[PATCH/RFC] core/nfsd: allow kernel threads to use task_work.

Message ID 170112272125.7109.6245462722883333440@noble.neil.brown.name
State New
Headers
Series [PATCH/RFC] core/nfsd: allow kernel threads to use task_work. |

Commit Message

NeilBrown Nov. 27, 2023, 10:05 p.m. UTC
  I have evidence from a customer site of 256 nfsd threads adding files to
delayed_fput_lists nearly twice as fast they are retired by a single
work-queue thread running delayed_fput().  As you might imagine this
does not end well (20 million files in the queue at the time a snapshot
was taken for analysis).

While this might point to a problem with the filesystem not handling the
final close efficiently, such problems should only hurt throughput, not
lead to memory exhaustion.

For normal threads, the thread that closes the file also calls the
final fput so there is natural rate limiting preventing excessive growth
in the list of delayed fputs.  For kernel threads, and particularly for
nfsd, delayed in the final fput do not impose any throttling to prevent
the thread from closing more files.

A simple way to fix this is to treat nfsd threads like normal processes
for task_work.  Thus the pending files are queued for the thread, and
the same thread finishes the work.

Currently KTHREADs are assumed never to call task_work_run().  With this
patch that it still the default but it is implemented by storing the
magic value TASK_WORKS_DISABLED in ->task_works.  If a kthread, such as
nfsd, will call task_work_run() periodically, it sets ->task_works
to NULL to indicate this.

Signed-off-by: NeilBrown <neilb@suse.de>
---

I wonder which tree this should go through assuming everyone likes it.
VFS maybe??

Thanks.

 fs/file_table.c           | 2 +-
 fs/nfsd/nfssvc.c          | 4 ++++
 include/linux/sched.h     | 1 +
 include/linux/task_work.h | 4 +++-
 kernel/fork.c             | 2 +-
 kernel/task_work.c        | 7 ++++---
 6 files changed, 14 insertions(+), 6 deletions(-)
  

Comments

Al Viro Nov. 27, 2023, 10:30 p.m. UTC | #1
On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote:

> A simple way to fix this is to treat nfsd threads like normal processes
> for task_work.  Thus the pending files are queued for the thread, and
> the same thread finishes the work.
> 
> Currently KTHREADs are assumed never to call task_work_run().  With this
> patch that it still the default but it is implemented by storing the
> magic value TASK_WORKS_DISABLED in ->task_works.  If a kthread, such as
> nfsd, will call task_work_run() periodically, it sets ->task_works
> to NULL to indicate this.

>  		svc_recv(rqstp);
>  		validate_process_creds();
> +		if (task_work_pending(current))
> +			task_work_run();

What locking environment and call chain do you have here?  And what happens if
you get something stuck in ->release()?

>  
>  	p->pdeath_signal = 0;
> -	p->task_works = NULL;
> +	p->task_works = args->kthread ? TASK_WORKS_DISABLED : NULL;

Umm... why not have them set (by helper in kernel/task_work.c) to
&work_exited?  Then the task_work_run parts wouldn't be needed at all...
  
NeilBrown Nov. 27, 2023, 10:43 p.m. UTC | #2
On Tue, 28 Nov 2023, Al Viro wrote:
> On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote:
> 
> > A simple way to fix this is to treat nfsd threads like normal processes
> > for task_work.  Thus the pending files are queued for the thread, and
> > the same thread finishes the work.
> > 
> > Currently KTHREADs are assumed never to call task_work_run().  With this
> > patch that it still the default but it is implemented by storing the
> > magic value TASK_WORKS_DISABLED in ->task_works.  If a kthread, such as
> > nfsd, will call task_work_run() periodically, it sets ->task_works
> > to NULL to indicate this.
> 
> >  		svc_recv(rqstp);
> >  		validate_process_creds();
> > +		if (task_work_pending(current))
> > +			task_work_run();
> 
> What locking environment and call chain do you have here?  And what happens if
> you get something stuck in ->release()?

No locking. This is in the top level function of the kthread.
A ->release function that waits for an NFS filesystem to flush out data
through a filesystem exported by this nfsd might hit problems.
But that really requires us nfs-exporting and nfs filesystem which is
loop-back mounted.  While we do support nfs-reexport and nfs-loop-back
mounts, I don't think we make any pretence of supporting a combination.

Is that the sort of thing you were think of?

> 
> >  
> >  	p->pdeath_signal = 0;
> > -	p->task_works = NULL;
> > +	p->task_works = args->kthread ? TASK_WORKS_DISABLED : NULL;
> 
> Umm... why not have them set (by helper in kernel/task_work.c) to
> &work_exited?  Then the task_work_run parts wouldn't be needed at all...
> 

I hadn't tried to understand what work_exited was for - but now I see
that its purpose is precisely to block further work from being queued -
exactly what I need.
Thanks - I make that change for a v2.

I've realised that I'll also need to change the flush_delayed_fput() in
fsd_file_close_inode_sync() to task_work_run().

Thanks,
NeilBrown
  
Chuck Lever Nov. 27, 2023, 10:59 p.m. UTC | #3
On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote:
> 
> I have evidence from a customer site of 256 nfsd threads adding files to
> delayed_fput_lists nearly twice as fast they are retired by a single
> work-queue thread running delayed_fput().  As you might imagine this
> does not end well (20 million files in the queue at the time a snapshot
> was taken for analysis).
> 
> While this might point to a problem with the filesystem not handling the
> final close efficiently, such problems should only hurt throughput, not
> lead to memory exhaustion.

I have this patch queued for v6.8:

https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0


> For normal threads, the thread that closes the file also calls the
> final fput so there is natural rate limiting preventing excessive growth
> in the list of delayed fputs.  For kernel threads, and particularly for
> nfsd, delayed in the final fput do not impose any throttling to prevent
> the thread from closing more files.

I don't think we want to block nfsd threads waiting for files to
close. Won't that be a potential denial of service?


> A simple way to fix this is to treat nfsd threads like normal processes
> for task_work.  Thus the pending files are queued for the thread, and
> the same thread finishes the work.
> 
> Currently KTHREADs are assumed never to call task_work_run().  With this
> patch that it still the default but it is implemented by storing the
> magic value TASK_WORKS_DISABLED in ->task_works.  If a kthread, such as
> nfsd, will call task_work_run() periodically, it sets ->task_works
> to NULL to indicate this.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
> 
> I wonder which tree this should go through assuming everyone likes it.
> VFS maybe??
> 
> Thanks.
> 
>  fs/file_table.c           | 2 +-
>  fs/nfsd/nfssvc.c          | 4 ++++
>  include/linux/sched.h     | 1 +
>  include/linux/task_work.h | 4 +++-
>  kernel/fork.c             | 2 +-
>  kernel/task_work.c        | 7 ++++---
>  6 files changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/file_table.c b/fs/file_table.c
> index de4a2915bfd4..e79351df22be 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -445,7 +445,7 @@ void fput(struct file *file)
>  	if (atomic_long_dec_and_test(&file->f_count)) {
>  		struct task_struct *task = current;
>  
> -		if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
> +		if (likely(!in_interrupt())) {
>  			init_task_work(&file->f_rcuhead, ____fput);
>  			if (!task_work_add(task, &file->f_rcuhead, TWA_RESUME))
>  				return;
> diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> index 66ca50b38b27..c047961262ca 100644
> --- a/fs/nfsd/nfssvc.c
> +++ b/fs/nfsd/nfssvc.c
> @@ -13,6 +13,7 @@
>  #include <linux/fs_struct.h>
>  #include <linux/swap.h>
>  #include <linux/siphash.h>
> +#include <linux/task_work.h>
>  
>  #include <linux/sunrpc/stats.h>
>  #include <linux/sunrpc/svcsock.h>
> @@ -941,6 +942,7 @@ nfsd(void *vrqstp)
>  	}
>  
>  	current->fs->umask = 0;
> +	current->task_works = NULL; /* Declare that I will call task_work_run() */
>  
>  	atomic_inc(&nfsdstats.th_cnt);
>  
> @@ -955,6 +957,8 @@ nfsd(void *vrqstp)
>  
>  		svc_recv(rqstp);
>  		validate_process_creds();
> +		if (task_work_pending(current))
> +			task_work_run();
>  	}
>  
>  	atomic_dec(&nfsdstats.th_cnt);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 292c31697248..c63c2bedbf71 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1117,6 +1117,7 @@ struct task_struct {
>  	unsigned int			sas_ss_flags;
>  
>  	struct callback_head		*task_works;
> +#define	TASK_WORKS_DISABLED	((void*)1)
>  
>  #ifdef CONFIG_AUDIT
>  #ifdef CONFIG_AUDITSYSCALL
> diff --git a/include/linux/task_work.h b/include/linux/task_work.h
> index 795ef5a68429..3c74e3de81ed 100644
> --- a/include/linux/task_work.h
> +++ b/include/linux/task_work.h
> @@ -22,7 +22,9 @@ enum task_work_notify_mode {
>  
>  static inline bool task_work_pending(struct task_struct *task)
>  {
> -	return READ_ONCE(task->task_works);
> +	struct callback_head *works = READ_ONCE(task->task_works);
> +
> +	return works && works != TASK_WORKS_DISABLED;
>  }
>  
>  int task_work_add(struct task_struct *task, struct callback_head *twork,
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 10917c3e1f03..903b29804fe1 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2577,7 +2577,7 @@ __latent_entropy struct task_struct *copy_process(
>  	p->dirty_paused_when = 0;
>  
>  	p->pdeath_signal = 0;
> -	p->task_works = NULL;
> +	p->task_works = args->kthread ? TASK_WORKS_DISABLED : NULL;
>  	clear_posix_cputimers_work(p);
>  
>  #ifdef CONFIG_KRETPROBES
> diff --git a/kernel/task_work.c b/kernel/task_work.c
> index 95a7e1b7f1da..ffdf4b0d7a0e 100644
> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -49,7 +49,8 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
>  
>  	head = READ_ONCE(task->task_works);
>  	do {
> -		if (unlikely(head == &work_exited))
> +		if (unlikely(head == &work_exited ||
> +			     head == TASK_WORKS_DISABLED))
>  			return -ESRCH;
>  		work->next = head;
>  	} while (!try_cmpxchg(&task->task_works, &head, work));
> @@ -157,7 +158,7 @@ void task_work_run(void)
>  		work = READ_ONCE(task->task_works);
>  		do {
>  			head = NULL;
> -			if (!work) {
> +			if (!work || work == TASK_WORKS_DISABLED) {
>  				if (task->flags & PF_EXITING)
>  					head = &work_exited;
>  				else
> @@ -165,7 +166,7 @@ void task_work_run(void)
>  			}
>  		} while (!try_cmpxchg(&task->task_works, &work, head));
>  
> -		if (!work)
> +		if (!work || work == TASK_WORKS_DISABLED)
>  			break;
>  		/*
>  		 * Synchronize with task_work_cancel(). It can not remove
> -- 
> 2.42.1
>
  
NeilBrown Nov. 28, 2023, 12:16 a.m. UTC | #4
On Tue, 28 Nov 2023, Chuck Lever wrote:
> On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote:
> > 
> > I have evidence from a customer site of 256 nfsd threads adding files to
> > delayed_fput_lists nearly twice as fast they are retired by a single
> > work-queue thread running delayed_fput().  As you might imagine this
> > does not end well (20 million files in the queue at the time a snapshot
> > was taken for analysis).
> > 
> > While this might point to a problem with the filesystem not handling the
> > final close efficiently, such problems should only hurt throughput, not
> > lead to memory exhaustion.
> 
> I have this patch queued for v6.8:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0
> 

Thanks....
I think that change is good, but I don't think it addresses the problem
mentioned in the description, and it is not directly relevant to the
problem I saw ... though it is complicated.

The problem "workqueue ...  hogged cpu..." probably means that
nfsd_file_dispose_list() needs a cond_resched() call in the loop.
That will stop it from hogging the CPU whether it is tied to one CPU or
free to roam.

Also that work is calling filp_close() which primarily calls
filp_flush().
It also calls fput() but that does minimal work.  If there is much work
to do then that is offloaded to another work-item.  *That* is the
workitem that I had problems with.

The problem I saw was with an older kernel which didn't have the nfsd
file cache and so probably is calling filp_close more often.  So maybe
my patch isn't so important now.  Particularly as nfsd now isn't closing
most files in-task but instead offloads that to another task.  So the
final fput will not be handled by the nfsd task either.

But I think there is room for improvement.  Gathering lots of files
together into a list and closing them sequentially is not going to be as
efficient as closing them in parallel.

> 
> > For normal threads, the thread that closes the file also calls the
> > final fput so there is natural rate limiting preventing excessive growth
> > in the list of delayed fputs.  For kernel threads, and particularly for
> > nfsd, delayed in the final fput do not impose any throttling to prevent
> > the thread from closing more files.
> 
> I don't think we want to block nfsd threads waiting for files to
> close. Won't that be a potential denial of service?

Not as much as the denial of service caused by memory exhaustion due to
an indefinitely growing list of files waiting to be closed by a single
thread of workqueue.

I think it is perfectly reasonable that when handling an NFSv4 CLOSE,
the nfsd thread should completely handle that request including all the
flush and ->release etc.  If that causes any denial of service, then
simple increase the number of nfsd threads.

For NFSv3 it is more complex.  On the kernel where I saw a problem the
filp_close happen after each READ or WRITE (though I think the customer
was using NFSv4...).  With the file cache there is no thread that is
obviously responsible for the close.
To get the sort of throttling that I think is need, we could possibly
have each "nfsd_open" check if there are pending closes, and to wait for
some small amount of progress.

But don't think it is reasonable for the nfsd threads to take none of
the burden of closing files as that can result in imbalance.

I'll need to give this more thought.

Thanks,
NeilBrown
  
Chuck Lever Nov. 28, 2023, 1:37 a.m. UTC | #5
On Tue, Nov 28, 2023 at 11:16:06AM +1100, NeilBrown wrote:
> On Tue, 28 Nov 2023, Chuck Lever wrote:
> > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote:
> > > 
> > > I have evidence from a customer site of 256 nfsd threads adding files to
> > > delayed_fput_lists nearly twice as fast they are retired by a single
> > > work-queue thread running delayed_fput().  As you might imagine this
> > > does not end well (20 million files in the queue at the time a snapshot
> > > was taken for analysis).
> > > 
> > > While this might point to a problem with the filesystem not handling the
> > > final close efficiently, such problems should only hurt throughput, not
> > > lead to memory exhaustion.
> > 
> > I have this patch queued for v6.8:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0
> > 
> 
> Thanks....
> I think that change is good, but I don't think it addresses the problem
> mentioned in the description, and it is not directly relevant to the
> problem I saw ... though it is complicated.
> 
> The problem "workqueue ...  hogged cpu..." probably means that
> nfsd_file_dispose_list() needs a cond_resched() call in the loop.
> That will stop it from hogging the CPU whether it is tied to one CPU or
> free to roam.
> 
> Also that work is calling filp_close() which primarily calls
> filp_flush().
> It also calls fput() but that does minimal work.  If there is much work
> to do then that is offloaded to another work-item.  *That* is the
> workitem that I had problems with.
> 
> The problem I saw was with an older kernel which didn't have the nfsd
> file cache and so probably is calling filp_close more often.

Without the file cache, the filp_close() should be handled directly
by the nfsd thread handling the RPC, IIRC.


> So maybe
> my patch isn't so important now.  Particularly as nfsd now isn't closing
> most files in-task but instead offloads that to another task.  So the
> final fput will not be handled by the nfsd task either.
> 
> But I think there is room for improvement.  Gathering lots of files
> together into a list and closing them sequentially is not going to be as
> efficient as closing them in parallel.

I believe the file cache passes the filps to the work queue one at
a time, but I don't think there's anything that forces the work
queue to handle each flush/close completely before proceeding to the
next.

IOW there is some parallelism there already, especially now that
nfsd_filecache_wq is UNBOUND.


> > > For normal threads, the thread that closes the file also calls the
> > > final fput so there is natural rate limiting preventing excessive growth
> > > in the list of delayed fputs.  For kernel threads, and particularly for
> > > nfsd, delayed in the final fput do not impose any throttling to prevent
> > > the thread from closing more files.
> > 
> > I don't think we want to block nfsd threads waiting for files to
> > close. Won't that be a potential denial of service?
> 
> Not as much as the denial of service caused by memory exhaustion due to
> an indefinitely growing list of files waiting to be closed by a single
> thread of workqueue.

The cache garbage collector is single-threaded, but nfsd_filecache_wq
has a max_active setting of zero.


> I think it is perfectly reasonable that when handling an NFSv4 CLOSE,
> the nfsd thread should completely handle that request including all the
> flush and ->release etc.  If that causes any denial of service, then
> simple increase the number of nfsd threads.
> 
> For NFSv3 it is more complex.  On the kernel where I saw a problem the
> filp_close happen after each READ or WRITE (though I think the customer
> was using NFSv4...).  With the file cache there is no thread that is
> obviously responsible for the close.
> To get the sort of throttling that I think is need, we could possibly
> have each "nfsd_open" check if there are pending closes, and to wait for
> some small amount of progress.

Well nfsd_open() in particular appears to be used only for readdir.

But maybe nfsd_file_acquire() could wait briefly, in the garbage-
collected case, if the nfsd_net's disposal queue is long.


> But don't think it is reasonable for the nfsd threads to take none of
> the burden of closing files as that can result in imbalance.
> 
> I'll need to give this more thought.
  
NeilBrown Nov. 28, 2023, 2:57 a.m. UTC | #6
(trimmed cc...)

On Tue, 28 Nov 2023, Chuck Lever wrote:
> On Tue, Nov 28, 2023 at 11:16:06AM +1100, NeilBrown wrote:
> > On Tue, 28 Nov 2023, Chuck Lever wrote:
> > > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote:
> > > > 
> > > > I have evidence from a customer site of 256 nfsd threads adding files to
> > > > delayed_fput_lists nearly twice as fast they are retired by a single
> > > > work-queue thread running delayed_fput().  As you might imagine this
> > > > does not end well (20 million files in the queue at the time a snapshot
> > > > was taken for analysis).
> > > > 
> > > > While this might point to a problem with the filesystem not handling the
> > > > final close efficiently, such problems should only hurt throughput, not
> > > > lead to memory exhaustion.
> > > 
> > > I have this patch queued for v6.8:
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0
> > > 
> > 
> > Thanks....
> > I think that change is good, but I don't think it addresses the problem
> > mentioned in the description, and it is not directly relevant to the
> > problem I saw ... though it is complicated.
> > 
> > The problem "workqueue ...  hogged cpu..." probably means that
> > nfsd_file_dispose_list() needs a cond_resched() call in the loop.
> > That will stop it from hogging the CPU whether it is tied to one CPU or
> > free to roam.
> > 
> > Also that work is calling filp_close() which primarily calls
> > filp_flush().
> > It also calls fput() but that does minimal work.  If there is much work
> > to do then that is offloaded to another work-item.  *That* is the
> > workitem that I had problems with.
> > 
> > The problem I saw was with an older kernel which didn't have the nfsd
> > file cache and so probably is calling filp_close more often.
> 
> Without the file cache, the filp_close() should be handled directly
> by the nfsd thread handling the RPC, IIRC.

Yes - but __fput() is handled by a workqueue.

> 
> 
> > So maybe
> > my patch isn't so important now.  Particularly as nfsd now isn't closing
> > most files in-task but instead offloads that to another task.  So the
> > final fput will not be handled by the nfsd task either.
> > 
> > But I think there is room for improvement.  Gathering lots of files
> > together into a list and closing them sequentially is not going to be as
> > efficient as closing them in parallel.
> 
> I believe the file cache passes the filps to the work queue one at

nfsd_file_close_inode() does.  nfsd_file_gc() and nfsd_file_lru_scan()
can pass multiple.

> a time, but I don't think there's anything that forces the work
> queue to handle each flush/close completely before proceeding to the
> next.

Parallelism with workqueues is controlled by the work items (struct
work_struct).  Two different work items can run in parallel.  But any
given work item can never run parallel to itself.

The only work items queued on nfsd_filecache_wq are from
  nn->fcache_disposal->work.
There is one of these for each network namespace.  So in any given
network namespace, all work on nfsd_filecache_wq is fully serialised.

> 
> IOW there is some parallelism there already, especially now that
> nfsd_filecache_wq is UNBOUND.

No there is not.  And UNBOUND makes no difference to parallelism in this
case.  It allows the one work item to migrate between CPUs while it is
running, but it doesn't allow it to run concurrently on two different
CPUs.


(UNBOUND can improve parallelism when multiple different work items are
 submitted all from the same CPU.  Without UNBOUND all the work would
 happen on the same CPU, though if the work sleeps, the different work
 items can be interleaved.  With UNBOUND the different work items can
 enjoy true parallelism when needed).


> 
> 
> > > > For normal threads, the thread that closes the file also calls the
> > > > final fput so there is natural rate limiting preventing excessive growth
> > > > in the list of delayed fputs.  For kernel threads, and particularly for
> > > > nfsd, delayed in the final fput do not impose any throttling to prevent
> > > > the thread from closing more files.
> > > 
> > > I don't think we want to block nfsd threads waiting for files to
> > > close. Won't that be a potential denial of service?
> > 
> > Not as much as the denial of service caused by memory exhaustion due to
> > an indefinitely growing list of files waiting to be closed by a single
> > thread of workqueue.
> 
> The cache garbage collector is single-threaded, but nfsd_filecache_wq
> has a max_active setting of zero.

This allows parallelism between network namespaces, but not within a
network namespace.

> 
> 
> > I think it is perfectly reasonable that when handling an NFSv4 CLOSE,
> > the nfsd thread should completely handle that request including all the
> > flush and ->release etc.  If that causes any denial of service, then
> > simple increase the number of nfsd threads.
> > 
> > For NFSv3 it is more complex.  On the kernel where I saw a problem the
> > filp_close happen after each READ or WRITE (though I think the customer
> > was using NFSv4...).  With the file cache there is no thread that is
> > obviously responsible for the close.
> > To get the sort of throttling that I think is need, we could possibly
> > have each "nfsd_open" check if there are pending closes, and to wait for
> > some small amount of progress.
> 
> Well nfsd_open() in particular appears to be used only for readdir.
> 
> But maybe nfsd_file_acquire() could wait briefly, in the garbage-
> collected case, if the nfsd_net's disposal queue is long.
> 
> 
> > But don't think it is reasonable for the nfsd threads to take none of
> > the burden of closing files as that can result in imbalance.
> > 
> > I'll need to give this more thought.
> 
> 
> -- 
> Chuck Lever
> 

Thanks,
NeilBrown
  
Christian Brauner Nov. 28, 2023, 11:24 a.m. UTC | #7
On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote:
> 
> I have evidence from a customer site of 256 nfsd threads adding files to
> delayed_fput_lists nearly twice as fast they are retired by a single
> work-queue thread running delayed_fput().  As you might imagine this
> does not end well (20 million files in the queue at the time a snapshot
> was taken for analysis).
> 
> While this might point to a problem with the filesystem not handling the
> final close efficiently, such problems should only hurt throughput, not
> lead to memory exhaustion.
> 
> For normal threads, the thread that closes the file also calls the
> final fput so there is natural rate limiting preventing excessive growth
> in the list of delayed fputs.  For kernel threads, and particularly for
> nfsd, delayed in the final fput do not impose any throttling to prevent
> the thread from closing more files.
> 
> A simple way to fix this is to treat nfsd threads like normal processes
> for task_work.  Thus the pending files are queued for the thread, and
> the same thread finishes the work.
> 
> Currently KTHREADs are assumed never to call task_work_run().  With this
> patch that it still the default but it is implemented by storing the
> magic value TASK_WORKS_DISABLED in ->task_works.  If a kthread, such as
> nfsd, will call task_work_run() periodically, it sets ->task_works
> to NULL to indicate this.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
> 
> I wonder which tree this should go through assuming everyone likes it.
> VFS maybe??

Sure.

> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 292c31697248..c63c2bedbf71 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1117,6 +1117,7 @@ struct task_struct {
>  	unsigned int			sas_ss_flags;
>  
>  	struct callback_head		*task_works;
> +#define	TASK_WORKS_DISABLED	((void*)1)

Should be simpler if you invert the logic?

COMPLETELY UNTESTED

---
 fs/file_table.c           |  2 +-
 fs/nfsd/nfssvc.c          |  4 ++++
 include/linux/task_work.h |  3 +++
 kernel/fork.c             |  3 +++
 kernel/task_work.c        | 12 ++++++++++++
 5 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index de4a2915bfd4..e79351df22be 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -445,7 +445,7 @@ void fput(struct file *file)
 	if (atomic_long_dec_and_test(&file->f_count)) {
 		struct task_struct *task = current;
 
-		if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
+		if (likely(!in_interrupt())) {
 			init_task_work(&file->f_rcuhead, ____fput);
 			if (!task_work_add(task, &file->f_rcuhead, TWA_RESUME))
 				return;
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index d6122bb2d167..cea76bad3a95 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -13,6 +13,7 @@
 #include <linux/fs_struct.h>
 #include <linux/swap.h>
 #include <linux/siphash.h>
+#include <linux/task_work.h>
 
 #include <linux/sunrpc/stats.h>
 #include <linux/sunrpc/svcsock.h>
@@ -943,6 +944,7 @@ nfsd(void *vrqstp)
 
 	current->fs->umask = 0;
 
+	task_work_manage(current); /* Declare that I will call task_work_run() */
 	atomic_inc(&nfsdstats.th_cnt);
 
 	set_freezable();
@@ -956,6 +958,8 @@ nfsd(void *vrqstp)
 
 		svc_recv(rqstp);
 		validate_process_creds();
+		if (task_work_pending(current))
+			task_work_run();
 	}
 
 	atomic_dec(&nfsdstats.th_cnt);
diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index 795ef5a68429..645fb94e47e0 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -20,6 +20,9 @@ enum task_work_notify_mode {
 	TWA_SIGNAL_NO_IPI,
 };
 
+void task_work_off(struct task_struct *task);
+void task_work_manage(struct task_struct *task);
+
 static inline bool task_work_pending(struct task_struct *task)
 {
 	return READ_ONCE(task->task_works);
diff --git a/kernel/fork.c b/kernel/fork.c
index 10917c3e1f03..348ed8fa9333 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2346,6 +2346,9 @@ __latent_entropy struct task_struct *copy_process(
 	if (args->io_thread)
 		p->flags |= PF_IO_WORKER;
 
+	if (args->kthread)
+		task_work_off(p);
+
 	if (args->name)
 		strscpy_pad(p->comm, args->name, sizeof(p->comm));
 
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 95a7e1b7f1da..2ae948d0d124 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -5,6 +5,18 @@
 
 static struct callback_head work_exited; /* all we need is ->next == NULL */
 
+void task_work_off(struct task_struct *task)
+{
+	task->task_works = &work_exited;
+}
+EXPORT_SYMBOL(task_work_off);
+
+void task_work_manage(struct task_struct *task)
+{
+	task->task_works = NULL;
+}
+EXPORT_SYMBOL(task_work_manage);
+
 /**
  * task_work_add - ask the @task to execute @work->func()
  * @task: the task which should run the callback
  
Christian Brauner Nov. 28, 2023, 1:51 p.m. UTC | #8
[Reusing the trimmed Cc]

On Tue, Nov 28, 2023 at 11:16:06AM +1100, NeilBrown wrote:
> On Tue, 28 Nov 2023, Chuck Lever wrote:
> > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote:
> > > 
> > > I have evidence from a customer site of 256 nfsd threads adding files to
> > > delayed_fput_lists nearly twice as fast they are retired by a single
> > > work-queue thread running delayed_fput().  As you might imagine this
> > > does not end well (20 million files in the queue at the time a snapshot
> > > was taken for analysis).
> > > 
> > > While this might point to a problem with the filesystem not handling the
> > > final close efficiently, such problems should only hurt throughput, not
> > > lead to memory exhaustion.
> > 
> > I have this patch queued for v6.8:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0
> > 
> 
> Thanks....
> I think that change is good, but I don't think it addresses the problem
> mentioned in the description, and it is not directly relevant to the
> problem I saw ... though it is complicated.
> 
> The problem "workqueue ...  hogged cpu..." probably means that
> nfsd_file_dispose_list() needs a cond_resched() call in the loop.
> That will stop it from hogging the CPU whether it is tied to one CPU or
> free to roam.
> 
> Also that work is calling filp_close() which primarily calls
> filp_flush().
> It also calls fput() but that does minimal work.  If there is much work
> to do then that is offloaded to another work-item.  *That* is the
> workitem that I had problems with.
> 
> The problem I saw was with an older kernel which didn't have the nfsd
> file cache and so probably is calling filp_close more often.  So maybe
> my patch isn't so important now.  Particularly as nfsd now isn't closing
> most files in-task but instead offloads that to another task.  So the
> final fput will not be handled by the nfsd task either.
> 
> But I think there is room for improvement.  Gathering lots of files
> together into a list and closing them sequentially is not going to be as
> efficient as closing them in parallel.
> 
> > 
> > > For normal threads, the thread that closes the file also calls the
> > > final fput so there is natural rate limiting preventing excessive growth
> > > in the list of delayed fputs.  For kernel threads, and particularly for
> > > nfsd, delayed in the final fput do not impose any throttling to prevent
> > > the thread from closing more files.
> > 
> > I don't think we want to block nfsd threads waiting for files to
> > close. Won't that be a potential denial of service?
> 
> Not as much as the denial of service caused by memory exhaustion due to
> an indefinitely growing list of files waiting to be closed by a single
> thread of workqueue.

It seems less likely that you run into memory exhausting than a DOS
because nfsd() is busy closing fds. Especially because you default to
single nfsd thread afaict.

> I think it is perfectly reasonable that when handling an NFSv4 CLOSE,
> the nfsd thread should completely handle that request including all the
> flush and ->release etc.  If that causes any denial of service, then
> simple increase the number of nfsd threads.

But isn't that a significant behavioral change? So I would expect to
make this at configurable via a module- or Kconfig option?

> For NFSv3 it is more complex.  On the kernel where I saw a problem the
> filp_close happen after each READ or WRITE (though I think the customer
> was using NFSv4...).  With the file cache there is no thread that is
> obviously responsible for the close.
> To get the sort of throttling that I think is need, we could possibly
> have each "nfsd_open" check if there are pending closes, and to wait for
> some small amount of progress.
> 
> But don't think it is reasonable for the nfsd threads to take none of
> the burden of closing files as that can result in imbalance.

It feels that this really needs to be tested under a similar workload in
question to see whether this is a viable solution.
  
Oleg Nesterov Nov. 28, 2023, 1:52 p.m. UTC | #9
On 11/28, Christian Brauner wrote:
>
> Should be simpler if you invert the logic?
>
> COMPLETELY UNTESTED

Agreed, this looks much better to me. But perhaps we can just add the new
PF_KTHREAD_XXX flag and change fput


	--- a/fs/file_table.c
	+++ b/fs/file_table.c
	@@ -445,7 +445,8 @@ void fput(struct file *file)
		if (atomic_long_dec_and_test(&file->f_count)) {
			struct task_struct *task = current;
	 
	-		if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
	+		if (likely(!in_interrupt() &&
	+		    task->flags & (PF_KTHREAD|PF_KTHREAD_XXX) != PF_KTHREAD) {
				init_task_work(&file->f_rcuhead, ____fput);
				if (!task_work_add(task, &file->f_rcuhead, TWA_RESUME))
					return;

?

Then nfsd() can simply set PF_KTHREAD_XXX. This looks even simpler to me.

Oleg.
  
Oleg Nesterov Nov. 28, 2023, 2:01 p.m. UTC | #10
On 11/28, NeilBrown wrote:
>
> I have evidence from a customer site of 256 nfsd threads adding files to
> delayed_fput_lists nearly twice as fast they are retired by a single
> work-queue thread running delayed_fput().  As you might imagine this
> does not end well (20 million files in the queue at the time a snapshot
> was taken for analysis).

On a related note... Neil, Al, et al, can you look at

	[PATCH 1/3] fput: don't abuse task_work_add() when possible
	https://lore.kernel.org/all/20150908171446.GA14589@redhat.com/

(please ignore 3/3).

Oleg.
  
Jeff Layton Nov. 28, 2023, 2:15 p.m. UTC | #11
On Tue, 2023-11-28 at 14:51 +0100, Christian Brauner wrote:
> [Reusing the trimmed Cc]
> 
> On Tue, Nov 28, 2023 at 11:16:06AM +1100, NeilBrown wrote:
> > On Tue, 28 Nov 2023, Chuck Lever wrote:
> > > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote:
> > > > 
> > > > I have evidence from a customer site of 256 nfsd threads adding files to
> > > > delayed_fput_lists nearly twice as fast they are retired by a single
> > > > work-queue thread running delayed_fput().  As you might imagine this
> > > > does not end well (20 million files in the queue at the time a snapshot
> > > > was taken for analysis).
> > > > 
> > > > While this might point to a problem with the filesystem not handling the
> > > > final close efficiently, such problems should only hurt throughput, not
> > > > lead to memory exhaustion.
> > > 
> > > I have this patch queued for v6.8:
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0
> > > 
> > 
> > Thanks....
> > I think that change is good, but I don't think it addresses the problem
> > mentioned in the description, and it is not directly relevant to the
> > problem I saw ... though it is complicated.
> > 
> > The problem "workqueue ...  hogged cpu..." probably means that
> > nfsd_file_dispose_list() needs a cond_resched() call in the loop.
> > That will stop it from hogging the CPU whether it is tied to one CPU or
> > free to roam.
> > 
> > Also that work is calling filp_close() which primarily calls
> > filp_flush().
> > It also calls fput() but that does minimal work.  If there is much work
> > to do then that is offloaded to another work-item.  *That* is the
> > workitem that I had problems with.
> > 
> > The problem I saw was with an older kernel which didn't have the nfsd
> > file cache and so probably is calling filp_close more often.  So maybe
> > my patch isn't so important now.  Particularly as nfsd now isn't closing
> > most files in-task but instead offloads that to another task.  So the
> > final fput will not be handled by the nfsd task either.
> > 
> > But I think there is room for improvement.  Gathering lots of files
> > together into a list and closing them sequentially is not going to be as
> > efficient as closing them in parallel.
> > 
> > > 
> > > > For normal threads, the thread that closes the file also calls the
> > > > final fput so there is natural rate limiting preventing excessive growth
> > > > in the list of delayed fputs.  For kernel threads, and particularly for
> > > > nfsd, delayed in the final fput do not impose any throttling to prevent
> > > > the thread from closing more files.
> > > 
> > > I don't think we want to block nfsd threads waiting for files to
> > > close. Won't that be a potential denial of service?
> > 
> > Not as much as the denial of service caused by memory exhaustion due to
> > an indefinitely growing list of files waiting to be closed by a single
> > thread of workqueue.
> 
> It seems less likely that you run into memory exhausting than a DOS
> because nfsd() is busy closing fds. Especially because you default to
> single nfsd thread afaict.
> 

The default is currently 8 threads (which is ridiculously low for most
uses, but that's another discussion). That policy is usually set by
userland nfs-utils though.

This is another place where we might want to reserve a "rescuer" thread
that avoids doing work that can end up blocked. Maybe we could switch
back to queuing them to the list when we're below a certain threshold of
available threads (1? 2? 4?).

> > I think it is perfectly reasonable that when handling an NFSv4 CLOSE,
> > the nfsd thread should completely handle that request including all the
> > flush and ->release etc.  If that causes any denial of service, then
> > simple increase the number of nfsd threads.
> 
> But isn't that a significant behavioral change? So I would expect to
> make this at configurable via a module- or Kconfig option?
> 

I struggle to think about how we would document a new option like this. 

> > For NFSv3 it is more complex.  On the kernel where I saw a problem the
> > filp_close happen after each READ or WRITE (though I think the customer
> > was using NFSv4...).  With the file cache there is no thread that is
> > obviously responsible for the close.
> > To get the sort of throttling that I think is need, we could possibly
> > have each "nfsd_open" check if there are pending closes, and to wait for
> > some small amount of progress.
> > 
> > But don't think it is reasonable for the nfsd threads to take none of
> > the burden of closing files as that can result in imbalance.
> 
> It feels that this really needs to be tested under a similar workload in
> question to see whether this is a viable solution.

Definitely. I'd also like to see how this behaves with NFS or Ceph
reexport. Closing can be quite slow on those filesystems, so that might
be a good place to try and break this.
  
Oleg Nesterov Nov. 28, 2023, 2:20 p.m. UTC | #12
On 11/28, Oleg Nesterov wrote:
>
> On a related note... Neil, Al, et al, can you look at
>
> 	[PATCH 1/3] fput: don't abuse task_work_add() when possible
> 	https://lore.kernel.org/all/20150908171446.GA14589@redhat.com/

Cough... Now that I look at this 8 years old patch again I think
it is wrong, fput() can race with task_work_cancel() so it is not
safe to dereference the first pending work in theory :/

Oleg.
  
Chuck Lever Nov. 28, 2023, 3:22 p.m. UTC | #13
On Tue, Nov 28, 2023 at 09:15:39AM -0500, Jeff Layton wrote:
> On Tue, 2023-11-28 at 14:51 +0100, Christian Brauner wrote:
> > [Reusing the trimmed Cc]
> > 
> > On Tue, Nov 28, 2023 at 11:16:06AM +1100, NeilBrown wrote:
> > > On Tue, 28 Nov 2023, Chuck Lever wrote:
> > > > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote:
> > > > > 
> > > > > I have evidence from a customer site of 256 nfsd threads adding files to
> > > > > delayed_fput_lists nearly twice as fast they are retired by a single
> > > > > work-queue thread running delayed_fput().  As you might imagine this
> > > > > does not end well (20 million files in the queue at the time a snapshot
> > > > > was taken for analysis).
> > > > > 
> > > > > While this might point to a problem with the filesystem not handling the
> > > > > final close efficiently, such problems should only hurt throughput, not
> > > > > lead to memory exhaustion.
> > > > 
> > > > I have this patch queued for v6.8:
> > > > 
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0
> > > > 
> > > 
> > > Thanks....
> > > I think that change is good, but I don't think it addresses the problem
> > > mentioned in the description, and it is not directly relevant to the
> > > problem I saw ... though it is complicated.
> > > 
> > > The problem "workqueue ...  hogged cpu..." probably means that
> > > nfsd_file_dispose_list() needs a cond_resched() call in the loop.
> > > That will stop it from hogging the CPU whether it is tied to one CPU or
> > > free to roam.
> > > 
> > > Also that work is calling filp_close() which primarily calls
> > > filp_flush().
> > > It also calls fput() but that does minimal work.  If there is much work
> > > to do then that is offloaded to another work-item.  *That* is the
> > > workitem that I had problems with.
> > > 
> > > The problem I saw was with an older kernel which didn't have the nfsd
> > > file cache and so probably is calling filp_close more often.  So maybe
> > > my patch isn't so important now.  Particularly as nfsd now isn't closing
> > > most files in-task but instead offloads that to another task.  So the
> > > final fput will not be handled by the nfsd task either.
> > > 
> > > But I think there is room for improvement.  Gathering lots of files
> > > together into a list and closing them sequentially is not going to be as
> > > efficient as closing them in parallel.
> > > 
> > > > 
> > > > > For normal threads, the thread that closes the file also calls the
> > > > > final fput so there is natural rate limiting preventing excessive growth
> > > > > in the list of delayed fputs.  For kernel threads, and particularly for
> > > > > nfsd, delayed in the final fput do not impose any throttling to prevent
> > > > > the thread from closing more files.
> > > > 
> > > > I don't think we want to block nfsd threads waiting for files to
> > > > close. Won't that be a potential denial of service?
> > > 
> > > Not as much as the denial of service caused by memory exhaustion due to
> > > an indefinitely growing list of files waiting to be closed by a single
> > > thread of workqueue.
> > 
> > It seems less likely that you run into memory exhausting than a DOS
> > because nfsd() is busy closing fds. Especially because you default to
> > single nfsd thread afaict.

I would expect a DoS too: the system should start pushing out dirty
file data itself well before exhausting memory.


> The default is currently 8 threads (which is ridiculously low for most
> uses, but that's another discussion). That policy is usually set by
> userland nfs-utils though.

With only 8 threads, it might be /more/ difficult for clients to
generate enough workload to cause an overwhelming flood of closes.
As Neil said in the cover letter text, he observed this issue with
256 nfsd threads.


> This is another place where we might want to reserve a "rescuer" thread
> that avoids doing work that can end up blocked. Maybe we could switch
> back to queuing them to the list when we're below a certain threshold of
> available threads (1? 2? 4?).
> 
> > > I think it is perfectly reasonable that when handling an NFSv4 CLOSE,
> > > the nfsd thread should completely handle that request including all the
> > > flush and ->release etc.  If that causes any denial of service, then
> > > simple increase the number of nfsd threads.
> > 
> > But isn't that a significant behavioral change? So I would expect to
> > make this at configurable via a module- or Kconfig option?
> 
> I struggle to think about how we would document a new option like this. 

I think NFSv4 CLOSE can close files synchronously without an
observable behavior change. NFSv4 clients almost always COMMIT dirty
data before they CLOSE, so there should only rarely be a significant
flush component to an fput done here.

The problem is the garbage-collected (NFSv3) case, where the server
frequently closes files well before a client might have COMMITted
its dirty data.


> > > For NFSv3 it is more complex.  On the kernel where I saw a problem the
> > > filp_close happen after each READ or WRITE (though I think the customer
> > > was using NFSv4...).  With the file cache there is no thread that is
> > > obviously responsible for the close.
> > > To get the sort of throttling that I think is need, we could possibly
> > > have each "nfsd_open" check if there are pending closes, and to wait for
> > > some small amount of progress.
> > > 
> > > But don't think it is reasonable for the nfsd threads to take none of
> > > the burden of closing files as that can result in imbalance.
> > 
> > It feels that this really needs to be tested under a similar workload in
> > question to see whether this is a viable solution.
> 
> Definitely. I'd also like to see how this behaves with NFS or Ceph
> reexport. Closing can be quite slow on those filesystems, so that might
> be a good place to try and break this.
  
Christian Brauner Nov. 28, 2023, 3:33 p.m. UTC | #14
On Tue, Nov 28, 2023 at 02:52:59PM +0100, Oleg Nesterov wrote:
> On 11/28, Christian Brauner wrote:
> >
> > Should be simpler if you invert the logic?
> >
> > COMPLETELY UNTESTED
> 
> Agreed, this looks much better to me. But perhaps we can just add the new
> PF_KTHREAD_XXX flag and change fput
> 
> 
> 	--- a/fs/file_table.c
> 	+++ b/fs/file_table.c
> 	@@ -445,7 +445,8 @@ void fput(struct file *file)
> 		if (atomic_long_dec_and_test(&file->f_count)) {
> 			struct task_struct *task = current;
> 	 
> 	-		if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
> 	+		if (likely(!in_interrupt() &&
> 	+		    task->flags & (PF_KTHREAD|PF_KTHREAD_XXX) != PF_KTHREAD) {
> 				init_task_work(&file->f_rcuhead, ____fput);
> 				if (!task_work_add(task, &file->f_rcuhead, TWA_RESUME))
> 					return;
> 
> ?
> 
> Then nfsd() can simply set PF_KTHREAD_XXX. This looks even simpler to me.

Yeah, I had played with that as well. Only reason I didn't do it was to
avoid a PF_* flag. If that's preferable it might be worth to just add
PF_TASK_WORK and decouple this from PF_KTHREAD. kthread creation and
userspace process creation are all based on the same struct
kernel_clone_args for a while now ever since we added this for clone3()
so we catch everything in copy_process():

diff --git a/fs/file_table.c b/fs/file_table.c
index 6deac386486d..5d3eb5ef4fc7 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -437,7 +437,7 @@ void fput(struct file *file)
                        file_free(file);
                        return;
                }
-               if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
+               if (likely(!in_interrupt() && (task->flags & PF_TASK_WORK))) {
                        init_task_work(&file->f_rcuhead, ____fput);
                        if (!task_work_add(task, &file->f_rcuhead, TWA_RESUME))
                                return;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 292c31697248..8dfc06acc6a0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1755,7 +1755,7 @@ extern struct pid *cad_pid;
                                                 * I am cleaning dirty pages from some other bdi. */
 #define PF_KTHREAD             0x00200000      /* I am a kernel thread */
 #define PF_RANDOMIZE           0x00400000      /* Randomize virtual address space */
-#define PF__HOLE__00800000     0x00800000
+#define PF_TASK_WORK           0x00800000
 #define PF__HOLE__01000000     0x01000000
 #define PF__HOLE__02000000     0x02000000
 #define PF_NO_SETAFFINITY      0x04000000      /* Userland is not allowed to meddle with cpus_mask */
diff --git a/kernel/fork.c b/kernel/fork.c
index 10917c3e1f03..2604235c800f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2346,6 +2346,14 @@ __latent_entropy struct task_struct *copy_process(
        if (args->io_thread)
                p->flags |= PF_IO_WORKER;

+       /*
+        * By default only non-kernel threads can use task work. Kernel
+        * threads that manage task work explicitly can add that flag in
+        * their kthread callback.
+        */
+       if (!args->kthread)
+               p->flags |= PF_TASK_WORK;
+
        if (args->name)
                strscpy_pad(p->comm, args->name, sizeof(p->comm));
  
Chuck Lever Nov. 28, 2023, 3:34 p.m. UTC | #15
On Tue, Nov 28, 2023 at 01:57:30PM +1100, NeilBrown wrote:
> 
> (trimmed cc...)
> 
> On Tue, 28 Nov 2023, Chuck Lever wrote:
> > On Tue, Nov 28, 2023 at 11:16:06AM +1100, NeilBrown wrote:
> > > On Tue, 28 Nov 2023, Chuck Lever wrote:
> > > > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote:
> > > > > 
> > > > > I have evidence from a customer site of 256 nfsd threads adding files to
> > > > > delayed_fput_lists nearly twice as fast they are retired by a single
> > > > > work-queue thread running delayed_fput().  As you might imagine this
> > > > > does not end well (20 million files in the queue at the time a snapshot
> > > > > was taken for analysis).
> > > > > 
> > > > > While this might point to a problem with the filesystem not handling the
> > > > > final close efficiently, such problems should only hurt throughput, not
> > > > > lead to memory exhaustion.
> > > > 
> > > > I have this patch queued for v6.8:
> > > > 
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0
> > > > 
> > > 
> > > Thanks....
> > > I think that change is good, but I don't think it addresses the problem
> > > mentioned in the description, and it is not directly relevant to the
> > > problem I saw ... though it is complicated.
> > > 
> > > The problem "workqueue ...  hogged cpu..." probably means that
> > > nfsd_file_dispose_list() needs a cond_resched() call in the loop.
> > > That will stop it from hogging the CPU whether it is tied to one CPU or
> > > free to roam.
> > > 
> > > Also that work is calling filp_close() which primarily calls
> > > filp_flush().
> > > It also calls fput() but that does minimal work.  If there is much work
> > > to do then that is offloaded to another work-item.  *That* is the
> > > workitem that I had problems with.
> > > 
> > > The problem I saw was with an older kernel which didn't have the nfsd
> > > file cache and so probably is calling filp_close more often.
> > 
> > Without the file cache, the filp_close() should be handled directly
> > by the nfsd thread handling the RPC, IIRC.
> 
> Yes - but __fput() is handled by a workqueue.
> 
> > 
> > 
> > > So maybe
> > > my patch isn't so important now.  Particularly as nfsd now isn't closing
> > > most files in-task but instead offloads that to another task.  So the
> > > final fput will not be handled by the nfsd task either.
> > > 
> > > But I think there is room for improvement.  Gathering lots of files
> > > together into a list and closing them sequentially is not going to be as
> > > efficient as closing them in parallel.
> > 
> > I believe the file cache passes the filps to the work queue one at
> 
> nfsd_file_close_inode() does.  nfsd_file_gc() and nfsd_file_lru_scan()
> can pass multiple.
> 
> > a time, but I don't think there's anything that forces the work
> > queue to handle each flush/close completely before proceeding to the
> > next.
> 
> Parallelism with workqueues is controlled by the work items (struct
> work_struct).  Two different work items can run in parallel.  But any
> given work item can never run parallel to itself.
> 
> The only work items queued on nfsd_filecache_wq are from
>   nn->fcache_disposal->work.
> There is one of these for each network namespace.  So in any given
> network namespace, all work on nfsd_filecache_wq is fully serialised.

OIC, it's that specific case you are concerned with. The per-
namespace laundrette was added by:

  9542e6a643fc ("nfsd: Containerise filecache laundrette")

It's purpose was to confine the close backlog to each container.

Seems like it would be better if there was a struct work_struct
in each struct nfsd_file. That wouldn't add real backpressure to
nfsd threads, but it would enable file closes to run in parallel.


> > IOW there is some parallelism there already, especially now that
> > nfsd_filecache_wq is UNBOUND.
> 
> No there is not.  And UNBOUND makes no difference to parallelism in this
> case.  It allows the one work item to migrate between CPUs while it is
> running, but it doesn't allow it to run concurrently on two different
> CPUs.

Right. The laundrette can now run in parallel with other work by
moving to a different core, but there still can be only one
laundrette running per namespace.


> (UNBOUND can improve parallelism when multiple different work items are
>  submitted all from the same CPU.  Without UNBOUND all the work would
>  happen on the same CPU, though if the work sleeps, the different work
>  items can be interleaved.  With UNBOUND the different work items can
>  enjoy true parallelism when needed).
> 
> 
> > 
> > 
> > > > > For normal threads, the thread that closes the file also calls the
> > > > > final fput so there is natural rate limiting preventing excessive growth
> > > > > in the list of delayed fputs.  For kernel threads, and particularly for
> > > > > nfsd, delayed in the final fput do not impose any throttling to prevent
> > > > > the thread from closing more files.
> > > > 
> > > > I don't think we want to block nfsd threads waiting for files to
> > > > close. Won't that be a potential denial of service?
> > > 
> > > Not as much as the denial of service caused by memory exhaustion due to
> > > an indefinitely growing list of files waiting to be closed by a single
> > > thread of workqueue.
> > 
> > The cache garbage collector is single-threaded, but nfsd_filecache_wq
> > has a max_active setting of zero.
> 
> This allows parallelism between network namespaces, but not within a
> network namespace.
> 
> > 
> > 
> > > I think it is perfectly reasonable that when handling an NFSv4 CLOSE,
> > > the nfsd thread should completely handle that request including all the
> > > flush and ->release etc.  If that causes any denial of service, then
> > > simple increase the number of nfsd threads.
> > > 
> > > For NFSv3 it is more complex.  On the kernel where I saw a problem the
> > > filp_close happen after each READ or WRITE (though I think the customer
> > > was using NFSv4...).  With the file cache there is no thread that is
> > > obviously responsible for the close.
> > > To get the sort of throttling that I think is need, we could possibly
> > > have each "nfsd_open" check if there are pending closes, and to wait for
> > > some small amount of progress.
> > 
> > Well nfsd_open() in particular appears to be used only for readdir.
> > 
> > But maybe nfsd_file_acquire() could wait briefly, in the garbage-
> > collected case, if the nfsd_net's disposal queue is long.
> > 
> > 
> > > But don't think it is reasonable for the nfsd threads to take none of
> > > the burden of closing files as that can result in imbalance.
> > > 
> > > I'll need to give this more thought.
> > 
> > 
> > -- 
> > Chuck Lever
> > 
> 
> Thanks,
> NeilBrown
  
Oleg Nesterov Nov. 28, 2023, 4:59 p.m. UTC | #16
On 11/28, Christian Brauner wrote:
>
> Yeah, I had played with that as well. Only reason I didn't do it was to
> avoid a PF_* flag. If that's preferable it might be worth to just add
> PF_TASK_WORK and decouple this from PF_KTHREAD.

OK, I won't insist.

But,

> +       /*
> +        * By default only non-kernel threads can use task work. Kernel
> +        * threads that manage task work explicitly can add that flag in
> +        * their kthread callback.
> +        */
> +       if (!args->kthread)
> +               p->flags |= PF_TASK_WORK;

The comment and the name of the new flag look a bit misleading to me...

kthreads can use task_work's. You can create a kthread which does
task_work_run() from time to time and use task_work_add() on it,
nothing wrong with that.

Probably nobody does this right now (I didn't try to check), but please
note irq_thread()->task_work_add(on_exit_work).

Oleg.
  
Oleg Nesterov Nov. 28, 2023, 5:29 p.m. UTC | #17
Forgot to menstion,

On 11/28, Oleg Nesterov wrote:
>
> but please
> note irq_thread()->task_work_add(on_exit_work).

and this means that Neil's and your more patch were wrong ;)

Oleg.
  
NeilBrown Nov. 28, 2023, 11:20 p.m. UTC | #18
On Wed, 29 Nov 2023, Christian Brauner wrote:
> [Reusing the trimmed Cc]
> 
> On Tue, Nov 28, 2023 at 11:16:06AM +1100, NeilBrown wrote:
> > On Tue, 28 Nov 2023, Chuck Lever wrote:
> > > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote:
> > > > 
> > > > I have evidence from a customer site of 256 nfsd threads adding files to
> > > > delayed_fput_lists nearly twice as fast they are retired by a single
> > > > work-queue thread running delayed_fput().  As you might imagine this
> > > > does not end well (20 million files in the queue at the time a snapshot
> > > > was taken for analysis).
> > > > 
> > > > While this might point to a problem with the filesystem not handling the
> > > > final close efficiently, such problems should only hurt throughput, not
> > > > lead to memory exhaustion.
> > > 
> > > I have this patch queued for v6.8:
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0
> > > 
> > 
> > Thanks....
> > I think that change is good, but I don't think it addresses the problem
> > mentioned in the description, and it is not directly relevant to the
> > problem I saw ... though it is complicated.
> > 
> > The problem "workqueue ...  hogged cpu..." probably means that
> > nfsd_file_dispose_list() needs a cond_resched() call in the loop.
> > That will stop it from hogging the CPU whether it is tied to one CPU or
> > free to roam.
> > 
> > Also that work is calling filp_close() which primarily calls
> > filp_flush().
> > It also calls fput() but that does minimal work.  If there is much work
> > to do then that is offloaded to another work-item.  *That* is the
> > workitem that I had problems with.
> > 
> > The problem I saw was with an older kernel which didn't have the nfsd
> > file cache and so probably is calling filp_close more often.  So maybe
> > my patch isn't so important now.  Particularly as nfsd now isn't closing
> > most files in-task but instead offloads that to another task.  So the
> > final fput will not be handled by the nfsd task either.
> > 
> > But I think there is room for improvement.  Gathering lots of files
> > together into a list and closing them sequentially is not going to be as
> > efficient as closing them in parallel.
> > 
> > > 
> > > > For normal threads, the thread that closes the file also calls the
> > > > final fput so there is natural rate limiting preventing excessive growth
> > > > in the list of delayed fputs.  For kernel threads, and particularly for
> > > > nfsd, delayed in the final fput do not impose any throttling to prevent
> > > > the thread from closing more files.
> > > 
> > > I don't think we want to block nfsd threads waiting for files to
> > > close. Won't that be a potential denial of service?
> > 
> > Not as much as the denial of service caused by memory exhaustion due to
> > an indefinitely growing list of files waiting to be closed by a single
> > thread of workqueue.
> 
> It seems less likely that you run into memory exhausting than a DOS
> because nfsd() is busy closing fds. Especially because you default to
> single nfsd thread afaict.

An nfsd thread would not end up being busy closing fds any more than it
can already be busy reading data or busy syncing out changes or busying
renaming a file.
Which it is say: of course it can be busy doing this, but doing this sort
of thing is its whole purpose in life.

If an nfsd thread only completes the close that it initiated the close
on (which is what I am currently proposing) then there would be at most
one, or maybe 2, fds to close after handling each request.  While that
is certainly a non-zero burden, I can't see how it can realistically be
called a DOS.

> 
> > I think it is perfectly reasonable that when handling an NFSv4 CLOSE,
> > the nfsd thread should completely handle that request including all the
> > flush and ->release etc.  If that causes any denial of service, then
> > simple increase the number of nfsd threads.
> 
> But isn't that a significant behavioral change? So I would expect to
> make this at configurable via a module- or Kconfig option?

Not really.  Certainly not more than the change to introduce the
filecache to nfsd in v5.4.

> 
> > For NFSv3 it is more complex.  On the kernel where I saw a problem the
> > filp_close happen after each READ or WRITE (though I think the customer
> > was using NFSv4...).  With the file cache there is no thread that is
> > obviously responsible for the close.
> > To get the sort of throttling that I think is need, we could possibly
> > have each "nfsd_open" check if there are pending closes, and to wait for
> > some small amount of progress.
> > 
> > But don't think it is reasonable for the nfsd threads to take none of
> > the burden of closing files as that can result in imbalance.
> 
> It feels that this really needs to be tested under a similar workload in
> question to see whether this is a viable solution.
> 

Creating that workload might be a challenge.  I know it involved
accessing 10s of millions of files with a server that was somewhat
memory constrained.  I don't know anything about the access pattern.

Certainly I'll try to reproduce something similar by inserting delays in
suitable places.  This will help exercise the code, but won't really
replicate the actual workload.

Thanks,
NeilBrown
  
NeilBrown Nov. 28, 2023, 11:31 p.m. UTC | #19
On Wed, 29 Nov 2023, Jeff Layton wrote:
> 
> This is another place where we might want to reserve a "rescuer" thread
> that avoids doing work that can end up blocked. Maybe we could switch
> back to queuing them to the list when we're below a certain threshold of
> available threads (1? 2? 4?).

A rescuer isn't for cases where work might be blocked, but for cases
whether otherwise work might be deadlocked - though maybe that is what
you meant.

Could nfsd calling filp_close() or __fput() ever deadlock?
I think we know from the experience pre v5.8 that calling filp_close()
doesn't cause deadlocks.  Could __fput, and particularly ->release ever
deadlock w.r.t nfsd?  i.e.  could a ->release function for a file
exported through NFS ever wait for nfsd to handle an NFS request?

We don't need to worry about indirect dependencies like allocating
memory and waiting for nfsd to flush out writes - that is already
handled so that we can support loop-back mounts.

So to have a problem we would need to nfs-export an NFS filesystem that
was being served by the local NFS server.  Now that we support NFS
re-export, and we support loop-back mounts, it is fair to ask if we
support the combination of the two.  If we did, then calling ->release
from the nfsd thread could deadlock.  But calling ->read and ->write (or
whatever those interfaces are called today) would also deadlock.

So I think we have to say that nfs-reexporting a loop-back NFS mount is
not supported, and not supportable.  Whether we should try to detect and
reject this case is an interesting question, but quite a separate
question from that of how to handle the closing of files.

In short - I don't think there is any need or value in a dedicated
"rescuer" thread here.

Thanks,
NeilBrown
  
NeilBrown Nov. 28, 2023, 11:40 p.m. UTC | #20
On Wed, 29 Nov 2023, Oleg Nesterov wrote:
> Forgot to menstion,
> 
> On 11/28, Oleg Nesterov wrote:
> >
> > but please
> > note irq_thread()->task_work_add(on_exit_work).
> 
> and this means that Neil's and your more patch were wrong ;)

Yes it does - thanks for that!
I guess we need a setting that is focused particularly on fput().
Probably a PF flag is best for that.

Thanks,
NeilBrown


> 
> Oleg.
> 
>
  
NeilBrown Nov. 29, 2023, 12:14 a.m. UTC | #21
On Wed, 29 Nov 2023, Oleg Nesterov wrote:
> On 11/28, NeilBrown wrote:
> >
> > I have evidence from a customer site of 256 nfsd threads adding files to
> > delayed_fput_lists nearly twice as fast they are retired by a single
> > work-queue thread running delayed_fput().  As you might imagine this
> > does not end well (20 million files in the queue at the time a snapshot
> > was taken for analysis).
> 
> On a related note... Neil, Al, et al, can you look at
> 
> 	[PATCH 1/3] fput: don't abuse task_work_add() when possible
> 	https://lore.kernel.org/all/20150908171446.GA14589@redhat.com/
> 

Would it make sense to create a separate task_struct->delayed_fput
llist?
fput() adds the file to this llist and if it was the first item on the
list, it then adds the task_work.  That would probably request adding a
callback_head to struct task_struct as well.

NeilBrown
  
Oleg Nesterov Nov. 29, 2023, 7:55 a.m. UTC | #22
On 11/29, NeilBrown wrote:
>
> On Wed, 29 Nov 2023, Oleg Nesterov wrote:
> > On 11/28, NeilBrown wrote:
> > >
> > > I have evidence from a customer site of 256 nfsd threads adding files to
> > > delayed_fput_lists nearly twice as fast they are retired by a single
> > > work-queue thread running delayed_fput().  As you might imagine this
> > > does not end well (20 million files in the queue at the time a snapshot
> > > was taken for analysis).
> >
> > On a related note... Neil, Al, et al, can you look at
> >
> > 	[PATCH 1/3] fput: don't abuse task_work_add() when possible
> > 	https://lore.kernel.org/all/20150908171446.GA14589@redhat.com/
> >
>
> Would it make sense to create a separate task_struct->delayed_fput
> llist?

Sure, I too thought about this,

> fput() adds the file to this llist and if it was the first item on the
> list, it then adds the task_work.  That would probably request adding a
> callback_head to struct task_struct as well.

Even simpler, but perhaps I missed something...

We can add a "struct file *fput_xxx" into task_struct and f_fput_xxx into
the f_llist/f_rcuhead union in the struct file.

fput:

	if (task->fput_xxx) {
		file->f_fput_xxx = task->fput_xxx;
		task->fput_xxx = file;
	} else {
		task_work_add(...);
		// XXX: file->f_fput_xxx != NULL
		task->fput_xxx = file;
	}

____fput:

	struct file *file = task->fput_xxx;
	struct file *tail = container_of(work, ...);
	// see XXX in fput()
	tail->f_fput_xxx = NULL;
	current->fput_xxx = NULL;

	do {
		next = READ_ONCE(file->f_fput_xxx);
		__fput(file);
		file = next;
		
	} while (file);
	
Again, quite possibly I missed something, but something like this should work.

But I am still trying to find a simpler solution which doesn't need another
member in task_struct...

Oleg.
  
Christian Brauner Nov. 29, 2023, 11:38 a.m. UTC | #23
On Tue, Nov 28, 2023 at 06:29:59PM +0100, Oleg Nesterov wrote:
> Forgot to menstion,
> 
> On 11/28, Oleg Nesterov wrote:
> >
> > but please
> > note irq_thread()->task_work_add(on_exit_work).
> 
> and this means that Neil's and your more patch were wrong ;)

Hm, that's all the more reason to not hang this off of PF_KTHREAD then.
I mean, it's functional but we likely wouldn't have run into this
confusion if this would be PF_FPUT_DELAYED, for example.
  
Christian Brauner Nov. 29, 2023, 11:43 a.m. UTC | #24
> If an nfsd thread only completes the close that it initiated the close
> on (which is what I am currently proposing) then there would be at most
> one, or maybe 2, fds to close after handling each request.  While that
> is certainly a non-zero burden, I can't see how it can realistically be
> called a DOS.

The 10s of millions of files is what makes me curious. Because that's
the workload that'd be interesting.

> > It feels that this really needs to be tested under a similar workload in
> > question to see whether this is a viable solution.
> > 
> 
> Creating that workload might be a challenge.  I know it involved
> accessing 10s of millions of files with a server that was somewhat
> memory constrained.  I don't know anything about the access pattern.
> 
> Certainly I'll try to reproduce something similar by inserting delays in
> suitable places.  This will help exercise the code, but won't really
> replicate the actual workload.
  
Chuck Lever Nov. 29, 2023, 2:04 p.m. UTC | #25
On Wed, Nov 29, 2023 at 10:20:23AM +1100, NeilBrown wrote:
> On Wed, 29 Nov 2023, Christian Brauner wrote:
> > [Reusing the trimmed Cc]
> > 
> > On Tue, Nov 28, 2023 at 11:16:06AM +1100, NeilBrown wrote:
> > > On Tue, 28 Nov 2023, Chuck Lever wrote:
> > > > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote:
> > > > > 
> > > > > I have evidence from a customer site of 256 nfsd threads adding files to
> > > > > delayed_fput_lists nearly twice as fast they are retired by a single
> > > > > work-queue thread running delayed_fput().  As you might imagine this
> > > > > does not end well (20 million files in the queue at the time a snapshot
> > > > > was taken for analysis).
> > > > > 
> > > > > While this might point to a problem with the filesystem not handling the
> > > > > final close efficiently, such problems should only hurt throughput, not
> > > > > lead to memory exhaustion.
> > > > 
> > > > I have this patch queued for v6.8:
> > > > 
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0
> > > > 
> > > 
> > > Thanks....
> > > I think that change is good, but I don't think it addresses the problem
> > > mentioned in the description, and it is not directly relevant to the
> > > problem I saw ... though it is complicated.
> > > 
> > > The problem "workqueue ...  hogged cpu..." probably means that
> > > nfsd_file_dispose_list() needs a cond_resched() call in the loop.
> > > That will stop it from hogging the CPU whether it is tied to one CPU or
> > > free to roam.
> > > 
> > > Also that work is calling filp_close() which primarily calls
> > > filp_flush().
> > > It also calls fput() but that does minimal work.  If there is much work
> > > to do then that is offloaded to another work-item.  *That* is the
> > > workitem that I had problems with.
> > > 
> > > The problem I saw was with an older kernel which didn't have the nfsd
> > > file cache and so probably is calling filp_close more often.  So maybe
> > > my patch isn't so important now.  Particularly as nfsd now isn't closing
> > > most files in-task but instead offloads that to another task.  So the
> > > final fput will not be handled by the nfsd task either.
> > > 
> > > But I think there is room for improvement.  Gathering lots of files
> > > together into a list and closing them sequentially is not going to be as
> > > efficient as closing them in parallel.
> > > 
> > > > 
> > > > > For normal threads, the thread that closes the file also calls the
> > > > > final fput so there is natural rate limiting preventing excessive growth
> > > > > in the list of delayed fputs.  For kernel threads, and particularly for
> > > > > nfsd, delayed in the final fput do not impose any throttling to prevent
> > > > > the thread from closing more files.
> > > > 
> > > > I don't think we want to block nfsd threads waiting for files to
> > > > close. Won't that be a potential denial of service?
> > > 
> > > Not as much as the denial of service caused by memory exhaustion due to
> > > an indefinitely growing list of files waiting to be closed by a single
> > > thread of workqueue.
> > 
> > It seems less likely that you run into memory exhausting than a DOS
> > because nfsd() is busy closing fds. Especially because you default to
> > single nfsd thread afaict.
> 
> An nfsd thread would not end up being busy closing fds any more than it
> can already be busy reading data or busy syncing out changes or busying
> renaming a file.
> Which it is say: of course it can be busy doing this, but doing this sort
> of thing is its whole purpose in life.
> 
> If an nfsd thread only completes the close that it initiated the close
> on (which is what I am currently proposing) then there would be at most
> one, or maybe 2, fds to close after handling each request.

Closing files more aggressively would seem to entirely defeat the
purpose of the file cache, which is to avoid the overhead of opens
and closes on frequently-used files.

And usually Linux prefers to let the workload consume as many free
resources as possible before it applies back pressure or cache
eviction.

IMO the first step should be removing head-of-queue blocking from
the file cache's background closing mechanism. That might be enough
to avoid forming a backlog in most cases.


> > > For NFSv3 it is more complex.  On the kernel where I saw a problem the
> > > filp_close happen after each READ or WRITE (though I think the customer
> > > was using NFSv4...).  With the file cache there is no thread that is
> > > obviously responsible for the close.
> > > To get the sort of throttling that I think is need, we could possibly
> > > have each "nfsd_open" check if there are pending closes, and to wait for
> > > some small amount of progress.
> > > 
> > > But don't think it is reasonable for the nfsd threads to take none of
> > > the burden of closing files as that can result in imbalance.
> > 
> > It feels that this really needs to be tested under a similar workload in
> > question to see whether this is a viable solution.
> 
> Creating that workload might be a challenge.  I know it involved
> accessing 10s of millions of files with a server that was somewhat
> memory constrained.  I don't know anything about the access pattern.
> 
> Certainly I'll try to reproduce something similar by inserting delays in
> suitable places.  This will help exercise the code, but won't really
> replicate the actual workload.

It's likely that the fundamental bottleneck is writeback during
close.
  
Jeff Layton Nov. 30, 2023, 5:47 p.m. UTC | #26
On Wed, 2023-11-29 at 09:04 -0500, Chuck Lever wrote:
> On Wed, Nov 29, 2023 at 10:20:23AM +1100, NeilBrown wrote:
> > On Wed, 29 Nov 2023, Christian Brauner wrote:
> > > [Reusing the trimmed Cc]
> > > 
> > > On Tue, Nov 28, 2023 at 11:16:06AM +1100, NeilBrown wrote:
> > > > On Tue, 28 Nov 2023, Chuck Lever wrote:
> > > > > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote:
> > > > > > 
> > > > > > I have evidence from a customer site of 256 nfsd threads adding files to
> > > > > > delayed_fput_lists nearly twice as fast they are retired by a single
> > > > > > work-queue thread running delayed_fput().  As you might imagine this
> > > > > > does not end well (20 million files in the queue at the time a snapshot
> > > > > > was taken for analysis).
> > > > > > 
> > > > > > While this might point to a problem with the filesystem not handling the
> > > > > > final close efficiently, such problems should only hurt throughput, not
> > > > > > lead to memory exhaustion.
> > > > > 
> > > > > I have this patch queued for v6.8:
> > > > > 
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0
> > > > > 
> > > > 
> > > > Thanks....
> > > > I think that change is good, but I don't think it addresses the problem
> > > > mentioned in the description, and it is not directly relevant to the
> > > > problem I saw ... though it is complicated.
> > > > 
> > > > The problem "workqueue ...  hogged cpu..." probably means that
> > > > nfsd_file_dispose_list() needs a cond_resched() call in the loop.
> > > > That will stop it from hogging the CPU whether it is tied to one CPU or
> > > > free to roam.
> > > > 
> > > > Also that work is calling filp_close() which primarily calls
> > > > filp_flush().
> > > > It also calls fput() but that does minimal work.  If there is much work
> > > > to do then that is offloaded to another work-item.  *That* is the
> > > > workitem that I had problems with.
> > > > 
> > > > The problem I saw was with an older kernel which didn't have the nfsd
> > > > file cache and so probably is calling filp_close more often.  So maybe
> > > > my patch isn't so important now.  Particularly as nfsd now isn't closing
> > > > most files in-task but instead offloads that to another task.  So the
> > > > final fput will not be handled by the nfsd task either.
> > > > 
> > > > But I think there is room for improvement.  Gathering lots of files
> > > > together into a list and closing them sequentially is not going to be as
> > > > efficient as closing them in parallel.
> > > > 
> > > > > 
> > > > > > For normal threads, the thread that closes the file also calls the
> > > > > > final fput so there is natural rate limiting preventing excessive growth
> > > > > > in the list of delayed fputs.  For kernel threads, and particularly for
> > > > > > nfsd, delayed in the final fput do not impose any throttling to prevent
> > > > > > the thread from closing more files.
> > > > > 
> > > > > I don't think we want to block nfsd threads waiting for files to
> > > > > close. Won't that be a potential denial of service?
> > > > 
> > > > Not as much as the denial of service caused by memory exhaustion due to
> > > > an indefinitely growing list of files waiting to be closed by a single
> > > > thread of workqueue.
> > > 
> > > It seems less likely that you run into memory exhausting than a DOS
> > > because nfsd() is busy closing fds. Especially because you default to
> > > single nfsd thread afaict.
> > 
> > An nfsd thread would not end up being busy closing fds any more than it
> > can already be busy reading data or busy syncing out changes or busying
> > renaming a file.
> > Which it is say: of course it can be busy doing this, but doing this sort
> > of thing is its whole purpose in life.
> > 
> > If an nfsd thread only completes the close that it initiated the close
> > on (which is what I am currently proposing) then there would be at most
> > one, or maybe 2, fds to close after handling each request.
> 
> Closing files more aggressively would seem to entirely defeat the
> purpose of the file cache, which is to avoid the overhead of opens
> and closes on frequently-used files.
> 
> And usually Linux prefers to let the workload consume as many free
> resources as possible before it applies back pressure or cache
> eviction.
> 
> IMO the first step should be removing head-of-queue blocking from
> the file cache's background closing mechanism. That might be enough
> to avoid forming a backlog in most cases.
> 
> 

That's not quite what task_work does. Neil's patch wouldn't result in
closes happening more aggressively. It would just make it so that we
don't queue the delayed part of the fput process to a workqueue like we
do today.

Instead, the nfsd threads would have to clean that part up themselves,
like syscalls do before returning to userland. I think that idea makes
sense overall since that mirrors what we already do in userland.

In the event that all of the nfsd threads are tied up in slow task_work
jobs...tough luck. That at least makes it more of a self-limiting
problem since RPCs will start being queueing, rather than allowing dead
files to just pile onto the list.


> > > > For NFSv3 it is more complex.  On the kernel where I saw a problem the
> > > > filp_close happen after each READ or WRITE (though I think the customer
> > > > was using NFSv4...).  With the file cache there is no thread that is
> > > > obviously responsible for the close.
> > > > To get the sort of throttling that I think is need, we could possibly
> > > > have each "nfsd_open" check if there are pending closes, and to wait for
> > > > some small amount of progress.
> > > > 
> > > > But don't think it is reasonable for the nfsd threads to take none of
> > > > the burden of closing files as that can result in imbalance.
> > > 
> > > It feels that this really needs to be tested under a similar workload in
> > > question to see whether this is a viable solution.
> > 
> > Creating that workload might be a challenge.  I know it involved
> > accessing 10s of millions of files with a server that was somewhat
> > memory constrained.  I don't know anything about the access pattern.
> > 
> > Certainly I'll try to reproduce something similar by inserting delays in
> > suitable places.  This will help exercise the code, but won't really
> > replicate the actual workload.
> 
> It's likely that the fundamental bottleneck is writeback during
> close.
>
  
Jeff Layton Nov. 30, 2023, 5:50 p.m. UTC | #27
On Tue, 2023-11-28 at 10:34 -0500, Chuck Lever wrote:
> On Tue, Nov 28, 2023 at 01:57:30PM +1100, NeilBrown wrote:
> > 
> > (trimmed cc...)
> > 
> > On Tue, 28 Nov 2023, Chuck Lever wrote:
> > > On Tue, Nov 28, 2023 at 11:16:06AM +1100, NeilBrown wrote:
> > > > On Tue, 28 Nov 2023, Chuck Lever wrote:
> > > > > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote:
> > > > > > 
> > > > > > I have evidence from a customer site of 256 nfsd threads adding files to
> > > > > > delayed_fput_lists nearly twice as fast they are retired by a single
> > > > > > work-queue thread running delayed_fput().  As you might imagine this
> > > > > > does not end well (20 million files in the queue at the time a snapshot
> > > > > > was taken for analysis).
> > > > > > 
> > > > > > While this might point to a problem with the filesystem not handling the
> > > > > > final close efficiently, such problems should only hurt throughput, not
> > > > > > lead to memory exhaustion.
> > > > > 
> > > > > I have this patch queued for v6.8:
> > > > > 
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0
> > > > > 
> > > > 
> > > > Thanks....
> > > > I think that change is good, but I don't think it addresses the problem
> > > > mentioned in the description, and it is not directly relevant to the
> > > > problem I saw ... though it is complicated.
> > > > 
> > > > The problem "workqueue ...  hogged cpu..." probably means that
> > > > nfsd_file_dispose_list() needs a cond_resched() call in the loop.
> > > > That will stop it from hogging the CPU whether it is tied to one CPU or
> > > > free to roam.
> > > > 
> > > > Also that work is calling filp_close() which primarily calls
> > > > filp_flush().
> > > > It also calls fput() but that does minimal work.  If there is much work
> > > > to do then that is offloaded to another work-item.  *That* is the
> > > > workitem that I had problems with.
> > > > 
> > > > The problem I saw was with an older kernel which didn't have the nfsd
> > > > file cache and so probably is calling filp_close more often.
> > > 
> > > Without the file cache, the filp_close() should be handled directly
> > > by the nfsd thread handling the RPC, IIRC.
> > 
> > Yes - but __fput() is handled by a workqueue.
> > 
> > > 
> > > 
> > > > So maybe
> > > > my patch isn't so important now.  Particularly as nfsd now isn't closing
> > > > most files in-task but instead offloads that to another task.  So the
> > > > final fput will not be handled by the nfsd task either.
> > > > 
> > > > But I think there is room for improvement.  Gathering lots of files
> > > > together into a list and closing them sequentially is not going to be as
> > > > efficient as closing them in parallel.
> > > 
> > > I believe the file cache passes the filps to the work queue one at
> > 
> > nfsd_file_close_inode() does.  nfsd_file_gc() and nfsd_file_lru_scan()
> > can pass multiple.
> > 
> > > a time, but I don't think there's anything that forces the work
> > > queue to handle each flush/close completely before proceeding to the
> > > next.
> > 
> > Parallelism with workqueues is controlled by the work items (struct
> > work_struct).  Two different work items can run in parallel.  But any
> > given work item can never run parallel to itself.
> > 
> > The only work items queued on nfsd_filecache_wq are from
> >   nn->fcache_disposal->work.
> > There is one of these for each network namespace.  So in any given
> > network namespace, all work on nfsd_filecache_wq is fully serialised.
> 
> OIC, it's that specific case you are concerned with. The per-
> namespace laundrette was added by:
> 
>   9542e6a643fc ("nfsd: Containerise filecache laundrette")
> 
> It's purpose was to confine the close backlog to each container.
> 
> Seems like it would be better if there was a struct work_struct
> in each struct nfsd_file. That wouldn't add real backpressure to
> nfsd threads, but it would enable file closes to run in parallel.
> 

I like this idea. That seems a lot simpler than all of this weirdo
queueing of delayed closes that we do.
  
Chuck Lever Nov. 30, 2023, 6:07 p.m. UTC | #28
On Thu, Nov 30, 2023 at 12:47:58PM -0500, Jeff Layton wrote:
> On Wed, 2023-11-29 at 09:04 -0500, Chuck Lever wrote:
> > On Wed, Nov 29, 2023 at 10:20:23AM +1100, NeilBrown wrote:
> > > On Wed, 29 Nov 2023, Christian Brauner wrote:
> > > > [Reusing the trimmed Cc]
> > > > 
> > > > On Tue, Nov 28, 2023 at 11:16:06AM +1100, NeilBrown wrote:
> > > > > On Tue, 28 Nov 2023, Chuck Lever wrote:
> > > > > > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote:
> > > > > > > 
> > > > > > > I have evidence from a customer site of 256 nfsd threads adding files to
> > > > > > > delayed_fput_lists nearly twice as fast they are retired by a single
> > > > > > > work-queue thread running delayed_fput().  As you might imagine this
> > > > > > > does not end well (20 million files in the queue at the time a snapshot
> > > > > > > was taken for analysis).
> > > > > > > 
> > > > > > > While this might point to a problem with the filesystem not handling the
> > > > > > > final close efficiently, such problems should only hurt throughput, not
> > > > > > > lead to memory exhaustion.
> > > > > > 
> > > > > > I have this patch queued for v6.8:
> > > > > > 
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0
> > > > > > 
> > > > > 
> > > > > Thanks....
> > > > > I think that change is good, but I don't think it addresses the problem
> > > > > mentioned in the description, and it is not directly relevant to the
> > > > > problem I saw ... though it is complicated.
> > > > > 
> > > > > The problem "workqueue ...  hogged cpu..." probably means that
> > > > > nfsd_file_dispose_list() needs a cond_resched() call in the loop.
> > > > > That will stop it from hogging the CPU whether it is tied to one CPU or
> > > > > free to roam.
> > > > > 
> > > > > Also that work is calling filp_close() which primarily calls
> > > > > filp_flush().
> > > > > It also calls fput() but that does minimal work.  If there is much work
> > > > > to do then that is offloaded to another work-item.  *That* is the
> > > > > workitem that I had problems with.
> > > > > 
> > > > > The problem I saw was with an older kernel which didn't have the nfsd
> > > > > file cache and so probably is calling filp_close more often.  So maybe
> > > > > my patch isn't so important now.  Particularly as nfsd now isn't closing
> > > > > most files in-task but instead offloads that to another task.  So the
> > > > > final fput will not be handled by the nfsd task either.
> > > > > 
> > > > > But I think there is room for improvement.  Gathering lots of files
> > > > > together into a list and closing them sequentially is not going to be as
> > > > > efficient as closing them in parallel.
> > > > > 
> > > > > > 
> > > > > > > For normal threads, the thread that closes the file also calls the
> > > > > > > final fput so there is natural rate limiting preventing excessive growth
> > > > > > > in the list of delayed fputs.  For kernel threads, and particularly for
> > > > > > > nfsd, delayed in the final fput do not impose any throttling to prevent
> > > > > > > the thread from closing more files.
> > > > > > 
> > > > > > I don't think we want to block nfsd threads waiting for files to
> > > > > > close. Won't that be a potential denial of service?
> > > > > 
> > > > > Not as much as the denial of service caused by memory exhaustion due to
> > > > > an indefinitely growing list of files waiting to be closed by a single
> > > > > thread of workqueue.
> > > > 
> > > > It seems less likely that you run into memory exhausting than a DOS
> > > > because nfsd() is busy closing fds. Especially because you default to
> > > > single nfsd thread afaict.
> > > 
> > > An nfsd thread would not end up being busy closing fds any more than it
> > > can already be busy reading data or busy syncing out changes or busying
> > > renaming a file.
> > > Which it is say: of course it can be busy doing this, but doing this sort
> > > of thing is its whole purpose in life.
> > > 
> > > If an nfsd thread only completes the close that it initiated the close
> > > on (which is what I am currently proposing) then there would be at most
> > > one, or maybe 2, fds to close after handling each request.
> > 
> > Closing files more aggressively would seem to entirely defeat the
> > purpose of the file cache, which is to avoid the overhead of opens
> > and closes on frequently-used files.
> > 
> > And usually Linux prefers to let the workload consume as many free
> > resources as possible before it applies back pressure or cache
> > eviction.
> > 
> > IMO the first step should be removing head-of-queue blocking from
> > the file cache's background closing mechanism. That might be enough
> > to avoid forming a backlog in most cases.
> 
> That's not quite what task_work does. Neil's patch wouldn't result in
> closes happening more aggressively. It would just make it so that we
> don't queue the delayed part of the fput process to a workqueue like we
> do today.
> 
> Instead, the nfsd threads would have to clean that part up themselves,
> like syscalls do before returning to userland. I think that idea makes
> sense overall since that mirrors what we already do in userland.
> 
> In the event that all of the nfsd threads are tied up in slow task_work
> jobs...tough luck. That at least makes it more of a self-limiting
> problem since RPCs will start being queueing, rather than allowing dead
> files to just pile onto the list.

Thanks for helping me understand the proposal. task_work would cause
nfsd threads to wait for flush/close operations that others have
already started; it would not increase the rate of closing cached
file descriptors.

The thing that nfsd_filesystem_wq does is compartmentalize the
flush/close workload so that a heavy flush/close workload in one
net namespace does not negatively impact other namespaces. IIUC,
then, task_work does not discriminate between namespaces -- if one
namespace is creating a backlog of dirty files to close, all nfsd
threads would need to handle that backlog, and thus all namespaces
would bear (a part of) that backlog.
  
Jeff Layton Nov. 30, 2023, 6:33 p.m. UTC | #29
On Thu, 2023-11-30 at 13:07 -0500, Chuck Lever wrote:
> On Thu, Nov 30, 2023 at 12:47:58PM -0500, Jeff Layton wrote:
> > On Wed, 2023-11-29 at 09:04 -0500, Chuck Lever wrote:
> > > On Wed, Nov 29, 2023 at 10:20:23AM +1100, NeilBrown wrote:
> > > > On Wed, 29 Nov 2023, Christian Brauner wrote:
> > > > > [Reusing the trimmed Cc]
> > > > > 
> > > > > On Tue, Nov 28, 2023 at 11:16:06AM +1100, NeilBrown wrote:
> > > > > > On Tue, 28 Nov 2023, Chuck Lever wrote:
> > > > > > > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote:
> > > > > > > > 
> > > > > > > > I have evidence from a customer site of 256 nfsd threads adding files to
> > > > > > > > delayed_fput_lists nearly twice as fast they are retired by a single
> > > > > > > > work-queue thread running delayed_fput().  As you might imagine this
> > > > > > > > does not end well (20 million files in the queue at the time a snapshot
> > > > > > > > was taken for analysis).
> > > > > > > > 
> > > > > > > > While this might point to a problem with the filesystem not handling the
> > > > > > > > final close efficiently, such problems should only hurt throughput, not
> > > > > > > > lead to memory exhaustion.
> > > > > > > 
> > > > > > > I have this patch queued for v6.8:
> > > > > > > 
> > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0
> > > > > > > 
> > > > > > 
> > > > > > Thanks....
> > > > > > I think that change is good, but I don't think it addresses the problem
> > > > > > mentioned in the description, and it is not directly relevant to the
> > > > > > problem I saw ... though it is complicated.
> > > > > > 
> > > > > > The problem "workqueue ...  hogged cpu..." probably means that
> > > > > > nfsd_file_dispose_list() needs a cond_resched() call in the loop.
> > > > > > That will stop it from hogging the CPU whether it is tied to one CPU or
> > > > > > free to roam.
> > > > > > 
> > > > > > Also that work is calling filp_close() which primarily calls
> > > > > > filp_flush().
> > > > > > It also calls fput() but that does minimal work.  If there is much work
> > > > > > to do then that is offloaded to another work-item.  *That* is the
> > > > > > workitem that I had problems with.
> > > > > > 
> > > > > > The problem I saw was with an older kernel which didn't have the nfsd
> > > > > > file cache and so probably is calling filp_close more often.  So maybe
> > > > > > my patch isn't so important now.  Particularly as nfsd now isn't closing
> > > > > > most files in-task but instead offloads that to another task.  So the
> > > > > > final fput will not be handled by the nfsd task either.
> > > > > > 
> > > > > > But I think there is room for improvement.  Gathering lots of files
> > > > > > together into a list and closing them sequentially is not going to be as
> > > > > > efficient as closing them in parallel.
> > > > > > 
> > > > > > > 
> > > > > > > > For normal threads, the thread that closes the file also calls the
> > > > > > > > final fput so there is natural rate limiting preventing excessive growth
> > > > > > > > in the list of delayed fputs.  For kernel threads, and particularly for
> > > > > > > > nfsd, delayed in the final fput do not impose any throttling to prevent
> > > > > > > > the thread from closing more files.
> > > > > > > 
> > > > > > > I don't think we want to block nfsd threads waiting for files to
> > > > > > > close. Won't that be a potential denial of service?
> > > > > > 
> > > > > > Not as much as the denial of service caused by memory exhaustion due to
> > > > > > an indefinitely growing list of files waiting to be closed by a single
> > > > > > thread of workqueue.
> > > > > 
> > > > > It seems less likely that you run into memory exhausting than a DOS
> > > > > because nfsd() is busy closing fds. Especially because you default to
> > > > > single nfsd thread afaict.
> > > > 
> > > > An nfsd thread would not end up being busy closing fds any more than it
> > > > can already be busy reading data or busy syncing out changes or busying
> > > > renaming a file.
> > > > Which it is say: of course it can be busy doing this, but doing this sort
> > > > of thing is its whole purpose in life.
> > > > 
> > > > If an nfsd thread only completes the close that it initiated the close
> > > > on (which is what I am currently proposing) then there would be at most
> > > > one, or maybe 2, fds to close after handling each request.
> > > 
> > > Closing files more aggressively would seem to entirely defeat the
> > > purpose of the file cache, which is to avoid the overhead of opens
> > > and closes on frequently-used files.
> > > 
> > > And usually Linux prefers to let the workload consume as many free
> > > resources as possible before it applies back pressure or cache
> > > eviction.
> > > 
> > > IMO the first step should be removing head-of-queue blocking from
> > > the file cache's background closing mechanism. That might be enough
> > > to avoid forming a backlog in most cases.
> > 
> > That's not quite what task_work does. Neil's patch wouldn't result in
> > closes happening more aggressively. It would just make it so that we
> > don't queue the delayed part of the fput process to a workqueue like we
> > do today.
> > 
> > Instead, the nfsd threads would have to clean that part up themselves,
> > like syscalls do before returning to userland. I think that idea makes
> > sense overall since that mirrors what we already do in userland.
> > 
> > In the event that all of the nfsd threads are tied up in slow task_work
> > jobs...tough luck. That at least makes it more of a self-limiting
> > problem since RPCs will start being queueing, rather than allowing dead
> > files to just pile onto the list.
> 
> Thanks for helping me understand the proposal. task_work would cause
> nfsd threads to wait for flush/close operations that others have
> already started; it would not increase the rate of closing cached
> file descriptors.
> 

Note that task_work is completely a per-task thing. Each nfsd thread now
becomes responsible for cleaning up the files that were closed as part
of processing the last RPC that it processed.

Closes that occur during the processing of an RPC would be finished
before the thread picks up a new RPC to process, but I don't see how the
thread would be responsible for closes that happen in a different task
altogether.


> The thing that nfsd_filesystem_wq does is compartmentalize the
> flush/close workload so that a heavy flush/close workload in one
> net namespace does not negatively impact other namespaces. IIUC,
> then, task_work does not discriminate between namespaces -- if one
> namespace is creating a backlog of dirty files to close, all nfsd
> threads would need to handle that backlog, and thus all namespaces
> would bear (a part of) that backlog.
> 

Closes that are queued to the nfsd_filesystem_wq won't be affected by 
this patch, since those files get closed in the context of a workqueue
thread. In most cases, those are v2/3 files that have timed out.

FWIW, I expect that this patch will mostly affect NFSv4, since those get
closed more directly by nfsd. I wonder if we might want to consider
doing something like this with lockd as well.
  
NeilBrown Dec. 4, 2023, 1:30 a.m. UTC | #30
On Wed, 29 Nov 2023, Christian Brauner wrote:
> > If an nfsd thread only completes the close that it initiated the close
> > on (which is what I am currently proposing) then there would be at most
> > one, or maybe 2, fds to close after handling each request.  While that
> > is certainly a non-zero burden, I can't see how it can realistically be
> > called a DOS.
> 
> The 10s of millions of files is what makes me curious. Because that's
> the workload that'd be interesting.
> 

I think the main effect of the 10s of millions of files is to bloat the
icache which causes it to be cleaned and this results is synchronous
reads from storage is situations that you wouldn't usually expect them.

It appears from examining a memory-snapshot that some files being closed
have already been unlinked.  This is quite unusual with NFS but can
happen if they are unlinked from one client while open on another
client.  (The directory containing the file in one case is called
"Cache".  Maybe cleaning of that cache by the applications often gets
files that are in use).

This pattern means that the final __dput calls __dentry_kill and
eventually xfs_fs_destroy_inode.  This sometimes needs to read
synchronously from storage - if the required info isn't cached.  This
causes the delays.

I've modelled the delay with 

diff --git a/fs/file_table.c b/fs/file_table.c
index d36cade6e366..51563f79385a 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -29,6 +29,7 @@
 #include <linux/ima.h>
 #include <linux/swap.h>
 #include <linux/kmemleak.h>
+#include <linux/delay.h>
 
 #include <linux/atomic.h>
 
@@ -375,6 +376,9 @@ static void __fput(struct file *file)
 	eventpoll_release(file);
 	locks_remove_file(file);
 
+	if ((file->f_mode & FMODE_WRITE) &&
+	    (current->flags & PF_KTHREAD))
+		msleep(25);
 	ima_file_free(file);
 	if (unlikely(file->f_flags & FASYNC)) {
 		if (file->f_op->fasync)


I loop-back mount a filesystem with NFS on the test machine.
The PF_KTHREAD test ensures that when "cp -r" writes to the NFS
filesystem there is no delay, but when nfsd writes to the local
filesystem there is a delay.

With this patch I can easily demonstrate the number of open files
growing without bound.  With my patch as well (I'll send new version
shortly) the growth is bounded.

Thanks,
NeilBrown
  

Patch

diff --git a/fs/file_table.c b/fs/file_table.c
index de4a2915bfd4..e79351df22be 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -445,7 +445,7 @@  void fput(struct file *file)
 	if (atomic_long_dec_and_test(&file->f_count)) {
 		struct task_struct *task = current;
 
-		if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
+		if (likely(!in_interrupt())) {
 			init_task_work(&file->f_rcuhead, ____fput);
 			if (!task_work_add(task, &file->f_rcuhead, TWA_RESUME))
 				return;
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 66ca50b38b27..c047961262ca 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -13,6 +13,7 @@ 
 #include <linux/fs_struct.h>
 #include <linux/swap.h>
 #include <linux/siphash.h>
+#include <linux/task_work.h>
 
 #include <linux/sunrpc/stats.h>
 #include <linux/sunrpc/svcsock.h>
@@ -941,6 +942,7 @@  nfsd(void *vrqstp)
 	}
 
 	current->fs->umask = 0;
+	current->task_works = NULL; /* Declare that I will call task_work_run() */
 
 	atomic_inc(&nfsdstats.th_cnt);
 
@@ -955,6 +957,8 @@  nfsd(void *vrqstp)
 
 		svc_recv(rqstp);
 		validate_process_creds();
+		if (task_work_pending(current))
+			task_work_run();
 	}
 
 	atomic_dec(&nfsdstats.th_cnt);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 292c31697248..c63c2bedbf71 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1117,6 +1117,7 @@  struct task_struct {
 	unsigned int			sas_ss_flags;
 
 	struct callback_head		*task_works;
+#define	TASK_WORKS_DISABLED	((void*)1)
 
 #ifdef CONFIG_AUDIT
 #ifdef CONFIG_AUDITSYSCALL
diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index 795ef5a68429..3c74e3de81ed 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -22,7 +22,9 @@  enum task_work_notify_mode {
 
 static inline bool task_work_pending(struct task_struct *task)
 {
-	return READ_ONCE(task->task_works);
+	struct callback_head *works = READ_ONCE(task->task_works);
+
+	return works && works != TASK_WORKS_DISABLED;
 }
 
 int task_work_add(struct task_struct *task, struct callback_head *twork,
diff --git a/kernel/fork.c b/kernel/fork.c
index 10917c3e1f03..903b29804fe1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2577,7 +2577,7 @@  __latent_entropy struct task_struct *copy_process(
 	p->dirty_paused_when = 0;
 
 	p->pdeath_signal = 0;
-	p->task_works = NULL;
+	p->task_works = args->kthread ? TASK_WORKS_DISABLED : NULL;
 	clear_posix_cputimers_work(p);
 
 #ifdef CONFIG_KRETPROBES
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 95a7e1b7f1da..ffdf4b0d7a0e 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -49,7 +49,8 @@  int task_work_add(struct task_struct *task, struct callback_head *work,
 
 	head = READ_ONCE(task->task_works);
 	do {
-		if (unlikely(head == &work_exited))
+		if (unlikely(head == &work_exited ||
+			     head == TASK_WORKS_DISABLED))
 			return -ESRCH;
 		work->next = head;
 	} while (!try_cmpxchg(&task->task_works, &head, work));
@@ -157,7 +158,7 @@  void task_work_run(void)
 		work = READ_ONCE(task->task_works);
 		do {
 			head = NULL;
-			if (!work) {
+			if (!work || work == TASK_WORKS_DISABLED) {
 				if (task->flags & PF_EXITING)
 					head = &work_exited;
 				else
@@ -165,7 +166,7 @@  void task_work_run(void)
 			}
 		} while (!try_cmpxchg(&task->task_works, &work, head));
 
-		if (!work)
+		if (!work || work == TASK_WORKS_DISABLED)
 			break;
 		/*
 		 * Synchronize with task_work_cancel(). It can not remove