Message ID | 20231208033006.5546-2-neilb@suse.de |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:bcd1:0:b0:403:3b70:6f57 with SMTP id r17csp5222860vqy; Thu, 7 Dec 2023 19:39:58 -0800 (PST) X-Google-Smtp-Source: AGHT+IE+2pogBkM/pmrdMO1bpeQliC6+AISr3rsAjSL2+fvP34rMO4/UXfJJNRVCTKXEyuEL2D3U X-Received: by 2002:a05:6a00:10cb:b0:6ce:379f:5ea0 with SMTP id d11-20020a056a0010cb00b006ce379f5ea0mr315030pfu.1.1702006798596; Thu, 07 Dec 2023 19:39:58 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702006798; cv=none; d=google.com; s=arc-20160816; b=AeXLrTsBLEUFn4TRe0Gdr6DNABNgnonxm2cKWqKQRTeHzQQ01XSswqmLoBPTPkmnkt U2IfjafTxddRP8O08arNAS/ym+1cAFEs5sZqSCkzvm06IN0/hYJmiJaXuEUdH3NufDZm bJS4F0Fa1ijyieLO0eKpyKWVBhVg6Fy9U+EuQFk28FsKYBOCBi0YDkTCG5ms0MAe8VHI heAEsN7WC6T7a2CeGq+4mYgzNCmwOnz4aVcQ/w1QYVg2wvOGXOlWVGVr4J+MAyKpOGIZ 9GY8njnXnh3mJcX+nQaN1JtmarXciztjX2XH/h19LV3XDCmpL+C+Qx6PRIFVU37Mj1ha J0jQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature:dkim-signature; bh=Vu248dzuKJlGI9r2nLRgAmjct9Ps79FDgE6mGoaCwWU=; fh=I45k8M33HW2Wqj1OlgncPyvUt41clxPTDAFs+9Wisus=; b=AVVX87Oi9D988TXV61m8DoOIMT2wVZqf8l+RbdnygjRUJ/CGjCFPJ4R+RQWlP7Bi6i UBZeRCLcnWFy26COFK9AU82ZA1A4Qv/0RUrBXjUcVKodbJZRfj9F5oz3BfRg+MRUd9Ih QPp1jAgp0rwTB/mGQgzVQzdcE0q3b4efbTCHiGdEr7ZFnv9tADMSmDMamlCTHIgIUQBm rQLnoSoEQMFg+FPGVhc+DmOHn60gJ4LXlAkRW7RM0qBhaT8/SYKvwJ+yfDJBJEew3+Cr zV7AEkiUiPhKnDw9M2CFqBmJ9+WCmbvFepQcLOrLPTR6OgWZXw39580gHMRyHW/4+MuE B+3g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.de header.s=susede2_rsa header.b=anmxB+fk; dkim=neutral (no key) header.i=@suse.de header.s=susede2_ed25519 header.b=RVlBKXdB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=suse.de Received: from groat.vger.email (groat.vger.email. [2620:137:e000::3:5]) by mx.google.com with ESMTPS id q12-20020a056a00088c00b006cddc343ceesi808507pfj.181.2023.12.07.19.39.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 07 Dec 2023 19:39:58 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) client-ip=2620:137:e000::3:5; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.de header.s=susede2_rsa header.b=anmxB+fk; dkim=neutral (no key) header.i=@suse.de header.s=susede2_ed25519 header.b=RVlBKXdB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=suse.de Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id 9AEF38087E0E; Thu, 7 Dec 2023 19:39:52 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231892AbjLHDjn (ORCPT <rfc822;chrisfriedt@gmail.com> + 99 others); Thu, 7 Dec 2023 22:39:43 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37500 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229531AbjLHDjn (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 7 Dec 2023 22:39:43 -0500 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 21E8D10E6; Thu, 7 Dec 2023 19:39:49 -0800 (PST) Received: from imap1.dmz-prg2.suse.org (imap1.dmz-prg2.suse.org [10.150.64.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id AD09022102; Fri, 8 Dec 2023 03:39:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1702006787; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Vu248dzuKJlGI9r2nLRgAmjct9Ps79FDgE6mGoaCwWU=; b=anmxB+fkH2fqhNie3MRW+XjNo0iXCABg60BSQw5Gsayu+iRy7OkPLqu3TWa1mTc05JFt4b PB64CHduam7aN+CxqIAk22QBDh8iM9NqskeG90Tsq/T3SdNhJ7iBncCsTGKBzfB+NjIlQb YV+pMG2fSEtB7Ejcs7hIWeF6ddVnbHY= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1702006787; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Vu248dzuKJlGI9r2nLRgAmjct9Ps79FDgE6mGoaCwWU=; b=RVlBKXdBjGA2gqTyBfMjuQ3JTo5Idgr/yQkZVdm/OFHxFJVi1kDjDn+oWJNh1MzyNl0vx4 h92N1tS4/CTf9XCQ== Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 1DB3C13725; Fri, 8 Dec 2023 03:30:33 +0000 (UTC) Received: from dovecot-director2.suse.de ([10.150.64.162]) by imap1.dmz-prg2.suse.org with ESMTPSA id h/XQKNmNcmX+MAAAD6G6ig (envelope-from <neilb@suse.de>); Fri, 08 Dec 2023 03:30:33 +0000 From: NeilBrown <neilb@suse.de> To: Al Viro <viro@zeniv.linux.org.uk>, Christian Brauner <brauner@kernel.org>, Jens Axboe <axboe@kernel.dk>, Oleg Nesterov <oleg@redhat.com>, Chuck Lever <chuck.lever@oracle.com>, Jeff Layton <jlayton@kernel.org> Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-nfs@vger.kernel.org Subject: [PATCH 1/3] nfsd: use __fput_sync() to avoid delayed closing of files. Date: Fri, 8 Dec 2023 14:27:26 +1100 Message-ID: <20231208033006.5546-2-neilb@suse.de> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20231208033006.5546-1-neilb@suse.de> References: <20231208033006.5546-1-neilb@suse.de> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spamd-Result: default: False [10.00 / 50.00]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; BAYES_SPAM(5.10)[100.00%]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; R_MISSING_CHARSET(2.50)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; MIME_GOOD(-0.10)[text/plain]; BROKEN_CONTENT_TYPE(1.50)[]; RCVD_COUNT_THREE(0.00)[3]; DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519]; RCPT_COUNT_SEVEN(0.00)[9]; MID_CONTAINS_FROM(1.00)[]; DBL_BLOCKED_OPENRESOLVER(0.00)[suse.de:email]; FUZZY_BLOCKED(0.00)[rspamd.com]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; RCVD_TLS_ALL(0.00)[] Authentication-Results: smtp-out1.suse.de; none X-Spam-Score: 10.00 X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Thu, 07 Dec 2023 19:39:53 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1784683480706825206 X-GMAIL-MSGID: 1784683480706825206 |
Series |
nfsd: fully close all files in the nfsd threads
|
|
Commit Message
NeilBrown
Dec. 8, 2023, 3:27 a.m. UTC
Calling fput() directly or though filp_close() from a kernel thread like
nfsd causes the final __fput() (if necessary) to be called from a
workqueue. This means that nfsd is not forced to wait for any work to
complete. If the ->release of ->destroy_inode function is slow for any
reason, this can result in nfsd closing files more quickly than the
workqueue can complete the close and the queue of pending closes can
grow without bounces (30 million has been seen at one customer site,
though this was in part due to a slowness in xfs which has since been
fixed).
nfsd does not need this. This quite appropriate and safe for nfsd to do
its own close work. There is now reason that close should ever wait for
nfsd, so no deadlock can occur.
So change all fput() calls to __fput_sync(), and convert filp_close() to
the sequence get_file();filp_close();__fput_sync().
This ensure that no fput work is queued to the workqueue.
Note that this removes the only in-module use of flush_fput_queue().
Signed-off-by: NeilBrown <neilb@suse.de>
---
fs/nfsd/filecache.c | 3 ++-
fs/nfsd/lockd.c | 2 +-
fs/nfsd/nfs4proc.c | 4 ++--
fs/nfsd/nfs4recover.c | 2 +-
fs/nfsd/vfs.c | 12 ++++++------
5 files changed, 12 insertions(+), 11 deletions(-)
Comments
On Fri, Dec 08, 2023 at 02:27:26PM +1100, NeilBrown wrote: > Calling fput() directly or though filp_close() from a kernel thread like > nfsd causes the final __fput() (if necessary) to be called from a > workqueue. This means that nfsd is not forced to wait for any work to > complete. If the ->release of ->destroy_inode function is slow for any > reason, this can result in nfsd closing files more quickly than the > workqueue can complete the close and the queue of pending closes can > grow without bounces (30 million has been seen at one customer site, > though this was in part due to a slowness in xfs which has since been > fixed). > > nfsd does not need this. That is technically true, but IIUC, there is only one case where a synchronous close matters for the backlog problem, and that's when nfsd_file_free() is called from nfsd_file_put(). AFAICT all other call sites (except rename) are error paths, so there aren't negative consequences for the lack of synchronous wait there... Consider at least breaking this patch into pieces so that there is one call site changed per patch: - If a performance regression occurs, a bisect can point out the specific call site replacement that triggered it - There is an opportunity to provide rationale for each call site, because it seems to me there are two or three distinct motivations across this patch, not all of which apply to each replacement. More specific (perhaps naive) questions below. - It's more surgical to drop one or two of these smaller patches if we find a problem or that the particular change is unneeded. Also, it would be convenient if this patch (series) introduced an nfsd_close() that then called __fput_sync(): - The kdoc for that function could explain why __fput_sync() is preferred and safe for nfsd - When troubleshooting, function boundary tracing could use this utility function to easily filter (sometimes costly) calls to __fput_sync() from nfsd and ignore other calls to __fput_sync() But if it turns out only one or two fput() call sites need to be replaced, an nfsd_close() utility doesn't make sense. I'm trying to think of a benchmark workload that I can use to exercise this patch series. We've used fstests generic/531 in the past for similar issues. Any others to consider? > This quite appropriate and safe for nfsd to do > its own close work. There is now reason that close should ever wait for > nfsd, so no deadlock can occur. > > So change all fput() calls to __fput_sync(), and convert filp_close() to > the sequence get_file();filp_close();__fput_sync(). > > This ensure that no fput work is queued to the workqueue. > > Note that this removes the only in-module use of flush_fput_queue(). > > Signed-off-by: NeilBrown <neilb@suse.de> > --- > fs/nfsd/filecache.c | 3 ++- > fs/nfsd/lockd.c | 2 +- > fs/nfsd/nfs4proc.c | 4 ++-- > fs/nfsd/nfs4recover.c | 2 +- > fs/nfsd/vfs.c | 12 ++++++------ > 5 files changed, 12 insertions(+), 11 deletions(-) > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > index ef063f93fde9..e9734c7451b5 100644 > --- a/fs/nfsd/filecache.c > +++ b/fs/nfsd/filecache.c > @@ -283,7 +283,9 @@ nfsd_file_free(struct nfsd_file *nf) > nfsd_file_mark_put(nf->nf_mark); > if (nf->nf_file) { > nfsd_file_check_write_error(nf); > + get_file(nf->nf_file); > filp_close(nf->nf_file, NULL); > + __fput_sync(nf->nf_file); > } This effectively reverts 4c475eee0237 ("nfsd: don't fsync nfsd_files on last close")... Won't this cause a regression? Jeff? And this is really the only place where a flushing close can produce back pressure on clients, isn't it? > > /* > @@ -631,7 +633,6 @@ nfsd_file_close_inode_sync(struct inode *inode) > list_del_init(&nf->nf_lru); > nfsd_file_free(nf); > } > - flush_delayed_fput(); > } > > /** > diff --git a/fs/nfsd/lockd.c b/fs/nfsd/lockd.c > index 46a7f9b813e5..f9d1059096a4 100644 > --- a/fs/nfsd/lockd.c > +++ b/fs/nfsd/lockd.c > @@ -60,7 +60,7 @@ nlm_fopen(struct svc_rqst *rqstp, struct nfs_fh *f, struct file **filp, > static void > nlm_fclose(struct file *filp) > { > - fput(filp); > + __fput_sync(filp); > } Will lock file descriptors have dirty data? This function is called from nlm_traverse_files(), which is looping over all files in a table, and it's called while a global mutex is held. Any data showing this won't have a scalability impact? > static const struct nlmsvc_binding nfsd_nlm_ops = { > diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c > index 6f2d4aa4970d..20d60823d530 100644 > --- a/fs/nfsd/nfs4proc.c > +++ b/fs/nfsd/nfs4proc.c > @@ -629,7 +629,7 @@ nfsd4_open(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, > nn->somebody_reclaimed = true; > out: > if (open->op_filp) { > - fput(open->op_filp); > + __fput_sync(open->op_filp); > open->op_filp = NULL; > } Isn't this just discarding a file descriptor that wasn't needed? What's the need for a flushing close here? > if (resfh && resfh != &cstate->current_fh) { > @@ -1546,7 +1546,7 @@ nfsd4_cleanup_inter_ssc(struct nfsd4_ssc_umount_item *nsui, struct file *filp, > long timeout = msecs_to_jiffies(nfsd4_ssc_umount_timeout); > > nfs42_ssc_close(filp); > - fput(filp); > + __fput_sync(filp); > > spin_lock(&nn->nfsd_ssc_lock); > list_del(&nsui->nsui_list); > diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c > index 3509e73abe1f..f8f0112fd9f5 100644 > --- a/fs/nfsd/nfs4recover.c > +++ b/fs/nfsd/nfs4recover.c > @@ -561,7 +561,7 @@ nfsd4_shutdown_recdir(struct net *net) > > if (!nn->rec_file) > return; > - fput(nn->rec_file); > + __fput_sync(nn->rec_file); > nn->rec_file = NULL; > } What's the justification for a flushing close in this path? > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c > index fbbea7498f02..15a811229211 100644 > --- a/fs/nfsd/vfs.c > +++ b/fs/nfsd/vfs.c > @@ -879,7 +879,7 @@ __nfsd_open(struct svc_rqst *rqstp, struct svc_fh *fhp, umode_t type, > > host_err = ima_file_check(file, may_flags); > if (host_err) { > - fput(file); > + __fput_sync(file); > goto out; > } AFAICT __nfsd_open is used only for creating a file descriptor for either an NLM request or for handling a readdir. In fact, I'm not even sure why there is an ima_file_check() call site here. IMO there doesn't need to be a flushing close in this error flow. > @@ -1884,10 +1884,10 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen, > fh_drop_write(ffhp); > > /* > - * If the target dentry has cached open files, then we need to try to > - * close them prior to doing the rename. Flushing delayed fput > - * shouldn't be done with locks held however, so we delay it until this > - * point and then reattempt the whole shebang. > + * If the target dentry has cached open files, then we need to > + * try to close them prior to doing the rename. Final fput > + * shouldn't be done with locks held however, so we delay it > + * until this point and then reattempt the whole shebang. > */ > if (close_cached) { > close_cached = false; > @@ -2141,7 +2141,7 @@ nfsd_readdir(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t *offsetp, > if (err == nfserr_eof || err == nfserr_toosmall) > err = nfs_ok; /* can still be found in ->err */ > out_close: > - fput(file); > + __fput_sync(file); Do we expect a directory file descriptor to need a flushing close? > out: > return err; > } > -- > 2.43.0 > >
On Sat, 09 Dec 2023, Chuck Lever wrote: > On Fri, Dec 08, 2023 at 02:27:26PM +1100, NeilBrown wrote: > > Calling fput() directly or though filp_close() from a kernel thread like > > nfsd causes the final __fput() (if necessary) to be called from a > > workqueue. This means that nfsd is not forced to wait for any work to > > complete. If the ->release of ->destroy_inode function is slow for any > > reason, this can result in nfsd closing files more quickly than the > > workqueue can complete the close and the queue of pending closes can > > grow without bounces (30 million has been seen at one customer site, > > though this was in part due to a slowness in xfs which has since been > > fixed). > > > > nfsd does not need this. > > That is technically true, but IIUC, there is only one case where a > synchronous close matters for the backlog problem, and that's when > nfsd_file_free() is called from nfsd_file_put(). AFAICT all other > call sites (except rename) are error paths, so there aren't negative > consequences for the lack of synchronous wait there... What you say is technically true but it isn't the way I see it. Firstly I should clarify that __fput_sync() is *not* a flushing close as you describe it below. All it does, apart for some trivial book-keeping, is to call ->release and possibly ->destroy_inode immediately rather than shunting them off to another thread. Apparently ->release sometimes does something that can deadlock with some kernel threads or if some awkward locks are held, so the whole final __fput is delay by default. But this does not apply to nfsd. Standard fput() is really the wrong interface for nfsd to use. It should use __fput_sync() (which shouldn't have such a scary name). The comment above flush_delayed_fput() seems to suggest that unmounting is a core issue. Maybe the fact that __fput() can call dissolve_on_fput() is a reason why it is sometimes safer to leave the work to later. But I don't see that applying to nfsd. Of course a ->release function *could* do synchronous writes just like the XFS ->destroy_inode function used to do synchronous reads. I don't think we should ever try to hide that by putting it in a workqueue. It's probably a bug and it is best if bugs are visible. Note that the XFS ->release function does call filemap_flush() in some cases, but that is an async flush, so __fput_sync doesn't wait for the flush to complete. The way I see this patch is that fput() is the wrong interface for nfsd to use, __fput_sync is the right interface. So we should change. 1 patch. The details about exhausting memory explain a particular symptom that motivated the examination which revealed that nfsd was using the wrong interface. If we have nfsd sometimes using fput() and sometimes __fput_sync, then we need to have clear rules for when to use which. It is much easier to have a simple rule: always use __fput_sync(). I'm certainly happy to revise function documentation and provide wrapper functions if needed. I might be good to have void filp_close_sync(struct file *f) { get_file(f); filp_close(f); __fput_sync(f); } but as that would only be called once, it was hard to motivate. Having it in linux/fs.h would be nice. Similarly would could wrap __fput_sync() is a more friendly name, but that would be better if we actually renamed the function. void fput_now(struct file *f) { __fput_sync(f); } ?? Thanks, NeilBrown
On Mon, Dec 11, 2023 at 09:47:35AM +1100, NeilBrown wrote: > On Sat, 09 Dec 2023, Chuck Lever wrote: > > On Fri, Dec 08, 2023 at 02:27:26PM +1100, NeilBrown wrote: > > > Calling fput() directly or though filp_close() from a kernel thread like > > > nfsd causes the final __fput() (if necessary) to be called from a > > > workqueue. This means that nfsd is not forced to wait for any work to > > > complete. If the ->release of ->destroy_inode function is slow for any > > > reason, this can result in nfsd closing files more quickly than the > > > workqueue can complete the close and the queue of pending closes can > > > grow without bounces (30 million has been seen at one customer site, > > > though this was in part due to a slowness in xfs which has since been > > > fixed). > > > > > > nfsd does not need this. > > > > That is technically true, but IIUC, there is only one case where a > > synchronous close matters for the backlog problem, and that's when > > nfsd_file_free() is called from nfsd_file_put(). AFAICT all other > > call sites (except rename) are error paths, so there aren't negative > > consequences for the lack of synchronous wait there... > > What you say is technically true but it isn't the way I see it. > > Firstly I should clarify that __fput_sync() is *not* a flushing close as > you describe it below. > All it does, apart for some trivial book-keeping, is to call ->release > and possibly ->destroy_inode immediately rather than shunting them off > to another thread. > Apparently ->release sometimes does something that can deadlock with > some kernel threads or if some awkward locks are held, so the whole > final __fput is delay by default. But this does not apply to nfsd. > Standard fput() is really the wrong interface for nfsd to use. > It should use __fput_sync() (which shouldn't have such a scary name). > > The comment above flush_delayed_fput() seems to suggest that unmounting > is a core issue. Maybe the fact that __fput() can call > dissolve_on_fput() is a reason why it is sometimes safer to leave the > work to later. But I don't see that applying to nfsd. > > Of course a ->release function *could* do synchronous writes just like > the XFS ->destroy_inode function used to do synchronous reads. I had assumed ->release for NFS re-export would flush due to close- to-open semantics. There seem to be numerous corner cases that might result in pile-ups which would change the situation in your problem statement but might not result in an overall improvement. > I don't think we should ever try to hide that by putting it in > a workqueue. It's probably a bug and it is best if bugs are visible. I'm not objecting, per se, to this change. I would simply like to see a little more due diligence before moving forward until it is clear how frequently ->release or ->destroy_inode will do I/O (or "is slow for any reason" as you say above). > Note that the XFS ->release function does call filemap_flush() in some > cases, but that is an async flush, so __fput_sync doesn't wait for the > flush to complete. When Jeff was working on the file cache a year ago, I did some performance analysis that shows even an async flush is costly when there is a lot of dirty data in the file being closed. The VFS walks through the whole file and starts I/O on every dirty page. This is quite CPU intensive, and can take on the order of a millisecond before the async flush request returns to its caller. IME async flushes are not free. > The way I see this patch is that fput() is the wrong interface for nfsd > to use, __fput_sync is the right interface. So we should change. 1 > patch. The practical matter is I see this as a change with a greater than zero risk, and we need to mitigate that risk. Or rather, as a maintainer of NFSD, /I/ need to see that the risk is as minimal as is practical. > The details about exhausting memory explain a particular symptom that > motivated the examination which revealed that nfsd was using the wrong > interface. > > If we have nfsd sometimes using fput() and sometimes __fput_sync, then > we need to have clear rules for when to use which. It is much easier to > have a simple rule: always use __fput_sync(). I don't agree that we should just flop all these over and hope for the best. In particular: - the changes in fs/nfsd/filecache.c appear to revert a bug fix, so I need to see data that shows that change doesn't cause a re-regression - the changes in fs/lockd/ can result in long waits while a global mutex is held (global as in all namespaces and all locked files on the server), so I need to see data that demonstrates there won't be a regression - the other changes don't appear to have motivation in terms of performance or behavior, and carry similar (if lesser) risks as the other two changes. My preferred solution to potential auditor confusion about the use of __fput_sync() in some places and fput() in others is to document, and leave call sites alone if there's no technical reason to change them at this time. There is enough of a risk of regression that I need to see a clear rationale for each hunk /and/ I need to see data that there is no regression. I know that won't be perfect coverage, but it's better than not having any data at all. > I'm certainly happy to revise function documentation and provide > wrapper functions if needed. > > It might be good to have > > void filp_close_sync(struct file *f) > { > get_file(f); > filp_close(f); > __fput_sync(f); > } > > but as that would only be called once, it was hard to motivate. > Having it in linux/fs.h would be nice. > > Similarly we could wrap __fput_sync() in a more friendly name, but > that would be better if we actually renamed the function. > > void fput_now(struct file *f) > { > __fput_sync(f); > } > > ?? Since this is an issue strictly for nfsd, the place for this utility function is in fs/nfsd/vfs.c, IMO, along with a documenting comment that provides a rationale for why nfsd does not want plain fput() in specific cases. When other subsystems need a similar capability, then let's consider a common helper.
On Mon, Dec 11, 2023 at 09:47:35AM +1100, NeilBrown wrote: > Similarly would could wrap __fput_sync() is a more friendly name, but > that would be better if we actually renamed the function. > > void fput_now(struct file *f) > { > __fput_sync(f); > } It is unfriendly *precisely* because it should not be used without a very good reason. It may be the last opened file keeping a lazy-umounted mount alive. It may be taking pretty much any locks, or eating a lot of stack space. It really isn't a general-purpose API; any "more friendly name" is going to be NAKed for that reason alone. Al, very much tempted to send a patch renaming that sucker to __fput_dont_use_that_unless_you_really_know_what_you_are_doing().
On Tue, 12 Dec 2023, Chuck Lever wrote: > On Mon, Dec 11, 2023 at 09:47:35AM +1100, NeilBrown wrote: > > On Sat, 09 Dec 2023, Chuck Lever wrote: > > > On Fri, Dec 08, 2023 at 02:27:26PM +1100, NeilBrown wrote: > > > > Calling fput() directly or though filp_close() from a kernel thread like > > > > nfsd causes the final __fput() (if necessary) to be called from a > > > > workqueue. This means that nfsd is not forced to wait for any work to > > > > complete. If the ->release of ->destroy_inode function is slow for any > > > > reason, this can result in nfsd closing files more quickly than the > > > > workqueue can complete the close and the queue of pending closes can > > > > grow without bounces (30 million has been seen at one customer site, > > > > though this was in part due to a slowness in xfs which has since been > > > > fixed). > > > > > > > > nfsd does not need this. > > > > > > That is technically true, but IIUC, there is only one case where a > > > synchronous close matters for the backlog problem, and that's when > > > nfsd_file_free() is called from nfsd_file_put(). AFAICT all other > > > call sites (except rename) are error paths, so there aren't negative > > > consequences for the lack of synchronous wait there... > > > > What you say is technically true but it isn't the way I see it. > > > > Firstly I should clarify that __fput_sync() is *not* a flushing close as > > you describe it below. > > All it does, apart for some trivial book-keeping, is to call ->release > > and possibly ->destroy_inode immediately rather than shunting them off > > to another thread. > > Apparently ->release sometimes does something that can deadlock with > > some kernel threads or if some awkward locks are held, so the whole > > final __fput is delay by default. But this does not apply to nfsd. > > Standard fput() is really the wrong interface for nfsd to use. > > It should use __fput_sync() (which shouldn't have such a scary name). > > > > The comment above flush_delayed_fput() seems to suggest that unmounting > > is a core issue. Maybe the fact that __fput() can call > > dissolve_on_fput() is a reason why it is sometimes safer to leave the > > work to later. But I don't see that applying to nfsd. > > > > Of course a ->release function *could* do synchronous writes just like > > the XFS ->destroy_inode function used to do synchronous reads. > > I had assumed ->release for NFS re-export would flush due to close- > to-open semantics. There seem to be numerous corner cases that > might result in pile-ups which would change the situation in your > problem statement but might not result in an overall improvement. That's the ->flush call in filp_close(). > > > > I don't think we should ever try to hide that by putting it in > > a workqueue. It's probably a bug and it is best if bugs are visible. > > > I'm not objecting, per se, to this change. I would simply like to > see a little more due diligence before moving forward until it is > clear how frequently ->release or ->destroy_inode will do I/O (or > "is slow for any reason" as you say above). > > > > Note that the XFS ->release function does call filemap_flush() in some > > cases, but that is an async flush, so __fput_sync doesn't wait for the > > flush to complete. > > When Jeff was working on the file cache a year ago, I did some > performance analysis that shows even an async flush is costly when > there is a lot of dirty data in the file being closed. The VFS walks > through the whole file and starts I/O on every dirty page. This is > quite CPU intensive, and can take on the order of a millisecond > before the async flush request returns to its caller. > > IME async flushes are not free. True, they aren't free. But some thread has to pay that price. I think nfsd should. You might argue that nfsd should wait to pay the price until after it has sent a reply to the client. My patches already effectively do that for garbage-collected files. Doing it for all files would probably be easy. But is it really worth the (small) complexity? I don't know. > > > > The way I see this patch is that fput() is the wrong interface for nfsd > > to use, __fput_sync is the right interface. So we should change. 1 > > patch. > > The practical matter is I see this as a change with a greater than > zero risk, and we need to mitigate that risk. Or rather, as a > maintainer of NFSD, /I/ need to see that the risk is as minimal as > is practical. > > > > The details about exhausting memory explain a particular symptom that > > motivated the examination which revealed that nfsd was using the wrong > > interface. > > > > If we have nfsd sometimes using fput() and sometimes __fput_sync, then > > we need to have clear rules for when to use which. It is much easier to > > have a simple rule: always use __fput_sync(). > > I don't agree that we should just flop all these over and hope for > the best. In particular: > > - the changes in fs/nfsd/filecache.c appear to revert a bug > fix, so I need to see data that shows that change doesn't > cause a re-regression The bug fix you refer to is "nfsd: don't fsync nfsd_files on last close" The patch doesn't change when fsync (or ->flush) is called, so it doesn't revert this bugfix. > > - the changes in fs/lockd/ can result in long waits while a > global mutex is held (global as in all namespaces and all > locked files on the server), so I need to see data that > demonstrates there won't be a regression It's probably impossible to provide any such data. The patch certainly moves work inside that mutex and so would increase the hold time, if only slightly. Is that lock hot enough to notice? Conventional wisdom is that locking is only a tiny fraction of NFS traffic. It might be possible to construct a workload that saturates lockd, but I doubt it would be relevant to the real world. Maybe we should just break up that lock so that the problem becomes moot. > > - the other changes don't appear to have motivation in terms > of performance or behavior, and carry similar (if lesser) > risks as the other two changes. My preferred solution to > potential auditor confusion about the use of __fput_sync() > in some places and fput() in others is to document, and > leave call sites alone if there's no technical reason to > change them at this time. Sounds to me like a good way to grow technical debt, but I'll do it like that if you prefer. > > There is enough of a risk of regression that I need to see a clear > rationale for each hunk /and/ I need to see data that there is > no regression. I know that won't be perfect coverage, but it's > better than not having any data at all. > > > > I'm certainly happy to revise function documentation and provide > > wrapper functions if needed. > > > > It might be good to have > > > > void filp_close_sync(struct file *f) > > { > > get_file(f); > > filp_close(f); > > __fput_sync(f); > > } > > > > but as that would only be called once, it was hard to motivate. > > Having it in linux/fs.h would be nice. > > > > Similarly we could wrap __fput_sync() in a more friendly name, but > > that would be better if we actually renamed the function. > > > > void fput_now(struct file *f) > > { > > __fput_sync(f); > > } > > > > ?? > > Since this is an issue strictly for nfsd, the place for this > utility function is in fs/nfsd/vfs.c, IMO, along with a documenting > comment that provides a rationale for why nfsd does not want plain > fput() in specific cases. > > When other subsystems need a similar capability, then let's > consider a common helper. fs/smb/server/ probably would benefit from the same helper today. Thanks, NeilBrown > > > -- > Chuck Lever >
On Tue, 12 Dec 2023, Al Viro wrote: > On Mon, Dec 11, 2023 at 09:47:35AM +1100, NeilBrown wrote: > > > Similarly would could wrap __fput_sync() is a more friendly name, but > > that would be better if we actually renamed the function. > > > > void fput_now(struct file *f) > > { > > __fput_sync(f); > > } > > It is unfriendly *precisely* because it should not be used without > a very good reason. > > It may be the last opened file keeping a lazy-umounted mount alive. > It may be taking pretty much any locks, or eating a lot of stack > space. Previously you've suggested problems with ->release blocking. Now you refer to lazy-umount, which is what the comment above __fput_sync() mentions. "pretty much an locks" seems like hyperbole. I don't see it taking nfsd_mutex or nlmsvc_mutex. Maybe you mean any filesystem lock? My understanding is that the advent of vmalloc allocated stacks means that kernel stack space is not an important consideration. It would really help if we could have clear documented explanation of what problems can occur. Maybe an example of contexts where it isn't safe to call __fput_sync(). I can easily see that lazy-unmount is an interesting case which could easily catch people unawares. Punting the tail end of mntput_no_expire (i.e. if count reaches zero) to a workqueue/task_work makes sense and would be much less impact than punting every __fput to a workqueue. Would that make an fput_now() call safe to use in most contexts, or is there something about ->release or dentry_kill() that can still cause problems? Thanks, NeilBrown > > It really isn't a general-purpose API; any "more friendly name" > is going to be NAKed for that reason alone. > > Al, very much tempted to send a patch renaming that sucker to > __fput_dont_use_that_unless_you_really_know_what_you_are_doing(). >
On Tue, Dec 12, 2023 at 09:23:51AM +1100, NeilBrown wrote: > Previously you've suggested problems with ->release blocking. > Now you refer to lazy-umount, which is what the comment above > __fput_sync() mentions. Yes? What I'm saying is that the set of locks involved is too large for any sane analysis. And lest you discard ->release(), that brings ->i_rwsem, and thus anything that might be grabbed under that. Someone's ->mmap_lock, for example. > "pretty much an locks" seems like hyperbole. I don't see it taking > nfsd_mutex or nlmsvc_mutex. I don't know - and I can't tell without serious search. What I can tell is that before making fput() delayed we used to find deadlocks on regular basis; that was a massive source of headache. > Maybe you mean any filesystem lock? Don't forget VM. And drivers. And there was quite a bit of fun happening in net/unix, etc. Sure, in case of nfsd the last two _probably_ won't occur - not directly, anyway. But making it a general nuisan^Wfacility is asking for trouble. > My understanding is that the advent of vmalloc allocated stacks means > that kernel stack space is not an important consideration. > > It would really help if we could have clear documented explanation of > what problems can occur. Maybe an example of contexts where it isn't > safe to call __fput_sync(). > > I can easily see that lazy-unmount is an interesting case which could > easily catch people unawares. Punting the tail end of mntput_no_expire > (i.e. if count reaches zero) to a workqueue/task_work makes sense and > would be much less impact than punting every __fput to a workqueue. > > Would that make an fput_now() call safe to use in most contexts, or is > there something about ->release or dentry_kill() that can still cause > problems? dentry_kill() means ->d_release(), ->d_iput() and anything final iput() could do. Including e.g. anything that might be done by afs_silly_iput(), with its "send REMOVE to server, wait for completion". No, that's not a deadlock per se, but it can stall you a bit more than you would probably consider tolerable... Sure, you could argue that AFS ought to make that thing asynchronous, but... Anyway, it won't be "safe to use in most contexts". ->mmap_lock alone is enough for that, and that's just the one I remember to have given us a lot of headache. And that's without bringing the "nfsd won't touch those files" cases - make it generally accessible and you get to audit all locks that might be taken when we close a socket, etc.
On Mon, Dec 11, 2023 at 11:13:30PM +0000, Al Viro wrote: > dentry_kill() means ->d_release(), ->d_iput() and anything final iput() > could do. Including e.g. anything that might be done by afs_silly_iput(), > with its "send REMOVE to server, wait for completion". No, that's not > a deadlock per se, but it can stall you a bit more than you would > probably consider tolerable... Sure, you could argue that AFS ought to > make that thing asynchronous, but... > > Anyway, it won't be "safe to use in most contexts". ->mmap_lock alone > is enough for that, and that's just the one I remember to have given > us a lot of headache. And that's without bringing the "nfsd won't > touch those files" cases - make it generally accessible and you get > to audit all locks that might be taken when we close a socket, etc. PS: put it that way - I can buy "nfsd is doing that only to regular files and not on an arbitrary filesystem, at that; having the thread wait on that sucker is not going to cause too much trouble"; I do *not* buy turning it into a thing usable outside of a very narrow set of circumstances.
On Tue, Dec 12, 2023 at 09:04:58AM +1100, NeilBrown wrote: > On Tue, 12 Dec 2023, Chuck Lever wrote: > > On Mon, Dec 11, 2023 at 09:47:35AM +1100, NeilBrown wrote: > > > On Sat, 09 Dec 2023, Chuck Lever wrote: > > > > On Fri, Dec 08, 2023 at 02:27:26PM +1100, NeilBrown wrote: > > > > > Calling fput() directly or though filp_close() from a kernel thread like > > > > > nfsd causes the final __fput() (if necessary) to be called from a > > > > > workqueue. This means that nfsd is not forced to wait for any work to > > > > > complete. If the ->release of ->destroy_inode function is slow for any > > > > > reason, this can result in nfsd closing files more quickly than the > > > > > workqueue can complete the close and the queue of pending closes can > > > > > grow without bounces (30 million has been seen at one customer site, > > > > > though this was in part due to a slowness in xfs which has since been > > > > > fixed). > > > > > > > > > > nfsd does not need this. > > > > > > > > That is technically true, but IIUC, there is only one case where a > > > > synchronous close matters for the backlog problem, and that's when > > > > nfsd_file_free() is called from nfsd_file_put(). AFAICT all other > > > > call sites (except rename) are error paths, so there aren't negative > > > > consequences for the lack of synchronous wait there... > > > > > > What you say is technically true but it isn't the way I see it. > > > > > > Firstly I should clarify that __fput_sync() is *not* a flushing close as > > > you describe it below. > > > All it does, apart for some trivial book-keeping, is to call ->release > > > and possibly ->destroy_inode immediately rather than shunting them off > > > to another thread. > > > Apparently ->release sometimes does something that can deadlock with > > > some kernel threads or if some awkward locks are held, so the whole > > > final __fput is delay by default. But this does not apply to nfsd. > > > Standard fput() is really the wrong interface for nfsd to use. > > > It should use __fput_sync() (which shouldn't have such a scary name). > > > > > > The comment above flush_delayed_fput() seems to suggest that unmounting > > > is a core issue. Maybe the fact that __fput() can call > > > dissolve_on_fput() is a reason why it is sometimes safer to leave the > > > work to later. But I don't see that applying to nfsd. > > > > > > Of course a ->release function *could* do synchronous writes just like > > > the XFS ->destroy_inode function used to do synchronous reads. > > > > I had assumed ->release for NFS re-export would flush due to close- > > to-open semantics. There seem to be numerous corner cases that > > might result in pile-ups which would change the situation in your > > problem statement but might not result in an overall improvement. > > That's the ->flush call in filp_close(). > > > > I don't think we should ever try to hide that by putting it in > > > a workqueue. It's probably a bug and it is best if bugs are visible. > > > > > > I'm not objecting, per se, to this change. I would simply like to > > see a little more due diligence before moving forward until it is > > clear how frequently ->release or ->destroy_inode will do I/O (or > > "is slow for any reason" as you say above). > > > > > > > Note that the XFS ->release function does call filemap_flush() in some > > > cases, but that is an async flush, so __fput_sync doesn't wait for the > > > flush to complete. > > > > When Jeff was working on the file cache a year ago, I did some > > performance analysis that shows even an async flush is costly when > > there is a lot of dirty data in the file being closed. The VFS walks > > through the whole file and starts I/O on every dirty page. This is > > quite CPU intensive, and can take on the order of a millisecond > > before the async flush request returns to its caller. > > > > IME async flushes are not free. > > True, they aren't free. But some thread has to pay that price. > I think nfsd should. An async flush can be as expensive as a synchronous flush in some cases. I'm not convinced that simply because a flush happens to be asynchronous, that makes it harmless to do at scale in the foreground. Our original desire way back when was to insert tactical async flushes during UNSTABLE WRITEs to get the server writing dirty data to storage sooner. It had an immediate negative throughput and latency impact. I have no philosophical disagreement about nfsd threads doing more work during a file close. I'm just saying that moving even an async flush from a worker to an nfsd thread might be visible to a client workload. > You might argue that nfsd should wait to pay the price until after it > has sent a reply to the client. My patches already effectively do that > for garbage-collected files. Doing it for all files would probably be > easy. But is it really worth the (small) complexity? I don't know. > > > > The way I see this patch is that fput() is the wrong interface for nfsd > > > to use, __fput_sync is the right interface. So we should change. 1 > > > patch. > > > > The practical matter is I see this as a change with a greater than > > zero risk, and we need to mitigate that risk. Or rather, as a > > maintainer of NFSD, /I/ need to see that the risk is as minimal as > > is practical. > > > > > > > The details about exhausting memory explain a particular symptom that > > > motivated the examination which revealed that nfsd was using the wrong > > > interface. > > > > > > If we have nfsd sometimes using fput() and sometimes __fput_sync, then > > > we need to have clear rules for when to use which. It is much easier to > > > have a simple rule: always use __fput_sync(). > > > > I don't agree that we should just flop all these over and hope for > > the best. In particular: > > > > - the changes in fs/nfsd/filecache.c appear to revert a bug > > fix, so I need to see data that shows that change doesn't > > cause a re-regression > > The bug fix you refer to is > "nfsd: don't fsync nfsd_files on last close" > The patch doesn't change when fsync (or ->flush) is called, so > it doesn't revert this bugfix. If the async flush is being done directly by nfsd_file_free instead of deferred to a work queue, that will slow down file closing, IMO, in particular in the single-threaded GC case. Again, not an objection, but we need to be aware of the impact of this change, and if it is negative, try to mitigate it. If that mitigation takes the form of patch 2/3, then maybe that patch needs to be applied before this one. > > - the changes in fs/lockd/ can result in long waits while a > > global mutex is held (global as in all namespaces and all > > locked files on the server), so I need to see data that > > demonstrates there won't be a regression > > It's probably impossible to provide any such data. > The patch certainly moves work inside that mutex and so would increase > the hold time, if only slightly. Is that lock hot enough to notice? > Conventional wisdom is that locking is only a tiny fraction of NFS > traffic. It might be possible to construct a workload that saturates > lockd, but I doubt it would be relevant to the real world. I have seen, in the past, customer workloads that are nothing but a stream of NLM lock and unlock requests, and the business requirement was that those need to go as fast as possible. Inserting I/O (even occasional asynchronous I/O) would result in an unwanted regression for that kind of workload. > Maybe we should just break up that lock so that the problem becomes moot. Let's drop this hunk for now until we (I) have a way to assess this change properly. I'm just not convinced this hunk is going to be a harmless change in some cases, and overall it doesn't appear to be necessary to address the close back-pressure concern. Replacing fput() here might be done after the file table data structure becomes more parallel. Maybe an rhashtable would be suitable. > > - the other changes don't appear to have motivation in terms > > of performance or behavior, and carry similar (if lesser) > > risks as the other two changes. My preferred solution to > > potential auditor confusion about the use of __fput_sync() > > in some places and fput() in others is to document, and > > leave call sites alone if there's no technical reason to > > change them at this time. > > Sounds to me like a good way to grow technical debt, but I'll do it like > that if you prefer. Yes, I prefer a helper with an explanation of why nfsd uses __fput_sync() for certain cases. I'm going to take the admonition in __fput_sync()'s kdoc comment seriously.
On Tue, 12 Dec 2023, Al Viro wrote: > On Mon, Dec 11, 2023 at 11:13:30PM +0000, Al Viro wrote: > > > dentry_kill() means ->d_release(), ->d_iput() and anything final iput() > > could do. Including e.g. anything that might be done by afs_silly_iput(), > > with its "send REMOVE to server, wait for completion". No, that's not > > a deadlock per se, but it can stall you a bit more than you would > > probably consider tolerable... Sure, you could argue that AFS ought to > > make that thing asynchronous, but... > > > > Anyway, it won't be "safe to use in most contexts". ->mmap_lock alone > > is enough for that, and that's just the one I remember to have given > > us a lot of headache. And that's without bringing the "nfsd won't > > touch those files" cases - make it generally accessible and you get > > to audit all locks that might be taken when we close a socket, etc. > > PS: put it that way - I can buy "nfsd is doing that only to regular > files and not on an arbitrary filesystem, at that; having the thread > wait on that sucker is not going to cause too much trouble"; I do *not* > buy turning it into a thing usable outside of a very narrow set of > circumstances. > Can you say more about "not on an arbitrary filesystem" ? I guess you means that procfs and/or sysfs might be problematic as may similar virtual filesystems (nfsd maybe). Could we encode some of this in the comment for __fput_sync ?? /** * __fput_sync : drop reference to a file synchronously * @f: file to drop * * Drop a reference on a file and do most cleanup work before returning. * * Due the the wide use of files in the design of Linux, dropping the * final reference to a file can result in dropping the final reference * to any of a large variety of other objects. Dropping those final * references can result in nearly arbitrary work. It should be assumed * that, unless prior checks or actions confirm otherwise, calling * __fput_sync() might: * - allocate memory * - perform synchronous IO * - wait for a remote service (for networked filesystems) * - take ->i_rwsem and other related VFS and filesystem locks * - take ->s_umount (if file is on a MNT_INTERNAL filesystem) * - take locks in a device driver if the file is CHR, BLK or SOCK * * If the caller cannot be confident that none of these will cause a * problem, it should use fput() instead. * * Note that the final unmount of a lazy-unmounted non-MNT_INTERNAL * filesystem will always be handled asynchronously. Individual drivers * might also leave some clean up to asynchronous threads. */ Thanks, NeilBrown
... > > PS: put it that way - I can buy "nfsd is doing that only to regular > > files and not on an arbitrary filesystem, at that; having the thread > > wait on that sucker is not going to cause too much trouble"; I do *not* > > buy turning it into a thing usable outside of a very narrow set of > > circumstances. > > > > Can you say more about "not on an arbitrary filesystem" ? > I guess you means that procfs and/or sysfs might be problematic as may > similar virtual filesystems (nfsd maybe). Can nfs export an ext4 fs that is on a loopback mount on a file that is remotely nfs (or other) mounted? As soon as you get loops like that you might find that fput() starts being problematic. I'm also sure I remember that nfs wasn't supposed to respond to a write until it had issued the actual disk write - but maybe no one do that any more because it really is too slow. (Especially if the 'disk' is a USB stick.) David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
.. > My understanding is that the advent of vmalloc allocated stacks means > that kernel stack space is not an important consideration. They are still the same (small) size - just allocated differently. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
> On Dec 15, 2023, at 1:27 PM, David Laight <David.Laight@ACULAB.COM> wrote: > > ... > > I'm also sure I remember that nfs wasn't supposed to respond to a write > until it had issued the actual disk write - but maybe no one do that > any more because it really is too slow. > (Especially if the 'disk' is a USB stick.) That rule applies only to NFSv2, which no-one should use any more. NFSv3 and later use a two-phase write. Clients send a COMMIT after an UNSTABLE WRITE to get the persistence guarantee. -- Chuck Lever
On Sat, 16 Dec 2023, David Laight wrote: > ... > > > PS: put it that way - I can buy "nfsd is doing that only to regular > > > files and not on an arbitrary filesystem, at that; having the thread > > > wait on that sucker is not going to cause too much trouble"; I do *not* > > > buy turning it into a thing usable outside of a very narrow set of > > > circumstances. > > > > > > > Can you say more about "not on an arbitrary filesystem" ? > > I guess you means that procfs and/or sysfs might be problematic as may > > similar virtual filesystems (nfsd maybe). > > Can nfs export an ext4 fs that is on a loopback mount on a file > that is remotely nfs (or other) mounted? Sure. There is no reason this would cause a problem. If the nfs mount were also a loopback mount, that might be interesting. i.e. You have a local filesystem, containing a file with a filesystem image. You nfs-export that local filesystem, and nfs mount it on the same host. Then you loop-mount that file in the nfs-mounted filesystem. Now you are getting into uncharted waters. I've testing loop-back NFS mounting and assure that it works. I haven't tried the double-loop. But if that caused problem, I though it would be fput. It would be fsync or writeback which causes the problem. > > As soon as you get loops like that you might find that fput() starts > being problematic. When calling fput on a regular file there are, I think, only two problem areas. One is that the fput might lead to a lazy-filesystem unmount completing. That only applies to MNT_INTERNAL filesystems, and they are unlikely to be exported (probably it's impossible, but I haven't checked). The other is synchronous (or even async) IO in the filesystem code, maybe completing an unlink or a truncate. This is no worse than any other synchronous IO that nfsd does. Thanks, NeilBrown > > I'm also sure I remember that nfs wasn't supposed to respond to a write > until it had issued the actual disk write - but maybe no one do that > any more because it really is too slow. > (Especially if the 'disk' is a USB stick.) > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales) >
On Mon, Dec 11, 2023 at 09:47:35AM +1100, NeilBrown wrote: > On Sat, 09 Dec 2023, Chuck Lever wrote: > > On Fri, Dec 08, 2023 at 02:27:26PM +1100, NeilBrown wrote: > > > Calling fput() directly or though filp_close() from a kernel thread like > > > nfsd causes the final __fput() (if necessary) to be called from a > > > workqueue. This means that nfsd is not forced to wait for any work to > > > complete. If the ->release of ->destroy_inode function is slow for any > > > reason, this can result in nfsd closing files more quickly than the > > > workqueue can complete the close and the queue of pending closes can > > > grow without bounces (30 million has been seen at one customer site, > > > though this was in part due to a slowness in xfs which has since been > > > fixed). > > > > > > nfsd does not need this. > > > > That is technically true, but IIUC, there is only one case where a > > synchronous close matters for the backlog problem, and that's when > > nfsd_file_free() is called from nfsd_file_put(). AFAICT all other > > call sites (except rename) are error paths, so there aren't negative > > consequences for the lack of synchronous wait there... > > What you say is technically true but it isn't the way I see it. > > Firstly I should clarify that __fput_sync() is *not* a flushing close as > you describe it below. > All it does, apart for some trivial book-keeping, is to call ->release > and possibly ->destroy_inode immediately rather than shunting them off > to another thread. > Apparently ->release sometimes does something that can deadlock with > some kernel threads or if some awkward locks are held, so the whole > final __fput is delay by default. But this does not apply to nfsd. > Standard fput() is really the wrong interface for nfsd to use. > It should use __fput_sync() (which shouldn't have such a scary name). > > The comment above flush_delayed_fput() seems to suggest that unmounting > is a core issue. Maybe the fact that __fput() can call > dissolve_on_fput() is a reason why it is sometimes safer to leave the > work to later. But I don't see that applying to nfsd. > > Of course a ->release function *could* do synchronous writes just like > the XFS ->destroy_inode function used to do synchronous reads. What do you mean "could"? The correct word is "does". > I don't think we should ever try to hide that by putting it in > a workqueue. It's probably a bug and it is best if bugs are visible. Most definitely *not* a bug. XFS, ext4 and btrfs all call filemap_flush() from their respective ->release methods. This is required to protect user data against loss caused by poorly written applications that overwrite user data in an unsafe manner (i.e. the open-truncate-write-close overwrite anti-pattern). The btrfs flush trigger is very similar to XFS: /* * Set by setattr when we are about to truncate a file from a non-zero * size to a zero size. This tries to flush down new bytes that may * have been written if the application were using truncate to replace * a file in place. */ if (test_and_clear_bit(BTRFS_INODE_FLUSH_ON_CLOSE, &BTRFS_I(inode)->runtime_flags)) filemap_flush(inode->i_mapping); XFS started doing this in 2006, ext4 in 2008, and git will tell you when btrfs picked this up, too. IOWs, we've been doing writeback from ->release for a *very long time*. > Note that the XFS ->release function does call filemap_flush() in some > cases, but that is an async flush, so __fput_sync doesn't wait for the > flush to complete. "async flush" does not mean it will not block for long periods of time, it just means it won't wait for *all* the IO to complete. i.e. if the async flush saturates the device, bio submission will wait for previous IO that the flush submitted own IO to complete before it can continue flushing the data. But wait, it gets better! XFS, btrfs and ext4 all implement delayed allocation, which means writeback often needs to run extent allocation transactions. In these cases, transaction reservation can block on metadata writeback to free up journal space. In the case of XFS, this could be tens of thousands of metadata IOs needing to be submitted and completed!. Then consider that extent allocation needs to search for free space which may need to read in metadata. i.e. extent allocation will end up submitting and waiting on synchronous read IO. Also, reading that metadata requires memory allocation for the buffers that will store it - memory allocation can also block on IO and other subsystems to free up memory. Even less obvious is the stack usage issues calling ->release from arbitrary code entails. The filesystem writeback stack is -deep-. Remember all the problems we used to have with ->writepage() being called from direct memory reclaim and so putting the writeback path at arbitrary depths in the stack and then running out of stack space? We really don't want to go back to the bad old days where filesystem write paths can be entered from code that has already consumed most of the stack space.... Hence, IMO, __fput_sync() is something that needs to be very carefully controlled and should have big scary warnings on it. We really don't want it to be called from just anywhere... Cheers, Dave.
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c index ef063f93fde9..e9734c7451b5 100644 --- a/fs/nfsd/filecache.c +++ b/fs/nfsd/filecache.c @@ -283,7 +283,9 @@ nfsd_file_free(struct nfsd_file *nf) nfsd_file_mark_put(nf->nf_mark); if (nf->nf_file) { nfsd_file_check_write_error(nf); + get_file(nf->nf_file); filp_close(nf->nf_file, NULL); + __fput_sync(nf->nf_file); } /* @@ -631,7 +633,6 @@ nfsd_file_close_inode_sync(struct inode *inode) list_del_init(&nf->nf_lru); nfsd_file_free(nf); } - flush_delayed_fput(); } /** diff --git a/fs/nfsd/lockd.c b/fs/nfsd/lockd.c index 46a7f9b813e5..f9d1059096a4 100644 --- a/fs/nfsd/lockd.c +++ b/fs/nfsd/lockd.c @@ -60,7 +60,7 @@ nlm_fopen(struct svc_rqst *rqstp, struct nfs_fh *f, struct file **filp, static void nlm_fclose(struct file *filp) { - fput(filp); + __fput_sync(filp); } static const struct nlmsvc_binding nfsd_nlm_ops = { diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c index 6f2d4aa4970d..20d60823d530 100644 --- a/fs/nfsd/nfs4proc.c +++ b/fs/nfsd/nfs4proc.c @@ -629,7 +629,7 @@ nfsd4_open(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, nn->somebody_reclaimed = true; out: if (open->op_filp) { - fput(open->op_filp); + __fput_sync(open->op_filp); open->op_filp = NULL; } if (resfh && resfh != &cstate->current_fh) { @@ -1546,7 +1546,7 @@ nfsd4_cleanup_inter_ssc(struct nfsd4_ssc_umount_item *nsui, struct file *filp, long timeout = msecs_to_jiffies(nfsd4_ssc_umount_timeout); nfs42_ssc_close(filp); - fput(filp); + __fput_sync(filp); spin_lock(&nn->nfsd_ssc_lock); list_del(&nsui->nsui_list); diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c index 3509e73abe1f..f8f0112fd9f5 100644 --- a/fs/nfsd/nfs4recover.c +++ b/fs/nfsd/nfs4recover.c @@ -561,7 +561,7 @@ nfsd4_shutdown_recdir(struct net *net) if (!nn->rec_file) return; - fput(nn->rec_file); + __fput_sync(nn->rec_file); nn->rec_file = NULL; } diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index fbbea7498f02..15a811229211 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -879,7 +879,7 @@ __nfsd_open(struct svc_rqst *rqstp, struct svc_fh *fhp, umode_t type, host_err = ima_file_check(file, may_flags); if (host_err) { - fput(file); + __fput_sync(file); goto out; } @@ -1884,10 +1884,10 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen, fh_drop_write(ffhp); /* - * If the target dentry has cached open files, then we need to try to - * close them prior to doing the rename. Flushing delayed fput - * shouldn't be done with locks held however, so we delay it until this - * point and then reattempt the whole shebang. + * If the target dentry has cached open files, then we need to + * try to close them prior to doing the rename. Final fput + * shouldn't be done with locks held however, so we delay it + * until this point and then reattempt the whole shebang. */ if (close_cached) { close_cached = false; @@ -2141,7 +2141,7 @@ nfsd_readdir(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t *offsetp, if (err == nfserr_eof || err == nfserr_toosmall) err = nfs_ok; /* can still be found in ->err */ out_close: - fput(file); + __fput_sync(file); out: return err; }