Message ID | 20230519074047.1739879-4-dhowells@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp1054857vqo; Fri, 19 May 2023 00:43:31 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7xDeaNjxuZCYvzd25DvcgJoYb4Fbp2YOXN/6cAaWCpgefj09XFAD7qgVmtqk+1+7hwVEPt X-Received: by 2002:a17:902:f547:b0:1ac:9890:1c49 with SMTP id h7-20020a170902f54700b001ac98901c49mr2581876plf.15.1684482210839; Fri, 19 May 2023 00:43:30 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1684482210; cv=none; d=google.com; s=arc-20160816; b=kaTvMclPm9dp/UqJPr8UUYh/xVQxBAwhc8/rTmrrdBlcD/1z0wLGBg6dbdHYy4vM4Z DsCr5TpdUHYj8DYzQMuetc2yy+4utVfzlfqYHlHka7OeHYla7VJ2wi2iP0jr2v4RhGlR 7gV5N8EeBlU85p21tysCI3HKKaPvQYCW9i+KtCRtj+1eY9hfhwpfkPOJB3ppGl8c8YPb jU0Qif+Mg/lmhGIrdJregHgh4l+5p4mjnxQLOH3Xz2J0vgQqj1S3IONlh/F83NDU8TCu JMU00fasB55DkmsogCBY7fTLNmRI7ey0+uZZY+ZeUz83XWLrjgqoFpB3MIOXOR5Agv5O 4ECg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=HJSls9joKnzamhDgZoRi/AKfts16br7tyBuA30GIBac=; b=qvjFYEQaboWL9Vdpcaad3r+N2bK7ylzDB3tIM4wpOKOw1AOTLy42JhaDWAlO2fSuj5 eC/BnRDGDV128rAhrw9I16svH5XqICj8ABP36OVD6vLsn1HivwXJHQcaB0XHX8f7pI/M hQK33Ky2koJGw+4aGrjHsSZSpoJeL5CYmoHPln00JCFBDcypCff8sshNQowy5wKhHspq HskNeAVRlu4sf/k3/rt+gXSQIBC0NTG1q0TT9Q96mLjNe6Dt5fbuDfFCH19EVh3MFvvF Ai6hA0zyBYSMGhk54l8cRo/eDAOmbZiqPXZGMCpQG9ehCYSht78TWxaD51xv+oyaNtnQ qpmA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=QFYDajIt; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id iw2-20020a170903044200b001a63b045c86si3087780plb.10.2023.05.19.00.43.18; Fri, 19 May 2023 00:43:30 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=QFYDajIt; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230319AbjESHmO (ORCPT <rfc822;cscallsign@gmail.com> + 99 others); Fri, 19 May 2023 03:42:14 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59100 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230317AbjESHmI (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 19 May 2023 03:42:08 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 02BA71AC for <linux-kernel@vger.kernel.org>; Fri, 19 May 2023 00:41:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1684482077; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=HJSls9joKnzamhDgZoRi/AKfts16br7tyBuA30GIBac=; b=QFYDajIts4baAcveTWbNX5MjBC+5bl+sLftufvPkOoCN3vvQJvhGzuJ8SFC43xwUrwSAUP Ii83EIhRLxBuwbFoqRRBUtBZA4SFt8MG/r22T+3F9xoBlevmm0zeodyticrH0U8X2n6A4X x3peDFK955G7cdwz6rqh5wfWUslFxHY= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-662-unilBiGRMYCJnhOgY90jPw-1; Fri, 19 May 2023 03:41:13 -0400 X-MC-Unique: unilBiGRMYCJnhOgY90jPw-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id B0BCD29AB3E8; Fri, 19 May 2023 07:41:12 +0000 (UTC) Received: from warthog.procyon.org.uk (unknown [10.42.28.221]) by smtp.corp.redhat.com (Postfix) with ESMTP id 894042166B28; Fri, 19 May 2023 07:41:10 +0000 (UTC) From: David Howells <dhowells@redhat.com> To: Jens Axboe <axboe@kernel.dk>, Al Viro <viro@zeniv.linux.org.uk>, Christoph Hellwig <hch@infradead.org> Cc: David Howells <dhowells@redhat.com>, Matthew Wilcox <willy@infradead.org>, Jan Kara <jack@suse.cz>, Jeff Layton <jlayton@kernel.org>, David Hildenbrand <david@redhat.com>, Jason Gunthorpe <jgg@nvidia.com>, Logan Gunthorpe <logang@deltatee.com>, Hillf Danton <hdanton@sina.com>, Christian Brauner <brauner@kernel.org>, Linus Torvalds <torvalds@linux-foundation.org>, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Christoph Hellwig <hch@lst.de> Subject: [PATCH v20 03/32] splice: Make direct_read_splice() limit to eof where appropriate Date: Fri, 19 May 2023 08:40:18 +0100 Message-Id: <20230519074047.1739879-4-dhowells@redhat.com> In-Reply-To: <20230519074047.1739879-1-dhowells@redhat.com> References: <20230519074047.1739879-1-dhowells@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.6 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1766307618824559176?= X-GMAIL-MSGID: =?utf-8?q?1766307618824559176?= |
Series |
splice, block: Use page pinning and kill ITER_PIPE
|
|
Commit Message
David Howells
May 19, 2023, 7:40 a.m. UTC
Make direct_read_splice() limit the read to the end of the file for regular
files and block devices, thereby reducing the amount of allocation it will
do in such a case.
This means that the blockdev code doesn't require any special handling as
filemap_read_splice() also limits to i_size.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
---
fs/splice.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
Comments
On Fri, May 19, 2023 at 08:40:18AM +0100, David Howells wrote: > Make direct_read_splice() limit the read to the end of the file for regular > files and block devices, thereby reducing the amount of allocation it will > do in such a case. > > This means that the blockdev code doesn't require any special handling as > filemap_read_splice() also limits to i_size. I'm really not sure if this is a good idea. Right now direct_read_splice (which also appears a little misnamed) really is a splice by calling ->read_iter helper. I we don't do any of this validtion we can just call it directly from splice.c instead of calling into ->splice_read for direct I/O and DAX and remove a ton of boilerplate code. How often do we even call into splice beyond i_size and for how much? > + if (S_ISREG(file_inode(in)->i_mode) || > + S_ISBLK(file_inode(in)->i_mode)) { Btw, having these kinds of checks in supposedly generic code is always a marked for a bit of a layering problem.
Christoph Hellwig <hch@infradead.org> wrote: > I'm really not sure if this is a good idea. Right now > direct_read_splice (which also appears a little misnamed) really is > a splice by calling ->read_iter helper. It can be renamed if you want a different name. copy_splice_read() or copy_splice_read_iter()? > I we don't do any of this validtion we can just call it directly from > splice.c instead of calling into ->splice_read for direct I/O and DAX and > remove a ton of boilerplate code. There's a couple of places where we might not want to do that - at least for non-DAX. shmem and f2fs for example. f2fs calls back to buffered reading under some circumstances. shmem ignores O_DIRECT and always splices from the pagecache. David
On Fri, May 19, 2023 at 09:43:34AM +0100, David Howells wrote: > > direct_read_splice (which also appears a little misnamed) really is > > a splice by calling ->read_iter helper. > > It can be renamed if you want a different name. copy_splice_read() or > copy_splice_read_iter()? Maybe something like that, yes. > > > I we don't do any of this validtion we can just call it directly from > > splice.c instead of calling into ->splice_read for direct I/O and DAX and > > remove a ton of boilerplate code. > > There's a couple of places where we might not want to do that - at least for > non-DAX. shmem and f2fs for example. f2fs calls back to buffered reading > under some circumstances. shmem ignores O_DIRECT and always splices from the > pagecache. So? even if ->read_iter does buffered I/O for O_DIRECT it will still work. This can in fact happen for many other file systems due when they fall back to buffeed I/O due to various reasons.
On Fri, May 19, 2023 at 12:41 AM David Howells <dhowells@redhat.com> wrote: > > + > + if (S_ISREG(file_inode(in)->i_mode) || > + S_ISBLK(file_inode(in)->i_mode)) { This really feels fundamentally wrong to me. If block and regular files have this limit, they should have their own splice_read() function that implements that limit. Not make everybody else check it. IOW, this should be a separate function ("block_splice_read()" or whatever), not inside a generic function that other users use. The zero size checking looks fine, although I wondered about that too. Some special files do traditionally have special meanings for zero-sized reads (as in "packet boundary"). But I suspect that isn't an issue for splice, and perhaps more importantly, I think the same rule should be in place: special files that want special rules shouldn't be using this generic function directly then. Linus
Linus Torvalds <torvalds@linux-foundation.org> wrote: > > + if (S_ISREG(file_inode(in)->i_mode) || > > + S_ISBLK(file_inode(in)->i_mode)) { > > This really feels fundamentally wrong to me. > > If block and regular files have this limit, they should have their own > splice_read() function that implements that limit. > > Not make everybody else check it. > > IOW, this should be a separate function ("block_splice_read()" or > whatever), not inside a generic function that other users use. This is just an optimisation to cut down the amount of bufferage allocated, so I could just drop it and leave it to userspace for now as the filesystem/block layer will stop anyway if it hits the EOF. Christoph would prefer that I call direct_splice_read() from generic_file_splice_read() in all O_DIRECT cases, if that's fine with you. David
On Fri, May 19, 2023 at 9:48 AM David Howells <dhowells@redhat.com> wrote: > > This is just an optimisation to cut down the amount of bufferage allocated So the thing is, it's actually very very wrong for some files. Now, admittedly, those files have other issues too, and it's a design mistake to begin with, but look at a number of files in /proc. In particular, look at the regular files that have a size of '0'. It's quite common indeed. Things like /proc/cpuinfo /proc/stat ... you can find a ton of them with find /proc -type f -size 0 Is it horribly wrong and bad? Yes. I hate it. It means that some really basic user space tools refuse to work on them, and the tools are 100% right - this is a kernel misfeature. Trying to do things like less -S /proc/cpuinfo may or may not work depending on your version of 'less', for example, because it's entirely reasonable to do something like fd = open(..); if (!fstat(fd, &st)) len = st.st_size; and limit your reads to the size of the file - exactly like your patch does. Except it fails horribly on those broken /proc files. I hate it, and I blame myself for the above horror, but it's pretty much unfixable. We could make them look like named pipes or something, but that's really ugly and probably would break other things anyway. And we simply don't know the size ahead of time. Now, *most* things work, because they just do the whole "read until EOF". In fact, my current version of 'less' has no problem at all doing the above thing, and gives the "expected" output. Also, honestly, I really don't think that it's necessarily a good idea to splice /proc files, but we actually do have splice wired up to these because people asked for it: fe33850ff798 ("proc: wire up generic_file_splice_read for iter ops") 4bd6a7353ee1 ("sysctl: Convert to iter interfaces") so I suspect those things do exist. > I could just drop it and leave it to userspace for now as the filesystem/block > layer will stop anyway if it hits the EOF. Christoph would prefer that I call > direct_splice_read() from generic_file_splice_read() in all O_DIRECT cases, if > that's fine with you. I guess that's fine, and for O_DIRECT itself it might even make sense to do the size test. That said, I doubt it matters: if you use O_DIRECT on a small file, you only have yourself to blame for doing something stupid. And if it isn't a small file, then who cares about some small EOF-time optimization? Nobody. So I would suggest not doing that optimization at all, because as-is, it's either pointless or actively broken. That said, I would *not* hate some kind of special FMODE_SIZELIMIT flag that allows filesystems to opt in to "limit reads to size". We already have flags like that: FMODE_UNSIGNED_OFFSET and 'sb->s_maxbytes' are both basically variations on that same theme, and having another flag to say "limit reads to i_size" wouldn't be wrong. It's only wrong when it is done mindlessly with S_ISREG(). Linus
Okay. Let's go with that. So I have to put the handling in vfs_splice_read(): long vfs_splice_read(struct file *in, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, unsigned int flags) { ... if (unlikely(!in->f_op->splice_read)) return warn_unsupported(in, "read"); /* * O_DIRECT and DAX don't deal with the pagecache, so we * allocate a buffer, copy into it and splice that into the pipe. */ if ((in->f_flags & O_DIRECT) || IS_DAX(in->f_mapping->host)) return copy_splice_read(in, ppos, pipe, len, flags); return in->f_op->splice_read(in, ppos, pipe, len, flags); } which leaves very little in generic_file_splice_read: ssize_t generic_file_splice_read(struct file *in, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, unsigned int flags) { if (unlikely(*ppos >= in->f_mapping->host->i_sb->s_maxbytes)) return 0; if (unlikely(!len)) return 0; return filemap_splice_read(in, ppos, pipe, len, flags); } so I wonder if the tests in generic_file_splice_read() can be folded into vfs_splice_read(), pointers to generic_file_splice_read() be replaced with pointers to filemap_splice_read() and generic_file_splice_read() just be removed. I suspect we can't quite do this because of the *ppos check - but I wonder if that's actually necessary since filemap_splice_read() checks against i_size... or if the check can be moved there if we definitely want to do it. Certainly, the zero-length check can be done in vfs_splice_read(). David
On Fri, May 19, 2023 at 11:27:51PM +0100, David Howells wrote: > ssize_t generic_file_splice_read(struct file *in, loff_t *ppos, > struct pipe_inode_info *pipe, size_t len, > unsigned int flags) > { > if (unlikely(*ppos >= in->f_mapping->host->i_sb->s_maxbytes)) > return 0; > if (unlikely(!len)) > return 0; > return filemap_splice_read(in, ppos, pipe, len, flags); > } > > so I wonder if the tests in generic_file_splice_read() can be folded into > vfs_splice_read(), pointers to generic_file_splice_read() be replaced with > pointers to filemap_splice_read() and generic_file_splice_read() just be > removed. > > I suspect we can't quite do this because of the *ppos check - but I wonder if > that's actually necessary since filemap_splice_read() checks against > i_size... or if the check can be moved there if we definitely want to do it. > > Certainly, the zero-length check can be done in vfs_splice_read(). The zero length check makes sense in vfs_splice_read. The ppos check I think makes sense for filemap_splice_read - after all we're dealing with the page cache here, and the page cache needs a working maxbytes and i_size. What callers of filemap_splice_read that don't have the checks do you have in your tree right now and how did you end up with them?
diff --git a/fs/splice.c b/fs/splice.c index 4db3eee49423..89c8516554d1 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -315,6 +315,19 @@ ssize_t direct_splice_read(struct file *in, loff_t *ppos, size_t used, npages, chunk, remain, keep = 0; int i; + if (!len) + return 0; + + if (S_ISREG(file_inode(in)->i_mode) || + S_ISBLK(file_inode(in)->i_mode)) { + loff_t i_size = i_size_read(in->f_mapping->host); + + if (*ppos >= i_size) + return 0; + if (len > i_size - *ppos) + len = i_size - *ppos; + } + /* Work out how much data we can actually add into the pipe */ used = pipe_occupancy(pipe->head, pipe->tail); npages = max_t(ssize_t, pipe->max_usage - used, 0);