[v20,03/32] splice: Make direct_read_splice() limit to eof where appropriate

Message ID 20230519074047.1739879-4-dhowells@redhat.com
State New
Headers
Series splice, block: Use page pinning and kill ITER_PIPE |

Commit Message

David Howells May 19, 2023, 7:40 a.m. UTC
  Make direct_read_splice() limit the read to the end of the file for regular
files and block devices, thereby reducing the amount of allocation it will
do in such a case.

This means that the blockdev code doesn't require any special handling as
filemap_read_splice() also limits to i_size.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
---
 fs/splice.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)
  

Comments

Christoph Hellwig May 19, 2023, 8:09 a.m. UTC | #1
On Fri, May 19, 2023 at 08:40:18AM +0100, David Howells wrote:
> Make direct_read_splice() limit the read to the end of the file for regular
> files and block devices, thereby reducing the amount of allocation it will
> do in such a case.
> 
> This means that the blockdev code doesn't require any special handling as
> filemap_read_splice() also limits to i_size.

I'm really not sure if this is a good idea.  Right now
direct_read_splice (which also appears a little misnamed) really is
a splice by calling ->read_iter helper.  I we don't do any
of this validtion we can just call it directly from splice.c instead
of calling into ->splice_read for direct I/O and DAX and remove a ton
of boilerplate code.

How often do we even call into splice beyond i_size and for how much?

> +	if (S_ISREG(file_inode(in)->i_mode) ||
> +	    S_ISBLK(file_inode(in)->i_mode)) {

Btw, having these kinds of checks in supposedly generic code is always
a marked for a bit of a layering problem.
  
David Howells May 19, 2023, 8:43 a.m. UTC | #2
Christoph Hellwig <hch@infradead.org> wrote:

> I'm really not sure if this is a good idea.  Right now
> direct_read_splice (which also appears a little misnamed) really is
> a splice by calling ->read_iter helper.

It can be renamed if you want a different name.  copy_splice_read() or
copy_splice_read_iter()?

> I we don't do any of this validtion we can just call it directly from
> splice.c instead of calling into ->splice_read for direct I/O and DAX and
> remove a ton of boilerplate code.

There's a couple of places where we might not want to do that - at least for
non-DAX.  shmem and f2fs for example.  f2fs calls back to buffered reading
under some circumstances.  shmem ignores O_DIRECT and always splices from the
pagecache.

David
  
Christoph Hellwig May 19, 2023, 8:47 a.m. UTC | #3
On Fri, May 19, 2023 at 09:43:34AM +0100, David Howells wrote:
> > direct_read_splice (which also appears a little misnamed) really is
> > a splice by calling ->read_iter helper.
> 
> It can be renamed if you want a different name.  copy_splice_read() or
> copy_splice_read_iter()?

Maybe something like that, yes.

> 
> > I we don't do any of this validtion we can just call it directly from
> > splice.c instead of calling into ->splice_read for direct I/O and DAX and
> > remove a ton of boilerplate code.
> 
> There's a couple of places where we might not want to do that - at least for
> non-DAX.  shmem and f2fs for example.  f2fs calls back to buffered reading
> under some circumstances.  shmem ignores O_DIRECT and always splices from the
> pagecache.

So?  even if ->read_iter does buffered I/O for O_DIRECT it will still
work.  This can in fact happen for many other file systems due when
they fall back to buffeed I/O due to various reasons.
  
Linus Torvalds May 19, 2023, 4:31 p.m. UTC | #4
On Fri, May 19, 2023 at 12:41 AM David Howells <dhowells@redhat.com> wrote:
>
> +
> +       if (S_ISREG(file_inode(in)->i_mode) ||
> +           S_ISBLK(file_inode(in)->i_mode)) {

This really feels fundamentally wrong to me.

If block and regular files have this limit, they should have their own
splice_read() function that implements that limit.

Not make everybody else check it.

IOW, this should be a separate function ("block_splice_read()" or
whatever), not inside a generic function that other users use.

The zero size checking looks fine, although I wondered about that too.
Some special files do traditionally have special meanings for
zero-sized reads (as in "packet boundary"). But I suspect that isn't
an issue for splice, and perhaps more importantly, I think the same
rule should be in place: special files that want special rules
shouldn't be using this generic function directly then.

                 Linus
  
David Howells May 19, 2023, 4:47 p.m. UTC | #5
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > +       if (S_ISREG(file_inode(in)->i_mode) ||
> > +           S_ISBLK(file_inode(in)->i_mode)) {
> 
> This really feels fundamentally wrong to me.
> 
> If block and regular files have this limit, they should have their own
> splice_read() function that implements that limit.
> 
> Not make everybody else check it.
> 
> IOW, this should be a separate function ("block_splice_read()" or
> whatever), not inside a generic function that other users use.

This is just an optimisation to cut down the amount of bufferage allocated, so
I could just drop it and leave it to userspace for now as the filesystem/block
layer will stop anyway if it hits the EOF.  Christoph would prefer that I call
direct_splice_read() from generic_file_splice_read() in all O_DIRECT cases, if
that's fine with you.

David
  
Linus Torvalds May 19, 2023, 5:37 p.m. UTC | #6
On Fri, May 19, 2023 at 9:48 AM David Howells <dhowells@redhat.com> wrote:
>
> This is just an optimisation to cut down the amount of bufferage allocated

So the thing is, it's actually very very wrong for some files.

Now, admittedly, those files have other issues too, and it's a design
mistake to begin with, but look at a number of files in /proc.

In particular, look at the regular files that have a size of '0'. It's
quite common indeed. Things like

    /proc/cpuinfo
    /proc/stat
    ...

you can find a ton of them with

    find /proc -type f -size 0

Is it horribly wrong and bad? Yes. I hate it. It means that some
really basic user space tools refuse to work on them, and the tools
are 100% right - this is a kernel misfeature. Trying to do things like

    less -S /proc/cpuinfo

may or may not work depending on your version of 'less', for example,
because it's entirely reasonable to do something like

    fd = open(..);
    if (!fstat(fd, &st))
         len = st.st_size;

and limit your reads to the size of the file - exactly like your patch does.

Except it fails horribly on those broken /proc files.

I hate it, and I blame myself for the above horror, but it's pretty
much unfixable. We could make them look like named pipes or something,
but that's really ugly and probably would break other things anyway.
And we simply don't know the size ahead of time.

Now, *most* things work, because they just do the whole "read until
EOF". In fact, my current version of 'less' has no problem at all
doing the above thing, and gives the "expected" output.

Also, honestly, I really don't think that it's necessarily a good idea
to splice /proc files, but we actually do have splice wired up to
these because people asked for it:

    fe33850ff798 ("proc: wire up generic_file_splice_read for iter ops")
    4bd6a7353ee1 ("sysctl: Convert to iter interfaces")

so I suspect those things do exist.

> I could just drop it and leave it to userspace for now as the filesystem/block
> layer will stop anyway if it hits the EOF.  Christoph would prefer that I call
> direct_splice_read() from generic_file_splice_read() in all O_DIRECT cases, if
> that's fine with you.

I guess that's fine, and for O_DIRECT itself it might even make sense
to do the size test. That said, I doubt it matters: if you use
O_DIRECT on a small file, you only have yourself to blame for doing
something stupid.

And if it isn't a small file, then who cares about some small EOF-time
optimization? Nobody.

So I would suggest not doing that optimization at all, because as-is,
it's either pointless or actively broken.

That said, I would *not* hate some kind of special FMODE_SIZELIMIT
flag that allows filesystems to opt in to "limit reads to size".

We already have flags like that: FMODE_UNSIGNED_OFFSET and
'sb->s_maxbytes' are both basically variations on that same theme, and
having another flag to say "limit reads to i_size" wouldn't be wrong.

It's only wrong when it is done mindlessly with S_ISREG().

             Linus
  
David Howells May 19, 2023, 10:27 p.m. UTC | #7
Okay.  Let's go with that.  So I have to put the handling in vfs_splice_read():

	long vfs_splice_read(struct file *in, loff_t *ppos,
			     struct pipe_inode_info *pipe, size_t len,
			     unsigned int flags)
	{
	...
		if (unlikely(!in->f_op->splice_read))
			return warn_unsupported(in, "read");
		/*
		 * O_DIRECT and DAX don't deal with the pagecache, so we
		 * allocate a buffer, copy into it and splice that into the pipe.
		 */
		if ((in->f_flags & O_DIRECT) || IS_DAX(in->f_mapping->host))
			return copy_splice_read(in, ppos, pipe, len, flags);
		return in->f_op->splice_read(in, ppos, pipe, len, flags);
	}

which leaves very little in generic_file_splice_read:

	ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
					 struct pipe_inode_info *pipe, size_t len,
					 unsigned int flags)
	{
		if (unlikely(*ppos >= in->f_mapping->host->i_sb->s_maxbytes))
			return 0;
		if (unlikely(!len))
			return 0;
		return filemap_splice_read(in, ppos, pipe, len, flags);
	}

so I wonder if the tests in generic_file_splice_read() can be folded into
vfs_splice_read(), pointers to generic_file_splice_read() be replaced with
pointers to filemap_splice_read() and generic_file_splice_read() just be
removed.

I suspect we can't quite do this because of the *ppos check - but I wonder if
that's actually necessary since filemap_splice_read() checks against
i_size... or if the check can be moved there if we definitely want to do it.

Certainly, the zero-length check can be done in vfs_splice_read().

David
  
Christoph Hellwig May 20, 2023, 3:54 a.m. UTC | #8
On Fri, May 19, 2023 at 11:27:51PM +0100, David Howells wrote:
> 	ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
> 					 struct pipe_inode_info *pipe, size_t len,
> 					 unsigned int flags)
> 	{
> 		if (unlikely(*ppos >= in->f_mapping->host->i_sb->s_maxbytes))
> 			return 0;
> 		if (unlikely(!len))
> 			return 0;
> 		return filemap_splice_read(in, ppos, pipe, len, flags);
> 	}
> 
> so I wonder if the tests in generic_file_splice_read() can be folded into
> vfs_splice_read(), pointers to generic_file_splice_read() be replaced with
> pointers to filemap_splice_read() and generic_file_splice_read() just be
> removed.
> 
> I suspect we can't quite do this because of the *ppos check - but I wonder if
> that's actually necessary since filemap_splice_read() checks against
> i_size... or if the check can be moved there if we definitely want to do it.
> 
> Certainly, the zero-length check can be done in vfs_splice_read().

The zero length check makes sense in vfs_splice_read.  The ppos check
I think makes sense for filemap_splice_read - after all we're dealing
with the page cache here, and the page cache needs a working maxbytes
and i_size.  What callers of filemap_splice_read that don't have the
checks do you have in your tree right now and how did you end up with
them?
  

Patch

diff --git a/fs/splice.c b/fs/splice.c
index 4db3eee49423..89c8516554d1 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -315,6 +315,19 @@  ssize_t direct_splice_read(struct file *in, loff_t *ppos,
 	size_t used, npages, chunk, remain, keep = 0;
 	int i;
 
+	if (!len)
+		return 0;
+
+	if (S_ISREG(file_inode(in)->i_mode) ||
+	    S_ISBLK(file_inode(in)->i_mode)) {
+		loff_t i_size = i_size_read(in->f_mapping->host);
+
+		if (*ppos >= i_size)
+			return 0;
+		if (len > i_size - *ppos)
+			len = i_size - *ppos;
+	}
+
 	/* Work out how much data we can actually add into the pipe */
 	used = pipe_occupancy(pipe->head, pipe->tail);
 	npages = max_t(ssize_t, pipe->max_usage - used, 0);