[v12,01/10] vfs, iomap: Fix generic_file_splice_read() to avoid reversion of ITER_PIPE
Message ID | 20230207171305.3716974-2-dhowells@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp2970749wrn; Tue, 7 Feb 2023 09:16:54 -0800 (PST) X-Google-Smtp-Source: AK7set+44eWBXR+w8RfuBi4Qr03eNylaQx+Al+4wcVoFVtqPYYquYNZB6qQVsM09RyVZfq3n6L50 X-Received: by 2002:a17:903:1390:b0:196:6215:8857 with SMTP id jx16-20020a170903139000b0019662158857mr3205242plb.22.1675790213880; Tue, 07 Feb 2023 09:16:53 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1675790213; cv=none; d=google.com; s=arc-20160816; b=pujQ6k99Wx5PmTaeTTFemEVkHujGV+oHiFGsupDZCtkeQ6ZE+lFuNILyENdCNFnvki JjgbZOPIZ1et0D660wj216nlbCXuAglydwlpX71GmxTNuwZe5Ps4r4DSEFfWUV9fQpMu boyz5X1btQL6+wOVMjKN0n7igEQurQFAyzIPpNJoCPUx9yDC973JzKzTdobZN+7UuIcH PEZabr4tFCDML2rO2X1rygLs2jQWkwrknb2tfe7ORMO0kFbn12MjB68Q8Xvfxw5mUYsQ xtPn9a8Y04eVjV3BXi/PUTwXjhm51rm4xvRrcB/jRnNowxuRgJrhD9xW5piVLBSVae92 BoWQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=AqhnRD2vIRBCcpahbYjaBxj+y0Ho2PQtP1suQNatsgY=; b=K504IvToj4BjjhoOUVYWucFc1kw2WAgkyzfcIDK9CFTWYFrw+ffqEG3218Tb8wF/PY dyHdBAyihNtc2JJM+aX45ns36aykw5HPNeZzRKa1F/DJjNU4gQ/zy08dbcyy+hQkapem EuKeJ/ffa/KGLY0pJrI8zb47zCAO4Y2IhsoTqTr2gj/lD+3EqY8fMTw3il466e46vGwR IAxY5ZhEUDYMtXAjb1pt9MtTE0tlMLUbB5uzKPes3sh0P6ggpRRoeub32Yd3pKpcAIWX 0YJQDTl7uW4U/S0fjeG1wMVULSOOfOeAKW81+1FrNT0gkcEfys5Qhmmqb9Vkqt/4uPe/ X7Sg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=XiyR9eKU; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id b4-20020a170902d88400b00198f9ca1352si9998655plz.101.2023.02.07.09.16.41; Tue, 07 Feb 2023 09:16:53 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=XiyR9eKU; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231827AbjBGROk (ORCPT <rfc822;kmanaouilinux@gmail.com> + 99 others); Tue, 7 Feb 2023 12:14:40 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50066 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231935AbjBGROH (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 7 Feb 2023 12:14:07 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E875283FB for <linux-kernel@vger.kernel.org>; Tue, 7 Feb 2023 09:13:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1675789999; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=AqhnRD2vIRBCcpahbYjaBxj+y0Ho2PQtP1suQNatsgY=; b=XiyR9eKUtvIiVOImNB3fUL86SoXrKJ2vNe21OaYqS1eiFr0CJdmWXWZ7v721iUAl/eUW/l slup4QhYindUuadWJCxdCIAnpBMWuT8Qy9EEHjY41xpdNuGrFlQ6B79ncWcwRx0HI0aWR+ lVtvSfYBuU56ak5EZyH3kZtg2gebeBA= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-141-U0lXyxJNPFiRadUkuHNCCw-1; Tue, 07 Feb 2023 12:13:15 -0500 X-MC-Unique: U0lXyxJNPFiRadUkuHNCCw-1 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.rdu2.redhat.com [10.11.54.1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id CC8898027EB; Tue, 7 Feb 2023 17:13:14 +0000 (UTC) Received: from warthog.procyon.org.uk.com (unknown [10.33.36.97]) by smtp.corp.redhat.com (Postfix) with ESMTP id C3B1940CF8EF; Tue, 7 Feb 2023 17:13:12 +0000 (UTC) From: David Howells <dhowells@redhat.com> To: Jens Axboe <axboe@kernel.dk>, Al Viro <viro@zeniv.linux.org.uk>, Christoph Hellwig <hch@infradead.org> Cc: David Howells <dhowells@redhat.com>, Matthew Wilcox <willy@infradead.org>, Jan Kara <jack@suse.cz>, Jeff Layton <jlayton@kernel.org>, David Hildenbrand <david@redhat.com>, Jason Gunthorpe <jgg@nvidia.com>, Logan Gunthorpe <logang@deltatee.com>, Hillf Danton <hdanton@sina.com>, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, syzbot+a440341a59e3b7142895@syzkaller.appspotmail.com, Christoph Hellwig <hch@lst.de>, John Hubbard <jhubbard@nvidia.com> Subject: [PATCH v12 01/10] vfs, iomap: Fix generic_file_splice_read() to avoid reversion of ITER_PIPE Date: Tue, 7 Feb 2023 17:12:56 +0000 Message-Id: <20230207171305.3716974-2-dhowells@redhat.com> In-Reply-To: <20230207171305.3716974-1-dhowells@redhat.com> References: <20230207171305.3716974-1-dhowells@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.1 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1757193399473132020?= X-GMAIL-MSGID: =?utf-8?q?1757193399473132020?= |
Series |
iov_iter: Improve page extraction (pin or just list)
|
|
Commit Message
David Howells
Feb. 7, 2023, 5:12 p.m. UTC
With the new iov_iter_extract_pages() function (added in a later patch),
pages extracted from a non-user-backed iterator, such as ITER_PIPE, aren't
pinned. __iomap_dio_rw(), however, calls iov_iter_revert() to shorten the
iterator to just the data it is going to use - which causes the pipe
buffers to be freed, even though they're attached to a bio and may get
written to by DMA (thanks to Hillf Danton for spotting this[1]).
This then causes massive memory corruption that is particularly noticable
when the syzbot test[2] is run. The test boils down to:
out = creat(argv[1], 0666);
ftruncate(out, 0x800);
lseek(out, 0x200, SEEK_SET);
in = open(argv[1], O_RDONLY | O_DIRECT | O_NOFOLLOW);
sendfile(out, in, NULL, 0x1dd00);
run repeatedly in parallel. What I think is happening is that ftruncate()
occasionally shortens the DIO read that's about to be made by sendfile's
splice core by reducing i_size.
Fix this by replacing the use of an ITER_PIPE iterator with an ITER_BVEC
iterator for which reversion won't free the buffers. Bulk allocate all the
buffers we think we're going to use in advance, do the read synchronously
and only then trim the buffer down. The pages we did use get pushed into
the pipe.
This is more efficient by virtue of doing a bulk page allocation, but
slightly less efficient by ignoring any partial page in the pipe.
Note that this removes the only user of ITER_PIPE.
Fixes: 920756a3306a ("block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages")
Reported-by: syzbot+a440341a59e3b7142895@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20230207094731.1390-1-hdanton@sina.com/ [1]
Link: https://lore.kernel.org/r/000000000000b0b3c005f3a09383@google.com/ [2]
---
fs/splice.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 68 insertions(+), 8 deletions(-)
Comments
Subject nitpick: this does not ouch iomap at all. > Fix this by replacing the use of an ITER_PIPE iterator with an ITER_BVEC > iterator for which reversion won't free the buffers. Bulk allocate all the > buffers we think we're going to use in advance, do the read synchronously > and only then trim the buffer down. The pages we did use get pushed into > the pipe. > > This is more efficient by virtue of doing a bulk page allocation, but > slightly less efficient by ignoring any partial page in the pipe. For the usual case of a buffered read into the iter, this completely changes the semantics: - before the pagecache pages just grew a new reference and were added to the pipe buffer, and I/O was done without an extra copy - with this patch you always allocate an extra set of pages for the pipe and copy to it So I very much doubt this makes anything more efficient, and I don't think we can just do it. We'll have to fix reverting of pipe buffers, just as I already pointed out in your cifs series that tries to play the same game.
Christoph Hellwig <hch@infradead.org> wrote: > We'll have to fix reverting of pipe buffers, just as I already pointed > out in your cifs series that tries to play the same game. Fixing ITER_PIPE reversion isn't very straightforward, unfortunately. It's possible for a partial direct I/O read to use reversion to trim the iterator (thereby discarding anon pages from a pipe) and then fall back to reading the rest by buffered I/O. I suppose I could set a flag on the pipe_buffers indicating whether iov_iter_revert() is allowed to free them (if they were spliced in from the pagecache, then they must be freed by reverting them). How about one of two different solutions? (1) Repurpose the function I proposed for generic_file_splice_read() but only for splicing from O_DIRECT files; reading from non-O_DIRECT files would use an ITER_PIPE as upstream. (2) Repurpose the function I proposed for generic_file_splice_read() but only for splicing from O_DIRECT files, as (1), but also replace the splice from a buffered file with something like the patch below. This uses filemap_get_pages() to do the reading and to get a bunch of folios from the pagecache that we can then splice into the pipe directly. David --- splice: Do splice read from a buffered file without using ITER_PIPE Provide a function to do splice read from a buffered file, pulling the folios out of the pagecache directly by calling filemap_get_pages() to do any required reading and then pasting the returned folios into the pipe. A helper function is provided to do the actual folio pasting and will handle multipage folios by splicing as many of the relevant subpages as will fit into the pipe. The ITER_BVEC-based splicing previously added is then only used for splicing from O_DIRECT files. The code is loosely based on filemap_read() and might belong in mm/filemap.c with that as it needs to use filemap_get_pages(). filemap_get_pages() and some of the functions it uses are changed to take the required byte count rather than an iterator (which was only being used for iter->count). This should be split into a separate file. Signed-off-by: David Howells <dhowells@redhat.com> --- fs/splice.c | 177 +++++++++++++++++++++++++++++++++++++++++++----- include/linux/pagemap.h | 2 mm/filemap.c | 23 +++--- 3 files changed, 176 insertions(+), 26 deletions(-) diff --git a/fs/splice.c b/fs/splice.c index fba93f4a4f9e..3ccecfa50eda 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -22,6 +22,7 @@ #include <linux/fs.h> #include <linux/file.h> #include <linux/pagemap.h> +#include <linux/pagevec.h> #include <linux/splice.h> #include <linux/memcontrol.h> #include <linux/mm_inline.h> @@ -282,22 +283,13 @@ void splice_shrink_spd(struct splice_pipe_desc *spd) kfree(spd->partial); } -/** - * generic_file_splice_read - splice data from file to a pipe - * @in: file to splice from - * @ppos: position in @in - * @pipe: pipe to splice to - * @len: number of bytes to splice - * @flags: splice modifier flags - * - * Description: - * Will read pages from given file and fill them into a pipe. Can be - * used as long as it has more or less sane ->read_iter(). - * +/* + * Splice data from an O_DIRECT file into pages and then add them to the output + * pipe. */ -ssize_t generic_file_splice_read(struct file *in, loff_t *ppos, - struct pipe_inode_info *pipe, size_t len, - unsigned int flags) +static ssize_t generic_file_direct_splice_read(struct file *in, loff_t *ppos, + struct pipe_inode_info *pipe, + size_t len, unsigned int flags) { LIST_HEAD(pages); struct iov_iter to; @@ -383,6 +375,161 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos, kfree(bv); return ret; } + +/* + * Splice subpages from a folio into a pipe. + */ +static size_t splice_folio_into_pipe(struct pipe_inode_info *pipe, + struct folio *folio, + loff_t fpos, size_t size) +{ + struct page *page; + size_t spliced = 0, offset = offset_in_folio(folio, fpos); + + page = folio_page(folio, offset / PAGE_SIZE); + size = min(size, folio_size(folio) - offset); + offset %= PAGE_SIZE; + + while (spliced < size && + !pipe_full(pipe->head, pipe->tail, pipe->max_usage)) { + struct pipe_buffer *buf = &pipe->bufs[pipe->head & (pipe->ring_size - 1)]; + size_t part = min_t(size_t, PAGE_SIZE - offset, size - spliced); + + *buf = (struct pipe_buffer) { + .ops = &page_cache_pipe_buf_ops, + .page = page, + .offset = offset, + .len = part, + }; + folio_get(folio); + pipe->head++; + page++; + spliced += part; + offset = 0; + } + + return spliced; +} + +/* + * Splice folios from the pagecache of a buffered (ie. non-O_DIRECT) file into + * a pipe. + */ +static ssize_t generic_file_buffered_splice_read(struct file *in, loff_t *ppos, + struct pipe_inode_info *pipe, + size_t len, + unsigned int flags) +{ + struct folio_batch fbatch; + size_t total_spliced = 0, used, npages; + loff_t isize, end_offset; + bool writably_mapped; + int i, error = 0; + + struct kiocb iocb = { + .ki_filp = in, + .ki_pos = *ppos, + }; + + /* Work out how much data we can actually add into the pipe */ + used = pipe_occupancy(pipe->head, pipe->tail); + npages = max_t(ssize_t, pipe->max_usage - used, 0); + len = min_t(size_t, len, npages * PAGE_SIZE); + + folio_batch_init(&fbatch); + + do { + cond_resched(); + + if (*ppos >= i_size_read(file_inode(in))) + break; + + iocb.ki_pos = *ppos; + error = filemap_get_pages(&iocb, len, &fbatch); + if (error < 0) + break; + + /* + * i_size must be checked after we know the pages are Uptodate. + * + * Checking i_size after the check allows us to calculate + * the correct value for "nr", which means the zero-filled + * part of the page is not copied back to userspace (unless + * another truncate extends the file - this is desired though). + */ + isize = i_size_read(file_inode(in)); + if (unlikely(*ppos >= isize)) + break; + end_offset = min_t(loff_t, isize, *ppos + len); + + /* + * Once we start copying data, we don't want to be touching any + * cachelines that might be contended: + */ + writably_mapped = mapping_writably_mapped(in->f_mapping); + + for (i = 0; i < folio_batch_count(&fbatch); i++) { + struct folio *folio = fbatch.folios[i]; + size_t n; + + if (folio_pos(folio) >= end_offset) + goto out; + folio_mark_accessed(folio); + + /* + * If users can be writing to this folio using arbitrary + * virtual addresses, take care of potential aliasing + * before reading the folio on the kernel side. + */ + if (writably_mapped) + flush_dcache_folio(folio); + + n = splice_folio_into_pipe(pipe, folio, *ppos, len); + if (!n) + goto out; + len -= n; + total_spliced += n; + *ppos += n; + in->f_ra.prev_pos = *ppos; + if (pipe_full(pipe->head, pipe->tail, pipe->max_usage)) + goto out; + } + + folio_batch_release(&fbatch); + } while (len); + +out: + folio_batch_release(&fbatch); + file_accessed(in); + + return total_spliced ? total_spliced : error; +} + +/** + * generic_file_splice_read - splice data from file to a pipe + * @in: file to splice from + * @ppos: position in @in + * @pipe: pipe to splice to + * @len: number of bytes to splice + * @flags: splice modifier flags + * + * Description: + * Will read pages from given file and fill them into a pipe. Can be + * used as long as it has more or less sane ->read_iter(). + * + */ +ssize_t generic_file_splice_read(struct file *in, loff_t *ppos, + struct pipe_inode_info *pipe, size_t len, + unsigned int flags) +{ + if (unlikely(*ppos >= file_inode(in)->i_sb->s_maxbytes)) + return 0; + if (unlikely(!len)) + return 0; + if (in->f_flags & O_DIRECT) + return generic_file_direct_splice_read(in, ppos, pipe, len, flags); + return generic_file_buffered_splice_read(in, ppos, pipe, len, flags); +} EXPORT_SYMBOL(generic_file_splice_read); const struct pipe_buf_operations default_pipe_buf_ops = { diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 29e1f9e76eb6..317fbc34849f 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -748,6 +748,8 @@ struct page *read_cache_page(struct address_space *, pgoff_t index, filler_t *filler, struct file *file); extern struct page * read_cache_page_gfp(struct address_space *mapping, pgoff_t index, gfp_t gfp_mask); +int filemap_get_pages(struct kiocb *iocb, size_t count, + struct folio_batch *fbatch); static inline struct page *read_mapping_page(struct address_space *mapping, pgoff_t index, struct file *file) diff --git a/mm/filemap.c b/mm/filemap.c index f72e4875bfcb..f8b28352b038 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2440,10 +2440,8 @@ static int filemap_read_folio(struct file *file, filler_t filler, } static bool filemap_range_uptodate(struct address_space *mapping, - loff_t pos, struct iov_iter *iter, struct folio *folio) + loff_t pos, size_t count, struct folio *folio) { - int count; - if (folio_test_uptodate(folio)) return true; if (!mapping->a_ops->is_partially_uptodate) @@ -2451,7 +2449,6 @@ static bool filemap_range_uptodate(struct address_space *mapping, if (mapping->host->i_blkbits >= folio_shift(folio)) return false; - count = iter->count; if (folio_pos(folio) > pos) { count -= folio_pos(folio) - pos; pos = 0; @@ -2463,7 +2460,7 @@ static bool filemap_range_uptodate(struct address_space *mapping, } static int filemap_update_page(struct kiocb *iocb, - struct address_space *mapping, struct iov_iter *iter, + struct address_space *mapping, size_t count, struct folio *folio) { int error; @@ -2498,7 +2495,7 @@ static int filemap_update_page(struct kiocb *iocb, goto unlock; error = 0; - if (filemap_range_uptodate(mapping, iocb->ki_pos, iter, folio)) + if (filemap_range_uptodate(mapping, iocb->ki_pos, count, folio)) goto unlock; error = -EAGAIN; @@ -2574,8 +2571,12 @@ static int filemap_readahead(struct kiocb *iocb, struct file *file, return 0; } -static int filemap_get_pages(struct kiocb *iocb, struct iov_iter *iter, - struct folio_batch *fbatch) +/* + * Extract some folios from the pagecache of a file, reading those pages from + * the backing store if necessary and waiting for them. + */ +int filemap_get_pages(struct kiocb *iocb, size_t count, + struct folio_batch *fbatch) { struct file *filp = iocb->ki_filp; struct address_space *mapping = filp->f_mapping; @@ -2585,7 +2586,7 @@ static int filemap_get_pages(struct kiocb *iocb, struct iov_iter *iter, struct folio *folio; int err = 0; - last_index = DIV_ROUND_UP(iocb->ki_pos + iter->count, PAGE_SIZE); + last_index = DIV_ROUND_UP(iocb->ki_pos + count, PAGE_SIZE); retry: if (fatal_signal_pending(current)) return -EINTR; @@ -2618,7 +2619,7 @@ static int filemap_get_pages(struct kiocb *iocb, struct iov_iter *iter, if ((iocb->ki_flags & IOCB_WAITQ) && folio_batch_count(fbatch) > 1) iocb->ki_flags |= IOCB_NOWAIT; - err = filemap_update_page(iocb, mapping, iter, folio); + err = filemap_update_page(iocb, mapping, count, folio); if (err) goto err; } @@ -2688,7 +2689,7 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter, if (unlikely(iocb->ki_pos >= i_size_read(inode))) break; - error = filemap_get_pages(iocb, iter, &fbatch); + error = filemap_get_pages(iocb, iov_iter_count(iter), &fbatch); if (error < 0) break;
On Wed, Feb 08, 2023 at 04:09:51PM +0000, David Howells wrote: > How about one of two different solutions? > > (1) Repurpose the function I proposed for generic_file_splice_read() but only > for splicing from O_DIRECT files; reading from non-O_DIRECT files would > use an ITER_PIPE as upstream. Given the amounts of problems we had with O_DIRECT vs splice, and the fact that even doing this is a bit pointless that seems sensible to me. > for splicing from O_DIRECT files, as (1), but also replace the splice > from a buffered file with something like the patch below. This uses > filemap_get_pages() to do the reading and to get a bunch of folios from > the pagecache that we can then splice into the pipe directly. I defintively like the idea of killing ITER_PIPE. Isn't the 16 folios in a folio tree often much less than what we could fit into a single pipe buf? Unless you have a file system that can use huge folios for buffered I/O and actually does this might significantly limit performance. With that in mind I'll try to find some time to review your actual patch, but I'm a little busy at the moment.
On Wed, Feb 08, 2023 at 04:09:51PM +0000, David Howells wrote: > @@ -2688,7 +2689,7 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter, > if (unlikely(iocb->ki_pos >= i_size_read(inode))) > break; > > - error = filemap_get_pages(iocb, iter, &fbatch); > + error = filemap_get_pages(iocb, iov_iter_count(iter), &fbatch); What's the point in iov_iter_count() anyway? It's more characters than iter->count, so why not use it directly?
Matthew Wilcox <willy@infradead.org> wrote: > What's the point in iov_iter_count() anyway? It's more characters > than iter->count, so why not use it directly? I presume it got added for a reason and should be used in preference to direct access. David
Christoph Hellwig <hch@infradead.org> wrote: > I defintively like the idea of killing ITER_PIPE. Isn't the 16 > folios in a folio tree often much less than what we could fit into > a single pipe buf? Unless you have a file system that can use > huge folios for buffered I/O and actually does this might significantly > limit performance. There's a loop there that repeats the filemap_get_pages() until either the pipe is full or we hit EOF, the same as in filemap_read() (upon which this is based). I want to use filemap_get_pages() if I can as that does all the readahead stuff. What might be nice, though, is if I could tell it to return rather than waiting for a folio to come uptodate if it has already returned a folio so that I can push the other side of the splice along whilst the read is in progress. David
diff --git a/fs/splice.c b/fs/splice.c index 5969b7a1d353..51778437f31f 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -295,24 +295,62 @@ void splice_shrink_spd(struct splice_pipe_desc *spd) * used as long as it has more or less sane ->read_iter(). * */ -ssize_t generic_file_splice_read(struct file *in, loff_t *ppos, +ssize_t generic_file_splice_read(struct file *file, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, unsigned int flags) { + LIST_HEAD(pages); struct iov_iter to; + struct bio_vec *bv; struct kiocb kiocb; - int ret; + struct page *page; + unsigned int head; + ssize_t ret; + size_t used, npages, chunk, remain, reclaim; + int i; + + /* Work out how much data we can actually add into the pipe */ + used = pipe_occupancy(pipe->head, pipe->tail); + npages = max_t(ssize_t, pipe->max_usage - used, 0); + len = min_t(size_t, len, npages * PAGE_SIZE); + npages = DIV_ROUND_UP(len, PAGE_SIZE); + + bv = kmalloc(array_size(npages, sizeof(bv[0])), GFP_KERNEL); + if (!bv) + return -ENOMEM; + + npages = alloc_pages_bulk_list(GFP_USER, npages, &pages); + if (!npages) { + kfree(bv); + return -ENOMEM; + } - iov_iter_pipe(&to, ITER_DEST, pipe, len); - init_sync_kiocb(&kiocb, in); + remain = len = min_t(size_t, len, npages * PAGE_SIZE); + + for (i = 0; i < npages; i++) { + chunk = min_t(size_t, PAGE_SIZE, remain); + page = list_first_entry(&pages, struct page, lru); + list_del_init(&page->lru); + bv[i].bv_page = page; + bv[i].bv_offset = 0; + bv[i].bv_len = chunk; + remain -= chunk; + } + + /* Do the I/O */ + iov_iter_bvec(&to, ITER_DEST, bv, npages, len); + init_sync_kiocb(&kiocb, file); kiocb.ki_pos = *ppos; - ret = call_read_iter(in, &kiocb, &to); + ret = call_read_iter(file, &kiocb, &to); + + reclaim = npages * PAGE_SIZE; + remain = 0; if (ret > 0) { + reclaim -= ret; + remain = ret; *ppos = kiocb.ki_pos; - file_accessed(in); + file_accessed(file); } else if (ret < 0) { - /* free what was emitted */ - pipe_discard_from(pipe, to.start_head); /* * callers of ->splice_read() expect -EAGAIN on * "can't put anything in there", rather than -EFAULT. @@ -321,6 +359,28 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos, ret = -EAGAIN; } + /* Free any pages that didn't get touched at all. */ + for (; reclaim >= PAGE_SIZE; reclaim -= PAGE_SIZE) + __free_page(bv[--npages].bv_page); + + /* Push the remaining pages into the pipe. */ + head = pipe->head; + for (i = 0; i < npages; i++) { + struct pipe_buffer *buf = &pipe->bufs[head & (pipe->ring_size - 1)]; + + chunk = min_t(size_t, remain, PAGE_SIZE); + *buf = (struct pipe_buffer) { + .ops = &default_pipe_buf_ops, + .page = bv[i].bv_page, + .offset = 0, + .len = chunk, + }; + head++; + remain -= chunk; + } + pipe->head = head; + + kfree(bv); return ret; } EXPORT_SYMBOL(generic_file_splice_read);