Message ID | 20221105090421.21237-1-mk@cm4all.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp895132wru; Sat, 5 Nov 2022 02:26:53 -0700 (PDT) X-Google-Smtp-Source: AMsMyM4dBIzQue+HknWJ86V783q5+dlPmCaCpN/rSwczZn/bwJMq/NU45dkpgtcmWCdsuwa7hbl0 X-Received: by 2002:a17:902:d682:b0:186:9ecf:94c2 with SMTP id v2-20020a170902d68200b001869ecf94c2mr39558541ply.54.1667640413252; Sat, 05 Nov 2022 02:26:53 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1667640413; cv=none; d=google.com; s=arc-20160816; b=vX4yXeNkahWdNCV+gMESoxlUIkAexYzqKjEh21G5X+pZBVFElIOUX9hKWntpIXtyOW XYalFWzjlDnrV7tojNuvwNz8Pe5hd2cnIxhOZTQeuSxDIMasDkxKiKnK3YsGSZxVT7/t pETnRUPYAtQaJwq5YtOGpEvjXE8yn70Q2qgMfY6BOgQ44fyYMZp9X6b5QgIuOg4rr4Ne NyNn1MIsuC9hRmc83QE5QzX0amrzhcubsbK33X25Ck+GZfUF+ewvwOBSmV6jOqax0l7c ow2kvNdChPY2/yHR/rwRnBNR2PbqzEngvu5kqK9MWVJCV+w7hq4t0AFKDdinrw+AxbrZ rkAg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=Bh0ijRpYTLRxnOkb/tNtpjoKHQtydvupTZZjpXvtT2g=; b=qz+Mbk+g3GR8Nd/Irs1B9PGBHXun8TCoQ6perkdaRda4KaCTRO9tFulAZkb5PNhPrW GuAAGpTFvjYVyYGZsBIuhSYoT1vn4k2PagDxu2A2M/F3ZhoST0w7VZk9tpyS6c75aHsw B/39+OpXB1rjJ8eOVjzedg3r9MlggdG965xd/h9EHWp+faU6zmLAxxTH87GkXOkd5d5m 05rxANUP8nLtos3C8cc9xY4GOVpQilrgUvAKH/p9BWARjoHTHfREWSddqEjqtfmn8OsW rNYPFPYoTzroHDFyPVjtDnt0r0nloPTx9CfYpqe6swOz8Vb7/1plY2MOVuPRp6B3CX4i mGcQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id f132-20020a636a8a000000b004622caf9ac4si2258820pgc.805.2022.11.05.02.26.40; Sat, 05 Nov 2022 02:26:53 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229461AbiKEJLX (ORCPT <rfc822;hjfbswb@gmail.com> + 99 others); Sat, 5 Nov 2022 05:11:23 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51070 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229487AbiKEJLV (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Sat, 5 Nov 2022 05:11:21 -0400 X-Greylist: delayed 392 seconds by postgrey-1.37 at lindbergh.monkeyblade.net; Sat, 05 Nov 2022 02:11:17 PDT Received: from nibbler.cm4all.net (nibbler.cm4all.net [82.165.145.151]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3CCB62611E for <linux-kernel@vger.kernel.org>; Sat, 5 Nov 2022 02:11:17 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by nibbler.cm4all.net (Postfix) with ESMTP id 4217EC00DA for <linux-kernel@vger.kernel.org>; Sat, 5 Nov 2022 10:04:44 +0100 (CET) X-Virus-Scanned: Debian amavisd-new at nibbler.cm4all.net Received: from nibbler.cm4all.net ([127.0.0.1]) by localhost (nibbler.cm4all.net [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 8wR_gPZYHnM4 for <linux-kernel@vger.kernel.org>; Sat, 5 Nov 2022 10:04:37 +0100 (CET) Received: from zero.intern.cm-ag (zero.intern.cm-ag [172.30.16.10]) by nibbler.cm4all.net (Postfix) with SMTP id 203D3C008F for <linux-kernel@vger.kernel.org>; Sat, 5 Nov 2022 10:04:37 +0100 (CET) Received: (qmail 10907 invoked from network); 5 Nov 2022 14:14:23 +0100 Received: from unknown (HELO rabbit.intern.cm-ag) (172.30.3.1) by zero.intern.cm-ag with SMTP; 5 Nov 2022 14:14:23 +0100 Received: by rabbit.intern.cm-ag (Postfix, from userid 1023) id E4E52460B43; Sat, 5 Nov 2022 10:04:36 +0100 (CET) From: Max Kellermann <mk@cm4all.com> To: viro@zeniv.linux.org.uk, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Max Kellermann <max.kellermann@ionos.com> Subject: [PATCH] fs/splice: don't block splice_direct_to_actor() after data was read Date: Sat, 5 Nov 2022 10:04:21 +0100 Message-Id: <20221105090421.21237-1-mk@cm4all.com> X-Mailer: git-send-email 2.30.2 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,SPF_HELO_NONE, SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1748647714211976065?= X-GMAIL-MSGID: =?utf-8?q?1748647714211976065?= |
Series |
fs/splice: don't block splice_direct_to_actor() after data was read
|
|
Commit Message
Max Kellermann
Nov. 5, 2022, 9:04 a.m. UTC
From: Max Kellermann <max.kellermann@ionos.com> If userspace calls sendfile() with a very large "count" parameter, the kernel can block for a very long time until 2 GiB (0x7ffff000 bytes) have been read from the hard disk and pushed into the socket buffer. Usually, that is not a problem, because the socket write buffer gets filled quickly, and if the socket is non-blocking, the last direct_splice_actor() call will return -EAGAIN, causing splice_direct_to_actor() to break from the loop, and sendfile() will return a partial transfer. However, if the network happens to be faster than the hard disk, and the socket buffer keeps getting drained between two generic_file_read_iter() calls, the sendfile() system call can keep running for a long time, blocking for disk I/O over and over. That is undesirable, because it can block the calling process for too long. I discovered a problem where nginx would block for so long that it would drop the HTTP connection because the kernel had just transferred 2 GiB in one call, and the HTTP socket was not writable (EPOLLOUT) for more than 60 seconds, resulting in a timeout: sendfile(4, 12, [5518919528] => [5884939344], 1813448856) = 366019816 <3.033067> sendfile(4, 12, [5884939344], 1447429040) = -1 EAGAIN (Resource temporarily unavailable) <0.000037> epoll_wait(9, [{EPOLLOUT, {u32=2181955104, u64=140572166585888}}], 512, 60000) = 1 <0.003355> gettimeofday({tv_sec=1667508799, tv_usec=201201}, NULL) = 0 <0.000024> sendfile(4, 12, [5884939344] => [8032418896], 2147480496) = 2147479552 <10.727970> writev(4, [], 0) = 0 <0.000439> epoll_wait(9, [], 512, 60000) = 0 <60.060430> gettimeofday({tv_sec=1667508869, tv_usec=991046}, NULL) = 0 <0.000078> write(5, "10.40.5.23 - - [03/Nov/2022:21:5"..., 124) = 124 <0.001097> close(12) = 0 <0.000063> close(4) = 0 <0.000091> In newer nginx versions (since 1.21.4), this problem was worked around by defaulting "sendfile_max_chunk" to 2 MiB: https://github.com/nginx/nginx/commit/5636e7f7b4 Instead of asking userspace to provide an artificial upper limit, I'd like the kernel to block for disk I/O at most once, and then pass back control to userspace. There is prior art for this kind of behavior in filemap_read(): /* * If we've already successfully copied some data, then we * can no longer safely return -EIOCBQUEUED. Hence mark * an async read NOWAIT at that point. */ if ((iocb->ki_flags & IOCB_WAITQ) && already_read) iocb->ki_flags |= IOCB_NOWAIT; This modifies the caller-provided "struct kiocb", which has an effect on repeated filemap_read() calls. This effect however vanishes because the "struct kiocb" is not persistent; splice_direct_to_actor() doesn't have one, and each generic_file_splice_read() call initializes a new one, losing the "IOCB_NOWAIT" flag that was injected by filemap_read(). There was no way to make generic_file_splice_read() aware that IOCB_NOWAIT was desired because some data had already been transferred in a previous call: - checking whether the input file has O_NONBLOCK doesn't work because this should be fixed even if the input file is not non-blocking - the SPLICE_F_NONBLOCK flag is not appropriate because it affects only whether pipe operations are non-blocking, not whether file/socket operations are non-blocking Since there are no other parameters, I suggest adding the SPLICE_F_NOWAIT flag, which is similar to SPLICE_F_NONBLOCK, but affects the "non-pipe" file descriptor passed to sendfile() or splice(). It translates to IOCB_NOWAIT for regular files. For now, I have documented the flag to be kernel-internal with a high bit, like io_uring does with SPLICE_F_FD_IN_FIXED, but making this part of the system call ABI may be a good idea as well. To: Alexander Viro <viro@zeniv.linux.org.uk> To: linux-fsdevel@vger.kernel.org To: linux-kernel@vger.kernel.org Signed-off-by: Max Kellermann <max.kellermann@ionos.com> --- fs/splice.c | 14 ++++++++++++++ include/linux/splice.h | 6 ++++++ 2 files changed, 20 insertions(+)
Comments
On Sat, Nov 05, 2022 at 10:04:21AM +0100, Max Kellermann wrote: > Since there are no other parameters, I suggest adding the > SPLICE_F_NOWAIT flag, which is similar to SPLICE_F_NONBLOCK, but > affects the "non-pipe" file descriptor passed to sendfile() or > splice(). It translates to IOCB_NOWAIT for regular files. This looks reasonable to me and matches the read/write side. > > For now, I > have documented the flag to be kernel-internal with a high bit, like > io_uring does with SPLICE_F_FD_IN_FIXED, but making this part of the > system call ABI may be a good idea as well. Yeah, my only comment here is that I see no reason to make this purely kernel internal. And while looking at that: does anyone remember why the (public) SPLICE_F_* aren't in a uapi header?
Greeting, FYI, we noticed a 3388.6% improvement of phoronix-test-suite.stress-ng.SENDFILE.bogo_ops_s due to commit: commit: 86f00c46806e19ccce7fa238fbf3aaf0f1f2f531 ("[PATCH] fs/splice: don't block splice_direct_to_actor() after data was read") url: https://github.com/intel-lab-lkp/linux/commits/Max-Kellermann/fs-splice-don-t-block-splice_direct_to_actor-after-data-was-read/20221105-171212 base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git b208b9fbbcba743fb269d15cb46a4036b01936b1 patch link: https://lore.kernel.org/all/20221105090421.21237-1-mk@cm4all.com/ patch subject: [PATCH] fs/splice: don't block splice_direct_to_actor() after data was read in testcase: phoronix-test-suite on test machine: 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (Cascade Lake) with 512G memory with following parameters: test: stress-ng-1.3.1 option_a: SENDFILE cpufreq_governor: performance test-description: The Phoronix Test Suite is the most comprehensive testing and benchmarking platform available that provides an extensible framework for which new tests can be easily added. test-url: http://www.phoronix-test-suite.com/ Details are as below: ========================================================================================= compiler/cpufreq_governor/kconfig/option_a/rootfs/tbox_group/test/testcase: gcc-11/performance/x86_64-rhel-8.3/SENDFILE/debian-x86_64-phoronix/lkp-csl-2sp7/stress-ng-1.3.1/phoronix-test-suite commit: b208b9fbbc ("Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux") 86f00c4680 ("fs/splice: don't block splice_direct_to_actor() after data was read") b208b9fbbcba743f 86f00c46806e19ccce7fa238fbf ---------------- --------------------------- %stddev %change %stddev \ | \ 105881 ± 10% +3388.6% 3693772 ± 13% phoronix-test-suite.stress-ng.SENDFILE.bogo_ops_s 2.16 ± 7% +1117.6% 26.36 ± 13% phoronix-test-suite.time.user_time 0.20 +0.2 0.40 ± 4% mpstat.cpu.all.usr% 97.92 ± 6% +14.9% 112.48 ± 5% sched_debug.cfs_rq:/.util_est_enqueued.stddev 2717 +2.5% 2783 turbostat.Bzy_MHz 254.74 -4.6% 243.11 turbostat.PkgWatt 2589095 ± 18% -37.5% 1617492 ± 28% numa-meminfo.node0.FilePages 788063 ± 62% +123.2% 1758697 ± 26% numa-meminfo.node1.FilePages 992948 ± 48% +92.6% 1912574 ± 24% numa-meminfo.node1.Inactive 1803548 ± 28% +52.3% 2746365 ± 17% numa-meminfo.node1.MemUsed 4.443e+09 ± 12% -36.0% 2.846e+09 ± 18% numa-numastat.node0.local_node 4.429e+09 ± 12% -35.8% 2.842e+09 ± 18% numa-numastat.node0.numa_hit 5.558e+09 ± 8% -54.2% 2.548e+09 ± 24% numa-numastat.node1.local_node 5.539e+09 ± 8% -54.0% 2.545e+09 ± 24% numa-numastat.node1.numa_hit 9.968e+09 ± 10% -46.0% 5.388e+09 ± 13% proc-vmstat.numa_hit 1e+10 ± 10% -46.1% 5.393e+09 ± 13% proc-vmstat.numa_local 9.759e+09 ± 10% -45.5% 5.32e+09 ± 13% proc-vmstat.pgalloc_normal 9.759e+09 ± 10% -45.5% 5.32e+09 ± 13% proc-vmstat.pgfree 647020 ± 19% -37.5% 404382 ± 28% numa-vmstat.node0.nr_file_pages 4.429e+09 ± 12% -35.8% 2.842e+09 ± 18% numa-vmstat.node0.numa_hit 4.443e+09 ± 12% -36.0% 2.846e+09 ± 18% numa-vmstat.node0.numa_local 196897 ± 62% +123.3% 439599 ± 26% numa-vmstat.node1.nr_file_pages 5.539e+09 ± 8% -54.0% 2.545e+09 ± 24% numa-vmstat.node1.numa_hit 5.558e+09 ± 8% -54.2% 2.548e+09 ± 24% numa-vmstat.node1.numa_local 2.602e+10 ± 10% -39.6% 1.573e+10 ± 13% perf-stat.i.branch-instructions 24921395 ± 12% +40.0% 34894184 ± 6% perf-stat.i.branch-misses 1.223e+08 ± 11% -37.8% 76029198 ± 13% perf-stat.i.cache-references 1.92 ± 18% +43.1% 2.75 ± 12% perf-stat.i.cpi 2.05e+11 +2.6% 2.103e+11 perf-stat.i.cpu-cycles 3.522e+10 ± 10% -37.8% 2.192e+10 ± 13% perf-stat.i.dTLB-loads 2.356e+10 ± 10% -40.7% 1.396e+10 ± 13% perf-stat.i.dTLB-stores 2507922 ± 14% +527.4% 15734908 ± 8% perf-stat.i.iTLB-load-misses 1.336e+11 ± 10% -39.6% 8.071e+10 ± 13% perf-stat.i.instructions 50055 ± 12% -90.8% 4589 ± 5% perf-stat.i.instructions-per-iTLB-miss 0.61 ± 10% -35.9% 0.39 ± 10% perf-stat.i.ipc 2135130 +2.7% 2191818 perf-stat.i.metric.GHz 8.845e+08 ± 10% -39.1% 5.387e+08 ± 13% perf-stat.i.metric.M/sec 60647 ± 2% -19.8% 48635 ± 6% perf-stat.i.node-stores 0.10 ± 8% +0.1 0.22 ± 7% perf-stat.overall.branch-miss-rate% 11.06 ± 35% +5.9 16.96 ± 25% perf-stat.overall.cache-miss-rate% 1.55 ± 12% +70.3% 2.65 ± 11% perf-stat.overall.cpi 0.00 ± 41% +0.0 0.00 ± 29% perf-stat.overall.dTLB-store-miss-rate% 86.38 ± 3% +10.8 97.16 perf-stat.overall.iTLB-load-miss-rate% 53972 ± 12% -90.5% 5111 ± 5% perf-stat.overall.instructions-per-iTLB-miss 0.65 ± 11% -41.2% 0.38 ± 12% perf-stat.overall.ipc 2.579e+10 ± 10% -39.5% 1.559e+10 ± 13% perf-stat.ps.branch-instructions 24705183 ± 12% +40.1% 34602922 ± 6% perf-stat.ps.branch-misses 1.212e+08 ± 11% -37.8% 75385416 ± 13% perf-stat.ps.cache-references 2.032e+11 +2.6% 2.085e+11 perf-stat.ps.cpu-cycles 3.491e+10 ± 10% -37.7% 2.173e+10 ± 13% perf-stat.ps.dTLB-loads 2.335e+10 ± 10% -40.7% 1.385e+10 ± 13% perf-stat.ps.dTLB-stores 2486067 ± 14% +527.6% 15601840 ± 8% perf-stat.ps.iTLB-load-misses 1.324e+11 ± 10% -39.5% 8.002e+10 ± 13% perf-stat.ps.instructions 60175 ± 2% -19.8% 48272 ± 6% perf-stat.ps.node-stores 1.529e+13 ± 10% -39.5% 9.254e+12 ± 12% perf-stat.total.instructions 55.97 ± 6% -13.5 42.46 ± 4% perf-profile.calltrace.cycles-pp.do_iter_readv_writev.do_iter_read.ovl_read_iter.generic_file_splice_read.splice_direct_to_actor 55.59 ± 6% -13.4 42.15 ± 4% perf-profile.calltrace.cycles-pp.shmem_file_read_iter.do_iter_readv_writev.do_iter_read.ovl_read_iter.generic_file_splice_read 29.58 ± 11% -12.7 16.90 ± 12% perf-profile.calltrace.cycles-pp.iov_iter_zero.shmem_file_read_iter.do_iter_readv_writev.do_iter_read.ovl_read_iter 57.68 ± 3% -11.7 45.98 perf-profile.calltrace.cycles-pp.do_iter_read.ovl_read_iter.generic_file_splice_read.splice_direct_to_actor.do_splice_direct 17.32 ± 12% -7.4 9.89 ± 14% perf-profile.calltrace.cycles-pp.append_pipe.iov_iter_zero.shmem_file_read_iter.do_iter_readv_writev.do_iter_read 13.20 ± 12% -5.5 7.66 ± 15% perf-profile.calltrace.cycles-pp.__alloc_pages.append_pipe.iov_iter_zero.shmem_file_read_iter.do_iter_readv_writev 10.88 ± 11% -4.6 6.27 ± 9% perf-profile.calltrace.cycles-pp.memset_erms.iov_iter_zero.shmem_file_read_iter.do_iter_readv_writev.do_iter_read 15.34 ± 7% -4.3 11.02 ± 14% perf-profile.calltrace.cycles-pp.direct_splice_actor.splice_direct_to_actor.do_splice_direct.do_sendfile.__x64_sys_sendfile64 15.23 ± 7% -4.3 10.96 ± 14% perf-profile.calltrace.cycles-pp.splice_from_pipe.direct_splice_actor.splice_direct_to_actor.do_splice_direct.do_sendfile 15.05 ± 7% -4.2 10.86 ± 14% perf-profile.calltrace.cycles-pp.__splice_from_pipe.splice_from_pipe.direct_splice_actor.splice_direct_to_actor.do_splice_direct 8.26 ± 11% -3.2 5.06 ± 14% perf-profile.calltrace.cycles-pp.get_page_from_freelist.__alloc_pages.append_pipe.iov_iter_zero.shmem_file_read_iter 8.38 ± 9% -2.4 6.01 ± 14% perf-profile.calltrace.cycles-pp.free_unref_page.__splice_from_pipe.splice_from_pipe.direct_splice_actor.splice_direct_to_actor 5.64 ± 10% -1.9 3.70 ± 13% perf-profile.calltrace.cycles-pp.rmqueue.get_page_from_freelist.__alloc_pages.append_pipe.iov_iter_zero 99.74 -1.5 98.21 perf-profile.calltrace.cycles-pp.splice_direct_to_actor.do_splice_direct.do_sendfile.__x64_sys_sendfile64.do_syscall_64 99.75 -1.4 98.36 perf-profile.calltrace.cycles-pp.do_splice_direct.do_sendfile.__x64_sys_sendfile64.do_syscall_64.entry_SYSCALL_64_after_hwframe 2.30 ± 13% -1.1 1.20 ± 16% perf-profile.calltrace.cycles-pp.alloc_pages.append_pipe.iov_iter_zero.shmem_file_read_iter.do_iter_readv_writev 1.74 ± 12% -0.8 0.96 ± 11% perf-profile.calltrace.cycles-pp.xas_load.__filemap_get_folio.shmem_get_folio_gfp.shmem_file_read_iter.do_iter_readv_writev 99.79 -0.8 99.03 perf-profile.calltrace.cycles-pp.do_sendfile.__x64_sys_sendfile64.do_syscall_64.entry_SYSCALL_64_after_hwframe 1.52 ± 12% -0.7 0.79 ± 15% perf-profile.calltrace.cycles-pp.free_unref_page_commit.free_unref_page.__splice_from_pipe.splice_from_pipe.direct_splice_actor 4.13 ± 3% -0.7 3.46 ± 13% perf-profile.calltrace.cycles-pp.generic_pipe_buf_release.__splice_from_pipe.splice_from_pipe.direct_splice_actor.splice_direct_to_actor 1.01 ± 11% -0.6 0.40 ± 72% perf-profile.calltrace.cycles-pp.free_pcp_prepare.free_unref_page.__splice_from_pipe.splice_from_pipe.direct_splice_actor 99.80 -0.4 99.36 perf-profile.calltrace.cycles-pp.__x64_sys_sendfile64.do_syscall_64.entry_SYSCALL_64_after_hwframe 99.82 -0.4 99.47 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.79 ± 10% -0.3 0.48 ± 45% perf-profile.calltrace.cycles-pp.__list_del_entry_valid.rmqueue.get_page_from_freelist.__alloc_pages.append_pipe 0.90 ± 7% -0.3 0.59 ± 4% perf-profile.calltrace.cycles-pp.__might_resched.shmem_file_read_iter.do_iter_readv_writev.do_iter_read.ovl_read_iter 99.83 -0.3 99.52 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe 83.75 +2.5 86.27 ± 2% perf-profile.calltrace.cycles-pp.generic_file_splice_read.splice_direct_to_actor.do_splice_direct.do_sendfile.__x64_sys_sendfile64 83.42 +2.5 85.97 ± 2% perf-profile.calltrace.cycles-pp.ovl_read_iter.generic_file_splice_read.splice_direct_to_actor.do_splice_direct.do_sendfile 11.81 ± 13% +6.4 18.22 perf-profile.calltrace.cycles-pp.override_creds.ovl_read_iter.generic_file_splice_read.splice_direct_to_actor.do_splice_direct 13.34 ± 14% +7.7 21.02 ± 7% perf-profile.calltrace.cycles-pp.revert_creds.ovl_read_iter.generic_file_splice_read.splice_direct_to_actor.do_splice_direct 55.98 ± 6% -13.5 42.46 ± 4% perf-profile.children.cycles-pp.do_iter_readv_writev 55.85 ± 6% -13.5 42.34 ± 4% perf-profile.children.cycles-pp.shmem_file_read_iter 29.70 ± 11% -12.7 16.96 ± 12% perf-profile.children.cycles-pp.iov_iter_zero 57.69 ± 3% -11.7 45.98 perf-profile.children.cycles-pp.do_iter_read 17.38 ± 12% -7.5 9.93 ± 14% perf-profile.children.cycles-pp.append_pipe 13.64 ± 12% -5.7 7.90 ± 15% perf-profile.children.cycles-pp.__alloc_pages 10.94 ± 10% -4.6 6.30 ± 9% perf-profile.children.cycles-pp.memset_erms 15.34 ± 7% -4.3 11.02 ± 14% perf-profile.children.cycles-pp.direct_splice_actor 15.24 ± 7% -4.3 10.96 ± 14% perf-profile.children.cycles-pp.splice_from_pipe 15.13 ± 7% -4.2 10.90 ± 14% perf-profile.children.cycles-pp.__splice_from_pipe 8.36 ± 11% -3.2 5.11 ± 14% perf-profile.children.cycles-pp.get_page_from_freelist 8.62 ± 9% -2.5 6.13 ± 14% perf-profile.children.cycles-pp.free_unref_page 5.76 ± 10% -2.0 3.76 ± 13% perf-profile.children.cycles-pp.rmqueue 99.74 -1.5 98.22 perf-profile.children.cycles-pp.splice_direct_to_actor 99.75 -1.4 98.36 perf-profile.children.cycles-pp.do_splice_direct 2.39 ± 13% -1.1 1.24 ± 16% perf-profile.children.cycles-pp.alloc_pages 2.63 ± 9% -1.0 1.65 ± 8% perf-profile.children.cycles-pp.__might_resched 1.84 ± 12% -0.8 1.02 ± 12% perf-profile.children.cycles-pp.xas_load 1.61 ± 12% -0.8 0.84 ± 14% perf-profile.children.cycles-pp.free_unref_page_commit 99.79 -0.7 99.04 perf-profile.children.cycles-pp.do_sendfile 4.18 ± 3% -0.7 3.49 ± 13% perf-profile.children.cycles-pp.generic_pipe_buf_release 1.42 ± 11% -0.6 0.80 ± 8% perf-profile.children.cycles-pp.__cond_resched 1.06 ± 11% -0.5 0.59 ± 14% perf-profile.children.cycles-pp.free_pcp_prepare 0.94 ± 14% -0.4 0.50 ± 15% perf-profile.children.cycles-pp.__might_sleep 99.80 -0.4 99.37 perf-profile.children.cycles-pp.__x64_sys_sendfile64 0.75 ± 15% -0.4 0.38 ± 16% perf-profile.children.cycles-pp.policy_node 0.79 ± 12% -0.4 0.42 ± 15% perf-profile.children.cycles-pp.__folio_put 0.90 ± 10% -0.4 0.54 ± 12% perf-profile.children.cycles-pp._raw_spin_unlock_irqrestore 99.95 -0.3 99.64 perf-profile.children.cycles-pp.do_syscall_64 99.95 -0.3 99.68 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe 0.58 ± 12% -0.3 0.31 ± 10% perf-profile.children.cycles-pp.rcu_all_qs 0.62 ± 14% -0.3 0.36 ± 12% perf-profile.children.cycles-pp.xas_start 0.80 ± 10% -0.2 0.56 ± 10% perf-profile.children.cycles-pp.__list_del_entry_valid 0.40 ± 17% -0.2 0.20 ± 17% perf-profile.children.cycles-pp._find_first_bit 0.32 ± 14% -0.2 0.16 ± 17% perf-profile.children.cycles-pp.__list_add_valid 0.34 ± 12% -0.1 0.20 ± 8% perf-profile.children.cycles-pp.sanity 0.32 ± 11% -0.1 0.18 ± 14% perf-profile.children.cycles-pp.__page_cache_release 0.27 ± 12% -0.1 0.14 ± 15% perf-profile.children.cycles-pp.should_fail_alloc_page 0.25 ± 12% -0.1 0.13 ± 17% perf-profile.children.cycles-pp.__mem_cgroup_uncharge 0.20 ± 14% -0.1 0.10 ± 13% perf-profile.children.cycles-pp.policy_nodemask 0.12 ± 15% -0.0 0.07 ± 12% perf-profile.children.cycles-pp.pipe_to_null 0.04 ± 45% +0.0 0.07 perf-profile.children.cycles-pp.__libc_write 0.05 ± 7% +0.0 0.08 ± 11% perf-profile.children.cycles-pp.ktime_get_coarse_real_ts64 0.04 ± 71% +0.0 0.07 perf-profile.children.cycles-pp.generic_file_write_iter 0.04 ± 71% +0.0 0.07 ± 5% perf-profile.children.cycles-pp.generic_perform_write 0.04 ± 71% +0.0 0.07 perf-profile.children.cycles-pp.__generic_file_write_iter 0.00 +0.1 0.05 ± 8% perf-profile.children.cycles-pp.syscall_return_via_sysret 0.00 +0.1 0.06 ± 11% perf-profile.children.cycles-pp.ovl_d_real 0.00 +0.1 0.07 ± 14% perf-profile.children.cycles-pp.__put_user_nocheck_8 0.16 ± 6% +0.1 0.23 ± 2% perf-profile.children.cycles-pp.current_time 0.00 +0.1 0.08 ± 14% perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack 0.28 ± 8% +0.1 0.37 ± 4% perf-profile.children.cycles-pp.atime_needs_update 0.09 ± 14% +0.1 0.18 ± 15% perf-profile.children.cycles-pp.__fsnotify_parent 0.35 ± 9% +0.1 0.45 ± 6% perf-profile.children.cycles-pp.touch_atime 0.00 +0.1 0.12 ± 12% perf-profile.children.cycles-pp.__might_fault 0.00 +0.1 0.14 ± 14% perf-profile.children.cycles-pp._copy_from_user 0.00 +0.1 0.14 ± 4% perf-profile.children.cycles-pp.__entry_text_start 0.00 +0.2 0.16 ± 19% perf-profile.children.cycles-pp.__fdget 83.76 +2.5 86.28 ± 2% perf-profile.children.cycles-pp.generic_file_splice_read 83.44 +2.5 85.99 ± 2% perf-profile.children.cycles-pp.ovl_read_iter 11.82 ± 13% +6.4 18.26 perf-profile.children.cycles-pp.override_creds 13.35 ± 14% +7.7 21.06 ± 7% perf-profile.children.cycles-pp.revert_creds 10.81 ± 10% -4.6 6.23 ± 9% perf-profile.self.cycles-pp.memset_erms 3.26 ± 13% -1.5 1.74 ± 17% perf-profile.self.cycles-pp.__alloc_pages 3.02 ± 12% -1.4 1.64 ± 14% perf-profile.self.cycles-pp.free_unref_page 2.96 ± 12% -1.4 1.60 ± 14% perf-profile.self.cycles-pp.rmqueue 2.52 ± 13% -1.2 1.30 ± 16% perf-profile.self.cycles-pp.get_page_from_freelist 2.58 ± 9% -1.0 1.62 ± 8% perf-profile.self.cycles-pp.__might_resched 1.46 ± 13% -0.7 0.78 ± 18% perf-profile.self.cycles-pp.alloc_pages 4.11 ± 3% -0.7 3.44 ± 13% perf-profile.self.cycles-pp.generic_pipe_buf_release 1.40 ± 13% -0.6 0.77 ± 15% perf-profile.self.cycles-pp.__splice_from_pipe 1.29 ± 12% -0.6 0.68 ± 14% perf-profile.self.cycles-pp.free_unref_page_commit 1.37 ± 10% -0.6 0.79 ± 9% perf-profile.self.cycles-pp.append_pipe 1.21 ± 12% -0.5 0.67 ± 11% perf-profile.self.cycles-pp.xas_load 1.02 ± 12% -0.5 0.54 ± 14% perf-profile.self.cycles-pp.iov_iter_zero 1.54 ± 8% -0.5 1.08 ± 3% perf-profile.self.cycles-pp.shmem_file_read_iter 0.99 ± 11% -0.4 0.55 ± 14% perf-profile.self.cycles-pp.free_pcp_prepare 0.80 ± 13% -0.4 0.43 ± 15% perf-profile.self.cycles-pp.__might_sleep 0.82 ± 9% -0.3 0.49 ± 5% perf-profile.self.cycles-pp.__cond_resched 0.78 ± 10% -0.3 0.46 ± 12% perf-profile.self.cycles-pp._raw_spin_unlock_irqrestore 0.56 ± 13% -0.2 0.32 ± 12% perf-profile.self.cycles-pp.xas_start 0.75 ± 10% -0.2 0.54 ± 11% perf-profile.self.cycles-pp.__list_del_entry_valid 0.44 ± 13% -0.2 0.23 ± 13% perf-profile.self.cycles-pp.rcu_all_qs 0.34 ± 16% -0.2 0.17 ± 17% perf-profile.self.cycles-pp._find_first_bit 0.36 ± 15% -0.2 0.18 ± 16% perf-profile.self.cycles-pp.policy_node 0.32 ± 12% -0.2 0.17 ± 15% perf-profile.self.cycles-pp.__page_cache_release 0.34 ± 12% -0.1 0.19 ± 8% perf-profile.self.cycles-pp.sanity 0.28 ± 14% -0.1 0.14 ± 19% perf-profile.self.cycles-pp.__list_add_valid 0.24 ± 13% -0.1 0.12 ± 15% perf-profile.self.cycles-pp.__mem_cgroup_uncharge 0.18 ± 14% -0.1 0.10 ± 14% perf-profile.self.cycles-pp.__folio_put 0.16 ± 12% -0.1 0.08 ± 12% perf-profile.self.cycles-pp.policy_nodemask 0.12 ± 9% -0.1 0.06 ± 17% perf-profile.self.cycles-pp.should_fail_alloc_page 0.10 ± 13% -0.0 0.06 ± 13% perf-profile.self.cycles-pp.direct_splice_actor 0.10 ± 14% -0.0 0.06 ± 16% perf-profile.self.cycles-pp.splice_from_pipe 0.09 ± 14% -0.0 0.06 ± 14% perf-profile.self.cycles-pp.pipe_to_null 0.10 ± 10% +0.0 0.14 ± 6% perf-profile.self.cycles-pp.current_time 0.03 ± 70% +0.0 0.08 ± 12% perf-profile.self.cycles-pp.ktime_get_coarse_real_ts64 0.00 +0.1 0.05 ± 8% perf-profile.self.cycles-pp.syscall_return_via_sysret 0.07 ± 15% +0.1 0.13 ± 10% perf-profile.self.cycles-pp.generic_file_splice_read 0.00 +0.1 0.06 ± 11% perf-profile.self.cycles-pp.ovl_d_real 0.00 +0.1 0.07 ± 16% perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack 0.00 +0.1 0.07 ± 14% perf-profile.self.cycles-pp.__put_user_nocheck_8 0.09 ± 14% +0.1 0.18 ± 15% perf-profile.self.cycles-pp.__fsnotify_parent 0.00 +0.1 0.14 ± 4% perf-profile.self.cycles-pp.__entry_text_start 0.00 +0.2 0.15 ± 20% perf-profile.self.cycles-pp.__fdget 0.11 ± 14% +0.2 0.30 ± 14% perf-profile.self.cycles-pp.splice_direct_to_actor 0.00 +0.2 0.24 ± 11% perf-profile.self.cycles-pp.do_sendfile 11.40 ± 4% +1.8 13.16 ± 3% perf-profile.self.cycles-pp.__filemap_get_folio 11.73 ± 13% +6.4 18.12 perf-profile.self.cycles-pp.override_creds 13.25 ± 14% +7.6 20.88 ± 7% perf-profile.self.cycles-pp.revert_creds To reproduce: git clone https://github.com/intel/lkp-tests.git cd lkp-tests sudo bin/lkp install job.yaml # job file is attached in this email bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run sudo bin/lkp run generated-yaml-file # if come across any failure that blocks the test, # please remove ~/.lkp and /lkp dir to run from a clean state. Disclaimer: Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.
diff --git a/fs/splice.c b/fs/splice.c index 0878b852b355..7a8d5fee0965 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -306,6 +306,8 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos, iov_iter_pipe(&to, READ, pipe, len); init_sync_kiocb(&kiocb, in); kiocb.ki_pos = *ppos; + if (flags & SPLICE_F_NOWAIT) + kiocb.ki_flags |= IOCB_NOWAIT; ret = call_read_iter(in, &kiocb, &to); if (ret > 0) { *ppos = kiocb.ki_pos; @@ -866,6 +868,18 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd, if (unlikely(ret <= 0)) goto out_release; + /* + * After at least one byte was read from the input + * file, don't wait for blocking I/O in the following + * loop iterations; instead of blocking for arbitrary + * amounts of time in the kernel, let userspace decide + * how to proceed. This avoids excessive latency if + * the output is being consumed faster than the input + * file can fill it (e.g. sendfile() from a slow hard + * disk to a fast network). + */ + flags |= SPLICE_F_NOWAIT; + read_len = ret; sd->total_len = read_len; diff --git a/include/linux/splice.h b/include/linux/splice.h index a55179fd60fc..14021bba7829 100644 --- a/include/linux/splice.h +++ b/include/linux/splice.h @@ -23,6 +23,12 @@ #define SPLICE_F_ALL (SPLICE_F_MOVE|SPLICE_F_NONBLOCK|SPLICE_F_MORE|SPLICE_F_GIFT) +/* + * Don't wait for I/O (internal flag for the splice_direct_to_actor() + * loop). + */ +#define SPLICE_F_NOWAIT (1U << 30) + /* * Passed to the actors */