fs/splice: don't block splice_direct_to_actor() after data was read

Message ID 20221105090421.21237-1-mk@cm4all.com
State New
Headers
Series fs/splice: don't block splice_direct_to_actor() after data was read |

Commit Message

Max Kellermann Nov. 5, 2022, 9:04 a.m. UTC
  From: Max Kellermann <max.kellermann@ionos.com>

If userspace calls sendfile() with a very large "count" parameter, the
kernel can block for a very long time until 2 GiB (0x7ffff000 bytes)
have been read from the hard disk and pushed into the socket buffer.

Usually, that is not a problem, because the socket write buffer gets
filled quickly, and if the socket is non-blocking, the last
direct_splice_actor() call will return -EAGAIN, causing
splice_direct_to_actor() to break from the loop, and sendfile() will
return a partial transfer.

However, if the network happens to be faster than the hard disk, and
the socket buffer keeps getting drained between two
generic_file_read_iter() calls, the sendfile() system call can keep
running for a long time, blocking for disk I/O over and over.

That is undesirable, because it can block the calling process for too
long.  I discovered a problem where nginx would block for so long that
it would drop the HTTP connection because the kernel had just
transferred 2 GiB in one call, and the HTTP socket was not writable
(EPOLLOUT) for more than 60 seconds, resulting in a timeout:

  sendfile(4, 12, [5518919528] => [5884939344], 1813448856) = 366019816 <3.033067>
  sendfile(4, 12, [5884939344], 1447429040) = -1 EAGAIN (Resource temporarily unavailable) <0.000037>
  epoll_wait(9, [{EPOLLOUT, {u32=2181955104, u64=140572166585888}}], 512, 60000) = 1 <0.003355>
  gettimeofday({tv_sec=1667508799, tv_usec=201201}, NULL) = 0 <0.000024>
  sendfile(4, 12, [5884939344] => [8032418896], 2147480496) = 2147479552 <10.727970>
  writev(4, [], 0) = 0 <0.000439>
  epoll_wait(9, [], 512, 60000) = 0 <60.060430>
  gettimeofday({tv_sec=1667508869, tv_usec=991046}, NULL) = 0 <0.000078>
  write(5, "10.40.5.23 - - [03/Nov/2022:21:5"..., 124) = 124 <0.001097>
  close(12) = 0 <0.000063>
  close(4)  = 0 <0.000091>

In newer nginx versions (since 1.21.4), this problem was worked around
by defaulting "sendfile_max_chunk" to 2 MiB:

 https://github.com/nginx/nginx/commit/5636e7f7b4

Instead of asking userspace to provide an artificial upper limit, I'd
like the kernel to block for disk I/O at most once, and then pass back
control to userspace.

There is prior art for this kind of behavior in filemap_read():

	/*
	 * If we've already successfully copied some data, then we
	 * can no longer safely return -EIOCBQUEUED. Hence mark
	 * an async read NOWAIT at that point.
	 */
	if ((iocb->ki_flags & IOCB_WAITQ) && already_read)
		iocb->ki_flags |= IOCB_NOWAIT;

This modifies the caller-provided "struct kiocb", which has an effect
on repeated filemap_read() calls.  This effect however vanishes
because the "struct kiocb" is not persistent; splice_direct_to_actor()
doesn't have one, and each generic_file_splice_read() call initializes
a new one, losing the "IOCB_NOWAIT" flag that was injected by
filemap_read().

There was no way to make generic_file_splice_read() aware that
IOCB_NOWAIT was desired because some data had already been transferred
in a previous call:

- checking whether the input file has O_NONBLOCK doesn't work because
  this should be fixed even if the input file is not non-blocking

- the SPLICE_F_NONBLOCK flag is not appropriate because it affects
  only whether pipe operations are non-blocking, not whether
  file/socket operations are non-blocking

Since there are no other parameters, I suggest adding the
SPLICE_F_NOWAIT flag, which is similar to SPLICE_F_NONBLOCK, but
affects the "non-pipe" file descriptor passed to sendfile() or
splice().  It translates to IOCB_NOWAIT for regular files.  For now, I
have documented the flag to be kernel-internal with a high bit, like
io_uring does with SPLICE_F_FD_IN_FIXED, but making this part of the
system call ABI may be a good idea as well.

To: Alexander Viro <viro@zeniv.linux.org.uk>
To: linux-fsdevel@vger.kernel.org
To: linux-kernel@vger.kernel.org
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
---
 fs/splice.c            | 14 ++++++++++++++
 include/linux/splice.h |  6 ++++++
 2 files changed, 20 insertions(+)
  

Comments

Christoph Hellwig Nov. 8, 2022, 7:37 a.m. UTC | #1
On Sat, Nov 05, 2022 at 10:04:21AM +0100, Max Kellermann wrote:
> Since there are no other parameters, I suggest adding the
> SPLICE_F_NOWAIT flag, which is similar to SPLICE_F_NONBLOCK, but
> affects the "non-pipe" file descriptor passed to sendfile() or
> splice().  It translates to IOCB_NOWAIT for regular files.

This looks reasonable to me and matches the read/write side.

>
> For now, I
> have documented the flag to be kernel-internal with a high bit, like
> io_uring does with SPLICE_F_FD_IN_FIXED, but making this part of the
> system call ABI may be a good idea as well.

Yeah, my only comment here is that I see no reason to make this
purely kernel internal.

And while looking at that: does anyone remember why the (public)
SPLICE_F_* aren't in a uapi header?
  
Liu, Yujie Nov. 20, 2022, 1:19 p.m. UTC | #2
Greeting,

FYI, we noticed a 3388.6% improvement of phoronix-test-suite.stress-ng.SENDFILE.bogo_ops_s due to commit:

commit: 86f00c46806e19ccce7fa238fbf3aaf0f1f2f531 ("[PATCH] fs/splice: don't block splice_direct_to_actor() after data was read")
url: https://github.com/intel-lab-lkp/linux/commits/Max-Kellermann/fs-splice-don-t-block-splice_direct_to_actor-after-data-was-read/20221105-171212
base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git b208b9fbbcba743fb269d15cb46a4036b01936b1
patch link: https://lore.kernel.org/all/20221105090421.21237-1-mk@cm4all.com/
patch subject: [PATCH] fs/splice: don't block splice_direct_to_actor() after data was read

in testcase: phoronix-test-suite
on test machine: 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (Cascade Lake) with 512G memory
with following parameters:

	test: stress-ng-1.3.1
	option_a: SENDFILE
	cpufreq_governor: performance

test-description: The Phoronix Test Suite is the most comprehensive testing and benchmarking platform available that provides an extensible framework for which new tests can be easily added.
test-url: http://www.phoronix-test-suite.com/


Details are as below:

=========================================================================================
compiler/cpufreq_governor/kconfig/option_a/rootfs/tbox_group/test/testcase:
  gcc-11/performance/x86_64-rhel-8.3/SENDFILE/debian-x86_64-phoronix/lkp-csl-2sp7/stress-ng-1.3.1/phoronix-test-suite

commit: 
  b208b9fbbc ("Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux")
  86f00c4680 ("fs/splice: don't block splice_direct_to_actor() after data was read")

b208b9fbbcba743f 86f00c46806e19ccce7fa238fbf 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
    105881 ± 10%   +3388.6%    3693772 ± 13%  phoronix-test-suite.stress-ng.SENDFILE.bogo_ops_s
      2.16 ±  7%   +1117.6%      26.36 ± 13%  phoronix-test-suite.time.user_time
      0.20            +0.2        0.40 ±  4%  mpstat.cpu.all.usr%
     97.92 ±  6%     +14.9%     112.48 ±  5%  sched_debug.cfs_rq:/.util_est_enqueued.stddev
      2717            +2.5%       2783        turbostat.Bzy_MHz
    254.74            -4.6%     243.11        turbostat.PkgWatt
   2589095 ± 18%     -37.5%    1617492 ± 28%  numa-meminfo.node0.FilePages
    788063 ± 62%    +123.2%    1758697 ± 26%  numa-meminfo.node1.FilePages
    992948 ± 48%     +92.6%    1912574 ± 24%  numa-meminfo.node1.Inactive
   1803548 ± 28%     +52.3%    2746365 ± 17%  numa-meminfo.node1.MemUsed
 4.443e+09 ± 12%     -36.0%  2.846e+09 ± 18%  numa-numastat.node0.local_node
 4.429e+09 ± 12%     -35.8%  2.842e+09 ± 18%  numa-numastat.node0.numa_hit
 5.558e+09 ±  8%     -54.2%  2.548e+09 ± 24%  numa-numastat.node1.local_node
 5.539e+09 ±  8%     -54.0%  2.545e+09 ± 24%  numa-numastat.node1.numa_hit
 9.968e+09 ± 10%     -46.0%  5.388e+09 ± 13%  proc-vmstat.numa_hit
     1e+10 ± 10%     -46.1%  5.393e+09 ± 13%  proc-vmstat.numa_local
 9.759e+09 ± 10%     -45.5%   5.32e+09 ± 13%  proc-vmstat.pgalloc_normal
 9.759e+09 ± 10%     -45.5%   5.32e+09 ± 13%  proc-vmstat.pgfree
    647020 ± 19%     -37.5%     404382 ± 28%  numa-vmstat.node0.nr_file_pages
 4.429e+09 ± 12%     -35.8%  2.842e+09 ± 18%  numa-vmstat.node0.numa_hit
 4.443e+09 ± 12%     -36.0%  2.846e+09 ± 18%  numa-vmstat.node0.numa_local
    196897 ± 62%    +123.3%     439599 ± 26%  numa-vmstat.node1.nr_file_pages
 5.539e+09 ±  8%     -54.0%  2.545e+09 ± 24%  numa-vmstat.node1.numa_hit
 5.558e+09 ±  8%     -54.2%  2.548e+09 ± 24%  numa-vmstat.node1.numa_local
 2.602e+10 ± 10%     -39.6%  1.573e+10 ± 13%  perf-stat.i.branch-instructions
  24921395 ± 12%     +40.0%   34894184 ±  6%  perf-stat.i.branch-misses
 1.223e+08 ± 11%     -37.8%   76029198 ± 13%  perf-stat.i.cache-references
      1.92 ± 18%     +43.1%       2.75 ± 12%  perf-stat.i.cpi
  2.05e+11            +2.6%  2.103e+11        perf-stat.i.cpu-cycles
 3.522e+10 ± 10%     -37.8%  2.192e+10 ± 13%  perf-stat.i.dTLB-loads
 2.356e+10 ± 10%     -40.7%  1.396e+10 ± 13%  perf-stat.i.dTLB-stores
   2507922 ± 14%    +527.4%   15734908 ±  8%  perf-stat.i.iTLB-load-misses
 1.336e+11 ± 10%     -39.6%  8.071e+10 ± 13%  perf-stat.i.instructions
     50055 ± 12%     -90.8%       4589 ±  5%  perf-stat.i.instructions-per-iTLB-miss
      0.61 ± 10%     -35.9%       0.39 ± 10%  perf-stat.i.ipc
   2135130            +2.7%    2191818        perf-stat.i.metric.GHz
 8.845e+08 ± 10%     -39.1%  5.387e+08 ± 13%  perf-stat.i.metric.M/sec
     60647 ±  2%     -19.8%      48635 ±  6%  perf-stat.i.node-stores
      0.10 ±  8%      +0.1        0.22 ±  7%  perf-stat.overall.branch-miss-rate%
     11.06 ± 35%      +5.9       16.96 ± 25%  perf-stat.overall.cache-miss-rate%
      1.55 ± 12%     +70.3%       2.65 ± 11%  perf-stat.overall.cpi
      0.00 ± 41%      +0.0        0.00 ± 29%  perf-stat.overall.dTLB-store-miss-rate%
     86.38 ±  3%     +10.8       97.16        perf-stat.overall.iTLB-load-miss-rate%
     53972 ± 12%     -90.5%       5111 ±  5%  perf-stat.overall.instructions-per-iTLB-miss
      0.65 ± 11%     -41.2%       0.38 ± 12%  perf-stat.overall.ipc
 2.579e+10 ± 10%     -39.5%  1.559e+10 ± 13%  perf-stat.ps.branch-instructions
  24705183 ± 12%     +40.1%   34602922 ±  6%  perf-stat.ps.branch-misses
 1.212e+08 ± 11%     -37.8%   75385416 ± 13%  perf-stat.ps.cache-references
 2.032e+11            +2.6%  2.085e+11        perf-stat.ps.cpu-cycles
 3.491e+10 ± 10%     -37.7%  2.173e+10 ± 13%  perf-stat.ps.dTLB-loads
 2.335e+10 ± 10%     -40.7%  1.385e+10 ± 13%  perf-stat.ps.dTLB-stores
   2486067 ± 14%    +527.6%   15601840 ±  8%  perf-stat.ps.iTLB-load-misses
 1.324e+11 ± 10%     -39.5%  8.002e+10 ± 13%  perf-stat.ps.instructions
     60175 ±  2%     -19.8%      48272 ±  6%  perf-stat.ps.node-stores
 1.529e+13 ± 10%     -39.5%  9.254e+12 ± 12%  perf-stat.total.instructions
     55.97 ±  6%     -13.5       42.46 ±  4%  perf-profile.calltrace.cycles-pp.do_iter_readv_writev.do_iter_read.ovl_read_iter.generic_file_splice_read.splice_direct_to_actor
     55.59 ±  6%     -13.4       42.15 ±  4%  perf-profile.calltrace.cycles-pp.shmem_file_read_iter.do_iter_readv_writev.do_iter_read.ovl_read_iter.generic_file_splice_read
     29.58 ± 11%     -12.7       16.90 ± 12%  perf-profile.calltrace.cycles-pp.iov_iter_zero.shmem_file_read_iter.do_iter_readv_writev.do_iter_read.ovl_read_iter
     57.68 ±  3%     -11.7       45.98        perf-profile.calltrace.cycles-pp.do_iter_read.ovl_read_iter.generic_file_splice_read.splice_direct_to_actor.do_splice_direct
     17.32 ± 12%      -7.4        9.89 ± 14%  perf-profile.calltrace.cycles-pp.append_pipe.iov_iter_zero.shmem_file_read_iter.do_iter_readv_writev.do_iter_read
     13.20 ± 12%      -5.5        7.66 ± 15%  perf-profile.calltrace.cycles-pp.__alloc_pages.append_pipe.iov_iter_zero.shmem_file_read_iter.do_iter_readv_writev
     10.88 ± 11%      -4.6        6.27 ±  9%  perf-profile.calltrace.cycles-pp.memset_erms.iov_iter_zero.shmem_file_read_iter.do_iter_readv_writev.do_iter_read
     15.34 ±  7%      -4.3       11.02 ± 14%  perf-profile.calltrace.cycles-pp.direct_splice_actor.splice_direct_to_actor.do_splice_direct.do_sendfile.__x64_sys_sendfile64
     15.23 ±  7%      -4.3       10.96 ± 14%  perf-profile.calltrace.cycles-pp.splice_from_pipe.direct_splice_actor.splice_direct_to_actor.do_splice_direct.do_sendfile
     15.05 ±  7%      -4.2       10.86 ± 14%  perf-profile.calltrace.cycles-pp.__splice_from_pipe.splice_from_pipe.direct_splice_actor.splice_direct_to_actor.do_splice_direct
      8.26 ± 11%      -3.2        5.06 ± 14%  perf-profile.calltrace.cycles-pp.get_page_from_freelist.__alloc_pages.append_pipe.iov_iter_zero.shmem_file_read_iter
      8.38 ±  9%      -2.4        6.01 ± 14%  perf-profile.calltrace.cycles-pp.free_unref_page.__splice_from_pipe.splice_from_pipe.direct_splice_actor.splice_direct_to_actor
      5.64 ± 10%      -1.9        3.70 ± 13%  perf-profile.calltrace.cycles-pp.rmqueue.get_page_from_freelist.__alloc_pages.append_pipe.iov_iter_zero
     99.74            -1.5       98.21        perf-profile.calltrace.cycles-pp.splice_direct_to_actor.do_splice_direct.do_sendfile.__x64_sys_sendfile64.do_syscall_64
     99.75            -1.4       98.36        perf-profile.calltrace.cycles-pp.do_splice_direct.do_sendfile.__x64_sys_sendfile64.do_syscall_64.entry_SYSCALL_64_after_hwframe
      2.30 ± 13%      -1.1        1.20 ± 16%  perf-profile.calltrace.cycles-pp.alloc_pages.append_pipe.iov_iter_zero.shmem_file_read_iter.do_iter_readv_writev
      1.74 ± 12%      -0.8        0.96 ± 11%  perf-profile.calltrace.cycles-pp.xas_load.__filemap_get_folio.shmem_get_folio_gfp.shmem_file_read_iter.do_iter_readv_writev
     99.79            -0.8       99.03        perf-profile.calltrace.cycles-pp.do_sendfile.__x64_sys_sendfile64.do_syscall_64.entry_SYSCALL_64_after_hwframe
      1.52 ± 12%      -0.7        0.79 ± 15%  perf-profile.calltrace.cycles-pp.free_unref_page_commit.free_unref_page.__splice_from_pipe.splice_from_pipe.direct_splice_actor
      4.13 ±  3%      -0.7        3.46 ± 13%  perf-profile.calltrace.cycles-pp.generic_pipe_buf_release.__splice_from_pipe.splice_from_pipe.direct_splice_actor.splice_direct_to_actor
      1.01 ± 11%      -0.6        0.40 ± 72%  perf-profile.calltrace.cycles-pp.free_pcp_prepare.free_unref_page.__splice_from_pipe.splice_from_pipe.direct_splice_actor
     99.80            -0.4       99.36        perf-profile.calltrace.cycles-pp.__x64_sys_sendfile64.do_syscall_64.entry_SYSCALL_64_after_hwframe
     99.82            -0.4       99.47        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
      0.79 ± 10%      -0.3        0.48 ± 45%  perf-profile.calltrace.cycles-pp.__list_del_entry_valid.rmqueue.get_page_from_freelist.__alloc_pages.append_pipe
      0.90 ±  7%      -0.3        0.59 ±  4%  perf-profile.calltrace.cycles-pp.__might_resched.shmem_file_read_iter.do_iter_readv_writev.do_iter_read.ovl_read_iter
     99.83            -0.3       99.52        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
     83.75            +2.5       86.27 ±  2%  perf-profile.calltrace.cycles-pp.generic_file_splice_read.splice_direct_to_actor.do_splice_direct.do_sendfile.__x64_sys_sendfile64
     83.42            +2.5       85.97 ±  2%  perf-profile.calltrace.cycles-pp.ovl_read_iter.generic_file_splice_read.splice_direct_to_actor.do_splice_direct.do_sendfile
     11.81 ± 13%      +6.4       18.22        perf-profile.calltrace.cycles-pp.override_creds.ovl_read_iter.generic_file_splice_read.splice_direct_to_actor.do_splice_direct
     13.34 ± 14%      +7.7       21.02 ±  7%  perf-profile.calltrace.cycles-pp.revert_creds.ovl_read_iter.generic_file_splice_read.splice_direct_to_actor.do_splice_direct
     55.98 ±  6%     -13.5       42.46 ±  4%  perf-profile.children.cycles-pp.do_iter_readv_writev
     55.85 ±  6%     -13.5       42.34 ±  4%  perf-profile.children.cycles-pp.shmem_file_read_iter
     29.70 ± 11%     -12.7       16.96 ± 12%  perf-profile.children.cycles-pp.iov_iter_zero
     57.69 ±  3%     -11.7       45.98        perf-profile.children.cycles-pp.do_iter_read
     17.38 ± 12%      -7.5        9.93 ± 14%  perf-profile.children.cycles-pp.append_pipe
     13.64 ± 12%      -5.7        7.90 ± 15%  perf-profile.children.cycles-pp.__alloc_pages
     10.94 ± 10%      -4.6        6.30 ±  9%  perf-profile.children.cycles-pp.memset_erms
     15.34 ±  7%      -4.3       11.02 ± 14%  perf-profile.children.cycles-pp.direct_splice_actor
     15.24 ±  7%      -4.3       10.96 ± 14%  perf-profile.children.cycles-pp.splice_from_pipe
     15.13 ±  7%      -4.2       10.90 ± 14%  perf-profile.children.cycles-pp.__splice_from_pipe
      8.36 ± 11%      -3.2        5.11 ± 14%  perf-profile.children.cycles-pp.get_page_from_freelist
      8.62 ±  9%      -2.5        6.13 ± 14%  perf-profile.children.cycles-pp.free_unref_page
      5.76 ± 10%      -2.0        3.76 ± 13%  perf-profile.children.cycles-pp.rmqueue
     99.74            -1.5       98.22        perf-profile.children.cycles-pp.splice_direct_to_actor
     99.75            -1.4       98.36        perf-profile.children.cycles-pp.do_splice_direct
      2.39 ± 13%      -1.1        1.24 ± 16%  perf-profile.children.cycles-pp.alloc_pages
      2.63 ±  9%      -1.0        1.65 ±  8%  perf-profile.children.cycles-pp.__might_resched
      1.84 ± 12%      -0.8        1.02 ± 12%  perf-profile.children.cycles-pp.xas_load
      1.61 ± 12%      -0.8        0.84 ± 14%  perf-profile.children.cycles-pp.free_unref_page_commit
     99.79            -0.7       99.04        perf-profile.children.cycles-pp.do_sendfile
      4.18 ±  3%      -0.7        3.49 ± 13%  perf-profile.children.cycles-pp.generic_pipe_buf_release
      1.42 ± 11%      -0.6        0.80 ±  8%  perf-profile.children.cycles-pp.__cond_resched
      1.06 ± 11%      -0.5        0.59 ± 14%  perf-profile.children.cycles-pp.free_pcp_prepare
      0.94 ± 14%      -0.4        0.50 ± 15%  perf-profile.children.cycles-pp.__might_sleep
     99.80            -0.4       99.37        perf-profile.children.cycles-pp.__x64_sys_sendfile64
      0.75 ± 15%      -0.4        0.38 ± 16%  perf-profile.children.cycles-pp.policy_node
      0.79 ± 12%      -0.4        0.42 ± 15%  perf-profile.children.cycles-pp.__folio_put
      0.90 ± 10%      -0.4        0.54 ± 12%  perf-profile.children.cycles-pp._raw_spin_unlock_irqrestore
     99.95            -0.3       99.64        perf-profile.children.cycles-pp.do_syscall_64
     99.95            -0.3       99.68        perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
      0.58 ± 12%      -0.3        0.31 ± 10%  perf-profile.children.cycles-pp.rcu_all_qs
      0.62 ± 14%      -0.3        0.36 ± 12%  perf-profile.children.cycles-pp.xas_start
      0.80 ± 10%      -0.2        0.56 ± 10%  perf-profile.children.cycles-pp.__list_del_entry_valid
      0.40 ± 17%      -0.2        0.20 ± 17%  perf-profile.children.cycles-pp._find_first_bit
      0.32 ± 14%      -0.2        0.16 ± 17%  perf-profile.children.cycles-pp.__list_add_valid
      0.34 ± 12%      -0.1        0.20 ±  8%  perf-profile.children.cycles-pp.sanity
      0.32 ± 11%      -0.1        0.18 ± 14%  perf-profile.children.cycles-pp.__page_cache_release
      0.27 ± 12%      -0.1        0.14 ± 15%  perf-profile.children.cycles-pp.should_fail_alloc_page
      0.25 ± 12%      -0.1        0.13 ± 17%  perf-profile.children.cycles-pp.__mem_cgroup_uncharge
      0.20 ± 14%      -0.1        0.10 ± 13%  perf-profile.children.cycles-pp.policy_nodemask
      0.12 ± 15%      -0.0        0.07 ± 12%  perf-profile.children.cycles-pp.pipe_to_null
      0.04 ± 45%      +0.0        0.07        perf-profile.children.cycles-pp.__libc_write
      0.05 ±  7%      +0.0        0.08 ± 11%  perf-profile.children.cycles-pp.ktime_get_coarse_real_ts64
      0.04 ± 71%      +0.0        0.07        perf-profile.children.cycles-pp.generic_file_write_iter
      0.04 ± 71%      +0.0        0.07 ±  5%  perf-profile.children.cycles-pp.generic_perform_write
      0.04 ± 71%      +0.0        0.07        perf-profile.children.cycles-pp.__generic_file_write_iter
      0.00            +0.1        0.05 ±  8%  perf-profile.children.cycles-pp.syscall_return_via_sysret
      0.00            +0.1        0.06 ± 11%  perf-profile.children.cycles-pp.ovl_d_real
      0.00            +0.1        0.07 ± 14%  perf-profile.children.cycles-pp.__put_user_nocheck_8
      0.16 ±  6%      +0.1        0.23 ±  2%  perf-profile.children.cycles-pp.current_time
      0.00            +0.1        0.08 ± 14%  perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
      0.28 ±  8%      +0.1        0.37 ±  4%  perf-profile.children.cycles-pp.atime_needs_update
      0.09 ± 14%      +0.1        0.18 ± 15%  perf-profile.children.cycles-pp.__fsnotify_parent
      0.35 ±  9%      +0.1        0.45 ±  6%  perf-profile.children.cycles-pp.touch_atime
      0.00            +0.1        0.12 ± 12%  perf-profile.children.cycles-pp.__might_fault
      0.00            +0.1        0.14 ± 14%  perf-profile.children.cycles-pp._copy_from_user
      0.00            +0.1        0.14 ±  4%  perf-profile.children.cycles-pp.__entry_text_start
      0.00            +0.2        0.16 ± 19%  perf-profile.children.cycles-pp.__fdget
     83.76            +2.5       86.28 ±  2%  perf-profile.children.cycles-pp.generic_file_splice_read
     83.44            +2.5       85.99 ±  2%  perf-profile.children.cycles-pp.ovl_read_iter
     11.82 ± 13%      +6.4       18.26        perf-profile.children.cycles-pp.override_creds
     13.35 ± 14%      +7.7       21.06 ±  7%  perf-profile.children.cycles-pp.revert_creds
     10.81 ± 10%      -4.6        6.23 ±  9%  perf-profile.self.cycles-pp.memset_erms
      3.26 ± 13%      -1.5        1.74 ± 17%  perf-profile.self.cycles-pp.__alloc_pages
      3.02 ± 12%      -1.4        1.64 ± 14%  perf-profile.self.cycles-pp.free_unref_page
      2.96 ± 12%      -1.4        1.60 ± 14%  perf-profile.self.cycles-pp.rmqueue
      2.52 ± 13%      -1.2        1.30 ± 16%  perf-profile.self.cycles-pp.get_page_from_freelist
      2.58 ±  9%      -1.0        1.62 ±  8%  perf-profile.self.cycles-pp.__might_resched
      1.46 ± 13%      -0.7        0.78 ± 18%  perf-profile.self.cycles-pp.alloc_pages
      4.11 ±  3%      -0.7        3.44 ± 13%  perf-profile.self.cycles-pp.generic_pipe_buf_release
      1.40 ± 13%      -0.6        0.77 ± 15%  perf-profile.self.cycles-pp.__splice_from_pipe
      1.29 ± 12%      -0.6        0.68 ± 14%  perf-profile.self.cycles-pp.free_unref_page_commit
      1.37 ± 10%      -0.6        0.79 ±  9%  perf-profile.self.cycles-pp.append_pipe
      1.21 ± 12%      -0.5        0.67 ± 11%  perf-profile.self.cycles-pp.xas_load
      1.02 ± 12%      -0.5        0.54 ± 14%  perf-profile.self.cycles-pp.iov_iter_zero
      1.54 ±  8%      -0.5        1.08 ±  3%  perf-profile.self.cycles-pp.shmem_file_read_iter
      0.99 ± 11%      -0.4        0.55 ± 14%  perf-profile.self.cycles-pp.free_pcp_prepare
      0.80 ± 13%      -0.4        0.43 ± 15%  perf-profile.self.cycles-pp.__might_sleep
      0.82 ±  9%      -0.3        0.49 ±  5%  perf-profile.self.cycles-pp.__cond_resched
      0.78 ± 10%      -0.3        0.46 ± 12%  perf-profile.self.cycles-pp._raw_spin_unlock_irqrestore
      0.56 ± 13%      -0.2        0.32 ± 12%  perf-profile.self.cycles-pp.xas_start
      0.75 ± 10%      -0.2        0.54 ± 11%  perf-profile.self.cycles-pp.__list_del_entry_valid
      0.44 ± 13%      -0.2        0.23 ± 13%  perf-profile.self.cycles-pp.rcu_all_qs
      0.34 ± 16%      -0.2        0.17 ± 17%  perf-profile.self.cycles-pp._find_first_bit
      0.36 ± 15%      -0.2        0.18 ± 16%  perf-profile.self.cycles-pp.policy_node
      0.32 ± 12%      -0.2        0.17 ± 15%  perf-profile.self.cycles-pp.__page_cache_release
      0.34 ± 12%      -0.1        0.19 ±  8%  perf-profile.self.cycles-pp.sanity
      0.28 ± 14%      -0.1        0.14 ± 19%  perf-profile.self.cycles-pp.__list_add_valid
      0.24 ± 13%      -0.1        0.12 ± 15%  perf-profile.self.cycles-pp.__mem_cgroup_uncharge
      0.18 ± 14%      -0.1        0.10 ± 14%  perf-profile.self.cycles-pp.__folio_put
      0.16 ± 12%      -0.1        0.08 ± 12%  perf-profile.self.cycles-pp.policy_nodemask
      0.12 ±  9%      -0.1        0.06 ± 17%  perf-profile.self.cycles-pp.should_fail_alloc_page
      0.10 ± 13%      -0.0        0.06 ± 13%  perf-profile.self.cycles-pp.direct_splice_actor
      0.10 ± 14%      -0.0        0.06 ± 16%  perf-profile.self.cycles-pp.splice_from_pipe
      0.09 ± 14%      -0.0        0.06 ± 14%  perf-profile.self.cycles-pp.pipe_to_null
      0.10 ± 10%      +0.0        0.14 ±  6%  perf-profile.self.cycles-pp.current_time
      0.03 ± 70%      +0.0        0.08 ± 12%  perf-profile.self.cycles-pp.ktime_get_coarse_real_ts64
      0.00            +0.1        0.05 ±  8%  perf-profile.self.cycles-pp.syscall_return_via_sysret
      0.07 ± 15%      +0.1        0.13 ± 10%  perf-profile.self.cycles-pp.generic_file_splice_read
      0.00            +0.1        0.06 ± 11%  perf-profile.self.cycles-pp.ovl_d_real
      0.00            +0.1        0.07 ± 16%  perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
      0.00            +0.1        0.07 ± 14%  perf-profile.self.cycles-pp.__put_user_nocheck_8
      0.09 ± 14%      +0.1        0.18 ± 15%  perf-profile.self.cycles-pp.__fsnotify_parent
      0.00            +0.1        0.14 ±  4%  perf-profile.self.cycles-pp.__entry_text_start
      0.00            +0.2        0.15 ± 20%  perf-profile.self.cycles-pp.__fdget
      0.11 ± 14%      +0.2        0.30 ± 14%  perf-profile.self.cycles-pp.splice_direct_to_actor
      0.00            +0.2        0.24 ± 11%  perf-profile.self.cycles-pp.do_sendfile
     11.40 ±  4%      +1.8       13.16 ±  3%  perf-profile.self.cycles-pp.__filemap_get_folio
     11.73 ± 13%      +6.4       18.12        perf-profile.self.cycles-pp.override_creds
     13.25 ± 14%      +7.6       20.88 ±  7%  perf-profile.self.cycles-pp.revert_creds



To reproduce:

        git clone https://github.com/intel/lkp-tests.git
        cd lkp-tests
        sudo bin/lkp install job.yaml           # job file is attached in this email
        bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
        sudo bin/lkp run generated-yaml-file

        # if come across any failure that blocks the test,
        # please remove ~/.lkp and /lkp dir to run from a clean state.


Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
  

Patch

diff --git a/fs/splice.c b/fs/splice.c
index 0878b852b355..7a8d5fee0965 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -306,6 +306,8 @@  ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 	iov_iter_pipe(&to, READ, pipe, len);
 	init_sync_kiocb(&kiocb, in);
 	kiocb.ki_pos = *ppos;
+	if (flags & SPLICE_F_NOWAIT)
+		kiocb.ki_flags |= IOCB_NOWAIT;
 	ret = call_read_iter(in, &kiocb, &to);
 	if (ret > 0) {
 		*ppos = kiocb.ki_pos;
@@ -866,6 +868,18 @@  ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
 		if (unlikely(ret <= 0))
 			goto out_release;
 
+		/*
+		 * After at least one byte was read from the input
+		 * file, don't wait for blocking I/O in the following
+		 * loop iterations; instead of blocking for arbitrary
+		 * amounts of time in the kernel, let userspace decide
+		 * how to proceed.  This avoids excessive latency if
+		 * the output is being consumed faster than the input
+		 * file can fill it (e.g. sendfile() from a slow hard
+		 * disk to a fast network).
+		 */
+		flags |= SPLICE_F_NOWAIT;
+
 		read_len = ret;
 		sd->total_len = read_len;
 
diff --git a/include/linux/splice.h b/include/linux/splice.h
index a55179fd60fc..14021bba7829 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -23,6 +23,12 @@ 
 
 #define SPLICE_F_ALL (SPLICE_F_MOVE|SPLICE_F_NONBLOCK|SPLICE_F_MORE|SPLICE_F_GIFT)
 
+/*
+ * Don't wait for I/O (internal flag for the splice_direct_to_actor()
+ * loop).
+ */
+#define SPLICE_F_NOWAIT	(1U << 30)
+
 /*
  * Passed to the actors
  */