Message ID | 20230919081259.1094971-1-max.kellermann@ionos.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:172:b0:3f2:4152:657d with SMTP id h50csp3243734vqi; Tue, 19 Sep 2023 02:29:58 -0700 (PDT) X-Google-Smtp-Source: AGHT+IF63HPLmLiZzsDE/CIp2Q1IC2EGHPDdtBaS+GAPq+AAacwJ2dGcmid21cHSrjGPgtDvjlPO X-Received: by 2002:a17:90b:b05:b0:273:f138:29cc with SMTP id bf5-20020a17090b0b0500b00273f13829ccmr9240035pjb.35.1695115798053; Tue, 19 Sep 2023 02:29:58 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695115798; cv=none; d=google.com; s=arc-20160816; b=k5IRaMLXOtxhJo+9S2+58ZR+mdlWisfce5M/wqujrWxrYSKXR05H69qdctOhRUFuhA SVIEC//VwO6WPiGXhkyN73aObBVghK3Yc4Nxp8so1Ond67j8LBBYgQVAeLsflw/3zC9Q ysiBTxzoi7uyQMF2R1q5O0Nf25WXVxaq7GhaVyno7xrGKtult2+H5JrQgam2P7ekBCAR N9u9lF23g4M1xxEfCpp8fY5vizHoVKyppGFYbWjZ6LXVVvSWavGGzNGZDQDpQ0r3fEM4 eUTRr1c8dRvoiz6emU/acyFZOhc3MDXC4WfYc8Axs/RUKuLwszyqZ9kHGSN4kXO3Wqdr WZGg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=VOwLl+YW9hr+Qj82oq/m7sO0HACM7+bf3qhUZRem7Dk=; fh=ti3iX6InVCNb9hue6GWlfA3tZv8dB/IUjzKiwQDClVI=; b=Ff+xGNIT8g0hrN0zO/Qj1NUk0M91Tt+412P1T4RFe9TnQlvZigRxdZg3zyEXncbz/Z 9u/MOyOh5wqei5+eb0IlG0MrUM/VexJkd3004XW2DzfXztgb4YiM6qCttzSKwhMjdem0 bCtMiNg5fZ/9LOvU+wHeCm9oebohLOASeqb+dRLFfLKzvBWVr2mezCTvVEfLGuJJQjYG WwdbqL+YjbQOqqjPBZWqF10bewpp+qatL8r7YjUBxt6hGZWVdKrT0lxS03uGoAgAZh0/ LQIn6GpH7XBDxSaW+8fnTT/F8s/rvhEndHgVzR6sitARURzZ2BzIjHFBtUPG3bVX7LA6 o5ZQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@ionos.com header.s=google header.b="SZ14L/A0"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=ionos.com Received: from lipwig.vger.email (lipwig.vger.email. [2620:137:e000::3:3]) by mx.google.com with ESMTPS id mg22-20020a17090b371600b00263a5cf8e64si9389492pjb.67.2023.09.19.02.29.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Sep 2023 02:29:58 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) client-ip=2620:137:e000::3:3; Authentication-Results: mx.google.com; dkim=pass header.i=@ionos.com header.s=google header.b="SZ14L/A0"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=ionos.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id EDC448033189; Tue, 19 Sep 2023 01:13:32 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230371AbjISINR (ORCPT <rfc822;toshivichauhan@gmail.com> + 26 others); Tue, 19 Sep 2023 04:13:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35290 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230211AbjISINP (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 19 Sep 2023 04:13:15 -0400 Received: from mail-ed1-x52a.google.com (mail-ed1-x52a.google.com [IPv6:2a00:1450:4864:20::52a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4EEA6134 for <linux-kernel@vger.kernel.org>; Tue, 19 Sep 2023 01:13:06 -0700 (PDT) Received: by mail-ed1-x52a.google.com with SMTP id 4fb4d7f45d1cf-532c3b4b55eso288073a12.0 for <linux-kernel@vger.kernel.org>; Tue, 19 Sep 2023 01:13:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ionos.com; s=google; t=1695111185; x=1695715985; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=VOwLl+YW9hr+Qj82oq/m7sO0HACM7+bf3qhUZRem7Dk=; b=SZ14L/A0IlHoqPLHNBi/qRFW3QHhQzr59dX7wwTWib800Sh1HboWuiNA0iWv0jwbrf wLWg+Vk4e8iHHc6+0LuQcWi+tHfgyJD9gk+J1/BMwDW9oZclOpdaQWYM6CpGFnhYNmSo ILEdGPRsEcjkFCPThxheTgnIXdgjw1k+RVKhXu9QdJW5aBWzva2D2wfF9kaE+evwlJL9 D5qb0P/Crj0gEaRLp9L9w8JnYZcGC3vyrafwgAkLX70xuBKWO1EJNvC4SqUf8iqTILLa 6edhHjNPlQQb6bnJjEJUB3/004hWoI7/9rY8/fyOdNR8/efZP9mMPmZFqdtDc79obd0o NLeQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695111185; x=1695715985; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=VOwLl+YW9hr+Qj82oq/m7sO0HACM7+bf3qhUZRem7Dk=; b=R7GM1Rk+VOqXMGDkjXxSFoGQH0eqzUBjp2Wub1P15E3Gi//qEWPpFlLyGSIVUA/kbR ILMiQ+MC/7p7vbqPxNi47RR9ztW9Rs3lI/tLIDyHTjtXHkFeKXjsCiYn9DFWlaegO3M+ CfsI6RhEPSBT/9QWAeWhD7Tljcaw18qVT4p63LyU8sDvFtd70dlyhQWLGjqAJcs80Phx gnlZOZ20NzsEUGhHKOA3nBOjlWsF+D9HEPBnMq01IC7JlM+vybbusgAggbHaJQhWsxWt M5IF3STJDjDd9FmGXHNOsskQSvZUdPn9ARRLEdiGx7kkixsQQoP5z1Hwd0uUUIRFcQPF Uegw== X-Gm-Message-State: AOJu0YzkmQdQQPdyXfwY9MdHJrHTczG13fGYYsk6iiPP5YJ+aRxMxLHV bXfkC9S7rQm4MjrNIlLq5WmJig== X-Received: by 2002:a05:6402:74d:b0:525:7234:52b7 with SMTP id p13-20020a056402074d00b00525723452b7mr9282824edy.19.1695111185366; Tue, 19 Sep 2023 01:13:05 -0700 (PDT) Received: from heron.intern.cm-ag (p200300dc6f209c00529a4cfffe3dd983.dip0.t-ipconnect.de. [2003:dc:6f20:9c00:529a:4cff:fe3d:d983]) by smtp.gmail.com with ESMTPSA id m20-20020a056402431400b00532b88fc984sm564466edc.24.2023.09.19.01.13.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Sep 2023 01:13:05 -0700 (PDT) From: Max Kellermann <max.kellermann@ionos.com> To: Alexander Viro <viro@zeniv.linux.org.uk>, Christian Brauner <brauner@kernel.org> Cc: Max Kellermann <max.kellermann@ionos.com>, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH] fs/splice: don't block splice_direct_to_actor() after data was read Date: Tue, 19 Sep 2023 10:12:58 +0200 Message-Id: <20230919081259.1094971-1-max.kellermann@ionos.com> X-Mailer: git-send-email 2.39.2 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Tue, 19 Sep 2023 01:13:33 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1777457743249611163 X-GMAIL-MSGID: 1777457743249611163 |
Series |
fs/splice: don't block splice_direct_to_actor() after data was read
|
|
Commit Message
Max Kellermann
Sept. 19, 2023, 8:12 a.m. UTC
If userspace calls sendfile() with a very large "count" parameter, the
kernel can block for a very long time until 2 GiB (0x7ffff000 bytes)
have been read from the hard disk and pushed into the socket buffer.
Usually, that is not a problem, because the socket write buffer gets
filled quickly, and if the socket is non-blocking, the last
direct_splice_actor() call will return -EAGAIN, causing
splice_direct_to_actor() to break from the loop, and sendfile() will
return a partial transfer.
However, if the network happens to be faster than the hard disk, and
the socket buffer keeps getting drained between two
generic_file_read_iter() calls, the sendfile() system call can keep
running for a long time, blocking for disk I/O over and over.
That is undesirable, because it can block the calling process for too
long. I discovered a problem where nginx would block for so long that
it would drop the HTTP connection because the kernel had just
transferred 2 GiB in one call, and the HTTP socket was not writable
(EPOLLOUT) for more than 60 seconds, resulting in a timeout:
sendfile(4, 12, [5518919528] => [5884939344], 1813448856) = 366019816 <3.033067>
sendfile(4, 12, [5884939344], 1447429040) = -1 EAGAIN (Resource temporarily unavailable) <0.000037>
epoll_wait(9, [{EPOLLOUT, {u32=2181955104, u64=140572166585888}}], 512, 60000) = 1 <0.003355>
gettimeofday({tv_sec=1667508799, tv_usec=201201}, NULL) = 0 <0.000024>
sendfile(4, 12, [5884939344] => [8032418896], 2147480496) = 2147479552 <10.727970>
writev(4, [], 0) = 0 <0.000439>
epoll_wait(9, [], 512, 60000) = 0 <60.060430>
gettimeofday({tv_sec=1667508869, tv_usec=991046}, NULL) = 0 <0.000078>
write(5, "10.40.5.23 - - [03/Nov/2022:21:5"..., 124) = 124 <0.001097>
close(12) = 0 <0.000063>
close(4) = 0 <0.000091>
In newer nginx versions (since 1.21.4), this problem was worked around
by defaulting "sendfile_max_chunk" to 2 MiB:
https://github.com/nginx/nginx/commit/5636e7f7b4
Instead of asking userspace to provide an artificial upper limit, I'd
like the kernel to block for disk I/O at most once, and then pass back
control to userspace.
There is prior art for this kind of behavior in filemap_read():
/*
* If we've already successfully copied some data, then we
* can no longer safely return -EIOCBQUEUED. Hence mark
* an async read NOWAIT at that point.
*/
if ((iocb->ki_flags & IOCB_WAITQ) && already_read)
iocb->ki_flags |= IOCB_NOWAIT;
This modifies the caller-provided "struct kiocb", which has an effect
on repeated filemap_read() calls. This effect however vanishes
because the "struct kiocb" is not persistent; splice_direct_to_actor()
doesn't have one, and each generic_file_splice_read() call initializes
a new one, losing the "IOCB_NOWAIT" flag that was injected by
filemap_read().
There was no way to make generic_file_splice_read() aware that
IOCB_NOWAIT was desired because some data had already been transferred
in a previous call:
- checking whether the input file has O_NONBLOCK doesn't work because
this should be fixed even if the input file is not non-blocking
- the SPLICE_F_NONBLOCK flag is not appropriate because it affects
only whether pipe operations are non-blocking, not whether
file/socket operations are non-blocking
Since there are no other parameters, I suggest adding the
SPLICE_F_NOWAIT flag, which is similar to SPLICE_F_NONBLOCK, but
affects the "non-pipe" file descriptor passed to sendfile() or
splice(). It translates to IOCB_NOWAIT for regular files. For now, I
have documented the flag to be kernel-internal with a high bit, like
io_uring does with SPLICE_F_FD_IN_FIXED, but making this part of the
system call ABI may be a good idea as well.
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
---
fs/splice.c | 14 ++++++++++++++
include/linux/splice.h | 6 ++++++
2 files changed, 20 insertions(+)
Comments
[+Cc Jens] On Tue, Sep 19, 2023 at 10:12:58AM +0200, Max Kellermann wrote: > If userspace calls sendfile() with a very large "count" parameter, the > kernel can block for a very long time until 2 GiB (0x7ffff000 bytes) > have been read from the hard disk and pushed into the socket buffer. > > Usually, that is not a problem, because the socket write buffer gets > filled quickly, and if the socket is non-blocking, the last > direct_splice_actor() call will return -EAGAIN, causing > splice_direct_to_actor() to break from the loop, and sendfile() will > return a partial transfer. > > However, if the network happens to be faster than the hard disk, and > the socket buffer keeps getting drained between two > generic_file_read_iter() calls, the sendfile() system call can keep > running for a long time, blocking for disk I/O over and over. > > That is undesirable, because it can block the calling process for too > long. I discovered a problem where nginx would block for so long that > it would drop the HTTP connection because the kernel had just > transferred 2 GiB in one call, and the HTTP socket was not writable > (EPOLLOUT) for more than 60 seconds, resulting in a timeout: > > sendfile(4, 12, [5518919528] => [5884939344], 1813448856) = 366019816 <3.033067> > sendfile(4, 12, [5884939344], 1447429040) = -1 EAGAIN (Resource temporarily unavailable) <0.000037> > epoll_wait(9, [{EPOLLOUT, {u32=2181955104, u64=140572166585888}}], 512, 60000) = 1 <0.003355> > gettimeofday({tv_sec=1667508799, tv_usec=201201}, NULL) = 0 <0.000024> > sendfile(4, 12, [5884939344] => [8032418896], 2147480496) = 2147479552 <10.727970> > writev(4, [], 0) = 0 <0.000439> > epoll_wait(9, [], 512, 60000) = 0 <60.060430> > gettimeofday({tv_sec=1667508869, tv_usec=991046}, NULL) = 0 <0.000078> > write(5, "10.40.5.23 - - [03/Nov/2022:21:5"..., 124) = 124 <0.001097> > close(12) = 0 <0.000063> > close(4) = 0 <0.000091> > > In newer nginx versions (since 1.21.4), this problem was worked around > by defaulting "sendfile_max_chunk" to 2 MiB: > > https://github.com/nginx/nginx/commit/5636e7f7b4 > > Instead of asking userspace to provide an artificial upper limit, I'd > like the kernel to block for disk I/O at most once, and then pass back > control to userspace. > > There is prior art for this kind of behavior in filemap_read(): > > /* > * If we've already successfully copied some data, then we > * can no longer safely return -EIOCBQUEUED. Hence mark > * an async read NOWAIT at that point. > */ > if ((iocb->ki_flags & IOCB_WAITQ) && already_read) > iocb->ki_flags |= IOCB_NOWAIT; > > This modifies the caller-provided "struct kiocb", which has an effect > on repeated filemap_read() calls. This effect however vanishes > because the "struct kiocb" is not persistent; splice_direct_to_actor() > doesn't have one, and each generic_file_splice_read() call initializes > a new one, losing the "IOCB_NOWAIT" flag that was injected by > filemap_read(). > > There was no way to make generic_file_splice_read() aware that > IOCB_NOWAIT was desired because some data had already been transferred > in a previous call: > > - checking whether the input file has O_NONBLOCK doesn't work because > this should be fixed even if the input file is not non-blocking > > - the SPLICE_F_NONBLOCK flag is not appropriate because it affects > only whether pipe operations are non-blocking, not whether > file/socket operations are non-blocking > > Since there are no other parameters, I suggest adding the > SPLICE_F_NOWAIT flag, which is similar to SPLICE_F_NONBLOCK, but > affects the "non-pipe" file descriptor passed to sendfile() or > splice(). It translates to IOCB_NOWAIT for regular files. For now, I > have documented the flag to be kernel-internal with a high bit, like > io_uring does with SPLICE_F_FD_IN_FIXED, but making this part of the > system call ABI may be a good idea as well. > > Signed-off-by: Max Kellermann <max.kellermann@ionos.com> > --- > fs/splice.c | 14 ++++++++++++++ > include/linux/splice.h | 6 ++++++ > 2 files changed, 20 insertions(+) > > diff --git a/fs/splice.c b/fs/splice.c > index d983d375ff11..c192321d5e37 100644 > --- a/fs/splice.c > +++ b/fs/splice.c > @@ -361,6 +361,8 @@ ssize_t copy_splice_read(struct file *in, loff_t *ppos, > iov_iter_bvec(&to, ITER_DEST, bv, npages, len); > init_sync_kiocb(&kiocb, in); > kiocb.ki_pos = *ppos; > + if (flags & SPLICE_F_NOWAIT) > + kiocb.ki_flags |= IOCB_NOWAIT; > ret = call_read_iter(in, &kiocb, &to); > > if (ret > 0) { > @@ -1070,6 +1072,18 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd, > if (unlikely(ret <= 0)) > goto read_failure; > > + /* > + * After at least one byte was read from the input > + * file, don't wait for blocking I/O in the following > + * loop iterations; instead of blocking for arbitrary > + * amounts of time in the kernel, let userspace decide > + * how to proceed. This avoids excessive latency if > + * the output is being consumed faster than the input > + * file can fill it (e.g. sendfile() from a slow hard > + * disk to a fast network). > + */ > + flags |= SPLICE_F_NOWAIT; > + > read_len = ret; > sd->total_len = read_len; > > diff --git a/include/linux/splice.h b/include/linux/splice.h > index 6c461573434d..abdf94759138 100644 > --- a/include/linux/splice.h > +++ b/include/linux/splice.h > @@ -23,6 +23,12 @@ > > #define SPLICE_F_ALL (SPLICE_F_MOVE|SPLICE_F_NONBLOCK|SPLICE_F_MORE|SPLICE_F_GIFT) > > +/* > + * Don't wait for I/O (internal flag for the splice_direct_to_actor() > + * loop). > + */ > +#define SPLICE_F_NOWAIT (1U << 30) > + > /* > * Passed to the actors > */ > -- > 2.39.2 >
On 9/19/23 8:18 AM, Christian Brauner wrote: > [+Cc Jens] > > On Tue, Sep 19, 2023 at 10:12:58AM +0200, Max Kellermann wrote: >> If userspace calls sendfile() with a very large "count" parameter, the >> kernel can block for a very long time until 2 GiB (0x7ffff000 bytes) >> have been read from the hard disk and pushed into the socket buffer. >> >> Usually, that is not a problem, because the socket write buffer gets >> filled quickly, and if the socket is non-blocking, the last >> direct_splice_actor() call will return -EAGAIN, causing >> splice_direct_to_actor() to break from the loop, and sendfile() will >> return a partial transfer. >> >> However, if the network happens to be faster than the hard disk, and >> the socket buffer keeps getting drained between two >> generic_file_read_iter() calls, the sendfile() system call can keep >> running for a long time, blocking for disk I/O over and over. >> >> That is undesirable, because it can block the calling process for too >> long. I discovered a problem where nginx would block for so long that >> it would drop the HTTP connection because the kernel had just >> transferred 2 GiB in one call, and the HTTP socket was not writable >> (EPOLLOUT) for more than 60 seconds, resulting in a timeout: >> >> sendfile(4, 12, [5518919528] => [5884939344], 1813448856) = 366019816 <3.033067> >> sendfile(4, 12, [5884939344], 1447429040) = -1 EAGAIN (Resource temporarily unavailable) <0.000037> >> epoll_wait(9, [{EPOLLOUT, {u32=2181955104, u64=140572166585888}}], 512, 60000) = 1 <0.003355> >> gettimeofday({tv_sec=1667508799, tv_usec=201201}, NULL) = 0 <0.000024> >> sendfile(4, 12, [5884939344] => [8032418896], 2147480496) = 2147479552 <10.727970> >> writev(4, [], 0) = 0 <0.000439> >> epoll_wait(9, [], 512, 60000) = 0 <60.060430> >> gettimeofday({tv_sec=1667508869, tv_usec=991046}, NULL) = 0 <0.000078> >> write(5, "10.40.5.23 - - [03/Nov/2022:21:5"..., 124) = 124 <0.001097> >> close(12) = 0 <0.000063> >> close(4) = 0 <0.000091> >> >> In newer nginx versions (since 1.21.4), this problem was worked around >> by defaulting "sendfile_max_chunk" to 2 MiB: >> >> https://github.com/nginx/nginx/commit/5636e7f7b4 >> >> Instead of asking userspace to provide an artificial upper limit, I'd >> like the kernel to block for disk I/O at most once, and then pass back >> control to userspace. >> >> There is prior art for this kind of behavior in filemap_read(): >> >> /* >> * If we've already successfully copied some data, then we >> * can no longer safely return -EIOCBQUEUED. Hence mark >> * an async read NOWAIT at that point. >> */ >> if ((iocb->ki_flags & IOCB_WAITQ) && already_read) >> iocb->ki_flags |= IOCB_NOWAIT; >> >> This modifies the caller-provided "struct kiocb", which has an effect >> on repeated filemap_read() calls. This effect however vanishes >> because the "struct kiocb" is not persistent; splice_direct_to_actor() >> doesn't have one, and each generic_file_splice_read() call initializes >> a new one, losing the "IOCB_NOWAIT" flag that was injected by >> filemap_read(). >> >> There was no way to make generic_file_splice_read() aware that >> IOCB_NOWAIT was desired because some data had already been transferred >> in a previous call: >> >> - checking whether the input file has O_NONBLOCK doesn't work because >> this should be fixed even if the input file is not non-blocking >> >> - the SPLICE_F_NONBLOCK flag is not appropriate because it affects >> only whether pipe operations are non-blocking, not whether >> file/socket operations are non-blocking >> >> Since there are no other parameters, I suggest adding the >> SPLICE_F_NOWAIT flag, which is similar to SPLICE_F_NONBLOCK, but >> affects the "non-pipe" file descriptor passed to sendfile() or >> splice(). It translates to IOCB_NOWAIT for regular files. For now, I >> have documented the flag to be kernel-internal with a high bit, like >> io_uring does with SPLICE_F_FD_IN_FIXED, but making this part of the >> system call ABI may be a good idea as well. I think adding the flag for this case makes sense, and also exposing it on the UAPI side. My only concern is full coverage of it. We can't really have a SPLICE_F_NOWAIT flag that only applies to some cases. That said, asking for a 2G splice, and getting a 2G splice no matter how slow it may be, is a bit of a "doctor it hurts when I..." scenario.
On Wed, Sep 20, 2023 at 7:28 PM Jens Axboe <axboe@kernel.dk> wrote: > I think adding the flag for this case makes sense, and also exposing it > on the UAPI side. OK. I suggest we get this patch merged first, and then I prepare a patch for wiring this into uapi, changing SPLICE_F_NOWAIT to 0x10 (the lowest free bit), add it to SPLICE_F_ALL and document it. (If you prefer to have it all in this initial patch, I can amend and resubmit it with the uapi feature.) > My only concern is full coverage of it. We can't > really have a SPLICE_F_NOWAIT flag that only applies to some cases. The feature is already part of uapi - via RWF_NOWAIT, which maps to IOCB_NOWAIT, just like my proposed SPLICE_F_NOWAIT flag. The semantics (and the concerns) are the same, aren't they? > That said, asking for a 2G splice, and getting a 2G splice no matter how > slow it may be, is a bit of a "doctor it hurts when I..." scenario. I understand this argument, but I disagree. Compare recv(socket) with read(regular_file). A read(regular_file) must block until the given buffer is filled completely (or EOF is reached), which is good for some programs which do not handle partial reads, but other programs might be happy with a partial read and prefer lower latency. There is preadv2(RWF_NOWAIT), but if it returns EAGAIN, userspace cannot know when data will be available, can't epoll() regular files. There's no way that a read() returns at least one byte, but doesn't wait for more (not even with preadv2(), unfortunately). recv(socket) (or reading on a pipe) behaves differently - it blocks only until at least one byte arrives, and callers must be able to deal with partial reads. That's good for latency - imagine recv() would behave like read(); how much data do you ask the kernel to receive? If it's too little, you need many system calls; if it's too much, your process may block indefinitely. read(regular_file) behaves that way for historical reasons and we can't change it, only add new APIs like preadv2(); but splice() is a modern API that we can optimize for how we want it to behave - and that is: copy as much as the kernel already has, but don't block after that (in order to avoid huge latencies). My point is: splice(2G) is a very reasonable thing to do if userspace wants the kernel to transfer as much as possible with a single system call, because there's no way for userspace to know what the best number is, so let's just pass the largest valid value. Max
On Wed, Sep 20, 2023 at 08:16:04PM +0200, Max Kellermann wrote: > On Wed, Sep 20, 2023 at 7:28 PM Jens Axboe <axboe@kernel.dk> wrote: > > I think adding the flag for this case makes sense, and also exposing it > > on the UAPI side. > > OK. I suggest we get this patch merged first, and then I prepare a > patch for wiring this into uapi, changing SPLICE_F_NOWAIT to 0x10 (the > lowest free bit), add it to SPLICE_F_ALL and document it. > > (If you prefer to have it all in this initial patch, I can amend and > resubmit it with the uapi feature.) Please resend it all in one patch. That's easier for all to review.
On Mon, Sep 25, 2023 at 3:10 PM Christian Brauner <brauner@kernel.org> wrote: > > OK. I suggest we get this patch merged first, and then I prepare a > > patch for wiring this into uapi, changing SPLICE_F_NOWAIT to 0x10 (the > > lowest free bit), add it to SPLICE_F_ALL and document it. > > > > (If you prefer to have it all in this initial patch, I can amend and > > resubmit it with the uapi feature.) > > Please resend it all in one patch. That's easier for all to review. Done. Though I was surprised there's no "uapi" header for splice, so I only changed the value to 0x10 and added it to SPLICE_F_ALL. Since the splice manpage is in a different repository, I'll submit a separate patch for that; can't be one patch. Max
> + /* > + * After at least one byte was read from the input > + * file, don't wait for blocking I/O in the following > + * loop iterations; instead of blocking for arbitrary > + * amounts of time in the kernel, let userspace decide > + * how to proceed. This avoids excessive latency if > + * the output is being consumed faster than the input > + * file can fill it (e.g. sendfile() from a slow hard > + * disk to a fast network). > + */ > + flags |= SPLICE_F_NOWAIT; > + Hm, so the thing that is worrysome about this change is that this may cause regressions afaict as this is a pretty significant change from current behavior.
On Tue, Sep 26, 2023 at 12:21 PM Christian Brauner <brauner@kernel.org> wrote: > Hm, so the thing that is worrysome about this change is that this may > cause regressions afaict as this is a pretty significant change from > current behavior. Would you prefer a new flag for explicitly selecting "wait until at least one byte was transferred, but don't wait further"? Because many applications need this behavior, and some (like nginx) have already worked around the problem by limiting the maximum transaction size, which I consider a bad workaround, because it leads to unnecessary system calls and still doesn't really solve the latency problem. On the other hand, what exactly would the absence of this flag mean... the old behavior, without my patch, can lead to partial transfers, and the absence of the flag doesn't mean it can't happen; my patch tackles just one corner case, but one that is important for me. We have been running this patch in production for nearly a year (and will continue to do so until upstream kernels have a proper solution) and never observed a problem, and I consider it safe, but I acknowledge the risk that this may reveal obscure application bugs if applied globally to all Linux kernels, so I understand your worries. Max
On Tue, Sep 26, 2023 at 12:41:42PM +0200, Max Kellermann wrote: > On Tue, Sep 26, 2023 at 12:21 PM Christian Brauner <brauner@kernel.org> wrote: > > Hm, so the thing that is worrysome about this change is that this may > > cause regressions afaict as this is a pretty significant change from > > current behavior. > > Would you prefer a new flag for explicitly selecting "wait until at > least one byte was transferred, but don't wait further"? Because many I had thought about it but afaict it'd be rather annoying as one can get into that code from copy_file_range() as well so we'd need a new flag for that system call as well afaict. > applications need this behavior, and some (like nginx) have already > worked around the problem by limiting the maximum transaction size, > which I consider a bad workaround, because it leads to unnecessary > system calls and still doesn't really solve the latency problem. > > On the other hand, what exactly would the absence of this flag mean... > the old behavior, without my patch, can lead to partial transfers, and > the absence of the flag doesn't mean it can't happen; my patch tackles > just one corner case, but one that is important for me. > > We have been running this patch in production for nearly a year (and > will continue to do so until upstream kernels have a proper solution) > and never observed a problem, and I consider it safe, but I > acknowledge the risk that this may reveal obscure application bugs if > applied globally to all Linux kernels, so I understand your worries. I think hanging for an insane amount of time is indeed a problem and tweaking the code in this way might actually be useful but we'd need to let this soak for quite a while to see whether this causes any issues. @Jens, what do you think? Is this worth it?
diff --git a/fs/splice.c b/fs/splice.c index d983d375ff11..c192321d5e37 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -361,6 +361,8 @@ ssize_t copy_splice_read(struct file *in, loff_t *ppos, iov_iter_bvec(&to, ITER_DEST, bv, npages, len); init_sync_kiocb(&kiocb, in); kiocb.ki_pos = *ppos; + if (flags & SPLICE_F_NOWAIT) + kiocb.ki_flags |= IOCB_NOWAIT; ret = call_read_iter(in, &kiocb, &to); if (ret > 0) { @@ -1070,6 +1072,18 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd, if (unlikely(ret <= 0)) goto read_failure; + /* + * After at least one byte was read from the input + * file, don't wait for blocking I/O in the following + * loop iterations; instead of blocking for arbitrary + * amounts of time in the kernel, let userspace decide + * how to proceed. This avoids excessive latency if + * the output is being consumed faster than the input + * file can fill it (e.g. sendfile() from a slow hard + * disk to a fast network). + */ + flags |= SPLICE_F_NOWAIT; + read_len = ret; sd->total_len = read_len; diff --git a/include/linux/splice.h b/include/linux/splice.h index 6c461573434d..abdf94759138 100644 --- a/include/linux/splice.h +++ b/include/linux/splice.h @@ -23,6 +23,12 @@ #define SPLICE_F_ALL (SPLICE_F_MOVE|SPLICE_F_NONBLOCK|SPLICE_F_MORE|SPLICE_F_GIFT) +/* + * Don't wait for I/O (internal flag for the splice_direct_to_actor() + * loop). + */ +#define SPLICE_F_NOWAIT (1U << 30) + /* * Passed to the actors */