Message ID | 2cover.1697486714.git.nabijaczleweli@nabijaczleweli.xyz |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:3b04:b0:fb:cd0c:d3e with SMTP id c4csp8767169dys; Thu, 14 Dec 2023 10:45:05 -0800 (PST) X-Google-Smtp-Source: AGHT+IFQCnYnV+LfUWZ5R5zogYijf4ihYJ+HYuCB0bGuOiYiF04NdlHxwcuk5rHTWIn0iFYPsRDo X-Received: by 2002:a17:90a:db96:b0:28a:f0bc:2b92 with SMTP id h22-20020a17090adb9600b0028af0bc2b92mr2053015pjv.6.1702579504402; Thu, 14 Dec 2023 10:45:04 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702579504; cv=none; d=google.com; s=arc-20160816; b=UhPuYtxr77f+nJ1ckzBrTLAIrUF0ixSLEr5oQPkw9/y94bdmOIGQBfcjNMytf2mEIc VAyZXVK7NK8FFTcHkxxGbi1BrH8KJmf498aLxOalmm2Das87Odghvtisq/afmsSXlhJ6 7HiSVpw3Ovcg+3HJc55/LTbvOEF/hUDf4d9CzXqas3cDVtcYAST123djKvcYqk1BsUCf TuWmSu/p7uJHSB7HfAMYIkXwsO7sorXBt2AVUP+XO1WnKu6dpjJTtO763Xh4lanpCN+h G+S+EsOVgQoZdqmMWlnxJ74GxOT5Ij6lwajh37AwZpzvuhmtT6YuH/mskaJIcv/qz+C+ OAOA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:to:content-disposition:mime-version:user-agent :message-id:subject:cc:from:date:dkim-signature; bh=GfzqPtNNb/y/CT3pp/bvS72UwjAhB+T0ikSa6zHCVjc=; fh=ZjsixHDAkO8CDnFs3mUbT+8/J6HcX6ygrBHgfe1e5k4=; b=0ohk4FoyDIUGBOKmY+Nbe0k3OhoAa+MrsY4Euz3RCNJLGAre3pWCYwxLsloIk/K/7+ rzylytc5esW2PJKwDkkGTvXLU0nZlX68y/DAOCXE4lvSETfsrHAZizxMkRo+V46bclQR NZmhAoyIeIVjs2SjotWLIkMZeGCGOFHmkzKzgo62Fwnrs/nVduI+f5Wc1ZpGLPoyUO1t aGjHIYnZTQt/Zb96lcotYuE8AteSNkXEIIssrVA6UMTbB2sZHQmfttnwY4EDiNFYEWhV q7vdS5AoJKUo5+arkDWV8hKryvKBVayU4L7l1+NV1FZYSDAmmGKBmMBBsCpAn0bZq6or NGQQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@nabijaczleweli.xyz header.s=202305 header.b=EJuhZJJX; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=nabijaczleweli.xyz Received: from groat.vger.email (groat.vger.email. [23.128.96.35]) by mx.google.com with ESMTPS id i8-20020a17090a974800b0028862f5a30dsi12908891pjw.49.2023.12.14.10.45.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 14 Dec 2023 10:45:04 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) client-ip=23.128.96.35; Authentication-Results: mx.google.com; dkim=pass header.i=@nabijaczleweli.xyz header.s=202305 header.b=EJuhZJJX; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=nabijaczleweli.xyz Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id 435E6826FAE0; Thu, 14 Dec 2023 10:44:53 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1573072AbjLNSoo (ORCPT <rfc822;winker.wchi@gmail.com> + 99 others); Thu, 14 Dec 2023 13:44:44 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40386 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229532AbjLNSon (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 14 Dec 2023 13:44:43 -0500 Received: from tarta.nabijaczleweli.xyz (tarta.nabijaczleweli.xyz [139.28.40.42]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5F280FB; Thu, 14 Dec 2023 10:44:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=nabijaczleweli.xyz; s=202305; t=1702579482; bh=Tt8d4xrtB4yfD/JplK2SRdSv44/6Xoalp7RmY8hdBr0=; h=Date:From:Cc:Subject:From; b=EJuhZJJXD3J5WCMABIDkLrpumuIpblG3ICHpjx1uCCtLhnPtwfVDN+pCutwnzkQm3 iSQjvAG3jiUjBT53o7s6TTQQFZYYi8PJ9mIJ29L4DlYeJ1iM+ALb86EHVXQXEQTrUr QaBjxIIQjzC+A2Cx62SlY1vvjlIjN76ocI4MpuNAALxopWIMHdPNYhDojRaax5UhU6 HxUdeqdNuh4u0nKh53R7VNqU32yFDNpLDxUPEJqmyCGpaQtaxUZRI2v/ur3akvAOFB +5jecMEHDee0AZIphoiMTT5oULseGAeVQUJ2KkXNKG9xtTPsg0ThZH9i/vWKDykCUT YyWrvJAleuzow== Received: from tarta.nabijaczleweli.xyz (unknown [192.168.1.250]) by tarta.nabijaczleweli.xyz (Postfix) with ESMTPSA id 2F3C313076; Thu, 14 Dec 2023 19:44:42 +0100 (CET) Date: Thu, 14 Dec 2023 19:44:41 +0100 From: Ahelenia =?utf-8?q?Ziemia=C5=84ska?= <nabijaczleweli@nabijaczleweli.xyz> Cc: Jens Axboe <axboe@kernel.dk>, Christian Brauner <brauner@kernel.org>, Alexander Viro <viro@zeniv.linux.org.uk>, linux-fsdevel@vger.kernel.org, "D. Wythe" <alibuda@linux.alibaba.com>, "David S. Miller" <davem@davemloft.net>, "Liam R. Howlett" <Liam.Howlett@oracle.com>, Alexander Mikhalitsyn <alexander@mihalicyn.com>, Andrew Morton <akpm@linux-foundation.org>, Boris Pismenny <borisp@nvidia.com>, Cong Wang <cong.wang@bytedance.com>, David Ahern <dsahern@kernel.org>, David Howells <dhowells@redhat.com>, Eric Dumazet <edumazet@google.com>, Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru>, Hyeonggon Yoo <42.hyeyoo@gmail.com>, Jakub Kicinski <kuba@kernel.org>, Jan Karcher <jaka@linux.ibm.com>, John Fastabend <john.fastabend@gmail.com>, Karsten Graul <kgraul@linux.ibm.com>, Kirill Tkhai <tkhai@ya.ru>, Kuniyuki Iwashima <kuniyu@amazon.com>, Li kunyu <kunyu@nfschina.com>, linux-kernel@vger.kernel.org, linux-s390@vger.kernel.org, netdev@vger.kernel.org, linux-trace-kernel@vger.kernel.org, Masami Hiramatsu <mhiramat@kernel.org>, Miklos Szeredi <miklos@szeredi.hu>, netdev@vger.kernel.org, Paolo Abeni <pabeni@redhat.com>, Pengcheng Yang <yangpc@wangsu.com>, Shigeru Yoshida <syoshida@redhat.com>, Steven Rostedt <rostedt@goodmis.org>, Suren Baghdasaryan <surenb@google.com>, Tony Lu <tonylu@linux.alibaba.com>, Wen Gu <guwen@linux.alibaba.com>, Wenjia Zhang <wenjia@linux.ibm.com>, Xu Panda <xu.panda@zte.com.cn>, Zhang Zhengming <zhang.zhengming@h3c.com> Subject: [PATCH RERESEND 00/11] splice(file<>pipe) I/O on file as-if O_NONBLOCK Message-ID: <2cover.1697486714.git.nabijaczleweli@nabijaczleweli.xyz> User-Agent: NeoMutt/20231103 MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="4xtajew35rjsbkjt" Content-Disposition: inline X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email To: unlisted-recipients:; (no To-header on input) Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Thu, 14 Dec 2023 10:44:53 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1785284006507041506 X-GMAIL-MSGID: 1785284006507041506 |
Series |
splice(file<>pipe) I/O on file as-if O_NONBLOCK
|
|
Message
Ahelenia Ziemiańska
Dec. 14, 2023, 6:44 p.m. UTC
First: https://lore.kernel.org/lkml/cover.1697486714.git.nabijaczleweli@nabijaczleweli.xyz/t/#u Resend: https://lore.kernel.org/lkml/1cover.1697486714.git.nabijaczleweli@nabijaczleweli.xyz/t/#u Resending again per https://lore.kernel.org/lkml/20231214093859.01f6e2cd@kernel.org/t/#u Hi! As it stands, splice(file -> pipe): 1. locks the pipe, 2. does a read from the file, 3. unlocks the pipe. For reading from regular files and blcokdevs this makes no difference. But if the file is a tty or a socket, for example, this means that until data appears, which it may never do, every process trying to read from or open the pipe enters an uninterruptible sleep, and will only exit it if the splicing process is killed. This trivially denies service to: * any hypothetical pipe-based log collexion system * all nullmailer installations * me, personally, when I'm pasting stuff into qemu -serial chardev:pipe This follows: 1. https://lore.kernel.org/linux-fsdevel/qk6hjuam54khlaikf2ssom6custxf5is2ekkaequf4hvode3ls@zgf7j5j4ubvw/t/#u 2. a security@ thread rooted in <irrrblivicfc7o3lfq7yjm2lrxq35iyya4gyozlohw24gdzyg7@azmluufpdfvu> 3. https://nabijaczleweli.xyz/content/blogn_t/011-linux-splice-exclusion.html Patches were posted and then discarded on principle or funxionality, all in all terminating in Linus posting > But it is possible that we need to just bite the bullet and say > "copy_splice_read() needs to use a non-blocking kiocb for the IO". This does that, effectively making splice(file -> pipe) request (and require) O_NONBLOCK on reads fron the file: this doesn't affect splicing from regular files and blockdevs, since they're always non-blocking (and requesting the stronger "no kernel sleep" IOCB_NOWAIT is non-sensical), but always returns -EINVAL for ttys. Sockets behave as expected from O_NONBLOCK reads: splice if there's data available else -EAGAIN. This should all pretty much behave as-expected. Mostly a re-based version of the summary diff from <gnj4drf7llod4voaaasoh5jdlq545gduishrbc3ql3665pw7qy@ytd5ykxc4gsr>. Bisexion yields commit 8924feff66f35fe22ce77aafe3f21eb8e5cff881 ("splice: lift pipe_lock out of splice_to_pipe()") as first bad. The patchset is made quite wide due to the many implementations of the splice_read callback, and was based entirely on results from $ git grep '\.splice_read.*=' | cut -d= -f2 | tr -s ',;[:space:]' '\n' | sort -u I'm assuming this is exhaustive, but it's 27 distinct implementations. Of these, I've classified these as trivial delegating wrappers: nfs_file_splice_read filemap_splice_read afs_file_splice_read filemap_splice_read ceph_splice_read filemap_splice_read ecryptfs_splice_read_update_atime filemap_splice_read ext4_file_splice_read filemap_splice_read f2fs_file_splice_read filemap_splice_read ntfs_file_splice_read filemap_splice_read ocfs2_file_splice_read filemap_splice_read orangefs_file_splice_read filemap_splice_read v9fs_file_splice_read filemap_splice_read xfs_file_splice_read filemap_splice_read zonefs_file_splice_read filemap_splice_read sock_splice_read copy_splice_read or a socket-specific one coda_file_splice_read vfs_splice_read ovl_splice_read vfs_splice_read filemap_splice_read() is used for regular files and blockdevs, and thus needs no changes, and is thus unchanged. vfs_splice_read() delegates to copy_splice_read() or f_op->splice_read(). The rest are fixed, in patch order: 01. copy_splice_read() by simply doing the I/O with IOCB_NOWAIT; diff from Linus: https://lore.kernel.org/lkml/5osglsw36dla3mubtpsmdwdid4fsdacplyd6acx2igo4atogdg@yur3idyim3cc/t/#ee67de5a9ec18886c434113637d7eff6cd7acac4b 02. unix_stream_splice_read() by unconditionally passing MSG_DONTWAIT 03. fuse_dev_splice_read() by behaving as-if O_NONBLOCK 04. tracing_buffers_splice_read() by behaving as-if O_NONBLOCK (this also removes the retry loop) 05. relay_file_splice_read() by behaving as-if SPLICE_F_NONBLOCK (this just means EAGAINing unconditionally for an empty transfer) 06. smc_splice_read() by unconditionally passing MSG_DONTWAIT 07. kcm_splice_read() by unconditionally passing MSG_DONTWAIT 08. tls_sw_splice_read() by behaving as-if SPLICE_F_NONBLOCK 09. tcp_splice_read() by behaving as-if O_NONBLOCK (this also removes the retry loop) 10. EINVALs on files that neither have FMODE_NOWAIT nor are S_ISREG We don't want this to be just FMODE_NOWAIT since most regular files don't have it set and that's not the right semantic anyway, as noted at the top of this mail, But this allows blockdevs "by accident", effectively, since they have FMODE_NOWAIT (at least the ones I tried), even though they're just like regular files: handled by filemap_splice_read(), thus not dispatched with IOCB_NOWAIT. since always non-blocking. Should this be a check for FMODE_NOWAIT && (S_ISREG || S_ISBLK)? Should it remain FMODE_NOWAIT && S_ISREG? Is there an even better way of spelling this? In net/kcm, this also fixes kcm_splice_read() passing SPLICE_F_*-style flags to skb_recv_datagram(), which takes MSG_*-style flags. I don't think they did anything anyway? But. I would of course be remiss to not analyse splice(pipe -> file) as well: gfs2_file_splice_write iter_file_splice_write ovl_splice_write iter_file_splice_write splice_write_null splice_from_pipe(pipe_to_null), does nothing fuse_dev_splice_write() locks, copies the iovs, unlocks, does I/O, locks, frees the pipe's iovs, unlocks port_fops_splice_write() locks, steals or copies pages, unlocks, does I/O 11. splice_to_socket(): has sock_sendmsg() inside the pipe lock; filling the socket buffer sleeps in splice with the pipe locked, and this is trivial to trigger with ./af_unix_ptf ./splicing-cat < fifo & cat > fifo & cp 64k fifo a couple times patch does unconditional MSG_DONTWAIT, tests sensibly iter_file_splice_write(): has vfs_iter_write() inside the pipe lock, but appears to be attached to regular files and blockdevs, but also random_fops & urandom_fops (looks like not an issue) and tty_fops & console_fops (this only means non-pty ttys so no issue with a full buffer? idk if there's a situation where a tty or the discipline can block forever or if it's guaranteed forward progress, however slow? still kinda ass to have the pipe lock hard-held for, say, (64*1024)/(300/8)s=30min if the pipe has 64k in the buffer? this predixion aligns precisely with what I measured: 1# stty 300 < /dev/ttyS0 1# ./splicing-cat < fifo > /dev/ttyS0 2$ cat > fifo # and typing works 3$ cp 64k fifo # uninterrupitbly sleeps in write(4, "SzmprOmdIIkciMwbpxhsEyFVORaPGbRQ"..., 66560 1: now sleeping in splice 2: typing more into the cat uninterruptibly sleeps in write 4$ : > /tmp/fifo # uninterruptibly hangs in open similarly, "cp 10k fifo" uninterruptibly sleeps in close, with the same effects on other (potential) writers, and woke up after around five minutes, which matches my maths so presumably something should be done about this as well? just idk what) So. AFAIK, just iter_file_splice_write() on ttys remains. This needs a man-pages patch as well, but I'd go rabid if I were to write it rn. For the samples above, af_unix_ptf.c: -- >8 -- #include <stdio.h> #include <stdlib.h> #include <sys/socket.h> #include <sys/types.h> #include <sys/un.h> #include <unistd.h> int main(int argc, char ** argv) { int fds[2]; if(socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0, fds)) abort(); if(!vfork()) { dup2(fds[1], 1); _exit(execvp(argv[1], argv + 1)); } dup2(fds[0], 0); for(;;) { char buf[16]; int r = read(0, buf, 16); fprintf(stderr, "read %d\n", r); sleep(10); } } -- >8 -- splicing-cat.c: -- >8 -- #define _GNU_SOURCE #include <fcntl.h> #include <stdio.h> #include <errno.h> int main() { int lasterr = -1; unsigned ctr = 0; for(;;) { errno = 0; ssize_t ret = splice(0, 0, 1, 0, 128 * 1024 * 1024, 0); if(ret >= 0 || errno != lasterr) { fprintf(stderr, "\n\t%m" + (lasterr == -1)); lasterr = errno; ctr = 0; } if(ret == -1) { ++ctr; fprintf(stderr, "\r%u", ctr); } else fprintf(stderr, "\r%zu", ret); if(!ret) break; } fprintf(stderr, "\n"); } -- >8 -- Ahelenia Ziemiańska (11): splice: copy_splice_read: do the I/O with IOCB_NOWAIT af_unix: unix_stream_splice_read: always request MSG_DONTWAIT fuse: fuse_dev_splice_read: use nonblocking I/O tracing: tracing_buffers_splice_read: behave as-if non-blocking I/O relayfs: relay_file_splice_read: always return -EAGAIN for no data net/smc: smc_splice_read: always request MSG_DONTWAIT kcm: kcm_splice_read: always request MSG_DONTWAIT tls/sw: tls_sw_splice_read: always request non-blocking I/O net/tcp: tcp_splice_read: always do non-blocking reads splice: file->pipe: -EINVAL for non-regular files w/o FMODE_NOWAIT splice: splice_to_socket: always request MSG_DONTWAIT fs/fuse/dev.c | 10 ++++++---- fs/splice.c | 7 ++++--- kernel/relay.c | 3 +-- kernel/trace/trace.c | 32 ++++---------------------------- net/ipv4/tcp.c | 30 +++--------------------------- net/kcm/kcmsock.c | 2 +- net/smc/af_smc.c | 6 +----- net/tls/tls_sw.c | 5 ++--- net/unix/af_unix.c | 5 +---- 9 files changed, 23 insertions(+), 77 deletions(-) base-commit: 58720809f52779dc0f08e53e54b014209d13eebb -- 2.39.2
Comments
On 12/14/23 11:44 AM, Ahelenia Ziemia?ska wrote: > First: https://lore.kernel.org/lkml/cover.1697486714.git.nabijaczleweli@nabijaczleweli.xyz/t/#u > Resend: https://lore.kernel.org/lkml/1cover.1697486714.git.nabijaczleweli@nabijaczleweli.xyz/t/#u > Resending again per https://lore.kernel.org/lkml/20231214093859.01f6e2cd@kernel.org/t/#u > > Hi! > > As it stands, splice(file -> pipe): > 1. locks the pipe, > 2. does a read from the file, > 3. unlocks the pipe. > > For reading from regular files and blcokdevs this makes no difference. > But if the file is a tty or a socket, for example, this means that until > data appears, which it may never do, every process trying to read from > or open the pipe enters an uninterruptible sleep, > and will only exit it if the splicing process is killed. > > This trivially denies service to: > * any hypothetical pipe-based log collexion system > * all nullmailer installations > * me, personally, when I'm pasting stuff into qemu -serial chardev:pipe > > This follows: > 1. https://lore.kernel.org/linux-fsdevel/qk6hjuam54khlaikf2ssom6custxf5is2ekkaequf4hvode3ls@zgf7j5j4ubvw/t/#u > 2. a security@ thread rooted in > <irrrblivicfc7o3lfq7yjm2lrxq35iyya4gyozlohw24gdzyg7@azmluufpdfvu> > 3. https://nabijaczleweli.xyz/content/blogn_t/011-linux-splice-exclusion.html > > Patches were posted and then discarded on principle or funxionality, > all in all terminating in Linus posting >> But it is possible that we need to just bite the bullet and say >> "copy_splice_read() needs to use a non-blocking kiocb for the IO". > > This does that, effectively making splice(file -> pipe) > request (and require) O_NONBLOCK on reads fron the file: > this doesn't affect splicing from regular files and blockdevs, > since they're always non-blocking > (and requesting the stronger "no kernel sleep" IOCB_NOWAIT is non-sensical), Not sure how you got the idea that regular files or block devices is always non-blocking, this is certainly not true without IOCB_NOWAIT. Without IOCB_NOWAIT, you can certainly be waiting for previous IO to complete. > but always returns -EINVAL for ttys. > Sockets behave as expected from O_NONBLOCK reads: > splice if there's data available else -EAGAIN. > > This should all pretty much behave as-expected. Should it? Seems like there's a very high risk of breaking existing use cases here. Have you at all looked into the approach of enabling splice to/from _without_ holding the pipe lock? That, to me, would seem like a much saner approach, with the caveat that I have not looked into that at all so there may indeed be reasons why this is not feasible.
On Thu, Dec 14, 2023 at 12:06:57PM -0700, Jens Axboe wrote: > On 12/14/23 11:44 AM, Ahelenia Ziemiańska wrote: > > This does that, effectively making splice(file -> pipe) > > request (and require) O_NONBLOCK on reads fron the file: > > this doesn't affect splicing from regular files and blockdevs, > > since they're always non-blocking > > (and requesting the stronger "no kernel sleep" IOCB_NOWAIT is non-sensical), > Not sure how you got the idea that regular files or block devices is > always non-blocking, this is certainly not true without IOCB_NOWAIT. > Without IOCB_NOWAIT, you can certainly be waiting for previous IO to > complete. Maybe "always non-blocking" is an abuse of the term, but the terminology is lost on me. By this I mean that O_NONBLOCK files/blockdevs have the same semantics as non-O_NONBLOCK files/blockdevs ‒ they may block for a bit while the I/O queue drains, but are guaranteed to complete within a relatively narrow bounded time; any contending writer/opener will be blocked for a short bit but will always wake up. This is in contrast to pipes/sockets/ttys/&c., which wait for a peer to send some data, and block until there is some; any contending writer/opener will be blocked potentially ad infinitum. Or, the way I see it, splice(socket -> pipe) can trivially be used to lock the pipe forever, whereas I don't think splice(regfile -> pipe) can, regardless of IOCB_NOWAIT, so the specific semantic IOCB_NOWAIT provides is immaterial here, so not specifying IOCB_NOWAIT in filemap_splice_read() provides semantics consistent to "file is read as-if it had O_NONBLOCK set". > > but always returns -EINVAL for ttys. > > Sockets behave as expected from O_NONBLOCK reads: > > splice if there's data available else -EAGAIN. > > > > This should all pretty much behave as-expected. > Should it? Seems like there's a very high risk of breaking existing use > cases here. If something wants to splice from a socket to a pipe and doesn't degrade to read/write if it gets EAGAIN then it will either retry and hotloop in the splice or error out, yeah. I don't think this is surmountable. > Have you at all looked into the approach of enabling splice to/from > _without_ holding the pipe lock? That, to me, would seem like a much > saner approach, with the caveat that I have not looked into that at all > so there may indeed be reasons why this is not feasible. IIUC Linus prepared a patch on security@ in <CAHk-=whPmrWvXBqcK6ey_mnd-0fz_HNUHEfz3NX97mqoCCcwtA@mail.gmail.com> (you're in To:) and an evolution of this is in https://lore.kernel.org/lkml/CAHk-=wgmLd78uSLU9A9NspXyTM9s6C23OVDiN2YjA-d8_S0zRg@mail.gmail.com/t/#u (you're in Cc:) that does this. He summarises it below as > So while fixing your NULL pointer check should be trivial, I think > that first patch is actually fundamentally broken wrt pipe resizing, > and I see no really sane way to fix it. We could add a new lock just > for that, but I don't think it's worth it. and > But it is possible that we need to just bite the bullet and say > "copy_splice_read() needs to use a non-blocking kiocb for the IO". so that's what I did. If Linus, who drew up and maintained this code for ~30 years, didn't arrive at a satisfactory approach, I, after ~30 minutes, won't either. It would be very sane to both not change the semantic and fix the lock by just not locking but at the top of that thread Christian said > Splice would have to be > refactored to not rely on pipe_lock(). That's likely major work with a > good portion of regressions if the past is any indication. and clearly this is true, so lacking the ability and capability to do that and any reasonable other ideas (You could either limit splices to a proportion of the pipe size, but then just doing five splices gets you where we are rn (or, as Linus construed it, into "write() returns -EBUSY" territory, which is just as bad but at least the writers aren't unkillable), and it reduces the I/O per splice significantly. You could limit each pipe to one outstanding splice and always leave Some space for normal I/O. This falls into "another lock just for this" territory I think, and it also sucks for the 99% of normal users.) I did this because Linus vaguely sanxioned it. It's probably feasible, but alas it isn't feasible for me.