Pending splice(file -> FIFO) always blocks read(FIFO), regardless of O_NONBLOCK on read side?
Message ID | qk6hjuam54khlaikf2ssom6custxf5is2ekkaequf4hvode3ls@zgf7j5j4ubvw |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp7182766vqr; Sun, 25 Jun 2023 18:29:27 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ50whLF8hmMMows0hrS5hGhjmZpEH1FovA6HrOlehc9WnEwyLzt9TYwwv3bUEpNlgzpPBcM X-Received: by 2002:a05:6808:6397:b0:39c:767e:bfc6 with SMTP id ec23-20020a056808639700b0039c767ebfc6mr25702160oib.10.1687742967035; Sun, 25 Jun 2023 18:29:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1687742967; cv=none; d=google.com; s=arc-20160816; b=Q+v6XPUQ3nCYEGuxNXjHfIIMfK5jCcn1Ab8xks7c2iIRXa5wFUK2PNIPgLQ+U/kRg0 2+Iq7rtX2KBkWoQ+1OOANzfpc/IYo/fvR+3XLKI0CPANzXmpmMuBZGhAnfupQA3NCPJw 9j33+Q8IJrOEyX7GBup2l7g9GcAtsZkMUF2Pjs46LwXBjDR+oXmO3VGIEsu2eHfvG322 oX9ZoWLsMi8l485k/D0RNhoPICv0SfrBsdlWMzUt6Lzc+7/jjQErX3OSulUukeU7o8qN llRFFSc/QhrXK8g+N+RmWF3rPVmzv7khztoS+r9nVYpdFDWuUeWV7w6E7Rg9P/ERty5V iZ1w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:content-disposition:mime-version :message-id:subject:to:from:date:dkim-signature; bh=QU1610WbVyZSZ86sWpXs/Om4sV/ealLtN9xqXvEU3kg=; fh=c1hve9LK7bidUltaxotdOOkNtWZ8Drk/KOyPdUWU/OI=; b=R3MJdV0Sq9T5ST8KhVn8u3ZfpARAGfCjVQIQ3Z1oxKIhC17SsxQpCVP5Cb0MVK9grJ IOYyN6CCJW6MxVT2qojoh358IfHqPZbgHGXAXBm0NXu76eBzf/yCi6XC012AR9rZ6G41 Z9KNJxS7KLDMbjegATKseIDCX19ShyTEt8SD8YhG5NtCTcAogLyWcR29uq3tt+Hrl5YW XqREW+crCl8tyrthIe7zEtHZXZLO2mTUlHIs4q3qsW9bSZ5t0MUG2XF6sYX+fGm0woWB HTq5j+nU9miP7iLJwKp1u+IX5Y6NBzW+JT8mJLnmmoLBViOAWmMHNSuTw5Efu/Gnou5H FxAA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@nabijaczleweli.xyz header.s=202305 header.b=iXxfdo8L; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=nabijaczleweli.xyz Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id ix19-20020a170902f81300b001b80b3106basi382459plb.564.2023.06.25.18.29.13; Sun, 25 Jun 2023 18:29:26 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@nabijaczleweli.xyz header.s=202305 header.b=iXxfdo8L; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=nabijaczleweli.xyz Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230102AbjFZBMQ (ORCPT <rfc822;filip.gregor98@gmail.com> + 99 others); Sun, 25 Jun 2023 21:12:16 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51288 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229683AbjFZBMO (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Sun, 25 Jun 2023 21:12:14 -0400 Received: from tarta.nabijaczleweli.xyz (unknown [139.28.40.42]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id F2400194; Sun, 25 Jun 2023 18:12:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=nabijaczleweli.xyz; s=202305; t=1687741930; bh=pqnPYGTfh1ZIMZQeiwgOUejdA3LSVPWbpNYvcMysFf8=; h=Date:From:To:Subject:From; b=iXxfdo8LgtiVhBj7vyKKmHRzb/590CG4zTFOGSqiTockvak6aMAkWhkHOhoKXoKcZ lzt13zDH66SgwQUNyi7DZLBK1yDoCGlKaBHxzmIfWmDEHRl2xM9GiSgFWeG25k/m5j 7JbQLes6Kst46omDevVJ9Id7YVj59Q9WeydH6Bi7UYSHyUH7rgFry7AweJg5NS5CIv GlmHDe9UO+C6Q+RNJqSgdr5ZgukGNISPVMwxVzYlsdfbm1tk9+K5U9v0nIvxPe6qjT gv+bpHkoFdWmKdcPXfwFvDrsHAFFEiUEmJ7mwz+E4NTdZougael8cT1FbM6D2rOo9s Kr15xH3VfZ6Tw== Received: from tarta.nabijaczleweli.xyz (unknown [192.168.1.250]) by tarta.nabijaczleweli.xyz (Postfix) with ESMTPSA id 957CA12AE; Mon, 26 Jun 2023 03:12:10 +0200 (CEST) Date: Mon, 26 Jun 2023 03:12:09 +0200 From: Ahelenia =?utf-8?q?Ziemia=C5=84ska?= <nabijaczleweli@nabijaczleweli.xyz> To: Alexander Viro <viro@zeniv.linux.org.uk>, Christian Brauner <brauner@kernel.org>, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Pending splice(file -> FIFO) always blocks read(FIFO), regardless of O_NONBLOCK on read side? Message-ID: <qk6hjuam54khlaikf2ssom6custxf5is2ekkaequf4hvode3ls@zgf7j5j4ubvw> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="5sqj5xnglnda2l7b" Content-Disposition: inline User-Agent: NeoMutt/20230517 X-Spam-Status: No, score=-1.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,PDS_RDNS_DYNAMIC_FP, RDNS_DYNAMIC,SPF_HELO_PASS,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1769726769609381065?= X-GMAIL-MSGID: =?utf-8?q?1769726769609381065?= |
Series |
Pending splice(file -> FIFO) always blocks read(FIFO), regardless of O_NONBLOCK on read side?
|
|
Commit Message
Ahelenia Ziemiańska
June 26, 2023, 1:12 a.m. UTC
Hi! (starting with get_maintainers.pl fs/splice.c, idk if that's right though) Per fs/splice.c: * The traditional unix read/write is extended with a "splice()" operation * that transfers data buffers to or from a pipe buffer. so I expect splice() to work just about the same as read()/write() (and, to a large extent, it does so). Thus, a refresher on pipe read() semantics (quoting Issue 8 Draft 3; Linux when writing with write()): 60746 When attempting to read from an empty pipe or FIFO: 60747 • If no process has the pipe open for writing, read( ) shall return 0 to indicate end-of-file. 60748 • If some process has the pipe open for writing and O_NONBLOCK is set, read( ) shall return 60749 −1 and set errno to [EAGAIN]. 60750 • If some process has the pipe open for writing and O_NONBLOCK is clear, read( ) shall 60751 block the calling thread until some data is written or the pipe is closed by all processes that 60752 had the pipe open for writing. However, I've observed that this is not the case when splicing from something that sleeps on read to a pipe, and that in that case all readers block, /including/ ones that are reading from fds with O_NONBLOCK set! As an example, consider these two programs: -- >8 -- // wr.c #define _GNU_SOURCE #include <fcntl.h> #include <stdio.h> int main() { while (splice(0, 0, 1, 0, 128 * 1024 * 1024, 0) > 0) ; fprintf(stderr, "wr: %m\n"); } -- >8 -- -- >8 -- // rd.c #define _GNU_SOURCE #include <errno.h> #include <fcntl.h> #include <stdio.h> #include <unistd.h> int main() { fcntl(0, F_SETFL, fcntl(0, F_GETFL) | O_NONBLOCK); char buf[64 * 1024] = {}; for (ssize_t rd;;) { #if 1 while ((rd = read(0, buf, sizeof(buf))) == -1 && errno == EINTR) ; #else while ((rd = splice(0, 0, 1, 0, 128 * 1024 * 1024, 0)) == -1 && errno == EINTR) ; #endif fprintf(stderr, "rd=%zd: %m\n", rd); write(1, buf, rd); errno = 0; sleep(1); } } -- >8 -- Thus: -- >8 -- a$ make rd wr a$ mkfifo fifo a$ ./rd < fifo b$ echo qwe > fifo rd=4: Success qwe rd=0: Success rd=0: Success b$ sleep 2 > fifo rd=-1: Resource temporarily unavailable rd=-1: Resource temporarily unavailable rd=0: Success rd=0: Success rd=-1: Resource temporarily unavailable b$ /bin/cat > fifo rd=-1: Resource temporarily unavailable rd=4: Success abc abc rd=-1: Resource temporarily unavailable rd=4: Success def def rd=0: Success ^D rd=0: Success rd=0: Success b$ ./wr > fifo -- >8 -- and nothing. Until you actually type a line (or a few) into teletype b so that the splice completes, at which point so does the read. An even simpler case is -- >8 -- $ ./wr | ./rd abc def rd=8: Success abc def ghi jkl rd=8: Success ghi jkl ^D wr: Success rd=-1: Resource temporarily unavailable rd=0: Success rd=0: Success -- >8 -- splice flags don't do anything. Tested on bookworm (6.1.27-1) and Linus' HEAD (v6.4-rc7-234-g547cc9be86f4). You could say this is a "denial of service", since this is a valid way of following pipes (and, sans SIGIO, the only portable one), and this makes it so the reader is completely blocked, and impervious to all deadly signals (incl. SIGKILL). I've also observed strace hanging (but it responded to SIGKILL). Rudimentary analysis shows that sys_splice() -> __do_splice() -> do_splice() -> {!(ipipe && opipe) -> !(ipipe) -> (ipipe)}: splice_file_to_pipe which then does -- >8 -- pipe_lock(opipe); ret = wait_for_space(opipe, flags); if (!ret) ret = do_splice_to(in, offset, opipe, len, flags); pipe_unlock(opipe); -- >8 -- conversely: -- >8 -- static ssize_t pipe_read(struct kiocb *iocb, struct iov_iter *to) { size_t total_len = iov_iter_count(to); struct file *filp = iocb->ki_filp; struct pipe_inode_info *pipe = filp->private_data; bool was_full, wake_next_reader = false; ssize_t ret; /* Null read succeeds. */ if (unlikely(total_len == 0)) return 0; ret = 0; __pipe_lock(pipe); -- >8 -- so, naturally(?), all non-empty reads wait for the splice to end. To validate this, I've applied the following (which I'm 100% sure is wrong and breaks unrelated stuff): -- >8 -- (naturally this now means it always EAGAINs even if there is data to be read since the splice() loop keeps it locked, but you get the picture).
Comments
On Mon, Jun 26, 2023 at 03:12:09AM +0200, Ahelenia Ziemiańska wrote: > Hi! (starting with get_maintainers.pl fs/splice.c, > idk if that's right though) > > Per fs/splice.c: > * The traditional unix read/write is extended with a "splice()" operation > * that transfers data buffers to or from a pipe buffer. > so I expect splice() to work just about the same as read()/write() > (and, to a large extent, it does so). > > Thus, a refresher on pipe read() semantics > (quoting Issue 8 Draft 3; Linux when writing with write()): > 60746 When attempting to read from an empty pipe or FIFO: > 60747 • If no process has the pipe open for writing, read( ) shall return 0 to indicate end-of-file. > 60748 • If some process has the pipe open for writing and O_NONBLOCK is set, read( ) shall return > 60749 −1 and set errno to [EAGAIN]. > 60750 • If some process has the pipe open for writing and O_NONBLOCK is clear, read( ) shall > 60751 block the calling thread until some data is written or the pipe is closed by all processes that > 60752 had the pipe open for writing. > > However, I've observed that this is not the case when splicing from > something that sleeps on read to a pipe, and that in that case all > readers block, /including/ ones that are reading from fds with > O_NONBLOCK set! > > As an example, consider these two programs: > -- >8 -- > // wr.c > #define _GNU_SOURCE > #include <fcntl.h> > #include <stdio.h> > int main() { > while (splice(0, 0, 1, 0, 128 * 1024 * 1024, 0) > 0) > ; > fprintf(stderr, "wr: %m\n"); > } > -- >8 -- > > -- >8 -- > // rd.c > #define _GNU_SOURCE > #include <errno.h> > #include <fcntl.h> > #include <stdio.h> > #include <unistd.h> > int main() { > fcntl(0, F_SETFL, fcntl(0, F_GETFL) | O_NONBLOCK); > > char buf[64 * 1024] = {}; > for (ssize_t rd;;) { > #if 1 > while ((rd = read(0, buf, sizeof(buf))) == -1 && errno == EINTR) > ; > #else > while ((rd = splice(0, 0, 1, 0, 128 * 1024 * 1024, 0)) == -1 && > errno == EINTR) > ; > #endif > fprintf(stderr, "rd=%zd: %m\n", rd); > write(1, buf, rd); > > errno = 0; > sleep(1); > } > } > -- >8 -- > > Thus: > -- >8 -- > a$ make rd wr > a$ mkfifo fifo > a$ ./rd < fifo b$ echo qwe > fifo > rd=4: Success > qwe > rd=0: Success > rd=0: Success b$ sleep 2 > fifo > rd=-1: Resource temporarily unavailable > rd=-1: Resource temporarily unavailable > rd=0: Success > rd=0: Success > rd=-1: Resource temporarily unavailable b$ /bin/cat > fifo > rd=-1: Resource temporarily unavailable > rd=4: Success abc > abc > rd=-1: Resource temporarily unavailable > rd=4: Success def > def > rd=0: Success ^D > rd=0: Success > rd=0: Success b$ ./wr > fifo > -- >8 -- > and nothing. Until you actually type a line (or a few) into teletype b > so that the splice completes, at which point so does the read. > > An even simpler case is > -- >8 -- > $ ./wr | ./rd > abc > def > rd=8: Success > abc > def > ghi > jkl > rd=8: Success > ghi > jkl > ^D > wr: Success > rd=-1: Resource temporarily unavailable > rd=0: Success > rd=0: Success > -- >8 -- > > splice flags don't do anything. > Tested on bookworm (6.1.27-1) and Linus' HEAD (v6.4-rc7-234-g547cc9be86f4). > > You could say this is a "denial of service", since this is a valid > way of following pipes (and, sans SIGIO, the only portable one), splice() may block for any of the two file descriptors if they don't have O_NONBLOCK set even if SPLICE_F_NONBLOCK is raised. SPLICE_F_NONBLOCK in splice_file_to_pipe() is only relevant if the pipe is full. If the pipe isn't full then the write is attempted. That of course involves reading the data to splice from the source file. If the source file isn't O_NONBLOCK that read may block holding pipe_lock(). If you raise O_NONBLOCK on the source fd in wr.c then your problems go away. This is pretty long-standing behavior. Splice would have to be refactored to not rely on pipe_lock(). That's likely major work with a good portion of regressions if the past is any indication. If you need that ability to fully async read from a pipe with splice rn then io_uring will at least allow you to punt that read into an async worker thread afaict.
On Mon, Jun 26, 2023 at 11:32:16AM +0200, Christian Brauner wrote: > On Mon, Jun 26, 2023 at 03:12:09AM +0200, Ahelenia Ziemiańska wrote: > > Hi! (starting with get_maintainers.pl fs/splice.c, > > idk if that's right though) > > > > Per fs/splice.c: > > * The traditional unix read/write is extended with a "splice()" operation > > * that transfers data buffers to or from a pipe buffer. > > so I expect splice() to work just about the same as read()/write() > > (and, to a large extent, it does so). > > > > Thus, a refresher on pipe read() semantics > > (quoting Issue 8 Draft 3; Linux when writing with write()): > > 60746 When attempting to read from an empty pipe or FIFO: > > 60747 • If no process has the pipe open for writing, read( ) shall return 0 to indicate end-of-file. > > 60748 • If some process has the pipe open for writing and O_NONBLOCK is set, read( ) shall return > > 60749 −1 and set errno to [EAGAIN]. > > 60750 • If some process has the pipe open for writing and O_NONBLOCK is clear, read( ) shall > > 60751 block the calling thread until some data is written or the pipe is closed by all processes that > > 60752 had the pipe open for writing. > > > > However, I've observed that this is not the case when splicing from > > something that sleeps on read to a pipe, and that in that case all > > readers block, /including/ ones that are reading from fds with > > O_NONBLOCK set! > > > > As an example, consider these two programs: > > -- >8 -- > > // wr.c > > #define _GNU_SOURCE > > #include <fcntl.h> > > #include <stdio.h> > > int main() { > > while (splice(0, 0, 1, 0, 128 * 1024 * 1024, 0) > 0) > > ; > > fprintf(stderr, "wr: %m\n"); > > } > > -- >8 -- > > > > -- >8 -- > > // rd.c > > #define _GNU_SOURCE > > #include <errno.h> > > #include <fcntl.h> > > #include <stdio.h> > > #include <unistd.h> > > int main() { > > fcntl(0, F_SETFL, fcntl(0, F_GETFL) | O_NONBLOCK); > > > > char buf[64 * 1024] = {}; > > for (ssize_t rd;;) { > > #if 1 > > while ((rd = read(0, buf, sizeof(buf))) == -1 && errno == EINTR) > > ; > > #else > > while ((rd = splice(0, 0, 1, 0, 128 * 1024 * 1024, 0)) == -1 && > > errno == EINTR) > > ; > > #endif > > fprintf(stderr, "rd=%zd: %m\n", rd); > > write(1, buf, rd); > > > > errno = 0; > > sleep(1); > > } > > } > > -- >8 -- > > > > Thus: > > -- >8 -- > > a$ make rd wr > > a$ mkfifo fifo > > a$ ./rd < fifo b$ echo qwe > fifo > > rd=4: Success > > qwe > > rd=0: Success > > rd=0: Success b$ sleep 2 > fifo > > rd=-1: Resource temporarily unavailable > > rd=-1: Resource temporarily unavailable > > rd=0: Success > > rd=0: Success > > rd=-1: Resource temporarily unavailable b$ /bin/cat > fifo > > rd=-1: Resource temporarily unavailable > > rd=4: Success abc > > abc > > rd=-1: Resource temporarily unavailable > > rd=4: Success def > > def > > rd=0: Success ^D > > rd=0: Success > > rd=0: Success b$ ./wr > fifo > > -- >8 -- > > and nothing. Until you actually type a line (or a few) into teletype b > > so that the splice completes, at which point so does the read. > > > > An even simpler case is > > -- >8 -- > > $ ./wr | ./rd > > abc > > def > > rd=8: Success > > abc > > def > > ghi > > jkl > > rd=8: Success > > ghi > > jkl > > ^D > > wr: Success > > rd=-1: Resource temporarily unavailable > > rd=0: Success > > rd=0: Success > > -- >8 -- > > > > splice flags don't do anything. > > Tested on bookworm (6.1.27-1) and Linus' HEAD (v6.4-rc7-234-g547cc9be86f4). > > > > You could say this is a "denial of service", since this is a valid > > way of following pipes (and, sans SIGIO, the only portable one), > splice() may block for any of the two file descriptors if they don't > have O_NONBLOCK set even if SPLICE_F_NONBLOCK is raised. > > SPLICE_F_NONBLOCK in splice_file_to_pipe() is only relevant if the pipe > is full. If the pipe isn't full then the write is attempted. That of > course involves reading the data to splice from the source file. If the > source file isn't O_NONBLOCK that read may block holding pipe_lock(). > > If you raise O_NONBLOCK on the source fd in wr.c then your problems go > away. This is pretty long-standing behavior. I don't see how this is relevant here. Whether the writer splice blocks ‒ or how it behaves at all ‒ doesn't matter. The /reader/ demands non-blocking reads. Just by running a splice() we've managed to permanently hang the reader in a way that's fully impervious to everything. Actually, hold that: in testing this on an actual program that relies on this (nullmailer), I've found that trying to /open the FIFO/ also hangs forever, in that same signal-impervious state. To wit: $ ps 3766 PID TTY STAT TIME COMMAND 3766 ? Ss 0:01 /usr/sbin/nullmailer-send $ ls -l /proc/3766/fd total 0 lr-x------ 1 mail mail 64 Jun 14 15:03 0 -> /dev/null lrwx------ 1 mail mail 64 Jun 14 15:03 1 -> 'socket:[81721760]' lrwx------ 1 mail mail 64 Jun 14 15:03 2 -> 'socket:[81721760]' lr-x------ 1 mail mail 64 Apr 28 15:38 3 -> 'pipe:[81721763]' l-wx------ 1 mail mail 64 Jun 14 15:03 4 -> 'pipe:[81721763]' lr-x------ 1 mail mail 64 Jun 14 15:03 5 -> /var/spool/nullmailer/trigger lrwx------ 1 mail mail 64 Jun 14 15:03 9 -> /dev/null # cat /proc/3766/fdinfo/5 pos: 0 flags: 0104000 mnt_id: 64 ino: 393969 # < /proc/3766/fdinfo/5 fdinfo O_RDONLY O_NONBLOCK O_LARGEFILE # strace -yp 3766 & strace: Process 3766 attached $ strace out/cmd/cat > /var/spool/nullmailer/trigger [cat] (normal libc setup) [cat] splice(0, NULL, 1, NULL, 134217728, SPLICE_F_MOVE|SPLICE_F_MOREa [cat] ) = 2 [cat] splice(0, NULL, 1, NULL, 134217728, SPLICE_F_MOVE|SPLICE_F_MORE [nullmailer] pselect6(6, [5</var/spool/nullmailer/trigger>], NULL, NULL, {tv_sec=86397, tv_nsec=624894145}, NULL) = 1 (in [5], left {tv_sec=86394, tv_nsec=841299215}) [nullmailer] write(1<socket:[81721760]>, "Trigger pulled.\n", 16) = 16 [nullmailer] read(5</var/spool/nullmailer/trigger>, and $ strace -y sh -c 'echo zupa > /var/spool/nullmailer/trigger' (...whatever shell setup) rt_sigaction(SIGTERM, {sa_handler=SIG_DFL, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER, sa_restorer=0xf7d21bb0}, NULL, 8) = 0 openat(AT_FDCWD, "/var/spool/nullmailer/trigger", O_WRONLY|O_CREAT|O_TRUNC, 0666 This is a "you've lost" situation to me. This system will /never/ send mail now, and any mailer program will also hang forever (again, to wit: # echo zupa | strace -yfo /tmp/ss mail root does hang forever and /tmp/ss ends in 16915 close(6</var/spool/nullmailer/queue>) = 0 16915 unlink("/var/spool/nullmailer/tmp/16915") = 0 16915 openat(AT_FDCWD, "/var/spool/nullmailer/trigger", O_WRONLY|O_NONBLOCK ) which means that, on this system, I will never get events from smartd or ZED, so fuck me if I wanted to get "scrub errored" or "disk will die soon" notifications (in pre-2.0.0 ZED this would also have broken autoreplace=on since it waited synchronously), or from other monitoring, so again fuck me if I wanted to get overheating/packet drops/whatever notifications, or again fuck me if I wanted to get cron mail. In many ways I've brought the system down (or will have done in like a day once some mails go out) by sending a mail weird. Naturally systemd stopping nullmailer failed after a few minutes with × nullmailer.service - Nullmailer relay-only MTA Loaded: loaded (/lib/systemd/system/nullmailer.service; enabled; preset: enabled) Active: failed (Result: timeout) since Mon 2023-06-26 13:10:02 CEST; 6min ago Duration: 1month 4w 10h 55min 29.666s Docs: man:nullmailer(7) Main PID: 3766 Tasks: 1 (limit: 4673) Memory: 3.1M CPU: 1min 26.893s CGroup: /system.slice/nullmailer.service └─3766 /usr/sbin/nullmailer-send Jun 26 13:05:32 szarotka systemd[1]: nullmailer.service: State 'stop-sigterm' timed out. Killing. Jun 26 13:05:32 szarotka systemd[1]: nullmailer.service: Killing process 3766 (nullmailer-send) with signal SIGKILL. Jun 26 13:07:02 szarotka systemd[1]: nullmailer.service: Processes still around after SIGKILL. Ignoring. Jun 26 13:08:32 szarotka systemd[1]: nullmailer.service: State 'final-sigterm' timed out. Killing. Jun 26 13:08:32 szarotka systemd[1]: nullmailer.service: Killing process 3766 (nullmailer-send) with signal SIGKILL. Jun 26 13:10:02 szarotka systemd[1]: nullmailer.service: Processes still around after final SIGKILL. Entering failed mode. Jun 26 13:10:02 szarotka systemd[1]: nullmailer.service: Failed with result 'timeout'. Jun 26 13:10:02 szarotka systemd[1]: nullmailer.service: Unit process 3766 (nullmailer-send) remains running after unit s> Jun 26 13:10:02 szarotka systemd[1]: Stopped nullmailer.service - Nullmailer relay-only MTA. Jun 26 13:10:02 szarotka systemd[1]: nullmailer.service: Consumed 1min 26.893s CPU time. But not to fret! Maybe we can still kill it with the cgroup! No: # strace -y sh -c 'echo 1 > /sys/fs/cgroup/system.slice/nullmailer.service/cgroup.kill' ... dup2(3</sys/fs/cgroup/system.slice/nullmailer.service/cgroup.kill>, 1) = 1</sys/fs/cgroup/system.slice/nullmailer.service/cgroup.kill> close(3</sys/fs/cgroup/system.slice/nullmailer.service/cgroup.kill>) = 0 write(1</sys/fs/cgroup/system.slice/nullmailer.service/cgroup.kill>, "1\n", 2) = 2 ... This completes, sure, but doesn't do anything at all (admittedly, I'm not a cgroup expert, but it did work on other, non-poisoned, cgroups, so I'd expect it to work). Opening the FIFO with O_NONBLOCK also hangs, obviously. Killing the splicer restores order, as expected. > Splice would have to be > refactored to not rely on pipe_lock(). That's likely major work with a > good portion of regressions if the past is any indication. That's likely; however, it ‒ or an equivalent solution ‒ would probably be a good idea to do, on balance of all my points above, I think. > If you need that ability to fully async read from a pipe with splice > rn then io_uring will at least allow you to punt that read into an async > worker thread afaict. I need my system to not be hanged by any user with a magic syscall. I think you've confused the splice bit as being contentious ‒ I don't care, and couldn't care less about how this is triggered ‒ the issue is that a splice fully excludes /ALL OTHER/ operations on a pipe, and zombifies all processes that try. Consider the case where messages arrive at a collection pipe (this is well-specified and well-used with O_DIRECT and <=PIPE_MAX, I don't have a demo off-hand), or, hell, even a case where logs do: giving any user with append capability effectively exclusive control for as long as they want is, well, suboptimal; you could analyse this as a stronger variant of https://lore.kernel.org/linux-fsdevel/fa6de786ee1241c6ba54c3cce0b980aa@AcuMS.aculab.com/t/#e74be7131861099a7f3d82d51dfc96645d26e9a94 and indeed, my original use-case was this broke tail -f, but you know how it be.
On Mon, Jun 26, 2023 at 01:59:07PM +0200, Ahelenia Ziemiańska wrote: > On Mon, Jun 26, 2023 at 11:32:16AM +0200, Christian Brauner wrote: > > On Mon, Jun 26, 2023 at 03:12:09AM +0200, Ahelenia Ziemiańska wrote: > > > Hi! (starting with get_maintainers.pl fs/splice.c, > > > idk if that's right though) > > > > > > Per fs/splice.c: > > > * The traditional unix read/write is extended with a "splice()" operation > > > * that transfers data buffers to or from a pipe buffer. > > > so I expect splice() to work just about the same as read()/write() > > > (and, to a large extent, it does so). > > > > > > Thus, a refresher on pipe read() semantics > > > (quoting Issue 8 Draft 3; Linux when writing with write()): > > > 60746 When attempting to read from an empty pipe or FIFO: > > > 60747 • If no process has the pipe open for writing, read( ) shall return 0 to indicate end-of-file. > > > 60748 • If some process has the pipe open for writing and O_NONBLOCK is set, read( ) shall return > > > 60749 −1 and set errno to [EAGAIN]. > > > 60750 • If some process has the pipe open for writing and O_NONBLOCK is clear, read( ) shall > > > 60751 block the calling thread until some data is written or the pipe is closed by all processes that > > > 60752 had the pipe open for writing. > > > > > > However, I've observed that this is not the case when splicing from > > > something that sleeps on read to a pipe, and that in that case all > > > readers block, /including/ ones that are reading from fds with > > > O_NONBLOCK set! > > > > > > As an example, consider these two programs: > > > -- >8 -- > > > // wr.c > > > #define _GNU_SOURCE > > > #include <fcntl.h> > > > #include <stdio.h> > > > int main() { > > > while (splice(0, 0, 1, 0, 128 * 1024 * 1024, 0) > 0) > > > ; > > > fprintf(stderr, "wr: %m\n"); > > > } > > > -- >8 -- > > > > > > -- >8 -- > > > // rd.c > > > #define _GNU_SOURCE > > > #include <errno.h> > > > #include <fcntl.h> > > > #include <stdio.h> > > > #include <unistd.h> > > > int main() { > > > fcntl(0, F_SETFL, fcntl(0, F_GETFL) | O_NONBLOCK); > > > > > > char buf[64 * 1024] = {}; > > > for (ssize_t rd;;) { > > > #if 1 > > > while ((rd = read(0, buf, sizeof(buf))) == -1 && errno == EINTR) > > > ; > > > #else > > > while ((rd = splice(0, 0, 1, 0, 128 * 1024 * 1024, 0)) == -1 && > > > errno == EINTR) > > > ; > > > #endif > > > fprintf(stderr, "rd=%zd: %m\n", rd); > > > write(1, buf, rd); > > > > > > errno = 0; > > > sleep(1); > > > } > > > } > > > -- >8 -- > > > > > > Thus: > > > -- >8 -- > > > a$ make rd wr > > > a$ mkfifo fifo > > > a$ ./rd < fifo b$ echo qwe > fifo > > > rd=4: Success > > > qwe > > > rd=0: Success > > > rd=0: Success b$ sleep 2 > fifo > > > rd=-1: Resource temporarily unavailable > > > rd=-1: Resource temporarily unavailable > > > rd=0: Success > > > rd=0: Success > > > rd=-1: Resource temporarily unavailable b$ /bin/cat > fifo > > > rd=-1: Resource temporarily unavailable > > > rd=4: Success abc > > > abc > > > rd=-1: Resource temporarily unavailable > > > rd=4: Success def > > > def > > > rd=0: Success ^D > > > rd=0: Success > > > rd=0: Success b$ ./wr > fifo > > > -- >8 -- > > > and nothing. Until you actually type a line (or a few) into teletype b > > > so that the splice completes, at which point so does the read. > > > > > > An even simpler case is > > > -- >8 -- > > > $ ./wr | ./rd > > > abc > > > def > > > rd=8: Success > > > abc > > > def > > > ghi > > > jkl > > > rd=8: Success > > > ghi > > > jkl > > > ^D > > > wr: Success > > > rd=-1: Resource temporarily unavailable > > > rd=0: Success > > > rd=0: Success > > > -- >8 -- > > > > > > splice flags don't do anything. > > > Tested on bookworm (6.1.27-1) and Linus' HEAD (v6.4-rc7-234-g547cc9be86f4). > > > > > > You could say this is a "denial of service", since this is a valid > > > way of following pipes (and, sans SIGIO, the only portable one), > > splice() may block for any of the two file descriptors if they don't > > have O_NONBLOCK set even if SPLICE_F_NONBLOCK is raised. > > > > SPLICE_F_NONBLOCK in splice_file_to_pipe() is only relevant if the pipe > > is full. If the pipe isn't full then the write is attempted. That of > > course involves reading the data to splice from the source file. If the > > source file isn't O_NONBLOCK that read may block holding pipe_lock(). > > > > If you raise O_NONBLOCK on the source fd in wr.c then your problems go > > away. This is pretty long-standing behavior. > I don't see how this is relevant here. Whether the writer splice blocks > ‒ or how it behaves at all ‒ doesn't matter. > > The /reader/ demands non-blocking reads. Just by running a splice() > we've managed to permanently hang the reader in a way that's fully > impervious to everything. > > Actually, hold that: in testing this on an actual program that relies on > this (nullmailer), I've found that trying to /open the FIFO/ also hangs > forever, in that same signal-impervious state. > > To wit: > $ ps 3766 > PID TTY STAT TIME COMMAND > 3766 ? Ss 0:01 /usr/sbin/nullmailer-send > $ ls -l /proc/3766/fd > total 0 > lr-x------ 1 mail mail 64 Jun 14 15:03 0 -> /dev/null > lrwx------ 1 mail mail 64 Jun 14 15:03 1 -> 'socket:[81721760]' > lrwx------ 1 mail mail 64 Jun 14 15:03 2 -> 'socket:[81721760]' > lr-x------ 1 mail mail 64 Apr 28 15:38 3 -> 'pipe:[81721763]' > l-wx------ 1 mail mail 64 Jun 14 15:03 4 -> 'pipe:[81721763]' > lr-x------ 1 mail mail 64 Jun 14 15:03 5 -> /var/spool/nullmailer/trigger > lrwx------ 1 mail mail 64 Jun 14 15:03 9 -> /dev/null > # cat /proc/3766/fdinfo/5 > pos: 0 > flags: 0104000 > mnt_id: 64 > ino: 393969 > # < /proc/3766/fdinfo/5 fdinfo > O_RDONLY O_NONBLOCK O_LARGEFILE > # strace -yp 3766 & > strace: Process 3766 attached > $ strace out/cmd/cat > /var/spool/nullmailer/trigger > [cat] (normal libc setup) > [cat] splice(0, NULL, 1, NULL, 134217728, SPLICE_F_MOVE|SPLICE_F_MOREa > [cat] ) = 2 > [cat] splice(0, NULL, 1, NULL, 134217728, SPLICE_F_MOVE|SPLICE_F_MORE > [nullmailer] pselect6(6, [5</var/spool/nullmailer/trigger>], NULL, NULL, {tv_sec=86397, tv_nsec=624894145}, NULL) = 1 (in [5], left {tv_sec=86394, tv_nsec=841299215}) > [nullmailer] write(1<socket:[81721760]>, "Trigger pulled.\n", 16) = 16 > [nullmailer] read(5</var/spool/nullmailer/trigger>, > and > $ strace -y sh -c 'echo zupa > /var/spool/nullmailer/trigger' > (...whatever shell setup) > rt_sigaction(SIGTERM, {sa_handler=SIG_DFL, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER, sa_restorer=0xf7d21bb0}, NULL, 8) = 0 > openat(AT_FDCWD, "/var/spool/nullmailer/trigger", O_WRONLY|O_CREAT|O_TRUNC, 0666 > > This is a "you've lost" situation to me. This system will /never/ > send mail now, and any mailer program will also hang forever > (again, to wit: > # echo zupa | strace -yfo /tmp/ss mail root > does hang forever and /tmp/ss ends in > 16915 close(6</var/spool/nullmailer/queue>) = 0 > 16915 unlink("/var/spool/nullmailer/tmp/16915") = 0 > 16915 openat(AT_FDCWD, "/var/spool/nullmailer/trigger", O_WRONLY|O_NONBLOCK > ) > which means that, on this system, I will never get events from smartd > or ZED, so fuck me if I wanted to get "scrub errored" or "disk > will die soon" notifications (in pre-2.0.0 ZED this would also have > broken autoreplace=on since it waited synchronously), > or from other monitoring, so again fuck me if I wanted to get > overheating/packet drops/whatever notifications, > or again fuck me if I wanted to get cron mail. > In many ways I've brought the system down (or will have done in like a > day once some mails go out) by sending a mail weird. > > > Naturally systemd stopping nullmailer failed after a few minutes with > × nullmailer.service - Nullmailer relay-only MTA > Loaded: loaded (/lib/systemd/system/nullmailer.service; enabled; preset: enabled) > Active: failed (Result: timeout) since Mon 2023-06-26 13:10:02 CEST; 6min ago > Duration: 1month 4w 10h 55min 29.666s > Docs: man:nullmailer(7) > Main PID: 3766 > Tasks: 1 (limit: 4673) > Memory: 3.1M > CPU: 1min 26.893s > CGroup: /system.slice/nullmailer.service > └─3766 /usr/sbin/nullmailer-send > > Jun 26 13:05:32 szarotka systemd[1]: nullmailer.service: State 'stop-sigterm' timed out. Killing. > Jun 26 13:05:32 szarotka systemd[1]: nullmailer.service: Killing process 3766 (nullmailer-send) with signal SIGKILL. > Jun 26 13:07:02 szarotka systemd[1]: nullmailer.service: Processes still around after SIGKILL. Ignoring. > Jun 26 13:08:32 szarotka systemd[1]: nullmailer.service: State 'final-sigterm' timed out. Killing. > Jun 26 13:08:32 szarotka systemd[1]: nullmailer.service: Killing process 3766 (nullmailer-send) with signal SIGKILL. > Jun 26 13:10:02 szarotka systemd[1]: nullmailer.service: Processes still around after final SIGKILL. Entering failed mode. > Jun 26 13:10:02 szarotka systemd[1]: nullmailer.service: Failed with result 'timeout'. > Jun 26 13:10:02 szarotka systemd[1]: nullmailer.service: Unit process 3766 (nullmailer-send) remains running after unit s> > Jun 26 13:10:02 szarotka systemd[1]: Stopped nullmailer.service - Nullmailer relay-only MTA. > Jun 26 13:10:02 szarotka systemd[1]: nullmailer.service: Consumed 1min 26.893s CPU time. > > But not to fret! Maybe we can still kill it with the cgroup! No: > # strace -y sh -c 'echo 1 > /sys/fs/cgroup/system.slice/nullmailer.service/cgroup.kill' > ... > dup2(3</sys/fs/cgroup/system.slice/nullmailer.service/cgroup.kill>, 1) = 1</sys/fs/cgroup/system.slice/nullmailer.service/cgroup.kill> > close(3</sys/fs/cgroup/system.slice/nullmailer.service/cgroup.kill>) = 0 > write(1</sys/fs/cgroup/system.slice/nullmailer.service/cgroup.kill>, "1\n", 2) = 2 > ... > This completes, sure, but doesn't do anything at all > (admittedly, I'm not a cgroup expert, but it did work on other, > non-poisoned, cgroups, so I'd expect it to work). > > Opening the FIFO with O_NONBLOCK also hangs, obviously. > Killing the splicer restores order, as expected. > > > Splice would have to be > > refactored to not rely on pipe_lock(). That's likely major work with a > > good portion of regressions if the past is any indication. > That's likely; however, it ‒ or an equivalent solution ‒ would > probably be a good idea to do, on balance of all my points above, > I think. In-kernel consumers already have a way of detecting when the pipe isn't safe for non-blocking read anymore because splice has been called and cleared FMODE_NOWAIT. I mean, one workaround would probably be poll() even with O_NONBLOCK but I get why that's annoying and half of a solution. So there are three options afaict: (1) rewrite splice.c to kill its reliance on pipe_lock() Very involved and would need a splice + pipe expert. (2) Add pipe_lock_interruptible() to stop the bleeding and give userspace the ability to at least kill a hanging reader. Also a potentially sensitive change probably regression prone. (3) Somehow factor in FMODE_NOWAIT when acquiring pipe_lock(). If FMODE_NOWAIT is set, try to acquire the lock and if not report EAGAIN otherwise proceed as before. I think Jens proposed a version of this a while back. Adding Linus as well since he probably has thoughts on this. tl;dr it by splicing from a regular file to a pipe where the regular file in splice isn't O_NONBLOCK we can hold pipe_lock() as long as we want and hang pipe_read() even with O_NONBLOCK unkillable trying to acquire pipe_lock().
On Mon, Jun 26, 2023 at 05:56:28PM +0200, Christian Brauner wrote: > I mean, one workaround would probably be poll() even with O_NONBLOCK but > I get why that's annoying and half of a solution. poll() doesn't really change anything since it turns into a hotloop of .revents=POLLHUP with a disconnected writer, cf. https://lore.kernel.org/linux-fsdevel/jbyihkyk5dtaohdwjyivambb2gffyjs3dodpofafnkkunxq7bu@jngkdxx65pux/t/#m747e2bbd0c5cffb6baaf1955f6f8b0d97e216839 The only apparently-supported way to poll pipes under linux is to sleep()/read(O_NONBLOCK) like in the bad old days. And even if that was a working work-around, the fundamental problem of ./spl>fifo excluding all other access to fifo is quite unfortunate too. > tl;dr it by splicing from a regular file to a pipe where the regular > file in splice isn't O_NONBLOCK (Most noticeable when the "regular file" is a tty and thus the I/O never completes by design.)
On Mon, 26 Jun 2023 at 09:14, Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> wrote: > > And even if that was a working work-around, the fundamental problem of > ./spl>fifo excluding all other access to fifo is quite unfortunate too. So I already sent an earlier broken version of this patch to Ahelenia and Christian, but here's an actually slightly tested version with the obvious bugs fixed. The keyword here being "obvious". It's probably broken in many non-obvious ways, but I'm sending it out in case anybody wants to play around with it. It boots for me. It even changes behavior of programs that splice() and used to keep the pipe lock over the IO and no longer do. But it might do unspeakable things to your pets, so caveat emptor. It "feels" right to me. But it's a rather quick hack, and really needs more eyes and more thought. I mention O_NDELAY in the (preliminary) commit message, but there are probably other issues that need thinking about. Linus
On Thu, Jul 06, 2023 at 02:56:45PM -0700, Linus Torvalds wrote: > On Mon, 26 Jun 2023 at 09:14, Ahelenia Ziemiańska > <nabijaczleweli@nabijaczleweli.xyz> wrote: > > > > And even if that was a working work-around, the fundamental problem of > > ./spl>fifo excluding all other access to fifo is quite unfortunate too. > > So I already sent an earlier broken version of this patch to Ahelenia > and Christian, but here's an actually slightly tested version with the > obvious bugs fixed. > > The keyword here being "obvious". It's probably broken in many > non-obvious ways, but I'm sending it out in case anybody wants to play > around with it. > > It boots for me. It even changes behavior of programs that splice() > and used to keep the pipe lock over the IO and no longer do. > > But it might do unspeakable things to your pets, so caveat emptor. It > "feels" right to me. But it's a rather quick hack, and really needs > more eyes and more thought. I mention O_NDELAY in the (preliminary) > commit message, but there are probably other issues that need thinking > about. Forgot to say, fwiw, I've been running this through the LTP splice, pipe, and ipc tests without issues. A hanging reader can be signaled away cleanly with this.
On Fri, 7 Jul 2023 at 10:21, Christian Brauner <brauner@kernel.org> wrote: > > Forgot to say, fwiw, I've been running this through the LTP splice, > pipe, and ipc tests without issues. A hanging reader can be signaled > away cleanly with this. So that patch still has a couple of "wait for this" cases remaining. In particular, when we do a read, and we do have pipe buffers, both the read() system call and a number of internal splice functions will go "Ahh, I have data", and then do pipe_buf_confirm() and read it. Which then results in pipe_buf_confirm() blocking. It now blocks interruptibly, which is much nicer, but several of these users *could* just do a non-blocking confirmation instead, and wait for pipe readability. HOWEVER, that's slightly less trivial than you'd expect, because the "wait for readability" needs to be done without the pipe lock held - so you can't actually check the pipe buffer state at that point (since you need the pipe lock to look up the buffer). That's true even of "trivial" cases like actual user-space "read() with O_NONBLOCK and poll()" situations. Now, the solution to all this is *fairly* straightforward: (a) don't use "!pipe_empty()" for a readability check. We already have "pipe_readable()", but it's hidden in fs/pipe.c, so all the splice() code ended up writing the "does this pipe have data" using "!pipe_empty()" instead. (b) make "pipe_buf_confirm()" take a "non-blocking" boolean argument, and if it is non-blocking but hits one of those blocked pages, set "pipe->not_ready", and return -EAGAIN. This is ok, because "pipe_buf_confirm()" is always under the pipe lock, and we'll just clear "pipe->not_ready" under the pipe lock after finalizing all those pages (and before waking up readers) (c) make "pipe_wait_readable()" and "poll()" know about this all, so that we wait properly for a pipe that was not ready to become ready This all makes *most* users deal properly with these blocking events. In particular, things like splice_to_socket() can now do the whole proper "wait without holding the pipe lock" sequence, even when the pipe is not empty, just in this blocked state. This *may* also make all the cases Jens had with io_uring and splicing JustWork(tm). NOTE! NOTE! NOTE! Once more, this "feels right to me", and I'd argue that the basic approach is fairly straightfoward. The patch is also not horrendous. It all makes a fair amount of sense. BUT! I haven't tested this, and like the previous patch, I really would want people to think about this a lot. Comments? Jens? Linus
On 7/7/23 1:10?PM, Linus Torvalds wrote: > On Fri, 7 Jul 2023 at 10:21, Christian Brauner <brauner@kernel.org> wrote: >> >> Forgot to say, fwiw, I've been running this through the LTP splice, >> pipe, and ipc tests without issues. A hanging reader can be signaled >> away cleanly with this. > > So that patch still has a couple of "wait for this" cases remaining. > > In particular, when we do a read, and we do have pipe buffers, both > the read() system call and a number of internal splice functions will > go "Ahh, I have data", and then do pipe_buf_confirm() and read it. > > Which then results in pipe_buf_confirm() blocking. It now blocks > interruptibly, which is much nicer, but several of these users *could* > just do a non-blocking confirmation instead, and wait for pipe > readability. > > HOWEVER, that's slightly less trivial than you'd expect, because the > "wait for readability" needs to be done without the pipe lock held - > so you can't actually check the pipe buffer state at that point (since > you need the pipe lock to look up the buffer). > > That's true even of "trivial" cases like actual user-space "read() > with O_NONBLOCK and poll()" situations. > > Now, the solution to all this is *fairly* straightforward: > > (a) don't use "!pipe_empty()" for a readability check. > > We already have "pipe_readable()", but it's hidden in fs/pipe.c, > so all the splice() code ended up writing the "does this pipe have > data" using "!pipe_empty()" instead. > > (b) make "pipe_buf_confirm()" take a "non-blocking" boolean argument, > and if it is non-blocking but hits one of those blocked pages, set > "pipe->not_ready", and return -EAGAIN. > > This is ok, because "pipe_buf_confirm()" is always under the pipe > lock, and we'll just clear "pipe->not_ready" under the pipe lock after > finalizing all those pages (and before waking up readers) > > (c) make "pipe_wait_readable()" and "poll()" know about this all, so > that we wait properly for a pipe that was not ready to become ready > > This all makes *most* users deal properly with these blocking events. > In particular, things like splice_to_socket() can now do the whole > proper "wait without holding the pipe lock" sequence, even when the > pipe is not empty, just in this blocked state. > > This *may* also make all the cases Jens had with io_uring and splicing > JustWork(tm). Exactly! I was reading this thread with excitement just now, would be nice to get rid of that kludge. > NOTE! NOTE! NOTE! Once more, this "feels right to me", and I'd argue > that the basic approach is fairly straightfoward. The patch is also > not horrendous. It all makes a fair amount of sense. BUT! I haven't > tested this, and like the previous patch, I really would want people > to think about this a lot. > > Comments? Jens? I'll take a closer look at this, but won't be until Monday most likely. But the approach seems sane, and going in a more idiomatic direction than before. So seems promising.
On Fri, Jul 07, 2023 at 12:10:36PM -0700, Linus Torvalds wrote: > On Fri, 7 Jul 2023 at 10:21, Christian Brauner <brauner@kernel.org> wrote: > > Forgot to say, fwiw, I've been running this through the LTP splice, > > pipe, and ipc tests without issues. A hanging reader can be signaled > > away cleanly with this. > NOTE! NOTE! NOTE! Once more, this "feels right to me", and I'd argue > that the basic approach is fairly straightfoward. The patch is also > not horrendous. It all makes a fair amount of sense. BUT! I haven't > tested this, and like the previous patch, I really would want people > to think about this a lot. > > Comments? Jens? I applied the patch upthread + this diff to 4f6b6c2b2f86b7878a770736bf478d8a263ff0bc; during test setup I got a null deref (building defconfig minus graphics). Reproducible, full BUG dump attached; trace of [ 149.878931] <TASK> [ 149.879533] ? __die+0x1e/0x60 [ 149.880309] ? page_fault_oops+0x17c/0x470 [ 149.881313] ? search_module_extables+0x14/0x50 [ 149.882422] ? exc_page_fault+0x67/0x150 [ 149.883397] ? asm_exc_page_fault+0x26/0x30 [ 149.884426] ? __pfx_pipe_to_null+0x10/0x10 [ 149.885451] ? splice_from_pipe_next+0x129/0x150 [ 149.886580] __splice_from_pipe+0x39/0x1c0 [ 149.887594] ? __pfx_pipe_to_null+0x10/0x10 [ 149.888615] ? __pfx_pipe_to_null+0x10/0x10 [ 149.889635] splice_from_pipe+0x5c/0x90 [ 149.890579] do_splice+0x35c/0x840 [ 149.891407] __do_splice+0x1eb/0x210 [ 149.892176] __x64_sys_splice+0xad/0x120 [ 149.893019] do_syscall_64+0x3e/0x90 [ 149.893798] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 $ scripts/faddr2line vmlinux splice_from_pipe_next+0x129 splice_from_pipe_next+0x129/0x150: pipe_buf_release at include/linux/pipe_fs_i.h:221 (inlined by) eat_empty_buffer at fs/splice.c:594 (inlined by) splice_from_pipe_next at fs/splice.c:640 I gamed this down to echo c | grep c >/dev/null where grep is ii grep 3.8-5 amd64 GNU grep, egrep and fgrep and strace of the same invocation (on the host) ends with newfstatat(1, "", {st_mode=S_IFCHR|0666, st_rdev=makedev(0x1, 0x3), ...}, AT_EMPTY_PATH) = 0 newfstatat(AT_FDCWD, "/dev/null", {st_mode=S_IFCHR|0666, st_rdev=makedev(0x1, 0x3), ...}, 0) = 0 newfstatat(0, "", {st_mode=S_IFIFO|0600, st_size=0, ...}, AT_EMPTY_PATH) = 0 lseek(0, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) read(0, "c\n", 98304) = 2 splice(0, NULL, 1, NULL, 98304, SPLICE_F_MOVE) = 0 close(1) = 0 close(2) = 0 exit_group(0) = ? +++ exited with 0 +++ And can also reproduce it with echo | { read -r _; exec ./wr; } > /dev/null (where ./wr is "while (splice(0, 0, 1, 0, 128 * 1024 * 1024, 0) > 0) {}"). However: echo | ./wr > /dev/null does NOT crash. Besides that, this doesn't solve the original issue, inasmuch as ./v > fifo & head fifo & echo zupa > fifo (where ./v splices from an empty pty to stdout; v.c attached) echo still sleeps until ./v dies, though it also succumbs to ^C now. "OTOH, on 4f6b6c2b2f86b7878a770736bf478d8a263ff0bc, "timeout 10 ./v > fifo &" (then lines 2 and 3 as above) does kill ./v -> unblock echo -> head copies "zupa", i.e. life resumes as normal after the splicer went away. With the patches, echo zupa is stuck forever (until you signal it)! This is kinda worse. [ 149.843966] BUG: kernel NULL pointer dereference, address: 0000000000000008 [ 149.845820] #PF: supervisor read access in kernel mode [ 149.847190] #PF: error_code(0x0000) - not-present page [ 149.848540] PGD 0 P4D 0 [ 149.849231] Oops: 0000 [#1] PREEMPT SMP PTI [ 149.850345] CPU: 0 PID: 230 Comm: grep Not tainted 6.4.0-12317-gabf530ed3e36-dirty #3 [ 149.852411] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 149.854900] RIP: 0010:splice_from_pipe_next+0x129/0x150 [ 149.856328] Code: ff c6 45 38 00 eb af 5b b8 00 fe ff ff 5d 41 5c 41 5d c3 cc cc cc cc 48 8b 46 10 41 83 c5 01 48 89 df 48 c7 46 10 00 00 00 00 <48> 8b 40 08 e8 ce a5 9a [ 149.861118] RSP: 0018:ffffb2ed40347d70 EFLAGS: 00010202 [ 149.862488] RAX: 0000000000000000 RBX: ffff8c06c1d9a0c0 RCX: 0000000000000000 [ 149.864357] RDX: 0000000000000005 RSI: ffff8c06c8c98028 RDI: ffff8c06c1d9a0c0 [ 149.866217] RBP: ffffb2ed40347de0 R08: 0000000000000001 R09: ffffffffaa428db0 [ 149.868088] R10: 0000000000018000 R11: 0000000000000000 R12: ffff8c06c2625580 [ 149.869950] R13: 0000000000000002 R14: ffff8c06c1d9a0c0 R15: ffffb2ed40347de0 [ 149.871828] FS: 00007fa5a6b3e740(0000) GS:ffff8c06dd800000(0000) knlGS:0000000000000000 [ 149.873937] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 149.875459] CR2: 0000000000000008 CR3: 000000000269a000 CR4: 00000000000006f0 [ 149.877327] Call Trace: [ 149.878931] <TASK> [ 149.879533] ? __die+0x1e/0x60 [ 149.880309] ? page_fault_oops+0x17c/0x470 [ 149.881313] ? search_module_extables+0x14/0x50 [ 149.882422] ? exc_page_fault+0x67/0x150 [ 149.883397] ? asm_exc_page_fault+0x26/0x30 [ 149.884426] ? __pfx_pipe_to_null+0x10/0x10 [ 149.885451] ? splice_from_pipe_next+0x129/0x150 [ 149.886580] __splice_from_pipe+0x39/0x1c0 [ 149.887594] ? __pfx_pipe_to_null+0x10/0x10 [ 149.888615] ? __pfx_pipe_to_null+0x10/0x10 [ 149.889635] splice_from_pipe+0x5c/0x90 [ 149.890579] do_splice+0x35c/0x840 [ 149.891407] __do_splice+0x1eb/0x210 [ 149.892176] __x64_sys_splice+0xad/0x120 [ 149.893019] do_syscall_64+0x3e/0x90 [ 149.893798] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 149.894881] RIP: 0033:0x7fa5a6c49dd3 [ 149.895682] Code: 64 89 02 48 c7 c0 ff ff ff ff eb b9 66 2e 0f 1f 84 00 00 00 00 00 90 80 3d 11 18 0d 00 00 49 89 ca 74 14 b8 13 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 74 [ 149.899538] RSP: 002b:00007ffc83d77768 EFLAGS: 00000202 ORIG_RAX: 0000000000000113 [ 149.901116] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fa5a6c49dd3 [ 149.902602] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000 [ 149.904048] RBP: 0000564d8aaeb000 R08: 0000000000018000 R09: 0000000000000001 [ 149.905439] R10: 0000000000000000 R11: 0000000000000202 R12: 000000000000000a [ 149.906832] R13: 0000564d8aaeb010 R14: 0000564d8aaeb000 R15: 0000000000000000 [ 149.908239] </TASK> [ 149.908692] Modules linked in: [ 149.909326] CR2: 0000000000000008 [ 149.910050] ---[ end trace 0000000000000000 ]--- [ 149.910986] RIP: 0010:splice_from_pipe_next+0x129/0x150 [ 149.912063] Code: ff c6 45 38 00 eb af 5b b8 00 fe ff ff 5d 41 5c 41 5d c3 cc cc cc cc 48 8b 46 10 41 83 c5 01 48 89 df 48 c7 46 10 00 00 00 00 <48> 8b 40 08 e8 ce a5 9a [ 149.915639] RSP: 0018:ffffb2ed40347d70 EFLAGS: 00010202 [ 149.916589] RAX: 0000000000000000 RBX: ffff8c06c1d9a0c0 RCX: 0000000000000000 [ 149.917877] RDX: 0000000000000005 RSI: ffff8c06c8c98028 RDI: ffff8c06c1d9a0c0 [ 149.919172] RBP: ffffb2ed40347de0 R08: 0000000000000001 R09: ffffffffaa428db0 [ 149.920457] R10: 0000000000018000 R11: 0000000000000000 R12: ffff8c06c2625580 [ 149.921737] R13: 0000000000000002 R14: ffff8c06c1d9a0c0 R15: ffffb2ed40347de0 [ 149.923021] FS: 00007fa5a6b3e740(0000) GS:ffff8c06dd800000(0000) knlGS:0000000000000000 [ 149.924481] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 149.925529] CR2: 0000000000000008 CR3: 000000000269a000 CR4: 00000000000006f0
On Fri, 7 Jul 2023 at 15:41, Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> wrote: > > (inlined by) eat_empty_buffer at fs/splice.c:594 Ahh, eat_empty_buffer() ends up releasing the buffer without waiting for it. And the reason for that is actually somewhat interesting: we do have that while (!pipe_readable(pipe)) { .. above it, but the logic for this all is that pipes with pipe buffers are by *default* considered readable until they try to actually confirm the buffer, and at that point they might say "oh, I have to return -EAGAIN and set 'not_ready'". And that splice_from_pipe_next() doesn't do that. End result: it will happily free that pipe buffer that is still in the process of being filled. The good news is that I think the fix is probably trivial. Something like the attached? Again - NOT TESTED. > Besides that, this doesn't solve the original issue, inasmuch as > ./v > fifo & > head fifo & > echo zupa > fifo > (where ./v splices from an empty pty to stdout; v.c attached) > echo still sleeps until ./v dies, though it also succumbs to ^C now. Yeah, I concentrated on just making everything interruptible, But the fact that the echo has to wait for the previous write to finish is kind of fundamental. We can't just magically do writes out of order. 'v' is busy writing to the fifo, we can't let some other write just come in. (We *could* make the splice in ./v not fill the whole pipe buffer, and allow some other writes to fill in buffers afterwards, but at _some_ point you'll hit the "pipe buffers are full and busy, can't add any more without waiting for them to empty"). One thing we could possibly do is to say that we just don't accept any new writes if there are old busy splices in process. So we could make new writes return -EBUSY or something, I guess. Linus
On Thu, Jul 06, 2023 at 02:56:45PM -0700, Linus Torvalds wrote: > +static int busy_pipe_buf_confirm(struct pipe_inode_info *pipe, > + struct pipe_buffer *buf) > +{ > + struct page *page = buf->page; > + > + if (folio_wait_bit_interruptible(page_folio(page), PG_locked)) > + return -EINTR; Do we really want interruptible here rather than killable? That is, do we want SIGWINCH or SIGALRM to result in a short read? I assume it's OK to return a short read because userspace has explicitly asked for O_NONBLOCK and can therefore be expected to actually check the return value from read().
On Fri, 7 Jul 2023 at 17:00, Matthew Wilcox <willy@infradead.org> wrote: > > Do we really want interruptible here rather than killable? Yes, we actually do need to be just regular interruptible, This is a bog-standard "IO to/from pipe" situation, which is interruptible. > That is, do we want SIGWINCH or SIGALRM to result in a short read? Now, that's a different issue, and is actually handled by the signal layer: a signal that is ignored (where "ignored" includes the case of "default handler") will be dropped early, exactly because we don't want to interrupt things like tty or pipe reads when you resize the window. Of course, if you actually *catch* SIGWINCH, then you will get that short read on a window change. Linus
On Fri, 7 Jul 2023 at 17:07, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > That is, do we want SIGWINCH or SIGALRM to result in a short read? > > Now, that's a different issue, and is actually handled by the signal > layer: a signal that is ignored (where "ignored" includes the case of > "default handler") will be dropped early, exactly because we don't > want to interrupt things like tty or pipe reads when you resize the > window. In case you care, it's prepare_signal() -> sig_ignored() -> sig_task_ignored() -> sig_handler_ignored() logic that does this short-circuiting. And while I don't think it's required by POSIX, this was definitely a case where lots of programs that *don't* use any signal handlers at all are very much not expecting to see -EINTR as a return value, and used to break exactly on things like SIGWINCH when reading from a tty or a pipe. But that logic goes back to before linux-1.0. In fact, I think it goes back to at least 0.99.10 (June '93). Linus
On Fri, Jul 07, 2023 at 03:57:40PM -0700, Linus Torvalds wrote: > On Fri, 7 Jul 2023 at 15:41, Ahelenia Ziemiańska > <nabijaczleweli@nabijaczleweli.xyz> wrote: > > (inlined by) eat_empty_buffer at fs/splice.c:594 > Ahh, eat_empty_buffer() ends up releasing the buffer without waiting for it. > > And the reason for that is actually somewhat interesting: we do have that > > while (!pipe_readable(pipe)) { > .. > > above it, but the logic for this all is that pipes with pipe buffers > are by *default* considered readable until they try to actually > confirm the buffer, and at that point they might say "oh, I have to > return -EAGAIN and set 'not_ready'". > > And that splice_from_pipe_next() doesn't do that. > > End result: it will happily free that pipe buffer that is still in the > process of being filled. > > The good news is that I think the fix is probably trivial. Something > like the attached? > > Again - NOT TESTED Same reproducer, backtrace attached: $ scripts/faddr2line vmlinux splice_from_pipe_next+0x6e splice_from_pipe_next+0x6e/0x180: pipe_buf_confirm at include/linux/pipe_fs_i.h:233 (inlined by) eat_empty_buffer at fs/splice.c:597 (inlined by) splice_from_pipe_next at fs/splice.c:647 Looks like the same place. > > Besides that, this doesn't solve the original issue, inasmuch as > > ./v > fifo & > > head fifo & > > echo zupa > fifo > > (where ./v splices from an empty pty to stdout; v.c attached) > > echo still sleeps until ./v dies, though it also succumbs to ^C now. > Yeah, I concentrated on just making everything interruptible, > > But the fact that the echo has to wait for the previous write to > finish is kind of fundamental. We can't just magically do writes out > of order. 'v' is busy writing to the fifo, we can't let some other > write just come in. (It's no longer busy writing to it when it gets killed by timeout in my second example, but echo still doesn't wake up.) > (We *could* make the splice in ./v not fill the whole pipe buffer, and > allow some other writes to fill in buffers afterwards, but at _some_ > point you'll hit the "pipe buffers are full and busy, can't add any > more without waiting for them to empty"). You are, but, well, that's also the case when the pipe is full. As it stands, the pipe is /empty/ and yet /no-one can write to it/. This is the crux of the issue at hand. (Coincidentally, you're describing what looks quite similar to pt 1. from <naljsvzzemr6pjiwwcdpdsdh5vtycdr6fmi2fk2dlr4nn4kq6o@ycanbgxhslti>.) I think we got away with it for so long because most files behave like regular files/blockdevs and the read is always guaranteed to complete ~instantly, but splice is, fundamentally, /not/ writing the whole time because it's not /reading/ the whole time when it's reading an empty socket/a chardev/whatever. Or, rather: splice() from a non-seekable (non-mmap()pable?) fundamentally doesn't really make much sense, because you're not gluing a bit of the pro-verbial page cache (forgive me my term abuse here, you see what I'm getting at tho) to the end of the pipe, you're just telling the kernel to run a read()/write() loop for you. sendfile() works around this by reading and then separately writing to its special buffer (in the form of an anonymous process-local pipe). splice() just raw-dogs the read with the write lock held, but just doesn't check if it can do it. That's how it's, honestly, shaking out to me: someone just forgot a check the first time they wrote it. Because the precondition for "does reading directly into the pipe buffer make sense" is "is it directly equivalent to read(f)/write(p)", and that holds only for seekables. /Maybe/ for, like, sockets if there's already data, or as a special case similar to pipe->pipe. But then for sockets you're already using sendfile(), so? To that end, I'm including a patch based on 4f6b6c2b2f86b7878a770736bf478d8a263ff0bc that does just that: EINVAL. Maybe if you're worried about breaking compatibility (which idk if it's an issue since splice and sendfile are temperamental w.r.t. what they take anyway across versions; you need a read()/write() fallback anyway) that case could become sendfile-like by first reading into a buffer once then pipe->pipe splicing out of it separately. Though, even the usual sendfile read/write loop only works on seekables. > One thing we could possibly do is to say that we just don't accept any > new writes if there are old busy splices in process. So we could make > new writes return -EBUSY or something, I guess. Same logic as above applies + no-one's really expecting EBUSY + what /would/ you do about EBUSY in this case? -- >8 -- From 552d93bee8e1b8333ce84ed0ca8490f6712c5b8b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ahelenia=20Ziemia=C5=84ska?= <nabijaczleweli@nabijaczleweli.xyz> Date: Sat, 8 Jul 2023 01:47:59 +0200 Subject: [PATCH] splice: file->pipe: -EINVAL if file non-seekable MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Both the logical semantic of "tie a page from the page cache to the pipe" and the implemented semantic of "lock the pipe, then read into it" (thus holding the write lock for as as long as the read is outstanding) only make sense if the read is guaranteed to complete instantly. This has been a long-standing omission and, when the read doesn't immediately complete (because it's a tty, socket, &c.), the write lock ‒ which for pipes is a pipe-global mutex ‒ is held until, thus excluding all mutual users of the pipe, until it does. Refuse it. Use read()/write() in userspace instead of getting the kernel to do it for you, badly, when there's no point to doing so. Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wjEC_Rh8+-rtEi8C45upO-Ffw=M_i1211qS_3AvWZCbOg@mail.gmail.com/t/#u Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> --- fs/splice.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/splice.c b/fs/splice.c index 004eb1c4ce31..14cf3cea1091 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -1309,6 +1309,8 @@ long do_splice(struct file *in, loff_t *off_in, struct file *out, if (opipe) { if (off_out) return -ESPIPE; + if (!(in->f_mode & FMODE_LSEEK)) + return -EINVAL; if (off_in) { if (!(in->f_mode & FMODE_PREAD)) return -EINVAL;
On Fri, 7 Jul 2023 at 17:30, Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> wrote: > > Same reproducer, backtrace attached: > $ scripts/faddr2line vmlinux splice_from_pipe_next+0x6e > splice_from_pipe_next+0x6e/0x180: > pipe_buf_confirm at include/linux/pipe_fs_i.h:233 Bah. I should have looked more closely at your case. This is a buffer without an 'ops' pointer. So it looks like was already released. And the reason is that the pipe was readable because there were no writers, and I had put the if (!pipe->writers) return 0; check in splice_from_pipe_next() in the wrong place. It needs to be *before* the eat_empty_buffer() call. Duh. Anyway, while I think that fixes your NULL pointer thing, having looked at my original patch I realized that keeping the pointer to the pipe buffer in copy_splice_read() across the dropping of the pipe lock is completely broken. I thought (because I didn't remember the code) that pipe resizing will just copy the pipe buffer pointers around. That would have made it ok to remember a pipe buffer pointer. But it is not so. It will actually create new pipe buffers and copy the buffer contents around. So while fixing your NULL pointer check should be trivial, I think that first patch is actually fundamentally broken wrt pipe resizing, and I see no really sane way to fix it. We could add a new lock just for that, but I don't think it's worth it. > You are, but, well, that's also the case when the pipe is full. > As it stands, the pipe is /empty/ and yet /no-one can write to it/. > This is the crux of the issue at hand. No, I think you are mis-representing things. The pipe isn't empty. It's full of things that just aren't finalized yet. > Or, rather: splice() from a non-seekable (non-mmap()pable?) Please stop with the non-seekable nonsense. Any time I see a patch like this: > + if (!(in->f_mode & FMODE_LSEEK)) > + return -EINVAL; I will just go "that person is not competent". This has absolutely nothing to do with seekability. But it is possible that we need to just bite the bullet and say "copy_splice_read() needs to use a non-blocking kiocb for the IO". Of course, that then doesn't work, because while doing this is trivial: --- a/fs/splice.c +++ b/fs/splice.c @@ -364,6 +364,7 @@ ssize_t copy_splice_read(struct file *in, loff_t *ppos, iov_iter_bvec(&to, ITER_DEST, bv, npages, len); init_sync_kiocb(&kiocb, in); kiocb.ki_pos = *ppos; + kiocb.ki_flags |= IOCB_NOWAIT; ret = call_read_iter(in, &kiocb, &to); if (ret > 0) { I suspectr you'll find that it makes no difference, because the tty layer doesn't actually honor the IOCB_NOWAIT flag for various historical reasons. In fact, the kiocb isn't even passed down to the low-level routine, which only gets the 'struct file *', and instead it looks at tty_io_nonblock(), which just does that legacy file->f_flags & O_NONBLOCK test. I guess combined with something like if (!(in->f_mode & FMODE_NOWAIT)) return -EINVAL; it might all work. Linus
On Sat, Jul 08, 2023 at 01:06:56PM -0700, Linus Torvalds wrote: > On Fri, 7 Jul 2023 at 17:30, Ahelenia Ziemiańska > <nabijaczleweli@nabijaczleweli.xyz> wrote: > > > > Same reproducer, backtrace attached: > > $ scripts/faddr2line vmlinux splice_from_pipe_next+0x6e > > splice_from_pipe_next+0x6e/0x180: > > pipe_buf_confirm at include/linux/pipe_fs_i.h:233 > Bah. I should have looked more closely at your case. > > This is a buffer without an 'ops' pointer. So it looks like was > already released. > > And the reason is that the pipe was readable because there were no > writers, and I had put the > > if (!pipe->writers) > return 0; > > check in splice_from_pipe_next() in the wrong place. It needs to be > *before* the eat_empty_buffer() call. > > Anyway, while I think that fixes your NULL pointer thing, It does. > So while fixing your NULL pointer check should be trivial, I think > that first patch is actually fundamentally broken wrt pipe resizing, > and I see no really sane way to fix it. We could add a new lock just > for that, but I don't think it's worth it. > > > You are, but, well, that's also the case when the pipe is full. > > As it stands, the pipe is /empty/ and yet /no-one can write to it/. > > This is the crux of the issue at hand. > No, I think you are mis-representing things. The pipe isn't empty. > It's full of things that just aren't finalized yet. Being full of no data (as part of some hidden state) doesn't make it any less empty, but meh; neither here not there. > > Or, rather: splice() from a non-seekable (non-mmap()pable?) > Please stop with the non-seekable nonsense. > > Any time I see a patch like this: > > > + if (!(in->f_mode & FMODE_LSEEK)) > > + return -EINVAL; > > I will just go "that person is not competent". Accurate assessment. > This has absolutely nothing to do with seekability. Yes, and as noted, I was using it as a stand-in for "I/O won't block" due to the above (and splice_direct_to_actor() already uses it). Glad to see you've managed to synthesise my drivel into something workable. > But it is possible that we need to just bite the bullet and say > "copy_splice_read() needs to use a non-blocking kiocb for the IO". > > Of course, that then doesn't work, because while doing this is trivial: > > --- a/fs/splice.c > +++ b/fs/splice.c > @@ -364,6 +364,7 @@ ssize_t copy_splice_read(struct file *in, loff_t *ppos, > iov_iter_bvec(&to, ITER_DEST, bv, npages, len); > init_sync_kiocb(&kiocb, in); > kiocb.ki_pos = *ppos; > + kiocb.ki_flags |= IOCB_NOWAIT; > ret = call_read_iter(in, &kiocb, &to); > > if (ret > 0) { > > I suspectr you'll find that it makes no difference, because the tty > layer doesn't actually honor the IOCB_NOWAIT flag for various > historical reasons. Indeed, neither when splicing from a tty, nor from a socket (same setup but socketpair(AF_UNIX, SOCK_STREAM, 0); w.c). > In fact, the kiocb isn't even passed down to the > low-level routine, which only gets the 'struct file *', and instead it > looks at tty_io_nonblock(), which just does that legacy > > file->f_flags & O_NONBLOCK > > test. > > I guess combined with something like > > if (!(in->f_mode & FMODE_NOWAIT)) > return -EINVAL; > > it might all work. Yes, that makes the splice instantly -EINVAL for ttys, but doesn't affect the socketpair() case above, which still blocks forever. This appears to be because vfs_splice_read() does if ((in->f_flags & O_DIRECT) || IS_DAX(in->f_mapping->host)) return copy_splice_read(in, ppos, pipe, len, flags); return in->f_op->splice_read(in, ppos, pipe, len, flags); so the c_s_r() isn't even called for the socket, which uses unix_stream_splice_read(). Thus, diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 123b35ddfd71..384d5a479b4a 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -2880,15 +2880,12 @@ static ssize_t unix_stream_splice_read(struct socket *sock, loff_t *ppos, .pipe = pipe, .size = size, .splice_flags = flags, + .flags = MSG_DONTWAIT, }; if (unlikely(*ppos)) return -ESPIPE; - if (sock->file->f_flags & O_NONBLOCK || - flags & SPLICE_F_NONBLOCK) - state.flags = MSG_DONTWAIT; - return unix_stream_read_generic(&state, false); } makes the splice -EAGAIN w/o data and splice whatever's in the socket when there is data. git grep '\.splice_read.*=' | cut -d= -f2 | tr -s ',;[:space:]' '\n' | sort -u gives me 27 distinct splice_read implementations and 1 cocci config. These are simple delegations: nfs_file_splice_read filemap_splice_read afs_file_splice_read filemap_splice_read ceph_splice_read filemap_splice_read ecryptfs_splice_read_update_atime filemap_splice_read ext4_file_splice_read filemap_splice_read f2fs_file_splice_read filemap_splice_read ntfs_file_splice_read filemap_splice_read ocfs2_file_splice_read filemap_splice_read orangefs_file_splice_read filemap_splice_read v9fs_file_splice_read filemap_splice_read xfs_file_splice_read filemap_splice_read zonefs_file_splice_read filemap_splice_read sock_splice_read copy_splice_read or a socket-specific one coda_file_splice_read vfs_splice_read ovl_splice_read vfs_splice_read vfs_splice_read calls copy_splice_read (not used as a .splice_read). And the rest are: 01. copy_splice_read you've fixed 02. filemap_splice_read is, as I understand it, only applicable to files/blockdevs and already has the required semantics? 03. unix_stream_splice_read can be fixed as above 04. fuse_dev_splice_read allocates a buffer and fuse_dev_do_read()s into it, fuse_dev_do_read does if (file->f_flags & O_NONBLOCK) return -EAGAIN; so this is likely a similarly easy fix 05. tracing_buffers_splice_read, when it doesn't read anything does ret = -EAGAIN; if ((file->f_flags & O_NONBLOCK) || (flags & SPLICE_F_NONBLOCK)) goto out; and waits for at least one thing to read; can probably just goto out instantly 06. tracing_splice_read_pipe delegates to whatever it's tracing, and errors if that errored, so it's fine(?) 07. shmem_file_splice_read is always nonblocking 08. relay_file_splice_read only checks SPLICE_F_NONBLOCK to turn 0 into -EAGAIN so I think it also just doesn't block 09. smc_splice_read falls back to an embedded socket's splice_read, or uses if (flags & SPLICE_F_NONBLOCK) flags = MSG_DONTWAIT; else flags = 0; SMC_STAT_INC(smc, splice_cnt); rc = smc_rx_recvmsg(smc, NULL, pipe, len, flags); so also probably remove the condition 10. kcm_splice_read: 10a. passes flags (SPLICE_F_...) to skb_recv_datagram(), which does timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT); verbatim!? and forwards them to try_recv which also checks for socket-style bits 10b. give it MSG_DONTWAIT, call it a day? 10c. it also passes flags to skb_splice_bits() to actually splice to the pipe, but that discards flags, so no change needed 11. tls_sw_splice_read I don't really understand but it does err = tls_rx_reader_lock(sk, ctx, flags & SPLICE_F_NONBLOCK); and err = tls_rx_rec_wait(sk, NULL, flags & SPLICE_F_NONBLOCK, true); (and skb_splice_bits()) so make them both true? 12. tcp_splice_read uses skb_splice_bits() and timeout governed by timeo = sock_rcvtimeo(sk, sock->file->f_flags & O_NONBLOCK); and re-tries on empty read if timeo!=0 (i.e. !(sock->file->f_flags & O_NONBLOCK)); so turning that into true (timeo = 0) and propagating that would make it behave accd'g to the new semantic Would that make sense?
On Sun, Jul 09, 2023 at 03:03:22AM +0200, Ahelenia Ziemiańska wrote: > On Sat, Jul 08, 2023 at 01:06:56PM -0700, Linus Torvalds wrote: > > I guess combined with something like > > > > if (!(in->f_mode & FMODE_NOWAIT)) > > return -EINVAL; > > > > it might all work. > Yes, that makes the splice instantly -EINVAL for ttys, but doesn't > affect the socketpair() case above, which still blocks forever. This also triggers for regular file -> pipe splices, which is probably a no-go. Attaching a summary diff that does all I said in the previous mail. filemap_get_pages() does use and inspect IOCB_NOWAIT if set in filemap_splice_read(), but it appears to not really make much sense, inasmuch it returns EAGAIN for the first splice() from a blockdev and then instantly return with data on the next go-around. This doesn't really make much sense ‒ and open(2) says blockdevs don't have O_NONBLOCK semantics, so I'm assuming this is not supposed to be exposed to userspace ‒ so I'm not setting it in the diff. I've tested this for: * tty: -EINVAL * socketpair(AF_UNIX, SOCK_STREAM): works as expected $ wc -c fifo & [1] 261 $ ./af_unix ./s > fifo 5 Success 6454 Resource temporarily unavailable 5 Success 6445 Resource temporarily unavailable 0 Success 10 fifo * socket(AF_INET, SOCK_STREAM, 0): works as expected $ wc fifo & [1] 249 $ ./tcp ./s > fifo 23099 Resource temporarily unavailable 7 Success 2099 Resource temporarily unavailable 4 Success 1751 Resource temporarily unavailable 3 Success 21655 Resource temporarily unavailable 95 Success 19589 Resource temporarily unavailable 0 Success 4 15 109 fifo corresponding to host$ nc 127.0.0.1 14640 żupan asd ad asdda dasj aiojd askdl; jasiopdij as[pkdo[p askd9p ias90dk aso[pjd 890pasid90[ jaskl;dj il;asd ^C * blockdev (/dev/vda): as expected (with filemap_splice_read() unchanged), copies it all * regular file: -EINVAL(!) Splicers still sleep if the pipe's full, of course, unless SPLICE_F_NONBLOCK. Test drivers attached as well.
On Mon, Jul 10, 2023 at 12:33:07AM +0200, Ahelenia Ziemiańska wrote: > On Sun, Jul 09, 2023 at 03:03:22AM +0200, Ahelenia Ziemiańska wrote: > > On Sat, Jul 08, 2023 at 01:06:56PM -0700, Linus Torvalds wrote: > > > I guess combined with something like > > > > > > if (!(in->f_mode & FMODE_NOWAIT)) > > > return -EINVAL; > > > > > > it might all work. > > Yes, that makes the splice instantly -EINVAL for ttys, but doesn't > > affect the socketpair() case above, which still blocks forever. > This also triggers for regular file -> pipe splices, > which is probably a no-go. Actually, that's only the case for regular files on some filesystems? I originally tested on tmpfs, and now on vfat, ramfs, procfs, and sysfs, and none have FMODE_NOWAIT set. Conversely, ext4 and XFS files both have FMODE_NOWAIT set, and behave like blockdevs incl. the filemap_splice_read() oddities below. Indeed, it looks like Some filesystems (btrfs/ext4/f2fs/ocfs2/xfs, blockdevs, /dev/{null,zero,random,urandom}, pipes, tun/tap) set FMODE_NOWAIT, but that's by far not All of them, so maybe /* File is capable of returning -EAGAIN if I/O will block */ is not the right check for regular files. > filemap_get_pages() does use and inspect IOCB_NOWAIT if set in > filemap_splice_read(), but it appears to not really make much sense, > inasmuch it returns EAGAIN for the first splice() from a > blockdev and then instantly return with data on the next go-around. Indeed, this is inconsistent to both: * readv2(off=-1, RWF_NOWAIT), which always returns EAGAIN, and * fcntl(0, F_SETFL, O_NONBLOCK), read(), which always reads. Thus, > This doesn't really make much sense ‒ and open(2) says blockdevs > don't have O_NONBLOCK semantics, so I'm assuming this is not supposed > to be exposed to userspace ‒ so I'm not setting it in the diff. not specifying IOCB_NOWAIT in filemap_splice_read() provides consistent semantics to "file is read as-if it had O_NONBLOCK set". I've tentatively updated the check to if (!((in->f_mode & FMODE_NOWAIT) || S_ISREG(in->f_inode->i_mode))) (with the reasoning, as previously, that regular files don't /have/ a distinct O_NONBLOCK mode, because they always behave non-blockingly) and with that > I've tested this for: * regular file: works as expected
diff --git a/fs/pipe.c b/fs/pipe.c index 2d88f73f585a..a76535839d32 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -240,6 +240,11 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) if (unlikely(total_len == 0)) return 0; + if ((filp->f_flags & O_NONBLOCK) || (iocb->ki_flags & IOCB_NOWAIT)) { + if (mutex_is_locked(&pipe->mutex)) + return -EAGAIN; + } + ret = 0; __pipe_lock(pipe); -- >8 -- which does make the aforementioned cases less pathological