Message ID | 20230320071442.172228-1-pedro.falcato@gmail.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:604a:0:0:0:0:0 with SMTP id j10csp1064376wrt; Mon, 20 Mar 2023 00:20:38 -0700 (PDT) X-Google-Smtp-Source: AK7set974AyRlGt2SQuoMzQD2zQLXE4hqVl90drQyAKEeY6Bq0lsm6n4JoDEHwWkXTfnGWo0pPQh X-Received: by 2002:a17:902:d3c9:b0:1a0:53ba:ff1f with SMTP id w9-20020a170902d3c900b001a053baff1fmr16630651plb.0.1679296837916; Mon, 20 Mar 2023 00:20:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1679296837; cv=none; d=google.com; s=arc-20160816; b=SYdTXnNwmfCH12O3btEf/xEz3P8GGTM02XoRrQa8UJN+cLyfm7BLiwRpqBVILkzrcF XN3Hq09AKY/dbHWZNZdLWFkEOEdXIkmPLCHg/BFp73bqO132i/lASm4Ozy46Z/M8okHK J3KzYWl28R5jpebd6qiTQFNOA59HtoMIs2bjxLMItdEIDRhb14/gIJID+1nG/kOB1xNP xqXxJe0lIydT7YcqKl0J7aVp6UEjNqTp/WX+8pFs+hq8TC4d6od+GAm3dhbxEh34E+vA e81fm8IOkUYhHmnwov/Tyts+SB0h805YhiNqp+b6Rfu+rzF+2A+BdBlTLswNL73tWlcR SGbg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=zfkxMOxAGfgWPUNtNItdws/lAov0WMT4Tn6jUy/fV/w=; b=Dz/RRQyAnqHWyniwhKgXrLA7QlO21smcPbY+mKnaOparIhjzqZfjcxCej2zo+6HmPA 9d/zJ3qRyVTmNV7TTAr3Zmb0FBcw3dKY454JLLZkOygS3UkmsuNmLG28AF94yLnT6o76 RsLhue0VTtpAdbYptYwMMqxVjlf4omqmrrCrPiuwRoXowoPKUjQvCMpyXq1upGxIqnEB u5q9vAnhCmt3RZRATE30ba97cQWxpK0eGPAxFFHyOh4kcPDpG45twHCtQvbwah/vQ5yq d3jt/exI8JEH0uyPaS7zMFNthb/COSV35Mikdw/8JniN7x++WGat/nfrUeNVTDWg6Mig n/Eg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=GjiILlTC; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id s10-20020a170902ea0a00b001a1b454e535si6880133plg.257.2023.03.20.00.20.21; Mon, 20 Mar 2023 00:20:37 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=GjiILlTC; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229496AbjCTHOz (ORCPT <rfc822;chrisfriedt@gmail.com> + 99 others); Mon, 20 Mar 2023 03:14:55 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50878 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229889AbjCTHOw (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 20 Mar 2023 03:14:52 -0400 Received: from mail-wr1-x42f.google.com (mail-wr1-x42f.google.com [IPv6:2a00:1450:4864:20::42f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BA42D1E5D8; Mon, 20 Mar 2023 00:14:50 -0700 (PDT) Received: by mail-wr1-x42f.google.com with SMTP id t15so9296680wrz.7; Mon, 20 Mar 2023 00:14:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1679296489; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=zfkxMOxAGfgWPUNtNItdws/lAov0WMT4Tn6jUy/fV/w=; b=GjiILlTCr5jfQ1cyHmMIYapJx0INf4D0wivvzByi6ZopCTgn/AEn2+rbNfguJgkRp6 S6vEh49Mmbr7nsPSYfd7V5dapTpUPjmeioLBRNzEoo+Jdeef1NOwW5qtU/Ip0HcvhTOa B35E5BpyCMILEG/AI7JYr8AMbTZ75lrrtRukiDiDrwvg8F8cmJE9/r/xdAlkEJQqHcKn 2dJBMLeWrwcX7jpnqQmsjwKp94DeSeTxAhg3NhOW6Tct5nWBAKucie0tjKrI0zE/l1uK HaW83YzF5zdsgfMS0wDWDTa3yqC8x6WNd1aobB1sVxUzZwX8nxNHK4cwCQAlEjR6Nudr FMMQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1679296489; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=zfkxMOxAGfgWPUNtNItdws/lAov0WMT4Tn6jUy/fV/w=; b=vbdaUXzT9pfXHPw9PSM0MQckNJ/9ae1KyRg8tPaEICVSQpA8YVFAoh9HvaCiaQRJmu vn4WDc5EHbctNmKfucyUilPVjkqivP31S7nIVaa7EGr+sqbSfdP7ASalfHMme+RklYJX OSStj+jrbP5VFr0am6kVwvdHzg0SEmj38apuZ6zb1bgS5iX852OPUqMlo4hRHTl96Gbg nMEgWCEcO8mSS52RaMGDlBzeTruocj15RsLVDB0+w3mgRTgiYwVby3HxWOoNnrEeXkbk qGiN9hpnUOV7bqyTaZpA+GRTXWUKATiSxcRsk4dCXkugogsR/JIIAbHSVx7seBACK4W7 g2oA== X-Gm-Message-State: AO0yUKWA5klA6l0J1FsSBUoDl/jVewfyH/t4HFjuUVkumUeSaotrdOnl GKKQNdtfJRPDK2hbYY/7S4g= X-Received: by 2002:a5d:4046:0:b0:2ce:ac70:5113 with SMTP id w6-20020a5d4046000000b002ceac705113mr12797192wrp.41.1679296489069; Mon, 20 Mar 2023 00:14:49 -0700 (PDT) Received: from PC-PEDRO-ARCH.lan ([2001:8a0:7280:5801:9441:3dce:686c:bfc7]) by smtp.gmail.com with ESMTPSA id e23-20020a5d5957000000b002cfefa50a8esm8092442wri.98.2023.03.20.00.14.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 20 Mar 2023 00:14:48 -0700 (PDT) From: Pedro Falcato <pedro.falcato@gmail.com> To: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org>, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Pedro Falcato <pedro.falcato@gmail.com> Subject: [PATCH] do_open(): Fix O_DIRECTORY | O_CREAT behavior Date: Mon, 20 Mar 2023 07:14:42 +0000 Message-Id: <20230320071442.172228-1-pedro.falcato@gmail.com> X-Mailer: git-send-email 2.39.2 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1760870360907059228?= X-GMAIL-MSGID: =?utf-8?q?1760870360907059228?= |
Series |
do_open(): Fix O_DIRECTORY | O_CREAT behavior
|
|
Commit Message
Pedro Falcato
March 20, 2023, 7:14 a.m. UTC
On Linux, open(O_DIRECTORY | O_CREAT) has historically meant "open
directory or create a regular file". This has remained mostly true,
except open(O_DIR | O_CREAT) has started returning an error *while
creating the file*. Restore the old behavior.
Signed-off-by: Pedro Falcato <pedro.falcato@gmail.com>
---
I did not explicitly add a Fixes: tag because I was unable to bisect this locally,
but it seems to me that this was introduced in the path walking refactoring done in early 2020.
Al, if you have a rough idea of what may have added this bug, feel free to add a Fixes.
This should also probably get CC'd to stable, but I'll leave this to your criteria.
fs/namei.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
Comments
On Mon, Mar 20, 2023 at 07:14:42AM +0000, Pedro Falcato wrote: > On Linux, open(O_DIRECTORY | O_CREAT) has historically meant "open > directory or create a regular file". This has remained mostly true, > except open(O_DIR | O_CREAT) has started returning an error *while > creating the file*. Restore the old behavior. > > Signed-off-by: Pedro Falcato <pedro.falcato@gmail.com> > --- > I did not explicitly add a Fixes: tag because I was unable to bisect this locally, This dates back to the lookup rework done for v5.7. Specifically, this was introduced by: Fixes: 973d4b73fbaf ("do_last(): rejoin the common path even earlier in FMODE_{OPENED,CREATED} case") > but it seems to me that this was introduced in the path walking refactoring done in early 2020. > Al, if you have a rough idea of what may have added this bug, feel free to add a Fixes. > > This should also probably get CC'd to stable, but I'll leave this to your criteria. > fs/namei.c | 14 ++++++++++++-- > 1 file changed, 12 insertions(+), 2 deletions(-) > > diff --git a/fs/namei.c b/fs/namei.c > index edfedfbccae..7b26db2f0f8 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -3540,8 +3540,18 @@ static int do_open(struct nameidata *nd, > if (unlikely(error)) > return error; > } > - if ((nd->flags & LOOKUP_DIRECTORY) && !d_can_lookup(nd->path.dentry)) > - return -ENOTDIR; > + > + if ((open_flag & (O_DIRECTORY | O_CREAT)) != (O_DIRECTORY | O_CREAT) || > + !(file->f_mode & FMODE_CREATED)) { > + /* O_DIRECTORY | O_CREAT has the strange property of being the > + * only open(O_DIRECTORY) lookup that can create and return a > + * regular file *if we indeed did create*. Because of this, > + * only return -ENOTDIR if we're not O_DIR | O_CREAT or if we > + * did not create a file. > + */ So before we continue down that road should we maybe treat this as a chance to fix the old bug? Because this behavior of returning -ENOTDIR has existed ever since v5.7 now. Since that time we had three LTS releases all returning ENOTDIR even if the file was created. So yeah, we could return to the old behavior. But we could also see this as a chance to get rid of the bug. IOW, fail right at O_DIRECTORY | O_CREAT with ENOTDIR and not even create the file anymore. No one has really noticed this from 5.7 until now and afaict this has been a bug ever since the dark ages and is mentioned as a bug on man 2 openat.
On Mon, Mar 20, 2023 at 4:52 AM Christian Brauner <brauner@kernel.org> wrote: > > So before we continue down that road should we maybe treat this as a > chance to fix the old bug? Because this behavior of returning -ENOTDIR > has existed ever since v5.7 now. Since that time we had three LTS > releases all returning ENOTDIR even if the file was created. Ack. I think considering that the return value has been broken for so long, I think we can pretty much assume that there are no actual users of it, and we might as well clean up the semantics properly. Willing to send that patch in and we'll get it tested in the crucible of the real world? Linus
On Mon, Mar 20, 2023 at 5:14 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Mon, Mar 20, 2023 at 4:52 AM Christian Brauner <brauner@kernel.org> wrote: > > > > So before we continue down that road should we maybe treat this as a > > chance to fix the old bug? Because this behavior of returning -ENOTDIR > > has existed ever since v5.7 now. Since that time we had three LTS > > releases all returning ENOTDIR even if the file was created. > > Ack. > > I think considering that the return value has been broken for so long, > I think we can pretty much assume that there are no actual users of > it, and we might as well clean up the semantics properly. > > Willing to send that patch in and we'll get it tested in the crucible > of the real world? That sounds good to me, I can do that. What kind of new semantics are we talking about here? I originally found this when testing every kind of possibly-odd edge case in path walking/open(2). From the systems I've tested on (not too diverse, basically NetBSD, FreeBSD, Linux 6.2.2, and now, when looking for a regression, a variety of kernels since 2009), 4 behaviors occurred: 1) Pre v5.7 Linux did the open-dir-if-exists-else-create-regular-file we all know and """love""". 2) Post 5.7, we started returning this buggy -ENOTDIR error, even when successfully creating a file. 3) NetBSD just straight up returns EINVAL on open(O_DIRECTORY | O_CREAT) 4) FreeBSD's open(O_CREAT | O_DIRECTORY) succeeds if the file exists and is a directory. Fails with -ENOENT if it falls onto the "O_CREAT" path (i.e it doesn't try to create the file at all, just ENOENT's; this changed relatively recently, in 2015) Note that all of these behaviors are allowed by POSIX (in fact, I would not call the old Linux behavior a *bug*, just really odd semantics). So, again, what kind of behavior change do we want here? IMO, the best and least destructive would probably be to emulate FreeBSD behavior here. I don't think open(O_DIRECTORY | O_CREAT) returning -ENOTDIR if it doesn't exist (as Christian suggested, I think?) makes any sort of sense here. Just my 2c.
On Mon, Mar 20, 2023 at 12:27 PM Pedro Falcato <pedro.falcato@gmail.com> wrote: > > 1) Pre v5.7 Linux did the open-dir-if-exists-else-create-regular-file > we all know and """love""". So I think we should fall back to this as a last resort, as a "well, it's our historical behavior". > 2) Post 5.7, we started returning this buggy -ENOTDIR error, even when > successfully creating a file. Yeah, I think this is the worst of the bunch and has no excuse (unless some crazy program has started depending on it, which sounds really *really* unlikely). > 3) NetBSD just straight up returns EINVAL on open(O_DIRECTORY | O_CREAT) > 4) FreeBSD's open(O_CREAT | O_DIRECTORY) succeeds if the file exists > and is a directory. Fails with -ENOENT if it falls onto the "O_CREAT" > path (i.e it doesn't try to create the file at all, just ENOENT's; > this changed relatively recently, in 2015) Either of these sound sensible to me. I suspect (3) is the clearest case. And (4) might be warranted just because it's closer to what we used to do, and it's *possible* that somebody happens to use O_DIRECTORY | O_CREAT on directories that exist, and never noticed how broken that was. And (4) has another special case: O_EXCL. Because I'm really hoping that O_DIRECTORY | O_EXCL will always fail. Is the proper patch something along the lines of this? --- a/fs/open.c +++ b/fs/open.c @@ -1186,6 +1186,8 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op) /* Deal with the mode. */ if (WILL_CREATE(flags)) { + if (flags & O_DIRECTORY) + return -EINVAL; if (how->mode & ~S_IALLUGO) return -EINVAL; op->mode = how->mode | S_IFREG; I dunno. Not tested, not thought about very much. What about O_PATH? I guess it's fine to create a file and only get a path fd to the result? Linus
On 2023-03-20, Linus Torvalds <torvalds@linux-foundation.org> wrote: > What about O_PATH? I guess it's fine to create a file and only get a > path fd to the result? O_PATH ignores the O_CREAT flag so it's the same as just passing O_PATH. This is the case for all flags not in O_PATH_FLAGS (so only O_DIRECTORY, O_NOFOLLOW, O_CLOEXEC have an effect on O_PATH). With openat2, passing any other flag with O_PATH returns -EINVAL.
On Mon, Mar 20, 2023 at 01:24:52PM -0700, Linus Torvalds wrote: > On Mon, Mar 20, 2023 at 12:27 PM Pedro Falcato <pedro.falcato@gmail.com> wrote: > > > > 1) Pre v5.7 Linux did the open-dir-if-exists-else-create-regular-file > > we all know and """love""". > > So I think we should fall back to this as a last resort, as a "well, > it's our historical behavior". > > > 2) Post 5.7, we started returning this buggy -ENOTDIR error, even when > > successfully creating a file. > > Yeah, I think this is the worst of the bunch and has no excuse (unless > some crazy program has started depending on it, which sounds really > *really* unlikely). > > > 3) NetBSD just straight up returns EINVAL on open(O_DIRECTORY | O_CREAT) > > 4) FreeBSD's open(O_CREAT | O_DIRECTORY) succeeds if the file exists > > and is a directory. Fails with -ENOENT if it falls onto the "O_CREAT" > > path (i.e it doesn't try to create the file at all, just ENOENT's; > > this changed relatively recently, in 2015) > > Either of these sound sensible to me. > > I suspect (3) is the clearest case. Yeah, we should try that. > > And (4) might be warranted just because it's closer to what we used to > do, and it's *possible* that somebody happens to use O_DIRECTORY | > O_CREAT on directories that exist, and never noticed how broken that > was. I really really hope that isn't the case because (4) is pretty nasty. Having to put this on a manpage seems nightmarish. > > And (4) has another special case: O_EXCL. Because I'm really hoping > that O_DIRECTORY | O_EXCL will always fail. I've detailed the semantics for that in the commit message below... > > Is the proper patch something along the lines of this? Yeah, I think that would do it and I think that's what we should try to get away with. I just spent time and took a look who passes O_DIRECTORY and O_CREAT and interestingly there are a number of comments roughly along the lines of the following example: /* Ideally we could use openat(O_DIRECTORY | O_CREAT | O_EXCL) here * to create and open the directory atomically suggests that people who specify O_DIRECTORY | O_CREAT are interested in creating a directory. But since this never did work they don't tend to use that flag combination (I've collected a few samples in [1] to [4].). (As a sidenote, posix made an interpretation change a long time ago to at least allow for O_DIRECTORY | O_CREAT to create a directory (see [3]). But that's a whole different can of worms and I haven't spent any thoughts even on feasibility. And even if we should probably get through a couple of kernels with O_DIRECTORY | O_CREAT failing with EINVAL first.) > > --- a/fs/open.c > +++ b/fs/open.c > @@ -1186,6 +1186,8 @@ inline int build_open_flags(const struct > open_how *how, struct open_flags *op) > > /* Deal with the mode. */ > if (WILL_CREATE(flags)) { > + if (flags & O_DIRECTORY) > + return -EINVAL; This will be problematic because for weird historical reasons O_TMPFILE includes O_DIRECTORY so this would unfortunately break O_TMPFILE. :/ I'll try to have a patch ready in a bit. I spent a long time digging through potential users of this nonsense. Link: https://lore.kernel.org/lkml/20230320071442.172228-1-pedro.falcato@gmail.com Link: https://sources.debian.org/src/flatpak/1.14.4-1/subprojects/libglnx/glnx-dirfd.c/?hl=324#L324 [1] Link: https://sources.debian.org/src/flatpak-builder/1.2.3-1/subprojects/libglnx/glnx-shutil.c/?hl=251#L251 [2] Link: https://sources.debian.org/src/ostree/2022.7-2/libglnx/glnx-dirfd.c/?hl=324#L324 [3] Link: https://www.openwall.com/lists/oss-security/2014/11/26/14 [4]
On Tue, Mar 21, 2023 at 03:24:19PM +0100, Christian Brauner wrote: > On Mon, Mar 20, 2023 at 01:24:52PM -0700, Linus Torvalds wrote: > > On Mon, Mar 20, 2023 at 12:27 PM Pedro Falcato <pedro.falcato@gmail.com> wrote: > > > > > > 1) Pre v5.7 Linux did the open-dir-if-exists-else-create-regular-file > > > we all know and """love""". > > > > So I think we should fall back to this as a last resort, as a "well, > > it's our historical behavior". > > > > > 2) Post 5.7, we started returning this buggy -ENOTDIR error, even when > > > successfully creating a file. > > > > Yeah, I think this is the worst of the bunch and has no excuse (unless > > some crazy program has started depending on it, which sounds really > > *really* unlikely). > > > > > 3) NetBSD just straight up returns EINVAL on open(O_DIRECTORY | O_CREAT) > > > 4) FreeBSD's open(O_CREAT | O_DIRECTORY) succeeds if the file exists > > > and is a directory. Fails with -ENOENT if it falls onto the "O_CREAT" > > > path (i.e it doesn't try to create the file at all, just ENOENT's; > > > this changed relatively recently, in 2015) > > > > Either of these sound sensible to me. > > > > I suspect (3) is the clearest case. > > Yeah, we should try that. > > > > > And (4) might be warranted just because it's closer to what we used to > > do, and it's *possible* that somebody happens to use O_DIRECTORY | > > O_CREAT on directories that exist, and never noticed how broken that > > was. > > I really really hope that isn't the case because (4) is pretty nasty. > Having to put this on a manpage seems nightmarish. > > > > > And (4) has another special case: O_EXCL. Because I'm really hoping > > that O_DIRECTORY | O_EXCL will always fail. > > I've detailed the semantics for that in the commit message below... > > > > > Is the proper patch something along the lines of this? > > Yeah, I think that would do it and I think that's what we should try to > get away with. I just spent time and took a look who passes O_DIRECTORY > and O_CREAT and interestingly there are a number of comments roughly > along the lines of the following example: > > /* Ideally we could use openat(O_DIRECTORY | O_CREAT | O_EXCL) here > * to create and open the directory atomically > > suggests that people who specify O_DIRECTORY | O_CREAT are interested in > creating a directory. But since this never did work they don't tend to > use that flag combination (I've collected a few samples in [1] to [4].). > > (As a sidenote, posix made an interpretation change a long time ago to > at least allow for O_DIRECTORY | O_CREAT to create a directory (see [3]). > > But that's a whole different can of worms and I haven't spent any > thoughts even on feasibility. And even if we should probably get through > a couple of kernels with O_DIRECTORY | O_CREAT failing with EINVAL first.) > > > > > --- a/fs/open.c > > +++ b/fs/open.c > > @@ -1186,6 +1186,8 @@ inline int build_open_flags(const struct > > open_how *how, struct open_flags *op) > > > > /* Deal with the mode. */ > > if (WILL_CREATE(flags)) { > > + if (flags & O_DIRECTORY) > > + return -EINVAL; > > This will be problematic because for weird historical reasons O_TMPFILE > includes O_DIRECTORY so this would unfortunately break O_TMPFILE. :/ > I'll try to have a patch ready in a bit. So, had to do some testing first: This survives xfstests: sudo ./check -g unlink,idmapped which pounds on the creation and O_TMPFILE paths. It also survives LTP: # The following includes openat2, openat, open, fsopen, open_tree, etc. sudo ./runtltp -f syscalls open I've also written a (very nasty) test script: #define _GNU_SOURCE #include <fcntl.h> #include <stdio.h> #include <stdlib.h> int main(int argc, char *argv[]) { int fd; fd = open("/tmp/TEST", O_DIRECTORY | O_CREAT); if (fd >= 0) printf("%d\n", fd); else printf("%m: fd(%d)\n", fd); fd = open("/tmp/TEST", O_DIRECTORY | O_CREAT | O_EXCL); if (fd >= 0) printf("%d\n", fd); else printf("%m: fd(%d)\n", fd); fd = open("/tmp/TEST", O_DIRECTORY | O_EXCL); if (fd >= 0) printf("%d\n", fd); else printf("%m: fd(%d)\n", fd); exit(EXIT_SUCCESS); } ubuntu@vm1:~/ssd$ sudo ./create_test Invalid argument: fd(-1) Invalid argument: fd(-1) No such file or directory: fd(-1) I think this should work. From the commit message it's hopefully clear that this is proper semantic change. But I think we might be able to pull this off given how rare this combination should be and how unnoticed the current regression has gone and for how long... So I came up with the following description trying to detail the semantics prior to v5.7, post v5.7 up until this patch, and then after the patch. Linus, I added your SOB to this but tell me if you'd rather not be associated with this potential mess... It would be very nice if we had tests for the new behavior. So if @Pedro would be up for it that would be highly appreciated. If not I'll put it on my ToDo... So I can carry this and sent a pr or it can be picked up directly. I'm not fuzzed. Hopefully there are no surprises or objections... From 6bc6e6a4c9ed0dcbe0c85cbbaca90953e27889e5 Mon Sep 17 00:00:00 2001 From: Christian Brauner <brauner@kernel.org> Date: Tue, 21 Mar 2023 09:18:07 +0100 Subject: [PATCH] open: return EINVAL for O_DIRECTORY | O_CREAT After a couple of years and multiple LTS releases we received a report that the behavior of O_DIRECTORY | O_CREAT changed starting with v5.7. On kernels prior to v5.7 combinations of O_DIRECTORY, O_CREAT, O_EXCL had the following semantics: (1) open("/tmp/d", O_DIRECTORY | O_CREAT) * d doesn't exist: create regular file * d exists and is a regular file: ENOTDIR * d exists and is a directory: EISDIR (2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL) * d doesn't exist: create regular file * d exists and is a regular file: EEXIST * d exists and is a directory: EEXIST (3) open("/tmp/d", O_DIRECTORY | O_EXCL) * d doesn't exist: ENOENT * d exists and is a regular file: ENOTDIR * d exists and is a directory: open directory On kernels since to v5.7 combinations of O_DIRECTORY, O_CREAT, O_EXCL have the following semantics: (1) open("/tmp/d", O_DIRECTORY | O_CREAT) * d doesn't exist: ENOTDIR (create regular file) * d exists and is a regular file: ENOTDIR * d exists and is a directory: EISDIR (2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL) * d doesn't exist: ENOTDIR (create regular file) * d exists and is a regular file: EEXIST * d exists and is a directory: EEXIST (3) open("/tmp/d", O_DIRECTORY | O_EXCL) * d doesn't exist: ENOENT * d exists and is a regular file: ENOTDIR * d exists and is a directory: open directory This is a fairly substantial semantic change that userspace didn't notice until Pedro took the time to deliberately figure out corner cases. Since no one noticed this breakage we can somewhat safely assume that O_DIRECTORY | O_CREAT combinations are likely unused. The v5.7 breakage is especially weird because while ENOTDIR is returned indicating failure a regular file is actually created. This doesn't make a lot of sense. Time was spent finding potential users of this combination. Searching on codesearch.debian.net showed that codebases often express semantical expectations about O_DIRECTORY | O_CREAT which are completely contrary to what our code has done and currently does. The expectation often is that this particular combination would create and open a directory. This suggests users who tried to use that combination would stumble upon the counterintuitive behavior no matter if pre-v5.7 or post v5.7 and quickly realize neither semantics give them what they want. For some examples see the code examples in [1] to [3] and the discussion in [4]. There are various ways to address this issue. The lazy/simple option would be to restore the pre-v5.7 behavior and to just live with that bug forever. But since there's a real chance that the O_DIRECTORY | O_CREAT quirk isn't relied upon we should try to get away with murder(ing bad semantics) first. If we need to Frankenstein pre-v5.7 behavior later so be it. So let's simply return EINVAL categorically for O_DIRECTORY | O_CREAT combinations. In addition to cleaning up the old bug this also opens up the possiblity to make that flag combination do something more intuitive in the future. Starting with this commit the following semantics apply: (1) open("/tmp/d", O_DIRECTORY | O_CREAT) * d doesn't exist: EINVAL * d exists and is a regular file: EINVAL * d exists and is a directory: EINVAL (2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL) * d doesn't exist: EINVAL * d exists and is a regular file: EINVAL * d exists and is a directory: EINVAL (3) open("/tmp/d", O_DIRECTORY | O_EXCL) * d doesn't exist: ENOENT * d exists and is a regular file: ENOTDIR * d exists and is a directory: open directory One additional note, O_TMPFILE is implemented as: #define __O_TMPFILE 020000000 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY) #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT) For older kernels it was important to return an explicit error when O_TMPFILE wasn't supported. So it is implemented to look like O_DIRECTORY | O_RDWR but without O_CREAT. The reason was that O_DIRECTORY | O_CREAT used to work and created a regular file. Allowing it to be specified would've potentially caused a regular file to be created on older kernels while the caller would believe they created an actual O_TMPFILE. So instead O_TMPFILE has included O_DIRECTORY | O_RDWR and the code uses O_TMPFILE_MASK to check that O_CREAT isn't specified returning EINVAL if it is. With this patch, we categorically reject O_DIRECTORY | O_CREAT and return EINVAL. That means, we could write this in a way that makes the if ((flags & O_TMPFILE_MASK) != O_TMPFILE) check superfluous but let's not do that. The check documents the peculiar aspects of O_TMPFILE quite nicely and can serve as a reminder how easy it is to break. As Aleksa pointed out O_PATH is unaffected by this change since it always returned EINVAL if O_CREAT was specified - with or without O_DIRECTORY. Link: https://lore.kernel.org/lkml/20230320071442.172228-1-pedro.falcato@gmail.com Link: https://sources.debian.org/src/flatpak/1.14.4-1/subprojects/libglnx/glnx-dirfd.c/?hl=324#L324 [1] Link: https://sources.debian.org/src/flatpak-builder/1.2.3-1/subprojects/libglnx/glnx-shutil.c/?hl=251#L251 [2] Link: https://sources.debian.org/src/ostree/2022.7-2/libglnx/glnx-dirfd.c/?hl=324#L324 [3] Link: https://www.openwall.com/lists/oss-security/2014/11/26/14 [4] Reported-by: Pedro Falcato <pedro.falcato@gmail.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> --- This survives xfstests: sudo ./check -g unlink,idmapped which pounds on the creation and O_TMPFILE paths. It also survives LTP: sudo ./runtltp -f syscalls openat2 sudo ./runtltp -f syscalls openat # The following includes openat2, openat, open, fsopen, open_tree, etc. sudo ./runtltp -f syscalls open sudo ./runtltp -f syscalls create_file sudo ./runtltp -f syscalls create_files I've also written a (very nasty) test script: #define _GNU_SOURCE #include <fcntl.h> #include <stdio.h> #include <stdlib.h> int main(int argc, char *argv[]) { int fd; fd = open("/tmp/TEST", O_DIRECTORY | O_CREAT); if (fd >= 0) printf("%d\n", fd); else printf("%m: fd(%d)\n", fd); fd = open("/tmp/TEST", O_DIRECTORY | O_CREAT | O_EXCL); if (fd >= 0) printf("%d\n", fd); else printf("%m: fd(%d)\n", fd); fd = open("/tmp/TEST", O_DIRECTORY | O_EXCL); if (fd >= 0) printf("%d\n", fd); else printf("%m: fd(%d)\n", fd); exit(EXIT_SUCCESS); } ubuntu@vm1:~/ssd$ sudo ./create_test Invalid argument: fd(-1) Invalid argument: fd(-1) No such file or directory: fd(-1) --- fs/open.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/fs/open.c b/fs/open.c index 4401a73d4032..39e360f9490d 100644 --- a/fs/open.c +++ b/fs/open.c @@ -1135,6 +1135,8 @@ struct file *open_with_fake_path(const struct path *path, int flags, EXPORT_SYMBOL(open_with_fake_path); #define WILL_CREATE(flags) (flags & (O_CREAT | __O_TMPFILE)) +#define INVALID_CREATE(flags) \ + ((flags & (O_DIRECTORY | O_CREAT)) == (O_DIRECTORY | O_CREAT)) #define O_PATH_FLAGS (O_DIRECTORY | O_NOFOLLOW | O_PATH | O_CLOEXEC) inline struct open_how build_open_how(int flags, umode_t mode) @@ -1207,6 +1209,10 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op) if (!(acc_mode & MAY_WRITE)) return -EINVAL; } + + if (INVALID_CREATE(flags)) + return -EINVAL; + if (flags & O_PATH) { /* O_PATH only permits certain other flags to be set. */ if (flags & ~O_PATH_FLAGS)
On Tue, Mar 21, 2023 at 9:17 AM Christian Brauner <brauner@kernel.org> wrote: > > #define WILL_CREATE(flags) (flags & (O_CREAT | __O_TMPFILE)) > +#define INVALID_CREATE(flags) \ > + ((flags & (O_DIRECTORY | O_CREAT)) == (O_DIRECTORY | O_CREAT)) > #define O_PATH_FLAGS (O_DIRECTORY | O_NOFOLLOW | O_PATH | O_CLOEXEC) > > inline struct open_how build_open_how(int flags, umode_t mode) > @@ -1207,6 +1209,10 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op) > if (!(acc_mode & MAY_WRITE)) > return -EINVAL; > } > + > + if (INVALID_CREATE(flags)) > + return -EINVAL; > + > if (flags & O_PATH) { > /* O_PATH only permits certain other flags to be set. */ > if (flags & ~O_PATH_FLAGS) So the patch looks simple enough, but (a) I'm not entirely sure I like the extra indirection through another #define. This impenetrable thicket of different macros makes it a bit hard to see what is going on. I'm not blaming you for it, it predates this patch, but.. (b) this seems to make that O_TMPFILE_MASK macro pointless. I think (b) kind of re-inforces the point of (a) here. The only reason for O_TMPFILE_MASK is literally that old historical "make sure old kernels return errors when they don't support O_TEMPFILE", and thus the magic re-use of old bit patterns. But now that we do that "return error if both O_DIRECTORY and O_CREAT are set", the O_TMPFILE_MASK check is basically dead, because it ends up checking for that same bit pattern except also __O_TMPFILE. And that is *not* obvious from the code, exactly because of that thicket of different macros. In fact, since that whole if ((flags & O_TMPFILE_MASK) != O_TMPFILE) return -EINVAL; is done inside an "if (flags & __O_TMPFILE)", the compiler might as well reduce it *exactly* down to that exact same test as INVALID_CREATE() now is. So I really get the feeling that the macros actually hide what is going on, and are the exact opposite of being helpful. Case in point: with your patch, you now have the exact same test twice in a row, except it *looks* like two different tests and one of them is conditional on __O_TMPFILE. For all I know, the compiler may actually notice the redundancy and remove one of them, but we shouldn't write bad code with the expectation that "the compiler will fix it up". Particularly when it just makes it harder for people to understand too. Linus
On Tue, Mar 21, 2023 at 10:35:47AM -0700, Linus Torvalds wrote: > On Tue, Mar 21, 2023 at 9:17 AM Christian Brauner <brauner@kernel.org> wrote: > > > > #define WILL_CREATE(flags) (flags & (O_CREAT | __O_TMPFILE)) > > +#define INVALID_CREATE(flags) \ > > + ((flags & (O_DIRECTORY | O_CREAT)) == (O_DIRECTORY | O_CREAT)) > > #define O_PATH_FLAGS (O_DIRECTORY | O_NOFOLLOW | O_PATH | O_CLOEXEC) > > > > inline struct open_how build_open_how(int flags, umode_t mode) > > @@ -1207,6 +1209,10 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op) > > if (!(acc_mode & MAY_WRITE)) > > return -EINVAL; > > } > > + > > + if (INVALID_CREATE(flags)) > > + return -EINVAL; > > + > > if (flags & O_PATH) { > > /* O_PATH only permits certain other flags to be set. */ > > if (flags & ~O_PATH_FLAGS) > > So the patch looks simple enough, but > > (a) I'm not entirely sure I like the extra indirection through > another #define. This impenetrable thicket of different macros makes > it a bit hard to see what is going on. I'm not blaming you for it, it > predates this patch, but.. > > (b) this seems to make that O_TMPFILE_MASK macro pointless. So I tried to justify this in the commit message but it might've gotten lost in there. My thinking had been: With this patch, we categorically reject O_DIRECTORY | O_CREAT and return EINVAL. That means, we could write this in a way that makes the if ((flags & O_TMPFILE_MASK) != O_TMPFILE) check superfluous but let's not do that. The check documents the peculiar aspects of O_TMPFILE quite nicely and can serve as a reminder how easy it is to break. But I'm not married to keeping it and it could be misleading. > > I think (b) kind of re-inforces the point of (a) here. > > The only reason for O_TMPFILE_MASK is literally that old historical > "make sure old kernels return errors when they don't support > O_TEMPFILE", and thus the magic re-use of old bit patterns. > > But now that we do that "return error if both O_DIRECTORY and O_CREAT > are set", the O_TMPFILE_MASK check is basically dead, because it ends > up checking for that same bit pattern except also __O_TMPFILE. Yes. > > And that is *not* obvious from the code, exactly because of that > thicket of different macros. > > In fact, since that whole > > if ((flags & O_TMPFILE_MASK) != O_TMPFILE) > return -EINVAL; > > is done inside an "if (flags & __O_TMPFILE)", the compiler might as > well reduce it *exactly* down to that exact same test as > INVALID_CREATE() now is. > > So I really get the feeling that the macros actually hide what is > going on, and are the exact opposite of being helpful. Case in point: > with your patch, you now have the exact same test twice in a row, > except it *looks* like two different tests and one of them is > conditional on __O_TMPFILE. Yeah, see above why I did that originally. > > For all I know, the compiler may actually notice the redundancy and > remove one of them, but we shouldn't write bad code with the > expectation that "the compiler will fix it up". Particularly when it > just makes it harder for people to understand too. But yes, that is a valid complaint so - without having tested - sm like: diff --git a/fs/open.c b/fs/open.c index 4401a73d4032..3c5227d84cfe 100644 --- a/fs/open.c +++ b/fs/open.c @@ -1195,6 +1195,13 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op) op->mode = 0; } + /* + * Block nonsensical creation requests and ensure that O_CREAT isn't + * slipped alongside O_TMPFILE which relies on O_DIRECTORY. + */ + if ((flags & (O_DIRECTORY | O_CREAT)) == (O_DIRECTORY | O_CREAT)) + return -EINVAL; + /* * In order to ensure programs get explicit errors when trying to use * O_TMPFILE on old kernels, O_TMPFILE is implemented such that it @@ -1202,7 +1209,7 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op) * have to require userspace to explicitly set it. */ if (flags & __O_TMPFILE) { - if ((flags & O_TMPFILE_MASK) != O_TMPFILE) + if ((flags & O_TMPFILE) != O_TMPFILE) return -EINVAL; if (!(acc_mode & MAY_WRITE)) return -EINVAL; diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h index 1ecdb911add8..80f37a0d40d7 100644 --- a/include/uapi/asm-generic/fcntl.h +++ b/include/uapi/asm-generic/fcntl.h @@ -91,7 +91,6 @@ /* a horrid kludge trying to make sure that this will fail on old kernels */ #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY) -#define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT) #ifndef O_NDELAY #define O_NDELAY O_NONBLOCK diff --git a/tools/include/uapi/asm-generic/fcntl.h b/tools/include/uapi/asm-generic/fcntl.h index b02c8e0f4057..1c7a0f6632c0 100644 --- a/tools/include/uapi/asm-generic/fcntl.h +++ b/tools/include/uapi/asm-generic/fcntl.h @@ -91,7 +91,6 @@ /* a horrid kludge trying to make sure that this will fail on old kernels */ #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY) -#define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT) #ifndef O_NDELAY #define O_NDELAY O_NONBLOCK -- 2.34.1
On Tue, Mar 21, 2023 at 1:16 PM Christian Brauner <brauner@kernel.org> wrote: > > But yes, that is a valid complaint so - without having tested - sm like: I'd actually go a bit further, and really spell all the bits out explicitly. I mean, I was *literally* involved in that original O_TMPFILE_MASK thing: https://lore.kernel.org/all/CA+55aFxA3qoM5wpMUya7gEA8SZyJep7kMBRjrPOsOm_OudD8aQ@mail.gmail.com/ with the whole O_DIRECOTY games to make O_TMPFILE safer, but despite that I didn't remember this at all, and my suggested "maybe something like this" patch was broken for the O_TMPFILE case. So while we do have all this documented in our history (both git commit logs and lore.kernel.org), I actually think it would be lovely to just make build_open_flags() be very explicit about all the exact O_xyz flags, and really write out the logic fully. For example, even your clarified version that gets rid of the "O_TMPFILE_MASK" thing still eends up doing if (flags & __O_TMPFILE) { if ((flags & O_TMPFILE) != O_TMPFILE) return -EINVAL; and so when you look at that code, you don't actually realize that O_TMPFILE _cotnains_ that __O_TMPFILE bit, and what the above really means is "also check O_DIRECTORY". So considering how I couldn't remember this mess myself, despite having been involved with it personally (a decade ago..), I really do think that maybe this shoudl be open-coded with a comment, and the above code should instead be if (flags & __O_TMPFILE) { if (!(flags & O_DIRECTORY)) return -EINVAL; together with an explicit comment about how O_TMPFILE is the *combination* of __O_TMPFILE and O_DIRECTORY, along with a short explanation as to why. Now, I agree that that test for O_DIRECTORY then _looks_ odd, but the thing is, it then makes the reality of this all much more explicit. In contrast, doing that if ((flags & O_TMPFILE) != O_TMPFILE) may *look* more natural in that context, but if you actually start thinking about it, that check makes no sense unless you then look up what O_TMPFILE is, and the history behind it. So I'd rather have code that looks a bit odd, but that explains itself and is explicit about what it does, than code that _tries_ to look natural but actually hides the reason for what it is doing. And then next time somebody looks at that O_DIRECTORY | O_CREAT combination, suddenly the __O_TMPFILE interaction is there, and very explicit. Hmm? I don't feel *hugely* strongly about this, so in the end I'll bow to your decision, but considering that my initial patch looked sane but was buggy because I had forgotten about O_TMPFILE, I really think we should make this more explicit at a source level.. Linus Linus
On Tue, Mar 21, 2023 at 02:47:55PM -0700, Linus Torvalds wrote: > On Tue, Mar 21, 2023 at 1:16 PM Christian Brauner <brauner@kernel.org> wrote: > > > > But yes, that is a valid complaint so - without having tested - sm like: > > I'd actually go a bit further, and really spell all the bits out explicitly. > > I mean, I was *literally* involved in that original O_TMPFILE_MASK thing: > > https://lore.kernel.org/all/CA+55aFxA3qoM5wpMUya7gEA8SZyJep7kMBRjrPOsOm_OudD8aQ@mail.gmail.com/ > > with the whole O_DIRECOTY games to make O_TMPFILE safer, but despite > that I didn't remember this at all, and my suggested "maybe something > like this" patch was broken for the O_TMPFILE case. > > So while we do have all this documented in our history (both git > commit logs and lore.kernel.org), I actually think it would be lovely > to just make build_open_flags() be very explicit about all the exact > O_xyz flags, and really write out the logic fully. > > For example, even your clarified version that gets rid of the > "O_TMPFILE_MASK" thing still eends up doing > > if (flags & __O_TMPFILE) { > if ((flags & O_TMPFILE) != O_TMPFILE) > return -EINVAL; > > and so when you look at that code, you don't actually realize that > O_TMPFILE _cotnains_ that __O_TMPFILE bit, and what the above really > means is "also check O_DIRECTORY". > > So considering how I couldn't remember this mess myself, despite > having been involved with it personally (a decade ago..), I really do > think that maybe this shoudl be open-coded with a comment, and the > above code should instead be > > if (flags & __O_TMPFILE) { > if (!(flags & O_DIRECTORY)) > return -EINVAL; > > together with an explicit comment about how O_TMPFILE is the > *combination* of __O_TMPFILE and O_DIRECTORY, along with a short > explanation as to why. > > Now, I agree that that test for O_DIRECTORY then _looks_ odd, but the > thing is, it then makes the reality of this all much more explicit. > > In contrast, doing that > > if ((flags & O_TMPFILE) != O_TMPFILE) > > may *look* more natural in that context, but if you actually start > thinking about it, that check makes no sense unless you then look up > what O_TMPFILE is, and the history behind it. > > So I'd rather have code that looks a bit odd, but that explains itself > and is explicit about what it does, than code that _tries_ to look > natural but actually hides the reason for what it is doing. > > And then next time somebody looks at that O_DIRECTORY | O_CREAT > combination, suddenly the __O_TMPFILE interaction is there, and very > explicit. > > Hmm? > > I don't feel *hugely* strongly about this, so in the end I'll bow to > your decision, but considering that my initial patch looked sane but > was buggy because I had forgotten about O_TMPFILE, I really think we > should make this more explicit at a source level.. I don't feel strongly about this either and your points are valid. So I incorporated that and updated the comments in the code. In case you'd like to take another look I've now put this up at: The following changes since commit e8d018dd0257f744ca50a729e3d042cf2ec9da65: Linux 6.3-rc3 (2023-03-19 13:27:55 -0700) are available in the Git repository at: ssh://git@gitolite.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping.git tags/vfs.open.directory.creat.einval for you to fetch changes up to 43b450632676fb60e9faeddff285d9fac94a4f58: open: return EINVAL for O_DIRECTORY | O_CREAT (2023-03-22 11:06:55 +0100) ---------------------------------------------------------------- vfs.open.directory.creat.einval ---------------------------------------------------------------- Christian Brauner (1): open: return EINVAL for O_DIRECTORY | O_CREAT fs/open.c | 18 +++++++++++++----- include/uapi/asm-generic/fcntl.h | 1 - tools/include/uapi/asm-generic/fcntl.h | 1 - 3 files changed, 13 insertions(+), 7 deletions(-)
On Wed, Mar 22, 2023 at 3:17 AM Christian Brauner <brauner@kernel.org> wrote: > > I don't feel strongly about this either and your points are valid. So I > incorporated that and updated the comments in the code. In case you'd like to > take another look I've now put this up at: Ack, LGTM. Linus
On Tue, Mar 21, 2023 at 4:17 PM Christian Brauner <brauner@kernel.org> wrote: > It would be very nice if we had tests for the new behavior. So if @Pedro > would be up for it that would be highly appreciated. If not I'll put it > on my ToDo... Where do you want them? selftests? I have a relatively self-contained ""testsuite"" of namei stuff that could fit in there well, after some cleanup. > The expectation often is that this particular combination would create > and open a directory. This suggests users who tried to use that > combination would stumble upon the counterintuitive behavior no matter > if pre-v5.7 or post v5.7 and quickly realize neither semantics give them > what they want. For some examples see the code examples in [1] to [3] > and the discussion in [4]. Ok so, silly question: Could it not be desirable to have these semantics (open a dir or mkdir, atomically)? It does seem to be why POSIX left this edge case implementation defined, and if folks are asking for it, could it be the right move? And yes, I do understand (from reading the room) that no one here is too excited about this possibility.
On Mon, Mar 20, 2023 at 10:14:29AM -0700, Linus Torvalds wrote: > On Mon, Mar 20, 2023 at 4:52 AM Christian Brauner <brauner@kernel.org> wrote: > > > > So before we continue down that road should we maybe treat this as a > > chance to fix the old bug? Because this behavior of returning -ENOTDIR > > has existed ever since v5.7 now. Since that time we had three LTS > > releases all returning ENOTDIR even if the file was created. > > Ack. > > I think considering that the return value has been broken for so long, > I think we can pretty much assume that there are no actual users of > it, and we might as well clean up the semantics properly. If there are no users of this and we can clean up the semantics, is there a strong reason *not* to make `O_DIRECTORY | O_CREATE` actually create a directory and atomically return a file descriptor for that directory? That seems like genuinely useful behavior that we don't currently have a syscall for. I didn't see any suggestion in the thread for reasons why we can't or shouldn't do that. Would that break some existing software? It doesn't *sound* like it would. As far as I can tell, that *also* wouldn't cause a problem with O_TMPFILE, because older kernels will still fail as desired, and we wouldn't change the behavior of O_TMPFILE on new kernels (it'd still create a temporary file, not a directory). - Josh Triplett
On Mon, Mar 27, 2023 at 7:15 PM Josh Triplett <josh@joshtriplett.org> wrote: > > If there are no users of this and we can clean up the semantics, is > there a strong reason *not* to make `O_DIRECTORY | O_CREATE` actually > create a directory and atomically return a file descriptor for that > directory? That seems like genuinely useful behavior that we don't > currently have a syscall for. I didn't see any suggestion in the thread > for reasons why we can't or shouldn't do that. Absolutely not. For one thing, it is clearly not "genuinely useful behavior". It's just stupid. Name a *single* real situation where that would be a big improvement? Point to code, and point to a reason, and point to why it would make a difference. No made-up hypotheticals. If you want to open a directory, just do that fd = open(.., O_DIRECTORY); and if that directory doesn't exist, and you still want to create it, then just do mkdir(...); on it. Done. And mkdir() is atomic, so there's no race there with somebody else doing something else to that path. And no, there is no race with a subsequent open of that mkdir case, because you already know the result empty, so what would you do with the fd you just got? Absolutely nothing. It's useless. Edwin Starr sang a song all about it. So there is *zero* real reasons for that "open a directory and create it atomically". It's a nonsensical operation. Ok, just to play along - maybe you can make it slightly less nonsensical by throwing O_PATH into the mix, and now an empty directory file descriptor at least has *some* use. But even *if* you can point to such a thing being useful (and I'm really doubtful), it would *still* be stupid. Now your code would not only be specific to Linux, it would be specific to some very new version of Linux, and do something completely different on older versions. Because those older versions will do random things, ranging from "always return an error" to "create a regular file - not a directory - and then return an error anyway" and finally "create a regular file - not a directory - and return that resulting fd". So no. We're not adding a *fourth* set of semantics to something that is silly and useless to do in the first place, and that has already had three existing semantics. The reason it has had three different behaviors over the years is *literally* that nobody has ever wanted to do it, and so the fact that it has been broken for years hasn't even mattered. Don't try to make that situation worse by then making up new pointless meanings for it and try to come up with excuses why somebody would want to do that operation. Linus
On March 28, 2023 12:32:59 PM GMT+09:00, Linus Torvalds <torvalds@linux-foundation.org> wrote: >Ok, just to play along - maybe you can make it slightly less >nonsensical by throwing O_PATH into the mix, and now an empty >directory file descriptor at least has *some* use. That's the case I was thinking of: create a directory, then use exclusively *at system calls, never anything path-based. (I was using "atomic" loosely; not concerned about races here, just convenience.) >Now your code would not only be specific to Linux, it would be >specific to some very new version of Linux, and do something >completely different on older versions. I'm extremely not concerned with depending on current Linux. But that said... >Because those older versions will do random things, ranging from >"always return an error" to "create a regular file - not a directory - >and then return an error anyway" and finally "create a regular file - >not a directory - and return that resulting fd". ... Right, open has the un-extendable semantics, hence O_TMPFILE. Fair enough. Nevermind then. As is often the case for multi-operation syscalls, I'm better off just using io_uring for a mkdir-then-open.
On Tue, Mar 28, 2023 at 01:00:30PM +0900, Josh Triplett wrote: > On March 28, 2023 12:32:59 PM GMT+09:00, Linus Torvalds <torvalds@linux-foundation.org> wrote: > >Ok, just to play along - maybe you can make it slightly less > >nonsensical by throwing O_PATH into the mix, and now an empty > >directory file descriptor at least has *some* use. > > That's the case I was thinking of: create a directory, then use exclusively *at system calls, never anything path-based. (I was using "atomic" loosely; not concerned about races here, just convenience.) > > >Now your code would not only be specific to Linux, it would be > >specific to some very new version of Linux, and do something > >completely different on older versions. > > I'm extremely not concerned with depending on current Linux. But that said... > > >Because those older versions will do random things, ranging from > >"always return an error" to "create a regular file - not a directory - > >and then return an error anyway" and finally "create a regular file - > >not a directory - and return that resulting fd". > > ... Right, open has the un-extendable semantics, hence O_TMPFILE. Fair enough. Nevermind then. That's not even the issue per se as most applications would probably just be able to test whether O_DIRECTORY|O_CREAT creates and opens a directory. It's not that we haven't had to contend with similar issues in userspace for other syscalls before. The bigger problem for me is that we'd be advancing from fixing the semantics to not do completely weird/unexpected things to making it do something that users would expect or want it to do in one big step. Right now we're making a clean break by telling userspace EINVAL. But if that turns out to be problematic we can easily just roll back to a version of the old weird behavior with probably little fanfare. But if we already introduced support for new semantics that express user's intuition about what it's supposed to do we'll have a much harder time and created a flame war for ourselves. If however, EINVAL works just fine for a couple of kernel releases then it would be - separate from the sensibility of this specific request - another matter to make it do something else. Because at that point it's no different from reusing deprecated bits like we did for e.g., CLONE_DETACHED -> CLONE_PIDFD which has exactly the same ignore unknown or removed flags semantics as open/openat/openat2. Moving slow even in the face of excitement about new possibilities isn't always the wrong thing. This is one case were it isn't.
On Mon, Mar 27, 2023 at 09:13:18PM +0100, Pedro Falcato wrote: > On Tue, Mar 21, 2023 at 4:17 PM Christian Brauner <brauner@kernel.org> wrote: > > It would be very nice if we had tests for the new behavior. So if @Pedro > > would be up for it that would be highly appreciated. If not I'll put it > > on my ToDo... > > Where do you want them? selftests? I have a relatively self-contained > ""testsuite"" of namei stuff that could fit in there well, after some > cleanup. I think I would prefer to have them as part of xfstests: https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git as that's where nearly all of the fs testing is taking place. It's never great when developers have to run 3 separate testsuites to get meaningful coverage. So having it central to xfstests would be my preference. A while ago I added a testsuite that tests generic core VFS behavior it's located under src/vfs: https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/src/vfs and covers a lot of different things. So I would ask you to consider adding a new testsuite into that file: https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/src/vfs/vfstest.c I think the structure should be somewhat understandable. Then create a new test in xfstests using the "new" helper in the generic sectionA > ./new generic Next test id is 728 Append a name to the ID? Test name will be 728-$name. y,[n]: Creating test file '728' Add to group(s) [auto] (separate by space, ? for list): auto quick Creating skeletal script for you to edit ... then call the vfstest binary from the generated test case: echo "Silence is golden" $here/src/vfs/vfstest --test-THAT-NEW-SWITCH-NAME-YOU-ADDED --device "$TEST_DEV" \ --mount "$TEST_DIR" --fstype "$FSTYP" status=$? exit (You can also submit this to LTP or tell them about this change and they'll likely add tests in addition to xfstests.) > > > The expectation often is that this particular combination would create > > and open a directory. This suggests users who tried to use that > > combination would stumble upon the counterintuitive behavior no matter > > if pre-v5.7 or post v5.7 and quickly realize neither semantics give them > > what they want. For some examples see the code examples in [1] to [3] > > and the discussion in [4]. > > Ok so, silly question: Could it not be desirable to have these > semantics (open a dir or mkdir, atomically)? > It does seem to be why POSIX left this edge case implementation > defined, and if folks are asking for it, could it be the right move? > > And yes, I do understand (from reading the room) that no one here is > too excited about this possibility. Forgive me for being lazy and instead of repeating everything I'll just leave a link to the other part of the thread https://lore.kernel.org/lkml/20230328075735.d3rs27jjvarmn6dw@wittgenstein
diff --git a/fs/namei.c b/fs/namei.c index edfedfbccae..7b26db2f0f8 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3540,8 +3540,18 @@ static int do_open(struct nameidata *nd, if (unlikely(error)) return error; } - if ((nd->flags & LOOKUP_DIRECTORY) && !d_can_lookup(nd->path.dentry)) - return -ENOTDIR; + + if ((open_flag & (O_DIRECTORY | O_CREAT)) != (O_DIRECTORY | O_CREAT) || + !(file->f_mode & FMODE_CREATED)) { + /* O_DIRECTORY | O_CREAT has the strange property of being the + * only open(O_DIRECTORY) lookup that can create and return a + * regular file *if we indeed did create*. Because of this, + * only return -ENOTDIR if we're not O_DIR | O_CREAT or if we + * did not create a file. + */ + if ((nd->flags & LOOKUP_DIRECTORY) && !d_can_lookup(nd->path.dentry)) + return -ENOTDIR; + } do_truncate = false; acc_mode = op->acc_mode;