do_open(): Fix O_DIRECTORY | O_CREAT behavior

Message ID 20230320071442.172228-1-pedro.falcato@gmail.com
State New
Headers
Series do_open(): Fix O_DIRECTORY | O_CREAT behavior |

Commit Message

Pedro Falcato March 20, 2023, 7:14 a.m. UTC
  On Linux, open(O_DIRECTORY | O_CREAT) has historically meant "open
directory or create a regular file". This has remained mostly true,
except open(O_DIR | O_CREAT) has started returning an error *while
creating the file*. Restore the old behavior.

Signed-off-by: Pedro Falcato <pedro.falcato@gmail.com>
---
I did not explicitly add a Fixes: tag because I was unable to bisect this locally,
but it seems to me that this was introduced in the path walking refactoring done in early 2020.
Al, if you have a rough idea of what may have added this bug, feel free to add a Fixes.

This should also probably get CC'd to stable, but I'll leave this to your criteria.
 fs/namei.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)
  

Comments

Christian Brauner March 20, 2023, 11:51 a.m. UTC | #1
On Mon, Mar 20, 2023 at 07:14:42AM +0000, Pedro Falcato wrote:
> On Linux, open(O_DIRECTORY | O_CREAT) has historically meant "open
> directory or create a regular file". This has remained mostly true,
> except open(O_DIR | O_CREAT) has started returning an error *while
> creating the file*. Restore the old behavior.
> 
> Signed-off-by: Pedro Falcato <pedro.falcato@gmail.com>
> ---
> I did not explicitly add a Fixes: tag because I was unable to bisect this locally,

This dates back to the lookup rework done for v5.7. Specifically, this
was introduced by:

Fixes: 973d4b73fbaf ("do_last(): rejoin the common path even earlier in FMODE_{OPENED,CREATED} case")

> but it seems to me that this was introduced in the path walking refactoring done in early 2020.
> Al, if you have a rough idea of what may have added this bug, feel free to add a Fixes.
> 
> This should also probably get CC'd to stable, but I'll leave this to your criteria.
>  fs/namei.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index edfedfbccae..7b26db2f0f8 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3540,8 +3540,18 @@ static int do_open(struct nameidata *nd,
>  		if (unlikely(error))
>  			return error;
>  	}
> -	if ((nd->flags & LOOKUP_DIRECTORY) && !d_can_lookup(nd->path.dentry))
> -		return -ENOTDIR;
> +
> +	if ((open_flag & (O_DIRECTORY | O_CREAT)) != (O_DIRECTORY | O_CREAT) ||
> +	    !(file->f_mode & FMODE_CREATED)) {
> +		/* O_DIRECTORY | O_CREAT has the strange property of being the
> +		 * only open(O_DIRECTORY) lookup that can create and return a
> +		 * regular file *if we indeed did create*. Because of this,
> +		 * only return -ENOTDIR if we're not O_DIR | O_CREAT or if we
> +		 * did not create a file.
> +		 */

So before we continue down that road should we maybe treat this as a
chance to fix the old bug? Because this behavior of returning -ENOTDIR
has existed ever since v5.7 now. Since that time we had three LTS
releases all returning ENOTDIR even if the file was created.

So yeah, we could return to the old behavior. But we could also see this
as a chance to get rid of the bug. IOW, fail right at O_DIRECTORY | O_CREAT
with ENOTDIR and not even create the file anymore. No one has really
noticed this from 5.7 until now and afaict this has been a bug ever
since the dark ages and is mentioned as a bug on man 2 openat.
  
Linus Torvalds March 20, 2023, 5:14 p.m. UTC | #2
On Mon, Mar 20, 2023 at 4:52 AM Christian Brauner <brauner@kernel.org> wrote:
>
> So before we continue down that road should we maybe treat this as a
> chance to fix the old bug? Because this behavior of returning -ENOTDIR
> has existed ever since v5.7 now. Since that time we had three LTS
> releases all returning ENOTDIR even if the file was created.

Ack.

I think considering that the return value has been broken for so long,
I think we can pretty much assume that there are no actual users of
it, and we might as well clean up the semantics properly.

Willing to send that patch in and we'll get it tested in the crucible
of the real world?

                Linus
  
Pedro Falcato March 20, 2023, 7:27 p.m. UTC | #3
On Mon, Mar 20, 2023 at 5:14 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, Mar 20, 2023 at 4:52 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > So before we continue down that road should we maybe treat this as a
> > chance to fix the old bug? Because this behavior of returning -ENOTDIR
> > has existed ever since v5.7 now. Since that time we had three LTS
> > releases all returning ENOTDIR even if the file was created.
>
> Ack.
>
> I think considering that the return value has been broken for so long,
> I think we can pretty much assume that there are no actual users of
> it, and we might as well clean up the semantics properly.
>
> Willing to send that patch in and we'll get it tested in the crucible
> of the real world?

That sounds good to me, I can do that. What kind of new semantics are
we talking about here?

I originally found this when testing every kind of possibly-odd edge
case in path
walking/open(2). From the systems I've tested on (not too diverse, basically
NetBSD, FreeBSD, Linux 6.2.2, and now, when looking for a regression,
a variety of kernels since 2009), 4 behaviors occurred:

1) Pre v5.7 Linux did the open-dir-if-exists-else-create-regular-file
we all know and """love""".
2) Post 5.7, we started returning this buggy -ENOTDIR error, even when
successfully creating a file.
3) NetBSD just straight up returns EINVAL on open(O_DIRECTORY | O_CREAT)
4) FreeBSD's open(O_CREAT | O_DIRECTORY) succeeds if the file exists
and is a directory. Fails with -ENOENT if it falls onto the "O_CREAT"
path (i.e it doesn't try to create the file at all, just ENOENT's;
this changed relatively recently, in 2015)

Note that all of these behaviors are allowed by POSIX (in fact, I
would not call the old Linux behavior a *bug*, just really odd
semantics).

So, again, what kind of behavior change do we want here? IMO, the best
and least destructive would probably
be to emulate FreeBSD behavior here. I don't think open(O_DIRECTORY |
O_CREAT) returning -ENOTDIR
if it doesn't exist (as Christian suggested, I think?) makes any sort
of sense here.

Just my 2c.
  
Linus Torvalds March 20, 2023, 8:24 p.m. UTC | #4
On Mon, Mar 20, 2023 at 12:27 PM Pedro Falcato <pedro.falcato@gmail.com> wrote:
>
> 1) Pre v5.7 Linux did the open-dir-if-exists-else-create-regular-file
> we all know and """love""".

So I think we should fall back to this as a last resort, as a "well,
it's our historical behavior".

> 2) Post 5.7, we started returning this buggy -ENOTDIR error, even when
> successfully creating a file.

Yeah, I think this is the worst of the bunch and has no excuse (unless
some crazy program has started depending on it, which sounds really
*really* unlikely).

> 3) NetBSD just straight up returns EINVAL on open(O_DIRECTORY | O_CREAT)
> 4) FreeBSD's open(O_CREAT | O_DIRECTORY) succeeds if the file exists
> and is a directory. Fails with -ENOENT if it falls onto the "O_CREAT"
> path (i.e it doesn't try to create the file at all, just ENOENT's;
> this changed relatively recently, in 2015)

Either of these sound sensible to me.

I suspect (3) is the clearest case.

And (4) might be warranted just because it's closer to what we used to
do, and it's *possible* that somebody happens to use O_DIRECTORY |
O_CREAT on directories that exist, and never noticed how broken that
was.

And (4) has another special case: O_EXCL. Because I'm really hoping
that O_DIRECTORY | O_EXCL will always fail.

Is the proper patch something along the lines of this?

   --- a/fs/open.c
   +++ b/fs/open.c
   @@ -1186,6 +1186,8 @@ inline int build_open_flags(const struct
open_how *how, struct open_flags *op)

        /* Deal with the mode. */
        if (WILL_CREATE(flags)) {
   +            if (flags & O_DIRECTORY)
   +                    return -EINVAL;
                if (how->mode & ~S_IALLUGO)
                        return -EINVAL;
                op->mode = how->mode | S_IFREG;

I dunno. Not tested, not thought about very much.

What about O_PATH? I guess it's fine to create a file and only get a
path fd to the result?

             Linus
  
Aleksa Sarai March 20, 2023, 10:10 p.m. UTC | #5
On 2023-03-20, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> What about O_PATH? I guess it's fine to create a file and only get a
> path fd to the result?

O_PATH ignores the O_CREAT flag so it's the same as just passing O_PATH.
This is the case for all flags not in O_PATH_FLAGS (so only O_DIRECTORY,
O_NOFOLLOW, O_CLOEXEC have an effect on O_PATH). With openat2, passing
any other flag with O_PATH returns -EINVAL.
  
Christian Brauner March 21, 2023, 2:24 p.m. UTC | #6
On Mon, Mar 20, 2023 at 01:24:52PM -0700, Linus Torvalds wrote:
> On Mon, Mar 20, 2023 at 12:27 PM Pedro Falcato <pedro.falcato@gmail.com> wrote:
> >
> > 1) Pre v5.7 Linux did the open-dir-if-exists-else-create-regular-file
> > we all know and """love""".
> 
> So I think we should fall back to this as a last resort, as a "well,
> it's our historical behavior".
> 
> > 2) Post 5.7, we started returning this buggy -ENOTDIR error, even when
> > successfully creating a file.
> 
> Yeah, I think this is the worst of the bunch and has no excuse (unless
> some crazy program has started depending on it, which sounds really
> *really* unlikely).
> 
> > 3) NetBSD just straight up returns EINVAL on open(O_DIRECTORY | O_CREAT)
> > 4) FreeBSD's open(O_CREAT | O_DIRECTORY) succeeds if the file exists
> > and is a directory. Fails with -ENOENT if it falls onto the "O_CREAT"
> > path (i.e it doesn't try to create the file at all, just ENOENT's;
> > this changed relatively recently, in 2015)
> 
> Either of these sound sensible to me.
> 
> I suspect (3) is the clearest case.

Yeah, we should try that.

> 
> And (4) might be warranted just because it's closer to what we used to
> do, and it's *possible* that somebody happens to use O_DIRECTORY |
> O_CREAT on directories that exist, and never noticed how broken that
> was.

I really really hope that isn't the case because (4) is pretty nasty.
Having to put this on a manpage seems nightmarish.

> 
> And (4) has another special case: O_EXCL. Because I'm really hoping
> that O_DIRECTORY | O_EXCL will always fail.

I've detailed the semantics for that in the commit message below...

> 
> Is the proper patch something along the lines of this?

Yeah, I think that would do it and I think that's what we should try to
get away with. I just spent time and took a look who passes O_DIRECTORY
and O_CREAT and interestingly there are a number of comments roughly
along the lines of the following example:

/* Ideally we could use openat(O_DIRECTORY | O_CREAT | O_EXCL) here
 * to create and open the directory atomically

suggests that people who specify O_DIRECTORY | O_CREAT are interested in
creating a directory. But since this never did work they don't tend to
use that flag combination (I've collected a few samples in [1] to [4].).

(As a sidenote, posix made an interpretation change a long time ago to
at least allow for O_DIRECTORY | O_CREAT to create a directory (see [3]).

But that's a whole different can of worms and I haven't spent any
thoughts even on feasibility. And even if we should probably get through
a couple of kernels with O_DIRECTORY | O_CREAT failing with EINVAL first.)

> 
>    --- a/fs/open.c
>    +++ b/fs/open.c
>    @@ -1186,6 +1186,8 @@ inline int build_open_flags(const struct
> open_how *how, struct open_flags *op)
> 
>         /* Deal with the mode. */
>         if (WILL_CREATE(flags)) {
>    +            if (flags & O_DIRECTORY)
>    +                    return -EINVAL;

This will be problematic because for weird historical reasons O_TMPFILE
includes O_DIRECTORY so this would unfortunately break O_TMPFILE. :/
I'll try to have a patch ready in a bit.

I spent a long time digging through potential users of this nonsense.

Link: https://lore.kernel.org/lkml/20230320071442.172228-1-pedro.falcato@gmail.com
Link: https://sources.debian.org/src/flatpak/1.14.4-1/subprojects/libglnx/glnx-dirfd.c/?hl=324#L324 [1]
Link: https://sources.debian.org/src/flatpak-builder/1.2.3-1/subprojects/libglnx/glnx-shutil.c/?hl=251#L251 [2]
Link: https://sources.debian.org/src/ostree/2022.7-2/libglnx/glnx-dirfd.c/?hl=324#L324 [3]
Link: https://www.openwall.com/lists/oss-security/2014/11/26/14 [4]
  
Christian Brauner March 21, 2023, 4:17 p.m. UTC | #7
On Tue, Mar 21, 2023 at 03:24:19PM +0100, Christian Brauner wrote:
> On Mon, Mar 20, 2023 at 01:24:52PM -0700, Linus Torvalds wrote:
> > On Mon, Mar 20, 2023 at 12:27 PM Pedro Falcato <pedro.falcato@gmail.com> wrote:
> > >
> > > 1) Pre v5.7 Linux did the open-dir-if-exists-else-create-regular-file
> > > we all know and """love""".
> > 
> > So I think we should fall back to this as a last resort, as a "well,
> > it's our historical behavior".
> > 
> > > 2) Post 5.7, we started returning this buggy -ENOTDIR error, even when
> > > successfully creating a file.
> > 
> > Yeah, I think this is the worst of the bunch and has no excuse (unless
> > some crazy program has started depending on it, which sounds really
> > *really* unlikely).
> > 
> > > 3) NetBSD just straight up returns EINVAL on open(O_DIRECTORY | O_CREAT)
> > > 4) FreeBSD's open(O_CREAT | O_DIRECTORY) succeeds if the file exists
> > > and is a directory. Fails with -ENOENT if it falls onto the "O_CREAT"
> > > path (i.e it doesn't try to create the file at all, just ENOENT's;
> > > this changed relatively recently, in 2015)
> > 
> > Either of these sound sensible to me.
> > 
> > I suspect (3) is the clearest case.
> 
> Yeah, we should try that.
> 
> > 
> > And (4) might be warranted just because it's closer to what we used to
> > do, and it's *possible* that somebody happens to use O_DIRECTORY |
> > O_CREAT on directories that exist, and never noticed how broken that
> > was.
> 
> I really really hope that isn't the case because (4) is pretty nasty.
> Having to put this on a manpage seems nightmarish.
> 
> > 
> > And (4) has another special case: O_EXCL. Because I'm really hoping
> > that O_DIRECTORY | O_EXCL will always fail.
> 
> I've detailed the semantics for that in the commit message below...
> 
> > 
> > Is the proper patch something along the lines of this?
> 
> Yeah, I think that would do it and I think that's what we should try to
> get away with. I just spent time and took a look who passes O_DIRECTORY
> and O_CREAT and interestingly there are a number of comments roughly
> along the lines of the following example:
> 
> /* Ideally we could use openat(O_DIRECTORY | O_CREAT | O_EXCL) here
>  * to create and open the directory atomically
> 
> suggests that people who specify O_DIRECTORY | O_CREAT are interested in
> creating a directory. But since this never did work they don't tend to
> use that flag combination (I've collected a few samples in [1] to [4].).
> 
> (As a sidenote, posix made an interpretation change a long time ago to
> at least allow for O_DIRECTORY | O_CREAT to create a directory (see [3]).
> 
> But that's a whole different can of worms and I haven't spent any
> thoughts even on feasibility. And even if we should probably get through
> a couple of kernels with O_DIRECTORY | O_CREAT failing with EINVAL first.)
> 
> > 
> >    --- a/fs/open.c
> >    +++ b/fs/open.c
> >    @@ -1186,6 +1186,8 @@ inline int build_open_flags(const struct
> > open_how *how, struct open_flags *op)
> > 
> >         /* Deal with the mode. */
> >         if (WILL_CREATE(flags)) {
> >    +            if (flags & O_DIRECTORY)
> >    +                    return -EINVAL;
> 
> This will be problematic because for weird historical reasons O_TMPFILE
> includes O_DIRECTORY so this would unfortunately break O_TMPFILE. :/
> I'll try to have a patch ready in a bit.

So, had to do some testing first:

This survives xfstests:
        sudo ./check -g unlink,idmapped
which pounds on the creation and O_TMPFILE paths.

It also survives LTP:
        # The following includes openat2, openat, open, fsopen, open_tree, etc.
        sudo ./runtltp -f syscalls open

I've also written a (very nasty) test script:

        #define _GNU_SOURCE
        #include <fcntl.h>
        #include <stdio.h>
        #include <stdlib.h>

        int main(int argc, char *argv[])
        {
                int fd;

                fd = open("/tmp/TEST", O_DIRECTORY | O_CREAT);
                if (fd >= 0)
                        printf("%d\n", fd);
                else
                        printf("%m: fd(%d)\n", fd);

                fd = open("/tmp/TEST", O_DIRECTORY | O_CREAT | O_EXCL);
                if (fd >= 0)
                        printf("%d\n", fd);
                else
                        printf("%m: fd(%d)\n", fd);

                fd = open("/tmp/TEST", O_DIRECTORY | O_EXCL);
                if (fd >= 0)
                        printf("%d\n", fd);
                else
                        printf("%m: fd(%d)\n", fd);

                exit(EXIT_SUCCESS);
        }
        ubuntu@vm1:~/ssd$ sudo ./create_test
        Invalid argument: fd(-1)
        Invalid argument: fd(-1)
        No such file or directory: fd(-1)

I think this should work. From the commit message it's hopefully clear
that this is proper semantic change. But I think we might be able to
pull this off given how rare this combination should be and how
unnoticed the current regression has gone and for how long...

So I came up with the following description trying to detail the
semantics prior to v5.7, post v5.7 up until this patch, and then after
the patch. Linus, I added your SOB to this but tell me if you'd rather
not be associated with this potential mess...

It would be very nice if we had tests for the new behavior. So if @Pedro
would be up for it that would be highly appreciated. If not I'll put it
on my ToDo...

So I can carry this and sent a pr or it can be picked up directly. I'm
not fuzzed. Hopefully there are no surprises or objections...

From 6bc6e6a4c9ed0dcbe0c85cbbaca90953e27889e5 Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Tue, 21 Mar 2023 09:18:07 +0100
Subject: [PATCH] open: return EINVAL for O_DIRECTORY | O_CREAT

After a couple of years and multiple LTS releases we received a report
that the behavior of O_DIRECTORY | O_CREAT changed starting with v5.7.

On kernels prior to v5.7 combinations of O_DIRECTORY, O_CREAT, O_EXCL
had the following semantics:

(1) open("/tmp/d", O_DIRECTORY | O_CREAT)
    * d doesn't exist:                create regular file
    * d exists and is a regular file: ENOTDIR
    * d exists and is a directory:    EISDIR

(2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL)
    * d doesn't exist:                create regular file
    * d exists and is a regular file: EEXIST
    * d exists and is a directory:    EEXIST

(3) open("/tmp/d", O_DIRECTORY | O_EXCL)
    * d doesn't exist:                ENOENT
    * d exists and is a regular file: ENOTDIR
    * d exists and is a directory:    open directory

On kernels since to v5.7 combinations of O_DIRECTORY, O_CREAT, O_EXCL
have the following semantics:

(1) open("/tmp/d", O_DIRECTORY | O_CREAT)
    * d doesn't exist:                ENOTDIR (create regular file)
    * d exists and is a regular file: ENOTDIR
    * d exists and is a directory:    EISDIR

(2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL)
    * d doesn't exist:                ENOTDIR (create regular file)
    * d exists and is a regular file: EEXIST
    * d exists and is a directory:    EEXIST

(3) open("/tmp/d", O_DIRECTORY | O_EXCL)
    * d doesn't exist:                ENOENT
    * d exists and is a regular file: ENOTDIR
    * d exists and is a directory:    open directory

This is a fairly substantial semantic change that userspace didn't
notice until Pedro took the time to deliberately figure out corner
cases. Since no one noticed this breakage we can somewhat safely assume
that O_DIRECTORY | O_CREAT combinations are likely unused.

The v5.7 breakage is especially weird because while ENOTDIR is returned
indicating failure a regular file is actually created. This doesn't make
a lot of sense.

Time was spent finding potential users of this combination. Searching on
codesearch.debian.net showed that codebases often express semantical
expectations about O_DIRECTORY | O_CREAT which are completely contrary
to what our code has done and currently does.

The expectation often is that this particular combination would create
and open a directory. This suggests users who tried to use that
combination would stumble upon the counterintuitive behavior no matter
if pre-v5.7 or post v5.7 and quickly realize neither semantics give them
what they want. For some examples see the code examples in [1] to [3]
and the discussion in [4].

There are various ways to address this issue. The lazy/simple option
would be to restore the pre-v5.7 behavior and to just live with that bug
forever. But since there's a real chance that the O_DIRECTORY | O_CREAT
quirk isn't relied upon we should try to get away with murder(ing bad
semantics) first. If we need to Frankenstein pre-v5.7 behavior later so
be it.

So let's simply return EINVAL categorically for O_DIRECTORY | O_CREAT
combinations. In addition to cleaning up the old bug this also opens up
the possiblity to make that flag combination do something more intuitive
in the future.

Starting with this commit the following semantics apply:

(1) open("/tmp/d", O_DIRECTORY | O_CREAT)
    * d doesn't exist:                EINVAL
    * d exists and is a regular file: EINVAL
    * d exists and is a directory:    EINVAL

(2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL)
    * d doesn't exist:                EINVAL
    * d exists and is a regular file: EINVAL
    * d exists and is a directory:    EINVAL

(3) open("/tmp/d", O_DIRECTORY | O_EXCL)
    * d doesn't exist:                ENOENT
    * d exists and is a regular file: ENOTDIR
    * d exists and is a directory:    open directory

One additional note, O_TMPFILE is implemented as:

    #define __O_TMPFILE    020000000
    #define O_TMPFILE      (__O_TMPFILE | O_DIRECTORY)
    #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)

For older kernels it was important to return an explicit error when
O_TMPFILE wasn't supported. So it is implemented to look like
O_DIRECTORY | O_RDWR but without O_CREAT. The reason was that
O_DIRECTORY | O_CREAT used to work and created a regular file. Allowing
it to be specified would've potentially caused a regular file to be
created on older kernels while the caller would believe they created an
actual O_TMPFILE. So instead O_TMPFILE has included O_DIRECTORY | O_RDWR
and the code uses O_TMPFILE_MASK to check that O_CREAT isn't specified
returning EINVAL if it is.

With this patch, we categorically reject O_DIRECTORY | O_CREAT and
return EINVAL. That means, we could write this in a way that makes the
if ((flags & O_TMPFILE_MASK) != O_TMPFILE) check superfluous but let's
not do that. The check documents the peculiar aspects of O_TMPFILE quite
nicely and can serve as a reminder how easy it is to break.

As Aleksa pointed out O_PATH is unaffected by this change since it
always returned EINVAL if O_CREAT was specified - with or without
O_DIRECTORY.

Link: https://lore.kernel.org/lkml/20230320071442.172228-1-pedro.falcato@gmail.com
Link: https://sources.debian.org/src/flatpak/1.14.4-1/subprojects/libglnx/glnx-dirfd.c/?hl=324#L324 [1]
Link: https://sources.debian.org/src/flatpak-builder/1.2.3-1/subprojects/libglnx/glnx-shutil.c/?hl=251#L251 [2]
Link: https://sources.debian.org/src/ostree/2022.7-2/libglnx/glnx-dirfd.c/?hl=324#L324 [3]
Link: https://www.openwall.com/lists/oss-security/2014/11/26/14 [4]
Reported-by: Pedro Falcato <pedro.falcato@gmail.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
---
This survives xfstests:

        sudo ./check -g unlink,idmapped

which pounds on the creation and O_TMPFILE paths.

It also survives LTP:

        sudo ./runtltp -f syscalls openat2
        sudo ./runtltp -f syscalls openat
        # The following includes openat2, openat, open, fsopen, open_tree, etc.
        sudo ./runtltp -f syscalls open
        sudo ./runtltp -f syscalls create_file
        sudo ./runtltp -f syscalls create_files

I've also written a (very nasty) test script:

        #define _GNU_SOURCE
        #include <fcntl.h>
        #include <stdio.h>
        #include <stdlib.h>

        int main(int argc, char *argv[])
        {
                int fd;

                fd = open("/tmp/TEST", O_DIRECTORY | O_CREAT);
                if (fd >= 0)
                        printf("%d\n", fd);
                else
                        printf("%m: fd(%d)\n", fd);

                fd = open("/tmp/TEST", O_DIRECTORY | O_CREAT | O_EXCL);
                if (fd >= 0)
                        printf("%d\n", fd);
                else
                        printf("%m: fd(%d)\n", fd);

                fd = open("/tmp/TEST", O_DIRECTORY | O_EXCL);
                if (fd >= 0)
                        printf("%d\n", fd);
                else
                        printf("%m: fd(%d)\n", fd);

                exit(EXIT_SUCCESS);
        }
        ubuntu@vm1:~/ssd$ sudo ./create_test
        Invalid argument: fd(-1)
        Invalid argument: fd(-1)
        No such file or directory: fd(-1)
---
 fs/open.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/open.c b/fs/open.c
index 4401a73d4032..39e360f9490d 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1135,6 +1135,8 @@ struct file *open_with_fake_path(const struct path *path, int flags,
 EXPORT_SYMBOL(open_with_fake_path);
 
 #define WILL_CREATE(flags)	(flags & (O_CREAT | __O_TMPFILE))
+#define INVALID_CREATE(flags) \
+	((flags & (O_DIRECTORY | O_CREAT)) == (O_DIRECTORY | O_CREAT))
 #define O_PATH_FLAGS		(O_DIRECTORY | O_NOFOLLOW | O_PATH | O_CLOEXEC)
 
 inline struct open_how build_open_how(int flags, umode_t mode)
@@ -1207,6 +1209,10 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
 		if (!(acc_mode & MAY_WRITE))
 			return -EINVAL;
 	}
+
+	if (INVALID_CREATE(flags))
+		return -EINVAL;
+
 	if (flags & O_PATH) {
 		/* O_PATH only permits certain other flags to be set. */
 		if (flags & ~O_PATH_FLAGS)
  
Linus Torvalds March 21, 2023, 5:35 p.m. UTC | #8
On Tue, Mar 21, 2023 at 9:17 AM Christian Brauner <brauner@kernel.org> wrote:
>
>  #define WILL_CREATE(flags)     (flags & (O_CREAT | __O_TMPFILE))
> +#define INVALID_CREATE(flags) \
> +       ((flags & (O_DIRECTORY | O_CREAT)) == (O_DIRECTORY | O_CREAT))
>  #define O_PATH_FLAGS           (O_DIRECTORY | O_NOFOLLOW | O_PATH | O_CLOEXEC)
>
>  inline struct open_how build_open_how(int flags, umode_t mode)
> @@ -1207,6 +1209,10 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
>                 if (!(acc_mode & MAY_WRITE))
>                         return -EINVAL;
>         }
> +
> +       if (INVALID_CREATE(flags))
> +               return -EINVAL;
> +
>         if (flags & O_PATH) {
>                 /* O_PATH only permits certain other flags to be set. */
>                 if (flags & ~O_PATH_FLAGS)

So the patch looks simple enough, but

 (a) I'm not entirely sure I like the extra indirection through
another #define. This impenetrable thicket of different macros makes
it a bit hard to see what is going on. I'm not blaming you for it, it
predates this patch, but..

 (b) this seems to make that O_TMPFILE_MASK macro pointless.

I think (b) kind of re-inforces the point of (a) here.

The only reason for O_TMPFILE_MASK is literally that old historical
"make sure old kernels return errors when they don't support
O_TEMPFILE", and thus the magic re-use of old bit patterns.

But now that we do that "return error if both O_DIRECTORY and O_CREAT
are set", the O_TMPFILE_MASK check is basically dead, because it ends
up checking for that same bit pattern except also __O_TMPFILE.

And that is *not* obvious from the code, exactly because of that
thicket of different macros.

In fact, since that whole

        if ((flags & O_TMPFILE_MASK) != O_TMPFILE)
                return -EINVAL;

is done inside an "if (flags & __O_TMPFILE)", the compiler might as
well reduce it *exactly* down to that exact same test as
INVALID_CREATE() now is.

So I really get the feeling that the macros actually hide what is
going on, and are the exact opposite of being helpful. Case in point:
with your patch, you now have the exact same test twice in a row,
except it *looks* like two different tests and one of them is
conditional on __O_TMPFILE.

For all I know, the compiler may actually notice the redundancy and
remove one of them, but we shouldn't write bad code with the
expectation that "the compiler will fix it up". Particularly when it
just makes it harder for people to understand too.

                     Linus
  
Christian Brauner March 21, 2023, 8:16 p.m. UTC | #9
On Tue, Mar 21, 2023 at 10:35:47AM -0700, Linus Torvalds wrote:
> On Tue, Mar 21, 2023 at 9:17 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> >  #define WILL_CREATE(flags)     (flags & (O_CREAT | __O_TMPFILE))
> > +#define INVALID_CREATE(flags) \
> > +       ((flags & (O_DIRECTORY | O_CREAT)) == (O_DIRECTORY | O_CREAT))
> >  #define O_PATH_FLAGS           (O_DIRECTORY | O_NOFOLLOW | O_PATH | O_CLOEXEC)
> >
> >  inline struct open_how build_open_how(int flags, umode_t mode)
> > @@ -1207,6 +1209,10 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
> >                 if (!(acc_mode & MAY_WRITE))
> >                         return -EINVAL;
> >         }
> > +
> > +       if (INVALID_CREATE(flags))
> > +               return -EINVAL;
> > +
> >         if (flags & O_PATH) {
> >                 /* O_PATH only permits certain other flags to be set. */
> >                 if (flags & ~O_PATH_FLAGS)
> 
> So the patch looks simple enough, but
> 
>  (a) I'm not entirely sure I like the extra indirection through
> another #define. This impenetrable thicket of different macros makes
> it a bit hard to see what is going on. I'm not blaming you for it, it
> predates this patch, but..
> 
>  (b) this seems to make that O_TMPFILE_MASK macro pointless.

So I tried to justify this in the commit message but it might've gotten lost in
there. My thinking had been:

    With this patch, we categorically reject O_DIRECTORY | O_CREAT and
    return EINVAL. That means, we could write this in a way that makes the
    if ((flags & O_TMPFILE_MASK) != O_TMPFILE) check superfluous but let's
    not do that. The check documents the peculiar aspects of O_TMPFILE quite
    nicely and can serve as a reminder how easy it is to break.

But I'm not married to keeping it and it could be misleading.

> 
> I think (b) kind of re-inforces the point of (a) here.
> 
> The only reason for O_TMPFILE_MASK is literally that old historical
> "make sure old kernels return errors when they don't support
> O_TEMPFILE", and thus the magic re-use of old bit patterns.
> 
> But now that we do that "return error if both O_DIRECTORY and O_CREAT
> are set", the O_TMPFILE_MASK check is basically dead, because it ends
> up checking for that same bit pattern except also __O_TMPFILE.

Yes.

> 
> And that is *not* obvious from the code, exactly because of that
> thicket of different macros.
> 
> In fact, since that whole
> 
>         if ((flags & O_TMPFILE_MASK) != O_TMPFILE)
>                 return -EINVAL;
> 
> is done inside an "if (flags & __O_TMPFILE)", the compiler might as
> well reduce it *exactly* down to that exact same test as
> INVALID_CREATE() now is.
> 
> So I really get the feeling that the macros actually hide what is
> going on, and are the exact opposite of being helpful. Case in point:
> with your patch, you now have the exact same test twice in a row,
> except it *looks* like two different tests and one of them is
> conditional on __O_TMPFILE.

Yeah, see above why I did that originally.

> 
> For all I know, the compiler may actually notice the redundancy and
> remove one of them, but we shouldn't write bad code with the
> expectation that "the compiler will fix it up". Particularly when it
> just makes it harder for people to understand too.

But yes, that is a valid complaint so - without having tested - sm like:

diff --git a/fs/open.c b/fs/open.c
index 4401a73d4032..3c5227d84cfe 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1195,6 +1195,13 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
                op->mode = 0;
        }

+       /*
+        * Block nonsensical creation requests and ensure that O_CREAT isn't
+        * slipped alongside O_TMPFILE which relies on O_DIRECTORY.
+        */
+       if ((flags & (O_DIRECTORY | O_CREAT)) == (O_DIRECTORY | O_CREAT))
+               return -EINVAL;
+
        /*
         * In order to ensure programs get explicit errors when trying to use
         * O_TMPFILE on old kernels, O_TMPFILE is implemented such that it
@@ -1202,7 +1209,7 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
         * have to require userspace to explicitly set it.
         */
        if (flags & __O_TMPFILE) {
-               if ((flags & O_TMPFILE_MASK) != O_TMPFILE)
+               if ((flags & O_TMPFILE) != O_TMPFILE)
                        return -EINVAL;
                if (!(acc_mode & MAY_WRITE))
                        return -EINVAL;
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 1ecdb911add8..80f37a0d40d7 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -91,7 +91,6 @@

 /* a horrid kludge trying to make sure that this will fail on old kernels */
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
-#define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)

 #ifndef O_NDELAY
 #define O_NDELAY       O_NONBLOCK
diff --git a/tools/include/uapi/asm-generic/fcntl.h b/tools/include/uapi/asm-generic/fcntl.h
index b02c8e0f4057..1c7a0f6632c0 100644
--- a/tools/include/uapi/asm-generic/fcntl.h
+++ b/tools/include/uapi/asm-generic/fcntl.h
@@ -91,7 +91,6 @@

 /* a horrid kludge trying to make sure that this will fail on old kernels */
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
-#define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)

 #ifndef O_NDELAY
 #define O_NDELAY       O_NONBLOCK
--
2.34.1
  
Linus Torvalds March 21, 2023, 9:47 p.m. UTC | #10
On Tue, Mar 21, 2023 at 1:16 PM Christian Brauner <brauner@kernel.org> wrote:
>
> But yes, that is a valid complaint so - without having tested - sm like:

I'd actually go a bit further, and really spell all the bits out explicitly.

I mean, I was *literally* involved in that original O_TMPFILE_MASK thing:

   https://lore.kernel.org/all/CA+55aFxA3qoM5wpMUya7gEA8SZyJep7kMBRjrPOsOm_OudD8aQ@mail.gmail.com/

with the whole O_DIRECOTY games to make O_TMPFILE safer, but despite
that I didn't remember this at all, and my suggested "maybe something
like this" patch was broken for the O_TMPFILE case.

So while we do have all this documented in our history (both git
commit logs and lore.kernel.org), I actually think it would be lovely
to just make build_open_flags() be very explicit about all the exact
O_xyz flags, and really write out the logic fully.

For example, even your clarified version that gets rid of the
"O_TMPFILE_MASK" thing still eends up doing

        if (flags & __O_TMPFILE) {
                if ((flags & O_TMPFILE) != O_TMPFILE)
                        return -EINVAL;

and so when you look at that code, you don't actually realize that
O_TMPFILE _cotnains_ that __O_TMPFILE bit, and what the above really
means is "also check O_DIRECTORY".

So considering how I couldn't remember this mess myself, despite
having been involved with it personally (a decade ago..), I really do
think that maybe this shoudl be open-coded with a comment, and the
above code should instead be

        if (flags & __O_TMPFILE) {
                if (!(flags & O_DIRECTORY))
                        return -EINVAL;

together with an explicit comment about how O_TMPFILE is the
*combination* of __O_TMPFILE and O_DIRECTORY, along with a short
explanation as to why.

Now, I agree that that test for O_DIRECTORY then _looks_ odd, but the
thing is, it then makes the reality of this all much more explicit.

In contrast, doing that

                if ((flags & O_TMPFILE) != O_TMPFILE)

may *look* more natural in that context, but if you actually start
thinking about it, that check makes no sense unless you then look up
what O_TMPFILE is, and the history behind it.

So I'd rather have code that looks a bit odd, but that explains itself
and is explicit about what it does, than code that _tries_ to look
natural but actually hides the reason for what it is doing.

And then next time somebody looks at that O_DIRECTORY | O_CREAT
combination, suddenly the __O_TMPFILE interaction is there, and very
explicit.

Hmm?

I don't feel *hugely* strongly about this, so in the end I'll bow to
your decision, but considering that my initial patch looked sane but
was buggy because I had forgotten about O_TMPFILE, I really think we
should make this more explicit at a source level..

               Linus

            Linus
  
Christian Brauner March 22, 2023, 10:17 a.m. UTC | #11
On Tue, Mar 21, 2023 at 02:47:55PM -0700, Linus Torvalds wrote:
> On Tue, Mar 21, 2023 at 1:16 PM Christian Brauner <brauner@kernel.org> wrote:
> >
> > But yes, that is a valid complaint so - without having tested - sm like:
> 
> I'd actually go a bit further, and really spell all the bits out explicitly.
> 
> I mean, I was *literally* involved in that original O_TMPFILE_MASK thing:
> 
>    https://lore.kernel.org/all/CA+55aFxA3qoM5wpMUya7gEA8SZyJep7kMBRjrPOsOm_OudD8aQ@mail.gmail.com/
> 
> with the whole O_DIRECOTY games to make O_TMPFILE safer, but despite
> that I didn't remember this at all, and my suggested "maybe something
> like this" patch was broken for the O_TMPFILE case.
> 
> So while we do have all this documented in our history (both git
> commit logs and lore.kernel.org), I actually think it would be lovely
> to just make build_open_flags() be very explicit about all the exact
> O_xyz flags, and really write out the logic fully.
> 
> For example, even your clarified version that gets rid of the
> "O_TMPFILE_MASK" thing still eends up doing
> 
>         if (flags & __O_TMPFILE) {
>                 if ((flags & O_TMPFILE) != O_TMPFILE)
>                         return -EINVAL;
> 
> and so when you look at that code, you don't actually realize that
> O_TMPFILE _cotnains_ that __O_TMPFILE bit, and what the above really
> means is "also check O_DIRECTORY".
> 
> So considering how I couldn't remember this mess myself, despite
> having been involved with it personally (a decade ago..), I really do
> think that maybe this shoudl be open-coded with a comment, and the
> above code should instead be
> 
>         if (flags & __O_TMPFILE) {
>                 if (!(flags & O_DIRECTORY))
>                         return -EINVAL;
> 
> together with an explicit comment about how O_TMPFILE is the
> *combination* of __O_TMPFILE and O_DIRECTORY, along with a short
> explanation as to why.
> 
> Now, I agree that that test for O_DIRECTORY then _looks_ odd, but the
> thing is, it then makes the reality of this all much more explicit.
> 
> In contrast, doing that
> 
>                 if ((flags & O_TMPFILE) != O_TMPFILE)
> 
> may *look* more natural in that context, but if you actually start
> thinking about it, that check makes no sense unless you then look up
> what O_TMPFILE is, and the history behind it.
> 
> So I'd rather have code that looks a bit odd, but that explains itself
> and is explicit about what it does, than code that _tries_ to look
> natural but actually hides the reason for what it is doing.
> 
> And then next time somebody looks at that O_DIRECTORY | O_CREAT
> combination, suddenly the __O_TMPFILE interaction is there, and very
> explicit.
> 
> Hmm?
> 
> I don't feel *hugely* strongly about this, so in the end I'll bow to
> your decision, but considering that my initial patch looked sane but
> was buggy because I had forgotten about O_TMPFILE, I really think we
> should make this more explicit at a source level..

I don't feel strongly about this either and your points are valid. So I
incorporated that and updated the comments in the code. In case you'd like to
take another look I've now put this up at:

The following changes since commit e8d018dd0257f744ca50a729e3d042cf2ec9da65:

  Linux 6.3-rc3 (2023-03-19 13:27:55 -0700)

are available in the Git repository at:

  ssh://git@gitolite.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping.git tags/vfs.open.directory.creat.einval

for you to fetch changes up to 43b450632676fb60e9faeddff285d9fac94a4f58:

  open: return EINVAL for O_DIRECTORY | O_CREAT (2023-03-22 11:06:55 +0100)

----------------------------------------------------------------
vfs.open.directory.creat.einval

----------------------------------------------------------------
Christian Brauner (1):
      open: return EINVAL for O_DIRECTORY | O_CREAT

 fs/open.c                              | 18 +++++++++++++-----
 include/uapi/asm-generic/fcntl.h       |  1 -
 tools/include/uapi/asm-generic/fcntl.h |  1 -
 3 files changed, 13 insertions(+), 7 deletions(-)
  
Linus Torvalds March 22, 2023, 5:07 p.m. UTC | #12
On Wed, Mar 22, 2023 at 3:17 AM Christian Brauner <brauner@kernel.org> wrote:
>
> I don't feel strongly about this either and your points are valid. So I
> incorporated that and updated the comments in the code. In case you'd like to
> take another look I've now put this up at:

Ack, LGTM.

               Linus
  
Pedro Falcato March 27, 2023, 8:13 p.m. UTC | #13
On Tue, Mar 21, 2023 at 4:17 PM Christian Brauner <brauner@kernel.org> wrote:
> It would be very nice if we had tests for the new behavior. So if @Pedro
> would be up for it that would be highly appreciated. If not I'll put it
> on my ToDo...

Where do you want them? selftests? I have a relatively self-contained
""testsuite"" of namei stuff that could fit in there well, after some
cleanup.

> The expectation often is that this particular combination would create
> and open a directory. This suggests users who tried to use that
> combination would stumble upon the counterintuitive behavior no matter
> if pre-v5.7 or post v5.7 and quickly realize neither semantics give them
> what they want. For some examples see the code examples in [1] to [3]
> and the discussion in [4].

Ok so, silly question: Could it not be desirable to have these
semantics (open a dir or mkdir, atomically)?
It does seem to be why POSIX left this edge case implementation
defined, and if folks are asking for it, could it be the right move?

And yes, I do understand (from reading the room) that no one here is
too excited about this possibility.
  
Josh Triplett March 28, 2023, 2:15 a.m. UTC | #14
On Mon, Mar 20, 2023 at 10:14:29AM -0700, Linus Torvalds wrote:
> On Mon, Mar 20, 2023 at 4:52 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > So before we continue down that road should we maybe treat this as a
> > chance to fix the old bug? Because this behavior of returning -ENOTDIR
> > has existed ever since v5.7 now. Since that time we had three LTS
> > releases all returning ENOTDIR even if the file was created.
> 
> Ack.
> 
> I think considering that the return value has been broken for so long,
> I think we can pretty much assume that there are no actual users of
> it, and we might as well clean up the semantics properly.

If there are no users of this and we can clean up the semantics, is
there a strong reason *not* to make `O_DIRECTORY | O_CREATE` actually
create a directory and atomically return a file descriptor for that
directory? That seems like genuinely useful behavior that we don't
currently have a syscall for. I didn't see any suggestion in the thread
for reasons why we can't or shouldn't do that.

Would that break some existing software? It doesn't *sound* like it
would.

As far as I can tell, that *also* wouldn't cause a problem with
O_TMPFILE, because older kernels will still fail as desired, and we
wouldn't change the behavior of O_TMPFILE on new kernels (it'd still
create a temporary file, not a directory).

- Josh Triplett
  
Linus Torvalds March 28, 2023, 3:32 a.m. UTC | #15
On Mon, Mar 27, 2023 at 7:15 PM Josh Triplett <josh@joshtriplett.org> wrote:
>
> If there are no users of this and we can clean up the semantics, is
> there a strong reason *not* to make `O_DIRECTORY | O_CREATE` actually
> create a directory and atomically return a file descriptor for that
> directory? That seems like genuinely useful behavior that we don't
> currently have a syscall for. I didn't see any suggestion in the thread
> for reasons why we can't or shouldn't do that.

Absolutely not.

For one thing, it is clearly not "genuinely useful behavior". It's just stupid.

Name a *single* real situation where that would be a big improvement?
Point to code, and point to a reason, and point to why it would make a
difference. No made-up hypotheticals.

If you want to open a directory, just do that

    fd = open(.., O_DIRECTORY);

and if that directory doesn't exist, and you still want to create it,
then just do

   mkdir(...);

on it. Done. And mkdir() is atomic, so there's no race there with
somebody else doing something else to that path.

And no, there is no race with a subsequent open of that mkdir case,
because you already know the result empty, so what would you do with
the fd you just got? Absolutely nothing. It's useless. Edwin Starr
sang a song all about it.

So there is *zero* real reasons for that "open a directory and create
it atomically". It's a nonsensical operation.

Ok, just to play along - maybe you can make it slightly less
nonsensical by throwing O_PATH into the mix, and now an empty
directory file descriptor at least has *some* use.

But even *if* you can point to such a thing being useful (and I'm
really doubtful), it would *still* be stupid.

Now your code would not only be specific to Linux, it would be
specific to some very new version of Linux, and do something
completely different on older versions.

Because those older versions will do random things, ranging from
"always return an error" to "create a regular file - not a directory -
and then return an error anyway" and finally "create a regular file -
not a directory - and return that resulting fd".

So no. We're not adding a *fourth* set of semantics to something that
is silly and useless to do in the first place, and that has already
had three existing semantics.

The reason it has had three different behaviors over the years is
*literally* that nobody has ever wanted to do it, and so the fact that
it has been broken for years hasn't even mattered.

Don't try to make that situation worse by then making up new pointless
meanings for it and try to come up with excuses why somebody would
want to do that operation.

                Linus
  
Josh Triplett March 28, 2023, 4 a.m. UTC | #16
On March 28, 2023 12:32:59 PM GMT+09:00, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>Ok, just to play along - maybe you can make it slightly less
>nonsensical by throwing O_PATH into the mix, and now an empty
>directory file descriptor at least has *some* use.

That's the case I was thinking of: create a directory, then use exclusively *at system calls, never anything path-based. (I was using "atomic" loosely; not concerned about races here, just convenience.)

>Now your code would not only be specific to Linux, it would be
>specific to some very new version of Linux, and do something
>completely different on older versions.

I'm extremely not concerned with depending on current Linux. But that said...

>Because those older versions will do random things, ranging from
>"always return an error" to "create a regular file - not a directory -
>and then return an error anyway" and finally "create a regular file -
>not a directory - and return that resulting fd".

... Right, open has the un-extendable semantics, hence O_TMPFILE. Fair enough. Nevermind then.

As is often the case for multi-operation syscalls, I'm better off just using io_uring for a mkdir-then-open.
  
Christian Brauner March 28, 2023, 7:57 a.m. UTC | #17
On Tue, Mar 28, 2023 at 01:00:30PM +0900, Josh Triplett wrote:
> On March 28, 2023 12:32:59 PM GMT+09:00, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >Ok, just to play along - maybe you can make it slightly less
> >nonsensical by throwing O_PATH into the mix, and now an empty
> >directory file descriptor at least has *some* use.
> 
> That's the case I was thinking of: create a directory, then use exclusively *at system calls, never anything path-based. (I was using "atomic" loosely; not concerned about races here, just convenience.)
> 
> >Now your code would not only be specific to Linux, it would be
> >specific to some very new version of Linux, and do something
> >completely different on older versions.
> 
> I'm extremely not concerned with depending on current Linux. But that said...
> 
> >Because those older versions will do random things, ranging from
> >"always return an error" to "create a regular file - not a directory -
> >and then return an error anyway" and finally "create a regular file -
> >not a directory - and return that resulting fd".
> 
> ... Right, open has the un-extendable semantics, hence O_TMPFILE. Fair enough. Nevermind then.

That's not even the issue per se as most applications would probably
just be able to test whether O_DIRECTORY|O_CREAT creates and opens a
directory. It's not that we haven't had to contend with similar issues
in userspace for other syscalls before.

The bigger problem for me is that we'd be advancing from fixing the
semantics to not do completely weird/unexpected things to making it do
something that users would expect or want it to do in one big step.

Right now we're making a clean break by telling userspace EINVAL. But if
that turns out to be problematic we can easily just roll back to a
version of the old weird behavior with probably little fanfare. But if
we already introduced support for new semantics that express user's
intuition about what it's supposed to do we'll have a much harder time
and created a flame war for ourselves.

If however, EINVAL works just fine for a couple of kernel releases then
it would be - separate from the sensibility of this specific request -
another matter to make it do something else. Because at that point it's
no different from reusing deprecated bits like we did for e.g.,
CLONE_DETACHED -> CLONE_PIDFD which has exactly the same ignore unknown
or removed flags semantics as open/openat/openat2. Moving slow even in
the face of excitement about new possibilities isn't always the wrong
thing. This is one case were it isn't.
  
Christian Brauner March 28, 2023, 8:12 a.m. UTC | #18
On Mon, Mar 27, 2023 at 09:13:18PM +0100, Pedro Falcato wrote:
> On Tue, Mar 21, 2023 at 4:17 PM Christian Brauner <brauner@kernel.org> wrote:
> > It would be very nice if we had tests for the new behavior. So if @Pedro
> > would be up for it that would be highly appreciated. If not I'll put it
> > on my ToDo...
> 
> Where do you want them? selftests? I have a relatively self-contained
> ""testsuite"" of namei stuff that could fit in there well, after some
> cleanup.

I think I would prefer to have them as part of xfstests:
https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git

as that's where nearly all of the fs testing is taking place. It's never
great when developers have to run 3 separate testsuites to get
meaningful coverage. So having it central to xfstests would be my
preference.

A while ago I added a testsuite that tests generic core VFS behavior
it's located under src/vfs:
https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/src/vfs

and covers a lot of different things. So I would ask you to consider
adding a new testsuite into that file:

https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/src/vfs/vfstest.c

I think the structure should be somewhat understandable. Then create a
new test in xfstests using the "new" helper in the generic sectionA

> ./new generic
Next test id is 728
Append a name to the ID? Test name will be 728-$name. y,[n]:
Creating test file '728'
Add to group(s) [auto] (separate by space, ? for list): auto quick
Creating skeletal script for you to edit ...

then call the vfstest binary from the generated test case:

echo "Silence is golden"

$here/src/vfs/vfstest --test-THAT-NEW-SWITCH-NAME-YOU-ADDED --device "$TEST_DEV" \
        --mount "$TEST_DIR" --fstype "$FSTYP"

status=$?
exit

(You can also submit this to LTP or tell them about this change and
they'll likely add tests in addition to xfstests.)

> 
> > The expectation often is that this particular combination would create
> > and open a directory. This suggests users who tried to use that
> > combination would stumble upon the counterintuitive behavior no matter
> > if pre-v5.7 or post v5.7 and quickly realize neither semantics give them
> > what they want. For some examples see the code examples in [1] to [3]
> > and the discussion in [4].
> 
> Ok so, silly question: Could it not be desirable to have these
> semantics (open a dir or mkdir, atomically)?
> It does seem to be why POSIX left this edge case implementation
> defined, and if folks are asking for it, could it be the right move?
> 
> And yes, I do understand (from reading the room) that no one here is
> too excited about this possibility.

Forgive me for being lazy and instead of repeating everything I'll just
leave a link to the other part of the thread
https://lore.kernel.org/lkml/20230328075735.d3rs27jjvarmn6dw@wittgenstein
  

Patch

diff --git a/fs/namei.c b/fs/namei.c
index edfedfbccae..7b26db2f0f8 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3540,8 +3540,18 @@  static int do_open(struct nameidata *nd,
 		if (unlikely(error))
 			return error;
 	}
-	if ((nd->flags & LOOKUP_DIRECTORY) && !d_can_lookup(nd->path.dentry))
-		return -ENOTDIR;
+
+	if ((open_flag & (O_DIRECTORY | O_CREAT)) != (O_DIRECTORY | O_CREAT) ||
+	    !(file->f_mode & FMODE_CREATED)) {
+		/* O_DIRECTORY | O_CREAT has the strange property of being the
+		 * only open(O_DIRECTORY) lookup that can create and return a
+		 * regular file *if we indeed did create*. Because of this,
+		 * only return -ENOTDIR if we're not O_DIR | O_CREAT or if we
+		 * did not create a file.
+		 */
+		if ((nd->flags & LOOKUP_DIRECTORY) && !d_can_lookup(nd->path.dentry))
+			return -ENOTDIR;
+	}
 
 	do_truncate = false;
 	acc_mode = op->acc_mode;