[v2] vfs: shave work on failed file open

Message ID 20230926162228.68666-1-mjguzik@gmail.com
State New
Headers
Series [v2] vfs: shave work on failed file open |

Commit Message

Mateusz Guzik Sept. 26, 2023, 4:22 p.m. UTC
  Failed opens (mostly ENOENT) legitimately happen a lot, for example here
are stats from stracing kernel build for few seconds (strace -fc make):

  % time     seconds  usecs/call     calls    errors syscall
  ------ ----------- ----------- --------- --------- ------------------
    0.76    0.076233           5     15040      3688 openat

(this is tons of header files tried in different paths)

In the common case of there being nothing to close (only the file object
to free) there is a lot of overhead which can be avoided.

This is most notably delegation of freeing to task_work, which comes
with an enormous cost (see 021a160abf62 ("fs: use __fput_sync in
close(2)" for an example).

Benchmarked with will-it-scale with a custom testcase based on
tests/open1.c, stuffed into tests/openneg.c:
[snip]
        while (1) {
                int fd = open("/tmp/nonexistent", O_RDONLY);
                assert(fd == -1);

                (*iterations)++;
        }
[/snip]

Sapphire Rapids, openneg_processes -t 1 (ops/s):
before:	1950013
after:	2914973 (+49%)

file refcount is checked as a safety belt against buggy consumers with
an atomic cmpxchg. Technically it is not necessary, but it happens to
not be measurable due to several other atomics which immediately follow.
Optmizing them away to make this atomic into a problem is left as an
exercise for the reader.

v2:
- unexport fput_badopen and move to fs/internal.h
- handle the refcount with cmpxchg, adjust commentary accordingly
- tweak the commit message

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 fs/file_table.c | 35 +++++++++++++++++++++++++++++++++++
 fs/internal.h   |  2 ++
 fs/namei.c      |  2 +-
 3 files changed, 38 insertions(+), 1 deletion(-)
  

Comments

Linus Torvalds Sept. 26, 2023, 7 p.m. UTC | #1
On Tue, 26 Sept 2023 at 09:22, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> +void fput_badopen(struct file *file)
> +{
> +       if (unlikely(file->f_mode & (FMODE_BACKING | FMODE_OPENED))) {
> +               fput(file);
> +               return;
> +       }

I don't understand.

Why the FMODE_BACKING test?

The only thing that sets FMODE_BACKING is alloc_empty_backing_file(),
and we know that isn't involved, because the file that is free'd is

        file = alloc_empty_file(op->open_flag, current_cred());

so that test makes no sense.

It might make sense as another WARN_ON_ONCE(), but honestly, why even
that?  Why worry about FMODE_BACKING?

Now, the FMODE_OPENED check makes sense to me, in that it most
definitely can be set, and means we need to call the ->release()
callback and a lot more. Although I get the feeling that this test
would make more sense in the caller, since path_openat() _already_
checks for FMODE_OPENED in the non-error path too.

> +       if (WARN_ON_ONCE(atomic_long_cmpxchg(&file->f_count, 1, 0) != 1)) {
> +               fput(file);
> +               return;
> +       }

Ok, I kind of see why you'd want this safety check.  I don't see how
f_count could be validly anything else, but that's what the
WARN_ON_ONCE is all about.

Anyway, I think I'd be happier about this if it was more of a "just
the reverse of alloc_empty_file()", and path_openat() literally did
just

        if (likely(file->f_mode & FMODE_OPENED))
                release_empty_file(file);
        else
                fput(file);

instead of having this fput_badopen() helper that feels like it needs
to care about other cases than alloc_empty_file().

Don't take this email as a NAK, though. I don't hate the patch. I just
feel it could be more targeted, and more clearly "this is explicitly
avoiding the cost of 'fput()' in just path_openat() if we never
actually filled things in".

                   Linus
  
Christian Brauner Sept. 27, 2023, 2:09 p.m. UTC | #2
> I don't have a strong opinion, I think my variant is cleaner and more
> generic, but this boils down to taste and this is definitely not the
> hill I'm willing to die on.

I kinda like the release_empty_file() approach but we should keep the
WARN_ON_ONCE() so we can see whether anyone is taking an extra reference
on this thing. It's super unlikely but I guess zebras exist and if some
(buggy) code were to call get_file() during ->open() and keep that
reference for some reason we'd want to know why. But I don't think
anything does that.

No need to resend I can massage this well enough in-tree.
  
Mateusz Guzik Sept. 27, 2023, 2:34 p.m. UTC | #3
On 9/27/23, Christian Brauner <brauner@kernel.org> wrote:
>> I don't have a strong opinion, I think my variant is cleaner and more
>> generic, but this boils down to taste and this is definitely not the
>> hill I'm willing to die on.
>
> I kinda like the release_empty_file() approach but we should keep the
> WARN_ON_ONCE() so we can see whether anyone is taking an extra reference
> on this thing. It's super unlikely but I guess zebras exist and if some
> (buggy) code were to call get_file() during ->open() and keep that
> reference for some reason we'd want to know why. But I don't think
> anything does that.
>
> No need to resend I can massage this well enough in-tree.
>

Ok, I'm buggering off to other patches.

Thanks.
  
Linus Torvalds Sept. 27, 2023, 5:48 p.m. UTC | #4
On Wed, 27 Sept 2023 at 07:10, Christian Brauner <brauner@kernel.org> wrote:
>
> No need to resend I can massage this well enough in-tree.

Hmm. Please do, but here's some more food for thought for at least the
commit message.

Because there's more than the "__fput_sync()" issue at hand, we have
another delayed thing that this patch ends up short-circuiting, which
wasn't obvious from the original description.

I'm talking about the fact that our existing "file_free()" ends up
doing the actual release with

        call_rcu(&f->f_rcuhead, file_free_rcu);

and the patch under discussion avoids that part too.

And I actually like that it avoids it, I just think it should be
mentioned explicitly, because it wasn't obvious to me until I actually
looked at the old __fput() path. Particularly since it means that the
f_creds are free'd synchronously now.

I do think that's fine, although I forget what path it was that
required that rcu-delayed cred freeing. Worth mentioning, and maybe
worth thinking about.

However, when I *did* look at it, it strikes me that we could do this
differently.

Something like this (ENTIRELY UNTESTED) patch, which just moves this
logic into fput() itself.

Again: ENTIRELY UNTESTED, and I might easily have screwed up. But it
looks simpler and more straightforward to me. But again: that may be
because I missed something.

             Linus
  
Mateusz Guzik Sept. 27, 2023, 5:56 p.m. UTC | #5
On 9/27/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Wed, 27 Sept 2023 at 07:10, Christian Brauner <brauner@kernel.org>
> wrote:
>>
>> No need to resend I can massage this well enough in-tree.
>
> Hmm. Please do, but here's some more food for thought for at least the
> commit message.
>
> Because there's more than the "__fput_sync()" issue at hand, we have
> another delayed thing that this patch ends up short-circuiting, which
> wasn't obvious from the original description.
>
> I'm talking about the fact that our existing "file_free()" ends up
> doing the actual release with
>
>         call_rcu(&f->f_rcuhead, file_free_rcu);
>
> and the patch under discussion avoids that part too.
>

Comments in the patch explicitly mention dodgin RCU for the file object.

> And I actually like that it avoids it, I just think it should be
> mentioned explicitly, because it wasn't obvious to me until I actually
> looked at the old __fput() path. Particularly since it means that the
> f_creds are free'd synchronously now.
>

Well put_cred is called synchronously, but should this happen to be
the last ref on them, they will get call_rcu(&cred->rcu,
put_cred_rcu)'ed.

> I do think that's fine, although I forget what path it was that
> required that rcu-delayed cred freeing. Worth mentioning, and maybe
> worth thinking about.
>

See above. The only spot which which plays tricks with it is
faccessat, other than that all creds are explicitly freed with rcu.

> However, when I *did* look at it, it strikes me that we could do this
> differently.
>
> Something like this (ENTIRELY UNTESTED) patch, which just moves this
> logic into fput() itself.
>

I did not want to do it because failed open is a special case, quite
specific to one syscall (and maybe few others later).

As is you are adding a branch to all final fputs and are preventing
whacking that 1 -> 0 unref down the road, unless it gets moved out
again like in my patch.
  
Linus Torvalds Sept. 27, 2023, 6:05 p.m. UTC | #6
On Wed, 27 Sept 2023 at 10:56, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> Comments in the patch explicitly mention dodgin RCU for the file object.

Not the commit message,. and the comment is also actually pretty
obscure and only talks about the freeing part.

The cred part is what actually made me go "why is that even rcu-free'd".

I *think* it's bogus, but I didn't go look at the history of it .

> Well put_cred is called synchronously, but should this happen to be
> the last ref on them, they will get call_rcu(&cred->rcu,
> put_cred_rcu)'ed.

Yes. But the way it's done in __fput() you end up potentially
RCU-delaying it twice. Odd.

The reason we rcu-delay the 'struct file *' is because of the
__fget_files_rcu() games.

But I don't see why the cred thing is there.

Historical mistake? But it all looks a bit odd, and because of that it
worries me.

              Linus
  
Mateusz Guzik Sept. 27, 2023, 6:32 p.m. UTC | #7
On Wed, Sep 27, 2023 at 11:05:37AM -0700, Linus Torvalds wrote:
> On Wed, 27 Sept 2023 at 10:56, Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > Comments in the patch explicitly mention dodgin RCU for the file object.
> 
> Not the commit message,. and the comment is also actually pretty
> obscure and only talks about the freeing part.
> 

How about this:

================== cut here ==================

vfs: shave work on failed file open

Failed opens (mostly ENOENT) legitimately happen a lot, for example here
are stats from stracing kernel build for few seconds (strace -fc make):

  % time     seconds  usecs/call     calls    errors syscall
  ------ ----------- ----------- --------- --------- ------------------
    0.76    0.076233           5     15040      3688 openat

(this is tons of header files tried in different paths)

In the common case of there being nothing to close (only the file object
to free) there is a lot of overhead which can be avoided.

This boils down to 2 items:
1. avoiding delegation of fput to task_work, see 021a160abf62 ("fs:
use __fput_sync in close(2)" for more details on overhead)
2. avoiding freeing the file with RCU

Benchmarked with will-it-scale with a custom testcase based on
tests/open1.c, stuffed into tests/openneg.c:
[snip]
        while (1) {
                int fd = open("/tmp/nonexistent", O_RDONLY);
                assert(fd == -1);

                (*iterations)++;
        }
[/snip]

Sapphire Rapids, openneg_processes -t 1 (ops/s):
before:	1950013
after:	2914973 (+49%)

file refcount is checked with an atomic cmpxchg as a safety belt against
buggy consumers. Technically it is not necessary, but it happens to not
be measurable due to several other atomics which immediately follow.
Optmizing them away to make this atomic into a problem is left as an
exercise for the reader.

================== cut here ==================
 
Comment in v2 is:

/*
 * Clean up after failing to open (e.g., open(2) returns with -ENOENT).
 *
 * This represents opportunities to shave on work in the common case of
 * FMODE_OPENED not being set:
 * 1. there is nothing to close, just the file object to free and consequently
 *    no need to delegate to task_work
 * 2. as nobody else had seen the file then there is no need to delegate
 *    freeing to RCU
 */

I don't see anything wrong with it as far as information goes.

> > Well put_cred is called synchronously, but should this happen to be
> > the last ref on them, they will get call_rcu(&cred->rcu,
> > put_cred_rcu)'ed.
> 
> Yes. But the way it's done in __fput() you end up potentially
> RCU-delaying it twice. Odd.
> 
> The reason we rcu-delay the 'struct file *' is because of the
> __fget_files_rcu() games.
> 
> But I don't see why the cred thing is there.
> 
> Historical mistake? But it all looks a bit odd, and because of that it
> worries me.
> 

put_cred showed up in file_free_rcu in d76b0d9b2d87 ("CRED: Use creds in
file structs"). Commit message does not claim any dependency on this
being in an rcu callback already and it looks like it was done this way
because this was the ony spot with kmem_cache_free(filp_cachep, f) --
you ensured put_cred was always called without inspecting any other
places.

If there is something magic going on here I don't see it, it definitely
was not intended at least.
  
Linus Torvalds Sept. 27, 2023, 8:27 p.m. UTC | #8
On Wed, 27 Sept 2023 at 11:32, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> put_cred showed up in file_free_rcu in d76b0d9b2d87 ("CRED: Use creds in
> file structs"). Commit message does not claim any dependency on this
> being in an rcu callback already and it looks like it was done this way
> because this was the ony spot with kmem_cache_free(filp_cachep, f)

Yes, that looks about right. So the rcu-freeing is almost an accident.

Btw, I think we could get rid of the RCU freeing of 'struct file *' entirely.

The way to fix it is

 (a) make sure all f_count accesses are atomic ops (the one special
case is the "0 -> X" initialization, which is ok)

 (b) make filp_cachep be SLAB_TYPESAFE_BY_RCU

because then get_file_rcu() can do the atomic_long_inc_not_zero()
knowing it's still a 'struct file *' while holding the RCU read lock
even if it was just free'd.

And __fget_files_rcu() will then re-check that it's the *right*
'struct file *' and do a fput() on it and re-try if it isn't. End
result: no need for any RCU freeing.

But the difference is that a *new* 'struct file *' might see a
temporary atomic increment / decrement of the file pointer because
another CPU is going through that __fget_files_rcu() dance.

Which is why "0 -> X" is ok to do as a "atomic_long_set()", but
everything else would need to be done as "atomic_long_inc()" etc.

Which all seems to be the case already, so with the put_cred() not
needing the RCU delay, I thing we really could do this patch (note:
independent of other issues, but makes your patch require that
"atomic_long_cmpxchg()" and the WARN_ON() should probably go away,
because it can actually happen).

That should help the normal file open/close case a bit, in that it
doesn't cause that extra RCU work.

Of course, on some loads it might be advantageous to do a delayed
de-allocation in some other RCU context, so ..

What do you think?

             Linus

PS. And as always: ENTIRELY UNTESTED.
  
Mateusz Guzik Sept. 27, 2023, 9:06 p.m. UTC | #9
On 9/27/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Btw, I think we could get rid of the RCU freeing of 'struct file *'
> entirely.
>
> The way to fix it is
>
>  (a) make sure all f_count accesses are atomic ops (the one special
> case is the "0 -> X" initialization, which is ok)
>
>  (b) make filp_cachep be SLAB_TYPESAFE_BY_RCU
>
> because then get_file_rcu() can do the atomic_long_inc_not_zero()
> knowing it's still a 'struct file *' while holding the RCU read lock
> even if it was just free'd.
>
> And __fget_files_rcu() will then re-check that it's the *right*
> 'struct file *' and do a fput() on it and re-try if it isn't. End
> result: no need for any RCU freeing.
>
> But the difference is that a *new* 'struct file *' might see a
> temporary atomic increment / decrement of the file pointer because
> another CPU is going through that __fget_files_rcu() dance.
>

I think you attached the wrong file, it has next to no changes and in
particular nothing for fd lookup.

You may find it interesting that both NetBSD and FreeBSD have been
doing something to that extent for years now in order to provide
lockless fd lookup despite not having an equivalent to RCU (what they
did have at the time is "type stable" -- objs can get reused but the
memory can *never* get freed. utterly gross, but that's old Unix for
you).

It does work, but I always found it dodgy because it backpedals in a
way which is not free of side effects.

Note that validating you got the right file bare minimum requires
reloading the fd table pointer because you might have been racing
against close *and* resize.
  
Linus Torvalds Sept. 27, 2023, 9:18 p.m. UTC | #10
On Wed, 27 Sept 2023 at 14:06, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> I think you attached the wrong file, it has next to no changes and in
> particular nothing for fd lookup.

The fd lookup is already safe.

It already does the whole "double-check the file pointer after doing
the increment" for other reasons - namely the whole "oh, the file
table can be re-allocated under us" thing.

So the fd lookup needs rcu, but it does all the checks to make it all
work with SLAB_TYPESAFE_BY_RCU.

> You may find it interesting that both NetBSD and FreeBSD have been
> doing something to that extent for years now in order to provide
> lockless fd lookup despite not having an equivalent to RCU (what they
> did have at the time is "type stable" -- objs can get reused but the
> memory can *never* get freed. utterly gross, but that's old Unix for
> you).

That kind of "never free'd" thing is indeed gross, but the
type-stability is useful.

Our SLAB_TYPESAFE_BY_RCU is somewhat widely used, exactly because it's
much cheaper than an *actual* RCU delayed free.

Of course, it also requires more care, but it so happens that we
already have that for other reasons for 'struct file'.

> It does work, but I always found it dodgy because it backpedals in a
> way which is not free of side effects.

Grep around for SLAB_TYPESAFE_BY_RCU and you'll see that we actually
have it in multiple places, most notably the sighand_struct.

> Note that validating you got the right file bare minimum requires
> reloading the fd table pointer because you might have been racing
> against close *and* resize.

Exactly. See __fget_files_rcu().

          Linus
  
Mateusz Guzik Sept. 27, 2023, 9:30 p.m. UTC | #11
On 9/27/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Wed, 27 Sept 2023 at 14:06, Mateusz Guzik <mjguzik@gmail.com> wrote:
>>
>> I think you attached the wrong file, it has next to no changes and in
>> particular nothing for fd lookup.
>
> The fd lookup is already safe.
>
> It already does the whole "double-check the file pointer after doing
> the increment" for other reasons - namely the whole "oh, the file
> table can be re-allocated under us" thing.
>
> So the fd lookup needs rcu, but it does all the checks to make it all
> work with SLAB_TYPESAFE_BY_RCU.
>

Indeed, nice.

Sorry, I discounted the patch after not seeing anything for fd and
file_free_rcu still being there. Looked like a WIP.

I'm going to give it a spin tomorrow along with some benching.
  
Christian Brauner Sept. 28, 2023, 1:25 p.m. UTC | #12
> Which all seems to be the case already, so with the put_cred() not
> needing the RCU delay, I thing we really could do this patch (note:

So I spent a good chunk of time going through this patch.

Before file->f_cred was introduced file->f_{g,u}id would have been
accessible just under rcu protection. And file->f_cred->f_fs{g,u}id
replaced that access. So I think the intention was that file->f_cred
would function the same way, i.e., it would be possible to go from file
to cred under rcu without requiring a reference.

But basically, file->f_cred is the only field that would give this
guarantee. Other pointers such as file->f_security
(security_file_free()) don't and are freed outside of the rcu delay
already as well.

This patch means that if someone wants to access file->f_cred under rcu
they now need to call get_file_rcu() first.

Nothing has relied on this rcu-only file->f_cred quirk/feature until now
so I think it's fine to change it.

Does that make sense?

Please take a look at:
https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs.misc&id=e3f15ee79197fc8b17d3496b6fa4fa0fc20f5406
for testing.
  
Christian Brauner Sept. 28, 2023, 2:05 p.m. UTC | #13
> So I spent a good chunk of time going through this patch.

The main thing that makes me go "we shouldn't do this" is that KASAN
isn't able to detect UAF issues as Jann pointed out so I'm getting
really nervous about this.

And Jann also pointed out some potential issues with
__fget_files_rcu() as well...
  
Jann Horn Sept. 28, 2023, 2:43 p.m. UTC | #14
On Thu, Sep 28, 2023 at 4:05 PM Christian Brauner <brauner@kernel.org> wrote:
>
> > So I spent a good chunk of time going through this patch.
>
> The main thing that makes me go "we shouldn't do this" is that KASAN
> isn't able to detect UAF issues as Jann pointed out so I'm getting
> really nervous about this.

(FWIW there is an in-progress patch to address this that I sent a few
weeks ago but that is not landed yet,
<https://lore.kernel.org/linux-mm/20230825211426.3798691-1-jannh@google.com/>.
So currently KASAN can only detect UAF in SLAB_TYPESAFE_BY_RCU slabs
when the slab allocator has given them back to the page allocator.)

> And Jann also pointed out some potential issues with
> __fget_files_rcu() as well...

The issue I see with the current __fget_files_rcu() is that the
"file->f_mode & mask" is no longer effective in its current position,
it would have to be moved down below the get_file_rcu() call.
That's a semantic difference between manually RCU-freeing and
SLAB_TYPESAFE_BY_RCU - we no longer have the guarantee that an object
can't be freed and reallocated within a single RCU grace period.
With the current patch, we could race like this:

```
static inline struct file *__fget_files_rcu(struct files_struct *files,
        unsigned int fd, fmode_t mask)
{
        for (;;) {
                struct file *file;
                struct fdtable *fdt = rcu_dereference_raw(files->fdt);
                struct file __rcu **fdentry;

                if (unlikely(fd >= fdt->max_fds))
                        return NULL;

                fdentry = fdt->fd + array_index_nospec(fd, fdt->max_fds);
                file = rcu_dereference_raw(*fdentry);
                if (unlikely(!file))
                        return NULL;

                if (unlikely(file->f_mode & mask))
                        return NULL;

                [in another thread:]
                [file is removed from fd table and freed]
                [file is reallocated as something like an O_PATH file,
                 which the check above would not permit]
                [reallocated file is inserted in the fd table in the
same position]

                /*
                 * Ok, we have a file pointer. However, because we do
                 * this all locklessly under RCU, we may be racing with
                 * that file being closed.
                 *
                 * Such a race can take two forms:
                 *
                 *  (a) the file ref already went down to zero,
                 *      and get_file_rcu() fails. Just try again:
                 */
                if (unlikely(!get_file_rcu(file))) [succeeds]
                        continue;

                /*
                 *  (b) the file table entry has changed under us.
                 *       Note that we don't need to re-check the 'fdt->fd'
                 *       pointer having changed, because it always goes
                 *       hand-in-hand with 'fdt'.
                 *
                 * If so, we need to put our ref and try again.
                 */
                [recheck succeeds because the new file was inserted in
the same position]
                if (unlikely(rcu_dereference_raw(files->fdt) != fdt) ||
                    unlikely(rcu_dereference_raw(*fdentry) != file)) {
                        fput(file);
                        continue;
                }

                /*
                 * Ok, we have a ref to the file, and checked that it
                 * still exists.
                 */
                [a file incompatible with the supplied mask is returned]
                return file;
        }
}
```

There are also some weird get_file_rcu() users in other places like
BPF's task_file_seq_get_next and in gfs2_glockfd_next_file that do
weird stuff without the recheck, especially gfs2_glockfd_next_file
even looks at the inodes of files without taking a reference (which
seems a little dodgy but maybe actually currently works because inodes
are also RCU-freed?). So I think you'd have to clean all of that up
before you can make this change.

Similar thing with get_mm_exe_file(), that relies on get_file_rcu()
success meaning that the file was not reallocated. And tid_fd_mode()
in procfs assumes that task_lookup_fd_rcu() returns a file* whose mode
can be inspected under RCU.

As Linus already mentioned, release_empty_file() is also broken now,
because it assumes that nobody will grab references to unopened files,
but actually that can now happen spuriously when a concurrent fget()
has called get_file_rcu() on a recycled file and not yet hit the
recheck fput(). Kinda like the thing with "struct page" where GUP can
randomly spuriously bump up the refcount of any page including ones
that are not mapped into userspace. So that would have to go through
the same fput() path as every other file freeing.

We also now rely on the "f_count" initialization in init_file()
happening after the point of no return, which is currently the case,
but that'd have to be documented to avoid someone adding a later
bailout in the future, and maybe could be clarified by actually moving
the count initialization after the bailout?

Heh, I grepped for `__rcu.*file`, and BPF has a thing in
kernel/bpf/verifier.c that seems to imply it would be safe for some
types of BPF programs to follow the mm->exe_file reference solely
protected by RCU, which already seems a little dodgy now but more
after this change:

```
/* RCU trusted: these fields are trusted in RCU CS and can be NULL */
BTF_TYPE_SAFE_RCU_OR_NULL(struct mm_struct) {
        struct file __rcu *exe_file;
};
```

(To be clear: This is not intended to be an exhaustive list.)


So I think conceptually this is something you can do but it would
require a bit of cleanup all around the kernel to make sure you really
just have one or two central functions that make use of the limited
RCU-ness of "struct file", and that nothing else relies on that or
makes assumptions about how non-zero refcounts move.
  
Linus Torvalds Sept. 28, 2023, 5:21 p.m. UTC | #15
On Thu, 28 Sept 2023 at 07:44, Jann Horn <jannh@google.com> wrote:
>
> The issue I see with the current __fget_files_rcu() is that the
> "file->f_mode & mask" is no longer effective in its current position,
> it would have to be moved down below the get_file_rcu() call.

Yes, you're right.

But moving it down below the "re-check that the fdt pointer and the
file pointer still matches" should be easy and sufficient.

> There are also some weird get_file_rcu() users in other places like
> BPF's task_file_seq_get_next and in gfs2_glockfd_next_file that do
> weird stuff without the recheck, especially gfs2_glockfd_next_file
> even looks at the inodes of files without taking a reference (which
> seems a little dodgy but maybe actually currently works because inodes
> are also RCU-freed?).

The inodes are also RCU-free'd, but that is indeed dodgy.

I think it happens to work, and we actually have a somewhat similar
pattern in the RCU lookup code (except with dentry->d_inode, not
file->f_inode), because as you say the inode data structure itself is
rcu-free'd, but more importantly, that code does the "get_file_rcu()"
afterwards.

And yes, right now that works fine, because it will fail if the file
f_count goes down to zero.

And f_count will go down to zero before we really tear down the inode with

        file->f_op->release(inode, file);

and (more importantly) the dput -> dentry_kill -> dentry_unlink_inode
-> release.

So that get_file_rcu() will currently protect against any "oh, the
inode is stale and about to be released".

But yes, that protection would be broken by SLAB_TYPESAFE_BY_RCU,
since then the "f_count is zero" is no longer a final thing.

It's fixable by having the same "double check the file table" that I
do think we should do regardless. That get_file_rcu() pattern may
*work*, but it's very very dodgy.

                Linus
  
Christian Brauner Sept. 29, 2023, 9:20 a.m. UTC | #16
> But yes, that protection would be broken by SLAB_TYPESAFE_BY_RCU,
> since then the "f_count is zero" is no longer a final thing.

I've tried coming up with a patch that is simple enough so the pattern
is easy to follow and then converting all places to rely on a pattern
that combine lookup_fd_rcu() or similar with get_file_rcu(). The obvious
thing is that we'll force a few places to now always acquire a reference
when they don't really need one right now and that already may cause
performance issues.

We also can't fully get rid of plain get_file_rcu() uses itself because
of users such as mm->exe_file. They don't go from one of the rcu fdtable
lookup helpers to the struct file obviously. They rcu replace the file
pointer in their struct ofc so we could change get_file_rcu() to take a
struct file __rcu **f and then comparing that the passed in pointer
hasn't changed before we managed to do atomic_long_inc_not_zero(). Which
afaict should work for such cases.

But overall we would introduce a fairly big and at the same time subtle
semantic change. The idea is pretty neat and it was fun to do but I'm
just not convinced we should do it given how ubiquitous struct file is
used and now to make the semanics even more special by allowing
refcounts.

I've kept your original release_empty_file() proposal in vfs.misc which
I think is a really nice change.

Let me know if you all passionately disagree. ;)
  
Jann Horn Sept. 29, 2023, 1:31 p.m. UTC | #17
On Fri, Sep 29, 2023 at 11:20 AM Christian Brauner <brauner@kernel.org> wrote:
> > But yes, that protection would be broken by SLAB_TYPESAFE_BY_RCU,
> > since then the "f_count is zero" is no longer a final thing.
>
> I've tried coming up with a patch that is simple enough so the pattern
> is easy to follow and then converting all places to rely on a pattern
> that combine lookup_fd_rcu() or similar with get_file_rcu(). The obvious
> thing is that we'll force a few places to now always acquire a reference
> when they don't really need one right now and that already may cause
> performance issues.

(Those places are probably used way less often than the hot
open/fget/close paths though.)

> We also can't fully get rid of plain get_file_rcu() uses itself because
> of users such as mm->exe_file. They don't go from one of the rcu fdtable
> lookup helpers to the struct file obviously. They rcu replace the file
> pointer in their struct ofc so we could change get_file_rcu() to take a
> struct file __rcu **f and then comparing that the passed in pointer
> hasn't changed before we managed to do atomic_long_inc_not_zero(). Which
> afaict should work for such cases.
>
> But overall we would introduce a fairly big and at the same time subtle
> semantic change. The idea is pretty neat and it was fun to do but I'm
> just not convinced we should do it given how ubiquitous struct file is
> used and now to make the semanics even more special by allowing
> refcounts.
>
> I've kept your original release_empty_file() proposal in vfs.misc which
> I think is a really nice change.
>
> Let me know if you all passionately disagree. ;)
  
Christian Brauner Sept. 29, 2023, 7:57 p.m. UTC | #18
On Fri, Sep 29, 2023 at 03:31:29PM +0200, Jann Horn wrote:
> On Fri, Sep 29, 2023 at 11:20 AM Christian Brauner <brauner@kernel.org> wrote:
> > > But yes, that protection would be broken by SLAB_TYPESAFE_BY_RCU,
> > > since then the "f_count is zero" is no longer a final thing.
> >
> > I've tried coming up with a patch that is simple enough so the pattern
> > is easy to follow and then converting all places to rely on a pattern
> > that combine lookup_fd_rcu() or similar with get_file_rcu(). The obvious
> > thing is that we'll force a few places to now always acquire a reference
> > when they don't really need one right now and that already may cause
> > performance issues.
> 
> (Those places are probably used way less often than the hot
> open/fget/close paths though.)
> 
> > We also can't fully get rid of plain get_file_rcu() uses itself because
> > of users such as mm->exe_file. They don't go from one of the rcu fdtable
> > lookup helpers to the struct file obviously. They rcu replace the file
> > pointer in their struct ofc so we could change get_file_rcu() to take a
> > struct file __rcu **f and then comparing that the passed in pointer
> > hasn't changed before we managed to do atomic_long_inc_not_zero(). Which
> > afaict should work for such cases.
> >
> > But overall we would introduce a fairly big and at the same time subtle
> > semantic change. The idea is pretty neat and it was fun to do but I'm
> > just not convinced we should do it given how ubiquitous struct file is
> > used and now to make the semanics even more special by allowing
> > refcounts.
> >
> > I've kept your original release_empty_file() proposal in vfs.misc which
> > I think is a really nice change.
> >
> > Let me know if you all passionately disagree. ;)

So I'm appending the patch I had played with and a fix from Jann on top.
@Linus, if you have an opinion, let me know what you think.

Also available here:
https://gitlab.com/brauner/linux/-/commits/vfs.file.rcu

Might be interesting if this could be perfed to see if there is any real
gain for workloads with massive numbers of fds.
  
Mateusz Guzik Sept. 29, 2023, 9:23 p.m. UTC | #19
On 9/29/23, Christian Brauner <brauner@kernel.org> wrote:
> On Fri, Sep 29, 2023 at 03:31:29PM +0200, Jann Horn wrote:
>> On Fri, Sep 29, 2023 at 11:20 AM Christian Brauner <brauner@kernel.org>
>> wrote:
>> > > But yes, that protection would be broken by SLAB_TYPESAFE_BY_RCU,
>> > > since then the "f_count is zero" is no longer a final thing.
>> >
>> > I've tried coming up with a patch that is simple enough so the pattern
>> > is easy to follow and then converting all places to rely on a pattern
>> > that combine lookup_fd_rcu() or similar with get_file_rcu(). The
>> > obvious
>> > thing is that we'll force a few places to now always acquire a
>> > reference
>> > when they don't really need one right now and that already may cause
>> > performance issues.
>>
>> (Those places are probably used way less often than the hot
>> open/fget/close paths though.)
>>
>> > We also can't fully get rid of plain get_file_rcu() uses itself because
>> > of users such as mm->exe_file. They don't go from one of the rcu
>> > fdtable
>> > lookup helpers to the struct file obviously. They rcu replace the file
>> > pointer in their struct ofc so we could change get_file_rcu() to take a
>> > struct file __rcu **f and then comparing that the passed in pointer
>> > hasn't changed before we managed to do atomic_long_inc_not_zero().
>> > Which
>> > afaict should work for such cases.
>> >
>> > But overall we would introduce a fairly big and at the same time subtle
>> > semantic change. The idea is pretty neat and it was fun to do but I'm
>> > just not convinced we should do it given how ubiquitous struct file is
>> > used and now to make the semanics even more special by allowing
>> > refcounts.
>> >
>> > I've kept your original release_empty_file() proposal in vfs.misc which
>> > I think is a really nice change.
>> >
>> > Let me know if you all passionately disagree. ;)
>
> So I'm appending the patch I had played with and a fix from Jann on top.
> @Linus, if you have an opinion, let me know what you think.
>
> Also available here:
> https://gitlab.com/brauner/linux/-/commits/vfs.file.rcu
>
> Might be interesting if this could be perfed to see if there is any real
> gain for workloads with massive numbers of fds.
>

I would feel safer with a guaranteed way to tell that the file was reallocated.

I think this could track allocs/frees with a sequence counter embedded
into the object, say odd means deallocated and even means allocated.

Then you would know for a fact whether you raced with the file getting
whacked and would never have to wonder if you double-checked
everything you needed (like that f_mode) thing.

This would also mean that consumers which get away with poking around
the file without getting a ref could still do it, this is at least
true for tid_fd_mode. All of them would need patching though.

Extending struct file is not ideal by any means, but the good news is that:
1. there is a 4 byte hole in there, if one is fine with an int-sized counter
2. if one insists on 8 bytes, the struct is 232 bytes on my kernel
(debian). still some room up to 256, so it may be tolerable?
  
Mateusz Guzik Sept. 29, 2023, 9:39 p.m. UTC | #20
On 9/29/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> On 9/29/23, Christian Brauner <brauner@kernel.org> wrote:
>> On Fri, Sep 29, 2023 at 03:31:29PM +0200, Jann Horn wrote:
>>> On Fri, Sep 29, 2023 at 11:20 AM Christian Brauner <brauner@kernel.org>
>>> wrote:
>>> > > But yes, that protection would be broken by SLAB_TYPESAFE_BY_RCU,
>>> > > since then the "f_count is zero" is no longer a final thing.
>>> >
>>> > I've tried coming up with a patch that is simple enough so the pattern
>>> > is easy to follow and then converting all places to rely on a pattern
>>> > that combine lookup_fd_rcu() or similar with get_file_rcu(). The
>>> > obvious
>>> > thing is that we'll force a few places to now always acquire a
>>> > reference
>>> > when they don't really need one right now and that already may cause
>>> > performance issues.
>>>
>>> (Those places are probably used way less often than the hot
>>> open/fget/close paths though.)
>>>
>>> > We also can't fully get rid of plain get_file_rcu() uses itself
>>> > because
>>> > of users such as mm->exe_file. They don't go from one of the rcu
>>> > fdtable
>>> > lookup helpers to the struct file obviously. They rcu replace the file
>>> > pointer in their struct ofc so we could change get_file_rcu() to take
>>> > a
>>> > struct file __rcu **f and then comparing that the passed in pointer
>>> > hasn't changed before we managed to do atomic_long_inc_not_zero().
>>> > Which
>>> > afaict should work for such cases.
>>> >
>>> > But overall we would introduce a fairly big and at the same time
>>> > subtle
>>> > semantic change. The idea is pretty neat and it was fun to do but I'm
>>> > just not convinced we should do it given how ubiquitous struct file is
>>> > used and now to make the semanics even more special by allowing
>>> > refcounts.
>>> >
>>> > I've kept your original release_empty_file() proposal in vfs.misc
>>> > which
>>> > I think is a really nice change.
>>> >
>>> > Let me know if you all passionately disagree. ;)
>>
>> So I'm appending the patch I had played with and a fix from Jann on top.
>> @Linus, if you have an opinion, let me know what you think.
>>
>> Also available here:
>> https://gitlab.com/brauner/linux/-/commits/vfs.file.rcu
>>
>> Might be interesting if this could be perfed to see if there is any real
>> gain for workloads with massive numbers of fds.
>>
>
> I would feel safer with a guaranteed way to tell that the file was
> reallocated.
>
> I think this could track allocs/frees with a sequence counter embedded
> into the object, say odd means deallocated and even means allocated.
>
> Then you would know for a fact whether you raced with the file getting
> whacked and would never have to wonder if you double-checked
> everything you needed (like that f_mode) thing.
>
> This would also mean that consumers which get away with poking around
> the file without getting a ref could still do it, this is at least
> true for tid_fd_mode. All of them would need patching though.
>
> Extending struct file is not ideal by any means, but the good news is that:
> 1. there is a 4 byte hole in there, if one is fine with an int-sized
> counter
> 2. if one insists on 8 bytes, the struct is 232 bytes on my kernel
> (debian). still some room up to 256, so it may be tolerable?
>

So to be clear, obtaining the initial count would require a dedicated
accessor. First you would find the file obj, wait for the count to
reach "allocated" state, validate the source still has the right
pointer, validate the count did not change (with acq fences sprinkled
in there). At the end of it you know that the seq counter you got from
the file was there when the file was still "installed".

Then you can poke around and validate you poked around the right thing
by once more validating  the counter.

Maybe I missed something, but the idea in general should work.
  
Matthew Wilcox Sept. 29, 2023, 10:24 p.m. UTC | #21
On Fri, Sep 29, 2023 at 11:23:04PM +0200, Mateusz Guzik wrote:
> Extending struct file is not ideal by any means, but the good news is that:
> 1. there is a 4 byte hole in there, if one is fine with an int-sized counter
> 2. if one insists on 8 bytes, the struct is 232 bytes on my kernel
> (debian). still some room up to 256, so it may be tolerable?

256 isn't quite the magic number for slabs ... at 256 bytes, we'd get 16
per 4kB page, but at 232 bytes we get 17 objects per 4kB page (or 35 per
8kB pair of pages).

That said, I thik a 32-bit counter is almost certainly sufficient.
  
Jann Horn Sept. 29, 2023, 11:02 p.m. UTC | #22
On Sat, Sep 30, 2023 at 12:24 AM Matthew Wilcox <willy@infradead.org> wrote:
> On Fri, Sep 29, 2023 at 11:23:04PM +0200, Mateusz Guzik wrote:
> > Extending struct file is not ideal by any means, but the good news is that:
> > 1. there is a 4 byte hole in there, if one is fine with an int-sized counter
> > 2. if one insists on 8 bytes, the struct is 232 bytes on my kernel
> > (debian). still some room up to 256, so it may be tolerable?
>
> 256 isn't quite the magic number for slabs ... at 256 bytes, we'd get 16
> per 4kB page, but at 232 bytes we get 17 objects per 4kB page (or 35 per
> 8kB pair of pages).
>
> That said, I thik a 32-bit counter is almost certainly sufficient.

I don't like the sequence number proposal because it seems to me like
it's adding one more layer of complication, but if this does happen, I
very much would want that number to be 64-bit. A computer doesn't take
_that_ long to count to 2^32, and especially with preemptible RCU it's
kinda hard to reason about how long a task might stay in the middle of
an RCU grace period. Like, are we absolutely sure that there is no
pessimal case where the scheduler will not schedule a runnable
cpu-pinned idle-priority task for a few minutes? Either because we hit
some pessimal case in the scheduler or because the task gets preempted
by something that's spinning a very long time with preemption
disabled?
(And yes, I know, seqlocks...)
  
Linus Torvalds Sept. 29, 2023, 11:57 p.m. UTC | #23
On Fri, 29 Sept 2023 at 14:39, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> So to be clear, obtaining the initial count would require a dedicated
> accessor.

Please, no.

Sequence numbers here are fundamentally broken, since getting that
initial sequence number would involve either (a) making it something
outside of 'struct file' itself or (b) require the same re-validation
of the file pointer that the non-sequence number code needed in the
first place.

We already have the right model in the only place that really matters
(ie fd lookup). Using that same "validate file pointer after you got
the ref to it" for the two or three other cases that didn't do it (and
are simpler: the exec pointer in particular doesn't need the fdt
re-validation at all).

The fact that we had some fd lookup that didn't do the full thing that
a *real* fd lookup did is just bad. Let's fix it, not introduce a
sequence counter that only adds more complexity.

          Linus
  
Christian Brauner Sept. 30, 2023, 9:04 a.m. UTC | #24
On Fri, Sep 29, 2023 at 04:57:29PM -0700, Linus Torvalds wrote:
> On Fri, 29 Sept 2023 at 14:39, Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > So to be clear, obtaining the initial count would require a dedicated
> > accessor.
> 
> Please, no.
> 
> Sequence numbers here are fundamentally broken, since getting that
> initial sequence number would involve either (a) making it something
> outside of 'struct file' itself or (b) require the same re-validation
> of the file pointer that the non-sequence number code needed in the
> first place.
> 
> We already have the right model in the only place that really matters
> (ie fd lookup). Using that same "validate file pointer after you got
> the ref to it" for the two or three other cases that didn't do it (and
> are simpler: the exec pointer in particular doesn't need the fdt
> re-validation at all).
> 
> The fact that we had some fd lookup that didn't do the full thing that
> a *real* fd lookup did is just bad. Let's fix it, not introduce a
> sequence counter that only adds more complexity.

I agree.

So I guess we're trying this. The appeneded patch now includes
documentation and renames *lookup_*_fd_rcu() to *lookup_*_fdget_rcu() to
reflect the refcount bump. It's now tentatively in vfs.misc (cf. [1])
and I've merged it into vfs.all to let -next chew on it. Please take a
close look and may the rcu gods be with us all...

[1]: git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
  
Nathan Chancellor Oct. 3, 2023, 4:45 p.m. UTC | #25
Hi Christian,

> >From d266eee9d9d917f07774e2c2bab0115d2119a311 Mon Sep 17 00:00:00 2001
> From: Christian Brauner <brauner@kernel.org>
> Date: Fri, 29 Sep 2023 08:45:59 +0200
> Subject: [PATCH] file: convert to SLAB_TYPESAFE_BY_RCU
> 
> In recent discussions around some performance improvements in the file
> handling area we discussed switching the file cache to rely on
> SLAB_TYPESAFE_BY_RCU which allows us to get rid of call_rcu() based
> freeing for files completely. This is a pretty sensitive change overall
> but it might actually be worth doing.
> 
> The main downside is the subtlety. The other one is that we should
> really wait for Jann's patch to land that enables KASAN to handle
> SLAB_TYPESAFE_BY_RCU UAFs. Currently it doesn't but a patch for this
> exists.
> 
> With SLAB_TYPESAFE_BY_RCU objects may be freed and reused multiple times
> which requires a few changes. So it isn't sufficient anymore to just
> acquire a reference to the file in question under rcu using
> atomic_long_inc_not_zero() since the file might have already been
> recycled and someone else might have bumped the reference.
> 
> In other words, callers might see reference count bumps from newer
> users. For this is reason it is necessary to verify that the pointer is
> the same before and after the reference count increment. This pattern
> can be seen in get_file_rcu() and __files_get_rcu().
> 
> In addition, it isn't possible to access or check fields in struct file
> without first aqcuiring a reference on it. Not doing that was always
> very dodgy and it was only usable for non-pointer data in struct file.
> With SLAB_TYPESAFE_BY_RCU it is necessary that callers first acquire a
> reference under rcu or they must hold the files_lock of the fdtable.
> Failing to do either one of this is a bug.
> 
> Thanks to Jann for pointing out that we need to ensure memory ordering
> between reallocations and pointer check by ensuring that all subsequent
> loads have a dependency on the second load in get_file_rcu() and
> providing a fixup that was folded into this patch.
> 
> Cc: Jann Horn <jannh@google.com>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---

<snip>

> --- a/arch/powerpc/platforms/cell/spufs/coredump.c
> +++ b/arch/powerpc/platforms/cell/spufs/coredump.c
> @@ -74,10 +74,13 @@ static struct spu_context *coredump_next_context(int *fd)
>  	*fd = n - 1;
>  
>  	rcu_read_lock();
> -	file = lookup_fd_rcu(*fd);
> -	ctx = SPUFS_I(file_inode(file))->i_ctx;
> -	get_spu_context(ctx);
> +	file = lookup_fdget_rcu(*fd);
>  	rcu_read_unlock();
> +	if (file) {
> +		ctx = SPUFS_I(file_inode(file))->i_ctx;
> +		get_spu_context(ctx);
> +		fput(file);
> +	}
>  
>  	return ctx;
>  }

This hunk now causes a clang warning (or error, since arch/powerpc builds
with -Werror by default) in next-20231003.

  $ make -skj"$(nproc)" ARCH=powerpc LLVM=1 ppc64_guest_defconfig arch/powerpc/platforms/cell/spufs/coredump.o
  ...
  arch/powerpc/platforms/cell/spufs/coredump.c:79:6: error: variable 'ctx' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized]
     79 |         if (file) {
        |             ^~~~
  arch/powerpc/platforms/cell/spufs/coredump.c:85:9: note: uninitialized use occurs here
     85 |         return ctx;
        |                ^~~
  arch/powerpc/platforms/cell/spufs/coredump.c:79:2: note: remove the 'if' if its condition is always true
     79 |         if (file) {
        |         ^~~~~~~~~
  arch/powerpc/platforms/cell/spufs/coredump.c:69:25: note: initialize the variable 'ctx' to silence this warning
     69 |         struct spu_context *ctx;
        |                                ^
        |                                 = NULL
  1 error generated.

Cheers,
Nathan
  
Al Viro Oct. 10, 2023, 3:06 a.m. UTC | #26
On Sat, Sep 30, 2023 at 11:04:20AM +0200, Christian Brauner wrote:
> +On newer kernels rcu based file lookup has been switched to rely on
> +SLAB_TYPESAFE_BY_RCU instead of call_rcu(). It isn't sufficient anymore to just
> +acquire a reference to the file in question under rcu using
> +atomic_long_inc_not_zero() since the file might have already been recycled and
> +someone else might have bumped the reference. In other words, the caller might
> +see reference count bumps from newer users. For this is reason it is necessary
> +to verify that the pointer is the same before and after the reference count
> +increment. This pattern can be seen in get_file_rcu() and __files_get_rcu().
> +
> +In addition, it isn't possible to access or check fields in struct file without
> +first aqcuiring a reference on it. Not doing that was always very dodgy and it
> +was only usable for non-pointer data in struct file. With SLAB_TYPESAFE_BY_RCU
> +it is necessary that callers first acquire a reference under rcu or they must
> +hold the files_lock of the fdtable. Failing to do either one of this is a bug.

Trivial correction: the last paragraph applies only to rcu lookups - something
like
        spin_lock(&files->file_lock);
        fdt = files_fdtable(files);
        if (close->fd >= fdt->max_fds) {
                spin_unlock(&files->file_lock);
                goto err;  
        }
        file = rcu_dereference_protected(fdt->fd[close->fd],
                        lockdep_is_held(&files->file_lock));
        if (!file || io_is_uring_fops(file)) {
		     ^^^^^^^^^^^^^^^^^^^^^ fetches file->f_op
                spin_unlock(&files->file_lock);
                goto err;
        }
	...

should be still valid.  As written, the reference to "rcu based file lookup"
is buried in the previous paragraph and it's not obvious that it applies to
the last one as well.  Incidentally, I would probably turn that fragment
(in io_uring/openclose.c:io_close()) into
	spin_lock(&files->file_lock);
	file = files_lookup_fd_locked(files, close->fd);
	if (!file || io_is_uring_fops(file)) {
		spin_unlock(&files->file_lock);
		goto err;
	}
	...

> diff --git a/arch/powerpc/platforms/cell/spufs/coredump.c b/arch/powerpc/platforms/cell/spufs/coredump.c
> index 1a587618015c..5e157f48995e 100644
> --- a/arch/powerpc/platforms/cell/spufs/coredump.c
> +++ b/arch/powerpc/platforms/cell/spufs/coredump.c
> @@ -74,10 +74,13 @@ static struct spu_context *coredump_next_context(int *fd)
>  	*fd = n - 1;
>  
>  	rcu_read_lock();
> -	file = lookup_fd_rcu(*fd);
> -	ctx = SPUFS_I(file_inode(file))->i_ctx;
> -	get_spu_context(ctx);
> +	file = lookup_fdget_rcu(*fd);
>  	rcu_read_unlock();
> +	if (file) {
> +		ctx = SPUFS_I(file_inode(file))->i_ctx;
> +		get_spu_context(ctx);
> +		fput(file);
> +	}

Well...  Here we should have descriptor table unshared, and we really
do rely upon that - we expect the file we'd found to have been a spufs
one *and* to have stayed that way.  So if anyone could change the
descriptor table behind our back, we'd be FUBAR.
  
Christian Brauner Oct. 10, 2023, 8:29 a.m. UTC | #27
> is buried in the previous paragraph and it's not obvious that it applies to
> the last one as well.  Incidentally, I would probably turn that fragment

massaged to clarify

> (in io_uring/openclose.c:io_close()) into
> 	spin_lock(&files->file_lock);
> 	file = files_lookup_fd_locked(files, close->fd);
> 	if (!file || io_is_uring_fops(file)) {
> 		spin_unlock(&files->file_lock);
> 		goto err;
> 	}

done
  

Patch

diff --git a/fs/file_table.c b/fs/file_table.c
index ee21b3da9d08..6cbd5bc551d0 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -82,6 +82,16 @@  static inline void file_free(struct file *f)
 	call_rcu(&f->f_rcuhead, file_free_rcu);
 }
 
+static inline void file_free_badopen(struct file *f)
+{
+	BUG_ON(f->f_mode & (FMODE_BACKING | FMODE_OPENED));
+	security_file_free(f);
+	put_cred(f->f_cred);
+	if (likely(!(f->f_mode & FMODE_NOACCOUNT)))
+		percpu_counter_dec(&nr_files);
+	kmem_cache_free(filp_cachep, f);
+}
+
 /*
  * Return the total number of open files in the system
  */
@@ -468,6 +478,31 @@  void __fput_sync(struct file *file)
 EXPORT_SYMBOL(fput);
 EXPORT_SYMBOL(__fput_sync);
 
+/*
+ * Clean up after failing to open (e.g., open(2) returns with -ENOENT).
+ *
+ * This represents opportunities to shave on work in the common case of
+ * FMODE_OPENED not being set:
+ * 1. there is nothing to close, just the file object to free and consequently
+ *    no need to delegate to task_work
+ * 2. as nobody else had seen the file then there is no need to delegate
+ *    freeing to RCU
+ */
+void fput_badopen(struct file *file)
+{
+	if (unlikely(file->f_mode & (FMODE_BACKING | FMODE_OPENED))) {
+		fput(file);
+		return;
+	}
+
+	if (WARN_ON_ONCE(atomic_long_cmpxchg(&file->f_count, 1, 0) != 1)) {
+		fput(file);
+		return;
+	}
+
+	file_free_badopen(file);
+}
+
 void __init files_init(void)
 {
 	filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
diff --git a/fs/internal.h b/fs/internal.h
index d64ae03998cc..93da6d815e90 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -95,6 +95,8 @@  struct file *alloc_empty_file(int flags, const struct cred *cred);
 struct file *alloc_empty_file_noaccount(int flags, const struct cred *cred);
 struct file *alloc_empty_backing_file(int flags, const struct cred *cred);
 
+void fput_badopen(struct file *);
+
 static inline void put_file_access(struct file *file)
 {
 	if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ) {
diff --git a/fs/namei.c b/fs/namei.c
index 567ee547492b..67579fe30b28 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3802,7 +3802,7 @@  static struct file *path_openat(struct nameidata *nd,
 		WARN_ON(1);
 		error = -EINVAL;
 	}
-	fput(file);
+	fput_badopen(file);
 	if (error == -EOPENSTALE) {
 		if (flags & LOOKUP_RCU)
 			error = -ECHILD;