[GIT,PULL] execve updates for v6.8-rc1

Message ID 202401081028.0E908F9E0A@keescook
State New
Headers
Series [GIT,PULL] execve updates for v6.8-rc1 |

Pull-request

https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git tags/execve-v6.8-rc1

Message

Kees Cook Jan. 8, 2024, 6:35 p.m. UTC
  Hi Linus,

Please pull these execve updates for v6.8-rc1. A fast-fail check has
been added to dramatically speed up execve-based PATH searches, and has
been in -next for the entire development window. A minor conflict with
netdev exists due to neighboring MAINTAINERS entries:
https://lore.kernel.org/linux-next/20231218161704.05c25766@canb.auug.org.au/

Thanks!

-Kees

The following changes since commit 21ca59b365c091d583f36ac753eaa8baf947be6f:

  binfmt_misc: enable sandboxed mounts (2023-10-11 08:46:01 -0700)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git tags/execve-v6.8-rc1

for you to fetch changes up to 0a8a952a75f2c5c140939c1616423e240677666c:

  ELF, MAINTAINERS: specifically mention ELF (2023-12-06 14:55:31 -0800)

----------------------------------------------------------------
execve updates for v6.8-rc1

- Update MAINTAINERS entry to explicitly mention ELF (Alexey Dobriyan)

- Add a fail-fast check to speed up execve-based PATH searches (Josh
  Triplett)

----------------------------------------------------------------
Alexey Dobriyan (1):
      ELF, MAINTAINERS: specifically mention ELF

Josh Triplett (1):
      fs/exec.c: Add fast path for ENOENT on PATH search before allocating mm

 MAINTAINERS |  3 ++-
 fs/exec.c   | 13 +++++++++++++
 2 files changed, 15 insertions(+), 1 deletion(-)
  

Comments

Linus Torvalds Jan. 9, 2024, 12:19 a.m. UTC | #1
On Mon, 8 Jan 2024 at 10:35, Kees Cook <keescook@chromium.org> wrote:
>
> Josh Triplett (1):
>       fs/exec.c: Add fast path for ENOENT on PATH search before allocating mm

No, we're not doing this.

If you want to open the file before the allocations, then dammit, do
exactly that.

Don't look up the path twice. Something (ENTIRELY UNTESTED) like this
patch that just moves the open from "bprm_execve()" to "alloc_bprm()".
It actually cleans up the odd BINPRM_FLAGS_PATH_INACCESSIBLE case too,
by setting it where it makes sense.

Anyway, I want to repeat: this patch is UNTESTED. It compiles for me.
But that is literally all the testing it has gotten apart from a
cursory "this patch looks sane".

There might be something seriously wrong with this patch, but it at
least makes sense, unlike that horror that will look up the filename
twice.

I bet whatever benchmark did the original was not using long filenames
with lots of components, or was only testing the ENOENT case.

                   Linus
  
Linus Torvalds Jan. 9, 2024, 12:30 a.m. UTC | #2
On Mon, 8 Jan 2024 at 16:19, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Anyway, I want to repeat: this patch is UNTESTED. It compiles for me.

Actually, I take that back. I did a clang build, and clang noted that
my "remove the retval initialization as unnecessary" was wrong,
because the

                if (!bprm->fdpath)
                        goto out_free;

code path in alloc_bprm() still wanted that initial -ENOMEM initialization.

So you need to fix the

        int retval;

in alloc_bprm() to be back to the original

        int retval = -ENOMEM;

but then it might all work.

Again - note the "might". Somebody needs to actually test it.  I may
try to do that in between pulls.

            Linus
  
Linus Torvalds Jan. 9, 2024, 12:46 a.m. UTC | #3
On Mon, 8 Jan 2024 at 16:30, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Again - note the "might". Somebody needs to actually test it.  I may
> try to do that in between pulls.

It boots. It builds a kernel. It must be perfect.

             Linus
  
Kees Cook Jan. 9, 2024, 1:48 a.m. UTC | #4
On January 8, 2024 4:19:45 PM PST, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>On Mon, 8 Jan 2024 at 10:35, Kees Cook <keescook@chromium.org> wrote:
>>
>> Josh Triplett (1):
>>       fs/exec.c: Add fast path for ENOENT on PATH search before allocating mm
>
>No, we're not doing this.
>
>If you want to open the file before the allocations, then dammit, do
>exactly that.

This was exactly the feedback I had originally and wrote almost what you suggest:

https://lore.kernel.org/lkml/202209161637.9EDAF6B18@keescook/

>Anyway, I want to repeat: this patch is UNTESTED. It compiles for me.
>But that is literally all the testing it has gotten apart from a
>cursory "this patch looks sane".
>
>There might be something seriously wrong with this patch, but it at
>least makes sense, unlike that horror that will look up the filename
>twice.
>
>I bet whatever benchmark did the original was not using long filenames
>with lots of components, or was only testing the ENOENT case.

But the perf testing of my proposed "look it up once" patch showed a net loss to the successful execs which no one could explain. In the end we went with the original proposal.

If you think this is too much of a hack, I'm happy to drop it. My very first reaction was "fix userspace; shells use access() not execve()" but it seems enough other runtimes (Python?) use execve PATH searches that it would make a measurable real-world difference.

-Kees
  
Linus Torvalds Jan. 9, 2024, 1:53 a.m. UTC | #5
On Mon, 8 Jan 2024 at 17:48, Kees Cook <kees@kernel.org> wrote:
>
> This was exactly the feedback I had originally and wrote almost what you suggest:
>
> https://lore.kernel.org/lkml/202209161637.9EDAF6B18@keescook/
>
> But the perf testing of my proposed "look it up once" patch showed a
> net loss to the successful execs which no one could explain. In the
> end we went with the original proposal.

Basing things one one random benchmark which must clearly have some
very particular cache effects or something is not ok.

End result: I'm not taking a random "look up filename twice because we
can't understand what is going on".

Because I *guarantee* that we can trivially write another benchmark
that shows that looking up the pathname twice is worse.

                Linus
  
Linus Torvalds Jan. 9, 2024, 3:28 a.m. UTC | #6
On Mon, 8 Jan 2024 at 17:53, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Because I *guarantee* that we can trivially write another benchmark
> that shows that looking up the pathname twice is worse.

Ok, so I just took a look at the alleged benchmark that was used for
the "look  up twice" argument.

It looks quite broken.

What it seems to do is to "fork+execve" on a small file, and do
clock_gettime() in the parent and in the child, and add up the
differences between the times.

But that's just testing random scheduler interactions, not the speed
of fork/exec.

IOW, that one improves performance if you always run the child first
after the fork(), so that the child runs immediately, finishes the
work, and when the parent then resumes, it reads the completed result
from the pipe.

It will give big behavior changes for any scheduling behavior - like
trying to run children concurrently on another CPU vs running it
immediately on the same CPU etc etc.

Using "vfork()" instead of "fork()" will remove *one* variable, in
that it will force that "child runs first" behavior that you want, and
would likely help performance a lot. But even then you'll end up with
a scheduling benchmark: when the child does "execve()" that will now
wake up the parent again, and the *optimal* behavior is probably to
run the child fully until it does "exit" (well, at least until it runs
"clock_gettime()") before scheduling the parent.

You might get that by just forcing it all to run on one single CPU,
unless the wakeup by the execve() synchronously wakes up the parent.

IOW, you can probably get closer to the numbers you want with vfork(),
but even then it's a crap-shoot and depends on scheduling.

If you want to actually test execve() itself, you shouldn't use fork()
at all - you should literally execve() in a loop, using the execve()
argument as the "loop variable". That will actually test execve(), not
the scheduling of the child, which will be pretty random.

IOW, something (truly stuipid) like the attached, and then you do

    $ gcc -O2 --static t.c
    $ time ./a.out 100000 1

to time a hundred thousand execve() calls.

Look ma, no fork, vfork, or scheduler interactions.

Of course, if you then want to check the pathname lookup failure cost,
you'd need to change the "execve()" into a "execvpe()" and play around
with the PATH variable, putting "." in different places etc. And you
might want to write your own PATH lookup one, to make sure it actually
uses the "execve()" system call and not "stat()" to find the
executable.

. and do you want to then check using "execveat()" (new model) vs
"path string created by appending in user space" (classic model)?

Tons of variables. For example, modern "execveat()" behavior is
*probably* using a small pathname that is looked up by opening the
different directories in $PATH, but the old-school thing that creates
pathnames all in user space and then does "execve()" on them will
probably have fairly heavy path lookup costs.

So now the whole "look up path twice" might be very differently
expensive depending on just how you ended up dealing with the $PATH
components. It *could* be cheap. Or it might be looking up a long
path.

End result: there's a million interactions here. You need to decide
what you want to test. But you *definitely* shouldn't decide to test
some random scheduler behavior and call it "execve cost".

                Linus
  
Josh Triplett Jan. 9, 2024, 6:57 p.m. UTC | #7
On Mon, Jan 08, 2024 at 05:48:38PM -0800, Kees Cook wrote:
> If you think this is too much of a hack, I'm happy to drop it. My very
> first reaction was "fix userspace; shells use access() not execve()"
> but it seems enough other runtimes (Python?) use execve PATH searches
> that it would make a measurable real-world difference.

In particular, execvpe and all the p variants of exec functions in both
glibc and musl have this exact behavior, and thus anything that uses
those functions will have the same behavior.

If someone wants to try other variations on this patch that only look up
the path once, and show via benchmarks that they're faster, I'm all for
it. I would *prefer* the approach of only looking up the path once, if
it's actually faster rather than slower. But I do think the spawnbench
benchmark I provided (which has fork-execvpe and vfork-execvpe and
posix_spawnp variants) is representative of real-world patterns for how
programs execute other programs on $PATH. Doing a microbenchmark on just
execvpe chaining from a program to itself is also valid, but I thought
it would be preferable to benchmark real-world patterns and measure the
actual time-to-first-instruction of the executed program as closely as
possible.
  
Linus Torvalds Jan. 9, 2024, 11:40 p.m. UTC | #8
On Tue, 9 Jan 2024 at 10:57, Josh Triplett <josh@joshtriplett.org> wrote:
>
> But I do think the spawnbench
> benchmark I provided (which has fork-execvpe and vfork-execvpe and
> posix_spawnp variants) is representative of real-world patterns for how
> programs execute other programs on $PATH.

No, it really isn't.

I already pointed out that the benchmark is entirely broken, because
what it times is broken.

It basically shows the difference in times between the parent and
child both doing a clock_gettime(), but what happens in between those
is
 - the parent doing a fork
 - the child doing an execve
 - the child reading the time
and that makes it a "at which point did we happen to schedule"
benchmark, not an execve() benchmark.

Just as an example, imagine if we schedule the child immediately after
the fork, and the parent doesn't run at all.

That obviously gives the minimum time difference - what your benchmark
then treats as "best". Agreed?

Except does it?

You have random details like "who happens to do a page fault and has
to copy the the stack page that has been marked COW, and that both the
parent and child have mapped and both will write to immediately  after
the fork()"

If the child runs first, the child will take that page fault, and the
child will have to do the page copy and insert it into its own VM.

So maybe it's better if the parent runs first and takes the page fault
and does the copy, and the child runs on another CPU just a little bit
later, and sees that it now has an exclusive page and doesn't need to
copy it at all? Maybe it gets to the execve() faster that way, and
gets a lower time difference just by pure luck? Or at least has a CPU
core of its own while the parent does something else?

Now, with "fork()" *something* has to do the page copy before the
execve() happens, unless it's all very lucky and the child happens to
run with the stack just at a page boundary and just gets its own page
that way.

I suspect you'll get the best performance if you run everything on
just one CPU, and don't try to spread things out, at least if your L2
caches are big enough to fit there - just for the best cache
utilization.

Because if you try to run the two loads on different CPU cores (and
maybe avoid HT siblings too, to get the best throughput), you'll have
to push all the cached contents from the parent to the child.

And maybe thats' ok on this one. It's almost certainly a good thing on
*some* loads, particularly if the child then ends up having more work
it does longer-term.

And yes, our scheduler tries to actually take cache affinity etc into
account, although the interaction with fork() may or may not be
optimal.

But my point is that what you are testing isn't actually the execve()
cycle, you're basically testing all these scheduler interactions on a
benchmark that doesn't actually match any real load.

Now, using vfork() instead of fork() will improve things, both from a
performance standpoint and from a "not as much a scheduler benchmark"
standpoint.

At least we don't have the issue with COW pages and trying to aim for
cache re-use, because there will be no overlap in execution of the
child and parent while they share the same VM. The parent is going to
stop in vfork(), the child is obviously best run on the same CPU until
it does an execve() and releases the parent, and at that point it's
*probably* best to try to run the new child on a different CPU, and
bring the parent back on the original CPU,.

Except that behavior (which sounds to me like the best option in
general) is not AT ALL what your benchmark would consider the best
option - because all your spawn bench thing looks at is how quickly
the child gets to run, so things like "cache placement for parent" no
longer matter at all for spawnbench.

So for that benchmark, instead of maybe trying to keep the parent
local to its own caches, and run the (new) child with no cache
footprint on another CPU, the best numbers for your benchmark probably
come from running the new execve() on the same CPU and not running the
parent at all until later.

And those are good numbers for the spawnbench just because the process
was already on that CPU in the kernel, so not running the parent where
it makes sense is good, because alll that matterns by then is that you
want to run the child asap.

See? your benchmark doesn't actually even *attempt* to time how good
our fork-execve sequence is. It times something entirely different. It
basically gives the best score to a scheduler decision that probably
doesn't even make sense.

Or maybe it does. Who knows? Maybe we *should* change the scheduler to
do what is best for spawnbench.

But do you see why I think it's at least as much a scheduler benchmark
as it is a execve() one, and why I think it's likely not a very good
benchmark at all, because I _suspect_ that the best numbers come from
doing things that may or may not make sense..

Now, I sent out that other benchmark, which at least avoids the whole
scheduler thing, because it does everything as one single process. I'm
not saying that's a sensible benchmark _either_, but at least it's
targeted to just execve().

Another option would be to not time the whole "parent clock_gettime ->
child clock_gettime" sequience that makes no sense, but to just time
the whole "fork-execve-exit-wait" sequence (which you can do in the
parent).

Because at that point, you're not timing the difference between two
random points (where scheduling decisions will change what happens
between them), you're actually timing the cost of the *whole*
sequence. Sure, scheduling will still matter for the details, but at
least you've timed the whole work, rather than timed a random *part*
of the work where other things are then ignored entirely.

For example, once you time the whole thing, it's no longer a "did the
parent of the child do the COW copy"? We don't care. One or the other
has to take the cost, and it's part of the *whole* cost of the
operation. Sure, scheduling decisions will still end up mattering, so
it's not a pure execve() benchmark, but at least now it's a benchmark
for the whole load, not just a random part of it.

              Linus
  
Josh Triplett Jan. 10, 2024, 2:21 a.m. UTC | #9
I'm not going to spend a lot of time and energy attempting to argue that
spawnbench is a representative benchmark, or that scheduling overhead is
a relevant part of program launch.

Instead, here are some numbers from Linus's suggested benchmark
(modified to use execvpe, and to count down rather than up so it doesn't
need two arguments; modified version and benchmark driver script
attached; compiled with `musl-gcc -Wall -O3 -s -static`):


With no patch (for-next/execve before the top two patches,
21ca59b365c091d583f36ac753eaa8baf947be6f):

=== With only PATH ===
0.32user 4.08system 0:04.55elapsed 96%CPU (0avgtext+0avgdata 1280maxresident)k
0inputs+0outputs (0major+1294599minor)pagefaults 0swaps

=== With 64 extra environment variables ===
0.29user 5.33system 0:05.76elapsed 97%CPU (0avgtext+0avgdata 1280maxresident)k
0inputs+0outputs (0major+1312477minor)pagefaults 0swaps


With my fastpath patch (for-next/execve,
0a8a952a75f2c5c140939c1616423e240677666c):

=== With only PATH ===
0.27user 2.40system 0:02.73elapsed 98%CPU (0avgtext+0avgdata 1152maxresident)k
0inputs+0outputs (0major+695002minor)pagefaults 0swaps

=== With 64 extra environment variables ===
0.29user 2.59system 0:02.94elapsed 98%CPU (0avgtext+0avgdata 1152maxresident)k
0inputs+0outputs (0major+712606minor)pagefaults 0swaps


With Linus's fastpath patch ("no patch" with Linus's applied, and the
followup -ENOMEM fix applied):

=== With only PATH ===
0.28user 2.44system 0:02.80elapsed 97%CPU (0avgtext+0avgdata 1152maxresident)k
0inputs+0outputs (0major+694706minor)pagefaults 0swaps

=== With 64 extra environment variables ===
0.29user 2.68system 0:03.06elapsed 97%CPU (0avgtext+0avgdata 1152maxresident)k
0inputs+0outputs (0major+712431minor)pagefaults 0swaps


I can reliably reproduce the differences between these three kernels;
the differences are well outside the noise. Both fastpaths are *much*
faster than the baseline, but the double-lookup version is still
consistently faster than the single-lookup version.

I'm sure it's *possible* to create a benchmark in which the
single-lookup version is faster. But in this benchmark of *just*
execvpe, it's still the case that double-lookup is faster, for *some*
reason.

I agree that it *shouldn't* be, and yet.
#!/bin/sh

echo === With only PATH ===
/usr/bin/time env -i PATH=/home/josh/.local/bin:/usr/lib/ccache:/sbin:/bin:/usr/sbin:/usr/bin:. t 100000

echo
echo === With 64 extra environment variables ===
/usr/bin/time env -i $(seq 1 64 | sed 's/.*/x&=&/') PATH=/home/josh/.local/bin:/usr/lib/ccache:/sbin:/bin:/usr/sbin:/usr/bin:. t 100000
  
Linus Torvalds Jan. 10, 2024, 3:54 a.m. UTC | #10
On Tue, 9 Jan 2024 at 18:21, Josh Triplett <josh@joshtriplett.org> wrote:
>
> Instead, here are some numbers from Linus's suggested benchmark
> (modified to use execvpe, and to count down rather than up so it doesn't
> need two arguments; modified version and benchmark driver script
> attached; compiled with `musl-gcc -Wall -O3 -s -static`):

Yeah, that's better. I had actually only benchmarked the success case.

And I see what the problem is: the "sane" way that only opens the
pathname once does so using

        file = do_filp_open(fd, name, &open_exec_flags);

and the "open path twice" does the first open with

        retval = filename_lookup(AT_FDCWD, filename, 0, &path, NULL);

and guess what the difference is?

The difference is that path_openat() starts out with

        file = alloc_empty_file(op->open_flag, current_cred());

and when the open fails, it will free the file with

        fput(file);

So if there are a lot of failures (because "." is at the end of the
path), it will have done a fair amount of those useless file
allocations and frees.

And - not surprisingly - the "open once" is then faster if there are
*not* a lot of failures, when the executable is found early in the
PATH.

Now, there's no fundamental *reason* to do that alloc_empty_file()
early, except for how the code is structured.

It partly makes the error handling simpler and since all the cases
want the filp in the end, doing it at the top means that it's only
done once.

And we occasionally do use the file pointer early (ie lookup_open()
will clear/set FMODE_CREATED in it even if it doesn't otherwise touch
the file pointer) even before the final lookup - and at creation time,
atomic_open() will actually want it for the final lookup.

Anyway, the real fix is definitely to just fix do_filp_open() to at
least not be *too* eager to allocate a file pointer.

In fact, that code is oddly non-optimal for another reason: it does
that "allocate and free file" not just when the path lookup fails, but
it does it for things like RCU lookup failures too.

So what happens is that if RCU lookup fails, do_filp_open() will call
path_openat() twice: first with LOOKUP_RCU, and then without it. And
path_openat() will allocate that "struct file *" twice.

On NFS - or other filesystems that can return ESTALE - it will in fact
do it three times.

That's pretty disgusting.

Al, comments? We *could* just special-case the execve() code not to
use do_filp_open() and avoid this issue that way, but it does feel
like even the regular open() case is pessimal with that whole RCU
situation.

                Linus
  
Kees Cook Jan. 10, 2024, 7:24 p.m. UTC | #11
On Tue, Jan 09, 2024 at 06:21:26PM -0800, Josh Triplett wrote:
> With Linus's fastpath patch ("no patch" with Linus's applied, and the
> followup -ENOMEM fix applied):
> 
> === With only PATH ===
> 0.28user 2.44system 0:02.80elapsed 97%CPU (0avgtext+0avgdata 1152maxresident)k
> 0inputs+0outputs (0major+694706minor)pagefaults 0swaps
> 
> === With 64 extra environment variables ===
> 0.29user 2.68system 0:03.06elapsed 97%CPU (0avgtext+0avgdata 1152maxresident)k
> 0inputs+0outputs (0major+712431minor)pagefaults 0swaps

Thanks for digging into this!

I've been trying to figure out how to measure only the execve portion of
a workload (with perf)[1] to get a more real-world measurement, but the
above does show improvements for the "open once early". I'll get the
behavior landed in -next after the merge window closes, and we can
continue examining if we can make do_filp_open() better...

-Kees

[1] https://lore.kernel.org/linux-perf-users/ZZ32p0LRSt5-vFPX@kernel.org/
  
Linus Torvalds Jan. 10, 2024, 8:12 p.m. UTC | #12
On Wed, 10 Jan 2024 at 11:24, Kees Cook <keescook@chromium.org> wrote:
>
> I've been trying to figure out how to measure only the execve portion of
> a workload (with perf)[1] to get a more real-world measurement, but the
> above does show improvements for the "open once early". I'll get the
> behavior landed in -next after the merge window closes, and we can
> continue examining if we can make do_filp_open() better...

Well, so the workload that shows the "early open" issue most is
actually something disgusting like this:

    #include <unistd.h>

    int main(int argc, char **argv, char **envp)
    {
        for (int i = 0; i < 10000000; i++)
                execve("nonexistent", argv, envp);
        return 0;
    }

and it's trivial to run under perf. You'll get something like this
with my patch:

   8.65%  [k] strncpy_from_user
   8.37%  [k] kmem_cache_alloc
   7.71%  [k] kmem_cache_free
   5.14%  [k] mod_objcg_state
   4.84%  [k] link_path_walk
   4.36%  [k] memset_orig
   3.93%  [k] percpu_counter_add_batch
   3.66%  [k] do_syscall_64
   3.63%  [k] path_openat

and with the hacky "open twice" you'll see that kmem_cache_alloc/free
should be much lower - it still does a kmem_cache_alloc/free() pair
for the pathname, but the 'struct file *' allocation/free goes away.

Anyway, you can see a very similar issue by replacing the "execve()" line with

                open("nonexistent", O_RDONLY);

instead, and for exactly the same reason. Just to show that this issue
isn't execve-related.

I really think that the "open twice" is wrong. It will look
artificially good in this "does not exist" case, but it will penalize
other cases, and it just hides this issue.

Without either of the patches, you'll see that execve case spend all
its time counting environment variables, and be much slower as a
result. Instead of that "strncpy_from_user()", you'll see
"strnlen_user()" and ccopy_from_user() shoot up because of that.

The above perf profile is actually quote good in general: the slab
alloc/free is a big issue only because nothing else is.

Oh, and the above profile depends *heavily* on your particular
microarchitecture and which mitigations you have in place. System call
overhead might be at the top, for example.

And the cost of "strncpy_from_user()" is so high above not because we
do a lot of copies (it's just that shortish filename), but simply
mainly because user copies are so insanely expensive on some uarchs
due to CLAC/STAC being expensive.

So even a short filename copy can end up taking more than the whole path walk.

So your exact mileage will vary, but you should see that pattern of
"kmem_cache_alloc/free" (and the "strnlen_user()" issue with none of
the patches being applied) etc etc.

                     Linus
  
Al Viro Jan. 11, 2024, 9:47 a.m. UTC | #13
On Tue, Jan 09, 2024 at 07:54:30PM -0800, Linus Torvalds wrote:

> Al, comments? We *could* just special-case the execve() code not to
> use do_filp_open() and avoid this issue that way, but it does feel
> like even the regular open() case is pessimal with that whole RCU
> situation.

Two things, both related to ->atomic_open():
	* we pass struct file to ->atomic_open(), so that it either opens
the sucker _or_ stores the resulting dentry in it (if ->atomic_open() bails
and tells us to use such-and-such dentry, other than the one it had been
given).
	* cleanup logics becomes interesting if we try to keep struct
file from pass to pass.  Sure, if it had never been touched _or_ if it had
only been fed to finish_no_open() (i.e. ->atomic_open() bailout path) -
no problem, we can reuse it.  But once it has hit do_dentry_open(), the
things get fishy.  We *must* fput() it if we got to setting FMODE_OPENED -
no plausible way around that.  But what about the case when we fail
e.g. inside ->open()?  Currently we undo just enough for fput() to do
the right thing without FMODE_OPENED, but e.g. security_file_open() has
no undoing for it.  Having it called twice on the same struct file might
or might not work on all LSMs, but they hadn't been exposed to that until
now.

We could pass struct file ** to path_openat(), with
	file = *fp;
	if (!file) {
		file = alloc_empty_file(op->open_flag, current_cred());
		if (IS_ERR(file))
			return file;
		*fp = file;
	}
in the beginning and have an extra flag that would be
set as soon as we hit do_dentry_open().  Then we could make the fput()
in path_openat() failure handling conditional upon that flag.

Doable, but really not pretty, especially since we'd need to massage
the caller as well...  Less painful variant is
	if (error == -ECHILD && (flags & LOOKUP_RCU))
		return ERR_PTR(-ECHILD); // keep file for non-rcu pass
	*fp = NULL;
	fput(file);
	...
on the way out; that won't help with -ESTALE side of things, but if we
hit *that*, struct file allocation overhead is really noise.

PS: apologies for late reply - had been sick since Saturday, just got more
or less back to normal.
  
Al Viro Jan. 11, 2024, 10:05 a.m. UTC | #14
On Thu, Jan 11, 2024 at 09:47:11AM +0000, Al Viro wrote:
> Doable, but really not pretty, especially since we'd need to massage
> the caller as well...  Less painful variant is
> 	if (error == -ECHILD && (flags & LOOKUP_RCU))
> 		return ERR_PTR(-ECHILD); // keep file for non-rcu pass
> 	*fp = NULL;
> 	fput(file);
> 	...
> on the way out; that won't help with -ESTALE side of things, but if we
> hit *that*, struct file allocation overhead is really noise.

Something like (completely untested) delta below, perhaps?

diff --git a/fs/namei.c b/fs/namei.c
index 5c318d657503..de770be9bb16 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3765,15 +3765,17 @@ static int do_o_path(struct nameidata *nd, unsigned flags, struct file *file)
 	return error;
 }
 
-static struct file *path_openat(struct nameidata *nd,
+static int path_openat(struct nameidata *nd, struct file **fp,
 			const struct open_flags *op, unsigned flags)
 {
-	struct file *file;
+	struct file *file = *fp;
 	int error;
 
-	file = alloc_empty_file(op->open_flag, current_cred());
-	if (IS_ERR(file))
-		return file;
+	if (!file) {
+		file = alloc_empty_file(op->open_flag, current_cred());
+		if (IS_ERR(file))
+			return PTR_ERR(file);
+	}
 
 	if (unlikely(file->f_flags & __O_TMPFILE)) {
 		error = do_tmpfile(nd, flags, op, file);
@@ -3789,11 +3791,17 @@ static struct file *path_openat(struct nameidata *nd,
 		terminate_walk(nd);
 	}
 	if (likely(!error)) {
-		if (likely(file->f_mode & FMODE_OPENED))
-			return file;
+		if (likely(file->f_mode & FMODE_OPENED)) {
+			*fp = file;
+			return 0;
+		}
 		WARN_ON(1);
 		error = -EINVAL;
 	}
+	if (error == -ECHILD && (flags & LOOKUP_RCU)) {
+		*fp = file; // reuse on the next pass
+		return -ECHILD;
+	}
 	fput(file);
 	if (error == -EOPENSTALE) {
 		if (flags & LOOKUP_RCU)
@@ -3801,7 +3809,7 @@ static struct file *path_openat(struct nameidata *nd,
 		else
 			error = -ESTALE;
 	}
-	return ERR_PTR(error);
+	return error;
 }
 
 struct file *do_filp_open(int dfd, struct filename *pathname,
@@ -3809,25 +3817,27 @@ struct file *do_filp_open(int dfd, struct filename *pathname,
 {
 	struct nameidata nd;
 	int flags = op->lookup_flags;
-	struct file *filp;
+	struct file *file = NULL;
+	int err;
 
 	set_nameidata(&nd, dfd, pathname, NULL);
-	filp = path_openat(&nd, op, flags | LOOKUP_RCU);
-	if (unlikely(filp == ERR_PTR(-ECHILD)))
-		filp = path_openat(&nd, op, flags);
-	if (unlikely(filp == ERR_PTR(-ESTALE)))
-		filp = path_openat(&nd, op, flags | LOOKUP_REVAL);
+	err = path_openat(&nd, &file, op, flags | LOOKUP_RCU);
+	if (unlikely(err == -ECHILD))
+		err = path_openat(&nd, &file, op, flags);
+	if (unlikely(err == -ESTALE))
+		err = path_openat(&nd, &file, op, flags | LOOKUP_REVAL);
 	restore_nameidata();
-	return filp;
+	return unlikely(err) ? ERR_PTR(err) : file;
 }
 
 struct file *do_file_open_root(const struct path *root,
 		const char *name, const struct open_flags *op)
 {
 	struct nameidata nd;
-	struct file *file;
+	struct file *file = NULL;
 	struct filename *filename;
 	int flags = op->lookup_flags;
+	int err;
 
 	if (d_is_symlink(root->dentry) && op->intent & LOOKUP_OPEN)
 		return ERR_PTR(-ELOOP);
@@ -3837,14 +3847,14 @@ struct file *do_file_open_root(const struct path *root,
 		return ERR_CAST(filename);
 
 	set_nameidata(&nd, -1, filename, root);
-	file = path_openat(&nd, op, flags | LOOKUP_RCU);
-	if (unlikely(file == ERR_PTR(-ECHILD)))
-		file = path_openat(&nd, op, flags);
-	if (unlikely(file == ERR_PTR(-ESTALE)))
-		file = path_openat(&nd, op, flags | LOOKUP_REVAL);
+	err = path_openat(&nd, &file, op, flags | LOOKUP_RCU);
+	if (unlikely(err == -ECHILD))
+		err = path_openat(&nd, &file, op, flags);
+	if (unlikely(err == -ESTALE))
+		err = path_openat(&nd, &file, op, flags | LOOKUP_REVAL);
 	restore_nameidata();
 	putname(filename);
-	return file;
+	return unlikely(err) ? ERR_PTR(err) : file;
 }
 
 static struct dentry *filename_create(int dfd, struct filename *name,
  
Linus Torvalds Jan. 11, 2024, 5:37 p.m. UTC | #15
On Thu, 11 Jan 2024 at 01:47, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Two things, both related to ->atomic_open():

Yeah, I was staring at the atomic_open() cases, and just thought that
we could allocate the filp early for that.

It wouldn't matter for normal filesystems, so from a performance
standpoint it would be ok.

My handwavy thinking was that we'd remove 'filp' from the arguments we
pass around, and instead make it be a member of 'struct nameidata',
and then the different codepaths could decide that "now I need the
filp, so I'll instantiate it".

But then I looked more at the code, and it seemed to get quite messy,
quite fast.

               Linus
  
Linus Torvalds Jan. 11, 2024, 5:42 p.m. UTC | #16
On Thu, 11 Jan 2024 at 02:05, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Something like (completely untested) delta below, perhaps?

No, this looks horrible.

This doesn't actually get rid of the early filp allocation for
execve(), it only seems to get rid of the repeated allocation for when
the RCU lookup fails.

And *that* is much easier to get rid of differently: just do the file
allocation in do_filp_open(), instead of path_openat. We'd need to
have some way to make sure that there is no left-over crud from the
RCU path into the next stage, but that doesn't look bad.

So the "path_openat() allocates filp on each invocation" looks fairly easy.

It's the "don't allocate filp until you actually need it" that looks
nasty. And yes, atomic_open() is part of the problem, but so is the
fact that wee end up saving some flags in the filp early.

                  Linus
  
Linus Torvalds Jan. 20, 2024, 10:18 p.m. UTC | #17
On Thu, 11 Jan 2024 at 09:42, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> It's the "don't allocate filp until you actually need it" that looks
> nasty. And yes, atomic_open() is part of the problem, but so is the
> fact that wee end up saving some flags in the filp early.

So just an update on this, since I came back to it.

It turns out that the bulk of the cost of the 'struct file *'
allocation is actually the exact same thing that was discussed the
other day about file locking: it's the fact that the file allocation
is done with SLAB_ACCOUNT. See

    https://lore.kernel.org/all/CAHk-=wg_CoTOfkREgaQQA6oJ5nM9ZKYrTn=E1r-JnvmQcgWpSg@mail.gmail.com/

and that thread on the recent file locking accounting discussion.

Now, the allocation itself isn't free either, but the SLAB_ACCOUNT
really does make it noticeable more expensive than it should be.

It's a bit more spread out: the cost of the slab allocation itself is
mainly the (optimized) path that does a cmpxchg and the memset, but
the SLAB_ACCOUNT cost is spread out in mod_objcg_state,
__memcg_slab_post_alloc_hook, obj_cgroup_charge,
__memcg_slab_free_hook).

And that actually opens the door up for a _somewhat_ simple partial
workaround: instead of using SLAB_ACCOUNT, we could just do the memcg
accounting when we set FMODE_OPEN instead, and de-account it when we
free the filp (which checks FMODE_OPEN since other cleanup is
dependent on that anyway).

That would help not just the execve() case, but the whole "open
non-existent file" case too.

And I suspect "open()" with ENOENT is actually way more common than
execve() is. All those open->fstat paths for various perfectly normal
loads.

Anyway, I didn't actually write that patch, but I did want to mention
it as a smaller-scale fix (because getting rid of the 'struct file'
allocation entirely does look somewhat painful).

End result: I committed my "move do_open_execat() to the beginning of
execve()" patch, since it's clearly an improvement on the existing
behavior, and that whole "struct file allocations are unnecessarily
expensive" issue is a separate thing.

              Linus
  
Kees Cook Jan. 21, 2024, 8:05 a.m. UTC | #18
On January 20, 2024 2:18:36 PM PST, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>End result: I committed my "move do_open_execat() to the beginning of
>execve()" patch, since it's clearly an improvement on the existing
>behavior, and that whole "struct file allocations are unnecessarily
>expensive" issue is a separate thing.

Thanks! I'll add the other bits of refactoring I did in my version of the clean-up (I created do_close_execat() for the repeated "allow_write_access(file); fput(file);" calls, along with some comments):
https://lore.kernel.org/lkml/202209161637.9EDAF6B18@keescook/

I like your removal of the "out" label! :)

-Kees