Message ID | 202401081028.0E908F9E0A@keescook |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-19974-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:37c1:b0:101:2151:f287 with SMTP id y1csp1203984dyq; Mon, 8 Jan 2024 10:37:41 -0800 (PST) X-Google-Smtp-Source: AGHT+IFdrV+W0l1x/NqaAM17AuEJ+HniWlbUQ0rmx57g3Nhry8VNT2wNztNp1PWMrtJYtlMviHwG X-Received: by 2002:a05:620a:4509:b0:781:56a7:3841 with SMTP id t9-20020a05620a450900b0078156a73841mr301253qkp.77.1704739060941; Mon, 08 Jan 2024 10:37:40 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1704739060; cv=none; d=google.com; s=arc-20160816; b=Up2jkR6a3BXUCIM91RaITk8l96Mq5UiFd5OIJwZZBOPVKsgwZjrHGNoK+Qit/OM8ji 8LzzAq7XOPXfF/BhMh5Ys4u1/lHwnYfTApruj0aDWTPvmtZn6SK6gErHVdnX6Pr/I7W7 MmbRq2hRPs6WKrjcqqWleNBR9Wi55kDUpm0za7ReXP7ytWJ/B1b2Ggcod5ggNBbmDcK9 1fCVwB0XBsr+saIaKFEtdkFm+O1lGNCIb2grgwMWYqTBkv10atuVZmSvdyCy0cq8WcY9 OWRw/GaWgkvrYWWMfjA31QCv78rNYl7m5YU01FfmsV8QueFPRt8hhxCc1/dvmypE21lb l5QQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-disposition:mime-version:list-unsubscribe:list-subscribe :list-id:precedence:message-id:subject:cc:to:from:date :dkim-signature; bh=unPpQVTDeVqN66DKJjPl4haWBetUrrskmet/dbbnct8=; fh=uh3iMsFDHVFQioOVRNFq/bTuzydBg5olMomLFhhqxjA=; b=mQohJDusdjcWDLsXuNObHlS5XFIY52++YbfdB2h+3BZ+0hR8Z1svF0MP58c5A6R9TN 6RYEGD7QCDO37vdvF20/s5vHXqw/i+jlT00Pj3b+Nt6vDCzJoTLkstW6pJdwPnxoXRQY zLgXysY0YHhNmSAMnqh59M2ncjfnGRdfCgqLEbW1cxQsNYZIQ0uNOwo4fGVzOdj5J1Uy nBl5l92qsAv7ejvx1vLmxNdTXLORDRNOJkoDeBVmsiD2TjM1l6SXMYqAxWrF+GuD8GGF c0pP7Fu6YuwXgcmhT9wc07Tw3txJ7xfCJBD+pKh+kc8xe8gpzWV2YN99sgXlDURK9Ckg Rnnw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=gB0maOuK; spf=pass (google.com: domain of linux-kernel+bounces-19974-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-19974-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [2604:1380:45d1:ec00::1]) by mx.google.com with ESMTPS id v7-20020a05620a0f0700b0078320a85898si290957qkl.295.2024.01.08.10.37.40 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 08 Jan 2024 10:37:40 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-19974-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=gB0maOuK; spf=pass (google.com: domain of linux-kernel+bounces-19974-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-19974-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id B1EA81C22EF2 for <ouuuleilei@gmail.com>; Mon, 8 Jan 2024 18:37:40 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 35F3657327; Mon, 8 Jan 2024 18:35:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=chromium.org header.i=@chromium.org header.b="gB0maOuK" X-Original-To: linux-kernel@vger.kernel.org Received: from mail-pj1-f46.google.com (mail-pj1-f46.google.com [209.85.216.46]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 33D6957302 for <linux-kernel@vger.kernel.org>; Mon, 8 Jan 2024 18:35:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=chromium.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=chromium.org Received: by mail-pj1-f46.google.com with SMTP id 98e67ed59e1d1-28c0df4b42eso2070988a91.1 for <linux-kernel@vger.kernel.org>; Mon, 08 Jan 2024 10:35:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; t=1704738902; x=1705343702; darn=vger.kernel.org; h=content-disposition:mime-version:message-id:subject:cc:to:from:date :from:to:cc:subject:date:message-id:reply-to; bh=unPpQVTDeVqN66DKJjPl4haWBetUrrskmet/dbbnct8=; b=gB0maOuKH9EYBK20PGbJ1ou6i9UFFJODNOAnMhE4erZqK30iZZAQICGfeUnxMGX2Tg fcgkDs6VVv8LCRGoxpr2Y2hJpi4x/PLKPY+ZgN1hOI2qA2sFPnWRfc4kvvWGnQ89+QRa 8Zs5Rp9tRK6kg2lvjvyf0LhOTFoMtIbMrtPGs= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1704738902; x=1705343702; h=content-disposition:mime-version:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=unPpQVTDeVqN66DKJjPl4haWBetUrrskmet/dbbnct8=; b=gK8E3KucVj94DZ107Hp0mNgMPPIfgOFBikoBHVDHBzk19CDScQZnnZ14eeGh1TW3Qj dsrxOUrWjJVhRfWutbZFx5Vp8QSJwGGUEZPvo3BUudPhC+XrqTBBN2TZiUFSWTHwvLcG f5vB8VKsbt70JrhHWHmHCmMeBamtwvXDVssU9UDwcsso9J5bW2wse8CJ8E+S+zqRYEba XY2j65lp1QQlnMqb10OoMyykEnUQodiW4j4eAYjrSK5Gz9+IVcChNeUZK00JtNJ927sI GRWyE0HVTli3QH9GVElIeC+i+pqtwPEGikcTvzHZ1xuXZV9Yj4amR2KyOUNugz+QEN8i n4lA== X-Gm-Message-State: AOJu0YyavOXAAaGwrIbhtEXbOhbdXYjQFbc5s7Mg/TgCplGxcpm8xfZf 8lZGw9v3WbkJYJP5vg6c227qUfw95AwP X-Received: by 2002:a17:90b:1b50:b0:28b:c02a:d0e2 with SMTP id nv16-20020a17090b1b5000b0028bc02ad0e2mr186788pjb.18.1704738902366; Mon, 08 Jan 2024 10:35:02 -0800 (PST) Received: from www.outflux.net (198-0-35-241-static.hfc.comcastbusiness.net. [198.0.35.241]) by smtp.gmail.com with ESMTPSA id sz15-20020a17090b2d4f00b0028c940cdad8sm6867394pjb.5.2024.01.08.10.35.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 08 Jan 2024 10:35:01 -0800 (PST) Date: Mon, 8 Jan 2024 10:35:01 -0800 From: Kees Cook <keescook@chromium.org> To: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org, Alexey Dobriyan <adobriyan@gmail.com>, Josh Triplett <josh@joshtriplett.org>, Kees Cook <keescook@chromium.org> Subject: [GIT PULL] execve updates for v6.8-rc1 Message-ID: <202401081028.0E908F9E0A@keescook> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1787548465421087826 X-GMAIL-MSGID: 1787548465421087826 |
Series |
[GIT,PULL] execve updates for v6.8-rc1
|
|
Pull-request
https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git tags/execve-v6.8-rc1Message
Kees Cook
Jan. 8, 2024, 6:35 p.m. UTC
Hi Linus, Please pull these execve updates for v6.8-rc1. A fast-fail check has been added to dramatically speed up execve-based PATH searches, and has been in -next for the entire development window. A minor conflict with netdev exists due to neighboring MAINTAINERS entries: https://lore.kernel.org/linux-next/20231218161704.05c25766@canb.auug.org.au/ Thanks! -Kees The following changes since commit 21ca59b365c091d583f36ac753eaa8baf947be6f: binfmt_misc: enable sandboxed mounts (2023-10-11 08:46:01 -0700) are available in the Git repository at: https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git tags/execve-v6.8-rc1 for you to fetch changes up to 0a8a952a75f2c5c140939c1616423e240677666c: ELF, MAINTAINERS: specifically mention ELF (2023-12-06 14:55:31 -0800) ---------------------------------------------------------------- execve updates for v6.8-rc1 - Update MAINTAINERS entry to explicitly mention ELF (Alexey Dobriyan) - Add a fail-fast check to speed up execve-based PATH searches (Josh Triplett) ---------------------------------------------------------------- Alexey Dobriyan (1): ELF, MAINTAINERS: specifically mention ELF Josh Triplett (1): fs/exec.c: Add fast path for ENOENT on PATH search before allocating mm MAINTAINERS | 3 ++- fs/exec.c | 13 +++++++++++++ 2 files changed, 15 insertions(+), 1 deletion(-)
Comments
On Mon, 8 Jan 2024 at 10:35, Kees Cook <keescook@chromium.org> wrote: > > Josh Triplett (1): > fs/exec.c: Add fast path for ENOENT on PATH search before allocating mm No, we're not doing this. If you want to open the file before the allocations, then dammit, do exactly that. Don't look up the path twice. Something (ENTIRELY UNTESTED) like this patch that just moves the open from "bprm_execve()" to "alloc_bprm()". It actually cleans up the odd BINPRM_FLAGS_PATH_INACCESSIBLE case too, by setting it where it makes sense. Anyway, I want to repeat: this patch is UNTESTED. It compiles for me. But that is literally all the testing it has gotten apart from a cursory "this patch looks sane". There might be something seriously wrong with this patch, but it at least makes sense, unlike that horror that will look up the filename twice. I bet whatever benchmark did the original was not using long filenames with lots of components, or was only testing the ENOENT case. Linus
On Mon, 8 Jan 2024 at 16:19, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Anyway, I want to repeat: this patch is UNTESTED. It compiles for me. Actually, I take that back. I did a clang build, and clang noted that my "remove the retval initialization as unnecessary" was wrong, because the if (!bprm->fdpath) goto out_free; code path in alloc_bprm() still wanted that initial -ENOMEM initialization. So you need to fix the int retval; in alloc_bprm() to be back to the original int retval = -ENOMEM; but then it might all work. Again - note the "might". Somebody needs to actually test it. I may try to do that in between pulls. Linus
On Mon, 8 Jan 2024 at 16:30, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Again - note the "might". Somebody needs to actually test it. I may > try to do that in between pulls. It boots. It builds a kernel. It must be perfect. Linus
On January 8, 2024 4:19:45 PM PST, Linus Torvalds <torvalds@linux-foundation.org> wrote: >On Mon, 8 Jan 2024 at 10:35, Kees Cook <keescook@chromium.org> wrote: >> >> Josh Triplett (1): >> fs/exec.c: Add fast path for ENOENT on PATH search before allocating mm > >No, we're not doing this. > >If you want to open the file before the allocations, then dammit, do >exactly that. This was exactly the feedback I had originally and wrote almost what you suggest: https://lore.kernel.org/lkml/202209161637.9EDAF6B18@keescook/ >Anyway, I want to repeat: this patch is UNTESTED. It compiles for me. >But that is literally all the testing it has gotten apart from a >cursory "this patch looks sane". > >There might be something seriously wrong with this patch, but it at >least makes sense, unlike that horror that will look up the filename >twice. > >I bet whatever benchmark did the original was not using long filenames >with lots of components, or was only testing the ENOENT case. But the perf testing of my proposed "look it up once" patch showed a net loss to the successful execs which no one could explain. In the end we went with the original proposal. If you think this is too much of a hack, I'm happy to drop it. My very first reaction was "fix userspace; shells use access() not execve()" but it seems enough other runtimes (Python?) use execve PATH searches that it would make a measurable real-world difference. -Kees
On Mon, 8 Jan 2024 at 17:48, Kees Cook <kees@kernel.org> wrote: > > This was exactly the feedback I had originally and wrote almost what you suggest: > > https://lore.kernel.org/lkml/202209161637.9EDAF6B18@keescook/ > > But the perf testing of my proposed "look it up once" patch showed a > net loss to the successful execs which no one could explain. In the > end we went with the original proposal. Basing things one one random benchmark which must clearly have some very particular cache effects or something is not ok. End result: I'm not taking a random "look up filename twice because we can't understand what is going on". Because I *guarantee* that we can trivially write another benchmark that shows that looking up the pathname twice is worse. Linus
On Mon, 8 Jan 2024 at 17:53, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Because I *guarantee* that we can trivially write another benchmark > that shows that looking up the pathname twice is worse. Ok, so I just took a look at the alleged benchmark that was used for the "look up twice" argument. It looks quite broken. What it seems to do is to "fork+execve" on a small file, and do clock_gettime() in the parent and in the child, and add up the differences between the times. But that's just testing random scheduler interactions, not the speed of fork/exec. IOW, that one improves performance if you always run the child first after the fork(), so that the child runs immediately, finishes the work, and when the parent then resumes, it reads the completed result from the pipe. It will give big behavior changes for any scheduling behavior - like trying to run children concurrently on another CPU vs running it immediately on the same CPU etc etc. Using "vfork()" instead of "fork()" will remove *one* variable, in that it will force that "child runs first" behavior that you want, and would likely help performance a lot. But even then you'll end up with a scheduling benchmark: when the child does "execve()" that will now wake up the parent again, and the *optimal* behavior is probably to run the child fully until it does "exit" (well, at least until it runs "clock_gettime()") before scheduling the parent. You might get that by just forcing it all to run on one single CPU, unless the wakeup by the execve() synchronously wakes up the parent. IOW, you can probably get closer to the numbers you want with vfork(), but even then it's a crap-shoot and depends on scheduling. If you want to actually test execve() itself, you shouldn't use fork() at all - you should literally execve() in a loop, using the execve() argument as the "loop variable". That will actually test execve(), not the scheduling of the child, which will be pretty random. IOW, something (truly stuipid) like the attached, and then you do $ gcc -O2 --static t.c $ time ./a.out 100000 1 to time a hundred thousand execve() calls. Look ma, no fork, vfork, or scheduler interactions. Of course, if you then want to check the pathname lookup failure cost, you'd need to change the "execve()" into a "execvpe()" and play around with the PATH variable, putting "." in different places etc. And you might want to write your own PATH lookup one, to make sure it actually uses the "execve()" system call and not "stat()" to find the executable. . and do you want to then check using "execveat()" (new model) vs "path string created by appending in user space" (classic model)? Tons of variables. For example, modern "execveat()" behavior is *probably* using a small pathname that is looked up by opening the different directories in $PATH, but the old-school thing that creates pathnames all in user space and then does "execve()" on them will probably have fairly heavy path lookup costs. So now the whole "look up path twice" might be very differently expensive depending on just how you ended up dealing with the $PATH components. It *could* be cheap. Or it might be looking up a long path. End result: there's a million interactions here. You need to decide what you want to test. But you *definitely* shouldn't decide to test some random scheduler behavior and call it "execve cost". Linus
On Mon, Jan 08, 2024 at 05:48:38PM -0800, Kees Cook wrote: > If you think this is too much of a hack, I'm happy to drop it. My very > first reaction was "fix userspace; shells use access() not execve()" > but it seems enough other runtimes (Python?) use execve PATH searches > that it would make a measurable real-world difference. In particular, execvpe and all the p variants of exec functions in both glibc and musl have this exact behavior, and thus anything that uses those functions will have the same behavior. If someone wants to try other variations on this patch that only look up the path once, and show via benchmarks that they're faster, I'm all for it. I would *prefer* the approach of only looking up the path once, if it's actually faster rather than slower. But I do think the spawnbench benchmark I provided (which has fork-execvpe and vfork-execvpe and posix_spawnp variants) is representative of real-world patterns for how programs execute other programs on $PATH. Doing a microbenchmark on just execvpe chaining from a program to itself is also valid, but I thought it would be preferable to benchmark real-world patterns and measure the actual time-to-first-instruction of the executed program as closely as possible.
On Tue, 9 Jan 2024 at 10:57, Josh Triplett <josh@joshtriplett.org> wrote: > > But I do think the spawnbench > benchmark I provided (which has fork-execvpe and vfork-execvpe and > posix_spawnp variants) is representative of real-world patterns for how > programs execute other programs on $PATH. No, it really isn't. I already pointed out that the benchmark is entirely broken, because what it times is broken. It basically shows the difference in times between the parent and child both doing a clock_gettime(), but what happens in between those is - the parent doing a fork - the child doing an execve - the child reading the time and that makes it a "at which point did we happen to schedule" benchmark, not an execve() benchmark. Just as an example, imagine if we schedule the child immediately after the fork, and the parent doesn't run at all. That obviously gives the minimum time difference - what your benchmark then treats as "best". Agreed? Except does it? You have random details like "who happens to do a page fault and has to copy the the stack page that has been marked COW, and that both the parent and child have mapped and both will write to immediately after the fork()" If the child runs first, the child will take that page fault, and the child will have to do the page copy and insert it into its own VM. So maybe it's better if the parent runs first and takes the page fault and does the copy, and the child runs on another CPU just a little bit later, and sees that it now has an exclusive page and doesn't need to copy it at all? Maybe it gets to the execve() faster that way, and gets a lower time difference just by pure luck? Or at least has a CPU core of its own while the parent does something else? Now, with "fork()" *something* has to do the page copy before the execve() happens, unless it's all very lucky and the child happens to run with the stack just at a page boundary and just gets its own page that way. I suspect you'll get the best performance if you run everything on just one CPU, and don't try to spread things out, at least if your L2 caches are big enough to fit there - just for the best cache utilization. Because if you try to run the two loads on different CPU cores (and maybe avoid HT siblings too, to get the best throughput), you'll have to push all the cached contents from the parent to the child. And maybe thats' ok on this one. It's almost certainly a good thing on *some* loads, particularly if the child then ends up having more work it does longer-term. And yes, our scheduler tries to actually take cache affinity etc into account, although the interaction with fork() may or may not be optimal. But my point is that what you are testing isn't actually the execve() cycle, you're basically testing all these scheduler interactions on a benchmark that doesn't actually match any real load. Now, using vfork() instead of fork() will improve things, both from a performance standpoint and from a "not as much a scheduler benchmark" standpoint. At least we don't have the issue with COW pages and trying to aim for cache re-use, because there will be no overlap in execution of the child and parent while they share the same VM. The parent is going to stop in vfork(), the child is obviously best run on the same CPU until it does an execve() and releases the parent, and at that point it's *probably* best to try to run the new child on a different CPU, and bring the parent back on the original CPU,. Except that behavior (which sounds to me like the best option in general) is not AT ALL what your benchmark would consider the best option - because all your spawn bench thing looks at is how quickly the child gets to run, so things like "cache placement for parent" no longer matter at all for spawnbench. So for that benchmark, instead of maybe trying to keep the parent local to its own caches, and run the (new) child with no cache footprint on another CPU, the best numbers for your benchmark probably come from running the new execve() on the same CPU and not running the parent at all until later. And those are good numbers for the spawnbench just because the process was already on that CPU in the kernel, so not running the parent where it makes sense is good, because alll that matterns by then is that you want to run the child asap. See? your benchmark doesn't actually even *attempt* to time how good our fork-execve sequence is. It times something entirely different. It basically gives the best score to a scheduler decision that probably doesn't even make sense. Or maybe it does. Who knows? Maybe we *should* change the scheduler to do what is best for spawnbench. But do you see why I think it's at least as much a scheduler benchmark as it is a execve() one, and why I think it's likely not a very good benchmark at all, because I _suspect_ that the best numbers come from doing things that may or may not make sense.. Now, I sent out that other benchmark, which at least avoids the whole scheduler thing, because it does everything as one single process. I'm not saying that's a sensible benchmark _either_, but at least it's targeted to just execve(). Another option would be to not time the whole "parent clock_gettime -> child clock_gettime" sequience that makes no sense, but to just time the whole "fork-execve-exit-wait" sequence (which you can do in the parent). Because at that point, you're not timing the difference between two random points (where scheduling decisions will change what happens between them), you're actually timing the cost of the *whole* sequence. Sure, scheduling will still matter for the details, but at least you've timed the whole work, rather than timed a random *part* of the work where other things are then ignored entirely. For example, once you time the whole thing, it's no longer a "did the parent of the child do the COW copy"? We don't care. One or the other has to take the cost, and it's part of the *whole* cost of the operation. Sure, scheduling decisions will still end up mattering, so it's not a pure execve() benchmark, but at least now it's a benchmark for the whole load, not just a random part of it. Linus
I'm not going to spend a lot of time and energy attempting to argue that spawnbench is a representative benchmark, or that scheduling overhead is a relevant part of program launch. Instead, here are some numbers from Linus's suggested benchmark (modified to use execvpe, and to count down rather than up so it doesn't need two arguments; modified version and benchmark driver script attached; compiled with `musl-gcc -Wall -O3 -s -static`): With no patch (for-next/execve before the top two patches, 21ca59b365c091d583f36ac753eaa8baf947be6f): === With only PATH === 0.32user 4.08system 0:04.55elapsed 96%CPU (0avgtext+0avgdata 1280maxresident)k 0inputs+0outputs (0major+1294599minor)pagefaults 0swaps === With 64 extra environment variables === 0.29user 5.33system 0:05.76elapsed 97%CPU (0avgtext+0avgdata 1280maxresident)k 0inputs+0outputs (0major+1312477minor)pagefaults 0swaps With my fastpath patch (for-next/execve, 0a8a952a75f2c5c140939c1616423e240677666c): === With only PATH === 0.27user 2.40system 0:02.73elapsed 98%CPU (0avgtext+0avgdata 1152maxresident)k 0inputs+0outputs (0major+695002minor)pagefaults 0swaps === With 64 extra environment variables === 0.29user 2.59system 0:02.94elapsed 98%CPU (0avgtext+0avgdata 1152maxresident)k 0inputs+0outputs (0major+712606minor)pagefaults 0swaps With Linus's fastpath patch ("no patch" with Linus's applied, and the followup -ENOMEM fix applied): === With only PATH === 0.28user 2.44system 0:02.80elapsed 97%CPU (0avgtext+0avgdata 1152maxresident)k 0inputs+0outputs (0major+694706minor)pagefaults 0swaps === With 64 extra environment variables === 0.29user 2.68system 0:03.06elapsed 97%CPU (0avgtext+0avgdata 1152maxresident)k 0inputs+0outputs (0major+712431minor)pagefaults 0swaps I can reliably reproduce the differences between these three kernels; the differences are well outside the noise. Both fastpaths are *much* faster than the baseline, but the double-lookup version is still consistently faster than the single-lookup version. I'm sure it's *possible* to create a benchmark in which the single-lookup version is faster. But in this benchmark of *just* execvpe, it's still the case that double-lookup is faster, for *some* reason. I agree that it *shouldn't* be, and yet. #!/bin/sh echo === With only PATH === /usr/bin/time env -i PATH=/home/josh/.local/bin:/usr/lib/ccache:/sbin:/bin:/usr/sbin:/usr/bin:. t 100000 echo echo === With 64 extra environment variables === /usr/bin/time env -i $(seq 1 64 | sed 's/.*/x&=&/') PATH=/home/josh/.local/bin:/usr/lib/ccache:/sbin:/bin:/usr/sbin:/usr/bin:. t 100000
On Tue, 9 Jan 2024 at 18:21, Josh Triplett <josh@joshtriplett.org> wrote: > > Instead, here are some numbers from Linus's suggested benchmark > (modified to use execvpe, and to count down rather than up so it doesn't > need two arguments; modified version and benchmark driver script > attached; compiled with `musl-gcc -Wall -O3 -s -static`): Yeah, that's better. I had actually only benchmarked the success case. And I see what the problem is: the "sane" way that only opens the pathname once does so using file = do_filp_open(fd, name, &open_exec_flags); and the "open path twice" does the first open with retval = filename_lookup(AT_FDCWD, filename, 0, &path, NULL); and guess what the difference is? The difference is that path_openat() starts out with file = alloc_empty_file(op->open_flag, current_cred()); and when the open fails, it will free the file with fput(file); So if there are a lot of failures (because "." is at the end of the path), it will have done a fair amount of those useless file allocations and frees. And - not surprisingly - the "open once" is then faster if there are *not* a lot of failures, when the executable is found early in the PATH. Now, there's no fundamental *reason* to do that alloc_empty_file() early, except for how the code is structured. It partly makes the error handling simpler and since all the cases want the filp in the end, doing it at the top means that it's only done once. And we occasionally do use the file pointer early (ie lookup_open() will clear/set FMODE_CREATED in it even if it doesn't otherwise touch the file pointer) even before the final lookup - and at creation time, atomic_open() will actually want it for the final lookup. Anyway, the real fix is definitely to just fix do_filp_open() to at least not be *too* eager to allocate a file pointer. In fact, that code is oddly non-optimal for another reason: it does that "allocate and free file" not just when the path lookup fails, but it does it for things like RCU lookup failures too. So what happens is that if RCU lookup fails, do_filp_open() will call path_openat() twice: first with LOOKUP_RCU, and then without it. And path_openat() will allocate that "struct file *" twice. On NFS - or other filesystems that can return ESTALE - it will in fact do it three times. That's pretty disgusting. Al, comments? We *could* just special-case the execve() code not to use do_filp_open() and avoid this issue that way, but it does feel like even the regular open() case is pessimal with that whole RCU situation. Linus
On Tue, Jan 09, 2024 at 06:21:26PM -0800, Josh Triplett wrote: > With Linus's fastpath patch ("no patch" with Linus's applied, and the > followup -ENOMEM fix applied): > > === With only PATH === > 0.28user 2.44system 0:02.80elapsed 97%CPU (0avgtext+0avgdata 1152maxresident)k > 0inputs+0outputs (0major+694706minor)pagefaults 0swaps > > === With 64 extra environment variables === > 0.29user 2.68system 0:03.06elapsed 97%CPU (0avgtext+0avgdata 1152maxresident)k > 0inputs+0outputs (0major+712431minor)pagefaults 0swaps Thanks for digging into this! I've been trying to figure out how to measure only the execve portion of a workload (with perf)[1] to get a more real-world measurement, but the above does show improvements for the "open once early". I'll get the behavior landed in -next after the merge window closes, and we can continue examining if we can make do_filp_open() better... -Kees [1] https://lore.kernel.org/linux-perf-users/ZZ32p0LRSt5-vFPX@kernel.org/
On Wed, 10 Jan 2024 at 11:24, Kees Cook <keescook@chromium.org> wrote: > > I've been trying to figure out how to measure only the execve portion of > a workload (with perf)[1] to get a more real-world measurement, but the > above does show improvements for the "open once early". I'll get the > behavior landed in -next after the merge window closes, and we can > continue examining if we can make do_filp_open() better... Well, so the workload that shows the "early open" issue most is actually something disgusting like this: #include <unistd.h> int main(int argc, char **argv, char **envp) { for (int i = 0; i < 10000000; i++) execve("nonexistent", argv, envp); return 0; } and it's trivial to run under perf. You'll get something like this with my patch: 8.65% [k] strncpy_from_user 8.37% [k] kmem_cache_alloc 7.71% [k] kmem_cache_free 5.14% [k] mod_objcg_state 4.84% [k] link_path_walk 4.36% [k] memset_orig 3.93% [k] percpu_counter_add_batch 3.66% [k] do_syscall_64 3.63% [k] path_openat and with the hacky "open twice" you'll see that kmem_cache_alloc/free should be much lower - it still does a kmem_cache_alloc/free() pair for the pathname, but the 'struct file *' allocation/free goes away. Anyway, you can see a very similar issue by replacing the "execve()" line with open("nonexistent", O_RDONLY); instead, and for exactly the same reason. Just to show that this issue isn't execve-related. I really think that the "open twice" is wrong. It will look artificially good in this "does not exist" case, but it will penalize other cases, and it just hides this issue. Without either of the patches, you'll see that execve case spend all its time counting environment variables, and be much slower as a result. Instead of that "strncpy_from_user()", you'll see "strnlen_user()" and ccopy_from_user() shoot up because of that. The above perf profile is actually quote good in general: the slab alloc/free is a big issue only because nothing else is. Oh, and the above profile depends *heavily* on your particular microarchitecture and which mitigations you have in place. System call overhead might be at the top, for example. And the cost of "strncpy_from_user()" is so high above not because we do a lot of copies (it's just that shortish filename), but simply mainly because user copies are so insanely expensive on some uarchs due to CLAC/STAC being expensive. So even a short filename copy can end up taking more than the whole path walk. So your exact mileage will vary, but you should see that pattern of "kmem_cache_alloc/free" (and the "strnlen_user()" issue with none of the patches being applied) etc etc. Linus
On Tue, Jan 09, 2024 at 07:54:30PM -0800, Linus Torvalds wrote: > Al, comments? We *could* just special-case the execve() code not to > use do_filp_open() and avoid this issue that way, but it does feel > like even the regular open() case is pessimal with that whole RCU > situation. Two things, both related to ->atomic_open(): * we pass struct file to ->atomic_open(), so that it either opens the sucker _or_ stores the resulting dentry in it (if ->atomic_open() bails and tells us to use such-and-such dentry, other than the one it had been given). * cleanup logics becomes interesting if we try to keep struct file from pass to pass. Sure, if it had never been touched _or_ if it had only been fed to finish_no_open() (i.e. ->atomic_open() bailout path) - no problem, we can reuse it. But once it has hit do_dentry_open(), the things get fishy. We *must* fput() it if we got to setting FMODE_OPENED - no plausible way around that. But what about the case when we fail e.g. inside ->open()? Currently we undo just enough for fput() to do the right thing without FMODE_OPENED, but e.g. security_file_open() has no undoing for it. Having it called twice on the same struct file might or might not work on all LSMs, but they hadn't been exposed to that until now. We could pass struct file ** to path_openat(), with file = *fp; if (!file) { file = alloc_empty_file(op->open_flag, current_cred()); if (IS_ERR(file)) return file; *fp = file; } in the beginning and have an extra flag that would be set as soon as we hit do_dentry_open(). Then we could make the fput() in path_openat() failure handling conditional upon that flag. Doable, but really not pretty, especially since we'd need to massage the caller as well... Less painful variant is if (error == -ECHILD && (flags & LOOKUP_RCU)) return ERR_PTR(-ECHILD); // keep file for non-rcu pass *fp = NULL; fput(file); ... on the way out; that won't help with -ESTALE side of things, but if we hit *that*, struct file allocation overhead is really noise. PS: apologies for late reply - had been sick since Saturday, just got more or less back to normal.
On Thu, Jan 11, 2024 at 09:47:11AM +0000, Al Viro wrote: > Doable, but really not pretty, especially since we'd need to massage > the caller as well... Less painful variant is > if (error == -ECHILD && (flags & LOOKUP_RCU)) > return ERR_PTR(-ECHILD); // keep file for non-rcu pass > *fp = NULL; > fput(file); > ... > on the way out; that won't help with -ESTALE side of things, but if we > hit *that*, struct file allocation overhead is really noise. Something like (completely untested) delta below, perhaps? diff --git a/fs/namei.c b/fs/namei.c index 5c318d657503..de770be9bb16 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3765,15 +3765,17 @@ static int do_o_path(struct nameidata *nd, unsigned flags, struct file *file) return error; } -static struct file *path_openat(struct nameidata *nd, +static int path_openat(struct nameidata *nd, struct file **fp, const struct open_flags *op, unsigned flags) { - struct file *file; + struct file *file = *fp; int error; - file = alloc_empty_file(op->open_flag, current_cred()); - if (IS_ERR(file)) - return file; + if (!file) { + file = alloc_empty_file(op->open_flag, current_cred()); + if (IS_ERR(file)) + return PTR_ERR(file); + } if (unlikely(file->f_flags & __O_TMPFILE)) { error = do_tmpfile(nd, flags, op, file); @@ -3789,11 +3791,17 @@ static struct file *path_openat(struct nameidata *nd, terminate_walk(nd); } if (likely(!error)) { - if (likely(file->f_mode & FMODE_OPENED)) - return file; + if (likely(file->f_mode & FMODE_OPENED)) { + *fp = file; + return 0; + } WARN_ON(1); error = -EINVAL; } + if (error == -ECHILD && (flags & LOOKUP_RCU)) { + *fp = file; // reuse on the next pass + return -ECHILD; + } fput(file); if (error == -EOPENSTALE) { if (flags & LOOKUP_RCU) @@ -3801,7 +3809,7 @@ static struct file *path_openat(struct nameidata *nd, else error = -ESTALE; } - return ERR_PTR(error); + return error; } struct file *do_filp_open(int dfd, struct filename *pathname, @@ -3809,25 +3817,27 @@ struct file *do_filp_open(int dfd, struct filename *pathname, { struct nameidata nd; int flags = op->lookup_flags; - struct file *filp; + struct file *file = NULL; + int err; set_nameidata(&nd, dfd, pathname, NULL); - filp = path_openat(&nd, op, flags | LOOKUP_RCU); - if (unlikely(filp == ERR_PTR(-ECHILD))) - filp = path_openat(&nd, op, flags); - if (unlikely(filp == ERR_PTR(-ESTALE))) - filp = path_openat(&nd, op, flags | LOOKUP_REVAL); + err = path_openat(&nd, &file, op, flags | LOOKUP_RCU); + if (unlikely(err == -ECHILD)) + err = path_openat(&nd, &file, op, flags); + if (unlikely(err == -ESTALE)) + err = path_openat(&nd, &file, op, flags | LOOKUP_REVAL); restore_nameidata(); - return filp; + return unlikely(err) ? ERR_PTR(err) : file; } struct file *do_file_open_root(const struct path *root, const char *name, const struct open_flags *op) { struct nameidata nd; - struct file *file; + struct file *file = NULL; struct filename *filename; int flags = op->lookup_flags; + int err; if (d_is_symlink(root->dentry) && op->intent & LOOKUP_OPEN) return ERR_PTR(-ELOOP); @@ -3837,14 +3847,14 @@ struct file *do_file_open_root(const struct path *root, return ERR_CAST(filename); set_nameidata(&nd, -1, filename, root); - file = path_openat(&nd, op, flags | LOOKUP_RCU); - if (unlikely(file == ERR_PTR(-ECHILD))) - file = path_openat(&nd, op, flags); - if (unlikely(file == ERR_PTR(-ESTALE))) - file = path_openat(&nd, op, flags | LOOKUP_REVAL); + err = path_openat(&nd, &file, op, flags | LOOKUP_RCU); + if (unlikely(err == -ECHILD)) + err = path_openat(&nd, &file, op, flags); + if (unlikely(err == -ESTALE)) + err = path_openat(&nd, &file, op, flags | LOOKUP_REVAL); restore_nameidata(); putname(filename); - return file; + return unlikely(err) ? ERR_PTR(err) : file; } static struct dentry *filename_create(int dfd, struct filename *name,
On Thu, 11 Jan 2024 at 01:47, Al Viro <viro@zeniv.linux.org.uk> wrote: > > Two things, both related to ->atomic_open(): Yeah, I was staring at the atomic_open() cases, and just thought that we could allocate the filp early for that. It wouldn't matter for normal filesystems, so from a performance standpoint it would be ok. My handwavy thinking was that we'd remove 'filp' from the arguments we pass around, and instead make it be a member of 'struct nameidata', and then the different codepaths could decide that "now I need the filp, so I'll instantiate it". But then I looked more at the code, and it seemed to get quite messy, quite fast. Linus
On Thu, 11 Jan 2024 at 02:05, Al Viro <viro@zeniv.linux.org.uk> wrote: > > Something like (completely untested) delta below, perhaps? No, this looks horrible. This doesn't actually get rid of the early filp allocation for execve(), it only seems to get rid of the repeated allocation for when the RCU lookup fails. And *that* is much easier to get rid of differently: just do the file allocation in do_filp_open(), instead of path_openat. We'd need to have some way to make sure that there is no left-over crud from the RCU path into the next stage, but that doesn't look bad. So the "path_openat() allocates filp on each invocation" looks fairly easy. It's the "don't allocate filp until you actually need it" that looks nasty. And yes, atomic_open() is part of the problem, but so is the fact that wee end up saving some flags in the filp early. Linus
On Thu, 11 Jan 2024 at 09:42, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > It's the "don't allocate filp until you actually need it" that looks > nasty. And yes, atomic_open() is part of the problem, but so is the > fact that wee end up saving some flags in the filp early. So just an update on this, since I came back to it. It turns out that the bulk of the cost of the 'struct file *' allocation is actually the exact same thing that was discussed the other day about file locking: it's the fact that the file allocation is done with SLAB_ACCOUNT. See https://lore.kernel.org/all/CAHk-=wg_CoTOfkREgaQQA6oJ5nM9ZKYrTn=E1r-JnvmQcgWpSg@mail.gmail.com/ and that thread on the recent file locking accounting discussion. Now, the allocation itself isn't free either, but the SLAB_ACCOUNT really does make it noticeable more expensive than it should be. It's a bit more spread out: the cost of the slab allocation itself is mainly the (optimized) path that does a cmpxchg and the memset, but the SLAB_ACCOUNT cost is spread out in mod_objcg_state, __memcg_slab_post_alloc_hook, obj_cgroup_charge, __memcg_slab_free_hook). And that actually opens the door up for a _somewhat_ simple partial workaround: instead of using SLAB_ACCOUNT, we could just do the memcg accounting when we set FMODE_OPEN instead, and de-account it when we free the filp (which checks FMODE_OPEN since other cleanup is dependent on that anyway). That would help not just the execve() case, but the whole "open non-existent file" case too. And I suspect "open()" with ENOENT is actually way more common than execve() is. All those open->fstat paths for various perfectly normal loads. Anyway, I didn't actually write that patch, but I did want to mention it as a smaller-scale fix (because getting rid of the 'struct file' allocation entirely does look somewhat painful). End result: I committed my "move do_open_execat() to the beginning of execve()" patch, since it's clearly an improvement on the existing behavior, and that whole "struct file allocations are unnecessarily expensive" issue is a separate thing. Linus
On January 20, 2024 2:18:36 PM PST, Linus Torvalds <torvalds@linux-foundation.org> wrote: >End result: I committed my "move do_open_execat() to the beginning of >execve()" patch, since it's clearly an improvement on the existing >behavior, and that whole "struct file allocations are unnecessarily >expensive" issue is a separate thing. Thanks! I'll add the other bits of refactoring I did in my version of the clean-up (I created do_close_execat() for the repeated "allow_write_access(file); fput(file);" calls, along with some comments): https://lore.kernel.org/lkml/202209161637.9EDAF6B18@keescook/ I like your removal of the "out" label! :) -Kees