Message ID | 20230116212105.1840362-2-mjguzik@gmail.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp1408112wrn; Mon, 16 Jan 2023 13:26:08 -0800 (PST) X-Google-Smtp-Source: AMrXdXtQn89+vZQOUmja17gISqh/xhPvXwDRfzIJOrf7mML6yHHIk3vFx7cn0glSWgB8mZPH03ao X-Received: by 2002:aa7:d887:0:b0:499:1ed2:6461 with SMTP id u7-20020aa7d887000000b004991ed26461mr769988edq.17.1673904368336; Mon, 16 Jan 2023 13:26:08 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1673904368; cv=none; d=google.com; s=arc-20160816; b=Gj0SvYnY2vP3IOb3PrBAe4dtPchSIHbhbnZlASAPWhkqdsLG7XrZ4bvlba5Ucm/0OQ P7H0gxtxysvn6GufpPTCfMxXhnv4+Imd7P4Yjxhe8OJw8qJoIV2ezwdWqvsCY1CqvDD5 kvZkFLSmNeAaGL30L3DzYrLGyhXn8mooySL1qY0fSGJ77/+VAvqhfc4wtbmKHg9FK+Cr Bn86fon9Xgn8lUugE55IfdplU1fR7gpjRJdFjderY8rMasvVp0SfgTtKoQvttnZuEu57 HsPJUuxxnM113YUTrRnu5GMT2FGTCW8gtn7PSi3DjjqStXYGh9DwKt8yabxXgn9L0keH 5GLg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=CCEk4uyfLQ6ixPIMnmhnLIi8+lwOJlnHpr806bX8wOs=; b=x99U3WKAQRGAp2drkzCLwdcPJ3KbYW2aCiwkDMa49qVcubQYTKG7zm6Eiufh17/Q9S Ecok1+o/GH9xwWxcvAcZ41qqnEIdygcOZ19O2Ul1foHzrm6YNkp7EhuU8vVK5h5sg+aF 5oOCm2kbTr30ipyeXPSIN4/NleqOeSn/HK/Qx+BvdVYeDgzcWT639A/t0dsGwKQ5Iyqg qLjpoANxqo4rhU6EoEunzY3bn5CTHRX37Pq+Uwtuj4Jrzny2xE/wIBiQhZweA7ySZzsh 1Nkwn54y4Xzn7idxYSSLp1XKJFw5z4qbS+gCFaVe4aTfmnBYXdAzRmlLtl46Ngo9OD+k wqcw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=dKk3o918; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id eg24-20020a056402289800b00475a6378758si27911670edb.554.2023.01.16.13.25.45; Mon, 16 Jan 2023 13:26:08 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=dKk3o918; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234358AbjAPVVZ (ORCPT <rfc822;stefanalexe802@gmail.com> + 99 others); Mon, 16 Jan 2023 16:21:25 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33030 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234171AbjAPVVU (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 16 Jan 2023 16:21:20 -0500 Received: from mail-ej1-x62c.google.com (mail-ej1-x62c.google.com [IPv6:2a00:1450:4864:20::62c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D337723C42; Mon, 16 Jan 2023 13:21:18 -0800 (PST) Received: by mail-ej1-x62c.google.com with SMTP id qx13so12520078ejb.13; Mon, 16 Jan 2023 13:21:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=CCEk4uyfLQ6ixPIMnmhnLIi8+lwOJlnHpr806bX8wOs=; b=dKk3o918oQTVmJd2rJ4ABOiqX1kqDqrlSHZtmQLo6RYGqNuOLgtT5JEMpWL5PN6PJP BWhNz+cb5ml7oDjTeksuAsATTZBOFDhw1kOuxYp86+NP0YJfZOXkDyfaJvYezuwiCxP1 O57S6AgkwHZXQeZAvsZbhRBLJC7Q7Mew1C4Ob/6RPHSWrHme2kGi05F4wnpaus3s4mJk 303ds1cmYyy7acXtYUoW7UsqzJcw5FRt5bxaGMfRQLjN2hQgHnKcTkZD+mR0Jf1hUogh vA5Vranx4BtvnLIyqxlDSc5AmOArrF4kOUliPw7vmghQlnus2HE77ZI7ARub2w4esy32 n2xA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=CCEk4uyfLQ6ixPIMnmhnLIi8+lwOJlnHpr806bX8wOs=; b=PBX6qb/fxu25JuKMNai0bUHJt9XLd6CqFmbnTlj1ktZYk0NhZG+31+tSsLZYUN7Onv Keu1xwhBqjlPajWzusay5yGElpWlOH72uEE30jk84LF58zj0HDcDTULzrnyjRhm6/UFG yiXBx0wseKKwX/Ru7eRAMhd0MXhCdQWnJ+nd4j0uEV0Pv/+CU/OntQTv+jAljZjyGkW7 NZSQUo7bjpQgmSw+wItACF3bOmBYxzR7e6YHUwB5Luwu8oiSwZ7nNSRozJAFjTr2fjVa /LIAGuof7tcERDaf2+8GKvFBliWiabVbrnLhpLb2PobiCei6GYIG/2P6I4FbmvqhxnmP Ok1g== X-Gm-Message-State: AFqh2kqCoEQdPLyo9qlury7fm5WwijTHU/r+6tL9bEgMVDEYN2FSc/EX l+VL1ovQcVGcq+FshUwpVHo= X-Received: by 2002:a17:906:eb8e:b0:871:6b9d:dbc with SMTP id mh14-20020a170906eb8e00b008716b9d0dbcmr470750ejb.21.1673904077291; Mon, 16 Jan 2023 13:21:17 -0800 (PST) Received: from f.. (cst-prg-72-175.cust.vodafone.cz. [46.135.72.175]) by smtp.gmail.com with ESMTPSA id q18-20020a17090676d200b00857c2c29553sm7961721ejn.197.2023.01.16.13.21.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 16 Jan 2023 13:21:16 -0800 (PST) From: Mateusz Guzik <mjguzik@gmail.com> To: viro@zeniv.linux.org.uk Cc: serge@hallyn.com, torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-security-module@vger.kernel.org, Mateusz Guzik <mjguzik@gmail.com> Subject: [PATCH v2 2/2] vfs: avoid duplicating creds in faccessat if possible Date: Mon, 16 Jan 2023 22:21:05 +0100 Message-Id: <20230116212105.1840362-2-mjguzik@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230116212105.1840362-1-mjguzik@gmail.com> References: <20230116212105.1840362-1-mjguzik@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1755215947241352681?= X-GMAIL-MSGID: =?utf-8?q?1755215947241352681?= |
Series |
[v2,1/2] capability: add cap_isidentical
|
|
Commit Message
Mateusz Guzik
Jan. 16, 2023, 9:21 p.m. UTC
access(2) remains commonly used, for example on exec:
access("/etc/ld.so.preload", R_OK)
or when running gcc: strace -c gcc empty.c
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
0.00 0.000000 0 42 26 access
It falls down to do_faccessat without the AT_EACCESS flag, which in turn
results in allocation of new creds in order to modify fsuid/fsgid and
caps. This is a very expensive process single-threaded and most notably
multi-threaded, with numerous structures getting refed and unrefed on
imminent new cred destruction.
Turns out for typical consumers the resulting creds would be identical
and this can be checked upfront, avoiding the hard work.
An access benchmark plugged into will-it-scale running on Cascade Lake
shows:
test proc before after
access1 1 1310582 2908735 (+121%) # distinct files
access1 24 4716491 63822173 (+1353%) # distinct files
access2 24 2378041 5370335 (+125%) # same file
The above benchmarks are not integrated into will-it-scale, but can be
found in a pull request:
https://github.com/antonblanchard/will-it-scale/pull/36/files
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
v2:
- fix current->cred usage warn reported by the kernel test robot
Link: https://lore.kernel.org/all/202301150709.9EC6UKBT-lkp@intel.com/
---
fs/open.c | 32 +++++++++++++++++++++++++++++++-
1 file changed, 31 insertions(+), 1 deletion(-)
Comments
On Mon, Jan 16, 2023 at 4:21 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > access(2) remains commonly used, for example on exec: > access("/etc/ld.so.preload", R_OK) > > or when running gcc: strace -c gcc empty.c > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 0.00 0.000000 0 42 26 access > > It falls down to do_faccessat without the AT_EACCESS flag, which in turn > results in allocation of new creds in order to modify fsuid/fsgid and > caps. This is a very expensive process single-threaded and most notably > multi-threaded, with numerous structures getting refed and unrefed on > imminent new cred destruction. > > Turns out for typical consumers the resulting creds would be identical > and this can be checked upfront, avoiding the hard work. > > An access benchmark plugged into will-it-scale running on Cascade Lake > shows: > test proc before after > access1 1 1310582 2908735 (+121%) # distinct files > access1 24 4716491 63822173 (+1353%) # distinct files > access2 24 2378041 5370335 (+125%) # same file Out of curiosity, do you have any measurements of the impact this patch has on the AT_EACCESS case when the creds do need to be modified? > The above benchmarks are not integrated into will-it-scale, but can be > found in a pull request: > https://github.com/antonblanchard/will-it-scale/pull/36/files > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> > > v2: > - fix current->cred usage warn reported by the kernel test robot > Link: https://lore.kernel.org/all/202301150709.9EC6UKBT-lkp@intel.com/ > --- > fs/open.c | 32 +++++++++++++++++++++++++++++++- > 1 file changed, 31 insertions(+), 1 deletion(-) > > diff --git a/fs/open.c b/fs/open.c > index 82c1a28b3308..3c068a38044c 100644 > --- a/fs/open.c > +++ b/fs/open.c > @@ -367,7 +367,37 @@ COMPAT_SYSCALL_DEFINE6(fallocate, int, fd, int, mode, compat_arg_u64_dual(offset > * access() needs to use the real uid/gid, not the effective uid/gid. > * We do this by temporarily clearing all FS-related capabilities and > * switching the fsuid/fsgid around to the real ones. > + * > + * Creating new credentials is expensive, so we try to skip doing it, > + * which we can if the result would match what we already got. > */ > +static bool access_need_override_creds(int flags) > +{ > + const struct cred *cred; > + > + if (flags & AT_EACCESS) > + return false; > + > + cred = current_cred(); > + if (!uid_eq(cred->fsuid, cred->uid) || > + !gid_eq(cred->fsgid, cred->gid)) > + return true; > + > + if (!issecure(SECURE_NO_SETUID_FIXUP)) { > + kuid_t root_uid = make_kuid(cred->user_ns, 0); > + if (!uid_eq(cred->uid, root_uid)) { > + if (!cap_isclear(cred->cap_effective)) > + return true; > + } else { > + if (!cap_isidentical(cred->cap_effective, > + cred->cap_permitted)) > + return true; > + } > + } > + > + return false; > +} I worry a little that with nothing connecting access_need_override_creds() to access_override_creds() there is a bug waiting to happen if/when only one of the functions is updated. Given the limited credential changes in access_override_creds(), I wonder if a better solution would be to see if we could create a light(er)weight prepare_creds()/override_creds() that would avoid some of the prepare_creds() hotspots (I'm assuming that is where most of the time is being spent). It's possible this could help improve the performance of other, similar operations that need to modify task creds for a brief, and synchronous, period of time. Have you done any profiling inside of access_override_creds() to see where the most time is spent? Looking at the code I have some gut feelings on the hotspots, but it would be good to see some proper data before jumping to any conclusions. > static const struct cred *access_override_creds(void) > { > const struct cred *old_cred; > @@ -436,7 +466,7 @@ static long do_faccessat(int dfd, const char __user *filename, int mode, int fla > if (flags & AT_EMPTY_PATH) > lookup_flags |= LOOKUP_EMPTY; > > - if (!(flags & AT_EACCESS)) { > + if (access_need_override_creds(flags)) { > old_cred = access_override_creds(); > if (!old_cred) > return -ENOMEM; > -- > 2.34.1
On 1/20/23, Paul Moore <paul@paul-moore.com> wrote: > On Mon, Jan 16, 2023 at 4:21 PM Mateusz Guzik <mjguzik@gmail.com> wrote: >> >> access(2) remains commonly used, for example on exec: >> access("/etc/ld.so.preload", R_OK) >> >> or when running gcc: strace -c gcc empty.c >> % time seconds usecs/call calls errors syscall >> ------ ----------- ----------- --------- --------- ---------------- >> 0.00 0.000000 0 42 26 access >> >> It falls down to do_faccessat without the AT_EACCESS flag, which in turn >> results in allocation of new creds in order to modify fsuid/fsgid and >> caps. This is a very expensive process single-threaded and most notably >> multi-threaded, with numerous structures getting refed and unrefed on >> imminent new cred destruction. >> >> Turns out for typical consumers the resulting creds would be identical >> and this can be checked upfront, avoiding the hard work. >> >> An access benchmark plugged into will-it-scale running on Cascade Lake >> shows: >> test proc before after >> access1 1 1310582 2908735 (+121%) # distinct files >> access1 24 4716491 63822173 (+1353%) # distinct files >> access2 24 2378041 5370335 (+125%) # same file > > Out of curiosity, do you have any measurements of the impact this > patch has on the AT_EACCESS case when the creds do need to be > modified? > I could not be arsed to bench that. I'm not saying there is literally 0 impact, but it should not be high and the massive win in the case I patched imho justifies it. Last week I got a private reply from Linus suggesting the new checks only happen once at commit_cred() time, which would mean there would be at most one extra branch for the case you are concerned with. However, this quickly turn out to be rather hairy as there are games being played for example in copy_creds() which assigns them *without* calling commit_creds(). I was not comfortable pre-computing without sorting out the mess first and he conceded the new branchfest is not necessarily a big deal. That said, if you want some performance recovered for this case, here is an easy one: static const struct cred *access_override_creds(void) [..] old_cred = override_creds(override_cred); /* override_cred() gets its own ref */ put_cred(override_cred); As in the new creds get refed only to get unrefed immediately after. Whacking the 2 atomics should make up for it no problem on x86-64. Also see below. >> The above benchmarks are not integrated into will-it-scale, but can be >> found in a pull request: >> https://github.com/antonblanchard/will-it-scale/pull/36/files >> >> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> >> >> v2: >> - fix current->cred usage warn reported by the kernel test robot >> Link: https://lore.kernel.org/all/202301150709.9EC6UKBT-lkp@intel.com/ >> --- >> fs/open.c | 32 +++++++++++++++++++++++++++++++- >> 1 file changed, 31 insertions(+), 1 deletion(-) >> >> diff --git a/fs/open.c b/fs/open.c >> index 82c1a28b3308..3c068a38044c 100644 >> --- a/fs/open.c >> +++ b/fs/open.c >> @@ -367,7 +367,37 @@ COMPAT_SYSCALL_DEFINE6(fallocate, int, fd, int, mode, >> compat_arg_u64_dual(offset >> * access() needs to use the real uid/gid, not the effective uid/gid. >> * We do this by temporarily clearing all FS-related capabilities and >> * switching the fsuid/fsgid around to the real ones. >> + * >> + * Creating new credentials is expensive, so we try to skip doing it, >> + * which we can if the result would match what we already got. >> */ >> +static bool access_need_override_creds(int flags) >> +{ >> + const struct cred *cred; >> + >> + if (flags & AT_EACCESS) >> + return false; >> + >> + cred = current_cred(); >> + if (!uid_eq(cred->fsuid, cred->uid) || >> + !gid_eq(cred->fsgid, cred->gid)) >> + return true; >> + >> + if (!issecure(SECURE_NO_SETUID_FIXUP)) { >> + kuid_t root_uid = make_kuid(cred->user_ns, 0); >> + if (!uid_eq(cred->uid, root_uid)) { >> + if (!cap_isclear(cred->cap_effective)) >> + return true; >> + } else { >> + if (!cap_isidentical(cred->cap_effective, >> + cred->cap_permitted)) >> + return true; >> + } >> + } >> + >> + return false; >> +} > > I worry a little that with nothing connecting > access_need_override_creds() to access_override_creds() there is a bug > waiting to happen if/when only one of the functions is updated. > These funcs are literally next to each other, I don't think that is easy to miss. I concede a comment in access_override_creds to take a look at access_need_override_creds would not hurt, but I don't know if a resend to add it is justified. > Given the limited credential changes in access_override_creds(), I > wonder if a better solution would be to see if we could create a > light(er)weight prepare_creds()/override_creds() that would avoid some > of the prepare_creds() hotspots (I'm assuming that is where most of > the time is being spent). It's possible this could help improve the > performance of other, similar operations that need to modify task > creds for a brief, and synchronous, period of time. > So the fundamental problem here is that several refs need to be grabbed to make fully-fledged credentials. Then, as you are done, you have to undo them. This clearly sucks single-threaded. As other threads copying the same creds do the same atomics on the same vars, this sucks even more multithreaded. As a cop out one may notice several of these are probably always the same for creds derived from given base, so perhaps there can be an obj which wraps them and then you only have to ref *that* obj to implicitly hold on to the entire thing. As in this (and more): get_uid(new->user); get_user_ns(new->user_ns); get_group_info(new->group_info); would get replaced with: new->fancy_obj = getref_fancy_obj(new->fancy_obj); /* populate pointers here */ ... conceptually at least. But even then, while less shafted both multi and single-threaded, there is still a bottleneck. For a Real Solution(tm) for a general case I think has to start with an observartion creds either persist for a long time *OR* keep getting recreated. This would suggest holding on to them and looking them up instead just allocating, but all this opens another can of worms and I don't believe is worth the effort at this stage. But maybe someone has a better idea. That said, for the case of access(), I had the following in mind but once more considered it not justified at this stage. pseudocode-wise: struct cred *prepare_shallow_creds(void) new = kmem_cache_alloc(cred_jar, GFP_KERNEL); old = task->cred; memcpy(new, old, sizeof(struct cred)); here new creds have all the same pointers as old, but the target objs are only kept alive by the old creds still refing them. So by API contract you are required to keep them around. after you temporarily assign them you call revert_shallow_creds(): if (tempcred->usage == 1) /* nobody refed them, do the non_rcu check */ ... else /* somebody grabbed them, legitimize creds by * grabbing the missing refs */ get_uid(new->user); get_user_ns(new->user_ns); get_group_info(new->group_info); /* and so on */ So this would shave work from the case you are concerned with probably for all calls. I do think this is an ok idea overall, but I felt like overengineering for the problem at hand *at the time*. For some context, I'm looking at performance of certain VFS stuff and there is some serious fish to fry in the fstat department. The patch I posted is definitely worthwhile perf-wise and easy enough to reason about that I did no expect much opposition to it. If anything I expected opposition to the idea outlined above. > Have you done any profiling inside of access_override_creds() to see > where the most time is spent? Looking at the code I have some gut > feelings on the hotspots, but it would be good to see some proper data > before jumping to any conclusions. > See above. It's the atomics all the way down.
On 1/21/23, Mateusz Guzik <mjguzik@gmail.com> wrote: > On 1/20/23, Paul Moore <paul@paul-moore.com> wrote: >> On Mon, Jan 16, 2023 at 4:21 PM Mateusz Guzik <mjguzik@gmail.com> wrote: >>> >>> access(2) remains commonly used, for example on exec: >>> access("/etc/ld.so.preload", R_OK) >>> >>> or when running gcc: strace -c gcc empty.c >>> % time seconds usecs/call calls errors syscall >>> ------ ----------- ----------- --------- --------- ---------------- >>> 0.00 0.000000 0 42 26 access >>> >>> It falls down to do_faccessat without the AT_EACCESS flag, which in turn >>> results in allocation of new creds in order to modify fsuid/fsgid and >>> caps. This is a very expensive process single-threaded and most notably >>> multi-threaded, with numerous structures getting refed and unrefed on >>> imminent new cred destruction. >>> >>> Turns out for typical consumers the resulting creds would be identical >>> and this can be checked upfront, avoiding the hard work. >>> >>> An access benchmark plugged into will-it-scale running on Cascade Lake >>> shows: >>> test proc before after >>> access1 1 1310582 2908735 (+121%) # distinct files >>> access1 24 4716491 63822173 (+1353%) # distinct files >>> access2 24 2378041 5370335 (+125%) # same file >> >> Out of curiosity, do you have any measurements of the impact this >> patch has on the AT_EACCESS case when the creds do need to be >> modified? >> > > I could not be arsed to bench that. I'm not saying there is literally 0 > impact, but it should not be high and the massive win in the case I > patched imho justifies it. > > Last week I got a private reply from Linus suggesting the new checks > only happen once at commit_cred() time, which would mean there would be > at most one extra branch for the case you are concerned with. However, > this quickly turn out to be rather hairy as there are games being > played for example in copy_creds() which assigns them *without* calling > commit_creds(). I was not comfortable pre-computing without sorting out > the mess first and he conceded the new branchfest is not necessarily a > big deal. > > That said, if you want some performance recovered for this case, here > is an easy one: > > static const struct cred *access_override_creds(void) > [..] > old_cred = override_creds(override_cred); > > /* override_cred() gets its own ref */ > put_cred(override_cred); > > As in the new creds get refed only to get unrefed immediately after. > Whacking the 2 atomics should make up for it no problem on x86-64. > > Also see below. > >>> The above benchmarks are not integrated into will-it-scale, but can be >>> found in a pull request: >>> https://github.com/antonblanchard/will-it-scale/pull/36/files >>> >>> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> >>> >>> v2: >>> - fix current->cred usage warn reported by the kernel test robot >>> Link: https://lore.kernel.org/all/202301150709.9EC6UKBT-lkp@intel.com/ >>> --- >>> fs/open.c | 32 +++++++++++++++++++++++++++++++- >>> 1 file changed, 31 insertions(+), 1 deletion(-) >>> >>> diff --git a/fs/open.c b/fs/open.c >>> index 82c1a28b3308..3c068a38044c 100644 >>> --- a/fs/open.c >>> +++ b/fs/open.c >>> @@ -367,7 +367,37 @@ COMPAT_SYSCALL_DEFINE6(fallocate, int, fd, int, >>> mode, >>> compat_arg_u64_dual(offset >>> * access() needs to use the real uid/gid, not the effective uid/gid. >>> * We do this by temporarily clearing all FS-related capabilities and >>> * switching the fsuid/fsgid around to the real ones. >>> + * >>> + * Creating new credentials is expensive, so we try to skip doing it, >>> + * which we can if the result would match what we already got. >>> */ >>> +static bool access_need_override_creds(int flags) >>> +{ >>> + const struct cred *cred; >>> + >>> + if (flags & AT_EACCESS) >>> + return false; >>> + >>> + cred = current_cred(); >>> + if (!uid_eq(cred->fsuid, cred->uid) || >>> + !gid_eq(cred->fsgid, cred->gid)) >>> + return true; >>> + >>> + if (!issecure(SECURE_NO_SETUID_FIXUP)) { >>> + kuid_t root_uid = make_kuid(cred->user_ns, 0); >>> + if (!uid_eq(cred->uid, root_uid)) { >>> + if (!cap_isclear(cred->cap_effective)) >>> + return true; >>> + } else { >>> + if (!cap_isidentical(cred->cap_effective, >>> + cred->cap_permitted)) >>> + return true; >>> + } >>> + } >>> + >>> + return false; >>> +} >> >> I worry a little that with nothing connecting >> access_need_override_creds() to access_override_creds() there is a bug >> waiting to happen if/when only one of the functions is updated. >> > > These funcs are literally next to each other, I don't think that is easy > to miss. I concede a comment in access_override_creds to take a look at > access_need_override_creds would not hurt, but I don't know if a resend > to add it is justified. > >> Given the limited credential changes in access_override_creds(), I >> wonder if a better solution would be to see if we could create a >> light(er)weight prepare_creds()/override_creds() that would avoid some >> of the prepare_creds() hotspots (I'm assuming that is where most of >> the time is being spent). It's possible this could help improve the >> performance of other, similar operations that need to modify task >> creds for a brief, and synchronous, period of time. >> > > So the fundamental problem here is that several refs need to be grabbed > to make fully-fledged credentials. Then, as you are done, you have to > undo them. This clearly sucks single-threaded. As other threads copying > the same creds do the same atomics on the same vars, this sucks even > more multithreaded. > > As a cop out one may notice several of these are probably always the > same for creds derived from given base, so perhaps there can be an obj > which wraps them and then you only have to ref *that* obj to implicitly > hold on to the entire thing. > > As in this (and more): > get_uid(new->user); > get_user_ns(new->user_ns); > get_group_info(new->group_info); > > would get replaced with: > new->fancy_obj = getref_fancy_obj(new->fancy_obj); > /* populate pointers here */ > > ... conceptually at least. > > But even then, while less shafted both multi and single-threaded, there > is still a bottleneck. > > For a Real Solution(tm) for a general case I think has to start with an > observartion creds either persist for a long time *OR* keep getting > recreated. This would suggest holding on to them and looking them up > instead just allocating, but all this opens another can of worms and > I don't believe is worth the effort at this stage. But maybe someone > has a better idea. > > That said, for the case of access(), I had the following in mind but > once more considered it not justified at this stage. > > pseudocode-wise: > struct cred *prepare_shallow_creds(void) > new = kmem_cache_alloc(cred_jar, GFP_KERNEL); > old = task->cred; > memcpy(new, old, sizeof(struct cred)); > > here new creds have all the same pointers as old, but the target objs > are only kept alive by the old creds still refing them. So by API > contract you are required to keep them around. > > after you temporarily assign them you call revert_shallow_creds(): > if (tempcred->usage == 1) > /* nobody refed them, do the non_rcu check */ > ... > else > /* somebody grabbed them, legitimize creds by > * grabbing the missing refs > */ > get_uid(new->user); > get_user_ns(new->user_ns); > get_group_info(new->group_info); > /* and so on */ > > So this would shave work from the case you are concerned with probably > for all calls. > > I do think this is an ok idea overall, but I felt like overengineering for > the > problem at hand *at the time*. > > For some context, I'm looking at performance of certain VFS stuff and > there is some serious fish to fry in the fstat department. The patch I > posted is definitely worthwhile perf-wise and easy enough to reason > about that I did no expect much opposition to it. If anything I expected > opposition to the idea outlined above. > So just in case, if ultimately the consensus is that shallow copy or similar is the way to go, I can take a stab some time next week. >> Have you done any profiling inside of access_override_creds() to see >> where the most time is spent? Looking at the code I have some gut >> feelings on the hotspots, but it would be good to see some proper data >> before jumping to any conclusions. >> > > See above. It's the atomics all the way down. > > -- > Mateusz Guzik <mjguzik gmail.com> >
On Fri, Jan 20, 2023 at 7:50 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > On 1/20/23, Paul Moore <paul@paul-moore.com> wrote: > > On Mon, Jan 16, 2023 at 4:21 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > >> > >> access(2) remains commonly used, for example on exec: > >> access("/etc/ld.so.preload", R_OK) > >> > >> or when running gcc: strace -c gcc empty.c > >> % time seconds usecs/call calls errors syscall > >> ------ ----------- ----------- --------- --------- ---------------- > >> 0.00 0.000000 0 42 26 access > >> > >> It falls down to do_faccessat without the AT_EACCESS flag, which in turn > >> results in allocation of new creds in order to modify fsuid/fsgid and > >> caps. This is a very expensive process single-threaded and most notably > >> multi-threaded, with numerous structures getting refed and unrefed on > >> imminent new cred destruction. > >> > >> Turns out for typical consumers the resulting creds would be identical > >> and this can be checked upfront, avoiding the hard work. > >> > >> An access benchmark plugged into will-it-scale running on Cascade Lake > >> shows: > >> test proc before after > >> access1 1 1310582 2908735 (+121%) # distinct files > >> access1 24 4716491 63822173 (+1353%) # distinct files > >> access2 24 2378041 5370335 (+125%) # same file > > > > Out of curiosity, do you have any measurements of the impact this > > patch has on the AT_EACCESS case when the creds do need to be > > modified? > > I could not be arsed to bench that. I'm not saying there is literally 0 > impact, but it should not be high and the massive win in the case I > patched imho justifies it. That's one way to respond to an honest question asking if you've done any tests on the other side of the change. I agree the impact should be less than the advantage you've shown, but sometimes it's nice to see these things. > Last week I got a private reply from Linus suggesting the new checks > only happen once at commit_cred() time, which would mean there would be > at most one extra branch for the case you are concerned with. However, > this quickly turn out to be rather hairy as there are games being > played for example in copy_creds() which assigns them *without* calling > commit_creds(). I was not comfortable pre-computing without sorting out > the mess first and he conceded the new branchfest is not necessarily a > big deal. > > That said, if you want some performance recovered for this case, here > is an easy one: > > static const struct cred *access_override_creds(void) > [..] > old_cred = override_creds(override_cred); > > /* override_cred() gets its own ref */ > put_cred(override_cred); > > As in the new creds get refed only to get unrefed immediately after. > Whacking the 2 atomics should make up for it no problem on x86-64. > > Also see below. > > >> The above benchmarks are not integrated into will-it-scale, but can be > >> found in a pull request: > >> https://github.com/antonblanchard/will-it-scale/pull/36/files > >> > >> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> > >> > >> v2: > >> - fix current->cred usage warn reported by the kernel test robot > >> Link: https://lore.kernel.org/all/202301150709.9EC6UKBT-lkp@intel.com/ > >> --- > >> fs/open.c | 32 +++++++++++++++++++++++++++++++- > >> 1 file changed, 31 insertions(+), 1 deletion(-) > >> > >> diff --git a/fs/open.c b/fs/open.c > >> index 82c1a28b3308..3c068a38044c 100644 > >> --- a/fs/open.c > >> +++ b/fs/open.c > >> @@ -367,7 +367,37 @@ COMPAT_SYSCALL_DEFINE6(fallocate, int, fd, int, mode, > >> compat_arg_u64_dual(offset > >> * access() needs to use the real uid/gid, not the effective uid/gid. > >> * We do this by temporarily clearing all FS-related capabilities and > >> * switching the fsuid/fsgid around to the real ones. > >> + * > >> + * Creating new credentials is expensive, so we try to skip doing it, > >> + * which we can if the result would match what we already got. > >> */ > >> +static bool access_need_override_creds(int flags) > >> +{ > >> + const struct cred *cred; > >> + > >> + if (flags & AT_EACCESS) > >> + return false; > >> + > >> + cred = current_cred(); > >> + if (!uid_eq(cred->fsuid, cred->uid) || > >> + !gid_eq(cred->fsgid, cred->gid)) > >> + return true; > >> + > >> + if (!issecure(SECURE_NO_SETUID_FIXUP)) { > >> + kuid_t root_uid = make_kuid(cred->user_ns, 0); > >> + if (!uid_eq(cred->uid, root_uid)) { > >> + if (!cap_isclear(cred->cap_effective)) > >> + return true; > >> + } else { > >> + if (!cap_isidentical(cred->cap_effective, > >> + cred->cap_permitted)) > >> + return true; > >> + } > >> + } > >> + > >> + return false; > >> +} > > > > I worry a little that with nothing connecting > > access_need_override_creds() to access_override_creds() there is a bug > > waiting to happen if/when only one of the functions is updated. > > These funcs are literally next to each other, I don't think that is easy > to miss. I concede a comment in access_override_creds to take a look at > access_need_override_creds would not hurt, but I don't know if a resend > to add it is justified. Perhaps it's because I have to deal with a fair amount of code getting changed in one place and not another, but I would think that a comment would be the least one could do here and would justify a respin. > > Given the limited credential changes in access_override_creds(), I > > wonder if a better solution would be to see if we could create a > > light(er)weight prepare_creds()/override_creds() that would avoid some > > of the prepare_creds() hotspots (I'm assuming that is where most of > > the time is being spent). It's possible this could help improve the > > performance of other, similar operations that need to modify task > > creds for a brief, and synchronous, period of time. ... > For a Real Solution(tm) for a general case I think has to start with an > observartion creds either persist for a long time *OR* keep getting > recreated. This would suggest holding on to them and looking them up > instead just allocating, but all this opens another can of worms and > I don't believe is worth the effort at this stage. But maybe someone > has a better idea. > > That said, for the case of access(), I had the following in mind but > once more considered it not justified at this stage. > > pseudocode-wise: > struct cred *prepare_shallow_creds(void) > new = kmem_cache_alloc(cred_jar, GFP_KERNEL); > old = task->cred; > memcpy(new, old, sizeof(struct cred)); > > here new creds have all the same pointers as old, but the target objs > are only kept alive by the old creds still refing them. So by API > contract you are required to keep them around. > > after you temporarily assign them you call revert_shallow_creds(): > if (tempcred->usage == 1) > /* nobody refed them, do the non_rcu check */ > ... > else > /* somebody grabbed them, legitimize creds by > * grabbing the missing refs > */ > get_uid(new->user); > get_user_ns(new->user_ns); > get_group_info(new->group_info); > /* and so on */ > > So this would shave work from the case you are concerned with probably > for all calls. > > I do think this is an ok idea overall, but I felt like overengineering for the > problem at hand *at the time*. In my opinion a generalized shallow copy approach has more value than a one-off solution that has the potential to fall out of sync and cause a problem in the future (I recognize that you disagree on the likelihood of this happening). > For some context, I'm looking at performance of certain VFS stuff and > there is some serious fish to fry in the fstat department. I assumed it was part of some larger perf work, but I'm not sure the context is that important for this particular discussion. > The patch I > posted is definitely worthwhile perf-wise and easy enough to reason > about that I did no expect much opposition to it. If anything I expected > opposition to the idea outlined above. Ultimately it's a call for the VFS folks as they are responsible for the access() code. -- paul-moore.com
On 1/23/23, Paul Moore <paul@paul-moore.com> wrote: > On Fri, Jan 20, 2023 at 7:50 PM Mateusz Guzik <mjguzik@gmail.com> wrote: >> On 1/20/23, Paul Moore <paul@paul-moore.com> wrote: >> > On Mon, Jan 16, 2023 at 4:21 PM Mateusz Guzik <mjguzik@gmail.com> >> > wrote: >> >> >> >> access(2) remains commonly used, for example on exec: >> >> access("/etc/ld.so.preload", R_OK) >> >> >> >> or when running gcc: strace -c gcc empty.c >> >> % time seconds usecs/call calls errors syscall >> >> ------ ----------- ----------- --------- --------- ---------------- >> >> 0.00 0.000000 0 42 26 access >> >> >> >> It falls down to do_faccessat without the AT_EACCESS flag, which in >> >> turn >> >> results in allocation of new creds in order to modify fsuid/fsgid and >> >> caps. This is a very expensive process single-threaded and most >> >> notably >> >> multi-threaded, with numerous structures getting refed and unrefed on >> >> imminent new cred destruction. >> >> >> >> Turns out for typical consumers the resulting creds would be identical >> >> and this can be checked upfront, avoiding the hard work. >> >> >> >> An access benchmark plugged into will-it-scale running on Cascade Lake >> >> shows: >> >> test proc before after >> >> access1 1 1310582 2908735 (+121%) # distinct files >> >> access1 24 4716491 63822173 (+1353%) # distinct files >> >> access2 24 2378041 5370335 (+125%) # same file >> > >> > Out of curiosity, do you have any measurements of the impact this >> > patch has on the AT_EACCESS case when the creds do need to be >> > modified? >> >> I could not be arsed to bench that. I'm not saying there is literally 0 >> impact, but it should not be high and the massive win in the case I >> patched imho justifies it. > > That's one way to respond to an honest question asking if you've done > any tests on the other side of the change. I agree the impact should > be less than the advantage you've shown, but sometimes it's nice to > see these things. > So reading this now I do think it was worded in quite a poor manner, so apologies for that. Wording aside, I don't know whether this is just a passing remark or are you genuinely concerned about the other case. If you are, I noted there is an immediately achievable speed up by eliminating the get/put ref cycle on creds coming from override_creds + put_cred to backpedal from it. This should be enough to cover it, but there are cosmetic problems around it I don't want to flame over. Say override_creds_noref gets added doing the usual work, except for get_new_cred. Then override_creds would be: validate_creds(new); get_new_cred((struct cred *)new); override_creds_noref(new); But override_creds_noref would retain validate_creds new/old and the above would repeat it which would preferably be avoided. Not a problem if it is deemed ok to get_new_cred without validate_creds. >> These funcs are literally next to each other, I don't think that is easy >> to miss. I concede a comment in access_override_creds to take a look at >> access_need_override_creds would not hurt, but I don't know if a resend >> to add it is justified. > > Perhaps it's because I have to deal with a fair amount of code getting > changed in one place and not another, but I would think that a comment > would be the least one could do here and would justify a respin. > I'm not going to *insist* on not adding that comment. Would this work for you? diff --git a/fs/open.c b/fs/open.c index 3c068a38044c..756177b94b04 100644 --- a/fs/open.c +++ b/fs/open.c @@ -407,6 +407,11 @@ static const struct cred *access_override_creds(void) if (!override_cred) return NULL; + /* + * XXX access_need_override_creds performs checks in hopes of + * skipping this work. Make sure it stays in sync if making any + * changes here. + */ override_cred->fsuid = override_cred->uid; override_cred->fsgid = override_cred->gid; if not, can you phrase it however you see fit for me to copy-paste? > In my opinion a generalized shallow copy approach has more value than > a one-off solution that has the potential to fall out of sync and > cause a problem in the future (I recognize that you disagree on the > likelihood of this happening). > To reiterate my stance, the posted patch is trivial to reason about and it provides a marked improvement for the most commonly seen case. It comes with some extra branches for the less common case, which I don't consider to be a big deal. From the quick toor I took around kernel/cred.c I think the cred code is messy and it would be best to sort it out before doing anything fancy. I have no interest in doing the clean up. The shallow copy idea I outlined above looks very simple, but I can't help the feeling there are surprises there, so I'm reluctant to roll with it as is. More importantly I can't show any workload which runs into the other case, thus if someone asks me to justify the complexity I will have nothing, which is mostly why I did not go for it. That said, if powers to be declare this is the way forward, I can spend some time getting it done. > Ultimately it's a call for the VFS folks as they are responsible for > the access() code. > Well let's wait and see. :)
On Tue, Jan 24, 2023 at 5:16 AM Mateusz Guzik <mjguzik@gmail.com> wrote: > On 1/23/23, Paul Moore <paul@paul-moore.com> wrote: > > On Fri, Jan 20, 2023 at 7:50 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > >> On 1/20/23, Paul Moore <paul@paul-moore.com> wrote: > >> > On Mon, Jan 16, 2023 at 4:21 PM Mateusz Guzik <mjguzik@gmail.com> > >> > wrote: > >> >> > >> >> access(2) remains commonly used, for example on exec: > >> >> access("/etc/ld.so.preload", R_OK) > >> >> > >> >> or when running gcc: strace -c gcc empty.c > >> >> % time seconds usecs/call calls errors syscall > >> >> ------ ----------- ----------- --------- --------- ---------------- > >> >> 0.00 0.000000 0 42 26 access > >> >> > >> >> It falls down to do_faccessat without the AT_EACCESS flag, which in > >> >> turn > >> >> results in allocation of new creds in order to modify fsuid/fsgid and > >> >> caps. This is a very expensive process single-threaded and most > >> >> notably > >> >> multi-threaded, with numerous structures getting refed and unrefed on > >> >> imminent new cred destruction. > >> >> > >> >> Turns out for typical consumers the resulting creds would be identical > >> >> and this can be checked upfront, avoiding the hard work. > >> >> > >> >> An access benchmark plugged into will-it-scale running on Cascade Lake > >> >> shows: > >> >> test proc before after > >> >> access1 1 1310582 2908735 (+121%) # distinct files > >> >> access1 24 4716491 63822173 (+1353%) # distinct files > >> >> access2 24 2378041 5370335 (+125%) # same file > >> > > >> > Out of curiosity, do you have any measurements of the impact this > >> > patch has on the AT_EACCESS case when the creds do need to be > >> > modified? > >> > >> I could not be arsed to bench that. I'm not saying there is literally 0 > >> impact, but it should not be high and the massive win in the case I > >> patched imho justifies it. > > > > That's one way to respond to an honest question asking if you've done > > any tests on the other side of the change. I agree the impact should > > be less than the advantage you've shown, but sometimes it's nice to > > see these things. > > So reading this now I do think it was worded in quite a poor manner, so > apologies for that. Thanks, but no worries. Work in this space long enough and everyone eventually ends up sending a mail or two that could have been worded better, myself included. > Wording aside, I don't know whether this is just a passing remark or > are you genuinely concerned about the other case. My main concern is the duplication between the cred check and the cred override functions leading to a bug at some unknown point in the future. Changes to credential checking, and access control in general, always gets my attention and due to past bruises I'm very sensitive to out-of-sync issues due to code duplication; so your patch was a bit of a "perfect storm" of concern for me :) The profiling questions were mainly there as a curiosity since it looked like this was part of a larger performance oriented effort and I thought you might have more data that didn't make it into the commit description. > >> These funcs are literally next to each other, I don't think that is easy > >> to miss. I concede a comment in access_override_creds to take a look at > >> access_need_override_creds would not hurt, but I don't know if a resend > >> to add it is justified. > > > > Perhaps it's because I have to deal with a fair amount of code getting > > changed in one place and not another, but I would think that a comment > > would be the least one could do here and would justify a respin. > > I'm not going to *insist* on not adding that comment. > > Would this work for you? > > diff --git a/fs/open.c b/fs/open.c > index 3c068a38044c..756177b94b04 100644 > --- a/fs/open.c > +++ b/fs/open.c > @@ -407,6 +407,11 @@ static const struct cred *access_override_creds(void) > if (!override_cred) > return NULL; > > + /* > + * XXX access_need_override_creds performs checks in hopes of > + * skipping this work. Make sure it stays in sync if making any > + * changes here. > + */ > override_cred->fsuid = override_cred->uid; > override_cred->fsgid = override_cred->gid; > > if not, can you phrase it however you see fit for me to copy-paste? That wording looks good to me and would help me feel a bit better about this change, thank you.
On Tue, Jan 24, 2023 at 9:00 AM Paul Moore <paul@paul-moore.com> wrote: > > My main concern is the duplication between the cred check and the cred > override functions leading to a bug at some unknown point in the > future. Yeah, it might be good to try to have some common logic for this, although it's kind of messy. The access_override_creds() logic is fairly different from the "do I need to create new creds" decision, since instead of *testing* whether the fs[ug]id and [ug]id matches, it just sets the fs[ug]id to the expected values. So that part of the test doesn't really exist. And the same is true of the !SECURE_NO_SETUID_FIXUP logic case - the current access() override doesn't _test_ those variables for equality, it just sets them. So Mateusz' patch doesn't really duplicate any actual logic, it just has similarities in that it checks "would that new cred that access_override_creds() would create be the same as the old one". So sharing code is hard, because the code is fundamentally not the same. The new access_need_override_creds() function is right next to the pre-existing access_override_creds() one, so at least they are close to each other. That may be the best that can be done. Maybe some of the "is it the root uid" logic could be shared, though. Both cases do have this part in common: if (!issecure(SECURE_NO_SETUID_FIXUP)) { /* Clear the capabilities if we switch to a non-root user */ kuid_t root_uid = make_kuid(override_cred->user_ns, 0); if (!uid_eq(override_cred->uid, root_uid)) and that is arguably the nastiest part of it all. I don't think it's all that likely to change in the future, though (except for possible changes due to user_ns re-orgs, but then changing both would be very natural). Linus
On 1/24/23, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, Jan 24, 2023 at 9:00 AM Paul Moore <paul@paul-moore.com> wrote: >> >> My main concern is the duplication between the cred check and the cred >> override functions leading to a bug at some unknown point in the >> future. > > Yeah, it might be good to try to have some common logic for this, > although it's kind of messy. > > The access_override_creds() logic is fairly different from the "do I > need to create new creds" decision, since instead of *testing* whether > the fs[ug]id and [ug]id matches, it just sets the fs[ug]id to the > expected values. > > So that part of the test doesn't really exist. > > And the same is true of the !SECURE_NO_SETUID_FIXUP logic case - the > current access() override doesn't _test_ those variables for equality, > it just sets them. > > So Mateusz' patch doesn't really duplicate any actual logic, it just > has similarities in that it checks "would that new cred that > access_override_creds() would create be the same as the old one". > > So sharing code is hard, because the code is fundamentally not the same. > > The new access_need_override_creds() function is right next to the > pre-existing access_override_creds() one, so at least they are close > to each other. That may be the best that can be done. > > Maybe some of the "is it the root uid" logic could be shared, though. > Both cases do have this part in common: > > if (!issecure(SECURE_NO_SETUID_FIXUP)) { > /* Clear the capabilities if we switch to a non-root user > */ > kuid_t root_uid = make_kuid(override_cred->user_ns, 0); > if (!uid_eq(override_cred->uid, root_uid)) > > and that is arguably the nastiest part of it all. > > I don't think it's all that likely to change in the future, though > (except for possible changes due to user_ns re-orgs, but then changing > both would be very natural). > You could dedup make_kuid + uid_eq check, but does it really buy anything? ns changes which break compilation will find both spots. Similarly any grep used to find one should also automagically find the other one. I think this patch generated way more discussion than it warrants, especially since I deliberately went for the trivial approach in hopes of avoiding this kind of stuff. So how about I simply respin with the comment I mailed earlier, repasted here for reference (with a slight tweak): diff --git a/fs/open.c b/fs/open.c index 3c068a38044c..756177b94b04 100644 --- a/fs/open.c +++ b/fs/open.c @@ -407,6 +407,11 @@ static const struct cred *access_override_creds(void) if (!override_cred) return NULL; + /* + * XXX access_need_override_creds performs checks in hopes of + * skipping this work. Make sure it stays in sync if making any + * changes in this routine. + */ override_cred->fsuid = override_cred->uid; override_cred->fsgid = override_cred->gid; sounds like a plan?
On Tue, Jan 24, 2023 at 10:42 AM Mateusz Guzik <mjguzik@gmail.com> wrote: > > So how about I simply respin with the comment I mailed earlier, > repasted here for reference (with a slight tweak): Ack, that's probably the way to go, Linus
On Tue, Jan 24, 2023 at 12:14 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, Jan 24, 2023 at 9:00 AM Paul Moore <paul@paul-moore.com> wrote: > > > > My main concern is the duplication between the cred check and the cred > > override functions leading to a bug at some unknown point in the > > future. > > Yeah, it might be good to try to have some common logic for this, > although it's kind of messy. > > The access_override_creds() logic is fairly different from the "do I > need to create new creds" decision, since instead of *testing* whether > the fs[ug]id and [ug]id matches, it just sets the fs[ug]id to the > expected values. > > So that part of the test doesn't really exist. > > And the same is true of the !SECURE_NO_SETUID_FIXUP logic case - the > current access() override doesn't _test_ those variables for equality, > it just sets them. > > So Mateusz' patch doesn't really duplicate any actual logic, it just > has similarities in that it checks "would that new cred that > access_override_creds() would create be the same as the old one". Perhaps I didn't do a very good job explaining my concern, or it got a little twisted as the thread went on (likely due to my use of "duplication"), but my concern wasn't so much that access_override_creds() or the proposed access_need_override_creds() duplicates code elsewhere, it was that the proposed access_need_override_creds() function is a separate check from the code which is actually responsible for doing the credential fixup for AT_EACCESS. I'm worried about a subtle change in one function not being reflected in the other and causing an access control bug. > The new access_need_override_creds() function is right next to the > pre-existing access_override_creds() one, so at least they are close > to each other. That may be the best that can be done. Possibly, and the comment should help. Although I'm looking at this again and realized that only do_faccessat() calls access_override_creds(), so why not just fold the new access_need_override_creds() logic into access_override_creds()? Just have one function that takes the flag value, and returns an old_cred/NULL pointer (or pass old_cred to the function by reference and return an error code); that should still provide the performance win Mateusz is looking for while providing additional safety against out-of-sync changes. I would guess the code would be smaller too. > Maybe some of the "is it the root uid" logic could be shared, though. > Both cases do have this part in common: > > if (!issecure(SECURE_NO_SETUID_FIXUP)) { > /* Clear the capabilities if we switch to a non-root user */ > kuid_t root_uid = make_kuid(override_cred->user_ns, 0); > if (!uid_eq(override_cred->uid, root_uid)) > > and that is arguably the nastiest part of it all. > > I don't think it's all that likely to change in the future, though > (except for possible changes due to user_ns re-orgs, but then changing > both would be very natural). You're probably right. As I said earlier, I'm just really sensitive to code that is vulnerable to going out of sync like this and I try to avoid it whenever possible.
On 1/24/23, Paul Moore <paul@paul-moore.com> wrote: > Although I'm looking at this again and realized that only > do_faccessat() calls access_override_creds(), so why not just fold the > new access_need_override_creds() logic into access_override_creds()? > Just have one function that takes the flag value, and returns an > old_cred/NULL pointer (or pass old_cred to the function by reference > and return an error code); that should still provide the performance > win Mateusz is looking for while providing additional safety against > out-of-sync changes. I would guess the code would be smaller too. > It is unclear from the description if you are arguing for moving the new func into access_override_creds almost as is just put prior to existing code *or* mixing checks with assignments. static bool *access_override_creds(struct cred **ptr) [snip] if (!uid_eq(cred->fsuid, cred->uid) || !gid_eq(cred->fsgid, cred->gid)) return false; /* remaining checks go here as well */ [snip] override_cred = prepare_creds(); if (!override_cred) { *ptr = NULL; return true; } override_cred->fsuid = override_cred->uid; override_cred->fsgid = override_cred->gid; [snip] If this is what you had in mind, I note it retains all the duplication except in one func body which I'm confident does not buy anything, provided the warning comment is added. At the same time the downside is that it uglifies error handling at the callsite, so I would say a net loss. Alternatively, if you want to somehow keep tests aroung assignments the code gets super hairy. But maybe you wanted something else? As I noted in another email this already got more discussion than it warrants. Addition of the warning comment makes sense, but concerns after that don't sound legitimate to me.
On 1/25/23, Mateusz Guzik <mjguzik@gmail.com> wrote: > On 1/24/23, Paul Moore <paul@paul-moore.com> wrote: >> Although I'm looking at this again and realized that only >> do_faccessat() calls access_override_creds(), so why not just fold the >> new access_need_override_creds() logic into access_override_creds()? >> Just have one function that takes the flag value, and returns an >> old_cred/NULL pointer (or pass old_cred to the function by reference >> and return an error code); that should still provide the performance >> win Mateusz is looking for while providing additional safety against >> out-of-sync changes. I would guess the code would be smaller too. >> > > It is unclear from the description if you are arguing for moving the new > func into access_override_creds almost as is just put prior to existing > code *or* mixing checks with assignments. > > static bool *access_override_creds(struct cred **ptr) > [snip] > if (!uid_eq(cred->fsuid, cred->uid) || > !gid_eq(cred->fsgid, cred->gid)) > return false; > /* remaining checks go here as well */ > [snip] > > override_cred = prepare_creds(); > if (!override_cred) { > *ptr = NULL; > return true; > } > > override_cred->fsuid = override_cred->uid; > override_cred->fsgid = override_cred->gid; > [snip] > > If this is what you had in mind, I note it retains all the duplication > except in one func body which I'm confident does not buy anything, > provided the warning comment is added. > > At the same time the downside is that it uglifies error handling at the > callsite, so I would say a net loss. > > Alternatively, if you want to somehow keep tests aroung assignments the > code gets super hairy. > > But maybe you wanted something else? > > As I noted in another email this already got more discussion than it > warrants. > > Addition of the warning comment makes sense, but concerns after that > don't sound legitimate to me. > So I posted v3 with the comment, you are CC'ed. I'm not going to further argue about the patch. If you want to write your own variant that's fine with me, feel free to take my bench results and denote they come from a similar version. There is other stuff I want to post and unslowed access() helps making the case. The CONFIG_INIT_ON_ALLOC_DEFAULT_ON option is enabled on Debian, Ubuntu and Arch and as a side effect it zeroes bufs allocated every time a path lookup is performed. As these are 4096 bytes in size it is not pretty whatsoever and I'm confident not worth the hardening promised by mandatory zeroing. Got patches to add exemption support for caches to sort it out without disabling the opt.
On Wed, Jan 25, 2023 at 10:00 AM Mateusz Guzik <mjguzik@gmail.com> wrote: > On 1/24/23, Paul Moore <paul@paul-moore.com> wrote: > > Although I'm looking at this again and realized that only > > do_faccessat() calls access_override_creds(), so why not just fold the > > new access_need_override_creds() logic into access_override_creds()? > > Just have one function that takes the flag value, and returns an > > old_cred/NULL pointer (or pass old_cred to the function by reference > > and return an error code); that should still provide the performance > > win Mateusz is looking for while providing additional safety against > > out-of-sync changes. I would guess the code would be smaller too. > > It is unclear from the description if you are arguing for moving the new > func into access_override_creds almost as is just put prior to existing > code *or* mixing checks with assignments. "arguing" is a bit strong of a word for what I was thinking, it was more of just tossing out an idea to see if it has any merit with you, the VFS folks, and possibly Linus. > static bool *access_override_creds(struct cred **ptr) > [snip] > if (!uid_eq(cred->fsuid, cred->uid) || > !gid_eq(cred->fsgid, cred->gid)) > return false; > /* remaining checks go here as well */ > [snip] > > override_cred = prepare_creds(); > if (!override_cred) { > *ptr = NULL; > return true; > } > > override_cred->fsuid = override_cred->uid; > override_cred->fsgid = override_cred->gid; > [snip] > > If this is what you had in mind, I note it retains all the duplication > except in one func body which I'm confident does not buy anything, > provided the warning comment is added. > > At the same time the downside is that it uglifies error handling at the > callsite, so I would say a net loss. Yes, I was thinking of combining the two functions into one to better link the cred checks with the cred adjustments. > Addition of the warning comment makes sense, but concerns after that > don't sound legitimate to me. Well, as we talked about earlier, it's really up to the VFS folks to pick what they want, and they have been suspiciously quiet thus far. -- paul-moore.com
On Wed, Jan 25, 2023 at 11:17 AM Mateusz Guzik <mjguzik@gmail.com> wrote: > So I posted v3 with the comment, you are CC'ed. > > I'm not going to further argue about the patch. If you want to write > your own variant that's fine with me, feel free to take my bench results > and denote they come from a similar version. Once again, I see the back-and-forth as more of a discussion, not really an argument in the combative sense, but everyone reads their email differently I guess. The comment is an improvement, thanks for that, now it's just a matter of hearing from the VFS folks.
diff --git a/fs/open.c b/fs/open.c index 82c1a28b3308..3c068a38044c 100644 --- a/fs/open.c +++ b/fs/open.c @@ -367,7 +367,37 @@ COMPAT_SYSCALL_DEFINE6(fallocate, int, fd, int, mode, compat_arg_u64_dual(offset * access() needs to use the real uid/gid, not the effective uid/gid. * We do this by temporarily clearing all FS-related capabilities and * switching the fsuid/fsgid around to the real ones. + * + * Creating new credentials is expensive, so we try to skip doing it, + * which we can if the result would match what we already got. */ +static bool access_need_override_creds(int flags) +{ + const struct cred *cred; + + if (flags & AT_EACCESS) + return false; + + cred = current_cred(); + if (!uid_eq(cred->fsuid, cred->uid) || + !gid_eq(cred->fsgid, cred->gid)) + return true; + + if (!issecure(SECURE_NO_SETUID_FIXUP)) { + kuid_t root_uid = make_kuid(cred->user_ns, 0); + if (!uid_eq(cred->uid, root_uid)) { + if (!cap_isclear(cred->cap_effective)) + return true; + } else { + if (!cap_isidentical(cred->cap_effective, + cred->cap_permitted)) + return true; + } + } + + return false; +} + static const struct cred *access_override_creds(void) { const struct cred *old_cred; @@ -436,7 +466,7 @@ static long do_faccessat(int dfd, const char __user *filename, int mode, int fla if (flags & AT_EMPTY_PATH) lookup_flags |= LOOKUP_EMPTY; - if (!(flags & AT_EACCESS)) { + if (access_need_override_creds(flags)) { old_cred = access_override_creds(); if (!old_cred) return -ENOMEM;