[v2,2/2] vfs: avoid duplicating creds in faccessat if possible

Message ID 20230116212105.1840362-2-mjguzik@gmail.com
State New
Headers
Series [v2,1/2] capability: add cap_isidentical |

Commit Message

Mateusz Guzik Jan. 16, 2023, 9:21 p.m. UTC
  access(2) remains commonly used, for example on exec:
access("/etc/ld.so.preload", R_OK)

or when running gcc: strace -c gcc empty.c
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0.00    0.000000           0        42        26 access

It falls down to do_faccessat without the AT_EACCESS flag, which in turn
results in allocation of new creds in order to modify fsuid/fsgid and
caps. This is a very expensive process single-threaded and most notably
multi-threaded, with numerous structures getting refed and unrefed on
imminent new cred destruction.

Turns out for typical consumers the resulting creds would be identical
and this can be checked upfront, avoiding the hard work.

An access benchmark plugged into will-it-scale running on Cascade Lake
shows:
test	proc	before	after
access1	1	1310582	2908735	 (+121%)  # distinct files
access1	24	4716491	63822173 (+1353%) # distinct files
access2	24	2378041	5370335	 (+125%)  # same file

The above benchmarks are not integrated into will-it-scale, but can be
found in a pull request:
https://github.com/antonblanchard/will-it-scale/pull/36/files

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>

v2:
- fix current->cred usage warn reported by the kernel test robot
Link: https://lore.kernel.org/all/202301150709.9EC6UKBT-lkp@intel.com/
---
 fs/open.c | 32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)
  

Comments

Paul Moore Jan. 20, 2023, 8:45 p.m. UTC | #1
On Mon, Jan 16, 2023 at 4:21 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> access(2) remains commonly used, for example on exec:
> access("/etc/ld.so.preload", R_OK)
>
> or when running gcc: strace -c gcc empty.c
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>   0.00    0.000000           0        42        26 access
>
> It falls down to do_faccessat without the AT_EACCESS flag, which in turn
> results in allocation of new creds in order to modify fsuid/fsgid and
> caps. This is a very expensive process single-threaded and most notably
> multi-threaded, with numerous structures getting refed and unrefed on
> imminent new cred destruction.
>
> Turns out for typical consumers the resulting creds would be identical
> and this can be checked upfront, avoiding the hard work.
>
> An access benchmark plugged into will-it-scale running on Cascade Lake
> shows:
> test    proc    before  after
> access1 1       1310582 2908735  (+121%)  # distinct files
> access1 24      4716491 63822173 (+1353%) # distinct files
> access2 24      2378041 5370335  (+125%)  # same file

Out of curiosity, do you have any measurements of the impact this
patch has on the AT_EACCESS case when the creds do need to be
modified?

> The above benchmarks are not integrated into will-it-scale, but can be
> found in a pull request:
> https://github.com/antonblanchard/will-it-scale/pull/36/files
>
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
>
> v2:
> - fix current->cred usage warn reported by the kernel test robot
> Link: https://lore.kernel.org/all/202301150709.9EC6UKBT-lkp@intel.com/
> ---
>  fs/open.c | 32 +++++++++++++++++++++++++++++++-
>  1 file changed, 31 insertions(+), 1 deletion(-)
>
> diff --git a/fs/open.c b/fs/open.c
> index 82c1a28b3308..3c068a38044c 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -367,7 +367,37 @@ COMPAT_SYSCALL_DEFINE6(fallocate, int, fd, int, mode, compat_arg_u64_dual(offset
>   * access() needs to use the real uid/gid, not the effective uid/gid.
>   * We do this by temporarily clearing all FS-related capabilities and
>   * switching the fsuid/fsgid around to the real ones.
> + *
> + * Creating new credentials is expensive, so we try to skip doing it,
> + * which we can if the result would match what we already got.
>   */
> +static bool access_need_override_creds(int flags)
> +{
> +       const struct cred *cred;
> +
> +       if (flags & AT_EACCESS)
> +               return false;
> +
> +       cred = current_cred();
> +       if (!uid_eq(cred->fsuid, cred->uid) ||
> +           !gid_eq(cred->fsgid, cred->gid))
> +               return true;
> +
> +       if (!issecure(SECURE_NO_SETUID_FIXUP)) {
> +               kuid_t root_uid = make_kuid(cred->user_ns, 0);
> +               if (!uid_eq(cred->uid, root_uid)) {
> +                       if (!cap_isclear(cred->cap_effective))
> +                               return true;
> +               } else {
> +                       if (!cap_isidentical(cred->cap_effective,
> +                           cred->cap_permitted))
> +                               return true;
> +               }
> +       }
> +
> +       return false;
> +}

I worry a little that with nothing connecting
access_need_override_creds() to access_override_creds() there is a bug
waiting to happen if/when only one of the functions is updated.

Given the limited credential changes in access_override_creds(), I
wonder if a better solution would be to see if we could create a
light(er)weight prepare_creds()/override_creds() that would avoid some
of the prepare_creds() hotspots (I'm assuming that is where most of
the time is being spent).  It's possible this could help improve the
performance of other, similar operations that need to modify task
creds for a brief, and synchronous, period of time.

Have you done any profiling inside of access_override_creds() to see
where the most time is spent?  Looking at the code I have some gut
feelings on the hotspots, but it would be good to see some proper data
before jumping to any conclusions.

>  static const struct cred *access_override_creds(void)
>  {
>         const struct cred *old_cred;
> @@ -436,7 +466,7 @@ static long do_faccessat(int dfd, const char __user *filename, int mode, int fla
>         if (flags & AT_EMPTY_PATH)
>                 lookup_flags |= LOOKUP_EMPTY;
>
> -       if (!(flags & AT_EACCESS)) {
> +       if (access_need_override_creds(flags)) {
>                 old_cred = access_override_creds();
>                 if (!old_cred)
>                         return -ENOMEM;
> --
> 2.34.1
  
Mateusz Guzik Jan. 21, 2023, 12:50 a.m. UTC | #2
On 1/20/23, Paul Moore <paul@paul-moore.com> wrote:
> On Mon, Jan 16, 2023 at 4:21 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>>
>> access(2) remains commonly used, for example on exec:
>> access("/etc/ld.so.preload", R_OK)
>>
>> or when running gcc: strace -c gcc empty.c
>> % time     seconds  usecs/call     calls    errors syscall
>> ------ ----------- ----------- --------- --------- ----------------
>>   0.00    0.000000           0        42        26 access
>>
>> It falls down to do_faccessat without the AT_EACCESS flag, which in turn
>> results in allocation of new creds in order to modify fsuid/fsgid and
>> caps. This is a very expensive process single-threaded and most notably
>> multi-threaded, with numerous structures getting refed and unrefed on
>> imminent new cred destruction.
>>
>> Turns out for typical consumers the resulting creds would be identical
>> and this can be checked upfront, avoiding the hard work.
>>
>> An access benchmark plugged into will-it-scale running on Cascade Lake
>> shows:
>> test    proc    before  after
>> access1 1       1310582 2908735  (+121%)  # distinct files
>> access1 24      4716491 63822173 (+1353%) # distinct files
>> access2 24      2378041 5370335  (+125%)  # same file
>
> Out of curiosity, do you have any measurements of the impact this
> patch has on the AT_EACCESS case when the creds do need to be
> modified?
>

I could not be arsed to bench that. I'm not saying there is literally 0
impact, but it should not be high and the massive win in the case I
patched imho justifies it.

Last week I got a private reply from Linus suggesting the new checks
only happen once at commit_cred() time, which would mean there would be
at most one extra branch for the case you are concerned with. However,
this quickly turn out to be rather hairy as there are games being
played for example in copy_creds() which assigns them *without* calling
commit_creds(). I was not comfortable pre-computing without sorting out
the mess first and he conceded the new branchfest is not necessarily a
big deal.

That said, if you want some performance recovered for this case, here
is an easy one:

static const struct cred *access_override_creds(void)
[..]
        old_cred = override_creds(override_cred);

        /* override_cred() gets its own ref */
        put_cred(override_cred);

As in the new creds get refed only to get unrefed immediately after.
Whacking the 2 atomics should make up for it no problem on x86-64.

Also see below.

>> The above benchmarks are not integrated into will-it-scale, but can be
>> found in a pull request:
>> https://github.com/antonblanchard/will-it-scale/pull/36/files
>>
>> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
>>
>> v2:
>> - fix current->cred usage warn reported by the kernel test robot
>> Link: https://lore.kernel.org/all/202301150709.9EC6UKBT-lkp@intel.com/
>> ---
>>  fs/open.c | 32 +++++++++++++++++++++++++++++++-
>>  1 file changed, 31 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/open.c b/fs/open.c
>> index 82c1a28b3308..3c068a38044c 100644
>> --- a/fs/open.c
>> +++ b/fs/open.c
>> @@ -367,7 +367,37 @@ COMPAT_SYSCALL_DEFINE6(fallocate, int, fd, int, mode,
>> compat_arg_u64_dual(offset
>>   * access() needs to use the real uid/gid, not the effective uid/gid.
>>   * We do this by temporarily clearing all FS-related capabilities and
>>   * switching the fsuid/fsgid around to the real ones.
>> + *
>> + * Creating new credentials is expensive, so we try to skip doing it,
>> + * which we can if the result would match what we already got.
>>   */
>> +static bool access_need_override_creds(int flags)
>> +{
>> +       const struct cred *cred;
>> +
>> +       if (flags & AT_EACCESS)
>> +               return false;
>> +
>> +       cred = current_cred();
>> +       if (!uid_eq(cred->fsuid, cred->uid) ||
>> +           !gid_eq(cred->fsgid, cred->gid))
>> +               return true;
>> +
>> +       if (!issecure(SECURE_NO_SETUID_FIXUP)) {
>> +               kuid_t root_uid = make_kuid(cred->user_ns, 0);
>> +               if (!uid_eq(cred->uid, root_uid)) {
>> +                       if (!cap_isclear(cred->cap_effective))
>> +                               return true;
>> +               } else {
>> +                       if (!cap_isidentical(cred->cap_effective,
>> +                           cred->cap_permitted))
>> +                               return true;
>> +               }
>> +       }
>> +
>> +       return false;
>> +}
>
> I worry a little that with nothing connecting
> access_need_override_creds() to access_override_creds() there is a bug
> waiting to happen if/when only one of the functions is updated.
>

These funcs are literally next to each other, I don't think that is easy
to miss. I concede a comment in access_override_creds to take a look at
access_need_override_creds would not hurt, but I don't know if a resend
to add it is justified.

> Given the limited credential changes in access_override_creds(), I
> wonder if a better solution would be to see if we could create a
> light(er)weight prepare_creds()/override_creds() that would avoid some
> of the prepare_creds() hotspots (I'm assuming that is where most of
> the time is being spent).  It's possible this could help improve the
> performance of other, similar operations that need to modify task
> creds for a brief, and synchronous, period of time.
>

So the fundamental problem here is that several refs need to be grabbed
to make fully-fledged credentials. Then, as you are done, you have to
undo them. This clearly sucks single-threaded. As other threads copying
the same creds do the same atomics on the same vars, this sucks even
more multithreaded.

As a cop out one may notice several of these are probably always the
same for creds derived from given base, so perhaps there can be an obj
which wraps them and then you only have to ref *that* obj to implicitly
hold on to the entire thing.

As in this (and more):
        get_uid(new->user);
        get_user_ns(new->user_ns);
        get_group_info(new->group_info);

would get replaced with:
        new->fancy_obj = getref_fancy_obj(new->fancy_obj);
        /* populate pointers here */

... conceptually at least.

But even then, while less shafted both multi and single-threaded, there
is still a bottleneck.

For a Real Solution(tm) for a general case I think has to start with an
observartion creds either persist for a long time *OR* keep getting
recreated. This would suggest holding on to them and looking them up
instead just allocating, but all this opens another can of worms and
I don't believe is worth the effort at this stage. But maybe someone
has a better idea.

That said, for the case of access(), I had the following in mind but
once more considered it not justified at this stage.

pseudocode-wise:
struct cred *prepare_shallow_creds(void)
        new = kmem_cache_alloc(cred_jar, GFP_KERNEL);
        old = task->cred;
        memcpy(new, old, sizeof(struct cred));

here new creds have all the same pointers as old, but the target objs
are only kept alive by the old creds still refing them. So by API
contract you are required to keep them around.

after you temporarily assign them you call revert_shallow_creds():
        if (tempcred->usage == 1)
                /* nobody refed them, do the non_rcu check */
                ...
        else
                /* somebody grabbed them, legitimize creds by
                 * grabbing the missing refs
                 */
                 get_uid(new->user);
                 get_user_ns(new->user_ns);
                 get_group_info(new->group_info);
                 /* and so on */

So this would shave work from the case you are concerned with probably
for all calls.

I do think this is an ok idea overall, but I felt like overengineering for the
problem at hand *at the time*.

For some context, I'm looking at performance of certain VFS stuff and
there is some serious fish to fry in the fstat department. The patch I
posted is definitely worthwhile perf-wise and easy enough to reason
about that I did no expect much opposition to it. If anything I expected
opposition to the idea outlined above.

> Have you done any profiling inside of access_override_creds() to see
> where the most time is spent?  Looking at the code I have some gut
> feelings on the hotspots, but it would be good to see some proper data
> before jumping to any conclusions.
>

See above. It's the atomics all the way down.
  
Mateusz Guzik Jan. 21, 2023, 1:18 a.m. UTC | #3
On 1/21/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> On 1/20/23, Paul Moore <paul@paul-moore.com> wrote:
>> On Mon, Jan 16, 2023 at 4:21 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>>>
>>> access(2) remains commonly used, for example on exec:
>>> access("/etc/ld.so.preload", R_OK)
>>>
>>> or when running gcc: strace -c gcc empty.c
>>> % time     seconds  usecs/call     calls    errors syscall
>>> ------ ----------- ----------- --------- --------- ----------------
>>>   0.00    0.000000           0        42        26 access
>>>
>>> It falls down to do_faccessat without the AT_EACCESS flag, which in turn
>>> results in allocation of new creds in order to modify fsuid/fsgid and
>>> caps. This is a very expensive process single-threaded and most notably
>>> multi-threaded, with numerous structures getting refed and unrefed on
>>> imminent new cred destruction.
>>>
>>> Turns out for typical consumers the resulting creds would be identical
>>> and this can be checked upfront, avoiding the hard work.
>>>
>>> An access benchmark plugged into will-it-scale running on Cascade Lake
>>> shows:
>>> test    proc    before  after
>>> access1 1       1310582 2908735  (+121%)  # distinct files
>>> access1 24      4716491 63822173 (+1353%) # distinct files
>>> access2 24      2378041 5370335  (+125%)  # same file
>>
>> Out of curiosity, do you have any measurements of the impact this
>> patch has on the AT_EACCESS case when the creds do need to be
>> modified?
>>
>
> I could not be arsed to bench that. I'm not saying there is literally 0
> impact, but it should not be high and the massive win in the case I
> patched imho justifies it.
>
> Last week I got a private reply from Linus suggesting the new checks
> only happen once at commit_cred() time, which would mean there would be
> at most one extra branch for the case you are concerned with. However,
> this quickly turn out to be rather hairy as there are games being
> played for example in copy_creds() which assigns them *without* calling
> commit_creds(). I was not comfortable pre-computing without sorting out
> the mess first and he conceded the new branchfest is not necessarily a
> big deal.
>
> That said, if you want some performance recovered for this case, here
> is an easy one:
>
> static const struct cred *access_override_creds(void)
> [..]
>         old_cred = override_creds(override_cred);
>
>         /* override_cred() gets its own ref */
>         put_cred(override_cred);
>
> As in the new creds get refed only to get unrefed immediately after.
> Whacking the 2 atomics should make up for it no problem on x86-64.
>
> Also see below.
>
>>> The above benchmarks are not integrated into will-it-scale, but can be
>>> found in a pull request:
>>> https://github.com/antonblanchard/will-it-scale/pull/36/files
>>>
>>> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
>>>
>>> v2:
>>> - fix current->cred usage warn reported by the kernel test robot
>>> Link: https://lore.kernel.org/all/202301150709.9EC6UKBT-lkp@intel.com/
>>> ---
>>>  fs/open.c | 32 +++++++++++++++++++++++++++++++-
>>>  1 file changed, 31 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/open.c b/fs/open.c
>>> index 82c1a28b3308..3c068a38044c 100644
>>> --- a/fs/open.c
>>> +++ b/fs/open.c
>>> @@ -367,7 +367,37 @@ COMPAT_SYSCALL_DEFINE6(fallocate, int, fd, int,
>>> mode,
>>> compat_arg_u64_dual(offset
>>>   * access() needs to use the real uid/gid, not the effective uid/gid.
>>>   * We do this by temporarily clearing all FS-related capabilities and
>>>   * switching the fsuid/fsgid around to the real ones.
>>> + *
>>> + * Creating new credentials is expensive, so we try to skip doing it,
>>> + * which we can if the result would match what we already got.
>>>   */
>>> +static bool access_need_override_creds(int flags)
>>> +{
>>> +       const struct cred *cred;
>>> +
>>> +       if (flags & AT_EACCESS)
>>> +               return false;
>>> +
>>> +       cred = current_cred();
>>> +       if (!uid_eq(cred->fsuid, cred->uid) ||
>>> +           !gid_eq(cred->fsgid, cred->gid))
>>> +               return true;
>>> +
>>> +       if (!issecure(SECURE_NO_SETUID_FIXUP)) {
>>> +               kuid_t root_uid = make_kuid(cred->user_ns, 0);
>>> +               if (!uid_eq(cred->uid, root_uid)) {
>>> +                       if (!cap_isclear(cred->cap_effective))
>>> +                               return true;
>>> +               } else {
>>> +                       if (!cap_isidentical(cred->cap_effective,
>>> +                           cred->cap_permitted))
>>> +                               return true;
>>> +               }
>>> +       }
>>> +
>>> +       return false;
>>> +}
>>
>> I worry a little that with nothing connecting
>> access_need_override_creds() to access_override_creds() there is a bug
>> waiting to happen if/when only one of the functions is updated.
>>
>
> These funcs are literally next to each other, I don't think that is easy
> to miss. I concede a comment in access_override_creds to take a look at
> access_need_override_creds would not hurt, but I don't know if a resend
> to add it is justified.
>
>> Given the limited credential changes in access_override_creds(), I
>> wonder if a better solution would be to see if we could create a
>> light(er)weight prepare_creds()/override_creds() that would avoid some
>> of the prepare_creds() hotspots (I'm assuming that is where most of
>> the time is being spent).  It's possible this could help improve the
>> performance of other, similar operations that need to modify task
>> creds for a brief, and synchronous, period of time.
>>
>
> So the fundamental problem here is that several refs need to be grabbed
> to make fully-fledged credentials. Then, as you are done, you have to
> undo them. This clearly sucks single-threaded. As other threads copying
> the same creds do the same atomics on the same vars, this sucks even
> more multithreaded.
>
> As a cop out one may notice several of these are probably always the
> same for creds derived from given base, so perhaps there can be an obj
> which wraps them and then you only have to ref *that* obj to implicitly
> hold on to the entire thing.
>
> As in this (and more):
>         get_uid(new->user);
>         get_user_ns(new->user_ns);
>         get_group_info(new->group_info);
>
> would get replaced with:
>         new->fancy_obj = getref_fancy_obj(new->fancy_obj);
>         /* populate pointers here */
>
> ... conceptually at least.
>
> But even then, while less shafted both multi and single-threaded, there
> is still a bottleneck.
>
> For a Real Solution(tm) for a general case I think has to start with an
> observartion creds either persist for a long time *OR* keep getting
> recreated. This would suggest holding on to them and looking them up
> instead just allocating, but all this opens another can of worms and
> I don't believe is worth the effort at this stage. But maybe someone
> has a better idea.
>
> That said, for the case of access(), I had the following in mind but
> once more considered it not justified at this stage.
>
> pseudocode-wise:
> struct cred *prepare_shallow_creds(void)
>         new = kmem_cache_alloc(cred_jar, GFP_KERNEL);
>         old = task->cred;
>         memcpy(new, old, sizeof(struct cred));
>
> here new creds have all the same pointers as old, but the target objs
> are only kept alive by the old creds still refing them. So by API
> contract you are required to keep them around.
>
> after you temporarily assign them you call revert_shallow_creds():
>         if (tempcred->usage == 1)
>                 /* nobody refed them, do the non_rcu check */
>                 ...
>         else
>                 /* somebody grabbed them, legitimize creds by
>                  * grabbing the missing refs
>                  */
>                  get_uid(new->user);
>                  get_user_ns(new->user_ns);
>                  get_group_info(new->group_info);
>                  /* and so on */
>
> So this would shave work from the case you are concerned with probably
> for all calls.
>
> I do think this is an ok idea overall, but I felt like overengineering for
> the
> problem at hand *at the time*.
>
> For some context, I'm looking at performance of certain VFS stuff and
> there is some serious fish to fry in the fstat department. The patch I
> posted is definitely worthwhile perf-wise and easy enough to reason
> about that I did no expect much opposition to it. If anything I expected
> opposition to the idea outlined above.
>

So just in case, if ultimately the consensus is that shallow copy or similar
is the way to go, I can take a stab some time next week.

>> Have you done any profiling inside of access_override_creds() to see
>> where the most time is spent?  Looking at the code I have some gut
>> feelings on the hotspots, but it would be good to see some proper data
>> before jumping to any conclusions.
>>
>
> See above. It's the atomics all the way down.
>
> --
> Mateusz Guzik <mjguzik gmail.com>
>
  
Paul Moore Jan. 23, 2023, 9:29 p.m. UTC | #4
On Fri, Jan 20, 2023 at 7:50 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> On 1/20/23, Paul Moore <paul@paul-moore.com> wrote:
> > On Mon, Jan 16, 2023 at 4:21 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >>
> >> access(2) remains commonly used, for example on exec:
> >> access("/etc/ld.so.preload", R_OK)
> >>
> >> or when running gcc: strace -c gcc empty.c
> >> % time     seconds  usecs/call     calls    errors syscall
> >> ------ ----------- ----------- --------- --------- ----------------
> >>   0.00    0.000000           0        42        26 access
> >>
> >> It falls down to do_faccessat without the AT_EACCESS flag, which in turn
> >> results in allocation of new creds in order to modify fsuid/fsgid and
> >> caps. This is a very expensive process single-threaded and most notably
> >> multi-threaded, with numerous structures getting refed and unrefed on
> >> imminent new cred destruction.
> >>
> >> Turns out for typical consumers the resulting creds would be identical
> >> and this can be checked upfront, avoiding the hard work.
> >>
> >> An access benchmark plugged into will-it-scale running on Cascade Lake
> >> shows:
> >> test    proc    before  after
> >> access1 1       1310582 2908735  (+121%)  # distinct files
> >> access1 24      4716491 63822173 (+1353%) # distinct files
> >> access2 24      2378041 5370335  (+125%)  # same file
> >
> > Out of curiosity, do you have any measurements of the impact this
> > patch has on the AT_EACCESS case when the creds do need to be
> > modified?
>
> I could not be arsed to bench that. I'm not saying there is literally 0
> impact, but it should not be high and the massive win in the case I
> patched imho justifies it.

That's one way to respond to an honest question asking if you've done
any tests on the other side of the change.  I agree the impact should
be less than the advantage you've shown, but sometimes it's nice to
see these things.

> Last week I got a private reply from Linus suggesting the new checks
> only happen once at commit_cred() time, which would mean there would be
> at most one extra branch for the case you are concerned with. However,
> this quickly turn out to be rather hairy as there are games being
> played for example in copy_creds() which assigns them *without* calling
> commit_creds(). I was not comfortable pre-computing without sorting out
> the mess first and he conceded the new branchfest is not necessarily a
> big deal.
>
> That said, if you want some performance recovered for this case, here
> is an easy one:
>
> static const struct cred *access_override_creds(void)
> [..]
>         old_cred = override_creds(override_cred);
>
>         /* override_cred() gets its own ref */
>         put_cred(override_cred);
>
> As in the new creds get refed only to get unrefed immediately after.
> Whacking the 2 atomics should make up for it no problem on x86-64.
>
> Also see below.
>
> >> The above benchmarks are not integrated into will-it-scale, but can be
> >> found in a pull request:
> >> https://github.com/antonblanchard/will-it-scale/pull/36/files
> >>
> >> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> >>
> >> v2:
> >> - fix current->cred usage warn reported by the kernel test robot
> >> Link: https://lore.kernel.org/all/202301150709.9EC6UKBT-lkp@intel.com/
> >> ---
> >>  fs/open.c | 32 +++++++++++++++++++++++++++++++-
> >>  1 file changed, 31 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/fs/open.c b/fs/open.c
> >> index 82c1a28b3308..3c068a38044c 100644
> >> --- a/fs/open.c
> >> +++ b/fs/open.c
> >> @@ -367,7 +367,37 @@ COMPAT_SYSCALL_DEFINE6(fallocate, int, fd, int, mode,
> >> compat_arg_u64_dual(offset
> >>   * access() needs to use the real uid/gid, not the effective uid/gid.
> >>   * We do this by temporarily clearing all FS-related capabilities and
> >>   * switching the fsuid/fsgid around to the real ones.
> >> + *
> >> + * Creating new credentials is expensive, so we try to skip doing it,
> >> + * which we can if the result would match what we already got.
> >>   */
> >> +static bool access_need_override_creds(int flags)
> >> +{
> >> +       const struct cred *cred;
> >> +
> >> +       if (flags & AT_EACCESS)
> >> +               return false;
> >> +
> >> +       cred = current_cred();
> >> +       if (!uid_eq(cred->fsuid, cred->uid) ||
> >> +           !gid_eq(cred->fsgid, cred->gid))
> >> +               return true;
> >> +
> >> +       if (!issecure(SECURE_NO_SETUID_FIXUP)) {
> >> +               kuid_t root_uid = make_kuid(cred->user_ns, 0);
> >> +               if (!uid_eq(cred->uid, root_uid)) {
> >> +                       if (!cap_isclear(cred->cap_effective))
> >> +                               return true;
> >> +               } else {
> >> +                       if (!cap_isidentical(cred->cap_effective,
> >> +                           cred->cap_permitted))
> >> +                               return true;
> >> +               }
> >> +       }
> >> +
> >> +       return false;
> >> +}
> >
> > I worry a little that with nothing connecting
> > access_need_override_creds() to access_override_creds() there is a bug
> > waiting to happen if/when only one of the functions is updated.
>
> These funcs are literally next to each other, I don't think that is easy
> to miss. I concede a comment in access_override_creds to take a look at
> access_need_override_creds would not hurt, but I don't know if a resend
> to add it is justified.

Perhaps it's because I have to deal with a fair amount of code getting
changed in one place and not another, but I would think that a comment
would be the least one could do here and would justify a respin.

> > Given the limited credential changes in access_override_creds(), I
> > wonder if a better solution would be to see if we could create a
> > light(er)weight prepare_creds()/override_creds() that would avoid some
> > of the prepare_creds() hotspots (I'm assuming that is where most of
> > the time is being spent).  It's possible this could help improve the
> > performance of other, similar operations that need to modify task
> > creds for a brief, and synchronous, period of time.

...

> For a Real Solution(tm) for a general case I think has to start with an
> observartion creds either persist for a long time *OR* keep getting
> recreated. This would suggest holding on to them and looking them up
> instead just allocating, but all this opens another can of worms and
> I don't believe is worth the effort at this stage. But maybe someone
> has a better idea.
>
> That said, for the case of access(), I had the following in mind but
> once more considered it not justified at this stage.
>
> pseudocode-wise:
> struct cred *prepare_shallow_creds(void)
>         new = kmem_cache_alloc(cred_jar, GFP_KERNEL);
>         old = task->cred;
>         memcpy(new, old, sizeof(struct cred));
>
> here new creds have all the same pointers as old, but the target objs
> are only kept alive by the old creds still refing them. So by API
> contract you are required to keep them around.
>
> after you temporarily assign them you call revert_shallow_creds():
>         if (tempcred->usage == 1)
>                 /* nobody refed them, do the non_rcu check */
>                 ...
>         else
>                 /* somebody grabbed them, legitimize creds by
>                  * grabbing the missing refs
>                  */
>                  get_uid(new->user);
>                  get_user_ns(new->user_ns);
>                  get_group_info(new->group_info);
>                  /* and so on */
>
> So this would shave work from the case you are concerned with probably
> for all calls.
>
> I do think this is an ok idea overall, but I felt like overengineering for the
> problem at hand *at the time*.

In my opinion a generalized shallow copy approach has more value than
a one-off solution that has the potential to fall out of sync and
cause a problem in the future (I recognize that you disagree on the
likelihood of this happening).

> For some context, I'm looking at performance of certain VFS stuff and
> there is some serious fish to fry in the fstat department.

I assumed it was part of some larger perf work, but I'm not sure the
context is that important for this particular discussion.

> The patch I
> posted is definitely worthwhile perf-wise and easy enough to reason
> about that I did no expect much opposition to it. If anything I expected
> opposition to the idea outlined above.

Ultimately it's a call for the VFS folks as they are responsible for
the access() code.

--
paul-moore.com
  
Mateusz Guzik Jan. 24, 2023, 10:16 a.m. UTC | #5
On 1/23/23, Paul Moore <paul@paul-moore.com> wrote:
> On Fri, Jan 20, 2023 at 7:50 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>> On 1/20/23, Paul Moore <paul@paul-moore.com> wrote:
>> > On Mon, Jan 16, 2023 at 4:21 PM Mateusz Guzik <mjguzik@gmail.com>
>> > wrote:
>> >>
>> >> access(2) remains commonly used, for example on exec:
>> >> access("/etc/ld.so.preload", R_OK)
>> >>
>> >> or when running gcc: strace -c gcc empty.c
>> >> % time     seconds  usecs/call     calls    errors syscall
>> >> ------ ----------- ----------- --------- --------- ----------------
>> >>   0.00    0.000000           0        42        26 access
>> >>
>> >> It falls down to do_faccessat without the AT_EACCESS flag, which in
>> >> turn
>> >> results in allocation of new creds in order to modify fsuid/fsgid and
>> >> caps. This is a very expensive process single-threaded and most
>> >> notably
>> >> multi-threaded, with numerous structures getting refed and unrefed on
>> >> imminent new cred destruction.
>> >>
>> >> Turns out for typical consumers the resulting creds would be identical
>> >> and this can be checked upfront, avoiding the hard work.
>> >>
>> >> An access benchmark plugged into will-it-scale running on Cascade Lake
>> >> shows:
>> >> test    proc    before  after
>> >> access1 1       1310582 2908735  (+121%)  # distinct files
>> >> access1 24      4716491 63822173 (+1353%) # distinct files
>> >> access2 24      2378041 5370335  (+125%)  # same file
>> >
>> > Out of curiosity, do you have any measurements of the impact this
>> > patch has on the AT_EACCESS case when the creds do need to be
>> > modified?
>>
>> I could not be arsed to bench that. I'm not saying there is literally 0
>> impact, but it should not be high and the massive win in the case I
>> patched imho justifies it.
>
> That's one way to respond to an honest question asking if you've done
> any tests on the other side of the change.  I agree the impact should
> be less than the advantage you've shown, but sometimes it's nice to
> see these things.
>

So reading this now I do think it was worded in quite a poor manner, so
apologies for that.

Wording aside, I don't know whether this is just a passing remark or
are you genuinely concerned about the other case.

If you are, I noted there is an immediately achievable speed up by
eliminating the get/put ref cycle on creds coming from override_creds +
put_cred to backpedal from it. This should be enough to cover it, but
there are cosmetic problems around it I don't want to flame over.

Say override_creds_noref gets added doing the usual work, except for
get_new_cred.

Then override_creds would be:
        validate_creds(new);
        get_new_cred((struct cred *)new);
        override_creds_noref(new);

But override_creds_noref would retain validate_creds new/old and the
above would repeat it which would preferably be avoided. Not a problem
if it is deemed ok to get_new_cred without validate_creds.

>> These funcs are literally next to each other, I don't think that is easy
>> to miss. I concede a comment in access_override_creds to take a look at
>> access_need_override_creds would not hurt, but I don't know if a resend
>> to add it is justified.
>
> Perhaps it's because I have to deal with a fair amount of code getting
> changed in one place and not another, but I would think that a comment
> would be the least one could do here and would justify a respin.
>

I'm not going to *insist* on not adding that comment.

Would this work for you?

diff --git a/fs/open.c b/fs/open.c
index 3c068a38044c..756177b94b04 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -407,6 +407,11 @@ static const struct cred *access_override_creds(void)
        if (!override_cred)
                return NULL;

+       /*
+        * XXX access_need_override_creds performs checks in hopes of
+        * skipping this work. Make sure it stays in sync if making any
+        * changes here.
+        */
        override_cred->fsuid = override_cred->uid;
        override_cred->fsgid = override_cred->gid;

if not, can you phrase it however you see fit for me to copy-paste?

> In my opinion a generalized shallow copy approach has more value than
> a one-off solution that has the potential to fall out of sync and
> cause a problem in the future (I recognize that you disagree on the
> likelihood of this happening).
>

To reiterate my stance, the posted patch is trivial to reason about
and it provides a marked improvement for the most commonly seen case.
It comes with some extra branches for the less common case, which I
don't consider to be a big deal.

From the quick toor I took around kernel/cred.c I think the cred code
is messy and it would be best to sort it out before doing anything
fancy. I have no interest in doing the clean up.

The shallow copy idea I outlined above looks very simple, but I can't
help the feeling there are surprises there, so I'm reluctant to roll
with it as is.

More importantly I can't show any workload which runs into the other
case, thus if someone asks me to justify the complexity I will have
nothing, which is mostly why I did not go for it.

That said, if powers to be declare this is the way forward, I can
spend some time getting it done.

> Ultimately it's a call for the VFS folks as they are responsible for
> the access() code.
>

Well let's wait and see. :)
  
Paul Moore Jan. 24, 2023, 5 p.m. UTC | #6
On Tue, Jan 24, 2023 at 5:16 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> On 1/23/23, Paul Moore <paul@paul-moore.com> wrote:
> > On Fri, Jan 20, 2023 at 7:50 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >> On 1/20/23, Paul Moore <paul@paul-moore.com> wrote:
> >> > On Mon, Jan 16, 2023 at 4:21 PM Mateusz Guzik <mjguzik@gmail.com>
> >> > wrote:
> >> >>
> >> >> access(2) remains commonly used, for example on exec:
> >> >> access("/etc/ld.so.preload", R_OK)
> >> >>
> >> >> or when running gcc: strace -c gcc empty.c
> >> >> % time     seconds  usecs/call     calls    errors syscall
> >> >> ------ ----------- ----------- --------- --------- ----------------
> >> >>   0.00    0.000000           0        42        26 access
> >> >>
> >> >> It falls down to do_faccessat without the AT_EACCESS flag, which in
> >> >> turn
> >> >> results in allocation of new creds in order to modify fsuid/fsgid and
> >> >> caps. This is a very expensive process single-threaded and most
> >> >> notably
> >> >> multi-threaded, with numerous structures getting refed and unrefed on
> >> >> imminent new cred destruction.
> >> >>
> >> >> Turns out for typical consumers the resulting creds would be identical
> >> >> and this can be checked upfront, avoiding the hard work.
> >> >>
> >> >> An access benchmark plugged into will-it-scale running on Cascade Lake
> >> >> shows:
> >> >> test    proc    before  after
> >> >> access1 1       1310582 2908735  (+121%)  # distinct files
> >> >> access1 24      4716491 63822173 (+1353%) # distinct files
> >> >> access2 24      2378041 5370335  (+125%)  # same file
> >> >
> >> > Out of curiosity, do you have any measurements of the impact this
> >> > patch has on the AT_EACCESS case when the creds do need to be
> >> > modified?
> >>
> >> I could not be arsed to bench that. I'm not saying there is literally 0
> >> impact, but it should not be high and the massive win in the case I
> >> patched imho justifies it.
> >
> > That's one way to respond to an honest question asking if you've done
> > any tests on the other side of the change.  I agree the impact should
> > be less than the advantage you've shown, but sometimes it's nice to
> > see these things.
>
> So reading this now I do think it was worded in quite a poor manner, so
> apologies for that.

Thanks, but no worries.  Work in this space long enough and everyone
eventually ends up sending a mail or two that could have been worded
better, myself included.

> Wording aside, I don't know whether this is just a passing remark or
> are you genuinely concerned about the other case.

My main concern is the duplication between the cred check and the cred
override functions leading to a bug at some unknown point in the
future.  Changes to credential checking, and access control in
general, always gets my attention and due to past bruises I'm very
sensitive to out-of-sync issues due to code duplication; so your patch
was a bit of a "perfect storm" of concern for me :)

The profiling questions were mainly there as a curiosity since it
looked like this was part of a larger performance oriented effort and
I thought you might have more data that didn't make it into the commit
description.

> >> These funcs are literally next to each other, I don't think that is easy
> >> to miss. I concede a comment in access_override_creds to take a look at
> >> access_need_override_creds would not hurt, but I don't know if a resend
> >> to add it is justified.
> >
> > Perhaps it's because I have to deal with a fair amount of code getting
> > changed in one place and not another, but I would think that a comment
> > would be the least one could do here and would justify a respin.
>
> I'm not going to *insist* on not adding that comment.
>
> Would this work for you?
>
> diff --git a/fs/open.c b/fs/open.c
> index 3c068a38044c..756177b94b04 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -407,6 +407,11 @@ static const struct cred *access_override_creds(void)
>         if (!override_cred)
>                 return NULL;
>
> +       /*
> +        * XXX access_need_override_creds performs checks in hopes of
> +        * skipping this work. Make sure it stays in sync if making any
> +        * changes here.
> +        */
>         override_cred->fsuid = override_cred->uid;
>         override_cred->fsgid = override_cred->gid;
>
> if not, can you phrase it however you see fit for me to copy-paste?

That wording looks good to me and would help me feel a bit better
about this change, thank you.
  
Linus Torvalds Jan. 24, 2023, 5:14 p.m. UTC | #7
On Tue, Jan 24, 2023 at 9:00 AM Paul Moore <paul@paul-moore.com> wrote:
>
> My main concern is the duplication between the cred check and the cred
> override functions leading to a bug at some unknown point in the
> future.

Yeah, it might be good to try to have some common logic for this,
although it's kind of messy.

The access_override_creds() logic is fairly different from the "do I
need to create new creds" decision, since instead of *testing* whether
the fs[ug]id and [ug]id matches, it just sets the fs[ug]id to the
expected values.

So that part of the test doesn't really exist.

And the same is true of the !SECURE_NO_SETUID_FIXUP logic case - the
current access() override doesn't _test_ those variables for equality,
it just sets them.

So Mateusz' patch doesn't really duplicate any actual logic, it just
has similarities in that it checks "would that new cred that
access_override_creds() would create be the same as the old one".

So sharing code is hard, because the code is fundamentally not the same.

The new access_need_override_creds() function is right next to the
pre-existing access_override_creds() one, so at least they are close
to each other. That may be the best that can be done.

Maybe some of the "is it the root uid" logic could be shared, though.
Both cases do have this part in common:

        if (!issecure(SECURE_NO_SETUID_FIXUP)) {
                /* Clear the capabilities if we switch to a non-root user */
                kuid_t root_uid = make_kuid(override_cred->user_ns, 0);
                if (!uid_eq(override_cred->uid, root_uid))

and that is arguably the nastiest part of it all.

I don't think it's all that likely to change in the future, though
(except for possible changes due to user_ns re-orgs, but then changing
both would be very natural).

               Linus
  
Mateusz Guzik Jan. 24, 2023, 6:42 p.m. UTC | #8
On 1/24/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Tue, Jan 24, 2023 at 9:00 AM Paul Moore <paul@paul-moore.com> wrote:
>>
>> My main concern is the duplication between the cred check and the cred
>> override functions leading to a bug at some unknown point in the
>> future.
>
> Yeah, it might be good to try to have some common logic for this,
> although it's kind of messy.
>
> The access_override_creds() logic is fairly different from the "do I
> need to create new creds" decision, since instead of *testing* whether
> the fs[ug]id and [ug]id matches, it just sets the fs[ug]id to the
> expected values.
>
> So that part of the test doesn't really exist.
>
> And the same is true of the !SECURE_NO_SETUID_FIXUP logic case - the
> current access() override doesn't _test_ those variables for equality,
> it just sets them.
>
> So Mateusz' patch doesn't really duplicate any actual logic, it just
> has similarities in that it checks "would that new cred that
> access_override_creds() would create be the same as the old one".
>
> So sharing code is hard, because the code is fundamentally not the same.
>
> The new access_need_override_creds() function is right next to the
> pre-existing access_override_creds() one, so at least they are close
> to each other. That may be the best that can be done.
>
> Maybe some of the "is it the root uid" logic could be shared, though.
> Both cases do have this part in common:
>
>         if (!issecure(SECURE_NO_SETUID_FIXUP)) {
>                 /* Clear the capabilities if we switch to a non-root user
> */
>                 kuid_t root_uid = make_kuid(override_cred->user_ns, 0);
>                 if (!uid_eq(override_cred->uid, root_uid))
>
> and that is arguably the nastiest part of it all.
>
> I don't think it's all that likely to change in the future, though
> (except for possible changes due to user_ns re-orgs, but then changing
> both would be very natural).
>

You could dedup make_kuid + uid_eq check, but does it really buy
anything?

ns changes which break compilation will find both spots. Similarly
any grep used to find one should also automagically find the other
one.

I think this patch generated way more discussion than it warrants,
especially since I deliberately went for the trivial approach in
hopes of avoiding this kind of stuff.

So how about I simply respin with the comment I mailed earlier,
repasted here for reference (with a slight tweak):

diff --git a/fs/open.c b/fs/open.c
index 3c068a38044c..756177b94b04 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -407,6 +407,11 @@ static const struct cred *access_override_creds(void)
        if (!override_cred)
                return NULL;

+       /*
+        * XXX access_need_override_creds performs checks in hopes of
+        * skipping this work. Make sure it stays in sync if making any
+        * changes in this routine.
+        */
        override_cred->fsuid = override_cred->uid;
        override_cred->fsgid = override_cred->gid;

sounds like a plan?
  
Linus Torvalds Jan. 24, 2023, 8:37 p.m. UTC | #9
On Tue, Jan 24, 2023 at 10:42 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> So how about I simply respin with the comment I mailed earlier,
> repasted here for reference (with a slight tweak):

Ack, that's probably the way to go,

         Linus
  
Paul Moore Jan. 24, 2023, 9:33 p.m. UTC | #10
On Tue, Jan 24, 2023 at 12:14 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Jan 24, 2023 at 9:00 AM Paul Moore <paul@paul-moore.com> wrote:
> >
> > My main concern is the duplication between the cred check and the cred
> > override functions leading to a bug at some unknown point in the
> > future.
>
> Yeah, it might be good to try to have some common logic for this,
> although it's kind of messy.
>
> The access_override_creds() logic is fairly different from the "do I
> need to create new creds" decision, since instead of *testing* whether
> the fs[ug]id and [ug]id matches, it just sets the fs[ug]id to the
> expected values.
>
> So that part of the test doesn't really exist.
>
> And the same is true of the !SECURE_NO_SETUID_FIXUP logic case - the
> current access() override doesn't _test_ those variables for equality,
> it just sets them.
>
> So Mateusz' patch doesn't really duplicate any actual logic, it just
> has similarities in that it checks "would that new cred that
> access_override_creds() would create be the same as the old one".

Perhaps I didn't do a very good job explaining my concern, or it got a
little twisted as the thread went on (likely due to my use of
"duplication"), but my concern wasn't so much that
access_override_creds() or the proposed access_need_override_creds()
duplicates code elsewhere, it was that the proposed
access_need_override_creds() function is a separate check from the
code which is actually responsible for doing the credential fixup for
AT_EACCESS.  I'm worried about a subtle change in one function not
being reflected in the other and causing an access control bug.

> The new access_need_override_creds() function is right next to the
> pre-existing access_override_creds() one, so at least they are close
> to each other. That may be the best that can be done.

Possibly, and the comment should help.

Although I'm looking at this again and realized that only
do_faccessat() calls access_override_creds(), so why not just fold the
new access_need_override_creds() logic into access_override_creds()?
Just have one function that takes the flag value, and returns an
old_cred/NULL pointer (or pass old_cred to the function by reference
and return an error code); that should still provide the performance
win Mateusz is looking for while providing additional safety against
out-of-sync changes.  I would guess the code would be smaller too.

> Maybe some of the "is it the root uid" logic could be shared, though.
> Both cases do have this part in common:
>
>         if (!issecure(SECURE_NO_SETUID_FIXUP)) {
>                 /* Clear the capabilities if we switch to a non-root user */
>                 kuid_t root_uid = make_kuid(override_cred->user_ns, 0);
>                 if (!uid_eq(override_cred->uid, root_uid))
>
> and that is arguably the nastiest part of it all.
>
> I don't think it's all that likely to change in the future, though
> (except for possible changes due to user_ns re-orgs, but then changing
> both would be very natural).

You're probably right.  As I said earlier, I'm just really sensitive
to code that is vulnerable to going out of sync like this and I try to
avoid it whenever possible.
  
Mateusz Guzik Jan. 25, 2023, 3 p.m. UTC | #11
On 1/24/23, Paul Moore <paul@paul-moore.com> wrote:
> Although I'm looking at this again and realized that only
> do_faccessat() calls access_override_creds(), so why not just fold the
> new access_need_override_creds() logic into access_override_creds()?
> Just have one function that takes the flag value, and returns an
> old_cred/NULL pointer (or pass old_cred to the function by reference
> and return an error code); that should still provide the performance
> win Mateusz is looking for while providing additional safety against
> out-of-sync changes.  I would guess the code would be smaller too.
>

It is unclear from the description if you are arguing for moving the new
func into access_override_creds almost as is just put prior to existing
code *or* mixing checks with assignments.

static bool *access_override_creds(struct cred **ptr)
        [snip]
        if (!uid_eq(cred->fsuid, cred->uid) ||
            !gid_eq(cred->fsgid, cred->gid))
                return false;
        /* remaining checks go here as well */
        [snip]

        override_cred = prepare_creds();
        if (!override_cred) {
                *ptr = NULL;
                return true;
        }

        override_cred->fsuid = override_cred->uid;
        override_cred->fsgid = override_cred->gid;
        [snip]

If this is what you had in mind, I note it retains all the duplication
except in one func body which I'm confident does not buy anything,
provided the warning comment is added.

At the same time the downside is that it uglifies error handling at the
callsite, so I would say a net loss.

Alternatively, if you want to somehow keep tests aroung assignments the
code gets super hairy.

But maybe you wanted something else?

As I noted in another email this already got more discussion than it
warrants.

Addition of the warning comment makes sense, but concerns after that
don't sound legitimate to me.
  
Mateusz Guzik Jan. 25, 2023, 4:17 p.m. UTC | #12
On 1/25/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> On 1/24/23, Paul Moore <paul@paul-moore.com> wrote:
>> Although I'm looking at this again and realized that only
>> do_faccessat() calls access_override_creds(), so why not just fold the
>> new access_need_override_creds() logic into access_override_creds()?
>> Just have one function that takes the flag value, and returns an
>> old_cred/NULL pointer (or pass old_cred to the function by reference
>> and return an error code); that should still provide the performance
>> win Mateusz is looking for while providing additional safety against
>> out-of-sync changes.  I would guess the code would be smaller too.
>>
>
> It is unclear from the description if you are arguing for moving the new
> func into access_override_creds almost as is just put prior to existing
> code *or* mixing checks with assignments.
>
> static bool *access_override_creds(struct cred **ptr)
>         [snip]
>         if (!uid_eq(cred->fsuid, cred->uid) ||
>             !gid_eq(cred->fsgid, cred->gid))
>                 return false;
>         /* remaining checks go here as well */
>         [snip]
>
>         override_cred = prepare_creds();
>         if (!override_cred) {
>                 *ptr = NULL;
>                 return true;
>         }
>
>         override_cred->fsuid = override_cred->uid;
>         override_cred->fsgid = override_cred->gid;
>         [snip]
>
> If this is what you had in mind, I note it retains all the duplication
> except in one func body which I'm confident does not buy anything,
> provided the warning comment is added.
>
> At the same time the downside is that it uglifies error handling at the
> callsite, so I would say a net loss.
>
> Alternatively, if you want to somehow keep tests aroung assignments the
> code gets super hairy.
>
> But maybe you wanted something else?
>
> As I noted in another email this already got more discussion than it
> warrants.
>
> Addition of the warning comment makes sense, but concerns after that
> don't sound legitimate to me.
>

So I posted v3 with the comment, you are CC'ed.

I'm not going to further argue about the patch. If you want to write
your own variant that's fine with me, feel free to take my bench results
and denote they come from a similar version.

There is other stuff I want to post and unslowed access() helps making
the case. The CONFIG_INIT_ON_ALLOC_DEFAULT_ON option is enabled on
Debian, Ubuntu and Arch and as a side effect it zeroes bufs allocated
every time a path lookup is performed. As these are 4096 bytes in size
it is not pretty whatsoever and I'm confident not worth the hardening
promised by mandatory zeroing. Got patches to add exemption support for
caches to sort it out without disabling the opt.
  
Paul Moore Jan. 25, 2023, 5:07 p.m. UTC | #13
On Wed, Jan 25, 2023 at 10:00 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> On 1/24/23, Paul Moore <paul@paul-moore.com> wrote:
> > Although I'm looking at this again and realized that only
> > do_faccessat() calls access_override_creds(), so why not just fold the
> > new access_need_override_creds() logic into access_override_creds()?
> > Just have one function that takes the flag value, and returns an
> > old_cred/NULL pointer (or pass old_cred to the function by reference
> > and return an error code); that should still provide the performance
> > win Mateusz is looking for while providing additional safety against
> > out-of-sync changes.  I would guess the code would be smaller too.
>
> It is unclear from the description if you are arguing for moving the new
> func into access_override_creds almost as is just put prior to existing
> code *or* mixing checks with assignments.

"arguing" is a bit strong of a word for what I was thinking, it was
more of just tossing out an idea to see if it has any merit with you,
the VFS folks, and possibly Linus.

> static bool *access_override_creds(struct cred **ptr)
>         [snip]
>         if (!uid_eq(cred->fsuid, cred->uid) ||
>             !gid_eq(cred->fsgid, cred->gid))
>                 return false;
>         /* remaining checks go here as well */
>         [snip]
>
>         override_cred = prepare_creds();
>         if (!override_cred) {
>                 *ptr = NULL;
>                 return true;
>         }
>
>         override_cred->fsuid = override_cred->uid;
>         override_cred->fsgid = override_cred->gid;
>         [snip]
>
> If this is what you had in mind, I note it retains all the duplication
> except in one func body which I'm confident does not buy anything,
> provided the warning comment is added.
>
> At the same time the downside is that it uglifies error handling at the
> callsite, so I would say a net loss.

Yes, I was thinking of combining the two functions into one to better
link the cred checks with the cred adjustments.

> Addition of the warning comment makes sense, but concerns after that
> don't sound legitimate to me.

Well, as we talked about earlier, it's really up to the VFS folks to
pick what they want, and they have been suspiciously quiet thus far.

--
paul-moore.com
  
Paul Moore Jan. 25, 2023, 5:11 p.m. UTC | #14
On Wed, Jan 25, 2023 at 11:17 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> So I posted v3 with the comment, you are CC'ed.
>
> I'm not going to further argue about the patch. If you want to write
> your own variant that's fine with me, feel free to take my bench results
> and denote they come from a similar version.

Once again, I see the back-and-forth as more of a discussion, not
really an argument in the combative sense, but everyone reads their
email differently I guess.

The comment is an improvement, thanks for that, now it's just a matter
of hearing from the VFS folks.
  

Patch

diff --git a/fs/open.c b/fs/open.c
index 82c1a28b3308..3c068a38044c 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -367,7 +367,37 @@  COMPAT_SYSCALL_DEFINE6(fallocate, int, fd, int, mode, compat_arg_u64_dual(offset
  * access() needs to use the real uid/gid, not the effective uid/gid.
  * We do this by temporarily clearing all FS-related capabilities and
  * switching the fsuid/fsgid around to the real ones.
+ *
+ * Creating new credentials is expensive, so we try to skip doing it,
+ * which we can if the result would match what we already got.
  */
+static bool access_need_override_creds(int flags)
+{
+	const struct cred *cred;
+
+	if (flags & AT_EACCESS)
+		return false;
+
+	cred = current_cred();
+	if (!uid_eq(cred->fsuid, cred->uid) ||
+	    !gid_eq(cred->fsgid, cred->gid))
+		return true;
+
+	if (!issecure(SECURE_NO_SETUID_FIXUP)) {
+		kuid_t root_uid = make_kuid(cred->user_ns, 0);
+		if (!uid_eq(cred->uid, root_uid)) {
+			if (!cap_isclear(cred->cap_effective))
+				return true;
+		} else {
+			if (!cap_isidentical(cred->cap_effective,
+			    cred->cap_permitted))
+				return true;
+		}
+	}
+
+	return false;
+}
+
 static const struct cred *access_override_creds(void)
 {
 	const struct cred *old_cred;
@@ -436,7 +466,7 @@  static long do_faccessat(int dfd, const char __user *filename, int mode, int fla
 	if (flags & AT_EMPTY_PATH)
 		lookup_flags |= LOOKUP_EMPTY;
 
-	if (!(flags & AT_EACCESS)) {
+	if (access_need_override_creds(flags)) {
 		old_cred = access_override_creds();
 		if (!old_cred)
 			return -ENOMEM;