diff mbox series

vfs: shave work on failed file open

Message ID	20230925205545.4135472-1-mjguzik@gmail.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) client-ip=23.128.96.35; From: Mateusz Guzik <mjguzik@gmail.com> To: brauner@kernel.org Cc: viro@zeniv.linux.org.uk, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, torvalds@linux-foundation.org, Mateusz Guzik <mjguzik@gmail.com> Subject: [PATCH] vfs: shave work on failed file open Date: Mon, 25 Sep 2023 22:55:45 +0200 Message-Id: <20230925205545.4135472-1-mjguzik@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	vfs: shave work on failed file open \| vfs: shave work on failed file open

Commit Message

Mateusz Guzik Sept. 25, 2023, 8:55 p.m. UTC

  Failed opens (mostly ENOENT) legitimately happen a lot, for example here
are stats from stracing kernel build for few seconds (strace -fc make):

  % time     seconds  usecs/call     calls    errors syscall
  ------ ----------- ----------- --------- --------- ------------------
    0.76    0.076233           5     15040      3688 openat

(this is tons of header files tried in different paths)

Apart from a rare corner case where the file object is fully constructed
and we need to abort, there is a lot of overhead which can be avoided.

Most notably delegation of freeing to task_work, which comes with an
enormous cost (see 021a160abf62 ("fs: use __fput_sync in close(2)" for
an example).

Benched with will-it-scale with a custom testcase based on
tests/open1.c:
[snip]
        while (1) {
                int fd = open("/tmp/nonexistent", O_RDONLY);
                assert(fd == -1);

                (*iterations)++;
        }
[/snip]

Sapphire Rapids, one worker in single-threaded case (ops/s):
before:	1950013
after:	2914973 (+49%)

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 fs/file_table.c      | 39 +++++++++++++++++++++++++++++++++++++++
 fs/namei.c           |  2 +-
 include/linux/file.h |  1 +
 3 files changed, 41 insertions(+), 1 deletion(-)

Comments

Christian Brauner Sept. 26, 2023, 2:01 p.m. UTC | #1

> +void fput_badopen(struct file *file)
> +{
> +	if (unlikely(file->f_mode & (FMODE_BACKING | FMODE_OPENED))) {
> +		fput(file);
> +		return;
> +	}
> +
> +	if (WARN_ON(atomic_long_read(&file->f_count) != 1)) {
> +		fput(file);
> +		return;
> +	}
> +
> +	/* zero out the ref count to appease possible asserts */
> +	atomic_long_set(&file->f_count, 0);

Afaict this could just be:

if (WARN_ON_ONCE(atomic_long_cmpxchg(&file->f_count, 1, 0) != 1)) {

> +	file_free_badopen(file);
> +}
> +EXPORT_SYMBOL(fput_badopen);

Should definitely not be exported and only be available to core vfs
code. So this should go into fs/internal.h.

Mateusz Guzik Sept. 26, 2023, 2:07 p.m. UTC | #2

On 9/26/23, Christian Brauner <brauner@kernel.org> wrote:
>> +void fput_badopen(struct file *file)
>> +{
>> +	if (unlikely(file->f_mode & (FMODE_BACKING | FMODE_OPENED))) {
>> +		fput(file);
>> +		return;
>> +	}
>> +
>> +	if (WARN_ON(atomic_long_read(&file->f_count) != 1)) {
>> +		fput(file);
>> +		return;
>> +	}
>> +
>> +	/* zero out the ref count to appease possible asserts */
>> +	atomic_long_set(&file->f_count, 0);
>
> Afaict this could just be:
>
> if (WARN_ON_ONCE(atomic_long_cmpxchg(&file->f_count, 1, 0) != 1)) {
>

This would bring back one of the atomics, but I'm not going to die on this hill.

If you insist on this change I'm going to have tweak some comments and
bench again.

>> +	file_free_badopen(file);
>> +}
>> +EXPORT_SYMBOL(fput_badopen);
>
> Should definitely not be exported and only be available to core vfs
> code. So this should go into fs/internal.h.
>

Ok

Christian Brauner Sept. 26, 2023, 2:24 p.m. UTC | #3

> > if (WARN_ON_ONCE(atomic_long_cmpxchg(&file->f_count, 1, 0) != 1)) {

> bench again.

Can you see how much of a difference it makes because imho it really
looks a lot nicer then this ugly atomic_read followed by atomic_set...

Mateusz Guzik Sept. 26, 2023, 3:40 p.m. UTC | #4

On 9/26/23, Christian Brauner <brauner@kernel.org> wrote:
>> > if (WARN_ON_ONCE(atomic_long_cmpxchg(&file->f_count, 1, 0) != 1)) {
>
>> bench again.
>
> Can you see how much of a difference it makes because imho it really
> looks a lot nicer then this ugly atomic_read followed by atomic_set...
>

Huh, turns out to be in the noise here and I see why.

Immediately following this there are several atomic ops anyway,
notably to unref creds and apparmor labels. So happens top of the
profile is the allocator(!).

These can be fixed but that's perhaps for another time.

If going this route then perhaps atomic_long_dec_and_test just like
fput? Although one could argue if that cmpxchg failed then something
fishy is going on and "real" fput would be safer. Ultimately there is
a lot of handwaving possible whichever way, so just pick something and
I'll send a v2.

John Stoffel Sept. 26, 2023, 8:59 p.m. UTC | #5

>>>>> "Mateusz" == Mateusz Guzik <mjguzik@gmail.com> writes:

> Failed opens (mostly ENOENT) legitimately happen a lot, for example here
> are stats from stracing kernel build for few seconds (strace -fc make):

>   % time     seconds  usecs/call     calls    errors syscall
>   ------ ----------- ----------- --------- --------- ------------------
>     0.76    0.076233           5     15040      3688 openat

> (this is tons of header files tried in different paths)

> Apart from a rare corner case where the file object is fully constructed
> and we need to abort, there is a lot of overhead which can be avoided.

> Most notably delegation of freeing to task_work, which comes with an
> enormous cost (see 021a160abf62 ("fs: use __fput_sync in close(2)" for
> an example).

> Benched with will-it-scale with a custom testcase based on
> tests/open1.c:
> [snip]
>         while (1) {
>                 int fd = open("/tmp/nonexistent", O_RDONLY);
>                 assert(fd == -1);

>                 (*iterations)++;
>         }
> [/snip]

> Sapphire Rapids, one worker in single-threaded case (ops/s):
> before:	1950013
> after:	2914973 (+49%)


So what are the times in a multi-threaded case?  Just wondering what
happens if you have a bunch of makes or other jobs like that all
running at once. 


> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> ---
>  fs/file_table.c      | 39 +++++++++++++++++++++++++++++++++++++++
>  fs/namei.c           |  2 +-
>  include/linux/file.h |  1 +
>  3 files changed, 41 insertions(+), 1 deletion(-)

> diff --git a/fs/file_table.c b/fs/file_table.c
> index ee21b3da9d08..320dc1f9aa0e 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -82,6 +82,16 @@ static inline void file_free(struct file *f)
>  	call_rcu(&f->f_rcuhead, file_free_rcu);
>  }
 
> +static inline void file_free_badopen(struct file *f)
> +{
> +	BUG_ON(f->f_mode & (FMODE_BACKING | FMODE_OPENED));

eww... what a BUG_ON() here?  This seems *way* overkill to crash the
system here, and you don't even check if f exists first as well, since
I assume the caller checks it or already knows it?  

Why not just return an error here and keep going?  What happens if you do?


> +	security_file_free(f);
> +	put_cred(f->f_cred);
> +	if (likely(!(f->f_mode & FMODE_NOACCOUNT)))
> +		percpu_counter_dec(&nr_files);
> +	kmem_cache_free(filp_cachep, f);
> +}
> +
>  /*
>   * Return the total number of open files in the system
>   */
> @@ -468,6 +478,35 @@ void __fput_sync(struct file *file)
>  EXPORT_SYMBOL(fput);
>  EXPORT_SYMBOL(__fput_sync);
 
> +/*
> + * Clean up after failing to open (e.g., open(2) returns with -ENOENT).
> + *
> + * This represents opportunities to shave on work in the common case compared
> + * to the usual fput:
> + * 1. vast majority of the time FMODE_OPENED is not set, meaning there is no
> + *    need to delegate to task_work
> + * 2. if the above holds then we are guaranteed we have the only reference with
> + *    nobody else seeing the file, thus no need to use atomics to release it
> + * 3. then there is no need to delegate freeing to RCU
> + */
> +void fput_badopen(struct file *file)
> +{
> +	if (unlikely(file->f_mode & (FMODE_BACKING | FMODE_OPENED))) {
> +		fput(file);
> +		return;
> +	}
> +
> +	if (WARN_ON(atomic_long_read(&file->f_count) != 1)) {
> +		fput(file);
> +		return;
> +	}
> +
> +	/* zero out the ref count to appease possible asserts */
> +	atomic_long_set(&file->f_count, 0);
> +	file_free_badopen(file);
> +}
> +EXPORT_SYMBOL(fput_badopen);
> +
>  void __init files_init(void)
>  {
>  	filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
> diff --git a/fs/namei.c b/fs/namei.c
> index 567ee547492b..67579fe30b28 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3802,7 +3802,7 @@ static struct file *path_openat(struct nameidata *nd,
>  		WARN_ON(1);
>  		error = -EINVAL;
>  	}
> -	fput(file);
> +	fput_badopen(file);
>  	if (error == -EOPENSTALE) {
>  		if (flags & LOOKUP_RCU)
>  			error = -ECHILD;
> diff --git a/include/linux/file.h b/include/linux/file.h
> index 6e9099d29343..96300e27d9a8 100644
> --- a/include/linux/file.h
> +++ b/include/linux/file.h
> @@ -15,6 +15,7 @@
>  struct file;
 
>  extern void fput(struct file *);
> +extern void fput_badopen(struct file *);
 
>  struct file_operations;
>  struct task_struct;
> -- 
> 2.39.2

Mateusz Guzik Sept. 26, 2023, 9:07 p.m. UTC | #6

On 9/26/23, John Stoffel <john@stoffel.org> wrote:
>>>>>> "Mateusz" == Mateusz Guzik <mjguzik@gmail.com> writes:
>
>> Failed opens (mostly ENOENT) legitimately happen a lot, for example here
>> are stats from stracing kernel build for few seconds (strace -fc make):
>
>>   % time     seconds  usecs/call     calls    errors syscall
>>   ------ ----------- ----------- --------- --------- ------------------
>>     0.76    0.076233           5     15040      3688 openat
>
>> (this is tons of header files tried in different paths)
>
>> Apart from a rare corner case where the file object is fully constructed
>> and we need to abort, there is a lot of overhead which can be avoided.
>
>> Most notably delegation of freeing to task_work, which comes with an
>> enormous cost (see 021a160abf62 ("fs: use __fput_sync in close(2)" for
>> an example).
>
>> Benched with will-it-scale with a custom testcase based on
>> tests/open1.c:
>> [snip]
>>         while (1) {
>>                 int fd = open("/tmp/nonexistent", O_RDONLY);
>>                 assert(fd == -1);
>
>>                 (*iterations)++;
>>         }
>> [/snip]
>
>> Sapphire Rapids, one worker in single-threaded case (ops/s):
>> before:	1950013
>> after:	2914973 (+49%)
>
>
> So what are the times in a multi-threaded case?  Just wondering what
> happens if you have a bunch of makes or other jobs like that all
> running at once.
>

On my kernel they heavily bottleneck on apparmor, I already mailed the author:
https://lore.kernel.org/all/CAGudoHFfG7mARwSqcoLNwV81-KX4Bici5FQHjoNG4f9m83oLyg@mail.gmail.com/

maybe i'll hack up the fix

When running without that LSM and on *stock* kernel it heavily
bottlenecks somewhere in bowels of SLUB + RCU.

Without LSM and with the patch it scales almost perfectly, as one would expect.

I don't have numbers nor perf output handy.

>
>> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
>> ---
>>  fs/file_table.c      | 39 +++++++++++++++++++++++++++++++++++++++
>>  fs/namei.c           |  2 +-
>>  include/linux/file.h |  1 +
>>  3 files changed, 41 insertions(+), 1 deletion(-)
>
>> diff --git a/fs/file_table.c b/fs/file_table.c
>> index ee21b3da9d08..320dc1f9aa0e 100644
>> --- a/fs/file_table.c
>> +++ b/fs/file_table.c
>> @@ -82,6 +82,16 @@ static inline void file_free(struct file *f)
>>  	call_rcu(&f->f_rcuhead, file_free_rcu);
>>  }
>
>> +static inline void file_free_badopen(struct file *f)
>> +{
>> +	BUG_ON(f->f_mode & (FMODE_BACKING | FMODE_OPENED));
>
> eww... what a BUG_ON() here?  This seems *way* overkill to crash the
> system here, and you don't even check if f exists first as well, since
> I assume the caller checks it or already knows it?
>
> Why not just return an error here and keep going?  What happens if you do?
>

The only caller already checked these flags, so I think BUGing out is prudent.

>
>> +	security_file_free(f);
>> +	put_cred(f->f_cred);
>> +	if (likely(!(f->f_mode & FMODE_NOACCOUNT)))
>> +		percpu_counter_dec(&nr_files);
>> +	kmem_cache_free(filp_cachep, f);
>> +}
>> +
>>  /*
>>   * Return the total number of open files in the system
>>   */
>> @@ -468,6 +478,35 @@ void __fput_sync(struct file *file)
>>  EXPORT_SYMBOL(fput);
>>  EXPORT_SYMBOL(__fput_sync);
>
>> +/*
>> + * Clean up after failing to open (e.g., open(2) returns with -ENOENT).
>> + *
>> + * This represents opportunities to shave on work in the common case
>> compared
>> + * to the usual fput:
>> + * 1. vast majority of the time FMODE_OPENED is not set, meaning there is
>> no
>> + *    need to delegate to task_work
>> + * 2. if the above holds then we are guaranteed we have the only
>> reference with
>> + *    nobody else seeing the file, thus no need to use atomics to release
>> it
>> + * 3. then there is no need to delegate freeing to RCU
>> + */
>> +void fput_badopen(struct file *file)
>> +{
>> +	if (unlikely(file->f_mode & (FMODE_BACKING | FMODE_OPENED))) {
>> +		fput(file);
>> +		return;
>> +	}
>> +
>> +	if (WARN_ON(atomic_long_read(&file->f_count) != 1)) {
>> +		fput(file);
>> +		return;
>> +	}
>> +
>> +	/* zero out the ref count to appease possible asserts */
>> +	atomic_long_set(&file->f_count, 0);
>> +	file_free_badopen(file);
>> +}
>> +EXPORT_SYMBOL(fput_badopen);
>> +
>>  void __init files_init(void)
>>  {
>>  	filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
>> diff --git a/fs/namei.c b/fs/namei.c
>> index 567ee547492b..67579fe30b28 100644
>> --- a/fs/namei.c
>> +++ b/fs/namei.c
>> @@ -3802,7 +3802,7 @@ static struct file *path_openat(struct nameidata
>> *nd,
>>  		WARN_ON(1);
>>  		error = -EINVAL;
>>  	}
>> -	fput(file);
>> +	fput_badopen(file);
>>  	if (error == -EOPENSTALE) {
>>  		if (flags & LOOKUP_RCU)
>>  			error = -ECHILD;
>> diff --git a/include/linux/file.h b/include/linux/file.h
>> index 6e9099d29343..96300e27d9a8 100644
>> --- a/include/linux/file.h
>> +++ b/include/linux/file.h
>> @@ -15,6 +15,7 @@
>>  struct file;
>
>>  extern void fput(struct file *);
>> +extern void fput_badopen(struct file *);
>
>>  struct file_operations;
>>  struct task_struct;
>> --
>> 2.39.2
>
>

John Stoffel Sept. 27, 2023, 6:43 p.m. UTC | #7

>>>>> "Mateusz" == Mateusz Guzik <mjguzik@gmail.com> writes:

> On 9/26/23, John Stoffel <john@stoffel.org> wrote:

>> 
>>> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
>>> ---
>>> fs/file_table.c      | 39 +++++++++++++++++++++++++++++++++++++++
>>> fs/namei.c           |  2 +-
>>> include/linux/file.h |  1 +
>>> 3 files changed, 41 insertions(+), 1 deletion(-)
>> 
>>> diff --git a/fs/file_table.c b/fs/file_table.c
>>> index ee21b3da9d08..320dc1f9aa0e 100644
>>> --- a/fs/file_table.c
>>> +++ b/fs/file_table.c
>>> @@ -82,6 +82,16 @@ static inline void file_free(struct file *f)
>>> call_rcu(&f->f_rcuhead, file_free_rcu);
>>> }
>> 
>>> +static inline void file_free_badopen(struct file *f)
>>> +{
>>> +	BUG_ON(f->f_mode & (FMODE_BACKING | FMODE_OPENED));
>> 
>> eww... what a BUG_ON() here?  This seems *way* overkill to crash the
>> system here, and you don't even check if f exists first as well, since
>> I assume the caller checks it or already knows it?
>> 
>> Why not just return an error here and keep going?  What happens if you do?
>> 

> The only caller already checked these flags, so I think BUGing out is prudent.

So how would the flags change if they had been checked before?  And if
they are wrong, why not just exit without doing anything?  Crashing
the system just because you can't free some memory seems like a
horrible thing to do.  

Linus has said multiple times that BUG_ON() isn't the answer.  You
should just do a WARN_ON() instead.  Or WARN_ONCE(), don't just kill
the entire system like this.

John

diff mbox series

Patch

diff --git a/fs/file_table.c b/fs/file_table.c
index ee21b3da9d08..320dc1f9aa0e 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -82,6 +82,16 @@  static inline void file_free(struct file *f)
 	call_rcu(&f->f_rcuhead, file_free_rcu);
 }
 
+static inline void file_free_badopen(struct file *f)
+{
+	BUG_ON(f->f_mode & (FMODE_BACKING | FMODE_OPENED));
+	security_file_free(f);
+	put_cred(f->f_cred);
+	if (likely(!(f->f_mode & FMODE_NOACCOUNT)))
+		percpu_counter_dec(&nr_files);
+	kmem_cache_free(filp_cachep, f);
+}
+
 /*
  * Return the total number of open files in the system
  */
@@ -468,6 +478,35 @@  void __fput_sync(struct file *file)
 EXPORT_SYMBOL(fput);
 EXPORT_SYMBOL(__fput_sync);
 
+/*
+ * Clean up after failing to open (e.g., open(2) returns with -ENOENT).
+ *
+ * This represents opportunities to shave on work in the common case compared
+ * to the usual fput:
+ * 1. vast majority of the time FMODE_OPENED is not set, meaning there is no
+ *    need to delegate to task_work
+ * 2. if the above holds then we are guaranteed we have the only reference with
+ *    nobody else seeing the file, thus no need to use atomics to release it
+ * 3. then there is no need to delegate freeing to RCU
+ */
+void fput_badopen(struct file *file)
+{
+	if (unlikely(file->f_mode & (FMODE_BACKING | FMODE_OPENED))) {
+		fput(file);
+		return;
+	}
+
+	if (WARN_ON(atomic_long_read(&file->f_count) != 1)) {
+		fput(file);
+		return;
+	}
+
+	/* zero out the ref count to appease possible asserts */
+	atomic_long_set(&file->f_count, 0);
+	file_free_badopen(file);
+}
+EXPORT_SYMBOL(fput_badopen);
+
 void __init files_init(void)
 {
 	filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
diff --git a/fs/namei.c b/fs/namei.c
index 567ee547492b..67579fe30b28 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3802,7 +3802,7 @@  static struct file *path_openat(struct nameidata *nd,
 		WARN_ON(1);
 		error = -EINVAL;
 	}
-	fput(file);
+	fput_badopen(file);
 	if (error == -EOPENSTALE) {
 		if (flags & LOOKUP_RCU)
 			error = -ECHILD;
diff --git a/include/linux/file.h b/include/linux/file.h
index 6e9099d29343..96300e27d9a8 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -15,6 +15,7 @@ 
 struct file;
 
 extern void fput(struct file *);
+extern void fput_badopen(struct file *);
 
 struct file_operations;
 struct task_struct;