[0/8] make unregistration of super_block shrinker more faster

Message ID	20230531095742.2480623-1-qi.zheng@linux.dev
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Qi Zheng <qi.zheng@linux.dev> To: akpm@linux-foundation.org, tkhai@ya.ru, roman.gushchin@linux.dev, vbabka@suse.cz, viro@zeniv.linux.org.uk, brauner@kernel.org, djwong@kernel.org, hughd@google.com, paulmck@kernel.org, muchun.song@linux.dev Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org, linux-kernel@vger.kernel.org, Qi Zheng <zhengqi.arch@bytedance.com> Subject: [PATCH 0/8] make unregistration of super_block shrinker more faster Date: Wed, 31 May 2023 09:57:34 +0000 Message-Id: <20230531095742.2480623-1-qi.zheng@linux.dev> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	make unregistration of super_block shrinker more faster \| [0/8] make unregistration of super_block shrinker more faster [1/8] mm: vmscan: move shrinker_debugfs_remove() before synchronize_srcu() [2/8] mm: vmscan: split unregister_shrinker() [3/8] fs: move list_lru_destroy() to destroy_super_work() [4/8] fs: shrink only (SB_ACTIVE\|SB_BORN) superblocks in super_cache_scan() [5/8] fs: introduce struct super_operations::destroy_super() callback [6/8] xfs: introduce xfs_fs_destroy_super() [7/8] shmem: implement shmem_destroy_super() [8/8] fs: use unregister_shrinker_delayed_{initiate, finalize} for super_block shrinker

Message ID

20230531095742.2480623-1-qi.zheng@linux.dev

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
From: Qi Zheng <qi.zheng@linux.dev>
To: akpm@linux-foundation.org, tkhai@ya.ru, roman.gushchin@linux.dev,
        vbabka@suse.cz, viro@zeniv.linux.org.uk, brauner@kernel.org,
        djwong@kernel.org, hughd@google.com, paulmck@kernel.org,
        muchun.song@linux.dev
Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
        linux-xfs@vger.kernel.org, linux-kernel@vger.kernel.org,
        Qi Zheng <zhengqi.arch@bytedance.com>
Subject: [PATCH 0/8] make unregistration of super_block shrinker more faster
Date: Wed, 31 May 2023 09:57:34 +0000
Message-Id: <20230531095742.2480623-1-qi.zheng@linux.dev>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

make unregistration of super_block shrinker more faster |

Message

Qi Zheng May 31, 2023, 9:57 a.m. UTC

  From: Qi Zheng <zhengqi.arch@bytedance.com>

Hi all,

This patch series aims to make unregistration of super_block shrinker more
faster.

1. Background
=============

The kernel test robot noticed a -88.8% regression of stress-ng.ramfs.ops_per_sec
on commit f95bdb700bc6 ("mm: vmscan: make global slab shrink lockless"). More
details can be seen from the link[1] below.

[1]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/

We can just use the following command to reproduce the result:

stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &

1) before commit f95bdb700bc6b:

stress-ng: info:  [11023] dispatching hogs: 9 ramfs
stress-ng: info:  [11023] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
stress-ng: info:  [11023]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [11023] ramfs            774966     60.00     10.18    169.45     12915.89        4314.26
stress-ng: info:  [11023] for a 60.00s run time:
stress-ng: info:  [11023]    1920.11s available CPU time
stress-ng: info:  [11023]      10.18s user time   (  0.53%)
stress-ng: info:  [11023]     169.44s system time (  8.82%)
stress-ng: info:  [11023]     179.62s total time  (  9.35%)
stress-ng: info:  [11023] load average: 8.99 2.69 0.93
stress-ng: info:  [11023] successful run completed in 60.00s (1 min, 0.00 secs)

2) after commit f95bdb700bc6b:

stress-ng: info:  [37676] dispatching hogs: 9 ramfs
stress-ng: info:  [37676] stressor       bogo ops real time  usrtime  sys time   bogo ops/s     bogo ops/s
stress-ng: info:  [37676]                           (secs)    (secs)   (secs)   (real time) (usr+sys time)
stress-ng: info:  [37676] ramfs            168673     60.00     1.61    39.66      2811.08        4087.47
stress-ng: info:  [37676] for a 60.10s run time:
stress-ng: info:  [37676]    1923.36s available CPU time
stress-ng: info:  [37676]       1.60s user time   (  0.08%)
stress-ng: info:  [37676]      39.66s system time (  2.06%)
stress-ng: info:  [37676]      41.26s total time  (  2.15%)
stress-ng: info:  [37676] load average: 7.69 3.63 2.36
stress-ng: info:  [37676] successful run completed in 60.10s (1 min, 0.10 secs)

The root cause is that SRCU has to be careful to not frequently check for srcu
read-side critical section exits. Paul E. McKenney gave a detailed explanation:

```
In practice, the act of checking to see if there is anyone in an SRCU
read-side critical section is a heavy-weight operation, involving at
least one cache miss per CPU along with a number of full memory barriers.
```

Therefore, even if no one is currently in the SRCU read-side critical section,
synchronize_srcu() cannot return quickly. That's why unregister_shrinker() has
become slower.

2. Idea
=======

2.1 use synchronize_srcu_expedited() ?
--------------------------------------

The synchronize_srcu_expedited() will let SRCU to be much more aggressive.
If we use it to replace synchronize_srcu() in the unregister_shrinker(), the
ops/s will return to previous levels:

stress-ng: info:  [13159] dispatching hogs: 9 ramfs
stress-ng: info:  [13159] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
stress-ng: info:  [13159]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [13159] ramfs            710062     60.00      9.63    157.26     11834.18        4254.75
stress-ng: info:  [13159] for a 60.00s run time:
stress-ng: info:  [13159]    1920.14s available CPU time
stress-ng: info:  [13159]       9.62s user time   (  0.50%)
stress-ng: info:  [13159]     157.26s system time (  8.19%)
stress-ng: info:  [13159]     166.88s total time  (  8.69%)
stress-ng: info:  [13159] load average: 9.49 4.02 1.65
stress-ng: info:  [13159] successful run completed in 60.00s (1 min, 0.00 secs)

But because SRCU (Sleepable RCU) is used here, the reader is allowed to sleep in
the read-side critical section, so synchronize_srcu_expedited() may cause a lot
of CPU consumption, so this is not a good choice.

2.2 move synchronize_srcu() to the asynchronous delayed work
------------------------------------------------------------

Kirill Tkhai proposed a better idea[2] in 2018: move synchronize_srcu() to the
asynchronous delayed work, then it doesn't affect on user-visible unregistration
speed.

[2]. https://lore.kernel.org/lkml/153365636747.19074.12610817307548583381.stgit@localhost.localdomain/

After applying his patches ([PATCH RFC 04/10]~[PATCH RFC 10/10], with few
conflicts), the ops/s is of course back to the previous levels:

stress-ng: info:  [11506] setting to a 60 second run per stressor
stress-ng: info:  [11506] dispatching hogs: 9 ramfs
stress-ng: info:  [11506] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
stress-ng: info:  [11506]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [11506] ramfs            829462     60.00     10.81    174.25     13824.14        4482.08
stress-ng: info:  [11506] for a 60.00s run time:
stress-ng: info:  [11506]    1920.12s available CPU time
stress-ng: info:  [11506]      10.81s user time   (  0.56%)
stress-ng: info:  [11506]     174.25s system time (  9.07%)
stress-ng: info:  [11506]     185.06s total time  (  9.64%)
stress-ng: info:  [11506] load average: 8.96 2.60 0.89
stress-ng: info:  [11506] successful run completed in 60.00s (1 min, 0.00 secs)

In order to continue to advance this patch set, I rebase these patches onto the
next-20230525. Any comments and suggestions are welcome.

Note: This patch serise is only for super_block shrinker, all further
time-critical for unregistration places may be written in the same conception.

Thanks,
Qi

Kirill Tkhai (7):
  mm: vmscan: split unregister_shrinker()
  fs: move list_lru_destroy() to destroy_super_work()
  fs: shrink only (SB_ACTIVE|SB_BORN) superblocks in super_cache_scan()
  fs: introduce struct super_operations::destroy_super() callback
  xfs: introduce xfs_fs_destroy_super()
  shmem: implement shmem_destroy_super()
  fs: use unregister_shrinker_delayed_{initiate, finalize} for
    super_block shrinker

Qi Zheng (1):
  mm: vmscan: move shrinker_debugfs_remove() before synchronize_srcu()

 fs/super.c               | 32 ++++++++++++++++++--------------
 fs/xfs/xfs_super.c       | 25 ++++++++++++++++++++++---
 include/linux/fs.h       |  6 ++++++
 include/linux/shrinker.h |  2 ++
 mm/shmem.c               |  8 ++++++++
 mm/vmscan.c              | 26 ++++++++++++++++++++------
 6 files changed, 76 insertions(+), 23 deletions(-)

Comments

Andrew Morton May 31, 2023, 6:40 p.m. UTC | #1

On Wed, 31 May 2023 09:57:34 +0000 Qi Zheng <qi.zheng@linux.dev> wrote:

> From: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> Hi all,
> 
> This patch series aims to make unregistration of super_block shrinker more
> faster.
> 
> 1. Background
> =============
> 
> The kernel test robot noticed a -88.8% regression of stress-ng.ramfs.ops_per_sec
> on commit f95bdb700bc6 ("mm: vmscan: make global slab shrink lockless"). More
> details can be seen from the link[1] below.
> 
> [1]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
> 
> We can just use the following command to reproduce the result:
> 
> stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
> 
> 1) before commit f95bdb700bc6b:
> 
> stress-ng: info:  [11023] dispatching hogs: 9 ramfs
> stress-ng: info:  [11023] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
> stress-ng: info:  [11023]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
> stress-ng: info:  [11023] ramfs            774966     60.00     10.18    169.45     12915.89        4314.26
> stress-ng: info:  [11023] for a 60.00s run time:
> stress-ng: info:  [11023]    1920.11s available CPU time
> stress-ng: info:  [11023]      10.18s user time   (  0.53%)
> stress-ng: info:  [11023]     169.44s system time (  8.82%)
> stress-ng: info:  [11023]     179.62s total time  (  9.35%)
> stress-ng: info:  [11023] load average: 8.99 2.69 0.93
> stress-ng: info:  [11023] successful run completed in 60.00s (1 min, 0.00 secs)
> 
> 2) after commit f95bdb700bc6b:
> 
> stress-ng: info:  [37676] dispatching hogs: 9 ramfs
> stress-ng: info:  [37676] stressor       bogo ops real time  usrtime  sys time   bogo ops/s     bogo ops/s
> stress-ng: info:  [37676]                           (secs)    (secs)   (secs)   (real time) (usr+sys time)
> stress-ng: info:  [37676] ramfs            168673     60.00     1.61    39.66      2811.08        4087.47
> stress-ng: info:  [37676] for a 60.10s run time:
> stress-ng: info:  [37676]    1923.36s available CPU time
> stress-ng: info:  [37676]       1.60s user time   (  0.08%)
> stress-ng: info:  [37676]      39.66s system time (  2.06%)
> stress-ng: info:  [37676]      41.26s total time  (  2.15%)
> stress-ng: info:  [37676] load average: 7.69 3.63 2.36
> stress-ng: info:  [37676] successful run completed in 60.10s (1 min, 0.10 secs)

Is this comparison reversed?  It appears to demonstrate that
f95bdb700bc6b made the operation faster.

Qi Zheng June 1, 2023, 8:46 a.m. UTC | #2

On 2023/6/1 02:40, Andrew Morton wrote:
> On Wed, 31 May 2023 09:57:34 +0000 Qi Zheng <qi.zheng@linux.dev> wrote:
> 
>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>
>> Hi all,
>>
>> This patch series aims to make unregistration of super_block shrinker more
>> faster.
>>
>> 1. Background
>> =============
>>
>> The kernel test robot noticed a -88.8% regression of stress-ng.ramfs.ops_per_sec
>> on commit f95bdb700bc6 ("mm: vmscan: make global slab shrink lockless"). More
>> details can be seen from the link[1] below.
>>
>> [1]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
>>
>> We can just use the following command to reproduce the result:
>>
>> stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
>>
>> 1) before commit f95bdb700bc6b:
>>
>> stress-ng: info:  [11023] dispatching hogs: 9 ramfs
>> stress-ng: info:  [11023] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
>> stress-ng: info:  [11023]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
>> stress-ng: info:  [11023] ramfs            774966     60.00     10.18    169.45     12915.89        4314.26
>> stress-ng: info:  [11023] for a 60.00s run time:
>> stress-ng: info:  [11023]    1920.11s available CPU time
>> stress-ng: info:  [11023]      10.18s user time   (  0.53%)
>> stress-ng: info:  [11023]     169.44s system time (  8.82%)
>> stress-ng: info:  [11023]     179.62s total time  (  9.35%)
>> stress-ng: info:  [11023] load average: 8.99 2.69 0.93
>> stress-ng: info:  [11023] successful run completed in 60.00s (1 min, 0.00 secs)
>>
>> 2) after commit f95bdb700bc6b:
>>
>> stress-ng: info:  [37676] dispatching hogs: 9 ramfs
>> stress-ng: info:  [37676] stressor       bogo ops real time  usrtime  sys time   bogo ops/s     bogo ops/s
>> stress-ng: info:  [37676]                           (secs)    (secs)   (secs)   (real time) (usr+sys time)
>> stress-ng: info:  [37676] ramfs            168673     60.00     1.61    39.66      2811.08        4087.47
>> stress-ng: info:  [37676] for a 60.10s run time:
>> stress-ng: info:  [37676]    1923.36s available CPU time
>> stress-ng: info:  [37676]       1.60s user time   (  0.08%)
>> stress-ng: info:  [37676]      39.66s system time (  2.06%)
>> stress-ng: info:  [37676]      41.26s total time  (  2.15%)
>> stress-ng: info:  [37676] load average: 7.69 3.63 2.36
>> stress-ng: info:  [37676] successful run completed in 60.10s (1 min, 0.10 secs)
> 
> Is this comparison reversed?  It appears to demonstrate that
> f95bdb700bc6b made the operation faster.

Maybe not. IIUC, the bogo ops/s (real time) bigger the better.

Thanks,
Qi

>