Message ID | 20230807110936.21819-46-zhengqi.arch@bytedance.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:c44e:0:b0:3f2:4152:657d with SMTP id w14csp1402800vqr; Mon, 7 Aug 2023 05:12:43 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGRjBObHRt/h/c6wRWxanHJ6Pqpqfy16pUmA2Ocj72dhFtQb2JSUdycu4RT/KQfHEKgQnlN X-Received: by 2002:a05:6402:28a:b0:522:455d:6f6f with SMTP id l10-20020a056402028a00b00522455d6f6fmr8298684edv.34.1691410363356; Mon, 07 Aug 2023 05:12:43 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1691410363; cv=none; d=google.com; s=arc-20160816; b=IJyhQsFlFoEFuq4qfw4ySA1Mbidt/9cHskylns9QR6mr1oottE0w85MvcJnmlyU3Q4 jfBKSmQL5irTptMcs7y0CGSLlMZBfzZ36O+BINLFp4/3hWOkgiGNqipv4b+p7ZBzz+Lm 2GZjgMKQeCu6GeXAbXXbuC65Lfe+n1tiGkAGOu4r+Q4aU7193gVof2M2QE3NjSbJFgvD ifCj59FEFEpJyzkIwvBo7waJUZ6+g02r/6xUvobVhsF/rLc194K7VTE43KiPXQURpzNy xPfrjMTf2rvJ6JrJ8poqcQBZjtjbRHSDZMWFkBRzMhbLIwbmYN7FeOMcd/ZvmzZC4UdH aKyA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=VvZzKGLWBPnU/YifQKDl8Wb9pt+zZKHS88YiwcrJMO0=; fh=LQnZnJgbwA2rofRvgLxtxvFjbWmQSZR0bRTgn+flB4A=; b=zMxZCMeSesPHYVYEAP0VnfwgOJ4W2H+EUA031sC54du+tMn6bmVlSkWUPgTpWrN0PJ q9MTuBdEJ5MGPAxeebKWJkdUQkZKocn9q4CZ0dwOfiYR/hF5d01yioYNr5OhOCLwLwyB +SMWsDR1mIoG0/se8XOYoj2GvOzsDchVOj1/BfK6a7w0zjkHjQ3XqKQma0uUZ2TQrPgE BkKAKZpb2nW5LB6X++AKWGu2FFXtVhGHSMCT0iYJJtmEIZXpTvZeiGo9Xcp6s0dL/NC/ 3xgkLlftTWMyOMqMgeaVCuJvWq3DF9bJi3W8A/xB0yd3YKLueg9SBxhIJ8h1WoMq1TT1 sk7Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=Cge+YIYD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=bytedance.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id r23-20020aa7c157000000b0051e0d7a3fc9si5881522edp.37.2023.08.07.05.12.18; Mon, 07 Aug 2023 05:12:43 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=Cge+YIYD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=bytedance.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232955AbjHGLgg (ORCPT <rfc822;aaronkmseo@gmail.com> + 99 others); Mon, 7 Aug 2023 07:36:36 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41034 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231699AbjHGLgR (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 7 Aug 2023 07:36:17 -0400 Received: from mail-il1-x131.google.com (mail-il1-x131.google.com [IPv6:2607:f8b0:4864:20::131]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8235883 for <linux-kernel@vger.kernel.org>; Mon, 7 Aug 2023 04:32:54 -0700 (PDT) Received: by mail-il1-x131.google.com with SMTP id e9e14a558f8ab-3491bcf0ae7so3225515ab.0 for <linux-kernel@vger.kernel.org>; Mon, 07 Aug 2023 04:32:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1691407917; x=1692012717; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=VvZzKGLWBPnU/YifQKDl8Wb9pt+zZKHS88YiwcrJMO0=; b=Cge+YIYDyCW3S4zsrnot5trTxLItJkUU2OKxJPY8FlwTeEZv05fY+pKVCz3bb0H1rr 5keW1+kqH2XTnx9bApoURgOFNhhN5JAbpIAREwu474l3uDJ6pLTDPid6J0MYIYg7SLUR 8HZT7WO86fcmIKyrXnoXvs2PzEkZTK6q7+O0I/NXeQpPuoD8NfhYQcpCEdTzNx62fLj1 d4fqDEZwSbRy+LmSP/xFkGP+wEMmLs0IHCYpvrA2ePciCBt/9SzD/MEgIVThYFTwJpIN wpfsXKQAAlO7S8OYXTi4ie0jnTLavoYB/9/XH9PDACdYPhguY4mA4OF2RlDSH/znN/SU oNcA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1691407917; x=1692012717; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=VvZzKGLWBPnU/YifQKDl8Wb9pt+zZKHS88YiwcrJMO0=; b=HXjQJYzC6UqVyjTjiGmTUq9zUajkqfdsHTlD52ifrWqpqt6pZ6nW/4sUV4FZmHtGiw LM7TxMQP7dNLHhKSY0gwNs6d1yhIb19mUlizK6gXbRHeJRbOu6i0SsjVIbdDgHL++xDe QAHj9eXLeNvpgoA8GgvbifSjTpvHmJT2T7CxvZvP9tFroWizaYRv41SOHaF3sQ0SOVEQ 3hmBjzX5C1FFoiB2qkMNRW2Ho3P+gASH6qVqMzAztxpmZ4KOxWT2zJK2WwRgpzGzzJW2 4gYlM8EIxeAUq+Lek8zKSZzEFR90SERU/3a0UaPm5RhbV2aOS3vyio35S8vbw9kzfCgh jSOQ== X-Gm-Message-State: ABy/qLZ3tFp7Kr/S8Bn+7NNHgmILjwo0mbYhRCSB3ZnauHoxRT1CduaW 7L7yG2dPqmQLjg8u/iAdj3WYANH/8dq/60sfrs4= X-Received: by 2002:a6b:c30f:0:b0:783:6e76:6bc7 with SMTP id t15-20020a6bc30f000000b007836e766bc7mr26415608iof.2.1691407174771; Mon, 07 Aug 2023 04:19:34 -0700 (PDT) Received: from C02DW0BEMD6R.bytedance.net ([203.208.167.146]) by smtp.gmail.com with ESMTPSA id y13-20020a17090aca8d00b0025be7b69d73sm5861191pjt.12.2023.08.07.04.19.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 07 Aug 2023 04:19:34 -0700 (PDT) From: Qi Zheng <zhengqi.arch@bytedance.com> To: akpm@linux-foundation.org, david@fromorbit.com, tkhai@ya.ru, vbabka@suse.cz, roman.gushchin@linux.dev, djwong@kernel.org, brauner@kernel.org, paulmck@kernel.org, tytso@mit.edu, steven.price@arm.com, cel@kernel.org, senozhatsky@chromium.org, yujie.liu@intel.com, gregkh@linuxfoundation.org, muchun.song@linux.dev, simon.horman@corigine.com, dlemoal@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org, kvm@vger.kernel.org, xen-devel@lists.xenproject.org, linux-erofs@lists.ozlabs.org, linux-f2fs-devel@lists.sourceforge.net, cluster-devel@redhat.com, linux-nfs@vger.kernel.org, linux-mtd@lists.infradead.org, rcu@vger.kernel.org, netdev@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-arm-msm@vger.kernel.org, dm-devel@redhat.com, linux-raid@vger.kernel.org, linux-bcache@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, linux-xfs@vger.kernel.org, linux-btrfs@vger.kernel.org, Qi Zheng <zhengqi.arch@bytedance.com> Subject: [PATCH v4 45/48] mm: shrinker: make global slab shrink lockless Date: Mon, 7 Aug 2023 19:09:33 +0800 Message-Id: <20230807110936.21819-46-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20230807110936.21819-1-zhengqi.arch@bytedance.com> References: <20230807110936.21819-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_BLOCKED, SPF_HELO_NONE,SPF_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1773572312750779551 X-GMAIL-MSGID: 1773572312750779551 |
Series |
use refcount+RCU method to implement lockless slab shrink
|
|
Commit Message
Qi Zheng
Aug. 7, 2023, 11:09 a.m. UTC
The shrinker_rwsem is a global read-write lock in shrinkers subsystem,
which protects most operations such as slab shrink, registration and
unregistration of shrinkers, etc. This can easily cause problems in the
following cases.
1) When the memory pressure is high and there are many filesystems
mounted or unmounted at the same time, slab shrink will be affected
(down_read_trylock() failed).
Such as the real workload mentioned by Kirill Tkhai:
```
One of the real workloads from my experience is start
of an overcommitted node containing many starting
containers after node crash (or many resuming containers
after reboot for kernel update). In these cases memory
pressure is huge, and the node goes round in long reclaim.
```
2) If a shrinker is blocked (such as the case mentioned
in [1]) and a writer comes in (such as mount a fs),
then this writer will be blocked and cause all
subsequent shrinker-related operations to be blocked.
Even if there is no competitor when shrinking slab, there may still be a
problem. The down_read_trylock() may become a perf hotspot with frequent
calls to shrink_slab(). Because of the poor multicore scalability of
atomic operations, this can lead to a significant drop in IPC
(instructions per cycle).
We used to implement the lockless slab shrink with SRCU [2], but then
kernel test robot reported -88.8% regression in
stress-ng.ramfs.ops_per_sec test case [3], so we reverted it [4].
This commit uses the refcount+RCU method [5] proposed by Dave Chinner
to re-implement the lockless global slab shrink. The memcg slab shrink is
handled in the subsequent patch.
For now, all shrinker instances are converted to dynamically allocated and
will be freed by call_rcu(). So we can use rcu_read_{lock,unlock}() to
ensure that the shrinker instance is valid.
And the shrinker instance will not be run again after unregistration. So
the structure that records the pointer of shrinker instance can be safely
freed without waiting for the RCU read-side critical section.
In this way, while we implement the lockless slab shrink, we don't need to
be blocked in unregister_shrinker().
The following are the test results:
stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
1) Before applying this patchset:
setting to a 60 second run per stressor
dispatching hogs: 9 ramfs
stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
(secs) (secs) (secs) (real time) (usr+sys time)
ramfs 735238 60.00 12.37 363.70 12253.05 1955.08
for a 60.01s run time:
1440.27s available CPU time
12.36s user time ( 0.86%)
363.70s system time ( 25.25%)
376.06s total time ( 26.11%)
load average: 10.79 4.47 1.69
passed: 9: ramfs (9)
failed: 0
skipped: 0
successful run completed in 60.01s (1 min, 0.01 secs)
2) After applying this patchset:
setting to a 60 second run per stressor
dispatching hogs: 9 ramfs
stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
(secs) (secs) (secs) (real time) (usr+sys time)
ramfs 746698 60.00 12.45 376.16 12444.02 1921.47
for a 60.01s run time:
1440.28s available CPU time
12.44s user time ( 0.86%)
376.16s system time ( 26.12%)
388.60s total time ( 26.98%)
load average: 9.01 3.85 1.49
passed: 9: ramfs (9)
failed: 0
skipped: 0
successful run completed in 60.01s (1 min, 0.01 secs)
We can see that the ops/s has hardly changed.
[1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
[2]. https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.arch@bytedance.com/
[3]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
[4]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zheng@linux.dev/
[5]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
include/linux/shrinker.h | 17 ++++++++++
mm/shrinker.c | 70 +++++++++++++++++++++++++++++-----------
2 files changed, 68 insertions(+), 19 deletions(-)
Comments
On Mon, Aug 07, 2023 at 07:09:33PM +0800, Qi Zheng wrote: > The shrinker_rwsem is a global read-write lock in shrinkers subsystem, > which protects most operations such as slab shrink, registration and > unregistration of shrinkers, etc. This can easily cause problems in the > following cases. .... > This commit uses the refcount+RCU method [5] proposed by Dave Chinner > to re-implement the lockless global slab shrink. The memcg slab shrink is > handled in the subsequent patch. .... > --- > include/linux/shrinker.h | 17 ++++++++++ > mm/shrinker.c | 70 +++++++++++++++++++++++++++++----------- > 2 files changed, 68 insertions(+), 19 deletions(-) There's no documentation in the code explaining how the lockless shrinker algorithm works. It's left to the reader to work out how this all goes together.... > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h > index eb342994675a..f06225f18531 100644 > --- a/include/linux/shrinker.h > +++ b/include/linux/shrinker.h > @@ -4,6 +4,8 @@ > > #include <linux/atomic.h> > #include <linux/types.h> > +#include <linux/refcount.h> > +#include <linux/completion.h> > > #define SHRINKER_UNIT_BITS BITS_PER_LONG > > @@ -87,6 +89,10 @@ struct shrinker { > int seeks; /* seeks to recreate an obj */ > unsigned flags; > > + refcount_t refcount; > + struct completion done; > + struct rcu_head rcu; What does the refcount protect, why do we need the completion, etc? > + > void *private_data; > > /* These are for internal use */ > @@ -120,6 +126,17 @@ struct shrinker *shrinker_alloc(unsigned int flags, const char *fmt, ...); > void shrinker_register(struct shrinker *shrinker); > void shrinker_free(struct shrinker *shrinker); > > +static inline bool shrinker_try_get(struct shrinker *shrinker) > +{ > + return refcount_inc_not_zero(&shrinker->refcount); > +} > + > +static inline void shrinker_put(struct shrinker *shrinker) > +{ > + if (refcount_dec_and_test(&shrinker->refcount)) > + complete(&shrinker->done); > +} > + > #ifdef CONFIG_SHRINKER_DEBUG > extern int __printf(2, 3) shrinker_debugfs_rename(struct shrinker *shrinker, > const char *fmt, ...); > diff --git a/mm/shrinker.c b/mm/shrinker.c > index 1911c06b8af5..d318f5621862 100644 > --- a/mm/shrinker.c > +++ b/mm/shrinker.c > @@ -2,6 +2,7 @@ > #include <linux/memcontrol.h> > #include <linux/rwsem.h> > #include <linux/shrinker.h> > +#include <linux/rculist.h> > #include <trace/events/vmscan.h> > > #include "internal.h" > @@ -577,33 +578,42 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg, > if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) > return shrink_slab_memcg(gfp_mask, nid, memcg, priority); > > - if (!down_read_trylock(&shrinker_rwsem)) > - goto out; > - > - list_for_each_entry(shrinker, &shrinker_list, list) { > + rcu_read_lock(); > + list_for_each_entry_rcu(shrinker, &shrinker_list, list) { > struct shrink_control sc = { > .gfp_mask = gfp_mask, > .nid = nid, > .memcg = memcg, > }; > > + if (!shrinker_try_get(shrinker)) > + continue; > + > + /* > + * We can safely unlock the RCU lock here since we already > + * hold the refcount of the shrinker. > + */ > + rcu_read_unlock(); > + > ret = do_shrink_slab(&sc, shrinker, priority); > if (ret == SHRINK_EMPTY) > ret = 0; > freed += ret; > + > /* > - * Bail out if someone want to register a new shrinker to > - * prevent the registration from being stalled for long periods > - * by parallel ongoing shrinking. > + * This shrinker may be deleted from shrinker_list and freed > + * after the shrinker_put() below, but this shrinker is still > + * used for the next traversal. So it is necessary to hold the > + * RCU lock first to prevent this shrinker from being freed, > + * which also ensures that the next shrinker that is traversed > + * will not be freed (even if it is deleted from shrinker_list > + * at the same time). > */ This comment really should be at the head of the function, describing the algorithm used within the function itself. i.e. how reference counts are used w.r.t. the rcu_read_lock() usage to guarantee existence of the shrinker and the validity of the list walk. I'm not going to remember all these little details when I look at this code in another 6 months time, and having to work it out from first principles every time I look at the code will waste of a lot of time... -Dave.
On Mon, Aug 07, 2023 at 07:09:33PM +0800, Qi Zheng wrote: > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h > index eb342994675a..f06225f18531 100644 > --- a/include/linux/shrinker.h > +++ b/include/linux/shrinker.h > @@ -4,6 +4,8 @@ > > #include <linux/atomic.h> > #include <linux/types.h> > +#include <linux/refcount.h> > +#include <linux/completion.h> > > #define SHRINKER_UNIT_BITS BITS_PER_LONG > > @@ -87,6 +89,10 @@ struct shrinker { > int seeks; /* seeks to recreate an obj */ > unsigned flags; > > + refcount_t refcount; > + struct completion done; > + struct rcu_head rcu; Documentation, please. What does the refcount protect, what does the completion provide, etc. > + > void *private_data; > > /* These are for internal use */ > @@ -120,6 +126,17 @@ struct shrinker *shrinker_alloc(unsigned int flags, const char *fmt, ...); > void shrinker_register(struct shrinker *shrinker); > void shrinker_free(struct shrinker *shrinker); > > +static inline bool shrinker_try_get(struct shrinker *shrinker) > +{ > + return refcount_inc_not_zero(&shrinker->refcount); > +} > + > +static inline void shrinker_put(struct shrinker *shrinker) > +{ > + if (refcount_dec_and_test(&shrinker->refcount)) > + complete(&shrinker->done); > +} > + > #ifdef CONFIG_SHRINKER_DEBUG > extern int __printf(2, 3) shrinker_debugfs_rename(struct shrinker *shrinker, > const char *fmt, ...); > diff --git a/mm/shrinker.c b/mm/shrinker.c > index 1911c06b8af5..d318f5621862 100644 > --- a/mm/shrinker.c > +++ b/mm/shrinker.c > @@ -2,6 +2,7 @@ > #include <linux/memcontrol.h> > #include <linux/rwsem.h> > #include <linux/shrinker.h> > +#include <linux/rculist.h> > #include <trace/events/vmscan.h> > > #include "internal.h" > @@ -577,33 +578,42 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg, > if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) > return shrink_slab_memcg(gfp_mask, nid, memcg, priority); > > - if (!down_read_trylock(&shrinker_rwsem)) > - goto out; > - > - list_for_each_entry(shrinker, &shrinker_list, list) { > + rcu_read_lock(); > + list_for_each_entry_rcu(shrinker, &shrinker_list, list) { > struct shrink_control sc = { > .gfp_mask = gfp_mask, > .nid = nid, > .memcg = memcg, > }; > > + if (!shrinker_try_get(shrinker)) > + continue; > + > + /* > + * We can safely unlock the RCU lock here since we already > + * hold the refcount of the shrinker. > + */ > + rcu_read_unlock(); > + > ret = do_shrink_slab(&sc, shrinker, priority); > if (ret == SHRINK_EMPTY) > ret = 0; > freed += ret; > + > /* > - * Bail out if someone want to register a new shrinker to > - * prevent the registration from being stalled for long periods > - * by parallel ongoing shrinking. > + * This shrinker may be deleted from shrinker_list and freed > + * after the shrinker_put() below, but this shrinker is still > + * used for the next traversal. So it is necessary to hold the > + * RCU lock first to prevent this shrinker from being freed, > + * which also ensures that the next shrinker that is traversed > + * will not be freed (even if it is deleted from shrinker_list > + * at the same time). > */ This needs to be moved to the head of the function, and document the whole list walk, get, put and completion parts of the algorithm that make it safe. There's more to this than "we hold a reference count", especially the tricky "we might see the shrinker before it is fully initialised" case.... ..... > void shrinker_free(struct shrinker *shrinker) > { > struct dentry *debugfs_entry = NULL; > @@ -686,9 +712,18 @@ void shrinker_free(struct shrinker *shrinker) > if (!shrinker) > return; > > + if (shrinker->flags & SHRINKER_REGISTERED) { > + shrinker_put(shrinker); > + wait_for_completion(&shrinker->done); > + } Needs a comment explaining why we need to wait here... > + > down_write(&shrinker_rwsem); > if (shrinker->flags & SHRINKER_REGISTERED) { > - list_del(&shrinker->list); > + /* > + * Lookups on the shrinker are over and will fail in the future, > + * so we can now remove it from the lists and free it. > + */ .... rather than here after the wait has been done and provided the guarantee that no shrinker is running or will run again... -Dave.
Hi Dave, On 2023/8/8 10:24, Dave Chinner wrote: > On Mon, Aug 07, 2023 at 07:09:33PM +0800, Qi Zheng wrote: >> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h >> index eb342994675a..f06225f18531 100644 >> --- a/include/linux/shrinker.h >> +++ b/include/linux/shrinker.h >> @@ -4,6 +4,8 @@ >> >> #include <linux/atomic.h> >> #include <linux/types.h> >> +#include <linux/refcount.h> >> +#include <linux/completion.h> >> >> #define SHRINKER_UNIT_BITS BITS_PER_LONG >> >> @@ -87,6 +89,10 @@ struct shrinker { >> int seeks; /* seeks to recreate an obj */ >> unsigned flags; >> >> + refcount_t refcount; >> + struct completion done; >> + struct rcu_head rcu; > > Documentation, please. What does the refcount protect, what does the > completion provide, etc. How about the following: /* * reference count of this shrinker, holding this can guarantee * that the shrinker will not be released. */ refcount_t refcount; /* * Wait for shrinker::refcount to reach 0, that is, no shrinker * is running or will run again. */ struct completion done; > >> + >> void *private_data; >> >> /* These are for internal use */ >> @@ -120,6 +126,17 @@ struct shrinker *shrinker_alloc(unsigned int flags, const char *fmt, ...); >> void shrinker_register(struct shrinker *shrinker); >> void shrinker_free(struct shrinker *shrinker); >> >> +static inline bool shrinker_try_get(struct shrinker *shrinker) >> +{ >> + return refcount_inc_not_zero(&shrinker->refcount); >> +} >> + >> +static inline void shrinker_put(struct shrinker *shrinker) >> +{ >> + if (refcount_dec_and_test(&shrinker->refcount)) >> + complete(&shrinker->done); >> +} >> + >> #ifdef CONFIG_SHRINKER_DEBUG >> extern int __printf(2, 3) shrinker_debugfs_rename(struct shrinker *shrinker, >> const char *fmt, ...); >> diff --git a/mm/shrinker.c b/mm/shrinker.c >> index 1911c06b8af5..d318f5621862 100644 >> --- a/mm/shrinker.c >> +++ b/mm/shrinker.c >> @@ -2,6 +2,7 @@ >> #include <linux/memcontrol.h> >> #include <linux/rwsem.h> >> #include <linux/shrinker.h> >> +#include <linux/rculist.h> >> #include <trace/events/vmscan.h> >> >> #include "internal.h" >> @@ -577,33 +578,42 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg, >> if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) >> return shrink_slab_memcg(gfp_mask, nid, memcg, priority); >> >> - if (!down_read_trylock(&shrinker_rwsem)) >> - goto out; >> - >> - list_for_each_entry(shrinker, &shrinker_list, list) { >> + rcu_read_lock(); >> + list_for_each_entry_rcu(shrinker, &shrinker_list, list) { >> struct shrink_control sc = { >> .gfp_mask = gfp_mask, >> .nid = nid, >> .memcg = memcg, >> }; >> >> + if (!shrinker_try_get(shrinker)) >> + continue; >> + >> + /* >> + * We can safely unlock the RCU lock here since we already >> + * hold the refcount of the shrinker. >> + */ >> + rcu_read_unlock(); >> + >> ret = do_shrink_slab(&sc, shrinker, priority); >> if (ret == SHRINK_EMPTY) >> ret = 0; >> freed += ret; >> + >> /* >> - * Bail out if someone want to register a new shrinker to >> - * prevent the registration from being stalled for long periods >> - * by parallel ongoing shrinking. >> + * This shrinker may be deleted from shrinker_list and freed >> + * after the shrinker_put() below, but this shrinker is still >> + * used for the next traversal. So it is necessary to hold the >> + * RCU lock first to prevent this shrinker from being freed, >> + * which also ensures that the next shrinker that is traversed >> + * will not be freed (even if it is deleted from shrinker_list >> + * at the same time). >> */ > > This needs to be moved to the head of the function, and document > the whole list walk, get, put and completion parts of the algorithm > that make it safe. There's more to this than "we hold a reference > count", especially the tricky "we might see the shrinker before it > is fully initialised" case.... How about moving these documents to before list_for_each_entry_rcu(), and then go to the head of shrink_slab_memcg() to explain the memcg slab shrink case. > > > ..... >> void shrinker_free(struct shrinker *shrinker) >> { >> struct dentry *debugfs_entry = NULL; >> @@ -686,9 +712,18 @@ void shrinker_free(struct shrinker *shrinker) >> if (!shrinker) >> return; >> >> + if (shrinker->flags & SHRINKER_REGISTERED) { >> + shrinker_put(shrinker); >> + wait_for_completion(&shrinker->done); >> + } > > Needs a comment explaining why we need to wait here... /* * Wait for all lookups of the shrinker to complete, after that, no * shrinker is running or will run again, then we can safely free * the structure where the shrinker is located, such as super_block * etc. */ >> + >> down_write(&shrinker_rwsem); >> if (shrinker->flags & SHRINKER_REGISTERED) { >> - list_del(&shrinker->list); >> + /* >> + * Lookups on the shrinker are over and will fail in the future, >> + * so we can now remove it from the lists and free it. >> + */ > > .... rather than here after the wait has been done and provided the > guarantee that no shrinker is running or will run again... With the above comment, how about simplifying the comment here to the following: /* * Now we can safely remove it from the shrinker_list and free it. */ Thanks, Qi > > -Dave.
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h index eb342994675a..f06225f18531 100644 --- a/include/linux/shrinker.h +++ b/include/linux/shrinker.h @@ -4,6 +4,8 @@ #include <linux/atomic.h> #include <linux/types.h> +#include <linux/refcount.h> +#include <linux/completion.h> #define SHRINKER_UNIT_BITS BITS_PER_LONG @@ -87,6 +89,10 @@ struct shrinker { int seeks; /* seeks to recreate an obj */ unsigned flags; + refcount_t refcount; + struct completion done; + struct rcu_head rcu; + void *private_data; /* These are for internal use */ @@ -120,6 +126,17 @@ struct shrinker *shrinker_alloc(unsigned int flags, const char *fmt, ...); void shrinker_register(struct shrinker *shrinker); void shrinker_free(struct shrinker *shrinker); +static inline bool shrinker_try_get(struct shrinker *shrinker) +{ + return refcount_inc_not_zero(&shrinker->refcount); +} + +static inline void shrinker_put(struct shrinker *shrinker) +{ + if (refcount_dec_and_test(&shrinker->refcount)) + complete(&shrinker->done); +} + #ifdef CONFIG_SHRINKER_DEBUG extern int __printf(2, 3) shrinker_debugfs_rename(struct shrinker *shrinker, const char *fmt, ...); diff --git a/mm/shrinker.c b/mm/shrinker.c index 1911c06b8af5..d318f5621862 100644 --- a/mm/shrinker.c +++ b/mm/shrinker.c @@ -2,6 +2,7 @@ #include <linux/memcontrol.h> #include <linux/rwsem.h> #include <linux/shrinker.h> +#include <linux/rculist.h> #include <trace/events/vmscan.h> #include "internal.h" @@ -577,33 +578,42 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg, if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) return shrink_slab_memcg(gfp_mask, nid, memcg, priority); - if (!down_read_trylock(&shrinker_rwsem)) - goto out; - - list_for_each_entry(shrinker, &shrinker_list, list) { + rcu_read_lock(); + list_for_each_entry_rcu(shrinker, &shrinker_list, list) { struct shrink_control sc = { .gfp_mask = gfp_mask, .nid = nid, .memcg = memcg, }; + if (!shrinker_try_get(shrinker)) + continue; + + /* + * We can safely unlock the RCU lock here since we already + * hold the refcount of the shrinker. + */ + rcu_read_unlock(); + ret = do_shrink_slab(&sc, shrinker, priority); if (ret == SHRINK_EMPTY) ret = 0; freed += ret; + /* - * Bail out if someone want to register a new shrinker to - * prevent the registration from being stalled for long periods - * by parallel ongoing shrinking. + * This shrinker may be deleted from shrinker_list and freed + * after the shrinker_put() below, but this shrinker is still + * used for the next traversal. So it is necessary to hold the + * RCU lock first to prevent this shrinker from being freed, + * which also ensures that the next shrinker that is traversed + * will not be freed (even if it is deleted from shrinker_list + * at the same time). */ - if (rwsem_is_contended(&shrinker_rwsem)) { - freed = freed ? : 1; - break; - } + rcu_read_lock(); + shrinker_put(shrinker); } - up_read(&shrinker_rwsem); -out: + rcu_read_unlock(); cond_resched(); return freed; } @@ -671,13 +681,29 @@ void shrinker_register(struct shrinker *shrinker) } down_write(&shrinker_rwsem); - list_add_tail(&shrinker->list, &shrinker_list); + list_add_tail_rcu(&shrinker->list, &shrinker_list); shrinker->flags |= SHRINKER_REGISTERED; shrinker_debugfs_add(shrinker); up_write(&shrinker_rwsem); + + init_completion(&shrinker->done); + /* + * Now the shrinker is fully set up, take the first reference to it to + * indicate that lookup operations are now allowed to use it via + * shrinker_try_get(). + */ + refcount_set(&shrinker->refcount, 1); } EXPORT_SYMBOL_GPL(shrinker_register); +static void shrinker_free_rcu_cb(struct rcu_head *head) +{ + struct shrinker *shrinker = container_of(head, struct shrinker, rcu); + + kfree(shrinker->nr_deferred); + kfree(shrinker); +} + void shrinker_free(struct shrinker *shrinker) { struct dentry *debugfs_entry = NULL; @@ -686,9 +712,18 @@ void shrinker_free(struct shrinker *shrinker) if (!shrinker) return; + if (shrinker->flags & SHRINKER_REGISTERED) { + shrinker_put(shrinker); + wait_for_completion(&shrinker->done); + } + down_write(&shrinker_rwsem); if (shrinker->flags & SHRINKER_REGISTERED) { - list_del(&shrinker->list); + /* + * Lookups on the shrinker are over and will fail in the future, + * so we can now remove it from the lists and free it. + */ + list_del_rcu(&shrinker->list); debugfs_entry = shrinker_debugfs_detach(shrinker, &debugfs_id); shrinker->flags &= ~SHRINKER_REGISTERED; } else { @@ -702,9 +737,6 @@ void shrinker_free(struct shrinker *shrinker) if (debugfs_entry) shrinker_debugfs_remove(debugfs_entry, debugfs_id); - kfree(shrinker->nr_deferred); - shrinker->nr_deferred = NULL; - - kfree(shrinker); + call_rcu(&shrinker->rcu, shrinker_free_rcu_cb); } EXPORT_SYMBOL_GPL(shrinker_free);