Message ID | 20231017154439.3036608-1-chengming.zhou@linux.dev |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:2908:b0:403:3b70:6f57 with SMTP id ib8csp4225907vqb; Tue, 17 Oct 2023 08:46:02 -0700 (PDT) X-Google-Smtp-Source: AGHT+IG3b3lOnCFLLoefhI+uTTgXj7SuzClox5XZpgBGcyKXz5h5Knlrq97qxozXCueolwG0ctIK X-Received: by 2002:a05:6a00:428e:b0:692:b3d4:e6c3 with SMTP id bx14-20020a056a00428e00b00692b3d4e6c3mr2741779pfb.0.1697557561786; Tue, 17 Oct 2023 08:46:01 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697557561; cv=none; d=google.com; s=arc-20160816; b=Ogzid9F6LalRleHX5Hgo2LAPZqP5MaMAYYiel2pHhQmiGf+xZznd77TSL9r3ybn/Y3 QXPkkAl4LiUCJCLKGEsweXSsj6CwD2i3Y8wd+hee7IlIMrnIDL2zSiKtn/Vwn8hKPsOh 0B80/voVt4bFOlp2ZSR72zTLywhAA+G2o4iIkueNiV57lHevjMszTCtfdDCtMvZqZKQs zYxWen977FENEQV1lEw8T5gv1jOsYXT+/Ww6sCViA5NjFRKqs7OcjGCMYxPTWPulO2KH Sk3xmCUEtMEATV5bxc79H+P+UI64CBlWYgDNdEhs+2X1+lgLDs6GubOLoKhwLGADOZuJ WOlA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=lemsX7vhxu4js2CfwxN22PvTUw6JIiATv2g1JPaIjFU=; fh=pb+eBl776ZRvna1F9CtRk2Qy3fEtolZ93e7lS3CUucI=; b=Q8fYpTEDKf2vK9VPQhtdovcubxbmMoZbMiwjyakYvI8rsB0WlBlzXgjHtTVqOgAxL5 14utrz+c+kREhi6KC5haU771v/54OtJn30ftq/zNS7FqnzZshV4xZ0/01pqeLzOOddUk t2Fa0deWdeYqFTzZH2TIp4U5HxXAv9H+GkVlDdIQKcCLTxREzoms69ZP95gYrn3dYjJD 2diS3opvtL39p5OQ3Brutb+X1fA8aPq7GGyq3qDZChl0knb4GrbBI8buMpFnrgOx81GZ uYHq/jX2nxnp1zGzezR6FvtWCEmtVIgq/taxjxTXfLn9lRorPsw6402v6t8BQgYYOqTw swmw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=uN24GXJp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from lipwig.vger.email (lipwig.vger.email. [23.128.96.33]) by mx.google.com with ESMTPS id q17-20020aa79831000000b006b77c54544dsi1744392pfl.195.2023.10.17.08.46.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 17 Oct 2023 08:46:01 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) client-ip=23.128.96.33; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=uN24GXJp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id CBDAC808FBDF; Tue, 17 Oct 2023 08:45:27 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344082AbjJQPpV (ORCPT <rfc822;hjfbswb@gmail.com> + 20 others); Tue, 17 Oct 2023 11:45:21 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42284 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229586AbjJQPpT (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 17 Oct 2023 11:45:19 -0400 Received: from out-204.mta1.migadu.com (out-204.mta1.migadu.com [IPv6:2001:41d0:203:375::cc]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5A9FCB0 for <linux-kernel@vger.kernel.org>; Tue, 17 Oct 2023 08:45:17 -0700 (PDT) X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1697557515; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=lemsX7vhxu4js2CfwxN22PvTUw6JIiATv2g1JPaIjFU=; b=uN24GXJpRZk/OgGh+X3H2+ZPV7U57tOlzJa1pdg68vEeMyCb3k8Ltg6+sL101i6+UFMXqg aoSLXJCYUTZfTf2dxI3zFkONVwF8pLQnfJ5YHvafFO47JLqmfqQofBpF1M8VXOKggln2va gsZV2NI5jM9645zCQ1OD47R4nZ6pKOk= From: chengming.zhou@linux.dev To: cl@linux.com, penberg@kernel.org Cc: rientjes@google.com, iamjoonsoo.kim@lge.com, akpm@linux-foundation.org, vbabka@suse.cz, roman.gushchin@linux.dev, 42.hyeyoo@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, chengming.zhou@linux.dev, Chengming Zhou <zhouchengming@bytedance.com> Subject: [RFC PATCH 0/5] slub: Delay freezing of CPU partial slabs Date: Tue, 17 Oct 2023 15:44:34 +0000 Message-Id: <20231017154439.3036608-1-chengming.zhou@linux.dev> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Tue, 17 Oct 2023 08:45:27 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1780018117659197817 X-GMAIL-MSGID: 1780018117659197817 |
Series | slub: Delay freezing of CPU partial slabs | |
Message
Chengming Zhou
Oct. 17, 2023, 3:44 p.m. UTC
From: Chengming Zhou <zhouchengming@bytedance.com>
1. Problem
==========
Now we have to freeze the slab when get from the node partial list, and
unfreeze the slab when put to the node partial list. Because we need to
rely on the node list_lock to synchronize the "frozen" bit changes.
This implementation has some drawbacks:
- Alloc path: twice cmpxchg_double.
It has to get some partial slabs from node when the allocator has used
up the CPU partial slabs. So it freeze the slab (one cmpxchg_double)
with node list_lock held, put those frozen slabs on its CPU partial
list. Later ___slab_alloc() will cmpxchg_double try-loop again if that
slab is picked to use.
- Alloc path: amplified contention on node list_lock.
Since we have to synchronize the "frozen" bit changes under the node
list_lock, the contention of slab (struct page) can be transferred
to the node list_lock. On machine with many CPUs in one node, the
contention of list_lock will be amplified by all CPUs' alloc path.
The current code has to workaround this problem by avoiding using
cmpxchg_double try-loop, which will just break and return when
contention of page encountered and the first cmpxchg_double failed.
But this workaround has its own problem.
- Free path: redundant unfreeze.
__slab_free() will freeze and cache some slabs on its partial list,
and flush them to the node partial list when exceed, which has to
unfreeze those slabs again under the node list_lock. Actually we
don't need to freeze slab on CPU partial list, in which case we
can save the unfreeze cmpxchg_double operations in flush path.
2. Solution
===========
We solve these problems by leaving slabs unfrozen when moving out of
the node partial list and on CPU partial list, so "frozen" bit is 0.
These partial slabs won't be manipulate concurrently by alloc path,
the only racer is free path, which may manipulate its list when !inuse.
So we need to introduce another synchronization way to avoid it, we
use a bit in slab->flags to indicate whether the slab is on node partial
list or not, only in that case we can manipulate the slab list.
The slab will be delay frozen when it's picked to actively use by the
CPU, it becomes full at the same time, in which case we still need to
rely on "frozen" bit to avoid manipulating its list. So the slab will
be frozen only when activate use and be unfrozen only when deactivate.
3. Patches
==========
Patch-1 introduce the new slab->flags to indicate whether the slab is
on node partial list, which is protected by node list_lock.
Patch-2 change the free path to check if slab is on node partial list,
only in which case we can manipulate its list. Then we can keep unfrozen
partial slabs out of node partial list, since the free path won't
concurrently manipulate with it.
Patch-3 optimize the deactivate path, we can directly unfreeze the slab,
(since node list_lock is not needed to synchronize "frozen" bit anymore)
then grab node list_lock if it's needed to put on the node partial list.
Patch-4 change to don't freeze slab when moving out from node partial
list or put on the CPU partial list, and don't need to unfreeze these
slabs when put back to node partial list from CPU partial list.
Patch-5 change the alloc path to freeze the CPU partial slab when picked
to use.
4. Testing
==========
We just did some simple testing on a server with 128 CPUs (2 nodes) to
compare performance for now.
- perf bench sched messaging -g 5 -t -l 100000
baseline RFC
7.042s 6.966s
7.022s 7.045s
7.054s 6.985s
- stress-ng --rawpkt 128 --rawpkt-ops 100000000
baseline RFC
2.42s 2.15s
2.45s 2.16s
2.44s 2.17s
It shows above there is about 10% improvement on stress-ng rawpkt
testcase, although no much improvement on perf sched bench testcase.
Thanks for any comment and code review!
Chengming Zhou (5):
slub: Introduce on_partial()
slub: Don't manipulate slab list when used by cpu
slub: Optimize deactivate_slab()
slub: Don't freeze slabs for cpu partial
slub: Introduce get_cpu_partial()
mm/slab.h | 2 +-
mm/slub.c | 257 +++++++++++++++++++++++++++++++-----------------------
2 files changed, 150 insertions(+), 109 deletions(-)
Comments
On Wed, Oct 18, 2023 at 12:45 AM <chengming.zhou@linux.dev> wrote: > 4. Testing > ========== > We just did some simple testing on a server with 128 CPUs (2 nodes) to > compare performance for now. > > - perf bench sched messaging -g 5 -t -l 100000 > baseline RFC > 7.042s 6.966s > 7.022s 7.045s > 7.054s 6.985s > > - stress-ng --rawpkt 128 --rawpkt-ops 100000000 > baseline RFC > 2.42s 2.15s > 2.45s 2.16s > 2.44s 2.17s > > It shows above there is about 10% improvement on stress-ng rawpkt > testcase, although no much improvement on perf sched bench testcase. > > Thanks for any comment and code review! Hi Chengming, this is the kerneltesting.org test report for your patch series. I applied this series on my slab-experimental tree [1] for testing, and I observed several kernel panics [2] [3] [4] on kernels without CONFIG_SLUB_CPU_PARTIAL. To verify that this series caused kernel panics, I tested before and after applying it on Vlastimil's slab/for-next and yeah, this series was the cause. System is deadlocked on memory and the OOM-killer says there is a huge amount of slab memory. So maybe there is a memory leak or it makes slab memory grow unboundedly? [1] https://git.kerneltesting.org/slab-experimental/ [2] https://lava.kerneltesting.org/scheduler/job/127#bottom [3] https://lava.kerneltesting.org/scheduler/job/131#bottom [4] https://lava.kerneltesting.org/scheduler/job/134#bottom > > Chengming Zhou (5): > slub: Introduce on_partial() > slub: Don't manipulate slab list when used by cpu > slub: Optimize deactivate_slab() > slub: Don't freeze slabs for cpu partial > slub: Introduce get_cpu_partial() > > mm/slab.h | 2 +- > mm/slub.c | 257 +++++++++++++++++++++++++++++++----------------------- > 2 files changed, 150 insertions(+), 109 deletions(-) > > -- > 2.40.1 >
On 2023/10/18 14:34, Hyeonggon Yoo wrote: > On Wed, Oct 18, 2023 at 12:45 AM <chengming.zhou@linux.dev> wrote: >> 4. Testing >> ========== >> We just did some simple testing on a server with 128 CPUs (2 nodes) to >> compare performance for now. >> >> - perf bench sched messaging -g 5 -t -l 100000 >> baseline RFC >> 7.042s 6.966s >> 7.022s 7.045s >> 7.054s 6.985s >> >> - stress-ng --rawpkt 128 --rawpkt-ops 100000000 >> baseline RFC >> 2.42s 2.15s >> 2.45s 2.16s >> 2.44s 2.17s >> >> It shows above there is about 10% improvement on stress-ng rawpkt >> testcase, although no much improvement on perf sched bench testcase. >> >> Thanks for any comment and code review! > > Hi Chengming, this is the kerneltesting.org test report for your patch series. > > I applied this series on my slab-experimental tree [1] for testing, > and I observed several kernel panics [2] [3] [4] on kernels without > CONFIG_SLUB_CPU_PARTIAL. > > To verify that this series caused kernel panics, I tested before and after > applying it on Vlastimil's slab/for-next and yeah, this series was the cause. > > System is deadlocked on memory and the OOM-killer says there is a > huge amount of slab memory. So maybe there is a memory leak or it makes > slab memory grow unboundedly? Thanks for the testing! I can reproduce the OOM locally without CONFIG_SLUB_CPU_PARTIAL. I made a quick fix below (will need to get another better fix). The root cause is in patch-4, which wrongly put some partial slabs onto the CPU partial list even without CONFIG_SLUB_CPU_PARTIAL. So these partial slabs are leaked. diff --git a/mm/slub.c b/mm/slub.c index d58eaf8447fd..b7ba6c008122 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2339,12 +2339,12 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n, } } +#ifdef CONFIG_SLUB_CPU_PARTIAL remove_partial(n, slab); put_cpu_partial(s, slab, 0); stat(s, CPU_PARTIAL_NODE); partial_slabs++; -#ifdef CONFIG_SLUB_CPU_PARTIAL if (!kmem_cache_has_cpu_partial(s) || partial_slabs > s->cpu_partial_slabs / 2) break; > > [1] https://git.kerneltesting.org/slab-experimental/ > [2] https://lava.kerneltesting.org/scheduler/job/127#bottom > [3] https://lava.kerneltesting.org/scheduler/job/131#bottom > [4] https://lava.kerneltesting.org/scheduler/job/134#bottom > >> >> Chengming Zhou (5): >> slub: Introduce on_partial() >> slub: Don't manipulate slab list when used by cpu >> slub: Optimize deactivate_slab() >> slub: Don't freeze slabs for cpu partial >> slub: Introduce get_cpu_partial() >> >> mm/slab.h | 2 +- >> mm/slub.c | 257 +++++++++++++++++++++++++++++++----------------------- >> 2 files changed, 150 insertions(+), 109 deletions(-) >> >> -- >> 2.40.1 >>