Message ID | 20230226144655.79778-1-zhengqi.arch@bytedance.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:5915:0:0:0:0:0 with SMTP id v21csp1973094wrd; Sun, 26 Feb 2023 06:53:46 -0800 (PST) X-Google-Smtp-Source: AK7set8mxHA1zZGHxxs6DZ+3HHy9E5Y635Jz2uDmEUJ2Q7ErxMm0ZIi4kt8Zygx6MGfyKzYtM1N8 X-Received: by 2002:a05:6a21:6da0:b0:cc:fced:f73d with SMTP id wl32-20020a056a216da000b000ccfcedf73dmr3286555pzb.6.1677423226652; Sun, 26 Feb 2023 06:53:46 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1677423226; cv=none; d=google.com; s=arc-20160816; b=IJCffzV+o7n+YgtlkzsRNrP7LY9cZ8mDzIrWdR/mKvTKAxCX/jLmW+M5UVdXQTCDPw SebDH+FPKTcOSlOneMVRzSW3FVkQd8fkI1O+0rQoJ2l73UwvPl/XKEZWZWr+0hdvLTUn +d/VfX7SZcvgIdmKqC8boJeZmjkAZVJ9MtRkz2az5I4OD5QS/oewE0Gbwzz0qXBNnh5g LntdWouF4Hl79s8g0VeMZxylo25SB4BUeynnENJ49ZOZcd/xAR50C+qG+IV3v4HFOTw2 +u63v/2aS4q9qvUL3VpvJ/BzrNrm3DCViXW3oKhK/kKbIRQng/IUWqJjiOE6JoTGUN2K 16IA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=wBVAIHRscHOmmN60aCmRWYJhxSdAjqn83e5duta6JJk=; b=XJbXNlj/yzb24xgTRf+jOnVHdnu0PTn8KVDgQ+lRPWRV1KrGZ2NCNut9H5NdiWklW7 uezPAn1d8ifopn6EKNXT8rjoZz4XXSE20dyZt1dkleqvAAlB41QLOayWCAygOzy94R1p hA4jmRHFklJF05roplrqRjZfvhpk6igG90n2FN5+sWTXCriZRPibTl/fDnxhYoFN2Bhv v52QeAgQI9O8CK5sQCkk+7E0EEwqyqQSAisIyynlyDTiN8RPKltTa3Ptua1CbglsuPPD 6apzAibU+r4qqLXexQGgXSCvb1cJjc8vsJx4mD481JXiDYGkOd0MV5MaPqtkGwZu5HBw YqYg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=QqZWJTJo; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id g16-20020a635210000000b004fb91aada12si4579655pgb.44.2023.02.26.06.53.34; Sun, 26 Feb 2023 06:53:46 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=QqZWJTJo; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230349AbjBZOxW (ORCPT <rfc822;tertiaryakionsight@gmail.com> + 99 others); Sun, 26 Feb 2023 09:53:22 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44062 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229973AbjBZOwg (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Sun, 26 Feb 2023 09:52:36 -0500 Received: from mail-pl1-x62c.google.com (mail-pl1-x62c.google.com [IPv6:2607:f8b0:4864:20::62c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 863C812BF2 for <linux-kernel@vger.kernel.org>; Sun, 26 Feb 2023 06:49:22 -0800 (PST) Received: by mail-pl1-x62c.google.com with SMTP id ky4so4226255plb.3 for <linux-kernel@vger.kernel.org>; Sun, 26 Feb 2023 06:49:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=wBVAIHRscHOmmN60aCmRWYJhxSdAjqn83e5duta6JJk=; b=QqZWJTJoqPplkLbmLjMX4TXGK0oR+ezqbRlYtGTcyjcprlCC3O7gvsCfS9sz+48v+0 q4uEKubWXGOFMl+yGlhPPS24pyC/68ORKZHQEfGYOClYAvgCD2sCeV1xvhOxcb9qQCc6 UCowSyzwQy9VgQ8WL1dYgxVbz/fyUb/fRMEUM/0d68DVm6VtY+AH4KSOmfXJ7U0bHaJk 3O5ltldqIrQ6FUyYJbtfjfdrSsJZ4MtQ+bSSkYk38Yk8i5nEqat4/D39G90cQKKEQyla uegE25QgmhFvM5aQ5xRnRoUHRtzYEKm/Q+zSwvr6CX4T5drrt/NozIAxnwcoQ39ir8tQ WYmA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=wBVAIHRscHOmmN60aCmRWYJhxSdAjqn83e5duta6JJk=; b=rAttlSbtvi850eCnW3QC+9kPWmwenI7GFEbaehP9ezB5lNxEa6SdkZ2VeyYajkZNNO 7pWGY71si+IwTnc1qlQQk8uuQA3PQ5CHWpYFWlzoftpFXKoB6tPic/0gyLIaRmrV/w+w C9hhrYAY2XjLj7oQHqcDNoDHIYjPOFzUwW29acKJKghcUf0rUg2xc5ejoJZnlBpfuCSG Wb4ktt2DDgrqdTu3+c9aNLLi/JMz+//1HWG1PNVuWGD+b28/i+SjSBpsChPi+F37MICC tFHksTneR/JBzjQusLdNfKm3SuxVNa/ysPghhWQ7DI+4O8c7x0AbWJCxIaw1fp2XZ7EH HpvA== X-Gm-Message-State: AO0yUKU3al8gnf/OMkEuv737BZ9D70N6yD0w6NaCkmsaSG/8oy00Gokd ZYd2rz+Z9GLSIinzZxk59KdcJA== X-Received: by 2002:a17:902:a3cd:b0:19a:f556:e386 with SMTP id q13-20020a170902a3cd00b0019af556e386mr21178978plb.0.1677422890342; Sun, 26 Feb 2023 06:48:10 -0800 (PST) Received: from localhost.localdomain ([139.177.225.248]) by smtp.gmail.com with ESMTPSA id y20-20020a170902ed5400b0019c2cf12d15sm2755589plb.116.2023.02.26.06.48.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 26 Feb 2023 06:48:09 -0800 (PST) From: Qi Zheng <zhengqi.arch@bytedance.com> To: akpm@linux-foundation.org, tkhai@ya.ru, hannes@cmpxchg.org, shakeelb@google.com, mhocko@kernel.org, roman.gushchin@linux.dev, muchun.song@linux.dev, david@redhat.com, shy828301@gmail.com Cc: sultan@kerneltoast.com, dave@stgolabs.net, penguin-kernel@I-love.SAKURA.ne.jp, paulmck@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Qi Zheng <zhengqi.arch@bytedance.com> Subject: [PATCH v3 0/8] make slab shrink lockless Date: Sun, 26 Feb 2023 22:46:47 +0800 Message-Id: <20230226144655.79778-1-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758905737274610682?= X-GMAIL-MSGID: =?utf-8?q?1758905737274610682?= |
Series |
make slab shrink lockless
|
|
Message
Qi Zheng
Feb. 26, 2023, 2:46 p.m. UTC
Hi all, This patch series aims to make slab shrink lockless. 1. Background ============= On our servers, we often find the following system cpu hotspots: 44.16% [kernel] [k] down_read_trylock 14.12% [kernel] [k] up_read 13.43% [kernel] [k] shrink_slab 5.25% [kernel] [k] count_shadow_nodes 3.42% [kernel] [k] idr_find Then we used bpftrace to capture its calltrace as follows: @[ down_read_trylock+5 shrink_slab+292 shrink_node+640 do_try_to_free_pages+211 try_to_free_mem_cgroup_pages+266 try_charge_memcg+386 charge_memcg+51 __mem_cgroup_charge+44 __handle_mm_fault+1416 handle_mm_fault+260 do_user_addr_fault+459 exc_page_fault+104 asm_exc_page_fault+38 clear_user_rep_good+18 read_zero+100 vfs_read+176 ksys_read+93 do_syscall_64+62 entry_SYSCALL_64_after_hwframe+114 ]: 1868979 It is easy to see that this is caused by the frequent failure to obtain the read lock of shrinker_rwsem when reclaiming slab memory. Currently, the shrinker_rwsem is a global lock. And the following cases may cause the above system cpu hotspots: a. the write lock of shrinker_rwsem was held for too long. For example, there are many memcgs in the system, which causes some paths to hold locks and traverse it for too long. (e.g. expand_shrinker_info()) b. the read lock of shrinker_rwsem was held for too long, and a writer came at this time. Then this writer will be forced to wait and block all subsequent readers. For example: - be scheduled when the read lock of shrinker_rwsem is held in do_shrink_slab() - some shrinker are blocked for too long. Like the case mentioned in the patchset[1]. [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/ And all the down_read_trylock() hotspots caused by the above cases can be solved by replacing the shrinker_rwsem trylocks with SRCU. 2. Survey ========= Before doing the code implementation, I found that there were many similar submissions in the community: a. Davidlohr Bueso submitted a patch in 2015. Subject: [PATCH -next v2] mm: srcu-ify shrinkers Link: https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/ Result: It was finally merged into the linux-next branch, but failed on arm allnoconfig (without CONFIG_SRCU) b. Tetsuo Handa submitted a patchset in 2017. Subject: [PATCH 1/2] mm,vmscan: Kill global shrinker lock. Link: https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/ Result: Finally chose to use the current simple way (break when rwsem_is_contended()). And Christoph Hellwig suggested to using SRCU, but SRCU was not unconditionally enabled at the time. c. Kirill Tkhai submitted a patchset in 2018. Subject: [PATCH RFC 00/10] Introduce lockless shrink_slab() Link: https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/ Result: At that time, SRCU was not unconditionally enabled, and there were some objections to enabling SRCU. Later, because Kirill's focus was moved to other things, this patchset was not continued to be updated. d. Sultan Alsawaf submitted a patch in 2021. Subject: [PATCH] mm: vmscan: Replace shrinker_rwsem trylocks with SRCU protection Link: https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/ Result: Rejected because SRCU was not unconditionally enabled. We can find that almost all these historical commits were abandoned because SRCU was not unconditionally enabled. But now SRCU has been unconditionally enable by Paul E. McKenney in 2023 [2], so it's time to replace shrinker_rwsem trylocks with SRCU. [2] https://lore.kernel.org/lkml/20230105003759.GA1769545@paulmck-ThinkPad-P17-Gen-1/ 3. Reproduction and testing =========================== We can reproduce the down_read_trylock() hotspot through the following script: ``` #!/bin/bash DIR="/root/shrinker/memcg/mnt" do_create() { mkdir /sys/fs/cgroup/memory/test echo 200M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes for i in `seq 0 $1`; do mkdir /sys/fs/cgroup/memory/test/$i; echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs; mkdir -p $DIR/$i; done } do_mount() { for i in `seq $1 $2`; do mount -t tmpfs $i $DIR/$i; done } do_touch() { for i in `seq $1 $2`; do echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs; dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 & done } do_create 2000 do_mount 0 2000 do_touch 0 1000 ``` Save the above script and execute it, we can get the following perf hotspots: 46.60% [kernel] [k] down_read_trylock 18.70% [kernel] [k] up_read 15.44% [kernel] [k] shrink_slab 4.37% [kernel] [k] _find_next_bit 2.75% [kernel] [k] xa_load 2.07% [kernel] [k] idr_find 1.73% [kernel] [k] do_shrink_slab 1.42% [kernel] [k] shrink_lruvec 0.74% [kernel] [k] shrink_node 0.60% [kernel] [k] list_lru_count_one After applying this patchset, the hotspot becomes as follows: 19.53% [kernel] [k] _find_next_bit 14.63% [kernel] [k] do_shrink_slab 14.58% [kernel] [k] shrink_slab 11.83% [kernel] [k] shrink_lruvec 9.33% [kernel] [k] __blk_flush_plug 6.67% [kernel] [k] mem_cgroup_iter 3.73% [kernel] [k] list_lru_count_one 2.43% [kernel] [k] shrink_node 1.96% [kernel] [k] super_cache_count 1.78% [kernel] [k] __rcu_read_unlock 1.38% [kernel] [k] __srcu_read_lock 1.30% [kernel] [k] xas_descend We can see that the slab reclaim is no longer blocked by shinker_rwsem trylock, which realizes the lockless slab reclaim. This series is based on next-20230217. Comments and suggestions are welcome. Thanks, Qi. Changelog in v2 -> v3: - fix bug in [PATCH v2 1/7] (per Kirill) - add Kirill's pacth which restore a check similar to the rwsem_is_contendent() check by adding shrinker_srcu_generation Changelog in v1 -> v2: - add a map_nr_max field to shrinker_info (suggested by Kirill) - use shrinker_mutex in reparent_shrinker_deferred() (pointed by Kirill) Kirill Tkhai (1): mm: vmscan: add shrinker_srcu_generation Qi Zheng (7): mm: vmscan: add a map_nr_max field to shrinker_info mm: vmscan: make global slab shrink lockless mm: vmscan: make memcg slab shrink lockless mm: shrinkers: make count and scan in shrinker debugfs lockless mm: vmscan: hold write lock to reparent shrinker nr_deferred mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers() mm: shrinkers: convert shrinker_rwsem to mutex drivers/md/dm-cache-metadata.c | 2 +- drivers/md/dm-thin-metadata.c | 2 +- fs/super.c | 2 +- include/linux/memcontrol.h | 1 + mm/shrinker_debug.c | 38 +++----- mm/vmscan.c | 166 +++++++++++++++++++-------------- 6 files changed, 112 insertions(+), 99 deletions(-)
Comments
On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng <zhengqi.arch@bytedance.com> wrote: > Hi all, > > This patch series aims to make slab shrink lockless. What an awesome changelog. > 2. Survey > ========= Especially this part. Looking through all the prior efforts and at this patchset I am not immediately seeing any statements about the overall effect upon real-world workloads. For a good example, does this patchset measurably improve throughput or energy consumption on your servers?
On 2023/2/27 03:51, Andrew Morton wrote: > On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng <zhengqi.arch@bytedance.com> wrote: > >> Hi all, >> >> This patch series aims to make slab shrink lockless. > > What an awesome changelog. > >> 2. Survey >> ========= > > Especially this part. > > Looking through all the prior efforts and at this patchset I am not > immediately seeing any statements about the overall effect upon > real-world workloads. For a good example, does this patchset > measurably improve throughput or energy consumption on your servers? Hi Andrew, I re-tested with the following physical machines: Architecture: x86_64 CPU(s): 96 On-line CPU(s) list: 0-95 Model name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz I found that the reason for the hotspot I described in cover letter is wrong. The reason for the down_read_trylock() hotspot is not because of the failure to trylock, but simply because of the atomic operation (cmpxchg). And this will lead to a significant reduction in IPC (insn per cycle). To verify this, I did the following tests: 1. Run the following script to create down_read_trylock() hotspots: ``` #!/bin/bash DIR="/root/shrinker/memcg/mnt" do_create() { mkdir -p /sys/fs/cgroup/memory/test mkdir -p /sys/fs/cgroup/perf_event/test echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes for i in `seq 0 $1`; do mkdir -p /sys/fs/cgroup/memory/test/$i; echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs; echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs; mkdir -p $DIR/$i; done } do_mount() { for i in `seq $1 $2`; do mount -t tmpfs $i $DIR/$i; done } do_touch() { for i in `seq $1 $2`; do echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs; echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs; dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 & done } case "$1" in touch) do_touch $2 $3 ;; test) do_create 4000 do_mount 0 4000 do_touch 0 3000 ;; *) exit 1 ;; esac ``` Save the above script, then run test and touch commands. Then we can use the following perf command to view hotspots: perf top -U -F 999 1) Before applying this patchset: 32.31% [kernel] [k] down_read_trylock 19.40% [kernel] [k] pv_native_safe_halt 16.24% [kernel] [k] up_read 15.70% [kernel] [k] shrink_slab 4.69% [kernel] [k] _find_next_bit 2.62% [kernel] [k] shrink_node 1.78% [kernel] [k] shrink_lruvec 0.76% [kernel] [k] do_shrink_slab 2) After applying this patchset: 27.83% [kernel] [k] _find_next_bit 16.97% [kernel] [k] shrink_slab 15.82% [kernel] [k] pv_native_safe_halt 9.58% [kernel] [k] shrink_node 8.31% [kernel] [k] shrink_lruvec 5.64% [kernel] [k] do_shrink_slab 3.88% [kernel] [k] mem_cgroup_iter 2. At the same time, we use the following perf command to capture IPC information: perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10 1) Before applying this patchset: Performance counter stats for 'system wide' (5 runs): 454187219766 cycles test ( +- 1.84% ) 78896433101 instructions test # 0.17 insn per cycle ( +- 0.44% ) 10.0020430 +- 0.0000366 seconds time elapsed ( +- 0.00% ) 2) After applying this patchset: Performance counter stats for 'system wide' (5 runs): 841954709443 cycles test ( +- 15.80% ) (98.69%) 527258677936 instructions test # 0.63 insn per cycle ( +- 15.11% ) (98.68%) 10.01064 +- 0.00831 seconds time elapsed ( +- 0.08% ) We can see that IPC drops very seriously when calling down_read_trylock() at high frequency. After using SRCU, the IPC is at a normal level. Thanks, Qi > >
Hi, On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote: > > > On 2023/2/27 03:51, Andrew Morton wrote: > > On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng <zhengqi.arch@bytedance.com> wrote: > > > > > Hi all, > > > > > > This patch series aims to make slab shrink lockless. > > > > What an awesome changelog. > > > > > 2. Survey > > > ========= > > > > Especially this part. > > > > Looking through all the prior efforts and at this patchset I am not > > immediately seeing any statements about the overall effect upon > > real-world workloads. For a good example, does this patchset > > measurably improve throughput or energy consumption on your servers? > > Hi Andrew, > > I re-tested with the following physical machines: > > Architecture: x86_64 > CPU(s): 96 > On-line CPU(s) list: 0-95 > Model name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz > > I found that the reason for the hotspot I described in cover letter is > wrong. The reason for the down_read_trylock() hotspot is not because of > the failure to trylock, but simply because of the atomic operation > (cmpxchg). And this will lead to a significant reduction in IPC (insn > per cycle). ... > Then we can use the following perf command to view hotspots: > > perf top -U -F 999 > > 1) Before applying this patchset: > > 32.31% [kernel] [k] down_read_trylock > 19.40% [kernel] [k] pv_native_safe_halt > 16.24% [kernel] [k] up_read > 15.70% [kernel] [k] shrink_slab > 4.69% [kernel] [k] _find_next_bit > 2.62% [kernel] [k] shrink_node > 1.78% [kernel] [k] shrink_lruvec > 0.76% [kernel] [k] do_shrink_slab > > 2) After applying this patchset: > > 27.83% [kernel] [k] _find_next_bit > 16.97% [kernel] [k] shrink_slab > 15.82% [kernel] [k] pv_native_safe_halt > 9.58% [kernel] [k] shrink_node > 8.31% [kernel] [k] shrink_lruvec > 5.64% [kernel] [k] do_shrink_slab > 3.88% [kernel] [k] mem_cgroup_iter > > 2. At the same time, we use the following perf command to capture IPC > information: > > perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10 > > 1) Before applying this patchset: > > Performance counter stats for 'system wide' (5 runs): > > 454187219766 cycles test ( > +- 1.84% ) > 78896433101 instructions test # 0.17 insn per > cycle ( +- 0.44% ) > > 10.0020430 +- 0.0000366 seconds time elapsed ( +- 0.00% ) > > 2) After applying this patchset: > > Performance counter stats for 'system wide' (5 runs): > > 841954709443 cycles test ( > +- 15.80% ) (98.69%) > 527258677936 instructions test # 0.63 insn per > cycle ( +- 15.11% ) (98.68%) > > 10.01064 +- 0.00831 seconds time elapsed ( +- 0.08% ) > > We can see that IPC drops very seriously when calling > down_read_trylock() at high frequency. After using SRCU, > the IPC is at a normal level. The results you present do show improvement in IPC for an artificial test script. But more interesting would be to see how a real world workloads benefit from your changes. > Thanks, > Qi
On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote: > > > On 2023/2/27 03:51, Andrew Morton wrote: > > On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng <zhengqi.arch@bytedance.com> wrote: > > > Save the above script, then run test and touch commands. > > Then we can use the following perf command to view hotspots: > > perf top -U -F 999 > > 1) Before applying this patchset: > > 32.31% [kernel] [k] down_read_trylock > 19.40% [kernel] [k] pv_native_safe_halt > 16.24% [kernel] [k] up_read > 15.70% [kernel] [k] shrink_slab > 4.69% [kernel] [k] _find_next_bit > 2.62% [kernel] [k] shrink_node > 1.78% [kernel] [k] shrink_lruvec > 0.76% [kernel] [k] do_shrink_slab > > 2) After applying this patchset: > > 27.83% [kernel] [k] _find_next_bit > 16.97% [kernel] [k] shrink_slab > 15.82% [kernel] [k] pv_native_safe_halt > 9.58% [kernel] [k] shrink_node > 8.31% [kernel] [k] shrink_lruvec > 5.64% [kernel] [k] do_shrink_slab > 3.88% [kernel] [k] mem_cgroup_iter Not opposing the intention of the patchset in any way (I actually think it's a good idea to make the shrinkers list lockless), but looking at both outputs above I think that the main problem is not the contention on the semaphore, but the reason of this contention. It seems like often there is a long list of shrinkers which barely can reclaim any memory, but we're calling them again and again. In order to achieve real wins with real-life workloads, I guess it's what we should optimize. Thanks!
On 27.02.2023 18:08, Mike Rapoport wrote: > Hi, > > On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote: >> >> >> On 2023/2/27 03:51, Andrew Morton wrote: >>> On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng <zhengqi.arch@bytedance.com> wrote: >>> >>>> Hi all, >>>> >>>> This patch series aims to make slab shrink lockless. >>> >>> What an awesome changelog. >>> >>>> 2. Survey >>>> ========= >>> >>> Especially this part. >>> >>> Looking through all the prior efforts and at this patchset I am not >>> immediately seeing any statements about the overall effect upon >>> real-world workloads. For a good example, does this patchset >>> measurably improve throughput or energy consumption on your servers? >> >> Hi Andrew, >> >> I re-tested with the following physical machines: >> >> Architecture: x86_64 >> CPU(s): 96 >> On-line CPU(s) list: 0-95 >> Model name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz >> >> I found that the reason for the hotspot I described in cover letter is >> wrong. The reason for the down_read_trylock() hotspot is not because of >> the failure to trylock, but simply because of the atomic operation >> (cmpxchg). And this will lead to a significant reduction in IPC (insn >> per cycle). > > ... > >> Then we can use the following perf command to view hotspots: >> >> perf top -U -F 999 >> >> 1) Before applying this patchset: >> >> 32.31% [kernel] [k] down_read_trylock >> 19.40% [kernel] [k] pv_native_safe_halt >> 16.24% [kernel] [k] up_read >> 15.70% [kernel] [k] shrink_slab >> 4.69% [kernel] [k] _find_next_bit >> 2.62% [kernel] [k] shrink_node >> 1.78% [kernel] [k] shrink_lruvec >> 0.76% [kernel] [k] do_shrink_slab >> >> 2) After applying this patchset: >> >> 27.83% [kernel] [k] _find_next_bit >> 16.97% [kernel] [k] shrink_slab >> 15.82% [kernel] [k] pv_native_safe_halt >> 9.58% [kernel] [k] shrink_node >> 8.31% [kernel] [k] shrink_lruvec >> 5.64% [kernel] [k] do_shrink_slab >> 3.88% [kernel] [k] mem_cgroup_iter >> >> 2. At the same time, we use the following perf command to capture IPC >> information: >> >> perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10 >> >> 1) Before applying this patchset: >> >> Performance counter stats for 'system wide' (5 runs): >> >> 454187219766 cycles test ( >> +- 1.84% ) >> 78896433101 instructions test # 0.17 insn per >> cycle ( +- 0.44% ) >> >> 10.0020430 +- 0.0000366 seconds time elapsed ( +- 0.00% ) >> >> 2) After applying this patchset: >> >> Performance counter stats for 'system wide' (5 runs): >> >> 841954709443 cycles test ( >> +- 15.80% ) (98.69%) >> 527258677936 instructions test # 0.63 insn per >> cycle ( +- 15.11% ) (98.68%) >> >> 10.01064 +- 0.00831 seconds time elapsed ( +- 0.08% ) >> >> We can see that IPC drops very seriously when calling >> down_read_trylock() at high frequency. After using SRCU, >> the IPC is at a normal level. > > The results you present do show improvement in IPC for an artificial test > script. But more interesting would be to see how a real world workloads > benefit from your changes. One of the real workloads from my experience is start of an overcommitted node containing many starting containers after node crash (or many resuming containers after reboot for kernel update). In these cases memory pressure is huge, and the node goes round in long reclaim. This patch patchset makes prealloc_memcg_shrinker() independent of do_shrink_slab(), so prealloc_memcg_shrinker() won't have to wait till shrink_slab_memcg() completes its current bit iteration, sees rwsem_is_contended() and the iteration breaks. Also, it's important to mention that currently we have the strange behavior: prealloc_memcg_shrinker() down_write(&shrinker_rwsem) idr_alloc() reclaim for each child memcg shrink_slab_memcg() down_read_trylock(&shrinker_rwsem) -> fail All the slab reclaim in this behavior is just a parasite work, and it just wastes our cpu time, which does not look a good design. Kirill
On Mon, Feb 27, 2023 at 10:20:59PM +0300, Kirill Tkhai wrote: > On 27.02.2023 18:08, Mike Rapoport wrote: > > Hi, > > > > On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote: > >> > >> > >> On 2023/2/27 03:51, Andrew Morton wrote: > >>> On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng <zhengqi.arch@bytedance.com> wrote: > >>> > >>>> Hi all, > >>>> > >>>> This patch series aims to make slab shrink lockless. > >>> > >>> What an awesome changelog. > >>> > >>>> 2. Survey > >>>> ========= > >>> > >>> Especially this part. > >>> > >>> Looking through all the prior efforts and at this patchset I am not > >>> immediately seeing any statements about the overall effect upon > >>> real-world workloads. For a good example, does this patchset > >>> measurably improve throughput or energy consumption on your servers? > >> > >> Hi Andrew, > >> > >> I re-tested with the following physical machines: > >> > >> Architecture: x86_64 > >> CPU(s): 96 > >> On-line CPU(s) list: 0-95 > >> Model name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz > >> > >> I found that the reason for the hotspot I described in cover letter is > >> wrong. The reason for the down_read_trylock() hotspot is not because of > >> the failure to trylock, but simply because of the atomic operation > >> (cmpxchg). And this will lead to a significant reduction in IPC (insn > >> per cycle). > > > > ... > > > >> Then we can use the following perf command to view hotspots: > >> > >> perf top -U -F 999 > >> > >> 1) Before applying this patchset: > >> > >> 32.31% [kernel] [k] down_read_trylock > >> 19.40% [kernel] [k] pv_native_safe_halt > >> 16.24% [kernel] [k] up_read > >> 15.70% [kernel] [k] shrink_slab > >> 4.69% [kernel] [k] _find_next_bit > >> 2.62% [kernel] [k] shrink_node > >> 1.78% [kernel] [k] shrink_lruvec > >> 0.76% [kernel] [k] do_shrink_slab > >> > >> 2) After applying this patchset: > >> > >> 27.83% [kernel] [k] _find_next_bit > >> 16.97% [kernel] [k] shrink_slab > >> 15.82% [kernel] [k] pv_native_safe_halt > >> 9.58% [kernel] [k] shrink_node > >> 8.31% [kernel] [k] shrink_lruvec > >> 5.64% [kernel] [k] do_shrink_slab > >> 3.88% [kernel] [k] mem_cgroup_iter > >> > >> 2. At the same time, we use the following perf command to capture IPC > >> information: > >> > >> perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10 > >> > >> 1) Before applying this patchset: > >> > >> Performance counter stats for 'system wide' (5 runs): > >> > >> 454187219766 cycles test ( > >> +- 1.84% ) > >> 78896433101 instructions test # 0.17 insn per > >> cycle ( +- 0.44% ) > >> > >> 10.0020430 +- 0.0000366 seconds time elapsed ( +- 0.00% ) > >> > >> 2) After applying this patchset: > >> > >> Performance counter stats for 'system wide' (5 runs): > >> > >> 841954709443 cycles test ( > >> +- 15.80% ) (98.69%) > >> 527258677936 instructions test # 0.63 insn per > >> cycle ( +- 15.11% ) (98.68%) > >> > >> 10.01064 +- 0.00831 seconds time elapsed ( +- 0.08% ) > >> > >> We can see that IPC drops very seriously when calling > >> down_read_trylock() at high frequency. After using SRCU, > >> the IPC is at a normal level. > > > > The results you present do show improvement in IPC for an artificial test > > script. But more interesting would be to see how a real world workloads > > benefit from your changes. > > One of the real workloads from my experience is start of an overcommitted node > containing many starting containers after node crash (or many resuming containers > after reboot for kernel update). In these cases memory pressure is huge, and > the node goes round in long reclaim. > > This patch patchset makes prealloc_memcg_shrinker() independent of do_shrink_slab(), > so prealloc_memcg_shrinker() won't have to wait till shrink_slab_memcg() completes its > current bit iteration, sees rwsem_is_contended() and the iteration breaks. > > Also, it's important to mention that currently we have the strange behavior: > > prealloc_memcg_shrinker() > down_write(&shrinker_rwsem) > idr_alloc() > reclaim > for each child memcg > shrink_slab_memcg() > down_read_trylock(&shrinker_rwsem) -> fail But this can happen only if we get -ENOMEM in idr_alloc()? Doesn't seem to be a very hot path. Thanks!
On 27.02.2023 22:32, Roman Gushchin wrote: > On Mon, Feb 27, 2023 at 10:20:59PM +0300, Kirill Tkhai wrote: >> On 27.02.2023 18:08, Mike Rapoport wrote: >>> Hi, >>> >>> On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote: >>>> >>>> >>>> On 2023/2/27 03:51, Andrew Morton wrote: >>>>> On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng <zhengqi.arch@bytedance.com> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> This patch series aims to make slab shrink lockless. >>>>> >>>>> What an awesome changelog. >>>>> >>>>>> 2. Survey >>>>>> ========= >>>>> >>>>> Especially this part. >>>>> >>>>> Looking through all the prior efforts and at this patchset I am not >>>>> immediately seeing any statements about the overall effect upon >>>>> real-world workloads. For a good example, does this patchset >>>>> measurably improve throughput or energy consumption on your servers? >>>> >>>> Hi Andrew, >>>> >>>> I re-tested with the following physical machines: >>>> >>>> Architecture: x86_64 >>>> CPU(s): 96 >>>> On-line CPU(s) list: 0-95 >>>> Model name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz >>>> >>>> I found that the reason for the hotspot I described in cover letter is >>>> wrong. The reason for the down_read_trylock() hotspot is not because of >>>> the failure to trylock, but simply because of the atomic operation >>>> (cmpxchg). And this will lead to a significant reduction in IPC (insn >>>> per cycle). >>> >>> ... >>> >>>> Then we can use the following perf command to view hotspots: >>>> >>>> perf top -U -F 999 >>>> >>>> 1) Before applying this patchset: >>>> >>>> 32.31% [kernel] [k] down_read_trylock >>>> 19.40% [kernel] [k] pv_native_safe_halt >>>> 16.24% [kernel] [k] up_read >>>> 15.70% [kernel] [k] shrink_slab >>>> 4.69% [kernel] [k] _find_next_bit >>>> 2.62% [kernel] [k] shrink_node >>>> 1.78% [kernel] [k] shrink_lruvec >>>> 0.76% [kernel] [k] do_shrink_slab >>>> >>>> 2) After applying this patchset: >>>> >>>> 27.83% [kernel] [k] _find_next_bit >>>> 16.97% [kernel] [k] shrink_slab >>>> 15.82% [kernel] [k] pv_native_safe_halt >>>> 9.58% [kernel] [k] shrink_node >>>> 8.31% [kernel] [k] shrink_lruvec >>>> 5.64% [kernel] [k] do_shrink_slab >>>> 3.88% [kernel] [k] mem_cgroup_iter >>>> >>>> 2. At the same time, we use the following perf command to capture IPC >>>> information: >>>> >>>> perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10 >>>> >>>> 1) Before applying this patchset: >>>> >>>> Performance counter stats for 'system wide' (5 runs): >>>> >>>> 454187219766 cycles test ( >>>> +- 1.84% ) >>>> 78896433101 instructions test # 0.17 insn per >>>> cycle ( +- 0.44% ) >>>> >>>> 10.0020430 +- 0.0000366 seconds time elapsed ( +- 0.00% ) >>>> >>>> 2) After applying this patchset: >>>> >>>> Performance counter stats for 'system wide' (5 runs): >>>> >>>> 841954709443 cycles test ( >>>> +- 15.80% ) (98.69%) >>>> 527258677936 instructions test # 0.63 insn per >>>> cycle ( +- 15.11% ) (98.68%) >>>> >>>> 10.01064 +- 0.00831 seconds time elapsed ( +- 0.08% ) >>>> >>>> We can see that IPC drops very seriously when calling >>>> down_read_trylock() at high frequency. After using SRCU, >>>> the IPC is at a normal level. >>> >>> The results you present do show improvement in IPC for an artificial test >>> script. But more interesting would be to see how a real world workloads >>> benefit from your changes. >> >> One of the real workloads from my experience is start of an overcommitted node >> containing many starting containers after node crash (or many resuming containers >> after reboot for kernel update). In these cases memory pressure is huge, and >> the node goes round in long reclaim. >> >> This patch patchset makes prealloc_memcg_shrinker() independent of do_shrink_slab(), >> so prealloc_memcg_shrinker() won't have to wait till shrink_slab_memcg() completes its >> current bit iteration, sees rwsem_is_contended() and the iteration breaks. >> >> Also, it's important to mention that currently we have the strange behavior: >> >> prealloc_memcg_shrinker() >> down_write(&shrinker_rwsem) >> idr_alloc() >> reclaim >> for each child memcg >> shrink_slab_memcg() >> down_read_trylock(&shrinker_rwsem) -> fail > > But this can happen only if we get -ENOMEM in idr_alloc()? > Doesn't seem to be a very hot path. There is not only idr_alloc(), but expand_shrinker_info() too. The last is more heavier. But despite that, yes, it's not a hot path. The memory pressure on overcommited node start I described above is a regular situation. There are lots of register_shrinker() contending with reclaim.
On 2023/2/27 23:08, Mike Rapoport wrote: > Hi, > > On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote: >> >> >> On 2023/2/27 03:51, Andrew Morton wrote: >>> On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng <zhengqi.arch@bytedance.com> wrote: >>> >>>> Hi all, >>>> >>>> This patch series aims to make slab shrink lockless. >>> >>> What an awesome changelog. >>> >>>> 2. Survey >>>> ========= >>> >>> Especially this part. >>> >>> Looking through all the prior efforts and at this patchset I am not >>> immediately seeing any statements about the overall effect upon >>> real-world workloads. For a good example, does this patchset >>> measurably improve throughput or energy consumption on your servers? >> >> Hi Andrew, >> >> I re-tested with the following physical machines: >> >> Architecture: x86_64 >> CPU(s): 96 >> On-line CPU(s) list: 0-95 >> Model name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz >> >> I found that the reason for the hotspot I described in cover letter is >> wrong. The reason for the down_read_trylock() hotspot is not because of >> the failure to trylock, but simply because of the atomic operation >> (cmpxchg). And this will lead to a significant reduction in IPC (insn >> per cycle). > > ... > >> Then we can use the following perf command to view hotspots: >> >> perf top -U -F 999 >> >> 1) Before applying this patchset: >> >> 32.31% [kernel] [k] down_read_trylock >> 19.40% [kernel] [k] pv_native_safe_halt >> 16.24% [kernel] [k] up_read >> 15.70% [kernel] [k] shrink_slab >> 4.69% [kernel] [k] _find_next_bit >> 2.62% [kernel] [k] shrink_node >> 1.78% [kernel] [k] shrink_lruvec >> 0.76% [kernel] [k] do_shrink_slab >> >> 2) After applying this patchset: >> >> 27.83% [kernel] [k] _find_next_bit >> 16.97% [kernel] [k] shrink_slab >> 15.82% [kernel] [k] pv_native_safe_halt >> 9.58% [kernel] [k] shrink_node >> 8.31% [kernel] [k] shrink_lruvec >> 5.64% [kernel] [k] do_shrink_slab >> 3.88% [kernel] [k] mem_cgroup_iter >> >> 2. At the same time, we use the following perf command to capture IPC >> information: >> >> perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10 >> >> 1) Before applying this patchset: >> >> Performance counter stats for 'system wide' (5 runs): >> >> 454187219766 cycles test ( >> +- 1.84% ) >> 78896433101 instructions test # 0.17 insn per >> cycle ( +- 0.44% ) >> >> 10.0020430 +- 0.0000366 seconds time elapsed ( +- 0.00% ) >> >> 2) After applying this patchset: >> >> Performance counter stats for 'system wide' (5 runs): >> >> 841954709443 cycles test ( >> +- 15.80% ) (98.69%) >> 527258677936 instructions test # 0.63 insn per >> cycle ( +- 15.11% ) (98.68%) >> >> 10.01064 +- 0.00831 seconds time elapsed ( +- 0.08% ) >> >> We can see that IPC drops very seriously when calling >> down_read_trylock() at high frequency. After using SRCU, >> the IPC is at a normal level. > > The results you present do show improvement in IPC for an artificial test > script. But more interesting would be to see how a real world workloads > benefit from your changes. Hi Mike and Andrew, I did encounter this problem under the real workload of our online server. At the end of this email, I posted another call stack and hot spot that I found before. I scanned the hotspots of all our online servers yesterday and today, but unfortunately did not find the live environment. Some of our servers have a large number of containers, and each container will mount some file systems. This is likely to trigger down_read_trylock() hotspots when the memory pressure of the whole machine or the memory pressure of memcg is high. So I just found a physical server with a similar configuration to the online server yesterday for a simulation test. The call stack and the hot spot in the simulation test are almost exactly the same, so in theory, when such a hot spot appears on the online server, we can also enjoy the improvement of IPC. This will improve the performance of the server in memory exhaustion scenarios (memcg or global level). And the above scenario is only one aspect, and the other aspect is the lock competition scenario mentioned by Kirill. After applying this patch set, slab shrink and register_shrinker() can be completely parallelized, which can fix that problem. These are the two main benefits for real workloads that I consider. Thanks, Qi call stack ---------- @[ down_read_trylock+1 shrink_slab+128 shrink_node+371 do_try_to_free_pages+232 try_to_free_pages+243 _alloc_pages_slowpath+771 _alloc_pages_nodemask+702 pagecache_get_page+255 filemap_fault+1361 ext4_filemap_fault+44 __do_fault+76 handle_mm_fault+3543 do_user_addr_fault+442 do_page_fault+48 page_fault+62 ]: 1161690 @[ down_read_trylock+1 shrink_slab+128 shrink_node+371 balance_pgdat+690 kswapd+389 kthread+246 ret_from_fork+31 ]: 8424884 @[ down_read_trylock+1 shrink_slab+128 shrink_node+371 do_try_to_free_pages+232 try_to_free_pages+243 __alloc_pages_slowpath+771 __alloc_pages_nodemask+702 __do_page_cache_readahead+244 filemap_fault+1674 ext4_filemap_fault+44 __do_fault+76 handle_mm_fault+3543 do_user_addr_fault+442 do_page_fault+48 page_fault+62 ]: 20917631 hotspot ------- 52.22% [kernel] [k] down_read_trylock 19.60% [kernel] [k] up_read 8.86% [kernel] [k] shrink_slab 2.44% [kernel] [k] idr_find 1.25% [kernel] [k] count_shadow_nodes 1.18% [kernel] [k] shrink lruvec 0.71% [kernel] [k] mem_cgroup_iter 0.71% [kernel] [k] shrink_node 0.55% [kernel] [k] find_next_bit > >> Thanks, >> Qi >
On 2023/2/28 03:20, Kirill Tkhai wrote: > On 27.02.2023 18:08, Mike Rapoport wrote: >> Hi, >> >> On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote: >>> >>> >>> On 2023/2/27 03:51, Andrew Morton wrote: >>>> On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng <zhengqi.arch@bytedance.com> wrote: >>>> >>>>> Hi all, >>>>> >>>>> This patch series aims to make slab shrink lockless. >>>> >>>> What an awesome changelog. >>>> >>>>> 2. Survey >>>>> ========= >>>> >>>> Especially this part. >>>> >>>> Looking through all the prior efforts and at this patchset I am not >>>> immediately seeing any statements about the overall effect upon >>>> real-world workloads. For a good example, does this patchset >>>> measurably improve throughput or energy consumption on your servers? >>> >>> Hi Andrew, >>> >>> I re-tested with the following physical machines: >>> >>> Architecture: x86_64 >>> CPU(s): 96 >>> On-line CPU(s) list: 0-95 >>> Model name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz >>> >>> I found that the reason for the hotspot I described in cover letter is >>> wrong. The reason for the down_read_trylock() hotspot is not because of >>> the failure to trylock, but simply because of the atomic operation >>> (cmpxchg). And this will lead to a significant reduction in IPC (insn >>> per cycle). >> >> ... >> >>> Then we can use the following perf command to view hotspots: >>> >>> perf top -U -F 999 >>> >>> 1) Before applying this patchset: >>> >>> 32.31% [kernel] [k] down_read_trylock >>> 19.40% [kernel] [k] pv_native_safe_halt >>> 16.24% [kernel] [k] up_read >>> 15.70% [kernel] [k] shrink_slab >>> 4.69% [kernel] [k] _find_next_bit >>> 2.62% [kernel] [k] shrink_node >>> 1.78% [kernel] [k] shrink_lruvec >>> 0.76% [kernel] [k] do_shrink_slab >>> >>> 2) After applying this patchset: >>> >>> 27.83% [kernel] [k] _find_next_bit >>> 16.97% [kernel] [k] shrink_slab >>> 15.82% [kernel] [k] pv_native_safe_halt >>> 9.58% [kernel] [k] shrink_node >>> 8.31% [kernel] [k] shrink_lruvec >>> 5.64% [kernel] [k] do_shrink_slab >>> 3.88% [kernel] [k] mem_cgroup_iter >>> >>> 2. At the same time, we use the following perf command to capture IPC >>> information: >>> >>> perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10 >>> >>> 1) Before applying this patchset: >>> >>> Performance counter stats for 'system wide' (5 runs): >>> >>> 454187219766 cycles test ( >>> +- 1.84% ) >>> 78896433101 instructions test # 0.17 insn per >>> cycle ( +- 0.44% ) >>> >>> 10.0020430 +- 0.0000366 seconds time elapsed ( +- 0.00% ) >>> >>> 2) After applying this patchset: >>> >>> Performance counter stats for 'system wide' (5 runs): >>> >>> 841954709443 cycles test ( >>> +- 15.80% ) (98.69%) >>> 527258677936 instructions test # 0.63 insn per >>> cycle ( +- 15.11% ) (98.68%) >>> >>> 10.01064 +- 0.00831 seconds time elapsed ( +- 0.08% ) >>> >>> We can see that IPC drops very seriously when calling >>> down_read_trylock() at high frequency. After using SRCU, >>> the IPC is at a normal level. >> >> The results you present do show improvement in IPC for an artificial test >> script. But more interesting would be to see how a real world workloads >> benefit from your changes. > > One of the real workloads from my experience is start of an overcommitted node > containing many starting containers after node crash (or many resuming containers > after reboot for kernel update). In these cases memory pressure is huge, and > the node goes round in long reclaim. Thanks a lot for providing this real workload! :) > > This patch patchset makes prealloc_memcg_shrinker() independent of do_shrink_slab(), > so prealloc_memcg_shrinker() won't have to wait till shrink_slab_memcg() completes its > current bit iteration, sees rwsem_is_contended() and the iteration breaks. > > Also, it's important to mention that currently we have the strange behavior: > > prealloc_memcg_shrinker() > down_write(&shrinker_rwsem) > idr_alloc() > reclaim > for each child memcg > shrink_slab_memcg() > down_read_trylock(&shrinker_rwsem) -> fail > > All the slab reclaim in this behavior is just a parasite work, and it just wastes > our cpu time, which does not look a good design. > > Kirill
On 2023/2/28 03:02, Roman Gushchin wrote: > On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote: >> >> >> On 2023/2/27 03:51, Andrew Morton wrote: >>> On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng <zhengqi.arch@bytedance.com> wrote: >>> >> Save the above script, then run test and touch commands. >> >> Then we can use the following perf command to view hotspots: >> >> perf top -U -F 999 >> >> 1) Before applying this patchset: >> >> 32.31% [kernel] [k] down_read_trylock >> 19.40% [kernel] [k] pv_native_safe_halt >> 16.24% [kernel] [k] up_read >> 15.70% [kernel] [k] shrink_slab >> 4.69% [kernel] [k] _find_next_bit >> 2.62% [kernel] [k] shrink_node >> 1.78% [kernel] [k] shrink_lruvec >> 0.76% [kernel] [k] do_shrink_slab >> >> 2) After applying this patchset: >> >> 27.83% [kernel] [k] _find_next_bit >> 16.97% [kernel] [k] shrink_slab >> 15.82% [kernel] [k] pv_native_safe_halt >> 9.58% [kernel] [k] shrink_node >> 8.31% [kernel] [k] shrink_lruvec >> 5.64% [kernel] [k] do_shrink_slab >> 3.88% [kernel] [k] mem_cgroup_iter > > Not opposing the intention of the patchset in any way (I actually think > it's a good idea to make the shrinkers list lockless), but looking at > both outputs above I think that the main problem is not the contention on > the semaphore, but the reason of this contention. Yes, in the above scenario, there is indeed no lock contention problem. > > It seems like often there is a long list of shrinkers which barely > can reclaim any memory, but we're calling them again and again. > In order to achieve real wins with real-life workloads, I guess > it's what we should optimize. > > Thanks!
On 2023/2/28 18:04, Qi Zheng wrote: > > > On 2023/2/27 23:08, Mike Rapoport wrote: >> Hi, >> >> On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote: >>> >>> >>> On 2023/2/27 03:51, Andrew Morton wrote: >>>> On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng >>>> <zhengqi.arch@bytedance.com> wrote: >>>> >>>>> Hi all, >>>>> >>>>> This patch series aims to make slab shrink lockless. >>>> >>>> What an awesome changelog. >>>> >>>>> 2. Survey >>>>> ========= >>>> >>>> Especially this part. >>>> >>>> Looking through all the prior efforts and at this patchset I am not >>>> immediately seeing any statements about the overall effect upon >>>> real-world workloads. For a good example, does this patchset >>>> measurably improve throughput or energy consumption on your servers? >>> >>> Hi Andrew, >>> >>> I re-tested with the following physical machines: >>> >>> Architecture: x86_64 >>> CPU(s): 96 >>> On-line CPU(s) list: 0-95 >>> Model name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz >>> >>> I found that the reason for the hotspot I described in cover letter is >>> wrong. The reason for the down_read_trylock() hotspot is not because of >>> the failure to trylock, but simply because of the atomic operation >>> (cmpxchg). And this will lead to a significant reduction in IPC (insn >>> per cycle). >> >> ... >>> Then we can use the following perf command to view hotspots: >>> >>> perf top -U -F 999 >>> >>> 1) Before applying this patchset: >>> >>> 32.31% [kernel] [k] down_read_trylock >>> 19.40% [kernel] [k] pv_native_safe_halt >>> 16.24% [kernel] [k] up_read >>> 15.70% [kernel] [k] shrink_slab >>> 4.69% [kernel] [k] _find_next_bit >>> 2.62% [kernel] [k] shrink_node >>> 1.78% [kernel] [k] shrink_lruvec >>> 0.76% [kernel] [k] do_shrink_slab >>> >>> 2) After applying this patchset: >>> >>> 27.83% [kernel] [k] _find_next_bit >>> 16.97% [kernel] [k] shrink_slab >>> 15.82% [kernel] [k] pv_native_safe_halt >>> 9.58% [kernel] [k] shrink_node >>> 8.31% [kernel] [k] shrink_lruvec >>> 5.64% [kernel] [k] do_shrink_slab >>> 3.88% [kernel] [k] mem_cgroup_iter >>> >>> 2. At the same time, we use the following perf command to capture IPC >>> information: >>> >>> perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10 >>> >>> 1) Before applying this patchset: >>> >>> Performance counter stats for 'system wide' (5 runs): >>> >>> 454187219766 cycles >>> test ( >>> +- 1.84% ) >>> 78896433101 instructions test # 0.17 >>> insn per >>> cycle ( +- 0.44% ) >>> >>> 10.0020430 +- 0.0000366 seconds time elapsed ( +- 0.00% ) >>> >>> 2) After applying this patchset: >>> >>> Performance counter stats for 'system wide' (5 runs): >>> >>> 841954709443 cycles >>> test ( >>> +- 15.80% ) (98.69%) >>> 527258677936 instructions test # 0.63 >>> insn per >>> cycle ( +- 15.11% ) (98.68%) >>> >>> 10.01064 +- 0.00831 seconds time elapsed ( +- 0.08% ) >>> >>> We can see that IPC drops very seriously when calling >>> down_read_trylock() at high frequency. After using SRCU, >>> the IPC is at a normal level. >> >> The results you present do show improvement in IPC for an artificial test >> script. But more interesting would be to see how a real world workloads >> benefit from your changes. > > Hi Mike and Andrew, > > I did encounter this problem under the real workload of our online > server. At the end of this email, I posted another call stack and > hot spot that I found before. > > I scanned the hotspots of all our online servers yesterday and today, > but unfortunately did not find the live environment. > > Some of our servers have a large number of containers, and each > container will mount some file systems. This is likely to trigger > down_read_trylock() hotspots when the memory pressure of the whole > machine or the memory pressure of memcg is high. And the servers where this hotspot has happened (we have a hotspot alarm record), basically have 96 cores, or 128 cores or even more. > > So I just found a physical server with a similar configuration to the > online server yesterday for a simulation test. The call stack and the > hot spot in the simulation test are almost exactly the same, so in > theory, when such a hot spot appears on the online server, we can also > enjoy the improvement of IPC. This will improve the performance of the > server in memory exhaustion scenarios (memcg or global level). > > And the above scenario is only one aspect, and the other aspect is the > lock competition scenario mentioned by Kirill. After applying this patch > set, slab shrink and register_shrinker() can be completely parallelized, > which can fix that problem. > > These are the two main benefits for real workloads that I consider. > > Thanks, > Qi > > call stack > ---------- > > @[ > down_read_trylock+1 > shrink_slab+128 > shrink_node+371 > do_try_to_free_pages+232 > try_to_free_pages+243 > _alloc_pages_slowpath+771 > _alloc_pages_nodemask+702 > pagecache_get_page+255 > filemap_fault+1361 > ext4_filemap_fault+44 > __do_fault+76 > handle_mm_fault+3543 > do_user_addr_fault+442 > do_page_fault+48 > page_fault+62 > ]: 1161690 > @[ > down_read_trylock+1 > shrink_slab+128 > shrink_node+371 > balance_pgdat+690 > kswapd+389 > kthread+246 > ret_from_fork+31 > ]: 8424884 > @[ > down_read_trylock+1 > shrink_slab+128 > shrink_node+371 > do_try_to_free_pages+232 > try_to_free_pages+243 > __alloc_pages_slowpath+771 > __alloc_pages_nodemask+702 > __do_page_cache_readahead+244 > filemap_fault+1674 > ext4_filemap_fault+44 > __do_fault+76 > handle_mm_fault+3543 > do_user_addr_fault+442 > do_page_fault+48 > page_fault+62 > ]: 20917631 > > hotspot > ------- > > 52.22% [kernel] [k] down_read_trylock > 19.60% [kernel] [k] up_read > 8.86% [kernel] [k] shrink_slab > 2.44% [kernel] [k] idr_find > 1.25% [kernel] [k] count_shadow_nodes > 1.18% [kernel] [k] shrink lruvec > 0.71% [kernel] [k] mem_cgroup_iter > 0.71% [kernel] [k] shrink_node > 0.55% [kernel] [k] find_next_bit > > >>> Thanks, >>> Qi >> >
On Mon 27-02-23 17:08:30, Mike Rapoport wrote: [...] > The results you present do show improvement in IPC for an artificial test > script. But more interesting would be to see how a real world workloads > benefit from your changes. It's been quite some time ago (2018ish) when we have seen bug report where mount got stalled when racing with memory reclaim. This was nasty because the said mount was a part of login chain and users simply had to wait for a long time to get loged in in that particular deployment. The mount was blocked on a shrinker registration and the reclaim was stalled in a slab shrinker IIRC. I do not remember all the details but the underlying problem was that a shrinker callback took a long time because there were too many objects to scan or it had to sync with other fs operation. I believe we ended up using Minchan's break out from slab shrinking if the shrinker semaphore was contended and that helped to some degree but there were still some corner cases where a single slab shrinker could take a noticeable amount of time. In general using a "big" lock like shrinker_rwsem from the reclaim and potentially block many unrelated subsystems that just want to register or unregister shrinkers is a potential source of hard to predict problems. So this is a very welcome change.
On 2023/3/1 02:40, Michal Hocko wrote: > On Mon 27-02-23 17:08:30, Mike Rapoport wrote: > [...] >> The results you present do show improvement in IPC for an artificial test >> script. But more interesting would be to see how a real world workloads >> benefit from your changes. > > It's been quite some time ago (2018ish) when we have seen bug report > where mount got stalled when racing with memory reclaim. This was > nasty because the said mount was a part of login chain and users simply > had to wait for a long time to get loged in in that particular > deployment. > > The mount was blocked on a shrinker registration and the reclaim was > stalled in a slab shrinker IIRC. I do not remember all the details but > the underlying problem was that a shrinker callback took a long time > because there were too many objects to scan or it had to sync with other > fs operation. I believe we ended up using Minchan's break out from slab > shrinking if the shrinker semaphore was contended and that helped to > some degree but there were still some corner cases where a single slab > shrinker could take a noticeable amount of time. > > In general using a "big" lock like shrinker_rwsem from the reclaim and > potentially block many unrelated subsystems that just want to register > or unregister shrinkers is a potential source of hard to predict > problems. So this is a very welcome change. Totally agree. :) Thanks, Qi