Message ID | 20231010032117.1577496-4-yosryahmed@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:a888:0:b0:403:3b70:6f57 with SMTP id x8csp2251662vqo; Mon, 9 Oct 2023 20:21:45 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEQIMAVspHVg0umWkMqH4F4+FGLCg7c4MPmEx4o226afPAvhSxmL0Z3GOrzpIVzoTzJlXNJ X-Received: by 2002:a17:90a:f495:b0:277:3afc:f27 with SMTP id bx21-20020a17090af49500b002773afc0f27mr19488013pjb.1.1696908105033; Mon, 09 Oct 2023 20:21:45 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696908105; cv=none; d=google.com; s=arc-20160816; b=tHfhQj3RySmM13tN4phIKLZU1wyyYwmRA1kLWAC9/WBDJjP4Dv818yKWYaAUAYGCIY hlGJx8dNMv0BrSDbJHPE8Z+cQoM+9nxKGex3mg82vJVFUO9fo/1wle3OPP7TCdORWG1T lc7Aa23JmhBoOCU/jHNvVkC4l8GWgW5uHCG+SEV6oq/YOHVi2OddlUdw42BCV1nhr/0+ 7tSMruqZWPVEOqxngDP4ePlp2Quap3CwtbYmkl8lRed04Uzz1rxQeQa1lQvm7oVLISQQ yxY9Ck1DYo5j4gI49QwFf5iCVq14/C9Qnq72mJJwhG3/kdibVYxfi0TCDySxO9EMS5TA FBIg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=XEDO979juocfA6e748Gdg1ojA/11BVhtMp+K0OvWH0c=; fh=i6r7D8Tbv5L0V2RUiJXfGU+sNI1eFV0UjCNL0oNyQ80=; b=MuFhi6D5Aox88O1E96TEpEYDegt/8GZnS4tBhyijA0Ti/TWkCLycYDn1O0JQjjAzgw 4ket4asHfBpfPIHXFZZ0UdzclrQX2pWrF/X9Yj1tsOwI8MPiAXXAoMELGfa4TzvFOYNP UOqtdddFTcVqgbDQ3MgEdLG/mDtiWUr6o6r9XfThMXd/FmJL1Pv6HnWU4vtgeuYgeOsg GfEf1/D8RzTL9MF5qd7J1ei/0pSFxjLNEibLunnc4TPCGAO+zjI8+cKkekRAI+mUoYiS yLiVTaTZK2VNGJ+IP2NUrxlM5veqAY2xvVw+j5n960Oe5LVvqWhPjx/CHuMC6YdZzil8 xwLQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=cFy98Pfi; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7]) by mx.google.com with ESMTPS id d4-20020a170902ef0400b001c76b4c349esi10542558plx.218.2023.10.09.20.21.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 09 Oct 2023 20:21:45 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=cFy98Pfi; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id 8E15E810C2D1; Mon, 9 Oct 2023 20:21:43 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1441956AbjJJDVg (ORCPT <rfc822;rua109.linux@gmail.com> + 20 others); Mon, 9 Oct 2023 23:21:36 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37722 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1441974AbjJJDV3 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 9 Oct 2023 23:21:29 -0400 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6596AAC for <linux-kernel@vger.kernel.org>; Mon, 9 Oct 2023 20:21:26 -0700 (PDT) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-59f61a639b9so80191267b3.1 for <linux-kernel@vger.kernel.org>; Mon, 09 Oct 2023 20:21:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1696908085; x=1697512885; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=XEDO979juocfA6e748Gdg1ojA/11BVhtMp+K0OvWH0c=; b=cFy98PfiP1+ILre97PV1XvTfRsee/TTpqaTzPcb3lr1BPgIL0Ad5lTdIa415wsXVHg HuB/Hj3rQnaFIaGVZqqCy4MoHYCCZcPUekhXpQ/enyXWmNWV64aiUWHTE0Xz54v4r4HW B8CI2VOMQuGIyT13hvqx5JsRUnTfTVdXmX731migJxMCa8/oE7DOCuZskaK+GgEjP8u2 LIumpbii48bruRrZMfoHp/IbZJjd17RSQGfidbFMBzQEKdg9+PovxJhfNQ+KCPEXbD5U UnohDJ+VYmnYzffFzV/yWz4u1niGoDp/JhjHjhV8QUEOzsHaKe89SDIP3n0HCU5vCutq A4dQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696908085; x=1697512885; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=XEDO979juocfA6e748Gdg1ojA/11BVhtMp+K0OvWH0c=; b=IOyEfhchvqWyHNCrsrm4VbkIaPvD6r39jGRESE8RrDcA8vBC1QWUtAg0/kHn0cAirt 7bw5Q0Yuh43ql+U2NLNYuoHGUbia9DsC7SY+US8/l7dD/5bSrGU/pY80qkkPV8ShQ+Q4 yrQJYMtlK0Nto8+6C5JNlvlNaRPG5BPRPT/iJHggCzCiimJHDzUMiKRMGTS5Q9fA+MoJ w47i8d8lWbyJYMrcHkdrHuILBwQ9l5VigCbjDmStk0UWDZ0Rrh4p+bu2E4TNTlCoRpdu CMyGpXszHvhvxigAs08F33wDCdwNpbJpRXlski0mrY+NAxTZjf1dXSn//IGuT150g2hM 0NOQ== X-Gm-Message-State: AOJu0YzBidz/GFGplZ/pU5C0LuCDh/1W5FaM1s4H6wgLz7S1+CRpnj4a 6aCiLiBh+3lwYUOHPvVE0m8XDNWSZwAljXnx X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:29b4]) (user=yosryahmed job=sendgmr) by 2002:a81:b209:0:b0:59b:e1db:562c with SMTP id q9-20020a81b209000000b0059be1db562cmr319164ywh.5.1696908085640; Mon, 09 Oct 2023 20:21:25 -0700 (PDT) Date: Tue, 10 Oct 2023 03:21:14 +0000 In-Reply-To: <20231010032117.1577496-1-yosryahmed@google.com> Mime-Version: 1.0 References: <20231010032117.1577496-1-yosryahmed@google.com> X-Mailer: git-send-email 2.42.0.609.gbb76f46606-goog Message-ID: <20231010032117.1577496-4-yosryahmed@google.com> Subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg From: Yosry Ahmed <yosryahmed@google.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, Shakeel Butt <shakeelb@google.com>, Muchun Song <muchun.song@linux.dev>, Ivan Babrou <ivan@cloudflare.com>, Tejun Heo <tj@kernel.org>, " =?utf-8?q?M?= =?utf-8?q?ichal_Koutn=C3=BD?= " <mkoutny@suse.com>, Waiman Long <longman@redhat.com>, kernel-team@cloudflare.com, Wei Xu <weixugc@google.com>, Greg Thelen <gthelen@google.com>, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Yosry Ahmed <yosryahmed@google.com> Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Mon, 09 Oct 2023 20:21:43 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1779337112853520088 X-GMAIL-MSGID: 1779337112853520088 |
Series |
mm: memcg: subtree stats flushing and thresholds
|
|
Commit Message
Yosry Ahmed
Oct. 10, 2023, 3:21 a.m. UTC
A global counter for the magnitude of memcg stats update is maintained
on the memcg side to avoid invoking rstat flushes when the pending
updates are not significant. This avoids unnecessary flushes, which are
not very cheap even if there isn't a lot of stats to flush. It also
avoids unnecessary lock contention on the underlying global rstat lock.
Make this threshold per-memcg. The scheme is followed where percpu (now
also per-memcg) counters are incremented in the update path, and only
propagated to per-memcg atomics when they exceed a certain threshold.
This provides two benefits:
(a) On large machines with a lot of memcgs, the global threshold can be
reached relatively fast, so guarding the underlying lock becomes less
effective. Making the threshold per-memcg avoids this.
(b) Having a global threshold makes it hard to do subtree flushes, as we
cannot reset the global counter except for a full flush. Per-memcg
counters removes this as a blocker from doing subtree flushes, which
helps avoid unnecessary work when the stats of a small subtree are
needed.
Nothing is free, of course. This comes at a cost:
(a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4
bytes. The extra memory usage is insigificant.
(b) More work on the update side, although in the common case it will
only be percpu counter updates. The amount of work scales with the
number of ancestors (i.e. tree depth). This is not a new concept, adding
a cgroup to the rstat tree involves a parent loop, so is charging.
Testing results below show no significant regressions.
(c) The error margin in the stats for the system as a whole increases
from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH *
NR_MEMCGS. This is probably fine because we have a similar per-memcg
error in charges coming from percpu stocks, and we have a periodic
flusher that makes sure we always flush all the stats every 2s anyway.
This patch was tested to make sure no significant regressions are
introduced on the update path as follows. The following benchmarks were
ran in a cgroup that is 4 levels deep (/sys/fs/cgroup/a/b/c/d), which is
deeper than a usual setup:
(a) neper [1] with 1000 flows and 100 threads (single machine). The
values in the table are the average of server and client throughputs in
mbps after 30 iterations, each running for 30s:
tcp_rr tcp_stream
Base 9504218.56 357366.84
Patched 9656205.68 356978.39
Delta +1.6% -0.1%
Standard Deviation 0.95% 1.03%
An increase in the performance of tcp_rr doesn't really make sense, but
it's probably in the noise. The same tests were ran with 1 flow and 1
thread but the throughput was too noisy to make any conclusions (the
averages did not show regressions nonetheless).
Looking at perf for one iteration of the above test, __mod_memcg_state()
(which is where memcg_rstat_updated() is called) does not show up at all
without this patch, but it shows up with this patch as 1.06% for tcp_rr
and 0.36% for tcp_stream.
(b) "stress-ng --vm 0 -t 1m --times --perf". I don't understand
stress-ng very well, so I am not sure that's the best way to test this,
but it spawns 384 workers and spits a lot of metrics which looks nice :)
I picked a few ones that seem to be relevant to the stats update path. I
also included cache misses as this patch introduce more atomics that may
bounce between cpu caches:
Metric Base Patched Delta
Cache Misses 3.394 B/sec 3.433 B/sec +1.14%
Cache L1D Read 0.148 T/sec 0.154 T/sec +4.05%
Cache L1D Read Miss 20.430 B/sec 21.820 B/sec +6.8%
Page Faults Total 4.304 M/sec 4.535 M/sec +5.4%
Page Faults Minor 4.304 M/sec 4.535 M/sec +5.4%
Page Faults Major 18.794 /sec 0.000 /sec
Kmalloc 0.153 M/sec 0.152 M/sec -0.65%
Kfree 0.152 M/sec 0.153 M/sec +0.65%
MM Page Alloc 4.640 M/sec 4.898 M/sec +5.56%
MM Page Free 4.639 M/sec 4.897 M/sec +5.56%
Lock Contention Begin 0.362 M/sec 0.479 M/sec +32.32%
Lock Contention End 0.362 M/sec 0.479 M/sec +32.32%
page-cache add 238.057 /sec 0.000 /sec
page-cache del 6.265 /sec 6.267 /sec -0.03%
This is only using a single run in each case. I am not sure what to
make out of most of these numbers, but they mostly seem in the noise
(some better, some worse). The lock contention numbers are interesting.
I am not sure if higher is better or worse here. No new locks or lock
sections are introduced by this patch either way.
Looking at perf, __mod_memcg_state() shows up as 0.00% with and without
this patch. This is suspicious, but I verified while stress-ng is
running that all the threads are in the right cgroup.
(3) will-it-scale page_fault tests. These tests (specifically
per_process_ops in page_fault3 test) detected a 25.9% regression before
for a change in the stats update path [2]. These are the
numbers from 30 runs (+ is good):
LABEL | MEAN | MEDIAN | STDDEV |
------------------------------+-------------+-------------+-------------
page_fault1_per_process_ops | | | |
(A) base | 265207.738 | 262941.000 | 12112.379 |
(B) patched | 249249.191 | 248781.000 | 8767.457 |
| -6.02% | -5.39% | |
page_fault1_per_thread_ops | | | |
(A) base | 241618.484 | 240209.000 | 10162.207 |
(B) patched | 229820.671 | 229108.000 | 7506.582 |
| -4.88% | -4.62% | |
page_fault1_scalability | | |
(A) base | 0.03545 | 0.035705 | 0.0015837 |
(B) patched | 0.029952 | 0.029957 | 0.0013551 |
| -9.29% | -9.35% | |
page_fault2_per_process_ops | | |
(A) base | 203916.148 | 203496.000 | 2908.331 |
(B) patched | 186975.419 | 187023.000 | 1991.100 |
| -6.85% | -6.90% | |
page_fault2_per_thread_ops | | |
(A) base | 170604.972 | 170532.000 | 1624.834 |
(B) patched | 163100.260 | 163263.000 | 1517.967 |
| -4.40% | -4.26% | |
page_fault2_scalability | | |
(A) base | 0.054603 | 0.054693 | 0.00080196 |
(B) patched | 0.044882 | 0.044957 | 0.0011766 |
| -0.05% | +0.33% | |
page_fault3_per_process_ops | | |
(A) base | 1299821.099 | 1297918.000 | 9882.872 |
(B) patched | 1248700.839 | 1247168.000 | 8454.891 |
| -3.93% | -3.91% | |
page_fault3_per_thread_ops | | |
(A) base | 387216.963 | 387115.000 | 1605.760 |
(B) patched | 368538.213 | 368826.000 | 1852.594 |
| -4.82% | -4.72% | |
page_fault3_scalability | | |
(A) base | 0.59909 | 0.59367 | 0.01256 |
(B) patched | 0.59995 | 0.59769 | 0.010088 |
| +0.14% | +0.68% | |
There is some microbenchmarks regressions (and some minute improvements),
but nothing outside the normal variance of this benchmark between kernel
versions. The fix for [2] assumes that 3% is noise -- and there were no
further practical complaints), so hopefully this means that such variations
in these microbenchmarks do not reflect on practical workloads.
[1]https://github.com/google/neper
[2]https://lore.kernel.org/all/20190520063534.GB19312@shao2-debian/
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
mm/memcontrol.c | 49 +++++++++++++++++++++++++++++++++----------------
1 file changed, 33 insertions(+), 16 deletions(-)
Comments
On Mon, Oct 9, 2023 at 8:21 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > A global counter for the magnitude of memcg stats update is maintained > on the memcg side to avoid invoking rstat flushes when the pending > updates are not significant. This avoids unnecessary flushes, which are > not very cheap even if there isn't a lot of stats to flush. It also > avoids unnecessary lock contention on the underlying global rstat lock. > > Make this threshold per-memcg. The scheme is followed where percpu (now > also per-memcg) counters are incremented in the update path, and only > propagated to per-memcg atomics when they exceed a certain threshold. > > This provides two benefits: > (a) On large machines with a lot of memcgs, the global threshold can be > reached relatively fast, so guarding the underlying lock becomes less > effective. Making the threshold per-memcg avoids this. > > (b) Having a global threshold makes it hard to do subtree flushes, as we > cannot reset the global counter except for a full flush. Per-memcg > counters removes this as a blocker from doing subtree flushes, which > helps avoid unnecessary work when the stats of a small subtree are > needed. > > Nothing is free, of course. This comes at a cost: > (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4 > bytes. The extra memory usage is insigificant. > > (b) More work on the update side, although in the common case it will > only be percpu counter updates. The amount of work scales with the > number of ancestors (i.e. tree depth). This is not a new concept, adding > a cgroup to the rstat tree involves a parent loop, so is charging. > Testing results below show no significant regressions. > > (c) The error margin in the stats for the system as a whole increases > from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH * > NR_MEMCGS. This is probably fine because we have a similar per-memcg > error in charges coming from percpu stocks, and we have a periodic > flusher that makes sure we always flush all the stats every 2s anyway. > > This patch was tested to make sure no significant regressions are > introduced on the update path as follows. The following benchmarks were > ran in a cgroup that is 4 levels deep (/sys/fs/cgroup/a/b/c/d), which is > deeper than a usual setup: > > (a) neper [1] with 1000 flows and 100 threads (single machine). The > values in the table are the average of server and client throughputs in > mbps after 30 iterations, each running for 30s: > > tcp_rr tcp_stream > Base 9504218.56 357366.84 > Patched 9656205.68 356978.39 > Delta +1.6% -0.1% > Standard Deviation 0.95% 1.03% > > An increase in the performance of tcp_rr doesn't really make sense, but > it's probably in the noise. The same tests were ran with 1 flow and 1 > thread but the throughput was too noisy to make any conclusions (the > averages did not show regressions nonetheless). > > Looking at perf for one iteration of the above test, __mod_memcg_state() > (which is where memcg_rstat_updated() is called) does not show up at all > without this patch, but it shows up with this patch as 1.06% for tcp_rr > and 0.36% for tcp_stream. > > (b) "stress-ng --vm 0 -t 1m --times --perf". I don't understand > stress-ng very well, so I am not sure that's the best way to test this, > but it spawns 384 workers and spits a lot of metrics which looks nice :) > I picked a few ones that seem to be relevant to the stats update path. I > also included cache misses as this patch introduce more atomics that may > bounce between cpu caches: > > Metric Base Patched Delta > Cache Misses 3.394 B/sec 3.433 B/sec +1.14% > Cache L1D Read 0.148 T/sec 0.154 T/sec +4.05% > Cache L1D Read Miss 20.430 B/sec 21.820 B/sec +6.8% > Page Faults Total 4.304 M/sec 4.535 M/sec +5.4% > Page Faults Minor 4.304 M/sec 4.535 M/sec +5.4% > Page Faults Major 18.794 /sec 0.000 /sec > Kmalloc 0.153 M/sec 0.152 M/sec -0.65% > Kfree 0.152 M/sec 0.153 M/sec +0.65% > MM Page Alloc 4.640 M/sec 4.898 M/sec +5.56% > MM Page Free 4.639 M/sec 4.897 M/sec +5.56% > Lock Contention Begin 0.362 M/sec 0.479 M/sec +32.32% > Lock Contention End 0.362 M/sec 0.479 M/sec +32.32% > page-cache add 238.057 /sec 0.000 /sec > page-cache del 6.265 /sec 6.267 /sec -0.03% > > This is only using a single run in each case. I am not sure what to > make out of most of these numbers, but they mostly seem in the noise > (some better, some worse). The lock contention numbers are interesting. > I am not sure if higher is better or worse here. No new locks or lock > sections are introduced by this patch either way. > > Looking at perf, __mod_memcg_state() shows up as 0.00% with and without > this patch. This is suspicious, but I verified while stress-ng is > running that all the threads are in the right cgroup. > > (3) will-it-scale page_fault tests. These tests (specifically > per_process_ops in page_fault3 test) detected a 25.9% regression before > for a change in the stats update path [2]. These are the > numbers from 30 runs (+ is good): > > LABEL | MEAN | MEDIAN | STDDEV | > ------------------------------+-------------+-------------+------------- > page_fault1_per_process_ops | | | | > (A) base | 265207.738 | 262941.000 | 12112.379 | > (B) patched | 249249.191 | 248781.000 | 8767.457 | > | -6.02% | -5.39% | | > page_fault1_per_thread_ops | | | | > (A) base | 241618.484 | 240209.000 | 10162.207 | > (B) patched | 229820.671 | 229108.000 | 7506.582 | > | -4.88% | -4.62% | | > page_fault1_scalability | | | > (A) base | 0.03545 | 0.035705 | 0.0015837 | > (B) patched | 0.029952 | 0.029957 | 0.0013551 | > | -9.29% | -9.35% | | > page_fault2_per_process_ops | | | > (A) base | 203916.148 | 203496.000 | 2908.331 | > (B) patched | 186975.419 | 187023.000 | 1991.100 | > | -6.85% | -6.90% | | > page_fault2_per_thread_ops | | | > (A) base | 170604.972 | 170532.000 | 1624.834 | > (B) patched | 163100.260 | 163263.000 | 1517.967 | > | -4.40% | -4.26% | | > page_fault2_scalability | | | > (A) base | 0.054603 | 0.054693 | 0.00080196 | > (B) patched | 0.044882 | 0.044957 | 0.0011766 | > | -0.05% | +0.33% | | > page_fault3_per_process_ops | | | > (A) base | 1299821.099 | 1297918.000 | 9882.872 | > (B) patched | 1248700.839 | 1247168.000 | 8454.891 | > | -3.93% | -3.91% | | > page_fault3_per_thread_ops | | | > (A) base | 387216.963 | 387115.000 | 1605.760 | > (B) patched | 368538.213 | 368826.000 | 1852.594 | > | -4.82% | -4.72% | | > page_fault3_scalability | | | > (A) base | 0.59909 | 0.59367 | 0.01256 | > (B) patched | 0.59995 | 0.59769 | 0.010088 | > | +0.14% | +0.68% | | > > There is some microbenchmarks regressions (and some minute improvements), > but nothing outside the normal variance of this benchmark between kernel > versions. The fix for [2] assumes that 3% is noise -- and there were no > further practical complaints), so hopefully this means that such variations > in these microbenchmarks do not reflect on practical workloads. > > [1]https://github.com/google/neper > [2]https://lore.kernel.org/all/20190520063534.GB19312@shao2-debian/ > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Johannes, as I mentioned in a reply to v1, I think this might be what you suggested in our previous discussion [1], but I am not sure this is what you meant for the update path, so I did not add a Suggested-by. Please let me know if this is what you meant and I can amend the tag as such. [1]https://lore.kernel.org/lkml/20230913153758.GB45543@cmpxchg.org/
On Mon, Oct 9, 2023 at 8:21 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > A global counter for the magnitude of memcg stats update is maintained > on the memcg side to avoid invoking rstat flushes when the pending > updates are not significant. This avoids unnecessary flushes, which are > not very cheap even if there isn't a lot of stats to flush. It also > avoids unnecessary lock contention on the underlying global rstat lock. > > Make this threshold per-memcg. The scheme is followed where percpu (now > also per-memcg) counters are incremented in the update path, and only > propagated to per-memcg atomics when they exceed a certain threshold. > > This provides two benefits: > (a) On large machines with a lot of memcgs, the global threshold can be > reached relatively fast, so guarding the underlying lock becomes less > effective. Making the threshold per-memcg avoids this. > > (b) Having a global threshold makes it hard to do subtree flushes, as we > cannot reset the global counter except for a full flush. Per-memcg > counters removes this as a blocker from doing subtree flushes, which > helps avoid unnecessary work when the stats of a small subtree are > needed. > > Nothing is free, of course. This comes at a cost: > (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4 > bytes. The extra memory usage is insigificant. > > (b) More work on the update side, although in the common case it will > only be percpu counter updates. The amount of work scales with the > number of ancestors (i.e. tree depth). This is not a new concept, adding > a cgroup to the rstat tree involves a parent loop, so is charging. > Testing results below show no significant regressions. > > (c) The error margin in the stats for the system as a whole increases > from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH * > NR_MEMCGS. This is probably fine because we have a similar per-memcg > error in charges coming from percpu stocks, and we have a periodic > flusher that makes sure we always flush all the stats every 2s anyway. > > This patch was tested to make sure no significant regressions are > introduced on the update path as follows. The following benchmarks were > ran in a cgroup that is 4 levels deep (/sys/fs/cgroup/a/b/c/d), which is > deeper than a usual setup: > > (a) neper [1] with 1000 flows and 100 threads (single machine). The > values in the table are the average of server and client throughputs in > mbps after 30 iterations, each running for 30s: > > tcp_rr tcp_stream > Base 9504218.56 357366.84 > Patched 9656205.68 356978.39 > Delta +1.6% -0.1% > Standard Deviation 0.95% 1.03% > > An increase in the performance of tcp_rr doesn't really make sense, but > it's probably in the noise. The same tests were ran with 1 flow and 1 > thread but the throughput was too noisy to make any conclusions (the > averages did not show regressions nonetheless). > > Looking at perf for one iteration of the above test, __mod_memcg_state() > (which is where memcg_rstat_updated() is called) does not show up at all > without this patch, but it shows up with this patch as 1.06% for tcp_rr > and 0.36% for tcp_stream. > > (b) "stress-ng --vm 0 -t 1m --times --perf". I don't understand > stress-ng very well, so I am not sure that's the best way to test this, > but it spawns 384 workers and spits a lot of metrics which looks nice :) > I picked a few ones that seem to be relevant to the stats update path. I > also included cache misses as this patch introduce more atomics that may > bounce between cpu caches: > > Metric Base Patched Delta > Cache Misses 3.394 B/sec 3.433 B/sec +1.14% > Cache L1D Read 0.148 T/sec 0.154 T/sec +4.05% > Cache L1D Read Miss 20.430 B/sec 21.820 B/sec +6.8% > Page Faults Total 4.304 M/sec 4.535 M/sec +5.4% > Page Faults Minor 4.304 M/sec 4.535 M/sec +5.4% > Page Faults Major 18.794 /sec 0.000 /sec > Kmalloc 0.153 M/sec 0.152 M/sec -0.65% > Kfree 0.152 M/sec 0.153 M/sec +0.65% > MM Page Alloc 4.640 M/sec 4.898 M/sec +5.56% > MM Page Free 4.639 M/sec 4.897 M/sec +5.56% > Lock Contention Begin 0.362 M/sec 0.479 M/sec +32.32% > Lock Contention End 0.362 M/sec 0.479 M/sec +32.32% > page-cache add 238.057 /sec 0.000 /sec > page-cache del 6.265 /sec 6.267 /sec -0.03% > > This is only using a single run in each case. I am not sure what to > make out of most of these numbers, but they mostly seem in the noise > (some better, some worse). The lock contention numbers are interesting. > I am not sure if higher is better or worse here. No new locks or lock > sections are introduced by this patch either way. > > Looking at perf, __mod_memcg_state() shows up as 0.00% with and without > this patch. This is suspicious, but I verified while stress-ng is > running that all the threads are in the right cgroup. > > (3) will-it-scale page_fault tests. These tests (specifically > per_process_ops in page_fault3 test) detected a 25.9% regression before > for a change in the stats update path [2]. These are the > numbers from 30 runs (+ is good): > > LABEL | MEAN | MEDIAN | STDDEV | > ------------------------------+-------------+-------------+------------- > page_fault1_per_process_ops | | | | > (A) base | 265207.738 | 262941.000 | 12112.379 | > (B) patched | 249249.191 | 248781.000 | 8767.457 | > | -6.02% | -5.39% | | > page_fault1_per_thread_ops | | | | > (A) base | 241618.484 | 240209.000 | 10162.207 | > (B) patched | 229820.671 | 229108.000 | 7506.582 | > | -4.88% | -4.62% | | > page_fault1_scalability | | | > (A) base | 0.03545 | 0.035705 | 0.0015837 | > (B) patched | 0.029952 | 0.029957 | 0.0013551 | > | -9.29% | -9.35% | | This much regression is not acceptable. In addition, I ran netperf with the same 4 level hierarchy as you have run and I am seeing ~11% regression. More specifically on a machine with 44 CPUs (HT disabled ixion machine): # for server $ netserver -6 # 22 instances of netperf clients $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K (averaged over 4 runs) base (next-20231009): 33081 MBPS patched: 29267 MBPS So, this series is not acceptable unless this regression is resolved.
On Tue, Oct 10, 2023 at 1:45 PM Shakeel Butt <shakeelb@google.com> wrote: > > On Mon, Oct 9, 2023 at 8:21 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > A global counter for the magnitude of memcg stats update is maintained > > on the memcg side to avoid invoking rstat flushes when the pending > > updates are not significant. This avoids unnecessary flushes, which are > > not very cheap even if there isn't a lot of stats to flush. It also > > avoids unnecessary lock contention on the underlying global rstat lock. > > > > Make this threshold per-memcg. The scheme is followed where percpu (now > > also per-memcg) counters are incremented in the update path, and only > > propagated to per-memcg atomics when they exceed a certain threshold. > > > > This provides two benefits: > > (a) On large machines with a lot of memcgs, the global threshold can be > > reached relatively fast, so guarding the underlying lock becomes less > > effective. Making the threshold per-memcg avoids this. > > > > (b) Having a global threshold makes it hard to do subtree flushes, as we > > cannot reset the global counter except for a full flush. Per-memcg > > counters removes this as a blocker from doing subtree flushes, which > > helps avoid unnecessary work when the stats of a small subtree are > > needed. > > > > Nothing is free, of course. This comes at a cost: > > (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4 > > bytes. The extra memory usage is insigificant. > > > > (b) More work on the update side, although in the common case it will > > only be percpu counter updates. The amount of work scales with the > > number of ancestors (i.e. tree depth). This is not a new concept, adding > > a cgroup to the rstat tree involves a parent loop, so is charging. > > Testing results below show no significant regressions. > > > > (c) The error margin in the stats for the system as a whole increases > > from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH * > > NR_MEMCGS. This is probably fine because we have a similar per-memcg > > error in charges coming from percpu stocks, and we have a periodic > > flusher that makes sure we always flush all the stats every 2s anyway. > > > > This patch was tested to make sure no significant regressions are > > introduced on the update path as follows. The following benchmarks were > > ran in a cgroup that is 4 levels deep (/sys/fs/cgroup/a/b/c/d), which is > > deeper than a usual setup: > > > > (a) neper [1] with 1000 flows and 100 threads (single machine). The > > values in the table are the average of server and client throughputs in > > mbps after 30 iterations, each running for 30s: > > > > tcp_rr tcp_stream > > Base 9504218.56 357366.84 > > Patched 9656205.68 356978.39 > > Delta +1.6% -0.1% > > Standard Deviation 0.95% 1.03% > > > > An increase in the performance of tcp_rr doesn't really make sense, but > > it's probably in the noise. The same tests were ran with 1 flow and 1 > > thread but the throughput was too noisy to make any conclusions (the > > averages did not show regressions nonetheless). > > > > Looking at perf for one iteration of the above test, __mod_memcg_state() > > (which is where memcg_rstat_updated() is called) does not show up at all > > without this patch, but it shows up with this patch as 1.06% for tcp_rr > > and 0.36% for tcp_stream. > > > > (b) "stress-ng --vm 0 -t 1m --times --perf". I don't understand > > stress-ng very well, so I am not sure that's the best way to test this, > > but it spawns 384 workers and spits a lot of metrics which looks nice :) > > I picked a few ones that seem to be relevant to the stats update path. I > > also included cache misses as this patch introduce more atomics that may > > bounce between cpu caches: > > > > Metric Base Patched Delta > > Cache Misses 3.394 B/sec 3.433 B/sec +1.14% > > Cache L1D Read 0.148 T/sec 0.154 T/sec +4.05% > > Cache L1D Read Miss 20.430 B/sec 21.820 B/sec +6.8% > > Page Faults Total 4.304 M/sec 4.535 M/sec +5.4% > > Page Faults Minor 4.304 M/sec 4.535 M/sec +5.4% > > Page Faults Major 18.794 /sec 0.000 /sec > > Kmalloc 0.153 M/sec 0.152 M/sec -0.65% > > Kfree 0.152 M/sec 0.153 M/sec +0.65% > > MM Page Alloc 4.640 M/sec 4.898 M/sec +5.56% > > MM Page Free 4.639 M/sec 4.897 M/sec +5.56% > > Lock Contention Begin 0.362 M/sec 0.479 M/sec +32.32% > > Lock Contention End 0.362 M/sec 0.479 M/sec +32.32% > > page-cache add 238.057 /sec 0.000 /sec > > page-cache del 6.265 /sec 6.267 /sec -0.03% > > > > This is only using a single run in each case. I am not sure what to > > make out of most of these numbers, but they mostly seem in the noise > > (some better, some worse). The lock contention numbers are interesting. > > I am not sure if higher is better or worse here. No new locks or lock > > sections are introduced by this patch either way. > > > > Looking at perf, __mod_memcg_state() shows up as 0.00% with and without > > this patch. This is suspicious, but I verified while stress-ng is > > running that all the threads are in the right cgroup. > > > > (3) will-it-scale page_fault tests. These tests (specifically > > per_process_ops in page_fault3 test) detected a 25.9% regression before > > for a change in the stats update path [2]. These are the > > numbers from 30 runs (+ is good): > > > > LABEL | MEAN | MEDIAN | STDDEV | > > ------------------------------+-------------+-------------+------------- > > page_fault1_per_process_ops | | | | > > (A) base | 265207.738 | 262941.000 | 12112.379 | > > (B) patched | 249249.191 | 248781.000 | 8767.457 | > > | -6.02% | -5.39% | | > > page_fault1_per_thread_ops | | | | > > (A) base | 241618.484 | 240209.000 | 10162.207 | > > (B) patched | 229820.671 | 229108.000 | 7506.582 | > > | -4.88% | -4.62% | | > > page_fault1_scalability | | | > > (A) base | 0.03545 | 0.035705 | 0.0015837 | > > (B) patched | 0.029952 | 0.029957 | 0.0013551 | > > | -9.29% | -9.35% | | > > This much regression is not acceptable. > > In addition, I ran netperf with the same 4 level hierarchy as you have > run and I am seeing ~11% regression. Interesting, I thought neper and netperf should be similar. Let me try to reproduce this. Thanks for testing! > > More specifically on a machine with 44 CPUs (HT disabled ixion machine): > > # for server > $ netserver -6 > > # 22 instances of netperf clients > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > (averaged over 4 runs) > > base (next-20231009): 33081 MBPS > patched: 29267 MBPS > > So, this series is not acceptable unless this regression is resolved.
On Tue, Oct 10, 2023 at 2:02 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Tue, Oct 10, 2023 at 1:45 PM Shakeel Butt <shakeelb@google.com> wrote: > > > > On Mon, Oct 9, 2023 at 8:21 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > A global counter for the magnitude of memcg stats update is maintained > > > on the memcg side to avoid invoking rstat flushes when the pending > > > updates are not significant. This avoids unnecessary flushes, which are > > > not very cheap even if there isn't a lot of stats to flush. It also > > > avoids unnecessary lock contention on the underlying global rstat lock. > > > > > > Make this threshold per-memcg. The scheme is followed where percpu (now > > > also per-memcg) counters are incremented in the update path, and only > > > propagated to per-memcg atomics when they exceed a certain threshold. > > > > > > This provides two benefits: > > > (a) On large machines with a lot of memcgs, the global threshold can be > > > reached relatively fast, so guarding the underlying lock becomes less > > > effective. Making the threshold per-memcg avoids this. > > > > > > (b) Having a global threshold makes it hard to do subtree flushes, as we > > > cannot reset the global counter except for a full flush. Per-memcg > > > counters removes this as a blocker from doing subtree flushes, which > > > helps avoid unnecessary work when the stats of a small subtree are > > > needed. > > > > > > Nothing is free, of course. This comes at a cost: > > > (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4 > > > bytes. The extra memory usage is insigificant. > > > > > > (b) More work on the update side, although in the common case it will > > > only be percpu counter updates. The amount of work scales with the > > > number of ancestors (i.e. tree depth). This is not a new concept, adding > > > a cgroup to the rstat tree involves a parent loop, so is charging. > > > Testing results below show no significant regressions. > > > > > > (c) The error margin in the stats for the system as a whole increases > > > from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH * > > > NR_MEMCGS. This is probably fine because we have a similar per-memcg > > > error in charges coming from percpu stocks, and we have a periodic > > > flusher that makes sure we always flush all the stats every 2s anyway. > > > > > > This patch was tested to make sure no significant regressions are > > > introduced on the update path as follows. The following benchmarks were > > > ran in a cgroup that is 4 levels deep (/sys/fs/cgroup/a/b/c/d), which is > > > deeper than a usual setup: > > > > > > (a) neper [1] with 1000 flows and 100 threads (single machine). The > > > values in the table are the average of server and client throughputs in > > > mbps after 30 iterations, each running for 30s: > > > > > > tcp_rr tcp_stream > > > Base 9504218.56 357366.84 > > > Patched 9656205.68 356978.39 > > > Delta +1.6% -0.1% > > > Standard Deviation 0.95% 1.03% > > > > > > An increase in the performance of tcp_rr doesn't really make sense, but > > > it's probably in the noise. The same tests were ran with 1 flow and 1 > > > thread but the throughput was too noisy to make any conclusions (the > > > averages did not show regressions nonetheless). > > > > > > Looking at perf for one iteration of the above test, __mod_memcg_state() > > > (which is where memcg_rstat_updated() is called) does not show up at all > > > without this patch, but it shows up with this patch as 1.06% for tcp_rr > > > and 0.36% for tcp_stream. > > > > > > (b) "stress-ng --vm 0 -t 1m --times --perf". I don't understand > > > stress-ng very well, so I am not sure that's the best way to test this, > > > but it spawns 384 workers and spits a lot of metrics which looks nice :) > > > I picked a few ones that seem to be relevant to the stats update path. I > > > also included cache misses as this patch introduce more atomics that may > > > bounce between cpu caches: > > > > > > Metric Base Patched Delta > > > Cache Misses 3.394 B/sec 3.433 B/sec +1.14% > > > Cache L1D Read 0.148 T/sec 0.154 T/sec +4.05% > > > Cache L1D Read Miss 20.430 B/sec 21.820 B/sec +6.8% > > > Page Faults Total 4.304 M/sec 4.535 M/sec +5.4% > > > Page Faults Minor 4.304 M/sec 4.535 M/sec +5.4% > > > Page Faults Major 18.794 /sec 0.000 /sec > > > Kmalloc 0.153 M/sec 0.152 M/sec -0.65% > > > Kfree 0.152 M/sec 0.153 M/sec +0.65% > > > MM Page Alloc 4.640 M/sec 4.898 M/sec +5.56% > > > MM Page Free 4.639 M/sec 4.897 M/sec +5.56% > > > Lock Contention Begin 0.362 M/sec 0.479 M/sec +32.32% > > > Lock Contention End 0.362 M/sec 0.479 M/sec +32.32% > > > page-cache add 238.057 /sec 0.000 /sec > > > page-cache del 6.265 /sec 6.267 /sec -0.03% > > > > > > This is only using a single run in each case. I am not sure what to > > > make out of most of these numbers, but they mostly seem in the noise > > > (some better, some worse). The lock contention numbers are interesting. > > > I am not sure if higher is better or worse here. No new locks or lock > > > sections are introduced by this patch either way. > > > > > > Looking at perf, __mod_memcg_state() shows up as 0.00% with and without > > > this patch. This is suspicious, but I verified while stress-ng is > > > running that all the threads are in the right cgroup. > > > > > > (3) will-it-scale page_fault tests. These tests (specifically > > > per_process_ops in page_fault3 test) detected a 25.9% regression before > > > for a change in the stats update path [2]. These are the > > > numbers from 30 runs (+ is good): > > > > > > LABEL | MEAN | MEDIAN | STDDEV | > > > ------------------------------+-------------+-------------+------------- > > > page_fault1_per_process_ops | | | | > > > (A) base | 265207.738 | 262941.000 | 12112.379 | > > > (B) patched | 249249.191 | 248781.000 | 8767.457 | > > > | -6.02% | -5.39% | | > > > page_fault1_per_thread_ops | | | | > > > (A) base | 241618.484 | 240209.000 | 10162.207 | > > > (B) patched | 229820.671 | 229108.000 | 7506.582 | > > > | -4.88% | -4.62% | | > > > page_fault1_scalability | | | > > > (A) base | 0.03545 | 0.035705 | 0.0015837 | > > > (B) patched | 0.029952 | 0.029957 | 0.0013551 | > > > | -9.29% | -9.35% | | > > > > This much regression is not acceptable. > > > > In addition, I ran netperf with the same 4 level hierarchy as you have > > run and I am seeing ~11% regression. > > Interesting, I thought neper and netperf should be similar. Let me try > to reproduce this. > > Thanks for testing! > > > > > More specifically on a machine with 44 CPUs (HT disabled ixion machine): > > > > # for server > > $ netserver -6 > > > > # 22 instances of netperf clients > > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > > > (averaged over 4 runs) > > > > base (next-20231009): 33081 MBPS > > patched: 29267 MBPS > > > > So, this series is not acceptable unless this regression is resolved. I tried this on a machine with 72 cpus (also ixion), running both netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows: # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control # mkdir /sys/fs/cgroup/a # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control # mkdir /sys/fs/cgroup/a/b # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control # mkdir /sys/fs/cgroup/a/b/c # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control # mkdir /sys/fs/cgroup/a/b/c/d # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs # ./netserver -6 # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K; done Base: 540000 262144 10240 60.00 54613.89 540000 262144 10240 60.00 54940.52 540000 262144 10240 60.00 55168.86 540000 262144 10240 60.00 54800.15 540000 262144 10240 60.00 54452.55 540000 262144 10240 60.00 54501.60 540000 262144 10240 60.00 55036.11 540000 262144 10240 60.00 52018.91 540000 262144 10240 60.00 54877.78 540000 262144 10240 60.00 55342.38 Average: 54575.275 Patched: 540000 262144 10240 60.00 53694.86 540000 262144 10240 60.00 54807.68 540000 262144 10240 60.00 54782.89 540000 262144 10240 60.00 51404.91 540000 262144 10240 60.00 55024.00 540000 262144 10240 60.00 54725.84 540000 262144 10240 60.00 51400.40 540000 262144 10240 60.00 54212.63 540000 262144 10240 60.00 51951.47 540000 262144 10240 60.00 51978.27 Average: 53398.295 That's ~2% regression. Did I do anything incorrectly?
On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote: [...] > > I tried this on a machine with 72 cpus (also ixion), running both > netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows: > # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control > # mkdir /sys/fs/cgroup/a > # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control > # mkdir /sys/fs/cgroup/a/b > # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control > # mkdir /sys/fs/cgroup/a/b/c > # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control > # mkdir /sys/fs/cgroup/a/b/c/d > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs > # ./netserver -6 > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs > # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- > -m 10K; done You are missing '&' at the end. Use something like below: #!/bin/bash for i in {1..22} do /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K & done wait
On Tue, Oct 10, 2023 at 5:36 PM Shakeel Butt <shakeelb@google.com> wrote: > > On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote: > [...] > > > > I tried this on a machine with 72 cpus (also ixion), running both > > netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows: > > # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control > > # mkdir /sys/fs/cgroup/a > > # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control > > # mkdir /sys/fs/cgroup/a/b > > # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control > > # mkdir /sys/fs/cgroup/a/b/c > > # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control > > # mkdir /sys/fs/cgroup/a/b/c/d > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs > > # ./netserver -6 > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs > > # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- > > -m 10K; done > > You are missing '&' at the end. Use something like below: > > #!/bin/bash > for i in {1..22} > do > /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K & > done > wait > Oh sorry I missed the fact that you are running instances in parallel, my bad. So I ran 36 instances on a machine with 72 cpus. I did this 10 times and got an average from all instances for all runs to reduce noise: #!/bin/bash ITER=10 NR_INSTANCES=36 for i in $(seq $ITER); do echo "iteration $i" for j in $(seq $NR_INSTANCES); do echo "iteration $i" >> "out$j" ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K >> "out$j" & done wait done cat out* | grep 540000 | awk '{sum += $5} END {print sum/NR}' Base: 22169 mbps Patched: 21331.9 mbps The difference is ~3.7% in my runs. I am not sure what's different. Perhaps it's the number of runs?
On Tue, Oct 10, 2023 at 6:48 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Tue, Oct 10, 2023 at 5:36 PM Shakeel Butt <shakeelb@google.com> wrote: > > > > On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote: > > [...] > > > > > > I tried this on a machine with 72 cpus (also ixion), running both > > > netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows: > > > # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control > > > # mkdir /sys/fs/cgroup/a > > > # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control > > > # mkdir /sys/fs/cgroup/a/b > > > # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control > > > # mkdir /sys/fs/cgroup/a/b/c > > > # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control > > > # mkdir /sys/fs/cgroup/a/b/c/d > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs > > > # ./netserver -6 > > > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs > > > # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- > > > -m 10K; done > > > > You are missing '&' at the end. Use something like below: > > > > #!/bin/bash > > for i in {1..22} > > do > > /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K & > > done > > wait > > > > Oh sorry I missed the fact that you are running instances in parallel, my bad. > > So I ran 36 instances on a machine with 72 cpus. I did this 10 times > and got an average from all instances for all runs to reduce noise: > > #!/bin/bash > > ITER=10 > NR_INSTANCES=36 > > for i in $(seq $ITER); do > echo "iteration $i" > for j in $(seq $NR_INSTANCES); do > echo "iteration $i" >> "out$j" > ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K >> "out$j" & > done > wait > done > > cat out* | grep 540000 | awk '{sum += $5} END {print sum/NR}' > > Base: 22169 mbps > Patched: 21331.9 mbps > > The difference is ~3.7% in my runs. I am not sure what's different. > Perhaps it's the number of runs? My base kernel is next-20231009 and I am running experiments with hyperthreading disabled.
On Wed, Oct 11, 2023 at 5:46 AM Shakeel Butt <shakeelb@google.com> wrote: > > On Tue, Oct 10, 2023 at 6:48 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Tue, Oct 10, 2023 at 5:36 PM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote: > > > [...] > > > > > > > > I tried this on a machine with 72 cpus (also ixion), running both > > > > netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows: > > > > # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control > > > > # mkdir /sys/fs/cgroup/a > > > > # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control > > > > # mkdir /sys/fs/cgroup/a/b > > > > # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control > > > > # mkdir /sys/fs/cgroup/a/b/c > > > > # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control > > > > # mkdir /sys/fs/cgroup/a/b/c/d > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs > > > > # ./netserver -6 > > > > > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs > > > > # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- > > > > -m 10K; done > > > > > > You are missing '&' at the end. Use something like below: > > > > > > #!/bin/bash > > > for i in {1..22} > > > do > > > /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K & > > > done > > > wait > > > > > > > Oh sorry I missed the fact that you are running instances in parallel, my bad. > > > > So I ran 36 instances on a machine with 72 cpus. I did this 10 times > > and got an average from all instances for all runs to reduce noise: > > > > #!/bin/bash > > > > ITER=10 > > NR_INSTANCES=36 > > > > for i in $(seq $ITER); do > > echo "iteration $i" > > for j in $(seq $NR_INSTANCES); do > > echo "iteration $i" >> "out$j" > > ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K >> "out$j" & > > done > > wait > > done > > > > cat out* | grep 540000 | awk '{sum += $5} END {print sum/NR}' > > > > Base: 22169 mbps > > Patched: 21331.9 mbps > > > > The difference is ~3.7% in my runs. I am not sure what's different. > > Perhaps it's the number of runs? > > My base kernel is next-20231009 and I am running experiments with > hyperthreading disabled. Using next-20231009 and a similar 44 core machine with hyperthreading disabled, I ran 22 instances of netperf in parallel and got the following numbers from averaging 20 runs: Base: 33076.5 mbps Patched: 31410.1 mbps That's about 5% diff. I guess the number of iterations helps reduce the noise? I am not sure. Please also keep in mind that in this case all netperf instances are in the same cgroup and at a 4-level depth. I imagine in a practical setup processes would be a little more spread out, which means less common ancestors, so less contended atomic operations.
On Wed, Oct 11, 2023 at 8:13 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Wed, Oct 11, 2023 at 5:46 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > On Tue, Oct 10, 2023 at 6:48 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > On Tue, Oct 10, 2023 at 5:36 PM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > > > On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote: > > > > [...] > > > > > > > > > > I tried this on a machine with 72 cpus (also ixion), running both > > > > > netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows: > > > > > # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control > > > > > # mkdir /sys/fs/cgroup/a > > > > > # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control > > > > > # mkdir /sys/fs/cgroup/a/b > > > > > # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control > > > > > # mkdir /sys/fs/cgroup/a/b/c > > > > > # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control > > > > > # mkdir /sys/fs/cgroup/a/b/c/d > > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs > > > > > # ./netserver -6 > > > > > > > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs > > > > > # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- > > > > > -m 10K; done > > > > > > > > You are missing '&' at the end. Use something like below: > > > > > > > > #!/bin/bash > > > > for i in {1..22} > > > > do > > > > /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K & > > > > done > > > > wait > > > > > > > > > > Oh sorry I missed the fact that you are running instances in parallel, my bad. > > > > > > So I ran 36 instances on a machine with 72 cpus. I did this 10 times > > > and got an average from all instances for all runs to reduce noise: > > > > > > #!/bin/bash > > > > > > ITER=10 > > > NR_INSTANCES=36 > > > > > > for i in $(seq $ITER); do > > > echo "iteration $i" > > > for j in $(seq $NR_INSTANCES); do > > > echo "iteration $i" >> "out$j" > > > ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K >> "out$j" & > > > done > > > wait > > > done > > > > > > cat out* | grep 540000 | awk '{sum += $5} END {print sum/NR}' > > > > > > Base: 22169 mbps > > > Patched: 21331.9 mbps > > > > > > The difference is ~3.7% in my runs. I am not sure what's different. > > > Perhaps it's the number of runs? > > > > My base kernel is next-20231009 and I am running experiments with > > hyperthreading disabled. > > Using next-20231009 and a similar 44 core machine with hyperthreading > disabled, I ran 22 instances of netperf in parallel and got the > following numbers from averaging 20 runs: > > Base: 33076.5 mbps > Patched: 31410.1 mbps > > That's about 5% diff. I guess the number of iterations helps reduce > the noise? I am not sure. > > Please also keep in mind that in this case all netperf instances are > in the same cgroup and at a 4-level depth. I imagine in a practical > setup processes would be a little more spread out, which means less > common ancestors, so less contended atomic operations. (Resending the reply as I messed up the last one, was not in plain text) I was curious, so I ran the same testing in a cgroup 2 levels deep (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my experience. Here are the numbers: Base: 40198.0 mbps Patched: 38629.7 mbps The regression is reduced to ~3.9%. What's more interesting is that going from a level 2 cgroup to a level 4 cgroup is already a big hit with or without this patch: Base: 40198.0 -> 33076.5 mbps (~17.7% regression) Patched: 38629.7 -> 31410.1 (~18.7% regression) So going from level 2 to 4 is already a significant regression for other reasons (e.g. hierarchical charging). This patch only makes it marginally worse. This puts the numbers more into perspective imo than comparing values at level 4. What do you think?
On Thu, Oct 12, 2023 at 01:04:03AM -0700, Yosry Ahmed wrote: > On Wed, Oct 11, 2023 at 8:13 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Wed, Oct 11, 2023 at 5:46 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > On Tue, Oct 10, 2023 at 6:48 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > On Tue, Oct 10, 2023 at 5:36 PM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > > > > > On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote: > > > > > [...] > > > > > > > > > > > > I tried this on a machine with 72 cpus (also ixion), running both > > > > > > netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows: > > > > > > # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control > > > > > > # mkdir /sys/fs/cgroup/a > > > > > > # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control > > > > > > # mkdir /sys/fs/cgroup/a/b > > > > > > # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control > > > > > > # mkdir /sys/fs/cgroup/a/b/c > > > > > > # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control > > > > > > # mkdir /sys/fs/cgroup/a/b/c/d > > > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs > > > > > > # ./netserver -6 > > > > > > > > > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs > > > > > > # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- > > > > > > -m 10K; done > > > > > > > > > > You are missing '&' at the end. Use something like below: > > > > > > > > > > #!/bin/bash > > > > > for i in {1..22} > > > > > do > > > > > /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K & > > > > > done > > > > > wait > > > > > > > > > > > > > Oh sorry I missed the fact that you are running instances in parallel, my bad. > > > > > > > > So I ran 36 instances on a machine with 72 cpus. I did this 10 times > > > > and got an average from all instances for all runs to reduce noise: > > > > > > > > #!/bin/bash > > > > > > > > ITER=10 > > > > NR_INSTANCES=36 > > > > > > > > for i in $(seq $ITER); do > > > > echo "iteration $i" > > > > for j in $(seq $NR_INSTANCES); do > > > > echo "iteration $i" >> "out$j" > > > > ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K >> "out$j" & > > > > done > > > > wait > > > > done > > > > > > > > cat out* | grep 540000 | awk '{sum += $5} END {print sum/NR}' > > > > > > > > Base: 22169 mbps > > > > Patched: 21331.9 mbps > > > > > > > > The difference is ~3.7% in my runs. I am not sure what's different. > > > > Perhaps it's the number of runs? > > > > > > My base kernel is next-20231009 and I am running experiments with > > > hyperthreading disabled. > > > > Using next-20231009 and a similar 44 core machine with hyperthreading > > disabled, I ran 22 instances of netperf in parallel and got the > > following numbers from averaging 20 runs: > > > > Base: 33076.5 mbps > > Patched: 31410.1 mbps > > > > That's about 5% diff. I guess the number of iterations helps reduce > > the noise? I am not sure. > > > > Please also keep in mind that in this case all netperf instances are > > in the same cgroup and at a 4-level depth. I imagine in a practical > > setup processes would be a little more spread out, which means less > > common ancestors, so less contended atomic operations. > > > (Resending the reply as I messed up the last one, was not in plain text) > > I was curious, so I ran the same testing in a cgroup 2 levels deep > (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my > experience. Here are the numbers: > > Base: 40198.0 mbps > Patched: 38629.7 mbps > > The regression is reduced to ~3.9%. > > What's more interesting is that going from a level 2 cgroup to a level > 4 cgroup is already a big hit with or without this patch: > > Base: 40198.0 -> 33076.5 mbps (~17.7% regression) > Patched: 38629.7 -> 31410.1 (~18.7% regression) > > So going from level 2 to 4 is already a significant regression for > other reasons (e.g. hierarchical charging). This patch only makes it > marginally worse. This puts the numbers more into perspective imo than > comparing values at level 4. What do you think? I think it's reasonable. Especially comparing to how many cachelines we used to touch on the write side when all flushing happened there. This looks like a good trade-off to me.
On Thu, Oct 12, 2023 at 1:04 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Wed, Oct 11, 2023 at 8:13 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Wed, Oct 11, 2023 at 5:46 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > On Tue, Oct 10, 2023 at 6:48 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > On Tue, Oct 10, 2023 at 5:36 PM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > > > > > On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote: > > > > > [...] > > > > > > > > > > > > I tried this on a machine with 72 cpus (also ixion), running both > > > > > > netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows: > > > > > > # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control > > > > > > # mkdir /sys/fs/cgroup/a > > > > > > # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control > > > > > > # mkdir /sys/fs/cgroup/a/b > > > > > > # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control > > > > > > # mkdir /sys/fs/cgroup/a/b/c > > > > > > # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control > > > > > > # mkdir /sys/fs/cgroup/a/b/c/d > > > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs > > > > > > # ./netserver -6 > > > > > > > > > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs > > > > > > # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- > > > > > > -m 10K; done > > > > > > > > > > You are missing '&' at the end. Use something like below: > > > > > > > > > > #!/bin/bash > > > > > for i in {1..22} > > > > > do > > > > > /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K & > > > > > done > > > > > wait > > > > > > > > > > > > > Oh sorry I missed the fact that you are running instances in parallel, my bad. > > > > > > > > So I ran 36 instances on a machine with 72 cpus. I did this 10 times > > > > and got an average from all instances for all runs to reduce noise: > > > > > > > > #!/bin/bash > > > > > > > > ITER=10 > > > > NR_INSTANCES=36 > > > > > > > > for i in $(seq $ITER); do > > > > echo "iteration $i" > > > > for j in $(seq $NR_INSTANCES); do > > > > echo "iteration $i" >> "out$j" > > > > ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K >> "out$j" & > > > > done > > > > wait > > > > done > > > > > > > > cat out* | grep 540000 | awk '{sum += $5} END {print sum/NR}' > > > > > > > > Base: 22169 mbps > > > > Patched: 21331.9 mbps > > > > > > > > The difference is ~3.7% in my runs. I am not sure what's different. > > > > Perhaps it's the number of runs? > > > > > > My base kernel is next-20231009 and I am running experiments with > > > hyperthreading disabled. > > > > Using next-20231009 and a similar 44 core machine with hyperthreading > > disabled, I ran 22 instances of netperf in parallel and got the > > following numbers from averaging 20 runs: > > > > Base: 33076.5 mbps > > Patched: 31410.1 mbps > > > > That's about 5% diff. I guess the number of iterations helps reduce > > the noise? I am not sure. > > > > Please also keep in mind that in this case all netperf instances are > > in the same cgroup and at a 4-level depth. I imagine in a practical > > setup processes would be a little more spread out, which means less > > common ancestors, so less contended atomic operations. > > > (Resending the reply as I messed up the last one, was not in plain text) > > I was curious, so I ran the same testing in a cgroup 2 levels deep > (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my > experience. Here are the numbers: > > Base: 40198.0 mbps > Patched: 38629.7 mbps > > The regression is reduced to ~3.9%. > > What's more interesting is that going from a level 2 cgroup to a level > 4 cgroup is already a big hit with or without this patch: > > Base: 40198.0 -> 33076.5 mbps (~17.7% regression) > Patched: 38629.7 -> 31410.1 (~18.7% regression) > > So going from level 2 to 4 is already a significant regression for > other reasons (e.g. hierarchical charging). This patch only makes it > marginally worse. This puts the numbers more into perspective imo than > comparing values at level 4. What do you think? This is weird as we are running the experiments on the same machine. I will rerun with 2 levels as well. Also can you rerun the page fault benchmark as well which was showing 9% regression in your original commit message?
On Thu, Oct 12, 2023 at 6:35 AM Shakeel Butt <shakeelb@google.com> wrote: > > On Thu, Oct 12, 2023 at 1:04 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Wed, Oct 11, 2023 at 8:13 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > On Wed, Oct 11, 2023 at 5:46 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > > > On Tue, Oct 10, 2023 at 6:48 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > > > On Tue, Oct 10, 2023 at 5:36 PM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > > > > > > > On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote: > > > > > > [...] > > > > > > > > > > > > > > I tried this on a machine with 72 cpus (also ixion), running both > > > > > > > netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows: > > > > > > > # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control > > > > > > > # mkdir /sys/fs/cgroup/a > > > > > > > # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control > > > > > > > # mkdir /sys/fs/cgroup/a/b > > > > > > > # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control > > > > > > > # mkdir /sys/fs/cgroup/a/b/c > > > > > > > # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control > > > > > > > # mkdir /sys/fs/cgroup/a/b/c/d > > > > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs > > > > > > > # ./netserver -6 > > > > > > > > > > > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs > > > > > > > # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- > > > > > > > -m 10K; done > > > > > > > > > > > > You are missing '&' at the end. Use something like below: > > > > > > > > > > > > #!/bin/bash > > > > > > for i in {1..22} > > > > > > do > > > > > > /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K & > > > > > > done > > > > > > wait > > > > > > > > > > > > > > > > Oh sorry I missed the fact that you are running instances in parallel, my bad. > > > > > > > > > > So I ran 36 instances on a machine with 72 cpus. I did this 10 times > > > > > and got an average from all instances for all runs to reduce noise: > > > > > > > > > > #!/bin/bash > > > > > > > > > > ITER=10 > > > > > NR_INSTANCES=36 > > > > > > > > > > for i in $(seq $ITER); do > > > > > echo "iteration $i" > > > > > for j in $(seq $NR_INSTANCES); do > > > > > echo "iteration $i" >> "out$j" > > > > > ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K >> "out$j" & > > > > > done > > > > > wait > > > > > done > > > > > > > > > > cat out* | grep 540000 | awk '{sum += $5} END {print sum/NR}' > > > > > > > > > > Base: 22169 mbps > > > > > Patched: 21331.9 mbps > > > > > > > > > > The difference is ~3.7% in my runs. I am not sure what's different. > > > > > Perhaps it's the number of runs? > > > > > > > > My base kernel is next-20231009 and I am running experiments with > > > > hyperthreading disabled. > > > > > > Using next-20231009 and a similar 44 core machine with hyperthreading > > > disabled, I ran 22 instances of netperf in parallel and got the > > > following numbers from averaging 20 runs: > > > > > > Base: 33076.5 mbps > > > Patched: 31410.1 mbps > > > > > > That's about 5% diff. I guess the number of iterations helps reduce > > > the noise? I am not sure. > > > > > > Please also keep in mind that in this case all netperf instances are > > > in the same cgroup and at a 4-level depth. I imagine in a practical > > > setup processes would be a little more spread out, which means less > > > common ancestors, so less contended atomic operations. > > > > > > (Resending the reply as I messed up the last one, was not in plain text) > > > > I was curious, so I ran the same testing in a cgroup 2 levels deep > > (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my > > experience. Here are the numbers: > > > > Base: 40198.0 mbps > > Patched: 38629.7 mbps > > > > The regression is reduced to ~3.9%. > > > > What's more interesting is that going from a level 2 cgroup to a level > > 4 cgroup is already a big hit with or without this patch: > > > > Base: 40198.0 -> 33076.5 mbps (~17.7% regression) > > Patched: 38629.7 -> 31410.1 (~18.7% regression) > > > > So going from level 2 to 4 is already a significant regression for > > other reasons (e.g. hierarchical charging). This patch only makes it > > marginally worse. This puts the numbers more into perspective imo than > > comparing values at level 4. What do you think? > > This is weird as we are running the experiments on the same machine. I > will rerun with 2 levels as well. Also can you rerun the page fault > benchmark as well which was showing 9% regression in your original > commit message? Thanks. I will re-run the page_fault tests, but keep in mind that the page fault benchmarks in will-it-scale are highly variable. We run them between kernel versions internally, and I think we ignore any changes below 10% as the benchmark is naturally noisy. I have a couple of runs for page_fault3_scalability showing a 2-3% improvement with this patch :)
[..] > > > > > > > > Using next-20231009 and a similar 44 core machine with hyperthreading > > > > disabled, I ran 22 instances of netperf in parallel and got the > > > > following numbers from averaging 20 runs: > > > > > > > > Base: 33076.5 mbps > > > > Patched: 31410.1 mbps > > > > > > > > That's about 5% diff. I guess the number of iterations helps reduce > > > > the noise? I am not sure. > > > > > > > > Please also keep in mind that in this case all netperf instances are > > > > in the same cgroup and at a 4-level depth. I imagine in a practical > > > > setup processes would be a little more spread out, which means less > > > > common ancestors, so less contended atomic operations. > > > > > > > > > (Resending the reply as I messed up the last one, was not in plain text) > > > > > > I was curious, so I ran the same testing in a cgroup 2 levels deep > > > (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my > > > experience. Here are the numbers: > > > > > > Base: 40198.0 mbps > > > Patched: 38629.7 mbps > > > > > > The regression is reduced to ~3.9%. > > > > > > What's more interesting is that going from a level 2 cgroup to a level > > > 4 cgroup is already a big hit with or without this patch: > > > > > > Base: 40198.0 -> 33076.5 mbps (~17.7% regression) > > > Patched: 38629.7 -> 31410.1 (~18.7% regression) > > > > > > So going from level 2 to 4 is already a significant regression for > > > other reasons (e.g. hierarchical charging). This patch only makes it > > > marginally worse. This puts the numbers more into perspective imo than > > > comparing values at level 4. What do you think? > > > > This is weird as we are running the experiments on the same machine. I > > will rerun with 2 levels as well. Also can you rerun the page fault > > benchmark as well which was showing 9% regression in your original > > commit message? > > Thanks. I will re-run the page_fault tests, but keep in mind that the > page fault benchmarks in will-it-scale are highly variable. We run > them between kernel versions internally, and I think we ignore any > changes below 10% as the benchmark is naturally noisy. > > I have a couple of runs for page_fault3_scalability showing a 2-3% > improvement with this patch :) I ran the page_fault tests for 10 runs on a machine with 256 cpus in a level 2 cgroup, here are the results (the results in the original commit message are for 384 cpus in a level 4 cgroup): LABEL | MEAN | MEDIAN | STDDEV | ------------------------------+-------------+-------------+------------- page_fault1_per_process_ops | | | | (A) base | 270249.164 | 265437.000 | 13451.836 | (B) patched | 261368.709 | 255725.000 | 13394.767 | | -3.29% | -3.66% | | page_fault1_per_thread_ops | | | | (A) base | 242111.345 | 239737.000 | 10026.031 | (B) patched | 237057.109 | 235305.000 | 9769.687 | | -2.09% | -1.85% | | page_fault1_scalability | | | (A) base | 0.034387 | 0.035168 | 0.0018283 | (B) patched | 0.033988 | 0.034573 | 0.0018056 | | -1.16% | -1.69% | | page_fault2_per_process_ops | | | (A) base | 203561.836 | 203301.000 | 2550.764 | (B) patched | 197195.945 | 197746.000 | 2264.263 | | -3.13% | -2.73% | | page_fault2_per_thread_ops | | | (A) base | 171046.473 | 170776.000 | 1509.679 | (B) patched | 166626.327 | 166406.000 | 768.753 | | -2.58% | -2.56% | | page_fault2_scalability | | | (A) base | 0.054026 | 0.053821 | 0.00062121 | (B) patched | 0.053329 | 0.05306 | 0.00048394 | | -1.29% | -1.41% | | page_fault3_per_process_ops | | | (A) base | 1295807.782 | 1297550.000 | 5907.585 | (B) patched | 1275579.873 | 1273359.000 | 8759.160 | | -1.56% | -1.86% | | page_fault3_per_thread_ops | | | (A) base | 391234.164 | 390860.000 | 1760.720 | (B) patched | 377231.273 | 376369.000 | 1874.971 | | -3.58% | -3.71% | | page_fault3_scalability | | | (A) base | 0.60369 | 0.60072 | 0.0083029 | (B) patched | 0.61733 | 0.61544 | 0.009855 | | +2.26% | +2.45% | | The numbers are much better. I can modify the commit log to include the testing in the replies instead of what's currently there if this helps (22 netperf instances on 44 cpus and will-it-scale page_fault on 256 cpus -- all in a level 2 cgroup).
On Thu, Oct 12, 2023 at 2:06 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > [..] > > > > > > > > > > Using next-20231009 and a similar 44 core machine with hyperthreading > > > > > disabled, I ran 22 instances of netperf in parallel and got the > > > > > following numbers from averaging 20 runs: > > > > > > > > > > Base: 33076.5 mbps > > > > > Patched: 31410.1 mbps > > > > > > > > > > That's about 5% diff. I guess the number of iterations helps reduce > > > > > the noise? I am not sure. > > > > > > > > > > Please also keep in mind that in this case all netperf instances are > > > > > in the same cgroup and at a 4-level depth. I imagine in a practical > > > > > setup processes would be a little more spread out, which means less > > > > > common ancestors, so less contended atomic operations. > > > > > > > > > > > > (Resending the reply as I messed up the last one, was not in plain text) > > > > > > > > I was curious, so I ran the same testing in a cgroup 2 levels deep > > > > (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my > > > > experience. Here are the numbers: > > > > > > > > Base: 40198.0 mbps > > > > Patched: 38629.7 mbps > > > > > > > > The regression is reduced to ~3.9%. > > > > > > > > What's more interesting is that going from a level 2 cgroup to a level > > > > 4 cgroup is already a big hit with or without this patch: > > > > > > > > Base: 40198.0 -> 33076.5 mbps (~17.7% regression) > > > > Patched: 38629.7 -> 31410.1 (~18.7% regression) > > > > > > > > So going from level 2 to 4 is already a significant regression for > > > > other reasons (e.g. hierarchical charging). This patch only makes it > > > > marginally worse. This puts the numbers more into perspective imo than > > > > comparing values at level 4. What do you think? > > > > > > This is weird as we are running the experiments on the same machine. I > > > will rerun with 2 levels as well. Also can you rerun the page fault > > > benchmark as well which was showing 9% regression in your original > > > commit message? > > > > Thanks. I will re-run the page_fault tests, but keep in mind that the > > page fault benchmarks in will-it-scale are highly variable. We run > > them between kernel versions internally, and I think we ignore any > > changes below 10% as the benchmark is naturally noisy. > > > > I have a couple of runs for page_fault3_scalability showing a 2-3% > > improvement with this patch :) > > I ran the page_fault tests for 10 runs on a machine with 256 cpus in a > level 2 cgroup, here are the results (the results in the original > commit message are for 384 cpus in a level 4 cgroup): > > LABEL | MEAN | MEDIAN | STDDEV | > ------------------------------+-------------+-------------+------------- > page_fault1_per_process_ops | | | | > (A) base | 270249.164 | 265437.000 | 13451.836 | > (B) patched | 261368.709 | 255725.000 | 13394.767 | > | -3.29% | -3.66% | | > page_fault1_per_thread_ops | | | | > (A) base | 242111.345 | 239737.000 | 10026.031 | > (B) patched | 237057.109 | 235305.000 | 9769.687 | > | -2.09% | -1.85% | | > page_fault1_scalability | | | > (A) base | 0.034387 | 0.035168 | 0.0018283 | > (B) patched | 0.033988 | 0.034573 | 0.0018056 | > | -1.16% | -1.69% | | > page_fault2_per_process_ops | | | > (A) base | 203561.836 | 203301.000 | 2550.764 | > (B) patched | 197195.945 | 197746.000 | 2264.263 | > | -3.13% | -2.73% | | > page_fault2_per_thread_ops | | | > (A) base | 171046.473 | 170776.000 | 1509.679 | > (B) patched | 166626.327 | 166406.000 | 768.753 | > | -2.58% | -2.56% | | > page_fault2_scalability | | | > (A) base | 0.054026 | 0.053821 | 0.00062121 | > (B) patched | 0.053329 | 0.05306 | 0.00048394 | > | -1.29% | -1.41% | | > page_fault3_per_process_ops | | | > (A) base | 1295807.782 | 1297550.000 | 5907.585 | > (B) patched | 1275579.873 | 1273359.000 | 8759.160 | > | -1.56% | -1.86% | | > page_fault3_per_thread_ops | | | > (A) base | 391234.164 | 390860.000 | 1760.720 | > (B) patched | 377231.273 | 376369.000 | 1874.971 | > | -3.58% | -3.71% | | > page_fault3_scalability | | | > (A) base | 0.60369 | 0.60072 | 0.0083029 | > (B) patched | 0.61733 | 0.61544 | 0.009855 | > | +2.26% | +2.45% | | > > The numbers are much better. I can modify the commit log to include > the testing in the replies instead of what's currently there if this > helps (22 netperf instances on 44 cpus and will-it-scale page_fault on > 256 cpus -- all in a level 2 cgroup). Yes this looks better. I think we should also ask intel perf and phoronix folks to run their benchmarks as well (but no need to block on them).
On Thu, Oct 12, 2023 at 2:16 PM Shakeel Butt <shakeelb@google.com> wrote: > > On Thu, Oct 12, 2023 at 2:06 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > [..] > > > > > > > > > > > > Using next-20231009 and a similar 44 core machine with hyperthreading > > > > > > disabled, I ran 22 instances of netperf in parallel and got the > > > > > > following numbers from averaging 20 runs: > > > > > > > > > > > > Base: 33076.5 mbps > > > > > > Patched: 31410.1 mbps > > > > > > > > > > > > That's about 5% diff. I guess the number of iterations helps reduce > > > > > > the noise? I am not sure. > > > > > > > > > > > > Please also keep in mind that in this case all netperf instances are > > > > > > in the same cgroup and at a 4-level depth. I imagine in a practical > > > > > > setup processes would be a little more spread out, which means less > > > > > > common ancestors, so less contended atomic operations. > > > > > > > > > > > > > > > (Resending the reply as I messed up the last one, was not in plain text) > > > > > > > > > > I was curious, so I ran the same testing in a cgroup 2 levels deep > > > > > (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my > > > > > experience. Here are the numbers: > > > > > > > > > > Base: 40198.0 mbps > > > > > Patched: 38629.7 mbps > > > > > > > > > > The regression is reduced to ~3.9%. > > > > > > > > > > What's more interesting is that going from a level 2 cgroup to a level > > > > > 4 cgroup is already a big hit with or without this patch: > > > > > > > > > > Base: 40198.0 -> 33076.5 mbps (~17.7% regression) > > > > > Patched: 38629.7 -> 31410.1 (~18.7% regression) > > > > > > > > > > So going from level 2 to 4 is already a significant regression for > > > > > other reasons (e.g. hierarchical charging). This patch only makes it > > > > > marginally worse. This puts the numbers more into perspective imo than > > > > > comparing values at level 4. What do you think? > > > > > > > > This is weird as we are running the experiments on the same machine. I > > > > will rerun with 2 levels as well. Also can you rerun the page fault > > > > benchmark as well which was showing 9% regression in your original > > > > commit message? > > > > > > Thanks. I will re-run the page_fault tests, but keep in mind that the > > > page fault benchmarks in will-it-scale are highly variable. We run > > > them between kernel versions internally, and I think we ignore any > > > changes below 10% as the benchmark is naturally noisy. > > > > > > I have a couple of runs for page_fault3_scalability showing a 2-3% > > > improvement with this patch :) > > > > I ran the page_fault tests for 10 runs on a machine with 256 cpus in a > > level 2 cgroup, here are the results (the results in the original > > commit message are for 384 cpus in a level 4 cgroup): > > > > LABEL | MEAN | MEDIAN | STDDEV | > > ------------------------------+-------------+-------------+------------- > > page_fault1_per_process_ops | | | | > > (A) base | 270249.164 | 265437.000 | 13451.836 | > > (B) patched | 261368.709 | 255725.000 | 13394.767 | > > | -3.29% | -3.66% | | > > page_fault1_per_thread_ops | | | | > > (A) base | 242111.345 | 239737.000 | 10026.031 | > > (B) patched | 237057.109 | 235305.000 | 9769.687 | > > | -2.09% | -1.85% | | > > page_fault1_scalability | | | > > (A) base | 0.034387 | 0.035168 | 0.0018283 | > > (B) patched | 0.033988 | 0.034573 | 0.0018056 | > > | -1.16% | -1.69% | | > > page_fault2_per_process_ops | | | > > (A) base | 203561.836 | 203301.000 | 2550.764 | > > (B) patched | 197195.945 | 197746.000 | 2264.263 | > > | -3.13% | -2.73% | | > > page_fault2_per_thread_ops | | | > > (A) base | 171046.473 | 170776.000 | 1509.679 | > > (B) patched | 166626.327 | 166406.000 | 768.753 | > > | -2.58% | -2.56% | | > > page_fault2_scalability | | | > > (A) base | 0.054026 | 0.053821 | 0.00062121 | > > (B) patched | 0.053329 | 0.05306 | 0.00048394 | > > | -1.29% | -1.41% | | > > page_fault3_per_process_ops | | | > > (A) base | 1295807.782 | 1297550.000 | 5907.585 | > > (B) patched | 1275579.873 | 1273359.000 | 8759.160 | > > | -1.56% | -1.86% | | > > page_fault3_per_thread_ops | | | > > (A) base | 391234.164 | 390860.000 | 1760.720 | > > (B) patched | 377231.273 | 376369.000 | 1874.971 | > > | -3.58% | -3.71% | | > > page_fault3_scalability | | | > > (A) base | 0.60369 | 0.60072 | 0.0083029 | > > (B) patched | 0.61733 | 0.61544 | 0.009855 | > > | +2.26% | +2.45% | | > > > > The numbers are much better. I can modify the commit log to include > > the testing in the replies instead of what's currently there if this > > helps (22 netperf instances on 44 cpus and will-it-scale page_fault on > > 256 cpus -- all in a level 2 cgroup). > > Yes this looks better. I think we should also ask intel perf and > phoronix folks to run their benchmarks as well (but no need to block > on them). Anything I need to do for this to happen? (I thought such testing is already done on linux-next) Also, any further comments on the patch (or the series in general)? If not, I can send a new commit message for this patch in-place.
On Thu, Oct 12, 2023 at 2:20 PM Yosry Ahmed <yosryahmed@google.com> wrote: > [...] > > > > Yes this looks better. I think we should also ask intel perf and > > phoronix folks to run their benchmarks as well (but no need to block > > on them). > > Anything I need to do for this to happen? (I thought such testing is > already done on linux-next) Just Cced the relevant folks. Michael, Oliver & Feng, if you have some time/resource available, please do trigger your performance benchmarks on the following series (but nothing urgent): https://lore.kernel.org/all/20231010032117.1577496-1-yosryahmed@google.com/ > > Also, any further comments on the patch (or the series in general)? If > not, I can send a new commit message for this patch in-place. Sorry, I haven't taken a look yet but will try in a week or so.
On Thu, Oct 12, 2023 at 2:39 PM Shakeel Butt <shakeelb@google.com> wrote: > > On Thu, Oct 12, 2023 at 2:20 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > [...] > > > > > > Yes this looks better. I think we should also ask intel perf and > > > phoronix folks to run their benchmarks as well (but no need to block > > > on them). > > > > Anything I need to do for this to happen? (I thought such testing is > > already done on linux-next) > > Just Cced the relevant folks. > > Michael, Oliver & Feng, if you have some time/resource available, > please do trigger your performance benchmarks on the following series > (but nothing urgent): > > https://lore.kernel.org/all/20231010032117.1577496-1-yosryahmed@google.com/ Thanks for that. > > > > > Also, any further comments on the patch (or the series in general)? If > > not, I can send a new commit message for this patch in-place. > > Sorry, I haven't taken a look yet but will try in a week or so. Sounds good, thanks. Meanwhile, Andrew, could you please replace the commit log of this patch as follows for more updated testing info: Subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg A global counter for the magnitude of memcg stats update is maintained on the memcg side to avoid invoking rstat flushes when the pending updates are not significant. This avoids unnecessary flushes, which are not very cheap even if there isn't a lot of stats to flush. It also avoids unnecessary lock contention on the underlying global rstat lock. Make this threshold per-memcg. The scheme is followed where percpu (now also per-memcg) counters are incremented in the update path, and only propagated to per-memcg atomics when they exceed a certain threshold. This provides two benefits: (a) On large machines with a lot of memcgs, the global threshold can be reached relatively fast, so guarding the underlying lock becomes less effective. Making the threshold per-memcg avoids this. (b) Having a global threshold makes it hard to do subtree flushes, as we cannot reset the global counter except for a full flush. Per-memcg counters removes this as a blocker from doing subtree flushes, which helps avoid unnecessary work when the stats of a small subtree are needed. Nothing is free, of course. This comes at a cost: (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4 bytes. The extra memory usage is insigificant. (b) More work on the update side, although in the common case it will only be percpu counter updates. The amount of work scales with the number of ancestors (i.e. tree depth). This is not a new concept, adding a cgroup to the rstat tree involves a parent loop, so is charging. Testing results below show no significant regressions. (c) The error margin in the stats for the system as a whole increases from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH * NR_MEMCGS. This is probably fine because we have a similar per-memcg error in charges coming from percpu stocks, and we have a periodic flusher that makes sure we always flush all the stats every 2s anyway. This patch was tested to make sure no significant regressions are introduced on the update path as follows. The following benchmarks were ran in a cgroup that is 2 levels deep (/sys/fs/cgroup/a/b/): (1) Running 22 instances of netperf on a 44 cpu machine with hyperthreading disabled. All instances are run in a level 2 cgroup, as well as netserver: # netserver -6 # netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Averaging 20 runs, the numbers are as follows: Base: 40198.0 mbps Patched: 38629.7 mbps (-3.9%) The regression is minimal, especially for 22 instances in the same cgroup sharing all ancestors (so updating the same atomics). (2) will-it-scale page_fault tests. These tests (specifically per_process_ops in page_fault3 test) detected a 25.9% regression before for a change in the stats update path [1]. These are the numbers from 10 runs (+ is good) on a machine with 256 cpus: LABEL | MEAN | MEDIAN | STDDEV | ------------------------------+-------------+-------------+------------- page_fault1_per_process_ops | | | | (A) base | 270249.164 | 265437.000 | 13451.836 | (B) patched | 261368.709 | 255725.000 | 13394.767 | | -3.29% | -3.66% | | page_fault1_per_thread_ops | | | | (A) base | 242111.345 | 239737.000 | 10026.031 | (B) patched | 237057.109 | 235305.000 | 9769.687 | | -2.09% | -1.85% | | page_fault1_scalability | | | (A) base | 0.034387 | 0.035168 | 0.0018283 | (B) patched | 0.033988 | 0.034573 | 0.0018056 | | -1.16% | -1.69% | | page_fault2_per_process_ops | | | (A) base | 203561.836 | 203301.000 | 2550.764 | (B) patched | 197195.945 | 197746.000 | 2264.263 | | -3.13% | -2.73% | | page_fault2_per_thread_ops | | | (A) base | 171046.473 | 170776.000 | 1509.679 | (B) patched | 166626.327 | 166406.000 | 768.753 | | -2.58% | -2.56% | | page_fault2_scalability | | | (A) base | 0.054026 | 0.053821 | 0.00062121 | (B) patched | 0.053329 | 0.05306 | 0.00048394 | | -1.29% | -1.41% | | page_fault3_per_process_ops | | | (A) base | 1295807.782 | 1297550.000 | 5907.585 | (B) patched | 1275579.873 | 1273359.000 | 8759.160 | | -1.56% | -1.86% | | page_fault3_per_thread_ops | | | (A) base | 391234.164 | 390860.000 | 1760.720 | (B) patched | 377231.273 | 376369.000 | 1874.971 | | -3.58% | -3.71% | | page_fault3_scalability | | | (A) base | 0.60369 | 0.60072 | 0.0083029 | (B) patched | 0.61733 | 0.61544 | 0.009855 | | +2.26% | +2.45% | | All regressions seem to be minimal, and within the normal variance for the benchmark. The fix for [1] assumes that 3% is noise -- and there were no further practical complaints), so hopefully this means that such variations in these microbenchmarks do not reflect on practical workloads. (3) I also ran stress-ng in a nested cgroup and did not observe any obvious regressions. [1]https://lore.kernel.org/all/20190520063534.GB19312@shao2-debian/
[..] > > > > > > Using next-20231009 and a similar 44 core machine with hyperthreading > > > disabled, I ran 22 instances of netperf in parallel and got the > > > following numbers from averaging 20 runs: > > > > > > Base: 33076.5 mbps > > > Patched: 31410.1 mbps > > > > > > That's about 5% diff. I guess the number of iterations helps reduce > > > the noise? I am not sure. > > > > > > Please also keep in mind that in this case all netperf instances are > > > in the same cgroup and at a 4-level depth. I imagine in a practical > > > setup processes would be a little more spread out, which means less > > > common ancestors, so less contended atomic operations. > > > > > > (Resending the reply as I messed up the last one, was not in plain text) > > > > I was curious, so I ran the same testing in a cgroup 2 levels deep > > (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my > > experience. Here are the numbers: > > > > Base: 40198.0 mbps > > Patched: 38629.7 mbps > > > > The regression is reduced to ~3.9%. > > > > What's more interesting is that going from a level 2 cgroup to a level > > 4 cgroup is already a big hit with or without this patch: > > > > Base: 40198.0 -> 33076.5 mbps (~17.7% regression) > > Patched: 38629.7 -> 31410.1 (~18.7% regression) > > > > So going from level 2 to 4 is already a significant regression for > > other reasons (e.g. hierarchical charging). This patch only makes it > > marginally worse. This puts the numbers more into perspective imo than > > comparing values at level 4. What do you think? > > I think it's reasonable. > > Especially comparing to how many cachelines we used to touch on the > write side when all flushing happened there. This looks like a good > trade-off to me. Thanks. Still wanting to figure out if this patch is what you suggested in our previous discussion [1], to add a Suggested-by if appropriate :) [1]https://lore.kernel.org/lkml/20230913153758.GB45543@cmpxchg.org/
On Thu, Oct 12, 2023 at 04:28:49PM -0700, Yosry Ahmed wrote: > [..] > > > > > > > > Using next-20231009 and a similar 44 core machine with hyperthreading > > > > disabled, I ran 22 instances of netperf in parallel and got the > > > > following numbers from averaging 20 runs: > > > > > > > > Base: 33076.5 mbps > > > > Patched: 31410.1 mbps > > > > > > > > That's about 5% diff. I guess the number of iterations helps reduce > > > > the noise? I am not sure. > > > > > > > > Please also keep in mind that in this case all netperf instances are > > > > in the same cgroup and at a 4-level depth. I imagine in a practical > > > > setup processes would be a little more spread out, which means less > > > > common ancestors, so less contended atomic operations. > > > > > > > > > (Resending the reply as I messed up the last one, was not in plain text) > > > > > > I was curious, so I ran the same testing in a cgroup 2 levels deep > > > (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my > > > experience. Here are the numbers: > > > > > > Base: 40198.0 mbps > > > Patched: 38629.7 mbps > > > > > > The regression is reduced to ~3.9%. > > > > > > What's more interesting is that going from a level 2 cgroup to a level > > > 4 cgroup is already a big hit with or without this patch: > > > > > > Base: 40198.0 -> 33076.5 mbps (~17.7% regression) > > > Patched: 38629.7 -> 31410.1 (~18.7% regression) > > > > > > So going from level 2 to 4 is already a significant regression for > > > other reasons (e.g. hierarchical charging). This patch only makes it > > > marginally worse. This puts the numbers more into perspective imo than > > > comparing values at level 4. What do you think? > > > > I think it's reasonable. > > > > Especially comparing to how many cachelines we used to touch on the > > write side when all flushing happened there. This looks like a good > > trade-off to me. > > Thanks. > > Still wanting to figure out if this patch is what you suggested in our > previous discussion [1], to add a > Suggested-by if appropriate :) > > [1]https://lore.kernel.org/lkml/20230913153758.GB45543@cmpxchg.org/ Haha, sort of. I suggested the cgroup-level flush-batching, but my proposal was missing the clever upward propagation of the pending stat updates that you added. You can add the tag if you're feeling generous, but I wouldn't be mad if you don't!
On Thu, Oct 12, 2023 at 7:33 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Thu, Oct 12, 2023 at 04:28:49PM -0700, Yosry Ahmed wrote: > > [..] > > > > > > > > > > Using next-20231009 and a similar 44 core machine with hyperthreading > > > > > disabled, I ran 22 instances of netperf in parallel and got the > > > > > following numbers from averaging 20 runs: > > > > > > > > > > Base: 33076.5 mbps > > > > > Patched: 31410.1 mbps > > > > > > > > > > That's about 5% diff. I guess the number of iterations helps reduce > > > > > the noise? I am not sure. > > > > > > > > > > Please also keep in mind that in this case all netperf instances are > > > > > in the same cgroup and at a 4-level depth. I imagine in a practical > > > > > setup processes would be a little more spread out, which means less > > > > > common ancestors, so less contended atomic operations. > > > > > > > > > > > > (Resending the reply as I messed up the last one, was not in plain text) > > > > > > > > I was curious, so I ran the same testing in a cgroup 2 levels deep > > > > (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my > > > > experience. Here are the numbers: > > > > > > > > Base: 40198.0 mbps > > > > Patched: 38629.7 mbps > > > > > > > > The regression is reduced to ~3.9%. > > > > > > > > What's more interesting is that going from a level 2 cgroup to a level > > > > 4 cgroup is already a big hit with or without this patch: > > > > > > > > Base: 40198.0 -> 33076.5 mbps (~17.7% regression) > > > > Patched: 38629.7 -> 31410.1 (~18.7% regression) > > > > > > > > So going from level 2 to 4 is already a significant regression for > > > > other reasons (e.g. hierarchical charging). This patch only makes it > > > > marginally worse. This puts the numbers more into perspective imo than > > > > comparing values at level 4. What do you think? > > > > > > I think it's reasonable. > > > > > > Especially comparing to how many cachelines we used to touch on the > > > write side when all flushing happened there. This looks like a good > > > trade-off to me. > > > > Thanks. > > > > Still wanting to figure out if this patch is what you suggested in our > > previous discussion [1], to add a > > Suggested-by if appropriate :) > > > > [1]https://lore.kernel.org/lkml/20230913153758.GB45543@cmpxchg.org/ > > Haha, sort of. I suggested the cgroup-level flush-batching, but my > proposal was missing the clever upward propagation of the pending stat > updates that you added. > > You can add the tag if you're feeling generous, but I wouldn't be mad > if you don't! I like to think that I am a generous person :) Will add it in the next respin.
On Thu, 12 Oct 2023 15:23:06 -0700 Yosry Ahmed <yosryahmed@google.com> wrote: > Meanwhile, Andrew, could you please replace the commit log of this > patch as follows for more updated testing info: Done.
On Sat, Oct 14, 2023 at 4:08 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Thu, 12 Oct 2023 15:23:06 -0700 Yosry Ahmed <yosryahmed@google.com> wrote: > > > Meanwhile, Andrew, could you please replace the commit log of this > > patch as follows for more updated testing info: > > Done. Thanks!
On Sat, Oct 14, 2023 at 4:08 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Thu, 12 Oct 2023 15:23:06 -0700 Yosry Ahmed <yosryahmed@google.com> wrote: > > > Meanwhile, Andrew, could you please replace the commit log of this > > patch as follows for more updated testing info: > > Done. Sorry Andrew, but could you please also take this fixlet? From: Yosry Ahmed <yosryahmed@google.com> Date: Tue, 17 Oct 2023 23:07:59 +0000 Subject: [PATCH] mm: memcg: clear percpu stats_pending during stats flush When flushing memcg stats, we clear the per-memcg count of pending stat updates, as they are captured by the flush. Also clear the percpu count for the cpu being flushed. Suggested-by: Wei Xu <weixugc@google.com> Signed-off-by: Yosry Ahmed <yosryahmed@google.com> --- mm/memcontrol.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0b1377b16b3e0..fa92de780ac89 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5653,6 +5653,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) } } } + statc->stats_updates = 0; /* We are in a per-cpu loop here, only do the atomic write once */ if (atomic64_read(&memcg->vmstats->stats_updates)) atomic64_set(&memcg->vmstats->stats_updates, 0); -- 2.42.0.655.g421f12c284-goog
hi, Yosry Ahmed, hi, Shakeel Butt, On Thu, Oct 12, 2023 at 03:23:06PM -0700, Yosry Ahmed wrote: > On Thu, Oct 12, 2023 at 2:39 PM Shakeel Butt <shakeelb@google.com> wrote: > > > > On Thu, Oct 12, 2023 at 2:20 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > [...] > > > > > > > > Yes this looks better. I think we should also ask intel perf and > > > > phoronix folks to run their benchmarks as well (but no need to block > > > > on them). > > > > > > Anything I need to do for this to happen? (I thought such testing is > > > already done on linux-next) > > > > Just Cced the relevant folks. > > > > Michael, Oliver & Feng, if you have some time/resource available, > > please do trigger your performance benchmarks on the following series > > (but nothing urgent): > > > > https://lore.kernel.org/all/20231010032117.1577496-1-yosryahmed@google.com/ > > Thanks for that. we (0day team) have already applied the patch-set as: c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent() 51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg 130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code 26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time 25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything <---- the base our tool picked for the patch set they've already in our so-called hourly-kernel which under various function and performance tests. our 0day test logic is if we found any regression by these hourly-kernels comparing to base (e.g. milestone release), auto-bisect will be triggnered. then we only report when we capture a first bad commit for a regression. based on this, if you don't receive any report in following 2-3 weeks, you could think 0day cannot capture any regression from your patch-set. *However*, please be aware that 0day is not a traditional CI system, and also due to resource constraints, we cannot guaratee coverage, we cannot tigger specific tests for your patchset, either. (sorry if this is not your expectation) > > > > > > > > > Also, any further comments on the patch (or the series in general)? If > > > not, I can send a new commit message for this patch in-place. > > > > Sorry, I haven't taken a look yet but will try in a week or so. > > Sounds good, thanks. > > Meanwhile, Andrew, could you please replace the commit log of this > patch as follows for more updated testing info: > > Subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg > > A global counter for the magnitude of memcg stats update is maintained > on the memcg side to avoid invoking rstat flushes when the pending > updates are not significant. This avoids unnecessary flushes, which are > not very cheap even if there isn't a lot of stats to flush. It also > avoids unnecessary lock contention on the underlying global rstat lock. > > Make this threshold per-memcg. The scheme is followed where percpu (now > also per-memcg) counters are incremented in the update path, and only > propagated to per-memcg atomics when they exceed a certain threshold. > > This provides two benefits: > (a) On large machines with a lot of memcgs, the global threshold can be > reached relatively fast, so guarding the underlying lock becomes less > effective. Making the threshold per-memcg avoids this. > > (b) Having a global threshold makes it hard to do subtree flushes, as we > cannot reset the global counter except for a full flush. Per-memcg > counters removes this as a blocker from doing subtree flushes, which > helps avoid unnecessary work when the stats of a small subtree are > needed. > > Nothing is free, of course. This comes at a cost: > (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4 > bytes. The extra memory usage is insigificant. > > (b) More work on the update side, although in the common case it will > only be percpu counter updates. The amount of work scales with the > number of ancestors (i.e. tree depth). This is not a new concept, adding > a cgroup to the rstat tree involves a parent loop, so is charging. > Testing results below show no significant regressions. > > (c) The error margin in the stats for the system as a whole increases > from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH * > NR_MEMCGS. This is probably fine because we have a similar per-memcg > error in charges coming from percpu stocks, and we have a periodic > flusher that makes sure we always flush all the stats every 2s anyway. > > This patch was tested to make sure no significant regressions are > introduced on the update path as follows. The following benchmarks were > ran in a cgroup that is 2 levels deep (/sys/fs/cgroup/a/b/): > > (1) Running 22 instances of netperf on a 44 cpu machine with > hyperthreading disabled. All instances are run in a level 2 cgroup, as > well as netserver: > # netserver -6 > # netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Averaging 20 runs, the numbers are as follows: > Base: 40198.0 mbps > Patched: 38629.7 mbps (-3.9%) > > The regression is minimal, especially for 22 instances in the same > cgroup sharing all ancestors (so updating the same atomics). > > (2) will-it-scale page_fault tests. These tests (specifically > per_process_ops in page_fault3 test) detected a 25.9% regression before > for a change in the stats update path [1]. These are the > numbers from 10 runs (+ is good) on a machine with 256 cpus: > > LABEL | MEAN | MEDIAN | STDDEV | > ------------------------------+-------------+-------------+------------- > page_fault1_per_process_ops | | | | > (A) base | 270249.164 | 265437.000 | 13451.836 | > (B) patched | 261368.709 | 255725.000 | 13394.767 | > | -3.29% | -3.66% | | > page_fault1_per_thread_ops | | | | > (A) base | 242111.345 | 239737.000 | 10026.031 | > (B) patched | 237057.109 | 235305.000 | 9769.687 | > | -2.09% | -1.85% | | > page_fault1_scalability | | | > (A) base | 0.034387 | 0.035168 | 0.0018283 | > (B) patched | 0.033988 | 0.034573 | 0.0018056 | > | -1.16% | -1.69% | | > page_fault2_per_process_ops | | | > (A) base | 203561.836 | 203301.000 | 2550.764 | > (B) patched | 197195.945 | 197746.000 | 2264.263 | > | -3.13% | -2.73% | | > page_fault2_per_thread_ops | | | > (A) base | 171046.473 | 170776.000 | 1509.679 | > (B) patched | 166626.327 | 166406.000 | 768.753 | > | -2.58% | -2.56% | | > page_fault2_scalability | | | > (A) base | 0.054026 | 0.053821 | 0.00062121 | > (B) patched | 0.053329 | 0.05306 | 0.00048394 | > | -1.29% | -1.41% | | > page_fault3_per_process_ops | | | > (A) base | 1295807.782 | 1297550.000 | 5907.585 | > (B) patched | 1275579.873 | 1273359.000 | 8759.160 | > | -1.56% | -1.86% | | > page_fault3_per_thread_ops | | | > (A) base | 391234.164 | 390860.000 | 1760.720 | > (B) patched | 377231.273 | 376369.000 | 1874.971 | > | -3.58% | -3.71% | | > page_fault3_scalability | | | > (A) base | 0.60369 | 0.60072 | 0.0083029 | > (B) patched | 0.61733 | 0.61544 | 0.009855 | > | +2.26% | +2.45% | | > > All regressions seem to be minimal, and within the normal variance for > the benchmark. The fix for [1] assumes that 3% is noise -- and there were no > further practical complaints), so hopefully this means that such variations > in these microbenchmarks do not reflect on practical workloads. > > (3) I also ran stress-ng in a nested cgroup and did not observe any > obvious regressions. > > [1]https://lore.kernel.org/all/20190520063534.GB19312@shao2-debian/
On Wed, Oct 18, 2023 at 1:22 AM Oliver Sang <oliver.sang@intel.com> wrote: > > hi, Yosry Ahmed, hi, Shakeel Butt, > > On Thu, Oct 12, 2023 at 03:23:06PM -0700, Yosry Ahmed wrote: > > On Thu, Oct 12, 2023 at 2:39 PM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > On Thu, Oct 12, 2023 at 2:20 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > [...] > > > > > > > > > > Yes this looks better. I think we should also ask intel perf and > > > > > phoronix folks to run their benchmarks as well (but no need to block > > > > > on them). > > > > > > > > Anything I need to do for this to happen? (I thought such testing is > > > > already done on linux-next) > > > > > > Just Cced the relevant folks. > > > > > > Michael, Oliver & Feng, if you have some time/resource available, > > > please do trigger your performance benchmarks on the following series > > > (but nothing urgent): > > > > > > https://lore.kernel.org/all/20231010032117.1577496-1-yosryahmed@google.com/ > > > > Thanks for that. > > we (0day team) have already applied the patch-set as: > > c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing > ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent() > 51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg > 130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code > 26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time > 25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything <---- the base our tool picked for the patch set > > they've already in our so-called hourly-kernel which under various function > and performance tests. > > our 0day test logic is if we found any regression by these hourly-kernels > comparing to base (e.g. milestone release), auto-bisect will be triggnered. > then we only report when we capture a first bad commit for a regression. > > based on this, if you don't receive any report in following 2-3 weeks, you > could think 0day cannot capture any regression from your patch-set. > > *However*, please be aware that 0day is not a traditional CI system, and also > due to resource constraints, we cannot guaratee coverage, we cannot tigger > specific tests for your patchset, either. > (sorry if this is not your expectation) > Thanks for taking a look and clarifying this, much appreciated. Fingers crossed for not getting any reports :)
Hello, kernel test robot noticed a -25.8% regression of will-it-scale.per_thread_ops on: commit: 51d74c18a9c61e7ee33bc90b522dd7f6e5b80bb5 ("[PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg") url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257 base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/all/20231010032117.1577496-4-yosryahmed@google.com/ patch subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg testcase: will-it-scale test machine: 104 threads 2 sockets (Skylake) with 192G memory parameters: nr_task: 100% mode: thread test: fallocate1 cpufreq_governor: performance In addition to that, the commit also has significant impact on the following tests: +------------------+---------------------------------------------------------------+ | testcase: change | will-it-scale: will-it-scale.per_thread_ops -30.0% regression | | test machine | 104 threads 2 sockets (Skylake) with 192G memory | | test parameters | cpufreq_governor=performance | | | mode=thread | | | nr_task=50% | | | test=fallocate1 | +------------------+---------------------------------------------------------------+ If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <oliver.sang@intel.com> | Closes: https://lore.kernel.org/oe-lkp/202310202303.c68e7639-oliver.sang@intel.com Details are as below: --------------------------------------------------------------------------------------------------> The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20231020/202310202303.c68e7639-oliver.sang@intel.com ========================================================================================= compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase: gcc-12/performance/x86_64-rhel-8.3/thread/100%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale commit: 130617edc1 ("mm: memcg: move vmstats structs definition above flushing code") 51d74c18a9 ("mm: memcg: make stats flushing threshold per-memcg") 130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522 ---------------- --------------------------- %stddev %change %stddev \ | \ 2.09 -0.5 1.61 ± 2% mpstat.cpu.all.usr% 27.58 +3.7% 28.59 turbostat.RAMWatt 3324 -10.0% 2993 vmstat.system.cs 1056 -100.0% 0.00 numa-meminfo.node0.Inactive(file) 6.67 ±141% +15799.3% 1059 numa-meminfo.node1.Inactive(file) 120.83 ± 11% +79.6% 217.00 ± 9% perf-c2c.DRAM.local 594.50 ± 6% +43.8% 854.83 ± 5% perf-c2c.DRAM.remote 3797041 -25.8% 2816352 will-it-scale.104.threads 36509 -25.8% 27079 will-it-scale.per_thread_ops 3797041 -25.8% 2816352 will-it-scale.workload 1.142e+09 -26.2% 8.437e+08 numa-numastat.node0.local_node 1.143e+09 -26.1% 8.439e+08 numa-numastat.node0.numa_hit 1.148e+09 -25.4% 8.563e+08 ± 2% numa-numastat.node1.local_node 1.149e+09 -25.4% 8.564e+08 ± 2% numa-numastat.node1.numa_hit 32933 -2.6% 32068 proc-vmstat.nr_slab_reclaimable 2.291e+09 -25.8% 1.7e+09 proc-vmstat.numa_hit 2.291e+09 -25.8% 1.7e+09 proc-vmstat.numa_local 2.29e+09 -25.8% 1.699e+09 proc-vmstat.pgalloc_normal 2.289e+09 -25.8% 1.699e+09 proc-vmstat.pgfree 1.00 ± 93% +154.2% 2.55 ± 16% perf-sched.sch_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64 191.10 ± 2% +18.0% 225.55 ± 2% perf-sched.wait_and_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm 385.50 ± 14% +39.6% 538.17 ± 12% perf-sched.wait_and_delay.count.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate 118.67 ± 11% -62.6% 44.33 ±100% perf-sched.wait_and_delay.count.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt 5043 ± 2% -13.0% 4387 ± 6% perf-sched.wait_and_delay.count.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm 167.12 ±222% +200.1% 501.48 ± 99% perf-sched.wait_and_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64 191.09 ± 2% +18.0% 225.53 ± 2% perf-sched.wait_time.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm 293.46 ± 4% +12.8% 330.98 ± 6% perf-sched.wait_time.max.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm 199.33 -100.0% 0.00 numa-vmstat.node0.nr_active_file 264.00 -100.0% 0.00 numa-vmstat.node0.nr_inactive_file 199.33 -100.0% 0.00 numa-vmstat.node0.nr_zone_active_file 264.00 -100.0% 0.00 numa-vmstat.node0.nr_zone_inactive_file 1.143e+09 -26.1% 8.439e+08 numa-vmstat.node0.numa_hit 1.142e+09 -26.2% 8.437e+08 numa-vmstat.node0.numa_local 1.67 ±141% +15799.3% 264.99 numa-vmstat.node1.nr_inactive_file 1.67 ±141% +15799.3% 264.99 numa-vmstat.node1.nr_zone_inactive_file 1.149e+09 -25.4% 8.564e+08 ± 2% numa-vmstat.node1.numa_hit 1.148e+09 -25.4% 8.563e+08 ± 2% numa-vmstat.node1.numa_local 0.59 ± 3% +125.2% 1.32 ± 2% perf-stat.i.MPKI 9.027e+09 -17.9% 7.408e+09 perf-stat.i.branch-instructions 0.64 -0.0 0.60 perf-stat.i.branch-miss-rate% 58102855 -23.3% 44580037 ± 2% perf-stat.i.branch-misses 15.28 +7.0 22.27 perf-stat.i.cache-miss-rate% 25155306 ± 2% +82.7% 45953601 ± 3% perf-stat.i.cache-misses 1.644e+08 +25.4% 2.062e+08 ± 2% perf-stat.i.cache-references 3258 -10.3% 2921 perf-stat.i.context-switches 6.73 +23.3% 8.30 perf-stat.i.cpi 145.97 -1.3% 144.13 perf-stat.i.cpu-migrations 11519 ± 3% -45.4% 6293 ± 3% perf-stat.i.cycles-between-cache-misses 0.04 -0.0 0.03 perf-stat.i.dTLB-load-miss-rate% 3921408 -25.3% 2929564 perf-stat.i.dTLB-load-misses 1.098e+10 -18.1% 8.993e+09 perf-stat.i.dTLB-loads 0.00 ± 2% +0.0 0.00 ± 4% perf-stat.i.dTLB-store-miss-rate% 5.606e+09 -23.2% 4.304e+09 perf-stat.i.dTLB-stores 95.65 -1.2 94.49 perf-stat.i.iTLB-load-miss-rate% 3876741 -25.0% 2905764 perf-stat.i.iTLB-load-misses 4.286e+10 -18.9% 3.477e+10 perf-stat.i.instructions 11061 +8.2% 11969 perf-stat.i.instructions-per-iTLB-miss 0.15 -18.9% 0.12 perf-stat.i.ipc 48.65 ± 2% +46.2% 71.11 ± 2% perf-stat.i.metric.K/sec 247.84 -18.9% 201.05 perf-stat.i.metric.M/sec 3138385 ± 2% +77.7% 5578401 ± 2% perf-stat.i.node-load-misses 375827 ± 3% +69.2% 635857 ± 11% perf-stat.i.node-loads 1343194 -26.8% 983668 perf-stat.i.node-store-misses 51550 ± 3% -19.0% 41748 ± 7% perf-stat.i.node-stores 0.59 ± 3% +125.1% 1.32 ± 2% perf-stat.overall.MPKI 0.64 -0.0 0.60 perf-stat.overall.branch-miss-rate% 15.30 +7.0 22.28 perf-stat.overall.cache-miss-rate% 6.73 +23.3% 8.29 perf-stat.overall.cpi 11470 ± 2% -45.3% 6279 ± 2% perf-stat.overall.cycles-between-cache-misses 0.04 -0.0 0.03 perf-stat.overall.dTLB-load-miss-rate% 0.00 ± 2% +0.0 0.00 ± 4% perf-stat.overall.dTLB-store-miss-rate% 95.56 -1.4 94.17 perf-stat.overall.iTLB-load-miss-rate% 11059 +8.2% 11967 perf-stat.overall.instructions-per-iTLB-miss 0.15 -18.9% 0.12 perf-stat.overall.ipc 3396437 +9.5% 3718021 perf-stat.overall.path-length 8.997e+09 -17.9% 7.383e+09 perf-stat.ps.branch-instructions 57910417 -23.3% 44426577 ± 2% perf-stat.ps.branch-misses 25075498 ± 2% +82.7% 45803186 ± 3% perf-stat.ps.cache-misses 1.639e+08 +25.4% 2.056e+08 ± 2% perf-stat.ps.cache-references 3247 -10.3% 2911 perf-stat.ps.context-switches 145.47 -1.3% 143.61 perf-stat.ps.cpu-migrations 3908900 -25.3% 2920218 perf-stat.ps.dTLB-load-misses 1.094e+10 -18.1% 8.963e+09 perf-stat.ps.dTLB-loads 5.587e+09 -23.2% 4.289e+09 perf-stat.ps.dTLB-stores 3863663 -25.0% 2895895 perf-stat.ps.iTLB-load-misses 4.272e+10 -18.9% 3.466e+10 perf-stat.ps.instructions 3128132 ± 2% +77.7% 5559939 ± 2% perf-stat.ps.node-load-misses 375403 ± 3% +69.0% 634300 ± 11% perf-stat.ps.node-loads 1338688 -26.8% 980311 perf-stat.ps.node-store-misses 51546 ± 3% -19.1% 41692 ± 7% perf-stat.ps.node-stores 1.29e+13 -18.8% 1.047e+13 perf-stat.total.instructions 0.96 -0.3 0.70 ± 2% perf-profile.calltrace.cycles-pp.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate 0.97 -0.3 0.72 perf-profile.calltrace.cycles-pp.syscall_return_via_sysret.fallocate64 0.76 ± 2% -0.2 0.54 ± 3% perf-profile.calltrace.cycles-pp.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate 0.82 -0.2 0.60 ± 2% perf-profile.calltrace.cycles-pp.alloc_pages_mpol.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate 0.91 -0.2 0.72 perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64 0.68 +0.1 0.76 ± 2% perf-profile.calltrace.cycles-pp.lru_add_fn.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp 1.67 +0.1 1.77 perf-profile.calltrace.cycles-pp.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate 1.78 ± 2% +0.1 1.92 ± 2% perf-profile.calltrace.cycles-pp.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr.notify_change 0.69 ± 5% +0.1 0.84 ± 4% perf-profile.calltrace.cycles-pp.get_mem_cgroup_from_mm.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate 1.56 ± 2% +0.2 1.76 ± 2% perf-profile.calltrace.cycles-pp.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr 0.85 ± 4% +0.4 1.23 ± 2% perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate 0.78 ± 4% +0.4 1.20 ± 3% perf-profile.calltrace.cycles-pp.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range 0.73 ± 4% +0.4 1.17 ± 3% perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio 48.39 +0.8 49.14 perf-profile.calltrace.cycles-pp.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64 0.00 +0.8 0.77 ± 4% perf-profile.calltrace.cycles-pp.mem_cgroup_commit_charge.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate 40.24 +0.8 41.03 perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp 40.22 +0.8 41.01 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio 0.00 +0.8 0.79 ± 3% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp 40.19 +0.8 40.98 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru 1.33 ± 5% +0.8 2.13 ± 4% perf-profile.calltrace.cycles-pp.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate 48.16 +0.8 48.98 perf-profile.calltrace.cycles-pp.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64 0.00 +0.9 0.88 ± 2% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio 47.92 +0.9 48.81 perf-profile.calltrace.cycles-pp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe 47.07 +0.9 48.01 perf-profile.calltrace.cycles-pp.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64 46.59 +1.1 47.64 perf-profile.calltrace.cycles-pp.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate 0.99 -0.3 0.73 ± 2% perf-profile.children.cycles-pp.syscall_return_via_sysret 0.96 -0.3 0.70 ± 2% perf-profile.children.cycles-pp.shmem_alloc_folio 0.78 ± 2% -0.2 0.56 ± 3% perf-profile.children.cycles-pp.shmem_inode_acct_blocks 0.83 -0.2 0.61 ± 2% perf-profile.children.cycles-pp.alloc_pages_mpol 0.92 -0.2 0.73 perf-profile.children.cycles-pp.syscall_exit_to_user_mode 0.74 ± 2% -0.2 0.55 ± 2% perf-profile.children.cycles-pp.xas_store 0.67 -0.2 0.50 ± 3% perf-profile.children.cycles-pp.__alloc_pages 0.43 -0.1 0.31 ± 2% perf-profile.children.cycles-pp.__entry_text_start 0.41 ± 2% -0.1 0.30 ± 3% perf-profile.children.cycles-pp.free_unref_page_list 0.35 -0.1 0.25 ± 2% perf-profile.children.cycles-pp.xas_load 0.35 ± 2% -0.1 0.25 ± 4% perf-profile.children.cycles-pp.__mod_lruvec_state 0.39 -0.1 0.30 ± 2% perf-profile.children.cycles-pp.get_page_from_freelist 0.27 ± 2% -0.1 0.19 ± 4% perf-profile.children.cycles-pp.__mod_node_page_state 0.32 ± 3% -0.1 0.24 ± 3% perf-profile.children.cycles-pp.find_lock_entries 0.23 ± 2% -0.1 0.15 ± 4% perf-profile.children.cycles-pp.xas_descend 0.28 ± 3% -0.1 0.20 ± 3% perf-profile.children.cycles-pp._raw_spin_lock 0.25 ± 3% -0.1 0.18 ± 3% perf-profile.children.cycles-pp.__dquot_alloc_space 0.16 ± 3% -0.1 0.10 ± 5% perf-profile.children.cycles-pp.xas_find_conflict 0.26 ± 2% -0.1 0.20 ± 3% perf-profile.children.cycles-pp.filemap_get_entry 0.26 -0.1 0.20 ± 2% perf-profile.children.cycles-pp.rmqueue 0.20 ± 3% -0.1 0.14 ± 3% perf-profile.children.cycles-pp.truncate_cleanup_folio 0.19 ± 5% -0.1 0.14 ± 4% perf-profile.children.cycles-pp.xas_clear_mark 0.17 ± 5% -0.0 0.12 ± 4% perf-profile.children.cycles-pp.xas_init_marks 0.15 ± 4% -0.0 0.10 ± 4% perf-profile.children.cycles-pp.free_unref_page_commit 0.18 ± 3% -0.0 0.14 ± 3% perf-profile.children.cycles-pp.__cond_resched 0.07 ± 5% -0.0 0.02 ± 99% perf-profile.children.cycles-pp.xas_find 0.13 ± 2% -0.0 0.09 perf-profile.children.cycles-pp.security_vm_enough_memory_mm 0.14 ± 4% -0.0 0.10 ± 7% perf-profile.children.cycles-pp.__fget_light 0.06 ± 6% -0.0 0.02 ± 99% perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack 0.12 ± 4% -0.0 0.08 ± 4% perf-profile.children.cycles-pp.xas_start 0.08 ± 5% -0.0 0.05 perf-profile.children.cycles-pp.__folio_throttle_swaprate 0.12 -0.0 0.08 ± 5% perf-profile.children.cycles-pp.folio_unlock 0.14 ± 3% -0.0 0.11 ± 3% perf-profile.children.cycles-pp.try_charge_memcg 0.12 ± 6% -0.0 0.08 ± 5% perf-profile.children.cycles-pp.free_unref_page_prepare 0.12 ± 3% -0.0 0.09 ± 4% perf-profile.children.cycles-pp.noop_dirty_folio 0.20 ± 2% -0.0 0.17 ± 5% perf-profile.children.cycles-pp.page_counter_uncharge 0.10 -0.0 0.07 ± 5% perf-profile.children.cycles-pp.cap_vm_enough_memory 0.09 ± 6% -0.0 0.06 ± 6% perf-profile.children.cycles-pp._raw_spin_trylock 0.09 ± 5% -0.0 0.06 ± 7% perf-profile.children.cycles-pp.inode_add_bytes 0.06 ± 6% -0.0 0.03 ± 70% perf-profile.children.cycles-pp.filemap_free_folio 0.06 ± 6% -0.0 0.03 ± 70% perf-profile.children.cycles-pp.percpu_counter_add_batch 0.12 ± 3% -0.0 0.09 ± 5% perf-profile.children.cycles-pp.__folio_cancel_dirty 0.12 ± 3% -0.0 0.10 ± 5% perf-profile.children.cycles-pp.shmem_recalc_inode 0.09 ± 5% -0.0 0.07 ± 7% perf-profile.children.cycles-pp.__vm_enough_memory 0.08 ± 5% -0.0 0.06 perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack 0.08 ± 5% -0.0 0.06 perf-profile.children.cycles-pp.security_file_permission 0.08 ± 6% -0.0 0.05 ± 7% perf-profile.children.cycles-pp.apparmor_file_permission 0.09 ± 4% -0.0 0.07 ± 8% perf-profile.children.cycles-pp.__percpu_counter_limited_add 0.08 ± 6% -0.0 0.06 ± 8% perf-profile.children.cycles-pp.__list_add_valid_or_report 0.07 ± 8% -0.0 0.05 perf-profile.children.cycles-pp.get_pfnblock_flags_mask 0.14 ± 3% -0.0 0.12 ± 6% perf-profile.children.cycles-pp.cgroup_rstat_updated 0.07 ± 5% -0.0 0.05 perf-profile.children.cycles-pp.policy_nodemask 0.24 ± 2% -0.0 0.22 ± 2% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt 0.08 -0.0 0.07 ± 7% perf-profile.children.cycles-pp.xas_create 0.69 +0.1 0.78 perf-profile.children.cycles-pp.lru_add_fn 1.72 ± 2% +0.1 1.80 perf-profile.children.cycles-pp.shmem_add_to_page_cache 1.79 ± 2% +0.1 1.93 ± 2% perf-profile.children.cycles-pp.filemap_remove_folio 0.13 ± 5% +0.1 0.28 perf-profile.children.cycles-pp.file_modified 0.69 ± 5% +0.1 0.84 ± 3% perf-profile.children.cycles-pp.get_mem_cgroup_from_mm 0.09 ± 7% +0.2 0.24 ± 2% perf-profile.children.cycles-pp.inode_needs_update_time 1.58 ± 3% +0.2 1.77 ± 2% perf-profile.children.cycles-pp.__filemap_remove_folio 0.15 ± 3% +0.4 0.50 ± 3% perf-profile.children.cycles-pp.__count_memcg_events 0.79 ± 4% +0.4 1.20 ± 3% perf-profile.children.cycles-pp.filemap_unaccount_folio 0.36 ± 5% +0.4 0.77 ± 4% perf-profile.children.cycles-pp.mem_cgroup_commit_charge 98.33 +0.5 98.78 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe 97.74 +0.6 98.34 perf-profile.children.cycles-pp.do_syscall_64 48.39 +0.8 49.15 perf-profile.children.cycles-pp.__x64_sys_fallocate 1.34 ± 5% +0.8 2.14 ± 4% perf-profile.children.cycles-pp.__mem_cgroup_charge 1.61 ± 4% +0.8 2.42 ± 2% perf-profile.children.cycles-pp.__mod_lruvec_page_state 48.17 +0.8 48.98 perf-profile.children.cycles-pp.vfs_fallocate 47.94 +0.9 48.82 perf-profile.children.cycles-pp.shmem_fallocate 47.10 +0.9 48.04 perf-profile.children.cycles-pp.shmem_get_folio_gfp 84.34 +0.9 85.28 perf-profile.children.cycles-pp.folio_lruvec_lock_irqsave 84.31 +0.9 85.26 perf-profile.children.cycles-pp._raw_spin_lock_irqsave 84.24 +1.0 85.21 perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath 46.65 +1.1 47.70 perf-profile.children.cycles-pp.shmem_alloc_and_add_folio 1.23 ± 4% +1.4 2.58 ± 2% perf-profile.children.cycles-pp.__mod_memcg_lruvec_state 0.98 -0.3 0.73 ± 2% perf-profile.self.cycles-pp.syscall_return_via_sysret 0.88 -0.2 0.70 perf-profile.self.cycles-pp.syscall_exit_to_user_mode 0.60 -0.2 0.45 perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe 0.41 ± 3% -0.1 0.27 ± 3% perf-profile.self.cycles-pp.release_pages 0.41 -0.1 0.30 ± 3% perf-profile.self.cycles-pp.xas_store 0.41 ± 3% -0.1 0.29 ± 2% perf-profile.self.cycles-pp.folio_batch_move_lru 0.30 ± 3% -0.1 0.18 ± 5% perf-profile.self.cycles-pp.shmem_add_to_page_cache 0.38 ± 2% -0.1 0.27 ± 2% perf-profile.self.cycles-pp.__entry_text_start 0.30 ± 3% -0.1 0.20 ± 6% perf-profile.self.cycles-pp.lru_add_fn 0.28 ± 2% -0.1 0.20 ± 5% perf-profile.self.cycles-pp.shmem_fallocate 0.26 ± 2% -0.1 0.18 ± 5% perf-profile.self.cycles-pp.__mod_node_page_state 0.27 ± 3% -0.1 0.20 ± 2% perf-profile.self.cycles-pp._raw_spin_lock 0.21 ± 2% -0.1 0.15 ± 4% perf-profile.self.cycles-pp.__alloc_pages 0.20 ± 2% -0.1 0.14 ± 3% perf-profile.self.cycles-pp.xas_descend 0.26 ± 3% -0.1 0.20 ± 4% perf-profile.self.cycles-pp.find_lock_entries 0.18 ± 4% -0.0 0.13 ± 5% perf-profile.self.cycles-pp.xas_clear_mark 0.15 ± 7% -0.0 0.10 ± 11% perf-profile.self.cycles-pp.shmem_inode_acct_blocks 0.16 ± 4% -0.0 0.12 ± 4% perf-profile.self.cycles-pp.__dquot_alloc_space 0.13 ± 4% -0.0 0.09 ± 5% perf-profile.self.cycles-pp.free_unref_page_commit 0.13 -0.0 0.09 ± 5% perf-profile.self.cycles-pp._raw_spin_lock_irq 0.16 ± 4% -0.0 0.12 ± 4% perf-profile.self.cycles-pp.shmem_alloc_and_add_folio 0.13 ± 5% -0.0 0.09 ± 7% perf-profile.self.cycles-pp.__filemap_remove_folio 0.13 ± 2% -0.0 0.09 ± 5% perf-profile.self.cycles-pp.get_page_from_freelist 0.12 ± 4% -0.0 0.09 ± 5% perf-profile.self.cycles-pp.vfs_fallocate 0.06 ± 7% -0.0 0.02 ± 99% perf-profile.self.cycles-pp.apparmor_file_permission 0.13 ± 3% -0.0 0.10 ± 5% perf-profile.self.cycles-pp.fallocate64 0.11 ± 4% -0.0 0.07 perf-profile.self.cycles-pp.xas_start 0.07 ± 5% -0.0 0.03 ± 70% perf-profile.self.cycles-pp.shmem_alloc_folio 0.14 ± 4% -0.0 0.10 ± 7% perf-profile.self.cycles-pp.__fget_light 0.10 ± 4% -0.0 0.06 ± 7% perf-profile.self.cycles-pp.rmqueue 0.12 ± 3% -0.0 0.09 ± 4% perf-profile.self.cycles-pp.xas_load 0.11 ± 4% -0.0 0.08 ± 7% perf-profile.self.cycles-pp.folio_unlock 0.10 ± 4% -0.0 0.07 ± 8% perf-profile.self.cycles-pp.alloc_pages_mpol 0.15 ± 2% -0.0 0.12 ± 5% perf-profile.self.cycles-pp.shmem_get_folio_gfp 0.10 -0.0 0.07 perf-profile.self.cycles-pp.cap_vm_enough_memory 0.16 ± 2% -0.0 0.13 ± 6% perf-profile.self.cycles-pp.page_counter_uncharge 0.12 ± 5% -0.0 0.09 ± 4% perf-profile.self.cycles-pp.__cond_resched 0.06 ± 6% -0.0 0.03 ± 70% perf-profile.self.cycles-pp.filemap_free_folio 0.12 ± 3% -0.0 0.10 ± 5% perf-profile.self.cycles-pp.free_unref_page_list 0.12 -0.0 0.09 ± 4% perf-profile.self.cycles-pp.noop_dirty_folio 0.10 ± 3% -0.0 0.07 ± 5% perf-profile.self.cycles-pp.filemap_remove_folio 0.10 ± 5% -0.0 0.07 ± 5% perf-profile.self.cycles-pp.try_charge_memcg 0.12 ± 3% -0.0 0.10 ± 8% perf-profile.self.cycles-pp.cgroup_rstat_updated 0.09 ± 4% -0.0 0.07 ± 7% perf-profile.self.cycles-pp.__folio_cancel_dirty 0.08 ± 4% -0.0 0.06 ± 8% perf-profile.self.cycles-pp._raw_spin_lock_irqsave 0.08 ± 5% -0.0 0.06 perf-profile.self.cycles-pp._raw_spin_trylock 0.08 -0.0 0.06 ± 6% perf-profile.self.cycles-pp.folio_add_lru 0.08 ± 8% -0.0 0.06 ± 6% perf-profile.self.cycles-pp.__mod_lruvec_state 0.07 ± 5% -0.0 0.05 perf-profile.self.cycles-pp.xas_find_conflict 0.08 ± 10% -0.0 0.06 ± 9% perf-profile.self.cycles-pp.truncate_cleanup_folio 0.07 ± 10% -0.0 0.05 perf-profile.self.cycles-pp.xas_init_marks 0.08 ± 4% -0.0 0.06 ± 7% perf-profile.self.cycles-pp.__percpu_counter_limited_add 0.07 ± 7% -0.0 0.05 perf-profile.self.cycles-pp.get_pfnblock_flags_mask 0.07 ± 5% -0.0 0.06 ± 8% perf-profile.self.cycles-pp.__list_add_valid_or_report 0.02 ±141% +0.0 0.06 ± 8% perf-profile.self.cycles-pp.uncharge_batch 0.21 ± 9% +0.1 0.31 ± 7% perf-profile.self.cycles-pp.mem_cgroup_commit_charge 0.69 ± 5% +0.1 0.83 ± 4% perf-profile.self.cycles-pp.get_mem_cgroup_from_mm 0.06 ± 6% +0.2 0.22 ± 2% perf-profile.self.cycles-pp.inode_needs_update_time 0.14 ± 8% +0.3 0.42 ± 7% perf-profile.self.cycles-pp.__mem_cgroup_charge 0.13 ± 7% +0.4 0.49 ± 3% perf-profile.self.cycles-pp.__count_memcg_events 84.24 +1.0 85.21 perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath 1.12 ± 5% +1.4 2.50 ± 2% perf-profile.self.cycles-pp.__mod_memcg_lruvec_state *************************************************************************************************** lkp-skl-fpga01: 104 threads 2 sockets (Skylake) with 192G memory ========================================================================================= compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase: gcc-12/performance/x86_64-rhel-8.3/thread/50%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale commit: 130617edc1 ("mm: memcg: move vmstats structs definition above flushing code") 51d74c18a9 ("mm: memcg: make stats flushing threshold per-memcg") 130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522 ---------------- --------------------------- %stddev %change %stddev \ | \ 1.87 -0.4 1.43 ± 3% mpstat.cpu.all.usr% 3171 -5.3% 3003 ± 2% vmstat.system.cs 84.83 ± 9% +55.8% 132.17 ± 16% perf-c2c.DRAM.local 484.17 ± 3% +37.1% 663.67 ± 10% perf-c2c.DRAM.remote 72763 ± 5% +14.4% 83212 ± 12% turbostat.C1 0.08 -25.0% 0.06 turbostat.IPC 27.90 +4.6% 29.18 turbostat.RAMWatt 3982212 -30.0% 2785941 will-it-scale.52.threads 76580 -30.0% 53575 will-it-scale.per_thread_ops 3982212 -30.0% 2785941 will-it-scale.workload 1.175e+09 ± 2% -28.6% 8.392e+08 ± 2% numa-numastat.node0.local_node 1.175e+09 ± 2% -28.6% 8.394e+08 ± 2% numa-numastat.node0.numa_hit 1.231e+09 ± 2% -31.3% 8.463e+08 ± 3% numa-numastat.node1.local_node 1.232e+09 ± 2% -31.3% 8.466e+08 ± 3% numa-numastat.node1.numa_hit 1.175e+09 ± 2% -28.6% 8.394e+08 ± 2% numa-vmstat.node0.numa_hit 1.175e+09 ± 2% -28.6% 8.392e+08 ± 2% numa-vmstat.node0.numa_local 1.232e+09 ± 2% -31.3% 8.466e+08 ± 3% numa-vmstat.node1.numa_hit 1.231e+09 ± 2% -31.3% 8.463e+08 ± 3% numa-vmstat.node1.numa_local 2.408e+09 -30.0% 1.686e+09 proc-vmstat.numa_hit 2.406e+09 -30.0% 1.685e+09 proc-vmstat.numa_local 2.404e+09 -29.9% 1.684e+09 proc-vmstat.pgalloc_normal 2.404e+09 -29.9% 1.684e+09 proc-vmstat.pgfree 0.04 ± 9% -19.3% 0.03 ± 6% perf-sched.wait_and_delay.avg.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64 0.04 ± 8% -22.3% 0.03 ± 5% perf-sched.wait_and_delay.avg.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate 0.91 ± 2% +11.3% 1.01 ± 5% perf-sched.wait_and_delay.avg.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64 0.04 ± 13% -90.3% 0.00 ±223% perf-sched.wait_and_delay.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt 1.14 +15.1% 1.31 perf-sched.wait_and_delay.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone 189.94 ± 3% +18.3% 224.73 ± 4% perf-sched.wait_and_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm 1652 ± 4% -13.4% 1431 ± 4% perf-sched.wait_and_delay.count.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64 83.67 ± 7% -87.6% 10.33 ±223% perf-sched.wait_and_delay.count.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt 3827 ± 4% -13.0% 3328 ± 3% perf-sched.wait_and_delay.count.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm 1.71 ±165% -83.4% 0.28 ± 21% perf-sched.wait_and_delay.max.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64 0.43 ± 17% -43.8% 0.24 ± 26% perf-sched.wait_and_delay.max.ms.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate 0.46 ± 17% -36.7% 0.29 ± 12% perf-sched.wait_and_delay.max.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate 0.30 ± 34% -90.7% 0.03 ±223% perf-sched.wait_and_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt 0.04 ± 9% -19.3% 0.03 ± 6% perf-sched.wait_time.avg.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64 0.04 ± 8% -22.3% 0.03 ± 5% perf-sched.wait_time.avg.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate 0.04 ± 11% -33.1% 0.03 ± 17% perf-sched.wait_time.avg.ms.__cond_resched.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.90 ± 2% +11.5% 1.00 ± 5% perf-sched.wait_time.avg.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64 0.04 ± 13% -26.6% 0.03 ± 12% perf-sched.wait_time.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt 1.13 +15.2% 1.30 perf-sched.wait_time.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone 189.93 ± 3% +18.3% 224.72 ± 4% perf-sched.wait_time.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm 1.71 ±165% -83.4% 0.28 ± 21% perf-sched.wait_time.max.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64 0.43 ± 17% -43.8% 0.24 ± 26% perf-sched.wait_time.max.ms.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate 0.46 ± 17% -36.7% 0.29 ± 12% perf-sched.wait_time.max.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate 0.75 +142.0% 1.83 ± 2% perf-stat.i.MPKI 8.47e+09 -24.4% 6.407e+09 perf-stat.i.branch-instructions 0.66 -0.0 0.63 perf-stat.i.branch-miss-rate% 56364992 -28.3% 40421603 ± 3% perf-stat.i.branch-misses 14.64 +6.7 21.30 perf-stat.i.cache-miss-rate% 30868184 +81.3% 55977240 ± 3% perf-stat.i.cache-misses 2.107e+08 +24.7% 2.627e+08 ± 2% perf-stat.i.cache-references 3106 -5.5% 2934 ± 2% perf-stat.i.context-switches 3.55 +33.4% 4.74 perf-stat.i.cpi 4722 -44.8% 2605 ± 3% perf-stat.i.cycles-between-cache-misses 0.04 -0.0 0.04 perf-stat.i.dTLB-load-miss-rate% 4117232 -29.1% 2917107 perf-stat.i.dTLB-load-misses 1.051e+10 -24.1% 7.979e+09 perf-stat.i.dTLB-loads 0.00 ± 3% +0.0 0.00 ± 6% perf-stat.i.dTLB-store-miss-rate% 5.886e+09 -27.5% 4.269e+09 perf-stat.i.dTLB-stores 78.16 -6.6 71.51 perf-stat.i.iTLB-load-miss-rate% 4131074 ± 3% -30.0% 2891515 perf-stat.i.iTLB-load-misses 4.098e+10 -25.0% 3.072e+10 perf-stat.i.instructions 9929 ± 2% +7.0% 10627 perf-stat.i.instructions-per-iTLB-miss 0.28 -25.0% 0.21 perf-stat.i.ipc 63.49 +43.8% 91.27 ± 3% perf-stat.i.metric.K/sec 241.12 -24.6% 181.87 perf-stat.i.metric.M/sec 3735316 +78.6% 6669641 ± 3% perf-stat.i.node-load-misses 377465 ± 4% +86.1% 702512 ± 11% perf-stat.i.node-loads 1322217 -27.6% 957081 ± 5% perf-stat.i.node-store-misses 37459 ± 3% -23.0% 28826 ± 5% perf-stat.i.node-stores 0.75 +141.8% 1.82 ± 2% perf-stat.overall.MPKI 0.67 -0.0 0.63 perf-stat.overall.branch-miss-rate% 14.65 +6.7 21.30 perf-stat.overall.cache-miss-rate% 3.55 +33.4% 4.73 perf-stat.overall.cpi 4713 -44.8% 2601 ± 3% perf-stat.overall.cycles-between-cache-misses 0.04 -0.0 0.04 perf-stat.overall.dTLB-load-miss-rate% 0.00 ± 3% +0.0 0.00 ± 5% perf-stat.overall.dTLB-store-miss-rate% 78.14 -6.7 71.47 perf-stat.overall.iTLB-load-miss-rate% 9927 ± 2% +7.0% 10624 perf-stat.overall.instructions-per-iTLB-miss 0.28 -25.0% 0.21 perf-stat.overall.ipc 3098901 +7.1% 3318983 perf-stat.overall.path-length 8.441e+09 -24.4% 6.385e+09 perf-stat.ps.branch-instructions 56179581 -28.3% 40286337 ± 3% perf-stat.ps.branch-misses 30759982 +81.3% 55777812 ± 3% perf-stat.ps.cache-misses 2.1e+08 +24.6% 2.618e+08 ± 2% perf-stat.ps.cache-references 3095 -5.5% 2923 ± 2% perf-stat.ps.context-switches 4103292 -29.1% 2907270 perf-stat.ps.dTLB-load-misses 1.048e+10 -24.1% 7.952e+09 perf-stat.ps.dTLB-loads 5.866e+09 -27.5% 4.255e+09 perf-stat.ps.dTLB-stores 4117020 ± 3% -30.0% 2881750 perf-stat.ps.iTLB-load-misses 4.084e+10 -25.0% 3.062e+10 perf-stat.ps.instructions 3722149 +78.5% 6645867 ± 3% perf-stat.ps.node-load-misses 376240 ± 4% +86.1% 700053 ± 11% perf-stat.ps.node-loads 1317772 -27.6% 953773 ± 5% perf-stat.ps.node-store-misses 37408 ± 3% -23.2% 28748 ± 5% perf-stat.ps.node-stores 1.234e+13 -25.1% 9.246e+12 perf-stat.total.instructions 1.28 -0.4 0.90 ± 2% perf-profile.calltrace.cycles-pp.syscall_return_via_sysret.fallocate64 1.26 ± 2% -0.4 0.90 ± 3% perf-profile.calltrace.cycles-pp.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate 1.08 ± 2% -0.3 0.77 ± 3% perf-profile.calltrace.cycles-pp.alloc_pages_mpol.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate 0.92 ± 2% -0.3 0.62 ± 3% perf-profile.calltrace.cycles-pp.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate 0.84 ± 3% -0.2 0.61 ± 3% perf-profile.calltrace.cycles-pp.__alloc_pages.alloc_pages_mpol.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp 1.26 -0.2 1.08 perf-profile.calltrace.cycles-pp.folio_batch_move_lru.lru_add_drain_cpu.__folio_batch_release.shmem_undo_range.shmem_setattr 1.26 -0.2 1.08 perf-profile.calltrace.cycles-pp.lru_add_drain_cpu.__folio_batch_release.shmem_undo_range.shmem_setattr.notify_change 1.24 -0.2 1.06 perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.folio_batch_move_lru.lru_add_drain_cpu.__folio_batch_release.shmem_undo_range 1.24 -0.2 1.06 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.lru_add_drain_cpu.__folio_batch_release 1.23 -0.2 1.06 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.lru_add_drain_cpu 1.20 -0.2 1.04 ± 2% perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64 0.68 ± 3% +0.0 0.72 ± 4% perf-profile.calltrace.cycles-pp.__mem_cgroup_uncharge_list.release_pages.__folio_batch_release.shmem_undo_range.shmem_setattr 1.08 +0.1 1.20 perf-profile.calltrace.cycles-pp.lru_add_fn.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp 2.91 +0.3 3.18 ± 2% perf-profile.calltrace.cycles-pp.truncate_inode_folio.shmem_undo_range.shmem_setattr.notify_change.do_truncate 2.56 +0.4 2.92 ± 2% perf-profile.calltrace.cycles-pp.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr.notify_change 1.36 ± 3% +0.4 1.76 ± 9% perf-profile.calltrace.cycles-pp.get_mem_cgroup_from_mm.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate 2.22 +0.5 2.68 ± 2% perf-profile.calltrace.cycles-pp.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr 0.00 +0.6 0.60 ± 2% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.release_pages.__folio_batch_release.shmem_undo_range.shmem_setattr 2.33 +0.6 2.94 perf-profile.calltrace.cycles-pp.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate 0.00 +0.7 0.72 ± 2% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.lru_add_fn.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio 0.69 ± 4% +0.8 1.47 ± 3% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio 1.24 ± 2% +0.8 2.04 ± 2% perf-profile.calltrace.cycles-pp.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range 0.00 +0.8 0.82 ± 4% perf-profile.calltrace.cycles-pp.__count_memcg_events.mem_cgroup_commit_charge.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp 1.17 ± 2% +0.8 2.00 ± 2% perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio 0.59 ± 4% +0.9 1.53 perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp 1.38 +1.0 2.33 ± 2% perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate 0.62 ± 3% +1.0 1.66 ± 5% perf-profile.calltrace.cycles-pp.mem_cgroup_commit_charge.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate 38.70 +1.2 39.90 perf-profile.calltrace.cycles-pp.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64 38.34 +1.3 39.65 perf-profile.calltrace.cycles-pp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe 37.24 +1.6 38.86 perf-profile.calltrace.cycles-pp.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64 36.64 +1.8 38.40 perf-profile.calltrace.cycles-pp.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate 2.47 ± 2% +2.1 4.59 ± 8% perf-profile.calltrace.cycles-pp.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate 1.30 -0.4 0.92 ± 2% perf-profile.children.cycles-pp.syscall_return_via_sysret 1.28 ± 2% -0.4 0.90 ± 3% perf-profile.children.cycles-pp.shmem_alloc_folio 1.10 ± 2% -0.3 0.78 ± 3% perf-profile.children.cycles-pp.alloc_pages_mpol 0.96 ± 2% -0.3 0.64 ± 3% perf-profile.children.cycles-pp.shmem_inode_acct_blocks 0.88 -0.3 0.58 ± 2% perf-profile.children.cycles-pp.xas_store 0.88 ± 3% -0.2 0.64 ± 3% perf-profile.children.cycles-pp.__alloc_pages 0.61 ± 2% -0.2 0.43 ± 3% perf-profile.children.cycles-pp.__entry_text_start 1.26 -0.2 1.09 perf-profile.children.cycles-pp.lru_add_drain_cpu 0.56 -0.2 0.39 ± 4% perf-profile.children.cycles-pp.free_unref_page_list 1.22 -0.2 1.06 ± 2% perf-profile.children.cycles-pp.syscall_exit_to_user_mode 0.46 -0.1 0.32 ± 3% perf-profile.children.cycles-pp.__mod_lruvec_state 0.41 ± 3% -0.1 0.28 ± 4% perf-profile.children.cycles-pp.xas_load 0.44 ± 4% -0.1 0.31 ± 4% perf-profile.children.cycles-pp.find_lock_entries 0.50 ± 3% -0.1 0.37 ± 2% perf-profile.children.cycles-pp.get_page_from_freelist 0.24 ± 7% -0.1 0.12 ± 5% perf-profile.children.cycles-pp.__list_add_valid_or_report 0.34 ± 2% -0.1 0.24 ± 4% perf-profile.children.cycles-pp.__mod_node_page_state 0.38 ± 3% -0.1 0.28 ± 4% perf-profile.children.cycles-pp._raw_spin_lock 0.32 ± 2% -0.1 0.22 ± 5% perf-profile.children.cycles-pp.__dquot_alloc_space 0.26 ± 2% -0.1 0.17 ± 2% perf-profile.children.cycles-pp.xas_descend 0.22 ± 3% -0.1 0.14 ± 4% perf-profile.children.cycles-pp.free_unref_page_commit 0.25 -0.1 0.17 ± 3% perf-profile.children.cycles-pp.xas_clear_mark 0.32 ± 4% -0.1 0.25 ± 3% perf-profile.children.cycles-pp.rmqueue 0.23 ± 2% -0.1 0.16 ± 2% perf-profile.children.cycles-pp.xas_init_marks 0.24 ± 2% -0.1 0.17 ± 5% perf-profile.children.cycles-pp.__cond_resched 0.25 ± 4% -0.1 0.18 ± 2% perf-profile.children.cycles-pp.truncate_cleanup_folio 0.30 ± 3% -0.1 0.23 ± 4% perf-profile.children.cycles-pp.filemap_get_entry 0.20 ± 2% -0.1 0.13 ± 5% perf-profile.children.cycles-pp.folio_unlock 0.16 ± 4% -0.1 0.10 ± 5% perf-profile.children.cycles-pp.xas_find_conflict 0.19 ± 3% -0.1 0.13 ± 5% perf-profile.children.cycles-pp._raw_spin_lock_irq 0.17 ± 5% -0.1 0.12 ± 3% perf-profile.children.cycles-pp.noop_dirty_folio 0.13 ± 4% -0.1 0.08 ± 9% perf-profile.children.cycles-pp.security_vm_enough_memory_mm 0.18 ± 8% -0.1 0.13 ± 4% perf-profile.children.cycles-pp.shmem_recalc_inode 0.16 ± 2% -0.1 0.11 ± 3% perf-profile.children.cycles-pp.free_unref_page_prepare 0.09 ± 5% -0.1 0.04 ± 45% perf-profile.children.cycles-pp.mem_cgroup_update_lru_size 0.10 ± 7% -0.0 0.05 ± 45% perf-profile.children.cycles-pp.cap_vm_enough_memory 0.14 ± 5% -0.0 0.10 perf-profile.children.cycles-pp.__folio_cancel_dirty 0.14 ± 5% -0.0 0.10 ± 4% perf-profile.children.cycles-pp.security_file_permission 0.10 ± 5% -0.0 0.06 ± 6% perf-profile.children.cycles-pp.xas_find 0.15 ± 4% -0.0 0.11 ± 3% perf-profile.children.cycles-pp.__fget_light 0.14 ± 5% -0.0 0.11 ± 3% perf-profile.children.cycles-pp.file_modified 0.12 ± 3% -0.0 0.09 ± 7% perf-profile.children.cycles-pp.__vm_enough_memory 0.12 ± 3% -0.0 0.09 ± 4% perf-profile.children.cycles-pp.apparmor_file_permission 0.12 ± 3% -0.0 0.08 ± 5% perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack 0.12 ± 4% -0.0 0.08 ± 4% perf-profile.children.cycles-pp.xas_start 0.09 -0.0 0.06 ± 8% perf-profile.children.cycles-pp.__folio_throttle_swaprate 0.12 ± 6% -0.0 0.08 ± 8% perf-profile.children.cycles-pp._raw_spin_trylock 0.12 ± 4% -0.0 0.08 ± 4% perf-profile.children.cycles-pp.__percpu_counter_limited_add 0.12 ± 4% -0.0 0.09 ± 4% perf-profile.children.cycles-pp.inode_add_bytes 0.20 ± 2% -0.0 0.17 ± 7% perf-profile.children.cycles-pp.try_charge_memcg 0.10 ± 5% -0.0 0.07 ± 7% perf-profile.children.cycles-pp.policy_nodemask 0.09 ± 6% -0.0 0.06 ± 6% perf-profile.children.cycles-pp.get_pfnblock_flags_mask 0.09 ± 6% -0.0 0.06 ± 7% perf-profile.children.cycles-pp.filemap_free_folio 0.07 ± 6% -0.0 0.05 ± 7% perf-profile.children.cycles-pp.down_write 0.08 ± 4% -0.0 0.06 ± 8% perf-profile.children.cycles-pp.get_task_policy 0.09 ± 5% -0.0 0.07 ± 5% perf-profile.children.cycles-pp.xas_create 0.09 ± 7% -0.0 0.07 perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack 0.09 ± 7% -0.0 0.07 perf-profile.children.cycles-pp.inode_needs_update_time 0.16 ± 2% -0.0 0.14 ± 5% perf-profile.children.cycles-pp.cgroup_rstat_updated 0.08 ± 7% -0.0 0.06 ± 9% perf-profile.children.cycles-pp.percpu_counter_add_batch 0.07 ± 5% -0.0 0.05 ± 7% perf-profile.children.cycles-pp.folio_mark_dirty 0.08 ± 10% -0.0 0.06 ± 6% perf-profile.children.cycles-pp.shmem_is_huge 0.07 ± 6% +0.0 0.09 ± 10% perf-profile.children.cycles-pp.propagate_protected_usage 0.43 ± 3% +0.0 0.46 ± 5% perf-profile.children.cycles-pp.uncharge_batch 0.68 ± 3% +0.0 0.73 ± 4% perf-profile.children.cycles-pp.__mem_cgroup_uncharge_list 1.11 +0.1 1.22 perf-profile.children.cycles-pp.lru_add_fn 2.91 +0.3 3.18 ± 2% perf-profile.children.cycles-pp.truncate_inode_folio 2.56 +0.4 2.92 ± 2% perf-profile.children.cycles-pp.filemap_remove_folio 1.37 ± 3% +0.4 1.76 ± 9% perf-profile.children.cycles-pp.get_mem_cgroup_from_mm 2.24 +0.5 2.70 ± 2% perf-profile.children.cycles-pp.__filemap_remove_folio 2.38 +0.6 2.97 perf-profile.children.cycles-pp.shmem_add_to_page_cache 0.18 ± 4% +0.7 0.91 ± 4% perf-profile.children.cycles-pp.__count_memcg_events 1.26 +0.8 2.04 ± 2% perf-profile.children.cycles-pp.filemap_unaccount_folio 0.63 ± 2% +1.0 1.67 ± 5% perf-profile.children.cycles-pp.mem_cgroup_commit_charge 38.71 +1.2 39.91 perf-profile.children.cycles-pp.vfs_fallocate 38.37 +1.3 39.66 perf-profile.children.cycles-pp.shmem_fallocate 37.28 +1.6 38.89 perf-profile.children.cycles-pp.shmem_get_folio_gfp 36.71 +1.7 38.45 perf-profile.children.cycles-pp.shmem_alloc_and_add_folio 2.58 +1.8 4.36 ± 2% perf-profile.children.cycles-pp.__mod_lruvec_page_state 2.48 ± 2% +2.1 4.60 ± 8% perf-profile.children.cycles-pp.__mem_cgroup_charge 1.93 ± 3% +2.4 4.36 ± 2% perf-profile.children.cycles-pp.__mod_memcg_lruvec_state 1.30 -0.4 0.92 ± 2% perf-profile.self.cycles-pp.syscall_return_via_sysret 0.73 -0.2 0.52 ± 2% perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe 0.54 ± 2% -0.2 0.36 ± 3% perf-profile.self.cycles-pp.release_pages 0.48 -0.2 0.30 ± 3% perf-profile.self.cycles-pp.xas_store 0.54 ± 2% -0.2 0.38 ± 3% perf-profile.self.cycles-pp.__entry_text_start 1.17 -0.1 1.03 ± 2% perf-profile.self.cycles-pp.syscall_exit_to_user_mode 0.36 ± 2% -0.1 0.22 ± 3% perf-profile.self.cycles-pp.shmem_add_to_page_cache 0.43 ± 5% -0.1 0.30 ± 7% perf-profile.self.cycles-pp.lru_add_fn 0.24 ± 7% -0.1 0.12 ± 6% perf-profile.self.cycles-pp.__list_add_valid_or_report 0.38 ± 4% -0.1 0.27 ± 4% perf-profile.self.cycles-pp._raw_spin_lock 0.52 ± 3% -0.1 0.41 perf-profile.self.cycles-pp.folio_batch_move_lru 0.32 ± 2% -0.1 0.22 ± 4% perf-profile.self.cycles-pp.__mod_node_page_state 0.36 ± 4% -0.1 0.26 ± 4% perf-profile.self.cycles-pp.find_lock_entries 0.36 ± 2% -0.1 0.26 ± 2% perf-profile.self.cycles-pp.shmem_fallocate 0.28 ± 3% -0.1 0.20 ± 5% perf-profile.self.cycles-pp.__alloc_pages 0.24 ± 2% -0.1 0.16 ± 4% perf-profile.self.cycles-pp.xas_descend 0.23 ± 2% -0.1 0.16 ± 3% perf-profile.self.cycles-pp.xas_clear_mark 0.18 ± 3% -0.1 0.11 ± 6% perf-profile.self.cycles-pp.free_unref_page_commit 0.18 ± 3% -0.1 0.12 ± 4% perf-profile.self.cycles-pp.shmem_inode_acct_blocks 0.21 ± 3% -0.1 0.15 ± 2% perf-profile.self.cycles-pp.shmem_alloc_and_add_folio 0.18 ± 2% -0.1 0.12 ± 3% perf-profile.self.cycles-pp.__filemap_remove_folio 0.18 ± 7% -0.1 0.12 ± 7% perf-profile.self.cycles-pp.vfs_fallocate 0.20 ± 2% -0.1 0.14 ± 6% perf-profile.self.cycles-pp.__dquot_alloc_space 0.18 ± 2% -0.1 0.13 ± 3% perf-profile.self.cycles-pp.folio_unlock 0.18 ± 2% -0.1 0.12 ± 3% perf-profile.self.cycles-pp.get_page_from_freelist 0.15 ± 3% -0.1 0.10 ± 7% perf-profile.self.cycles-pp.xas_load 0.17 ± 3% -0.1 0.12 ± 8% perf-profile.self.cycles-pp.__cond_resched 0.17 ± 2% -0.1 0.12 ± 3% perf-profile.self.cycles-pp._raw_spin_lock_irq 0.17 ± 5% -0.1 0.12 ± 3% perf-profile.self.cycles-pp.noop_dirty_folio 0.10 ± 7% -0.0 0.05 ± 45% perf-profile.self.cycles-pp.cap_vm_enough_memory 0.12 ± 3% -0.0 0.08 ± 4% perf-profile.self.cycles-pp.rmqueue 0.07 ± 5% -0.0 0.02 ± 99% perf-profile.self.cycles-pp.xas_find 0.13 ± 3% -0.0 0.09 ± 6% perf-profile.self.cycles-pp.alloc_pages_mpol 0.07 ± 6% -0.0 0.03 ± 70% perf-profile.self.cycles-pp.xas_find_conflict 0.16 ± 2% -0.0 0.12 ± 6% perf-profile.self.cycles-pp.free_unref_page_list 0.12 ± 5% -0.0 0.08 ± 4% perf-profile.self.cycles-pp.fallocate64 0.20 ± 4% -0.0 0.16 ± 3% perf-profile.self.cycles-pp.shmem_get_folio_gfp 0.06 ± 7% -0.0 0.02 ± 99% perf-profile.self.cycles-pp.shmem_recalc_inode 0.13 ± 3% -0.0 0.09 perf-profile.self.cycles-pp._raw_spin_lock_irqsave 0.22 ± 3% -0.0 0.19 ± 6% perf-profile.self.cycles-pp.page_counter_uncharge 0.14 ± 3% -0.0 0.10 ± 6% perf-profile.self.cycles-pp.filemap_remove_folio 0.15 ± 5% -0.0 0.11 ± 3% perf-profile.self.cycles-pp.__fget_light 0.12 ± 4% -0.0 0.08 perf-profile.self.cycles-pp.__folio_cancel_dirty 0.11 ± 4% -0.0 0.08 ± 7% perf-profile.self.cycles-pp._raw_spin_trylock 0.12 ± 3% -0.0 0.09 ± 5% perf-profile.self.cycles-pp.__mod_lruvec_state 0.11 ± 5% -0.0 0.08 ± 4% perf-profile.self.cycles-pp.truncate_cleanup_folio 0.11 ± 3% -0.0 0.08 ± 6% perf-profile.self.cycles-pp.__percpu_counter_limited_add 0.11 ± 3% -0.0 0.08 ± 6% perf-profile.self.cycles-pp.xas_start 0.10 ± 6% -0.0 0.07 ± 5% perf-profile.self.cycles-pp.xas_init_marks 0.09 ± 6% -0.0 0.06 ± 6% perf-profile.self.cycles-pp.get_pfnblock_flags_mask 0.11 -0.0 0.08 ± 5% perf-profile.self.cycles-pp.folio_add_lru 0.09 ± 6% -0.0 0.06 ± 7% perf-profile.self.cycles-pp.filemap_free_folio 0.09 ± 4% -0.0 0.06 ± 6% perf-profile.self.cycles-pp.shmem_alloc_folio 0.14 ± 5% -0.0 0.12 ± 5% perf-profile.self.cycles-pp.cgroup_rstat_updated 0.10 ± 4% -0.0 0.08 ± 4% perf-profile.self.cycles-pp.apparmor_file_permission 0.07 ± 7% -0.0 0.04 ± 44% perf-profile.self.cycles-pp.policy_nodemask 0.07 ± 11% -0.0 0.04 ± 45% perf-profile.self.cycles-pp.shmem_is_huge 0.08 ± 4% -0.0 0.06 ± 8% perf-profile.self.cycles-pp.get_task_policy 0.08 ± 6% -0.0 0.05 ± 8% perf-profile.self.cycles-pp.__x64_sys_fallocate 0.12 ± 3% -0.0 0.10 ± 6% perf-profile.self.cycles-pp.try_charge_memcg 0.07 -0.0 0.05 perf-profile.self.cycles-pp.free_unref_page_prepare 0.07 ± 6% -0.0 0.06 ± 9% perf-profile.self.cycles-pp.percpu_counter_add_batch 0.08 ± 4% -0.0 0.06 perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack 0.09 ± 7% -0.0 0.07 ± 5% perf-profile.self.cycles-pp.filemap_get_entry 0.07 ± 9% +0.0 0.09 ± 10% perf-profile.self.cycles-pp.propagate_protected_usage 0.96 ± 2% +0.2 1.12 ± 7% perf-profile.self.cycles-pp.__mod_lruvec_page_state 0.45 ± 4% +0.4 0.82 ± 8% perf-profile.self.cycles-pp.mem_cgroup_commit_charge 1.36 ± 3% +0.4 1.75 ± 9% perf-profile.self.cycles-pp.get_mem_cgroup_from_mm 0.29 +0.7 1.00 ± 10% perf-profile.self.cycles-pp.__mem_cgroup_charge 0.16 ± 4% +0.7 0.90 ± 4% perf-profile.self.cycles-pp.__count_memcg_events 1.80 ± 2% +2.5 4.26 ± 2% perf-profile.self.cycles-pp.__mod_memcg_lruvec_state Disclaimer: Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.
On Fri, Oct 20, 2023 at 9:18 AM kernel test robot <oliver.sang@intel.com> wrote: > > > > Hello, > > kernel test robot noticed a -25.8% regression of will-it-scale.per_thread_ops on: > > > commit: 51d74c18a9c61e7ee33bc90b522dd7f6e5b80bb5 ("[PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg") > url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257 > base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything > patch link: https://lore.kernel.org/all/20231010032117.1577496-4-yosryahmed@google.com/ > patch subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg > > testcase: will-it-scale > test machine: 104 threads 2 sockets (Skylake) with 192G memory > parameters: > > nr_task: 100% > mode: thread > test: fallocate1 > cpufreq_governor: performance > > > In addition to that, the commit also has significant impact on the following tests: > > +------------------+---------------------------------------------------------------+ > | testcase: change | will-it-scale: will-it-scale.per_thread_ops -30.0% regression | > | test machine | 104 threads 2 sockets (Skylake) with 192G memory | > | test parameters | cpufreq_governor=performance | > | | mode=thread | > | | nr_task=50% | > | | test=fallocate1 | > +------------------+---------------------------------------------------------------+ > Yosry, I don't think 25% to 30% regression can be ignored. Unless there is a quick fix, IMO this series should be skipped for the upcoming kernel open window.
On Fri, Oct 20, 2023 at 10:23 AM Shakeel Butt <shakeelb@google.com> wrote: > > On Fri, Oct 20, 2023 at 9:18 AM kernel test robot <oliver.sang@intel.com> wrote: > > > > > > > > Hello, > > > > kernel test robot noticed a -25.8% regression of will-it-scale.per_thread_ops on: > > > > > > commit: 51d74c18a9c61e7ee33bc90b522dd7f6e5b80bb5 ("[PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg") > > url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257 > > base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything > > patch link: https://lore.kernel.org/all/20231010032117.1577496-4-yosryahmed@google.com/ > > patch subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg > > > > testcase: will-it-scale > > test machine: 104 threads 2 sockets (Skylake) with 192G memory > > parameters: > > > > nr_task: 100% > > mode: thread > > test: fallocate1 > > cpufreq_governor: performance > > > > > > In addition to that, the commit also has significant impact on the following tests: > > > > +------------------+---------------------------------------------------------------+ > > | testcase: change | will-it-scale: will-it-scale.per_thread_ops -30.0% regression | > > | test machine | 104 threads 2 sockets (Skylake) with 192G memory | > > | test parameters | cpufreq_governor=performance | > > | | mode=thread | > > | | nr_task=50% | > > | | test=fallocate1 | > > +------------------+---------------------------------------------------------------+ > > > > Yosry, I don't think 25% to 30% regression can be ignored. Unless > there is a quick fix, IMO this series should be skipped for the > upcoming kernel open window. I am currently looking into it. It's reasonable to skip the next merge window if a quick fix isn't found soon. I am surprised by the size of the regression given the following: 1.12 ą 5% +1.4 2.50 ą 2% perf-profile.self.cycles-pp.__mod_memcg_lruvec_state IIUC we are only spending 1% more time in __mod_memcg_lruvec_state().
On Sat, Oct 21, 2023 at 01:42:58AM +0800, Yosry Ahmed wrote: > On Fri, Oct 20, 2023 at 10:23 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > On Fri, Oct 20, 2023 at 9:18 AM kernel test robot <oliver.sang@intel.com> wrote: > > > > > > > > > > > > Hello, > > > > > > kernel test robot noticed a -25.8% regression of will-it-scale.per_thread_ops on: > > > > > > > > > commit: 51d74c18a9c61e7ee33bc90b522dd7f6e5b80bb5 ("[PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg") > > > url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257 > > > base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything > > > patch link: https://lore.kernel.org/all/20231010032117.1577496-4-yosryahmed@google.com/ > > > patch subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg > > > > > > testcase: will-it-scale > > > test machine: 104 threads 2 sockets (Skylake) with 192G memory > > > parameters: > > > > > > nr_task: 100% > > > mode: thread > > > test: fallocate1 > > > cpufreq_governor: performance > > > > > > > > > In addition to that, the commit also has significant impact on the following tests: > > > > > > +------------------+---------------------------------------------------------------+ > > > | testcase: change | will-it-scale: will-it-scale.per_thread_ops -30.0% regression | > > > | test machine | 104 threads 2 sockets (Skylake) with 192G memory | > > > | test parameters | cpufreq_governor=performance | > > > | | mode=thread | > > > | | nr_task=50% | > > > | | test=fallocate1 | > > > +------------------+---------------------------------------------------------------+ > > > > > > > Yosry, I don't think 25% to 30% regression can be ignored. Unless > > there is a quick fix, IMO this series should be skipped for the > > upcoming kernel open window. > > I am currently looking into it. It's reasonable to skip the next merge > window if a quick fix isn't found soon. > > I am surprised by the size of the regression given the following: > 1.12 ą 5% +1.4 2.50 ą 2% > perf-profile.self.cycles-pp.__mod_memcg_lruvec_state > > IIUC we are only spending 1% more time in __mod_memcg_lruvec_state(). Yes, this is kind of confusing. And we have seen similar cases before, espcially for micro benchmark like will-it-scale, stressng, netperf etc, the change to those functions in hot path was greatly amplified in the final benchmark score. In a netperf case, https://lore.kernel.org/lkml/20220619150456.GB34471@xsang-OptiPlex-9020/ the affected functions have around 10% change in perf's cpu-cycles, and trigger 69% regression. IIRC, micro benchmarks are very sensitive to those statistics update, like memcg's and vmstat. Thanks, Feng
On Sun, Oct 22, 2023 at 6:34 PM Feng Tang <feng.tang@intel.com> wrote: > > On Sat, Oct 21, 2023 at 01:42:58AM +0800, Yosry Ahmed wrote: > > On Fri, Oct 20, 2023 at 10:23 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > On Fri, Oct 20, 2023 at 9:18 AM kernel test robot <oliver.sang@intel.com> wrote: > > > > > > > > > > > > > > > > Hello, > > > > > > > > kernel test robot noticed a -25.8% regression of will-it-scale.per_thread_ops on: > > > > > > > > > > > > commit: 51d74c18a9c61e7ee33bc90b522dd7f6e5b80bb5 ("[PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg") > > > > url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257 > > > > base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything > > > > patch link: https://lore.kernel.org/all/20231010032117.1577496-4-yosryahmed@google.com/ > > > > patch subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg > > > > > > > > testcase: will-it-scale > > > > test machine: 104 threads 2 sockets (Skylake) with 192G memory > > > > parameters: > > > > > > > > nr_task: 100% > > > > mode: thread > > > > test: fallocate1 > > > > cpufreq_governor: performance > > > > > > > > > > > > In addition to that, the commit also has significant impact on the following tests: > > > > > > > > +------------------+---------------------------------------------------------------+ > > > > | testcase: change | will-it-scale: will-it-scale.per_thread_ops -30.0% regression | > > > > | test machine | 104 threads 2 sockets (Skylake) with 192G memory | > > > > | test parameters | cpufreq_governor=performance | > > > > | | mode=thread | > > > > | | nr_task=50% | > > > > | | test=fallocate1 | > > > > +------------------+---------------------------------------------------------------+ > > > > > > > > > > Yosry, I don't think 25% to 30% regression can be ignored. Unless > > > there is a quick fix, IMO this series should be skipped for the > > > upcoming kernel open window. > > > > I am currently looking into it. It's reasonable to skip the next merge > > window if a quick fix isn't found soon. > > > > I am surprised by the size of the regression given the following: > > 1.12 ą 5% +1.4 2.50 ą 2% > > perf-profile.self.cycles-pp.__mod_memcg_lruvec_state > > > > IIUC we are only spending 1% more time in __mod_memcg_lruvec_state(). > > Yes, this is kind of confusing. And we have seen similar cases before, > espcially for micro benchmark like will-it-scale, stressng, netperf > etc, the change to those functions in hot path was greatly amplified > in the final benchmark score. > > In a netperf case, https://lore.kernel.org/lkml/20220619150456.GB34471@xsang-OptiPlex-9020/ > the affected functions have around 10% change in perf's cpu-cycles, > and trigger 69% regression. IIRC, micro benchmarks are very sensitive > to those statistics update, like memcg's and vmstat. > Thanks for clarifying. I am still trying to reproduce locally but I am running into some quirks with tooling. I may have to run a modified version of the fallocate test manually. Meanwhile, I noticed that the patch was tested without the fixlet that I posted [1] for it, understandably. Would it be possible to get some numbers with that fixlet? It should reduce the total number of contended atomic operations, so it may help. [1]https://lore.kernel.org/lkml/CAJD7tkZDarDn_38ntFg5bK2fAmFdSe+Rt6DKOZA7Sgs_kERoVA@mail.gmail.com/ I am also wondering if aligning the stats_updates atomic will help. Right now it may share a cacheline with some items of the events_pending array. The latter may be dirtied during a flush and unnecessarily dirty the former, but the chances are slim to be honest. If it's easy to test such a diff, that would be nice, but I don't expect a lot of difference: diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7cbc7d94eb65..a35fce653262 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -646,7 +646,7 @@ struct memcg_vmstats { unsigned long events_pending[NR_MEMCG_EVENTS]; /* Stats updates since the last flush */ - atomic64_t stats_updates; + atomic64_t stats_updates ____cacheline_aligned_in_smp; }; /*
On Mon, Oct 23, 2023 at 11:25 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Sun, Oct 22, 2023 at 6:34 PM Feng Tang <feng.tang@intel.com> wrote: > > > > On Sat, Oct 21, 2023 at 01:42:58AM +0800, Yosry Ahmed wrote: > > > On Fri, Oct 20, 2023 at 10:23 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > > > On Fri, Oct 20, 2023 at 9:18 AM kernel test robot <oliver.sang@intel.com> wrote: > > > > > > > > > > > > > > > > > > > > Hello, > > > > > > > > > > kernel test robot noticed a -25.8% regression of will-it-scale.per_thread_ops on: > > > > > > > > > > > > > > > commit: 51d74c18a9c61e7ee33bc90b522dd7f6e5b80bb5 ("[PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg") > > > > > url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257 > > > > > base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything > > > > > patch link: https://lore.kernel.org/all/20231010032117.1577496-4-yosryahmed@google.com/ > > > > > patch subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg > > > > > > > > > > testcase: will-it-scale > > > > > test machine: 104 threads 2 sockets (Skylake) with 192G memory > > > > > parameters: > > > > > > > > > > nr_task: 100% > > > > > mode: thread > > > > > test: fallocate1 > > > > > cpufreq_governor: performance > > > > > > > > > > > > > > > In addition to that, the commit also has significant impact on the following tests: > > > > > > > > > > +------------------+---------------------------------------------------------------+ > > > > > | testcase: change | will-it-scale: will-it-scale.per_thread_ops -30.0% regression | > > > > > | test machine | 104 threads 2 sockets (Skylake) with 192G memory | > > > > > | test parameters | cpufreq_governor=performance | > > > > > | | mode=thread | > > > > > | | nr_task=50% | > > > > > | | test=fallocate1 | > > > > > +------------------+---------------------------------------------------------------+ > > > > > > > > > > > > > Yosry, I don't think 25% to 30% regression can be ignored. Unless > > > > there is a quick fix, IMO this series should be skipped for the > > > > upcoming kernel open window. > > > > > > I am currently looking into it. It's reasonable to skip the next merge > > > window if a quick fix isn't found soon. > > > > > > I am surprised by the size of the regression given the following: > > > 1.12 ą 5% +1.4 2.50 ą 2% > > > perf-profile.self.cycles-pp.__mod_memcg_lruvec_state > > > > > > IIUC we are only spending 1% more time in __mod_memcg_lruvec_state(). > > > > Yes, this is kind of confusing. And we have seen similar cases before, > > espcially for micro benchmark like will-it-scale, stressng, netperf > > etc, the change to those functions in hot path was greatly amplified > > in the final benchmark score. > > > > In a netperf case, https://lore.kernel.org/lkml/20220619150456.GB34471@xsang-OptiPlex-9020/ > > the affected functions have around 10% change in perf's cpu-cycles, > > and trigger 69% regression. IIRC, micro benchmarks are very sensitive > > to those statistics update, like memcg's and vmstat. > > > > Thanks for clarifying. I am still trying to reproduce locally but I am > running into some quirks with tooling. I may have to run a modified > version of the fallocate test manually. Meanwhile, I noticed that the > patch was tested without the fixlet that I posted [1] for it, > understandably. Would it be possible to get some numbers with that > fixlet? It should reduce the total number of contended atomic > operations, so it may help. > > [1]https://lore.kernel.org/lkml/CAJD7tkZDarDn_38ntFg5bK2fAmFdSe+Rt6DKOZA7Sgs_kERoVA@mail.gmail.com/ > > I am also wondering if aligning the stats_updates atomic will help. > Right now it may share a cacheline with some items of the > events_pending array. The latter may be dirtied during a flush and > unnecessarily dirty the former, but the chances are slim to be honest. > If it's easy to test such a diff, that would be nice, but I don't > expect a lot of difference: > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 7cbc7d94eb65..a35fce653262 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -646,7 +646,7 @@ struct memcg_vmstats { > unsigned long events_pending[NR_MEMCG_EVENTS]; > > /* Stats updates since the last flush */ > - atomic64_t stats_updates; > + atomic64_t stats_updates ____cacheline_aligned_in_smp; > }; > > /* I still could not run the benchmark, but I used a version of fallocate1.c that does 1 million iterations. I ran 100 in parallel. This showed ~13% regression with the patch, so not the same as the will-it-scale version, but it could be an indicator. With that, I did not see any improvement with the fixlet above or ___cacheline_aligned_in_smp. So you can scratch that. I did, however, see some improvement with reducing the indirection layers by moving stats_updates directly into struct mem_cgroup. The regression in my manual testing went down to 9%. Still not great, but I am wondering how this reflects on the benchmark. If you're able to test it that would be great, the diff is below. Meanwhile I am still looking for other improvements that can be made. diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index f64ac140083e..b4dfcd8b9cc1 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -270,6 +270,9 @@ struct mem_cgroup { CACHELINE_PADDING(_pad1_); + /* Stats updates since the last flush */ + atomic64_t stats_updates; + /* memory.stat */ struct memcg_vmstats *vmstats; @@ -309,6 +312,7 @@ struct mem_cgroup { atomic_t moving_account; struct task_struct *move_lock_task; + unsigned int __percpu *stats_updates_percpu; struct memcg_vmstats_percpu __percpu *vmstats_percpu; #ifdef CONFIG_CGROUP_WRITEBACK diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7cbc7d94eb65..e5d2f3d4d874 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -627,9 +627,6 @@ struct memcg_vmstats_percpu { /* Cgroup1: threshold notifications & softlimit tree updates */ unsigned long nr_page_events; unsigned long targets[MEM_CGROUP_NTARGETS]; - - /* Stats updates since the last flush */ - unsigned int stats_updates; }; struct memcg_vmstats { @@ -644,9 +641,6 @@ struct memcg_vmstats { /* Pending child counts during tree propagation */ long state_pending[MEMCG_NR_STAT]; unsigned long events_pending[NR_MEMCG_EVENTS]; - - /* Stats updates since the last flush */ - atomic64_t stats_updates; }; /* @@ -695,14 +689,14 @@ static void memcg_stats_unlock(void) static bool memcg_should_flush_stats(struct mem_cgroup *memcg) { - return atomic64_read(&memcg->vmstats->stats_updates) > + return atomic64_read(&memcg->stats_updates) > MEMCG_CHARGE_BATCH * num_online_cpus(); } static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) { int cpu = smp_processor_id(); - unsigned int x; + unsigned int *stats_updates_percpu; if (!val) return; @@ -710,10 +704,10 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) cgroup_rstat_updated(memcg->css.cgroup, cpu); for (; memcg; memcg = parent_mem_cgroup(memcg)) { - x = __this_cpu_add_return(memcg->vmstats_percpu->stats_updates, - abs(val)); + stats_updates_percpu = this_cpu_ptr(memcg->stats_updates_percpu); - if (x < MEMCG_CHARGE_BATCH) + *stats_updates_percpu += abs(val); + if (*stats_updates_percpu < MEMCG_CHARGE_BATCH) continue; /* @@ -721,8 +715,8 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) * redundant. Avoid the overhead of the atomic update. */ if (!memcg_should_flush_stats(memcg)) - atomic64_add(x, &memcg->vmstats->stats_updates); - __this_cpu_write(memcg->vmstats_percpu->stats_updates, 0); + atomic64_add(*stats_updates_percpu, &memcg->stats_updates); + *stats_updates_percpu = 0; } } @@ -5467,6 +5461,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) free_mem_cgroup_per_node_info(memcg, node); kfree(memcg->vmstats); free_percpu(memcg->vmstats_percpu); + free_percpu(memcg->stats_updates_percpu); kfree(memcg); } @@ -5504,6 +5499,11 @@ static struct mem_cgroup *mem_cgroup_alloc(void) if (!memcg->vmstats_percpu) goto fail; + memcg->stats_updates_percpu = alloc_percpu_gfp(unsigned int, + GFP_KERNEL_ACCOUNT); + if (!memcg->stats_updates_percpu) + goto fail; + for_each_node(node) if (alloc_mem_cgroup_per_node_info(memcg, node)) goto fail; @@ -5735,10 +5735,12 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) struct mem_cgroup *memcg = mem_cgroup_from_css(css); struct mem_cgroup *parent = parent_mem_cgroup(memcg); struct memcg_vmstats_percpu *statc; + int *stats_updates_percpu; long delta, delta_cpu, v; int i, nid; statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); + stats_updates_percpu = per_cpu_ptr(memcg->stats_updates_percpu, cpu); for (i = 0; i < MEMCG_NR_STAT; i++) { /* @@ -5826,10 +5828,10 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) } } } - statc->stats_updates = 0; + *stats_updates_percpu = 0; /* We are in a per-cpu loop here, only do the atomic write once */ - if (atomic64_read(&memcg->vmstats->stats_updates)) - atomic64_set(&memcg->vmstats->stats_updates, 0); + if (atomic64_read(&memcg->stats_updates)) + atomic64_set(&memcg->stats_updates, 0); } #ifdef CONFIG_MMU
hi, Yosry Ahmed, On Mon, Oct 23, 2023 at 07:13:50PM -0700, Yosry Ahmed wrote: ... > > I still could not run the benchmark, but I used a version of > fallocate1.c that does 1 million iterations. I ran 100 in parallel. > This showed ~13% regression with the patch, so not the same as the > will-it-scale version, but it could be an indicator. > > With that, I did not see any improvement with the fixlet above or > ___cacheline_aligned_in_smp. So you can scratch that. > > I did, however, see some improvement with reducing the indirection > layers by moving stats_updates directly into struct mem_cgroup. The > regression in my manual testing went down to 9%. Still not great, but > I am wondering how this reflects on the benchmark. If you're able to > test it that would be great, the diff is below. Meanwhile I am still > looking for other improvements that can be made. we applied previous patch-set as below: c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent() 51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg 130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code 26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time 25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything <---- the base our tool picked for the patch set I tried to apply below patch to either 51d74c18a9c61 or c5f50d8b23c79, but failed. could you guide how to apply this patch? Thanks > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index f64ac140083e..b4dfcd8b9cc1 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -270,6 +270,9 @@ struct mem_cgroup { > > CACHELINE_PADDING(_pad1_); > > + /* Stats updates since the last flush */ > + atomic64_t stats_updates; > + > /* memory.stat */ > struct memcg_vmstats *vmstats; > > @@ -309,6 +312,7 @@ struct mem_cgroup { > atomic_t moving_account; > struct task_struct *move_lock_task; > > + unsigned int __percpu *stats_updates_percpu; > struct memcg_vmstats_percpu __percpu *vmstats_percpu; > > #ifdef CONFIG_CGROUP_WRITEBACK > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 7cbc7d94eb65..e5d2f3d4d874 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -627,9 +627,6 @@ struct memcg_vmstats_percpu { > /* Cgroup1: threshold notifications & softlimit tree updates */ > unsigned long nr_page_events; > unsigned long targets[MEM_CGROUP_NTARGETS]; > - > - /* Stats updates since the last flush */ > - unsigned int stats_updates; > }; > > struct memcg_vmstats { > @@ -644,9 +641,6 @@ struct memcg_vmstats { > /* Pending child counts during tree propagation */ > long state_pending[MEMCG_NR_STAT]; > unsigned long events_pending[NR_MEMCG_EVENTS]; > - > - /* Stats updates since the last flush */ > - atomic64_t stats_updates; > }; > > /* > @@ -695,14 +689,14 @@ static void memcg_stats_unlock(void) > > static bool memcg_should_flush_stats(struct mem_cgroup *memcg) > { > - return atomic64_read(&memcg->vmstats->stats_updates) > > + return atomic64_read(&memcg->stats_updates) > > MEMCG_CHARGE_BATCH * num_online_cpus(); > } > > static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) > { > int cpu = smp_processor_id(); > - unsigned int x; > + unsigned int *stats_updates_percpu; > > if (!val) > return; > @@ -710,10 +704,10 @@ static inline void memcg_rstat_updated(struct > mem_cgroup *memcg, int val) > cgroup_rstat_updated(memcg->css.cgroup, cpu); > > for (; memcg; memcg = parent_mem_cgroup(memcg)) { > - x = __this_cpu_add_return(memcg->vmstats_percpu->stats_updates, > - abs(val)); > + stats_updates_percpu = > this_cpu_ptr(memcg->stats_updates_percpu); > > - if (x < MEMCG_CHARGE_BATCH) > + *stats_updates_percpu += abs(val); > + if (*stats_updates_percpu < MEMCG_CHARGE_BATCH) > continue; > > /* > @@ -721,8 +715,8 @@ static inline void memcg_rstat_updated(struct > mem_cgroup *memcg, int val) > * redundant. Avoid the overhead of the atomic update. > */ > if (!memcg_should_flush_stats(memcg)) > - atomic64_add(x, &memcg->vmstats->stats_updates); > - __this_cpu_write(memcg->vmstats_percpu->stats_updates, 0); > + atomic64_add(*stats_updates_percpu, > &memcg->stats_updates); > + *stats_updates_percpu = 0; > } > } > > @@ -5467,6 +5461,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) > free_mem_cgroup_per_node_info(memcg, node); > kfree(memcg->vmstats); > free_percpu(memcg->vmstats_percpu); > + free_percpu(memcg->stats_updates_percpu); > kfree(memcg); > } > > @@ -5504,6 +5499,11 @@ static struct mem_cgroup *mem_cgroup_alloc(void) > if (!memcg->vmstats_percpu) > goto fail; > > + memcg->stats_updates_percpu = alloc_percpu_gfp(unsigned int, > + GFP_KERNEL_ACCOUNT); > + if (!memcg->stats_updates_percpu) > + goto fail; > + > for_each_node(node) > if (alloc_mem_cgroup_per_node_info(memcg, node)) > goto fail; > @@ -5735,10 +5735,12 @@ static void mem_cgroup_css_rstat_flush(struct > cgroup_subsys_state *css, int cpu) > struct mem_cgroup *memcg = mem_cgroup_from_css(css); > struct mem_cgroup *parent = parent_mem_cgroup(memcg); > struct memcg_vmstats_percpu *statc; > + int *stats_updates_percpu; > long delta, delta_cpu, v; > int i, nid; > > statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); > + stats_updates_percpu = per_cpu_ptr(memcg->stats_updates_percpu, cpu); > > for (i = 0; i < MEMCG_NR_STAT; i++) { > /* > @@ -5826,10 +5828,10 @@ static void mem_cgroup_css_rstat_flush(struct > cgroup_subsys_state *css, int cpu) > } > } > } > - statc->stats_updates = 0; > + *stats_updates_percpu = 0; > /* We are in a per-cpu loop here, only do the atomic write once */ > - if (atomic64_read(&memcg->vmstats->stats_updates)) > - atomic64_set(&memcg->vmstats->stats_updates, 0); > + if (atomic64_read(&memcg->stats_updates)) > + atomic64_set(&memcg->stats_updates, 0); > } > > #ifdef CONFIG_MMU >
On Mon, Oct 23, 2023 at 11:56 PM Oliver Sang <oliver.sang@intel.com> wrote: > > hi, Yosry Ahmed, > > On Mon, Oct 23, 2023 at 07:13:50PM -0700, Yosry Ahmed wrote: > > ... > > > > > I still could not run the benchmark, but I used a version of > > fallocate1.c that does 1 million iterations. I ran 100 in parallel. > > This showed ~13% regression with the patch, so not the same as the > > will-it-scale version, but it could be an indicator. > > > > With that, I did not see any improvement with the fixlet above or > > ___cacheline_aligned_in_smp. So you can scratch that. > > > > I did, however, see some improvement with reducing the indirection > > layers by moving stats_updates directly into struct mem_cgroup. The > > regression in my manual testing went down to 9%. Still not great, but > > I am wondering how this reflects on the benchmark. If you're able to > > test it that would be great, the diff is below. Meanwhile I am still > > looking for other improvements that can be made. > > we applied previous patch-set as below: > > c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing > ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent() > 51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg > 130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code > 26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time > 25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything <---- the base our tool picked for the patch set > > I tried to apply below patch to either 51d74c18a9c61 or c5f50d8b23c79, > but failed. could you guide how to apply this patch? > Thanks > Thanks for looking into this. I rebased the diff on top of c5f50d8b23c79. Please find it attached.
hi, Yosry Ahmed, On Tue, Oct 24, 2023 at 12:14:42AM -0700, Yosry Ahmed wrote: > On Mon, Oct 23, 2023 at 11:56 PM Oliver Sang <oliver.sang@intel.com> wrote: > > > > hi, Yosry Ahmed, > > > > On Mon, Oct 23, 2023 at 07:13:50PM -0700, Yosry Ahmed wrote: > > > > ... > > > > > > > > I still could not run the benchmark, but I used a version of > > > fallocate1.c that does 1 million iterations. I ran 100 in parallel. > > > This showed ~13% regression with the patch, so not the same as the > > > will-it-scale version, but it could be an indicator. > > > > > > With that, I did not see any improvement with the fixlet above or > > > ___cacheline_aligned_in_smp. So you can scratch that. > > > > > > I did, however, see some improvement with reducing the indirection > > > layers by moving stats_updates directly into struct mem_cgroup. The > > > regression in my manual testing went down to 9%. Still not great, but > > > I am wondering how this reflects on the benchmark. If you're able to > > > test it that would be great, the diff is below. Meanwhile I am still > > > looking for other improvements that can be made. > > > > we applied previous patch-set as below: > > > > c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing > > ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent() > > 51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg > > 130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code > > 26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time > > 25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything <---- the base our tool picked for the patch set > > > > I tried to apply below patch to either 51d74c18a9c61 or c5f50d8b23c79, > > but failed. could you guide how to apply this patch? > > Thanks > > > > Thanks for looking into this. I rebased the diff on top of > c5f50d8b23c79. Please find it attached. from our tests, this patch has little impact. it was applied as below ac6a9444dec85: ac6a9444dec85 (linux-devel/fixup-c5f50d8b23c79) memcg: move stats_updates to struct mem_cgroup c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent() 51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg 130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code 26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time 25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything for the first regression reported in original report, data are very close for 51d74c18a9c61, c5f50d8b23c79 (patch-set tip, parent of ac6a9444dec85), and ac6a9444dec85. full comparison is as [1] ========================================================================================= compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase: gcc-12/performance/x86_64-rhel-8.3/thread/100%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale 130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522 c5f50d8b23c7982ac875791755b ac6a9444dec85dc50c6bfbc4ee7 ---------------- --------------------------- --------------------------- --------------------------- %stddev %change %stddev %change %stddev %change %stddev \ | \ | \ | \ 36509 -25.8% 27079 -25.2% 27305 -25.0% 27383 will-it-scale.per_thread_ops for the second regression reported in origianl report, seems a small impact from ac6a9444dec85. full comparison is as [2] ========================================================================================= compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase: gcc-12/performance/x86_64-rhel-8.3/thread/50%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale 130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522 c5f50d8b23c7982ac875791755b ac6a9444dec85dc50c6bfbc4ee7 ---------------- --------------------------- --------------------------- --------------------------- %stddev %change %stddev %change %stddev %change %stddev \ | \ | \ | \ 76580 -30.0% 53575 -28.9% 54415 -26.7% 56152 will-it-scale.per_thread_ops [1] ========================================================================================= compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase: gcc-12/performance/x86_64-rhel-8.3/thread/100%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale 130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522 c5f50d8b23c7982ac875791755b ac6a9444dec85dc50c6bfbc4ee7 ---------------- --------------------------- --------------------------- --------------------------- %stddev %change %stddev %change %stddev %change %stddev \ | \ | \ | \ 2.09 -0.5 1.61 ± 2% -0.5 1.61 -0.5 1.60 mpstat.cpu.all.usr% 3324 -10.0% 2993 +3.6% 3444 ± 20% -6.2% 3118 ± 4% vmstat.system.cs 120.83 ± 11% +79.6% 217.00 ± 9% +105.8% 248.67 ± 10% +115.2% 260.00 ± 10% perf-c2c.DRAM.local 594.50 ± 6% +43.8% 854.83 ± 5% +56.6% 931.17 ± 10% +21.2% 720.67 ± 7% perf-c2c.DRAM.remote -16.64 +39.7% -23.25 +177.3% -46.14 +13.9% -18.94 sched_debug.cpu.nr_uninterruptible.min 6.59 ± 13% +6.5% 7.02 ± 11% +84.7% 12.18 ± 51% -6.6% 6.16 ± 10% sched_debug.cpu.nr_uninterruptible.stddev 0.04 -20.8% 0.03 ± 11% -20.8% 0.03 ± 11% -25.0% 0.03 turbostat.IPC 27.58 +3.7% 28.59 +4.2% 28.74 +3.8% 28.63 turbostat.RAMWatt 71000 ± 68% +66.4% 118174 ± 60% -49.8% 35634 ± 13% -59.9% 28485 ± 10% numa-meminfo.node0.AnonHugePages 1056 -100.0% 0.00 +1.9% 1076 -12.6% 923.33 ± 44% numa-meminfo.node0.Inactive(file) 6.67 ±141% +15799.3% 1059 -100.0% 0.00 +2669.8% 184.65 ±223% numa-meminfo.node1.Inactive(file) 3797041 -25.8% 2816352 -25.2% 2839803 -25.0% 2847955 will-it-scale.104.threads 36509 -25.8% 27079 -25.2% 27305 -25.0% 27383 will-it-scale.per_thread_ops 3797041 -25.8% 2816352 -25.2% 2839803 -25.0% 2847955 will-it-scale.workload 1.142e+09 -26.2% 8.437e+08 -26.6% 8.391e+08 -25.7% 8.489e+08 numa-numastat.node0.local_node 1.143e+09 -26.1% 8.439e+08 -26.6% 8.392e+08 -25.7% 8.491e+08 numa-numastat.node0.numa_hit 1.148e+09 -25.4% 8.563e+08 ± 2% -23.7% 8.756e+08 ± 2% -24.2% 8.702e+08 numa-numastat.node1.local_node 1.149e+09 -25.4% 8.564e+08 ± 2% -23.8% 8.758e+08 ± 2% -24.2% 8.707e+08 numa-numastat.node1.numa_hit 10842 +0.9% 10941 +2.9% 11153 ± 2% +0.3% 10873 proc-vmstat.nr_mapped 32933 -2.6% 32068 +0.1% 32956 ± 2% -1.5% 32450 ± 2% proc-vmstat.nr_slab_reclaimable 2.291e+09 -25.8% 1.7e+09 -25.1% 1.715e+09 -24.9% 1.72e+09 proc-vmstat.numa_hit 2.291e+09 -25.8% 1.7e+09 -25.1% 1.715e+09 -25.0% 1.719e+09 proc-vmstat.numa_local 2.29e+09 -25.8% 1.699e+09 -25.1% 1.714e+09 -24.9% 1.718e+09 proc-vmstat.pgalloc_normal 2.289e+09 -25.8% 1.699e+09 -25.1% 1.714e+09 -24.9% 1.718e+09 proc-vmstat.pgfree 199.33 -100.0% 0.00 -0.3% 198.66 -16.4% 166.67 ± 44% numa-vmstat.node0.nr_active_file 264.00 -100.0% 0.00 +1.9% 269.00 -12.6% 230.83 ± 44% numa-vmstat.node0.nr_inactive_file 199.33 -100.0% 0.00 -0.3% 198.66 -16.4% 166.67 ± 44% numa-vmstat.node0.nr_zone_active_file 264.00 -100.0% 0.00 +1.9% 269.00 -12.6% 230.83 ± 44% numa-vmstat.node0.nr_zone_inactive_file 1.143e+09 -26.1% 8.439e+08 -26.6% 8.392e+08 -25.7% 8.491e+08 numa-vmstat.node0.numa_hit 1.142e+09 -26.2% 8.437e+08 -26.6% 8.391e+08 -25.7% 8.489e+08 numa-vmstat.node0.numa_local 1.67 ±141% +15799.3% 264.99 -100.0% 0.00 +2669.8% 46.16 ±223% numa-vmstat.node1.nr_inactive_file 1.67 ±141% +15799.3% 264.99 -100.0% 0.00 +2669.8% 46.16 ±223% numa-vmstat.node1.nr_zone_inactive_file 1.149e+09 -25.4% 8.564e+08 ± 2% -23.8% 8.758e+08 ± 2% -24.2% 8.707e+08 numa-vmstat.node1.numa_hit 1.148e+09 -25.4% 8.563e+08 ± 2% -23.7% 8.756e+08 ± 2% -24.2% 8.702e+08 numa-vmstat.node1.numa_local 0.04 ±108% -76.2% 0.01 ± 23% +154.8% 0.10 ± 34% +110.0% 0.08 ± 88% perf-sched.sch_delay.avg.ms.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll 1.00 ± 93% +154.2% 2.55 ± 16% +133.4% 2.34 ± 39% +174.6% 2.76 ± 22% perf-sched.sch_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64 0.71 ±131% -91.3% 0.06 ± 74% +184.4% 2.02 ± 40% +122.6% 1.58 ± 98% perf-sched.sch_delay.max.ms.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll 1.84 ± 45% +35.2% 2.48 ± 31% +66.1% 3.05 ± 25% +61.9% 2.98 ± 10% perf-sched.sch_delay.max.ms.schedule_hrtimeout_range_clock.do_select.core_sys_select.kern_select 191.10 ± 2% +18.0% 225.55 ± 2% +18.9% 227.22 ± 4% +19.8% 228.89 ± 4% perf-sched.wait_and_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm 3484 -7.8% 3211 ± 6% -7.3% 3230 ± 7% -11.0% 3101 ± 3% perf-sched.wait_and_delay.count.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64 385.50 ± 14% +39.6% 538.17 ± 12% +104.5% 788.17 ± 54% +30.9% 504.67 ± 41% perf-sched.wait_and_delay.count.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate 3784 -7.5% 3500 ± 6% -7.1% 3516 ± 2% -10.6% 3383 ± 4% perf-sched.wait_and_delay.count.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate 118.67 ± 11% -62.6% 44.33 ±100% -45.9% 64.17 ± 71% -64.9% 41.67 ±100% perf-sched.wait_and_delay.count.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt 5043 ± 2% -13.0% 4387 ± 6% -14.7% 4301 ± 3% -16.5% 4210 ± 4% perf-sched.wait_and_delay.count.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm 167.12 ±222% +200.1% 501.48 ± 99% +2.9% 171.99 ±215% +399.7% 835.05 ± 44% perf-sched.wait_and_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64 2.17 ± 21% +8.9% 2.36 ± 16% +94.3% 4.21 ± 36% +40.4% 3.04 ± 21% perf-sched.wait_time.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone 191.09 ± 2% +18.0% 225.53 ± 2% +18.9% 227.21 ± 4% +19.8% 228.88 ± 4% perf-sched.wait_time.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm 293.46 ± 4% +12.8% 330.98 ± 6% +21.0% 355.18 ± 16% +7.1% 314.31 ± 26% perf-sched.wait_time.max.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm 30.33 ±105% -35.1% 19.69 ±138% +494.1% 180.18 ± 79% +135.5% 71.43 ± 76% perf-sched.wait_time.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone 0.59 ± 3% +125.2% 1.32 ± 2% +139.3% 1.41 +128.6% 1.34 perf-stat.i.MPKI 9.027e+09 -17.9% 7.408e+09 -17.5% 7.446e+09 -17.3% 7.465e+09 perf-stat.i.branch-instructions 0.64 -0.0 0.60 -0.0 0.60 -0.0 0.60 perf-stat.i.branch-miss-rate% 58102855 -23.3% 44580037 ± 2% -23.4% 44524712 ± 2% -22.9% 44801374 perf-stat.i.branch-misses 15.28 +7.0 22.27 +7.9 23.14 +7.2 22.50 perf-stat.i.cache-miss-rate% 25155306 ± 2% +82.7% 45953601 ± 3% +95.2% 49105558 ± 2% +87.7% 47212483 perf-stat.i.cache-misses 1.644e+08 +25.4% 2.062e+08 ± 2% +29.0% 2.12e+08 +27.6% 2.098e+08 perf-stat.i.cache-references 3258 -10.3% 2921 +2.5% 3341 ± 19% -6.7% 3041 ± 4% perf-stat.i.context-switches 6.73 +23.3% 8.30 +22.7% 8.26 +21.8% 8.20 perf-stat.i.cpi 145.97 -1.3% 144.13 -1.4% 143.89 -1.2% 144.29 perf-stat.i.cpu-migrations 11519 ± 3% -45.4% 6293 ± 3% -48.9% 5892 ± 2% -46.9% 6118 perf-stat.i.cycles-between-cache-misses 0.04 -0.0 0.03 -0.0 0.03 -0.0 0.03 perf-stat.i.dTLB-load-miss-rate% 3921408 -25.3% 2929564 -24.6% 2957991 -24.5% 2961168 perf-stat.i.dTLB-load-misses 1.098e+10 -18.1% 8.993e+09 -17.6% 9.045e+09 -16.3% 9.185e+09 perf-stat.i.dTLB-loads 0.00 ± 2% +0.0 0.00 ± 4% +0.0 0.00 ± 5% +0.0 0.00 ± 3% perf-stat.i.dTLB-store-miss-rate% 5.606e+09 -23.2% 4.304e+09 -22.6% 4.338e+09 -22.4% 4.349e+09 perf-stat.i.dTLB-stores 95.65 -1.2 94.49 -0.9 94.74 -0.8 94.87 perf-stat.i.iTLB-load-miss-rate% 3876741 -25.0% 2905764 -24.8% 2915184 -25.0% 2909099 perf-stat.i.iTLB-load-misses 4.286e+10 -18.9% 3.477e+10 -18.4% 3.496e+10 -17.9% 3.517e+10 perf-stat.i.instructions 11061 +8.2% 11969 +8.4% 11996 +9.3% 12091 perf-stat.i.instructions-per-iTLB-miss 0.15 -18.9% 0.12 -18.5% 0.12 -17.9% 0.12 perf-stat.i.ipc 0.01 ± 96% -8.9% 0.01 ± 96% +72.3% 0.01 ± 73% +174.6% 0.02 ± 32% perf-stat.i.major-faults 48.65 ± 2% +46.2% 71.11 ± 2% +57.0% 76.37 ± 2% +45.4% 70.72 perf-stat.i.metric.K/sec 247.84 -18.9% 201.05 -18.4% 202.30 -17.7% 203.92 perf-stat.i.metric.M/sec 89.33 +0.5 89.79 -0.7 88.67 -2.1 87.23 perf-stat.i.node-load-miss-rate% 3138385 ± 2% +77.7% 5578401 ± 2% +89.9% 5958861 ± 2% +70.9% 5363943 perf-stat.i.node-load-misses 375827 ± 3% +69.2% 635857 ± 11% +102.6% 761334 ± 4% +109.3% 786773 ± 5% perf-stat.i.node-loads 1343194 -26.8% 983668 -22.6% 1039799 ± 2% -23.6% 1026076 perf-stat.i.node-store-misses 51550 ± 3% -19.0% 41748 ± 7% -22.5% 39954 ± 4% -20.6% 40921 ± 7% perf-stat.i.node-stores 0.59 ± 3% +125.1% 1.32 ± 2% +139.2% 1.40 +128.7% 1.34 perf-stat.overall.MPKI 0.64 -0.0 0.60 -0.0 0.60 -0.0 0.60 perf-stat.overall.branch-miss-rate% 15.30 +7.0 22.28 +7.9 23.16 +7.2 22.50 perf-stat.overall.cache-miss-rate% 6.73 +23.3% 8.29 +22.6% 8.25 +21.9% 8.20 perf-stat.overall.cpi 11470 ± 2% -45.3% 6279 ± 2% -48.8% 5875 ± 2% -46.7% 6108 perf-stat.overall.cycles-between-cache-misses 0.04 -0.0 0.03 -0.0 0.03 -0.0 0.03 perf-stat.overall.dTLB-load-miss-rate% 0.00 ± 2% +0.0 0.00 ± 4% +0.0 0.00 ± 5% +0.0 0.00 ± 4% perf-stat.overall.dTLB-store-miss-rate% 95.56 -1.4 94.17 -1.0 94.56 -0.9 94.66 perf-stat.overall.iTLB-load-miss-rate% 11059 +8.2% 11967 +8.5% 11994 +9.3% 12091 perf-stat.overall.instructions-per-iTLB-miss 0.15 -18.9% 0.12 -18.4% 0.12 -17.9% 0.12 perf-stat.overall.ipc 89.29 +0.5 89.78 -0.6 88.67 -2.1 87.20 perf-stat.overall.node-load-miss-rate% 3396437 +9.5% 3718021 +9.1% 3705386 +9.6% 3721307 perf-stat.overall.path-length 8.997e+09 -17.9% 7.383e+09 -17.5% 7.421e+09 -17.3% 7.44e+09 perf-stat.ps.branch-instructions 57910417 -23.3% 44426577 ± 2% -23.4% 44376780 ± 2% -22.9% 44649215 perf-stat.ps.branch-misses 25075498 ± 2% +82.7% 45803186 ± 3% +95.2% 48942749 ± 2% +87.7% 47057228 perf-stat.ps.cache-misses 1.639e+08 +25.4% 2.056e+08 ± 2% +28.9% 2.113e+08 +27.6% 2.091e+08 perf-stat.ps.cache-references 3247 -10.3% 2911 +2.5% 3329 ± 19% -6.7% 3030 ± 4% perf-stat.ps.context-switches 145.47 -1.3% 143.61 -1.4% 143.38 -1.2% 143.70 perf-stat.ps.cpu-migrations 3908900 -25.3% 2920218 -24.6% 2949112 -24.5% 2951633 perf-stat.ps.dTLB-load-misses 1.094e+10 -18.1% 8.963e+09 -17.6% 9.014e+09 -16.3% 9.154e+09 perf-stat.ps.dTLB-loads 5.587e+09 -23.2% 4.289e+09 -22.6% 4.324e+09 -22.4% 4.335e+09 perf-stat.ps.dTLB-stores 3863663 -25.0% 2895895 -24.8% 2905355 -25.0% 2899323 perf-stat.ps.iTLB-load-misses 4.272e+10 -18.9% 3.466e+10 -18.4% 3.484e+10 -17.9% 3.505e+10 perf-stat.ps.instructions 3128132 ± 2% +77.7% 5559939 ± 2% +89.9% 5938929 ± 2% +70.9% 5346027 perf-stat.ps.node-load-misses 375403 ± 3% +69.0% 634300 ± 11% +102.3% 759484 ± 4% +109.1% 784913 ± 5% perf-stat.ps.node-loads 1338688 -26.8% 980311 -22.6% 1036279 ± 2% -23.6% 1022618 perf-stat.ps.node-store-misses 51546 ± 3% -19.1% 41692 ± 7% -22.6% 39921 ± 4% -20.7% 40875 ± 7% perf-stat.ps.node-stores 1.29e+13 -18.8% 1.047e+13 -18.4% 1.052e+13 -17.8% 1.06e+13 perf-stat.total.instructions 0.96 -0.3 0.70 ± 2% -0.3 0.70 ± 2% -0.3 0.70 perf-profile.calltrace.cycles-pp.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate 0.97 -0.3 0.72 -0.2 0.72 -0.2 0.72 perf-profile.calltrace.cycles-pp.syscall_return_via_sysret.fallocate64 0.76 ± 2% -0.2 0.54 ± 3% -0.2 0.59 ± 3% -0.1 0.68 perf-profile.calltrace.cycles-pp.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate 0.82 -0.2 0.60 ± 2% -0.2 0.60 ± 2% -0.2 0.60 perf-profile.calltrace.cycles-pp.alloc_pages_mpol.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate 0.91 -0.2 0.72 -0.2 0.72 -0.2 0.70 ± 2% perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64 51.50 -0.0 51.47 -0.5 50.99 -0.3 51.21 perf-profile.calltrace.cycles-pp.fallocate64 48.31 +0.0 48.35 +0.5 48.83 +0.3 48.61 perf-profile.calltrace.cycles-pp.ftruncate64 48.29 +0.0 48.34 +0.5 48.81 +0.3 48.60 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.ftruncate64 48.28 +0.0 48.33 +0.5 48.80 +0.3 48.59 perf-profile.calltrace.cycles-pp.do_sys_ftruncate.do_syscall_64.entry_SYSCALL_64_after_hwframe.ftruncate64 48.29 +0.1 48.34 +0.5 48.82 +0.3 48.60 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.ftruncate64 48.28 +0.1 48.33 +0.5 48.80 +0.3 48.58 perf-profile.calltrace.cycles-pp.do_truncate.do_sys_ftruncate.do_syscall_64.entry_SYSCALL_64_after_hwframe.ftruncate64 48.27 +0.1 48.33 +0.5 48.80 +0.3 48.58 perf-profile.calltrace.cycles-pp.notify_change.do_truncate.do_sys_ftruncate.do_syscall_64.entry_SYSCALL_64_after_hwframe 48.27 +0.1 48.32 +0.5 48.80 +0.3 48.58 perf-profile.calltrace.cycles-pp.shmem_setattr.notify_change.do_truncate.do_sys_ftruncate.do_syscall_64 48.25 +0.1 48.31 +0.5 48.78 +0.3 48.57 perf-profile.calltrace.cycles-pp.shmem_undo_range.shmem_setattr.notify_change.do_truncate.do_sys_ftruncate 2.06 ± 2% +0.1 2.13 ± 2% +0.1 2.16 +0.0 2.09 perf-profile.calltrace.cycles-pp.truncate_inode_folio.shmem_undo_range.shmem_setattr.notify_change.do_truncate 0.68 +0.1 0.76 ± 2% +0.1 0.75 +0.1 0.74 perf-profile.calltrace.cycles-pp.lru_add_fn.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp 1.67 +0.1 1.77 +0.1 1.81 ± 2% +0.0 1.71 perf-profile.calltrace.cycles-pp.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate 45.76 +0.1 45.86 +0.5 46.29 +0.4 46.13 perf-profile.calltrace.cycles-pp.__folio_batch_release.shmem_undo_range.shmem_setattr.notify_change.do_truncate 1.78 ± 2% +0.1 1.92 ± 2% +0.2 1.95 +0.1 1.88 perf-profile.calltrace.cycles-pp.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr.notify_change 0.69 ± 5% +0.1 0.84 ± 4% +0.2 0.86 ± 5% +0.1 0.79 ± 2% perf-profile.calltrace.cycles-pp.get_mem_cgroup_from_mm.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate 1.56 ± 2% +0.2 1.76 ± 2% +0.2 1.79 +0.2 1.71 perf-profile.calltrace.cycles-pp.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr 0.85 ± 4% +0.4 1.23 ± 2% +0.4 1.26 ± 3% +0.3 1.14 perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate 0.78 ± 4% +0.4 1.20 ± 3% +0.4 1.22 +0.3 1.11 perf-profile.calltrace.cycles-pp.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range 0.73 ± 4% +0.4 1.17 ± 3% +0.5 1.19 ± 2% +0.4 1.08 perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio 41.60 +0.7 42.30 +0.1 41.73 +0.5 42.06 perf-profile.calltrace.cycles-pp.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate 41.50 +0.7 42.23 +0.2 41.66 +0.5 41.99 perf-profile.calltrace.cycles-pp.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate 48.39 +0.8 49.14 +0.2 48.64 +0.5 48.89 perf-profile.calltrace.cycles-pp.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64 0.00 +0.8 0.77 ± 4% +0.8 0.80 ± 2% +0.8 0.78 ± 2% perf-profile.calltrace.cycles-pp.mem_cgroup_commit_charge.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate 40.24 +0.8 41.03 +0.2 40.48 +0.6 40.80 perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp 40.22 +0.8 41.01 +0.2 40.47 +0.6 40.79 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio 0.00 +0.8 0.79 ± 3% +0.8 0.82 ± 3% +0.8 0.79 ± 2% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp 40.19 +0.8 40.98 +0.3 40.44 +0.6 40.76 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru 1.33 ± 5% +0.8 2.13 ± 4% +0.9 2.21 ± 4% +0.8 2.09 ± 2% perf-profile.calltrace.cycles-pp.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate 48.16 +0.8 48.98 +0.3 48.48 +0.6 48.72 perf-profile.calltrace.cycles-pp.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64 0.00 +0.9 0.88 ± 2% +0.9 0.91 +0.9 0.86 perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio 47.92 +0.9 48.81 +0.4 48.30 +0.6 48.56 perf-profile.calltrace.cycles-pp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe 47.07 +0.9 48.01 +0.5 47.60 +0.7 47.79 perf-profile.calltrace.cycles-pp.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64 46.59 +1.1 47.64 +0.7 47.24 +0.8 47.44 perf-profile.calltrace.cycles-pp.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate 0.99 -0.3 0.73 ± 2% -0.3 0.74 -0.3 0.74 perf-profile.children.cycles-pp.syscall_return_via_sysret 0.96 -0.3 0.70 ± 2% -0.3 0.70 ± 2% -0.3 0.71 perf-profile.children.cycles-pp.shmem_alloc_folio 0.78 ± 2% -0.2 0.56 ± 3% -0.2 0.61 ± 3% -0.1 0.69 ± 2% perf-profile.children.cycles-pp.shmem_inode_acct_blocks 0.83 -0.2 0.61 ± 2% -0.2 0.61 ± 2% -0.2 0.62 perf-profile.children.cycles-pp.alloc_pages_mpol 0.92 -0.2 0.73 -0.2 0.73 -0.2 0.71 ± 2% perf-profile.children.cycles-pp.syscall_exit_to_user_mode 0.74 ± 2% -0.2 0.55 ± 2% -0.2 0.56 ± 2% -0.2 0.58 ± 3% perf-profile.children.cycles-pp.xas_store 0.67 -0.2 0.50 ± 3% -0.2 0.50 ± 2% -0.2 0.50 perf-profile.children.cycles-pp.__alloc_pages 0.43 -0.1 0.31 ± 2% -0.1 0.31 -0.1 0.31 perf-profile.children.cycles-pp.__entry_text_start 0.41 ± 2% -0.1 0.30 ± 3% -0.1 0.31 ± 2% -0.1 0.31 ± 2% perf-profile.children.cycles-pp.free_unref_page_list 0.35 -0.1 0.25 ± 2% -0.1 0.25 ± 2% -0.1 0.25 perf-profile.children.cycles-pp.xas_load 0.35 ± 2% -0.1 0.25 ± 4% -0.1 0.25 ± 2% -0.1 0.26 ± 2% perf-profile.children.cycles-pp.__mod_lruvec_state 0.39 -0.1 0.30 ± 2% -0.1 0.29 ± 3% -0.1 0.30 perf-profile.children.cycles-pp.get_page_from_freelist 0.27 ± 2% -0.1 0.19 ± 4% -0.1 0.19 ± 5% -0.1 0.19 ± 3% perf-profile.children.cycles-pp.__mod_node_page_state 0.32 ± 3% -0.1 0.24 ± 3% -0.1 0.25 -0.1 0.26 ± 4% perf-profile.children.cycles-pp.find_lock_entries 0.23 ± 2% -0.1 0.15 ± 4% -0.1 0.16 ± 3% -0.1 0.16 ± 5% perf-profile.children.cycles-pp.xas_descend 0.25 ± 3% -0.1 0.18 ± 3% -0.1 0.18 ± 3% -0.1 0.18 ± 2% perf-profile.children.cycles-pp.__dquot_alloc_space 0.28 ± 3% -0.1 0.20 ± 3% -0.1 0.21 ± 2% -0.1 0.20 ± 2% perf-profile.children.cycles-pp._raw_spin_lock 0.16 ± 3% -0.1 0.10 ± 5% -0.1 0.10 ± 4% -0.1 0.10 ± 4% perf-profile.children.cycles-pp.xas_find_conflict 0.26 ± 2% -0.1 0.20 ± 3% -0.1 0.19 ± 3% -0.1 0.19 perf-profile.children.cycles-pp.filemap_get_entry 0.26 -0.1 0.20 ± 2% -0.1 0.20 ± 4% -0.1 0.20 ± 2% perf-profile.children.cycles-pp.rmqueue 0.20 ± 3% -0.1 0.14 ± 3% -0.0 0.15 ± 3% -0.0 0.16 ± 3% perf-profile.children.cycles-pp.truncate_cleanup_folio 0.19 ± 5% -0.1 0.14 ± 4% -0.0 0.15 ± 5% -0.0 0.15 ± 4% perf-profile.children.cycles-pp.xas_clear_mark 0.17 ± 5% -0.0 0.12 ± 4% -0.0 0.12 ± 6% -0.0 0.13 ± 3% perf-profile.children.cycles-pp.xas_init_marks 0.15 ± 4% -0.0 0.10 ± 4% -0.0 0.10 ± 4% -0.0 0.11 ± 3% perf-profile.children.cycles-pp.free_unref_page_commit 0.15 ± 12% -0.0 0.10 ± 20% -0.1 0.10 ± 15% -0.1 0.10 ± 14% perf-profile.children.cycles-pp._raw_spin_lock_irq 51.56 -0.0 51.51 -0.5 51.03 -0.3 51.26 perf-profile.children.cycles-pp.fallocate64 0.18 ± 3% -0.0 0.14 ± 3% -0.0 0.13 ± 5% -0.0 0.14 ± 2% perf-profile.children.cycles-pp.__cond_resched 0.07 ± 5% -0.0 0.02 ± 99% -0.0 0.04 ± 44% -0.0 0.04 ± 44% perf-profile.children.cycles-pp.xas_find 0.13 ± 2% -0.0 0.09 -0.0 0.10 ± 5% -0.0 0.12 ± 4% perf-profile.children.cycles-pp.security_vm_enough_memory_mm 0.14 ± 4% -0.0 0.10 ± 7% -0.0 0.10 ± 6% -0.0 0.10 ± 3% perf-profile.children.cycles-pp.__fget_light 0.06 ± 6% -0.0 0.02 ± 99% -0.0 0.05 -0.0 0.05 perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack 0.12 ± 4% -0.0 0.08 ± 4% -0.0 0.08 ± 4% -0.0 0.08 perf-profile.children.cycles-pp.xas_start 0.08 ± 5% -0.0 0.05 -0.0 0.05 -0.0 0.05 ± 7% perf-profile.children.cycles-pp.__folio_throttle_swaprate 0.12 -0.0 0.08 ± 5% -0.0 0.08 ± 5% -0.0 0.08 ± 5% perf-profile.children.cycles-pp.folio_unlock 0.14 ± 3% -0.0 0.11 ± 3% -0.0 0.11 ± 4% -0.0 0.12 ± 3% perf-profile.children.cycles-pp.try_charge_memcg 0.12 ± 6% -0.0 0.08 ± 5% -0.0 0.09 ± 5% -0.0 0.09 ± 7% perf-profile.children.cycles-pp.free_unref_page_prepare 0.12 ± 3% -0.0 0.09 ± 4% -0.0 0.09 ± 7% -0.0 0.09 perf-profile.children.cycles-pp.noop_dirty_folio 0.20 ± 2% -0.0 0.17 ± 5% -0.0 0.18 -0.0 0.19 ± 2% perf-profile.children.cycles-pp.page_counter_uncharge 0.10 -0.0 0.07 ± 5% -0.0 0.08 ± 8% +0.0 0.10 ± 4% perf-profile.children.cycles-pp.cap_vm_enough_memory 0.09 ± 6% -0.0 0.06 ± 6% -0.0 0.06 ± 7% -0.0 0.06 ± 7% perf-profile.children.cycles-pp._raw_spin_trylock 0.09 ± 5% -0.0 0.06 ± 7% -0.0 0.06 ± 7% -0.0 0.07 ± 7% perf-profile.children.cycles-pp.inode_add_bytes 0.06 ± 6% -0.0 0.03 ± 70% -0.0 0.04 ± 44% -0.0 0.05 ± 7% perf-profile.children.cycles-pp.filemap_free_folio 0.06 ± 6% -0.0 0.03 ± 70% +0.0 0.07 ± 7% +0.1 0.14 ± 6% perf-profile.children.cycles-pp.percpu_counter_add_batch 0.12 ± 3% -0.0 0.10 ± 5% -0.0 0.09 ± 4% -0.0 0.09 perf-profile.children.cycles-pp.shmem_recalc_inode 0.12 ± 3% -0.0 0.09 ± 5% -0.0 0.09 ± 5% -0.0 0.10 ± 4% perf-profile.children.cycles-pp.__folio_cancel_dirty 0.09 ± 5% -0.0 0.07 ± 7% -0.0 0.09 ± 4% +0.1 0.16 ± 7% perf-profile.children.cycles-pp.__vm_enough_memory 0.08 ± 5% -0.0 0.06 -0.0 0.06 ± 6% -0.0 0.06 ± 6% perf-profile.children.cycles-pp.security_file_permission 0.08 ± 5% -0.0 0.06 -0.0 0.06 ± 6% -0.0 0.06 ± 6% perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack 0.08 ± 6% -0.0 0.05 ± 7% -0.0 0.05 ± 8% -0.0 0.05 ± 7% perf-profile.children.cycles-pp.apparmor_file_permission 0.09 ± 4% -0.0 0.07 ± 8% -0.0 0.09 ± 8% -0.0 0.07 ± 6% perf-profile.children.cycles-pp.__percpu_counter_limited_add 0.08 ± 6% -0.0 0.06 ± 8% -0.0 0.06 -0.0 0.06 ± 6% perf-profile.children.cycles-pp.__list_add_valid_or_report 0.07 ± 8% -0.0 0.05 -0.0 0.05 ± 7% -0.0 0.06 ± 9% perf-profile.children.cycles-pp.get_pfnblock_flags_mask 0.14 ± 3% -0.0 0.12 ± 6% -0.0 0.12 ± 3% -0.0 0.13 ± 3% perf-profile.children.cycles-pp.cgroup_rstat_updated 0.07 ± 5% -0.0 0.05 -0.0 0.05 -0.0 0.05 perf-profile.children.cycles-pp.policy_nodemask 0.24 ± 2% -0.0 0.22 ± 2% -0.0 0.22 ± 2% -0.0 0.22 ± 2% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt 0.08 -0.0 0.07 ± 7% -0.0 0.06 ± 6% -0.0 0.07 ± 6% perf-profile.children.cycles-pp.xas_create 0.08 ± 8% -0.0 0.06 ± 7% -0.0 0.06 ± 7% -0.0 0.07 ± 7% perf-profile.children.cycles-pp.mem_cgroup_update_lru_size 0.00 +0.0 0.00 +0.0 0.00 +0.1 0.08 ± 8% perf-profile.children.cycles-pp.__file_remove_privs 0.28 ± 2% +0.0 0.28 ± 4% +0.0 0.30 +0.0 0.30 perf-profile.children.cycles-pp.uncharge_batch 0.14 ± 5% +0.0 0.17 ± 4% +0.0 0.17 ± 2% +0.0 0.16 perf-profile.children.cycles-pp.uncharge_folio 0.43 +0.0 0.46 ± 4% +0.0 0.48 +0.0 0.47 perf-profile.children.cycles-pp.__mem_cgroup_uncharge_list 48.31 +0.0 48.35 +0.5 48.83 +0.3 48.61 perf-profile.children.cycles-pp.ftruncate64 48.28 +0.0 48.33 +0.5 48.80 +0.3 48.59 perf-profile.children.cycles-pp.do_sys_ftruncate 48.28 +0.1 48.33 +0.5 48.80 +0.3 48.58 perf-profile.children.cycles-pp.do_truncate 48.27 +0.1 48.33 +0.5 48.80 +0.3 48.58 perf-profile.children.cycles-pp.notify_change 48.27 +0.1 48.32 +0.5 48.80 +0.3 48.58 perf-profile.children.cycles-pp.shmem_setattr 48.26 +0.1 48.32 +0.5 48.79 +0.3 48.57 perf-profile.children.cycles-pp.shmem_undo_range 2.06 ± 2% +0.1 2.13 ± 2% +0.1 2.16 +0.0 2.10 perf-profile.children.cycles-pp.truncate_inode_folio 0.69 +0.1 0.78 +0.1 0.77 +0.1 0.76 perf-profile.children.cycles-pp.lru_add_fn 1.72 ± 2% +0.1 1.80 +0.1 1.83 ± 2% +0.0 1.74 perf-profile.children.cycles-pp.shmem_add_to_page_cache 45.77 +0.1 45.86 +0.5 46.29 +0.4 46.13 perf-profile.children.cycles-pp.__folio_batch_release 1.79 ± 2% +0.1 1.93 ± 2% +0.2 1.96 +0.1 1.88 perf-profile.children.cycles-pp.filemap_remove_folio 0.13 ± 5% +0.1 0.28 +0.1 0.19 ± 5% +0.1 0.24 ± 2% perf-profile.children.cycles-pp.file_modified 0.69 ± 5% +0.1 0.84 ± 3% +0.2 0.86 ± 5% +0.1 0.79 ± 2% perf-profile.children.cycles-pp.get_mem_cgroup_from_mm 0.09 ± 7% +0.2 0.24 ± 2% +0.1 0.15 ± 3% +0.0 0.14 ± 4% perf-profile.children.cycles-pp.inode_needs_update_time 1.58 ± 3% +0.2 1.77 ± 2% +0.2 1.80 +0.1 1.72 perf-profile.children.cycles-pp.__filemap_remove_folio 0.15 ± 3% +0.4 0.50 ± 3% +0.4 0.52 ± 2% +0.4 0.52 ± 2% perf-profile.children.cycles-pp.__count_memcg_events 0.79 ± 4% +0.4 1.20 ± 3% +0.4 1.22 +0.3 1.12 perf-profile.children.cycles-pp.filemap_unaccount_folio 0.36 ± 5% +0.4 0.77 ± 4% +0.4 0.81 ± 2% +0.4 0.78 ± 2% perf-profile.children.cycles-pp.mem_cgroup_commit_charge 98.33 +0.5 98.78 +0.4 98.77 +0.4 98.77 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe 97.74 +0.6 98.34 +0.6 98.32 +0.6 98.33 perf-profile.children.cycles-pp.do_syscall_64 41.62 +0.7 42.33 +0.1 41.76 +0.5 42.08 perf-profile.children.cycles-pp.folio_add_lru 43.91 +0.7 44.64 +0.2 44.09 +0.5 44.40 perf-profile.children.cycles-pp.folio_batch_move_lru 48.39 +0.8 49.15 +0.2 48.64 +0.5 48.89 perf-profile.children.cycles-pp.__x64_sys_fallocate 1.34 ± 5% +0.8 2.14 ± 4% +0.9 2.22 ± 4% +0.8 2.10 ± 2% perf-profile.children.cycles-pp.__mem_cgroup_charge 1.61 ± 4% +0.8 2.42 ± 2% +0.9 2.47 ± 2% +0.6 2.24 perf-profile.children.cycles-pp.__mod_lruvec_page_state 48.17 +0.8 48.98 +0.3 48.48 +0.6 48.72 perf-profile.children.cycles-pp.vfs_fallocate 47.94 +0.9 48.82 +0.4 48.32 +0.6 48.56 perf-profile.children.cycles-pp.shmem_fallocate 47.10 +0.9 48.04 +0.5 47.64 +0.7 47.83 perf-profile.children.cycles-pp.shmem_get_folio_gfp 84.34 +0.9 85.28 +0.8 85.11 +0.9 85.28 perf-profile.children.cycles-pp.folio_lruvec_lock_irqsave 84.31 +0.9 85.26 +0.8 85.08 +0.9 85.26 perf-profile.children.cycles-pp._raw_spin_lock_irqsave 84.24 +1.0 85.21 +0.8 85.04 +1.0 85.21 perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath 46.65 +1.1 47.70 +0.7 47.30 +0.8 47.48 perf-profile.children.cycles-pp.shmem_alloc_and_add_folio 1.23 ± 4% +1.4 2.58 ± 2% +1.4 2.63 ± 2% +1.3 2.52 perf-profile.children.cycles-pp.__mod_memcg_lruvec_state 0.98 -0.3 0.73 ± 2% -0.2 0.74 -0.2 0.74 perf-profile.self.cycles-pp.syscall_return_via_sysret 0.88 -0.2 0.70 -0.2 0.70 -0.2 0.69 ± 2% perf-profile.self.cycles-pp.syscall_exit_to_user_mode 0.60 -0.2 0.45 -0.1 0.46 ± 2% -0.2 0.46 ± 3% perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe 0.41 ± 3% -0.1 0.27 ± 3% -0.1 0.27 ± 2% -0.1 0.28 ± 2% perf-profile.self.cycles-pp.release_pages 0.41 ± 3% -0.1 0.29 ± 2% -0.1 0.28 ± 3% -0.1 0.29 ± 2% perf-profile.self.cycles-pp.folio_batch_move_lru 0.41 -0.1 0.30 ± 3% -0.1 0.30 ± 2% -0.1 0.32 ± 4% perf-profile.self.cycles-pp.xas_store 0.30 ± 3% -0.1 0.18 ± 5% -0.1 0.19 ± 2% -0.1 0.19 ± 2% perf-profile.self.cycles-pp.shmem_add_to_page_cache 0.38 ± 2% -0.1 0.27 ± 2% -0.1 0.27 ± 2% -0.1 0.27 perf-profile.self.cycles-pp.__entry_text_start 0.30 ± 3% -0.1 0.20 ± 6% -0.1 0.20 ± 5% -0.1 0.21 ± 2% perf-profile.self.cycles-pp.lru_add_fn 0.28 ± 2% -0.1 0.20 ± 5% -0.1 0.20 ± 2% -0.1 0.20 ± 3% perf-profile.self.cycles-pp.shmem_fallocate 0.26 ± 2% -0.1 0.18 ± 5% -0.1 0.18 ± 4% -0.1 0.19 ± 3% perf-profile.self.cycles-pp.__mod_node_page_state 0.27 ± 3% -0.1 0.20 ± 2% -0.1 0.20 ± 3% -0.1 0.20 ± 3% perf-profile.self.cycles-pp._raw_spin_lock 0.21 ± 2% -0.1 0.15 ± 4% -0.1 0.15 ± 4% -0.1 0.16 ± 2% perf-profile.self.cycles-pp.__alloc_pages 0.20 ± 2% -0.1 0.14 ± 3% -0.1 0.14 ± 2% -0.1 0.14 ± 5% perf-profile.self.cycles-pp.xas_descend 0.26 ± 3% -0.1 0.20 ± 4% -0.1 0.21 ± 3% -0.0 0.22 ± 4% perf-profile.self.cycles-pp.find_lock_entries 0.06 ± 6% -0.1 0.00 +0.0 0.06 ± 7% +0.1 0.13 ± 6% perf-profile.self.cycles-pp.percpu_counter_add_batch 0.18 ± 4% -0.0 0.13 ± 5% -0.0 0.13 ± 3% -0.0 0.14 ± 4% perf-profile.self.cycles-pp.xas_clear_mark 0.15 ± 7% -0.0 0.10 ± 11% -0.0 0.11 ± 8% -0.0 0.10 ± 6% perf-profile.self.cycles-pp.shmem_inode_acct_blocks 0.13 ± 4% -0.0 0.09 ± 5% -0.0 0.08 ± 5% -0.0 0.09 perf-profile.self.cycles-pp.free_unref_page_commit 0.13 -0.0 0.09 ± 5% -0.0 0.09 ± 5% -0.0 0.09 ± 6% perf-profile.self.cycles-pp._raw_spin_lock_irq 0.16 ± 4% -0.0 0.12 ± 4% -0.0 0.12 ± 3% -0.0 0.12 ± 4% perf-profile.self.cycles-pp.__dquot_alloc_space 0.16 ± 4% -0.0 0.12 ± 4% -0.0 0.11 ± 6% -0.0 0.11 perf-profile.self.cycles-pp.shmem_alloc_and_add_folio 0.13 ± 5% -0.0 0.09 ± 7% -0.0 0.09 -0.0 0.10 ± 7% perf-profile.self.cycles-pp.__filemap_remove_folio 0.13 ± 2% -0.0 0.09 ± 5% -0.0 0.09 ± 4% -0.0 0.09 ± 4% perf-profile.self.cycles-pp.get_page_from_freelist 0.06 ± 7% -0.0 0.02 ± 99% -0.0 0.02 ± 99% -0.0 0.02 ±141% perf-profile.self.cycles-pp.apparmor_file_permission 0.12 ± 4% -0.0 0.09 ± 5% -0.0 0.09 ± 5% -0.0 0.08 ± 8% perf-profile.self.cycles-pp.vfs_fallocate 0.13 ± 3% -0.0 0.10 ± 5% -0.0 0.10 ± 4% -0.0 0.10 ± 4% perf-profile.self.cycles-pp.fallocate64 0.11 ± 4% -0.0 0.07 -0.0 0.08 ± 6% -0.0 0.08 ± 6% perf-profile.self.cycles-pp.xas_start 0.07 ± 5% -0.0 0.03 ± 70% -0.0 0.04 ± 44% -0.1 0.02 ±141% perf-profile.self.cycles-pp.shmem_alloc_folio 0.14 ± 4% -0.0 0.10 ± 7% -0.0 0.10 ± 5% -0.0 0.10 ± 3% perf-profile.self.cycles-pp.__fget_light 0.10 ± 4% -0.0 0.06 ± 7% -0.0 0.06 ± 7% -0.0 0.06 ± 6% perf-profile.self.cycles-pp.rmqueue 0.10 ± 4% -0.0 0.07 ± 8% -0.0 0.07 ± 5% -0.0 0.07 ± 5% perf-profile.self.cycles-pp.alloc_pages_mpol 0.12 ± 3% -0.0 0.09 ± 4% -0.0 0.09 ± 4% -0.0 0.09 ± 4% perf-profile.self.cycles-pp.xas_load 0.11 ± 4% -0.0 0.08 ± 7% -0.0 0.08 ± 5% -0.0 0.08 ± 4% perf-profile.self.cycles-pp.folio_unlock 0.15 ± 2% -0.0 0.12 ± 5% -0.0 0.12 ± 4% -0.0 0.12 ± 4% perf-profile.self.cycles-pp.shmem_get_folio_gfp 0.10 -0.0 0.07 -0.0 0.08 ± 7% +0.0 0.10 ± 4% perf-profile.self.cycles-pp.cap_vm_enough_memory 0.16 ± 2% -0.0 0.13 ± 6% -0.0 0.14 -0.0 0.14 perf-profile.self.cycles-pp.page_counter_uncharge 0.12 ± 5% -0.0 0.09 ± 4% -0.0 0.09 ± 7% -0.0 0.09 ± 5% perf-profile.self.cycles-pp.__cond_resched 0.06 ± 6% -0.0 0.03 ± 70% -0.0 0.04 ± 44% -0.0 0.05 perf-profile.self.cycles-pp.filemap_free_folio 0.12 -0.0 0.09 ± 4% -0.0 0.09 ± 4% -0.0 0.09 perf-profile.self.cycles-pp.noop_dirty_folio 0.12 ± 3% -0.0 0.10 ± 5% -0.0 0.10 ± 7% -0.0 0.10 ± 5% perf-profile.self.cycles-pp.free_unref_page_list 0.10 ± 3% -0.0 0.07 ± 5% -0.0 0.07 ± 5% -0.0 0.08 ± 6% perf-profile.self.cycles-pp.filemap_remove_folio 0.10 ± 5% -0.0 0.07 ± 5% -0.0 0.07 -0.0 0.08 ± 4% perf-profile.self.cycles-pp.try_charge_memcg 0.12 ± 3% -0.0 0.10 ± 8% -0.0 0.10 -0.0 0.10 ± 4% perf-profile.self.cycles-pp.cgroup_rstat_updated 0.09 ± 4% -0.0 0.07 ± 7% -0.0 0.07 ± 5% -0.0 0.07 ± 7% perf-profile.self.cycles-pp.__folio_cancel_dirty 0.08 ± 4% -0.0 0.06 ± 8% -0.0 0.06 ± 6% -0.0 0.06 ± 8% perf-profile.self.cycles-pp._raw_spin_lock_irqsave 0.08 ± 5% -0.0 0.06 -0.0 0.06 -0.0 0.06 perf-profile.self.cycles-pp._raw_spin_trylock 0.08 -0.0 0.06 ± 6% -0.0 0.06 ± 8% -0.0 0.06 perf-profile.self.cycles-pp.folio_add_lru 0.07 ± 5% -0.0 0.05 -0.0 0.05 -0.0 0.04 ± 44% perf-profile.self.cycles-pp.xas_find_conflict 0.08 ± 8% -0.0 0.06 ± 6% -0.0 0.06 ± 6% -0.0 0.06 ± 7% perf-profile.self.cycles-pp.__mod_lruvec_state 0.56 ± 6% -0.0 0.54 ± 9% -0.0 0.55 ± 5% -0.2 0.40 ± 3% perf-profile.self.cycles-pp.__mod_lruvec_page_state 0.08 ± 10% -0.0 0.06 ± 9% -0.0 0.06 -0.0 0.06 perf-profile.self.cycles-pp.truncate_cleanup_folio 0.07 ± 10% -0.0 0.05 -0.0 0.05 ± 7% -0.0 0.05 ± 8% perf-profile.self.cycles-pp.xas_init_marks 0.08 ± 4% -0.0 0.06 ± 7% +0.0 0.08 ± 4% -0.0 0.07 ± 10% perf-profile.self.cycles-pp.__percpu_counter_limited_add 0.07 ± 7% -0.0 0.05 -0.0 0.05 ± 7% -0.0 0.06 ± 9% perf-profile.self.cycles-pp.get_pfnblock_flags_mask 0.07 ± 5% -0.0 0.06 ± 8% -0.0 0.06 ± 6% -0.0 0.05 ± 7% perf-profile.self.cycles-pp.__list_add_valid_or_report 0.07 ± 5% -0.0 0.06 ± 9% -0.0 0.06 ± 7% -0.0 0.06 perf-profile.self.cycles-pp.mem_cgroup_update_lru_size 0.08 ± 4% -0.0 0.07 ± 5% -0.0 0.06 -0.0 0.06 ± 6% perf-profile.self.cycles-pp.filemap_get_entry 0.00 +0.0 0.00 +0.0 0.00 +0.1 0.08 ± 8% perf-profile.self.cycles-pp.__file_remove_privs 0.14 ± 2% +0.0 0.16 ± 6% +0.0 0.17 ± 3% +0.0 0.16 perf-profile.self.cycles-pp.uncharge_folio 0.02 ±141% +0.0 0.06 ± 8% +0.0 0.06 +0.0 0.06 ± 9% perf-profile.self.cycles-pp.uncharge_batch 0.21 ± 9% +0.1 0.31 ± 7% +0.1 0.32 ± 5% +0.1 0.30 ± 4% perf-profile.self.cycles-pp.mem_cgroup_commit_charge 0.69 ± 5% +0.1 0.83 ± 4% +0.2 0.86 ± 5% +0.1 0.79 ± 2% perf-profile.self.cycles-pp.get_mem_cgroup_from_mm 0.06 ± 6% +0.2 0.22 ± 2% +0.1 0.13 ± 5% +0.1 0.11 ± 4% perf-profile.self.cycles-pp.inode_needs_update_time 0.14 ± 8% +0.3 0.42 ± 7% +0.3 0.44 ± 6% +0.3 0.40 ± 3% perf-profile.self.cycles-pp.__mem_cgroup_charge 0.13 ± 7% +0.4 0.49 ± 3% +0.4 0.51 ± 2% +0.4 0.51 ± 2% perf-profile.self.cycles-pp.__count_memcg_events 84.24 +1.0 85.21 +0.8 85.04 +1.0 85.21 perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath 1.12 ± 5% +1.4 2.50 ± 2% +1.4 2.55 ± 2% +1.3 2.43 perf-profile.self.cycles-pp.__mod_memcg_lruvec_state [2] ========================================================================================= compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase: gcc-12/performance/x86_64-rhel-8.3/thread/50%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale 130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522 c5f50d8b23c7982ac875791755b ac6a9444dec85dc50c6bfbc4ee7 ---------------- --------------------------- --------------------------- --------------------------- %stddev %change %stddev %change %stddev %change %stddev \ | \ | \ | \ 10544810 ± 11% +1.7% 10720938 ± 4% +1.7% 10719232 ± 4% +24.8% 13160448 meminfo.DirectMap2M 1.87 -0.4 1.43 ± 3% -0.4 1.47 ± 2% -0.4 1.46 mpstat.cpu.all.usr% 3171 -5.3% 3003 ± 2% +17.4% 3725 ± 30% +2.6% 3255 ± 5% vmstat.system.cs 93.97 ±130% +360.8% 433.04 ± 83% +5204.4% 4984 ±150% +1540.1% 1541 ± 56% boot-time.boot 6762 ±101% +96.3% 13275 ± 75% +3212.0% 223971 ±150% +752.6% 57655 ± 60% boot-time.idle 84.83 ± 9% +55.8% 132.17 ± 16% +75.6% 149.00 ± 11% +98.0% 168.00 ± 6% perf-c2c.DRAM.local 484.17 ± 3% +37.1% 663.67 ± 10% +44.1% 697.67 ± 7% -0.2% 483.00 ± 5% perf-c2c.DRAM.remote 72763 ± 5% +14.4% 83212 ± 12% +141.5% 175744 ± 83% +55.7% 113321 ± 21% turbostat.C1 0.08 -25.0% 0.06 -27.1% 0.06 ± 6% -25.0% 0.06 turbostat.IPC 27.90 +4.6% 29.18 +4.9% 29.27 +3.9% 29.00 turbostat.RAMWatt 3982212 -30.0% 2785941 -28.9% 2829631 -26.7% 2919929 will-it-scale.52.threads 76580 -30.0% 53575 -28.9% 54415 -26.7% 56152 will-it-scale.per_thread_ops 3982212 -30.0% 2785941 -28.9% 2829631 -26.7% 2919929 will-it-scale.workload 1.175e+09 ± 2% -28.6% 8.392e+08 ± 2% -28.2% 8.433e+08 ± 2% -25.4% 8.762e+08 numa-numastat.node0.local_node 1.175e+09 ± 2% -28.6% 8.394e+08 ± 2% -28.3% 8.434e+08 ± 2% -25.4% 8.766e+08 numa-numastat.node0.numa_hit 1.231e+09 ± 2% -31.3% 8.463e+08 ± 3% -29.5% 8.683e+08 ± 3% -27.7% 8.901e+08 numa-numastat.node1.local_node 1.232e+09 ± 2% -31.3% 8.466e+08 ± 3% -29.5% 8.688e+08 ± 3% -27.7% 8.907e+08 numa-numastat.node1.numa_hit 2.408e+09 -30.0% 1.686e+09 -28.9% 1.712e+09 -26.6% 1.767e+09 proc-vmstat.numa_hit 2.406e+09 -30.0% 1.685e+09 -28.9% 1.712e+09 -26.6% 1.766e+09 proc-vmstat.numa_local 2.404e+09 -29.9% 1.684e+09 -28.8% 1.71e+09 -26.6% 1.765e+09 proc-vmstat.pgalloc_normal 2.404e+09 -29.9% 1.684e+09 -28.8% 1.71e+09 -26.6% 1.765e+09 proc-vmstat.pgfree 2302080 -0.9% 2280448 -0.5% 2290432 -1.2% 2274688 proc-vmstat.unevictable_pgs_scanned 83444 ± 71% +34.2% 111978 ± 65% -9.1% 75877 ± 86% -76.2% 19883 ± 12% numa-meminfo.node0.AnonHugePages 150484 ± 55% +9.3% 164434 ± 46% -9.3% 136435 ± 53% -62.4% 56548 ± 18% numa-meminfo.node0.AnonPages 167427 ± 50% +8.2% 181159 ± 41% -8.3% 153613 ± 47% -56.1% 73487 ± 14% numa-meminfo.node0.Inactive 166720 ± 50% +8.7% 181159 ± 41% -8.3% 152902 ± 48% -56.6% 72379 ± 14% numa-meminfo.node0.Inactive(anon) 111067 ± 62% -13.7% 95819 ± 59% +14.6% 127294 ± 60% +86.1% 206693 ± 8% numa-meminfo.node1.AnonHugePages 179594 ± 47% -4.2% 172027 ± 43% +9.3% 196294 ± 39% +55.8% 279767 ± 3% numa-meminfo.node1.AnonPages 257406 ± 30% -2.1% 251990 ± 32% +9.9% 282766 ± 26% +42.2% 366131 ± 8% numa-meminfo.node1.AnonPages.max 196741 ± 43% -3.6% 189753 ± 39% +8.1% 212645 ± 36% +50.9% 296827 ± 3% numa-meminfo.node1.Inactive 196385 ± 43% -3.9% 188693 ± 39% +8.1% 212288 ± 36% +51.1% 296827 ± 3% numa-meminfo.node1.Inactive(anon) 37621 ± 55% +9.3% 41115 ± 46% -9.3% 34116 ± 53% -62.4% 14141 ± 18% numa-vmstat.node0.nr_anon_pages 41664 ± 50% +8.6% 45233 ± 41% -8.2% 38240 ± 47% -56.6% 18079 ± 14% numa-vmstat.node0.nr_inactive_anon 41677 ± 50% +8.6% 45246 ± 41% -8.2% 38250 ± 47% -56.6% 18092 ± 14% numa-vmstat.node0.nr_zone_inactive_anon 1.175e+09 ± 2% -28.6% 8.394e+08 ± 2% -28.3% 8.434e+08 ± 2% -25.4% 8.766e+08 numa-vmstat.node0.numa_hit 1.175e+09 ± 2% -28.6% 8.392e+08 ± 2% -28.2% 8.433e+08 ± 2% -25.4% 8.762e+08 numa-vmstat.node0.numa_local 44903 ± 47% -4.2% 43015 ± 43% +9.3% 49079 ± 39% +55.8% 69957 ± 3% numa-vmstat.node1.nr_anon_pages 49030 ± 43% -3.9% 47139 ± 39% +8.3% 53095 ± 36% +51.4% 74210 ± 3% numa-vmstat.node1.nr_inactive_anon 49035 ± 43% -3.9% 47135 ± 39% +8.3% 53098 ± 36% +51.3% 74212 ± 3% numa-vmstat.node1.nr_zone_inactive_anon 1.232e+09 ± 2% -31.3% 8.466e+08 ± 3% -29.5% 8.688e+08 ± 3% -27.7% 8.907e+08 numa-vmstat.node1.numa_hit 1.231e+09 ± 2% -31.3% 8.463e+08 ± 3% -29.5% 8.683e+08 ± 3% -27.7% 8.901e+08 numa-vmstat.node1.numa_local 5256095 ± 59% +557.5% 34561019 ± 89% +4549.1% 2.444e+08 ±146% +1646.7% 91810708 ± 50% sched_debug.cfs_rq:/.avg_vruntime.avg 8288083 ± 52% +365.0% 38543329 ± 81% +3020.3% 2.586e+08 ±145% +1133.9% 1.023e+08 ± 49% sched_debug.cfs_rq:/.avg_vruntime.max 1364475 ± 40% +26.7% 1728262 ± 29% +346.8% 6096205 ±118% +180.4% 3826288 ± 41% sched_debug.cfs_rq:/.avg_vruntime.stddev 161.62 ± 99% -42.4% 93.09 ±144% -57.3% 69.01 ± 74% -86.6% 21.73 ± 10% sched_debug.cfs_rq:/.load_avg.avg 902.70 ±107% -46.8% 480.28 ±171% -57.3% 385.28 ±120% -94.8% 47.03 ± 8% sched_debug.cfs_rq:/.load_avg.stddev 5256095 ± 59% +557.5% 34561019 ± 89% +4549.1% 2.444e+08 ±146% +1646.7% 91810708 ± 50% sched_debug.cfs_rq:/.min_vruntime.avg 8288083 ± 52% +365.0% 38543329 ± 81% +3020.3% 2.586e+08 ±145% +1133.9% 1.023e+08 ± 49% sched_debug.cfs_rq:/.min_vruntime.max 1364475 ± 40% +26.7% 1728262 ± 29% +346.8% 6096205 ±118% +180.4% 3826288 ± 41% sched_debug.cfs_rq:/.min_vruntime.stddev 31.84 ±161% -71.8% 8.98 ± 44% -84.0% 5.10 ± 43% -79.0% 6.68 ± 24% sched_debug.cfs_rq:/.removed.load_avg.avg 272.14 ±192% -84.9% 41.10 ± 29% -89.7% 28.08 ± 21% -87.8% 33.19 ± 12% sched_debug.cfs_rq:/.removed.load_avg.stddev 334.70 ± 17% +32.4% 443.13 ± 19% +34.3% 449.66 ± 11% +14.6% 383.66 ± 24% sched_debug.cfs_rq:/.util_est_enqueued.avg 322.95 ± 23% +12.5% 363.30 ± 19% +27.9% 412.92 ± 6% +11.2% 359.17 ± 18% sched_debug.cfs_rq:/.util_est_enqueued.stddev 240924 ± 52% +136.5% 569868 ± 62% +2031.9% 5136297 ±145% +600.7% 1688103 ± 51% sched_debug.cpu.clock.avg 240930 ± 52% +136.5% 569874 ± 62% +2031.9% 5136304 ±145% +600.7% 1688109 ± 51% sched_debug.cpu.clock.max 240917 ± 52% +136.5% 569861 ± 62% +2032.0% 5136290 ±145% +600.7% 1688095 ± 51% sched_debug.cpu.clock.min 239307 ± 52% +136.6% 566140 ± 62% +2009.9% 5049095 ±145% +600.7% 1676912 ± 51% sched_debug.cpu.clock_task.avg 239479 ± 52% +136.5% 566334 ± 62% +2014.9% 5064818 ±145% +600.4% 1677208 ± 51% sched_debug.cpu.clock_task.max 232462 ± 53% +140.6% 559281 ± 63% +2064.0% 5030381 ±146% +617.9% 1668793 ± 52% sched_debug.cpu.clock_task.min 683.22 ± 3% +0.7% 688.14 ± 4% +1762.4% 12724 ±138% +19.2% 814.55 ± 8% sched_debug.cpu.clock_task.stddev 3267 ± 57% +146.0% 8040 ± 63% +2127.2% 72784 ±146% +652.5% 24591 ± 52% sched_debug.cpu.curr->pid.avg 10463 ± 39% +101.0% 21030 ± 54% +1450.9% 162275 ±143% +448.5% 57391 ± 49% sched_debug.cpu.curr->pid.max 3373 ± 57% +149.1% 8403 ± 64% +2141.6% 75621 ±146% +657.7% 25561 ± 52% sched_debug.cpu.curr->pid.stddev 58697 ± 14% +1.6% 59612 ± 7% +1.9e+05% 1.142e+08 ±156% +105.4% 120565 ± 32% sched_debug.cpu.nr_switches.max 6023 ± 10% +13.6% 6843 ± 11% +2.9e+05% 17701514 ±151% +124.8% 13541 ± 32% sched_debug.cpu.nr_switches.stddev 240917 ± 52% +136.5% 569862 ± 62% +2032.0% 5136291 ±145% +600.7% 1688096 ± 51% sched_debug.cpu_clk 240346 ± 52% +136.9% 569288 ± 62% +2036.8% 5135723 ±145% +602.1% 1687529 ± 51% sched_debug.ktime 241481 ± 51% +136.2% 570443 ± 62% +2027.2% 5136856 ±145% +599.3% 1688672 ± 51% sched_debug.sched_clk 0.04 ± 9% -19.3% 0.03 ± 6% -19.7% 0.03 ± 6% -14.3% 0.03 ± 8% perf-sched.wait_and_delay.avg.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64 0.04 ± 11% -18.0% 0.03 ± 13% -22.8% 0.03 ± 10% -14.0% 0.04 ± 15% perf-sched.wait_and_delay.avg.ms.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate 0.04 ± 8% -22.3% 0.03 ± 5% -19.4% 0.03 ± 3% -12.6% 0.04 ± 9% perf-sched.wait_and_delay.avg.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate 0.91 ± 2% +11.3% 1.01 ± 5% +65.3% 1.51 ± 53% +28.8% 1.17 ± 11% perf-sched.wait_and_delay.avg.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64 0.04 ± 13% -90.3% 0.00 ±223% -66.4% 0.01 ±101% -83.8% 0.01 ±223% perf-sched.wait_and_delay.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt 24.11 ± 3% -8.5% 22.08 ± 11% -25.2% 18.04 ± 50% -29.5% 17.01 ± 21% perf-sched.wait_and_delay.avg.ms.pipe_read.vfs_read.ksys_read.do_syscall_64 1.14 +15.1% 1.31 -24.1% 0.86 ± 70% +13.7% 1.29 perf-sched.wait_and_delay.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone 189.94 ± 3% +18.3% 224.73 ± 4% +20.3% 228.52 ± 3% +22.1% 231.82 ± 3% perf-sched.wait_and_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm 1652 ± 4% -13.4% 1431 ± 4% -13.4% 1431 ± 2% -14.3% 1416 ± 6% perf-sched.wait_and_delay.count.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64 1628 ± 8% -15.0% 1383 ± 9% -16.6% 1357 ± 2% -16.6% 1358 ± 7% perf-sched.wait_and_delay.count.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate 83.67 ± 7% -87.6% 10.33 ±223% -59.2% 34.17 ±100% -85.5% 12.17 ±223% perf-sched.wait_and_delay.count.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt 2835 ± 3% +10.6% 3135 ± 10% +123.8% 6345 ± 80% +48.4% 4207 ± 19% perf-sched.wait_and_delay.count.pipe_read.vfs_read.ksys_read.do_syscall_64 3827 ± 4% -13.0% 3328 ± 3% -12.9% 3335 ± 2% -14.7% 3264 ± 2% perf-sched.wait_and_delay.count.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm 1.71 ±165% -83.4% 0.28 ± 21% -82.3% 0.30 ± 16% -74.6% 0.43 ± 60% perf-sched.wait_and_delay.max.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64 0.43 ± 17% -43.8% 0.24 ± 26% -44.4% 0.24 ± 27% -32.9% 0.29 ± 23% perf-sched.wait_and_delay.max.ms.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate 0.46 ± 17% -36.7% 0.29 ± 12% -35.7% 0.30 ± 19% -35.3% 0.30 ± 21% perf-sched.wait_and_delay.max.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate 45.41 ± 4% +13.4% 51.51 ± 12% +148.6% 112.88 ± 86% +56.7% 71.18 ± 21% perf-sched.wait_and_delay.max.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64 0.30 ± 34% -90.7% 0.03 ±223% -66.0% 0.10 ±110% -88.2% 0.04 ±223% perf-sched.wait_and_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt 2.39 +10.7% 2.65 ± 2% -24.3% 1.81 ± 70% +12.1% 2.68 ± 2% perf-sched.wait_and_delay.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone 0.04 ± 9% -19.3% 0.03 ± 6% -19.7% 0.03 ± 6% -14.3% 0.03 ± 8% perf-sched.wait_time.avg.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64 0.04 ± 11% -18.0% 0.03 ± 13% -22.8% 0.03 ± 10% -14.0% 0.04 ± 15% perf-sched.wait_time.avg.ms.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate 0.04 ± 8% -22.3% 0.03 ± 5% -19.4% 0.03 ± 3% -12.6% 0.04 ± 9% perf-sched.wait_time.avg.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate 0.04 ± 11% -33.1% 0.03 ± 17% -32.3% 0.03 ± 22% -16.3% 0.04 ± 12% perf-sched.wait_time.avg.ms.__cond_resched.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.90 ± 2% +11.5% 1.00 ± 5% +66.1% 1.50 ± 53% +29.2% 1.16 ± 11% perf-sched.wait_time.avg.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64 0.04 ± 13% -26.6% 0.03 ± 12% -33.6% 0.03 ± 11% -18.1% 0.04 ± 16% perf-sched.wait_time.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt 24.05 ± 3% -9.0% 21.90 ± 10% -25.0% 18.04 ± 50% -29.4% 16.97 ± 21% perf-sched.wait_time.avg.ms.pipe_read.vfs_read.ksys_read.do_syscall_64 1.13 +15.2% 1.30 +15.0% 1.30 +13.7% 1.29 perf-sched.wait_time.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone 189.93 ± 3% +18.3% 224.72 ± 4% +20.3% 228.50 ± 3% +22.1% 231.81 ± 3% perf-sched.wait_time.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm 1.71 ±165% -83.4% 0.28 ± 21% -82.3% 0.30 ± 16% -74.6% 0.43 ± 60% perf-sched.wait_time.max.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64 0.43 ± 17% -43.8% 0.24 ± 26% -44.4% 0.24 ± 27% -32.9% 0.29 ± 23% perf-sched.wait_time.max.ms.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate 0.46 ± 17% -36.7% 0.29 ± 12% -35.7% 0.30 ± 19% -35.3% 0.30 ± 21% perf-sched.wait_time.max.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate 0.31 ± 26% -42.1% 0.18 ± 58% -64.1% 0.11 ± 40% -28.5% 0.22 ± 30% perf-sched.wait_time.max.ms.__cond_resched.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe 45.41 ± 4% +13.4% 51.50 ± 12% +148.6% 112.87 ± 86% +56.8% 71.18 ± 21% perf-sched.wait_time.max.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64 2.39 +10.7% 2.64 ± 2% +12.9% 2.69 ± 2% +12.1% 2.68 ± 2% perf-sched.wait_time.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone 0.75 +142.0% 1.83 ± 2% +146.9% 1.86 +124.8% 1.70 perf-stat.i.MPKI 8.47e+09 -24.4% 6.407e+09 -23.2% 6.503e+09 -21.2% 6.674e+09 perf-stat.i.branch-instructions 0.66 -0.0 0.63 -0.0 0.64 -0.0 0.63 perf-stat.i.branch-miss-rate% 56364992 -28.3% 40421603 ± 3% -26.0% 41734061 ± 2% -25.8% 41829975 perf-stat.i.branch-misses 14.64 +6.7 21.30 +6.9 21.54 +6.5 21.10 perf-stat.i.cache-miss-rate% 30868184 +81.3% 55977240 ± 3% +87.7% 57950237 +76.2% 54404466 perf-stat.i.cache-misses 2.107e+08 +24.7% 2.627e+08 ± 2% +27.6% 2.69e+08 +22.3% 2.578e+08 perf-stat.i.cache-references 3106 -5.5% 2934 ± 2% +16.4% 3615 ± 29% +2.4% 3181 ± 5% perf-stat.i.context-switches 3.55 +33.4% 4.74 +31.5% 4.67 +27.4% 4.52 perf-stat.i.cpi 4722 -44.8% 2605 ± 3% -46.7% 2515 -43.3% 2675 perf-stat.i.cycles-between-cache-misses 0.04 -0.0 0.04 -0.0 0.04 -0.0 0.04 perf-stat.i.dTLB-load-miss-rate% 4117232 -29.1% 2917107 -28.1% 2961876 -25.8% 3056956 perf-stat.i.dTLB-load-misses 1.051e+10 -24.1% 7.979e+09 -23.0% 8.1e+09 -19.7% 8.44e+09 perf-stat.i.dTLB-loads 0.00 ± 3% +0.0 0.00 ± 6% +0.0 0.00 ± 5% +0.0 0.00 ± 4% perf-stat.i.dTLB-store-miss-rate% 5.886e+09 -27.5% 4.269e+09 -26.3% 4.34e+09 -24.1% 4.467e+09 perf-stat.i.dTLB-stores 78.16 -6.6 71.51 -6.4 71.75 -5.9 72.23 perf-stat.i.iTLB-load-miss-rate% 4131074 ± 3% -30.0% 2891515 -29.2% 2922789 -26.2% 3048227 perf-stat.i.iTLB-load-misses 4.098e+10 -25.0% 3.072e+10 -23.9% 3.119e+10 -21.6% 3.214e+10 perf-stat.i.instructions 9929 ± 2% +7.0% 10627 +7.5% 10673 +6.2% 10547 perf-stat.i.instructions-per-iTLB-miss 0.28 -25.0% 0.21 -23.9% 0.21 -21.5% 0.22 perf-stat.i.ipc 63.49 +43.8% 91.27 ± 3% +48.2% 94.07 +38.6% 87.97 perf-stat.i.metric.K/sec 241.12 -24.6% 181.87 -23.4% 184.70 -20.9% 190.75 perf-stat.i.metric.M/sec 90.84 -0.4 90.49 -0.9 89.98 -2.9 87.93 perf-stat.i.node-load-miss-rate% 3735316 +78.6% 6669641 ± 3% +83.1% 6839047 +62.4% 6067727 perf-stat.i.node-load-misses 377465 ± 4% +86.1% 702512 ± 11% +101.7% 761510 ± 4% +120.8% 833359 perf-stat.i.node-loads 1322217 -27.6% 957081 ± 5% -22.9% 1019779 ± 2% -19.4% 1066178 perf-stat.i.node-store-misses 37459 ± 3% -23.0% 28826 ± 5% -19.2% 30253 ± 6% -23.4% 28682 ± 3% perf-stat.i.node-stores 0.75 +141.8% 1.82 ± 2% +146.6% 1.86 +124.7% 1.69 perf-stat.overall.MPKI 0.67 -0.0 0.63 -0.0 0.64 -0.0 0.63 perf-stat.overall.branch-miss-rate% 14.65 +6.7 21.30 +6.9 21.54 +6.5 21.11 perf-stat.overall.cache-miss-rate% 3.55 +33.4% 4.73 +31.4% 4.66 +27.4% 4.52 perf-stat.overall.cpi 4713 -44.8% 2601 ± 3% -46.7% 2511 -43.3% 2671 perf-stat.overall.cycles-between-cache-misses 0.04 -0.0 0.04 -0.0 0.04 -0.0 0.04 perf-stat.overall.dTLB-load-miss-rate% 0.00 ± 3% +0.0 0.00 ± 5% +0.0 0.00 ± 5% +0.0 0.00 perf-stat.overall.dTLB-store-miss-rate% 78.14 -6.7 71.47 -6.4 71.70 -5.9 72.20 perf-stat.overall.iTLB-load-miss-rate% 9927 ± 2% +7.0% 10624 +7.5% 10672 +6.2% 10547 perf-stat.overall.instructions-per-iTLB-miss 0.28 -25.0% 0.21 -23.9% 0.21 -21.5% 0.22 perf-stat.overall.ipc 90.82 -0.3 90.49 -0.8 89.98 -2.9 87.92 perf-stat.overall.node-load-miss-rate% 3098901 +7.1% 3318983 +6.9% 3313112 +7.0% 3316044 perf-stat.overall.path-length 8.441e+09 -24.4% 6.385e+09 -23.2% 6.48e+09 -21.2% 6.652e+09 perf-stat.ps.branch-instructions 56179581 -28.3% 40286337 ± 3% -26.0% 41593521 ± 2% -25.8% 41687151 perf-stat.ps.branch-misses 30759982 +81.3% 55777812 ± 3% +87.7% 57746279 +76.3% 54217757 perf-stat.ps.cache-misses 2.1e+08 +24.6% 2.618e+08 ± 2% +27.6% 2.68e+08 +22.3% 2.569e+08 perf-stat.ps.cache-references 3095 -5.5% 2923 ± 2% +16.2% 3597 ± 29% +2.3% 3167 ± 5% perf-stat.ps.context-switches 135.89 -0.8% 134.84 -0.7% 134.99 -1.0% 134.55 perf-stat.ps.cpu-migrations 4103292 -29.1% 2907270 -28.1% 2951746 -25.7% 3046739 perf-stat.ps.dTLB-load-misses 1.048e+10 -24.1% 7.952e+09 -23.0% 8.072e+09 -19.7% 8.412e+09 perf-stat.ps.dTLB-loads 5.866e+09 -27.5% 4.255e+09 -26.3% 4.325e+09 -24.1% 4.452e+09 perf-stat.ps.dTLB-stores 4117020 ± 3% -30.0% 2881750 -29.3% 2912744 -26.2% 3037970 perf-stat.ps.iTLB-load-misses 4.084e+10 -25.0% 3.062e+10 -23.9% 3.109e+10 -21.6% 3.203e+10 perf-stat.ps.instructions 3722149 +78.5% 6645867 ± 3% +83.1% 6814976 +62.5% 6046854 perf-stat.ps.node-load-misses 376240 ± 4% +86.1% 700053 ± 11% +101.7% 758898 ± 4% +120.8% 830575 perf-stat.ps.node-loads 1317772 -27.6% 953773 ± 5% -22.9% 1016183 ± 2% -19.4% 1062457 perf-stat.ps.node-store-misses 37408 ± 3% -23.2% 28748 ± 5% -19.3% 30192 ± 6% -23.5% 28607 ± 3% perf-stat.ps.node-stores 1.234e+13 -25.1% 9.246e+12 -24.0% 9.375e+12 -21.5% 9.683e+12 perf-stat.total.instructions 1.28 -0.4 0.90 ± 2% -0.4 0.91 -0.3 0.94 ± 2% perf-profile.calltrace.cycles-pp.syscall_return_via_sysret.fallocate64 1.26 ± 2% -0.4 0.90 ± 3% -0.3 0.92 ± 2% -0.3 0.94 ± 2% perf-profile.calltrace.cycles-pp.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate 1.08 ± 2% -0.3 0.77 ± 3% -0.3 0.79 ± 2% -0.3 0.81 ± 2% perf-profile.calltrace.cycles-pp.alloc_pages_mpol.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate 0.92 ± 2% -0.3 0.62 ± 3% -0.3 0.63 -0.3 0.66 ± 2% perf-profile.calltrace.cycles-pp.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate 0.84 ± 3% -0.2 0.61 ± 3% -0.2 0.63 ± 2% -0.2 0.65 ± 2% perf-profile.calltrace.cycles-pp.__alloc_pages.alloc_pages_mpol.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp 29.27 -0.2 29.09 -1.0 28.32 -0.2 29.04 perf-profile.calltrace.cycles-pp.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate 1.26 -0.2 1.08 -0.2 1.07 -0.2 1.10 perf-profile.calltrace.cycles-pp.folio_batch_move_lru.lru_add_drain_cpu.__folio_batch_release.shmem_undo_range.shmem_setattr 1.26 -0.2 1.08 -0.2 1.07 -0.2 1.10 perf-profile.calltrace.cycles-pp.lru_add_drain_cpu.__folio_batch_release.shmem_undo_range.shmem_setattr.notify_change 1.24 -0.2 1.06 -0.2 1.05 -0.2 1.08 perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.folio_batch_move_lru.lru_add_drain_cpu.__folio_batch_release.shmem_undo_range 1.23 -0.2 1.06 -0.2 1.05 -0.2 1.08 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.lru_add_drain_cpu 1.24 -0.2 1.06 -0.2 1.05 -0.2 1.08 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.lru_add_drain_cpu.__folio_batch_release 29.15 -0.2 28.99 -0.9 28.23 -0.2 28.94 perf-profile.calltrace.cycles-pp.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate 1.20 -0.2 1.04 ± 2% -0.2 1.05 -0.2 1.02 ± 2% perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64 27.34 -0.1 27.22 ± 2% -0.9 26.49 -0.1 27.20 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio 27.36 -0.1 27.24 ± 2% -0.9 26.51 -0.1 27.22 perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp 27.28 -0.1 27.17 ± 2% -0.8 26.44 -0.1 27.16 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru 25.74 -0.1 25.67 ± 2% +0.2 25.98 +0.9 26.62 perf-profile.calltrace.cycles-pp.release_pages.__folio_batch_release.shmem_undo_range.shmem_setattr.notify_change 23.43 +0.0 23.43 ± 2% +0.3 23.70 +0.9 24.34 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.release_pages.__folio_batch_release.shmem_undo_range 23.45 +0.0 23.45 ± 2% +0.3 23.73 +0.9 24.35 perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.release_pages.__folio_batch_release.shmem_undo_range.shmem_setattr 23.37 +0.0 23.39 ± 2% +0.3 23.67 +0.9 24.30 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.release_pages.__folio_batch_release 0.68 ± 3% +0.0 0.72 ± 4% +0.1 0.73 ± 3% +0.1 0.74 perf-profile.calltrace.cycles-pp.__mem_cgroup_uncharge_list.release_pages.__folio_batch_release.shmem_undo_range.shmem_setattr 1.08 +0.1 1.20 +0.1 1.17 +0.1 1.15 ± 2% perf-profile.calltrace.cycles-pp.lru_add_fn.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp 2.91 +0.3 3.18 ± 2% +0.3 3.23 +0.1 3.02 perf-profile.calltrace.cycles-pp.truncate_inode_folio.shmem_undo_range.shmem_setattr.notify_change.do_truncate 2.56 +0.4 2.92 ± 2% +0.4 2.98 +0.2 2.75 perf-profile.calltrace.cycles-pp.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr.notify_change 1.36 ± 3% +0.4 1.76 ± 9% +0.4 1.75 ± 5% +0.3 1.68 ± 3% perf-profile.calltrace.cycles-pp.get_mem_cgroup_from_mm.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate 2.22 +0.5 2.68 ± 2% +0.5 2.73 +0.3 2.50 perf-profile.calltrace.cycles-pp.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr 0.00 +0.6 0.60 ± 2% +0.6 0.61 ± 2% +0.6 0.61 perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.release_pages.__folio_batch_release.shmem_undo_range.shmem_setattr 2.33 +0.6 2.94 +0.6 2.96 ± 3% +0.3 2.59 perf-profile.calltrace.cycles-pp.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate 0.00 +0.7 0.72 ± 2% +0.7 0.72 ± 2% +0.7 0.68 ± 2% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.lru_add_fn.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio 0.69 ± 4% +0.8 1.47 ± 3% +0.8 1.48 ± 2% +0.7 1.42 perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio 1.24 ± 2% +0.8 2.04 ± 2% +0.8 2.07 ± 2% +0.6 1.82 perf-profile.calltrace.cycles-pp.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range 0.00 +0.8 0.82 ± 4% +0.8 0.85 ± 3% +0.8 0.78 ± 2% perf-profile.calltrace.cycles-pp.__count_memcg_events.mem_cgroup_commit_charge.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp 1.17 ± 2% +0.8 2.00 ± 2% +0.9 2.04 ± 2% +0.6 1.77 perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio 0.59 ± 4% +0.9 1.53 +0.9 1.53 ± 4% +0.8 1.37 ± 2% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp 1.38 +1.0 2.33 ± 2% +1.0 2.34 ± 3% +0.6 1.94 ± 2% perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate 0.62 ± 3% +1.0 1.66 ± 5% +1.1 1.68 ± 4% +1.0 1.57 ± 2% perf-profile.calltrace.cycles-pp.mem_cgroup_commit_charge.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate 38.70 +1.2 39.90 +0.5 39.23 +0.7 39.45 perf-profile.calltrace.cycles-pp.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64 38.34 +1.3 39.65 +0.6 38.97 +0.9 39.20 perf-profile.calltrace.cycles-pp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe 37.24 +1.6 38.86 +0.9 38.17 +1.1 38.35 perf-profile.calltrace.cycles-pp.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64 36.64 +1.8 38.40 +1.1 37.72 +1.2 37.88 perf-profile.calltrace.cycles-pp.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate 2.47 ± 2% +2.1 4.59 ± 8% +2.1 4.61 ± 5% +1.9 4.37 ± 2% perf-profile.calltrace.cycles-pp.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate 1.30 -0.4 0.92 ± 2% -0.4 0.93 -0.4 0.96 perf-profile.children.cycles-pp.syscall_return_via_sysret 1.28 ± 2% -0.4 0.90 ± 3% -0.3 0.93 ± 2% -0.3 0.95 ± 2% perf-profile.children.cycles-pp.shmem_alloc_folio 30.44 -0.3 30.11 -1.1 29.33 -0.4 30.07 perf-profile.children.cycles-pp.folio_batch_move_lru 1.10 ± 2% -0.3 0.78 ± 3% -0.3 0.81 ± 2% -0.3 0.82 ± 2% perf-profile.children.cycles-pp.alloc_pages_mpol 0.96 ± 2% -0.3 0.64 ± 3% -0.3 0.65 -0.3 0.68 ± 2% perf-profile.children.cycles-pp.shmem_inode_acct_blocks 0.88 -0.3 0.58 ± 2% -0.3 0.60 ± 2% -0.3 0.62 ± 2% perf-profile.children.cycles-pp.xas_store 0.88 ± 3% -0.2 0.64 ± 3% -0.2 0.66 ± 2% -0.2 0.67 ± 2% perf-profile.children.cycles-pp.__alloc_pages 29.29 -0.2 29.10 -1.0 28.33 -0.2 29.06 perf-profile.children.cycles-pp.folio_add_lru 0.61 ± 2% -0.2 0.43 ± 3% -0.2 0.44 ± 2% -0.2 0.45 ± 3% perf-profile.children.cycles-pp.__entry_text_start 1.26 -0.2 1.09 -0.2 1.08 -0.2 1.10 perf-profile.children.cycles-pp.lru_add_drain_cpu 0.56 -0.2 0.39 ± 4% -0.2 0.40 ± 3% -0.2 0.40 ± 3% perf-profile.children.cycles-pp.free_unref_page_list 1.22 -0.2 1.06 ± 2% -0.2 1.06 -0.2 1.04 ± 2% perf-profile.children.cycles-pp.syscall_exit_to_user_mode 0.46 -0.1 0.32 ± 3% -0.1 0.32 -0.1 0.32 ± 3% perf-profile.children.cycles-pp.__mod_lruvec_state 0.41 ± 3% -0.1 0.28 ± 4% -0.1 0.28 ± 3% -0.1 0.29 ± 2% perf-profile.children.cycles-pp.xas_load 0.44 ± 4% -0.1 0.31 ± 4% -0.1 0.32 ± 2% -0.1 0.34 ± 3% perf-profile.children.cycles-pp.find_lock_entries 0.50 ± 3% -0.1 0.37 ± 2% -0.1 0.39 ± 4% -0.1 0.39 ± 2% perf-profile.children.cycles-pp.get_page_from_freelist 0.24 ± 7% -0.1 0.12 ± 5% -0.1 0.13 ± 2% -0.1 0.13 ± 3% perf-profile.children.cycles-pp.__list_add_valid_or_report 25.89 -0.1 25.78 ± 2% +0.2 26.08 +0.8 26.73 perf-profile.children.cycles-pp.release_pages 0.34 ± 2% -0.1 0.24 ± 4% -0.1 0.23 ± 2% -0.1 0.23 ± 4% perf-profile.children.cycles-pp.__mod_node_page_state 0.38 ± 3% -0.1 0.28 ± 4% -0.1 0.29 ± 3% -0.1 0.28 perf-profile.children.cycles-pp._raw_spin_lock 0.32 ± 2% -0.1 0.22 ± 5% -0.1 0.23 ± 2% -0.1 0.23 ± 2% perf-profile.children.cycles-pp.__dquot_alloc_space 0.26 ± 2% -0.1 0.17 ± 2% -0.1 0.18 ± 3% -0.1 0.18 ± 2% perf-profile.children.cycles-pp.xas_descend 0.22 ± 3% -0.1 0.14 ± 4% -0.1 0.14 ± 3% -0.1 0.14 ± 2% perf-profile.children.cycles-pp.free_unref_page_commit 0.25 -0.1 0.17 ± 3% -0.1 0.18 ± 4% -0.1 0.18 ± 4% perf-profile.children.cycles-pp.xas_clear_mark 0.32 ± 4% -0.1 0.25 ± 3% -0.1 0.26 ± 4% -0.1 0.26 ± 2% perf-profile.children.cycles-pp.rmqueue 0.23 ± 2% -0.1 0.16 ± 2% -0.1 0.16 ± 4% -0.1 0.16 ± 6% perf-profile.children.cycles-pp.xas_init_marks 0.24 ± 2% -0.1 0.17 ± 5% -0.1 0.17 ± 4% -0.1 0.18 ± 2% perf-profile.children.cycles-pp.__cond_resched 0.25 ± 4% -0.1 0.18 ± 2% -0.1 0.18 ± 2% -0.1 0.18 ± 4% perf-profile.children.cycles-pp.truncate_cleanup_folio 0.30 ± 3% -0.1 0.23 ± 4% -0.1 0.22 ± 3% -0.1 0.22 ± 2% perf-profile.children.cycles-pp.filemap_get_entry 0.20 ± 2% -0.1 0.13 ± 5% -0.1 0.13 ± 3% -0.1 0.14 ± 4% perf-profile.children.cycles-pp.folio_unlock 0.16 ± 4% -0.1 0.10 ± 5% -0.1 0.10 ± 7% -0.1 0.11 ± 6% perf-profile.children.cycles-pp.xas_find_conflict 0.19 ± 3% -0.1 0.13 ± 5% -0.0 0.14 ± 12% -0.1 0.14 ± 5% perf-profile.children.cycles-pp._raw_spin_lock_irq 0.17 ± 5% -0.1 0.12 ± 3% -0.1 0.12 ± 4% -0.0 0.13 ± 3% perf-profile.children.cycles-pp.noop_dirty_folio 0.13 ± 4% -0.1 0.08 ± 9% -0.1 0.08 ± 8% -0.0 0.09 perf-profile.children.cycles-pp.security_vm_enough_memory_mm 0.18 ± 8% -0.1 0.13 ± 4% -0.0 0.13 ± 5% -0.0 0.13 ± 5% perf-profile.children.cycles-pp.shmem_recalc_inode 0.16 ± 2% -0.1 0.11 ± 3% -0.0 0.12 ± 4% -0.0 0.12 ± 6% perf-profile.children.cycles-pp.free_unref_page_prepare 0.09 ± 5% -0.1 0.04 ± 45% -0.0 0.05 -0.0 0.05 ± 7% perf-profile.children.cycles-pp.mem_cgroup_update_lru_size 0.10 ± 7% -0.0 0.05 ± 45% -0.0 0.06 ± 13% -0.0 0.06 ± 7% perf-profile.children.cycles-pp.cap_vm_enough_memory 0.14 ± 5% -0.0 0.10 -0.0 0.10 ± 4% -0.0 0.11 ± 5% perf-profile.children.cycles-pp.__folio_cancel_dirty 0.14 ± 5% -0.0 0.10 ± 4% -0.0 0.10 ± 3% -0.0 0.10 ± 6% perf-profile.children.cycles-pp.security_file_permission 0.10 ± 5% -0.0 0.06 ± 6% -0.0 0.06 ± 7% -0.0 0.07 ± 10% perf-profile.children.cycles-pp.xas_find 0.15 ± 4% -0.0 0.11 ± 3% -0.0 0.11 ± 6% -0.0 0.11 ± 3% perf-profile.children.cycles-pp.__fget_light 0.12 ± 3% -0.0 0.09 ± 7% -0.0 0.09 ± 7% -0.0 0.09 ± 6% perf-profile.children.cycles-pp.__vm_enough_memory 0.12 ± 3% -0.0 0.09 ± 4% -0.0 0.09 ± 4% -0.0 0.09 ± 6% perf-profile.children.cycles-pp.apparmor_file_permission 0.12 ± 3% -0.0 0.08 ± 5% -0.0 0.08 ± 5% -0.0 0.09 ± 5% perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack 0.14 ± 5% -0.0 0.11 ± 3% -0.0 0.11 ± 4% -0.0 0.12 ± 3% perf-profile.children.cycles-pp.file_modified 0.12 ± 4% -0.0 0.08 ± 4% -0.0 0.08 ± 7% -0.0 0.09 ± 5% perf-profile.children.cycles-pp.xas_start 0.09 -0.0 0.06 ± 8% -0.0 0.04 ± 45% -0.0 0.06 ± 6% perf-profile.children.cycles-pp.__folio_throttle_swaprate 0.12 ± 4% -0.0 0.08 ± 4% -0.0 0.08 ± 4% -0.0 0.09 ± 5% perf-profile.children.cycles-pp.__percpu_counter_limited_add 0.12 ± 6% -0.0 0.08 ± 8% -0.0 0.08 ± 8% -0.0 0.08 ± 4% perf-profile.children.cycles-pp._raw_spin_trylock 0.12 ± 4% -0.0 0.09 ± 4% -0.0 0.09 ± 4% -0.0 0.09 perf-profile.children.cycles-pp.inode_add_bytes 0.20 ± 2% -0.0 0.17 ± 7% -0.0 0.17 ± 4% -0.0 0.18 ± 3% perf-profile.children.cycles-pp.try_charge_memcg 0.10 ± 5% -0.0 0.07 ± 7% -0.0 0.07 ± 7% -0.0 0.06 ± 7% perf-profile.children.cycles-pp.policy_nodemask 0.09 ± 6% -0.0 0.06 ± 6% -0.0 0.06 ± 7% -0.0 0.06 ± 6% perf-profile.children.cycles-pp.get_pfnblock_flags_mask 0.09 ± 6% -0.0 0.06 ± 7% -0.0 0.06 ± 7% -0.0 0.07 ± 5% perf-profile.children.cycles-pp.filemap_free_folio 0.07 ± 6% -0.0 0.05 ± 7% -0.0 0.06 ± 9% -0.0 0.06 ± 8% perf-profile.children.cycles-pp.down_write 0.08 ± 4% -0.0 0.06 ± 8% -0.0 0.06 ± 9% -0.0 0.06 ± 8% perf-profile.children.cycles-pp.get_task_policy 0.09 ± 7% -0.0 0.07 -0.0 0.07 ± 7% -0.0 0.07 perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack 0.09 ± 7% -0.0 0.07 -0.0 0.07 ± 5% -0.0 0.08 ± 6% perf-profile.children.cycles-pp.inode_needs_update_time 0.09 ± 5% -0.0 0.07 ± 5% -0.0 0.08 ± 4% -0.0 0.08 ± 4% perf-profile.children.cycles-pp.xas_create 0.16 ± 2% -0.0 0.14 ± 5% -0.0 0.14 ± 2% -0.0 0.15 ± 4% perf-profile.children.cycles-pp.cgroup_rstat_updated 0.08 ± 7% -0.0 0.06 ± 9% -0.0 0.06 ± 6% -0.0 0.06 perf-profile.children.cycles-pp.percpu_counter_add_batch 0.07 ± 5% -0.0 0.05 ± 7% -0.0 0.03 ± 70% -0.0 0.06 ± 14% perf-profile.children.cycles-pp.folio_mark_dirty 0.08 ± 10% -0.0 0.06 ± 6% -0.0 0.06 ± 13% -0.0 0.05 perf-profile.children.cycles-pp.shmem_is_huge 0.07 ± 6% +0.0 0.09 ± 10% +0.0 0.09 ± 5% +0.0 0.09 ± 6% perf-profile.children.cycles-pp.propagate_protected_usage 0.43 ± 3% +0.0 0.46 ± 5% +0.0 0.47 ± 3% +0.0 0.48 ± 2% perf-profile.children.cycles-pp.uncharge_batch 0.68 ± 3% +0.0 0.73 ± 4% +0.0 0.74 ± 3% +0.1 0.74 perf-profile.children.cycles-pp.__mem_cgroup_uncharge_list 1.11 +0.1 1.22 +0.1 1.19 +0.1 1.17 ± 2% perf-profile.children.cycles-pp.lru_add_fn 2.91 +0.3 3.18 ± 2% +0.3 3.23 +0.1 3.02 perf-profile.children.cycles-pp.truncate_inode_folio 2.56 +0.4 2.92 ± 2% +0.4 2.98 +0.2 2.75 perf-profile.children.cycles-pp.filemap_remove_folio 1.37 ± 3% +0.4 1.76 ± 9% +0.4 1.76 ± 5% +0.3 1.69 ± 2% perf-profile.children.cycles-pp.get_mem_cgroup_from_mm 2.24 +0.5 2.70 ± 2% +0.5 2.75 +0.3 2.51 perf-profile.children.cycles-pp.__filemap_remove_folio 2.38 +0.6 2.97 +0.6 2.99 ± 3% +0.2 2.63 perf-profile.children.cycles-pp.shmem_add_to_page_cache 0.18 ± 4% +0.7 0.91 ± 4% +0.8 0.94 ± 4% +0.7 0.87 ± 2% perf-profile.children.cycles-pp.__count_memcg_events 1.26 +0.8 2.04 ± 2% +0.8 2.08 ± 2% +0.6 1.82 perf-profile.children.cycles-pp.filemap_unaccount_folio 0.63 ± 2% +1.0 1.67 ± 5% +1.1 1.68 ± 5% +1.0 1.58 ± 2% perf-profile.children.cycles-pp.mem_cgroup_commit_charge 38.71 +1.2 39.91 +0.5 39.23 +0.7 39.46 perf-profile.children.cycles-pp.vfs_fallocate 38.37 +1.3 39.66 +0.6 38.99 +0.8 39.21 perf-profile.children.cycles-pp.shmem_fallocate 37.28 +1.6 38.89 +0.9 38.20 +1.1 38.39 perf-profile.children.cycles-pp.shmem_get_folio_gfp 36.71 +1.7 38.45 +1.1 37.77 +1.2 37.94 perf-profile.children.cycles-pp.shmem_alloc_and_add_folio 2.58 +1.8 4.36 ± 2% +1.8 4.40 ± 3% +1.2 3.74 perf-profile.children.cycles-pp.__mod_lruvec_page_state 2.48 ± 2% +2.1 4.60 ± 8% +2.1 4.62 ± 5% +1.9 4.38 ± 2% perf-profile.children.cycles-pp.__mem_cgroup_charge 1.93 ± 3% +2.4 4.36 ± 2% +2.5 4.38 ± 3% +2.2 4.09 perf-profile.children.cycles-pp.__mod_memcg_lruvec_state 1.30 -0.4 0.92 ± 2% -0.4 0.93 -0.3 0.95 perf-profile.self.cycles-pp.syscall_return_via_sysret 0.73 -0.2 0.52 ± 2% -0.2 0.53 -0.2 0.54 ± 2% perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe 0.54 ± 2% -0.2 0.36 ± 3% -0.2 0.36 ± 3% -0.2 0.37 ± 2% perf-profile.self.cycles-pp.release_pages 0.48 -0.2 0.30 ± 3% -0.2 0.32 ± 3% -0.2 0.33 ± 2% perf-profile.self.cycles-pp.xas_store 0.54 ± 2% -0.2 0.38 ± 3% -0.1 0.39 ± 2% -0.1 0.39 ± 3% perf-profile.self.cycles-pp.__entry_text_start 1.17 -0.1 1.03 ± 2% -0.1 1.03 -0.2 1.00 ± 2% perf-profile.self.cycles-pp.syscall_exit_to_user_mode 0.36 ± 2% -0.1 0.22 ± 3% -0.1 0.22 ± 3% -0.1 0.24 ± 2% perf-profile.self.cycles-pp.shmem_add_to_page_cache 0.43 ± 5% -0.1 0.30 ± 7% -0.2 0.27 ± 7% -0.1 0.29 ± 2% perf-profile.self.cycles-pp.lru_add_fn 0.24 ± 7% -0.1 0.12 ± 6% -0.1 0.13 ± 2% -0.1 0.12 ± 6% perf-profile.self.cycles-pp.__list_add_valid_or_report 0.38 ± 4% -0.1 0.27 ± 4% -0.1 0.28 ± 3% -0.1 0.28 ± 2% perf-profile.self.cycles-pp._raw_spin_lock 0.52 ± 3% -0.1 0.41 -0.1 0.41 -0.1 0.43 ± 3% perf-profile.self.cycles-pp.folio_batch_move_lru 0.32 ± 2% -0.1 0.22 ± 4% -0.1 0.22 ± 3% -0.1 0.22 ± 5% perf-profile.self.cycles-pp.__mod_node_page_state 0.36 ± 2% -0.1 0.26 ± 2% -0.1 0.26 ± 2% -0.1 0.27 perf-profile.self.cycles-pp.shmem_fallocate 0.36 ± 4% -0.1 0.26 ± 4% -0.1 0.26 ± 3% -0.1 0.27 ± 3% perf-profile.self.cycles-pp.find_lock_entries 0.28 ± 3% -0.1 0.20 ± 5% -0.1 0.20 ± 2% -0.1 0.21 ± 3% perf-profile.self.cycles-pp.__alloc_pages 0.24 ± 2% -0.1 0.16 ± 4% -0.1 0.16 ± 4% -0.1 0.16 ± 3% perf-profile.self.cycles-pp.xas_descend 0.09 ± 5% -0.1 0.01 ±223% -0.1 0.03 ± 70% -0.1 0.03 ± 70% perf-profile.self.cycles-pp.mem_cgroup_update_lru_size 0.23 ± 2% -0.1 0.16 ± 3% -0.1 0.16 ± 2% -0.1 0.16 ± 4% perf-profile.self.cycles-pp.xas_clear_mark 0.18 ± 3% -0.1 0.11 ± 6% -0.1 0.12 ± 4% -0.1 0.11 ± 4% perf-profile.self.cycles-pp.free_unref_page_commit 0.18 ± 3% -0.1 0.12 ± 4% -0.1 0.12 ± 3% -0.0 0.13 ± 5% perf-profile.self.cycles-pp.shmem_inode_acct_blocks 0.21 ± 3% -0.1 0.15 ± 2% -0.1 0.15 ± 2% -0.1 0.16 ± 3% perf-profile.self.cycles-pp.shmem_alloc_and_add_folio 0.18 ± 2% -0.1 0.12 ± 3% -0.1 0.12 ± 4% -0.1 0.12 ± 3% perf-profile.self.cycles-pp.__filemap_remove_folio 0.18 ± 7% -0.1 0.12 ± 7% -0.0 0.13 ± 5% -0.1 0.12 ± 3% perf-profile.self.cycles-pp.vfs_fallocate 0.18 ± 2% -0.1 0.13 ± 3% -0.1 0.13 -0.1 0.13 ± 5% perf-profile.self.cycles-pp.folio_unlock 0.20 ± 2% -0.1 0.14 ± 6% -0.1 0.15 ± 3% -0.1 0.15 ± 6% perf-profile.self.cycles-pp.__dquot_alloc_space 0.18 ± 2% -0.1 0.12 ± 3% -0.1 0.13 ± 3% -0.0 0.13 ± 4% perf-profile.self.cycles-pp.get_page_from_freelist 0.15 ± 3% -0.1 0.10 ± 7% -0.0 0.10 ± 3% -0.0 0.10 ± 3% perf-profile.self.cycles-pp.xas_load 0.17 ± 3% -0.1 0.12 ± 8% -0.1 0.12 ± 3% -0.0 0.12 ± 4% perf-profile.self.cycles-pp.__cond_resched 0.17 ± 2% -0.1 0.12 ± 3% -0.1 0.12 ± 7% -0.0 0.13 ± 2% perf-profile.self.cycles-pp._raw_spin_lock_irq 0.17 ± 5% -0.1 0.12 ± 3% -0.0 0.12 ± 4% -0.0 0.12 ± 6% perf-profile.self.cycles-pp.noop_dirty_folio 0.10 ± 7% -0.0 0.05 ± 45% -0.0 0.06 ± 13% -0.0 0.06 ± 7% perf-profile.self.cycles-pp.cap_vm_enough_memory 0.12 ± 3% -0.0 0.08 ± 4% -0.0 0.08 -0.0 0.08 ± 4% perf-profile.self.cycles-pp.rmqueue 0.06 -0.0 0.02 ±141% -0.0 0.03 ± 70% -0.0 0.04 ± 44% perf-profile.self.cycles-pp.inode_needs_update_time 0.07 ± 5% -0.0 0.02 ± 99% -0.0 0.05 -0.0 0.05 ± 7% perf-profile.self.cycles-pp.xas_find 0.13 ± 3% -0.0 0.09 ± 6% -0.0 0.10 ± 5% -0.0 0.09 ± 7% perf-profile.self.cycles-pp.alloc_pages_mpol 0.07 ± 6% -0.0 0.03 ± 70% -0.0 0.04 ± 44% -0.0 0.05 perf-profile.self.cycles-pp.xas_find_conflict 0.16 ± 2% -0.0 0.12 ± 6% -0.0 0.12 ± 3% -0.0 0.13 ± 5% perf-profile.self.cycles-pp.free_unref_page_list 0.12 ± 5% -0.0 0.08 ± 4% -0.0 0.08 ± 4% -0.0 0.09 ± 7% perf-profile.self.cycles-pp.fallocate64 0.20 ± 4% -0.0 0.16 ± 3% -0.0 0.16 ± 3% -0.0 0.18 ± 4% perf-profile.self.cycles-pp.shmem_get_folio_gfp 0.06 ± 7% -0.0 0.02 ± 99% -0.0 0.02 ± 99% -0.0 0.03 ± 70% perf-profile.self.cycles-pp.shmem_recalc_inode 0.13 ± 3% -0.0 0.09 -0.0 0.09 ± 6% -0.0 0.09 ± 6% perf-profile.self.cycles-pp._raw_spin_lock_irqsave 0.22 ± 3% -0.0 0.19 ± 6% -0.0 0.20 ± 3% -0.0 0.21 ± 4% perf-profile.self.cycles-pp.page_counter_uncharge 0.14 ± 3% -0.0 0.10 ± 6% -0.0 0.10 ± 8% -0.0 0.10 ± 4% perf-profile.self.cycles-pp.filemap_remove_folio 0.15 ± 5% -0.0 0.11 ± 3% -0.0 0.11 ± 6% -0.0 0.11 ± 3% perf-profile.self.cycles-pp.__fget_light 0.12 ± 4% -0.0 0.08 -0.0 0.08 ± 5% -0.0 0.08 ± 4% perf-profile.self.cycles-pp.__folio_cancel_dirty 0.11 ± 4% -0.0 0.08 ± 7% -0.0 0.08 ± 8% -0.0 0.08 ± 4% perf-profile.self.cycles-pp._raw_spin_trylock 0.11 ± 3% -0.0 0.08 ± 6% -0.0 0.07 ± 9% -0.0 0.08 ± 6% perf-profile.self.cycles-pp.xas_start 0.11 ± 3% -0.0 0.08 ± 6% -0.0 0.08 ± 6% -0.0 0.08 ± 6% perf-profile.self.cycles-pp.__percpu_counter_limited_add 0.12 ± 3% -0.0 0.09 ± 5% -0.0 0.08 ± 5% -0.0 0.08 ± 4% perf-profile.self.cycles-pp.__mod_lruvec_state 0.11 ± 5% -0.0 0.08 ± 4% -0.0 0.08 ± 6% -0.0 0.08 ± 4% perf-profile.self.cycles-pp.truncate_cleanup_folio 0.10 ± 6% -0.0 0.07 ± 5% -0.0 0.07 ± 7% -0.0 0.07 ± 11% perf-profile.self.cycles-pp.xas_init_marks 0.09 ± 6% -0.0 0.06 ± 6% -0.0 0.06 ± 7% -0.0 0.06 ± 6% perf-profile.self.cycles-pp.get_pfnblock_flags_mask 0.11 -0.0 0.08 ± 5% -0.0 0.08 -0.0 0.09 ± 5% perf-profile.self.cycles-pp.folio_add_lru 0.09 ± 6% -0.0 0.06 ± 7% -0.0 0.06 ± 7% -0.0 0.07 ± 5% perf-profile.self.cycles-pp.filemap_free_folio 0.09 ± 4% -0.0 0.06 ± 6% -0.0 0.06 ± 6% -0.0 0.06 ± 6% perf-profile.self.cycles-pp.shmem_alloc_folio 0.10 ± 4% -0.0 0.08 ± 4% -0.0 0.08 ± 6% -0.0 0.08 ± 7% perf-profile.self.cycles-pp.apparmor_file_permission 0.14 ± 5% -0.0 0.12 ± 5% -0.0 0.12 ± 3% -0.0 0.13 ± 4% perf-profile.self.cycles-pp.cgroup_rstat_updated 0.07 ± 7% -0.0 0.04 ± 44% -0.0 0.04 ± 44% -0.0 0.04 ± 71% perf-profile.self.cycles-pp.policy_nodemask 0.07 ± 11% -0.0 0.04 ± 45% -0.0 0.05 ± 7% -0.0 0.03 ± 70% perf-profile.self.cycles-pp.shmem_is_huge 0.08 ± 4% -0.0 0.06 ± 8% -0.0 0.06 ± 9% -0.0 0.06 ± 9% perf-profile.self.cycles-pp.get_task_policy 0.08 ± 6% -0.0 0.05 ± 8% -0.0 0.06 ± 8% -0.0 0.05 ± 8% perf-profile.self.cycles-pp.__x64_sys_fallocate 0.12 ± 3% -0.0 0.10 ± 6% -0.0 0.10 ± 6% -0.0 0.10 ± 3% perf-profile.self.cycles-pp.try_charge_memcg 0.07 -0.0 0.05 -0.0 0.05 -0.0 0.04 ± 45% perf-profile.self.cycles-pp.free_unref_page_prepare 0.07 ± 6% -0.0 0.06 ± 9% -0.0 0.06 ± 8% -0.0 0.06 ± 9% perf-profile.self.cycles-pp.percpu_counter_add_batch 0.08 ± 4% -0.0 0.06 -0.0 0.06 ± 6% -0.0 0.06 perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack 0.09 ± 7% -0.0 0.07 ± 5% -0.0 0.07 ± 5% -0.0 0.07 ± 7% perf-profile.self.cycles-pp.filemap_get_entry 0.07 ± 9% +0.0 0.09 ± 10% +0.0 0.09 ± 5% +0.0 0.09 ± 6% perf-profile.self.cycles-pp.propagate_protected_usage 0.96 ± 2% +0.2 1.12 ± 7% +0.2 1.16 ± 4% -0.2 0.72 ± 2% perf-profile.self.cycles-pp.__mod_lruvec_page_state 0.45 ± 4% +0.4 0.82 ± 8% +0.4 0.81 ± 6% +0.3 0.77 ± 3% perf-profile.self.cycles-pp.mem_cgroup_commit_charge 1.36 ± 3% +0.4 1.75 ± 9% +0.4 1.75 ± 5% +0.3 1.68 ± 2% perf-profile.self.cycles-pp.get_mem_cgroup_from_mm 0.29 +0.7 1.00 ± 10% +0.7 1.01 ± 7% +0.6 0.93 ± 2% perf-profile.self.cycles-pp.__mem_cgroup_charge 0.16 ± 4% +0.7 0.90 ± 4% +0.8 0.92 ± 4% +0.7 0.85 ± 2% perf-profile.self.cycles-pp.__count_memcg_events 1.80 ± 2% +2.5 4.26 ± 2% +2.5 4.28 ± 3% +2.2 3.98 perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
On Tue, Oct 24, 2023 at 11:09 PM Oliver Sang <oliver.sang@intel.com> wrote: > > hi, Yosry Ahmed, > > On Tue, Oct 24, 2023 at 12:14:42AM -0700, Yosry Ahmed wrote: > > On Mon, Oct 23, 2023 at 11:56 PM Oliver Sang <oliver.sang@intel.com> wrote: > > > > > > hi, Yosry Ahmed, > > > > > > On Mon, Oct 23, 2023 at 07:13:50PM -0700, Yosry Ahmed wrote: > > > > > > ... > > > > > > > > > > > I still could not run the benchmark, but I used a version of > > > > fallocate1.c that does 1 million iterations. I ran 100 in parallel. > > > > This showed ~13% regression with the patch, so not the same as the > > > > will-it-scale version, but it could be an indicator. > > > > > > > > With that, I did not see any improvement with the fixlet above or > > > > ___cacheline_aligned_in_smp. So you can scratch that. > > > > > > > > I did, however, see some improvement with reducing the indirection > > > > layers by moving stats_updates directly into struct mem_cgroup. The > > > > regression in my manual testing went down to 9%. Still not great, but > > > > I am wondering how this reflects on the benchmark. If you're able to > > > > test it that would be great, the diff is below. Meanwhile I am still > > > > looking for other improvements that can be made. > > > > > > we applied previous patch-set as below: > > > > > > c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing > > > ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent() > > > 51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg > > > 130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code > > > 26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time > > > 25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything <---- the base our tool picked for the patch set > > > > > > I tried to apply below patch to either 51d74c18a9c61 or c5f50d8b23c79, > > > but failed. could you guide how to apply this patch? > > > Thanks > > > > > > > Thanks for looking into this. I rebased the diff on top of > > c5f50d8b23c79. Please find it attached. > > from our tests, this patch has little impact. > > it was applied as below ac6a9444dec85: > > ac6a9444dec85 (linux-devel/fixup-c5f50d8b23c79) memcg: move stats_updates to struct mem_cgroup > c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing > ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent() > 51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg > 130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code > 26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time > 25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything > > for the first regression reported in original report, data are very close > for 51d74c18a9c61, c5f50d8b23c79 (patch-set tip, parent of ac6a9444dec85), > and ac6a9444dec85. > full comparison is as [1] > > ========================================================================================= > compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase: > gcc-12/performance/x86_64-rhel-8.3/thread/100%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale > > 130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522 c5f50d8b23c7982ac875791755b ac6a9444dec85dc50c6bfbc4ee7 > ---------------- --------------------------- --------------------------- --------------------------- > %stddev %change %stddev %change %stddev %change %stddev > \ | \ | \ | \ > 36509 -25.8% 27079 -25.2% 27305 -25.0% 27383 will-it-scale.per_thread_ops > > for the second regression reported in origianl report, seems a small impact > from ac6a9444dec85. > full comparison is as [2] > > ========================================================================================= > compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase: > gcc-12/performance/x86_64-rhel-8.3/thread/50%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale > > 130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522 c5f50d8b23c7982ac875791755b ac6a9444dec85dc50c6bfbc4ee7 > ---------------- --------------------------- --------------------------- --------------------------- > %stddev %change %stddev %change %stddev %change %stddev > \ | \ | \ | \ > 76580 -30.0% 53575 -28.9% 54415 -26.7% 56152 will-it-scale.per_thread_ops > > [1] > Thanks Oliver for running the numbers. If I understand correctly the will-it-scale.fallocate1 microbenchmark is the only one showing significant regression here, is this correct? In my runs, other more representative microbenchmarks benchmarks like netperf and will-it-scale.page_fault* show minimal regression. I would expect practical workloads to have high concurrency of page faults or networking, but maybe not fallocate/ftruncate. Oliver, in your experience, how often does such a regression in such a microbenchmark translate to a real regression that people care about? (or how often do people dismiss it?) I tried optimizing this further for the fallocate/ftruncate case but without luck. I even tried moving stats_updates into cgroup core (struct cgroup_rstat_cpu) to reuse the existing loop in cgroup_rstat_updated() -- but it somehow made it worse. On the other hand, we do have some machines in production running this series together with a previous optimization for non-hierarchical stats [1] on an older kernel, and we do see significant reduction in cpu time spent on reading the stats. Domenico did a similar experiment with only this series and reported similar results [2]. Shakeel, Johannes, (and other memcg folks), I personally think the benefits here outweigh a regression in this particular benchmark, but I am obviously biased. What do you think? [1]https://lore.kernel.org/lkml/20230726153223.821757-2-yosryahmed@google.com/ [2]https://lore.kernel.org/lkml/CAFYChMv_kv_KXOMRkrmTN-7MrfgBHMcK3YXv0dPYEL7nK77e2A@mail.gmail.com/
On Tue, Oct 24, 2023 at 11:23 PM Yosry Ahmed <yosryahmed@google.com> wrote: > [...] > > Thanks Oliver for running the numbers. If I understand correctly the > will-it-scale.fallocate1 microbenchmark is the only one showing > significant regression here, is this correct? > > In my runs, other more representative microbenchmarks benchmarks like > netperf and will-it-scale.page_fault* show minimal regression. I would > expect practical workloads to have high concurrency of page faults or > networking, but maybe not fallocate/ftruncate. > > Oliver, in your experience, how often does such a regression in such a > microbenchmark translate to a real regression that people care about? > (or how often do people dismiss it?) > > I tried optimizing this further for the fallocate/ftruncate case but > without luck. I even tried moving stats_updates into cgroup core > (struct cgroup_rstat_cpu) to reuse the existing loop in > cgroup_rstat_updated() -- but it somehow made it worse. > > On the other hand, we do have some machines in production running this > series together with a previous optimization for non-hierarchical > stats [1] on an older kernel, and we do see significant reduction in > cpu time spent on reading the stats. Domenico did a similar experiment > with only this series and reported similar results [2]. > > Shakeel, Johannes, (and other memcg folks), I personally think the > benefits here outweigh a regression in this particular benchmark, but > I am obviously biased. What do you think? > > [1]https://lore.kernel.org/lkml/20230726153223.821757-2-yosryahmed@google.com/ > [2]https://lore.kernel.org/lkml/CAFYChMv_kv_KXOMRkrmTN-7MrfgBHMcK3YXv0dPYEL7nK77e2A@mail.gmail.com/ I still am not convinced of the benefits outweighing the regression but I would not block this. So, let's do this, skip this open window, get the patch series reviewed and hopefully we can work together on fixing that regression and we can make an informed decision of accepting the regression for this series for the next cycle.
On Wed, Oct 25, 2023 at 10:06 AM Shakeel Butt <shakeelb@google.com> wrote: > > On Tue, Oct 24, 2023 at 11:23 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > [...] > > > > Thanks Oliver for running the numbers. If I understand correctly the > > will-it-scale.fallocate1 microbenchmark is the only one showing > > significant regression here, is this correct? > > > > In my runs, other more representative microbenchmarks benchmarks like > > netperf and will-it-scale.page_fault* show minimal regression. I would > > expect practical workloads to have high concurrency of page faults or > > networking, but maybe not fallocate/ftruncate. > > > > Oliver, in your experience, how often does such a regression in such a > > microbenchmark translate to a real regression that people care about? > > (or how often do people dismiss it?) > > > > I tried optimizing this further for the fallocate/ftruncate case but > > without luck. I even tried moving stats_updates into cgroup core > > (struct cgroup_rstat_cpu) to reuse the existing loop in > > cgroup_rstat_updated() -- but it somehow made it worse. > > > > On the other hand, we do have some machines in production running this > > series together with a previous optimization for non-hierarchical > > stats [1] on an older kernel, and we do see significant reduction in > > cpu time spent on reading the stats. Domenico did a similar experiment > > with only this series and reported similar results [2]. > > > > Shakeel, Johannes, (and other memcg folks), I personally think the > > benefits here outweigh a regression in this particular benchmark, but > > I am obviously biased. What do you think? > > > > [1]https://lore.kernel.org/lkml/20230726153223.821757-2-yosryahmed@google.com/ > > [2]https://lore.kernel.org/lkml/CAFYChMv_kv_KXOMRkrmTN-7MrfgBHMcK3YXv0dPYEL7nK77e2A@mail.gmail.com/ > > I still am not convinced of the benefits outweighing the regression > but I would not block this. So, let's do this, skip this open window, > get the patch series reviewed and hopefully we can work together on > fixing that regression and we can make an informed decision of > accepting the regression for this series for the next cycle. Skipping this open window sounds okay to me. FWIW, I think with this patch series we can keep the old behavior (roughly) and hide the changes behind a tunable (config option or sysfs file). I think the only changes that need to be done to the code to approximate the previous behavior are: - Use root when updating the pending stats in memcg_rstat_updated() instead of the passed memcg. - Use root in mem_cgroup_flush_stats() instead of the passed memcg. - Use mutex_trylock() instead of mutex_lock() in mem_cgroup_flush_stats(). So I think it should be doable to hide most changes behind a tunable, but let's not do this unless necessary.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a393f1399a2b..9a586893bd3e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -627,6 +627,9 @@ struct memcg_vmstats_percpu { /* Cgroup1: threshold notifications & softlimit tree updates */ unsigned long nr_page_events; unsigned long targets[MEM_CGROUP_NTARGETS]; + + /* Stats updates since the last flush */ + unsigned int stats_updates; }; struct memcg_vmstats { @@ -641,6 +644,9 @@ struct memcg_vmstats { /* Pending child counts during tree propagation */ long state_pending[MEMCG_NR_STAT]; unsigned long events_pending[NR_MEMCG_EVENTS]; + + /* Stats updates since the last flush */ + atomic64_t stats_updates; }; /* @@ -660,9 +666,7 @@ struct memcg_vmstats { */ static void flush_memcg_stats_dwork(struct work_struct *w); static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork); -static DEFINE_PER_CPU(unsigned int, stats_updates); static atomic_t stats_flush_ongoing = ATOMIC_INIT(0); -static atomic_t stats_flush_threshold = ATOMIC_INIT(0); static u64 flush_last_time; #define FLUSH_TIME (2UL*HZ) @@ -689,26 +693,37 @@ static void memcg_stats_unlock(void) preempt_enable_nested(); } + +static bool memcg_should_flush_stats(struct mem_cgroup *memcg) +{ + return atomic64_read(&memcg->vmstats->stats_updates) > + MEMCG_CHARGE_BATCH * num_online_cpus(); +} + static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) { + int cpu = smp_processor_id(); unsigned int x; if (!val) return; - cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id()); + cgroup_rstat_updated(memcg->css.cgroup, cpu); + + for (; memcg; memcg = parent_mem_cgroup(memcg)) { + x = __this_cpu_add_return(memcg->vmstats_percpu->stats_updates, + abs(val)); + + if (x < MEMCG_CHARGE_BATCH) + continue; - x = __this_cpu_add_return(stats_updates, abs(val)); - if (x > MEMCG_CHARGE_BATCH) { /* - * If stats_flush_threshold exceeds the threshold - * (>num_online_cpus()), cgroup stats update will be triggered - * in __mem_cgroup_flush_stats(). Increasing this var further - * is redundant and simply adds overhead in atomic update. + * If @memcg is already flush-able, increasing stats_updates is + * redundant. Avoid the overhead of the atomic update. */ - if (atomic_read(&stats_flush_threshold) <= num_online_cpus()) - atomic_add(x / MEMCG_CHARGE_BATCH, &stats_flush_threshold); - __this_cpu_write(stats_updates, 0); + if (!memcg_should_flush_stats(memcg)) + atomic64_add(x, &memcg->vmstats->stats_updates); + __this_cpu_write(memcg->vmstats_percpu->stats_updates, 0); } } @@ -727,13 +742,12 @@ static void do_flush_stats(void) cgroup_rstat_flush(root_mem_cgroup->css.cgroup); - atomic_set(&stats_flush_threshold, 0); atomic_set(&stats_flush_ongoing, 0); } void mem_cgroup_flush_stats(void) { - if (atomic_read(&stats_flush_threshold) > num_online_cpus()) + if (memcg_should_flush_stats(root_mem_cgroup)) do_flush_stats(); } @@ -747,8 +761,8 @@ void mem_cgroup_flush_stats_ratelimited(void) static void flush_memcg_stats_dwork(struct work_struct *w) { /* - * Always flush here so that flushing in latency-sensitive paths is - * as cheap as possible. + * Deliberately ignore memcg_should_flush_stats() here so that flushing + * in latency-sensitive paths is as cheap as possible. */ do_flush_stats(); queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME); @@ -5803,6 +5817,9 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) } } } + /* We are in a per-cpu loop here, only do the atomic write once */ + if (atomic64_read(&memcg->vmstats->stats_updates)) + atomic64_set(&memcg->vmstats->stats_updates, 0); } #ifdef CONFIG_MMU