Message ID | 20230719174613.3062124-1-yosryahmed@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:c923:0:b0:3e4:2afc:c1 with SMTP id j3csp2632618vqt; Wed, 19 Jul 2023 11:41:18 -0700 (PDT) X-Google-Smtp-Source: APBJJlEBAzDMBa4AYi9JMPWvFBhalNilmMeA5u4T511alchSaRB7MFjh7Vp7NHKxCpEFARAHZ2V3 X-Received: by 2002:a05:6402:1344:b0:51e:1a58:eac0 with SMTP id y4-20020a056402134400b0051e1a58eac0mr3504860edw.12.1689792077968; Wed, 19 Jul 2023 11:41:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1689792077; cv=none; d=google.com; s=arc-20160816; b=sze41YZvMHwJBV0CI3s2CVExbV+S38sa0GkKOMaYRGgwE2xeCvgA9CtJ5o2kjAq7ZP 0vwnn9lknIDykoh+NURO4E5UcIsrtl9tIp7JC4/y+ZndHIER3DkyYSztj4180Ab5h8US ArDxBVPBXujtJKH1Mzgc7JjCRpvyrcDLe18pch1D8QO9CEN77sV651H4FTzCu16geK+P 6RMpHA4cUej/dibaP5trnHp/9wFZ2KQV1UZbsQu692A2+WG2ltJrYiulkiAO/V4G06ua mAlfIFP9SMmtLPMhkxVSfi6vYvGahUHBvG5SJS3iSnjHZq1C+deuDiqtvjzEgbVxAGgj RXOg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:mime-version:date :dkim-signature; bh=6e7KuyR5qZ6xCXotERLteVBuiZtk31A4XosqK7/beqs=; fh=jmG0cbQBAwdux8Hz7jngnF2uBplH4M/FgvnJiAKVH6o=; b=kjXXnfe3UNbZ+GyQifjVqIzzuz+CDxQVFyqiqOUJZRyEmZuGP10ny1gqaA3yqJMV9d rnCF2+500+edBGY8gHVmclAJhH+AjcL0AyFjhKrnKdQQICiyfPoRDVfvOOFs13IH4+VL ltn/gHnmu+gkUrjSsFJ0P3Os7Wf82AruNSTlMR/smPvQKjt/chLULJqO3Cf0hRRegY2j XKX9ZYkOJ/y9WNxyXo8S4HvoNZG40+y3q1vMk6wwvEr+Oz9pdp5og5SkW5mEP74O2bBS vKsr/Jacl5yy6mjmq8fCyv0aH/u23PpyUyad6tBSgTTaw7q/SKsxGRUrvZZl2SvzAbE3 0CJA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=qQcHhj0f; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id a14-20020aa7d74e000000b0051dd3314051si3236962eds.188.2023.07.19.11.40.54; Wed, 19 Jul 2023 11:41:17 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=qQcHhj0f; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230491AbjGSRqU (ORCPT <rfc822;assdfgzxcv4@gmail.com> + 99 others); Wed, 19 Jul 2023 13:46:20 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35774 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229844AbjGSRqT (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 19 Jul 2023 13:46:19 -0400 Received: from mail-pl1-x649.google.com (mail-pl1-x649.google.com [IPv6:2607:f8b0:4864:20::649]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A4D8D1BF6 for <linux-kernel@vger.kernel.org>; Wed, 19 Jul 2023 10:46:17 -0700 (PDT) Received: by mail-pl1-x649.google.com with SMTP id d9443c01a7336-1b8af49a5d2so56215615ad.2 for <linux-kernel@vger.kernel.org>; Wed, 19 Jul 2023 10:46:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689788777; x=1692380777; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=6e7KuyR5qZ6xCXotERLteVBuiZtk31A4XosqK7/beqs=; b=qQcHhj0foniGWDDkDgux7VVWqWaFxDGN+dbzuBsFo30NW6axCLOXrULxehYv6yMHwO jH8UyWpvMU89UheqpLWuY6AceR7QLFibsdGjWeBHNc4UVV7JAvKCHe+DohloqVnlnsji ogeDFvWIQSpEqF11Bn77AHuju0IchimGYa8p1bIZvDMDediUM4PW9ts4U8YkzJIa5x9a tQYoEjbHB3KdO3hxgnScIFJG5V/1KNIo6cJRQRkAbFtYttlUH1CHzaGUF3Qhf/MJ7iLy wMWj+q+IOdC7KZsTRwCYrC2HWbvNHpWR762aG94pnusjfJNKUEQG9aOKJYgksrUzGeX1 Qhag== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689788777; x=1692380777; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=6e7KuyR5qZ6xCXotERLteVBuiZtk31A4XosqK7/beqs=; b=iAmbp9FCYdfPOTymK0RmqKVxlRZHykkKewTCiDcqbNmx1wUlVxNq9BZEb7OTKlTMMg wD9R2Z81kBDyTGTz2j99/Gts0b6xhRojfhuWRzf1ss09oFKgzxtsSJbMXgfNfHwtWD6P GYcUHCfu6lyuY20/2Ugnvk4d7+4bOYRCVGlCsZ3NdXphLpYL0MbXSh/2OGvGTJsKoJxi rNwo/qHWJJzuZaLm3u6WY7T5tqoisJEUJ+OcKPQv4rbDSDlw5MJahnpoU2BQAAw5qVX2 PmBgivfQ1wvr2ls8bXZQSfNDUS6MQvl4hb0kGGBQNO2YXu6y041KgvMYZy80JOVuapLN nY4A== X-Gm-Message-State: ABy/qLZ3p/G/yoBywSiFdHk6X7RPtJt1Q23XLRXCaMfbqFEfEIC8JW8I qdAnytDrnyrI5dxontd42AsLymk/VMoKlvvF X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:2327]) (user=yosryahmed job=sendgmr) by 2002:a17:903:2450:b0:1b3:e4f1:1b3f with SMTP id l16-20020a170903245000b001b3e4f11b3fmr17242pls.2.1689788776892; Wed, 19 Jul 2023 10:46:16 -0700 (PDT) Date: Wed, 19 Jul 2023 17:46:13 +0000 Mime-Version: 1.0 X-Mailer: git-send-email 2.41.0.255.g8b1d071c50-goog Message-ID: <20230719174613.3062124-1-yosryahmed@google.com> Subject: [PATCH] mm: memcg: use rstat for non-hierarchical stats From: Yosry Ahmed <yosryahmed@google.com> To: Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, Shakeel Butt <shakeelb@google.com>, Muchun Song <muchun.song@linux.dev>, Andrew Morton <akpm@linux-foundation.org> Cc: linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, Yosry Ahmed <yosryahmed@google.com> Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1771875418068635424 X-GMAIL-MSGID: 1771875418068635424 |
Series |
mm: memcg: use rstat for non-hierarchical stats
|
|
Commit Message
Yosry Ahmed
July 19, 2023, 5:46 p.m. UTC
Currently, memcg uses rstat to maintain hierarchical stats. The rstat
framework keeps track of which cgroups have updates on which cpus.
For non-hierarchical stats, as memcg moved to rstat, they are no longer
readily available as counters. Instead, the percpu counters for a given
stat need to be summed to get the non-hierarchical stat value. This
causes a performance regression when reading non-hierarchical stats on
kernels where memcg moved to using rstat. This is especially visible
when reading memory.stat on cgroup v1. There are also some code paths
internal to the kernel that read such non-hierarchical stats.
It is inefficient to iterate and sum counters in all cpus when the rstat
framework knows exactly when a percpu counter has an update. Instead,
maintain cpu-aggregated non-hierarchical counters for each stat. During
an rstat flush, keep those updated as well. When reading
non-hierarchical stats, we no longer need to iterate cpus, we just need
to read the maintainer counters, similar to hierarchical stats.
A caveat is that we now a stats flush before reading
local/non-hierarchical stats through {memcg/lruvec}_page_state_local()
or memcg_events_local(), where we previously only needed a flush to
read hierarchical stats. Most contexts reading non-hierarchical stats
are already doing a flush, add a flush to the only missing context in
count_shadow_nodes().
With this patch, reading memory.stat from 1000 memcgs is 3x faster on a
machine with 256 cpus on cgroup v1:
# for i in $(seq 1000); do mkdir /sys/fs/cgroup/memory/cg$i; done
# time cat /dev/cgroup/memory/cg*/memory.stat > /dev/null
real 0m0.125s
user 0m0.005s
sys 0m0.120s
After:
real 0m0.032s
user 0m0.005s
sys 0m0.027s
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
include/linux/memcontrol.h | 7 ++++---
mm/memcontrol.c | 32 +++++++++++++++++++-------------
mm/workingset.c | 1 +
3 files changed, 24 insertions(+), 16 deletions(-)
Comments
On Wed, 19 Jul 2023 17:46:13 +0000 Yosry Ahmed <yosryahmed@google.com> wrote: > Currently, memcg uses rstat to maintain hierarchical stats. The rstat > framework keeps track of which cgroups have updates on which cpus. > > For non-hierarchical stats, as memcg moved to rstat, they are no longer > readily available as counters. Instead, the percpu counters for a given > stat need to be summed to get the non-hierarchical stat value. This > causes a performance regression when reading non-hierarchical stats on > kernels where memcg moved to using rstat. This is especially visible > when reading memory.stat on cgroup v1. There are also some code paths > internal to the kernel that read such non-hierarchical stats. > > It is inefficient to iterate and sum counters in all cpus when the rstat > framework knows exactly when a percpu counter has an update. Instead, > maintain cpu-aggregated non-hierarchical counters for each stat. During > an rstat flush, keep those updated as well. When reading > non-hierarchical stats, we no longer need to iterate cpus, we just need > to read the maintainer counters, similar to hierarchical stats. > > A caveat is that we now a stats flush before reading > local/non-hierarchical stats through {memcg/lruvec}_page_state_local() > or memcg_events_local(), where we previously only needed a flush to > read hierarchical stats. Most contexts reading non-hierarchical stats > are already doing a flush, add a flush to the only missing context in > count_shadow_nodes(). > > With this patch, reading memory.stat from 1000 memcgs is 3x faster on a > machine with 256 cpus on cgroup v1: > # for i in $(seq 1000); do mkdir /sys/fs/cgroup/memory/cg$i; done > # time cat /dev/cgroup/memory/cg*/memory.stat > /dev/null > real 0m0.125s > user 0m0.005s > sys 0m0.120s > > After: > real 0m0.032s > user 0m0.005s > sys 0m0.027s > I'll queue this for some testing, pending reviewer input, please.
On Mon, Jul 24, 2023 at 11:31 AM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Wed, 19 Jul 2023 17:46:13 +0000 Yosry Ahmed <yosryahmed@google.com> wrote: > > > Currently, memcg uses rstat to maintain hierarchical stats. The rstat > > framework keeps track of which cgroups have updates on which cpus. > > > > For non-hierarchical stats, as memcg moved to rstat, they are no longer > > readily available as counters. Instead, the percpu counters for a given > > stat need to be summed to get the non-hierarchical stat value. This > > causes a performance regression when reading non-hierarchical stats on > > kernels where memcg moved to using rstat. This is especially visible > > when reading memory.stat on cgroup v1. There are also some code paths > > internal to the kernel that read such non-hierarchical stats. > > > > It is inefficient to iterate and sum counters in all cpus when the rstat > > framework knows exactly when a percpu counter has an update. Instead, > > maintain cpu-aggregated non-hierarchical counters for each stat. During > > an rstat flush, keep those updated as well. When reading > > non-hierarchical stats, we no longer need to iterate cpus, we just need > > to read the maintainer counters, similar to hierarchical stats. > > > > A caveat is that we now a stats flush before reading > > local/non-hierarchical stats through {memcg/lruvec}_page_state_local() > > or memcg_events_local(), where we previously only needed a flush to > > read hierarchical stats. Most contexts reading non-hierarchical stats > > are already doing a flush, add a flush to the only missing context in > > count_shadow_nodes(). > > > > With this patch, reading memory.stat from 1000 memcgs is 3x faster on a > > machine with 256 cpus on cgroup v1: > > # for i in $(seq 1000); do mkdir /sys/fs/cgroup/memory/cg$i; done > > # time cat /dev/cgroup/memory/cg*/memory.stat > /dev/null > > real 0m0.125s > > user 0m0.005s > > sys 0m0.120s > > > > After: > > real 0m0.032s > > user 0m0.005s > > sys 0m0.027s > > > > I'll queue this for some testing, pending reviewer input, please. Thanks Andrew! I am doing extra testing of my own as well as of now. Will report back if anything interesting pops up.
On Wed, Jul 19, 2023 at 05:46:13PM +0000, Yosry Ahmed wrote: > Currently, memcg uses rstat to maintain hierarchical stats. The rstat > framework keeps track of which cgroups have updates on which cpus. > > For non-hierarchical stats, as memcg moved to rstat, they are no longer > readily available as counters. Instead, the percpu counters for a given > stat need to be summed to get the non-hierarchical stat value. This > causes a performance regression when reading non-hierarchical stats on > kernels where memcg moved to using rstat. This is especially visible > when reading memory.stat on cgroup v1. There are also some code paths > internal to the kernel that read such non-hierarchical stats. It's actually not an rstat regression. It's always been this costly. Quick history: We used to maintain *all* stats in per-cpu counters at the local level. memory.stat reads would have to iterate and aggregate the entire subtree every time. This was obviously very costly, so we added batched upward propagation during stat updates to simplify reads: commit 42a300353577ccc17ecc627b8570a89fa1678bec Author: Johannes Weiner <hannes@cmpxchg.org> Date: Tue May 14 15:47:12 2019 -0700 mm: memcontrol: fix recursive statistics correctness & scalabilty However, that caused a regression in the stat write path, as the upward propagation would bottleneck on the cachelines in the shared parents. The fix for *that* re-introduced the per-cpu loops in the local stat reads: commit 815744d75152078cde5391fc1e3c2d4424323fb6 Author: Johannes Weiner <hannes@cmpxchg.org> Date: Thu Jun 13 15:55:46 2019 -0700 mm: memcontrol: don't batch updates of local VM stats and events So I wouldn't say it's a regression from rstat. Except for that short period between the two commits above, the read side for local stats was always expensive. rstat promises a shot at finally fixing it, with less risk to the write path. > It is inefficient to iterate and sum counters in all cpus when the rstat > framework knows exactly when a percpu counter has an update. Instead, > maintain cpu-aggregated non-hierarchical counters for each stat. During > an rstat flush, keep those updated as well. When reading > non-hierarchical stats, we no longer need to iterate cpus, we just need > to read the maintainer counters, similar to hierarchical stats. > > A caveat is that we now a stats flush before reading > local/non-hierarchical stats through {memcg/lruvec}_page_state_local() > or memcg_events_local(), where we previously only needed a flush to > read hierarchical stats. Most contexts reading non-hierarchical stats > are already doing a flush, add a flush to the only missing context in > count_shadow_nodes(). > > With this patch, reading memory.stat from 1000 memcgs is 3x faster on a > machine with 256 cpus on cgroup v1: > # for i in $(seq 1000); do mkdir /sys/fs/cgroup/memory/cg$i; done > # time cat /dev/cgroup/memory/cg*/memory.stat > /dev/null > real 0m0.125s > user 0m0.005s > sys 0m0.120s > > After: > real 0m0.032s > user 0m0.005s > sys 0m0.027s > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> But I want to be clear: this isn't a regression fix. It's a new performance optimization for the deprecated cgroup1 code. And it comes at the cost of higher memory footprint for both cgroup1 AND cgroup2. If this causes a regression, we should revert it again. But let's try.
On Tue, 25 Jul 2023 10:04:35 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote: > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > > But I want to be clear: this isn't a regression fix. It's a new > performance optimization for the deprecated cgroup1 code. And it comes > at the cost of higher memory footprint for both cgroup1 AND cgroup2. Thanks. Yosry, could you please send along a suitably modified changelog?
Hey Johannes, On Tue, Jul 25, 2023 at 7:04 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Wed, Jul 19, 2023 at 05:46:13PM +0000, Yosry Ahmed wrote: > > Currently, memcg uses rstat to maintain hierarchical stats. The rstat > > framework keeps track of which cgroups have updates on which cpus. > > > > For non-hierarchical stats, as memcg moved to rstat, they are no longer > > readily available as counters. Instead, the percpu counters for a given > > stat need to be summed to get the non-hierarchical stat value. This > > causes a performance regression when reading non-hierarchical stats on > > kernels where memcg moved to using rstat. This is especially visible > > when reading memory.stat on cgroup v1. There are also some code paths > > internal to the kernel that read such non-hierarchical stats. > > It's actually not an rstat regression. It's always been this costly. > > Quick history: Thanks for the context. > > We used to maintain *all* stats in per-cpu counters at the local > level. memory.stat reads would have to iterate and aggregate the > entire subtree every time. This was obviously very costly, so we added > batched upward propagation during stat updates to simplify reads: > > commit 42a300353577ccc17ecc627b8570a89fa1678bec > Author: Johannes Weiner <hannes@cmpxchg.org> > Date: Tue May 14 15:47:12 2019 -0700 > > mm: memcontrol: fix recursive statistics correctness & scalabilty > > However, that caused a regression in the stat write path, as the > upward propagation would bottleneck on the cachelines in the shared > parents. The fix for *that* re-introduced the per-cpu loops in the > local stat reads: > > commit 815744d75152078cde5391fc1e3c2d4424323fb6 > Author: Johannes Weiner <hannes@cmpxchg.org> > Date: Thu Jun 13 15:55:46 2019 -0700 > > mm: memcontrol: don't batch updates of local VM stats and events > > So I wouldn't say it's a regression from rstat. Except for that short > period between the two commits above, the read side for local stats > was always expensive. I was comparing from an 4.15 kernel, so I assumed the major change was from rstat, but that was not accurate. Thanks for the history. However, in that 4.15 kernel the local (non-hierarchical) stats were readily available without iterating percpu counters. There is a regression that was introduced somewhere. Looking at the history you described, it seems like up until 815744d75152 we used to maintain "local" (aka non-hierarchical) counters, so reading local stats was reading one counter, and starting 815744d75152 we started having to loop percpu counters for that. So it is not a regression of rstat, but seemingly it is a regression of 815744d75152. Is my understanding incorrect? > > rstat promises a shot at finally fixing it, with less risk to the > write path. > > > It is inefficient to iterate and sum counters in all cpus when the rstat > > framework knows exactly when a percpu counter has an update. Instead, > > maintain cpu-aggregated non-hierarchical counters for each stat. During > > an rstat flush, keep those updated as well. When reading > > non-hierarchical stats, we no longer need to iterate cpus, we just need > > to read the maintainer counters, similar to hierarchical stats. > > > > A caveat is that we now a stats flush before reading > > local/non-hierarchical stats through {memcg/lruvec}_page_state_local() > > or memcg_events_local(), where we previously only needed a flush to > > read hierarchical stats. Most contexts reading non-hierarchical stats > > are already doing a flush, add a flush to the only missing context in > > count_shadow_nodes(). > > > > With this patch, reading memory.stat from 1000 memcgs is 3x faster on a > > machine with 256 cpus on cgroup v1: > > # for i in $(seq 1000); do mkdir /sys/fs/cgroup/memory/cg$i; done > > # time cat /dev/cgroup/memory/cg*/memory.stat > /dev/null > > real 0m0.125s > > user 0m0.005s > > sys 0m0.120s > > > > After: > > real 0m0.032s > > user 0m0.005s > > sys 0m0.027s > > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> Thanks! I will reformulate the commit log after we agree on the history. > > But I want to be clear: this isn't a regression fix. It's a new > performance optimization for the deprecated cgroup1 code. And it comes > at the cost of higher memory footprint for both cgroup1 AND cgroup2. I still think it is, but I can easily be wrong. I am hoping that the memory footprint is not a problem. There are *roughly* 80 per-memcg stats/events (MEMCG_NR_STAT + NR_MEMCG_EVENTS) and 55 per-lruvec stats (NR_VM_NODE_STAT_ITEMS). For each stat there is an extra 8 bytes, so on a two-node machine that's 8 * (80 + 55 * 2) ~= 1.5 KiB per memcg. Given that struct mem_cgroup is already large, and can easily be 100s of KiBs on a large machine with many cpus, I hope there won't be a noticeable regression. > > If this causes a regression, we should revert it again. But let's try. Of course. Fingers crossed.
On Tue, Jul 25, 2023 at 10:43 AM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Tue, 25 Jul 2023 10:04:35 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > > > > But I want to be clear: this isn't a regression fix. It's a new > > performance optimization for the deprecated cgroup1 code. And it comes > > at the cost of higher memory footprint for both cgroup1 AND cgroup2. > > Thanks. > > Yosry, could you please send along a suitably modified changelog? I will send a v2 once we agree on the history story and whether or not this fixes a regression. Thanks!
On Tue, Jul 25, 2023 at 12:24:19PM -0700, Yosry Ahmed wrote: > On Tue, Jul 25, 2023 at 7:04 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > We used to maintain *all* stats in per-cpu counters at the local > > level. memory.stat reads would have to iterate and aggregate the > > entire subtree every time. This was obviously very costly, so we added > > batched upward propagation during stat updates to simplify reads: > > > > commit 42a300353577ccc17ecc627b8570a89fa1678bec > > Author: Johannes Weiner <hannes@cmpxchg.org> > > Date: Tue May 14 15:47:12 2019 -0700 > > > > mm: memcontrol: fix recursive statistics correctness & scalabilty > > > > However, that caused a regression in the stat write path, as the > > upward propagation would bottleneck on the cachelines in the shared > > parents. The fix for *that* re-introduced the per-cpu loops in the > > local stat reads: > > > > commit 815744d75152078cde5391fc1e3c2d4424323fb6 > > Author: Johannes Weiner <hannes@cmpxchg.org> > > Date: Thu Jun 13 15:55:46 2019 -0700 > > > > mm: memcontrol: don't batch updates of local VM stats and events > > > > So I wouldn't say it's a regression from rstat. Except for that short > > period between the two commits above, the read side for local stats > > was always expensive. > > I was comparing from an 4.15 kernel, so I assumed the major change was > from rstat, but that was not accurate. Thanks for the history. > > However, in that 4.15 kernel the local (non-hierarchical) stats were > readily available without iterating percpu counters. There is a > regression that was introduced somewhere. > > Looking at the history you described, it seems like up until > 815744d75152 we used to maintain "local" (aka non-hierarchical) > counters, so reading local stats was reading one counter, and starting > 815744d75152 we started having to loop percpu counters for that. > > So it is not a regression of rstat, but seemingly it is a regression > of 815744d75152. Is my understanding incorrect? Yes, it actually goes back further. Bear with me. For the longest time, it used to be local per-cpu counters. Every memory.stat read had to do nr_memcg * nr_cpu aggregation. You can imagine that this didn't scale in production. We added local atomics and turned the per-cpu counters into buffers: commit a983b5ebee57209c99f68c8327072f25e0e6e3da Author: Johannes Weiner <hannes@cmpxchg.org> Date: Wed Jan 31 16:16:45 2018 -0800 mm: memcontrol: fix excessive complexity in memory.stat reporting Local counts became a simple atomic_read(), but the hierarchy counts would still have to aggregate nr_memcg counters. That was of course still too much read-side complexity, so we switched to batched upward propagation during the stat updates: commit 42a300353577ccc17ecc627b8570a89fa1678bec Author: Johannes Weiner <hannes@cmpxchg.org> Date: Tue May 14 15:47:12 2019 -0700 mm: memcontrol: fix recursive statistics correctness & scalabilty This gave us two atomics at each level: one for local and one for hierarchical stats. However, that went too far the other direction: too many counters touched during stat updates, and we got a regression report over memcg cacheline contention during MM workloads. Instead of backing out 42a300353 - since all the previous versions were terrible too - we dropped write-side aggregation of *only* the local counters: commit 815744d75152078cde5391fc1e3c2d4424323fb6 Author: Johannes Weiner <hannes@cmpxchg.org> Date: Thu Jun 13 15:55:46 2019 -0700 mm: memcontrol: don't batch updates of local VM stats and events In effect, this kept all the stat optimizations for cgroup2 (which doesn't have local counters), and reverted cgroup1 back to how it was for the longest time: on-demand aggregated per-cpu counters. For about a year, cgroup1 didn't have to per-cpu the local stats on read. But for the recursive stats, it would either still have to do subtree aggregation on read, or too much upward flushing on write. So if I had to blame one commit for a cgroup1 regression, it would probably be 815744d. But it's kind of a stretch to say that it worked well before that commit. For the changelog, maybe just say that there was a lot of back and forth between read-side aggregation and write-side aggregation. Since with rstat we now have efficient read-side aggregation, attempt a conceptual revert of 815744d. > > But I want to be clear: this isn't a regression fix. It's a new > > performance optimization for the deprecated cgroup1 code. And it comes > > at the cost of higher memory footprint for both cgroup1 AND cgroup2. > > I still think it is, but I can easily be wrong. I am hoping that the > memory footprint is not a problem. There are *roughly* 80 per-memcg > stats/events (MEMCG_NR_STAT + NR_MEMCG_EVENTS) and 55 per-lruvec stats > (NR_VM_NODE_STAT_ITEMS). For each stat there is an extra 8 bytes, so > on a two-node machine that's 8 * (80 + 55 * 2) ~= 1.5 KiB per memcg. > > Given that struct mem_cgroup is already large, and can easily be 100s > of KiBs on a large machine with many cpus, I hope there won't be a > noticeable regression. Yes, the concern wasn't so much the memory consumption but the cachelines touched during hot paths. However, that was mostly because we either had a) write-side flushing, which is extremely hot during MM stress, or b) read-side flushing with huuuge cgroup subtrees due to zombie cgroups. A small cacheline difference would be enormously amplified by these factors. Rstat is very good at doing selective subtree flushing on reads, so the big coefficients from a) and b) are no longer such a big concern. A slightly bigger cache footprint is probably going to be okay.
On Tue, Jul 25, 2023 at 1:18 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Tue, Jul 25, 2023 at 12:24:19PM -0700, Yosry Ahmed wrote: > > On Tue, Jul 25, 2023 at 7:04 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > We used to maintain *all* stats in per-cpu counters at the local > > > level. memory.stat reads would have to iterate and aggregate the > > > entire subtree every time. This was obviously very costly, so we added > > > batched upward propagation during stat updates to simplify reads: > > > > > > commit 42a300353577ccc17ecc627b8570a89fa1678bec > > > Author: Johannes Weiner <hannes@cmpxchg.org> > > > Date: Tue May 14 15:47:12 2019 -0700 > > > > > > mm: memcontrol: fix recursive statistics correctness & scalabilty > > > > > > However, that caused a regression in the stat write path, as the > > > upward propagation would bottleneck on the cachelines in the shared > > > parents. The fix for *that* re-introduced the per-cpu loops in the > > > local stat reads: > > > > > > commit 815744d75152078cde5391fc1e3c2d4424323fb6 > > > Author: Johannes Weiner <hannes@cmpxchg.org> > > > Date: Thu Jun 13 15:55:46 2019 -0700 > > > > > > mm: memcontrol: don't batch updates of local VM stats and events > > > > > > So I wouldn't say it's a regression from rstat. Except for that short > > > period between the two commits above, the read side for local stats > > > was always expensive. > > > > I was comparing from an 4.15 kernel, so I assumed the major change was > > from rstat, but that was not accurate. Thanks for the history. > > > > However, in that 4.15 kernel the local (non-hierarchical) stats were > > readily available without iterating percpu counters. There is a > > regression that was introduced somewhere. > > > > Looking at the history you described, it seems like up until > > 815744d75152 we used to maintain "local" (aka non-hierarchical) > > counters, so reading local stats was reading one counter, and starting > > 815744d75152 we started having to loop percpu counters for that. > > > > So it is not a regression of rstat, but seemingly it is a regression > > of 815744d75152. Is my understanding incorrect? > > Yes, it actually goes back further. Bear with me. > > For the longest time, it used to be local per-cpu counters. Every > memory.stat read had to do nr_memcg * nr_cpu aggregation. You can > imagine that this didn't scale in production. > > We added local atomics and turned the per-cpu counters into buffers: > > commit a983b5ebee57209c99f68c8327072f25e0e6e3da > Author: Johannes Weiner <hannes@cmpxchg.org> > Date: Wed Jan 31 16:16:45 2018 -0800 > > mm: memcontrol: fix excessive complexity in memory.stat reporting > > Local counts became a simple atomic_read(), but the hierarchy counts > would still have to aggregate nr_memcg counters. > > That was of course still too much read-side complexity, so we switched > to batched upward propagation during the stat updates: > > commit 42a300353577ccc17ecc627b8570a89fa1678bec > Author: Johannes Weiner <hannes@cmpxchg.org> > Date: Tue May 14 15:47:12 2019 -0700 > > mm: memcontrol: fix recursive statistics correctness & scalabilty > > This gave us two atomics at each level: one for local and one for > hierarchical stats. > > However, that went too far the other direction: too many counters > touched during stat updates, and we got a regression report over memcg > cacheline contention during MM workloads. Instead of backing out > 42a300353 - since all the previous versions were terrible too - we > dropped write-side aggregation of *only* the local counters: > > commit 815744d75152078cde5391fc1e3c2d4424323fb6 > Author: Johannes Weiner <hannes@cmpxchg.org> > Date: Thu Jun 13 15:55:46 2019 -0700 > > mm: memcontrol: don't batch updates of local VM stats and events > > In effect, this kept all the stat optimizations for cgroup2 (which > doesn't have local counters), and reverted cgroup1 back to how it was > for the longest time: on-demand aggregated per-cpu counters. > > For about a year, cgroup1 didn't have to per-cpu the local stats on > read. But for the recursive stats, it would either still have to do > subtree aggregation on read, or too much upward flushing on write. > > So if I had to blame one commit for a cgroup1 regression, it would > probably be 815744d. But it's kind of a stretch to say that it worked > well before that commit. > > For the changelog, maybe just say that there was a lot of back and > forth between read-side aggregation and write-side aggregation. Since > with rstat we now have efficient read-side aggregation, attempt a > conceptual revert of 815744d. Now that's a much more complete picture. Thanks a lot for all the history, now it makes much more sense. I wouldn't blame 815744d then, as you said it's better framed as a conceptual revert of it. I will rewrite the commit log accordingly and send a v2. Thanks! > > > > But I want to be clear: this isn't a regression fix. It's a new > > > performance optimization for the deprecated cgroup1 code. And it comes > > > at the cost of higher memory footprint for both cgroup1 AND cgroup2. > > > > I still think it is, but I can easily be wrong. I am hoping that the > > memory footprint is not a problem. There are *roughly* 80 per-memcg > > stats/events (MEMCG_NR_STAT + NR_MEMCG_EVENTS) and 55 per-lruvec stats > > (NR_VM_NODE_STAT_ITEMS). For each stat there is an extra 8 bytes, so > > on a two-node machine that's 8 * (80 + 55 * 2) ~= 1.5 KiB per memcg. > > > > Given that struct mem_cgroup is already large, and can easily be 100s > > of KiBs on a large machine with many cpus, I hope there won't be a > > noticeable regression. > > Yes, the concern wasn't so much the memory consumption but the > cachelines touched during hot paths. > > However, that was mostly because we either had a) write-side flushing, > which is extremely hot during MM stress, or b) read-side flushing with > huuuge cgroup subtrees due to zombie cgroups. A small cacheline > difference would be enormously amplified by these factors. > > Rstat is very good at doing selective subtree flushing on reads, so > the big coefficients from a) and b) are no longer such a big concern. > A slightly bigger cache footprint is probably going to be okay. Agreed, maintaining the local counters with rstat is much easier than the previous attempts. I will try to bake most of this into the commit log.
On Wed, Jul 19, 2023 at 10:46 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > Currently, memcg uses rstat to maintain hierarchical stats. The rstat > framework keeps track of which cgroups have updates on which cpus. > > For non-hierarchical stats, as memcg moved to rstat, they are no longer > readily available as counters. Instead, the percpu counters for a given > stat need to be summed to get the non-hierarchical stat value. This > causes a performance regression when reading non-hierarchical stats on > kernels where memcg moved to using rstat. This is especially visible > when reading memory.stat on cgroup v1. There are also some code paths > internal to the kernel that read such non-hierarchical stats. > > It is inefficient to iterate and sum counters in all cpus when the rstat > framework knows exactly when a percpu counter has an update. Instead, > maintain cpu-aggregated non-hierarchical counters for each stat. During > an rstat flush, keep those updated as well. When reading > non-hierarchical stats, we no longer need to iterate cpus, we just need > to read the maintainer counters, similar to hierarchical stats. > > A caveat is that we now a stats flush before reading > local/non-hierarchical stats through {memcg/lruvec}_page_state_local() > or memcg_events_local(), where we previously only needed a flush to > read hierarchical stats. Most contexts reading non-hierarchical stats > are already doing a flush, add a flush to the only missing context in > count_shadow_nodes(). > > With this patch, reading memory.stat from 1000 memcgs is 3x faster on a > machine with 256 cpus on cgroup v1: > # for i in $(seq 1000); do mkdir /sys/fs/cgroup/memory/cg$i; done > # time cat /dev/cgroup/memory/cg*/memory.stat > /dev/null > real 0m0.125s > user 0m0.005s > sys 0m0.120s > > After: > real 0m0.032s > user 0m0.005s > sys 0m0.027s > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > --- > include/linux/memcontrol.h | 7 ++++--- > mm/memcontrol.c | 32 +++++++++++++++++++------------- > mm/workingset.c | 1 + > 3 files changed, 24 insertions(+), 16 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 5818af8eca5a..a9f2861a57a5 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -112,6 +112,9 @@ struct lruvec_stats { > /* Aggregated (CPU and subtree) state */ > long state[NR_VM_NODE_STAT_ITEMS]; > > + /* Non-hierarchical (CPU aggregated) state */ > + long state_local[NR_VM_NODE_STAT_ITEMS]; > + > /* Pending child counts during tree propagation */ > long state_pending[NR_VM_NODE_STAT_ITEMS]; > }; > @@ -1020,14 +1023,12 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, > { > struct mem_cgroup_per_node *pn; > long x = 0; > - int cpu; > > if (mem_cgroup_disabled()) > return node_page_state(lruvec_pgdat(lruvec), idx); > > pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); > - for_each_possible_cpu(cpu) > - x += per_cpu(pn->lruvec_stats_percpu->state[idx], cpu); > + x = READ_ONCE(pn->lruvec_stats.state_local[idx]); > #ifdef CONFIG_SMP > if (x < 0) > x = 0; > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index e8ca4bdcb03c..90a22637818e 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -742,6 +742,10 @@ struct memcg_vmstats { > long state[MEMCG_NR_STAT]; > unsigned long events[NR_MEMCG_EVENTS]; > > + /* Non-hierarchical (CPU aggregated) page state & events */ > + long state_local[MEMCG_NR_STAT]; > + unsigned long events_local[NR_MEMCG_EVENTS]; > + > /* Pending child counts during tree propagation */ > long state_pending[MEMCG_NR_STAT]; > unsigned long events_pending[NR_MEMCG_EVENTS]; > @@ -775,11 +779,8 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) > /* idx can be of type enum memcg_stat_item or node_stat_item. */ > static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx) > { > - long x = 0; > - int cpu; > + long x = READ_ONCE(memcg->vmstats->state_local[idx]); > > - for_each_possible_cpu(cpu) > - x += per_cpu(memcg->vmstats_percpu->state[idx], cpu); > #ifdef CONFIG_SMP > if (x < 0) > x = 0; > @@ -926,16 +927,12 @@ static unsigned long memcg_events(struct mem_cgroup *memcg, int event) > > static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) > { > - long x = 0; > - int cpu; > int index = memcg_events_index(event); > > if (index < 0) > return 0; > > - for_each_possible_cpu(cpu) > - x += per_cpu(memcg->vmstats_percpu->events[index], cpu); > - return x; > + return READ_ONCE(memcg->vmstats->events_local[index]); > } > > static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, > @@ -5526,7 +5523,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) > struct mem_cgroup *memcg = mem_cgroup_from_css(css); > struct mem_cgroup *parent = parent_mem_cgroup(memcg); > struct memcg_vmstats_percpu *statc; > - long delta, v; > + long delta, delta_cpu, v; > int i, nid; > > statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); > @@ -5542,9 +5539,11 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) > memcg->vmstats->state_pending[i] = 0; > > /* Add CPU changes on this level since the last flush */ > + delta_cpu = 0; > v = READ_ONCE(statc->state[i]); > if (v != statc->state_prev[i]) { > - delta += v - statc->state_prev[i]; > + delta_cpu = v - statc->state_prev[i]; > + delta += delta_cpu; > statc->state_prev[i] = v; > } > > @@ -5553,6 +5552,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) > > /* Aggregate counts on this level and propagate upwards */ > memcg->vmstats->state[i] += delta; > + memcg->vmstats->state_local[i] += delta_cpu; I ran this through more testing. There is a subtle problem here. If delta == 0 and delta_cpu != 0, we will skip the update to the local stats. This happens in the very unlikely case where the delta on the flushed cpu is equal in value but of opposite sign to the delta coming from the children. IOW if (statc->state[i] - statc->state_prev[i]) == -memcg->vmstats->state_pending[i]. Very unlikely but I happened to stumble upon it. Will fix this for v2. > if (parent) > parent->vmstats->state_pending[i] += delta; > } > @@ -5562,9 +5562,11 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) > if (delta) > memcg->vmstats->events_pending[i] = 0; > > + delta_cpu = 0; > v = READ_ONCE(statc->events[i]); > if (v != statc->events_prev[i]) { > - delta += v - statc->events_prev[i]; > + delta_cpu = v - statc->events_prev[i]; > + delta += delta_cpu; > statc->events_prev[i] = v; > } > > @@ -5572,6 +5574,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) > continue; > > memcg->vmstats->events[i] += delta; > + memcg->vmstats->events_local[i] += delta_cpu; > if (parent) > parent->vmstats->events_pending[i] += delta; > } > @@ -5591,9 +5594,11 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) > if (delta) > pn->lruvec_stats.state_pending[i] = 0; > > + delta_cpu = 0; > v = READ_ONCE(lstatc->state[i]); > if (v != lstatc->state_prev[i]) { > - delta += v - lstatc->state_prev[i]; > + delta_cpu = v - lstatc->state_prev[i]; > + delta += delta_cpu; > lstatc->state_prev[i] = v; > } > > @@ -5601,6 +5606,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) > continue; > > pn->lruvec_stats.state[i] += delta; > + pn->lruvec_stats.state_local[i] += delta_cpu; > if (ppn) > ppn->lruvec_stats.state_pending[i] += delta; > } > diff --git a/mm/workingset.c b/mm/workingset.c > index 4686ae363000..da58a26d0d4d 100644 > --- a/mm/workingset.c > +++ b/mm/workingset.c > @@ -664,6 +664,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker, > struct lruvec *lruvec; > int i; > > + mem_cgroup_flush_stats(); > lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid)); > for (pages = 0, i = 0; i < NR_LRU_LISTS; i++) > pages += lruvec_page_state_local(lruvec, > -- > 2.41.0.255.g8b1d071c50-goog >
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 5818af8eca5a..a9f2861a57a5 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -112,6 +112,9 @@ struct lruvec_stats { /* Aggregated (CPU and subtree) state */ long state[NR_VM_NODE_STAT_ITEMS]; + /* Non-hierarchical (CPU aggregated) state */ + long state_local[NR_VM_NODE_STAT_ITEMS]; + /* Pending child counts during tree propagation */ long state_pending[NR_VM_NODE_STAT_ITEMS]; }; @@ -1020,14 +1023,12 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, { struct mem_cgroup_per_node *pn; long x = 0; - int cpu; if (mem_cgroup_disabled()) return node_page_state(lruvec_pgdat(lruvec), idx); pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); - for_each_possible_cpu(cpu) - x += per_cpu(pn->lruvec_stats_percpu->state[idx], cpu); + x = READ_ONCE(pn->lruvec_stats.state_local[idx]); #ifdef CONFIG_SMP if (x < 0) x = 0; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e8ca4bdcb03c..90a22637818e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -742,6 +742,10 @@ struct memcg_vmstats { long state[MEMCG_NR_STAT]; unsigned long events[NR_MEMCG_EVENTS]; + /* Non-hierarchical (CPU aggregated) page state & events */ + long state_local[MEMCG_NR_STAT]; + unsigned long events_local[NR_MEMCG_EVENTS]; + /* Pending child counts during tree propagation */ long state_pending[MEMCG_NR_STAT]; unsigned long events_pending[NR_MEMCG_EVENTS]; @@ -775,11 +779,8 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) /* idx can be of type enum memcg_stat_item or node_stat_item. */ static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx) { - long x = 0; - int cpu; + long x = READ_ONCE(memcg->vmstats->state_local[idx]); - for_each_possible_cpu(cpu) - x += per_cpu(memcg->vmstats_percpu->state[idx], cpu); #ifdef CONFIG_SMP if (x < 0) x = 0; @@ -926,16 +927,12 @@ static unsigned long memcg_events(struct mem_cgroup *memcg, int event) static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) { - long x = 0; - int cpu; int index = memcg_events_index(event); if (index < 0) return 0; - for_each_possible_cpu(cpu) - x += per_cpu(memcg->vmstats_percpu->events[index], cpu); - return x; + return READ_ONCE(memcg->vmstats->events_local[index]); } static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, @@ -5526,7 +5523,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) struct mem_cgroup *memcg = mem_cgroup_from_css(css); struct mem_cgroup *parent = parent_mem_cgroup(memcg); struct memcg_vmstats_percpu *statc; - long delta, v; + long delta, delta_cpu, v; int i, nid; statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); @@ -5542,9 +5539,11 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) memcg->vmstats->state_pending[i] = 0; /* Add CPU changes on this level since the last flush */ + delta_cpu = 0; v = READ_ONCE(statc->state[i]); if (v != statc->state_prev[i]) { - delta += v - statc->state_prev[i]; + delta_cpu = v - statc->state_prev[i]; + delta += delta_cpu; statc->state_prev[i] = v; } @@ -5553,6 +5552,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) /* Aggregate counts on this level and propagate upwards */ memcg->vmstats->state[i] += delta; + memcg->vmstats->state_local[i] += delta_cpu; if (parent) parent->vmstats->state_pending[i] += delta; } @@ -5562,9 +5562,11 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) if (delta) memcg->vmstats->events_pending[i] = 0; + delta_cpu = 0; v = READ_ONCE(statc->events[i]); if (v != statc->events_prev[i]) { - delta += v - statc->events_prev[i]; + delta_cpu = v - statc->events_prev[i]; + delta += delta_cpu; statc->events_prev[i] = v; } @@ -5572,6 +5574,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) continue; memcg->vmstats->events[i] += delta; + memcg->vmstats->events_local[i] += delta_cpu; if (parent) parent->vmstats->events_pending[i] += delta; } @@ -5591,9 +5594,11 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) if (delta) pn->lruvec_stats.state_pending[i] = 0; + delta_cpu = 0; v = READ_ONCE(lstatc->state[i]); if (v != lstatc->state_prev[i]) { - delta += v - lstatc->state_prev[i]; + delta_cpu = v - lstatc->state_prev[i]; + delta += delta_cpu; lstatc->state_prev[i] = v; } @@ -5601,6 +5606,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) continue; pn->lruvec_stats.state[i] += delta; + pn->lruvec_stats.state_local[i] += delta_cpu; if (ppn) ppn->lruvec_stats.state_pending[i] += delta; } diff --git a/mm/workingset.c b/mm/workingset.c index 4686ae363000..da58a26d0d4d 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -664,6 +664,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker, struct lruvec *lruvec; int i; + mem_cgroup_flush_stats(); lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid)); for (pages = 0, i = 0; i < NR_LRU_LISTS; i++) pages += lruvec_page_state_local(lruvec,