mm: memcg: use rstat for non-hierarchical stats

Message ID 20230719174613.3062124-1-yosryahmed@google.com
State New
Headers
Series mm: memcg: use rstat for non-hierarchical stats |

Commit Message

Yosry Ahmed July 19, 2023, 5:46 p.m. UTC
  Currently, memcg uses rstat to maintain hierarchical stats. The rstat
framework keeps track of which cgroups have updates on which cpus.

For non-hierarchical stats, as memcg moved to rstat, they are no longer
readily available as counters. Instead, the percpu counters for a given
stat need to be summed to get the non-hierarchical stat value. This
causes a performance regression when reading non-hierarchical stats on
kernels where memcg moved to using rstat. This is especially visible
when reading memory.stat on cgroup v1. There are also some code paths
internal to the kernel that read such non-hierarchical stats.

It is inefficient to iterate and sum counters in all cpus when the rstat
framework knows exactly when a percpu counter has an update. Instead,
maintain cpu-aggregated non-hierarchical counters for each stat. During
an rstat flush, keep those updated as well. When reading
non-hierarchical stats, we no longer need to iterate cpus, we just need
to read the maintainer counters, similar to hierarchical stats.

A caveat is that we now a stats flush before reading
local/non-hierarchical stats through {memcg/lruvec}_page_state_local()
or memcg_events_local(), where we previously only needed a flush to
read hierarchical stats. Most contexts reading non-hierarchical stats
are already doing a flush, add a flush to the only missing context in
count_shadow_nodes().

With this patch, reading memory.stat from 1000 memcgs is 3x faster on a
machine with 256 cpus on cgroup v1:
 # for i in $(seq 1000); do mkdir /sys/fs/cgroup/memory/cg$i; done
 # time cat /dev/cgroup/memory/cg*/memory.stat > /dev/null
 real	 0m0.125s
 user	 0m0.005s
 sys	 0m0.120s

After:
 real	 0m0.032s
 user	 0m0.005s
 sys	 0m0.027s

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/memcontrol.h |  7 ++++---
 mm/memcontrol.c            | 32 +++++++++++++++++++-------------
 mm/workingset.c            |  1 +
 3 files changed, 24 insertions(+), 16 deletions(-)
  

Comments

Andrew Morton July 24, 2023, 6:31 p.m. UTC | #1
On Wed, 19 Jul 2023 17:46:13 +0000 Yosry Ahmed <yosryahmed@google.com> wrote:

> Currently, memcg uses rstat to maintain hierarchical stats. The rstat
> framework keeps track of which cgroups have updates on which cpus.
> 
> For non-hierarchical stats, as memcg moved to rstat, they are no longer
> readily available as counters. Instead, the percpu counters for a given
> stat need to be summed to get the non-hierarchical stat value. This
> causes a performance regression when reading non-hierarchical stats on
> kernels where memcg moved to using rstat. This is especially visible
> when reading memory.stat on cgroup v1. There are also some code paths
> internal to the kernel that read such non-hierarchical stats.
> 
> It is inefficient to iterate and sum counters in all cpus when the rstat
> framework knows exactly when a percpu counter has an update. Instead,
> maintain cpu-aggregated non-hierarchical counters for each stat. During
> an rstat flush, keep those updated as well. When reading
> non-hierarchical stats, we no longer need to iterate cpus, we just need
> to read the maintainer counters, similar to hierarchical stats.
> 
> A caveat is that we now a stats flush before reading
> local/non-hierarchical stats through {memcg/lruvec}_page_state_local()
> or memcg_events_local(), where we previously only needed a flush to
> read hierarchical stats. Most contexts reading non-hierarchical stats
> are already doing a flush, add a flush to the only missing context in
> count_shadow_nodes().
> 
> With this patch, reading memory.stat from 1000 memcgs is 3x faster on a
> machine with 256 cpus on cgroup v1:
>  # for i in $(seq 1000); do mkdir /sys/fs/cgroup/memory/cg$i; done
>  # time cat /dev/cgroup/memory/cg*/memory.stat > /dev/null
>  real	 0m0.125s
>  user	 0m0.005s
>  sys	 0m0.120s
> 
> After:
>  real	 0m0.032s
>  user	 0m0.005s
>  sys	 0m0.027s
> 

I'll queue this for some testing, pending reviewer input, please.
  
Yosry Ahmed July 24, 2023, 6:34 p.m. UTC | #2
On Mon, Jul 24, 2023 at 11:31 AM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Wed, 19 Jul 2023 17:46:13 +0000 Yosry Ahmed <yosryahmed@google.com> wrote:
>
> > Currently, memcg uses rstat to maintain hierarchical stats. The rstat
> > framework keeps track of which cgroups have updates on which cpus.
> >
> > For non-hierarchical stats, as memcg moved to rstat, they are no longer
> > readily available as counters. Instead, the percpu counters for a given
> > stat need to be summed to get the non-hierarchical stat value. This
> > causes a performance regression when reading non-hierarchical stats on
> > kernels where memcg moved to using rstat. This is especially visible
> > when reading memory.stat on cgroup v1. There are also some code paths
> > internal to the kernel that read such non-hierarchical stats.
> >
> > It is inefficient to iterate and sum counters in all cpus when the rstat
> > framework knows exactly when a percpu counter has an update. Instead,
> > maintain cpu-aggregated non-hierarchical counters for each stat. During
> > an rstat flush, keep those updated as well. When reading
> > non-hierarchical stats, we no longer need to iterate cpus, we just need
> > to read the maintainer counters, similar to hierarchical stats.
> >
> > A caveat is that we now a stats flush before reading
> > local/non-hierarchical stats through {memcg/lruvec}_page_state_local()
> > or memcg_events_local(), where we previously only needed a flush to
> > read hierarchical stats. Most contexts reading non-hierarchical stats
> > are already doing a flush, add a flush to the only missing context in
> > count_shadow_nodes().
> >
> > With this patch, reading memory.stat from 1000 memcgs is 3x faster on a
> > machine with 256 cpus on cgroup v1:
> >  # for i in $(seq 1000); do mkdir /sys/fs/cgroup/memory/cg$i; done
> >  # time cat /dev/cgroup/memory/cg*/memory.stat > /dev/null
> >  real  0m0.125s
> >  user  0m0.005s
> >  sys   0m0.120s
> >
> > After:
> >  real  0m0.032s
> >  user  0m0.005s
> >  sys   0m0.027s
> >
>
> I'll queue this for some testing, pending reviewer input, please.

Thanks Andrew!

I am doing extra testing of my own as well as of now. Will report back
if anything interesting pops up.
  
Johannes Weiner July 25, 2023, 2:04 p.m. UTC | #3
On Wed, Jul 19, 2023 at 05:46:13PM +0000, Yosry Ahmed wrote:
> Currently, memcg uses rstat to maintain hierarchical stats. The rstat
> framework keeps track of which cgroups have updates on which cpus.
> 
> For non-hierarchical stats, as memcg moved to rstat, they are no longer
> readily available as counters. Instead, the percpu counters for a given
> stat need to be summed to get the non-hierarchical stat value. This
> causes a performance regression when reading non-hierarchical stats on
> kernels where memcg moved to using rstat. This is especially visible
> when reading memory.stat on cgroup v1. There are also some code paths
> internal to the kernel that read such non-hierarchical stats.

It's actually not an rstat regression. It's always been this costly.

Quick history:

We used to maintain *all* stats in per-cpu counters at the local
level. memory.stat reads would have to iterate and aggregate the
entire subtree every time. This was obviously very costly, so we added
batched upward propagation during stat updates to simplify reads:

commit 42a300353577ccc17ecc627b8570a89fa1678bec
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Tue May 14 15:47:12 2019 -0700

    mm: memcontrol: fix recursive statistics correctness & scalabilty

However, that caused a regression in the stat write path, as the
upward propagation would bottleneck on the cachelines in the shared
parents. The fix for *that* re-introduced the per-cpu loops in the
local stat reads:

commit 815744d75152078cde5391fc1e3c2d4424323fb6
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Thu Jun 13 15:55:46 2019 -0700

    mm: memcontrol: don't batch updates of local VM stats and events

So I wouldn't say it's a regression from rstat. Except for that short
period between the two commits above, the read side for local stats
was always expensive.

rstat promises a shot at finally fixing it, with less risk to the
write path.

> It is inefficient to iterate and sum counters in all cpus when the rstat
> framework knows exactly when a percpu counter has an update. Instead,
> maintain cpu-aggregated non-hierarchical counters for each stat. During
> an rstat flush, keep those updated as well. When reading
> non-hierarchical stats, we no longer need to iterate cpus, we just need
> to read the maintainer counters, similar to hierarchical stats.
> 
> A caveat is that we now a stats flush before reading
> local/non-hierarchical stats through {memcg/lruvec}_page_state_local()
> or memcg_events_local(), where we previously only needed a flush to
> read hierarchical stats. Most contexts reading non-hierarchical stats
> are already doing a flush, add a flush to the only missing context in
> count_shadow_nodes().
> 
> With this patch, reading memory.stat from 1000 memcgs is 3x faster on a
> machine with 256 cpus on cgroup v1:
>  # for i in $(seq 1000); do mkdir /sys/fs/cgroup/memory/cg$i; done
>  # time cat /dev/cgroup/memory/cg*/memory.stat > /dev/null
>  real	 0m0.125s
>  user	 0m0.005s
>  sys	 0m0.120s
> 
> After:
>  real	 0m0.032s
>  user	 0m0.005s
>  sys	 0m0.027s
> 
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

But I want to be clear: this isn't a regression fix. It's a new
performance optimization for the deprecated cgroup1 code. And it comes
at the cost of higher memory footprint for both cgroup1 AND cgroup2.

If this causes a regression, we should revert it again. But let's try.
  
Andrew Morton July 25, 2023, 5:43 p.m. UTC | #4
On Tue, 25 Jul 2023 10:04:35 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> But I want to be clear: this isn't a regression fix. It's a new
> performance optimization for the deprecated cgroup1 code. And it comes
> at the cost of higher memory footprint for both cgroup1 AND cgroup2.

Thanks.

Yosry, could you please send along a suitably modified changelog?
  
Yosry Ahmed July 25, 2023, 7:24 p.m. UTC | #5
Hey Johannes,

On Tue, Jul 25, 2023 at 7:04 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Jul 19, 2023 at 05:46:13PM +0000, Yosry Ahmed wrote:
> > Currently, memcg uses rstat to maintain hierarchical stats. The rstat
> > framework keeps track of which cgroups have updates on which cpus.
> >
> > For non-hierarchical stats, as memcg moved to rstat, they are no longer
> > readily available as counters. Instead, the percpu counters for a given
> > stat need to be summed to get the non-hierarchical stat value. This
> > causes a performance regression when reading non-hierarchical stats on
> > kernels where memcg moved to using rstat. This is especially visible
> > when reading memory.stat on cgroup v1. There are also some code paths
> > internal to the kernel that read such non-hierarchical stats.
>
> It's actually not an rstat regression. It's always been this costly.
>
> Quick history:

Thanks for the context.

>
> We used to maintain *all* stats in per-cpu counters at the local
> level. memory.stat reads would have to iterate and aggregate the
> entire subtree every time. This was obviously very costly, so we added
> batched upward propagation during stat updates to simplify reads:
>
> commit 42a300353577ccc17ecc627b8570a89fa1678bec
> Author: Johannes Weiner <hannes@cmpxchg.org>
> Date:   Tue May 14 15:47:12 2019 -0700
>
>     mm: memcontrol: fix recursive statistics correctness & scalabilty
>
> However, that caused a regression in the stat write path, as the
> upward propagation would bottleneck on the cachelines in the shared
> parents. The fix for *that* re-introduced the per-cpu loops in the
> local stat reads:
>
> commit 815744d75152078cde5391fc1e3c2d4424323fb6
> Author: Johannes Weiner <hannes@cmpxchg.org>
> Date:   Thu Jun 13 15:55:46 2019 -0700
>
>     mm: memcontrol: don't batch updates of local VM stats and events
>
> So I wouldn't say it's a regression from rstat. Except for that short
> period between the two commits above, the read side for local stats
> was always expensive.

I was comparing from an 4.15 kernel, so I assumed the major change was
from rstat, but that was not accurate. Thanks for the history.

However, in that 4.15 kernel the local (non-hierarchical) stats were
readily available without iterating percpu counters. There is a
regression that was introduced somewhere.

Looking at the history you described, it seems like up until
815744d75152 we used to maintain "local" (aka non-hierarchical)
counters, so reading local stats was reading one counter, and starting
815744d75152 we started having to loop percpu counters for that.

So it is not a regression of rstat, but seemingly it is a regression
of 815744d75152. Is my understanding incorrect?

>
> rstat promises a shot at finally fixing it, with less risk to the
> write path.
>
> > It is inefficient to iterate and sum counters in all cpus when the rstat
> > framework knows exactly when a percpu counter has an update. Instead,
> > maintain cpu-aggregated non-hierarchical counters for each stat. During
> > an rstat flush, keep those updated as well. When reading
> > non-hierarchical stats, we no longer need to iterate cpus, we just need
> > to read the maintainer counters, similar to hierarchical stats.
> >
> > A caveat is that we now a stats flush before reading
> > local/non-hierarchical stats through {memcg/lruvec}_page_state_local()
> > or memcg_events_local(), where we previously only needed a flush to
> > read hierarchical stats. Most contexts reading non-hierarchical stats
> > are already doing a flush, add a flush to the only missing context in
> > count_shadow_nodes().
> >
> > With this patch, reading memory.stat from 1000 memcgs is 3x faster on a
> > machine with 256 cpus on cgroup v1:
> >  # for i in $(seq 1000); do mkdir /sys/fs/cgroup/memory/cg$i; done
> >  # time cat /dev/cgroup/memory/cg*/memory.stat > /dev/null
> >  real  0m0.125s
> >  user  0m0.005s
> >  sys   0m0.120s
> >
> > After:
> >  real  0m0.032s
> >  user  0m0.005s
> >  sys   0m0.027s
> >
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks! I will reformulate the commit log after we agree on the history.

>
> But I want to be clear: this isn't a regression fix. It's a new
> performance optimization for the deprecated cgroup1 code. And it comes
> at the cost of higher memory footprint for both cgroup1 AND cgroup2.

I still think it is, but I can easily be wrong. I am hoping that the
memory footprint is not a problem. There are *roughly* 80 per-memcg
stats/events (MEMCG_NR_STAT + NR_MEMCG_EVENTS) and 55 per-lruvec stats
(NR_VM_NODE_STAT_ITEMS). For each stat there is an extra 8 bytes, so
on a two-node machine that's  8 * (80 + 55 * 2) ~= 1.5 KiB per memcg.

Given that struct mem_cgroup is already large, and can easily be 100s
of KiBs on a large machine with many cpus, I hope there won't be a
noticeable regression.

>
> If this causes a regression, we should revert it again. But let's try.

Of course. Fingers crossed.
  
Yosry Ahmed July 25, 2023, 7:25 p.m. UTC | #6
On Tue, Jul 25, 2023 at 10:43 AM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Tue, 25 Jul 2023 10:04:35 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> >
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> >
> > But I want to be clear: this isn't a regression fix. It's a new
> > performance optimization for the deprecated cgroup1 code. And it comes
> > at the cost of higher memory footprint for both cgroup1 AND cgroup2.
>
> Thanks.
>
> Yosry, could you please send along a suitably modified changelog?

I will send a v2 once we agree on the history story and whether or not
this fixes a regression.

Thanks!
  
Johannes Weiner July 25, 2023, 8:18 p.m. UTC | #7
On Tue, Jul 25, 2023 at 12:24:19PM -0700, Yosry Ahmed wrote:
> On Tue, Jul 25, 2023 at 7:04 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > We used to maintain *all* stats in per-cpu counters at the local
> > level. memory.stat reads would have to iterate and aggregate the
> > entire subtree every time. This was obviously very costly, so we added
> > batched upward propagation during stat updates to simplify reads:
> >
> > commit 42a300353577ccc17ecc627b8570a89fa1678bec
> > Author: Johannes Weiner <hannes@cmpxchg.org>
> > Date:   Tue May 14 15:47:12 2019 -0700
> >
> >     mm: memcontrol: fix recursive statistics correctness & scalabilty
> >
> > However, that caused a regression in the stat write path, as the
> > upward propagation would bottleneck on the cachelines in the shared
> > parents. The fix for *that* re-introduced the per-cpu loops in the
> > local stat reads:
> >
> > commit 815744d75152078cde5391fc1e3c2d4424323fb6
> > Author: Johannes Weiner <hannes@cmpxchg.org>
> > Date:   Thu Jun 13 15:55:46 2019 -0700
> >
> >     mm: memcontrol: don't batch updates of local VM stats and events
> >
> > So I wouldn't say it's a regression from rstat. Except for that short
> > period between the two commits above, the read side for local stats
> > was always expensive.
> 
> I was comparing from an 4.15 kernel, so I assumed the major change was
> from rstat, but that was not accurate. Thanks for the history.
> 
> However, in that 4.15 kernel the local (non-hierarchical) stats were
> readily available without iterating percpu counters. There is a
> regression that was introduced somewhere.
> 
> Looking at the history you described, it seems like up until
> 815744d75152 we used to maintain "local" (aka non-hierarchical)
> counters, so reading local stats was reading one counter, and starting
> 815744d75152 we started having to loop percpu counters for that.
> 
> So it is not a regression of rstat, but seemingly it is a regression
> of 815744d75152. Is my understanding incorrect?

Yes, it actually goes back further. Bear with me.

For the longest time, it used to be local per-cpu counters. Every
memory.stat read had to do nr_memcg * nr_cpu aggregation. You can
imagine that this didn't scale in production.

We added local atomics and turned the per-cpu counters into buffers:

commit a983b5ebee57209c99f68c8327072f25e0e6e3da
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Wed Jan 31 16:16:45 2018 -0800

    mm: memcontrol: fix excessive complexity in memory.stat reporting

Local counts became a simple atomic_read(), but the hierarchy counts
would still have to aggregate nr_memcg counters.

That was of course still too much read-side complexity, so we switched
to batched upward propagation during the stat updates:

commit 42a300353577ccc17ecc627b8570a89fa1678bec
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Tue May 14 15:47:12 2019 -0700

    mm: memcontrol: fix recursive statistics correctness & scalabilty

This gave us two atomics at each level: one for local and one for
hierarchical stats.

However, that went too far the other direction: too many counters
touched during stat updates, and we got a regression report over memcg
cacheline contention during MM workloads. Instead of backing out
42a300353 - since all the previous versions were terrible too - we
dropped write-side aggregation of *only* the local counters:

commit 815744d75152078cde5391fc1e3c2d4424323fb6
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Thu Jun 13 15:55:46 2019 -0700

    mm: memcontrol: don't batch updates of local VM stats and events

In effect, this kept all the stat optimizations for cgroup2 (which
doesn't have local counters), and reverted cgroup1 back to how it was
for the longest time: on-demand aggregated per-cpu counters.

For about a year, cgroup1 didn't have to per-cpu the local stats on
read. But for the recursive stats, it would either still have to do
subtree aggregation on read, or too much upward flushing on write.

So if I had to blame one commit for a cgroup1 regression, it would
probably be 815744d. But it's kind of a stretch to say that it worked
well before that commit.

For the changelog, maybe just say that there was a lot of back and
forth between read-side aggregation and write-side aggregation. Since
with rstat we now have efficient read-side aggregation, attempt a
conceptual revert of 815744d.

> > But I want to be clear: this isn't a regression fix. It's a new
> > performance optimization for the deprecated cgroup1 code. And it comes
> > at the cost of higher memory footprint for both cgroup1 AND cgroup2.
> 
> I still think it is, but I can easily be wrong. I am hoping that the
> memory footprint is not a problem. There are *roughly* 80 per-memcg
> stats/events (MEMCG_NR_STAT + NR_MEMCG_EVENTS) and 55 per-lruvec stats
> (NR_VM_NODE_STAT_ITEMS). For each stat there is an extra 8 bytes, so
> on a two-node machine that's  8 * (80 + 55 * 2) ~= 1.5 KiB per memcg.
> 
> Given that struct mem_cgroup is already large, and can easily be 100s
> of KiBs on a large machine with many cpus, I hope there won't be a
> noticeable regression.

Yes, the concern wasn't so much the memory consumption but the
cachelines touched during hot paths.

However, that was mostly because we either had a) write-side flushing,
which is extremely hot during MM stress, or b) read-side flushing with
huuuge cgroup subtrees due to zombie cgroups. A small cacheline
difference would be enormously amplified by these factors.

Rstat is very good at doing selective subtree flushing on reads, so
the big coefficients from a) and b) are no longer such a big concern.
A slightly bigger cache footprint is probably going to be okay.
  
Yosry Ahmed July 25, 2023, 10 p.m. UTC | #8
On Tue, Jul 25, 2023 at 1:18 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Jul 25, 2023 at 12:24:19PM -0700, Yosry Ahmed wrote:
> > On Tue, Jul 25, 2023 at 7:04 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > We used to maintain *all* stats in per-cpu counters at the local
> > > level. memory.stat reads would have to iterate and aggregate the
> > > entire subtree every time. This was obviously very costly, so we added
> > > batched upward propagation during stat updates to simplify reads:
> > >
> > > commit 42a300353577ccc17ecc627b8570a89fa1678bec
> > > Author: Johannes Weiner <hannes@cmpxchg.org>
> > > Date:   Tue May 14 15:47:12 2019 -0700
> > >
> > >     mm: memcontrol: fix recursive statistics correctness & scalabilty
> > >
> > > However, that caused a regression in the stat write path, as the
> > > upward propagation would bottleneck on the cachelines in the shared
> > > parents. The fix for *that* re-introduced the per-cpu loops in the
> > > local stat reads:
> > >
> > > commit 815744d75152078cde5391fc1e3c2d4424323fb6
> > > Author: Johannes Weiner <hannes@cmpxchg.org>
> > > Date:   Thu Jun 13 15:55:46 2019 -0700
> > >
> > >     mm: memcontrol: don't batch updates of local VM stats and events
> > >
> > > So I wouldn't say it's a regression from rstat. Except for that short
> > > period between the two commits above, the read side for local stats
> > > was always expensive.
> >
> > I was comparing from an 4.15 kernel, so I assumed the major change was
> > from rstat, but that was not accurate. Thanks for the history.
> >
> > However, in that 4.15 kernel the local (non-hierarchical) stats were
> > readily available without iterating percpu counters. There is a
> > regression that was introduced somewhere.
> >
> > Looking at the history you described, it seems like up until
> > 815744d75152 we used to maintain "local" (aka non-hierarchical)
> > counters, so reading local stats was reading one counter, and starting
> > 815744d75152 we started having to loop percpu counters for that.
> >
> > So it is not a regression of rstat, but seemingly it is a regression
> > of 815744d75152. Is my understanding incorrect?
>
> Yes, it actually goes back further. Bear with me.
>
> For the longest time, it used to be local per-cpu counters. Every
> memory.stat read had to do nr_memcg * nr_cpu aggregation. You can
> imagine that this didn't scale in production.
>
> We added local atomics and turned the per-cpu counters into buffers:
>
> commit a983b5ebee57209c99f68c8327072f25e0e6e3da
> Author: Johannes Weiner <hannes@cmpxchg.org>
> Date:   Wed Jan 31 16:16:45 2018 -0800
>
>     mm: memcontrol: fix excessive complexity in memory.stat reporting
>
> Local counts became a simple atomic_read(), but the hierarchy counts
> would still have to aggregate nr_memcg counters.
>
> That was of course still too much read-side complexity, so we switched
> to batched upward propagation during the stat updates:
>
> commit 42a300353577ccc17ecc627b8570a89fa1678bec
> Author: Johannes Weiner <hannes@cmpxchg.org>
> Date:   Tue May 14 15:47:12 2019 -0700
>
>     mm: memcontrol: fix recursive statistics correctness & scalabilty
>
> This gave us two atomics at each level: one for local and one for
> hierarchical stats.
>
> However, that went too far the other direction: too many counters
> touched during stat updates, and we got a regression report over memcg
> cacheline contention during MM workloads. Instead of backing out
> 42a300353 - since all the previous versions were terrible too - we
> dropped write-side aggregation of *only* the local counters:
>
> commit 815744d75152078cde5391fc1e3c2d4424323fb6
> Author: Johannes Weiner <hannes@cmpxchg.org>
> Date:   Thu Jun 13 15:55:46 2019 -0700
>
>     mm: memcontrol: don't batch updates of local VM stats and events
>
> In effect, this kept all the stat optimizations for cgroup2 (which
> doesn't have local counters), and reverted cgroup1 back to how it was
> for the longest time: on-demand aggregated per-cpu counters.
>
> For about a year, cgroup1 didn't have to per-cpu the local stats on
> read. But for the recursive stats, it would either still have to do
> subtree aggregation on read, or too much upward flushing on write.
>
> So if I had to blame one commit for a cgroup1 regression, it would
> probably be 815744d. But it's kind of a stretch to say that it worked
> well before that commit.
>
> For the changelog, maybe just say that there was a lot of back and
> forth between read-side aggregation and write-side aggregation. Since
> with rstat we now have efficient read-side aggregation, attempt a
> conceptual revert of 815744d.

Now that's a much more complete picture. Thanks a lot for all the
history, now it makes much more sense. I wouldn't blame 815744d then,
as you said it's better framed as a conceptual revert of it. I will
rewrite the commit log accordingly and send a v2.

Thanks!

>
> > > But I want to be clear: this isn't a regression fix. It's a new
> > > performance optimization for the deprecated cgroup1 code. And it comes
> > > at the cost of higher memory footprint for both cgroup1 AND cgroup2.
> >
> > I still think it is, but I can easily be wrong. I am hoping that the
> > memory footprint is not a problem. There are *roughly* 80 per-memcg
> > stats/events (MEMCG_NR_STAT + NR_MEMCG_EVENTS) and 55 per-lruvec stats
> > (NR_VM_NODE_STAT_ITEMS). For each stat there is an extra 8 bytes, so
> > on a two-node machine that's  8 * (80 + 55 * 2) ~= 1.5 KiB per memcg.
> >
> > Given that struct mem_cgroup is already large, and can easily be 100s
> > of KiBs on a large machine with many cpus, I hope there won't be a
> > noticeable regression.
>
> Yes, the concern wasn't so much the memory consumption but the
> cachelines touched during hot paths.
>
> However, that was mostly because we either had a) write-side flushing,
> which is extremely hot during MM stress, or b) read-side flushing with
> huuuge cgroup subtrees due to zombie cgroups. A small cacheline
> difference would be enormously amplified by these factors.
>
> Rstat is very good at doing selective subtree flushing on reads, so
> the big coefficients from a) and b) are no longer such a big concern.
> A slightly bigger cache footprint is probably going to be okay.

Agreed, maintaining the local counters with rstat is much easier than
the previous attempts. I will try to bake most of this into the commit
log.
  
Yosry Ahmed July 25, 2023, 11:58 p.m. UTC | #9
On Wed, Jul 19, 2023 at 10:46 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> Currently, memcg uses rstat to maintain hierarchical stats. The rstat
> framework keeps track of which cgroups have updates on which cpus.
>
> For non-hierarchical stats, as memcg moved to rstat, they are no longer
> readily available as counters. Instead, the percpu counters for a given
> stat need to be summed to get the non-hierarchical stat value. This
> causes a performance regression when reading non-hierarchical stats on
> kernels where memcg moved to using rstat. This is especially visible
> when reading memory.stat on cgroup v1. There are also some code paths
> internal to the kernel that read such non-hierarchical stats.
>
> It is inefficient to iterate and sum counters in all cpus when the rstat
> framework knows exactly when a percpu counter has an update. Instead,
> maintain cpu-aggregated non-hierarchical counters for each stat. During
> an rstat flush, keep those updated as well. When reading
> non-hierarchical stats, we no longer need to iterate cpus, we just need
> to read the maintainer counters, similar to hierarchical stats.
>
> A caveat is that we now a stats flush before reading
> local/non-hierarchical stats through {memcg/lruvec}_page_state_local()
> or memcg_events_local(), where we previously only needed a flush to
> read hierarchical stats. Most contexts reading non-hierarchical stats
> are already doing a flush, add a flush to the only missing context in
> count_shadow_nodes().
>
> With this patch, reading memory.stat from 1000 memcgs is 3x faster on a
> machine with 256 cpus on cgroup v1:
>  # for i in $(seq 1000); do mkdir /sys/fs/cgroup/memory/cg$i; done
>  # time cat /dev/cgroup/memory/cg*/memory.stat > /dev/null
>  real    0m0.125s
>  user    0m0.005s
>  sys     0m0.120s
>
> After:
>  real    0m0.032s
>  user    0m0.005s
>  sys     0m0.027s
>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> ---
>  include/linux/memcontrol.h |  7 ++++---
>  mm/memcontrol.c            | 32 +++++++++++++++++++-------------
>  mm/workingset.c            |  1 +
>  3 files changed, 24 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 5818af8eca5a..a9f2861a57a5 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -112,6 +112,9 @@ struct lruvec_stats {
>         /* Aggregated (CPU and subtree) state */
>         long state[NR_VM_NODE_STAT_ITEMS];
>
> +       /* Non-hierarchical (CPU aggregated) state */
> +       long state_local[NR_VM_NODE_STAT_ITEMS];
> +
>         /* Pending child counts during tree propagation */
>         long state_pending[NR_VM_NODE_STAT_ITEMS];
>  };
> @@ -1020,14 +1023,12 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
>  {
>         struct mem_cgroup_per_node *pn;
>         long x = 0;
> -       int cpu;
>
>         if (mem_cgroup_disabled())
>                 return node_page_state(lruvec_pgdat(lruvec), idx);
>
>         pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> -       for_each_possible_cpu(cpu)
> -               x += per_cpu(pn->lruvec_stats_percpu->state[idx], cpu);
> +       x = READ_ONCE(pn->lruvec_stats.state_local[idx]);
>  #ifdef CONFIG_SMP
>         if (x < 0)
>                 x = 0;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e8ca4bdcb03c..90a22637818e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -742,6 +742,10 @@ struct memcg_vmstats {
>         long                    state[MEMCG_NR_STAT];
>         unsigned long           events[NR_MEMCG_EVENTS];
>
> +       /* Non-hierarchical (CPU aggregated) page state & events */
> +       long                    state_local[MEMCG_NR_STAT];
> +       unsigned long           events_local[NR_MEMCG_EVENTS];
> +
>         /* Pending child counts during tree propagation */
>         long                    state_pending[MEMCG_NR_STAT];
>         unsigned long           events_pending[NR_MEMCG_EVENTS];
> @@ -775,11 +779,8 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
>  /* idx can be of type enum memcg_stat_item or node_stat_item. */
>  static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx)
>  {
> -       long x = 0;
> -       int cpu;
> +       long x = READ_ONCE(memcg->vmstats->state_local[idx]);
>
> -       for_each_possible_cpu(cpu)
> -               x += per_cpu(memcg->vmstats_percpu->state[idx], cpu);
>  #ifdef CONFIG_SMP
>         if (x < 0)
>                 x = 0;
> @@ -926,16 +927,12 @@ static unsigned long memcg_events(struct mem_cgroup *memcg, int event)
>
>  static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
>  {
> -       long x = 0;
> -       int cpu;
>         int index = memcg_events_index(event);
>
>         if (index < 0)
>                 return 0;
>
> -       for_each_possible_cpu(cpu)
> -               x += per_cpu(memcg->vmstats_percpu->events[index], cpu);
> -       return x;
> +       return READ_ONCE(memcg->vmstats->events_local[index]);
>  }
>
>  static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
> @@ -5526,7 +5523,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
>         struct mem_cgroup *memcg = mem_cgroup_from_css(css);
>         struct mem_cgroup *parent = parent_mem_cgroup(memcg);
>         struct memcg_vmstats_percpu *statc;
> -       long delta, v;
> +       long delta, delta_cpu, v;
>         int i, nid;
>
>         statc = per_cpu_ptr(memcg->vmstats_percpu, cpu);
> @@ -5542,9 +5539,11 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
>                         memcg->vmstats->state_pending[i] = 0;
>
>                 /* Add CPU changes on this level since the last flush */
> +               delta_cpu = 0;
>                 v = READ_ONCE(statc->state[i]);
>                 if (v != statc->state_prev[i]) {
> -                       delta += v - statc->state_prev[i];
> +                       delta_cpu = v - statc->state_prev[i];
> +                       delta += delta_cpu;
>                         statc->state_prev[i] = v;
>                 }
>
> @@ -5553,6 +5552,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
>
>                 /* Aggregate counts on this level and propagate upwards */
>                 memcg->vmstats->state[i] += delta;
> +               memcg->vmstats->state_local[i] += delta_cpu;

I ran this through more testing. There is a subtle problem here. If
delta == 0 and delta_cpu != 0, we will skip the update to the local
stats. This happens in the very unlikely case where the delta on the
flushed cpu is equal in value but of opposite sign to the delta coming
from the children. IOW if (statc->state[i] - statc->state_prev[i]) ==
-memcg->vmstats->state_pending[i].

Very unlikely but I happened to stumble upon it. Will fix this for v2.


>                 if (parent)
>                         parent->vmstats->state_pending[i] += delta;
>         }
> @@ -5562,9 +5562,11 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
>                 if (delta)
>                         memcg->vmstats->events_pending[i] = 0;
>
> +               delta_cpu = 0;
>                 v = READ_ONCE(statc->events[i]);
>                 if (v != statc->events_prev[i]) {
> -                       delta += v - statc->events_prev[i];
> +                       delta_cpu = v - statc->events_prev[i];
> +                       delta += delta_cpu;
>                         statc->events_prev[i] = v;
>                 }
>
> @@ -5572,6 +5574,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
>                         continue;
>
>                 memcg->vmstats->events[i] += delta;
> +               memcg->vmstats->events_local[i] += delta_cpu;
>                 if (parent)
>                         parent->vmstats->events_pending[i] += delta;
>         }
> @@ -5591,9 +5594,11 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
>                         if (delta)
>                                 pn->lruvec_stats.state_pending[i] = 0;
>
> +                       delta_cpu = 0;
>                         v = READ_ONCE(lstatc->state[i]);
>                         if (v != lstatc->state_prev[i]) {
> -                               delta += v - lstatc->state_prev[i];
> +                               delta_cpu = v - lstatc->state_prev[i];
> +                               delta += delta_cpu;
>                                 lstatc->state_prev[i] = v;
>                         }
>
> @@ -5601,6 +5606,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
>                                 continue;
>
>                         pn->lruvec_stats.state[i] += delta;
> +                       pn->lruvec_stats.state_local[i] += delta_cpu;
>                         if (ppn)
>                                 ppn->lruvec_stats.state_pending[i] += delta;
>                 }
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 4686ae363000..da58a26d0d4d 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -664,6 +664,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
>                 struct lruvec *lruvec;
>                 int i;
>
> +               mem_cgroup_flush_stats();
>                 lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
>                 for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
>                         pages += lruvec_page_state_local(lruvec,
> --
> 2.41.0.255.g8b1d071c50-goog
>
  

Patch

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5818af8eca5a..a9f2861a57a5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -112,6 +112,9 @@  struct lruvec_stats {
 	/* Aggregated (CPU and subtree) state */
 	long state[NR_VM_NODE_STAT_ITEMS];
 
+	/* Non-hierarchical (CPU aggregated) state */
+	long state_local[NR_VM_NODE_STAT_ITEMS];
+
 	/* Pending child counts during tree propagation */
 	long state_pending[NR_VM_NODE_STAT_ITEMS];
 };
@@ -1020,14 +1023,12 @@  static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 {
 	struct mem_cgroup_per_node *pn;
 	long x = 0;
-	int cpu;
 
 	if (mem_cgroup_disabled())
 		return node_page_state(lruvec_pgdat(lruvec), idx);
 
 	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
-	for_each_possible_cpu(cpu)
-		x += per_cpu(pn->lruvec_stats_percpu->state[idx], cpu);
+	x = READ_ONCE(pn->lruvec_stats.state_local[idx]);
 #ifdef CONFIG_SMP
 	if (x < 0)
 		x = 0;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e8ca4bdcb03c..90a22637818e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -742,6 +742,10 @@  struct memcg_vmstats {
 	long			state[MEMCG_NR_STAT];
 	unsigned long		events[NR_MEMCG_EVENTS];
 
+	/* Non-hierarchical (CPU aggregated) page state & events */
+	long			state_local[MEMCG_NR_STAT];
+	unsigned long		events_local[NR_MEMCG_EVENTS];
+
 	/* Pending child counts during tree propagation */
 	long			state_pending[MEMCG_NR_STAT];
 	unsigned long		events_pending[NR_MEMCG_EVENTS];
@@ -775,11 +779,8 @@  void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
 /* idx can be of type enum memcg_stat_item or node_stat_item. */
 static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx)
 {
-	long x = 0;
-	int cpu;
+	long x = READ_ONCE(memcg->vmstats->state_local[idx]);
 
-	for_each_possible_cpu(cpu)
-		x += per_cpu(memcg->vmstats_percpu->state[idx], cpu);
 #ifdef CONFIG_SMP
 	if (x < 0)
 		x = 0;
@@ -926,16 +927,12 @@  static unsigned long memcg_events(struct mem_cgroup *memcg, int event)
 
 static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
 {
-	long x = 0;
-	int cpu;
 	int index = memcg_events_index(event);
 
 	if (index < 0)
 		return 0;
 
-	for_each_possible_cpu(cpu)
-		x += per_cpu(memcg->vmstats_percpu->events[index], cpu);
-	return x;
+	return READ_ONCE(memcg->vmstats->events_local[index]);
 }
 
 static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
@@ -5526,7 +5523,7 @@  static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
 	struct memcg_vmstats_percpu *statc;
-	long delta, v;
+	long delta, delta_cpu, v;
 	int i, nid;
 
 	statc = per_cpu_ptr(memcg->vmstats_percpu, cpu);
@@ -5542,9 +5539,11 @@  static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 			memcg->vmstats->state_pending[i] = 0;
 
 		/* Add CPU changes on this level since the last flush */
+		delta_cpu = 0;
 		v = READ_ONCE(statc->state[i]);
 		if (v != statc->state_prev[i]) {
-			delta += v - statc->state_prev[i];
+			delta_cpu = v - statc->state_prev[i];
+			delta += delta_cpu;
 			statc->state_prev[i] = v;
 		}
 
@@ -5553,6 +5552,7 @@  static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 
 		/* Aggregate counts on this level and propagate upwards */
 		memcg->vmstats->state[i] += delta;
+		memcg->vmstats->state_local[i] += delta_cpu;
 		if (parent)
 			parent->vmstats->state_pending[i] += delta;
 	}
@@ -5562,9 +5562,11 @@  static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 		if (delta)
 			memcg->vmstats->events_pending[i] = 0;
 
+		delta_cpu = 0;
 		v = READ_ONCE(statc->events[i]);
 		if (v != statc->events_prev[i]) {
-			delta += v - statc->events_prev[i];
+			delta_cpu = v - statc->events_prev[i];
+			delta += delta_cpu;
 			statc->events_prev[i] = v;
 		}
 
@@ -5572,6 +5574,7 @@  static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 			continue;
 
 		memcg->vmstats->events[i] += delta;
+		memcg->vmstats->events_local[i] += delta_cpu;
 		if (parent)
 			parent->vmstats->events_pending[i] += delta;
 	}
@@ -5591,9 +5594,11 @@  static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 			if (delta)
 				pn->lruvec_stats.state_pending[i] = 0;
 
+			delta_cpu = 0;
 			v = READ_ONCE(lstatc->state[i]);
 			if (v != lstatc->state_prev[i]) {
-				delta += v - lstatc->state_prev[i];
+				delta_cpu = v - lstatc->state_prev[i];
+				delta += delta_cpu;
 				lstatc->state_prev[i] = v;
 			}
 
@@ -5601,6 +5606,7 @@  static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 				continue;
 
 			pn->lruvec_stats.state[i] += delta;
+			pn->lruvec_stats.state_local[i] += delta_cpu;
 			if (ppn)
 				ppn->lruvec_stats.state_pending[i] += delta;
 		}
diff --git a/mm/workingset.c b/mm/workingset.c
index 4686ae363000..da58a26d0d4d 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -664,6 +664,7 @@  static unsigned long count_shadow_nodes(struct shrinker *shrinker,
 		struct lruvec *lruvec;
 		int i;
 
+		mem_cgroup_flush_stats();
 		lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
 		for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
 			pages += lruvec_page_state_local(lruvec,