[mm-unstable,RFC,3/5] memcg: calculate root usage from global state

Message ID 20230403220337.443510-4-yosryahmed@google.com
State New
Headers
Series cgroup: eliminate atomic rstat |

Commit Message

Yosry Ahmed April 3, 2023, 10:03 p.m. UTC
  Currently, we approximate the root usage by adding the memcg stats for
anon, file, and conditionally swap (for memsw). To read the memcg stats
we need to invoke an rstat flush. rstat flushes can be expensive, they
scale with the number of cpus and cgroups on the system.

mem_cgroup_usage() is called by memcg_events()->mem_cgroup_threshold()
with irqs disabled, so such an expensive operation with irqs disabled
can cause problems.

Instead, approximate the root usage from global state. This is not 100%
accurate, but the root usage has always been ill-defined anyway.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 mm/memcontrol.c | 24 +++++-------------------
 1 file changed, 5 insertions(+), 19 deletions(-)
  

Comments

Michal Koutný April 11, 2023, 12:53 p.m. UTC | #1
On Mon, Apr 03, 2023 at 10:03:35PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote:
> Instead, approximate the root usage from global state. This is not 100%
> accurate, but the root usage has always been ill-defined anyway.

Technically, this approximation should be closer to truth because global
counters aren't subject to flushing "delay".

> 
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> ---
>  mm/memcontrol.c | 24 +++++-------------------
>  1 file changed, 5 insertions(+), 19 deletions(-)

But feel free to add
Reviewed-by: Michal Koutný <mkoutny@suse.com>
  
Yosry Ahmed April 11, 2023, 4:59 p.m. UTC | #2
On Tue, Apr 11, 2023 at 5:53 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Mon, Apr 03, 2023 at 10:03:35PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote:
> > Instead, approximate the root usage from global state. This is not 100%
> > accurate, but the root usage has always been ill-defined anyway.
>
> Technically, this approximation should be closer to truth because global
> counters aren't subject to flushing "delay".

It is a tiny bit different when some pages are in swap, probably
because of swap slot caching and other swap specifics. At least in
cgroup v1, the swap uncharging and freeing of the underlying swap
entry may happen at different times. I think it practically doesn't
really matter though.

>
> >
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > ---
> >  mm/memcontrol.c | 24 +++++-------------------
> >  1 file changed, 5 insertions(+), 19 deletions(-)
>
> But feel free to add
> Reviewed-by: Michal Koutný <mkoutny@suse.com>

Thanks!
>
  
Shakeel Butt April 20, 2023, 6:57 p.m. UTC | #3
On Mon, Apr 3, 2023 at 3:03 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> Currently, we approximate the root usage by adding the memcg stats for
> anon, file, and conditionally swap (for memsw). To read the memcg stats
> we need to invoke an rstat flush. rstat flushes can be expensive, they
> scale with the number of cpus and cgroups on the system.
>
> mem_cgroup_usage() is called by memcg_events()->mem_cgroup_threshold()
> with irqs disabled, so such an expensive operation with irqs disabled
> can cause problems.
>
> Instead, approximate the root usage from global state. This is not 100%
> accurate, but the root usage has always been ill-defined anyway.
>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

Acked-by: Shakeel Butt <shakeelb@google.com>
  

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bdd52fe9e7e4b..e7fe18c0c0ef2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3698,27 +3698,13 @@  static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
 
 	if (mem_cgroup_is_root(memcg)) {
 		/*
-		 * We can reach here from irq context through:
-		 * uncharge_batch()
-		 * |--memcg_check_events()
-		 *    |--mem_cgroup_threshold()
-		 *       |--__mem_cgroup_threshold()
-		 *          |--mem_cgroup_usage
-		 *
-		 * rstat flushing is an expensive operation that should not be
-		 * done from irq context; use stale stats in this case.
-		 * Arguably, usage threshold events are not reliable on the root
-		 * memcg anyway since its usage is ill-defined.
-		 *
-		 * Additionally, other call paths through memcg_check_events()
-		 * disable irqs, so make sure we are flushing stats atomically.
+		 * Approximate root's usage from global state. This isn't
+		 * perfect, but the root usage was always an approximation.
 		 */
-		if (in_task())
-			mem_cgroup_flush_stats_atomic();
-		val = memcg_page_state(memcg, NR_FILE_PAGES) +
-			memcg_page_state(memcg, NR_ANON_MAPPED);
+		val = global_node_page_state(NR_FILE_PAGES) +
+			global_node_page_state(NR_ANON_MAPPED);
 		if (swap)
-			val += memcg_page_state(memcg, MEMCG_SWAP);
+			val += total_swap_pages - get_nr_swap_pages();
 	} else {
 		if (!swap)
 			val = page_counter_read(&memcg->memory);