[v2,2/2] mm: memcg: normalize the value passed into memcg_rstat_updated()

Message ID 20230922175741.635002-3-yosryahmed@google.com
State New
Headers
Series mm: memcg: fix tracking of pending stats updates values |

Commit Message

Yosry Ahmed Sept. 22, 2023, 5:57 p.m. UTC
  memcg_rstat_updated() uses the value of the state update to keep track
of the magnitude of pending updates, so that we only do a stats flush
when it's worth the work. Most values passed into memcg_rstat_updated()
are in pages, however, a few of them are actually in bytes or KBs.

To put this into perspective, a 512 byte slab allocation today would
look the same as allocating 512 pages. This may result in premature
flushes, which means unnecessary work and latency.

Normalize all the state values passed into memcg_rstat_updated() to
pages. Round up non-zero sub-page to 1 page, because
memcg_rstat_updated() ignores 0 page updates.

Fixes: 5b3be698a872 ("memcg: better bounds on the memcg stats updates")
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 mm/memcontrol.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)
  

Comments

Johannes Weiner Oct. 3, 2023, 1:13 p.m. UTC | #1
On Fri, Sep 22, 2023 at 05:57:40PM +0000, Yosry Ahmed wrote:
> memcg_rstat_updated() uses the value of the state update to keep track
> of the magnitude of pending updates, so that we only do a stats flush
> when it's worth the work. Most values passed into memcg_rstat_updated()
> are in pages, however, a few of them are actually in bytes or KBs.
> 
> To put this into perspective, a 512 byte slab allocation today would
> look the same as allocating 512 pages. This may result in premature
> flushes, which means unnecessary work and latency.

Yikes.

I'm somewhat less concerned about the performance as I am about the
variance in flushing cost that could be quite difficult to pinpoint.
IMO this is a correctness fix and a code cleanup, not a performance
thing.

> Normalize all the state values passed into memcg_rstat_updated() to
> pages. Round up non-zero sub-page to 1 page, because
> memcg_rstat_updated() ignores 0 page updates.
> 
> Fixes: 5b3be698a872 ("memcg: better bounds on the memcg stats updates")
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
  
Yosry Ahmed Oct. 3, 2023, 3:53 p.m. UTC | #2
On Tue, Oct 3, 2023 at 6:13 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, Sep 22, 2023 at 05:57:40PM +0000, Yosry Ahmed wrote:
> > memcg_rstat_updated() uses the value of the state update to keep track
> > of the magnitude of pending updates, so that we only do a stats flush
> > when it's worth the work. Most values passed into memcg_rstat_updated()
> > are in pages, however, a few of them are actually in bytes or KBs.
> >
> > To put this into perspective, a 512 byte slab allocation today would
> > look the same as allocating 512 pages. This may result in premature
> > flushes, which means unnecessary work and latency.
>
> Yikes.
>
> I'm somewhat less concerned about the performance as I am about the
> variance in flushing cost that could be quite difficult to pinpoint.
> IMO this is a correctness fix and a code cleanup, not a performance
> thing.

Agreed, the code right now has a subtle mistake.

>
> > Normalize all the state values passed into memcg_rstat_updated() to
> > pages. Round up non-zero sub-page to 1 page, because
> > memcg_rstat_updated() ignores 0 page updates.
> >
> > Fixes: 5b3be698a872 ("memcg: better bounds on the memcg stats updates")
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks for taking a look!
  
Michal Koutný Oct. 3, 2023, 6:22 p.m. UTC | #3
On Fri, Sep 22, 2023 at 05:57:40PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote:
> memcg_rstat_updated() uses the value of the state update to keep track
> of the magnitude of pending updates, so that we only do a stats flush
> when it's worth the work. Most values passed into memcg_rstat_updated()
> are in pages, however, a few of them are actually in bytes or KBs.
> 
> To put this into perspective, a 512 byte slab allocation today would
> look the same as allocating 512 pages. This may result in premature
> flushes, which means unnecessary work and latency.
> 
> Normalize all the state values passed into memcg_rstat_updated() to
> pages.

I've dreamed about such normalization since error estimates were
introduced :-)

(As touched in the previous patch) it makes me wonder whether it makes
sense to add up state and event counters (apples and oranges).

Shouldn't with this approach events: a) have a separate counter, b)
wight with zero and rely on time-based flushing only?

Thanks,
Michal
  
Yosry Ahmed Oct. 3, 2023, 7:51 p.m. UTC | #4
On Tue, Oct 3, 2023 at 11:22 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Fri, Sep 22, 2023 at 05:57:40PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote:
> > memcg_rstat_updated() uses the value of the state update to keep track
> > of the magnitude of pending updates, so that we only do a stats flush
> > when it's worth the work. Most values passed into memcg_rstat_updated()
> > are in pages, however, a few of them are actually in bytes or KBs.
> >
> > To put this into perspective, a 512 byte slab allocation today would
> > look the same as allocating 512 pages. This may result in premature
> > flushes, which means unnecessary work and latency.
> >
> > Normalize all the state values passed into memcg_rstat_updated() to
> > pages.
>
> I've dreamed about such normalization since error estimates were
> introduced :-)
>
> (As touched in the previous patch) it makes me wonder whether it makes
> sense to add up state and event counters (apples and oranges).

I conceptually agree that we are adding apples and oranges, but in
practice if you look at memcg_vm_event_stat, most stat updates
correspond to 1 page worth of something. I am guessing the implicit
assumption here is that event updates correspond roughly to page-sized
state updates. It's not perfect, but perhaps it's acceptable.

>
> Shouldn't with this approach events: a) have a separate counter, b)
> wight with zero and rely on time-based flushing only?

(a) If we have separate counters we need to eventually sum them to
figure out if the total magnitude of pending updates is worth flushing
anyway, right?
(b) I would be more nervous to rely on time-based flushing only as I
don't know how this can affect workloads.

>
> Thanks,
> Michal
  

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 308cc7353ef0..d1a322a75172 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -763,6 +763,22 @@  unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
 	return x;
 }
 
+static int memcg_page_state_unit(int item);
+
+/*
+ * Normalize the value passed into memcg_rstat_updated() to be in pages. Round
+ * up non-zero sub-page updates to 1 page as zero page updates are ignored.
+ */
+static int memcg_state_val_in_pages(int idx, int val)
+{
+	int unit = memcg_page_state_unit(idx);
+
+	if (!val || unit == PAGE_SIZE)
+		return val;
+	else
+		return max(val * unit / PAGE_SIZE, 1UL);
+}
+
 /**
  * __mod_memcg_state - update cgroup memory statistics
  * @memcg: the memory cgroup
@@ -775,7 +791,7 @@  void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
 		return;
 
 	__this_cpu_add(memcg->vmstats_percpu->state[idx], val);
-	memcg_rstat_updated(memcg, val);
+	memcg_rstat_updated(memcg, memcg_state_val_in_pages(idx, val));
 }
 
 /* idx can be of type enum memcg_stat_item or node_stat_item. */
@@ -826,7 +842,7 @@  void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	/* Update lruvec */
 	__this_cpu_add(pn->lruvec_stats_percpu->state[idx], val);
 
-	memcg_rstat_updated(memcg, val);
+	memcg_rstat_updated(memcg, memcg_state_val_in_pages(idx, val));
 	memcg_stats_unlock();
 }