Message ID | 20230922175741.635002-2-yosryahmed@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:172:b0:3f2:4152:657d with SMTP id h50csp5919580vqi; Fri, 22 Sep 2023 16:12:26 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEV/L49db50tdZQFj1zQN2bWXfA2zsWNTz8NuUALAoy1PinudeeS5Yv21B6rmH3J2ky9wLY X-Received: by 2002:a05:6a21:328b:b0:140:d536:d428 with SMTP id yt11-20020a056a21328b00b00140d536d428mr1003450pzb.51.1695424346053; Fri, 22 Sep 2023 16:12:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695424346; cv=none; d=google.com; s=arc-20160816; b=cuCAWi033VxJa0ce17/RHm74ZKsZ4MS+uvcF+HW7lx62nrHiUy2QHWttHi0fl92JEd ANiyuaQPAzwmeKp1+fnHANXGopLJD5QWYW5bAU5IfieQJTbWlmaHmL//8j/HvotSdgs7 yeQtzwR9Jjh+c9rmRJV8uAl5D2mY+rIRrFMC51oFeRUKZDqYR2667hqHHIiM4c8WsO2a 4leZK6mrTWbbfCOmxxGzTnvDIx1mkuVIvfaAHpQ8TyMal6MDNSLd4ZCPNR7mfXoy8tov Zr5G2/KF8fvGp9cmg3+MrWhOEzA8TxpA9XI2q/m/f39HHl3PyiYPiPb9isHmUWcUo5bQ x2sw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=CIUVdWYJzS23SUkOVUGYlHJuolUCd+SBS6YJzjY/W/M=; fh=8R3NybVVYxF3mxyewwJ7cqkhC8pADgA9jxe98eswTvE=; b=oisJ2SmM0SG4Pk4c5BxBJ7mXY8rS9FwnB47y1oXDniLkE5P4oNxi/Udee+Xi0UXo47 oGf9O7b/K8rzks4CEjo4FeQkYYi8v+63S83aORITaV7XIu7LO+xnN1b1I3H/wjFV6iK9 SzVCZixC2XEOyJNY9H7mwImfnxyclvixezEIWyFd9LK3LPijctnmUnqvbL7qu1FFEw4G rhqUZWGHT0n9J1p3oW+ecp2N5962A9FYKgYcWEuyccG+8NZIRJrmpDI/EdRsDXrDLEF9 2iahMMmxirNovCCgHB/9iqE/7vMuBJVSPXqcXg7LgFlj+e50jzswV0i4Z4HrVKKrTsFr LFxg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=j6jMO0KP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from agentk.vger.email (agentk.vger.email. [2620:137:e000::3:2]) by mx.google.com with ESMTPS id y3-20020a17090a86c300b00276945c92a1si6915643pjv.88.2023.09.22.16.12.24 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 22 Sep 2023 16:12:26 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) client-ip=2620:137:e000::3:2; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=j6jMO0KP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id B45498050FAF; Fri, 22 Sep 2023 12:36:35 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231803AbjIVTgb (ORCPT <rfc822;pwkd43@gmail.com> + 28 others); Fri, 22 Sep 2023 15:36:31 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34746 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233426AbjIVR7e (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 22 Sep 2023 13:59:34 -0400 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 326C04211 for <linux-kernel@vger.kernel.org>; Fri, 22 Sep 2023 10:57:46 -0700 (PDT) Received: by mail-yb1-xb4a.google.com with SMTP id 3f1490d57ef6-d81a76a11eeso3071368276.3 for <linux-kernel@vger.kernel.org>; Fri, 22 Sep 2023 10:57:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1695405465; x=1696010265; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=CIUVdWYJzS23SUkOVUGYlHJuolUCd+SBS6YJzjY/W/M=; b=j6jMO0KPP+GHBicZNd3LVMZSQuXiS/dwN9sDBeANBHBoOrl6GWQqcBMeYr+OhfGW41 j9DARkjCnZgGSkLQcbz2WN6qW1frdj3oCH4ytTmmXC3yQ8QLuYd6ANgEmbPFi+ZMhO/2 QojDnZ+NojRvcRFAVk5DLP+aIbCt2o5iCNj+r2udzeGho/0Wr1Bf7yD1ZIAh9zeJ/5+Y bnXcN+PNpLvDvIiP7kQBWTmMys97Lr5W7o9XlZIm4rjFchkQTs6S3Qeg9Cx8z0DR7W+c rx1b+qrC/O01Wcwk4hBmmEuUoSD+GtApnNOU1FGwwpi99VqjIcj+/LOmi44D5yZJultX NfOg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695405465; x=1696010265; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=CIUVdWYJzS23SUkOVUGYlHJuolUCd+SBS6YJzjY/W/M=; b=iKDFFQsOPKlg6WnUBMf94UsKFs/BkVdwDojLnr2CJSxFp0HUsOgF2+H0K26+La8R44 D+59NRt99cRZNn1ptePJ2AB5vrCfjaCEOt0Nid7McoSQjWKuCiVA5nIsAnnqHhp30jct i8yLfMwRp+jMeI6FssEw8v77ItN9ls8YALe0oUWnDEXwRg1OEx8D/YLlwOcqe5Ofn1bg qRkBX75N9+Sngi3DRswItZVgCOhpYVR2Q1ku/TZ3p1X/8pCml2sV7V787uiFHi2Qpa0V JZYWZDmWmtbbdtGTXYioT9qn8l+jdZp14naNQCNEsE02tGtg8L2q/d7Q0WyJuSWNLa4G 3fww== X-Gm-Message-State: AOJu0YyOeVueKE26nYYJqgB4Ci0+1uql1mNtxyg2tdOwwVR0B4eklVZQ wEBTvcV5aXkvTjNQ40pr6/uCPp2qtBxafoyg X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:29b4]) (user=yosryahmed job=sendgmr) by 2002:a25:6942:0:b0:d80:cf4:7e80 with SMTP id e63-20020a256942000000b00d800cf47e80mr453ybc.7.1695405465321; Fri, 22 Sep 2023 10:57:45 -0700 (PDT) Date: Fri, 22 Sep 2023 17:57:39 +0000 In-Reply-To: <20230922175741.635002-1-yosryahmed@google.com> Mime-Version: 1.0 References: <20230922175741.635002-1-yosryahmed@google.com> X-Mailer: git-send-email 2.42.0.515.g380fc7ccd1-goog Message-ID: <20230922175741.635002-2-yosryahmed@google.com> Subject: [PATCH v2 1/2] mm: memcg: refactor page state unit helpers From: Yosry Ahmed <yosryahmed@google.com> To: Andrew Morton <akpm@linux-foundation.org>, Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, Muchun Song <muchun.song@linux.dev>, " =?utf-8?q?Michal_Koutn=C3=BD?= " <mkoutny@suse.com>, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Yosry Ahmed <yosryahmed@google.com> Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-4.8 required=5.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RCVD_IN_SBL_CSS,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Fri, 22 Sep 2023 12:36:35 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1777781278913509026 X-GMAIL-MSGID: 1777781278913509026 |
Series |
mm: memcg: fix tracking of pending stats updates values
|
|
Commit Message
Yosry Ahmed
Sept. 22, 2023, 5:57 p.m. UTC
memcg_page_state_unit() is currently used to identify the unit of a
memcg state item so that all stats in memory.stat are in bytes. However,
it lies about the units of WORKINGSET_* stats. These stats actually
represent pages, but we present them to userspace as a scalar number of
events. In retrospect, maybe those stats should have been memcg "events"
rather than memcg "state".
In preparation for using memcg_page_state_unit() for other purposes that
need to know the truthful units of different stat items, break it down
into two helpers:
- memcg_page_state_unit() retuns the actual unit of the item.
- memcg_page_state_output_unit() returns the unit used for output.
Use the latter instead of the former in memcg_page_state_output() and
lruvec_page_state_output(). While we are at it, let's show cgroup v1
some love and add memcg_page_state_local_output() for consistency.
No functional change intended.
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
mm/memcontrol.c | 44 +++++++++++++++++++++++++++++++++-----------
1 file changed, 33 insertions(+), 11 deletions(-)
Comments
On Fri, Sep 22, 2023 at 05:57:39PM +0000, Yosry Ahmed wrote: > memcg_page_state_unit() is currently used to identify the unit of a > memcg state item so that all stats in memory.stat are in bytes. However, > it lies about the units of WORKINGSET_* stats. These stats actually > represent pages, but we present them to userspace as a scalar number of > events. In retrospect, maybe those stats should have been memcg "events" > rather than memcg "state". > > In preparation for using memcg_page_state_unit() for other purposes that > need to know the truthful units of different stat items, break it down > into two helpers: > - memcg_page_state_unit() retuns the actual unit of the item. > - memcg_page_state_output_unit() returns the unit used for output. > > Use the latter instead of the former in memcg_page_state_output() and > lruvec_page_state_output(). While we are at it, let's show cgroup v1 > some love and add memcg_page_state_local_output() for consistency. > > No functional change intended. > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> That's a nice cleanup in itself. Acked-by: Johannes Weiner <hannes@cmpxchg.org>
On Fri, Sep 22, 2023 at 05:57:39PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote: > memcg_page_state_unit() is currently used to identify the unit of a > memcg state item so that all stats in memory.stat are in bytes. However, > it lies about the units of WORKINGSET_* stats. These stats actually > represent pages, but we present them to userspace as a scalar number of > events. In retrospect, maybe those stats should have been memcg "events" > rather than memcg "state". Why isn't it possible to move WORKINGSET_* stats under the events now? (Instead of using internal and external units.) Thanks, Michal
On Tue, Oct 3, 2023 at 11:11 AM Michal Koutný <mkoutny@suse.com> wrote: > > On Fri, Sep 22, 2023 at 05:57:39PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote: > > memcg_page_state_unit() is currently used to identify the unit of a > > memcg state item so that all stats in memory.stat are in bytes. However, > > it lies about the units of WORKINGSET_* stats. These stats actually > > represent pages, but we present them to userspace as a scalar number of > > events. In retrospect, maybe those stats should have been memcg "events" > > rather than memcg "state". > > Why isn't it possible to move WORKINGSET_* stats under the events now? > (Instead of using internal and external units.) Those constants are shared with code outside of memcg, namely enum node_stat_item and enum vm_event_item, and IIUC they are used differently outside of memcg. Did I miss something? > > Thanks, > Michal
On Tue, Oct 03, 2023 at 12:47:25PM -0700, Yosry Ahmed <yosryahmed@google.com> wrote: > Those constants are shared with code outside of memcg, namely enum > node_stat_item and enum vm_event_item, and IIUC they are used > differently outside of memcg. Did I miss something? The difference is not big, e.g. mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta); could be __count_memcg_events( container_of(lruvec, struct mem_cgroup_per_node, lruvec)->memcg, WORKINGSET_ACTIVATE_BASE + type, delta ); Yes, it would mean transferring WORKINGSET_* items from enum node_stat_item to enum vm_event_item. IOW, I don't know what is the effective difference between mod_memcg_lruvec_state() and count_memcg_events(). Is it per-memcg vs per-memcg-per-node resolution? (Is _that_ read by workingset mechanism?) Thanks, Michal
On Wed, Oct 4, 2023 at 2:02 AM Michal Koutný <mkoutny@suse.com> wrote: > > On Tue, Oct 03, 2023 at 12:47:25PM -0700, Yosry Ahmed <yosryahmed@google.com> wrote: > > Those constants are shared with code outside of memcg, namely enum > > node_stat_item and enum vm_event_item, and IIUC they are used > > differently outside of memcg. Did I miss something? > > The difference is not big, e.g. > mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta); > could be > __count_memcg_events( > container_of(lruvec, struct mem_cgroup_per_node, lruvec)->memcg, > WORKINGSET_ACTIVATE_BASE + type, delta > ); > > Yes, it would mean transferring WORKINGSET_* items from enum > node_stat_item to enum vm_event_item. > IOW, I don't know what is the effective difference between > mod_memcg_lruvec_state() and count_memcg_events(). > Is it per-memcg vs per-memcg-per-node resolution? > (Is _that_ read by workingset mechanism?) Even if it is not read, I think it is exposed in memory.numa_stat, right? Outside of memcg code, if you look at vmstat_start(), you will see that the items in enum vm_event_item are handled differently (see all_vm_events()) when reading vmstat. I don't think we can just move it, unfortunately. > > Thanks, > Michal
On Wed, Oct 04, 2023 at 11:02:11AM +0200, Michal Koutný wrote: > On Tue, Oct 03, 2023 at 12:47:25PM -0700, Yosry Ahmed <yosryahmed@google.com> wrote: > > Those constants are shared with code outside of memcg, namely enum > > node_stat_item and enum vm_event_item, and IIUC they are used > > differently outside of memcg. Did I miss something? > > The difference is not big, e.g. > mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta); > could be > __count_memcg_events( > container_of(lruvec, struct mem_cgroup_per_node, lruvec)->memcg, > WORKINGSET_ACTIVATE_BASE + type, delta > ); > > Yes, it would mean transferring WORKINGSET_* items from enum > node_stat_item to enum vm_event_item. > IOW, I don't know what is the effective difference between > mod_memcg_lruvec_state() and count_memcg_events(). > Is it per-memcg vs per-memcg-per-node resolution? Yes, it's because of node resolution which event counters generally don't have. Some of the refault events influence node-local reclaim decisions, see mm/vmscan.c::snapshot_refaults(). There are a few other event counters in the stat array that people thought would be useful to have split out in /sys/devices/system/node/nodeN/vmstat to understand numa behavior better. It's a bit messy. Some events would be useful to move to 'stats' for the numa awareness, such as the allocator stats and reclaim activity. Some events would be useful to move to 'stats' for the numa awareness, but don't have the zone resolution required by them, such as kswapd/kcompactd wakeups. Some events aren't numa specific, such as oom kills, drop_pagecache.
On Wed, Oct 04, 2023 at 02:36:19PM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote: > Yes, it's because of node resolution which event counters generally > don't have. Some of the refault events influence node-local reclaim > decisions, see mm/vmscan.c::snapshot_refaults(). > > There are a few other event counters in the stat array that people > thought would be useful to have split out in > /sys/devices/system/node/nodeN/vmstat to understand numa behavior > better. > > It's a bit messy. > > Some events would be useful to move to 'stats' for the numa awareness, > such as the allocator stats and reclaim activity. > > Some events would be useful to move to 'stats' for the numa awareness, > but don't have the zone resolution required by them, such as > kswapd/kcompactd wakeups. Thanks for the enlightenment. > Some events aren't numa specific, such as oom kills, drop_pagecache. These are oddballs indeed. As with the normalization patchset these are counted as PAGE_SIZE^W 1 error but they should rather be an infinite error (to warrant a flush). So my feedback to this series is: - patch 1/2 -- creating two classes of units is consequence of unclarity between state and events (as in event=Δstate/Δt) and resolution (global vs per-node), so the better approach would be to tidy this up, - patch 2/2 -- it could use the single unit class that exists, it'll bound the error of printed numbers afterall (and can be changed later depending on how it affects internal consumers). My 0.02€, Michal
On Thu, Oct 5, 2023 at 2:06 AM Michal Koutný <mkoutny@suse.com> wrote: > > On Wed, Oct 04, 2023 at 02:36:19PM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote: > > Yes, it's because of node resolution which event counters generally > > don't have. Some of the refault events influence node-local reclaim > > decisions, see mm/vmscan.c::snapshot_refaults(). > > > > There are a few other event counters in the stat array that people > > thought would be useful to have split out in > > /sys/devices/system/node/nodeN/vmstat to understand numa behavior > > better. > > > > It's a bit messy. > > > > Some events would be useful to move to 'stats' for the numa awareness, > > such as the allocator stats and reclaim activity. > > > > Some events would be useful to move to 'stats' for the numa awareness, > > but don't have the zone resolution required by them, such as > > kswapd/kcompactd wakeups. > > Thanks for the enlightenment. > > > Some events aren't numa specific, such as oom kills, drop_pagecache. > > These are oddballs indeed. As with the normalization patchset these are > counted as PAGE_SIZE^W 1 error but they should rather be an infinite > error (to warrant a flush). > > So my feedback to this series is: > - patch 1/2 -- creating two classes of units is consequence of unclarity > between state and events (as in event=Δstate/Δt) and resolution > (global vs per-node), so the better approach would be to tidy this up, I am not really sure what you mean here. I understand that this series fixes the unit normalization for state but leaves events out of it. Looking at the event items tracked by memcg in memcg_vm_event_stat it looks to me that most of them correspond roughly to a page's worth of updates (all but the THP_* events). We don't track things like OOM_KILL and DROP_PAGECACHE per memcg as far as I can tell. Do you mean that we should add something similar to memcg_page_state_unit() for events as well to get all of them right? If yes, I think that should be easy to add, it would only special case THP_* events. Alternatively, I can add a comment above the call to memcg_rstat_updated() in __count_memcg_events() explaining why we don't normalize the event count for now. > - patch 2/2 -- it could use the single unit class that exists, > it'll bound the error of printed numbers afterall (and can be changed > later depending on how it affects internal consumers). This will mean that WORKINGSET_* state will become more stale. We will need 4096 as many updates as today to get a flush. These are used by internal flushers (reclaim), and are exposed to userspace. I am not sure we want to do that. > > My 0.02€, > Michal
On Thu, Oct 05, 2023 at 02:31:03AM -0700, Yosry Ahmed <yosryahmed@google.com> wrote: > I am not really sure what you mean here. My "vision" is to treat WORKINGSET_ entries as events. That would mean implementing per-node tracking for vm_event_item (costlier?). That would mean node_stat_item and vm_event_item being effectively equal, so they could be merged in one. That would be situation to come up with new classification based on use cases (e.g. precision/timeliness requirements, state vs change semantics). (Do not take this as blocker of the patch 1/2, I rather used the opportunity to discuss a greater possible cleanup.) > We don't track things like OOM_KILL and DROP_PAGECACHE per memcg as > far as I can tell. Ah, good. (I forgot only subset of entries is relevant for memcgs.) > This will mean that WORKINGSET_* state will become more stale. We will > need 4096 as many updates as today to get a flush. These are used by > internal flushers (reclaim), and are exposed to userspace. I am not > sure we want to do that. snapshot_refaults() doesn't seem to follow after flush and workigset_refault()'s flush doesn't seem to preceed readers Is the flush misplaced or have I overlooked something? (If the former, it seems to work good enough even with the current flushing heuristics :-)) Michal
On Thu, Oct 5, 2023 at 9:30 AM Michal Koutný <mkoutny@suse.com> wrote: > > On Thu, Oct 05, 2023 at 02:31:03AM -0700, Yosry Ahmed <yosryahmed@google.com> wrote: > > I am not really sure what you mean here. > > My "vision" is to treat WORKINGSET_ entries as events. > That would mean implementing per-node tracking for vm_event_item > (costlier?). > That would mean node_stat_item and vm_event_item being effectively > equal, so they could be merged in one. > That would be situation to come up with new classification based on use > cases (e.g. precision/timeliness requirements, state vs change > semantics). > > (Do not take this as blocker of the patch 1/2, I rather used the > opportunity to discuss a greater possible cleanup.) Yeah ideally we can clean this up separately. I would be careful about userspace exposure though. It seems like CONFIG_VM_EVENT_COUNTERS is used to control tracking events and displaying them in vmstat, so moving items between node_stat_item and vm_event_item (or merging them) won't be easy. > > > We don't track things like OOM_KILL and DROP_PAGECACHE per memcg as > > far as I can tell. > > Ah, good. (I forgot only subset of entries is relevant for memcgs.) > > > This will mean that WORKINGSET_* state will become more stale. We will > > need 4096 as many updates as today to get a flush. These are used by > > internal flushers (reclaim), and are exposed to userspace. I am not > > sure we want to do that. > > snapshot_refaults() doesn't seem to follow after flush > and > workigset_refault()'s flush doesn't seem to preceed readers > > Is the flush misplaced or have I overlooked something? > (If the former, it seems to work good enough even with the current > flushing heuristics :-)) We flush in prepare_scan_count() before reading WORKINGSET_ACTIVATE_* state. That flush also implicitly precedes every call to snapshot_refaults(), which is unclear and not so robust, but we are effectively flushing before snapshot_refaults() too. > > > Michal
On Thu, 5 Oct 2023 10:30:36 -0700 Yosry Ahmed <yosryahmed@google.com> wrote: > On Thu, Oct 5, 2023 at 9:30 AM Michal Koutný <mkoutny@suse.com> wrote: > > > > On Thu, Oct 05, 2023 at 02:31:03AM -0700, Yosry Ahmed <yosryahmed@google.com> wrote: > > > I am not really sure what you mean here. > > > > My "vision" is to treat WORKINGSET_ entries as events. > > That would mean implementing per-node tracking for vm_event_item > > (costlier?). > > That would mean node_stat_item and vm_event_item being effectively > > equal, so they could be merged in one. > > That would be situation to come up with new classification based on use > > cases (e.g. precision/timeliness requirements, state vs change > > semantics). > > > > (Do not take this as blocker of the patch 1/2, I rather used the > > opportunity to discuss a greater possible cleanup.) > > Yeah ideally we can clean this up separately. I would be careful about > userspace exposure though. It seems like CONFIG_VM_EVENT_COUNTERS is > used to control tracking events and displaying them in vmstat, so > moving items between node_stat_item and vm_event_item (or merging > them) won't be easy. I like the word "separately". This series has been in mm-unstable for nearly a month, so I'll move it into mm-stable as-is,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 927c64d3cbcb..308cc7353ef0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1535,7 +1535,7 @@ static const struct memory_stat memory_stats[] = { { "workingset_nodereclaim", WORKINGSET_NODERECLAIM }, }; -/* Translate stat items to the correct unit for memory.stat output */ +/* The actual unit of the state item, not the same as the output unit */ static int memcg_page_state_unit(int item) { switch (item) { @@ -1543,6 +1543,22 @@ static int memcg_page_state_unit(int item) case MEMCG_ZSWAP_B: case NR_SLAB_RECLAIMABLE_B: case NR_SLAB_UNRECLAIMABLE_B: + return 1; + case NR_KERNEL_STACK_KB: + return SZ_1K; + default: + return PAGE_SIZE; + } +} + +/* Translate stat items to the correct unit for memory.stat output */ +static int memcg_page_state_output_unit(int item) +{ + /* + * Workingset state is actually in pages, but we export it to userspace + * as a scalar count of events, so special case it here. + */ + switch (item) { case WORKINGSET_REFAULT_ANON: case WORKINGSET_REFAULT_FILE: case WORKINGSET_ACTIVATE_ANON: @@ -1551,17 +1567,23 @@ static int memcg_page_state_unit(int item) case WORKINGSET_RESTORE_FILE: case WORKINGSET_NODERECLAIM: return 1; - case NR_KERNEL_STACK_KB: - return SZ_1K; default: - return PAGE_SIZE; + return memcg_page_state_unit(item); } } static inline unsigned long memcg_page_state_output(struct mem_cgroup *memcg, int item) { - return memcg_page_state(memcg, item) * memcg_page_state_unit(item); + return memcg_page_state(memcg, item) * + memcg_page_state_output_unit(item); +} + +static inline unsigned long memcg_page_state_local_output( + struct mem_cgroup *memcg, int item) +{ + return memcg_page_state_local(memcg, item) * + memcg_page_state_output_unit(item); } static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) @@ -4106,9 +4128,8 @@ static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) { unsigned long nr; - nr = memcg_page_state_local(memcg, memcg1_stats[i]); - seq_buf_printf(s, "%s %lu\n", memcg1_stat_names[i], - nr * memcg_page_state_unit(memcg1_stats[i])); + nr = memcg_page_state_local_output(memcg, memcg1_stats[i]); + seq_buf_printf(s, "%s %lu\n", memcg1_stat_names[i], nr); } for (i = 0; i < ARRAY_SIZE(memcg1_events); i++) @@ -4134,9 +4155,9 @@ static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) { unsigned long nr; - nr = memcg_page_state(memcg, memcg1_stats[i]); + nr = memcg_page_state_output(memcg, memcg1_stats[i]); seq_buf_printf(s, "total_%s %llu\n", memcg1_stat_names[i], - (u64)nr * memcg_page_state_unit(memcg1_stats[i])); + (u64)nr); } for (i = 0; i < ARRAY_SIZE(memcg1_events); i++) @@ -6614,7 +6635,8 @@ static int memory_stat_show(struct seq_file *m, void *v) static inline unsigned long lruvec_page_state_output(struct lruvec *lruvec, int item) { - return lruvec_page_state(lruvec, item) * memcg_page_state_unit(item); + return lruvec_page_state(lruvec, item) * + memcg_page_state_output_unit(item); } static int memory_numa_stat_show(struct seq_file *m, void *v)