Message ID | 20221203011120.2361610-1-almasrymina@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:f944:0:0:0:0:0 with SMTP id q4csp1159881wrr; Fri, 2 Dec 2022 17:38:46 -0800 (PST) X-Google-Smtp-Source: AA0mqf4eFT8EfjftI0wptSGS6U4233J8uD8bLfUiVZzQYZ0CtF6ZaCRFDntq80IBtuI2MCunmHle X-Received: by 2002:a17:906:4cc1:b0:7ae:50c6:fd0a with SMTP id q1-20020a1709064cc100b007ae50c6fd0amr28144325ejt.184.1670031525978; Fri, 02 Dec 2022 17:38:45 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1670031525; cv=none; d=google.com; s=arc-20160816; b=yk7wEi4KtX0gDPRebtia+S1YGgzdaOGy6152Msrbxkx9VoMwaTh0ykI5m19WN3lue0 uTeCZyeCi/7ufz+uaNVwc2Vz1R0+XofWQXVZEjSN/EPuOla7S5sPkHHfxGQVGUV+vNO6 fQZzXsXWWiWzCfP7yilsI0qgWt7NwDpQum80FVaEZqamPx7D+SZbN5NAZmKj+q/dfC0w lgJJaQ3Obia86J8QgIzRcqApV31/QXG1sQiPOOOxqJg3alqkz0hLJc1HrZpKc9oAH7qB gfalmpVeJPNlofY3e5xPQPRSdH22lmlaB1uCZVpbs/7VEzqW4Di5M10ZnQEuykAk6Esq BxCQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:mime-version:date :dkim-signature; bh=/nFR+wx3fDh2jF+K9/Q6SHeUTNnF6C4ZtCGpq7JUDE0=; b=UCK1p1yZQBF0nSsQmaU2XH8jzNirPC2as3uUtqTE3UXumU9mfEGeXLj8X7+KMVkVZ9 xfdplgt3TwAah8j42ha5mXXcsXv3tZ5YjhDdGSvSDl7vz14mPX8JMibgMApmDeMp4UNH kg9JeEj85L+uAg0aIMtB2EidMYiKCYTVas4e3PqV3Hb7L0zSBRXyt9YrR0LlDEkzesNP imxkXJVAdkM1LUxkwKLW5ZsdJRBiqWgcMj6sWsqZn9rvWyvOrhQQ78/W3KBEjX2f5+as KwGpIrKqVcw8d1jITH/4PDvVWpHAqzUx9uLh3xP9kc40Sw1wyn+vpj4t/EicZKghIgay 2/3g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b="MeZG/Gqx"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id d14-20020aa7d5ce000000b0046af5c0f32asi7396688eds.37.2022.12.02.17.38.22; Fri, 02 Dec 2022 17:38:45 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b="MeZG/Gqx"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235192AbiLCBLj (ORCPT <rfc822;lhua1029@gmail.com> + 99 others); Fri, 2 Dec 2022 20:11:39 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60058 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234966AbiLCBLh (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 2 Dec 2022 20:11:37 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E41F5BE4F4 for <linux-kernel@vger.kernel.org>; Fri, 2 Dec 2022 17:11:35 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id 203-20020a2502d4000000b006f94ab02400so6824138ybc.2 for <linux-kernel@vger.kernel.org>; Fri, 02 Dec 2022 17:11:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=/nFR+wx3fDh2jF+K9/Q6SHeUTNnF6C4ZtCGpq7JUDE0=; b=MeZG/Gqxyp/tTgTVrJI9XN6sIrGfZKGsvwpnx4AHGwBdG+OWLYFHLJhw0cd8K8Ef+0 yG81P7JvXIEbTXa1zJvjOPe4Bbs36R3mIFG4v1l3Bq++h8W3bHcV/drAXV3VmlO7KD5e IIwSUyBd6u/VJXBWUAO1/RGXFgsXcdq3canItaKNtl+Ff7RMbc+r5pkcmwyPD2/LQCLI 2gu8lfvDfWwUOlxiX1mdcXb+hbaX/xI94k3LOAvwt98TVZ+BL4e7jFpYYlFFMuWLljQi Dmhi562LLY4/Y1igpOaRFei7nukuzmBdpzsOcB6wB0yQMJBXMGxz6p3fU5Nes+fXgea9 vtsQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=/nFR+wx3fDh2jF+K9/Q6SHeUTNnF6C4ZtCGpq7JUDE0=; b=elNiXna3rfvT5ZRKMm3Ht0+DGaocaKbvDWOGoougYOQhCBoVfYD072Nc82vzJ2z0ES FHiSxLduf8PGfqIWUWmtM6VHOVNgyEnXE8a44FIBltzdCuROVMky0sLphuIpuSlmh0uZ osusJn3p2vYpk06ovE0OhoAOPYZNuz1mzVDXU+XGBkZFHQuU05WXdQfV9SmpN2lYeoGn cbz6YegpsOqyVCGhDSwzMZnbN2g56WvY1MCgGI4u8XLZoLh5mZHFpyrQmxJFB5964ZPp squF7PjT+TtF84mXgVc1sd4kQXarXXhtoMm7ApXqZcWqkq70Qm0J/7GVwlAKxmXEsbEe raUQ== X-Gm-Message-State: ANoB5pmtVU1lsIiotca2V1QtDLIKHUTX94p0jhY+dB3WdUfVSCP3Z/2t 7YXI1mOUEZtm8awNyEdv4QbTf0TZn9BS2YpiIA== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2d4:203:e655:31e2:2ad4:2421]) (user=almasrymina job=sendgmr) by 2002:a25:3504:0:b0:6ee:984b:3d08 with SMTP id c4-20020a253504000000b006ee984b3d08mr52761857yba.116.1670029895188; Fri, 02 Dec 2022 17:11:35 -0800 (PST) Date: Fri, 2 Dec 2022 17:11:19 -0800 Mime-Version: 1.0 X-Mailer: git-send-email 2.39.0.rc0.267.gcb52ba06e7-goog Message-ID: <20221203011120.2361610-1-almasrymina@google.com> Subject: [PATCH v1] [mm-unstable] mm: Fix memcg reclaim on memory tiered systems From: Mina Almasry <almasrymina@google.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, Shakeel Butt <shakeelb@google.com>, Muchun Song <songmuchun@bytedance.com>, Huang Ying <ying.huang@intel.com>, Yang Shi <yang.shi@linux.alibaba.com>, Yosry Ahmed <yosryahmed@google.com>, weixugc@google.com, fvdl@google.com, Mina Almasry <almasrymina@google.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1751154977640190297?= X-GMAIL-MSGID: =?utf-8?q?1751154977640190297?= |
Series |
[v1,mm-unstable] mm: Fix memcg reclaim on memory tiered systems
|
|
Commit Message
Mina Almasry
Dec. 3, 2022, 1:11 a.m. UTC
commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
reclaim"") enabled demotion in memcg reclaim, which is the right thing
to do, however, I suspect it introduced a regression in the behavior of
try_to_free_mem_cgroup_pages().
The callers of try_to_free_mem_cgroup_pages() expect it to attempt to
reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage
of the cgroup should reduce by nr_pages. The callers expect
try_to_free_mem_cgroup_pages() to also return the number of pages
reclaimed, not demoted.
However, what try_to_free_mem_cgroup_pages() actually does is it
unconditionally counts demoted pages as reclaimed pages. So in practice
when it is called it will often demote nr_pages and return the number of
demoted pages to the caller. Demoted pages don't lower the memcg usage,
and so I think try_to_free_mem_cgroup_pages() is not actually doing what
the callers want it to do.
I suspect various things work suboptimally on memory systems or don't
work at all due to this:
- memory.high enforcement likely doesn't work (it just demotes nr_pages
instead of lowering the memcg usage by nr_pages).
- try_charge_memcg() will keep retrying the charge while
try_to_free_mem_cgroup_pages() is just demoting pages and not actually
making any room for the charge.
- memory.reclaim has a wonky interface. It advertises to the user it
reclaims the provided amount but it will actually demote that amount.
There may be more effects to this issue.
To fix these issues I propose shrink_folio_list() to only count pages
demoted from inside of sc->nodemask to outside of sc->nodemask as
'reclaimed'.
For callers such as reclaim_high() or try_charge_memcg() that set
sc->nodemask to NULL, try_to_free_mem_cgroup_pages() will try to
actually reclaim nr_pages and return the number of pages reclaimed. No
demoted pages would count towards the nr_pages requirement.
For callers such as memory_reclaim() that set sc->nodemask,
try_to_free_mem_cgroup_pages() will free nr_pages from that nodemask
with either reclaim or demotion.
Tested this change using memory.reclaim interface. With this change,
echo "1m" > memory.reclaim
Will cause freeing of 1m of memory from the cgroup regardless of the
demotions happening inside.
echo "1m nodes=0" > memory.reclaim
Will cause freeing of 1m of node 0 by demotion if a demotion target is
available, and by reclaim if no demotion target is available.
Signed-off-by: Mina Almasry <almasrymina@google.com>
---
This is developed on top of mm-unstable largely because I need the
memory.reclaim nodes= arg to test it properly.
---
mm/vmscan.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
--
2.39.0.rc0.267.gcb52ba06e7-goog
Comments
On Fri, Dec 2, 2022 at 5:11 PM Mina Almasry <almasrymina@google.com> wrote: > > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > reclaim"") enabled demotion in memcg reclaim, which is the right thing > to do, however, I suspect it introduced a regression in the behavior of > try_to_free_mem_cgroup_pages(). > > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage > of the cgroup should reduce by nr_pages. The callers expect > try_to_free_mem_cgroup_pages() to also return the number of pages > reclaimed, not demoted. > > However, what try_to_free_mem_cgroup_pages() actually does is it > unconditionally counts demoted pages as reclaimed pages. So in practice > when it is called it will often demote nr_pages and return the number of > demoted pages to the caller. Demoted pages don't lower the memcg usage, > and so I think try_to_free_mem_cgroup_pages() is not actually doing what > the callers want it to do. > > I suspect various things work suboptimally on memory systems or don't > work at all due to this: > > - memory.high enforcement likely doesn't work (it just demotes nr_pages > instead of lowering the memcg usage by nr_pages). > - try_charge_memcg() will keep retrying the charge while > try_to_free_mem_cgroup_pages() is just demoting pages and not actually > making any room for the charge. > - memory.reclaim has a wonky interface. It advertises to the user it > reclaims the provided amount but it will actually demote that amount. > > There may be more effects to this issue. > > To fix these issues I propose shrink_folio_list() to only count pages > demoted from inside of sc->nodemask to outside of sc->nodemask as > 'reclaimed'. > > For callers such as reclaim_high() or try_charge_memcg() that set > sc->nodemask to NULL, try_to_free_mem_cgroup_pages() will try to > actually reclaim nr_pages and return the number of pages reclaimed. No > demoted pages would count towards the nr_pages requirement. > > For callers such as memory_reclaim() that set sc->nodemask, > try_to_free_mem_cgroup_pages() will free nr_pages from that nodemask > with either reclaim or demotion. > > Tested this change using memory.reclaim interface. With this change, > > echo "1m" > memory.reclaim > > Will cause freeing of 1m of memory from the cgroup regardless of the > demotions happening inside. > > echo "1m nodes=0" > memory.reclaim > > Will cause freeing of 1m of node 0 by demotion if a demotion target is > available, and by reclaim if no demotion target is available. > > Signed-off-by: Mina Almasry <almasrymina@google.com> > > --- > > This is developed on top of mm-unstable largely because I need the > memory.reclaim nodes= arg to test it properly. > --- > mm/vmscan.c | 13 ++++++++++++- > 1 file changed, 12 insertions(+), 1 deletion(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 2b42ac9ad755..8f6e993b870d 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1653,6 +1653,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > LIST_HEAD(free_folios); > LIST_HEAD(demote_folios); > unsigned int nr_reclaimed = 0; > + unsigned int nr_demoted = 0; > unsigned int pgactivate = 0; > bool do_demote_pass; > struct swap_iocb *plug = NULL; > @@ -2085,7 +2086,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > /* 'folio_list' is always empty here */ > > /* Migrate folios selected for demotion */ > - nr_reclaimed += demote_folio_list(&demote_folios, pgdat); > + nr_demoted = demote_folio_list(&demote_folios, pgdat); > + > + /* > + * Only count demoted folios as reclaimed if we demoted them from > + * inside of the nodemask to outside of the nodemask, hence reclaiming > + * pages in the nodemask. > + */ > + if (sc->nodemask && node_isset(pgdat->node_id, *sc->nodemask) && > + !node_isset(next_demotion_node(pgdat->node_id), *sc->nodemask)) next_demotion_node() is just the first demotion target node. Demotion can fall back to other allowed target nodes returned by node_get_allowed_targets(). When the page is demoted to a fallback node and this fallback node is in sc->nodemask, nr_demoted should not be added into nr_reclaimed, either. One way to address this issue is to pass sc->nodemask into demote_folio_list() and exclude sc->nodemask from the allowed target demotion nodes. > + nr_reclaimed += nr_demoted; > + > /* Folios that could not be demoted are still in @demote_folios */ > if (!list_empty(&demote_folios)) { > /* Folios which weren't demoted go back on @folio_list */ > -- > 2.39.0.rc0.267.gcb52ba06e7-goog
On Fri, Dec 2, 2022 at 8:14 PM Wei Xu <weixugc@google.com> wrote: > > On Fri, Dec 2, 2022 at 5:11 PM Mina Almasry <almasrymina@google.com> wrote: > > > > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > > reclaim"") enabled demotion in memcg reclaim, which is the right thing > > to do, however, I suspect it introduced a regression in the behavior of > > try_to_free_mem_cgroup_pages(). > > > > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to > > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage > > of the cgroup should reduce by nr_pages. The callers expect > > try_to_free_mem_cgroup_pages() to also return the number of pages > > reclaimed, not demoted. > > > > However, what try_to_free_mem_cgroup_pages() actually does is it > > unconditionally counts demoted pages as reclaimed pages. So in practice > > when it is called it will often demote nr_pages and return the number of > > demoted pages to the caller. Demoted pages don't lower the memcg usage, > > and so I think try_to_free_mem_cgroup_pages() is not actually doing what > > the callers want it to do. > > > > I suspect various things work suboptimally on memory systems or don't > > work at all due to this: > > > > - memory.high enforcement likely doesn't work (it just demotes nr_pages > > instead of lowering the memcg usage by nr_pages). > > - try_charge_memcg() will keep retrying the charge while > > try_to_free_mem_cgroup_pages() is just demoting pages and not actually > > making any room for the charge. > > - memory.reclaim has a wonky interface. It advertises to the user it > > reclaims the provided amount but it will actually demote that amount. > > > > There may be more effects to this issue. > > > > To fix these issues I propose shrink_folio_list() to only count pages > > demoted from inside of sc->nodemask to outside of sc->nodemask as > > 'reclaimed'. > > > > For callers such as reclaim_high() or try_charge_memcg() that set > > sc->nodemask to NULL, try_to_free_mem_cgroup_pages() will try to > > actually reclaim nr_pages and return the number of pages reclaimed. No > > demoted pages would count towards the nr_pages requirement. > > > > For callers such as memory_reclaim() that set sc->nodemask, > > try_to_free_mem_cgroup_pages() will free nr_pages from that nodemask > > with either reclaim or demotion. > > > > Tested this change using memory.reclaim interface. With this change, > > > > echo "1m" > memory.reclaim > > > > Will cause freeing of 1m of memory from the cgroup regardless of the > > demotions happening inside. > > > > echo "1m nodes=0" > memory.reclaim > > > > Will cause freeing of 1m of node 0 by demotion if a demotion target is > > available, and by reclaim if no demotion target is available. > > > > Signed-off-by: Mina Almasry <almasrymina@google.com> > > > > --- > > > > This is developed on top of mm-unstable largely because I need the > > memory.reclaim nodes= arg to test it properly. > > --- > > mm/vmscan.c | 13 ++++++++++++- > > 1 file changed, 12 insertions(+), 1 deletion(-) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 2b42ac9ad755..8f6e993b870d 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -1653,6 +1653,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > LIST_HEAD(free_folios); > > LIST_HEAD(demote_folios); > > unsigned int nr_reclaimed = 0; > > + unsigned int nr_demoted = 0; > > unsigned int pgactivate = 0; > > bool do_demote_pass; > > struct swap_iocb *plug = NULL; > > @@ -2085,7 +2086,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > /* 'folio_list' is always empty here */ > > > > /* Migrate folios selected for demotion */ > > - nr_reclaimed += demote_folio_list(&demote_folios, pgdat); > > + nr_demoted = demote_folio_list(&demote_folios, pgdat); > > + > > + /* > > + * Only count demoted folios as reclaimed if we demoted them from > > + * inside of the nodemask to outside of the nodemask, hence reclaiming > > + * pages in the nodemask. > > + */ > > + if (sc->nodemask && node_isset(pgdat->node_id, *sc->nodemask) && > > + !node_isset(next_demotion_node(pgdat->node_id), *sc->nodemask)) > > next_demotion_node() is just the first demotion target node. Demotion > can fall back to other allowed target nodes returned by > node_get_allowed_targets(). When the page is demoted to a fallback > node and this fallback node is in sc->nodemask, nr_demoted should not > be added into nr_reclaimed, either. > Thanks for reviewing Wei, I did indeed miss this. > One way to address this issue is to pass sc->nodemask into > demote_folio_list() and exclude sc->nodemask from the allowed target > demotion nodes. > This makes sense to me. Applied this change and uploaded v2: https://lore.kernel.org/linux-mm/20221204093008.2620459-1-almasrymina@google.com/T/#u > > + nr_reclaimed += nr_demoted; > > + > > /* Folios that could not be demoted are still in @demote_folios */ > > if (!list_empty(&demote_folios)) { > > /* Folios which weren't demoted go back on @folio_list */ > > -- > > 2.39.0.rc0.267.gcb52ba06e7-goog
Mina Almasry <almasrymina@google.com> writes: > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > reclaim"") enabled demotion in memcg reclaim, which is the right thing > to do, however, I suspect it introduced a regression in the behavior of > try_to_free_mem_cgroup_pages(). > > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage > of the cgroup should reduce by nr_pages. The callers expect > try_to_free_mem_cgroup_pages() to also return the number of pages > reclaimed, not demoted. > > However, what try_to_free_mem_cgroup_pages() actually does is it > unconditionally counts demoted pages as reclaimed pages. So in practice > when it is called it will often demote nr_pages and return the number of > demoted pages to the caller. Demoted pages don't lower the memcg usage, > and so I think try_to_free_mem_cgroup_pages() is not actually doing what > the callers want it to do. > > I suspect various things work suboptimally on memory systems or don't > work at all due to this: > > - memory.high enforcement likely doesn't work (it just demotes nr_pages > instead of lowering the memcg usage by nr_pages). > - try_charge_memcg() will keep retrying the charge while > try_to_free_mem_cgroup_pages() is just demoting pages and not actually > making any room for the charge. > - memory.reclaim has a wonky interface. It advertises to the user it > reclaims the provided amount but it will actually demote that amount. > > There may be more effects to this issue. > > To fix these issues I propose shrink_folio_list() to only count pages > demoted from inside of sc->nodemask to outside of sc->nodemask as > 'reclaimed'. > > For callers such as reclaim_high() or try_charge_memcg() that set > sc->nodemask to NULL, try_to_free_mem_cgroup_pages() will try to > actually reclaim nr_pages and return the number of pages reclaimed. No > demoted pages would count towards the nr_pages requirement. > > For callers such as memory_reclaim() that set sc->nodemask, > try_to_free_mem_cgroup_pages() will free nr_pages from that nodemask > with either reclaim or demotion. Have you checked all callers? For example, IIUC, in reclaim_clean_pages_from_list(), although sc.nodemask == NULL, the demoted pages should be counted as reclaimed. How about count both "demoted" and "reclaimed" in struct scan_control, and let callers to determine how to use the number? > Tested this change using memory.reclaim interface. With this change, > > echo "1m" > memory.reclaim > > Will cause freeing of 1m of memory from the cgroup regardless of the > demotions happening inside. > > echo "1m nodes=0" > memory.reclaim Have you tested these tests in the original kernel? If so, whether does the issue you suspected above occurs during testing? Best Regards, Huang, Ying > Will cause freeing of 1m of node 0 by demotion if a demotion target is > available, and by reclaim if no demotion target is available. > > Signed-off-by: Mina Almasry <almasrymina@google.com> > > --- > > This is developed on top of mm-unstable largely because I need the > memory.reclaim nodes= arg to test it properly. > --- > mm/vmscan.c | 13 ++++++++++++- > 1 file changed, 12 insertions(+), 1 deletion(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 2b42ac9ad755..8f6e993b870d 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1653,6 +1653,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > LIST_HEAD(free_folios); > LIST_HEAD(demote_folios); > unsigned int nr_reclaimed = 0; > + unsigned int nr_demoted = 0; > unsigned int pgactivate = 0; > bool do_demote_pass; > struct swap_iocb *plug = NULL; > @@ -2085,7 +2086,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > /* 'folio_list' is always empty here */ > > /* Migrate folios selected for demotion */ > - nr_reclaimed += demote_folio_list(&demote_folios, pgdat); > + nr_demoted = demote_folio_list(&demote_folios, pgdat); > + > + /* > + * Only count demoted folios as reclaimed if we demoted them from > + * inside of the nodemask to outside of the nodemask, hence reclaiming > + * pages in the nodemask. > + */ > + if (sc->nodemask && node_isset(pgdat->node_id, *sc->nodemask) && > + !node_isset(next_demotion_node(pgdat->node_id), *sc->nodemask)) > + nr_reclaimed += nr_demoted; > + > /* Folios that could not be demoted are still in @demote_folios */ > if (!list_empty(&demote_folios)) { > /* Folios which weren't demoted go back on @folio_list */ > -- > 2.39.0.rc0.267.gcb52ba06e7-goog
Wei Xu <weixugc@google.com> writes: > On Fri, Dec 2, 2022 at 5:11 PM Mina Almasry <almasrymina@google.com> wrote: >> >> commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg >> reclaim"") enabled demotion in memcg reclaim, which is the right thing >> to do, however, I suspect it introduced a regression in the behavior of >> try_to_free_mem_cgroup_pages(). >> >> The callers of try_to_free_mem_cgroup_pages() expect it to attempt to >> reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage >> of the cgroup should reduce by nr_pages. The callers expect >> try_to_free_mem_cgroup_pages() to also return the number of pages >> reclaimed, not demoted. >> >> However, what try_to_free_mem_cgroup_pages() actually does is it >> unconditionally counts demoted pages as reclaimed pages. So in practice >> when it is called it will often demote nr_pages and return the number of >> demoted pages to the caller. Demoted pages don't lower the memcg usage, >> and so I think try_to_free_mem_cgroup_pages() is not actually doing what >> the callers want it to do. >> >> I suspect various things work suboptimally on memory systems or don't >> work at all due to this: >> >> - memory.high enforcement likely doesn't work (it just demotes nr_pages >> instead of lowering the memcg usage by nr_pages). >> - try_charge_memcg() will keep retrying the charge while >> try_to_free_mem_cgroup_pages() is just demoting pages and not actually >> making any room for the charge. >> - memory.reclaim has a wonky interface. It advertises to the user it >> reclaims the provided amount but it will actually demote that amount. >> >> There may be more effects to this issue. >> >> To fix these issues I propose shrink_folio_list() to only count pages >> demoted from inside of sc->nodemask to outside of sc->nodemask as >> 'reclaimed'. >> >> For callers such as reclaim_high() or try_charge_memcg() that set >> sc->nodemask to NULL, try_to_free_mem_cgroup_pages() will try to >> actually reclaim nr_pages and return the number of pages reclaimed. No >> demoted pages would count towards the nr_pages requirement. >> >> For callers such as memory_reclaim() that set sc->nodemask, >> try_to_free_mem_cgroup_pages() will free nr_pages from that nodemask >> with either reclaim or demotion. >> >> Tested this change using memory.reclaim interface. With this change, >> >> echo "1m" > memory.reclaim >> >> Will cause freeing of 1m of memory from the cgroup regardless of the >> demotions happening inside. >> >> echo "1m nodes=0" > memory.reclaim >> >> Will cause freeing of 1m of node 0 by demotion if a demotion target is >> available, and by reclaim if no demotion target is available. >> >> Signed-off-by: Mina Almasry <almasrymina@google.com> >> >> --- >> >> This is developed on top of mm-unstable largely because I need the >> memory.reclaim nodes= arg to test it properly. >> --- >> mm/vmscan.c | 13 ++++++++++++- >> 1 file changed, 12 insertions(+), 1 deletion(-) >> >> diff --git a/mm/vmscan.c b/mm/vmscan.c >> index 2b42ac9ad755..8f6e993b870d 100644 >> --- a/mm/vmscan.c >> +++ b/mm/vmscan.c >> @@ -1653,6 +1653,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, >> LIST_HEAD(free_folios); >> LIST_HEAD(demote_folios); >> unsigned int nr_reclaimed = 0; >> + unsigned int nr_demoted = 0; >> unsigned int pgactivate = 0; >> bool do_demote_pass; >> struct swap_iocb *plug = NULL; >> @@ -2085,7 +2086,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, >> /* 'folio_list' is always empty here */ >> >> /* Migrate folios selected for demotion */ >> - nr_reclaimed += demote_folio_list(&demote_folios, pgdat); >> + nr_demoted = demote_folio_list(&demote_folios, pgdat); >> + >> + /* >> + * Only count demoted folios as reclaimed if we demoted them from >> + * inside of the nodemask to outside of the nodemask, hence reclaiming >> + * pages in the nodemask. >> + */ >> + if (sc->nodemask && node_isset(pgdat->node_id, *sc->nodemask) && >> + !node_isset(next_demotion_node(pgdat->node_id), *sc->nodemask)) > > next_demotion_node() is just the first demotion target node. Demotion > can fall back to other allowed target nodes returned by > node_get_allowed_targets(). When the page is demoted to a fallback > node and this fallback node is in sc->nodemask, nr_demoted should not > be added into nr_reclaimed, either. > > One way to address this issue is to pass sc->nodemask into > demote_folio_list() and exclude sc->nodemask from the allowed target > demotion nodes. I don't think this is a good idea. Because this may break the fast -> slow -> storage aging order. A warm page in fast memory node may be reclaimed to storage directly, instead of being demoted to the slow memory node. If necessary, we can account "nr_demoted" in alloc_demote_page() and to-be-added free_demote_page(). Best Regards, Huang, Ying >> + nr_reclaimed += nr_demoted; >> + >> /* Folios that could not be demoted are still in @demote_folios */ >> if (!list_empty(&demote_folios)) { >> /* Folios which weren't demoted go back on @folio_list */ >> -- >> 2.39.0.rc0.267.gcb52ba06e7-goog
On Sun, Dec 4, 2022 at 6:39 PM Huang, Ying <ying.huang@intel.com> wrote: > > Mina Almasry <almasrymina@google.com> writes: > > > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > > reclaim"") enabled demotion in memcg reclaim, which is the right thing > > to do, however, I suspect it introduced a regression in the behavior of > > try_to_free_mem_cgroup_pages(). > > > > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to > > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage > > of the cgroup should reduce by nr_pages. The callers expect > > try_to_free_mem_cgroup_pages() to also return the number of pages > > reclaimed, not demoted. > > > > However, what try_to_free_mem_cgroup_pages() actually does is it > > unconditionally counts demoted pages as reclaimed pages. So in practice > > when it is called it will often demote nr_pages and return the number of > > demoted pages to the caller. Demoted pages don't lower the memcg usage, > > and so I think try_to_free_mem_cgroup_pages() is not actually doing what > > the callers want it to do. > > > > I suspect various things work suboptimally on memory systems or don't > > work at all due to this: > > > > - memory.high enforcement likely doesn't work (it just demotes nr_pages > > instead of lowering the memcg usage by nr_pages). > > - try_charge_memcg() will keep retrying the charge while > > try_to_free_mem_cgroup_pages() is just demoting pages and not actually > > making any room for the charge. > > - memory.reclaim has a wonky interface. It advertises to the user it > > reclaims the provided amount but it will actually demote that amount. > > > > There may be more effects to this issue. > > > > To fix these issues I propose shrink_folio_list() to only count pages > > demoted from inside of sc->nodemask to outside of sc->nodemask as > > 'reclaimed'. > > > > For callers such as reclaim_high() or try_charge_memcg() that set > > sc->nodemask to NULL, try_to_free_mem_cgroup_pages() will try to > > actually reclaim nr_pages and return the number of pages reclaimed. No > > demoted pages would count towards the nr_pages requirement. > > > > For callers such as memory_reclaim() that set sc->nodemask, > > try_to_free_mem_cgroup_pages() will free nr_pages from that nodemask > > with either reclaim or demotion. > > Have you checked all callers? For example, IIUC, in > reclaim_clean_pages_from_list(), although sc.nodemask == NULL, the > demoted pages should be counted as reclaimed. I checked all call stacks leading to shrink_folio_list() now (at least I hope). Here is what I think they do and how I propose to handle them: - reclaim_clean_pages_from_list() & __node_reclaim() & balance_pgdat() These try to free memory from a specific node, and both demotion and reclaim from that node should be counted. I propose these calls set sc>nodemask = pgdat.node_id to signal to shrink_folio_list() that both demotion and reclaim from this node should be counted. - try_to_free_pages() Tries to free pages from a specific nodemask. It sets sc->nodemask to ac->nodemask. In this case pages demoted within the nodemask should not count. Pages demoted outside of the nodemask should count, which this patch already tries to do. - mem_cgroup_shrink_node() This is memcg soft limit reclaim. AFAIU only reclaim should be counted. It already sets sc->nodemask=NULL to indicate that it requires reclaim from all nodes and that only reclaimed memory should be counted, which this patch already tries to do. - try_to_free_mem_cgroup_pages() This is covered in the commit message. Many callers set nodemask=NULL indicating they want reclaim and demotion should not count. memory.reclaim sets nodemask depending on the 'nodes=' arg and wants demotion and reclaim from that nodemask. - reclaim_folio_list() Sets no_demotion = 1. No ambiguity here, only reclaims and counts reclaimed pages. If agreeable I can fix reclaim_clean_pages_from_list() & __node_reclaim() & balance_pgdat() call sites in v3. > How about count both > "demoted" and "reclaimed" in struct scan_control, and let callers to > determine how to use the number? > I don't think this is by itself enough. Pages demoted between 2 nodes that are both in sc->nodemask should not count, I think. So 'demoted' needs to be specifically pages demoted outside of the nodemask. We can do 2 things: 1. Only allow the kernel to demote outside the nodemask (which you don't prefer). 2. Allow the kernel to demote inside the nodemask but not count them. I will see if I can implement #2. > > Tested this change using memory.reclaim interface. With this change, > > > > echo "1m" > memory.reclaim > > > > Will cause freeing of 1m of memory from the cgroup regardless of the > > demotions happening inside. > > > > echo "1m nodes=0" > memory.reclaim > > Have you tested these tests in the original kernel? If so, whether does > the issue you suspected above occurs during testing? > Yes. I set up a test case where I allocate 500m in a cgroup, and then do: echo "50m" > memory.reclaim Without my fix, my kernel demotes 70mb and reclaims 4 mb. With my v1 fix, my kernel demotes all memory possible and reclaims 60mb. I will add this to the commit message in the next version. > Best Regards, > Huang, Ying > > > Will cause freeing of 1m of node 0 by demotion if a demotion target is > > available, and by reclaim if no demotion target is available. > > > > Signed-off-by: Mina Almasry <almasrymina@google.com> > > > > --- > > > > This is developed on top of mm-unstable largely because I need the > > memory.reclaim nodes= arg to test it properly. > > --- > > mm/vmscan.c | 13 ++++++++++++- > > 1 file changed, 12 insertions(+), 1 deletion(-) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 2b42ac9ad755..8f6e993b870d 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -1653,6 +1653,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > LIST_HEAD(free_folios); > > LIST_HEAD(demote_folios); > > unsigned int nr_reclaimed = 0; > > + unsigned int nr_demoted = 0; > > unsigned int pgactivate = 0; > > bool do_demote_pass; > > struct swap_iocb *plug = NULL; > > @@ -2085,7 +2086,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > /* 'folio_list' is always empty here */ > > > > /* Migrate folios selected for demotion */ > > - nr_reclaimed += demote_folio_list(&demote_folios, pgdat); > > + nr_demoted = demote_folio_list(&demote_folios, pgdat); > > + > > + /* > > + * Only count demoted folios as reclaimed if we demoted them from > > + * inside of the nodemask to outside of the nodemask, hence reclaiming > > + * pages in the nodemask. > > + */ > > + if (sc->nodemask && node_isset(pgdat->node_id, *sc->nodemask) && > > + !node_isset(next_demotion_node(pgdat->node_id), *sc->nodemask)) > > + nr_reclaimed += nr_demoted; > > + > > /* Folios that could not be demoted are still in @demote_folios */ > > if (!list_empty(&demote_folios)) { > > /* Folios which weren't demoted go back on @folio_list */ > > -- > > 2.39.0.rc0.267.gcb52ba06e7-goog >
Mina Almasry <almasrymina@google.com> writes: > On Sun, Dec 4, 2022 at 6:39 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Mina Almasry <almasrymina@google.com> writes: >> >> > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg >> > reclaim"") enabled demotion in memcg reclaim, which is the right thing >> > to do, however, I suspect it introduced a regression in the behavior of >> > try_to_free_mem_cgroup_pages(). >> > >> > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to >> > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage >> > of the cgroup should reduce by nr_pages. The callers expect >> > try_to_free_mem_cgroup_pages() to also return the number of pages >> > reclaimed, not demoted. >> > >> > However, what try_to_free_mem_cgroup_pages() actually does is it >> > unconditionally counts demoted pages as reclaimed pages. So in practice >> > when it is called it will often demote nr_pages and return the number of >> > demoted pages to the caller. Demoted pages don't lower the memcg usage, >> > and so I think try_to_free_mem_cgroup_pages() is not actually doing what >> > the callers want it to do. >> > >> > I suspect various things work suboptimally on memory systems or don't >> > work at all due to this: >> > >> > - memory.high enforcement likely doesn't work (it just demotes nr_pages >> > instead of lowering the memcg usage by nr_pages). >> > - try_charge_memcg() will keep retrying the charge while >> > try_to_free_mem_cgroup_pages() is just demoting pages and not actually >> > making any room for the charge. >> > - memory.reclaim has a wonky interface. It advertises to the user it >> > reclaims the provided amount but it will actually demote that amount. >> > >> > There may be more effects to this issue. >> > >> > To fix these issues I propose shrink_folio_list() to only count pages >> > demoted from inside of sc->nodemask to outside of sc->nodemask as >> > 'reclaimed'. >> > >> > For callers such as reclaim_high() or try_charge_memcg() that set >> > sc->nodemask to NULL, try_to_free_mem_cgroup_pages() will try to >> > actually reclaim nr_pages and return the number of pages reclaimed. No >> > demoted pages would count towards the nr_pages requirement. >> > >> > For callers such as memory_reclaim() that set sc->nodemask, >> > try_to_free_mem_cgroup_pages() will free nr_pages from that nodemask >> > with either reclaim or demotion. >> >> Have you checked all callers? For example, IIUC, in >> reclaim_clean_pages_from_list(), although sc.nodemask == NULL, the >> demoted pages should be counted as reclaimed. > > I checked all call stacks leading to shrink_folio_list() now (at least > I hope). Here is what I think they do and how I propose to handle > them: > > - reclaim_clean_pages_from_list() & __node_reclaim() & balance_pgdat() > These try to free memory from a specific node, and both demotion and > reclaim from that node should be counted. I propose these calls set > sc>nodemask = pgdat.node_id to signal to shrink_folio_list() that both > demotion and reclaim from this node should be counted. > > - try_to_free_pages() > Tries to free pages from a specific nodemask. It sets sc->nodemask to > ac->nodemask. In this case pages demoted within the nodemask should > not count. Pages demoted outside of the nodemask should count, which > this patch already tries to do. > > - mem_cgroup_shrink_node() > This is memcg soft limit reclaim. AFAIU only reclaim should be > counted. It already sets sc->nodemask=NULL to indicate that it > requires reclaim from all nodes and that only reclaimed memory should > be counted, which this patch already tries to do. > > - try_to_free_mem_cgroup_pages() > This is covered in the commit message. Many callers set nodemask=NULL > indicating they want reclaim and demotion should not count. > memory.reclaim sets nodemask depending on the 'nodes=' arg and wants > demotion and reclaim from that nodemask. > > - reclaim_folio_list() > Sets no_demotion = 1. No ambiguity here, only reclaims and counts > reclaimed pages. > > If agreeable I can fix reclaim_clean_pages_from_list() & > __node_reclaim() & balance_pgdat() call sites in v3. Looks good to me, Thanks! >> How about count both >> "demoted" and "reclaimed" in struct scan_control, and let callers to >> determine how to use the number? >> > > I don't think this is by itself enough. Pages demoted between 2 nodes > that are both in sc->nodemask should not count, I think. So 'demoted' > needs to be specifically pages demoted outside of the nodemask. Yes. Maybe we can do that when we need it. I suggest to change the return value description in the comments of shrink_folio_list(). > We can do 2 things: > > 1. Only allow the kernel to demote outside the nodemask (which you > don't prefer). > 2. Allow the kernel to demote inside the nodemask but not count them. > > I will see if I can implement #2. Thanks! >> > Tested this change using memory.reclaim interface. With this change, >> > >> > echo "1m" > memory.reclaim >> > >> > Will cause freeing of 1m of memory from the cgroup regardless of the >> > demotions happening inside. >> > >> > echo "1m nodes=0" > memory.reclaim >> >> Have you tested these tests in the original kernel? If so, whether does >> the issue you suspected above occurs during testing? >> > > Yes. I set up a test case where I allocate 500m in a cgroup, and then do: > > echo "50m" > memory.reclaim > > Without my fix, my kernel demotes 70mb and reclaims 4 mb. > > With my v1 fix, my kernel demotes all memory possible and reclaims 60mb. > > I will add this to the commit message in the next version. Good! Thanks! Best Regards, Huang, Ying >> >> > Will cause freeing of 1m of node 0 by demotion if a demotion target is >> > available, and by reclaim if no demotion target is available. >> > >> > Signed-off-by: Mina Almasry <almasrymina@google.com> >> > >> > --- >> > >> > This is developed on top of mm-unstable largely because I need the >> > memory.reclaim nodes= arg to test it properly. >> > --- >> > mm/vmscan.c | 13 ++++++++++++- >> > 1 file changed, 12 insertions(+), 1 deletion(-) >> > >> > diff --git a/mm/vmscan.c b/mm/vmscan.c >> > index 2b42ac9ad755..8f6e993b870d 100644 >> > --- a/mm/vmscan.c >> > +++ b/mm/vmscan.c >> > @@ -1653,6 +1653,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, >> > LIST_HEAD(free_folios); >> > LIST_HEAD(demote_folios); >> > unsigned int nr_reclaimed = 0; >> > + unsigned int nr_demoted = 0; >> > unsigned int pgactivate = 0; >> > bool do_demote_pass; >> > struct swap_iocb *plug = NULL; >> > @@ -2085,7 +2086,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, >> > /* 'folio_list' is always empty here */ >> > >> > /* Migrate folios selected for demotion */ >> > - nr_reclaimed += demote_folio_list(&demote_folios, pgdat); >> > + nr_demoted = demote_folio_list(&demote_folios, pgdat); >> > + >> > + /* >> > + * Only count demoted folios as reclaimed if we demoted them from >> > + * inside of the nodemask to outside of the nodemask, hence reclaiming >> > + * pages in the nodemask. >> > + */ >> > + if (sc->nodemask && node_isset(pgdat->node_id, *sc->nodemask) && >> > + !node_isset(next_demotion_node(pgdat->node_id), *sc->nodemask)) >> > + nr_reclaimed += nr_demoted; >> > + >> > /* Folios that could not be demoted are still in @demote_folios */ >> > if (!list_empty(&demote_folios)) { >> > /* Folios which weren't demoted go back on @folio_list */ >> > -- >> > 2.39.0.rc0.267.gcb52ba06e7-goog >>
On Mon, Dec 5, 2022 at 5:26 PM Huang, Ying <ying.huang@intel.com> wrote: > > Mina Almasry <almasrymina@google.com> writes: > > > On Sun, Dec 4, 2022 at 6:39 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Mina Almasry <almasrymina@google.com> writes: > >> > >> > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > >> > reclaim"") enabled demotion in memcg reclaim, which is the right thing > >> > to do, however, I suspect it introduced a regression in the behavior of > >> > try_to_free_mem_cgroup_pages(). > >> > > >> > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to > >> > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage > >> > of the cgroup should reduce by nr_pages. The callers expect > >> > try_to_free_mem_cgroup_pages() to also return the number of pages > >> > reclaimed, not demoted. > >> > > >> > However, what try_to_free_mem_cgroup_pages() actually does is it > >> > unconditionally counts demoted pages as reclaimed pages. So in practice > >> > when it is called it will often demote nr_pages and return the number of > >> > demoted pages to the caller. Demoted pages don't lower the memcg usage, > >> > and so I think try_to_free_mem_cgroup_pages() is not actually doing what > >> > the callers want it to do. > >> > > >> > I suspect various things work suboptimally on memory systems or don't > >> > work at all due to this: > >> > > >> > - memory.high enforcement likely doesn't work (it just demotes nr_pages > >> > instead of lowering the memcg usage by nr_pages). > >> > - try_charge_memcg() will keep retrying the charge while > >> > try_to_free_mem_cgroup_pages() is just demoting pages and not actually > >> > making any room for the charge. > >> > - memory.reclaim has a wonky interface. It advertises to the user it > >> > reclaims the provided amount but it will actually demote that amount. > >> > > >> > There may be more effects to this issue. > >> > > >> > To fix these issues I propose shrink_folio_list() to only count pages > >> > demoted from inside of sc->nodemask to outside of sc->nodemask as > >> > 'reclaimed'. > >> > > >> > For callers such as reclaim_high() or try_charge_memcg() that set > >> > sc->nodemask to NULL, try_to_free_mem_cgroup_pages() will try to > >> > actually reclaim nr_pages and return the number of pages reclaimed. No > >> > demoted pages would count towards the nr_pages requirement. > >> > > >> > For callers such as memory_reclaim() that set sc->nodemask, > >> > try_to_free_mem_cgroup_pages() will free nr_pages from that nodemask > >> > with either reclaim or demotion. > >> > >> Have you checked all callers? For example, IIUC, in > >> reclaim_clean_pages_from_list(), although sc.nodemask == NULL, the > >> demoted pages should be counted as reclaimed. > > > > I checked all call stacks leading to shrink_folio_list() now (at least > > I hope). Here is what I think they do and how I propose to handle > > them: > > > > - reclaim_clean_pages_from_list() & __node_reclaim() & balance_pgdat() > > These try to free memory from a specific node, and both demotion and > > reclaim from that node should be counted. I propose these calls set > > sc>nodemask = pgdat.node_id to signal to shrink_folio_list() that both > > demotion and reclaim from this node should be counted. > > > > - try_to_free_pages() > > Tries to free pages from a specific nodemask. It sets sc->nodemask to > > ac->nodemask. In this case pages demoted within the nodemask should > > not count. Pages demoted outside of the nodemask should count, which > > this patch already tries to do. > > > > - mem_cgroup_shrink_node() > > This is memcg soft limit reclaim. AFAIU only reclaim should be > > counted. It already sets sc->nodemask=NULL to indicate that it > > requires reclaim from all nodes and that only reclaimed memory should > > be counted, which this patch already tries to do. > > > > - try_to_free_mem_cgroup_pages() > > This is covered in the commit message. Many callers set nodemask=NULL > > indicating they want reclaim and demotion should not count. > > memory.reclaim sets nodemask depending on the 'nodes=' arg and wants > > demotion and reclaim from that nodemask. > > > > - reclaim_folio_list() > > Sets no_demotion = 1. No ambiguity here, only reclaims and counts > > reclaimed pages. > > > > If agreeable I can fix reclaim_clean_pages_from_list() & > > __node_reclaim() & balance_pgdat() call sites in v3. > > Looks good to me, Thanks! > Thanks. Sent out v3 with the comments addressed. PTAL whenever convenient: https://lore.kernel.org/linux-mm/20221206023406.3182800-1-almasrymina@google.com/T/#u > >> How about count both > >> "demoted" and "reclaimed" in struct scan_control, and let callers to > >> determine how to use the number? > >> > > > > I don't think this is by itself enough. Pages demoted between 2 nodes > > that are both in sc->nodemask should not count, I think. So 'demoted' > > needs to be specifically pages demoted outside of the nodemask. > > Yes. Maybe we can do that when we need it. I suggest to change the > return value description in the comments of shrink_folio_list(). > > > We can do 2 things: > > > > 1. Only allow the kernel to demote outside the nodemask (which you > > don't prefer). > > 2. Allow the kernel to demote inside the nodemask but not count them. > > > > I will see if I can implement #2. > > Thanks! > > >> > Tested this change using memory.reclaim interface. With this change, > >> > > >> > echo "1m" > memory.reclaim > >> > > >> > Will cause freeing of 1m of memory from the cgroup regardless of the > >> > demotions happening inside. > >> > > >> > echo "1m nodes=0" > memory.reclaim > >> > >> Have you tested these tests in the original kernel? If so, whether does > >> the issue you suspected above occurs during testing? > >> > > > > Yes. I set up a test case where I allocate 500m in a cgroup, and then do: > > > > echo "50m" > memory.reclaim > > > > Without my fix, my kernel demotes 70mb and reclaims 4 mb. > > > > With my v1 fix, my kernel demotes all memory possible and reclaims 60mb. > > > > I will add this to the commit message in the next version. > > Good! Thanks! > > Best Regards, > Huang, Ying > > >> > >> > Will cause freeing of 1m of node 0 by demotion if a demotion target is > >> > available, and by reclaim if no demotion target is available. > >> > > >> > Signed-off-by: Mina Almasry <almasrymina@google.com> > >> > > >> > --- > >> > > >> > This is developed on top of mm-unstable largely because I need the > >> > memory.reclaim nodes= arg to test it properly. > >> > --- > >> > mm/vmscan.c | 13 ++++++++++++- > >> > 1 file changed, 12 insertions(+), 1 deletion(-) > >> > > >> > diff --git a/mm/vmscan.c b/mm/vmscan.c > >> > index 2b42ac9ad755..8f6e993b870d 100644 > >> > --- a/mm/vmscan.c > >> > +++ b/mm/vmscan.c > >> > @@ -1653,6 +1653,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > >> > LIST_HEAD(free_folios); > >> > LIST_HEAD(demote_folios); > >> > unsigned int nr_reclaimed = 0; > >> > + unsigned int nr_demoted = 0; > >> > unsigned int pgactivate = 0; > >> > bool do_demote_pass; > >> > struct swap_iocb *plug = NULL; > >> > @@ -2085,7 +2086,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > >> > /* 'folio_list' is always empty here */ > >> > > >> > /* Migrate folios selected for demotion */ > >> > - nr_reclaimed += demote_folio_list(&demote_folios, pgdat); > >> > + nr_demoted = demote_folio_list(&demote_folios, pgdat); > >> > + > >> > + /* > >> > + * Only count demoted folios as reclaimed if we demoted them from > >> > + * inside of the nodemask to outside of the nodemask, hence reclaiming > >> > + * pages in the nodemask. > >> > + */ > >> > + if (sc->nodemask && node_isset(pgdat->node_id, *sc->nodemask) && > >> > + !node_isset(next_demotion_node(pgdat->node_id), *sc->nodemask)) > >> > + nr_reclaimed += nr_demoted; > >> > + > >> > /* Folios that could not be demoted are still in @demote_folios */ > >> > if (!list_empty(&demote_folios)) { > >> > /* Folios which weren't demoted go back on @folio_list */ > >> > -- > >> > 2.39.0.rc0.267.gcb52ba06e7-goog > >> >
diff --git a/mm/vmscan.c b/mm/vmscan.c index 2b42ac9ad755..8f6e993b870d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1653,6 +1653,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, LIST_HEAD(free_folios); LIST_HEAD(demote_folios); unsigned int nr_reclaimed = 0; + unsigned int nr_demoted = 0; unsigned int pgactivate = 0; bool do_demote_pass; struct swap_iocb *plug = NULL; @@ -2085,7 +2086,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, /* 'folio_list' is always empty here */ /* Migrate folios selected for demotion */ - nr_reclaimed += demote_folio_list(&demote_folios, pgdat); + nr_demoted = demote_folio_list(&demote_folios, pgdat); + + /* + * Only count demoted folios as reclaimed if we demoted them from + * inside of the nodemask to outside of the nodemask, hence reclaiming + * pages in the nodemask. + */ + if (sc->nodemask && node_isset(pgdat->node_id, *sc->nodemask) && + !node_isset(next_demotion_node(pgdat->node_id), *sc->nodemask)) + nr_reclaimed += nr_demoted; + /* Folios that could not be demoted are still in @demote_folios */ if (!list_empty(&demote_folios)) { /* Folios which weren't demoted go back on @folio_list */