Message ID | 20221204093008.2620459-1-almasrymina@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:f944:0:0:0:0:0 with SMTP id q4csp1715169wrr; Sun, 4 Dec 2022 01:38:20 -0800 (PST) X-Google-Smtp-Source: AA0mqf5wqGwYSs0O059oZuYGfk/AFPy12sieQmaNqfigk0QPYye2lfU1Q9S4zpwyBD9rNyxommAJ X-Received: by 2002:a05:6402:2926:b0:46b:b45d:4431 with SMTP id ee38-20020a056402292600b0046bb45d4431mr16211348edb.4.1670146700192; Sun, 04 Dec 2022 01:38:20 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1670146700; cv=none; d=google.com; s=arc-20160816; b=SKwQGJVOs7y05za+mOJ++5o+fRHjpCM4r7pIbHAwj9ysMca3G2JOrOxjh+uYJ4Oe5K KMgESmUOjTahlp0qOV1/pVjA8KT8iIg9+a2wkQ8MRIoA0u1Ysdrf6Rw6HFO3+U+ElHIA /1fSZUlJdvLGrmNuVVS4DNelO1xPf9bdMKBf5rUB5OJ//N88viopWZptST32cXGcH2Jn sG37shT0Bpwv3RYduWCTnT9YRwrRzZZWn+TEIvgjNoxZ8ShMe2x5NAV2CdRwMDtU+Oo6 q3kqb7/cPDE8BTEeuf+S3MMpVjVci5QcA6Vi4CFYdknQdGDoxo/g1+zFAKQa9qfMYicS toOw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:mime-version:date :dkim-signature; bh=OaLjpR/QVt3hQbpdZlq+Q/65ck99KOIfezFgNLpv2/M=; b=JmXlp6gMoX9E0LFd7BRHLz/k3+h+rp89b5Xvp4gf9C7PSGXjw5H7LRhW3BiHPNGrKk uVAfHNOW+rm7CTj3/AgxQQTT9W4qlXgZwOgR6nWpkh11jaPKxcIkW4Ep1pzitsa23hd1 M+5LpU2a4Gz0oGrjzcC/Au+LD5KbhM2n9DMdUB6XlgmPcx6JRain4HOYXeNdA5Q5SgqP OhiVdLWzThQm8JMn1PF27FjJrTATasQq38tTuHLMwQCUwu9HFL7Cl4/GQviS6atiuLX9 ACtrBFHWBMHGbe0NLcD9O5OHE18JQUyDjQOdnkXkngOAjzR0M7IzYpyLvdpeVA21hRQW wICw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=jTqf8mCr; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id wy1-20020a170906fe0100b007ba713e241dsi9070384ejb.894.2022.12.04.01.37.54; Sun, 04 Dec 2022 01:38:20 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=jTqf8mCr; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229974AbiLDJaX (ORCPT <rfc822;jaysivo@gmail.com> + 99 others); Sun, 4 Dec 2022 04:30:23 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48276 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229834AbiLDJaT (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Sun, 4 Dec 2022 04:30:19 -0500 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C345112D29 for <linux-kernel@vger.kernel.org>; Sun, 4 Dec 2022 01:30:17 -0800 (PST) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-3c44ac82606so94520417b3.1 for <linux-kernel@vger.kernel.org>; Sun, 04 Dec 2022 01:30:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=OaLjpR/QVt3hQbpdZlq+Q/65ck99KOIfezFgNLpv2/M=; b=jTqf8mCrAhYdbQXet7tnzjdA4IOXTGDFOyvIOTyDJDCAH0yNC5q95Z6bT5WYF8q2a0 GBkyNHGwVngkLcFbXsseKJl42kkpXy+cck1DIAKSpPpUgBmRrico0VnE4Gjwm4izpa6X wBxVDUYrRfPV4iwrFSYnwJBc3veHTwcBOMaXprk1cNKhTt2PkugHq2xNOmmcdtrIOctd VHcwn6p+I2JL2cKt5zz1+BOnmH5vrl+O2GKDU08xKZx79XPpljVqCJt1IIFDPRTVNDEn bYgovauUQXITwD1kAox3GoUvdzbqgWanMPS0Di0vgx8pt5wZM+WHpopY35RsbwM0/JtL wcdg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=OaLjpR/QVt3hQbpdZlq+Q/65ck99KOIfezFgNLpv2/M=; b=O8sKyfxLq7Rd6mamggrQho5rPtp5guwlaDT3d33tKGidAxklNiKxVoDaZ8XCr2wlN0 XAPD9N3DWwGAdH8PnvfPiQem+sPYHT45YjyFEkQ6mUDz6SbLJQJpgVESZjEVf7vStAqd 3eFNCUXBtVQGzWL0Z21XIxsONNc3Hb0mi+fpREgCVpQmIDnrnrEiy3g9YaRBx8IMsDJN UqaB/wOGp9kVskdSjRlV/08lS3nXfLx1n08qoL6RGxm7/p6ONESvFbLiT6qN1XPxxnpK Kx5LXw48lOjCNHYOmdJyildTbFElxH89mJm42Vx6inWg9WG/6poHy6WtNDhlFyQXm3Et wFdA== X-Gm-Message-State: ANoB5pnbddAv+9Ul1ThSG579EHzMDuaVJs886qLqJdXKIs7FCi7Nw3uc KW5WPshTCyu+bpjX740M2GWohwKk8weonVVXAQ== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2d4:203:8120:ce77:bb35:7eba]) (user=almasrymina job=sendgmr) by 2002:a25:5d3:0:b0:6fe:f001:b029 with SMTP id 202-20020a2505d3000000b006fef001b029mr4468732ybf.324.1670146217039; Sun, 04 Dec 2022 01:30:17 -0800 (PST) Date: Sun, 4 Dec 2022 01:30:06 -0800 Mime-Version: 1.0 X-Mailer: git-send-email 2.39.0.rc0.267.gcb52ba06e7-goog Message-ID: <20221204093008.2620459-1-almasrymina@google.com> Subject: [PATCH v2] [mm-unstable] mm: Fix memcg reclaim on memory tiered systems From: Mina Almasry <almasrymina@google.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, Shakeel Butt <shakeelb@google.com>, Muchun Song <songmuchun@bytedance.com>, Huang Ying <ying.huang@intel.com>, Yang Shi <yang.shi@linux.alibaba.com>, Yosry Ahmed <yosryahmed@google.com>, weixugc@google.com, fvdl@google.com, Mina Almasry <almasrymina@google.com>, linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1751275745867424502?= X-GMAIL-MSGID: =?utf-8?q?1751275745867424502?= |
Series |
[v2,mm-unstable] mm: Fix memcg reclaim on memory tiered systems
|
|
Commit Message
Mina Almasry
Dec. 4, 2022, 9:30 a.m. UTC
commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
reclaim"") enabled demotion in memcg reclaim, which is the right thing
to do, but introduced a regression in the behavior of
try_to_free_mem_cgroup_pages().
The callers of try_to_free_mem_cgroup_pages() expect it to attempt to
reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage
of the cgroup should reduce by nr_pages. The callers expect
try_to_free_mem_cgroup_pages() to also return the number of pages
reclaimed, not demoted.
However, try_to_free_mem_cgroup_pages() actually unconditionally counts
demoted pages as reclaimed pages. So in practice when it is called it will
often demote nr_pages and return the number of demoted pages to the caller.
Demoted pages don't lower the memcg usage as the caller requested.
I suspect various things work suboptimally on memory systems or don't
work at all due to this:
- memory.high enforcement likely doesn't work (it just demotes nr_pages
instead of lowering the memcg usage by nr_pages).
- try_charge_memcg() will keep retrying the charge while
try_to_free_mem_cgroup_pages() is just demoting pages and not actually
making any room for the charge.
- memory.reclaim has a wonky interface. It advertises to the user it
reclaims the provided amount but it will actually demote that amount.
There may be more effects to this issue.
To fix these issues I propose shrink_folio_list() to only count pages
demoted from inside of sc->nodemask to outside of sc->nodemask as
'reclaimed'.
For callers such as reclaim_high() or try_charge_memcg() that set
sc->nodemask to NULL, try_to_free_mem_cgroup_pages() will try to
actually reclaim nr_pages and return the number of pages reclaimed. No
demoted pages would count towards the nr_pages requirement.
For callers such as memory_reclaim() that set sc->nodemask,
try_to_free_mem_cgroup_pages() will free nr_pages from that nodemask
with either demotion or reclaim.
Tested this change using memory.reclaim interface. With this change,
echo "1m" > memory.reclaim
Will cause freeing of 1m of memory from the cgroup regardless of the
demotions happening inside.
echo "1m nodes=0" > memory.reclaim
Will cause freeing of 1m of node 0 by demotion if a demotion target is
available, and by reclaim if no demotion target is available.
Signed-off-by: Mina Almasry <almasrymina@google.com>
---
This is developed on top of mm-unstable largely to test with memory.reclaim
nodes= arg and ensure the fix is compatible with that.
v2:
- Shortened the commit message a bit.
- Fixed issue when demotion falls back to other allowed target nodes returned by
node_get_allowed_targets() as Wei suggested.
Cc: weixugc@google.com
---
include/linux/memory-tiers.h | 7 +++++--
mm/memory-tiers.c | 10 +++++++++-
mm/vmscan.c | 20 +++++++++++++++++---
3 files changed, 31 insertions(+), 6 deletions(-)
--
2.39.0.rc0.267.gcb52ba06e7-goog
Comments
Hi Mina, Thank you for the patch! Yet something to improve: [auto build test ERROR on akpm-mm/mm-everything] url: https://github.com/intel-lab-lkp/linux/commits/Mina-Almasry/mm-Fix-memcg-reclaim-on-memory-tiered-systems/20221204-173146 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20221204093008.2620459-1-almasrymina%40google.com patch subject: [PATCH v2] [mm-unstable] mm: Fix memcg reclaim on memory tiered systems config: alpha-buildonly-randconfig-r005-20221204 compiler: alpha-linux-gcc (GCC) 12.1.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/intel-lab-lkp/linux/commit/332f41fcb1b1d6bc7dafd40e8c1e10a1a0952849 git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Mina-Almasry/mm-Fix-memcg-reclaim-on-memory-tiered-systems/20221204-173146 git checkout 332f41fcb1b1d6bc7dafd40e8c1e10a1a0952849 # save the config file mkdir build_dir && cp config build_dir/.config COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=alpha SHELL=/bin/bash If you fix the issue, kindly add following tag where applicable | Reported-by: kernel test robot <lkp@intel.com> All errors (new ones prefixed by >>): mm/vmscan.c: In function 'demote_folio_list': >> mm/vmscan.c:1618:9: error: too many arguments to function 'node_get_allowed_targets' 1618 | node_get_allowed_targets(pgdat, &allowed_mask, demote_from_nodemask); | ^~~~~~~~~~~~~~~~~~~~~~~~ In file included from mm/vmscan.c:46: include/linux/memory-tiers.h:94:20: note: declared here 94 | static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) | ^~~~~~~~~~~~~~~~~~~~~~~~ vim +/node_get_allowed_targets +1618 mm/vmscan.c 1587 1588 /* 1589 * Take folios on @demote_folios and attempt to demote them to another node. 1590 * Folios which are not demoted are left on @demote_folios. 1591 */ 1592 static unsigned int demote_folio_list(struct list_head *demote_folios, 1593 struct pglist_data *pgdat, 1594 nodemask_t *demote_from_nodemask) 1595 { 1596 int target_nid = next_demotion_node(pgdat->node_id); 1597 unsigned int nr_succeeded; 1598 nodemask_t allowed_mask; 1599 1600 struct migration_target_control mtc = { 1601 /* 1602 * Allocate from 'node', or fail quickly and quietly. 1603 * When this happens, 'page' will likely just be discarded 1604 * instead of migrated. 1605 */ 1606 .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN | 1607 __GFP_NOMEMALLOC | GFP_NOWAIT, 1608 .nid = target_nid, 1609 .nmask = &allowed_mask 1610 }; 1611 1612 if (list_empty(demote_folios)) 1613 return 0; 1614 1615 if (target_nid == NUMA_NO_NODE) 1616 return 0; 1617 > 1618 node_get_allowed_targets(pgdat, &allowed_mask, demote_from_nodemask); 1619 1620 /* Demotion ignores all cpuset and mempolicy settings */ 1621 migrate_pages(demote_folios, alloc_demote_page, NULL, 1622 (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, 1623 &nr_succeeded); 1624 1625 __count_vm_events(PGDEMOTE_KSWAPD + reclaimer_offset(), nr_succeeded); 1626 1627 return nr_succeeded; 1628 } 1629
On Sun, Dec 4, 2022 at 2:32 AM kernel test robot <lkp@intel.com> wrote: > > Hi Mina, > > Thank you for the patch! Yet something to improve: > > [auto build test ERROR on akpm-mm/mm-everything] > > url: https://github.com/intel-lab-lkp/linux/commits/Mina-Almasry/mm-Fix-memcg-reclaim-on-memory-tiered-systems/20221204-173146 > base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything > patch link: https://lore.kernel.org/r/20221204093008.2620459-1-almasrymina%40google.com > patch subject: [PATCH v2] [mm-unstable] mm: Fix memcg reclaim on memory tiered systems > config: alpha-buildonly-randconfig-r005-20221204 > compiler: alpha-linux-gcc (GCC) 12.1.0 > reproduce (this is a W=1 build): > wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross > chmod +x ~/bin/make.cross > # https://github.com/intel-lab-lkp/linux/commit/332f41fcb1b1d6bc7dafd40e8c1e10a1a0952849 > git remote add linux-review https://github.com/intel-lab-lkp/linux > git fetch --no-tags linux-review Mina-Almasry/mm-Fix-memcg-reclaim-on-memory-tiered-systems/20221204-173146 > git checkout 332f41fcb1b1d6bc7dafd40e8c1e10a1a0952849 > # save the config file > mkdir build_dir && cp config build_dir/.config > COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=alpha SHELL=/bin/bash > > If you fix the issue, kindly add following tag where applicable > | Reported-by: kernel test robot <lkp@intel.com> > > All errors (new ones prefixed by >>): > > mm/vmscan.c: In function 'demote_folio_list': > >> mm/vmscan.c:1618:9: error: too many arguments to function 'node_get_allowed_targets' > 1618 | node_get_allowed_targets(pgdat, &allowed_mask, demote_from_nodemask); > | ^~~~~~~~~~~~~~~~~~~~~~~~ > In file included from mm/vmscan.c:46: > include/linux/memory-tiers.h:94:20: note: declared here > 94 | static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) > | ^~~~~~~~~~~~~~~~~~~~~~~~ > I missed updating this function signature. Will fix it in v3. > > vim +/node_get_allowed_targets +1618 mm/vmscan.c > > 1587 > 1588 /* > 1589 * Take folios on @demote_folios and attempt to demote them to another node. > 1590 * Folios which are not demoted are left on @demote_folios. > 1591 */ > 1592 static unsigned int demote_folio_list(struct list_head *demote_folios, > 1593 struct pglist_data *pgdat, > 1594 nodemask_t *demote_from_nodemask) > 1595 { > 1596 int target_nid = next_demotion_node(pgdat->node_id); > 1597 unsigned int nr_succeeded; > 1598 nodemask_t allowed_mask; > 1599 > 1600 struct migration_target_control mtc = { > 1601 /* > 1602 * Allocate from 'node', or fail quickly and quietly. > 1603 * When this happens, 'page' will likely just be discarded > 1604 * instead of migrated. > 1605 */ > 1606 .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN | > 1607 __GFP_NOMEMALLOC | GFP_NOWAIT, > 1608 .nid = target_nid, > 1609 .nmask = &allowed_mask > 1610 }; > 1611 > 1612 if (list_empty(demote_folios)) > 1613 return 0; > 1614 > 1615 if (target_nid == NUMA_NO_NODE) > 1616 return 0; > 1617 > > 1618 node_get_allowed_targets(pgdat, &allowed_mask, demote_from_nodemask); > 1619 > 1620 /* Demotion ignores all cpuset and mempolicy settings */ > 1621 migrate_pages(demote_folios, alloc_demote_page, NULL, > 1622 (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, > 1623 &nr_succeeded); > 1624 > 1625 __count_vm_events(PGDEMOTE_KSWAPD + reclaimer_offset(), nr_succeeded); > 1626 > 1627 return nr_succeeded; > 1628 } > 1629 > > -- > 0-DAY CI Kernel Test Service > https://01.org/lkp
Hi Mina, Thank you for the patch! Yet something to improve: [auto build test ERROR on akpm-mm/mm-everything] url: https://github.com/intel-lab-lkp/linux/commits/Mina-Almasry/mm-Fix-memcg-reclaim-on-memory-tiered-systems/20221204-173146 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20221204093008.2620459-1-almasrymina%40google.com patch subject: [PATCH v2] [mm-unstable] mm: Fix memcg reclaim on memory tiered systems config: s390-randconfig-r013-20221204 compiler: clang version 16.0.0 (https://github.com/llvm/llvm-project 6e4cea55f0d1104408b26ac574566a0e4de48036) reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # install s390 cross compiling tool for clang build # apt-get install binutils-s390x-linux-gnu # https://github.com/intel-lab-lkp/linux/commit/332f41fcb1b1d6bc7dafd40e8c1e10a1a0952849 git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Mina-Almasry/mm-Fix-memcg-reclaim-on-memory-tiered-systems/20221204-173146 git checkout 332f41fcb1b1d6bc7dafd40e8c1e10a1a0952849 # save the config file mkdir build_dir && cp config build_dir/.config COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=s390 SHELL=/bin/bash If you fix the issue, kindly add following tag where applicable | Reported-by: kernel test robot <lkp@intel.com> All errors (new ones prefixed by >>): >> mm/vmscan.c:1618:49: error: too many arguments to function call, expected 2, have 3 node_get_allowed_targets(pgdat, &allowed_mask, demote_from_nodemask); ~~~~~~~~~~~~~~~~~~~~~~~~ ^~~~~~~~~~~~~~~~~~~~ include/linux/memory-tiers.h:94:20: note: 'node_get_allowed_targets' declared here static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) ^ 1 error generated. vim +1618 mm/vmscan.c 1587 1588 /* 1589 * Take folios on @demote_folios and attempt to demote them to another node. 1590 * Folios which are not demoted are left on @demote_folios. 1591 */ 1592 static unsigned int demote_folio_list(struct list_head *demote_folios, 1593 struct pglist_data *pgdat, 1594 nodemask_t *demote_from_nodemask) 1595 { 1596 int target_nid = next_demotion_node(pgdat->node_id); 1597 unsigned int nr_succeeded; 1598 nodemask_t allowed_mask; 1599 1600 struct migration_target_control mtc = { 1601 /* 1602 * Allocate from 'node', or fail quickly and quietly. 1603 * When this happens, 'page' will likely just be discarded 1604 * instead of migrated. 1605 */ 1606 .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN | 1607 __GFP_NOMEMALLOC | GFP_NOWAIT, 1608 .nid = target_nid, 1609 .nmask = &allowed_mask 1610 }; 1611 1612 if (list_empty(demote_folios)) 1613 return 0; 1614 1615 if (target_nid == NUMA_NO_NODE) 1616 return 0; 1617 > 1618 node_get_allowed_targets(pgdat, &allowed_mask, demote_from_nodemask); 1619 1620 /* Demotion ignores all cpuset and mempolicy settings */ 1621 migrate_pages(demote_folios, alloc_demote_page, NULL, 1622 (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, 1623 &nr_succeeded); 1624 1625 __count_vm_events(PGDEMOTE_KSWAPD + reclaimer_offset(), nr_succeeded); 1626 1627 return nr_succeeded; 1628 } 1629
Mina Almasry <almasrymina@google.com> writes: > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > reclaim"") enabled demotion in memcg reclaim, which is the right thing > to do, but introduced a regression in the behavior of > try_to_free_mem_cgroup_pages(). > > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage > of the cgroup should reduce by nr_pages. The callers expect > try_to_free_mem_cgroup_pages() to also return the number of pages > reclaimed, not demoted. > > However, try_to_free_mem_cgroup_pages() actually unconditionally counts > demoted pages as reclaimed pages. So in practice when it is called it will > often demote nr_pages and return the number of demoted pages to the caller. > Demoted pages don't lower the memcg usage as the caller requested. > > I suspect various things work suboptimally on memory systems or don't > work at all due to this: > > - memory.high enforcement likely doesn't work (it just demotes nr_pages > instead of lowering the memcg usage by nr_pages). > - try_charge_memcg() will keep retrying the charge while > try_to_free_mem_cgroup_pages() is just demoting pages and not actually > making any room for the charge. > - memory.reclaim has a wonky interface. It advertises to the user it > reclaims the provided amount but it will actually demote that amount. > > There may be more effects to this issue. > > To fix these issues I propose shrink_folio_list() to only count pages > demoted from inside of sc->nodemask to outside of sc->nodemask as > 'reclaimed'. > > For callers such as reclaim_high() or try_charge_memcg() that set > sc->nodemask to NULL, try_to_free_mem_cgroup_pages() will try to > actually reclaim nr_pages and return the number of pages reclaimed. No > demoted pages would count towards the nr_pages requirement. > > For callers such as memory_reclaim() that set sc->nodemask, > try_to_free_mem_cgroup_pages() will free nr_pages from that nodemask > with either demotion or reclaim. > > Tested this change using memory.reclaim interface. With this change, > > echo "1m" > memory.reclaim > > Will cause freeing of 1m of memory from the cgroup regardless of the > demotions happening inside. > > echo "1m nodes=0" > memory.reclaim > > Will cause freeing of 1m of node 0 by demotion if a demotion target is > available, and by reclaim if no demotion target is available. > > Signed-off-by: Mina Almasry <almasrymina@google.com> > > --- > > This is developed on top of mm-unstable largely to test with memory.reclaim > nodes= arg and ensure the fix is compatible with that. > > v2: > - Shortened the commit message a bit. > - Fixed issue when demotion falls back to other allowed target nodes returned by > node_get_allowed_targets() as Wei suggested. > > Cc: weixugc@google.com > --- > include/linux/memory-tiers.h | 7 +++++-- > mm/memory-tiers.c | 10 +++++++++- > mm/vmscan.c | 20 +++++++++++++++++--- > 3 files changed, 31 insertions(+), 6 deletions(-) > > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h > index fc9647b1b4f9..f3f359760fd0 100644 > --- a/include/linux/memory-tiers.h > +++ b/include/linux/memory-tiers.h > @@ -38,7 +38,8 @@ void init_node_memory_type(int node, struct memory_dev_type *default_type); > void clear_node_memory_type(int node, struct memory_dev_type *memtype); > #ifdef CONFIG_MIGRATION > int next_demotion_node(int node); > -void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); > +void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets, > + nodemask_t *demote_from_targets); > bool node_is_toptier(int node); > #else > static inline int next_demotion_node(int node) > @@ -46,7 +47,9 @@ static inline int next_demotion_node(int node) > return NUMA_NO_NODE; > } > > -static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) > +static inline void node_get_allowed_targets(pg_data_t *pgdat, > + nodemask_t *targets, > + nodemask_t *demote_from_targets) > { > *targets = NODE_MASK_NONE; > } > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > index c734658c6242..7f8f0b5de2b3 100644 > --- a/mm/memory-tiers.c > +++ b/mm/memory-tiers.c > @@ -264,7 +264,8 @@ bool node_is_toptier(int node) > return toptier; > } > > -void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) > +void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets, > + nodemask_t *demote_from_targets) > { > struct memory_tier *memtier; > > @@ -280,6 +281,13 @@ void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) > else > *targets = NODE_MASK_NONE; > rcu_read_unlock(); > + > + /* > + * Exclude the demote_from_targets from the allowed targets if we're > + * trying to demote from a specific set of nodes. > + */ > + if (demote_from_targets) > + nodes_andnot(*targets, *targets, *demote_from_targets); > } Will this cause demotion to not work when we have memory policy like MPOL_BIND with nodemask including demotion targets? > > /** > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 2b42ac9ad755..97ca0445b5dc 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1590,7 +1590,8 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private) > * Folios which are not demoted are left on @demote_folios. > */ > static unsigned int demote_folio_list(struct list_head *demote_folios, > - struct pglist_data *pgdat) > + struct pglist_data *pgdat, > + nodemask_t *demote_from_nodemask) > { > int target_nid = next_demotion_node(pgdat->node_id); > unsigned int nr_succeeded; > @@ -1614,7 +1615,7 @@ static unsigned int demote_folio_list(struct list_head *demote_folios, > if (target_nid == NUMA_NO_NODE) > return 0; > > - node_get_allowed_targets(pgdat, &allowed_mask); > + node_get_allowed_targets(pgdat, &allowed_mask, demote_from_nodemask); > > /* Demotion ignores all cpuset and mempolicy settings */ > migrate_pages(demote_folios, alloc_demote_page, NULL, > @@ -1653,6 +1654,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > LIST_HEAD(free_folios); > LIST_HEAD(demote_folios); > unsigned int nr_reclaimed = 0; > + unsigned int nr_demoted = 0; > unsigned int pgactivate = 0; > bool do_demote_pass; > struct swap_iocb *plug = NULL; > @@ -2085,7 +2087,19 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > /* 'folio_list' is always empty here */ > > /* Migrate folios selected for demotion */ > - nr_reclaimed += demote_folio_list(&demote_folios, pgdat); > + nr_demoted = demote_folio_list(&demote_folios, pgdat, sc->nodemask); > + > + /* > + * Only count demoted folios as reclaimed if the caller has requested > + * demotion from a specific nodemask. In this case pages inside the > + * noedmask have been demoted to outside the nodemask and we can count > + * these pages as reclaimed. If no nodemask is passed, then the caller > + * is requesting reclaim from all memory, which should not count > + * demoted pages. > + */ > + if (sc->nodemask) > + nr_reclaimed += nr_demoted; > + > /* Folios that could not be demoted are still in @demote_folios */ > if (!list_empty(&demote_folios)) { > /* Folios which weren't demoted go back on @folio_list */ > -- > 2.39.0.rc0.267.gcb52ba06e7-goog
On Wed, Dec 7, 2022 at 12:07 AM Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> wrote: > > Mina Almasry <almasrymina@google.com> writes: > > > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > > reclaim"") enabled demotion in memcg reclaim, which is the right thing > > to do, but introduced a regression in the behavior of > > try_to_free_mem_cgroup_pages(). > > > > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to > > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage > > of the cgroup should reduce by nr_pages. The callers expect > > try_to_free_mem_cgroup_pages() to also return the number of pages > > reclaimed, not demoted. > > > > However, try_to_free_mem_cgroup_pages() actually unconditionally counts > > demoted pages as reclaimed pages. So in practice when it is called it will > > often demote nr_pages and return the number of demoted pages to the caller. > > Demoted pages don't lower the memcg usage as the caller requested. > > > > I suspect various things work suboptimally on memory systems or don't > > work at all due to this: > > > > - memory.high enforcement likely doesn't work (it just demotes nr_pages > > instead of lowering the memcg usage by nr_pages). > > - try_charge_memcg() will keep retrying the charge while > > try_to_free_mem_cgroup_pages() is just demoting pages and not actually > > making any room for the charge. > > - memory.reclaim has a wonky interface. It advertises to the user it > > reclaims the provided amount but it will actually demote that amount. > > > > There may be more effects to this issue. > > > > To fix these issues I propose shrink_folio_list() to only count pages > > demoted from inside of sc->nodemask to outside of sc->nodemask as > > 'reclaimed'. > > > > For callers such as reclaim_high() or try_charge_memcg() that set > > sc->nodemask to NULL, try_to_free_mem_cgroup_pages() will try to > > actually reclaim nr_pages and return the number of pages reclaimed. No > > demoted pages would count towards the nr_pages requirement. > > > > For callers such as memory_reclaim() that set sc->nodemask, > > try_to_free_mem_cgroup_pages() will free nr_pages from that nodemask > > with either demotion or reclaim. > > > > Tested this change using memory.reclaim interface. With this change, > > > > echo "1m" > memory.reclaim > > > > Will cause freeing of 1m of memory from the cgroup regardless of the > > demotions happening inside. > > > > echo "1m nodes=0" > memory.reclaim > > > > Will cause freeing of 1m of node 0 by demotion if a demotion target is > > available, and by reclaim if no demotion target is available. > > > > Signed-off-by: Mina Almasry <almasrymina@google.com> > > > > --- > > > > This is developed on top of mm-unstable largely to test with memory.reclaim > > nodes= arg and ensure the fix is compatible with that. > > > > v2: > > - Shortened the commit message a bit. > > - Fixed issue when demotion falls back to other allowed target nodes returned by > > node_get_allowed_targets() as Wei suggested. > > > > Cc: weixugc@google.com > > --- > > include/linux/memory-tiers.h | 7 +++++-- > > mm/memory-tiers.c | 10 +++++++++- > > mm/vmscan.c | 20 +++++++++++++++++--- > > 3 files changed, 31 insertions(+), 6 deletions(-) > > > > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h > > index fc9647b1b4f9..f3f359760fd0 100644 > > --- a/include/linux/memory-tiers.h > > +++ b/include/linux/memory-tiers.h > > @@ -38,7 +38,8 @@ void init_node_memory_type(int node, struct memory_dev_type *default_type); > > void clear_node_memory_type(int node, struct memory_dev_type *memtype); > > #ifdef CONFIG_MIGRATION > > int next_demotion_node(int node); > > -void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); > > +void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets, > > + nodemask_t *demote_from_targets); > > bool node_is_toptier(int node); > > #else > > static inline int next_demotion_node(int node) > > @@ -46,7 +47,9 @@ static inline int next_demotion_node(int node) > > return NUMA_NO_NODE; > > } > > > > -static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) > > +static inline void node_get_allowed_targets(pg_data_t *pgdat, > > + nodemask_t *targets, > > + nodemask_t *demote_from_targets) > > { > > *targets = NODE_MASK_NONE; > > } > > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > > index c734658c6242..7f8f0b5de2b3 100644 > > --- a/mm/memory-tiers.c > > +++ b/mm/memory-tiers.c > > @@ -264,7 +264,8 @@ bool node_is_toptier(int node) > > return toptier; > > } > > > > -void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) > > +void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets, > > + nodemask_t *demote_from_targets) > > { > > struct memory_tier *memtier; > > > > @@ -280,6 +281,13 @@ void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) > > else > > *targets = NODE_MASK_NONE; > > rcu_read_unlock(); > > + > > + /* > > + * Exclude the demote_from_targets from the allowed targets if we're > > + * trying to demote from a specific set of nodes. > > + */ > > + if (demote_from_targets) > > + nodes_andnot(*targets, *targets, *demote_from_targets); > > } > > Will this cause demotion to not work when we have memory policy like > MPOL_BIND with nodemask including demotion targets? > Hi Aneesh, You may want to review v3 of this patch that removed this bit: https://lore.kernel.org/linux-mm/202212070124.VxwbfKCK-lkp@intel.com/T/#t To answer your question though, it will disable demotion between the MPOL_BIND nodes I think, yes. That may be another reason not to do this (it's already removed in v3). > > > > > /** > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 2b42ac9ad755..97ca0445b5dc 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -1590,7 +1590,8 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private) > > * Folios which are not demoted are left on @demote_folios. > > */ > > static unsigned int demote_folio_list(struct list_head *demote_folios, > > - struct pglist_data *pgdat) > > + struct pglist_data *pgdat, > > + nodemask_t *demote_from_nodemask) > > { > > int target_nid = next_demotion_node(pgdat->node_id); > > unsigned int nr_succeeded; > > @@ -1614,7 +1615,7 @@ static unsigned int demote_folio_list(struct list_head *demote_folios, > > if (target_nid == NUMA_NO_NODE) > > return 0; > > > > - node_get_allowed_targets(pgdat, &allowed_mask); > > + node_get_allowed_targets(pgdat, &allowed_mask, demote_from_nodemask); > > > > /* Demotion ignores all cpuset and mempolicy settings */ > > migrate_pages(demote_folios, alloc_demote_page, NULL, > > @@ -1653,6 +1654,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > LIST_HEAD(free_folios); > > LIST_HEAD(demote_folios); > > unsigned int nr_reclaimed = 0; > > + unsigned int nr_demoted = 0; > > unsigned int pgactivate = 0; > > bool do_demote_pass; > > struct swap_iocb *plug = NULL; > > @@ -2085,7 +2087,19 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > /* 'folio_list' is always empty here */ > > > > /* Migrate folios selected for demotion */ > > - nr_reclaimed += demote_folio_list(&demote_folios, pgdat); > > + nr_demoted = demote_folio_list(&demote_folios, pgdat, sc->nodemask); > > + > > + /* > > + * Only count demoted folios as reclaimed if the caller has requested > > + * demotion from a specific nodemask. In this case pages inside the > > + * noedmask have been demoted to outside the nodemask and we can count > > + * these pages as reclaimed. If no nodemask is passed, then the caller > > + * is requesting reclaim from all memory, which should not count > > + * demoted pages. > > + */ > > + if (sc->nodemask) > > + nr_reclaimed += nr_demoted; > > + > > /* Folios that could not be demoted are still in @demote_folios */ > > if (!list_empty(&demote_folios)) { > > /* Folios which weren't demoted go back on @folio_list */ > > -- > > 2.39.0.rc0.267.gcb52ba06e7-goog
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index fc9647b1b4f9..f3f359760fd0 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -38,7 +38,8 @@ void init_node_memory_type(int node, struct memory_dev_type *default_type); void clear_node_memory_type(int node, struct memory_dev_type *memtype); #ifdef CONFIG_MIGRATION int next_demotion_node(int node); -void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); +void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets, + nodemask_t *demote_from_targets); bool node_is_toptier(int node); #else static inline int next_demotion_node(int node) @@ -46,7 +47,9 @@ static inline int next_demotion_node(int node) return NUMA_NO_NODE; } -static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) +static inline void node_get_allowed_targets(pg_data_t *pgdat, + nodemask_t *targets, + nodemask_t *demote_from_targets) { *targets = NODE_MASK_NONE; } diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index c734658c6242..7f8f0b5de2b3 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -264,7 +264,8 @@ bool node_is_toptier(int node) return toptier; } -void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) +void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets, + nodemask_t *demote_from_targets) { struct memory_tier *memtier; @@ -280,6 +281,13 @@ void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) else *targets = NODE_MASK_NONE; rcu_read_unlock(); + + /* + * Exclude the demote_from_targets from the allowed targets if we're + * trying to demote from a specific set of nodes. + */ + if (demote_from_targets) + nodes_andnot(*targets, *targets, *demote_from_targets); } /** diff --git a/mm/vmscan.c b/mm/vmscan.c index 2b42ac9ad755..97ca0445b5dc 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1590,7 +1590,8 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private) * Folios which are not demoted are left on @demote_folios. */ static unsigned int demote_folio_list(struct list_head *demote_folios, - struct pglist_data *pgdat) + struct pglist_data *pgdat, + nodemask_t *demote_from_nodemask) { int target_nid = next_demotion_node(pgdat->node_id); unsigned int nr_succeeded; @@ -1614,7 +1615,7 @@ static unsigned int demote_folio_list(struct list_head *demote_folios, if (target_nid == NUMA_NO_NODE) return 0; - node_get_allowed_targets(pgdat, &allowed_mask); + node_get_allowed_targets(pgdat, &allowed_mask, demote_from_nodemask); /* Demotion ignores all cpuset and mempolicy settings */ migrate_pages(demote_folios, alloc_demote_page, NULL, @@ -1653,6 +1654,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, LIST_HEAD(free_folios); LIST_HEAD(demote_folios); unsigned int nr_reclaimed = 0; + unsigned int nr_demoted = 0; unsigned int pgactivate = 0; bool do_demote_pass; struct swap_iocb *plug = NULL; @@ -2085,7 +2087,19 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, /* 'folio_list' is always empty here */ /* Migrate folios selected for demotion */ - nr_reclaimed += demote_folio_list(&demote_folios, pgdat); + nr_demoted = demote_folio_list(&demote_folios, pgdat, sc->nodemask); + + /* + * Only count demoted folios as reclaimed if the caller has requested + * demotion from a specific nodemask. In this case pages inside the + * noedmask have been demoted to outside the nodemask and we can count + * these pages as reclaimed. If no nodemask is passed, then the caller + * is requesting reclaim from all memory, which should not count + * demoted pages. + */ + if (sc->nodemask) + nr_reclaimed += nr_demoted; + /* Folios that could not be demoted are still in @demote_folios */ if (!list_empty(&demote_folios)) { /* Folios which weren't demoted go back on @folio_list */