Message ID | 20221026074343.6517-1-feng.tang@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp113447wru; Wed, 26 Oct 2022 00:46:06 -0700 (PDT) X-Google-Smtp-Source: AMsMyM6zFTK292q986g2KWbfH0vI9r2SJqH6/FRcz35JG/3RA9CFL13CT34w6RIx1mu5AHJ/HKfA X-Received: by 2002:a17:90b:3c0e:b0:213:5ce5:e60c with SMTP id pb14-20020a17090b3c0e00b002135ce5e60cmr829026pjb.157.1666770366341; Wed, 26 Oct 2022 00:46:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666770366; cv=none; d=google.com; s=arc-20160816; b=kjRFU1hjWCiLew8WyMBjwPgV7scLY5HWIA4ADPVROjBrj4m+I659UPlyjCTXooHFUi 75IVLNENNmsYJtNjTUFfh+HzsXhr8Pw+I9lHdZL9C3DE1uiFem2YQ0PODrdHUiE6wTzb jAA8du/X3Dfct8Y7uFXKZBrW9J0di0pXqnXJLLRsfqvf+y470oHGn0kUDjO328jxjZ1j XmHO/Q8M6wImsMEk7vbzXnQLBfOB5GwsmqZlVBxN7wPznjIqXPNUkDmeXLqgOmLt10Q0 edQl99YCNdGdJqNpGD7uEjsqZzw+8D2Rw/eyaA+Ab1SMyGSQc6BUg3kba/AD34CLX3C/ oiEg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=ZyytPBlagQWkIFuGtmdNWyaOWOo2lERNlqRVUqRinXc=; b=Z2c/IXY6wI0mm2Eh4HV3Ud358RloGUoX0bNqlQ7fgNEvngfBJkMSsnGWVPJE18N4OJ 0oBlYE7p9vK56GMkdNjXiM6LFS4F3xk+moq9zfwMcl+hQMCZszT20g6gqMp87uijSMBx dpiauuMUm/c7aEqpmV79Iq5UlhGT5mG5ycteoN4uGbWqyVZq6gFJOTmSTAiEhFVw/o0f Kx2fqpufyxff492GEsm1KEAuUbn47zRXIxxHQGuq4xxP+0v2IbTwjhmM//8kb++Ngtg3 0IDtb5JbkORdgfXhm8ga2vVJDFCBP8G2HpojB1NgSJlZ4ww0UGIfVdBKlPCCkZQZQ7Rq TtoA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="Q6/wOhD3"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id j18-20020a62e912000000b0056bc64d158csi5256534pfh.172.2022.10.26.00.45.52; Wed, 26 Oct 2022 00:46:06 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="Q6/wOhD3"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233290AbiJZHmG (ORCPT <rfc822;pwkd43@gmail.com> + 99 others); Wed, 26 Oct 2022 03:42:06 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43098 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233331AbiJZHly (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 26 Oct 2022 03:41:54 -0400 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8D4A1A4B90; Wed, 26 Oct 2022 00:41:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1666770112; x=1698306112; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=jl6oLzXT4bn5lvPhVCs6xFhXegqNnCjjfs7rohSWfqY=; b=Q6/wOhD3VeZyicRsjyaiaNpVpiV7OFEWbjva551Coa4XeJMovLYasHJU jFTnhRu02Elu8or06/h8I0upGHXfqTDIfr5RIIDHw8+/ErEqp7ZJql8Bm Ovkb6bM+YVr387KeYWL/hWwVjS4NQuGxGU+2nliweWiB0NErjoKnsUGEl wMPvpMgS19RtxRMqdat2Cqj5m11q0omGJ0t0x0Y11tnHrIV01tBe/Yb4m udTMuxn1mq0N72uMXGKE4eu54PlS2xDzqj/6KreKMdoMKDU53IPoIshkr oR5lt+Fayp53drKMjfMhwHiAJ4NNda4LQE0DkLrWMfbw4BGAymJ4Na7BV A==; X-IronPort-AV: E=McAfee;i="6500,9779,10511"; a="308977052" X-IronPort-AV: E=Sophos;i="5.95,214,1661842800"; d="scan'208";a="308977052" Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Oct 2022 00:41:51 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10511"; a="700832747" X-IronPort-AV: E=Sophos;i="5.95,214,1661842800"; d="scan'208";a="700832747" Received: from shbuild999.sh.intel.com ([10.239.146.101]) by fmsmga004.fm.intel.com with ESMTP; 26 Oct 2022 00:41:47 -0700 From: Feng Tang <feng.tang@intel.com> To: Andrew Morton <akpm@linux-foundation.org>, Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@suse.com>, Tejun Heo <tj@kernel.org>, Zefan Li <lizefan.x@bytedance.com>, Waiman Long <longman@redhat.com>, ying.huang@intel.com, aneesh.kumar@linux.ibm.com, linux-mm@kvack.org, cgroups@vger.kernel.org Cc: linux-kernel@vger.kernel.org, dave.hansen@intel.com, tim.c.chen@intel.com, fengwei.yin@intel.com, Feng Tang <feng.tang@intel.com> Subject: [PATCH] mm/vmscan: respect cpuset policy during page demotion Date: Wed, 26 Oct 2022 15:43:43 +0800 Message-Id: <20221026074343.6517-1-feng.tang@intel.com> X-Mailer: git-send-email 2.27.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.9 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE, URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1747735403742424929?= X-GMAIL-MSGID: =?utf-8?q?1747735403742424929?= |
Series |
mm/vmscan: respect cpuset policy during page demotion
|
|
Commit Message
Feng Tang
Oct. 26, 2022, 7:43 a.m. UTC
In page reclaim path, memory could be demoted from faster memory tier
to slower memory tier. Currently, there is no check about cpuset's
memory policy, that even if the target demotion node is not allowd
by cpuset, the demotion will still happen, which breaks the cpuset
semantics.
So add cpuset policy check in the demotion path and skip demotion
if the demotion targets are not allowed by cpuset.
Signed-off-by: Feng Tang <feng.tang@intel.com>
---
Hi reviewers,
For easy bisectable, I combined the cpuset change and mm change
in one patch, if you prefer to separate them, I can turn it into
2 patches.
Thanks,
Feng
include/linux/cpuset.h | 6 ++++++
kernel/cgroup/cpuset.c | 29 +++++++++++++++++++++++++++++
mm/vmscan.c | 35 ++++++++++++++++++++++++++++++++---
3 files changed, 67 insertions(+), 3 deletions(-)
Comments
On 10/26/22 1:13 PM, Feng Tang wrote: > In page reclaim path, memory could be demoted from faster memory tier > to slower memory tier. Currently, there is no check about cpuset's > memory policy, that even if the target demotion node is not allowd > by cpuset, the demotion will still happen, which breaks the cpuset > semantics. > > So add cpuset policy check in the demotion path and skip demotion > if the demotion targets are not allowed by cpuset. > What about the vma policy or the task memory policy? Shouldn't we respect those memory policy restrictions while demoting the page? -aneesh
On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote: > On 10/26/22 1:13 PM, Feng Tang wrote: > > In page reclaim path, memory could be demoted from faster memory tier > > to slower memory tier. Currently, there is no check about cpuset's > > memory policy, that even if the target demotion node is not allowd > > by cpuset, the demotion will still happen, which breaks the cpuset > > semantics. > > > > So add cpuset policy check in the demotion path and skip demotion > > if the demotion targets are not allowed by cpuset. > > > > What about the vma policy or the task memory policy? Shouldn't we respect > those memory policy restrictions while demoting the page? Good question! We have some basic patches to consider memory policy in demotion path too, which are still under test, and will be posted soon. And the basic idea is similar to this patch. Thanks, Feng > -aneesh
Hi Feng, On 10/26/2022 3:43 PM, Feng Tang wrote: > In page reclaim path, memory could be demoted from faster memory tier > to slower memory tier. Currently, there is no check about cpuset's > memory policy, that even if the target demotion node is not allowd > by cpuset, the demotion will still happen, which breaks the cpuset > semantics. > > So add cpuset policy check in the demotion path and skip demotion > if the demotion targets are not allowed by cpuset. > > Signed-off-by: Feng Tang <feng.tang@intel.com> > --- > Hi reviewers, > > For easy bisectable, I combined the cpuset change and mm change > in one patch, if you prefer to separate them, I can turn it into > 2 patches. > > Thanks, > Feng > > include/linux/cpuset.h | 6 ++++++ > kernel/cgroup/cpuset.c | 29 +++++++++++++++++++++++++++++ > mm/vmscan.c | 35 ++++++++++++++++++++++++++++++++--- > 3 files changed, 67 insertions(+), 3 deletions(-) > > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h > index d58e0476ee8e..6fcce2bd2631 100644 > --- a/include/linux/cpuset.h > +++ b/include/linux/cpuset.h > @@ -178,6 +178,8 @@ static inline void set_mems_allowed(nodemask_t nodemask) > task_unlock(current); > } > > +extern void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, > + nodemask_t *nmask); > #else /* !CONFIG_CPUSETS */ > > static inline bool cpusets_enabled(void) { return false; } > @@ -299,6 +301,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq) > return false; > } > > +static inline void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, > + nodemask_t *nmask) > +{ > +} > #endif /* !CONFIG_CPUSETS */ > > #endif /* _LINUX_CPUSET_H */ > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > index 3ea2e836e93e..cbb118c0502f 100644 > --- a/kernel/cgroup/cpuset.c > +++ b/kernel/cgroup/cpuset.c > @@ -3750,6 +3750,35 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk) > return mask; > } > > +/* > + * Retrieve the allowed memory nodemask for a cgroup. > + * > + * Set *nmask to cpuset's effective allowed nodemask for cgroup v2, > + * and NODE_MASK_ALL (means no constraint) for cgroup v1 where there > + * is no guaranteed association from a cgroup to a cpuset. > + */ > +void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, nodemask_t *nmask) > +{ > + struct cgroup_subsys_state *css; > + struct cpuset *cs; > + > + if (!is_in_v2_mode()) { > + *nmask = NODE_MASK_ALL; > + return; > + } > + > + rcu_read_lock(); > + css = cgroup_e_css(cgroup, &cpuset_cgrp_subsys); > + if (css) { > + css_get(css); > + cs = css_cs(css); > + *nmask = cs->effective_mems; > + css_put(css); > + } > + > + rcu_read_unlock(); > +} > + > /** > * cpuset_nodemask_valid_mems_allowed - check nodemask vs. current mems_allowed > * @nodemask: the nodemask to be checked > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 18f6497994ec..c205d98283bc 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1537,9 +1537,21 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private) > { > struct page *target_page; > nodemask_t *allowed_mask; > - struct migration_target_control *mtc; > + struct migration_target_control *mtc = (void *)private; > > - mtc = (struct migration_target_control *)private; > +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS) > + struct mem_cgroup *memcg; > + nodemask_t cpuset_nmask; > + > + memcg = page_memcg(page); > + cpuset_get_allowed_mem_nodes(memcg->css.cgroup, &cpuset_nmask); > + > + if (!node_isset(mtc->nid, cpuset_nmask)) { > + if (mtc->nmask) > + nodes_and(*mtc->nmask, *mtc->nmask, cpuset_nmask); > + return alloc_migration_target(page, (unsigned long)mtc); > + } > +#endif > > allowed_mask = mtc->nmask; > /* > @@ -1649,6 +1661,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > enum folio_references references = FOLIOREF_RECLAIM; > bool dirty, writeback; > unsigned int nr_pages; > + bool skip_this_demotion = false; > > cond_resched(); > > @@ -1658,6 +1671,22 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > if (!folio_trylock(folio)) > goto keep; > > +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS) > + if (do_demote_pass) { > + struct mem_cgroup *memcg; > + nodemask_t nmask, nmask1; > + > + node_get_allowed_targets(pgdat, &nmask); > + memcg = folio_memcg(folio); > + if (memcg) Doesn't check memcg in the change to alloc_demote_page(). What's the difference here? > + cpuset_get_allowed_mem_nodes(memcg->css.cgroup, > + &nmask1); > + > + if (!nodes_intersects(nmask, nmask1)) If memcg is NULL, nmask1 will have an uninitialized value. Thanks. Regards Yin, Fengwei > + skip_this_demotion = true; > + } > +#endif > + > VM_BUG_ON_FOLIO(folio_test_active(folio), folio); > > nr_pages = folio_nr_pages(folio); > @@ -1799,7 +1828,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > * Before reclaiming the folio, try to relocate > * its contents to another node. > */ > - if (do_demote_pass && > + if (do_demote_pass && !skip_this_demotion && > (thp_migration_supported() || !folio_test_large(folio))) { > list_add(&folio->lru, &demote_folios); > folio_unlock(folio);
On Wed, Oct 26, 2022 at 04:26:28PM +0800, Yin, Fengwei wrote: > Hi Feng, > > On 10/26/2022 3:43 PM, Feng Tang wrote: > > In page reclaim path, memory could be demoted from faster memory tier > > to slower memory tier. Currently, there is no check about cpuset's > > memory policy, that even if the target demotion node is not allowd > > by cpuset, the demotion will still happen, which breaks the cpuset > > semantics. > > > > So add cpuset policy check in the demotion path and skip demotion > > if the demotion targets are not allowed by cpuset. > > > > Signed-off-by: Feng Tang <feng.tang@intel.com> > > --- > > Hi reviewers, > > > > For easy bisectable, I combined the cpuset change and mm change > > in one patch, if you prefer to separate them, I can turn it into > > 2 patches. > > > > Thanks, > > Feng > > > > include/linux/cpuset.h | 6 ++++++ > > kernel/cgroup/cpuset.c | 29 +++++++++++++++++++++++++++++ > > mm/vmscan.c | 35 ++++++++++++++++++++++++++++++++--- > > 3 files changed, 67 insertions(+), 3 deletions(-) > > > > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h > > index d58e0476ee8e..6fcce2bd2631 100644 > > --- a/include/linux/cpuset.h > > +++ b/include/linux/cpuset.h > > @@ -178,6 +178,8 @@ static inline void set_mems_allowed(nodemask_t nodemask) > > task_unlock(current); > > } > > > > +extern void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, > > + nodemask_t *nmask); > > #else /* !CONFIG_CPUSETS */ > > > > static inline bool cpusets_enabled(void) { return false; } > > @@ -299,6 +301,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq) > > return false; > > } > > > > +static inline void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, > > + nodemask_t *nmask) > > +{ > > +} > > #endif /* !CONFIG_CPUSETS */ > > > > #endif /* _LINUX_CPUSET_H */ > > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > > index 3ea2e836e93e..cbb118c0502f 100644 > > --- a/kernel/cgroup/cpuset.c > > +++ b/kernel/cgroup/cpuset.c > > @@ -3750,6 +3750,35 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk) > > return mask; > > } > > > > +/* > > + * Retrieve the allowed memory nodemask for a cgroup. > > + * > > + * Set *nmask to cpuset's effective allowed nodemask for cgroup v2, > > + * and NODE_MASK_ALL (means no constraint) for cgroup v1 where there > > + * is no guaranteed association from a cgroup to a cpuset. > > + */ > > +void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, nodemask_t *nmask) > > +{ > > + struct cgroup_subsys_state *css; > > + struct cpuset *cs; > > + > > + if (!is_in_v2_mode()) { > > + *nmask = NODE_MASK_ALL; > > + return; > > + } > > + > > + rcu_read_lock(); > > + css = cgroup_e_css(cgroup, &cpuset_cgrp_subsys); > > + if (css) { > > + css_get(css); > > + cs = css_cs(css); > > + *nmask = cs->effective_mems; > > + css_put(css); > > + } > > + > > + rcu_read_unlock(); > > +} > > + > > /** > > * cpuset_nodemask_valid_mems_allowed - check nodemask vs. current mems_allowed > > * @nodemask: the nodemask to be checked > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 18f6497994ec..c205d98283bc 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -1537,9 +1537,21 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private) > > { > > struct page *target_page; > > nodemask_t *allowed_mask; > > - struct migration_target_control *mtc; > > + struct migration_target_control *mtc = (void *)private; > > > > - mtc = (struct migration_target_control *)private; > > +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS) > > + struct mem_cgroup *memcg; > > + nodemask_t cpuset_nmask; > > + > > + memcg = page_memcg(page); > > + cpuset_get_allowed_mem_nodes(memcg->css.cgroup, &cpuset_nmask); > > + > > + if (!node_isset(mtc->nid, cpuset_nmask)) { > > + if (mtc->nmask) > > + nodes_and(*mtc->nmask, *mtc->nmask, cpuset_nmask); > > + return alloc_migration_target(page, (unsigned long)mtc); > > + } > > +#endif > > > > allowed_mask = mtc->nmask; > > /* > > @@ -1649,6 +1661,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > enum folio_references references = FOLIOREF_RECLAIM; > > bool dirty, writeback; > > unsigned int nr_pages; > > + bool skip_this_demotion = false; > > > > cond_resched(); > > > > @@ -1658,6 +1671,22 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > if (!folio_trylock(folio)) > > goto keep; > > > > +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS) > > + if (do_demote_pass) { > > + struct mem_cgroup *memcg; > > + nodemask_t nmask, nmask1; > > + > > + node_get_allowed_targets(pgdat, &nmask); > > + memcg = folio_memcg(folio); > > + if (memcg) > Doesn't check memcg in the change to alloc_demote_page(). What's the difference here? > > > + cpuset_get_allowed_mem_nodes(memcg->css.cgroup, > > + &nmask1); > > + > > + if (!nodes_intersects(nmask, nmask1)) > If memcg is NULL, nmask1 will have an uninitialized value. Thanks. Good catch! Yes, it should be initialized to NODE_MASK_ALL. Actually I was not sure if I need to check 'memcg == NULL' case, while I was under the impression that for page on LRU list, it's memcg is either a specific memcg or the 'root_mem_cgroup'. I will double check that. Thanks, Feng > Regards > Yin, Fengwei > > > + skip_this_demotion = true; > > + } > > +#endif > > + > > VM_BUG_ON_FOLIO(folio_test_active(folio), folio); > > > > nr_pages = folio_nr_pages(folio); > > @@ -1799,7 +1828,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > * Before reclaiming the folio, try to relocate > > * its contents to another node. > > */ > > - if (do_demote_pass && > > + if (do_demote_pass && !skip_this_demotion && > > (thp_migration_supported() || !folio_test_large(folio))) { > > list_add(&folio->lru, &demote_folios); > > folio_unlock(folio);
On Wed 26-10-22 16:00:13, Feng Tang wrote: > On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote: > > On 10/26/22 1:13 PM, Feng Tang wrote: > > > In page reclaim path, memory could be demoted from faster memory tier > > > to slower memory tier. Currently, there is no check about cpuset's > > > memory policy, that even if the target demotion node is not allowd > > > by cpuset, the demotion will still happen, which breaks the cpuset > > > semantics. > > > > > > So add cpuset policy check in the demotion path and skip demotion > > > if the demotion targets are not allowed by cpuset. > > > > > > > What about the vma policy or the task memory policy? Shouldn't we respect > > those memory policy restrictions while demoting the page? > > Good question! We have some basic patches to consider memory policy > in demotion path too, which are still under test, and will be posted > soon. And the basic idea is similar to this patch. For that you need to consult each vma and it's owning task(s) and that to me sounds like something to be done in folio_check_references. Relying on memcg to get a cpuset cgroup is really ugly and not really 100% correct. Memory controller might be disabled and then you do not have your association anymore. This all can get quite expensive so the primary question is, does the existing behavior generates any real issues or is this more of an correctness exercise? I mean it certainly is not great to demote to an incompatible numa node but are there any reasonable configurations when the demotion target node is explicitly excluded from memory policy/cpuset?
On 10/26/22 2:49 PM, Michal Hocko wrote: > On Wed 26-10-22 16:00:13, Feng Tang wrote: >> On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote: >>> On 10/26/22 1:13 PM, Feng Tang wrote: >>>> In page reclaim path, memory could be demoted from faster memory tier >>>> to slower memory tier. Currently, there is no check about cpuset's >>>> memory policy, that even if the target demotion node is not allowd >>>> by cpuset, the demotion will still happen, which breaks the cpuset >>>> semantics. >>>> >>>> So add cpuset policy check in the demotion path and skip demotion >>>> if the demotion targets are not allowed by cpuset. >>>> >>> >>> What about the vma policy or the task memory policy? Shouldn't we respect >>> those memory policy restrictions while demoting the page? >> >> Good question! We have some basic patches to consider memory policy >> in demotion path too, which are still under test, and will be posted >> soon. And the basic idea is similar to this patch. > > For that you need to consult each vma and it's owning task(s) and that > to me sounds like something to be done in folio_check_references. > Relying on memcg to get a cpuset cgroup is really ugly and not really > 100% correct. Memory controller might be disabled and then you do not > have your association anymore. > I was looking at this recently and I am wondering whether we should worry about VM_SHARE vmas. ie, page_to_policy() can just reverse lookup just one VMA and fetch the policy right? if it VM_SHARE it will be a shared policy we can find using vma->vm_file? For non anonymous and anon vma not having any policy set it is owning task vma->vm_mm->owner task policy ? We don't worry about multiple tasks that can be possibly sharing that page right? > This all can get quite expensive so the primary question is, does the > existing behavior generates any real issues or is this more of an > correctness exercise? I mean it certainly is not great to demote to an > incompatible numa node but are there any reasonable configurations when > the demotion target node is explicitly excluded from memory > policy/cpuset? I guess vma policy is important. Applications want to make sure that they don't have variable performance and they go to lengths to avoid that by using MEM_BIND. So if they access the memory they always know access is satisfied from a specific set of NUMA nodes. Swapin can cause performance impact but then all continued access will be from a specific NUMA nodes. With slow memory demotion that is not going to be the case. Large in-memory database applications are observed to be sensitive to these access latencies. -aneesh
On Wed 26-10-22 16:12:25, Aneesh Kumar K V wrote: > On 10/26/22 2:49 PM, Michal Hocko wrote: > > On Wed 26-10-22 16:00:13, Feng Tang wrote: > >> On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote: > >>> On 10/26/22 1:13 PM, Feng Tang wrote: > >>>> In page reclaim path, memory could be demoted from faster memory tier > >>>> to slower memory tier. Currently, there is no check about cpuset's > >>>> memory policy, that even if the target demotion node is not allowd > >>>> by cpuset, the demotion will still happen, which breaks the cpuset > >>>> semantics. > >>>> > >>>> So add cpuset policy check in the demotion path and skip demotion > >>>> if the demotion targets are not allowed by cpuset. > >>>> > >>> > >>> What about the vma policy or the task memory policy? Shouldn't we respect > >>> those memory policy restrictions while demoting the page? > >> > >> Good question! We have some basic patches to consider memory policy > >> in demotion path too, which are still under test, and will be posted > >> soon. And the basic idea is similar to this patch. > > > > For that you need to consult each vma and it's owning task(s) and that > > to me sounds like something to be done in folio_check_references. > > Relying on memcg to get a cpuset cgroup is really ugly and not really > > 100% correct. Memory controller might be disabled and then you do not > > have your association anymore. > > > > I was looking at this recently and I am wondering whether we should worry about VM_SHARE > vmas. > > ie, page_to_policy() can just reverse lookup just one VMA and fetch the policy right? How would that help for private mappings shared between parent/child? Also reducing this to a single VMA is not really necessary as folio_check_references already does most of that work. What is really missing is to check for other memory policies (i.e. cpusets and per-task mempolicy). The later is what can get quite expensive. > if it VM_SHARE it will be a shared policy we can find using vma->vm_file? > > For non anonymous and anon vma not having any policy set it is owning task vma->vm_mm->owner task policy ? Please note that mm can be shared even outside of the traditional thread group so you would need to go into something like mm_update_next_owner > We don't worry about multiple tasks that can be possibly sharing that page right? Why not? > > This all can get quite expensive so the primary question is, does the > > existing behavior generates any real issues or is this more of an > > correctness exercise? I mean it certainly is not great to demote to an > > incompatible numa node but are there any reasonable configurations when > > the demotion target node is explicitly excluded from memory > > policy/cpuset? > > I guess vma policy is important. Applications want to make sure that they don't > have variable performance and they go to lengths to avoid that by using MEM_BIND. > So if they access the memory they always know access is satisfied from a specific > set of NUMA nodes. Swapin can cause performance impact but then all continued > access will be from a specific NUMA nodes. With slow memory demotion that is > not going to be the case. Large in-memory database applications are observed to > be sensitive to these access latencies. Yes, I do understand that from the correctness POV this is a problem. My question is whether this is a _practical_ problem worth really being fixed as it is not really a cheap fix. If there are strong node locality assumptions by the userspace then I would expect demotion to be disabled in the first place.
On 10/26/22 4:32 PM, Michal Hocko wrote: > On Wed 26-10-22 16:12:25, Aneesh Kumar K V wrote: >> On 10/26/22 2:49 PM, Michal Hocko wrote: >>> On Wed 26-10-22 16:00:13, Feng Tang wrote: >>>> On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote: >>>>> On 10/26/22 1:13 PM, Feng Tang wrote: >>>>>> In page reclaim path, memory could be demoted from faster memory tier >>>>>> to slower memory tier. Currently, there is no check about cpuset's >>>>>> memory policy, that even if the target demotion node is not allowd >>>>>> by cpuset, the demotion will still happen, which breaks the cpuset >>>>>> semantics. >>>>>> >>>>>> So add cpuset policy check in the demotion path and skip demotion >>>>>> if the demotion targets are not allowed by cpuset. >>>>>> >>>>> >>>>> What about the vma policy or the task memory policy? Shouldn't we respect >>>>> those memory policy restrictions while demoting the page? >>>> >>>> Good question! We have some basic patches to consider memory policy >>>> in demotion path too, which are still under test, and will be posted >>>> soon. And the basic idea is similar to this patch. >>> >>> For that you need to consult each vma and it's owning task(s) and that >>> to me sounds like something to be done in folio_check_references. >>> Relying on memcg to get a cpuset cgroup is really ugly and not really >>> 100% correct. Memory controller might be disabled and then you do not >>> have your association anymore. >>> >> >> I was looking at this recently and I am wondering whether we should worry about VM_SHARE >> vmas. >> >> ie, page_to_policy() can just reverse lookup just one VMA and fetch the policy right? > > How would that help for private mappings shared between parent/child? this is MAP_PRIVATE | MAP_SHARED? and won't they get converted to shared policy for the backing shmfs? via } else if (vm_flags & VM_SHARED) { error = shmem_zero_setup(vma); if (error) goto free_vma; } else { vma_set_anonymous(vma); } > Also reducing this to a single VMA is not really necessary as > folio_check_references already does most of that work. What is really > missing is to check for other memory policies (i.e. cpusets and per-task > mempolicy). The later is what can get quite expensive. > I agree that walking all the related vma is already done in folio_check_references. I was checking do we really need to check all the vma in case of memory policy. >> if it VM_SHARE it will be a shared policy we can find using vma->vm_file? >> >> For non anonymous and anon vma not having any policy set it is owning task vma->vm_mm->owner task policy ? > > Please note that mm can be shared even outside of the traditional thread > group so you would need to go into something like mm_update_next_owner > >> We don't worry about multiple tasks that can be possibly sharing that page right? > > Why not? > On the page fault side for non anonymous vma we only respect the memory policy of the task faulting the page in. With that restrictions we could always say if demotion node is allowed by the policy of any task sharing this vma, we can demote the page to that specific node? >>> This all can get quite expensive so the primary question is, does the >>> existing behavior generates any real issues or is this more of an >>> correctness exercise? I mean it certainly is not great to demote to an >>> incompatible numa node but are there any reasonable configurations when >>> the demotion target node is explicitly excluded from memory >>> policy/cpuset? >> >> I guess vma policy is important. Applications want to make sure that they don't >> have variable performance and they go to lengths to avoid that by using MEM_BIND. >> So if they access the memory they always know access is satisfied from a specific >> set of NUMA nodes. Swapin can cause performance impact but then all continued >> access will be from a specific NUMA nodes. With slow memory demotion that is >> not going to be the case. Large in-memory database applications are observed to >> be sensitive to these access latencies. > > Yes, I do understand that from the correctness POV this is a problem. My > question is whether this is a _practical_ problem worth really being > fixed as it is not really a cheap fix. If there are strong node locality > assumptions by the userspace then I would expect demotion to be disabled > in the first place. Agreed. Right now these applications when they fail to allocate memory from the Node on which they are running, they fallback to nearby NUMA nodes rather than failing the database query. They would want to prevent fallback of some allocation to slow cxl memory to avoid hot database tables getting allocated from the cxl memory node. In that case one way they can consume slow cxl memory is to partition the data structure using membind and allow cold pages to demoted back to slow cxl memory making space for hot page allocation in the running NUMA node. Other option is to run the application using fsdax. I am still not clear which option will get used finally. -aneesh
On Wed, Oct 26, 2022 at 05:19:50PM +0800, Michal Hocko wrote: > On Wed 26-10-22 16:00:13, Feng Tang wrote: > > On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote: > > > On 10/26/22 1:13 PM, Feng Tang wrote: > > > > In page reclaim path, memory could be demoted from faster memory tier > > > > to slower memory tier. Currently, there is no check about cpuset's > > > > memory policy, that even if the target demotion node is not allowd > > > > by cpuset, the demotion will still happen, which breaks the cpuset > > > > semantics. > > > > > > > > So add cpuset policy check in the demotion path and skip demotion > > > > if the demotion targets are not allowed by cpuset. > > > > > > > > > > What about the vma policy or the task memory policy? Shouldn't we respect > > > those memory policy restrictions while demoting the page? > > > > Good question! We have some basic patches to consider memory policy > > in demotion path too, which are still under test, and will be posted > > soon. And the basic idea is similar to this patch. > > For that you need to consult each vma and it's owning task(s) and that > to me sounds like something to be done in folio_check_references. > Relying on memcg to get a cpuset cgroup is really ugly and not really > 100% correct. Memory controller might be disabled and then you do not > have your association anymore. You are right, for cpuset case, the solution depends on 'CONFIG_MEMCG=y', and the bright side is most of distribution have it on. > This all can get quite expensive so the primary question is, does the > existing behavior generates any real issues or is this more of an > correctness exercise? I mean it certainly is not great to demote to an > incompatible numa node but are there any reasonable configurations when > the demotion target node is explicitly excluded from memory > policy/cpuset? We haven't got customer report on this, but there are quite some customers use cpuset to bind some specific memory nodes to a docker (You've helped us solve a OOM issue in such cases), so I think it's practical to respect the cpuset semantics as much as we can. Your concern about the expensive cost makes sense! Some raw ideas are: * if the shrink_folio_list is called by kswapd, the folios come from the same per-memcg lruvec, so only one check is enough * if not from kswapd, like called form madvise or DAMON code, we can save a memcg cache, and if the next folio's memcg is same as the cache, we reuse its result. And due to the locality, the real check is rarely performed. Thanks, Feng > -- > Michal Hocko > SUSE Labs >
On Wed 26-10-22 17:38:06, Aneesh Kumar K V wrote: > On 10/26/22 4:32 PM, Michal Hocko wrote: > > On Wed 26-10-22 16:12:25, Aneesh Kumar K V wrote: > >> On 10/26/22 2:49 PM, Michal Hocko wrote: > >>> On Wed 26-10-22 16:00:13, Feng Tang wrote: > >>>> On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote: > >>>>> On 10/26/22 1:13 PM, Feng Tang wrote: > >>>>>> In page reclaim path, memory could be demoted from faster memory tier > >>>>>> to slower memory tier. Currently, there is no check about cpuset's > >>>>>> memory policy, that even if the target demotion node is not allowd > >>>>>> by cpuset, the demotion will still happen, which breaks the cpuset > >>>>>> semantics. > >>>>>> > >>>>>> So add cpuset policy check in the demotion path and skip demotion > >>>>>> if the demotion targets are not allowed by cpuset. > >>>>>> > >>>>> > >>>>> What about the vma policy or the task memory policy? Shouldn't we respect > >>>>> those memory policy restrictions while demoting the page? > >>>> > >>>> Good question! We have some basic patches to consider memory policy > >>>> in demotion path too, which are still under test, and will be posted > >>>> soon. And the basic idea is similar to this patch. > >>> > >>> For that you need to consult each vma and it's owning task(s) and that > >>> to me sounds like something to be done in folio_check_references. > >>> Relying on memcg to get a cpuset cgroup is really ugly and not really > >>> 100% correct. Memory controller might be disabled and then you do not > >>> have your association anymore. > >>> > >> > >> I was looking at this recently and I am wondering whether we should worry about VM_SHARE > >> vmas. > >> > >> ie, page_to_policy() can just reverse lookup just one VMA and fetch the policy right? > > > > How would that help for private mappings shared between parent/child? > > > this is MAP_PRIVATE | MAP_SHARED? This is not a valid combination IIRC. What I meant is a simple MAP_PRIVATE|MAP_ANON that is CoW shared between parent and child. [...]
On 10/26/22 5:51 PM, Michal Hocko wrote: > On Wed 26-10-22 17:38:06, Aneesh Kumar K V wrote: >> On 10/26/22 4:32 PM, Michal Hocko wrote: >>> On Wed 26-10-22 16:12:25, Aneesh Kumar K V wrote: >>>> On 10/26/22 2:49 PM, Michal Hocko wrote: >>>>> On Wed 26-10-22 16:00:13, Feng Tang wrote: >>>>>> On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote: >>>>>>> On 10/26/22 1:13 PM, Feng Tang wrote: >>>>>>>> In page reclaim path, memory could be demoted from faster memory tier >>>>>>>> to slower memory tier. Currently, there is no check about cpuset's >>>>>>>> memory policy, that even if the target demotion node is not allowd >>>>>>>> by cpuset, the demotion will still happen, which breaks the cpuset >>>>>>>> semantics. >>>>>>>> >>>>>>>> So add cpuset policy check in the demotion path and skip demotion >>>>>>>> if the demotion targets are not allowed by cpuset. >>>>>>>> >>>>>>> >>>>>>> What about the vma policy or the task memory policy? Shouldn't we respect >>>>>>> those memory policy restrictions while demoting the page? >>>>>> >>>>>> Good question! We have some basic patches to consider memory policy >>>>>> in demotion path too, which are still under test, and will be posted >>>>>> soon. And the basic idea is similar to this patch. >>>>> >>>>> For that you need to consult each vma and it's owning task(s) and that >>>>> to me sounds like something to be done in folio_check_references. >>>>> Relying on memcg to get a cpuset cgroup is really ugly and not really >>>>> 100% correct. Memory controller might be disabled and then you do not >>>>> have your association anymore. >>>>> >>>> >>>> I was looking at this recently and I am wondering whether we should worry about VM_SHARE >>>> vmas. >>>> >>>> ie, page_to_policy() can just reverse lookup just one VMA and fetch the policy right? >>> >>> How would that help for private mappings shared between parent/child? >> >> >> this is MAP_PRIVATE | MAP_SHARED? > Sorry, I meant MAP_ANONYMOUS | MAP_SHARED. > This is not a valid combination IIRC. What I meant is a simple > MAP_PRIVATE|MAP_ANON that is CoW shared between parent and child. > > [...] -aneesh
On 10/26/22 03:43, Feng Tang wrote: > In page reclaim path, memory could be demoted from faster memory tier > to slower memory tier. Currently, there is no check about cpuset's > memory policy, that even if the target demotion node is not allowd > by cpuset, the demotion will still happen, which breaks the cpuset > semantics. > > So add cpuset policy check in the demotion path and skip demotion > if the demotion targets are not allowed by cpuset. > > Signed-off-by: Feng Tang <feng.tang@intel.com> > --- > Hi reviewers, > > For easy bisectable, I combined the cpuset change and mm change > in one patch, if you prefer to separate them, I can turn it into > 2 patches. > > Thanks, > Feng > > include/linux/cpuset.h | 6 ++++++ > kernel/cgroup/cpuset.c | 29 +++++++++++++++++++++++++++++ > mm/vmscan.c | 35 ++++++++++++++++++++++++++++++++--- > 3 files changed, 67 insertions(+), 3 deletions(-) > > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h > index d58e0476ee8e..6fcce2bd2631 100644 > --- a/include/linux/cpuset.h > +++ b/include/linux/cpuset.h > @@ -178,6 +178,8 @@ static inline void set_mems_allowed(nodemask_t nodemask) > task_unlock(current); > } > > +extern void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, > + nodemask_t *nmask); > #else /* !CONFIG_CPUSETS */ > > static inline bool cpusets_enabled(void) { return false; } > @@ -299,6 +301,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq) > return false; > } > > +static inline void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, > + nodemask_t *nmask) > +{ > +} > #endif /* !CONFIG_CPUSETS */ > > #endif /* _LINUX_CPUSET_H */ > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > index 3ea2e836e93e..cbb118c0502f 100644 > --- a/kernel/cgroup/cpuset.c > +++ b/kernel/cgroup/cpuset.c > @@ -3750,6 +3750,35 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk) > return mask; > } > > +/* > + * Retrieve the allowed memory nodemask for a cgroup. > + * > + * Set *nmask to cpuset's effective allowed nodemask for cgroup v2, > + * and NODE_MASK_ALL (means no constraint) for cgroup v1 where there > + * is no guaranteed association from a cgroup to a cpuset. > + */ > +void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, nodemask_t *nmask) > +{ > + struct cgroup_subsys_state *css; > + struct cpuset *cs; > + > + if (!is_in_v2_mode()) { > + *nmask = NODE_MASK_ALL; > + return; > + } You are allowing all nodes to be used for cgroup v1. Is there a reason why you ignore v1? > + > + rcu_read_lock(); > + css = cgroup_e_css(cgroup, &cpuset_cgrp_subsys); > + if (css) { > + css_get(css); > + cs = css_cs(css); > + *nmask = cs->effective_mems; > + css_put(css); > + } Since you are holding an RCU read lock and copying out the whole nodemask, you probably don't need to do a css_get/css_put pair. > + > + rcu_read_unlock(); > +} > + Cheers, Longman
On Wed 26-10-22 20:20:01, Feng Tang wrote: > On Wed, Oct 26, 2022 at 05:19:50PM +0800, Michal Hocko wrote: > > On Wed 26-10-22 16:00:13, Feng Tang wrote: > > > On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote: > > > > On 10/26/22 1:13 PM, Feng Tang wrote: > > > > > In page reclaim path, memory could be demoted from faster memory tier > > > > > to slower memory tier. Currently, there is no check about cpuset's > > > > > memory policy, that even if the target demotion node is not allowd > > > > > by cpuset, the demotion will still happen, which breaks the cpuset > > > > > semantics. > > > > > > > > > > So add cpuset policy check in the demotion path and skip demotion > > > > > if the demotion targets are not allowed by cpuset. > > > > > > > > > > > > > What about the vma policy or the task memory policy? Shouldn't we respect > > > > those memory policy restrictions while demoting the page? > > > > > > Good question! We have some basic patches to consider memory policy > > > in demotion path too, which are still under test, and will be posted > > > soon. And the basic idea is similar to this patch. > > > > For that you need to consult each vma and it's owning task(s) and that > > to me sounds like something to be done in folio_check_references. > > Relying on memcg to get a cpuset cgroup is really ugly and not really > > 100% correct. Memory controller might be disabled and then you do not > > have your association anymore. > > You are right, for cpuset case, the solution depends on 'CONFIG_MEMCG=y', > and the bright side is most of distribution have it on. CONFIG_MEMCG=y is not sufficient. You would need to enable memcg controller during the runtime as well. > > This all can get quite expensive so the primary question is, does the > > existing behavior generates any real issues or is this more of an > > correctness exercise? I mean it certainly is not great to demote to an > > incompatible numa node but are there any reasonable configurations when > > the demotion target node is explicitly excluded from memory > > policy/cpuset? > > We haven't got customer report on this, but there are quite some customers > use cpuset to bind some specific memory nodes to a docker (You've helped > us solve a OOM issue in such cases), so I think it's practical to respect > the cpuset semantics as much as we can. Yes, it is definitely better to respect cpusets and all local memory policies. There is no dispute there. The thing is whether this is really worth it. How often would cpusets (or policies in general) go actively against demotion nodes (i.e. exclude those nodes from their allowes node mask)? I can imagine workloads which wouldn't like to get their memory demoted for some reason but wouldn't it be more practical to tell that explicitly (e.g. via prctl) rather than configuring cpusets/memory policies explicitly? > Your concern about the expensive cost makes sense! Some raw ideas are: > * if the shrink_folio_list is called by kswapd, the folios come from > the same per-memcg lruvec, so only one check is enough > * if not from kswapd, like called form madvise or DAMON code, we can > save a memcg cache, and if the next folio's memcg is same as the > cache, we reuse its result. And due to the locality, the real > check is rarely performed. memcg is not the expensive part of the thing. You need to get from page -> all vmas::vm_policy -> mm -> task::mempolicy
On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <mhocko@suse.com> wrote: > > On Wed 26-10-22 20:20:01, Feng Tang wrote: > > On Wed, Oct 26, 2022 at 05:19:50PM +0800, Michal Hocko wrote: > > > On Wed 26-10-22 16:00:13, Feng Tang wrote: > > > > On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote: > > > > > On 10/26/22 1:13 PM, Feng Tang wrote: > > > > > > In page reclaim path, memory could be demoted from faster memory tier > > > > > > to slower memory tier. Currently, there is no check about cpuset's > > > > > > memory policy, that even if the target demotion node is not allowd > > > > > > by cpuset, the demotion will still happen, which breaks the cpuset > > > > > > semantics. > > > > > > > > > > > > So add cpuset policy check in the demotion path and skip demotion > > > > > > if the demotion targets are not allowed by cpuset. > > > > > > > > > > > > > > > > What about the vma policy or the task memory policy? Shouldn't we respect > > > > > those memory policy restrictions while demoting the page? > > > > > > > > Good question! We have some basic patches to consider memory policy > > > > in demotion path too, which are still under test, and will be posted > > > > soon. And the basic idea is similar to this patch. > > > > > > For that you need to consult each vma and it's owning task(s) and that > > > to me sounds like something to be done in folio_check_references. > > > Relying on memcg to get a cpuset cgroup is really ugly and not really > > > 100% correct. Memory controller might be disabled and then you do not > > > have your association anymore. > > > > You are right, for cpuset case, the solution depends on 'CONFIG_MEMCG=y', > > and the bright side is most of distribution have it on. > > CONFIG_MEMCG=y is not sufficient. You would need to enable memcg > controller during the runtime as well. > > > > This all can get quite expensive so the primary question is, does the > > > existing behavior generates any real issues or is this more of an > > > correctness exercise? I mean it certainly is not great to demote to an > > > incompatible numa node but are there any reasonable configurations when > > > the demotion target node is explicitly excluded from memory > > > policy/cpuset? > > > > We haven't got customer report on this, but there are quite some customers > > use cpuset to bind some specific memory nodes to a docker (You've helped > > us solve a OOM issue in such cases), so I think it's practical to respect > > the cpuset semantics as much as we can. > > Yes, it is definitely better to respect cpusets and all local memory > policies. There is no dispute there. The thing is whether this is really > worth it. How often would cpusets (or policies in general) go actively > against demotion nodes (i.e. exclude those nodes from their allowes node > mask)? > > I can imagine workloads which wouldn't like to get their memory demoted > for some reason but wouldn't it be more practical to tell that > explicitly (e.g. via prctl) rather than configuring cpusets/memory > policies explicitly? > > > Your concern about the expensive cost makes sense! Some raw ideas are: > > * if the shrink_folio_list is called by kswapd, the folios come from > > the same per-memcg lruvec, so only one check is enough > > * if not from kswapd, like called form madvise or DAMON code, we can > > save a memcg cache, and if the next folio's memcg is same as the > > cache, we reuse its result. And due to the locality, the real > > check is rarely performed. > > memcg is not the expensive part of the thing. You need to get from page > -> all vmas::vm_policy -> mm -> task::mempolicy Yeah, on the same page with Michal. Figuring out mempolicy from page seems quite expensive and the correctness can't be guranteed since the mempolicy could be set per-thread and the mm->task depends on CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG. > > -- > Michal Hocko > SUSE Labs >
Feng Tang <feng.tang@intel.com> writes: > In page reclaim path, memory could be demoted from faster memory tier > to slower memory tier. Currently, there is no check about cpuset's > memory policy, that even if the target demotion node is not allowd > by cpuset, the demotion will still happen, which breaks the cpuset > semantics. > > So add cpuset policy check in the demotion path and skip demotion > if the demotion targets are not allowed by cpuset. > > Signed-off-by: Feng Tang <feng.tang@intel.com> > --- > Hi reviewers, > > For easy bisectable, I combined the cpuset change and mm change > in one patch, if you prefer to separate them, I can turn it into > 2 patches. > > Thanks, > Feng > > include/linux/cpuset.h | 6 ++++++ > kernel/cgroup/cpuset.c | 29 +++++++++++++++++++++++++++++ > mm/vmscan.c | 35 ++++++++++++++++++++++++++++++++--- > 3 files changed, 67 insertions(+), 3 deletions(-) > > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h > index d58e0476ee8e..6fcce2bd2631 100644 > --- a/include/linux/cpuset.h > +++ b/include/linux/cpuset.h > @@ -178,6 +178,8 @@ static inline void set_mems_allowed(nodemask_t nodemask) > task_unlock(current); > } > > +extern void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, > + nodemask_t *nmask); > #else /* !CONFIG_CPUSETS */ > > static inline bool cpusets_enabled(void) { return false; } > @@ -299,6 +301,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq) > return false; > } > > +static inline void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, > + nodemask_t *nmask) > +{ > +} > #endif /* !CONFIG_CPUSETS */ > > #endif /* _LINUX_CPUSET_H */ > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > index 3ea2e836e93e..cbb118c0502f 100644 > --- a/kernel/cgroup/cpuset.c > +++ b/kernel/cgroup/cpuset.c > @@ -3750,6 +3750,35 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk) > return mask; > } > > +/* > + * Retrieve the allowed memory nodemask for a cgroup. > + * > + * Set *nmask to cpuset's effective allowed nodemask for cgroup v2, > + * and NODE_MASK_ALL (means no constraint) for cgroup v1 where there > + * is no guaranteed association from a cgroup to a cpuset. > + */ > +void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, nodemask_t *nmask) > +{ > + struct cgroup_subsys_state *css; > + struct cpuset *cs; > + > + if (!is_in_v2_mode()) { > + *nmask = NODE_MASK_ALL; > + return; > + } > + > + rcu_read_lock(); > + css = cgroup_e_css(cgroup, &cpuset_cgrp_subsys); > + if (css) { > + css_get(css); > + cs = css_cs(css); > + *nmask = cs->effective_mems; > + css_put(css); > + } > + > + rcu_read_unlock(); > +} > + > /** > * cpuset_nodemask_valid_mems_allowed - check nodemask vs. current mems_allowed > * @nodemask: the nodemask to be checked > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 18f6497994ec..c205d98283bc 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1537,9 +1537,21 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private) > { > struct page *target_page; > nodemask_t *allowed_mask; > - struct migration_target_control *mtc; > + struct migration_target_control *mtc = (void *)private; > > - mtc = (struct migration_target_control *)private; I think we should avoid (void *) conversion here. > +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS) > + struct mem_cgroup *memcg; > + nodemask_t cpuset_nmask; > + > + memcg = page_memcg(page); > + cpuset_get_allowed_mem_nodes(memcg->css.cgroup, &cpuset_nmask); > + > + if (!node_isset(mtc->nid, cpuset_nmask)) { > + if (mtc->nmask) > + nodes_and(*mtc->nmask, *mtc->nmask, cpuset_nmask); > + return alloc_migration_target(page, (unsigned long)mtc); > + } If node_isset(mtc->nid, cpuset_nmask) == true, we should keep the original 2 steps allocation and apply nodes_and() on node mask. > +#endif > > allowed_mask = mtc->nmask; > /* > @@ -1649,6 +1661,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > enum folio_references references = FOLIOREF_RECLAIM; > bool dirty, writeback; > unsigned int nr_pages; > + bool skip_this_demotion = false; > > cond_resched(); > > @@ -1658,6 +1671,22 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > if (!folio_trylock(folio)) > goto keep; > > +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS) > + if (do_demote_pass) { > + struct mem_cgroup *memcg; > + nodemask_t nmask, nmask1; > + > + node_get_allowed_targets(pgdat, &nmask); pgdat will not change in the loop, so we can move this out of the loop? > + memcg = folio_memcg(folio); > + if (memcg) > + cpuset_get_allowed_mem_nodes(memcg->css.cgroup, > + &nmask1); > + > + if (!nodes_intersects(nmask, nmask1)) > + skip_this_demotion = true; > + } If nodes_intersects() == true, we will call cpuset_get_allowed_mem_nodes() twice. Better to pass the intersecting mask to demote_folio_list()? > +#endif > + > VM_BUG_ON_FOLIO(folio_test_active(folio), folio); > > nr_pages = folio_nr_pages(folio); > @@ -1799,7 +1828,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > * Before reclaiming the folio, try to relocate > * its contents to another node. > */ > - if (do_demote_pass && > + if (do_demote_pass && !skip_this_demotion && > (thp_migration_supported() || !folio_test_large(folio))) { > list_add(&folio->lru, &demote_folios); > folio_unlock(folio); Best Regards, Huang, Ying
On Thu, Oct 27, 2022 at 01:13:30PM +0800, Huang, Ying wrote: > Feng Tang <feng.tang@intel.com> writes: > > > In page reclaim path, memory could be demoted from faster memory tier > > to slower memory tier. Currently, there is no check about cpuset's > > memory policy, that even if the target demotion node is not allowd > > by cpuset, the demotion will still happen, which breaks the cpuset > > semantics. > > > > So add cpuset policy check in the demotion path and skip demotion > > if the demotion targets are not allowed by cpuset. > > > > Signed-off-by: Feng Tang <feng.tang@intel.com> > > --- [...] > > index 18f6497994ec..c205d98283bc 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -1537,9 +1537,21 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private) > > { > > struct page *target_page; > > nodemask_t *allowed_mask; > > - struct migration_target_control *mtc; > > + struct migration_target_control *mtc = (void *)private; > > > > - mtc = (struct migration_target_control *)private; > > I think we should avoid (void *) conversion here. OK, will change back. > > +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS) > > + struct mem_cgroup *memcg; > > + nodemask_t cpuset_nmask; > > + > > + memcg = page_memcg(page); > > + cpuset_get_allowed_mem_nodes(memcg->css.cgroup, &cpuset_nmask); > > + > > + if (!node_isset(mtc->nid, cpuset_nmask)) { > > + if (mtc->nmask) > > + nodes_and(*mtc->nmask, *mtc->nmask, cpuset_nmask); > > + return alloc_migration_target(page, (unsigned long)mtc); > > + } > > If node_isset(mtc->nid, cpuset_nmask) == true, we should keep the > original 2 steps allocation and apply nodes_and() on node mask. Good catch! Yes, the nodes_and() call should be taken out from this check and done before calling node_isset(). > > +#endif > > > > allowed_mask = mtc->nmask; > > /* > > @@ -1649,6 +1661,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > enum folio_references references = FOLIOREF_RECLAIM; > > bool dirty, writeback; > > unsigned int nr_pages; > > + bool skip_this_demotion = false; > > > > cond_resched(); > > > > @@ -1658,6 +1671,22 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > if (!folio_trylock(folio)) > > goto keep; > > > > +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS) > > + if (do_demote_pass) { > > + struct mem_cgroup *memcg; > > + nodemask_t nmask, nmask1; > > + > > + node_get_allowed_targets(pgdat, &nmask); > > pgdat will not change in the loop, so we can move this out of the loop? Yes > > + memcg = folio_memcg(folio); > > + if (memcg) > > + cpuset_get_allowed_mem_nodes(memcg->css.cgroup, > > + &nmask1); > > + > > + if (!nodes_intersects(nmask, nmask1)) > > + skip_this_demotion = true; > > + } > > If nodes_intersects() == true, we will call > cpuset_get_allowed_mem_nodes() twice. Better to pass the intersecting > mask to demote_folio_list()? The pages in the loop may come from different mem control group, and the cpuset's nodemask could be different, I don't know how to save this per-page info to be used later in demote_folio_list. Thanks, Feng > > +#endif > > + > > VM_BUG_ON_FOLIO(folio_test_active(folio), folio); > > > > nr_pages = folio_nr_pages(folio); > > @@ -1799,7 +1828,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > * Before reclaiming the folio, try to relocate > > * its contents to another node. > > */ > > - if (do_demote_pass && > > + if (do_demote_pass && !skip_this_demotion && > > (thp_migration_supported() || !folio_test_large(folio))) { > > list_add(&folio->lru, &demote_folios); > > folio_unlock(folio); > > Best Regards, > Huang, Ying
On Wed, Oct 26, 2022 at 10:36:32PM +0800, Waiman Long wrote: > On 10/26/22 03:43, Feng Tang wrote: > > In page reclaim path, memory could be demoted from faster memory tier > > to slower memory tier. Currently, there is no check about cpuset's > > memory policy, that even if the target demotion node is not allowd > > by cpuset, the demotion will still happen, which breaks the cpuset > > semantics. > > > > So add cpuset policy check in the demotion path and skip demotion > > if the demotion targets are not allowed by cpuset. > > > > Signed-off-by: Feng Tang <feng.tang@intel.com> > > --- > > Hi reviewers, > > > > For easy bisectable, I combined the cpuset change and mm change > > in one patch, if you prefer to separate them, I can turn it into > > 2 patches. > > > > Thanks, > > Feng > > > > include/linux/cpuset.h | 6 ++++++ > > kernel/cgroup/cpuset.c | 29 +++++++++++++++++++++++++++++ > > mm/vmscan.c | 35 ++++++++++++++++++++++++++++++++--- > > 3 files changed, 67 insertions(+), 3 deletions(-) > > > > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h > > index d58e0476ee8e..6fcce2bd2631 100644 > > --- a/include/linux/cpuset.h > > +++ b/include/linux/cpuset.h > > @@ -178,6 +178,8 @@ static inline void set_mems_allowed(nodemask_t nodemask) > > task_unlock(current); > > } > > > > +extern void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, > > + nodemask_t *nmask); > > #else /* !CONFIG_CPUSETS */ > > > > static inline bool cpusets_enabled(void) { return false; } > > @@ -299,6 +301,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq) > > return false; > > } > > > > +static inline void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, > > + nodemask_t *nmask) > > +{ > > +} > > #endif /* !CONFIG_CPUSETS */ > > > > #endif /* _LINUX_CPUSET_H */ > > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > > index 3ea2e836e93e..cbb118c0502f 100644 > > --- a/kernel/cgroup/cpuset.c > > +++ b/kernel/cgroup/cpuset.c > > @@ -3750,6 +3750,35 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk) > > return mask; > > } > > > > +/* > > + * Retrieve the allowed memory nodemask for a cgroup. > > + * > > + * Set *nmask to cpuset's effective allowed nodemask for cgroup v2, > > + * and NODE_MASK_ALL (means no constraint) for cgroup v1 where there > > + * is no guaranteed association from a cgroup to a cpuset. > > + */ > > +void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, nodemask_t *nmask) > > +{ > > + struct cgroup_subsys_state *css; > > + struct cpuset *cs; > > + > > + if (!is_in_v2_mode()) { > > + *nmask = NODE_MASK_ALL; > > + return; > > + } > > You are allowing all nodes to be used for cgroup v1. Is there a reason > why you ignore v1? The use case for the API is, for a memory control group, user want to get its associated cpuset controller's memory policy, so it tries the memcg --> cgroup --> cpuset chain. IIUC, there is no a reliable chain for cgroup v1, plus cgroup v2 is the default option for many distros, the cgroup v1 is bypassed here. > > + > > + rcu_read_lock(); > > + css = cgroup_e_css(cgroup, &cpuset_cgrp_subsys); > > + if (css) { > > + css_get(css); > > + cs = css_cs(css); > > + *nmask = cs->effective_mems; > > + css_put(css); > > + } > Since you are holding an RCU read lock and copying out the whole > nodemask, you probably don't need to do a css_get/css_put pair. Thanks for the note! Thanks, Feng > > + > > + rcu_read_unlock(); > > +} > > + > Cheers, > > Longman >
Feng Tang <feng.tang@intel.com> writes: > On Thu, Oct 27, 2022 at 01:13:30PM +0800, Huang, Ying wrote: >> Feng Tang <feng.tang@intel.com> writes: >> >> > In page reclaim path, memory could be demoted from faster memory tier >> > to slower memory tier. Currently, there is no check about cpuset's >> > memory policy, that even if the target demotion node is not allowd >> > by cpuset, the demotion will still happen, which breaks the cpuset >> > semantics. >> > >> > So add cpuset policy check in the demotion path and skip demotion >> > if the demotion targets are not allowed by cpuset. >> > >> > Signed-off-by: Feng Tang <feng.tang@intel.com> >> > --- > [...] >> > index 18f6497994ec..c205d98283bc 100644 >> > --- a/mm/vmscan.c >> > +++ b/mm/vmscan.c >> > @@ -1537,9 +1537,21 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private) >> > { >> > struct page *target_page; >> > nodemask_t *allowed_mask; >> > - struct migration_target_control *mtc; >> > + struct migration_target_control *mtc = (void *)private; >> > >> > - mtc = (struct migration_target_control *)private; >> >> I think we should avoid (void *) conversion here. > > OK, will change back. > >> > +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS) >> > + struct mem_cgroup *memcg; >> > + nodemask_t cpuset_nmask; >> > + >> > + memcg = page_memcg(page); >> > + cpuset_get_allowed_mem_nodes(memcg->css.cgroup, &cpuset_nmask); >> > + >> > + if (!node_isset(mtc->nid, cpuset_nmask)) { >> > + if (mtc->nmask) >> > + nodes_and(*mtc->nmask, *mtc->nmask, cpuset_nmask); >> > + return alloc_migration_target(page, (unsigned long)mtc); >> > + } >> >> If node_isset(mtc->nid, cpuset_nmask) == true, we should keep the >> original 2 steps allocation and apply nodes_and() on node mask. > > Good catch! Yes, the nodes_and() call should be taken out from this > check and done before calling node_isset(). > >> > +#endif >> > >> > allowed_mask = mtc->nmask; >> > /* >> > @@ -1649,6 +1661,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, >> > enum folio_references references = FOLIOREF_RECLAIM; >> > bool dirty, writeback; >> > unsigned int nr_pages; >> > + bool skip_this_demotion = false; >> > >> > cond_resched(); >> > >> > @@ -1658,6 +1671,22 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, >> > if (!folio_trylock(folio)) >> > goto keep; >> > >> > +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS) >> > + if (do_demote_pass) { >> > + struct mem_cgroup *memcg; >> > + nodemask_t nmask, nmask1; >> > + >> > + node_get_allowed_targets(pgdat, &nmask); >> >> pgdat will not change in the loop, so we can move this out of the loop? > > Yes > >> > + memcg = folio_memcg(folio); >> > + if (memcg) >> > + cpuset_get_allowed_mem_nodes(memcg->css.cgroup, >> > + &nmask1); >> > + >> > + if (!nodes_intersects(nmask, nmask1)) >> > + skip_this_demotion = true; >> > + } >> >> If nodes_intersects() == true, we will call >> cpuset_get_allowed_mem_nodes() twice. Better to pass the intersecting >> mask to demote_folio_list()? > > The pages in the loop may come from different mem control group, and > the cpuset's nodemask could be different, I don't know how to save > this per-page info to be used later in demote_folio_list. Yes. You are right. We cannot do that. Best Regards, Huang, Ying > >> > +#endif >> > + >> > VM_BUG_ON_FOLIO(folio_test_active(folio), folio); >> > >> > nr_pages = folio_nr_pages(folio); >> > @@ -1799,7 +1828,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, >> > * Before reclaiming the folio, try to relocate >> > * its contents to another node. >> > */ >> > - if (do_demote_pass && >> > + if (do_demote_pass && !skip_this_demotion && >> > (thp_migration_supported() || !folio_test_large(folio))) { >> > list_add(&folio->lru, &demote_folios); >> > folio_unlock(folio); >> >> Best Regards, >> Huang, Ying
Michal Hocko <mhocko@suse.com> writes: > On Wed 26-10-22 20:20:01, Feng Tang wrote: >> On Wed, Oct 26, 2022 at 05:19:50PM +0800, Michal Hocko wrote: >> > On Wed 26-10-22 16:00:13, Feng Tang wrote: >> > > On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote: >> > > > On 10/26/22 1:13 PM, Feng Tang wrote: >> > > > > In page reclaim path, memory could be demoted from faster memory tier >> > > > > to slower memory tier. Currently, there is no check about cpuset's >> > > > > memory policy, that even if the target demotion node is not allowd >> > > > > by cpuset, the demotion will still happen, which breaks the cpuset >> > > > > semantics. >> > > > > >> > > > > So add cpuset policy check in the demotion path and skip demotion >> > > > > if the demotion targets are not allowed by cpuset. >> > > > > >> > > > >> > > > What about the vma policy or the task memory policy? Shouldn't we respect >> > > > those memory policy restrictions while demoting the page? >> > > >> > > Good question! We have some basic patches to consider memory policy >> > > in demotion path too, which are still under test, and will be posted >> > > soon. And the basic idea is similar to this patch. >> > >> > For that you need to consult each vma and it's owning task(s) and that >> > to me sounds like something to be done in folio_check_references. >> > Relying on memcg to get a cpuset cgroup is really ugly and not really >> > 100% correct. Memory controller might be disabled and then you do not >> > have your association anymore. >> >> You are right, for cpuset case, the solution depends on 'CONFIG_MEMCG=y', >> and the bright side is most of distribution have it on. > > CONFIG_MEMCG=y is not sufficient. You would need to enable memcg > controller during the runtime as well. > >> > This all can get quite expensive so the primary question is, does the >> > existing behavior generates any real issues or is this more of an >> > correctness exercise? I mean it certainly is not great to demote to an >> > incompatible numa node but are there any reasonable configurations when >> > the demotion target node is explicitly excluded from memory >> > policy/cpuset? >> >> We haven't got customer report on this, but there are quite some customers >> use cpuset to bind some specific memory nodes to a docker (You've helped >> us solve a OOM issue in such cases), so I think it's practical to respect >> the cpuset semantics as much as we can. > > Yes, it is definitely better to respect cpusets and all local memory > policies. There is no dispute there. The thing is whether this is really > worth it. How often would cpusets (or policies in general) go actively > against demotion nodes (i.e. exclude those nodes from their allowes node > mask)? > > I can imagine workloads which wouldn't like to get their memory demoted > for some reason but wouldn't it be more practical to tell that > explicitly (e.g. via prctl) rather than configuring cpusets/memory > policies explicitly? If my understanding were correct, prctl() configures the process or thread. How can we get process/thread configuration at demotion time? Best Regards, Huang, Ying
On Thu 27-10-22 14:47:22, Huang, Ying wrote: > Michal Hocko <mhocko@suse.com> writes: [...] > > I can imagine workloads which wouldn't like to get their memory demoted > > for some reason but wouldn't it be more practical to tell that > > explicitly (e.g. via prctl) rather than configuring cpusets/memory > > policies explicitly? > > If my understanding were correct, prctl() configures the process or > thread. Not necessarily. There are properties which are per adddress space like PR_[GS]ET_THP_DISABLE. This could be very similar. > How can we get process/thread configuration at demotion time? As already pointed out in previous emails. You could hook into folio_check_references path, more specifically folio_referenced_one where you have all that you need already - all vmas mapping the page and then it is trivial to get the corresponding vm_mm. If at least one of them has the flag set then the demotion is not allowed (essentially the same model as VM_LOCKED).
On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote: > On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <mhocko@suse.com> wrote: [...] > > > > This all can get quite expensive so the primary question is, does the > > > > existing behavior generates any real issues or is this more of an > > > > correctness exercise? I mean it certainly is not great to demote to an > > > > incompatible numa node but are there any reasonable configurations when > > > > the demotion target node is explicitly excluded from memory > > > > policy/cpuset? > > > > > > We haven't got customer report on this, but there are quite some customers > > > use cpuset to bind some specific memory nodes to a docker (You've helped > > > us solve a OOM issue in such cases), so I think it's practical to respect > > > the cpuset semantics as much as we can. > > > > Yes, it is definitely better to respect cpusets and all local memory > > policies. There is no dispute there. The thing is whether this is really > > worth it. How often would cpusets (or policies in general) go actively > > against demotion nodes (i.e. exclude those nodes from their allowes node > > mask)? > > > > I can imagine workloads which wouldn't like to get their memory demoted > > for some reason but wouldn't it be more practical to tell that > > explicitly (e.g. via prctl) rather than configuring cpusets/memory > > policies explicitly? > > > > > Your concern about the expensive cost makes sense! Some raw ideas are: > > > * if the shrink_folio_list is called by kswapd, the folios come from > > > the same per-memcg lruvec, so only one check is enough > > > * if not from kswapd, like called form madvise or DAMON code, we can > > > save a memcg cache, and if the next folio's memcg is same as the > > > cache, we reuse its result. And due to the locality, the real > > > check is rarely performed. > > > > memcg is not the expensive part of the thing. You need to get from page > > -> all vmas::vm_policy -> mm -> task::mempolicy > > Yeah, on the same page with Michal. Figuring out mempolicy from page > seems quite expensive and the correctness can't be guranteed since the > mempolicy could be set per-thread and the mm->task depends on > CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG. Yes, you are right. Our "working" psudo code for mem policy looks like what Michal mentioned, and it can't work for all cases, but try to enforce it whenever possible: static bool __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma, unsigned long addr, void *arg) { bool *skip_demotion = arg; struct mempolicy *mpol; int nid, dnid; bool ret = true; mpol = __get_vma_policy(vma, addr); if (!mpol) { struct task_struct *task; if (vma->vm_mm) task = vma->vm_mm->owner; if (task) { mpol = get_task_policy(task); if (mpol) mpol_get(mpol); } } if (!mpol) return ret; if (mpol->mode != MPOL_BIND) goto put_exit; nid = folio_nid(folio); dnid = next_demotion_node(nid); if (!node_isset(dnid, mpol->nodes)) { *skip_demotion = true; ret = false; } put_exit: mpol_put(mpol); return ret; } static unsigned int shrink_page_list(struct list_head *page_list,..) { ... bool skip_demotion = false; struct rmap_walk_control rwc = { .arg = &skip_demotion, .rmap_one = __check_mpol_demotion, }; /* memory policy check */ rmap_walk(folio, &rwc); if (skip_demotion) goto keep_locked; } And there seems to be no simple solution for getting the memory policy from a page. Thanks, Feng > > > > -- > > Michal Hocko > > SUSE Labs > > >
Michal Hocko <mhocko@suse.com> writes: > On Thu 27-10-22 14:47:22, Huang, Ying wrote: >> Michal Hocko <mhocko@suse.com> writes: > [...] >> > I can imagine workloads which wouldn't like to get their memory demoted >> > for some reason but wouldn't it be more practical to tell that >> > explicitly (e.g. via prctl) rather than configuring cpusets/memory >> > policies explicitly? >> >> If my understanding were correct, prctl() configures the process or >> thread. > > Not necessarily. There are properties which are per adddress space like > PR_[GS]ET_THP_DISABLE. This could be very similar. > >> How can we get process/thread configuration at demotion time? > > As already pointed out in previous emails. You could hook into > folio_check_references path, more specifically folio_referenced_one > where you have all that you need already - all vmas mapping the page and > then it is trivial to get the corresponding vm_mm. If at least one of > them has the flag set then the demotion is not allowed (essentially the > same model as VM_LOCKED). Got it! Thanks for detailed explanation. One bit may be not sufficient. For example, if we want to avoid or control cross-socket demotion and still allow demoting to slow memory nodes in local socket, we need to specify a node mask to exclude some NUMA nodes from demotion targets. From overhead point of view, this appears similar as that of VMA/task memory policy? We can make mm->owner available for memory tiers (CONFIG_NUMA && CONFIG_MIGRATION). The advantage is that we don't need to introduce new ABI. I guess users may prefer to use `numactl` than a new ABI? Best Regards, Huang, Ying
Feng Tang <feng.tang@intel.com> writes: > On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote: >> On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <mhocko@suse.com> wrote: > [...] >> > > > This all can get quite expensive so the primary question is, does the >> > > > existing behavior generates any real issues or is this more of an >> > > > correctness exercise? I mean it certainly is not great to demote to an >> > > > incompatible numa node but are there any reasonable configurations when >> > > > the demotion target node is explicitly excluded from memory >> > > > policy/cpuset? >> > > >> > > We haven't got customer report on this, but there are quite some customers >> > > use cpuset to bind some specific memory nodes to a docker (You've helped >> > > us solve a OOM issue in such cases), so I think it's practical to respect >> > > the cpuset semantics as much as we can. >> > >> > Yes, it is definitely better to respect cpusets and all local memory >> > policies. There is no dispute there. The thing is whether this is really >> > worth it. How often would cpusets (or policies in general) go actively >> > against demotion nodes (i.e. exclude those nodes from their allowes node >> > mask)? >> > >> > I can imagine workloads which wouldn't like to get their memory demoted >> > for some reason but wouldn't it be more practical to tell that >> > explicitly (e.g. via prctl) rather than configuring cpusets/memory >> > policies explicitly? >> > >> > > Your concern about the expensive cost makes sense! Some raw ideas are: >> > > * if the shrink_folio_list is called by kswapd, the folios come from >> > > the same per-memcg lruvec, so only one check is enough >> > > * if not from kswapd, like called form madvise or DAMON code, we can >> > > save a memcg cache, and if the next folio's memcg is same as the >> > > cache, we reuse its result. And due to the locality, the real >> > > check is rarely performed. >> > >> > memcg is not the expensive part of the thing. You need to get from page >> > -> all vmas::vm_policy -> mm -> task::mempolicy >> >> Yeah, on the same page with Michal. Figuring out mempolicy from page >> seems quite expensive and the correctness can't be guranteed since the >> mempolicy could be set per-thread and the mm->task depends on >> CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG. > > Yes, you are right. Our "working" psudo code for mem policy looks like > what Michal mentioned, and it can't work for all cases, but try to > enforce it whenever possible: > > static bool __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma, > unsigned long addr, void *arg) > { > bool *skip_demotion = arg; > struct mempolicy *mpol; > int nid, dnid; > bool ret = true; > > mpol = __get_vma_policy(vma, addr); > if (!mpol) { > struct task_struct *task; task = NULL; > if (vma->vm_mm) > task = vma->vm_mm->owner; > > if (task) { > mpol = get_task_policy(task); > if (mpol) > mpol_get(mpol); > } > } > > if (!mpol) > return ret; > > if (mpol->mode != MPOL_BIND) > goto put_exit; > > nid = folio_nid(folio); > dnid = next_demotion_node(nid); > if (!node_isset(dnid, mpol->nodes)) { > *skip_demotion = true; > ret = false; > } I think that you need to get a node mask instead. Even if !node_isset(dnid, mpol->nodes), you may demote to other node in the node mask. Best Regards, Huang, Ying > > put_exit: > mpol_put(mpol); > return ret; > } > > static unsigned int shrink_page_list(struct list_head *page_list,..) > { > ... > > bool skip_demotion = false; > struct rmap_walk_control rwc = { > .arg = &skip_demotion, > .rmap_one = __check_mpol_demotion, > }; > > /* memory policy check */ > rmap_walk(folio, &rwc); > if (skip_demotion) > goto keep_locked; > } > > And there seems to be no simple solution for getting the memory > policy from a page. > > Thanks, > Feng > >> > >> > -- >> > Michal Hocko >> > SUSE Labs >> > >>
On Thu, Oct 27, 2022 at 03:45:12PM +0800, Huang, Ying wrote: > Feng Tang <feng.tang@intel.com> writes: > > > On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote: > >> On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <mhocko@suse.com> wrote: > > [...] > >> > > > This all can get quite expensive so the primary question is, does the > >> > > > existing behavior generates any real issues or is this more of an > >> > > > correctness exercise? I mean it certainly is not great to demote to an > >> > > > incompatible numa node but are there any reasonable configurations when > >> > > > the demotion target node is explicitly excluded from memory > >> > > > policy/cpuset? > >> > > > >> > > We haven't got customer report on this, but there are quite some customers > >> > > use cpuset to bind some specific memory nodes to a docker (You've helped > >> > > us solve a OOM issue in such cases), so I think it's practical to respect > >> > > the cpuset semantics as much as we can. > >> > > >> > Yes, it is definitely better to respect cpusets and all local memory > >> > policies. There is no dispute there. The thing is whether this is really > >> > worth it. How often would cpusets (or policies in general) go actively > >> > against demotion nodes (i.e. exclude those nodes from their allowes node > >> > mask)? > >> > > >> > I can imagine workloads which wouldn't like to get their memory demoted > >> > for some reason but wouldn't it be more practical to tell that > >> > explicitly (e.g. via prctl) rather than configuring cpusets/memory > >> > policies explicitly? > >> > > >> > > Your concern about the expensive cost makes sense! Some raw ideas are: > >> > > * if the shrink_folio_list is called by kswapd, the folios come from > >> > > the same per-memcg lruvec, so only one check is enough > >> > > * if not from kswapd, like called form madvise or DAMON code, we can > >> > > save a memcg cache, and if the next folio's memcg is same as the > >> > > cache, we reuse its result. And due to the locality, the real > >> > > check is rarely performed. > >> > > >> > memcg is not the expensive part of the thing. You need to get from page > >> > -> all vmas::vm_policy -> mm -> task::mempolicy > >> > >> Yeah, on the same page with Michal. Figuring out mempolicy from page > >> seems quite expensive and the correctness can't be guranteed since the > >> mempolicy could be set per-thread and the mm->task depends on > >> CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG. > > > > Yes, you are right. Our "working" psudo code for mem policy looks like > > what Michal mentioned, and it can't work for all cases, but try to > > enforce it whenever possible: > > > > static bool __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma, > > unsigned long addr, void *arg) > > { > > bool *skip_demotion = arg; > > struct mempolicy *mpol; > > int nid, dnid; > > bool ret = true; > > > > mpol = __get_vma_policy(vma, addr); > > if (!mpol) { > > struct task_struct *task; > > task = NULL; > > > if (vma->vm_mm) > > task = vma->vm_mm->owner; > > > > if (task) { > > mpol = get_task_policy(task); > > if (mpol) > > mpol_get(mpol); > > } > > } > > > > if (!mpol) > > return ret; > > > > if (mpol->mode != MPOL_BIND) > > goto put_exit; > > > > nid = folio_nid(folio); > > dnid = next_demotion_node(nid); > > if (!node_isset(dnid, mpol->nodes)) { > > *skip_demotion = true; > > ret = false; > > } > > I think that you need to get a node mask instead. Even if > !node_isset(dnid, mpol->nodes), you may demote to other node in the node > mask. Yes, you are right. This code was written/tested about 2 months ago, before Aneesh's memory tiering interface patchset. It was listed to demonstrate idea of solution. Thanks, Feng > Best Regards, > Huang, Ying
On Thu 27-10-22 15:39:00, Huang, Ying wrote: > Michal Hocko <mhocko@suse.com> writes: > > > On Thu 27-10-22 14:47:22, Huang, Ying wrote: > >> Michal Hocko <mhocko@suse.com> writes: > > [...] > >> > I can imagine workloads which wouldn't like to get their memory demoted > >> > for some reason but wouldn't it be more practical to tell that > >> > explicitly (e.g. via prctl) rather than configuring cpusets/memory > >> > policies explicitly? > >> > >> If my understanding were correct, prctl() configures the process or > >> thread. > > > > Not necessarily. There are properties which are per adddress space like > > PR_[GS]ET_THP_DISABLE. This could be very similar. > > > >> How can we get process/thread configuration at demotion time? > > > > As already pointed out in previous emails. You could hook into > > folio_check_references path, more specifically folio_referenced_one > > where you have all that you need already - all vmas mapping the page and > > then it is trivial to get the corresponding vm_mm. If at least one of > > them has the flag set then the demotion is not allowed (essentially the > > same model as VM_LOCKED). > > Got it! Thanks for detailed explanation. > > One bit may be not sufficient. For example, if we want to avoid or > control cross-socket demotion and still allow demoting to slow memory > nodes in local socket, we need to specify a node mask to exclude some > NUMA nodes from demotion targets. Isn't this something to be configured on the demotion topology side? Or do you expect there will be per process/address space usecases? I mean different processes running on the same topology, one requesting local demotion while other ok with the whole demotion topology? > >From overhead point of view, this appears similar as that of VMA/task > memory policy? We can make mm->owner available for memory tiers > (CONFIG_NUMA && CONFIG_MIGRATION). The advantage is that we don't need > to introduce new ABI. I guess users may prefer to use `numactl` than a > new ABI? mm->owner is a wrong direction. It doesn't have a strong meaning because there is no one task explicitly responsible for the mm so there is no real owner (our clone() semantic is just to permissive for that). The memcg::owner is a crude and ugly hack and it should go away over time rather than build new uses. Besides that, and as I have already tried to explain, per task demotion policy is what makes this whole thing expensive. So this better be a per mm or per vma property. Whether it is a on/off knob like PR_[GS]ET_THP_DISABLE or there are explicit requirements for fine grain control on the vma level I dunno. I haven't seen those usecases yet and it is really easy to overengineer this. To be completely honest I would much rather wait for those usecases before adding a more complex APIs. PR_[GS]_DEMOTION_DISABLED sounds like a reasonable first step. Should we have more fine grained requirements wrt address space I would follow the MADV_{NO}HUGEPAGE lead. If we really need/want to give a fine grained control over demotion nodemask then we would have to go with vma->mempolicy interface. In any case a per process on/off knob sounds like a reasonable first step before we learn more about real usecases.
On Wed 26-10-22 18:05:46, Aneesh Kumar K V wrote: > On 10/26/22 5:51 PM, Michal Hocko wrote: > > On Wed 26-10-22 17:38:06, Aneesh Kumar K V wrote: > >> On 10/26/22 4:32 PM, Michal Hocko wrote: > >>> On Wed 26-10-22 16:12:25, Aneesh Kumar K V wrote: > >>>> On 10/26/22 2:49 PM, Michal Hocko wrote: > >>>>> On Wed 26-10-22 16:00:13, Feng Tang wrote: > >>>>>> On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote: > >>>>>>> On 10/26/22 1:13 PM, Feng Tang wrote: > >>>>>>>> In page reclaim path, memory could be demoted from faster memory tier > >>>>>>>> to slower memory tier. Currently, there is no check about cpuset's > >>>>>>>> memory policy, that even if the target demotion node is not allowd > >>>>>>>> by cpuset, the demotion will still happen, which breaks the cpuset > >>>>>>>> semantics. > >>>>>>>> > >>>>>>>> So add cpuset policy check in the demotion path and skip demotion > >>>>>>>> if the demotion targets are not allowed by cpuset. > >>>>>>>> > >>>>>>> > >>>>>>> What about the vma policy or the task memory policy? Shouldn't we respect > >>>>>>> those memory policy restrictions while demoting the page? > >>>>>> > >>>>>> Good question! We have some basic patches to consider memory policy > >>>>>> in demotion path too, which are still under test, and will be posted > >>>>>> soon. And the basic idea is similar to this patch. > >>>>> > >>>>> For that you need to consult each vma and it's owning task(s) and that > >>>>> to me sounds like something to be done in folio_check_references. > >>>>> Relying on memcg to get a cpuset cgroup is really ugly and not really > >>>>> 100% correct. Memory controller might be disabled and then you do not > >>>>> have your association anymore. > >>>>> > >>>> > >>>> I was looking at this recently and I am wondering whether we should worry about VM_SHARE > >>>> vmas. > >>>> > >>>> ie, page_to_policy() can just reverse lookup just one VMA and fetch the policy right? > >>> > >>> How would that help for private mappings shared between parent/child? > >> > >> > >> this is MAP_PRIVATE | MAP_SHARED? > > > > Sorry, I meant MAP_ANONYMOUS | MAP_SHARED. I am still not sure where you are targeting to be honest. MAP_SHARED or MAP_PRIVATE both can have page shared between several vmas.
Michal Hocko <mhocko@suse.com> writes: > On Thu 27-10-22 15:39:00, Huang, Ying wrote: >> Michal Hocko <mhocko@suse.com> writes: >> >> > On Thu 27-10-22 14:47:22, Huang, Ying wrote: >> >> Michal Hocko <mhocko@suse.com> writes: >> > [...] >> >> > I can imagine workloads which wouldn't like to get their memory demoted >> >> > for some reason but wouldn't it be more practical to tell that >> >> > explicitly (e.g. via prctl) rather than configuring cpusets/memory >> >> > policies explicitly? >> >> >> >> If my understanding were correct, prctl() configures the process or >> >> thread. >> > >> > Not necessarily. There are properties which are per adddress space like >> > PR_[GS]ET_THP_DISABLE. This could be very similar. >> > >> >> How can we get process/thread configuration at demotion time? >> > >> > As already pointed out in previous emails. You could hook into >> > folio_check_references path, more specifically folio_referenced_one >> > where you have all that you need already - all vmas mapping the page and >> > then it is trivial to get the corresponding vm_mm. If at least one of >> > them has the flag set then the demotion is not allowed (essentially the >> > same model as VM_LOCKED). >> >> Got it! Thanks for detailed explanation. >> >> One bit may be not sufficient. For example, if we want to avoid or >> control cross-socket demotion and still allow demoting to slow memory >> nodes in local socket, we need to specify a node mask to exclude some >> NUMA nodes from demotion targets. > > Isn't this something to be configured on the demotion topology side? Or > do you expect there will be per process/address space usecases? I mean > different processes running on the same topology, one requesting local > demotion while other ok with the whole demotion topology? I think that it's possible for different processes have different requirements. - Some processes don't care about where the memory is placed, prefer local, then fall back to remote if no free space. - Some processes want to avoid cross-socket traffic, bind to nodes of local socket. - Some processes want to avoid to use slow memory, bind to fast memory node only. >> >From overhead point of view, this appears similar as that of VMA/task >> memory policy? We can make mm->owner available for memory tiers >> (CONFIG_NUMA && CONFIG_MIGRATION). The advantage is that we don't need >> to introduce new ABI. I guess users may prefer to use `numactl` than a >> new ABI? > > mm->owner is a wrong direction. It doesn't have a strong meaning because > there is no one task explicitly responsible for the mm so there is no > real owner (our clone() semantic is just to permissive for that). The > memcg::owner is a crude and ugly hack and it should go away over time > rather than build new uses. > > Besides that, and as I have already tried to explain, per task demotion > policy is what makes this whole thing expensive. So this better be a per > mm or per vma property. Whether it is a on/off knob like PR_[GS]ET_THP_DISABLE > or there are explicit requirements for fine grain control on the vma > level I dunno. I haven't seen those usecases yet and it is really easy > to overengineer this. > > To be completely honest I would much rather wait for those usecases > before adding a more complex APIs. PR_[GS]_DEMOTION_DISABLED sounds > like a reasonable first step. Should we have more fine grained > requirements wrt address space I would follow the MADV_{NO}HUGEPAGE > lead. > > If we really need/want to give a fine grained control over demotion > nodemask then we would have to go with vma->mempolicy interface. In > any case a per process on/off knob sounds like a reasonable first step > before we learn more about real usecases. Yes. Per-mm or per-vma property is much better than per-task property. Another possibility, how about add a new flag to set_mempolicy() system call to set the per-mm mempolicy? `numactl` can use that by default. Best Regards, Huang, Ying
On 10/27/22 2:32 PM, Michal Hocko wrote: >> >> Sorry, I meant MAP_ANONYMOUS | MAP_SHARED. > > I am still not sure where you are targeting to be honest. MAP_SHARED or > MAP_PRIVATE both can have page shared between several vmas. What I was checking was w.r.t demotion and shared pages do we need to cross-check all the related memory policies? On the page fault side, we don't do that. ie, if the vma policy or the faulting task policy allowed pages to be allocated from the demotion node, then we allocate the page even if there is a conflicting policy from another thread. For ex: in the case of MAP_ANON | MAP_PRIVATE cow shared pages if the parent did allow allocation from the demotion node we can have pages in the demotion node even though the child memory policy doesn't have the node in the node mask. -aneesh
On Thu 27-10-22 17:31:35, Huang, Ying wrote: > Michal Hocko <mhocko@suse.com> writes: > > > On Thu 27-10-22 15:39:00, Huang, Ying wrote: > >> Michal Hocko <mhocko@suse.com> writes: > >> > >> > On Thu 27-10-22 14:47:22, Huang, Ying wrote: > >> >> Michal Hocko <mhocko@suse.com> writes: > >> > [...] > >> >> > I can imagine workloads which wouldn't like to get their memory demoted > >> >> > for some reason but wouldn't it be more practical to tell that > >> >> > explicitly (e.g. via prctl) rather than configuring cpusets/memory > >> >> > policies explicitly? > >> >> > >> >> If my understanding were correct, prctl() configures the process or > >> >> thread. > >> > > >> > Not necessarily. There are properties which are per adddress space like > >> > PR_[GS]ET_THP_DISABLE. This could be very similar. > >> > > >> >> How can we get process/thread configuration at demotion time? > >> > > >> > As already pointed out in previous emails. You could hook into > >> > folio_check_references path, more specifically folio_referenced_one > >> > where you have all that you need already - all vmas mapping the page and > >> > then it is trivial to get the corresponding vm_mm. If at least one of > >> > them has the flag set then the demotion is not allowed (essentially the > >> > same model as VM_LOCKED). > >> > >> Got it! Thanks for detailed explanation. > >> > >> One bit may be not sufficient. For example, if we want to avoid or > >> control cross-socket demotion and still allow demoting to slow memory > >> nodes in local socket, we need to specify a node mask to exclude some > >> NUMA nodes from demotion targets. > > > > Isn't this something to be configured on the demotion topology side? Or > > do you expect there will be per process/address space usecases? I mean > > different processes running on the same topology, one requesting local > > demotion while other ok with the whole demotion topology? > > I think that it's possible for different processes have different > requirements. > > - Some processes don't care about where the memory is placed, prefer > local, then fall back to remote if no free space. > > - Some processes want to avoid cross-socket traffic, bind to nodes of > local socket. > > - Some processes want to avoid to use slow memory, bind to fast memory > node only. Yes, I do understand that. Do you have any specific examples in mind? [...] > > If we really need/want to give a fine grained control over demotion > > nodemask then we would have to go with vma->mempolicy interface. In > > any case a per process on/off knob sounds like a reasonable first step > > before we learn more about real usecases. > > Yes. Per-mm or per-vma property is much better than per-task property. > Another possibility, how about add a new flag to set_mempolicy() system > call to set the per-mm mempolicy? `numactl` can use that by default. Do you mean a flag to control whether the given policy is applied to a task or mm?
On Thu 27-10-22 15:46:07, Aneesh Kumar K V wrote: > On 10/27/22 2:32 PM, Michal Hocko wrote: > > >> > >> Sorry, I meant MAP_ANONYMOUS | MAP_SHARED. > > > > I am still not sure where you are targeting to be honest. MAP_SHARED or > > MAP_PRIVATE both can have page shared between several vmas. > > > What I was checking was w.r.t demotion and shared pages do we need to > cross-check all the related memory policies? On the page fault side, we don't do that. Yes, because on the page fault we do have an originator and so we can apply some reasonable semantic. For the memory reclaim there is no such originator for a specific page. A completely unrelated context might be reclaiming a page with some mempolicy constrain and you do not have any records who has faulted it in. The fact that we have a memory policy also at the task level makes a completely consistent semantic rather hard if possible at all (e.g. what if different thread are simply bound to a different node because shared memory is prefaulted and local thread mappings will be always local). I do not think shared mappings are very much special in that regards. It is our mempolicy API that allows to specify a policy for vmas as well as tasks and the later makes the semantic for reclaim really awkward.
On Thu, Oct 27, 2022 at 12:12 AM Feng Tang <feng.tang@intel.com> wrote: > > On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote: > > On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <mhocko@suse.com> wrote: > [...] > > > > > This all can get quite expensive so the primary question is, does the > > > > > existing behavior generates any real issues or is this more of an > > > > > correctness exercise? I mean it certainly is not great to demote to an > > > > > incompatible numa node but are there any reasonable configurations when > > > > > the demotion target node is explicitly excluded from memory > > > > > policy/cpuset? > > > > > > > > We haven't got customer report on this, but there are quite some customers > > > > use cpuset to bind some specific memory nodes to a docker (You've helped > > > > us solve a OOM issue in such cases), so I think it's practical to respect > > > > the cpuset semantics as much as we can. > > > > > > Yes, it is definitely better to respect cpusets and all local memory > > > policies. There is no dispute there. The thing is whether this is really > > > worth it. How often would cpusets (or policies in general) go actively > > > against demotion nodes (i.e. exclude those nodes from their allowes node > > > mask)? > > > > > > I can imagine workloads which wouldn't like to get their memory demoted > > > for some reason but wouldn't it be more practical to tell that > > > explicitly (e.g. via prctl) rather than configuring cpusets/memory > > > policies explicitly? > > > > > > > Your concern about the expensive cost makes sense! Some raw ideas are: > > > > * if the shrink_folio_list is called by kswapd, the folios come from > > > > the same per-memcg lruvec, so only one check is enough > > > > * if not from kswapd, like called form madvise or DAMON code, we can > > > > save a memcg cache, and if the next folio's memcg is same as the > > > > cache, we reuse its result. And due to the locality, the real > > > > check is rarely performed. > > > > > > memcg is not the expensive part of the thing. You need to get from page > > > -> all vmas::vm_policy -> mm -> task::mempolicy > > > > Yeah, on the same page with Michal. Figuring out mempolicy from page > > seems quite expensive and the correctness can't be guranteed since the > > mempolicy could be set per-thread and the mm->task depends on > > CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG. > > Yes, you are right. Our "working" psudo code for mem policy looks like > what Michal mentioned, and it can't work for all cases, but try to > enforce it whenever possible: > > static bool __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma, > unsigned long addr, void *arg) > { > bool *skip_demotion = arg; > struct mempolicy *mpol; > int nid, dnid; > bool ret = true; > > mpol = __get_vma_policy(vma, addr); > if (!mpol) { > struct task_struct *task; > if (vma->vm_mm) > task = vma->vm_mm->owner; But this task may not be the task you want IIUC. For example, the process has two threads, A and B. They have different mempolicy. The vmscan is trying to demote a page belonging to thread A, but the task may point to thread B, so you actually get the wrong mempolicy IIUC. > > if (task) { > mpol = get_task_policy(task); > if (mpol) > mpol_get(mpol); > } > } > > if (!mpol) > return ret; > > if (mpol->mode != MPOL_BIND) > goto put_exit; > > nid = folio_nid(folio); > dnid = next_demotion_node(nid); > if (!node_isset(dnid, mpol->nodes)) { > *skip_demotion = true; > ret = false; > } > > put_exit: > mpol_put(mpol); > return ret; > } > > static unsigned int shrink_page_list(struct list_head *page_list,..) > { > ... > > bool skip_demotion = false; > struct rmap_walk_control rwc = { > .arg = &skip_demotion, > .rmap_one = __check_mpol_demotion, > }; > > /* memory policy check */ > rmap_walk(folio, &rwc); > if (skip_demotion) > goto keep_locked; > } > > And there seems to be no simple solution for getting the memory > policy from a page. > > Thanks, > Feng > > > > > > > -- > > > Michal Hocko > > > SUSE Labs > > > > >
Michal Hocko <mhocko@suse.com> writes: > On Thu 27-10-22 17:31:35, Huang, Ying wrote: >> Michal Hocko <mhocko@suse.com> writes: >> >> > On Thu 27-10-22 15:39:00, Huang, Ying wrote: >> >> Michal Hocko <mhocko@suse.com> writes: >> >> >> >> > On Thu 27-10-22 14:47:22, Huang, Ying wrote: >> >> >> Michal Hocko <mhocko@suse.com> writes: >> >> > [...] >> >> >> > I can imagine workloads which wouldn't like to get their memory demoted >> >> >> > for some reason but wouldn't it be more practical to tell that >> >> >> > explicitly (e.g. via prctl) rather than configuring cpusets/memory >> >> >> > policies explicitly? >> >> >> >> >> >> If my understanding were correct, prctl() configures the process or >> >> >> thread. >> >> > >> >> > Not necessarily. There are properties which are per adddress space like >> >> > PR_[GS]ET_THP_DISABLE. This could be very similar. >> >> > >> >> >> How can we get process/thread configuration at demotion time? >> >> > >> >> > As already pointed out in previous emails. You could hook into >> >> > folio_check_references path, more specifically folio_referenced_one >> >> > where you have all that you need already - all vmas mapping the page and >> >> > then it is trivial to get the corresponding vm_mm. If at least one of >> >> > them has the flag set then the demotion is not allowed (essentially the >> >> > same model as VM_LOCKED). >> >> >> >> Got it! Thanks for detailed explanation. >> >> >> >> One bit may be not sufficient. For example, if we want to avoid or >> >> control cross-socket demotion and still allow demoting to slow memory >> >> nodes in local socket, we need to specify a node mask to exclude some >> >> NUMA nodes from demotion targets. >> > >> > Isn't this something to be configured on the demotion topology side? Or >> > do you expect there will be per process/address space usecases? I mean >> > different processes running on the same topology, one requesting local >> > demotion while other ok with the whole demotion topology? >> >> I think that it's possible for different processes have different >> requirements. >> >> - Some processes don't care about where the memory is placed, prefer >> local, then fall back to remote if no free space. >> >> - Some processes want to avoid cross-socket traffic, bind to nodes of >> local socket. >> >> - Some processes want to avoid to use slow memory, bind to fast memory >> node only. > > Yes, I do understand that. Do you have any specific examples in mind? > [...] Sorry, I don't have specific examples. >> > If we really need/want to give a fine grained control over demotion >> > nodemask then we would have to go with vma->mempolicy interface. In >> > any case a per process on/off knob sounds like a reasonable first step >> > before we learn more about real usecases. >> >> Yes. Per-mm or per-vma property is much better than per-task property. >> Another possibility, how about add a new flag to set_mempolicy() system >> call to set the per-mm mempolicy? `numactl` can use that by default. > > Do you mean a flag to control whether the given policy is applied to a > task or mm? Yes. That is the idea. Best Regards, Huang, Ying
On Thu, Oct 27, 2022 at 10:55:58AM -0700, Yang Shi wrote: > On Thu, Oct 27, 2022 at 12:12 AM Feng Tang <feng.tang@intel.com> wrote: > > > > On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote: > > > On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <mhocko@suse.com> wrote: > > [...] > > > > > > This all can get quite expensive so the primary question is, does the > > > > > > existing behavior generates any real issues or is this more of an > > > > > > correctness exercise? I mean it certainly is not great to demote to an > > > > > > incompatible numa node but are there any reasonable configurations when > > > > > > the demotion target node is explicitly excluded from memory > > > > > > policy/cpuset? > > > > > > > > > > We haven't got customer report on this, but there are quite some customers > > > > > use cpuset to bind some specific memory nodes to a docker (You've helped > > > > > us solve a OOM issue in such cases), so I think it's practical to respect > > > > > the cpuset semantics as much as we can. > > > > > > > > Yes, it is definitely better to respect cpusets and all local memory > > > > policies. There is no dispute there. The thing is whether this is really > > > > worth it. How often would cpusets (or policies in general) go actively > > > > against demotion nodes (i.e. exclude those nodes from their allowes node > > > > mask)? > > > > > > > > I can imagine workloads which wouldn't like to get their memory demoted > > > > for some reason but wouldn't it be more practical to tell that > > > > explicitly (e.g. via prctl) rather than configuring cpusets/memory > > > > policies explicitly? > > > > > > > > > Your concern about the expensive cost makes sense! Some raw ideas are: > > > > > * if the shrink_folio_list is called by kswapd, the folios come from > > > > > the same per-memcg lruvec, so only one check is enough > > > > > * if not from kswapd, like called form madvise or DAMON code, we can > > > > > save a memcg cache, and if the next folio's memcg is same as the > > > > > cache, we reuse its result. And due to the locality, the real > > > > > check is rarely performed. > > > > > > > > memcg is not the expensive part of the thing. You need to get from page > > > > -> all vmas::vm_policy -> mm -> task::mempolicy > > > > > > Yeah, on the same page with Michal. Figuring out mempolicy from page > > > seems quite expensive and the correctness can't be guranteed since the > > > mempolicy could be set per-thread and the mm->task depends on > > > CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG. > > > > Yes, you are right. Our "working" psudo code for mem policy looks like > > what Michal mentioned, and it can't work for all cases, but try to > > enforce it whenever possible: > > > > static bool __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma, > > unsigned long addr, void *arg) > > { > > bool *skip_demotion = arg; > > struct mempolicy *mpol; > > int nid, dnid; > > bool ret = true; > > > > mpol = __get_vma_policy(vma, addr); > > if (!mpol) { > > struct task_struct *task; > > if (vma->vm_mm) > > task = vma->vm_mm->owner; > > But this task may not be the task you want IIUC. For example, the > process has two threads, A and B. They have different mempolicy. The > vmscan is trying to demote a page belonging to thread A, but the task > may point to thread B, so you actually get the wrong mempolicy IIUC. Yes, this is a valid concern! We don't have good solution for this. For memory policy, we may only handle the per-vma policy for now whose cost is relatively low, as a best-effort try. Thanks, Feng
On 10/27/22 11:25 PM, Yang Shi wrote: > On Thu, Oct 27, 2022 at 12:12 AM Feng Tang <feng.tang@intel.com> wrote: >> >> On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote: >>> On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <mhocko@suse.com> wrote: >> [...] >>>>>> This all can get quite expensive so the primary question is, does the >>>>>> existing behavior generates any real issues or is this more of an >>>>>> correctness exercise? I mean it certainly is not great to demote to an >>>>>> incompatible numa node but are there any reasonable configurations when >>>>>> the demotion target node is explicitly excluded from memory >>>>>> policy/cpuset? >>>>> >>>>> We haven't got customer report on this, but there are quite some customers >>>>> use cpuset to bind some specific memory nodes to a docker (You've helped >>>>> us solve a OOM issue in such cases), so I think it's practical to respect >>>>> the cpuset semantics as much as we can. >>>> >>>> Yes, it is definitely better to respect cpusets and all local memory >>>> policies. There is no dispute there. The thing is whether this is really >>>> worth it. How often would cpusets (or policies in general) go actively >>>> against demotion nodes (i.e. exclude those nodes from their allowes node >>>> mask)? >>>> >>>> I can imagine workloads which wouldn't like to get their memory demoted >>>> for some reason but wouldn't it be more practical to tell that >>>> explicitly (e.g. via prctl) rather than configuring cpusets/memory >>>> policies explicitly? >>>> >>>>> Your concern about the expensive cost makes sense! Some raw ideas are: >>>>> * if the shrink_folio_list is called by kswapd, the folios come from >>>>> the same per-memcg lruvec, so only one check is enough >>>>> * if not from kswapd, like called form madvise or DAMON code, we can >>>>> save a memcg cache, and if the next folio's memcg is same as the >>>>> cache, we reuse its result. And due to the locality, the real >>>>> check is rarely performed. >>>> >>>> memcg is not the expensive part of the thing. You need to get from page >>>> -> all vmas::vm_policy -> mm -> task::mempolicy >>> >>> Yeah, on the same page with Michal. Figuring out mempolicy from page >>> seems quite expensive and the correctness can't be guranteed since the >>> mempolicy could be set per-thread and the mm->task depends on >>> CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG. >> >> Yes, you are right. Our "working" psudo code for mem policy looks like >> what Michal mentioned, and it can't work for all cases, but try to >> enforce it whenever possible: >> >> static bool __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma, >> unsigned long addr, void *arg) >> { >> bool *skip_demotion = arg; >> struct mempolicy *mpol; >> int nid, dnid; >> bool ret = true; >> >> mpol = __get_vma_policy(vma, addr); >> if (!mpol) { >> struct task_struct *task; >> if (vma->vm_mm) >> task = vma->vm_mm->owner; > > But this task may not be the task you want IIUC. For example, the > process has two threads, A and B. They have different mempolicy. The > vmscan is trying to demote a page belonging to thread A, but the task > may point to thread B, so you actually get the wrong mempolicy IIUC. > But if we swap out this page and fault back in via thread B the page would get allocated as per thread B mempolicy. So if we demote based on thread B policy are we breaking anything? -aneesh
Feng Tang <feng.tang@intel.com> writes: > On Thu, Oct 27, 2022 at 10:55:58AM -0700, Yang Shi wrote: >> On Thu, Oct 27, 2022 at 12:12 AM Feng Tang <feng.tang@intel.com> wrote: >> > >> > On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote: >> > > On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <mhocko@suse.com> wrote: >> > [...] >> > > > > > This all can get quite expensive so the primary question is, does the >> > > > > > existing behavior generates any real issues or is this more of an >> > > > > > correctness exercise? I mean it certainly is not great to demote to an >> > > > > > incompatible numa node but are there any reasonable configurations when >> > > > > > the demotion target node is explicitly excluded from memory >> > > > > > policy/cpuset? >> > > > > >> > > > > We haven't got customer report on this, but there are quite some customers >> > > > > use cpuset to bind some specific memory nodes to a docker (You've helped >> > > > > us solve a OOM issue in such cases), so I think it's practical to respect >> > > > > the cpuset semantics as much as we can. >> > > > >> > > > Yes, it is definitely better to respect cpusets and all local memory >> > > > policies. There is no dispute there. The thing is whether this is really >> > > > worth it. How often would cpusets (or policies in general) go actively >> > > > against demotion nodes (i.e. exclude those nodes from their allowes node >> > > > mask)? >> > > > >> > > > I can imagine workloads which wouldn't like to get their memory demoted >> > > > for some reason but wouldn't it be more practical to tell that >> > > > explicitly (e.g. via prctl) rather than configuring cpusets/memory >> > > > policies explicitly? >> > > > >> > > > > Your concern about the expensive cost makes sense! Some raw ideas are: >> > > > > * if the shrink_folio_list is called by kswapd, the folios come from >> > > > > the same per-memcg lruvec, so only one check is enough >> > > > > * if not from kswapd, like called form madvise or DAMON code, we can >> > > > > save a memcg cache, and if the next folio's memcg is same as the >> > > > > cache, we reuse its result. And due to the locality, the real >> > > > > check is rarely performed. >> > > > >> > > > memcg is not the expensive part of the thing. You need to get from page >> > > > -> all vmas::vm_policy -> mm -> task::mempolicy >> > > >> > > Yeah, on the same page with Michal. Figuring out mempolicy from page >> > > seems quite expensive and the correctness can't be guranteed since the >> > > mempolicy could be set per-thread and the mm->task depends on >> > > CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG. >> > >> > Yes, you are right. Our "working" psudo code for mem policy looks like >> > what Michal mentioned, and it can't work for all cases, but try to >> > enforce it whenever possible: >> > >> > static bool __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma, >> > unsigned long addr, void *arg) >> > { >> > bool *skip_demotion = arg; >> > struct mempolicy *mpol; >> > int nid, dnid; >> > bool ret = true; >> > >> > mpol = __get_vma_policy(vma, addr); >> > if (!mpol) { >> > struct task_struct *task; >> > if (vma->vm_mm) >> > task = vma->vm_mm->owner; >> >> But this task may not be the task you want IIUC. For example, the >> process has two threads, A and B. They have different mempolicy. The >> vmscan is trying to demote a page belonging to thread A, but the task >> may point to thread B, so you actually get the wrong mempolicy IIUC. > > Yes, this is a valid concern! We don't have good solution for this. > For memory policy, we may only handle the per-vma policy for now whose > cost is relatively low, as a best-effort try. Yes. The solution isn't perfect, especially for multiple-thread processes with thread specific memory policy. But the proposed code above can support the most common cases at least, that is, run workload with `numactl`. Best Regards, Huang, Ying
On Thu, Oct 27, 2022 at 10:09 PM Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> wrote: > > On 10/27/22 11:25 PM, Yang Shi wrote: > > On Thu, Oct 27, 2022 at 12:12 AM Feng Tang <feng.tang@intel.com> wrote: > >> > >> On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote: > >>> On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <mhocko@suse.com> wrote: > >> [...] > >>>>>> This all can get quite expensive so the primary question is, does the > >>>>>> existing behavior generates any real issues or is this more of an > >>>>>> correctness exercise? I mean it certainly is not great to demote to an > >>>>>> incompatible numa node but are there any reasonable configurations when > >>>>>> the demotion target node is explicitly excluded from memory > >>>>>> policy/cpuset? > >>>>> > >>>>> We haven't got customer report on this, but there are quite some customers > >>>>> use cpuset to bind some specific memory nodes to a docker (You've helped > >>>>> us solve a OOM issue in such cases), so I think it's practical to respect > >>>>> the cpuset semantics as much as we can. > >>>> > >>>> Yes, it is definitely better to respect cpusets and all local memory > >>>> policies. There is no dispute there. The thing is whether this is really > >>>> worth it. How often would cpusets (or policies in general) go actively > >>>> against demotion nodes (i.e. exclude those nodes from their allowes node > >>>> mask)? > >>>> > >>>> I can imagine workloads which wouldn't like to get their memory demoted > >>>> for some reason but wouldn't it be more practical to tell that > >>>> explicitly (e.g. via prctl) rather than configuring cpusets/memory > >>>> policies explicitly? > >>>> > >>>>> Your concern about the expensive cost makes sense! Some raw ideas are: > >>>>> * if the shrink_folio_list is called by kswapd, the folios come from > >>>>> the same per-memcg lruvec, so only one check is enough > >>>>> * if not from kswapd, like called form madvise or DAMON code, we can > >>>>> save a memcg cache, and if the next folio's memcg is same as the > >>>>> cache, we reuse its result. And due to the locality, the real > >>>>> check is rarely performed. > >>>> > >>>> memcg is not the expensive part of the thing. You need to get from page > >>>> -> all vmas::vm_policy -> mm -> task::mempolicy > >>> > >>> Yeah, on the same page with Michal. Figuring out mempolicy from page > >>> seems quite expensive and the correctness can't be guranteed since the > >>> mempolicy could be set per-thread and the mm->task depends on > >>> CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG. > >> > >> Yes, you are right. Our "working" psudo code for mem policy looks like > >> what Michal mentioned, and it can't work for all cases, but try to > >> enforce it whenever possible: > >> > >> static bool __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma, > >> unsigned long addr, void *arg) > >> { > >> bool *skip_demotion = arg; > >> struct mempolicy *mpol; > >> int nid, dnid; > >> bool ret = true; > >> > >> mpol = __get_vma_policy(vma, addr); > >> if (!mpol) { > >> struct task_struct *task; > >> if (vma->vm_mm) > >> task = vma->vm_mm->owner; > > > > But this task may not be the task you want IIUC. For example, the > > process has two threads, A and B. They have different mempolicy. The > > vmscan is trying to demote a page belonging to thread A, but the task > > may point to thread B, so you actually get the wrong mempolicy IIUC. > > > > But if we swap out this page and fault back in via thread B the page would > get allocated as per thread B mempolicy. So if we demote based on thread B > policy are we breaking anything? If the page is demoted by following thread B's mempolicy, didn't it already break thread A's mempolicy in the first place if you care about it? If thread A and thread B have the same mempolicy, then it is not a problem. Actually there is another problem for shared page. If a page is shared by two processes, P1 and P2, when you do rmap walk to find the task, you may find two contradict mempolicy, what mempolicy would you like to obey? Do you have to save all the intermediate mempolicy results somewhere or you just bail out once the first mempolicy is found? > > -aneesh > >
On Thu, Oct 27, 2022 at 10:55 PM Huang, Ying <ying.huang@intel.com> wrote: > > Feng Tang <feng.tang@intel.com> writes: > > > On Thu, Oct 27, 2022 at 10:55:58AM -0700, Yang Shi wrote: > >> On Thu, Oct 27, 2022 at 12:12 AM Feng Tang <feng.tang@intel.com> wrote: > >> > > >> > On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote: > >> > > On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <mhocko@suse.com> wrote: > >> > [...] > >> > > > > > This all can get quite expensive so the primary question is, does the > >> > > > > > existing behavior generates any real issues or is this more of an > >> > > > > > correctness exercise? I mean it certainly is not great to demote to an > >> > > > > > incompatible numa node but are there any reasonable configurations when > >> > > > > > the demotion target node is explicitly excluded from memory > >> > > > > > policy/cpuset? > >> > > > > > >> > > > > We haven't got customer report on this, but there are quite some customers > >> > > > > use cpuset to bind some specific memory nodes to a docker (You've helped > >> > > > > us solve a OOM issue in such cases), so I think it's practical to respect > >> > > > > the cpuset semantics as much as we can. > >> > > > > >> > > > Yes, it is definitely better to respect cpusets and all local memory > >> > > > policies. There is no dispute there. The thing is whether this is really > >> > > > worth it. How often would cpusets (or policies in general) go actively > >> > > > against demotion nodes (i.e. exclude those nodes from their allowes node > >> > > > mask)? > >> > > > > >> > > > I can imagine workloads which wouldn't like to get their memory demoted > >> > > > for some reason but wouldn't it be more practical to tell that > >> > > > explicitly (e.g. via prctl) rather than configuring cpusets/memory > >> > > > policies explicitly? > >> > > > > >> > > > > Your concern about the expensive cost makes sense! Some raw ideas are: > >> > > > > * if the shrink_folio_list is called by kswapd, the folios come from > >> > > > > the same per-memcg lruvec, so only one check is enough > >> > > > > * if not from kswapd, like called form madvise or DAMON code, we can > >> > > > > save a memcg cache, and if the next folio's memcg is same as the > >> > > > > cache, we reuse its result. And due to the locality, the real > >> > > > > check is rarely performed. > >> > > > > >> > > > memcg is not the expensive part of the thing. You need to get from page > >> > > > -> all vmas::vm_policy -> mm -> task::mempolicy > >> > > > >> > > Yeah, on the same page with Michal. Figuring out mempolicy from page > >> > > seems quite expensive and the correctness can't be guranteed since the > >> > > mempolicy could be set per-thread and the mm->task depends on > >> > > CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG. > >> > > >> > Yes, you are right. Our "working" psudo code for mem policy looks like > >> > what Michal mentioned, and it can't work for all cases, but try to > >> > enforce it whenever possible: > >> > > >> > static bool __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma, > >> > unsigned long addr, void *arg) > >> > { > >> > bool *skip_demotion = arg; > >> > struct mempolicy *mpol; > >> > int nid, dnid; > >> > bool ret = true; > >> > > >> > mpol = __get_vma_policy(vma, addr); > >> > if (!mpol) { > >> > struct task_struct *task; > >> > if (vma->vm_mm) > >> > task = vma->vm_mm->owner; > >> > >> But this task may not be the task you want IIUC. For example, the > >> process has two threads, A and B. They have different mempolicy. The > >> vmscan is trying to demote a page belonging to thread A, but the task > >> may point to thread B, so you actually get the wrong mempolicy IIUC. > > > > Yes, this is a valid concern! We don't have good solution for this. > > For memory policy, we may only handle the per-vma policy for now whose > > cost is relatively low, as a best-effort try. > > Yes. The solution isn't perfect, especially for multiple-thread > processes with thread specific memory policy. But the proposed code > above can support the most common cases at least, that is, run workload > with `numactl`. Not only multi threads, but also may be broken for shared pages. When you do rmap walk, you may get multiple contradict mempolicy, which one would you like to obey? TBH I'm not sure whether such half-baked solution is worth it or not, at least at this moment. The cost is not cheap, but the gain may not be worth it IMHO. > > Best Regards, > Huang, Ying
Yang Shi <shy828301@gmail.com> writes: > On Thu, Oct 27, 2022 at 10:09 PM Aneesh Kumar K V > <aneesh.kumar@linux.ibm.com> wrote: >> >> On 10/27/22 11:25 PM, Yang Shi wrote: >> > On Thu, Oct 27, 2022 at 12:12 AM Feng Tang <feng.tang@intel.com> wrote: >> >> >> >> On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote: >> >>> On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <mhocko@suse.com> wrote: >> >> [...] >> >>>>>> This all can get quite expensive so the primary question is, does the >> >>>>>> existing behavior generates any real issues or is this more of an >> >>>>>> correctness exercise? I mean it certainly is not great to demote to an >> >>>>>> incompatible numa node but are there any reasonable configurations when >> >>>>>> the demotion target node is explicitly excluded from memory >> >>>>>> policy/cpuset? >> >>>>> >> >>>>> We haven't got customer report on this, but there are quite some customers >> >>>>> use cpuset to bind some specific memory nodes to a docker (You've helped >> >>>>> us solve a OOM issue in such cases), so I think it's practical to respect >> >>>>> the cpuset semantics as much as we can. >> >>>> >> >>>> Yes, it is definitely better to respect cpusets and all local memory >> >>>> policies. There is no dispute there. The thing is whether this is really >> >>>> worth it. How often would cpusets (or policies in general) go actively >> >>>> against demotion nodes (i.e. exclude those nodes from their allowes node >> >>>> mask)? >> >>>> >> >>>> I can imagine workloads which wouldn't like to get their memory demoted >> >>>> for some reason but wouldn't it be more practical to tell that >> >>>> explicitly (e.g. via prctl) rather than configuring cpusets/memory >> >>>> policies explicitly? >> >>>> >> >>>>> Your concern about the expensive cost makes sense! Some raw ideas are: >> >>>>> * if the shrink_folio_list is called by kswapd, the folios come from >> >>>>> the same per-memcg lruvec, so only one check is enough >> >>>>> * if not from kswapd, like called form madvise or DAMON code, we can >> >>>>> save a memcg cache, and if the next folio's memcg is same as the >> >>>>> cache, we reuse its result. And due to the locality, the real >> >>>>> check is rarely performed. >> >>>> >> >>>> memcg is not the expensive part of the thing. You need to get from page >> >>>> -> all vmas::vm_policy -> mm -> task::mempolicy >> >>> >> >>> Yeah, on the same page with Michal. Figuring out mempolicy from page >> >>> seems quite expensive and the correctness can't be guranteed since the >> >>> mempolicy could be set per-thread and the mm->task depends on >> >>> CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG. >> >> >> >> Yes, you are right. Our "working" psudo code for mem policy looks like >> >> what Michal mentioned, and it can't work for all cases, but try to >> >> enforce it whenever possible: >> >> >> >> static bool __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma, >> >> unsigned long addr, void *arg) >> >> { >> >> bool *skip_demotion = arg; >> >> struct mempolicy *mpol; >> >> int nid, dnid; >> >> bool ret = true; >> >> >> >> mpol = __get_vma_policy(vma, addr); >> >> if (!mpol) { >> >> struct task_struct *task; >> >> if (vma->vm_mm) >> >> task = vma->vm_mm->owner; >> > >> > But this task may not be the task you want IIUC. For example, the >> > process has two threads, A and B. They have different mempolicy. The >> > vmscan is trying to demote a page belonging to thread A, but the task >> > may point to thread B, so you actually get the wrong mempolicy IIUC. >> > >> >> But if we swap out this page and fault back in via thread B the page would >> get allocated as per thread B mempolicy. So if we demote based on thread B >> policy are we breaking anything? > > If the page is demoted by following thread B's mempolicy, didn't it > already break thread A's mempolicy in the first place if you care > about it? If thread A and thread B have the same mempolicy, then it is > not a problem. > > Actually there is another problem for shared page. If a page is shared > by two processes, P1 and P2, when you do rmap walk to find the task, > you may find two contradict mempolicy, what mempolicy would you like > to obey? Do you have to save all the intermediate mempolicy results > somewhere or you just bail out once the first mempolicy is found? Yes. There's no perfect solution for this. I suggest to avoid demoting if any VMA (or task) prevent it. Because allowing demoting is the default policy. And we will not promote the page back if it becomes hot later by default because promotion only works for default memory policy by default. Best Regards, Huang, Ying
Yang Shi <shy828301@gmail.com> writes: > On Thu, Oct 27, 2022 at 10:55 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Feng Tang <feng.tang@intel.com> writes: >> >> > On Thu, Oct 27, 2022 at 10:55:58AM -0700, Yang Shi wrote: >> >> On Thu, Oct 27, 2022 at 12:12 AM Feng Tang <feng.tang@intel.com> wrote: >> >> > >> >> > On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote: >> >> > > On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <mhocko@suse.com> wrote: >> >> > [...] >> >> > > > > > This all can get quite expensive so the primary question is, does the >> >> > > > > > existing behavior generates any real issues or is this more of an >> >> > > > > > correctness exercise? I mean it certainly is not great to demote to an >> >> > > > > > incompatible numa node but are there any reasonable configurations when >> >> > > > > > the demotion target node is explicitly excluded from memory >> >> > > > > > policy/cpuset? >> >> > > > > >> >> > > > > We haven't got customer report on this, but there are quite some customers >> >> > > > > use cpuset to bind some specific memory nodes to a docker (You've helped >> >> > > > > us solve a OOM issue in such cases), so I think it's practical to respect >> >> > > > > the cpuset semantics as much as we can. >> >> > > > >> >> > > > Yes, it is definitely better to respect cpusets and all local memory >> >> > > > policies. There is no dispute there. The thing is whether this is really >> >> > > > worth it. How often would cpusets (or policies in general) go actively >> >> > > > against demotion nodes (i.e. exclude those nodes from their allowes node >> >> > > > mask)? >> >> > > > >> >> > > > I can imagine workloads which wouldn't like to get their memory demoted >> >> > > > for some reason but wouldn't it be more practical to tell that >> >> > > > explicitly (e.g. via prctl) rather than configuring cpusets/memory >> >> > > > policies explicitly? >> >> > > > >> >> > > > > Your concern about the expensive cost makes sense! Some raw ideas are: >> >> > > > > * if the shrink_folio_list is called by kswapd, the folios come from >> >> > > > > the same per-memcg lruvec, so only one check is enough >> >> > > > > * if not from kswapd, like called form madvise or DAMON code, we can >> >> > > > > save a memcg cache, and if the next folio's memcg is same as the >> >> > > > > cache, we reuse its result. And due to the locality, the real >> >> > > > > check is rarely performed. >> >> > > > >> >> > > > memcg is not the expensive part of the thing. You need to get from page >> >> > > > -> all vmas::vm_policy -> mm -> task::mempolicy >> >> > > >> >> > > Yeah, on the same page with Michal. Figuring out mempolicy from page >> >> > > seems quite expensive and the correctness can't be guranteed since the >> >> > > mempolicy could be set per-thread and the mm->task depends on >> >> > > CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG. >> >> > >> >> > Yes, you are right. Our "working" psudo code for mem policy looks like >> >> > what Michal mentioned, and it can't work for all cases, but try to >> >> > enforce it whenever possible: >> >> > >> >> > static bool __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma, >> >> > unsigned long addr, void *arg) >> >> > { >> >> > bool *skip_demotion = arg; >> >> > struct mempolicy *mpol; >> >> > int nid, dnid; >> >> > bool ret = true; >> >> > >> >> > mpol = __get_vma_policy(vma, addr); >> >> > if (!mpol) { >> >> > struct task_struct *task; >> >> > if (vma->vm_mm) >> >> > task = vma->vm_mm->owner; >> >> >> >> But this task may not be the task you want IIUC. For example, the >> >> process has two threads, A and B. They have different mempolicy. The >> >> vmscan is trying to demote a page belonging to thread A, but the task >> >> may point to thread B, so you actually get the wrong mempolicy IIUC. >> > >> > Yes, this is a valid concern! We don't have good solution for this. >> > For memory policy, we may only handle the per-vma policy for now whose >> > cost is relatively low, as a best-effort try. >> >> Yes. The solution isn't perfect, especially for multiple-thread >> processes with thread specific memory policy. But the proposed code >> above can support the most common cases at least, that is, run workload >> with `numactl`. > > Not only multi threads, but also may be broken for shared pages. When > you do rmap walk, you may get multiple contradict mempolicy, which one > would you like to obey? > > TBH I'm not sure whether such half-baked solution is worth it or not, > at least at this moment. The cost is not cheap, but the gain may not > be worth it IMHO. Per my understanding, this can cover most cases. For example, run workload with `numactl`, or control the page placement of some memory areas via mbind(). Although there are some issue in the corner cases. Best Regards, Huang, Ying
On Sat, Oct 29, 2022 at 01:23:53AM +0800, Yang Shi wrote: > On Thu, Oct 27, 2022 at 10:55 PM Huang, Ying <ying.huang@intel.com> wrote: > > > > Feng Tang <feng.tang@intel.com> writes: > > > > > On Thu, Oct 27, 2022 at 10:55:58AM -0700, Yang Shi wrote: > > >> On Thu, Oct 27, 2022 at 12:12 AM Feng Tang <feng.tang@intel.com> wrote: > > >> > > > >> > On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote: > > >> > > On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <mhocko@suse.com> wrote: > > >> > [...] > > >> > > > > > This all can get quite expensive so the primary question is, does the > > >> > > > > > existing behavior generates any real issues or is this more of an > > >> > > > > > correctness exercise? I mean it certainly is not great to demote to an > > >> > > > > > incompatible numa node but are there any reasonable configurations when > > >> > > > > > the demotion target node is explicitly excluded from memory > > >> > > > > > policy/cpuset? > > >> > > > > > > >> > > > > We haven't got customer report on this, but there are quite some customers > > >> > > > > use cpuset to bind some specific memory nodes to a docker (You've helped > > >> > > > > us solve a OOM issue in such cases), so I think it's practical to respect > > >> > > > > the cpuset semantics as much as we can. > > >> > > > > > >> > > > Yes, it is definitely better to respect cpusets and all local memory > > >> > > > policies. There is no dispute there. The thing is whether this is really > > >> > > > worth it. How often would cpusets (or policies in general) go actively > > >> > > > against demotion nodes (i.e. exclude those nodes from their allowes node > > >> > > > mask)? > > >> > > > > > >> > > > I can imagine workloads which wouldn't like to get their memory demoted > > >> > > > for some reason but wouldn't it be more practical to tell that > > >> > > > explicitly (e.g. via prctl) rather than configuring cpusets/memory > > >> > > > policies explicitly? > > >> > > > > > >> > > > > Your concern about the expensive cost makes sense! Some raw ideas are: > > >> > > > > * if the shrink_folio_list is called by kswapd, the folios come from > > >> > > > > the same per-memcg lruvec, so only one check is enough > > >> > > > > * if not from kswapd, like called form madvise or DAMON code, we can > > >> > > > > save a memcg cache, and if the next folio's memcg is same as the > > >> > > > > cache, we reuse its result. And due to the locality, the real > > >> > > > > check is rarely performed. > > >> > > > > > >> > > > memcg is not the expensive part of the thing. You need to get from page > > >> > > > -> all vmas::vm_policy -> mm -> task::mempolicy > > >> > > > > >> > > Yeah, on the same page with Michal. Figuring out mempolicy from page > > >> > > seems quite expensive and the correctness can't be guranteed since the > > >> > > mempolicy could be set per-thread and the mm->task depends on > > >> > > CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG. > > >> > > > >> > Yes, you are right. Our "working" psudo code for mem policy looks like > > >> > what Michal mentioned, and it can't work for all cases, but try to > > >> > enforce it whenever possible: > > >> > > > >> > static bool __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma, > > >> > unsigned long addr, void *arg) > > >> > { > > >> > bool *skip_demotion = arg; > > >> > struct mempolicy *mpol; > > >> > int nid, dnid; > > >> > bool ret = true; > > >> > > > >> > mpol = __get_vma_policy(vma, addr); > > >> > if (!mpol) { > > >> > struct task_struct *task; > > >> > if (vma->vm_mm) > > >> > task = vma->vm_mm->owner; > > >> > > >> But this task may not be the task you want IIUC. For example, the > > >> process has two threads, A and B. They have different mempolicy. The > > >> vmscan is trying to demote a page belonging to thread A, but the task > > >> may point to thread B, so you actually get the wrong mempolicy IIUC. > > > > > > Yes, this is a valid concern! We don't have good solution for this. > > > For memory policy, we may only handle the per-vma policy for now whose > > > cost is relatively low, as a best-effort try. > > > > Yes. The solution isn't perfect, especially for multiple-thread > > processes with thread specific memory policy. But the proposed code > > above can support the most common cases at least, that is, run workload > > with `numactl`. > > Not only multi threads, but also may be broken for shared pages. When > you do rmap walk, you may get multiple contradict mempolicy, which one > would you like to obey? In our test code, it follows the stricter policy, that if the rmap walk meets a mempolicy disallowing the demotion, it will stop the walk and return with 'skip_demotion' flag set. Thanks, Feng
On Fri 28-10-22 07:22:27, Huang, Ying wrote: > Michal Hocko <mhocko@suse.com> writes: > > > On Thu 27-10-22 17:31:35, Huang, Ying wrote: [...] > >> I think that it's possible for different processes have different > >> requirements. > >> > >> - Some processes don't care about where the memory is placed, prefer > >> local, then fall back to remote if no free space. > >> > >> - Some processes want to avoid cross-socket traffic, bind to nodes of > >> local socket. > >> > >> - Some processes want to avoid to use slow memory, bind to fast memory > >> node only. > > > > Yes, I do understand that. Do you have any specific examples in mind? > > [...] > > Sorry, I don't have specific examples. OK, then let's stop any complicated solution right here then. Let's start simple with a per-mm flag to disable demotion of an address space. Should there ever be a real demand for a more fine grained solution let's go further but I do not think we want a half baked solution without real usecases.
Michal Hocko <mhocko@suse.com> writes: > On Fri 28-10-22 07:22:27, Huang, Ying wrote: >> Michal Hocko <mhocko@suse.com> writes: >> >> > On Thu 27-10-22 17:31:35, Huang, Ying wrote: > [...] >> >> I think that it's possible for different processes have different >> >> requirements. >> >> >> >> - Some processes don't care about where the memory is placed, prefer >> >> local, then fall back to remote if no free space. >> >> >> >> - Some processes want to avoid cross-socket traffic, bind to nodes of >> >> local socket. >> >> >> >> - Some processes want to avoid to use slow memory, bind to fast memory >> >> node only. >> > >> > Yes, I do understand that. Do you have any specific examples in mind? >> > [...] >> >> Sorry, I don't have specific examples. > > OK, then let's stop any complicated solution right here then. Let's > start simple with a per-mm flag to disable demotion of an address > space. I'm not a big fan of per-mm flag. Because we don't have users for that too and it needs to add ABI too. > Should there ever be a real demand for a more fine grained solution > let's go further but I do not think we want a half baked solution > without real usecases. I'm OK to ignore per-task (and missing per-process) memory policy support for now. Best Regards, Huang, Ying
On Mon 31-10-22 16:51:11, Huang, Ying wrote: > Michal Hocko <mhocko@suse.com> writes: > > > On Fri 28-10-22 07:22:27, Huang, Ying wrote: > >> Michal Hocko <mhocko@suse.com> writes: > >> > >> > On Thu 27-10-22 17:31:35, Huang, Ying wrote: > > [...] > >> >> I think that it's possible for different processes have different > >> >> requirements. > >> >> > >> >> - Some processes don't care about where the memory is placed, prefer > >> >> local, then fall back to remote if no free space. > >> >> > >> >> - Some processes want to avoid cross-socket traffic, bind to nodes of > >> >> local socket. > >> >> > >> >> - Some processes want to avoid to use slow memory, bind to fast memory > >> >> node only. > >> > > >> > Yes, I do understand that. Do you have any specific examples in mind? > >> > [...] > >> > >> Sorry, I don't have specific examples. > > > > OK, then let's stop any complicated solution right here then. Let's > > start simple with a per-mm flag to disable demotion of an address > > space. > > I'm not a big fan of per-mm flag. Because we don't have users for that > too and it needs to add ABI too. OK, if there are no users for opt-out then let's jus document the current limitations and be done with it. > > Should there ever be a real demand for a more fine grained solution > > let's go further but I do not think we want a half baked solution > > without real usecases. > > I'm OK to ignore per-task (and missing per-process) memory policy > support for now. I am against such a half baked solution.
On Mon, Oct 31, 2022 at 04:40:15PM +0800, Michal Hocko wrote: > On Fri 28-10-22 07:22:27, Huang, Ying wrote: > > Michal Hocko <mhocko@suse.com> writes: > > > > > On Thu 27-10-22 17:31:35, Huang, Ying wrote: > [...] > > >> I think that it's possible for different processes have different > > >> requirements. > > >> > > >> - Some processes don't care about where the memory is placed, prefer > > >> local, then fall back to remote if no free space. > > >> > > >> - Some processes want to avoid cross-socket traffic, bind to nodes of > > >> local socket. > > >> > > >> - Some processes want to avoid to use slow memory, bind to fast memory > > >> node only. > > > > > > Yes, I do understand that. Do you have any specific examples in mind? > > > [...] > > > > Sorry, I don't have specific examples. > > OK, then let's stop any complicated solution right here then. Let's > start simple with a per-mm flag to disable demotion of an address space. > Should there ever be a real demand for a more fine grained solution > let's go further but I do not think we want a half baked solution > without real usecases. Yes, the concern about the high cost for mempolicy from you and Yang is valid. How about the cpuset part? We've got bug reports from different channels about using cpuset+docker to control meomry placement in memory tiering system, leading to 2 commits solving them: 2685027fca38 ("cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in cpuset_init_smp()") https://lore.kernel.org/all/20220419020958.40419-1-feng.tang@intel.com/ 8ca1b5a49885 ("mm/page_alloc: detect allocation forbidden by cpuset and bail out early") https://lore.kernel.org/all/1632481657-68112-1-git-send-email-feng.tang@intel.com/ From these bug reports, I think it's reasonable to say there are quite some real world users using cpuset+docker+memory-tiering-system. So I plan to refine the original cpuset patch with some optimizations discussed (like checking once for kswapd based shrink_folio_list()). Thanks, Feng > -- > Michal Hocko > SUSE Labs >
On Mon 31-10-22 22:09:15, Feng Tang wrote: > On Mon, Oct 31, 2022 at 04:40:15PM +0800, Michal Hocko wrote: > > On Fri 28-10-22 07:22:27, Huang, Ying wrote: > > > Michal Hocko <mhocko@suse.com> writes: > > > > > > > On Thu 27-10-22 17:31:35, Huang, Ying wrote: > > [...] > > > >> I think that it's possible for different processes have different > > > >> requirements. > > > >> > > > >> - Some processes don't care about where the memory is placed, prefer > > > >> local, then fall back to remote if no free space. > > > >> > > > >> - Some processes want to avoid cross-socket traffic, bind to nodes of > > > >> local socket. > > > >> > > > >> - Some processes want to avoid to use slow memory, bind to fast memory > > > >> node only. > > > > > > > > Yes, I do understand that. Do you have any specific examples in mind? > > > > [...] > > > > > > Sorry, I don't have specific examples. > > > > OK, then let's stop any complicated solution right here then. Let's > > start simple with a per-mm flag to disable demotion of an address space. > > Should there ever be a real demand for a more fine grained solution > > let's go further but I do not think we want a half baked solution > > without real usecases. > > Yes, the concern about the high cost for mempolicy from you and Yang is > valid. > > How about the cpuset part? Cpusets fall into the same bucket as per task mempolicies wrt costs. Geting a cpuset requires knowing all tasks associated with a page. Or am I just missing any magic? And no memcg->cpuset association is not a proper solution at all. > We've got bug reports from different channels > about using cpuset+docker to control meomry placement in memory tiering > system, leading to 2 commits solving them: > > 2685027fca38 ("cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in > cpuset_init_smp()") > https://lore.kernel.org/all/20220419020958.40419-1-feng.tang@intel.com/ > > 8ca1b5a49885 ("mm/page_alloc: detect allocation forbidden by cpuset and > bail out early") > https://lore.kernel.org/all/1632481657-68112-1-git-send-email-feng.tang@intel.com/ > > >From these bug reports, I think it's reasonable to say there are quite > some real world users using cpuset+docker+memory-tiering-system. I don't think anybody is questioning existence of those usecases. The primary question is whether any of them really require any non-trivial (read nodemask aware) demotion policies. In other words do we know of cpuset policy setups where demotion fallbacks are (partially) excluded?
"Huang, Ying" <ying.huang@intel.com> writes: > Michal Hocko <mhocko@suse.com> writes: > >> On Thu 27-10-22 15:39:00, Huang, Ying wrote: >>> Michal Hocko <mhocko@suse.com> writes: >>> >>> > On Thu 27-10-22 14:47:22, Huang, Ying wrote: >>> >> Michal Hocko <mhocko@suse.com> writes: >>> > [...] >>> >> > I can imagine workloads which wouldn't like to get their memory demoted >>> >> > for some reason but wouldn't it be more practical to tell that >>> >> > explicitly (e.g. via prctl) rather than configuring cpusets/memory >>> >> > policies explicitly? >>> >> >>> >> If my understanding were correct, prctl() configures the process or >>> >> thread. >>> > >>> > Not necessarily. There are properties which are per adddress space like >>> > PR_[GS]ET_THP_DISABLE. This could be very similar. >>> > >>> >> How can we get process/thread configuration at demotion time? >>> > >>> > As already pointed out in previous emails. You could hook into >>> > folio_check_references path, more specifically folio_referenced_one >>> > where you have all that you need already - all vmas mapping the page and >>> > then it is trivial to get the corresponding vm_mm. If at least one of >>> > them has the flag set then the demotion is not allowed (essentially the >>> > same model as VM_LOCKED). >>> >>> Got it! Thanks for detailed explanation. >>> >>> One bit may be not sufficient. For example, if we want to avoid or >>> control cross-socket demotion and still allow demoting to slow memory >>> nodes in local socket, we need to specify a node mask to exclude some >>> NUMA nodes from demotion targets. >> >> Isn't this something to be configured on the demotion topology side? Or >> do you expect there will be per process/address space usecases? I mean >> different processes running on the same topology, one requesting local >> demotion while other ok with the whole demotion topology? > > I think that it's possible for different processes have different > requirements. > > - Some processes don't care about where the memory is placed, prefer > local, then fall back to remote if no free space. > > - Some processes want to avoid cross-socket traffic, bind to nodes of > local socket. > > - Some processes want to avoid to use slow memory, bind to fast memory > node only. Hi, Johannes, Wei, Jonathan, Yang, Aneesh, We need your help. Do you or your organization have requirements to restrict the page demotion target nodes? If so, can you share some details of the requirements? For example, to avoid cross-socket traffic, or to avoid using slow memory. And do you want to restrict that with cpusets, memory policy, or some other interfaces. Best Regards, Huang, Ying
On Mon, Oct 31, 2022 at 03:32:34PM +0100, Michal Hocko wrote: > > > OK, then let's stop any complicated solution right here then. Let's > > > start simple with a per-mm flag to disable demotion of an address space. > > > Should there ever be a real demand for a more fine grained solution > > > let's go further but I do not think we want a half baked solution > > > without real usecases. > > > > Yes, the concern about the high cost for mempolicy from you and Yang is > > valid. > > > > How about the cpuset part? > > Cpusets fall into the same bucket as per task mempolicies wrt costs. Geting a > cpuset requires knowing all tasks associated with a page. Or am I just > missing any magic? And no memcg->cpuset association is not a proper > solution at all. No, you are not missing anything. It's really difficult to find a solution for all holes. And the patch is actually a best-efforts approach, trying to cover cgroup v2 + memory controller enabled case, which we think is a common user case for newer platforms with tiering memory. > > We've got bug reports from different channels > > about using cpuset+docker to control meomry placement in memory tiering > > system, leading to 2 commits solving them: > > > > 2685027fca38 ("cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in > > cpuset_init_smp()") > > https://lore.kernel.org/all/20220419020958.40419-1-feng.tang@intel.com/ > > > > 8ca1b5a49885 ("mm/page_alloc: detect allocation forbidden by cpuset and > > bail out early") > > https://lore.kernel.org/all/1632481657-68112-1-git-send-email-feng.tang@intel.com/ > > > > >From these bug reports, I think it's reasonable to say there are quite > > some real world users using cpuset+docker+memory-tiering-system. > > I don't think anybody is questioning existence of those usecases. The > primary question is whether any of them really require any non-trivial > (read nodemask aware) demotion policies. In other words do we know of > cpuset policy setups where demotion fallbacks are (partially) excluded? For cpuset numa memory binding, there are possible usercases: * User wants cpuset to bind some important containers to faster memory tiers for better latency/performance (where simply disabling demotion should work, like your per-mm flag solution) * User wants to bind to a set of physically closer nodes (like faster CPU+DRAM node and slower PMEM node). With initial demotion code, our HW will have 1:1 demotion/promotion pair for one DRAM node and its closer PMEM node, and user's binding can work fine. And there are many other types of memory tiering system from other vendors, like many CPU-less DRAM nodes in system, and Aneesh's patchset[1] created a more general tiering interface, where IIUC each tier has a nodemask, and an upper tier can demote to the whole lower tier, where the demotion path is N:N mapping. And for this, fine-tuning cpuset nodes binding needs this handling. [1]. https://lore.kernel.org/lkml/20220818131042.113280-1-aneesh.kumar@linux.ibm.com/ Thanks, Feng > -- > Michal Hocko > SUSE Labs
On Mon 07-11-22 16:05:37, Feng Tang wrote: > On Mon, Oct 31, 2022 at 03:32:34PM +0100, Michal Hocko wrote: > > > > OK, then let's stop any complicated solution right here then. Let's > > > > start simple with a per-mm flag to disable demotion of an address space. > > > > Should there ever be a real demand for a more fine grained solution > > > > let's go further but I do not think we want a half baked solution > > > > without real usecases. > > > > > > Yes, the concern about the high cost for mempolicy from you and Yang is > > > valid. > > > > > > How about the cpuset part? > > > > Cpusets fall into the same bucket as per task mempolicies wrt costs. Geting a > > cpuset requires knowing all tasks associated with a page. Or am I just > > missing any magic? And no memcg->cpuset association is not a proper > > solution at all. > > No, you are not missing anything. It's really difficult to find a > solution for all holes. And the patch is actually a best-efforts > approach, trying to cover cgroup v2 + memory controller enabled case, > which we think is a common user case for newer platforms with tiering > memory. Best effort is OK but it shouldn't create an unexpected behavior and this approach does that. I thought I have already explained that. But let me be more explicit this time. Please have a look at how controllers can be enabled/disabled at different levels of the hierarchy. Unless memcg grows a hard dependency on another controller (as it does with the blk io controller) then this approach can point to a wrong cpuset. See my point? Really, solution for this is not going to be cheap and also I am not sure all the hessles is really worth it until there is a clear usecase in sight.
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index d58e0476ee8e..6fcce2bd2631 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -178,6 +178,8 @@ static inline void set_mems_allowed(nodemask_t nodemask) task_unlock(current); } +extern void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, + nodemask_t *nmask); #else /* !CONFIG_CPUSETS */ static inline bool cpusets_enabled(void) { return false; } @@ -299,6 +301,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq) return false; } +static inline void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, + nodemask_t *nmask) +{ +} #endif /* !CONFIG_CPUSETS */ #endif /* _LINUX_CPUSET_H */ diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 3ea2e836e93e..cbb118c0502f 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -3750,6 +3750,35 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk) return mask; } +/* + * Retrieve the allowed memory nodemask for a cgroup. + * + * Set *nmask to cpuset's effective allowed nodemask for cgroup v2, + * and NODE_MASK_ALL (means no constraint) for cgroup v1 where there + * is no guaranteed association from a cgroup to a cpuset. + */ +void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, nodemask_t *nmask) +{ + struct cgroup_subsys_state *css; + struct cpuset *cs; + + if (!is_in_v2_mode()) { + *nmask = NODE_MASK_ALL; + return; + } + + rcu_read_lock(); + css = cgroup_e_css(cgroup, &cpuset_cgrp_subsys); + if (css) { + css_get(css); + cs = css_cs(css); + *nmask = cs->effective_mems; + css_put(css); + } + + rcu_read_unlock(); +} + /** * cpuset_nodemask_valid_mems_allowed - check nodemask vs. current mems_allowed * @nodemask: the nodemask to be checked diff --git a/mm/vmscan.c b/mm/vmscan.c index 18f6497994ec..c205d98283bc 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1537,9 +1537,21 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private) { struct page *target_page; nodemask_t *allowed_mask; - struct migration_target_control *mtc; + struct migration_target_control *mtc = (void *)private; - mtc = (struct migration_target_control *)private; +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS) + struct mem_cgroup *memcg; + nodemask_t cpuset_nmask; + + memcg = page_memcg(page); + cpuset_get_allowed_mem_nodes(memcg->css.cgroup, &cpuset_nmask); + + if (!node_isset(mtc->nid, cpuset_nmask)) { + if (mtc->nmask) + nodes_and(*mtc->nmask, *mtc->nmask, cpuset_nmask); + return alloc_migration_target(page, (unsigned long)mtc); + } +#endif allowed_mask = mtc->nmask; /* @@ -1649,6 +1661,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, enum folio_references references = FOLIOREF_RECLAIM; bool dirty, writeback; unsigned int nr_pages; + bool skip_this_demotion = false; cond_resched(); @@ -1658,6 +1671,22 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, if (!folio_trylock(folio)) goto keep; +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS) + if (do_demote_pass) { + struct mem_cgroup *memcg; + nodemask_t nmask, nmask1; + + node_get_allowed_targets(pgdat, &nmask); + memcg = folio_memcg(folio); + if (memcg) + cpuset_get_allowed_mem_nodes(memcg->css.cgroup, + &nmask1); + + if (!nodes_intersects(nmask, nmask1)) + skip_this_demotion = true; + } +#endif + VM_BUG_ON_FOLIO(folio_test_active(folio), folio); nr_pages = folio_nr_pages(folio); @@ -1799,7 +1828,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, * Before reclaiming the folio, try to relocate * its contents to another node. */ - if (do_demote_pass && + if (do_demote_pass && !skip_this_demotion && (thp_migration_supported() || !folio_test_large(folio))) { list_add(&folio->lru, &demote_folios); folio_unlock(folio);