diff mbox series

mm/vmscan: respect cpuset policy during page demotion

Message ID	20221026074343.6517-1-feng.tang@intel.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Feng Tang <feng.tang@intel.com> To: Andrew Morton <akpm@linux-foundation.org>, Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@suse.com>, Tejun Heo <tj@kernel.org>, Zefan Li <lizefan.x@bytedance.com>, Waiman Long <longman@redhat.com>, ying.huang@intel.com, aneesh.kumar@linux.ibm.com, linux-mm@kvack.org, cgroups@vger.kernel.org Cc: linux-kernel@vger.kernel.org, dave.hansen@intel.com, tim.c.chen@intel.com, fengwei.yin@intel.com, Feng Tang <feng.tang@intel.com> Subject: [PATCH] mm/vmscan: respect cpuset policy during page demotion Date: Wed, 26 Oct 2022 15:43:43 +0800 Message-Id: <20221026074343.6517-1-feng.tang@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	mm/vmscan: respect cpuset policy during page demotion \| mm/vmscan: respect cpuset policy during page demotion

Commit Message

Feng Tang Oct. 26, 2022, 7:43 a.m. UTC

  In page reclaim path, memory could be demoted from faster memory tier
to slower memory tier. Currently, there is no check about cpuset's
memory policy, that even if the target demotion node is not allowd
by cpuset, the demotion will still happen, which breaks the cpuset
semantics.

So add cpuset policy check in the demotion path and skip demotion
if the demotion targets are not allowed by cpuset.

Signed-off-by: Feng Tang <feng.tang@intel.com>
---
Hi reviewers,

For easy bisectable, I combined the cpuset change and mm change
in one patch, if you prefer to separate them, I can turn it into
2 patches.

Thanks,
Feng

 include/linux/cpuset.h |  6 ++++++
 kernel/cgroup/cpuset.c | 29 +++++++++++++++++++++++++++++
 mm/vmscan.c            | 35 ++++++++++++++++++++++++++++++++---
 3 files changed, 67 insertions(+), 3 deletions(-)

Comments

Aneesh Kumar K.V Oct. 26, 2022, 7:49 a.m. UTC | #1

On 10/26/22 1:13 PM, Feng Tang wrote:
> In page reclaim path, memory could be demoted from faster memory tier
> to slower memory tier. Currently, there is no check about cpuset's
> memory policy, that even if the target demotion node is not allowd
> by cpuset, the demotion will still happen, which breaks the cpuset
> semantics.
> 
> So add cpuset policy check in the demotion path and skip demotion
> if the demotion targets are not allowed by cpuset.
> 

What about the vma policy or the task memory policy? Shouldn't we respect
those memory policy restrictions while demoting the page? 

-aneesh

Feng Tang Oct. 26, 2022, 8 a.m. UTC | #2

On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote:
> On 10/26/22 1:13 PM, Feng Tang wrote:
> > In page reclaim path, memory could be demoted from faster memory tier
> > to slower memory tier. Currently, there is no check about cpuset's
> > memory policy, that even if the target demotion node is not allowd
> > by cpuset, the demotion will still happen, which breaks the cpuset
> > semantics.
> > 
> > So add cpuset policy check in the demotion path and skip demotion
> > if the demotion targets are not allowed by cpuset.
> > 
> 
> What about the vma policy or the task memory policy? Shouldn't we respect
> those memory policy restrictions while demoting the page? 
 
Good question! We have some basic patches to consider memory policy
in demotion path too, which are still under test, and will be posted
soon. And the basic idea is similar to this patch.

Thanks,
Feng

> -aneesh

Yin Fengwei Oct. 26, 2022, 8:26 a.m. UTC | #3

Hi Feng,

On 10/26/2022 3:43 PM, Feng Tang wrote:
> In page reclaim path, memory could be demoted from faster memory tier
> to slower memory tier. Currently, there is no check about cpuset's
> memory policy, that even if the target demotion node is not allowd
> by cpuset, the demotion will still happen, which breaks the cpuset
> semantics.
> 
> So add cpuset policy check in the demotion path and skip demotion
> if the demotion targets are not allowed by cpuset.
> 
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> ---
> Hi reviewers,
> 
> For easy bisectable, I combined the cpuset change and mm change
> in one patch, if you prefer to separate them, I can turn it into
> 2 patches.
> 
> Thanks,
> Feng
> 
>  include/linux/cpuset.h |  6 ++++++
>  kernel/cgroup/cpuset.c | 29 +++++++++++++++++++++++++++++
>  mm/vmscan.c            | 35 ++++++++++++++++++++++++++++++++---
>  3 files changed, 67 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index d58e0476ee8e..6fcce2bd2631 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -178,6 +178,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
>  	task_unlock(current);
>  }
>  
> +extern void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup,
> +						nodemask_t *nmask);
>  #else /* !CONFIG_CPUSETS */
>  
>  static inline bool cpusets_enabled(void) { return false; }
> @@ -299,6 +301,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
>  	return false;
>  }
>  
> +static inline void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup,
> +						nodemask_t *nmask)
> +{
> +}
>  #endif /* !CONFIG_CPUSETS */
>  
>  #endif /* _LINUX_CPUSET_H */
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 3ea2e836e93e..cbb118c0502f 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -3750,6 +3750,35 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
>  	return mask;
>  }
>  
> +/*
> + * Retrieve the allowed memory nodemask for a cgroup.
> + *
> + * Set *nmask to cpuset's effective allowed nodemask for cgroup v2,
> + * and NODE_MASK_ALL (means no constraint) for cgroup v1 where there
> + * is no guaranteed association from a cgroup to a cpuset.
> + */
> +void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, nodemask_t *nmask)
> +{
> +	struct cgroup_subsys_state *css;
> +	struct cpuset *cs;
> +
> +	if (!is_in_v2_mode()) {
> +		*nmask = NODE_MASK_ALL;
> +		return;
> +	}
> +
> +	rcu_read_lock();
> +	css = cgroup_e_css(cgroup, &cpuset_cgrp_subsys);
> +	if (css) {
> +		css_get(css);
> +		cs = css_cs(css);
> +		*nmask = cs->effective_mems;
> +		css_put(css);
> +	}
> +
> +	rcu_read_unlock();
> +}
> +
>  /**
>   * cpuset_nodemask_valid_mems_allowed - check nodemask vs. current mems_allowed
>   * @nodemask: the nodemask to be checked
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 18f6497994ec..c205d98283bc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1537,9 +1537,21 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private)
>  {
>  	struct page *target_page;
>  	nodemask_t *allowed_mask;
> -	struct migration_target_control *mtc;
> +	struct migration_target_control *mtc = (void *)private;
>  
> -	mtc = (struct migration_target_control *)private;
> +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS)
> +	struct mem_cgroup *memcg;
> +	nodemask_t cpuset_nmask;
> +
> +	memcg = page_memcg(page);
> +	cpuset_get_allowed_mem_nodes(memcg->css.cgroup, &cpuset_nmask);
> +
> +	if (!node_isset(mtc->nid, cpuset_nmask)) {
> +		if (mtc->nmask)
> +			nodes_and(*mtc->nmask, *mtc->nmask, cpuset_nmask);
> +		return alloc_migration_target(page, (unsigned long)mtc);
> +	}
> +#endif
>  
>  	allowed_mask = mtc->nmask;
>  	/*
> @@ -1649,6 +1661,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>  		enum folio_references references = FOLIOREF_RECLAIM;
>  		bool dirty, writeback;
>  		unsigned int nr_pages;
> +		bool skip_this_demotion = false;
>  
>  		cond_resched();
>  
> @@ -1658,6 +1671,22 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>  		if (!folio_trylock(folio))
>  			goto keep;
>  
> +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS)
> +		if (do_demote_pass) {
> +			struct mem_cgroup *memcg;
> +			nodemask_t nmask, nmask1;
> +
> +			node_get_allowed_targets(pgdat, &nmask);
> +			memcg = folio_memcg(folio);
> +			if (memcg)
Doesn't check memcg in the change to alloc_demote_page(). What's the difference here?

> +				cpuset_get_allowed_mem_nodes(memcg->css.cgroup,
> +								&nmask1);
> +
> +			if (!nodes_intersects(nmask, nmask1))
If memcg is NULL, nmask1 will have an uninitialized value. Thanks.

Regards
Yin, Fengwei

> +				skip_this_demotion = true;
> +		}
> +#endif
> +
>  		VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
>  
>  		nr_pages = folio_nr_pages(folio);
> @@ -1799,7 +1828,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>  		 * Before reclaiming the folio, try to relocate
>  		 * its contents to another node.
>  		 */
> -		if (do_demote_pass &&
> +		if (do_demote_pass && !skip_this_demotion &&
>  		    (thp_migration_supported() || !folio_test_large(folio))) {
>  			list_add(&folio->lru, &demote_folios);
>  			folio_unlock(folio);

Feng Tang Oct. 26, 2022, 8:37 a.m. UTC | #4

On Wed, Oct 26, 2022 at 04:26:28PM +0800, Yin, Fengwei wrote:
> Hi Feng,
> 
> On 10/26/2022 3:43 PM, Feng Tang wrote:
> > In page reclaim path, memory could be demoted from faster memory tier
> > to slower memory tier. Currently, there is no check about cpuset's
> > memory policy, that even if the target demotion node is not allowd
> > by cpuset, the demotion will still happen, which breaks the cpuset
> > semantics.
> > 
> > So add cpuset policy check in the demotion path and skip demotion
> > if the demotion targets are not allowed by cpuset.
> > 
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > ---
> > Hi reviewers,
> > 
> > For easy bisectable, I combined the cpuset change and mm change
> > in one patch, if you prefer to separate them, I can turn it into
> > 2 patches.
> > 
> > Thanks,
> > Feng
> > 
> >  include/linux/cpuset.h |  6 ++++++
> >  kernel/cgroup/cpuset.c | 29 +++++++++++++++++++++++++++++
> >  mm/vmscan.c            | 35 ++++++++++++++++++++++++++++++++---
> >  3 files changed, 67 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> > index d58e0476ee8e..6fcce2bd2631 100644
> > --- a/include/linux/cpuset.h
> > +++ b/include/linux/cpuset.h
> > @@ -178,6 +178,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
> >  	task_unlock(current);
> >  }
> >  
> > +extern void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup,
> > +						nodemask_t *nmask);
> >  #else /* !CONFIG_CPUSETS */
> >  
> >  static inline bool cpusets_enabled(void) { return false; }
> > @@ -299,6 +301,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
> >  	return false;
> >  }
> >  
> > +static inline void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup,
> > +						nodemask_t *nmask)
> > +{
> > +}
> >  #endif /* !CONFIG_CPUSETS */
> >  
> >  #endif /* _LINUX_CPUSET_H */
> > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > index 3ea2e836e93e..cbb118c0502f 100644
> > --- a/kernel/cgroup/cpuset.c
> > +++ b/kernel/cgroup/cpuset.c
> > @@ -3750,6 +3750,35 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
> >  	return mask;
> >  }
> >  
> > +/*
> > + * Retrieve the allowed memory nodemask for a cgroup.
> > + *
> > + * Set *nmask to cpuset's effective allowed nodemask for cgroup v2,
> > + * and NODE_MASK_ALL (means no constraint) for cgroup v1 where there
> > + * is no guaranteed association from a cgroup to a cpuset.
> > + */
> > +void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, nodemask_t *nmask)
> > +{
> > +	struct cgroup_subsys_state *css;
> > +	struct cpuset *cs;
> > +
> > +	if (!is_in_v2_mode()) {
> > +		*nmask = NODE_MASK_ALL;
> > +		return;
> > +	}
> > +
> > +	rcu_read_lock();
> > +	css = cgroup_e_css(cgroup, &cpuset_cgrp_subsys);
> > +	if (css) {
> > +		css_get(css);
> > +		cs = css_cs(css);
> > +		*nmask = cs->effective_mems;
> > +		css_put(css);
> > +	}
> > +
> > +	rcu_read_unlock();
> > +}
> > +
> >  /**
> >   * cpuset_nodemask_valid_mems_allowed - check nodemask vs. current mems_allowed
> >   * @nodemask: the nodemask to be checked
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 18f6497994ec..c205d98283bc 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1537,9 +1537,21 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private)
> >  {
> >  	struct page *target_page;
> >  	nodemask_t *allowed_mask;
> > -	struct migration_target_control *mtc;
> > +	struct migration_target_control *mtc = (void *)private;
> >  
> > -	mtc = (struct migration_target_control *)private;
> > +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS)
> > +	struct mem_cgroup *memcg;
> > +	nodemask_t cpuset_nmask;
> > +
> > +	memcg = page_memcg(page);
> > +	cpuset_get_allowed_mem_nodes(memcg->css.cgroup, &cpuset_nmask);
> > +
> > +	if (!node_isset(mtc->nid, cpuset_nmask)) {
> > +		if (mtc->nmask)
> > +			nodes_and(*mtc->nmask, *mtc->nmask, cpuset_nmask);
> > +		return alloc_migration_target(page, (unsigned long)mtc);
> > +	}
> > +#endif
> >  
> >  	allowed_mask = mtc->nmask;
> >  	/*
> > @@ -1649,6 +1661,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >  		enum folio_references references = FOLIOREF_RECLAIM;
> >  		bool dirty, writeback;
> >  		unsigned int nr_pages;
> > +		bool skip_this_demotion = false;
> >  
> >  		cond_resched();
> >  
> > @@ -1658,6 +1671,22 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >  		if (!folio_trylock(folio))
> >  			goto keep;
> >  
> > +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS)
> > +		if (do_demote_pass) {
> > +			struct mem_cgroup *memcg;
> > +			nodemask_t nmask, nmask1;
> > +
> > +			node_get_allowed_targets(pgdat, &nmask);
> > +			memcg = folio_memcg(folio);
> > +			if (memcg)
> Doesn't check memcg in the change to alloc_demote_page(). What's the difference here?
> 
> > +				cpuset_get_allowed_mem_nodes(memcg->css.cgroup,
> > +								&nmask1);
> > +
> > +			if (!nodes_intersects(nmask, nmask1))
> If memcg is NULL, nmask1 will have an uninitialized value. Thanks.
 
Good catch! Yes, it should be initialized to NODE_MASK_ALL.

Actually I was not sure if I need to check 'memcg == NULL' case, while
I was under the impression that for page on LRU list, it's memcg is
either a specific memcg or the 'root_mem_cgroup'. I will double
check that.

Thanks,
Feng

> Regards
> Yin, Fengwei
> 
> > +				skip_this_demotion = true;
> > +		}
> > +#endif
> > +
> >  		VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
> >  
> >  		nr_pages = folio_nr_pages(folio);
> > @@ -1799,7 +1828,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >  		 * Before reclaiming the folio, try to relocate
> >  		 * its contents to another node.
> >  		 */
> > -		if (do_demote_pass &&
> > +		if (do_demote_pass && !skip_this_demotion &&
> >  		    (thp_migration_supported() || !folio_test_large(folio))) {
> >  			list_add(&folio->lru, &demote_folios);
> >  			folio_unlock(folio);

Michal Hocko Oct. 26, 2022, 9:19 a.m. UTC | #5

On Wed 26-10-22 16:00:13, Feng Tang wrote:
> On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote:
> > On 10/26/22 1:13 PM, Feng Tang wrote:
> > > In page reclaim path, memory could be demoted from faster memory tier
> > > to slower memory tier. Currently, there is no check about cpuset's
> > > memory policy, that even if the target demotion node is not allowd
> > > by cpuset, the demotion will still happen, which breaks the cpuset
> > > semantics.
> > > 
> > > So add cpuset policy check in the demotion path and skip demotion
> > > if the demotion targets are not allowed by cpuset.
> > > 
> > 
> > What about the vma policy or the task memory policy? Shouldn't we respect
> > those memory policy restrictions while demoting the page? 
>  
> Good question! We have some basic patches to consider memory policy
> in demotion path too, which are still under test, and will be posted
> soon. And the basic idea is similar to this patch.

For that you need to consult each vma and it's owning task(s) and that
to me sounds like something to be done in folio_check_references.
Relying on memcg to get a cpuset cgroup is really ugly and not really
100% correct. Memory controller might be disabled and then you do not
have your association anymore.

This all can get quite expensive so the primary question is, does the
existing behavior generates any real issues or is this more of an
correctness exercise? I mean it certainly is not great to demote to an
incompatible numa node but are there any reasonable configurations when
the demotion target node is explicitly excluded from memory
policy/cpuset?

Aneesh Kumar K.V Oct. 26, 2022, 10:42 a.m. UTC | #6

On 10/26/22 2:49 PM, Michal Hocko wrote:
> On Wed 26-10-22 16:00:13, Feng Tang wrote:
>> On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote:
>>> On 10/26/22 1:13 PM, Feng Tang wrote:
>>>> In page reclaim path, memory could be demoted from faster memory tier
>>>> to slower memory tier. Currently, there is no check about cpuset's
>>>> memory policy, that even if the target demotion node is not allowd
>>>> by cpuset, the demotion will still happen, which breaks the cpuset
>>>> semantics.
>>>>
>>>> So add cpuset policy check in the demotion path and skip demotion
>>>> if the demotion targets are not allowed by cpuset.
>>>>
>>>
>>> What about the vma policy or the task memory policy? Shouldn't we respect
>>> those memory policy restrictions while demoting the page? 
>>  
>> Good question! We have some basic patches to consider memory policy
>> in demotion path too, which are still under test, and will be posted
>> soon. And the basic idea is similar to this patch.
> 
> For that you need to consult each vma and it's owning task(s) and that
> to me sounds like something to be done in folio_check_references.
> Relying on memcg to get a cpuset cgroup is really ugly and not really
> 100% correct. Memory controller might be disabled and then you do not
> have your association anymore.
> 

I was looking at this recently and I am wondering whether we should worry about VM_SHARE
vmas. 

ie, page_to_policy() can just reverse lookup just one VMA and fetch the policy right?
if it VM_SHARE it will be a shared policy we can find using vma->vm_file? 

For non anonymous and anon vma not having any policy set  it is owning task vma->vm_mm->owner task policy ? 
We don't worry about multiple tasks that can be possibly sharing that page right? 

> This all can get quite expensive so the primary question is, does the
> existing behavior generates any real issues or is this more of an
> correctness exercise? I mean it certainly is not great to demote to an
> incompatible numa node but are there any reasonable configurations when
> the demotion target node is explicitly excluded from memory
> policy/cpuset?

I guess vma policy is important. Applications want to make sure that they don't
have variable performance and they go to lengths to avoid that by using MEM_BIND.
So if they access the memory they always know access is satisfied from a specific
set of NUMA nodes. Swapin can cause performance impact but then all continued
access will be from a specific NUMA nodes. With slow memory demotion that is
not going to be the case. Large in-memory database applications are observed to
be sensitive to these access latencies. 

-aneesh

Michal Hocko Oct. 26, 2022, 11:02 a.m. UTC | #7

On Wed 26-10-22 16:12:25, Aneesh Kumar K V wrote:
> On 10/26/22 2:49 PM, Michal Hocko wrote:
> > On Wed 26-10-22 16:00:13, Feng Tang wrote:
> >> On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote:
> >>> On 10/26/22 1:13 PM, Feng Tang wrote:
> >>>> In page reclaim path, memory could be demoted from faster memory tier
> >>>> to slower memory tier. Currently, there is no check about cpuset's
> >>>> memory policy, that even if the target demotion node is not allowd
> >>>> by cpuset, the demotion will still happen, which breaks the cpuset
> >>>> semantics.
> >>>>
> >>>> So add cpuset policy check in the demotion path and skip demotion
> >>>> if the demotion targets are not allowed by cpuset.
> >>>>
> >>>
> >>> What about the vma policy or the task memory policy? Shouldn't we respect
> >>> those memory policy restrictions while demoting the page? 
> >>  
> >> Good question! We have some basic patches to consider memory policy
> >> in demotion path too, which are still under test, and will be posted
> >> soon. And the basic idea is similar to this patch.
> > 
> > For that you need to consult each vma and it's owning task(s) and that
> > to me sounds like something to be done in folio_check_references.
> > Relying on memcg to get a cpuset cgroup is really ugly and not really
> > 100% correct. Memory controller might be disabled and then you do not
> > have your association anymore.
> > 
> 
> I was looking at this recently and I am wondering whether we should worry about VM_SHARE
> vmas. 
> 
> ie, page_to_policy() can just reverse lookup just one VMA and fetch the policy right?

How would that help for private mappings shared between parent/child?
Also reducing this to a single VMA is not really necessary as
folio_check_references already does most of that work. What is really
missing is to check for other memory policies (i.e. cpusets and per-task
mempolicy). The later is what can get quite expensive.

> if it VM_SHARE it will be a shared policy we can find using vma->vm_file? 
> 
> For non anonymous and anon vma not having any policy set  it is owning task vma->vm_mm->owner task policy ? 

Please note that mm can be shared even outside of the traditional thread
group so you would need to go into something like mm_update_next_owner

> We don't worry about multiple tasks that can be possibly sharing that page right? 

Why not?

> > This all can get quite expensive so the primary question is, does the
> > existing behavior generates any real issues or is this more of an
> > correctness exercise? I mean it certainly is not great to demote to an
> > incompatible numa node but are there any reasonable configurations when
> > the demotion target node is explicitly excluded from memory
> > policy/cpuset?
> 
> I guess vma policy is important. Applications want to make sure that they don't
> have variable performance and they go to lengths to avoid that by using MEM_BIND.
> So if they access the memory they always know access is satisfied from a specific
> set of NUMA nodes. Swapin can cause performance impact but then all continued
> access will be from a specific NUMA nodes. With slow memory demotion that is
> not going to be the case. Large in-memory database applications are observed to
> be sensitive to these access latencies. 

Yes, I do understand that from the correctness POV this is a problem. My
question is whether this is a _practical_ problem worth really being
fixed as it is not really a cheap fix. If there are strong node locality
assumptions by the userspace then I would expect demotion to be disabled
in the first place.

Aneesh Kumar K.V Oct. 26, 2022, 12:08 p.m. UTC | #8

On 10/26/22 4:32 PM, Michal Hocko wrote:
> On Wed 26-10-22 16:12:25, Aneesh Kumar K V wrote:
>> On 10/26/22 2:49 PM, Michal Hocko wrote:
>>> On Wed 26-10-22 16:00:13, Feng Tang wrote:
>>>> On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote:
>>>>> On 10/26/22 1:13 PM, Feng Tang wrote:
>>>>>> In page reclaim path, memory could be demoted from faster memory tier
>>>>>> to slower memory tier. Currently, there is no check about cpuset's
>>>>>> memory policy, that even if the target demotion node is not allowd
>>>>>> by cpuset, the demotion will still happen, which breaks the cpuset
>>>>>> semantics.
>>>>>>
>>>>>> So add cpuset policy check in the demotion path and skip demotion
>>>>>> if the demotion targets are not allowed by cpuset.
>>>>>>
>>>>>
>>>>> What about the vma policy or the task memory policy? Shouldn't we respect
>>>>> those memory policy restrictions while demoting the page? 
>>>>  
>>>> Good question! We have some basic patches to consider memory policy
>>>> in demotion path too, which are still under test, and will be posted
>>>> soon. And the basic idea is similar to this patch.
>>>
>>> For that you need to consult each vma and it's owning task(s) and that
>>> to me sounds like something to be done in folio_check_references.
>>> Relying on memcg to get a cpuset cgroup is really ugly and not really
>>> 100% correct. Memory controller might be disabled and then you do not
>>> have your association anymore.
>>>
>>
>> I was looking at this recently and I am wondering whether we should worry about VM_SHARE
>> vmas. 
>>
>> ie, page_to_policy() can just reverse lookup just one VMA and fetch the policy right?
> 
> How would that help for private mappings shared between parent/child?


this is MAP_PRIVATE | MAP_SHARED? and won't they get converted to shared policy for the backing shmfs? via

	} else if (vm_flags & VM_SHARED) {
		error = shmem_zero_setup(vma);
		if (error)
			goto free_vma;
	} else {
		vma_set_anonymous(vma);
	}



> Also reducing this to a single VMA is not really necessary as
> folio_check_references already does most of that work. What is really
> missing is to check for other memory policies (i.e. cpusets and per-task
> mempolicy). The later is what can get quite expensive.
> 


I agree that walking all the related vma is already done in folio_check_references. I was
checking do we really need to check all the vma in case of memory policy.


>> if it VM_SHARE it will be a shared policy we can find using vma->vm_file? 
>>
>> For non anonymous and anon vma not having any policy set  it is owning task vma->vm_mm->owner task policy ? 
> 
> Please note that mm can be shared even outside of the traditional thread
> group so you would need to go into something like mm_update_next_owner
> 
>> We don't worry about multiple tasks that can be possibly sharing that page right? 
> 
> Why not?
> 

On the page fault side for non anonymous vma we only respect the memory policy of the
task faulting the page in. With that restrictions we could always say if demotion
node is allowed by the policy of any task sharing this vma, we can demote the
page to that specific node? 

>>> This all can get quite expensive so the primary question is, does the
>>> existing behavior generates any real issues or is this more of an
>>> correctness exercise? I mean it certainly is not great to demote to an
>>> incompatible numa node but are there any reasonable configurations when
>>> the demotion target node is explicitly excluded from memory
>>> policy/cpuset?
>>
>> I guess vma policy is important. Applications want to make sure that they don't
>> have variable performance and they go to lengths to avoid that by using MEM_BIND.
>> So if they access the memory they always know access is satisfied from a specific
>> set of NUMA nodes. Swapin can cause performance impact but then all continued
>> access will be from a specific NUMA nodes. With slow memory demotion that is
>> not going to be the case. Large in-memory database applications are observed to
>> be sensitive to these access latencies. 
> 
> Yes, I do understand that from the correctness POV this is a problem. My
> question is whether this is a _practical_ problem worth really being
> fixed as it is not really a cheap fix. If there are strong node locality
> assumptions by the userspace then I would expect demotion to be disabled
> in the first place.

Agreed. Right now these applications when they fail to allocate memory from
the Node on which they are running, they fallback to nearby NUMA nodes rather than
failing the database query. They would want to prevent fallback of some allocation
to slow cxl memory to avoid hot database tables getting allocated from the cxl
memory node. In that case one way they can consume slow cxl memory is to partition
the data structure using membind and allow cold pages to demoted back to
slow cxl memory making space for hot page allocation in the running NUMA node. 

Other option is to run the application using fsdax.

I am still not clear which option will get used finally. 


-aneesh

Feng Tang Oct. 26, 2022, 12:20 p.m. UTC | #9

On Wed, Oct 26, 2022 at 05:19:50PM +0800, Michal Hocko wrote:
> On Wed 26-10-22 16:00:13, Feng Tang wrote:
> > On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote:
> > > On 10/26/22 1:13 PM, Feng Tang wrote:
> > > > In page reclaim path, memory could be demoted from faster memory tier
> > > > to slower memory tier. Currently, there is no check about cpuset's
> > > > memory policy, that even if the target demotion node is not allowd
> > > > by cpuset, the demotion will still happen, which breaks the cpuset
> > > > semantics.
> > > > 
> > > > So add cpuset policy check in the demotion path and skip demotion
> > > > if the demotion targets are not allowed by cpuset.
> > > > 
> > > 
> > > What about the vma policy or the task memory policy? Shouldn't we respect
> > > those memory policy restrictions while demoting the page? 
> >  
> > Good question! We have some basic patches to consider memory policy
> > in demotion path too, which are still under test, and will be posted
> > soon. And the basic idea is similar to this patch.
> 
> For that you need to consult each vma and it's owning task(s) and that
> to me sounds like something to be done in folio_check_references.
> Relying on memcg to get a cpuset cgroup is really ugly and not really
> 100% correct. Memory controller might be disabled and then you do not
> have your association anymore.

You are right, for cpuset case, the solution depends on 'CONFIG_MEMCG=y',
and the bright side is most of distribution have it on.

> This all can get quite expensive so the primary question is, does the
> existing behavior generates any real issues or is this more of an
> correctness exercise? I mean it certainly is not great to demote to an
> incompatible numa node but are there any reasonable configurations when
> the demotion target node is explicitly excluded from memory
> policy/cpuset?

We haven't got customer report on this, but there are quite some customers
use cpuset to bind some specific memory nodes to a docker (You've helped
us solve a OOM issue in such cases), so I think it's practical to respect
the cpuset semantics as much as we can.

Your concern about the expensive cost makes sense! Some raw ideas are:
* if the shrink_folio_list is called by kswapd, the folios come from
  the same per-memcg lruvec, so only one check is enough
* if not from kswapd, like called form madvise or DAMON code, we can
  save a memcg cache, and if the next folio's memcg is same as the
  cache, we reuse its result. And due to the locality, the real
  check is rarely performed.

Thanks,
Feng

> -- 
> Michal Hocko
> SUSE Labs
>

Michal Hocko Oct. 26, 2022, 12:21 p.m. UTC | #10

On Wed 26-10-22 17:38:06, Aneesh Kumar K V wrote:
> On 10/26/22 4:32 PM, Michal Hocko wrote:
> > On Wed 26-10-22 16:12:25, Aneesh Kumar K V wrote:
> >> On 10/26/22 2:49 PM, Michal Hocko wrote:
> >>> On Wed 26-10-22 16:00:13, Feng Tang wrote:
> >>>> On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote:
> >>>>> On 10/26/22 1:13 PM, Feng Tang wrote:
> >>>>>> In page reclaim path, memory could be demoted from faster memory tier
> >>>>>> to slower memory tier. Currently, there is no check about cpuset's
> >>>>>> memory policy, that even if the target demotion node is not allowd
> >>>>>> by cpuset, the demotion will still happen, which breaks the cpuset
> >>>>>> semantics.
> >>>>>>
> >>>>>> So add cpuset policy check in the demotion path and skip demotion
> >>>>>> if the demotion targets are not allowed by cpuset.
> >>>>>>
> >>>>>
> >>>>> What about the vma policy or the task memory policy? Shouldn't we respect
> >>>>> those memory policy restrictions while demoting the page? 
> >>>>  
> >>>> Good question! We have some basic patches to consider memory policy
> >>>> in demotion path too, which are still under test, and will be posted
> >>>> soon. And the basic idea is similar to this patch.
> >>>
> >>> For that you need to consult each vma and it's owning task(s) and that
> >>> to me sounds like something to be done in folio_check_references.
> >>> Relying on memcg to get a cpuset cgroup is really ugly and not really
> >>> 100% correct. Memory controller might be disabled and then you do not
> >>> have your association anymore.
> >>>
> >>
> >> I was looking at this recently and I am wondering whether we should worry about VM_SHARE
> >> vmas. 
> >>
> >> ie, page_to_policy() can just reverse lookup just one VMA and fetch the policy right?
> > 
> > How would that help for private mappings shared between parent/child?
> 
> 
> this is MAP_PRIVATE | MAP_SHARED?

This is not a valid combination IIRC. What I meant is a simple
MAP_PRIVATE|MAP_ANON that is CoW shared between parent and child.

[...]

Aneesh Kumar K.V Oct. 26, 2022, 12:35 p.m. UTC | #11

On 10/26/22 5:51 PM, Michal Hocko wrote:
> On Wed 26-10-22 17:38:06, Aneesh Kumar K V wrote:
>> On 10/26/22 4:32 PM, Michal Hocko wrote:
>>> On Wed 26-10-22 16:12:25, Aneesh Kumar K V wrote:
>>>> On 10/26/22 2:49 PM, Michal Hocko wrote:
>>>>> On Wed 26-10-22 16:00:13, Feng Tang wrote:
>>>>>> On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote:
>>>>>>> On 10/26/22 1:13 PM, Feng Tang wrote:
>>>>>>>> In page reclaim path, memory could be demoted from faster memory tier
>>>>>>>> to slower memory tier. Currently, there is no check about cpuset's
>>>>>>>> memory policy, that even if the target demotion node is not allowd
>>>>>>>> by cpuset, the demotion will still happen, which breaks the cpuset
>>>>>>>> semantics.
>>>>>>>>
>>>>>>>> So add cpuset policy check in the demotion path and skip demotion
>>>>>>>> if the demotion targets are not allowed by cpuset.
>>>>>>>>
>>>>>>>
>>>>>>> What about the vma policy or the task memory policy? Shouldn't we respect
>>>>>>> those memory policy restrictions while demoting the page? 
>>>>>>  
>>>>>> Good question! We have some basic patches to consider memory policy
>>>>>> in demotion path too, which are still under test, and will be posted
>>>>>> soon. And the basic idea is similar to this patch.
>>>>>
>>>>> For that you need to consult each vma and it's owning task(s) and that
>>>>> to me sounds like something to be done in folio_check_references.
>>>>> Relying on memcg to get a cpuset cgroup is really ugly and not really
>>>>> 100% correct. Memory controller might be disabled and then you do not
>>>>> have your association anymore.
>>>>>
>>>>
>>>> I was looking at this recently and I am wondering whether we should worry about VM_SHARE
>>>> vmas. 
>>>>
>>>> ie, page_to_policy() can just reverse lookup just one VMA and fetch the policy right?
>>>
>>> How would that help for private mappings shared between parent/child?
>>
>>
>> this is MAP_PRIVATE | MAP_SHARED?
> 

Sorry, I meant MAP_ANONYMOUS | MAP_SHARED. 

> This is not a valid combination IIRC. What I meant is a simple
> MAP_PRIVATE|MAP_ANON that is CoW shared between parent and child.
> 
> [...]


-aneesh

Waiman Long Oct. 26, 2022, 2:36 p.m. UTC | #12

On 10/26/22 03:43, Feng Tang wrote:
> In page reclaim path, memory could be demoted from faster memory tier
> to slower memory tier. Currently, there is no check about cpuset's
> memory policy, that even if the target demotion node is not allowd
> by cpuset, the demotion will still happen, which breaks the cpuset
> semantics.
>
> So add cpuset policy check in the demotion path and skip demotion
> if the demotion targets are not allowed by cpuset.
>
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> ---
> Hi reviewers,
>
> For easy bisectable, I combined the cpuset change and mm change
> in one patch, if you prefer to separate them, I can turn it into
> 2 patches.
>
> Thanks,
> Feng
>
>   include/linux/cpuset.h |  6 ++++++
>   kernel/cgroup/cpuset.c | 29 +++++++++++++++++++++++++++++
>   mm/vmscan.c            | 35 ++++++++++++++++++++++++++++++++---
>   3 files changed, 67 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index d58e0476ee8e..6fcce2bd2631 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -178,6 +178,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
>   	task_unlock(current);
>   }
>   
> +extern void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup,
> +						nodemask_t *nmask);
>   #else /* !CONFIG_CPUSETS */
>   
>   static inline bool cpusets_enabled(void) { return false; }
> @@ -299,6 +301,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
>   	return false;
>   }
>   
> +static inline void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup,
> +						nodemask_t *nmask)
> +{
> +}
>   #endif /* !CONFIG_CPUSETS */
>   
>   #endif /* _LINUX_CPUSET_H */
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 3ea2e836e93e..cbb118c0502f 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -3750,6 +3750,35 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
>   	return mask;
>   }
>   
> +/*
> + * Retrieve the allowed memory nodemask for a cgroup.
> + *
> + * Set *nmask to cpuset's effective allowed nodemask for cgroup v2,
> + * and NODE_MASK_ALL (means no constraint) for cgroup v1 where there
> + * is no guaranteed association from a cgroup to a cpuset.
> + */
> +void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, nodemask_t *nmask)
> +{
> +	struct cgroup_subsys_state *css;
> +	struct cpuset *cs;
> +
> +	if (!is_in_v2_mode()) {
> +		*nmask = NODE_MASK_ALL;
> +		return;
> +	}

You are allowing all nodes to be used for cgroup v1. Is there a reason 
why you ignore v1?

> +
> +	rcu_read_lock();
> +	css = cgroup_e_css(cgroup, &cpuset_cgrp_subsys);
> +	if (css) {
> +		css_get(css);
> +		cs = css_cs(css);
> +		*nmask = cs->effective_mems;
> +		css_put(css);
> +	}
Since you are holding an RCU read lock and copying out the whole 
nodemask, you probably don't need to do a css_get/css_put pair.
> +
> +	rcu_read_unlock();
> +}
> +
Cheers,

Longman

Michal Hocko Oct. 26, 2022, 3:59 p.m. UTC | #13

On Wed 26-10-22 20:20:01, Feng Tang wrote:
> On Wed, Oct 26, 2022 at 05:19:50PM +0800, Michal Hocko wrote:
> > On Wed 26-10-22 16:00:13, Feng Tang wrote:
> > > On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote:
> > > > On 10/26/22 1:13 PM, Feng Tang wrote:
> > > > > In page reclaim path, memory could be demoted from faster memory tier
> > > > > to slower memory tier. Currently, there is no check about cpuset's
> > > > > memory policy, that even if the target demotion node is not allowd
> > > > > by cpuset, the demotion will still happen, which breaks the cpuset
> > > > > semantics.
> > > > > 
> > > > > So add cpuset policy check in the demotion path and skip demotion
> > > > > if the demotion targets are not allowed by cpuset.
> > > > > 
> > > > 
> > > > What about the vma policy or the task memory policy? Shouldn't we respect
> > > > those memory policy restrictions while demoting the page? 
> > >  
> > > Good question! We have some basic patches to consider memory policy
> > > in demotion path too, which are still under test, and will be posted
> > > soon. And the basic idea is similar to this patch.
> > 
> > For that you need to consult each vma and it's owning task(s) and that
> > to me sounds like something to be done in folio_check_references.
> > Relying on memcg to get a cpuset cgroup is really ugly and not really
> > 100% correct. Memory controller might be disabled and then you do not
> > have your association anymore.
>  
> You are right, for cpuset case, the solution depends on 'CONFIG_MEMCG=y',
> and the bright side is most of distribution have it on.

CONFIG_MEMCG=y is not sufficient. You would need to enable memcg
controller during the runtime as well.
 
> > This all can get quite expensive so the primary question is, does the
> > existing behavior generates any real issues or is this more of an
> > correctness exercise? I mean it certainly is not great to demote to an
> > incompatible numa node but are there any reasonable configurations when
> > the demotion target node is explicitly excluded from memory
> > policy/cpuset?
> 
> We haven't got customer report on this, but there are quite some customers
> use cpuset to bind some specific memory nodes to a docker (You've helped
> us solve a OOM issue in such cases), so I think it's practical to respect
> the cpuset semantics as much as we can.

Yes, it is definitely better to respect cpusets and all local memory
policies. There is no dispute there. The thing is whether this is really
worth it. How often would cpusets (or policies in general) go actively
against demotion nodes (i.e. exclude those nodes from their allowes node
mask)?

I can imagine workloads which wouldn't like to get their memory demoted
for some reason but wouldn't it be more practical to tell that
explicitly (e.g. via prctl) rather than configuring cpusets/memory
policies explicitly?
 
> Your concern about the expensive cost makes sense! Some raw ideas are:
> * if the shrink_folio_list is called by kswapd, the folios come from
>   the same per-memcg lruvec, so only one check is enough
> * if not from kswapd, like called form madvise or DAMON code, we can
>   save a memcg cache, and if the next folio's memcg is same as the
>   cache, we reuse its result. And due to the locality, the real
>   check is rarely performed.

memcg is not the expensive part of the thing. You need to get from page
-> all vmas::vm_policy -> mm -> task::mempolicy

Yang Shi Oct. 26, 2022, 5:57 p.m. UTC | #14

On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 26-10-22 20:20:01, Feng Tang wrote:
> > On Wed, Oct 26, 2022 at 05:19:50PM +0800, Michal Hocko wrote:
> > > On Wed 26-10-22 16:00:13, Feng Tang wrote:
> > > > On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote:
> > > > > On 10/26/22 1:13 PM, Feng Tang wrote:
> > > > > > In page reclaim path, memory could be demoted from faster memory tier
> > > > > > to slower memory tier. Currently, there is no check about cpuset's
> > > > > > memory policy, that even if the target demotion node is not allowd
> > > > > > by cpuset, the demotion will still happen, which breaks the cpuset
> > > > > > semantics.
> > > > > >
> > > > > > So add cpuset policy check in the demotion path and skip demotion
> > > > > > if the demotion targets are not allowed by cpuset.
> > > > > >
> > > > >
> > > > > What about the vma policy or the task memory policy? Shouldn't we respect
> > > > > those memory policy restrictions while demoting the page?
> > > >
> > > > Good question! We have some basic patches to consider memory policy
> > > > in demotion path too, which are still under test, and will be posted
> > > > soon. And the basic idea is similar to this patch.
> > >
> > > For that you need to consult each vma and it's owning task(s) and that
> > > to me sounds like something to be done in folio_check_references.
> > > Relying on memcg to get a cpuset cgroup is really ugly and not really
> > > 100% correct. Memory controller might be disabled and then you do not
> > > have your association anymore.
> >
> > You are right, for cpuset case, the solution depends on 'CONFIG_MEMCG=y',
> > and the bright side is most of distribution have it on.
>
> CONFIG_MEMCG=y is not sufficient. You would need to enable memcg
> controller during the runtime as well.
>
> > > This all can get quite expensive so the primary question is, does the
> > > existing behavior generates any real issues or is this more of an
> > > correctness exercise? I mean it certainly is not great to demote to an
> > > incompatible numa node but are there any reasonable configurations when
> > > the demotion target node is explicitly excluded from memory
> > > policy/cpuset?
> >
> > We haven't got customer report on this, but there are quite some customers
> > use cpuset to bind some specific memory nodes to a docker (You've helped
> > us solve a OOM issue in such cases), so I think it's practical to respect
> > the cpuset semantics as much as we can.
>
> Yes, it is definitely better to respect cpusets and all local memory
> policies. There is no dispute there. The thing is whether this is really
> worth it. How often would cpusets (or policies in general) go actively
> against demotion nodes (i.e. exclude those nodes from their allowes node
> mask)?
>
> I can imagine workloads which wouldn't like to get their memory demoted
> for some reason but wouldn't it be more practical to tell that
> explicitly (e.g. via prctl) rather than configuring cpusets/memory
> policies explicitly?
>
> > Your concern about the expensive cost makes sense! Some raw ideas are:
> > * if the shrink_folio_list is called by kswapd, the folios come from
> >   the same per-memcg lruvec, so only one check is enough
> > * if not from kswapd, like called form madvise or DAMON code, we can
> >   save a memcg cache, and if the next folio's memcg is same as the
> >   cache, we reuse its result. And due to the locality, the real
> >   check is rarely performed.
>
> memcg is not the expensive part of the thing. You need to get from page
> -> all vmas::vm_policy -> mm -> task::mempolicy

Yeah, on the same page with Michal. Figuring out mempolicy from page
seems quite expensive and the correctness can't be guranteed since the
mempolicy could be set per-thread and the mm->task depends on
CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG.

>
> --
> Michal Hocko
> SUSE Labs
>

Huang, Ying Oct. 27, 2022, 5:13 a.m. UTC | #15

Feng Tang <feng.tang@intel.com> writes:

> In page reclaim path, memory could be demoted from faster memory tier
> to slower memory tier. Currently, there is no check about cpuset's
> memory policy, that even if the target demotion node is not allowd
> by cpuset, the demotion will still happen, which breaks the cpuset
> semantics.
>
> So add cpuset policy check in the demotion path and skip demotion
> if the demotion targets are not allowed by cpuset.
>
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> ---
> Hi reviewers,
>
> For easy bisectable, I combined the cpuset change and mm change
> in one patch, if you prefer to separate them, I can turn it into
> 2 patches.
>
> Thanks,
> Feng
>
>  include/linux/cpuset.h |  6 ++++++
>  kernel/cgroup/cpuset.c | 29 +++++++++++++++++++++++++++++
>  mm/vmscan.c            | 35 ++++++++++++++++++++++++++++++++---
>  3 files changed, 67 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index d58e0476ee8e..6fcce2bd2631 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -178,6 +178,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
>  	task_unlock(current);
>  }
>  
> +extern void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup,
> +						nodemask_t *nmask);
>  #else /* !CONFIG_CPUSETS */
>  
>  static inline bool cpusets_enabled(void) { return false; }
> @@ -299,6 +301,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
>  	return false;
>  }
>  
> +static inline void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup,
> +						nodemask_t *nmask)
> +{
> +}
>  #endif /* !CONFIG_CPUSETS */
>  
>  #endif /* _LINUX_CPUSET_H */
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 3ea2e836e93e..cbb118c0502f 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -3750,6 +3750,35 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
>  	return mask;
>  }
>  
> +/*
> + * Retrieve the allowed memory nodemask for a cgroup.
> + *
> + * Set *nmask to cpuset's effective allowed nodemask for cgroup v2,
> + * and NODE_MASK_ALL (means no constraint) for cgroup v1 where there
> + * is no guaranteed association from a cgroup to a cpuset.
> + */
> +void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, nodemask_t *nmask)
> +{
> +	struct cgroup_subsys_state *css;
> +	struct cpuset *cs;
> +
> +	if (!is_in_v2_mode()) {
> +		*nmask = NODE_MASK_ALL;
> +		return;
> +	}
> +
> +	rcu_read_lock();
> +	css = cgroup_e_css(cgroup, &cpuset_cgrp_subsys);
> +	if (css) {
> +		css_get(css);
> +		cs = css_cs(css);
> +		*nmask = cs->effective_mems;
> +		css_put(css);
> +	}
> +
> +	rcu_read_unlock();
> +}
> +
>  /**
>   * cpuset_nodemask_valid_mems_allowed - check nodemask vs. current mems_allowed
>   * @nodemask: the nodemask to be checked
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 18f6497994ec..c205d98283bc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1537,9 +1537,21 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private)
>  {
>  	struct page *target_page;
>  	nodemask_t *allowed_mask;
> -	struct migration_target_control *mtc;
> +	struct migration_target_control *mtc = (void *)private;
>  
> -	mtc = (struct migration_target_control *)private;

I think we should avoid (void *) conversion here.

> +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS)
> +	struct mem_cgroup *memcg;
> +	nodemask_t cpuset_nmask;
> +
> +	memcg = page_memcg(page);
> +	cpuset_get_allowed_mem_nodes(memcg->css.cgroup, &cpuset_nmask);
> +
> +	if (!node_isset(mtc->nid, cpuset_nmask)) {
> +		if (mtc->nmask)
> +			nodes_and(*mtc->nmask, *mtc->nmask, cpuset_nmask);
> +		return alloc_migration_target(page, (unsigned long)mtc);
> +	}

If node_isset(mtc->nid, cpuset_nmask) == true, we should keep the
original 2 steps allocation and apply nodes_and() on node mask.

> +#endif
>  
>  	allowed_mask = mtc->nmask;
>  	/*
> @@ -1649,6 +1661,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>  		enum folio_references references = FOLIOREF_RECLAIM;
>  		bool dirty, writeback;
>  		unsigned int nr_pages;
> +		bool skip_this_demotion = false;
>  
>  		cond_resched();
>  
> @@ -1658,6 +1671,22 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>  		if (!folio_trylock(folio))
>  			goto keep;
>  
> +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS)
> +		if (do_demote_pass) {
> +			struct mem_cgroup *memcg;
> +			nodemask_t nmask, nmask1;
> +
> +			node_get_allowed_targets(pgdat, &nmask);

pgdat will not change in the loop, so we can move this out of the loop?

> +			memcg = folio_memcg(folio);
> +			if (memcg)
> +				cpuset_get_allowed_mem_nodes(memcg->css.cgroup,
> +								&nmask1);
> +
> +			if (!nodes_intersects(nmask, nmask1))
> +				skip_this_demotion = true;
> +		}

If nodes_intersects() == true, we will call
cpuset_get_allowed_mem_nodes() twice.  Better to pass the intersecting
mask to demote_folio_list()?

> +#endif
> +
>  		VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
>  
>  		nr_pages = folio_nr_pages(folio);
> @@ -1799,7 +1828,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>  		 * Before reclaiming the folio, try to relocate
>  		 * its contents to another node.
>  		 */
> -		if (do_demote_pass &&
> +		if (do_demote_pass && !skip_this_demotion &&
>  		    (thp_migration_supported() || !folio_test_large(folio))) {
>  			list_add(&folio->lru, &demote_folios);
>  			folio_unlock(folio);

Best Regards,
Huang, Ying

Feng Tang Oct. 27, 2022, 5:49 a.m. UTC | #16

On Thu, Oct 27, 2022 at 01:13:30PM +0800, Huang, Ying wrote:
> Feng Tang <feng.tang@intel.com> writes:
> 
> > In page reclaim path, memory could be demoted from faster memory tier
> > to slower memory tier. Currently, there is no check about cpuset's
> > memory policy, that even if the target demotion node is not allowd
> > by cpuset, the demotion will still happen, which breaks the cpuset
> > semantics.
> >
> > So add cpuset policy check in the demotion path and skip demotion
> > if the demotion targets are not allowed by cpuset.
> >
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > ---
[...]
> > index 18f6497994ec..c205d98283bc 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1537,9 +1537,21 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private)
> >  {
> >  	struct page *target_page;
> >  	nodemask_t *allowed_mask;
> > -	struct migration_target_control *mtc;
> > +	struct migration_target_control *mtc = (void *)private;
> >  
> > -	mtc = (struct migration_target_control *)private;
> 
> I think we should avoid (void *) conversion here.

OK, will change back.

> > +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS)
> > +	struct mem_cgroup *memcg;
> > +	nodemask_t cpuset_nmask;
> > +
> > +	memcg = page_memcg(page);
> > +	cpuset_get_allowed_mem_nodes(memcg->css.cgroup, &cpuset_nmask);
> > +
> > +	if (!node_isset(mtc->nid, cpuset_nmask)) {
> > +		if (mtc->nmask)
> > +			nodes_and(*mtc->nmask, *mtc->nmask, cpuset_nmask);
> > +		return alloc_migration_target(page, (unsigned long)mtc);
> > +	}
> 
> If node_isset(mtc->nid, cpuset_nmask) == true, we should keep the
> original 2 steps allocation and apply nodes_and() on node mask.

Good catch! Yes, the nodes_and() call should be taken out from this
check and done before calling node_isset().

> > +#endif
> >  
> >  	allowed_mask = mtc->nmask;
> >  	/*
> > @@ -1649,6 +1661,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >  		enum folio_references references = FOLIOREF_RECLAIM;
> >  		bool dirty, writeback;
> >  		unsigned int nr_pages;
> > +		bool skip_this_demotion = false;
> >  
> >  		cond_resched();
> >  
> > @@ -1658,6 +1671,22 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >  		if (!folio_trylock(folio))
> >  			goto keep;
> >  
> > +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS)
> > +		if (do_demote_pass) {
> > +			struct mem_cgroup *memcg;
> > +			nodemask_t nmask, nmask1;
> > +
> > +			node_get_allowed_targets(pgdat, &nmask);
> 
> pgdat will not change in the loop, so we can move this out of the loop?
 
Yes

> > +			memcg = folio_memcg(folio);
> > +			if (memcg)
> > +				cpuset_get_allowed_mem_nodes(memcg->css.cgroup,
> > +								&nmask1);
> > +
> > +			if (!nodes_intersects(nmask, nmask1))
> > +				skip_this_demotion = true;
> > +		}
> 
> If nodes_intersects() == true, we will call
> cpuset_get_allowed_mem_nodes() twice.  Better to pass the intersecting
> mask to demote_folio_list()?
 
The pages in the loop may come from different mem control group, and
the cpuset's nodemask could be different, I don't know how to save
this per-page info to be used later in demote_folio_list.

Thanks,
Feng

> > +#endif
> > +
> >  		VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
> >  
> >  		nr_pages = folio_nr_pages(folio);
> > @@ -1799,7 +1828,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >  		 * Before reclaiming the folio, try to relocate
> >  		 * its contents to another node.
> >  		 */
> > -		if (do_demote_pass &&
> > +		if (do_demote_pass && !skip_this_demotion &&
> >  		    (thp_migration_supported() || !folio_test_large(folio))) {
> >  			list_add(&folio->lru, &demote_folios);
> >  			folio_unlock(folio);
> 
> Best Regards,
> Huang, Ying

Feng Tang Oct. 27, 2022, 5:57 a.m. UTC | #17

On Wed, Oct 26, 2022 at 10:36:32PM +0800, Waiman Long wrote:
> On 10/26/22 03:43, Feng Tang wrote:
> > In page reclaim path, memory could be demoted from faster memory tier
> > to slower memory tier. Currently, there is no check about cpuset's
> > memory policy, that even if the target demotion node is not allowd
> > by cpuset, the demotion will still happen, which breaks the cpuset
> > semantics.
> >
> > So add cpuset policy check in the demotion path and skip demotion
> > if the demotion targets are not allowed by cpuset.
> >
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > ---
> > Hi reviewers,
> >
> > For easy bisectable, I combined the cpuset change and mm change
> > in one patch, if you prefer to separate them, I can turn it into
> > 2 patches.
> >
> > Thanks,
> > Feng
> >
> >   include/linux/cpuset.h |  6 ++++++
> >   kernel/cgroup/cpuset.c | 29 +++++++++++++++++++++++++++++
> >   mm/vmscan.c            | 35 ++++++++++++++++++++++++++++++++---
> >   3 files changed, 67 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> > index d58e0476ee8e..6fcce2bd2631 100644
> > --- a/include/linux/cpuset.h
> > +++ b/include/linux/cpuset.h
> > @@ -178,6 +178,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
> >   	task_unlock(current);
> >   }
> >   
> > +extern void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup,
> > +						nodemask_t *nmask);
> >   #else /* !CONFIG_CPUSETS */
> >   
> >   static inline bool cpusets_enabled(void) { return false; }
> > @@ -299,6 +301,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
> >   	return false;
> >   }
> >   
> > +static inline void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup,
> > +						nodemask_t *nmask)
> > +{
> > +}
> >   #endif /* !CONFIG_CPUSETS */
> >   
> >   #endif /* _LINUX_CPUSET_H */
> > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > index 3ea2e836e93e..cbb118c0502f 100644
> > --- a/kernel/cgroup/cpuset.c
> > +++ b/kernel/cgroup/cpuset.c
> > @@ -3750,6 +3750,35 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
> >   	return mask;
> >   }
> >   
> > +/*
> > + * Retrieve the allowed memory nodemask for a cgroup.
> > + *
> > + * Set *nmask to cpuset's effective allowed nodemask for cgroup v2,
> > + * and NODE_MASK_ALL (means no constraint) for cgroup v1 where there
> > + * is no guaranteed association from a cgroup to a cpuset.
> > + */
> > +void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, nodemask_t *nmask)
> > +{
> > +	struct cgroup_subsys_state *css;
> > +	struct cpuset *cs;
> > +
> > +	if (!is_in_v2_mode()) {
> > +		*nmask = NODE_MASK_ALL;
> > +		return;
> > +	}
> 
> You are allowing all nodes to be used for cgroup v1. Is there a reason 
> why you ignore v1?
 
The use case for the API is, for a memory control group, user want to
get its associated cpuset controller's memory policy, so it tries
the memcg --> cgroup --> cpuset chain. IIUC, there is no a reliable
chain for cgroup v1, plus cgroup v2 is the default option for many
distros, the cgroup v1 is bypassed here.

> > +
> > +	rcu_read_lock();
> > +	css = cgroup_e_css(cgroup, &cpuset_cgrp_subsys);
> > +	if (css) {
> > +		css_get(css);
> > +		cs = css_cs(css);
> > +		*nmask = cs->effective_mems;
> > +		css_put(css);
> > +	}
> Since you are holding an RCU read lock and copying out the whole 
> nodemask, you probably don't need to do a css_get/css_put pair.

Thanks for the note!

Thanks,
Feng

> > +
> > +	rcu_read_unlock();
> > +}
> > +
> Cheers,
> 
> Longman
>

Huang, Ying Oct. 27, 2022, 6:05 a.m. UTC | #18

Feng Tang <feng.tang@intel.com> writes:

> On Thu, Oct 27, 2022 at 01:13:30PM +0800, Huang, Ying wrote:
>> Feng Tang <feng.tang@intel.com> writes:
>> 
>> > In page reclaim path, memory could be demoted from faster memory tier
>> > to slower memory tier. Currently, there is no check about cpuset's
>> > memory policy, that even if the target demotion node is not allowd
>> > by cpuset, the demotion will still happen, which breaks the cpuset
>> > semantics.
>> >
>> > So add cpuset policy check in the demotion path and skip demotion
>> > if the demotion targets are not allowed by cpuset.
>> >
>> > Signed-off-by: Feng Tang <feng.tang@intel.com>
>> > ---
> [...]
>> > index 18f6497994ec..c205d98283bc 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -1537,9 +1537,21 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private)
>> >  {
>> >  	struct page *target_page;
>> >  	nodemask_t *allowed_mask;
>> > -	struct migration_target_control *mtc;
>> > +	struct migration_target_control *mtc = (void *)private;
>> >  
>> > -	mtc = (struct migration_target_control *)private;
>> 
>> I think we should avoid (void *) conversion here.
>
> OK, will change back.
>
>> > +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS)
>> > +	struct mem_cgroup *memcg;
>> > +	nodemask_t cpuset_nmask;
>> > +
>> > +	memcg = page_memcg(page);
>> > +	cpuset_get_allowed_mem_nodes(memcg->css.cgroup, &cpuset_nmask);
>> > +
>> > +	if (!node_isset(mtc->nid, cpuset_nmask)) {
>> > +		if (mtc->nmask)
>> > +			nodes_and(*mtc->nmask, *mtc->nmask, cpuset_nmask);
>> > +		return alloc_migration_target(page, (unsigned long)mtc);
>> > +	}
>> 
>> If node_isset(mtc->nid, cpuset_nmask) == true, we should keep the
>> original 2 steps allocation and apply nodes_and() on node mask.
>
> Good catch! Yes, the nodes_and() call should be taken out from this
> check and done before calling node_isset().
>
>> > +#endif
>> >  
>> >  	allowed_mask = mtc->nmask;
>> >  	/*
>> > @@ -1649,6 +1661,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>> >  		enum folio_references references = FOLIOREF_RECLAIM;
>> >  		bool dirty, writeback;
>> >  		unsigned int nr_pages;
>> > +		bool skip_this_demotion = false;
>> >  
>> >  		cond_resched();
>> >  
>> > @@ -1658,6 +1671,22 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>> >  		if (!folio_trylock(folio))
>> >  			goto keep;
>> >  
>> > +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS)
>> > +		if (do_demote_pass) {
>> > +			struct mem_cgroup *memcg;
>> > +			nodemask_t nmask, nmask1;
>> > +
>> > +			node_get_allowed_targets(pgdat, &nmask);
>> 
>> pgdat will not change in the loop, so we can move this out of the loop?
>  
> Yes
>
>> > +			memcg = folio_memcg(folio);
>> > +			if (memcg)
>> > +				cpuset_get_allowed_mem_nodes(memcg->css.cgroup,
>> > +								&nmask1);
>> > +
>> > +			if (!nodes_intersects(nmask, nmask1))
>> > +				skip_this_demotion = true;
>> > +		}
>> 
>> If nodes_intersects() == true, we will call
>> cpuset_get_allowed_mem_nodes() twice.  Better to pass the intersecting
>> mask to demote_folio_list()?
>  
> The pages in the loop may come from different mem control group, and
> the cpuset's nodemask could be different, I don't know how to save
> this per-page info to be used later in demote_folio_list.

Yes.  You are right.  We cannot do that.

Best Regards,
Huang, Ying

>
>> > +#endif
>> > +
>> >  		VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
>> >  
>> >  		nr_pages = folio_nr_pages(folio);
>> > @@ -1799,7 +1828,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>> >  		 * Before reclaiming the folio, try to relocate
>> >  		 * its contents to another node.
>> >  		 */
>> > -		if (do_demote_pass &&
>> > +		if (do_demote_pass && !skip_this_demotion &&
>> >  		    (thp_migration_supported() || !folio_test_large(folio))) {
>> >  			list_add(&folio->lru, &demote_folios);
>> >  			folio_unlock(folio);
>> 
>> Best Regards,
>> Huang, Ying

Huang, Ying Oct. 27, 2022, 6:47 a.m. UTC | #19

Michal Hocko <mhocko@suse.com> writes:

> On Wed 26-10-22 20:20:01, Feng Tang wrote:
>> On Wed, Oct 26, 2022 at 05:19:50PM +0800, Michal Hocko wrote:
>> > On Wed 26-10-22 16:00:13, Feng Tang wrote:
>> > > On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote:
>> > > > On 10/26/22 1:13 PM, Feng Tang wrote:
>> > > > > In page reclaim path, memory could be demoted from faster memory tier
>> > > > > to slower memory tier. Currently, there is no check about cpuset's
>> > > > > memory policy, that even if the target demotion node is not allowd
>> > > > > by cpuset, the demotion will still happen, which breaks the cpuset
>> > > > > semantics.
>> > > > > 
>> > > > > So add cpuset policy check in the demotion path and skip demotion
>> > > > > if the demotion targets are not allowed by cpuset.
>> > > > > 
>> > > > 
>> > > > What about the vma policy or the task memory policy? Shouldn't we respect
>> > > > those memory policy restrictions while demoting the page? 
>> > >  
>> > > Good question! We have some basic patches to consider memory policy
>> > > in demotion path too, which are still under test, and will be posted
>> > > soon. And the basic idea is similar to this patch.
>> > 
>> > For that you need to consult each vma and it's owning task(s) and that
>> > to me sounds like something to be done in folio_check_references.
>> > Relying on memcg to get a cpuset cgroup is really ugly and not really
>> > 100% correct. Memory controller might be disabled and then you do not
>> > have your association anymore.
>>  
>> You are right, for cpuset case, the solution depends on 'CONFIG_MEMCG=y',
>> and the bright side is most of distribution have it on.
>
> CONFIG_MEMCG=y is not sufficient. You would need to enable memcg
> controller during the runtime as well.
>  
>> > This all can get quite expensive so the primary question is, does the
>> > existing behavior generates any real issues or is this more of an
>> > correctness exercise? I mean it certainly is not great to demote to an
>> > incompatible numa node but are there any reasonable configurations when
>> > the demotion target node is explicitly excluded from memory
>> > policy/cpuset?
>> 
>> We haven't got customer report on this, but there are quite some customers
>> use cpuset to bind some specific memory nodes to a docker (You've helped
>> us solve a OOM issue in such cases), so I think it's practical to respect
>> the cpuset semantics as much as we can.
>
> Yes, it is definitely better to respect cpusets and all local memory
> policies. There is no dispute there. The thing is whether this is really
> worth it. How often would cpusets (or policies in general) go actively
> against demotion nodes (i.e. exclude those nodes from their allowes node
> mask)?
>
> I can imagine workloads which wouldn't like to get their memory demoted
> for some reason but wouldn't it be more practical to tell that
> explicitly (e.g. via prctl) rather than configuring cpusets/memory
> policies explicitly?

If my understanding were correct, prctl() configures the process or
thread.  How can we get process/thread configuration at demotion time?

Best Regards,
Huang, Ying

Michal Hocko Oct. 27, 2022, 7:10 a.m. UTC | #20

On Thu 27-10-22 14:47:22, Huang, Ying wrote:
> Michal Hocko <mhocko@suse.com> writes:
[...]
> > I can imagine workloads which wouldn't like to get their memory demoted
> > for some reason but wouldn't it be more practical to tell that
> > explicitly (e.g. via prctl) rather than configuring cpusets/memory
> > policies explicitly?
> 
> If my understanding were correct, prctl() configures the process or
> thread.

Not necessarily. There are properties which are per adddress space like
PR_[GS]ET_THP_DISABLE. This could be very similar.

> How can we get process/thread configuration at demotion time?

As already pointed out in previous emails. You could hook into
folio_check_references path, more specifically folio_referenced_one
where you have all that you need already - all vmas mapping the page and
then it is trivial to get the corresponding vm_mm. If at least one of
them has the flag set then the demotion is not allowed (essentially the
same model as VM_LOCKED).

Feng Tang Oct. 27, 2022, 7:11 a.m. UTC | #21

On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote:
> On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > > > This all can get quite expensive so the primary question is, does the
> > > > existing behavior generates any real issues or is this more of an
> > > > correctness exercise? I mean it certainly is not great to demote to an
> > > > incompatible numa node but are there any reasonable configurations when
> > > > the demotion target node is explicitly excluded from memory
> > > > policy/cpuset?
> > >
> > > We haven't got customer report on this, but there are quite some customers
> > > use cpuset to bind some specific memory nodes to a docker (You've helped
> > > us solve a OOM issue in such cases), so I think it's practical to respect
> > > the cpuset semantics as much as we can.
> >
> > Yes, it is definitely better to respect cpusets and all local memory
> > policies. There is no dispute there. The thing is whether this is really
> > worth it. How often would cpusets (or policies in general) go actively
> > against demotion nodes (i.e. exclude those nodes from their allowes node
> > mask)?
> >
> > I can imagine workloads which wouldn't like to get their memory demoted
> > for some reason but wouldn't it be more practical to tell that
> > explicitly (e.g. via prctl) rather than configuring cpusets/memory
> > policies explicitly?
> >
> > > Your concern about the expensive cost makes sense! Some raw ideas are:
> > > * if the shrink_folio_list is called by kswapd, the folios come from
> > >   the same per-memcg lruvec, so only one check is enough
> > > * if not from kswapd, like called form madvise or DAMON code, we can
> > >   save a memcg cache, and if the next folio's memcg is same as the
> > >   cache, we reuse its result. And due to the locality, the real
> > >   check is rarely performed.
> >
> > memcg is not the expensive part of the thing. You need to get from page
> > -> all vmas::vm_policy -> mm -> task::mempolicy
> 
> Yeah, on the same page with Michal. Figuring out mempolicy from page
> seems quite expensive and the correctness can't be guranteed since the
> mempolicy could be set per-thread and the mm->task depends on
> CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG.

Yes, you are right. Our "working" psudo code for mem policy looks like
what Michal mentioned, and it can't work for all cases, but try to
enforce it whenever possible:

static bool  __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma,
		unsigned long addr, void *arg)
{
	bool *skip_demotion = arg;
	struct mempolicy *mpol;
	int nid, dnid;
	bool ret = true;

	mpol = __get_vma_policy(vma, addr);
	if (!mpol) {
		struct task_struct *task;
		if (vma->vm_mm)
			task = vma->vm_mm->owner;

		if (task) {
			mpol = get_task_policy(task);
			if (mpol)
				mpol_get(mpol);
		}
	}

	if (!mpol)
		return ret;

	if (mpol->mode != MPOL_BIND)
		goto put_exit;

	nid = folio_nid(folio);
	dnid = next_demotion_node(nid);
	if (!node_isset(dnid, mpol->nodes)) {
		*skip_demotion = true;
		ret = false;
	}

put_exit:
	mpol_put(mpol);
	return ret;
}
	
static unsigned int shrink_page_list(struct list_head *page_list,..)
{
	...

	bool skip_demotion = false;
	struct rmap_walk_control rwc = {
		.arg = &skip_demotion,
		.rmap_one = __check_mpol_demotion,
	};

	/* memory policy check */
	rmap_walk(folio, &rwc);
	if (skip_demotion)
		goto keep_locked;
}

And there seems to be no simple solution for getting the memory
policy from a page.

Thanks,
Feng

> >
> > --
> > Michal Hocko
> > SUSE Labs
> >
>

Huang, Ying Oct. 27, 2022, 7:39 a.m. UTC | #22

Michal Hocko <mhocko@suse.com> writes:

> On Thu 27-10-22 14:47:22, Huang, Ying wrote:
>> Michal Hocko <mhocko@suse.com> writes:
> [...]
>> > I can imagine workloads which wouldn't like to get their memory demoted
>> > for some reason but wouldn't it be more practical to tell that
>> > explicitly (e.g. via prctl) rather than configuring cpusets/memory
>> > policies explicitly?
>> 
>> If my understanding were correct, prctl() configures the process or
>> thread.
>
> Not necessarily. There are properties which are per adddress space like
> PR_[GS]ET_THP_DISABLE. This could be very similar.
>
>> How can we get process/thread configuration at demotion time?
>
> As already pointed out in previous emails. You could hook into
> folio_check_references path, more specifically folio_referenced_one
> where you have all that you need already - all vmas mapping the page and
> then it is trivial to get the corresponding vm_mm. If at least one of
> them has the flag set then the demotion is not allowed (essentially the
> same model as VM_LOCKED).

Got it!  Thanks for detailed explanation.

One bit may be not sufficient.  For example, if we want to avoid or
control cross-socket demotion and still allow demoting to slow memory
nodes in local socket, we need to specify a node mask to exclude some
NUMA nodes from demotion targets.

From overhead point of view, this appears similar as that of VMA/task
memory policy?  We can make mm->owner available for memory tiers
(CONFIG_NUMA && CONFIG_MIGRATION).  The advantage is that we don't need
to introduce new ABI.  I guess users may prefer to use `numactl` than a
new ABI?

Best Regards,
Huang, Ying

Huang, Ying Oct. 27, 2022, 7:45 a.m. UTC | #23

Feng Tang <feng.tang@intel.com> writes:

> On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote:
>> On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <mhocko@suse.com> wrote:
> [...]
>> > > > This all can get quite expensive so the primary question is, does the
>> > > > existing behavior generates any real issues or is this more of an
>> > > > correctness exercise? I mean it certainly is not great to demote to an
>> > > > incompatible numa node but are there any reasonable configurations when
>> > > > the demotion target node is explicitly excluded from memory
>> > > > policy/cpuset?
>> > >
>> > > We haven't got customer report on this, but there are quite some customers
>> > > use cpuset to bind some specific memory nodes to a docker (You've helped
>> > > us solve a OOM issue in such cases), so I think it's practical to respect
>> > > the cpuset semantics as much as we can.
>> >
>> > Yes, it is definitely better to respect cpusets and all local memory
>> > policies. There is no dispute there. The thing is whether this is really
>> > worth it. How often would cpusets (or policies in general) go actively
>> > against demotion nodes (i.e. exclude those nodes from their allowes node
>> > mask)?
>> >
>> > I can imagine workloads which wouldn't like to get their memory demoted
>> > for some reason but wouldn't it be more practical to tell that
>> > explicitly (e.g. via prctl) rather than configuring cpusets/memory
>> > policies explicitly?
>> >
>> > > Your concern about the expensive cost makes sense! Some raw ideas are:
>> > > * if the shrink_folio_list is called by kswapd, the folios come from
>> > >   the same per-memcg lruvec, so only one check is enough
>> > > * if not from kswapd, like called form madvise or DAMON code, we can
>> > >   save a memcg cache, and if the next folio's memcg is same as the
>> > >   cache, we reuse its result. And due to the locality, the real
>> > >   check is rarely performed.
>> >
>> > memcg is not the expensive part of the thing. You need to get from page
>> > -> all vmas::vm_policy -> mm -> task::mempolicy
>> 
>> Yeah, on the same page with Michal. Figuring out mempolicy from page
>> seems quite expensive and the correctness can't be guranteed since the
>> mempolicy could be set per-thread and the mm->task depends on
>> CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG.
>
> Yes, you are right. Our "working" psudo code for mem policy looks like
> what Michal mentioned, and it can't work for all cases, but try to
> enforce it whenever possible:
>
> static bool  __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma,
> 		unsigned long addr, void *arg)
> {
> 	bool *skip_demotion = arg;
> 	struct mempolicy *mpol;
> 	int nid, dnid;
> 	bool ret = true;
>
> 	mpol = __get_vma_policy(vma, addr);
> 	if (!mpol) {
> 		struct task_struct *task;

                task = NULL;

> 		if (vma->vm_mm)
> 			task = vma->vm_mm->owner;
>
> 		if (task) {
> 			mpol = get_task_policy(task);
> 			if (mpol)
> 				mpol_get(mpol);
> 		}
> 	}
>
> 	if (!mpol)
> 		return ret;
>
> 	if (mpol->mode != MPOL_BIND)
> 		goto put_exit;
>
> 	nid = folio_nid(folio);
> 	dnid = next_demotion_node(nid);
> 	if (!node_isset(dnid, mpol->nodes)) {
> 		*skip_demotion = true;
> 		ret = false;
> 	}

I think that you need to get a node mask instead.  Even if
!node_isset(dnid, mpol->nodes), you may demote to other node in the node
mask.

Best Regards,
Huang, Ying

>
> put_exit:
> 	mpol_put(mpol);
> 	return ret;
> }
> 	
> static unsigned int shrink_page_list(struct list_head *page_list,..)
> {
> 	...
>
> 	bool skip_demotion = false;
> 	struct rmap_walk_control rwc = {
> 		.arg = &skip_demotion,
> 		.rmap_one = __check_mpol_demotion,
> 	};
>
> 	/* memory policy check */
> 	rmap_walk(folio, &rwc);
> 	if (skip_demotion)
> 		goto keep_locked;
> }
>
> And there seems to be no simple solution for getting the memory
> policy from a page.
>
> Thanks,
> Feng
>
>> >
>> > --
>> > Michal Hocko
>> > SUSE Labs
>> >
>>

Feng Tang Oct. 27, 2022, 7:51 a.m. UTC | #24

On Thu, Oct 27, 2022 at 03:45:12PM +0800, Huang, Ying wrote:
> Feng Tang <feng.tang@intel.com> writes:
> 
> > On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote:
> >> On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > [...]
> >> > > > This all can get quite expensive so the primary question is, does the
> >> > > > existing behavior generates any real issues or is this more of an
> >> > > > correctness exercise? I mean it certainly is not great to demote to an
> >> > > > incompatible numa node but are there any reasonable configurations when
> >> > > > the demotion target node is explicitly excluded from memory
> >> > > > policy/cpuset?
> >> > >
> >> > > We haven't got customer report on this, but there are quite some customers
> >> > > use cpuset to bind some specific memory nodes to a docker (You've helped
> >> > > us solve a OOM issue in such cases), so I think it's practical to respect
> >> > > the cpuset semantics as much as we can.
> >> >
> >> > Yes, it is definitely better to respect cpusets and all local memory
> >> > policies. There is no dispute there. The thing is whether this is really
> >> > worth it. How often would cpusets (or policies in general) go actively
> >> > against demotion nodes (i.e. exclude those nodes from their allowes node
> >> > mask)?
> >> >
> >> > I can imagine workloads which wouldn't like to get their memory demoted
> >> > for some reason but wouldn't it be more practical to tell that
> >> > explicitly (e.g. via prctl) rather than configuring cpusets/memory
> >> > policies explicitly?
> >> >
> >> > > Your concern about the expensive cost makes sense! Some raw ideas are:
> >> > > * if the shrink_folio_list is called by kswapd, the folios come from
> >> > >   the same per-memcg lruvec, so only one check is enough
> >> > > * if not from kswapd, like called form madvise or DAMON code, we can
> >> > >   save a memcg cache, and if the next folio's memcg is same as the
> >> > >   cache, we reuse its result. And due to the locality, the real
> >> > >   check is rarely performed.
> >> >
> >> > memcg is not the expensive part of the thing. You need to get from page
> >> > -> all vmas::vm_policy -> mm -> task::mempolicy
> >> 
> >> Yeah, on the same page with Michal. Figuring out mempolicy from page
> >> seems quite expensive and the correctness can't be guranteed since the
> >> mempolicy could be set per-thread and the mm->task depends on
> >> CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG.
> >
> > Yes, you are right. Our "working" psudo code for mem policy looks like
> > what Michal mentioned, and it can't work for all cases, but try to
> > enforce it whenever possible:
> >
> > static bool  __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma,
> > 		unsigned long addr, void *arg)
> > {
> > 	bool *skip_demotion = arg;
> > 	struct mempolicy *mpol;
> > 	int nid, dnid;
> > 	bool ret = true;
> >
> > 	mpol = __get_vma_policy(vma, addr);
> > 	if (!mpol) {
> > 		struct task_struct *task;
> 
>                 task = NULL;
> 
> > 		if (vma->vm_mm)
> > 			task = vma->vm_mm->owner;
> >
> > 		if (task) {
> > 			mpol = get_task_policy(task);
> > 			if (mpol)
> > 				mpol_get(mpol);
> > 		}
> > 	}
> >
> > 	if (!mpol)
> > 		return ret;
> >
> > 	if (mpol->mode != MPOL_BIND)
> > 		goto put_exit;
> >
> > 	nid = folio_nid(folio);
> > 	dnid = next_demotion_node(nid);
> > 	if (!node_isset(dnid, mpol->nodes)) {
> > 		*skip_demotion = true;
> > 		ret = false;
> > 	}
> 
> I think that you need to get a node mask instead.  Even if
> !node_isset(dnid, mpol->nodes), you may demote to other node in the node
> mask.

Yes, you are right. This code was written/tested about 2 months ago,
before Aneesh's memory tiering interface patchset. It was listed
to demonstrate idea of solution. 

Thanks,
Feng

> Best Regards,
> Huang, Ying

Michal Hocko Oct. 27, 2022, 8:01 a.m. UTC | #25

On Thu 27-10-22 15:39:00, Huang, Ying wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Thu 27-10-22 14:47:22, Huang, Ying wrote:
> >> Michal Hocko <mhocko@suse.com> writes:
> > [...]
> >> > I can imagine workloads which wouldn't like to get their memory demoted
> >> > for some reason but wouldn't it be more practical to tell that
> >> > explicitly (e.g. via prctl) rather than configuring cpusets/memory
> >> > policies explicitly?
> >> 
> >> If my understanding were correct, prctl() configures the process or
> >> thread.
> >
> > Not necessarily. There are properties which are per adddress space like
> > PR_[GS]ET_THP_DISABLE. This could be very similar.
> >
> >> How can we get process/thread configuration at demotion time?
> >
> > As already pointed out in previous emails. You could hook into
> > folio_check_references path, more specifically folio_referenced_one
> > where you have all that you need already - all vmas mapping the page and
> > then it is trivial to get the corresponding vm_mm. If at least one of
> > them has the flag set then the demotion is not allowed (essentially the
> > same model as VM_LOCKED).
> 
> Got it!  Thanks for detailed explanation.
> 
> One bit may be not sufficient.  For example, if we want to avoid or
> control cross-socket demotion and still allow demoting to slow memory
> nodes in local socket, we need to specify a node mask to exclude some
> NUMA nodes from demotion targets.

Isn't this something to be configured on the demotion topology side? Or
do you expect there will be per process/address space usecases? I mean
different processes running on the same topology, one requesting local
demotion while other ok with the whole demotion topology?

> >From overhead point of view, this appears similar as that of VMA/task
> memory policy?  We can make mm->owner available for memory tiers
> (CONFIG_NUMA && CONFIG_MIGRATION).  The advantage is that we don't need
> to introduce new ABI.  I guess users may prefer to use `numactl` than a
> new ABI?

mm->owner is a wrong direction. It doesn't have a strong meaning because
there is no one task explicitly responsible for the mm so there is no
real owner (our clone() semantic is just to permissive for that). The
memcg::owner is a crude and ugly hack and it should go away over time
rather than build new uses.

Besides that, and as I have already tried to explain, per task demotion
policy is what makes this whole thing expensive. So this better be a per
mm or per vma property. Whether it is a on/off knob like PR_[GS]ET_THP_DISABLE
or there are explicit requirements for fine grain control on the vma
level I dunno. I haven't seen those usecases yet and it is really easy
to overengineer this.

To be completely honest I would much rather wait for those usecases
before adding a more complex APIs.  PR_[GS]_DEMOTION_DISABLED sounds
like a reasonable first step. Should we have more fine grained
requirements wrt address space I would follow the MADV_{NO}HUGEPAGE
lead.

If we really need/want to give a fine grained control over demotion
nodemask then we would have to go with vma->mempolicy interface. In
any case a per process on/off knob sounds like a reasonable first step
before we learn more about real usecases.

Michal Hocko Oct. 27, 2022, 9:02 a.m. UTC | #26

On Wed 26-10-22 18:05:46, Aneesh Kumar K V wrote:
> On 10/26/22 5:51 PM, Michal Hocko wrote:
> > On Wed 26-10-22 17:38:06, Aneesh Kumar K V wrote:
> >> On 10/26/22 4:32 PM, Michal Hocko wrote:
> >>> On Wed 26-10-22 16:12:25, Aneesh Kumar K V wrote:
> >>>> On 10/26/22 2:49 PM, Michal Hocko wrote:
> >>>>> On Wed 26-10-22 16:00:13, Feng Tang wrote:
> >>>>>> On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote:
> >>>>>>> On 10/26/22 1:13 PM, Feng Tang wrote:
> >>>>>>>> In page reclaim path, memory could be demoted from faster memory tier
> >>>>>>>> to slower memory tier. Currently, there is no check about cpuset's
> >>>>>>>> memory policy, that even if the target demotion node is not allowd
> >>>>>>>> by cpuset, the demotion will still happen, which breaks the cpuset
> >>>>>>>> semantics.
> >>>>>>>>
> >>>>>>>> So add cpuset policy check in the demotion path and skip demotion
> >>>>>>>> if the demotion targets are not allowed by cpuset.
> >>>>>>>>
> >>>>>>>
> >>>>>>> What about the vma policy or the task memory policy? Shouldn't we respect
> >>>>>>> those memory policy restrictions while demoting the page? 
> >>>>>>  
> >>>>>> Good question! We have some basic patches to consider memory policy
> >>>>>> in demotion path too, which are still under test, and will be posted
> >>>>>> soon. And the basic idea is similar to this patch.
> >>>>>
> >>>>> For that you need to consult each vma and it's owning task(s) and that
> >>>>> to me sounds like something to be done in folio_check_references.
> >>>>> Relying on memcg to get a cpuset cgroup is really ugly and not really
> >>>>> 100% correct. Memory controller might be disabled and then you do not
> >>>>> have your association anymore.
> >>>>>
> >>>>
> >>>> I was looking at this recently and I am wondering whether we should worry about VM_SHARE
> >>>> vmas. 
> >>>>
> >>>> ie, page_to_policy() can just reverse lookup just one VMA and fetch the policy right?
> >>>
> >>> How would that help for private mappings shared between parent/child?
> >>
> >>
> >> this is MAP_PRIVATE | MAP_SHARED?
> > 
> 
> Sorry, I meant MAP_ANONYMOUS | MAP_SHARED. 

I am still not sure where you are targeting to be honest. MAP_SHARED or
MAP_PRIVATE both can have page shared between several vmas.

Huang, Ying Oct. 27, 2022, 9:31 a.m. UTC | #27

Michal Hocko <mhocko@suse.com> writes:

> On Thu 27-10-22 15:39:00, Huang, Ying wrote:
>> Michal Hocko <mhocko@suse.com> writes:
>> 
>> > On Thu 27-10-22 14:47:22, Huang, Ying wrote:
>> >> Michal Hocko <mhocko@suse.com> writes:
>> > [...]
>> >> > I can imagine workloads which wouldn't like to get their memory demoted
>> >> > for some reason but wouldn't it be more practical to tell that
>> >> > explicitly (e.g. via prctl) rather than configuring cpusets/memory
>> >> > policies explicitly?
>> >> 
>> >> If my understanding were correct, prctl() configures the process or
>> >> thread.
>> >
>> > Not necessarily. There are properties which are per adddress space like
>> > PR_[GS]ET_THP_DISABLE. This could be very similar.
>> >
>> >> How can we get process/thread configuration at demotion time?
>> >
>> > As already pointed out in previous emails. You could hook into
>> > folio_check_references path, more specifically folio_referenced_one
>> > where you have all that you need already - all vmas mapping the page and
>> > then it is trivial to get the corresponding vm_mm. If at least one of
>> > them has the flag set then the demotion is not allowed (essentially the
>> > same model as VM_LOCKED).
>> 
>> Got it!  Thanks for detailed explanation.
>> 
>> One bit may be not sufficient.  For example, if we want to avoid or
>> control cross-socket demotion and still allow demoting to slow memory
>> nodes in local socket, we need to specify a node mask to exclude some
>> NUMA nodes from demotion targets.
>
> Isn't this something to be configured on the demotion topology side? Or
> do you expect there will be per process/address space usecases? I mean
> different processes running on the same topology, one requesting local
> demotion while other ok with the whole demotion topology?

I think that it's possible for different processes have different
requirements.

- Some processes don't care about where the memory is placed, prefer
  local, then fall back to remote if no free space.

- Some processes want to avoid cross-socket traffic, bind to nodes of
  local socket.

- Some processes want to avoid to use slow memory, bind to fast memory
  node only.

>> >From overhead point of view, this appears similar as that of VMA/task
>> memory policy?  We can make mm->owner available for memory tiers
>> (CONFIG_NUMA && CONFIG_MIGRATION).  The advantage is that we don't need
>> to introduce new ABI.  I guess users may prefer to use `numactl` than a
>> new ABI?
>
> mm->owner is a wrong direction. It doesn't have a strong meaning because
> there is no one task explicitly responsible for the mm so there is no
> real owner (our clone() semantic is just to permissive for that). The
> memcg::owner is a crude and ugly hack and it should go away over time
> rather than build new uses.
>
> Besides that, and as I have already tried to explain, per task demotion
> policy is what makes this whole thing expensive. So this better be a per
> mm or per vma property. Whether it is a on/off knob like PR_[GS]ET_THP_DISABLE
> or there are explicit requirements for fine grain control on the vma
> level I dunno. I haven't seen those usecases yet and it is really easy
> to overengineer this.
>
> To be completely honest I would much rather wait for those usecases
> before adding a more complex APIs.  PR_[GS]_DEMOTION_DISABLED sounds
> like a reasonable first step. Should we have more fine grained
> requirements wrt address space I would follow the MADV_{NO}HUGEPAGE
> lead.
>
> If we really need/want to give a fine grained control over demotion
> nodemask then we would have to go with vma->mempolicy interface. In
> any case a per process on/off knob sounds like a reasonable first step
> before we learn more about real usecases.

Yes.  Per-mm or per-vma property is much better than per-task property.
Another possibility, how about add a new flag to set_mempolicy() system
call to set the per-mm mempolicy?  `numactl` can use that by default.

Best Regards,
Huang, Ying

Aneesh Kumar K.V Oct. 27, 2022, 10:16 a.m. UTC | #28

On 10/27/22 2:32 PM, Michal Hocko wrote:

>>
>> Sorry, I meant MAP_ANONYMOUS | MAP_SHARED. 
> 
> I am still not sure where you are targeting to be honest. MAP_SHARED or
> MAP_PRIVATE both can have page shared between several vmas.

What I was checking was w.r.t demotion and shared pages do we need to
cross-check all the related memory policies? On the page fault side, we don't do that.
ie, if the vma policy or the faulting task policy allowed pages to be allocated
from the demotion node, then we allocate the page even if there is a conflicting
policy from another thread. 

For ex: in the case of MAP_ANON | MAP_PRIVATE cow shared pages if the parent 
did allow allocation from the demotion node we can have pages in the demotion node
even though the child memory policy doesn't have the node in the node mask.

-aneesh

Michal Hocko Oct. 27, 2022, 12:29 p.m. UTC | #29

On Thu 27-10-22 17:31:35, Huang, Ying wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Thu 27-10-22 15:39:00, Huang, Ying wrote:
> >> Michal Hocko <mhocko@suse.com> writes:
> >> 
> >> > On Thu 27-10-22 14:47:22, Huang, Ying wrote:
> >> >> Michal Hocko <mhocko@suse.com> writes:
> >> > [...]
> >> >> > I can imagine workloads which wouldn't like to get their memory demoted
> >> >> > for some reason but wouldn't it be more practical to tell that
> >> >> > explicitly (e.g. via prctl) rather than configuring cpusets/memory
> >> >> > policies explicitly?
> >> >> 
> >> >> If my understanding were correct, prctl() configures the process or
> >> >> thread.
> >> >
> >> > Not necessarily. There are properties which are per adddress space like
> >> > PR_[GS]ET_THP_DISABLE. This could be very similar.
> >> >
> >> >> How can we get process/thread configuration at demotion time?
> >> >
> >> > As already pointed out in previous emails. You could hook into
> >> > folio_check_references path, more specifically folio_referenced_one
> >> > where you have all that you need already - all vmas mapping the page and
> >> > then it is trivial to get the corresponding vm_mm. If at least one of
> >> > them has the flag set then the demotion is not allowed (essentially the
> >> > same model as VM_LOCKED).
> >> 
> >> Got it!  Thanks for detailed explanation.
> >> 
> >> One bit may be not sufficient.  For example, if we want to avoid or
> >> control cross-socket demotion and still allow demoting to slow memory
> >> nodes in local socket, we need to specify a node mask to exclude some
> >> NUMA nodes from demotion targets.
> >
> > Isn't this something to be configured on the demotion topology side? Or
> > do you expect there will be per process/address space usecases? I mean
> > different processes running on the same topology, one requesting local
> > demotion while other ok with the whole demotion topology?
> 
> I think that it's possible for different processes have different
> requirements.
> 
> - Some processes don't care about where the memory is placed, prefer
>   local, then fall back to remote if no free space.
> 
> - Some processes want to avoid cross-socket traffic, bind to nodes of
>   local socket.
> 
> - Some processes want to avoid to use slow memory, bind to fast memory
>   node only.

Yes, I do understand that. Do you have any specific examples in mind?
[...]
> > If we really need/want to give a fine grained control over demotion
> > nodemask then we would have to go with vma->mempolicy interface. In
> > any case a per process on/off knob sounds like a reasonable first step
> > before we learn more about real usecases.
> 
> Yes.  Per-mm or per-vma property is much better than per-task property.
> Another possibility, how about add a new flag to set_mempolicy() system
> call to set the per-mm mempolicy?  `numactl` can use that by default.

Do you mean a flag to control whether the given policy is applied to a
task or mm?

Michal Hocko Oct. 27, 2022, 1:05 p.m. UTC | #30

On Thu 27-10-22 15:46:07, Aneesh Kumar K V wrote:
> On 10/27/22 2:32 PM, Michal Hocko wrote:
> 
> >>
> >> Sorry, I meant MAP_ANONYMOUS | MAP_SHARED. 
> > 
> > I am still not sure where you are targeting to be honest. MAP_SHARED or
> > MAP_PRIVATE both can have page shared between several vmas.
> 
> 
> What I was checking was w.r.t demotion and shared pages do we need to
> cross-check all the related memory policies? On the page fault side, we don't do that.

Yes, because on the page fault we do have an originator and so we can
apply some reasonable semantic. For the memory reclaim there is no such
originator for a specific page. A completely unrelated context might be
reclaiming a page with some mempolicy constrain and you do not have any
records who has faulted it in. The fact that we have a memory policy
also at the task level makes a completely consistent semantic rather
hard if possible at all (e.g. what if different thread are simply bound
to a different node because shared memory is prefaulted and local thread
mappings will be always local).

I do not think shared mappings are very much special in that regards. It
is our mempolicy API that allows to specify a policy for vmas as well as
tasks and the later makes the semantic for reclaim really awkward.

Yang Shi Oct. 27, 2022, 5:55 p.m. UTC | #31

On Thu, Oct 27, 2022 at 12:12 AM Feng Tang <feng.tang@intel.com> wrote:
>
> On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote:
> > On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > > > This all can get quite expensive so the primary question is, does the
> > > > > existing behavior generates any real issues or is this more of an
> > > > > correctness exercise? I mean it certainly is not great to demote to an
> > > > > incompatible numa node but are there any reasonable configurations when
> > > > > the demotion target node is explicitly excluded from memory
> > > > > policy/cpuset?
> > > >
> > > > We haven't got customer report on this, but there are quite some customers
> > > > use cpuset to bind some specific memory nodes to a docker (You've helped
> > > > us solve a OOM issue in such cases), so I think it's practical to respect
> > > > the cpuset semantics as much as we can.
> > >
> > > Yes, it is definitely better to respect cpusets and all local memory
> > > policies. There is no dispute there. The thing is whether this is really
> > > worth it. How often would cpusets (or policies in general) go actively
> > > against demotion nodes (i.e. exclude those nodes from their allowes node
> > > mask)?
> > >
> > > I can imagine workloads which wouldn't like to get their memory demoted
> > > for some reason but wouldn't it be more practical to tell that
> > > explicitly (e.g. via prctl) rather than configuring cpusets/memory
> > > policies explicitly?
> > >
> > > > Your concern about the expensive cost makes sense! Some raw ideas are:
> > > > * if the shrink_folio_list is called by kswapd, the folios come from
> > > >   the same per-memcg lruvec, so only one check is enough
> > > > * if not from kswapd, like called form madvise or DAMON code, we can
> > > >   save a memcg cache, and if the next folio's memcg is same as the
> > > >   cache, we reuse its result. And due to the locality, the real
> > > >   check is rarely performed.
> > >
> > > memcg is not the expensive part of the thing. You need to get from page
> > > -> all vmas::vm_policy -> mm -> task::mempolicy
> >
> > Yeah, on the same page with Michal. Figuring out mempolicy from page
> > seems quite expensive and the correctness can't be guranteed since the
> > mempolicy could be set per-thread and the mm->task depends on
> > CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG.
>
> Yes, you are right. Our "working" psudo code for mem policy looks like
> what Michal mentioned, and it can't work for all cases, but try to
> enforce it whenever possible:
>
> static bool  __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma,
>                 unsigned long addr, void *arg)
> {
>         bool *skip_demotion = arg;
>         struct mempolicy *mpol;
>         int nid, dnid;
>         bool ret = true;
>
>         mpol = __get_vma_policy(vma, addr);
>         if (!mpol) {
>                 struct task_struct *task;
>                 if (vma->vm_mm)
>                         task = vma->vm_mm->owner;

But this task may not be the task you want IIUC. For example, the
process has two threads, A and B. They have different mempolicy. The
vmscan is trying to demote a page belonging to thread A, but the task
may point to thread B, so you actually get the wrong mempolicy IIUC.

>
>                 if (task) {
>                         mpol = get_task_policy(task);
>                         if (mpol)
>                                 mpol_get(mpol);
>                 }
>         }
>
>         if (!mpol)
>                 return ret;
>
>         if (mpol->mode != MPOL_BIND)
>                 goto put_exit;
>
>         nid = folio_nid(folio);
>         dnid = next_demotion_node(nid);
>         if (!node_isset(dnid, mpol->nodes)) {
>                 *skip_demotion = true;
>                 ret = false;
>         }
>
> put_exit:
>         mpol_put(mpol);
>         return ret;
> }
>
> static unsigned int shrink_page_list(struct list_head *page_list,..)
> {
>         ...
>
>         bool skip_demotion = false;
>         struct rmap_walk_control rwc = {
>                 .arg = &skip_demotion,
>                 .rmap_one = __check_mpol_demotion,
>         };
>
>         /* memory policy check */
>         rmap_walk(folio, &rwc);
>         if (skip_demotion)
>                 goto keep_locked;
> }
>
> And there seems to be no simple solution for getting the memory
> policy from a page.
>
> Thanks,
> Feng
>
> > >
> > > --
> > > Michal Hocko
> > > SUSE Labs
> > >
> >

Huang, Ying Oct. 27, 2022, 11:22 p.m. UTC | #32

Michal Hocko <mhocko@suse.com> writes:

> On Thu 27-10-22 17:31:35, Huang, Ying wrote:
>> Michal Hocko <mhocko@suse.com> writes:
>> 
>> > On Thu 27-10-22 15:39:00, Huang, Ying wrote:
>> >> Michal Hocko <mhocko@suse.com> writes:
>> >> 
>> >> > On Thu 27-10-22 14:47:22, Huang, Ying wrote:
>> >> >> Michal Hocko <mhocko@suse.com> writes:
>> >> > [...]
>> >> >> > I can imagine workloads which wouldn't like to get their memory demoted
>> >> >> > for some reason but wouldn't it be more practical to tell that
>> >> >> > explicitly (e.g. via prctl) rather than configuring cpusets/memory
>> >> >> > policies explicitly?
>> >> >> 
>> >> >> If my understanding were correct, prctl() configures the process or
>> >> >> thread.
>> >> >
>> >> > Not necessarily. There are properties which are per adddress space like
>> >> > PR_[GS]ET_THP_DISABLE. This could be very similar.
>> >> >
>> >> >> How can we get process/thread configuration at demotion time?
>> >> >
>> >> > As already pointed out in previous emails. You could hook into
>> >> > folio_check_references path, more specifically folio_referenced_one
>> >> > where you have all that you need already - all vmas mapping the page and
>> >> > then it is trivial to get the corresponding vm_mm. If at least one of
>> >> > them has the flag set then the demotion is not allowed (essentially the
>> >> > same model as VM_LOCKED).
>> >> 
>> >> Got it!  Thanks for detailed explanation.
>> >> 
>> >> One bit may be not sufficient.  For example, if we want to avoid or
>> >> control cross-socket demotion and still allow demoting to slow memory
>> >> nodes in local socket, we need to specify a node mask to exclude some
>> >> NUMA nodes from demotion targets.
>> >
>> > Isn't this something to be configured on the demotion topology side? Or
>> > do you expect there will be per process/address space usecases? I mean
>> > different processes running on the same topology, one requesting local
>> > demotion while other ok with the whole demotion topology?
>> 
>> I think that it's possible for different processes have different
>> requirements.
>> 
>> - Some processes don't care about where the memory is placed, prefer
>>   local, then fall back to remote if no free space.
>> 
>> - Some processes want to avoid cross-socket traffic, bind to nodes of
>>   local socket.
>> 
>> - Some processes want to avoid to use slow memory, bind to fast memory
>>   node only.
>
> Yes, I do understand that. Do you have any specific examples in mind?
> [...]

Sorry, I don't have specific examples.

>> > If we really need/want to give a fine grained control over demotion
>> > nodemask then we would have to go with vma->mempolicy interface. In
>> > any case a per process on/off knob sounds like a reasonable first step
>> > before we learn more about real usecases.
>> 
>> Yes.  Per-mm or per-vma property is much better than per-task property.
>> Another possibility, how about add a new flag to set_mempolicy() system
>> call to set the per-mm mempolicy?  `numactl` can use that by default.
>
> Do you mean a flag to control whether the given policy is applied to a
> task or mm?

Yes.  That is the idea.

Best Regards,
Huang, Ying

Feng Tang Oct. 28, 2022, 3:37 a.m. UTC | #33

On Thu, Oct 27, 2022 at 10:55:58AM -0700, Yang Shi wrote:
> On Thu, Oct 27, 2022 at 12:12 AM Feng Tang <feng.tang@intel.com> wrote:
> >
> > On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote:
> > > On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > [...]
> > > > > > This all can get quite expensive so the primary question is, does the
> > > > > > existing behavior generates any real issues or is this more of an
> > > > > > correctness exercise? I mean it certainly is not great to demote to an
> > > > > > incompatible numa node but are there any reasonable configurations when
> > > > > > the demotion target node is explicitly excluded from memory
> > > > > > policy/cpuset?
> > > > >
> > > > > We haven't got customer report on this, but there are quite some customers
> > > > > use cpuset to bind some specific memory nodes to a docker (You've helped
> > > > > us solve a OOM issue in such cases), so I think it's practical to respect
> > > > > the cpuset semantics as much as we can.
> > > >
> > > > Yes, it is definitely better to respect cpusets and all local memory
> > > > policies. There is no dispute there. The thing is whether this is really
> > > > worth it. How often would cpusets (or policies in general) go actively
> > > > against demotion nodes (i.e. exclude those nodes from their allowes node
> > > > mask)?
> > > >
> > > > I can imagine workloads which wouldn't like to get their memory demoted
> > > > for some reason but wouldn't it be more practical to tell that
> > > > explicitly (e.g. via prctl) rather than configuring cpusets/memory
> > > > policies explicitly?
> > > >
> > > > > Your concern about the expensive cost makes sense! Some raw ideas are:
> > > > > * if the shrink_folio_list is called by kswapd, the folios come from
> > > > >   the same per-memcg lruvec, so only one check is enough
> > > > > * if not from kswapd, like called form madvise or DAMON code, we can
> > > > >   save a memcg cache, and if the next folio's memcg is same as the
> > > > >   cache, we reuse its result. And due to the locality, the real
> > > > >   check is rarely performed.
> > > >
> > > > memcg is not the expensive part of the thing. You need to get from page
> > > > -> all vmas::vm_policy -> mm -> task::mempolicy
> > >
> > > Yeah, on the same page with Michal. Figuring out mempolicy from page
> > > seems quite expensive and the correctness can't be guranteed since the
> > > mempolicy could be set per-thread and the mm->task depends on
> > > CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG.
> >
> > Yes, you are right. Our "working" psudo code for mem policy looks like
> > what Michal mentioned, and it can't work for all cases, but try to
> > enforce it whenever possible:
> >
> > static bool  __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma,
> >                 unsigned long addr, void *arg)
> > {
> >         bool *skip_demotion = arg;
> >         struct mempolicy *mpol;
> >         int nid, dnid;
> >         bool ret = true;
> >
> >         mpol = __get_vma_policy(vma, addr);
> >         if (!mpol) {
> >                 struct task_struct *task;
> >                 if (vma->vm_mm)
> >                         task = vma->vm_mm->owner;
> 
> But this task may not be the task you want IIUC. For example, the
> process has two threads, A and B. They have different mempolicy. The
> vmscan is trying to demote a page belonging to thread A, but the task
> may point to thread B, so you actually get the wrong mempolicy IIUC.

Yes, this is a valid concern! We don't have good solution for this.
For memory policy, we may only handle the per-vma policy for now whose
cost is relatively low, as a best-effort try.

Thanks,
Feng

Aneesh Kumar K.V Oct. 28, 2022, 5:09 a.m. UTC | #34

On 10/27/22 11:25 PM, Yang Shi wrote:
> On Thu, Oct 27, 2022 at 12:12 AM Feng Tang <feng.tang@intel.com> wrote:
>>
>> On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote:
>>> On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <mhocko@suse.com> wrote:
>> [...]
>>>>>> This all can get quite expensive so the primary question is, does the
>>>>>> existing behavior generates any real issues or is this more of an
>>>>>> correctness exercise? I mean it certainly is not great to demote to an
>>>>>> incompatible numa node but are there any reasonable configurations when
>>>>>> the demotion target node is explicitly excluded from memory
>>>>>> policy/cpuset?
>>>>>
>>>>> We haven't got customer report on this, but there are quite some customers
>>>>> use cpuset to bind some specific memory nodes to a docker (You've helped
>>>>> us solve a OOM issue in such cases), so I think it's practical to respect
>>>>> the cpuset semantics as much as we can.
>>>>
>>>> Yes, it is definitely better to respect cpusets and all local memory
>>>> policies. There is no dispute there. The thing is whether this is really
>>>> worth it. How often would cpusets (or policies in general) go actively
>>>> against demotion nodes (i.e. exclude those nodes from their allowes node
>>>> mask)?
>>>>
>>>> I can imagine workloads which wouldn't like to get their memory demoted
>>>> for some reason but wouldn't it be more practical to tell that
>>>> explicitly (e.g. via prctl) rather than configuring cpusets/memory
>>>> policies explicitly?
>>>>
>>>>> Your concern about the expensive cost makes sense! Some raw ideas are:
>>>>> * if the shrink_folio_list is called by kswapd, the folios come from
>>>>>   the same per-memcg lruvec, so only one check is enough
>>>>> * if not from kswapd, like called form madvise or DAMON code, we can
>>>>>   save a memcg cache, and if the next folio's memcg is same as the
>>>>>   cache, we reuse its result. And due to the locality, the real
>>>>>   check is rarely performed.
>>>>
>>>> memcg is not the expensive part of the thing. You need to get from page
>>>> -> all vmas::vm_policy -> mm -> task::mempolicy
>>>
>>> Yeah, on the same page with Michal. Figuring out mempolicy from page
>>> seems quite expensive and the correctness can't be guranteed since the
>>> mempolicy could be set per-thread and the mm->task depends on
>>> CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG.
>>
>> Yes, you are right. Our "working" psudo code for mem policy looks like
>> what Michal mentioned, and it can't work for all cases, but try to
>> enforce it whenever possible:
>>
>> static bool  __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma,
>>                 unsigned long addr, void *arg)
>> {
>>         bool *skip_demotion = arg;
>>         struct mempolicy *mpol;
>>         int nid, dnid;
>>         bool ret = true;
>>
>>         mpol = __get_vma_policy(vma, addr);
>>         if (!mpol) {
>>                 struct task_struct *task;
>>                 if (vma->vm_mm)
>>                         task = vma->vm_mm->owner;
> 
> But this task may not be the task you want IIUC. For example, the
> process has two threads, A and B. They have different mempolicy. The
> vmscan is trying to demote a page belonging to thread A, but the task
> may point to thread B, so you actually get the wrong mempolicy IIUC.
> 

But if we swap out this page and fault back in via thread B the page would
get allocated as per thread B mempolicy. So if we demote based on thread B
policy are we breaking anything? 

-aneesh

Huang, Ying Oct. 28, 2022, 5:54 a.m. UTC | #35

Feng Tang <feng.tang@intel.com> writes:

> On Thu, Oct 27, 2022 at 10:55:58AM -0700, Yang Shi wrote:
>> On Thu, Oct 27, 2022 at 12:12 AM Feng Tang <feng.tang@intel.com> wrote:
>> >
>> > On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote:
>> > > On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <mhocko@suse.com> wrote:
>> > [...]
>> > > > > > This all can get quite expensive so the primary question is, does the
>> > > > > > existing behavior generates any real issues or is this more of an
>> > > > > > correctness exercise? I mean it certainly is not great to demote to an
>> > > > > > incompatible numa node but are there any reasonable configurations when
>> > > > > > the demotion target node is explicitly excluded from memory
>> > > > > > policy/cpuset?
>> > > > >
>> > > > > We haven't got customer report on this, but there are quite some customers
>> > > > > use cpuset to bind some specific memory nodes to a docker (You've helped
>> > > > > us solve a OOM issue in such cases), so I think it's practical to respect
>> > > > > the cpuset semantics as much as we can.
>> > > >
>> > > > Yes, it is definitely better to respect cpusets and all local memory
>> > > > policies. There is no dispute there. The thing is whether this is really
>> > > > worth it. How often would cpusets (or policies in general) go actively
>> > > > against demotion nodes (i.e. exclude those nodes from their allowes node
>> > > > mask)?
>> > > >
>> > > > I can imagine workloads which wouldn't like to get their memory demoted
>> > > > for some reason but wouldn't it be more practical to tell that
>> > > > explicitly (e.g. via prctl) rather than configuring cpusets/memory
>> > > > policies explicitly?
>> > > >
>> > > > > Your concern about the expensive cost makes sense! Some raw ideas are:
>> > > > > * if the shrink_folio_list is called by kswapd, the folios come from
>> > > > >   the same per-memcg lruvec, so only one check is enough
>> > > > > * if not from kswapd, like called form madvise or DAMON code, we can
>> > > > >   save a memcg cache, and if the next folio's memcg is same as the
>> > > > >   cache, we reuse its result. And due to the locality, the real
>> > > > >   check is rarely performed.
>> > > >
>> > > > memcg is not the expensive part of the thing. You need to get from page
>> > > > -> all vmas::vm_policy -> mm -> task::mempolicy
>> > >
>> > > Yeah, on the same page with Michal. Figuring out mempolicy from page
>> > > seems quite expensive and the correctness can't be guranteed since the
>> > > mempolicy could be set per-thread and the mm->task depends on
>> > > CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG.
>> >
>> > Yes, you are right. Our "working" psudo code for mem policy looks like
>> > what Michal mentioned, and it can't work for all cases, but try to
>> > enforce it whenever possible:
>> >
>> > static bool  __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma,
>> >                 unsigned long addr, void *arg)
>> > {
>> >         bool *skip_demotion = arg;
>> >         struct mempolicy *mpol;
>> >         int nid, dnid;
>> >         bool ret = true;
>> >
>> >         mpol = __get_vma_policy(vma, addr);
>> >         if (!mpol) {
>> >                 struct task_struct *task;
>> >                 if (vma->vm_mm)
>> >                         task = vma->vm_mm->owner;
>> 
>> But this task may not be the task you want IIUC. For example, the
>> process has two threads, A and B. They have different mempolicy. The
>> vmscan is trying to demote a page belonging to thread A, but the task
>> may point to thread B, so you actually get the wrong mempolicy IIUC.
>
> Yes, this is a valid concern! We don't have good solution for this.
> For memory policy, we may only handle the per-vma policy for now whose
> cost is relatively low, as a best-effort try.

Yes.  The solution isn't perfect, especially for multiple-thread
processes with thread specific memory policy.  But the proposed code
above can support the most common cases at least, that is, run workload
with `numactl`.

Best Regards,
Huang, Ying

Yang Shi Oct. 28, 2022, 5:16 p.m. UTC | #36

On Thu, Oct 27, 2022 at 10:09 PM Aneesh Kumar K V
<aneesh.kumar@linux.ibm.com> wrote:
>
> On 10/27/22 11:25 PM, Yang Shi wrote:
> > On Thu, Oct 27, 2022 at 12:12 AM Feng Tang <feng.tang@intel.com> wrote:
> >>
> >> On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote:
> >>> On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <mhocko@suse.com> wrote:
> >> [...]
> >>>>>> This all can get quite expensive so the primary question is, does the
> >>>>>> existing behavior generates any real issues or is this more of an
> >>>>>> correctness exercise? I mean it certainly is not great to demote to an
> >>>>>> incompatible numa node but are there any reasonable configurations when
> >>>>>> the demotion target node is explicitly excluded from memory
> >>>>>> policy/cpuset?
> >>>>>
> >>>>> We haven't got customer report on this, but there are quite some customers
> >>>>> use cpuset to bind some specific memory nodes to a docker (You've helped
> >>>>> us solve a OOM issue in such cases), so I think it's practical to respect
> >>>>> the cpuset semantics as much as we can.
> >>>>
> >>>> Yes, it is definitely better to respect cpusets and all local memory
> >>>> policies. There is no dispute there. The thing is whether this is really
> >>>> worth it. How often would cpusets (or policies in general) go actively
> >>>> against demotion nodes (i.e. exclude those nodes from their allowes node
> >>>> mask)?
> >>>>
> >>>> I can imagine workloads which wouldn't like to get their memory demoted
> >>>> for some reason but wouldn't it be more practical to tell that
> >>>> explicitly (e.g. via prctl) rather than configuring cpusets/memory
> >>>> policies explicitly?
> >>>>
> >>>>> Your concern about the expensive cost makes sense! Some raw ideas are:
> >>>>> * if the shrink_folio_list is called by kswapd, the folios come from
> >>>>>   the same per-memcg lruvec, so only one check is enough
> >>>>> * if not from kswapd, like called form madvise or DAMON code, we can
> >>>>>   save a memcg cache, and if the next folio's memcg is same as the
> >>>>>   cache, we reuse its result. And due to the locality, the real
> >>>>>   check is rarely performed.
> >>>>
> >>>> memcg is not the expensive part of the thing. You need to get from page
> >>>> -> all vmas::vm_policy -> mm -> task::mempolicy
> >>>
> >>> Yeah, on the same page with Michal. Figuring out mempolicy from page
> >>> seems quite expensive and the correctness can't be guranteed since the
> >>> mempolicy could be set per-thread and the mm->task depends on
> >>> CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG.
> >>
> >> Yes, you are right. Our "working" psudo code for mem policy looks like
> >> what Michal mentioned, and it can't work for all cases, but try to
> >> enforce it whenever possible:
> >>
> >> static bool  __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma,
> >>                 unsigned long addr, void *arg)
> >> {
> >>         bool *skip_demotion = arg;
> >>         struct mempolicy *mpol;
> >>         int nid, dnid;
> >>         bool ret = true;
> >>
> >>         mpol = __get_vma_policy(vma, addr);
> >>         if (!mpol) {
> >>                 struct task_struct *task;
> >>                 if (vma->vm_mm)
> >>                         task = vma->vm_mm->owner;
> >
> > But this task may not be the task you want IIUC. For example, the
> > process has two threads, A and B. They have different mempolicy. The
> > vmscan is trying to demote a page belonging to thread A, but the task
> > may point to thread B, so you actually get the wrong mempolicy IIUC.
> >
>
> But if we swap out this page and fault back in via thread B the page would
> get allocated as per thread B mempolicy. So if we demote based on thread B
> policy are we breaking anything?

If the page is demoted by following thread B's mempolicy, didn't it
already break thread A's mempolicy in the first place if you care
about it? If thread A and thread B have the same mempolicy, then it is
not a problem.

Actually there is another problem for shared page. If a page is shared
by two processes, P1 and P2, when you do rmap walk to find the task,
you may find two contradict mempolicy, what mempolicy would you like
to obey? Do you have to save all the intermediate mempolicy results
somewhere or you just bail out once the first mempolicy is found?

>
> -aneesh
>
>

Yang Shi Oct. 28, 2022, 5:23 p.m. UTC | #37

On Thu, Oct 27, 2022 at 10:55 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Feng Tang <feng.tang@intel.com> writes:
>
> > On Thu, Oct 27, 2022 at 10:55:58AM -0700, Yang Shi wrote:
> >> On Thu, Oct 27, 2022 at 12:12 AM Feng Tang <feng.tang@intel.com> wrote:
> >> >
> >> > On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote:
> >> > > On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <mhocko@suse.com> wrote:
> >> > [...]
> >> > > > > > This all can get quite expensive so the primary question is, does the
> >> > > > > > existing behavior generates any real issues or is this more of an
> >> > > > > > correctness exercise? I mean it certainly is not great to demote to an
> >> > > > > > incompatible numa node but are there any reasonable configurations when
> >> > > > > > the demotion target node is explicitly excluded from memory
> >> > > > > > policy/cpuset?
> >> > > > >
> >> > > > > We haven't got customer report on this, but there are quite some customers
> >> > > > > use cpuset to bind some specific memory nodes to a docker (You've helped
> >> > > > > us solve a OOM issue in such cases), so I think it's practical to respect
> >> > > > > the cpuset semantics as much as we can.
> >> > > >
> >> > > > Yes, it is definitely better to respect cpusets and all local memory
> >> > > > policies. There is no dispute there. The thing is whether this is really
> >> > > > worth it. How often would cpusets (or policies in general) go actively
> >> > > > against demotion nodes (i.e. exclude those nodes from their allowes node
> >> > > > mask)?
> >> > > >
> >> > > > I can imagine workloads which wouldn't like to get their memory demoted
> >> > > > for some reason but wouldn't it be more practical to tell that
> >> > > > explicitly (e.g. via prctl) rather than configuring cpusets/memory
> >> > > > policies explicitly?
> >> > > >
> >> > > > > Your concern about the expensive cost makes sense! Some raw ideas are:
> >> > > > > * if the shrink_folio_list is called by kswapd, the folios come from
> >> > > > >   the same per-memcg lruvec, so only one check is enough
> >> > > > > * if not from kswapd, like called form madvise or DAMON code, we can
> >> > > > >   save a memcg cache, and if the next folio's memcg is same as the
> >> > > > >   cache, we reuse its result. And due to the locality, the real
> >> > > > >   check is rarely performed.
> >> > > >
> >> > > > memcg is not the expensive part of the thing. You need to get from page
> >> > > > -> all vmas::vm_policy -> mm -> task::mempolicy
> >> > >
> >> > > Yeah, on the same page with Michal. Figuring out mempolicy from page
> >> > > seems quite expensive and the correctness can't be guranteed since the
> >> > > mempolicy could be set per-thread and the mm->task depends on
> >> > > CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG.
> >> >
> >> > Yes, you are right. Our "working" psudo code for mem policy looks like
> >> > what Michal mentioned, and it can't work for all cases, but try to
> >> > enforce it whenever possible:
> >> >
> >> > static bool  __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma,
> >> >                 unsigned long addr, void *arg)
> >> > {
> >> >         bool *skip_demotion = arg;
> >> >         struct mempolicy *mpol;
> >> >         int nid, dnid;
> >> >         bool ret = true;
> >> >
> >> >         mpol = __get_vma_policy(vma, addr);
> >> >         if (!mpol) {
> >> >                 struct task_struct *task;
> >> >                 if (vma->vm_mm)
> >> >                         task = vma->vm_mm->owner;
> >>
> >> But this task may not be the task you want IIUC. For example, the
> >> process has two threads, A and B. They have different mempolicy. The
> >> vmscan is trying to demote a page belonging to thread A, but the task
> >> may point to thread B, so you actually get the wrong mempolicy IIUC.
> >
> > Yes, this is a valid concern! We don't have good solution for this.
> > For memory policy, we may only handle the per-vma policy for now whose
> > cost is relatively low, as a best-effort try.
>
> Yes.  The solution isn't perfect, especially for multiple-thread
> processes with thread specific memory policy.  But the proposed code
> above can support the most common cases at least, that is, run workload
> with `numactl`.

Not only multi threads, but also may be broken for shared pages. When
you do rmap walk, you may get multiple contradict mempolicy, which one
would you like to obey?

TBH I'm not sure whether such half-baked solution is worth it or not,
at least at this moment. The cost is not cheap, but the gain may not
be worth it IMHO.

>
> Best Regards,
> Huang, Ying

Huang, Ying Oct. 31, 2022, 1:53 a.m. UTC | #38

Yang Shi <shy828301@gmail.com> writes:

> On Thu, Oct 27, 2022 at 10:09 PM Aneesh Kumar K V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> On 10/27/22 11:25 PM, Yang Shi wrote:
>> > On Thu, Oct 27, 2022 at 12:12 AM Feng Tang <feng.tang@intel.com> wrote:
>> >>
>> >> On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote:
>> >>> On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <mhocko@suse.com> wrote:
>> >> [...]
>> >>>>>> This all can get quite expensive so the primary question is, does the
>> >>>>>> existing behavior generates any real issues or is this more of an
>> >>>>>> correctness exercise? I mean it certainly is not great to demote to an
>> >>>>>> incompatible numa node but are there any reasonable configurations when
>> >>>>>> the demotion target node is explicitly excluded from memory
>> >>>>>> policy/cpuset?
>> >>>>>
>> >>>>> We haven't got customer report on this, but there are quite some customers
>> >>>>> use cpuset to bind some specific memory nodes to a docker (You've helped
>> >>>>> us solve a OOM issue in such cases), so I think it's practical to respect
>> >>>>> the cpuset semantics as much as we can.
>> >>>>
>> >>>> Yes, it is definitely better to respect cpusets and all local memory
>> >>>> policies. There is no dispute there. The thing is whether this is really
>> >>>> worth it. How often would cpusets (or policies in general) go actively
>> >>>> against demotion nodes (i.e. exclude those nodes from their allowes node
>> >>>> mask)?
>> >>>>
>> >>>> I can imagine workloads which wouldn't like to get their memory demoted
>> >>>> for some reason but wouldn't it be more practical to tell that
>> >>>> explicitly (e.g. via prctl) rather than configuring cpusets/memory
>> >>>> policies explicitly?
>> >>>>
>> >>>>> Your concern about the expensive cost makes sense! Some raw ideas are:
>> >>>>> * if the shrink_folio_list is called by kswapd, the folios come from
>> >>>>>   the same per-memcg lruvec, so only one check is enough
>> >>>>> * if not from kswapd, like called form madvise or DAMON code, we can
>> >>>>>   save a memcg cache, and if the next folio's memcg is same as the
>> >>>>>   cache, we reuse its result. And due to the locality, the real
>> >>>>>   check is rarely performed.
>> >>>>
>> >>>> memcg is not the expensive part of the thing. You need to get from page
>> >>>> -> all vmas::vm_policy -> mm -> task::mempolicy
>> >>>
>> >>> Yeah, on the same page with Michal. Figuring out mempolicy from page
>> >>> seems quite expensive and the correctness can't be guranteed since the
>> >>> mempolicy could be set per-thread and the mm->task depends on
>> >>> CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG.
>> >>
>> >> Yes, you are right. Our "working" psudo code for mem policy looks like
>> >> what Michal mentioned, and it can't work for all cases, but try to
>> >> enforce it whenever possible:
>> >>
>> >> static bool  __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma,
>> >>                 unsigned long addr, void *arg)
>> >> {
>> >>         bool *skip_demotion = arg;
>> >>         struct mempolicy *mpol;
>> >>         int nid, dnid;
>> >>         bool ret = true;
>> >>
>> >>         mpol = __get_vma_policy(vma, addr);
>> >>         if (!mpol) {
>> >>                 struct task_struct *task;
>> >>                 if (vma->vm_mm)
>> >>                         task = vma->vm_mm->owner;
>> >
>> > But this task may not be the task you want IIUC. For example, the
>> > process has two threads, A and B. They have different mempolicy. The
>> > vmscan is trying to demote a page belonging to thread A, but the task
>> > may point to thread B, so you actually get the wrong mempolicy IIUC.
>> >
>>
>> But if we swap out this page and fault back in via thread B the page would
>> get allocated as per thread B mempolicy. So if we demote based on thread B
>> policy are we breaking anything?
>
> If the page is demoted by following thread B's mempolicy, didn't it
> already break thread A's mempolicy in the first place if you care
> about it? If thread A and thread B have the same mempolicy, then it is
> not a problem.
>
> Actually there is another problem for shared page. If a page is shared
> by two processes, P1 and P2, when you do rmap walk to find the task,
> you may find two contradict mempolicy, what mempolicy would you like
> to obey? Do you have to save all the intermediate mempolicy results
> somewhere or you just bail out once the first mempolicy is found?

Yes.  There's no perfect solution for this.  I suggest to avoid demoting
if any VMA (or task) prevent it.  Because allowing demoting is the
default policy.  And we will not promote the page back if it becomes hot
later by default because promotion only works for default memory policy
by default.

Best Regards,
Huang, Ying

Huang, Ying Oct. 31, 2022, 1:56 a.m. UTC | #39

Yang Shi <shy828301@gmail.com> writes:

> On Thu, Oct 27, 2022 at 10:55 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Feng Tang <feng.tang@intel.com> writes:
>>
>> > On Thu, Oct 27, 2022 at 10:55:58AM -0700, Yang Shi wrote:
>> >> On Thu, Oct 27, 2022 at 12:12 AM Feng Tang <feng.tang@intel.com> wrote:
>> >> >
>> >> > On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote:
>> >> > > On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <mhocko@suse.com> wrote:
>> >> > [...]
>> >> > > > > > This all can get quite expensive so the primary question is, does the
>> >> > > > > > existing behavior generates any real issues or is this more of an
>> >> > > > > > correctness exercise? I mean it certainly is not great to demote to an
>> >> > > > > > incompatible numa node but are there any reasonable configurations when
>> >> > > > > > the demotion target node is explicitly excluded from memory
>> >> > > > > > policy/cpuset?
>> >> > > > >
>> >> > > > > We haven't got customer report on this, but there are quite some customers
>> >> > > > > use cpuset to bind some specific memory nodes to a docker (You've helped
>> >> > > > > us solve a OOM issue in such cases), so I think it's practical to respect
>> >> > > > > the cpuset semantics as much as we can.
>> >> > > >
>> >> > > > Yes, it is definitely better to respect cpusets and all local memory
>> >> > > > policies. There is no dispute there. The thing is whether this is really
>> >> > > > worth it. How often would cpusets (or policies in general) go actively
>> >> > > > against demotion nodes (i.e. exclude those nodes from their allowes node
>> >> > > > mask)?
>> >> > > >
>> >> > > > I can imagine workloads which wouldn't like to get their memory demoted
>> >> > > > for some reason but wouldn't it be more practical to tell that
>> >> > > > explicitly (e.g. via prctl) rather than configuring cpusets/memory
>> >> > > > policies explicitly?
>> >> > > >
>> >> > > > > Your concern about the expensive cost makes sense! Some raw ideas are:
>> >> > > > > * if the shrink_folio_list is called by kswapd, the folios come from
>> >> > > > >   the same per-memcg lruvec, so only one check is enough
>> >> > > > > * if not from kswapd, like called form madvise or DAMON code, we can
>> >> > > > >   save a memcg cache, and if the next folio's memcg is same as the
>> >> > > > >   cache, we reuse its result. And due to the locality, the real
>> >> > > > >   check is rarely performed.
>> >> > > >
>> >> > > > memcg is not the expensive part of the thing. You need to get from page
>> >> > > > -> all vmas::vm_policy -> mm -> task::mempolicy
>> >> > >
>> >> > > Yeah, on the same page with Michal. Figuring out mempolicy from page
>> >> > > seems quite expensive and the correctness can't be guranteed since the
>> >> > > mempolicy could be set per-thread and the mm->task depends on
>> >> > > CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG.
>> >> >
>> >> > Yes, you are right. Our "working" psudo code for mem policy looks like
>> >> > what Michal mentioned, and it can't work for all cases, but try to
>> >> > enforce it whenever possible:
>> >> >
>> >> > static bool  __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma,
>> >> >                 unsigned long addr, void *arg)
>> >> > {
>> >> >         bool *skip_demotion = arg;
>> >> >         struct mempolicy *mpol;
>> >> >         int nid, dnid;
>> >> >         bool ret = true;
>> >> >
>> >> >         mpol = __get_vma_policy(vma, addr);
>> >> >         if (!mpol) {
>> >> >                 struct task_struct *task;
>> >> >                 if (vma->vm_mm)
>> >> >                         task = vma->vm_mm->owner;
>> >>
>> >> But this task may not be the task you want IIUC. For example, the
>> >> process has two threads, A and B. They have different mempolicy. The
>> >> vmscan is trying to demote a page belonging to thread A, but the task
>> >> may point to thread B, so you actually get the wrong mempolicy IIUC.
>> >
>> > Yes, this is a valid concern! We don't have good solution for this.
>> > For memory policy, we may only handle the per-vma policy for now whose
>> > cost is relatively low, as a best-effort try.
>>
>> Yes.  The solution isn't perfect, especially for multiple-thread
>> processes with thread specific memory policy.  But the proposed code
>> above can support the most common cases at least, that is, run workload
>> with `numactl`.
>
> Not only multi threads, but also may be broken for shared pages. When
> you do rmap walk, you may get multiple contradict mempolicy, which one
> would you like to obey?
>
> TBH I'm not sure whether such half-baked solution is worth it or not,
> at least at this moment. The cost is not cheap, but the gain may not
> be worth it IMHO.

Per my understanding, this can cover most cases.  For example, run
workload with `numactl`, or control the page placement of some memory
areas via mbind().  Although there are some issue in the corner cases.

Best Regards,
Huang, Ying

Feng Tang Oct. 31, 2022, 2:19 a.m. UTC | #40

On Sat, Oct 29, 2022 at 01:23:53AM +0800, Yang Shi wrote:
> On Thu, Oct 27, 2022 at 10:55 PM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Feng Tang <feng.tang@intel.com> writes:
> >
> > > On Thu, Oct 27, 2022 at 10:55:58AM -0700, Yang Shi wrote:
> > >> On Thu, Oct 27, 2022 at 12:12 AM Feng Tang <feng.tang@intel.com> wrote:
> > >> >
> > >> > On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote:
> > >> > > On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > >> > [...]
> > >> > > > > > This all can get quite expensive so the primary question is, does the
> > >> > > > > > existing behavior generates any real issues or is this more of an
> > >> > > > > > correctness exercise? I mean it certainly is not great to demote to an
> > >> > > > > > incompatible numa node but are there any reasonable configurations when
> > >> > > > > > the demotion target node is explicitly excluded from memory
> > >> > > > > > policy/cpuset?
> > >> > > > >
> > >> > > > > We haven't got customer report on this, but there are quite some customers
> > >> > > > > use cpuset to bind some specific memory nodes to a docker (You've helped
> > >> > > > > us solve a OOM issue in such cases), so I think it's practical to respect
> > >> > > > > the cpuset semantics as much as we can.
> > >> > > >
> > >> > > > Yes, it is definitely better to respect cpusets and all local memory
> > >> > > > policies. There is no dispute there. The thing is whether this is really
> > >> > > > worth it. How often would cpusets (or policies in general) go actively
> > >> > > > against demotion nodes (i.e. exclude those nodes from their allowes node
> > >> > > > mask)?
> > >> > > >
> > >> > > > I can imagine workloads which wouldn't like to get their memory demoted
> > >> > > > for some reason but wouldn't it be more practical to tell that
> > >> > > > explicitly (e.g. via prctl) rather than configuring cpusets/memory
> > >> > > > policies explicitly?
> > >> > > >
> > >> > > > > Your concern about the expensive cost makes sense! Some raw ideas are:
> > >> > > > > * if the shrink_folio_list is called by kswapd, the folios come from
> > >> > > > >   the same per-memcg lruvec, so only one check is enough
> > >> > > > > * if not from kswapd, like called form madvise or DAMON code, we can
> > >> > > > >   save a memcg cache, and if the next folio's memcg is same as the
> > >> > > > >   cache, we reuse its result. And due to the locality, the real
> > >> > > > >   check is rarely performed.
> > >> > > >
> > >> > > > memcg is not the expensive part of the thing. You need to get from page
> > >> > > > -> all vmas::vm_policy -> mm -> task::mempolicy
> > >> > >
> > >> > > Yeah, on the same page with Michal. Figuring out mempolicy from page
> > >> > > seems quite expensive and the correctness can't be guranteed since the
> > >> > > mempolicy could be set per-thread and the mm->task depends on
> > >> > > CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG.
> > >> >
> > >> > Yes, you are right. Our "working" psudo code for mem policy looks like
> > >> > what Michal mentioned, and it can't work for all cases, but try to
> > >> > enforce it whenever possible:
> > >> >
> > >> > static bool  __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma,
> > >> >                 unsigned long addr, void *arg)
> > >> > {
> > >> >         bool *skip_demotion = arg;
> > >> >         struct mempolicy *mpol;
> > >> >         int nid, dnid;
> > >> >         bool ret = true;
> > >> >
> > >> >         mpol = __get_vma_policy(vma, addr);
> > >> >         if (!mpol) {
> > >> >                 struct task_struct *task;
> > >> >                 if (vma->vm_mm)
> > >> >                         task = vma->vm_mm->owner;
> > >>
> > >> But this task may not be the task you want IIUC. For example, the
> > >> process has two threads, A and B. They have different mempolicy. The
> > >> vmscan is trying to demote a page belonging to thread A, but the task
> > >> may point to thread B, so you actually get the wrong mempolicy IIUC.
> > >
> > > Yes, this is a valid concern! We don't have good solution for this.
> > > For memory policy, we may only handle the per-vma policy for now whose
> > > cost is relatively low, as a best-effort try.
> >
> > Yes.  The solution isn't perfect, especially for multiple-thread
> > processes with thread specific memory policy.  But the proposed code
> > above can support the most common cases at least, that is, run workload
> > with `numactl`.
> 
> Not only multi threads, but also may be broken for shared pages. When
> you do rmap walk, you may get multiple contradict mempolicy, which one
> would you like to obey?

In our test code, it follows the stricter policy, that if the rmap
walk meets a mempolicy disallowing the demotion, it will stop the walk
and return with 'skip_demotion' flag set.

Thanks,
Feng

Michal Hocko Oct. 31, 2022, 8:40 a.m. UTC | #41

On Fri 28-10-22 07:22:27, Huang, Ying wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Thu 27-10-22 17:31:35, Huang, Ying wrote:
[...]
> >> I think that it's possible for different processes have different
> >> requirements.
> >> 
> >> - Some processes don't care about where the memory is placed, prefer
> >>   local, then fall back to remote if no free space.
> >> 
> >> - Some processes want to avoid cross-socket traffic, bind to nodes of
> >>   local socket.
> >> 
> >> - Some processes want to avoid to use slow memory, bind to fast memory
> >>   node only.
> >
> > Yes, I do understand that. Do you have any specific examples in mind?
> > [...]
> 
> Sorry, I don't have specific examples.

OK, then let's stop any complicated solution right here then. Let's
start simple with a per-mm flag to disable demotion of an address space.
Should there ever be a real demand for a more fine grained solution
let's go further but I do not think we want a half baked solution
without real usecases.

Huang, Ying Oct. 31, 2022, 8:51 a.m. UTC | #42

Michal Hocko <mhocko@suse.com> writes:

> On Fri 28-10-22 07:22:27, Huang, Ying wrote:
>> Michal Hocko <mhocko@suse.com> writes:
>> 
>> > On Thu 27-10-22 17:31:35, Huang, Ying wrote:
> [...]
>> >> I think that it's possible for different processes have different
>> >> requirements.
>> >> 
>> >> - Some processes don't care about where the memory is placed, prefer
>> >>   local, then fall back to remote if no free space.
>> >> 
>> >> - Some processes want to avoid cross-socket traffic, bind to nodes of
>> >>   local socket.
>> >> 
>> >> - Some processes want to avoid to use slow memory, bind to fast memory
>> >>   node only.
>> >
>> > Yes, I do understand that. Do you have any specific examples in mind?
>> > [...]
>> 
>> Sorry, I don't have specific examples.
>
> OK, then let's stop any complicated solution right here then. Let's
> start simple with a per-mm flag to disable demotion of an address
> space.

I'm not a big fan of per-mm flag.  Because we don't have users for that
too and it needs to add ABI too.

> Should there ever be a real demand for a more fine grained solution
> let's go further but I do not think we want a half baked solution
> without real usecases.

I'm OK to ignore per-task (and missing per-process) memory policy
support for now.

Best Regards,
Huang, Ying

Michal Hocko Oct. 31, 2022, 9:18 a.m. UTC | #43

On Mon 31-10-22 16:51:11, Huang, Ying wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Fri 28-10-22 07:22:27, Huang, Ying wrote:
> >> Michal Hocko <mhocko@suse.com> writes:
> >> 
> >> > On Thu 27-10-22 17:31:35, Huang, Ying wrote:
> > [...]
> >> >> I think that it's possible for different processes have different
> >> >> requirements.
> >> >> 
> >> >> - Some processes don't care about where the memory is placed, prefer
> >> >>   local, then fall back to remote if no free space.
> >> >> 
> >> >> - Some processes want to avoid cross-socket traffic, bind to nodes of
> >> >>   local socket.
> >> >> 
> >> >> - Some processes want to avoid to use slow memory, bind to fast memory
> >> >>   node only.
> >> >
> >> > Yes, I do understand that. Do you have any specific examples in mind?
> >> > [...]
> >> 
> >> Sorry, I don't have specific examples.
> >
> > OK, then let's stop any complicated solution right here then. Let's
> > start simple with a per-mm flag to disable demotion of an address
> > space.
> 
> I'm not a big fan of per-mm flag.  Because we don't have users for that
> too and it needs to add ABI too.

OK, if there are no users for opt-out then let's jus document the
current limitations and be done with it.

> > Should there ever be a real demand for a more fine grained solution
> > let's go further but I do not think we want a half baked solution
> > without real usecases.
> 
> I'm OK to ignore per-task (and missing per-process) memory policy
> support for now.

I am against such a half baked solution.

Feng Tang Oct. 31, 2022, 2:09 p.m. UTC | #44

On Mon, Oct 31, 2022 at 04:40:15PM +0800, Michal Hocko wrote:
> On Fri 28-10-22 07:22:27, Huang, Ying wrote:
> > Michal Hocko <mhocko@suse.com> writes:
> > 
> > > On Thu 27-10-22 17:31:35, Huang, Ying wrote:
> [...]
> > >> I think that it's possible for different processes have different
> > >> requirements.
> > >> 
> > >> - Some processes don't care about where the memory is placed, prefer
> > >>   local, then fall back to remote if no free space.
> > >> 
> > >> - Some processes want to avoid cross-socket traffic, bind to nodes of
> > >>   local socket.
> > >> 
> > >> - Some processes want to avoid to use slow memory, bind to fast memory
> > >>   node only.
> > >
> > > Yes, I do understand that. Do you have any specific examples in mind?
> > > [...]
> > 
> > Sorry, I don't have specific examples.
> 
> OK, then let's stop any complicated solution right here then. Let's
> start simple with a per-mm flag to disable demotion of an address space.
> Should there ever be a real demand for a more fine grained solution
> let's go further but I do not think we want a half baked solution
> without real usecases.

Yes, the concern about the high cost for mempolicy from you and Yang is
valid. 

How about the cpuset part? We've got bug reports from different channels
about using cpuset+docker to control meomry placement in memory tiering
system, leading to 2 commits solving them:

2685027fca38 ("cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in
cpuset_init_smp()") 
https://lore.kernel.org/all/20220419020958.40419-1-feng.tang@intel.com/

8ca1b5a49885 ("mm/page_alloc: detect allocation forbidden by cpuset and
bail out early")
https://lore.kernel.org/all/1632481657-68112-1-git-send-email-feng.tang@intel.com/

From these bug reports, I think it's reasonable to say there are quite
some real world users using cpuset+docker+memory-tiering-system. So
I plan to refine the original cpuset patch with some optimizations
discussed (like checking once for kswapd based shrink_folio_list()).

Thanks,
Feng

> -- 
> Michal Hocko
> SUSE Labs
>

Michal Hocko Oct. 31, 2022, 2:32 p.m. UTC | #45

On Mon 31-10-22 22:09:15, Feng Tang wrote:
> On Mon, Oct 31, 2022 at 04:40:15PM +0800, Michal Hocko wrote:
> > On Fri 28-10-22 07:22:27, Huang, Ying wrote:
> > > Michal Hocko <mhocko@suse.com> writes:
> > > 
> > > > On Thu 27-10-22 17:31:35, Huang, Ying wrote:
> > [...]
> > > >> I think that it's possible for different processes have different
> > > >> requirements.
> > > >> 
> > > >> - Some processes don't care about where the memory is placed, prefer
> > > >>   local, then fall back to remote if no free space.
> > > >> 
> > > >> - Some processes want to avoid cross-socket traffic, bind to nodes of
> > > >>   local socket.
> > > >> 
> > > >> - Some processes want to avoid to use slow memory, bind to fast memory
> > > >>   node only.
> > > >
> > > > Yes, I do understand that. Do you have any specific examples in mind?
> > > > [...]
> > > 
> > > Sorry, I don't have specific examples.
> > 
> > OK, then let's stop any complicated solution right here then. Let's
> > start simple with a per-mm flag to disable demotion of an address space.
> > Should there ever be a real demand for a more fine grained solution
> > let's go further but I do not think we want a half baked solution
> > without real usecases.
> 
> Yes, the concern about the high cost for mempolicy from you and Yang is
> valid. 
> 
> How about the cpuset part?

Cpusets fall into the same bucket as per task mempolicies wrt costs. Geting a
cpuset requires knowing all tasks associated with a page. Or am I just
missing any magic? And no memcg->cpuset association is not a proper
solution at all.

> We've got bug reports from different channels
> about using cpuset+docker to control meomry placement in memory tiering
> system, leading to 2 commits solving them:
> 
> 2685027fca38 ("cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in
> cpuset_init_smp()") 
> https://lore.kernel.org/all/20220419020958.40419-1-feng.tang@intel.com/
> 
> 8ca1b5a49885 ("mm/page_alloc: detect allocation forbidden by cpuset and
> bail out early")
> https://lore.kernel.org/all/1632481657-68112-1-git-send-email-feng.tang@intel.com/
> 
> >From these bug reports, I think it's reasonable to say there are quite
> some real world users using cpuset+docker+memory-tiering-system.

I don't think anybody is questioning existence of those usecases. The
primary question is whether any of them really require any non-trivial
(read nodemask aware) demotion policies. In other words do we know of
cpuset policy setups where demotion fallbacks are (partially) excluded?

Huang, Ying Nov. 1, 2022, 3:17 a.m. UTC | #46

"Huang, Ying" <ying.huang@intel.com> writes:

> Michal Hocko <mhocko@suse.com> writes:
>
>> On Thu 27-10-22 15:39:00, Huang, Ying wrote:
>>> Michal Hocko <mhocko@suse.com> writes:
>>> 
>>> > On Thu 27-10-22 14:47:22, Huang, Ying wrote:
>>> >> Michal Hocko <mhocko@suse.com> writes:
>>> > [...]
>>> >> > I can imagine workloads which wouldn't like to get their memory demoted
>>> >> > for some reason but wouldn't it be more practical to tell that
>>> >> > explicitly (e.g. via prctl) rather than configuring cpusets/memory
>>> >> > policies explicitly?
>>> >> 
>>> >> If my understanding were correct, prctl() configures the process or
>>> >> thread.
>>> >
>>> > Not necessarily. There are properties which are per adddress space like
>>> > PR_[GS]ET_THP_DISABLE. This could be very similar.
>>> >
>>> >> How can we get process/thread configuration at demotion time?
>>> >
>>> > As already pointed out in previous emails. You could hook into
>>> > folio_check_references path, more specifically folio_referenced_one
>>> > where you have all that you need already - all vmas mapping the page and
>>> > then it is trivial to get the corresponding vm_mm. If at least one of
>>> > them has the flag set then the demotion is not allowed (essentially the
>>> > same model as VM_LOCKED).
>>> 
>>> Got it!  Thanks for detailed explanation.
>>> 
>>> One bit may be not sufficient.  For example, if we want to avoid or
>>> control cross-socket demotion and still allow demoting to slow memory
>>> nodes in local socket, we need to specify a node mask to exclude some
>>> NUMA nodes from demotion targets.
>>
>> Isn't this something to be configured on the demotion topology side? Or
>> do you expect there will be per process/address space usecases? I mean
>> different processes running on the same topology, one requesting local
>> demotion while other ok with the whole demotion topology?
>
> I think that it's possible for different processes have different
> requirements.
>
> - Some processes don't care about where the memory is placed, prefer
>   local, then fall back to remote if no free space.
>
> - Some processes want to avoid cross-socket traffic, bind to nodes of
>   local socket.
>
> - Some processes want to avoid to use slow memory, bind to fast memory
>   node only.

Hi, Johannes, Wei, Jonathan, Yang, Aneesh,

We need your help.  Do you or your organization have requirements to
restrict the page demotion target nodes?  If so, can you share some
details of the requirements?  For example, to avoid cross-socket
traffic, or to avoid using slow memory.  And do you want to restrict
that with cpusets, memory policy, or some other interfaces.

Best Regards,
Huang, Ying

Feng Tang Nov. 7, 2022, 8:05 a.m. UTC | #47

On Mon, Oct 31, 2022 at 03:32:34PM +0100, Michal Hocko wrote:
> > > OK, then let's stop any complicated solution right here then. Let's
> > > start simple with a per-mm flag to disable demotion of an address space.
> > > Should there ever be a real demand for a more fine grained solution
> > > let's go further but I do not think we want a half baked solution
> > > without real usecases.
> > 
> > Yes, the concern about the high cost for mempolicy from you and Yang is
> > valid. 
> > 
> > How about the cpuset part?
> 
> Cpusets fall into the same bucket as per task mempolicies wrt costs. Geting a
> cpuset requires knowing all tasks associated with a page. Or am I just
> missing any magic? And no memcg->cpuset association is not a proper
> solution at all.

No, you are not missing anything. It's really difficult to find a
solution for all holes. And the patch is actually a best-efforts
approach, trying to cover cgroup v2 + memory controller enabled case,
which we think is a common user case for newer platforms with tiering
memory.

> > We've got bug reports from different channels
> > about using cpuset+docker to control meomry placement in memory tiering
> > system, leading to 2 commits solving them:
> > 
> > 2685027fca38 ("cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in
> > cpuset_init_smp()") 
> > https://lore.kernel.org/all/20220419020958.40419-1-feng.tang@intel.com/
> > 
> > 8ca1b5a49885 ("mm/page_alloc: detect allocation forbidden by cpuset and
> > bail out early")
> > https://lore.kernel.org/all/1632481657-68112-1-git-send-email-feng.tang@intel.com/
> > 
> > >From these bug reports, I think it's reasonable to say there are quite
> > some real world users using cpuset+docker+memory-tiering-system.
> 
> I don't think anybody is questioning existence of those usecases. The
> primary question is whether any of them really require any non-trivial
> (read nodemask aware) demotion policies. In other words do we know of
> cpuset policy setups where demotion fallbacks are (partially) excluded?

For cpuset numa memory binding, there are possible usercases:

* User wants cpuset to bind some important containers to faster
  memory tiers for better latency/performance (where simply disabling
  demotion should work, like your per-mm flag solution)

* User wants to bind to a set of physically closer nodes (like faster
  CPU+DRAM node and slower PMEM node). With initial demotion code,
  our HW will have 1:1 demotion/promotion pair for one DRAM node and
  its closer PMEM node, and user's binding can work fine. And there
  are many other types of memory tiering system from other vendors,
  like many CPU-less DRAM nodes in system, and Aneesh's patchset[1]
  created a more general tiering interface, where IIUC each tier has
  a nodemask, and an upper tier can demote to the whole lower tier,
  where the demotion path is N:N mapping. And for this, fine-tuning
  cpuset nodes binding needs this handling.

[1]. https://lore.kernel.org/lkml/20220818131042.113280-1-aneesh.kumar@linux.ibm.com/

Thanks,
Feng

> -- 
> Michal Hocko
> SUSE Labs

Michal Hocko Nov. 7, 2022, 8:17 a.m. UTC | #48

On Mon 07-11-22 16:05:37, Feng Tang wrote:
> On Mon, Oct 31, 2022 at 03:32:34PM +0100, Michal Hocko wrote:
> > > > OK, then let's stop any complicated solution right here then. Let's
> > > > start simple with a per-mm flag to disable demotion of an address space.
> > > > Should there ever be a real demand for a more fine grained solution
> > > > let's go further but I do not think we want a half baked solution
> > > > without real usecases.
> > > 
> > > Yes, the concern about the high cost for mempolicy from you and Yang is
> > > valid. 
> > > 
> > > How about the cpuset part?
> > 
> > Cpusets fall into the same bucket as per task mempolicies wrt costs. Geting a
> > cpuset requires knowing all tasks associated with a page. Or am I just
> > missing any magic? And no memcg->cpuset association is not a proper
> > solution at all.
> 
> No, you are not missing anything. It's really difficult to find a
> solution for all holes. And the patch is actually a best-efforts
> approach, trying to cover cgroup v2 + memory controller enabled case,
> which we think is a common user case for newer platforms with tiering
> memory.

Best effort is OK but it shouldn't create an unexpected behavior and
this approach does that.

I thought I have already explained that. But let me be more
explicit this time.  Please have a look at how controllers can be
enabled/disabled at different levels of the hierarchy. Unless memcg
grows a hard dependency on another controller (as it does with the blk
io controller) then this approach can point to a wrong cpuset. See my
point?

Really, solution for this is not going to be cheap and also I am not
sure all the hessles is really worth it until there is a clear usecase
in sight.

diff mbox series

Patch

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index d58e0476ee8e..6fcce2bd2631 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -178,6 +178,8 @@  static inline void set_mems_allowed(nodemask_t nodemask)
 	task_unlock(current);
 }
 
+extern void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup,
+						nodemask_t *nmask);
 #else /* !CONFIG_CPUSETS */
 
 static inline bool cpusets_enabled(void) { return false; }
@@ -299,6 +301,10 @@  static inline bool read_mems_allowed_retry(unsigned int seq)
 	return false;
 }
 
+static inline void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup,
+						nodemask_t *nmask)
+{
+}
 #endif /* !CONFIG_CPUSETS */
 
 #endif /* _LINUX_CPUSET_H */
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 3ea2e836e93e..cbb118c0502f 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3750,6 +3750,35 @@  nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
 	return mask;
 }
 
+/*
+ * Retrieve the allowed memory nodemask for a cgroup.
+ *
+ * Set *nmask to cpuset's effective allowed nodemask for cgroup v2,
+ * and NODE_MASK_ALL (means no constraint) for cgroup v1 where there
+ * is no guaranteed association from a cgroup to a cpuset.
+ */
+void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, nodemask_t *nmask)
+{
+	struct cgroup_subsys_state *css;
+	struct cpuset *cs;
+
+	if (!is_in_v2_mode()) {
+		*nmask = NODE_MASK_ALL;
+		return;
+	}
+
+	rcu_read_lock();
+	css = cgroup_e_css(cgroup, &cpuset_cgrp_subsys);
+	if (css) {
+		css_get(css);
+		cs = css_cs(css);
+		*nmask = cs->effective_mems;
+		css_put(css);
+	}
+
+	rcu_read_unlock();
+}
+
 /**
  * cpuset_nodemask_valid_mems_allowed - check nodemask vs. current mems_allowed
  * @nodemask: the nodemask to be checked
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 18f6497994ec..c205d98283bc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1537,9 +1537,21 @@  static struct page *alloc_demote_page(struct page *page, unsigned long private)
 {
 	struct page *target_page;
 	nodemask_t *allowed_mask;
-	struct migration_target_control *mtc;
+	struct migration_target_control *mtc = (void *)private;
 
-	mtc = (struct migration_target_control *)private;
+#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS)
+	struct mem_cgroup *memcg;
+	nodemask_t cpuset_nmask;
+
+	memcg = page_memcg(page);
+	cpuset_get_allowed_mem_nodes(memcg->css.cgroup, &cpuset_nmask);
+
+	if (!node_isset(mtc->nid, cpuset_nmask)) {
+		if (mtc->nmask)
+			nodes_and(*mtc->nmask, *mtc->nmask, cpuset_nmask);
+		return alloc_migration_target(page, (unsigned long)mtc);
+	}
+#endif
 
 	allowed_mask = mtc->nmask;
 	/*
@@ -1649,6 +1661,7 @@  static unsigned int shrink_folio_list(struct list_head *folio_list,
 		enum folio_references references = FOLIOREF_RECLAIM;
 		bool dirty, writeback;
 		unsigned int nr_pages;
+		bool skip_this_demotion = false;
 
 		cond_resched();
 
@@ -1658,6 +1671,22 @@  static unsigned int shrink_folio_list(struct list_head *folio_list,
 		if (!folio_trylock(folio))
 			goto keep;
 
+#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS)
+		if (do_demote_pass) {
+			struct mem_cgroup *memcg;
+			nodemask_t nmask, nmask1;
+
+			node_get_allowed_targets(pgdat, &nmask);
+			memcg = folio_memcg(folio);
+			if (memcg)
+				cpuset_get_allowed_mem_nodes(memcg->css.cgroup,
+								&nmask1);
+
+			if (!nodes_intersects(nmask, nmask1))
+				skip_this_demotion = true;
+		}
+#endif
+
 		VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
 
 		nr_pages = folio_nr_pages(folio);
@@ -1799,7 +1828,7 @@  static unsigned int shrink_folio_list(struct list_head *folio_list,
 		 * Before reclaiming the folio, try to relocate
 		 * its contents to another node.
 		 */
-		if (do_demote_pass &&
+		if (do_demote_pass && !skip_this_demotion &&
 		    (thp_migration_supported() || !folio_test_large(folio))) {
 			list_add(&folio->lru, &demote_folios);
 			folio_unlock(folio);