[v3,mm-unstable] mm: Fix memcg reclaim on memory tiered systems

Message ID 20221206023406.3182800-1-almasrymina@google.com
State New
Headers
Series [v3,mm-unstable] mm: Fix memcg reclaim on memory tiered systems |

Commit Message

Mina Almasry Dec. 6, 2022, 2:34 a.m. UTC
  commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
reclaim"") enabled demotion in memcg reclaim, which is the right thing
to do, however, it introduced a regression in the behavior of
try_to_free_mem_cgroup_pages().

The callers of try_to_free_mem_cgroup_pages() expect it to attempt to
reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage
of the cgroup should reduce by nr_pages. The callers expect
try_to_free_mem_cgroup_pages() to also return the number of pages
reclaimed, not demoted.

However, what try_to_free_mem_cgroup_pages() actually does is it
unconditionally counts demoted pages as reclaimed pages. So in practice
when it is called it will often demote nr_pages and return the number of
demoted pages to the caller. Demoted pages don't lower the memcg usage,
and so try_to_free_mem_cgroup_pages() is not actually doing what the
callers want it to do.

Various things work suboptimally on memory tiered systems or don't work
at all due to this:

- memory.high enforcement likely doesn't work (it just demotes nr_pages
  instead of lowering the memcg usage by nr_pages).
- try_charge_memcg() will keep retrying the charge while
  try_to_free_mem_cgroup_pages() is just demoting pages and not actually
  making any room for the charge.
- memory.reclaim has a wonky interface. It advertises to the user it
  reclaims the provided amount but it will actually often demote that
  amount.

There may be more effects to this issue.

To fix these issues I propose shrink_folio_list() to only count pages
demoted from inside of sc->nodemask to outside of sc->nodemask as
'reclaimed'.

For callers such as reclaim_high() or try_charge_memcg() that set
sc->nodemask to NULL, try_to_free_mem_cgroup_pages() will try to
actually reclaim nr_pages and return the number of pages reclaimed. No
demoted pages would count towards the nr_pages requirement.

For callers such as memory_reclaim() that set sc->nodemask,
try_to_free_mem_cgroup_pages() will free nr_pages from that nodemask
with either reclaim or demotion.

Tested this change using memory.reclaim interface. I set up a test case where
I allocate 500m in a cgroup, and then do:

    echo "50m" > memory.reclaim

Without this fix, my kernel demotes 70mb and reclaims 4 mb
(memory.current is reduced by about 4mb).

With this fix, my kernel demotes all memory possible and reclaims 60mb
(memory.current is reduced by about 60mb).

Signed-off-by: Mina Almasry <almasrymina@google.com>

---

Changes in v3:
- Reverted back to v1 implementation: "try to demote but don't count
  demoted pages unless they are demoted to outside the nodemask" as Ying
  suggested.
- Made sure demotions that fall back to non next_demotion_target() are
  not counted as Wei suggested.
- Updated comment in shrink_folio_list() as Ying suggested.
- Added before/after for the test case in commit message since Ying
  asked.
- Fixed call sites that don't provide sc->nodemask but expect demotion
  from a specific node as Ying pointed out.

Cc: weixugc@google.com
Cc: ying.huang@intel.com

This is developed on top of mm-unstable largely because I want the
solution to be compatible with the recently added nodes= arg on
mm-unstable.
---
 mm/vmscan.c | 91 +++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 78 insertions(+), 13 deletions(-)

--
2.39.0.rc0.267.gcb52ba06e7-goog
  

Comments

Huang, Ying Dec. 6, 2022, 3:13 a.m. UTC | #1
Mina Almasry <almasrymina@google.com> writes:

> commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> reclaim"") enabled demotion in memcg reclaim, which is the right thing
> to do, however, it introduced a regression in the behavior of
> try_to_free_mem_cgroup_pages().
>
> The callers of try_to_free_mem_cgroup_pages() expect it to attempt to
> reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage
> of the cgroup should reduce by nr_pages. The callers expect
> try_to_free_mem_cgroup_pages() to also return the number of pages
> reclaimed, not demoted.
>
> However, what try_to_free_mem_cgroup_pages() actually does is it
> unconditionally counts demoted pages as reclaimed pages. So in practnice
> when it is called it will often demote nr_pages and return the number of
> demoted pages to the caller. Demoted pages don't lower the memcg usage,
> and so try_to_free_mem_cgroup_pages() is not actually doing what the
> callers want it to do.
>
> Various things work suboptimally on memory tiered systems or don't work
> at all due to this:
>
> - memory.high enforcement likely doesn't work (it just demotes nr_pages
>   instead of lowering the memcg usage by nr_pages).
> - try_charge_memcg() will keep retrying the charge while
>   try_to_free_mem_cgroup_pages() is just demoting pages and not actually
>   making any room for the charge.
> - memory.reclaim has a wonky interface. It advertises to the user it
>   reclaims the provided amount but it will actually often demote that
>   amount.
>
> There may be more effects to this issue.
>
> To fix these issues I propose shrink_folio_list() to only count pages
> demoted from inside of sc->nodemask to outside of sc->nodemask as
> 'reclaimed'.
>
> For callers such as reclaim_high() or try_charge_memcg() that set
> sc->nodemask to NULL, try_to_free_mem_cgroup_pages() will try to
> actually reclaim nr_pages and return the number of pages reclaimed. No
> demoted pages would count towards the nr_pages requirement.
>
> For callers such as memory_reclaim() that set sc->nodemask,
> try_to_free_mem_cgroup_pages() will free nr_pages from that nodemask
> with either reclaim or demotion.
>
> Tested this change using memory.reclaim interface. I set up a test case where
> I allocate 500m in a cgroup, and then do:
>
>     echo "50m" > memory.reclaim
>
> Without this fix, my kernel demotes 70mb and reclaims 4 mb
> (memory.current is reduced by about 4mb).
>
> With this fix, my kernel demotes all memory possible and reclaims 60mb
> (memory.current is reduced by about 60mb).
>
> Signed-off-by: Mina Almasry <almasrymina@google.com>
>
> ---
>
> Changes in v3:
> - Reverted back to v1 implementation: "try to demote but don't count
>   demoted pages unless they are demoted to outside the nodemask" as Ying
>   suggested.
> - Made sure demotions that fall back to non next_demotion_target() are
>   not counted as Wei suggested.
> - Updated comment in shrink_folio_list() as Ying suggested.
> - Added before/after for the test case in commit message since Ying
>   asked.
> - Fixed call sites that don't provide sc->nodemask but expect demotion
>   from a specific node as Ying pointed out.
>
> Cc: weixugc@google.com
> Cc: ying.huang@intel.com
>
> This is developed on top of mm-unstable largely because I want the
> solution to be compatible with the recently added nodes= arg on
> mm-unstable.
> ---
>  mm/vmscan.c | 91 +++++++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 78 insertions(+), 13 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2b42ac9ad755..f324e80395c3 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1555,13 +1555,18 @@ static void folio_check_dirty_writeback(struct folio *folio,
>  		mapping->a_ops->is_dirty_writeback(folio, dirty, writeback);
>  }
>
> +struct demotion_control {
> +	struct migration_target_control *mtc;
> +	nodemask_t *demote_from_nodemask;
> +	unsigned long nr_demoted_outside_nodemask;
> +};
> +
>  static struct page *alloc_demote_page(struct page *page, unsigned long private)
>  {
>  	struct page *target_page;
>  	nodemask_t *allowed_mask;
> -	struct migration_target_control *mtc;
> -
> -	mtc = (struct migration_target_control *)private;
> +	struct demotion_control *dc = (struct demotion_control *)private;
> +	struct migration_target_control *mtc = dc->mtc;
>
>  	allowed_mask = mtc->nmask;
>  	/*
> @@ -1576,13 +1581,31 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private)
>  	mtc->nmask = NULL;
>  	mtc->gfp_mask |= __GFP_THISNODE;
>  	target_page = alloc_migration_target(page, (unsigned long)mtc);
> -	if (target_page)
> +	if (!target_page) {
> +		mtc->gfp_mask &= ~__GFP_THISNODE;
> +		mtc->nmask = allowed_mask;
> +		target_page = alloc_migration_target(page, (unsigned long)mtc);
> +	}
> +
> +	if (!target_page)
>  		return target_page;
>
> -	mtc->gfp_mask &= ~__GFP_THISNODE;
> -	mtc->nmask = allowed_mask;
> +	if (dc->demote_from_nodemask &&
> +	    !node_isset(page_to_nid(target_page), *dc->demote_from_nodemask))

Use mtc->nid directly?

> +		dc->nr_demoted_outside_nodemask++;
>
> -	return alloc_migration_target(page, (unsigned long)mtc);
> +	return target_page;
> +}
> +
> +void free_demote_page(struct page *page, unsigned long private)
> +{
> +	struct demotion_control *dc = (struct demotion_control *)private;
> +
> +	if (dc->demote_from_nodemask &&
> +	    !node_isset(page_to_nid(page), *dc->demote_from_nodemask))

ditto

> +		dc->nr_demoted_outside_nodemask--;
> +
> +	folio_put(page_folio(page));
>  }
>
>  /*
> @@ -1590,7 +1613,8 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private)
>   * Folios which are not demoted are left on @demote_folios.
>   */
>  static unsigned int demote_folio_list(struct list_head *demote_folios,
> -				     struct pglist_data *pgdat)
> +				      struct pglist_data *pgdat,
> +				      nodemask_t *nodemask)
>  {
>  	int target_nid = next_demotion_node(pgdat->node_id);
>  	unsigned int nr_succeeded;
> @@ -1608,6 +1632,12 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
>  		.nmask = &allowed_mask
>  	};
>
> +	struct demotion_control dc = {
> +		.mtc = &mtc,
> +		.demote_from_nodemask = nodemask,
> +		.nr_demoted_outside_nodemask = 0,
> +	};
> +
>  	if (list_empty(demote_folios))
>  		return 0;
>
> @@ -1617,13 +1647,13 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
>  	node_get_allowed_targets(pgdat, &allowed_mask);
>
>  	/* Demotion ignores all cpuset and mempolicy settings */
> -	migrate_pages(demote_folios, alloc_demote_page, NULL,
> -		      (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
> +	migrate_pages(demote_folios, alloc_demote_page, free_demote_page,
> +		      (unsigned long)&dc, MIGRATE_ASYNC, MR_DEMOTION,
>  		      &nr_succeeded);
>
>  	__count_vm_events(PGDEMOTE_KSWAPD + reclaimer_offset(), nr_succeeded);
>
> -	return nr_succeeded;
> +	return dc.nr_demoted_outside_nodemask;
>  }
>
>  static bool may_enter_fs(struct folio *folio, gfp_t gfp_mask)
> @@ -1643,7 +1673,12 @@ static bool may_enter_fs(struct folio *folio, gfp_t gfp_mask)
>  }
>
>  /*
> - * shrink_folio_list() returns the number of reclaimed pages
> + * shrink_folio_list() returns the number of reclaimed pages.
> + *
> + * Demoted pages are counted as reclaimed iff:
> + *   (a) sc->nodemask arg is provided.
> + *   (b) page has been demoted from a node inside sc->nodemask to a node
> + *   outside sc->nodemask.
>   */
>  static unsigned int shrink_folio_list(struct list_head *folio_list,
>  		struct pglist_data *pgdat, struct scan_control *sc,
> @@ -1653,6 +1688,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>  	LIST_HEAD(free_folios);
>  	LIST_HEAD(demote_folios);
>  	unsigned int nr_reclaimed = 0;
> +	unsigned int nr_demoted_outside_nodemask = 0;
>  	unsigned int pgactivate = 0;
>  	bool do_demote_pass;
>  	struct swap_iocb *plug = NULL;
> @@ -2085,7 +2121,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>  	/* 'folio_list' is always empty here */
>
>  	/* Migrate folios selected for demotion */
> -	nr_reclaimed += demote_folio_list(&demote_folios, pgdat);
> +	nr_demoted_outside_nodemask =
> +		demote_folio_list(&demote_folios, pgdat, sc->nodemask);
> +
> +	if (sc->nodemask)
> +		nr_reclaimed += nr_demoted_outside_nodemask;
> +
>  	/* Folios that could not be demoted are still in @demote_folios */
>  	if (!list_empty(&demote_folios)) {
>  		/* Folios which weren't demoted go back on @folio_list */
> @@ -2130,9 +2171,11 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>  unsigned int reclaim_clean_pages_from_list(struct zone *zone,
>  					   struct list_head *folio_list)
>  {
> +	nodemask_t nodemask = NODE_MASK_NONE;

Is it necessary for us to use NODEMASK_ALLOC/NODEMASK_FREE to save stack space?

Best Regards,
Huang, Ying

>  	struct scan_control sc = {
>  		.gfp_mask = GFP_KERNEL,
>  		.may_unmap = 1,
> +		.nodemask = &nodemask
>  	};
>  	struct reclaim_stat stat;
>  	unsigned int nr_reclaimed;
> @@ -2140,6 +2183,12 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
>  	LIST_HEAD(clean_folios);
>  	unsigned int noreclaim_flag;
>
> +	/*
> +	 * Set the nodemask in sc to indicate to shrink_folio_list() that we're
> +	 * looking for reclaim from this node.
> +	 */
> +	node_set(zone->zone_pgdat->node_id, nodemask);
> +
>  	list_for_each_entry_safe(folio, next, folio_list, lru) {
>  		if (!folio_test_hugetlb(folio) && folio_is_file_lru(folio) &&
>  		    !folio_test_dirty(folio) && !__folio_test_movable(folio) &&
> @@ -7031,12 +7080,20 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
>  	unsigned long zone_boosts[MAX_NR_ZONES] = { 0, };
>  	bool boosted;
>  	struct zone *zone;
> +	nodemask_t nodemask = NODE_MASK_NONE;
>  	struct scan_control sc = {
>  		.gfp_mask = GFP_KERNEL,
>  		.order = order,
>  		.may_unmap = 1,
> +		.nodemask = &nodemask,
>  	};
>
> +	/*
> +	 * Set the nodemask in sc to indicate to kswapd_shrink_node() that we're
> +	 * looking for reclaim from this node.
> +	 */
> +	node_set(pgdat->node_id, nodemask);
> +
>  	set_task_reclaim_state(current, &sc.reclaim_state);
>  	psi_memstall_enter(&pflags);
>  	__fs_reclaim_acquire(_THIS_IP_);
> @@ -7642,6 +7699,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
>  	const unsigned long nr_pages = 1 << order;
>  	struct task_struct *p = current;
>  	unsigned int noreclaim_flag;
> +	nodemask_t nodemask = NODE_MASK_NONE;
>  	struct scan_control sc = {
>  		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
>  		.gfp_mask = current_gfp_context(gfp_mask),
> @@ -7651,9 +7709,16 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
>  		.may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
>  		.may_swap = 1,
>  		.reclaim_idx = gfp_zone(gfp_mask),
> +		.nodemask = &nodemask,
>  	};
>  	unsigned long pflags;
>
> +	/*
> +	 * Set the nodemask in sc to indicate to shrink_node() that we're
> +	 * looking for reclaim from this node.
> +	 */
> +	node_set(pgdat->node_id, nodemask);
> +
>  	trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order,
>  					   sc.gfp_mask);
>
> --
> 2.39.0.rc0.267.gcb52ba06e7-goog
  
Mina Almasry Dec. 6, 2022, 4:15 a.m. UTC | #2
On Mon, Dec 5, 2022 at 7:14 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Mina Almasry <almasrymina@google.com> writes:
>
> > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> > reclaim"") enabled demotion in memcg reclaim, which is the right thing
> > to do, however, it introduced a regression in the behavior of
> > try_to_free_mem_cgroup_pages().
> >
> > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to
> > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage
> > of the cgroup should reduce by nr_pages. The callers expect
> > try_to_free_mem_cgroup_pages() to also return the number of pages
> > reclaimed, not demoted.
> >
> > However, what try_to_free_mem_cgroup_pages() actually does is it
> > unconditionally counts demoted pages as reclaimed pages. So in practnice
> > when it is called it will often demote nr_pages and return the number of
> > demoted pages to the caller. Demoted pages don't lower the memcg usage,
> > and so try_to_free_mem_cgroup_pages() is not actually doing what the
> > callers want it to do.
> >
> > Various things work suboptimally on memory tiered systems or don't work
> > at all due to this:
> >
> > - memory.high enforcement likely doesn't work (it just demotes nr_pages
> >   instead of lowering the memcg usage by nr_pages).
> > - try_charge_memcg() will keep retrying the charge while
> >   try_to_free_mem_cgroup_pages() is just demoting pages and not actually
> >   making any room for the charge.
> > - memory.reclaim has a wonky interface. It advertises to the user it
> >   reclaims the provided amount but it will actually often demote that
> >   amount.
> >
> > There may be more effects to this issue.
> >
> > To fix these issues I propose shrink_folio_list() to only count pages
> > demoted from inside of sc->nodemask to outside of sc->nodemask as
> > 'reclaimed'.
> >
> > For callers such as reclaim_high() or try_charge_memcg() that set
> > sc->nodemask to NULL, try_to_free_mem_cgroup_pages() will try to
> > actually reclaim nr_pages and return the number of pages reclaimed. No
> > demoted pages would count towards the nr_pages requirement.
> >
> > For callers such as memory_reclaim() that set sc->nodemask,
> > try_to_free_mem_cgroup_pages() will free nr_pages from that nodemask
> > with either reclaim or demotion.
> >
> > Tested this change using memory.reclaim interface. I set up a test case where
> > I allocate 500m in a cgroup, and then do:
> >
> >     echo "50m" > memory.reclaim
> >
> > Without this fix, my kernel demotes 70mb and reclaims 4 mb
> > (memory.current is reduced by about 4mb).
> >
> > With this fix, my kernel demotes all memory possible and reclaims 60mb
> > (memory.current is reduced by about 60mb).
> >
> > Signed-off-by: Mina Almasry <almasrymina@google.com>
> >
> > ---
> >
> > Changes in v3:
> > - Reverted back to v1 implementation: "try to demote but don't count
> >   demoted pages unless they are demoted to outside the nodemask" as Ying
> >   suggested.
> > - Made sure demotions that fall back to non next_demotion_target() are
> >   not counted as Wei suggested.
> > - Updated comment in shrink_folio_list() as Ying suggested.
> > - Added before/after for the test case in commit message since Ying
> >   asked.
> > - Fixed call sites that don't provide sc->nodemask but expect demotion
> >   from a specific node as Ying pointed out.
> >
> > Cc: weixugc@google.com
> > Cc: ying.huang@intel.com
> >
> > This is developed on top of mm-unstable largely because I want the
> > solution to be compatible with the recently added nodes= arg on
> > mm-unstable.
> > ---
> >  mm/vmscan.c | 91 +++++++++++++++++++++++++++++++++++++++++++++--------
> >  1 file changed, 78 insertions(+), 13 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 2b42ac9ad755..f324e80395c3 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1555,13 +1555,18 @@ static void folio_check_dirty_writeback(struct folio *folio,
> >               mapping->a_ops->is_dirty_writeback(folio, dirty, writeback);
> >  }
> >
> > +struct demotion_control {
> > +     struct migration_target_control *mtc;
> > +     nodemask_t *demote_from_nodemask;
> > +     unsigned long nr_demoted_outside_nodemask;
> > +};
> > +
> >  static struct page *alloc_demote_page(struct page *page, unsigned long private)
> >  {
> >       struct page *target_page;
> >       nodemask_t *allowed_mask;
> > -     struct migration_target_control *mtc;
> > -
> > -     mtc = (struct migration_target_control *)private;
> > +     struct demotion_control *dc = (struct demotion_control *)private;
> > +     struct migration_target_control *mtc = dc->mtc;
> >
> >       allowed_mask = mtc->nmask;
> >       /*
> > @@ -1576,13 +1581,31 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private)
> >       mtc->nmask = NULL;
> >       mtc->gfp_mask |= __GFP_THISNODE;
> >       target_page = alloc_migration_target(page, (unsigned long)mtc);
> > -     if (target_page)
> > +     if (!target_page) {
> > +             mtc->gfp_mask &= ~__GFP_THISNODE;
> > +             mtc->nmask = allowed_mask;
> > +             target_page = alloc_migration_target(page, (unsigned long)mtc);
> > +     }
> > +
> > +     if (!target_page)
> >               return target_page;
> >
> > -     mtc->gfp_mask &= ~__GFP_THISNODE;
> > -     mtc->nmask = allowed_mask;
> > +     if (dc->demote_from_nodemask &&
> > +         !node_isset(page_to_nid(target_page), *dc->demote_from_nodemask))
>
> Use mtc->nid directly?
>

mtc->nid is the next_demotion_node(). Wei's earlier comment is that
the page may be allocated anywhere on the get_allowed_targets(), not
necessarily the next_demotion_node(), so I don't think I can use
mtc->nid. I think I have to check on which node the page was allocated
as I'm doing here. Let me know if I missed something.

> > +             dc->nr_demoted_outside_nodemask++;
> >
> > -     return alloc_migration_target(page, (unsigned long)mtc);
> > +     return target_page;
> > +}
> > +
> > +void free_demote_page(struct page *page, unsigned long private)
> > +{
> > +     struct demotion_control *dc = (struct demotion_control *)private;
> > +
> > +     if (dc->demote_from_nodemask &&
> > +         !node_isset(page_to_nid(page), *dc->demote_from_nodemask))
>
> ditto
>
> > +             dc->nr_demoted_outside_nodemask--;
> > +
> > +     folio_put(page_folio(page));
> >  }
> >
> >  /*
> > @@ -1590,7 +1613,8 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private)
> >   * Folios which are not demoted are left on @demote_folios.
> >   */
> >  static unsigned int demote_folio_list(struct list_head *demote_folios,
> > -                                  struct pglist_data *pgdat)
> > +                                   struct pglist_data *pgdat,
> > +                                   nodemask_t *nodemask)
> >  {
> >       int target_nid = next_demotion_node(pgdat->node_id);
> >       unsigned int nr_succeeded;
> > @@ -1608,6 +1632,12 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
> >               .nmask = &allowed_mask
> >       };
> >
> > +     struct demotion_control dc = {
> > +             .mtc = &mtc,
> > +             .demote_from_nodemask = nodemask,
> > +             .nr_demoted_outside_nodemask = 0,
> > +     };
> > +
> >       if (list_empty(demote_folios))
> >               return 0;
> >
> > @@ -1617,13 +1647,13 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
> >       node_get_allowed_targets(pgdat, &allowed_mask);
> >
> >       /* Demotion ignores all cpuset and mempolicy settings */
> > -     migrate_pages(demote_folios, alloc_demote_page, NULL,
> > -                   (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
> > +     migrate_pages(demote_folios, alloc_demote_page, free_demote_page,
> > +                   (unsigned long)&dc, MIGRATE_ASYNC, MR_DEMOTION,
> >                     &nr_succeeded);
> >
> >       __count_vm_events(PGDEMOTE_KSWAPD + reclaimer_offset(), nr_succeeded);
> >
> > -     return nr_succeeded;
> > +     return dc.nr_demoted_outside_nodemask;
> >  }
> >
> >  static bool may_enter_fs(struct folio *folio, gfp_t gfp_mask)
> > @@ -1643,7 +1673,12 @@ static bool may_enter_fs(struct folio *folio, gfp_t gfp_mask)
> >  }
> >
> >  /*
> > - * shrink_folio_list() returns the number of reclaimed pages
> > + * shrink_folio_list() returns the number of reclaimed pages.
> > + *
> > + * Demoted pages are counted as reclaimed iff:
> > + *   (a) sc->nodemask arg is provided.
> > + *   (b) page has been demoted from a node inside sc->nodemask to a node
> > + *   outside sc->nodemask.
> >   */
> >  static unsigned int shrink_folio_list(struct list_head *folio_list,
> >               struct pglist_data *pgdat, struct scan_control *sc,
> > @@ -1653,6 +1688,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >       LIST_HEAD(free_folios);
> >       LIST_HEAD(demote_folios);
> >       unsigned int nr_reclaimed = 0;
> > +     unsigned int nr_demoted_outside_nodemask = 0;
> >       unsigned int pgactivate = 0;
> >       bool do_demote_pass;
> >       struct swap_iocb *plug = NULL;
> > @@ -2085,7 +2121,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >       /* 'folio_list' is always empty here */
> >
> >       /* Migrate folios selected for demotion */
> > -     nr_reclaimed += demote_folio_list(&demote_folios, pgdat);
> > +     nr_demoted_outside_nodemask =
> > +             demote_folio_list(&demote_folios, pgdat, sc->nodemask);
> > +
> > +     if (sc->nodemask)
> > +             nr_reclaimed += nr_demoted_outside_nodemask;
> > +
> >       /* Folios that could not be demoted are still in @demote_folios */
> >       if (!list_empty(&demote_folios)) {
> >               /* Folios which weren't demoted go back on @folio_list */
> > @@ -2130,9 +2171,11 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >  unsigned int reclaim_clean_pages_from_list(struct zone *zone,
> >                                          struct list_head *folio_list)
> >  {
> > +     nodemask_t nodemask = NODE_MASK_NONE;
>
> Is it necessary for us to use NODEMASK_ALLOC/NODEMASK_FREE to save stack space?
>

I can certainly use NODEMASK_ALLOC/NODEMASK_FREE if you'd like. I
think there are a few places that stack allocate nodemask_t already,
including one place I recently added in memory_reclaim(), so it
doesn't seem _necessary_ per say.

If you're asking my opinion, AFAICT it's not an issue. I think you
need > 32 numa nodes before nodemask_t becomes an array of size 2
longs on a 32-bit machine, and even then I think it's not a huge deal.
Up to you; I have no issue with converting to
NODEMASK_ALLOC/NODEMASK_FREE in v4.

> Best Regards,
> Huang, Ying
>
> >       struct scan_control sc = {
> >               .gfp_mask = GFP_KERNEL,
> >               .may_unmap = 1,
> > +             .nodemask = &nodemask
> >       };
> >       struct reclaim_stat stat;
> >       unsigned int nr_reclaimed;
> > @@ -2140,6 +2183,12 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
> >       LIST_HEAD(clean_folios);
> >       unsigned int noreclaim_flag;
> >
> > +     /*
> > +      * Set the nodemask in sc to indicate to shrink_folio_list() that we're
> > +      * looking for reclaim from this node.
> > +      */
> > +     node_set(zone->zone_pgdat->node_id, nodemask);
> > +
> >       list_for_each_entry_safe(folio, next, folio_list, lru) {
> >               if (!folio_test_hugetlb(folio) && folio_is_file_lru(folio) &&
> >                   !folio_test_dirty(folio) && !__folio_test_movable(folio) &&
> > @@ -7031,12 +7080,20 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
> >       unsigned long zone_boosts[MAX_NR_ZONES] = { 0, };
> >       bool boosted;
> >       struct zone *zone;
> > +     nodemask_t nodemask = NODE_MASK_NONE;
> >       struct scan_control sc = {
> >               .gfp_mask = GFP_KERNEL,
> >               .order = order,
> >               .may_unmap = 1,
> > +             .nodemask = &nodemask,
> >       };
> >
> > +     /*
> > +      * Set the nodemask in sc to indicate to kswapd_shrink_node() that we're
> > +      * looking for reclaim from this node.
> > +      */
> > +     node_set(pgdat->node_id, nodemask);
> > +
> >       set_task_reclaim_state(current, &sc.reclaim_state);
> >       psi_memstall_enter(&pflags);
> >       __fs_reclaim_acquire(_THIS_IP_);
> > @@ -7642,6 +7699,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
> >       const unsigned long nr_pages = 1 << order;
> >       struct task_struct *p = current;
> >       unsigned int noreclaim_flag;
> > +     nodemask_t nodemask = NODE_MASK_NONE;
> >       struct scan_control sc = {
> >               .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> >               .gfp_mask = current_gfp_context(gfp_mask),
> > @@ -7651,9 +7709,16 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
> >               .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
> >               .may_swap = 1,
> >               .reclaim_idx = gfp_zone(gfp_mask),
> > +             .nodemask = &nodemask,
> >       };
> >       unsigned long pflags;
> >
> > +     /*
> > +      * Set the nodemask in sc to indicate to shrink_node() that we're
> > +      * looking for reclaim from this node.
> > +      */
> > +     node_set(pgdat->node_id, nodemask);
> > +
> >       trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order,
> >                                          sc.gfp_mask);
> >
> > --
> > 2.39.0.rc0.267.gcb52ba06e7-goog
>
  
Huang, Ying Dec. 6, 2022, 5:22 a.m. UTC | #3
Mina Almasry <almasrymina@google.com> writes:

> On Mon, Dec 5, 2022 at 7:14 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Mina Almasry <almasrymina@google.com> writes:
>>
>> > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
>> > reclaim"") enabled demotion in memcg reclaim, which is the right thing
>> > to do, however, it introduced a regression in the behavior of
>> > try_to_free_mem_cgroup_pages().
>> >
>> > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to
>> > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage
>> > of the cgroup should reduce by nr_pages. The callers expect
>> > try_to_free_mem_cgroup_pages() to also return the number of pages
>> > reclaimed, not demoted.
>> >
>> > However, what try_to_free_mem_cgroup_pages() actually does is it
>> > unconditionally counts demoted pages as reclaimed pages. So in practnice
>> > when it is called it will often demote nr_pages and return the number of
>> > demoted pages to the caller. Demoted pages don't lower the memcg usage,
>> > and so try_to_free_mem_cgroup_pages() is not actually doing what the
>> > callers want it to do.
>> >
>> > Various things work suboptimally on memory tiered systems or don't work
>> > at all due to this:
>> >
>> > - memory.high enforcement likely doesn't work (it just demotes nr_pages
>> >   instead of lowering the memcg usage by nr_pages).
>> > - try_charge_memcg() will keep retrying the charge while
>> >   try_to_free_mem_cgroup_pages() is just demoting pages and not actually
>> >   making any room for the charge.
>> > - memory.reclaim has a wonky interface. It advertises to the user it
>> >   reclaims the provided amount but it will actually often demote that
>> >   amount.
>> >
>> > There may be more effects to this issue.
>> >
>> > To fix these issues I propose shrink_folio_list() to only count pages
>> > demoted from inside of sc->nodemask to outside of sc->nodemask as
>> > 'reclaimed'.
>> >
>> > For callers such as reclaim_high() or try_charge_memcg() that set
>> > sc->nodemask to NULL, try_to_free_mem_cgroup_pages() will try to
>> > actually reclaim nr_pages and return the number of pages reclaimed. No
>> > demoted pages would count towards the nr_pages requirement.
>> >
>> > For callers such as memory_reclaim() that set sc->nodemask,
>> > try_to_free_mem_cgroup_pages() will free nr_pages from that nodemask
>> > with either reclaim or demotion.
>> >
>> > Tested this change using memory.reclaim interface. I set up a test case where
>> > I allocate 500m in a cgroup, and then do:
>> >
>> >     echo "50m" > memory.reclaim
>> >
>> > Without this fix, my kernel demotes 70mb and reclaims 4 mb
>> > (memory.current is reduced by about 4mb).
>> >
>> > With this fix, my kernel demotes all memory possible and reclaims 60mb
>> > (memory.current is reduced by about 60mb).
>> >
>> > Signed-off-by: Mina Almasry <almasrymina@google.com>
>> >
>> > ---
>> >
>> > Changes in v3:
>> > - Reverted back to v1 implementation: "try to demote but don't count
>> >   demoted pages unless they are demoted to outside the nodemask" as Ying
>> >   suggested.
>> > - Made sure demotions that fall back to non next_demotion_target() are
>> >   not counted as Wei suggested.
>> > - Updated comment in shrink_folio_list() as Ying suggested.
>> > - Added before/after for the test case in commit message since Ying
>> >   asked.
>> > - Fixed call sites that don't provide sc->nodemask but expect demotion
>> >   from a specific node as Ying pointed out.
>> >
>> > Cc: weixugc@google.com
>> > Cc: ying.huang@intel.com
>> >
>> > This is developed on top of mm-unstable largely because I want the
>> > solution to be compatible with the recently added nodes= arg on
>> > mm-unstable.
>> > ---
>> >  mm/vmscan.c | 91 +++++++++++++++++++++++++++++++++++++++++++++--------
>> >  1 file changed, 78 insertions(+), 13 deletions(-)
>> >
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index 2b42ac9ad755..f324e80395c3 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -1555,13 +1555,18 @@ static void folio_check_dirty_writeback(struct folio *folio,
>> >               mapping->a_ops->is_dirty_writeback(folio, dirty, writeback);
>> >  }
>> >
>> > +struct demotion_control {
>> > +     struct migration_target_control *mtc;
>> > +     nodemask_t *demote_from_nodemask;
>> > +     unsigned long nr_demoted_outside_nodemask;
>> > +};
>> > +
>> >  static struct page *alloc_demote_page(struct page *page, unsigned long private)
>> >  {
>> >       struct page *target_page;
>> >       nodemask_t *allowed_mask;
>> > -     struct migration_target_control *mtc;
>> > -
>> > -     mtc = (struct migration_target_control *)private;
>> > +     struct demotion_control *dc = (struct demotion_control *)private;
>> > +     struct migration_target_control *mtc = dc->mtc;
>> >
>> >       allowed_mask = mtc->nmask;
>> >       /*
>> > @@ -1576,13 +1581,31 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private)
>> >       mtc->nmask = NULL;
>> >       mtc->gfp_mask |= __GFP_THISNODE;
>> >       target_page = alloc_migration_target(page, (unsigned long)mtc);
>> > -     if (target_page)
>> > +     if (!target_page) {
>> > +             mtc->gfp_mask &= ~__GFP_THISNODE;
>> > +             mtc->nmask = allowed_mask;
>> > +             target_page = alloc_migration_target(page, (unsigned long)mtc);
>> > +     }
>> > +
>> > +     if (!target_page)
>> >               return target_page;
>> >
>> > -     mtc->gfp_mask &= ~__GFP_THISNODE;
>> > -     mtc->nmask = allowed_mask;
>> > +     if (dc->demote_from_nodemask &&
>> > +         !node_isset(page_to_nid(target_page), *dc->demote_from_nodemask))
>>
>> Use mtc->nid directly?
>>
>
> mtc->nid is the next_demotion_node(). Wei's earlier comment is that
> the page may be allocated anywhere on the get_allowed_targets(), not
> necessarily the next_demotion_node(), so I don't think I can use
> mtc->nid. I think I have to check on which node the page was allocated
> as I'm doing here. Let me know if I missed something.

Yes.  You are right.

>> > +             dc->nr_demoted_outside_nodemask++;
>> >
>> > -     return alloc_migration_target(page, (unsigned long)mtc);
>> > +     return target_page;
>> > +}
>> > +
>> > +void free_demote_page(struct page *page, unsigned long private)
>> > +{
>> > +     struct demotion_control *dc = (struct demotion_control *)private;
>> > +
>> > +     if (dc->demote_from_nodemask &&
>> > +         !node_isset(page_to_nid(page), *dc->demote_from_nodemask))
>>
>> ditto
>>
>> > +             dc->nr_demoted_outside_nodemask--;
>> > +
>> > +     folio_put(page_folio(page));
>> >  }
>> >
>> >  /*
>> > @@ -1590,7 +1613,8 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private)
>> >   * Folios which are not demoted are left on @demote_folios.
>> >   */
>> >  static unsigned int demote_folio_list(struct list_head *demote_folios,
>> > -                                  struct pglist_data *pgdat)
>> > +                                   struct pglist_data *pgdat,
>> > +                                   nodemask_t *nodemask)
>> >  {
>> >       int target_nid = next_demotion_node(pgdat->node_id);
>> >       unsigned int nr_succeeded;
>> > @@ -1608,6 +1632,12 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
>> >               .nmask = &allowed_mask
>> >       };
>> >
>> > +     struct demotion_control dc = {
>> > +             .mtc = &mtc,
>> > +             .demote_from_nodemask = nodemask,
>> > +             .nr_demoted_outside_nodemask = 0,
>> > +     };
>> > +
>> >       if (list_empty(demote_folios))
>> >               return 0;
>> >
>> > @@ -1617,13 +1647,13 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
>> >       node_get_allowed_targets(pgdat, &allowed_mask);
>> >
>> >       /* Demotion ignores all cpuset and mempolicy settings */
>> > -     migrate_pages(demote_folios, alloc_demote_page, NULL,
>> > -                   (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
>> > +     migrate_pages(demote_folios, alloc_demote_page, free_demote_page,
>> > +                   (unsigned long)&dc, MIGRATE_ASYNC, MR_DEMOTION,
>> >                     &nr_succeeded);
>> >
>> >       __count_vm_events(PGDEMOTE_KSWAPD + reclaimer_offset(), nr_succeeded);
>> >
>> > -     return nr_succeeded;
>> > +     return dc.nr_demoted_outside_nodemask;
>> >  }
>> >
>> >  static bool may_enter_fs(struct folio *folio, gfp_t gfp_mask)
>> > @@ -1643,7 +1673,12 @@ static bool may_enter_fs(struct folio *folio, gfp_t gfp_mask)
>> >  }
>> >
>> >  /*
>> > - * shrink_folio_list() returns the number of reclaimed pages
>> > + * shrink_folio_list() returns the number of reclaimed pages.
>> > + *
>> > + * Demoted pages are counted as reclaimed iff:
>> > + *   (a) sc->nodemask arg is provided.
>> > + *   (b) page has been demoted from a node inside sc->nodemask to a node
>> > + *   outside sc->nodemask.
>> >   */
>> >  static unsigned int shrink_folio_list(struct list_head *folio_list,
>> >               struct pglist_data *pgdat, struct scan_control *sc,
>> > @@ -1653,6 +1688,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>> >       LIST_HEAD(free_folios);
>> >       LIST_HEAD(demote_folios);
>> >       unsigned int nr_reclaimed = 0;
>> > +     unsigned int nr_demoted_outside_nodemask = 0;
>> >       unsigned int pgactivate = 0;
>> >       bool do_demote_pass;
>> >       struct swap_iocb *plug = NULL;
>> > @@ -2085,7 +2121,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>> >       /* 'folio_list' is always empty here */
>> >
>> >       /* Migrate folios selected for demotion */
>> > -     nr_reclaimed += demote_folio_list(&demote_folios, pgdat);
>> > +     nr_demoted_outside_nodemask =
>> > +             demote_folio_list(&demote_folios, pgdat, sc->nodemask);
>> > +
>> > +     if (sc->nodemask)
>> > +             nr_reclaimed += nr_demoted_outside_nodemask;
>> > +
>> >       /* Folios that could not be demoted are still in @demote_folios */
>> >       if (!list_empty(&demote_folios)) {
>> >               /* Folios which weren't demoted go back on @folio_list */
>> > @@ -2130,9 +2171,11 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>> >  unsigned int reclaim_clean_pages_from_list(struct zone *zone,
>> >                                          struct list_head *folio_list)
>> >  {
>> > +     nodemask_t nodemask = NODE_MASK_NONE;
>>
>> Is it necessary for us to use NODEMASK_ALLOC/NODEMASK_FREE to save stack space?
>>
>
> I can certainly use NODEMASK_ALLOC/NODEMASK_FREE if you'd like. I
> think there are a few places that stack allocate nodemask_t already,
> including one place I recently added in memory_reclaim(), so it
> doesn't seem _necessary_ per say.
>
> If you're asking my opinion, AFAICT it's not an issue. I think you
> need > 32 numa nodes before nodemask_t becomes an array of size 2
> longs on a 32-bit machine, and even then I think it's not a huge deal.
> Up to you; I have no issue with converting to
> NODEMASK_ALLOC/NODEMASK_FREE in v4.

Both are OK for me.  Want to know David Rientjes's idea here.

Best Regards,
Huang, Ying

>>
>> >       struct scan_control sc = {
>> >               .gfp_mask = GFP_KERNEL,
>> >               .may_unmap = 1,
>> > +             .nodemask = &nodemask
>> >       };
>> >       struct reclaim_stat stat;
>> >       unsigned int nr_reclaimed;
>> > @@ -2140,6 +2183,12 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
>> >       LIST_HEAD(clean_folios);
>> >       unsigned int noreclaim_flag;
>> >
>> > +     /*
>> > +      * Set the nodemask in sc to indicate to shrink_folio_list() that we're
>> > +      * looking for reclaim from this node.
>> > +      */
>> > +     node_set(zone->zone_pgdat->node_id, nodemask);
>> > +
>> >       list_for_each_entry_safe(folio, next, folio_list, lru) {
>> >               if (!folio_test_hugetlb(folio) && folio_is_file_lru(folio) &&
>> >                   !folio_test_dirty(folio) && !__folio_test_movable(folio) &&
>> > @@ -7031,12 +7080,20 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
>> >       unsigned long zone_boosts[MAX_NR_ZONES] = { 0, };
>> >       bool boosted;
>> >       struct zone *zone;
>> > +     nodemask_t nodemask = NODE_MASK_NONE;
>> >       struct scan_control sc = {
>> >               .gfp_mask = GFP_KERNEL,
>> >               .order = order,
>> >               .may_unmap = 1,
>> > +             .nodemask = &nodemask,
>> >       };
>> >
>> > +     /*
>> > +      * Set the nodemask in sc to indicate to kswapd_shrink_node() that we're
>> > +      * looking for reclaim from this node.
>> > +      */
>> > +     node_set(pgdat->node_id, nodemask);
>> > +
>> >       set_task_reclaim_state(current, &sc.reclaim_state);
>> >       psi_memstall_enter(&pflags);
>> >       __fs_reclaim_acquire(_THIS_IP_);
>> > @@ -7642,6 +7699,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
>> >       const unsigned long nr_pages = 1 << order;
>> >       struct task_struct *p = current;
>> >       unsigned int noreclaim_flag;
>> > +     nodemask_t nodemask = NODE_MASK_NONE;
>> >       struct scan_control sc = {
>> >               .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
>> >               .gfp_mask = current_gfp_context(gfp_mask),
>> > @@ -7651,9 +7709,16 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
>> >               .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
>> >               .may_swap = 1,
>> >               .reclaim_idx = gfp_zone(gfp_mask),
>> > +             .nodemask = &nodemask,
>> >       };
>> >       unsigned long pflags;
>> >
>> > +     /*
>> > +      * Set the nodemask in sc to indicate to shrink_node() that we're
>> > +      * looking for reclaim from this node.
>> > +      */
>> > +     node_set(pgdat->node_id, nodemask);
>> > +
>> >       trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order,
>> >                                          sc.gfp_mask);
>> >
>> > --
>> > 2.39.0.rc0.267.gcb52ba06e7-goog
>>
  
Michal Hocko Dec. 6, 2022, 12:20 p.m. UTC | #4
On Mon 05-12-22 18:34:05, Mina Almasry wrote:
> commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> reclaim"") enabled demotion in memcg reclaim, which is the right thing
> to do, however, it introduced a regression in the behavior of
> try_to_free_mem_cgroup_pages().
> 
> The callers of try_to_free_mem_cgroup_pages() expect it to attempt to
> reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage
> of the cgroup should reduce by nr_pages. The callers expect
> try_to_free_mem_cgroup_pages() to also return the number of pages
> reclaimed, not demoted.
> 
> However, what try_to_free_mem_cgroup_pages() actually does is it
> unconditionally counts demoted pages as reclaimed pages. So in practice
> when it is called it will often demote nr_pages and return the number of
> demoted pages to the caller. Demoted pages don't lower the memcg usage,
> and so try_to_free_mem_cgroup_pages() is not actually doing what the
> callers want it to do.
> 
> Various things work suboptimally on memory tiered systems or don't work
> at all due to this:
> 
> - memory.high enforcement likely doesn't work (it just demotes nr_pages
>   instead of lowering the memcg usage by nr_pages).
> - try_charge_memcg() will keep retrying the charge while
>   try_to_free_mem_cgroup_pages() is just demoting pages and not actually
>   making any room for the charge.

This has been brought up during the review https://lore.kernel.org/all/YoYTEDD+c4GT0xYY@dhcp22.suse.cz/

> - memory.reclaim has a wonky interface. It advertises to the user it
>   reclaims the provided amount but it will actually often demote that
>   amount.
> 
> There may be more effects to this issue.
> 
> To fix these issues I propose shrink_folio_list() to only count pages
> demoted from inside of sc->nodemask to outside of sc->nodemask as
> 'reclaimed'.

Could you expand on why the node mask matters? From the charge point of
view it should be completely uninteresting as the charge remains.

I suspect we really need to change to reclaim metrics for memcg reclaim.
In the memory balancing reclaim we can indeed consider demotions as a
reclaim because the memory is freed in the end but for the memcg reclaim
we really should be counting discharges instead. No demotion/migration will
free up charges.
  
kernel test robot Dec. 6, 2022, 3:15 p.m. UTC | #5
Hi Mina,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Mina-Almasry/mm-Fix-memcg-reclaim-on-memory-tiered-systems/20221206-103512
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20221206023406.3182800-1-almasrymina%40google.com
patch subject: [PATCH v3] [mm-unstable] mm: Fix memcg reclaim on memory tiered systems
config: openrisc-randconfig-r012-20221206
compiler: or1k-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/d326a3b857c743d1e64b73a752505a65732b4e51
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Mina-Almasry/mm-Fix-memcg-reclaim-on-memory-tiered-systems/20221206-103512
        git checkout d326a3b857c743d1e64b73a752505a65732b4e51
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=openrisc SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> mm/vmscan.c:1599:6: warning: no previous prototype for 'free_demote_page' [-Wmissing-prototypes]
    1599 | void free_demote_page(struct page *page, unsigned long private)
         |      ^~~~~~~~~~~~~~~~


vim +/free_demote_page +1599 mm/vmscan.c

  1598	
> 1599	void free_demote_page(struct page *page, unsigned long private)
  1600	{
  1601		struct demotion_control *dc = (struct demotion_control *)private;
  1602	
  1603		if (dc->demote_from_nodemask &&
  1604		    !node_isset(page_to_nid(page), *dc->demote_from_nodemask))
  1605			dc->nr_demoted_outside_nodemask--;
  1606	
  1607		folio_put(page_folio(page));
  1608	}
  1609
  
Mina Almasry Dec. 6, 2022, 4:06 p.m. UTC | #6
On Tue, Dec 6, 2022 at 4:20 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 05-12-22 18:34:05, Mina Almasry wrote:
> > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> > reclaim"") enabled demotion in memcg reclaim, which is the right thing
> > to do, however, it introduced a regression in the behavior of
> > try_to_free_mem_cgroup_pages().
> >
> > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to
> > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage
> > of the cgroup should reduce by nr_pages. The callers expect
> > try_to_free_mem_cgroup_pages() to also return the number of pages
> > reclaimed, not demoted.
> >
> > However, what try_to_free_mem_cgroup_pages() actually does is it
> > unconditionally counts demoted pages as reclaimed pages. So in practice
> > when it is called it will often demote nr_pages and return the number of
> > demoted pages to the caller. Demoted pages don't lower the memcg usage,
> > and so try_to_free_mem_cgroup_pages() is not actually doing what the
> > callers want it to do.
> >
> > Various things work suboptimally on memory tiered systems or don't work
> > at all due to this:
> >
> > - memory.high enforcement likely doesn't work (it just demotes nr_pages
> >   instead of lowering the memcg usage by nr_pages).
> > - try_charge_memcg() will keep retrying the charge while
> >   try_to_free_mem_cgroup_pages() is just demoting pages and not actually
> >   making any room for the charge.
>
> This has been brought up during the review https://lore.kernel.org/all/YoYTEDD+c4GT0xYY@dhcp22.suse.cz/
>

Ah, I did indeed miss this. Thanks for the pointer. However I don't
understand this bit from your email (sorry I'm probably missing
something):

"I suspect this is rather unlikely situation, though. The last tear
(without any fallback) should have some memory to reclaim most of
the time."

Reading the code in try_charge_memcg(), I don't see the last retry for
try_to_free_mem_cgroup_pages() do anything special. My concern here is
that try_charge_memcg() calls try_to_free_mem_cgroup_pages()
MAX_RECLAIM_RETRIES times. Each time that call may demote pages and
report back that it was able to 'reclaim' memory, but the charge keeps
failing because the memcg reclaim didn't actually make room for the
charge. What happens in this case? My understanding is that the memcg
oom-killer gets wrongly invoked.

> > - memory.reclaim has a wonky interface. It advertises to the user it
> >   reclaims the provided amount but it will actually often demote that
> >   amount.
> >
> > There may be more effects to this issue.
> >
> > To fix these issues I propose shrink_folio_list() to only count pages
> > demoted from inside of sc->nodemask to outside of sc->nodemask as
> > 'reclaimed'.
>
> Could you expand on why the node mask matters? From the charge point of
> view it should be completely uninteresting as the charge remains.
>
> I suspect we really need to change to reclaim metrics for memcg reclaim.
> In the memory balancing reclaim we can indeed consider demotions as a
> reclaim because the memory is freed in the end but for the memcg reclaim
> we really should be counting discharges instead. No demotion/migration will
> free up charges.

I think what you're describing is exactly what this patch aims to do.
I'm proposing an interface change to shrink_folio_list() such that it
only counts demoted pages as reclaimed iff sc->nodemask is provided by
the caller and the demotion removed pages from inside sc->nodemask to
outside sc->nodemask. In this case:

1. memory balancing reclaim would pass sc->nodemask=nid to
shrink_folio_list() indicating that it should count pages demoted from
sc->nodemask as reclaimed.

2. memcg reclaim would pass sc->nodemask=NULL to shrink_folio_list()
indicating that it is looking for reclaim across all nodes and no
demoted pages should count as reclaimed.

Sorry if the commit message was not clear. I can try making it clearer
in the next version but it's already very long.

> --
> Michal Hocko
> SUSE Labs
  
kernel test robot Dec. 6, 2022, 6:17 p.m. UTC | #7
Hi Mina,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Mina-Almasry/mm-Fix-memcg-reclaim-on-memory-tiered-systems/20221206-103512
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20221206023406.3182800-1-almasrymina%40google.com
patch subject: [PATCH v3] [mm-unstable] mm: Fix memcg reclaim on memory tiered systems
config: arm-randconfig-r036-20221206
compiler: clang version 16.0.0 (https://github.com/llvm/llvm-project 6e4cea55f0d1104408b26ac574566a0e4de48036)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm cross compiling tool for clang build
        # apt-get install binutils-arm-linux-gnueabi
        # https://github.com/intel-lab-lkp/linux/commit/d326a3b857c743d1e64b73a752505a65732b4e51
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Mina-Almasry/mm-Fix-memcg-reclaim-on-memory-tiered-systems/20221206-103512
        git checkout d326a3b857c743d1e64b73a752505a65732b4e51
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> mm/vmscan.c:1599:6: warning: no previous prototype for function 'free_demote_page' [-Wmissing-prototypes]
   void free_demote_page(struct page *page, unsigned long private)
        ^
   mm/vmscan.c:1599:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   void free_demote_page(struct page *page, unsigned long private)
   ^
   static 
   1 warning generated.


vim +/free_demote_page +1599 mm/vmscan.c

  1598	
> 1599	void free_demote_page(struct page *page, unsigned long private)
  1600	{
  1601		struct demotion_control *dc = (struct demotion_control *)private;
  1602	
  1603		if (dc->demote_from_nodemask &&
  1604		    !node_isset(page_to_nid(page), *dc->demote_from_nodemask))
  1605			dc->nr_demoted_outside_nodemask--;
  1606	
  1607		folio_put(page_folio(page));
  1608	}
  1609
  
Michal Hocko Dec. 6, 2022, 7:55 p.m. UTC | #8
On Tue 06-12-22 08:06:51, Mina Almasry wrote:
> On Tue, Dec 6, 2022 at 4:20 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 05-12-22 18:34:05, Mina Almasry wrote:
> > > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> > > reclaim"") enabled demotion in memcg reclaim, which is the right thing
> > > to do, however, it introduced a regression in the behavior of
> > > try_to_free_mem_cgroup_pages().
> > >
> > > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to
> > > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage
> > > of the cgroup should reduce by nr_pages. The callers expect
> > > try_to_free_mem_cgroup_pages() to also return the number of pages
> > > reclaimed, not demoted.
> > >
> > > However, what try_to_free_mem_cgroup_pages() actually does is it
> > > unconditionally counts demoted pages as reclaimed pages. So in practice
> > > when it is called it will often demote nr_pages and return the number of
> > > demoted pages to the caller. Demoted pages don't lower the memcg usage,
> > > and so try_to_free_mem_cgroup_pages() is not actually doing what the
> > > callers want it to do.
> > >
> > > Various things work suboptimally on memory tiered systems or don't work
> > > at all due to this:
> > >
> > > - memory.high enforcement likely doesn't work (it just demotes nr_pages
> > >   instead of lowering the memcg usage by nr_pages).
> > > - try_charge_memcg() will keep retrying the charge while
> > >   try_to_free_mem_cgroup_pages() is just demoting pages and not actually
> > >   making any room for the charge.
> >
> > This has been brought up during the review https://lore.kernel.org/all/YoYTEDD+c4GT0xYY@dhcp22.suse.cz/
> >
> 
> Ah, I did indeed miss this. Thanks for the pointer. However I don't
> understand this bit from your email (sorry I'm probably missing
> something):
> 
> "I suspect this is rather unlikely situation, though. The last tear
> (without any fallback) should have some memory to reclaim most of
> the time."
> 
> Reading the code in try_charge_memcg(), I don't see the last retry for
> try_to_free_mem_cgroup_pages() do anything special. My concern here is
> that try_charge_memcg() calls try_to_free_mem_cgroup_pages()
> MAX_RECLAIM_RETRIES times. Each time that call may demote pages and
> report back that it was able to 'reclaim' memory, but the charge keeps
> failing because the memcg reclaim didn't actually make room for the
> charge. What happens in this case? My understanding is that the memcg
> oom-killer gets wrongly invoked.

The memcg reclaim shrinks from all zones in the allowed zonelist. In
general from all nodes. So unless the lower tier is outside of this
zonelist then there is a zone to reclaim from which cannot demote.
Correct?

> > > - memory.reclaim has a wonky interface. It advertises to the user it
> > >   reclaims the provided amount but it will actually often demote that
> > >   amount.
> > >
> > > There may be more effects to this issue.
> > >
> > > To fix these issues I propose shrink_folio_list() to only count pages
> > > demoted from inside of sc->nodemask to outside of sc->nodemask as
> > > 'reclaimed'.
> >
> > Could you expand on why the node mask matters? From the charge point of
> > view it should be completely uninteresting as the charge remains.
> >
> > I suspect we really need to change to reclaim metrics for memcg reclaim.
> > In the memory balancing reclaim we can indeed consider demotions as a
> > reclaim because the memory is freed in the end but for the memcg reclaim
> > we really should be counting discharges instead. No demotion/migration will
> > free up charges.
> 
> I think what you're describing is exactly what this patch aims to do.
> I'm proposing an interface change to shrink_folio_list() such that it
> only counts demoted pages as reclaimed iff sc->nodemask is provided by
> the caller and the demotion removed pages from inside sc->nodemask to
> outside sc->nodemask. In this case:
> 
> 1. memory balancing reclaim would pass sc->nodemask=nid to
> shrink_folio_list() indicating that it should count pages demoted from
> sc->nodemask as reclaimed.
> 
> 2. memcg reclaim would pass sc->nodemask=NULL to shrink_folio_list()
> indicating that it is looking for reclaim across all nodes and no
> demoted pages should count as reclaimed.
> 
> Sorry if the commit message was not clear. I can try making it clearer
> in the next version but it's already very long.

Either I am missing something or I simply do not understand why you are
hooked into nodemask so much. Why cannot we have a simple rule that
only global reclaim considers demotions as nr_reclaimed?
  
Huang, Ying Dec. 7, 2022, 1:22 a.m. UTC | #9
Michal Hocko <mhocko@suse.com> writes:

> On Tue 06-12-22 08:06:51, Mina Almasry wrote:
>> On Tue, Dec 6, 2022 at 4:20 AM Michal Hocko <mhocko@suse.com> wrote:
>> >
>> > On Mon 05-12-22 18:34:05, Mina Almasry wrote:
>> > > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
>> > > reclaim"") enabled demotion in memcg reclaim, which is the right thing
>> > > to do, however, it introduced a regression in the behavior of
>> > > try_to_free_mem_cgroup_pages().
>> > >
>> > > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to
>> > > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage
>> > > of the cgroup should reduce by nr_pages. The callers expect
>> > > try_to_free_mem_cgroup_pages() to also return the number of pages
>> > > reclaimed, not demoted.
>> > >
>> > > However, what try_to_free_mem_cgroup_pages() actually does is it
>> > > unconditionally counts demoted pages as reclaimed pages. So in practice
>> > > when it is called it will often demote nr_pages and return the number of
>> > > demoted pages to the caller. Demoted pages don't lower the memcg usage,
>> > > and so try_to_free_mem_cgroup_pages() is not actually doing what the
>> > > callers want it to do.
>> > >
>> > > Various things work suboptimally on memory tiered systems or don't work
>> > > at all due to this:
>> > >
>> > > - memory.high enforcement likely doesn't work (it just demotes nr_pages
>> > >   instead of lowering the memcg usage by nr_pages).
>> > > - try_charge_memcg() will keep retrying the charge while
>> > >   try_to_free_mem_cgroup_pages() is just demoting pages and not actually
>> > >   making any room for the charge.
>> >
>> > This has been brought up during the review https://lore.kernel.org/all/YoYTEDD+c4GT0xYY@dhcp22.suse.cz/
>> >
>> 
>> Ah, I did indeed miss this. Thanks for the pointer. However I don't
>> understand this bit from your email (sorry I'm probably missing
>> something):
>> 
>> "I suspect this is rather unlikely situation, though. The last tear
>> (without any fallback) should have some memory to reclaim most of
>> the time."
>> 
>> Reading the code in try_charge_memcg(), I don't see the last retry for
>> try_to_free_mem_cgroup_pages() do anything special. My concern here is
>> that try_charge_memcg() calls try_to_free_mem_cgroup_pages()
>> MAX_RECLAIM_RETRIES times. Each time that call may demote pages and
>> report back that it was able to 'reclaim' memory, but the charge keeps
>> failing because the memcg reclaim didn't actually make room for the
>> charge. What happens in this case? My understanding is that the memcg
>> oom-killer gets wrongly invoked.
>
> The memcg reclaim shrinks from all zones in the allowed zonelist. In
> general from all nodes. So unless the lower tier is outside of this
> zonelist then there is a zone to reclaim from which cannot demote.
> Correct?
>
>> > > - memory.reclaim has a wonky interface. It advertises to the user it
>> > >   reclaims the provided amount but it will actually often demote that
>> > >   amount.
>> > >
>> > > There may be more effects to this issue.
>> > >
>> > > To fix these issues I propose shrink_folio_list() to only count pages
>> > > demoted from inside of sc->nodemask to outside of sc->nodemask as
>> > > 'reclaimed'.
>> >
>> > Could you expand on why the node mask matters? From the charge point of
>> > view it should be completely uninteresting as the charge remains.
>> >
>> > I suspect we really need to change to reclaim metrics for memcg reclaim.
>> > In the memory balancing reclaim we can indeed consider demotions as a
>> > reclaim because the memory is freed in the end but for the memcg reclaim
>> > we really should be counting discharges instead. No demotion/migration will
>> > free up charges.
>> 
>> I think what you're describing is exactly what this patch aims to do.
>> I'm proposing an interface change to shrink_folio_list() such that it
>> only counts demoted pages as reclaimed iff sc->nodemask is provided by
>> the caller and the demotion removed pages from inside sc->nodemask to
>> outside sc->nodemask. In this case:
>> 
>> 1. memory balancing reclaim would pass sc->nodemask=nid to
>> shrink_folio_list() indicating that it should count pages demoted from
>> sc->nodemask as reclaimed.
>> 
>> 2. memcg reclaim would pass sc->nodemask=NULL to shrink_folio_list()
>> indicating that it is looking for reclaim across all nodes and no
>> demoted pages should count as reclaimed.
>> 
>> Sorry if the commit message was not clear. I can try making it clearer
>> in the next version but it's already very long.
>
> Either I am missing something or I simply do not understand why you are
> hooked into nodemask so much. Why cannot we have a simple rule that
> only global reclaim considers demotions as nr_reclaimed?

Yes.  This sounds reasonable to me and this simplify the logic greatly!

Best Regards,
Huang, Ying
  
Mina Almasry Dec. 7, 2022, 1:55 a.m. UTC | #10
On Tue, Dec 6, 2022 at 11:55 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 06-12-22 08:06:51, Mina Almasry wrote:
> > On Tue, Dec 6, 2022 at 4:20 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 05-12-22 18:34:05, Mina Almasry wrote:
> > > > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> > > > reclaim"") enabled demotion in memcg reclaim, which is the right thing
> > > > to do, however, it introduced a regression in the behavior of
> > > > try_to_free_mem_cgroup_pages().
> > > >
> > > > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to
> > > > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage
> > > > of the cgroup should reduce by nr_pages. The callers expect
> > > > try_to_free_mem_cgroup_pages() to also return the number of pages
> > > > reclaimed, not demoted.
> > > >
> > > > However, what try_to_free_mem_cgroup_pages() actually does is it
> > > > unconditionally counts demoted pages as reclaimed pages. So in practice
> > > > when it is called it will often demote nr_pages and return the number of
> > > > demoted pages to the caller. Demoted pages don't lower the memcg usage,
> > > > and so try_to_free_mem_cgroup_pages() is not actually doing what the
> > > > callers want it to do.
> > > >
> > > > Various things work suboptimally on memory tiered systems or don't work
> > > > at all due to this:
> > > >
> > > > - memory.high enforcement likely doesn't work (it just demotes nr_pages
> > > >   instead of lowering the memcg usage by nr_pages).
> > > > - try_charge_memcg() will keep retrying the charge while
> > > >   try_to_free_mem_cgroup_pages() is just demoting pages and not actually
> > > >   making any room for the charge.
> > >
> > > This has been brought up during the review https://lore.kernel.org/all/YoYTEDD+c4GT0xYY@dhcp22.suse.cz/
> > >
> >
> > Ah, I did indeed miss this. Thanks for the pointer. However I don't
> > understand this bit from your email (sorry I'm probably missing
> > something):
> >
> > "I suspect this is rather unlikely situation, though. The last tear
> > (without any fallback) should have some memory to reclaim most of
> > the time."
> >
> > Reading the code in try_charge_memcg(), I don't see the last retry for
> > try_to_free_mem_cgroup_pages() do anything special. My concern here is
> > that try_charge_memcg() calls try_to_free_mem_cgroup_pages()
> > MAX_RECLAIM_RETRIES times. Each time that call may demote pages and
> > report back that it was able to 'reclaim' memory, but the charge keeps
> > failing because the memcg reclaim didn't actually make room for the
> > charge. What happens in this case? My understanding is that the memcg
> > oom-killer gets wrongly invoked.
>
> The memcg reclaim shrinks from all zones in the allowed zonelist. In
> general from all nodes. So unless the lower tier is outside of this
> zonelist then there is a zone to reclaim from which cannot demote.
> Correct?
>

Ah, thanks for pointing this out. I did indeed miss that the memcg
reclaim tries to apply pressure equally to all the nodes. With some
additional testing I'm able to see what you said: there could be no
premature oom kill invocation because generally the memcg reclaim will
find some pages to reclaim from lower tier nodes.

I do find that the first call to try_to_free_mem_cgroup_pages()
sometimes will mostly demote and not do much reclaim. I haven't been
able to fully track the cause of that down but I suspect that the
first call in my test will find most of the cgroup's memory on top
tier nodes. However we do retry a bunch of times before we invoke oom,
and in my testing subsequent calls will find plenty of memory in the
lower tier nodes that it can reclaim. I'll update the commit message
in the next version.

> > > > - memory.reclaim has a wonky interface. It advertises to the user it
> > > >   reclaims the provided amount but it will actually often demote that
> > > >   amount.
> > > >
> > > > There may be more effects to this issue.
> > > >
> > > > To fix these issues I propose shrink_folio_list() to only count pages
> > > > demoted from inside of sc->nodemask to outside of sc->nodemask as
> > > > 'reclaimed'.
> > >
> > > Could you expand on why the node mask matters? From the charge point of
> > > view it should be completely uninteresting as the charge remains.
> > >
> > > I suspect we really need to change to reclaim metrics for memcg reclaim.
> > > In the memory balancing reclaim we can indeed consider demotions as a
> > > reclaim because the memory is freed in the end but for the memcg reclaim
> > > we really should be counting discharges instead. No demotion/migration will
> > > free up charges.
> >
> > I think what you're describing is exactly what this patch aims to do.
> > I'm proposing an interface change to shrink_folio_list() such that it
> > only counts demoted pages as reclaimed iff sc->nodemask is provided by
> > the caller and the demotion removed pages from inside sc->nodemask to
> > outside sc->nodemask. In this case:
> >
> > 1. memory balancing reclaim would pass sc->nodemask=nid to
> > shrink_folio_list() indicating that it should count pages demoted from
> > sc->nodemask as reclaimed.
> >
> > 2. memcg reclaim would pass sc->nodemask=NULL to shrink_folio_list()
> > indicating that it is looking for reclaim across all nodes and no
> > demoted pages should count as reclaimed.
> >
> > Sorry if the commit message was not clear. I can try making it clearer
> > in the next version but it's already very long.
>
> Either I am missing something or I simply do not understand why you are
> hooked into nodemask so much. Why cannot we have a simple rule that
> only global reclaim considers demotions as nr_reclaimed?
>

Thanks. I think this approach would work for most callers. My issue
here is properly supporting the recently added nodes= arg[1] to
memory.reclaim. If the user specifies all nodes or provides no arg,
I'd like to treat it as memcg reclaim which doesn't count demotions.
If the user provides the top tier nodes, I would like to count
demotions as this interface is the way to trigger proactive demotion
from top tier nodes.

I guess I can check which args the user is passing and decide whether
or not to count demotions. But the user right now can specify any
combination of nodes, some of them top tier, some lower tier, some in
the middle. I can return -EINVAL for that, but that seems like a
shame. I thought a generic way to address this was what I'm doing
here, i.e. counting pages demoted from the nodemask as reclaimed. Is
that not acceptable? Is -EINVAL preferred here?

[1] https://lore.kernel.org/linux-mm/87tu2a1v3y.fsf@yhuang6-desk2.ccr.corp.intel.com/

> --
> Michal Hocko
> SUSE Labs
  
Michal Hocko Dec. 7, 2022, 11:12 a.m. UTC | #11
On Tue 06-12-22 17:55:58, Mina Almasry wrote:
> On Tue, Dec 6, 2022 at 11:55 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 06-12-22 08:06:51, Mina Almasry wrote:
> > > On Tue, Dec 6, 2022 at 4:20 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Mon 05-12-22 18:34:05, Mina Almasry wrote:
> > > > > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> > > > > reclaim"") enabled demotion in memcg reclaim, which is the right thing
> > > > > to do, however, it introduced a regression in the behavior of
> > > > > try_to_free_mem_cgroup_pages().
> > > > >
> > > > > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to
> > > > > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage
> > > > > of the cgroup should reduce by nr_pages. The callers expect
> > > > > try_to_free_mem_cgroup_pages() to also return the number of pages
> > > > > reclaimed, not demoted.
> > > > >
> > > > > However, what try_to_free_mem_cgroup_pages() actually does is it
> > > > > unconditionally counts demoted pages as reclaimed pages. So in practice
> > > > > when it is called it will often demote nr_pages and return the number of
> > > > > demoted pages to the caller. Demoted pages don't lower the memcg usage,
> > > > > and so try_to_free_mem_cgroup_pages() is not actually doing what the
> > > > > callers want it to do.
> > > > >
> > > > > Various things work suboptimally on memory tiered systems or don't work
> > > > > at all due to this:
> > > > >
> > > > > - memory.high enforcement likely doesn't work (it just demotes nr_pages
> > > > >   instead of lowering the memcg usage by nr_pages).
> > > > > - try_charge_memcg() will keep retrying the charge while
> > > > >   try_to_free_mem_cgroup_pages() is just demoting pages and not actually
> > > > >   making any room for the charge.
> > > >
> > > > This has been brought up during the review https://lore.kernel.org/all/YoYTEDD+c4GT0xYY@dhcp22.suse.cz/
> > > >
> > >
> > > Ah, I did indeed miss this. Thanks for the pointer. However I don't
> > > understand this bit from your email (sorry I'm probably missing
> > > something):
> > >
> > > "I suspect this is rather unlikely situation, though. The last tear
> > > (without any fallback) should have some memory to reclaim most of
> > > the time."
> > >
> > > Reading the code in try_charge_memcg(), I don't see the last retry for
> > > try_to_free_mem_cgroup_pages() do anything special. My concern here is
> > > that try_charge_memcg() calls try_to_free_mem_cgroup_pages()
> > > MAX_RECLAIM_RETRIES times. Each time that call may demote pages and
> > > report back that it was able to 'reclaim' memory, but the charge keeps
> > > failing because the memcg reclaim didn't actually make room for the
> > > charge. What happens in this case? My understanding is that the memcg
> > > oom-killer gets wrongly invoked.
> >
> > The memcg reclaim shrinks from all zones in the allowed zonelist. In
> > general from all nodes. So unless the lower tier is outside of this
> > zonelist then there is a zone to reclaim from which cannot demote.
> > Correct?
> >
> 
> Ah, thanks for pointing this out. I did indeed miss that the memcg
> reclaim tries to apply pressure equally to all the nodes. With some
> additional testing I'm able to see what you said: there could be no
> premature oom kill invocation because generally the memcg reclaim will
> find some pages to reclaim from lower tier nodes.
> 
> I do find that the first call to try_to_free_mem_cgroup_pages()
> sometimes will mostly demote and not do much reclaim. I haven't been
> able to fully track the cause of that down but I suspect that the
> first call in my test will find most of the cgroup's memory on top
> tier nodes. However we do retry a bunch of times before we invoke oom,
> and in my testing subsequent calls will find plenty of memory in the
> lower tier nodes that it can reclaim. I'll update the commit message
> in the next version.

In the past we used to break out early from the memcg reclaim if the
reclaim target has been completed - see 1ba6fc9af35b ("mm: vmscan: do
not share cgroup iteration between reclaimers"). But I do not see early
break from the reclaim anywhere anymore so the precise nr_reclaimed
tracking shouldn't make a lot of difference. There are cases where we
really have hard time to find proper candidates and need to dip into
hogher reclaim priorities though.

Anyway a proper nr_reclaimed tracking should be rather straightforward
but I do not expect to make a big difference in practice

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 026199c047e0..1b7f2d8cb128 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1633,7 +1633,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 	LIST_HEAD(ret_folios);
 	LIST_HEAD(free_folios);
 	LIST_HEAD(demote_folios);
-	unsigned int nr_reclaimed = 0;
+	unsigned int nr_reclaimed = 0, nr_demoted = 0;
 	unsigned int pgactivate = 0;
 	bool do_demote_pass;
 	struct swap_iocb *plug = NULL;
@@ -2065,8 +2065,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 	}
 	/* 'folio_list' is always empty here */
 
-	/* Migrate folios selected for demotion */
-	nr_reclaimed += demote_folio_list(&demote_folios, pgdat);
+	/*
+	 * Migrate folios selected for demotion.
+	 * Do not consider demoted pages to be reclaimed for the memcg reclaim
+	 * because no charges are really freed during the migration. Global
+	 * reclaim aims at releasing memory from nodes/zones so consider
+	 * demotion to reclaim memory.
+	 */
+	nr_demoted += demote_folio_list(&demote_folios, pgdat);
+	if (!cgroup_reclaim(sc))
+		nr_reclaimed += nr_demoted;
+
 	/* Folios that could not be demoted are still in @demote_folios */
 	if (!list_empty(&demote_folios)) {
 		/* Folios which weren't demoted go back on @folio_list for retry: */

[...]
> > Either I am missing something or I simply do not understand why you are
> > hooked into nodemask so much. Why cannot we have a simple rule that
> > only global reclaim considers demotions as nr_reclaimed?
> >
> 
> Thanks. I think this approach would work for most callers. My issue
> here is properly supporting the recently added nodes= arg[1] to
> memory.reclaim. If the user specifies all nodes or provides no arg,
> I'd like to treat it as memcg reclaim which doesn't count demotions.
> If the user provides the top tier nodes, I would like to count
> demotions as this interface is the way to trigger proactive demotion
> from top tier nodes.

Sorry to repeat myself but nodemask doesn't really make any difference.
The reclaim should only count pages which really free memory in the
domain. Even if you demote to a node outside of the given nodemask then
the charge stays with the existing memcg so it is rather dubious to
count it as a reclaimed page. Or do I still miss what you are trying to
achieve?
  
Mina Almasry Dec. 7, 2022, 9:43 p.m. UTC | #12
On Wed, Dec 7, 2022 at 3:12 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 06-12-22 17:55:58, Mina Almasry wrote:
> > On Tue, Dec 6, 2022 at 11:55 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 06-12-22 08:06:51, Mina Almasry wrote:
> > > > On Tue, Dec 6, 2022 at 4:20 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Mon 05-12-22 18:34:05, Mina Almasry wrote:
> > > > > > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> > > > > > reclaim"") enabled demotion in memcg reclaim, which is the right thing
> > > > > > to do, however, it introduced a regression in the behavior of
> > > > > > try_to_free_mem_cgroup_pages().
> > > > > >
> > > > > > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to
> > > > > > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage
> > > > > > of the cgroup should reduce by nr_pages. The callers expect
> > > > > > try_to_free_mem_cgroup_pages() to also return the number of pages
> > > > > > reclaimed, not demoted.
> > > > > >
> > > > > > However, what try_to_free_mem_cgroup_pages() actually does is it
> > > > > > unconditionally counts demoted pages as reclaimed pages. So in practice
> > > > > > when it is called it will often demote nr_pages and return the number of
> > > > > > demoted pages to the caller. Demoted pages don't lower the memcg usage,
> > > > > > and so try_to_free_mem_cgroup_pages() is not actually doing what the
> > > > > > callers want it to do.
> > > > > >
> > > > > > Various things work suboptimally on memory tiered systems or don't work
> > > > > > at all due to this:
> > > > > >
> > > > > > - memory.high enforcement likely doesn't work (it just demotes nr_pages
> > > > > >   instead of lowering the memcg usage by nr_pages).
> > > > > > - try_charge_memcg() will keep retrying the charge while
> > > > > >   try_to_free_mem_cgroup_pages() is just demoting pages and not actually
> > > > > >   making any room for the charge.
> > > > >
> > > > > This has been brought up during the review https://lore.kernel.org/all/YoYTEDD+c4GT0xYY@dhcp22.suse.cz/
> > > > >
> > > >
> > > > Ah, I did indeed miss this. Thanks for the pointer. However I don't
> > > > understand this bit from your email (sorry I'm probably missing
> > > > something):
> > > >
> > > > "I suspect this is rather unlikely situation, though. The last tear
> > > > (without any fallback) should have some memory to reclaim most of
> > > > the time."
> > > >
> > > > Reading the code in try_charge_memcg(), I don't see the last retry for
> > > > try_to_free_mem_cgroup_pages() do anything special. My concern here is
> > > > that try_charge_memcg() calls try_to_free_mem_cgroup_pages()
> > > > MAX_RECLAIM_RETRIES times. Each time that call may demote pages and
> > > > report back that it was able to 'reclaim' memory, but the charge keeps
> > > > failing because the memcg reclaim didn't actually make room for the
> > > > charge. What happens in this case? My understanding is that the memcg
> > > > oom-killer gets wrongly invoked.
> > >
> > > The memcg reclaim shrinks from all zones in the allowed zonelist. In
> > > general from all nodes. So unless the lower tier is outside of this
> > > zonelist then there is a zone to reclaim from which cannot demote.
> > > Correct?
> > >
> >
> > Ah, thanks for pointing this out. I did indeed miss that the memcg
> > reclaim tries to apply pressure equally to all the nodes. With some
> > additional testing I'm able to see what you said: there could be no
> > premature oom kill invocation because generally the memcg reclaim will
> > find some pages to reclaim from lower tier nodes.
> >
> > I do find that the first call to try_to_free_mem_cgroup_pages()
> > sometimes will mostly demote and not do much reclaim. I haven't been
> > able to fully track the cause of that down but I suspect that the
> > first call in my test will find most of the cgroup's memory on top
> > tier nodes. However we do retry a bunch of times before we invoke oom,
> > and in my testing subsequent calls will find plenty of memory in the
> > lower tier nodes that it can reclaim. I'll update the commit message
> > in the next version.
>
> In the past we used to break out early from the memcg reclaim if the
> reclaim target has been completed - see 1ba6fc9af35b ("mm: vmscan: do
> not share cgroup iteration between reclaimers"). But I do not see early
> break from the reclaim anywhere anymore so the precise nr_reclaimed
> tracking shouldn't make a lot of difference. There are cases where we
> really have hard time to find proper candidates and need to dip into
> hogher reclaim priorities though.
>

Yes, I agree. Thank you for your patient explanation. I now see why
precise accounting of nr_reclaimed is not an issue (or at least as big
an issue as I once thought).

> Anyway a proper nr_reclaimed tracking should be rather straightforward
> but I do not expect to make a big difference in practice
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 026199c047e0..1b7f2d8cb128 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1633,7 +1633,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>         LIST_HEAD(ret_folios);
>         LIST_HEAD(free_folios);
>         LIST_HEAD(demote_folios);
> -       unsigned int nr_reclaimed = 0;
> +       unsigned int nr_reclaimed = 0, nr_demoted = 0;
>         unsigned int pgactivate = 0;
>         bool do_demote_pass;
>         struct swap_iocb *plug = NULL;
> @@ -2065,8 +2065,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>         }
>         /* 'folio_list' is always empty here */
>
> -       /* Migrate folios selected for demotion */
> -       nr_reclaimed += demote_folio_list(&demote_folios, pgdat);
> +       /*
> +        * Migrate folios selected for demotion.
> +        * Do not consider demoted pages to be reclaimed for the memcg reclaim
> +        * because no charges are really freed during the migration. Global
> +        * reclaim aims at releasing memory from nodes/zones so consider
> +        * demotion to reclaim memory.
> +        */
> +       nr_demoted += demote_folio_list(&demote_folios, pgdat);
> +       if (!cgroup_reclaim(sc))
> +               nr_reclaimed += nr_demoted;
> +
>         /* Folios that could not be demoted are still in @demote_folios */
>         if (!list_empty(&demote_folios)) {
>                 /* Folios which weren't demoted go back on @folio_list for retry: */
>
> [...]

Thank you again, but this patch breaks the memory.reclaim nodes arg
for me. This is my test case. I run it on a machine with 2 memory
tiers.

Memory tier 1= nodes 0-2
Memory tier 2= node 3

    mkdir -p /sys/fs/cgroup/unified/test
    cd /sys/fs/cgroup/unified/test
    echo $$ > cgroup.procs
    head -c 500m /dev/random > /tmp/testfile
    echo $$ > /sys/fs/cgroup/unified/cgroup.procs
    echo "1m nodes=0-2" > memory.reclaim

In my opinion the expected behavior is for the kernel to demote 1mb of
memory from nodes 0-2 to node 3.

Actual behavior on the tip of mm-unstable is as expected.

Actual behavior with your patch cherry-picked to mm-unstable is that
the kernel demotes all 500mb of memory from nodes 0-2 to node 3, and
returns -EAGAIN to the user. This may be the correct behavior you're
intending, but it completely breaks the use case I implemented the
nodes= arg for and listed on the commit message of that change.

> > > Either I am missing something or I simply do not understand why you are
> > > hooked into nodemask so much. Why cannot we have a simple rule that
> > > only global reclaim considers demotions as nr_reclaimed?
> > >
> >
> > Thanks. I think this approach would work for most callers. My issue
> > here is properly supporting the recently added nodes= arg[1] to
> > memory.reclaim. If the user specifies all nodes or provides no arg,
> > I'd like to treat it as memcg reclaim which doesn't count demotions.
> > If the user provides the top tier nodes, I would like to count
> > demotions as this interface is the way to trigger proactive demotion
> > from top tier nodes.
>
> Sorry to repeat myself but nodemask doesn't really make any difference.
> The reclaim should only count pages which really free memory in the
> domain. Even if you demote to a node outside of the given nodemask then
> the charge stays with the existing memcg so it is rather dubious to
> count it as a reclaimed page. Or do I still miss what you are trying to
> achieve?
>

I think you've convinced me that the issue is not as serious as I
first thought, thanks. We can maybe abandon this fix? If we do want to
fix this now or in the future, if possible let the fix be compatible
with memory.reclaim and its nodes= arg. I'm happy to provide patches
for whatever fix you find acceptable.

> --
> Michal Hocko
> SUSE Labs
  
Michal Hocko Dec. 8, 2022, 8:09 a.m. UTC | #13
On Wed 07-12-22 13:43:55, Mina Almasry wrote:
> On Wed, Dec 7, 2022 at 3:12 AM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > Anyway a proper nr_reclaimed tracking should be rather straightforward
> > but I do not expect to make a big difference in practice
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 026199c047e0..1b7f2d8cb128 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1633,7 +1633,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >         LIST_HEAD(ret_folios);
> >         LIST_HEAD(free_folios);
> >         LIST_HEAD(demote_folios);
> > -       unsigned int nr_reclaimed = 0;
> > +       unsigned int nr_reclaimed = 0, nr_demoted = 0;
> >         unsigned int pgactivate = 0;
> >         bool do_demote_pass;
> >         struct swap_iocb *plug = NULL;
> > @@ -2065,8 +2065,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >         }
> >         /* 'folio_list' is always empty here */
> >
> > -       /* Migrate folios selected for demotion */
> > -       nr_reclaimed += demote_folio_list(&demote_folios, pgdat);
> > +       /*
> > +        * Migrate folios selected for demotion.
> > +        * Do not consider demoted pages to be reclaimed for the memcg reclaim
> > +        * because no charges are really freed during the migration. Global
> > +        * reclaim aims at releasing memory from nodes/zones so consider
> > +        * demotion to reclaim memory.
> > +        */
> > +       nr_demoted += demote_folio_list(&demote_folios, pgdat);
> > +       if (!cgroup_reclaim(sc))
> > +               nr_reclaimed += nr_demoted;
> > +
> >         /* Folios that could not be demoted are still in @demote_folios */
> >         if (!list_empty(&demote_folios)) {
> >                 /* Folios which weren't demoted go back on @folio_list for retry: */
> >
> > [...]
> 
> Thank you again, but this patch breaks the memory.reclaim nodes arg
> for me. This is my test case. I run it on a machine with 2 memory
> tiers.
> 
> Memory tier 1= nodes 0-2
> Memory tier 2= node 3
> 
>     mkdir -p /sys/fs/cgroup/unified/test
>     cd /sys/fs/cgroup/unified/test
>     echo $$ > cgroup.procs
>     head -c 500m /dev/random > /tmp/testfile
>     echo $$ > /sys/fs/cgroup/unified/cgroup.procs
>     echo "1m nodes=0-2" > memory.reclaim
> 
> In my opinion the expected behavior is for the kernel to demote 1mb of
> memory from nodes 0-2 to node 3.
> 
> Actual behavior on the tip of mm-unstable is as expected.
> 
> Actual behavior with your patch cherry-picked to mm-unstable is that
> the kernel demotes all 500mb of memory from nodes 0-2 to node 3, and
> returns -EAGAIN to the user. This may be the correct behavior you're
> intending, but it completely breaks the use case I implemented the
> nodes= arg for and listed on the commit message of that change.

Yes, strictly speaking the behavior is correct albeit unexpected. You
have told the kernel to _reclaim_ that much memory but demotion are
simply aging handling rather than a reclaim if the demotion target has a
lot of memory free. This would be the case without any nodemask as well
btw.

I am worried this will popping up again and again. I thought your nodes
subset approach could deal with this but I have overlooked one important
thing in your patch. The user provided nodemask controls where to
reclaim from but it doesn't constrain demotion targets. Is this
intentional? Would it actually make more sense to control demotion by
addint demotion nodes into the nodemask?
  
Mina Almasry Dec. 8, 2022, 9 a.m. UTC | #14
On Thu, Dec 8, 2022 at 12:09 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 07-12-22 13:43:55, Mina Almasry wrote:
> > On Wed, Dec 7, 2022 at 3:12 AM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > Anyway a proper nr_reclaimed tracking should be rather straightforward
> > > but I do not expect to make a big difference in practice
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 026199c047e0..1b7f2d8cb128 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -1633,7 +1633,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> > >         LIST_HEAD(ret_folios);
> > >         LIST_HEAD(free_folios);
> > >         LIST_HEAD(demote_folios);
> > > -       unsigned int nr_reclaimed = 0;
> > > +       unsigned int nr_reclaimed = 0, nr_demoted = 0;
> > >         unsigned int pgactivate = 0;
> > >         bool do_demote_pass;
> > >         struct swap_iocb *plug = NULL;
> > > @@ -2065,8 +2065,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> > >         }
> > >         /* 'folio_list' is always empty here */
> > >
> > > -       /* Migrate folios selected for demotion */
> > > -       nr_reclaimed += demote_folio_list(&demote_folios, pgdat);
> > > +       /*
> > > +        * Migrate folios selected for demotion.
> > > +        * Do not consider demoted pages to be reclaimed for the memcg reclaim
> > > +        * because no charges are really freed during the migration. Global
> > > +        * reclaim aims at releasing memory from nodes/zones so consider
> > > +        * demotion to reclaim memory.
> > > +        */
> > > +       nr_demoted += demote_folio_list(&demote_folios, pgdat);
> > > +       if (!cgroup_reclaim(sc))
> > > +               nr_reclaimed += nr_demoted;
> > > +
> > >         /* Folios that could not be demoted are still in @demote_folios */
> > >         if (!list_empty(&demote_folios)) {
> > >                 /* Folios which weren't demoted go back on @folio_list for retry: */
> > >
> > > [...]
> >
> > Thank you again, but this patch breaks the memory.reclaim nodes arg
> > for me. This is my test case. I run it on a machine with 2 memory
> > tiers.
> >
> > Memory tier 1= nodes 0-2
> > Memory tier 2= node 3
> >
> >     mkdir -p /sys/fs/cgroup/unified/test
> >     cd /sys/fs/cgroup/unified/test
> >     echo $$ > cgroup.procs
> >     head -c 500m /dev/random > /tmp/testfile
> >     echo $$ > /sys/fs/cgroup/unified/cgroup.procs
> >     echo "1m nodes=0-2" > memory.reclaim
> >
> > In my opinion the expected behavior is for the kernel to demote 1mb of
> > memory from nodes 0-2 to node 3.
> >
> > Actual behavior on the tip of mm-unstable is as expected.
> >
> > Actual behavior with your patch cherry-picked to mm-unstable is that
> > the kernel demotes all 500mb of memory from nodes 0-2 to node 3, and
> > returns -EAGAIN to the user. This may be the correct behavior you're
> > intending, but it completely breaks the use case I implemented the
> > nodes= arg for and listed on the commit message of that change.
>
> Yes, strictly speaking the behavior is correct albeit unexpected. You
> have told the kernel to _reclaim_ that much memory but demotion are
> simply aging handling rather than a reclaim if the demotion target has a
> lot of memory free.

Yes, by the strict definition of reclaim, you're completely correct.
But in reality earlier I proposed a patch to the kernel that disables
demotion in proactive reclaim. That would have been a correct change
by the strict definition of reclaim, but Johannes informed me that
meta already has a dependency on proactive reclaim triggering demotion
and directed me to add a nodes= arg to memory.reclaim to trigger
demotion instead, to satisfy both use cases. Seems both us and meta
are using this interface to trigger both reclaim and demotion, despite
the strict definition of the word?

> This would be the case without any nodemask as well
> btw.
>

The difference here is that without any nodemask, the kernel will also
reclaim from bottom tier nodes, which will count towards the user's
request so the behavior makes sense in my opinion. If nodemask=NULL or
nodemask=ALL_NODES, then the caller is asking for reclaim across all
nodes which should not count demoted pages IMO. The kernel can still
look for demotion opportunities, but the demoted pages would not count
towards the user's request.

> I am worried this will popping up again and again. I thought your nodes
> subset approach could deal with this but I have overlooked one important
> thing in your patch. The user provided nodemask controls where to
> reclaim from but it doesn't constrain demotion targets. Is this
> intentional? Would it actually make more sense to control demotion by
> addint demotion nodes into the nodemask?
>

IMO, yes it is intentional, and no I would not recommend adding
demotion nodes (I assume you mean adding both demote_from_nodes and
demote_to_nodes as arg). My opinion is based on 2 reasons:

1. We control proactive reclaim by looking for nodes/tiers approaching
pressure and triggering reclaim/demotion from said nodes/tiers. So we
know the node/tier we would like to reclaim from, but not necessarily
have a requirement on where the memory should go. I think it should be
up to the kernel.
2. Currently I think most tiered machines will have 2 memory tiers,
but I think the code is designed to support N memory tiers. What
happens if N=3 and the user asks you to demote from the top tier nodes
to the bottom tier nodes (skipping the middle tier)? The user in this
case is explicitly asking to break the aging pipeline. From my short
while on the mailing list I see great interest in respecting the aging
pipeline, so it seems to me the demotion target should be decided by
the kernel based on the aging pipeline, and the user should not be
able to override it? I don't know. Maybe there is a valid use case for
that somewhere.

> --
> Michal Hocko
> SUSE Labs
  
Michal Hocko Dec. 8, 2022, 11:54 a.m. UTC | #15
On Thu 08-12-22 01:00:40, Mina Almasry wrote:
> On Thu, Dec 8, 2022 at 12:09 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 07-12-22 13:43:55, Mina Almasry wrote:
> > > On Wed, Dec 7, 2022 at 3:12 AM Michal Hocko <mhocko@suse.com> wrote:
> > [...]
> > > > Anyway a proper nr_reclaimed tracking should be rather straightforward
> > > > but I do not expect to make a big difference in practice
> > > >
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 026199c047e0..1b7f2d8cb128 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -1633,7 +1633,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> > > >         LIST_HEAD(ret_folios);
> > > >         LIST_HEAD(free_folios);
> > > >         LIST_HEAD(demote_folios);
> > > > -       unsigned int nr_reclaimed = 0;
> > > > +       unsigned int nr_reclaimed = 0, nr_demoted = 0;
> > > >         unsigned int pgactivate = 0;
> > > >         bool do_demote_pass;
> > > >         struct swap_iocb *plug = NULL;
> > > > @@ -2065,8 +2065,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> > > >         }
> > > >         /* 'folio_list' is always empty here */
> > > >
> > > > -       /* Migrate folios selected for demotion */
> > > > -       nr_reclaimed += demote_folio_list(&demote_folios, pgdat);
> > > > +       /*
> > > > +        * Migrate folios selected for demotion.
> > > > +        * Do not consider demoted pages to be reclaimed for the memcg reclaim
> > > > +        * because no charges are really freed during the migration. Global
> > > > +        * reclaim aims at releasing memory from nodes/zones so consider
> > > > +        * demotion to reclaim memory.
> > > > +        */
> > > > +       nr_demoted += demote_folio_list(&demote_folios, pgdat);
> > > > +       if (!cgroup_reclaim(sc))
> > > > +               nr_reclaimed += nr_demoted;
> > > > +
> > > >         /* Folios that could not be demoted are still in @demote_folios */
> > > >         if (!list_empty(&demote_folios)) {
> > > >                 /* Folios which weren't demoted go back on @folio_list for retry: */
> > > >
> > > > [...]
> > >
> > > Thank you again, but this patch breaks the memory.reclaim nodes arg
> > > for me. This is my test case. I run it on a machine with 2 memory
> > > tiers.
> > >
> > > Memory tier 1= nodes 0-2
> > > Memory tier 2= node 3
> > >
> > >     mkdir -p /sys/fs/cgroup/unified/test
> > >     cd /sys/fs/cgroup/unified/test
> > >     echo $$ > cgroup.procs
> > >     head -c 500m /dev/random > /tmp/testfile
> > >     echo $$ > /sys/fs/cgroup/unified/cgroup.procs
> > >     echo "1m nodes=0-2" > memory.reclaim
> > >
> > > In my opinion the expected behavior is for the kernel to demote 1mb of
> > > memory from nodes 0-2 to node 3.
> > >
> > > Actual behavior on the tip of mm-unstable is as expected.
> > >
> > > Actual behavior with your patch cherry-picked to mm-unstable is that
> > > the kernel demotes all 500mb of memory from nodes 0-2 to node 3, and
> > > returns -EAGAIN to the user. This may be the correct behavior you're
> > > intending, but it completely breaks the use case I implemented the
> > > nodes= arg for and listed on the commit message of that change.
> >
> > Yes, strictly speaking the behavior is correct albeit unexpected. You
> > have told the kernel to _reclaim_ that much memory but demotion are
> > simply aging handling rather than a reclaim if the demotion target has a
> > lot of memory free.
> 
> Yes, by the strict definition of reclaim, you're completely correct.
> But in reality earlier I proposed a patch to the kernel that disables
> demotion in proactive reclaim. That would have been a correct change
> by the strict definition of reclaim, but Johannes informed me that
> meta already has a dependency on proactive reclaim triggering demotion
> and directed me to add a nodes= arg to memory.reclaim to trigger
> demotion instead, to satisfy both use cases. Seems both us and meta
> are using this interface to trigger both reclaim and demotion, despite
> the strict definition of the word?

Well, demotion is a part of aging and that is a part of the reclaim so I
believe we want both and demotion mostly an implementation detail. If
you want to have a very precise control then the nodemask should drive
you there.

[...]
> > I am worried this will popping up again and again. I thought your nodes
> > subset approach could deal with this but I have overlooked one important
> > thing in your patch. The user provided nodemask controls where to
> > reclaim from but it doesn't constrain demotion targets. Is this
> > intentional? Would it actually make more sense to control demotion by
> > addint demotion nodes into the nodemask?
> >
> 
> IMO, yes it is intentional, and no I would not recommend adding
> demotion nodes (I assume you mean adding both demote_from_nodes and
> demote_to_nodes as arg).

What I really mean is to add demotion nodes to the nodemask along with
the set of nodes you want to reclaim from. To me that sounds like a
more natural interface allowing for all sorts of usecases:
- free up demotion targets (only specify demotion nodes in the mask)
- control where to demote (e.g. select specific demotion target(s))
- do not demote at all (skip demotion nodes from the node mask)

> My opinion is based on 2 reasons:
> 
> 1. We control proactive reclaim by looking for nodes/tiers approaching
> pressure and triggering reclaim/demotion from said nodes/tiers. So we
> know the node/tier we would like to reclaim from, but not necessarily
> have a requirement on where the memory should go. I think it should be
> up to the kernel.
> 2. Currently I think most tiered machines will have 2 memory tiers,
> but I think the code is designed to support N memory tiers. What
> happens if N=3 and the user asks you to demote from the top tier nodes
> to the bottom tier nodes (skipping the middle tier)? The user in this
> case is explicitly asking to break the aging pipeline. From my short
> while on the mailing list I see great interest in respecting the aging
> pipeline, so it seems to me the demotion target should be decided by
> the kernel based on the aging pipeline, and the user should not be
> able to override it? I don't know. Maybe there is a valid use case for
> that somewhere.

I agree that the agining should be preserved as much as possible unless
there is an explicit requirement to do otherwise which might be
something application specific.

It is really hard to assume all the usecases at this stage but we should
keep in mind that the semantic of the interface will get cast into stone
once it is released. As of now I do see a great confusion point in the
nodemask semantic which pretends to allow some fine control while it is
largerly hard to predict because it makes some assumptions about the
reclaim while it has a very limited control of the demotion.
  
Wei Xu Dec. 9, 2022, 12:59 a.m. UTC | #16
On Thu, Dec 8, 2022 at 3:54 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 08-12-22 01:00:40, Mina Almasry wrote:
> > On Thu, Dec 8, 2022 at 12:09 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Wed 07-12-22 13:43:55, Mina Almasry wrote:
> > > > On Wed, Dec 7, 2022 at 3:12 AM Michal Hocko <mhocko@suse.com> wrote:
> > > [...]
> > > > > Anyway a proper nr_reclaimed tracking should be rather straightforward
> > > > > but I do not expect to make a big difference in practice
> > > > >
> > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > index 026199c047e0..1b7f2d8cb128 100644
> > > > > --- a/mm/vmscan.c
> > > > > +++ b/mm/vmscan.c
> > > > > @@ -1633,7 +1633,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> > > > >         LIST_HEAD(ret_folios);
> > > > >         LIST_HEAD(free_folios);
> > > > >         LIST_HEAD(demote_folios);
> > > > > -       unsigned int nr_reclaimed = 0;
> > > > > +       unsigned int nr_reclaimed = 0, nr_demoted = 0;
> > > > >         unsigned int pgactivate = 0;
> > > > >         bool do_demote_pass;
> > > > >         struct swap_iocb *plug = NULL;
> > > > > @@ -2065,8 +2065,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> > > > >         }
> > > > >         /* 'folio_list' is always empty here */
> > > > >
> > > > > -       /* Migrate folios selected for demotion */
> > > > > -       nr_reclaimed += demote_folio_list(&demote_folios, pgdat);
> > > > > +       /*
> > > > > +        * Migrate folios selected for demotion.
> > > > > +        * Do not consider demoted pages to be reclaimed for the memcg reclaim
> > > > > +        * because no charges are really freed during the migration. Global
> > > > > +        * reclaim aims at releasing memory from nodes/zones so consider
> > > > > +        * demotion to reclaim memory.
> > > > > +        */
> > > > > +       nr_demoted += demote_folio_list(&demote_folios, pgdat);
> > > > > +       if (!cgroup_reclaim(sc))
> > > > > +               nr_reclaimed += nr_demoted;
> > > > > +
> > > > >         /* Folios that could not be demoted are still in @demote_folios */
> > > > >         if (!list_empty(&demote_folios)) {
> > > > >                 /* Folios which weren't demoted go back on @folio_list for retry: */
> > > > >
> > > > > [...]
> > > >
> > > > Thank you again, but this patch breaks the memory.reclaim nodes arg
> > > > for me. This is my test case. I run it on a machine with 2 memory
> > > > tiers.
> > > >
> > > > Memory tier 1= nodes 0-2
> > > > Memory tier 2= node 3
> > > >
> > > >     mkdir -p /sys/fs/cgroup/unified/test
> > > >     cd /sys/fs/cgroup/unified/test
> > > >     echo $$ > cgroup.procs
> > > >     head -c 500m /dev/random > /tmp/testfile
> > > >     echo $$ > /sys/fs/cgroup/unified/cgroup.procs
> > > >     echo "1m nodes=0-2" > memory.reclaim
> > > >
> > > > In my opinion the expected behavior is for the kernel to demote 1mb of
> > > > memory from nodes 0-2 to node 3.
> > > >
> > > > Actual behavior on the tip of mm-unstable is as expected.
> > > >
> > > > Actual behavior with your patch cherry-picked to mm-unstable is that
> > > > the kernel demotes all 500mb of memory from nodes 0-2 to node 3, and
> > > > returns -EAGAIN to the user. This may be the correct behavior you're
> > > > intending, but it completely breaks the use case I implemented the
> > > > nodes= arg for and listed on the commit message of that change.
> > >
> > > Yes, strictly speaking the behavior is correct albeit unexpected. You
> > > have told the kernel to _reclaim_ that much memory but demotion are
> > > simply aging handling rather than a reclaim if the demotion target has a
> > > lot of memory free.
> >
> > Yes, by the strict definition of reclaim, you're completely correct.
> > But in reality earlier I proposed a patch to the kernel that disables
> > demotion in proactive reclaim. That would have been a correct change
> > by the strict definition of reclaim, but Johannes informed me that
> > meta already has a dependency on proactive reclaim triggering demotion
> > and directed me to add a nodes= arg to memory.reclaim to trigger
> > demotion instead, to satisfy both use cases. Seems both us and meta
> > are using this interface to trigger both reclaim and demotion, despite
> > the strict definition of the word?
>
> Well, demotion is a part of aging and that is a part of the reclaim so I
> believe we want both and demotion mostly an implementation detail. If
> you want to have a very precise control then the nodemask should drive
> you there.
>
> [...]
> > > I am worried this will popping up again and again. I thought your nodes
> > > subset approach could deal with this but I have overlooked one important
> > > thing in your patch. The user provided nodemask controls where to
> > > reclaim from but it doesn't constrain demotion targets. Is this
> > > intentional? Would it actually make more sense to control demotion by
> > > addint demotion nodes into the nodemask?
> > >
> >
> > IMO, yes it is intentional, and no I would not recommend adding
> > demotion nodes (I assume you mean adding both demote_from_nodes and
> > demote_to_nodes as arg).
>
> What I really mean is to add demotion nodes to the nodemask along with
> the set of nodes you want to reclaim from. To me that sounds like a
> more natural interface allowing for all sorts of usecases:
> - free up demotion targets (only specify demotion nodes in the mask)
> - control where to demote (e.g. select specific demotion target(s))
> - do not demote at all (skip demotion nodes from the node mask)

For clarification, do you mean to add another argument (e.g.
demotion_nodes) in addition to the "nodes" argument?   Also, which
meaning do these arguments have?

- Option 1: The "nodes" argument specifies the source nodes where
pages should be reclaimed (but not demoted) from and the
"demotion_nodes" argument specifies the source nodes where pages
should be demoted (but not reclaimed) from.

- Option 2: The "nodes" argument specifies the source nodes where
pages should be reclaimed or demoted from and the "demotion_nodes"
argument specifies the allowed demotion target nodes.

Option 1 is more flexible in allowing various use cases. Option 2 can
also support the use cases to reclaim only without demotion and
demotion with fallback to reclaim, but cannot support demotion only
without reclaim.

> > My opinion is based on 2 reasons:
> >
> > 1. We control proactive reclaim by looking for nodes/tiers approaching
> > pressure and triggering reclaim/demotion from said nodes/tiers. So we
> > know the node/tier we would like to reclaim from, but not necessarily
> > have a requirement on where the memory should go. I think it should be
> > up to the kernel.
> > 2. Currently I think most tiered machines will have 2 memory tiers,
> > but I think the code is designed to support N memory tiers. What
> > happens if N=3 and the user asks you to demote from the top tier nodes
> > to the bottom tier nodes (skipping the middle tier)? The user in this
> > case is explicitly asking to break the aging pipeline. From my short
> > while on the mailing list I see great interest in respecting the aging
> > pipeline, so it seems to me the demotion target should be decided by
> > the kernel based on the aging pipeline, and the user should not be
> > able to override it? I don't know. Maybe there is a valid use case for
> > that somewhere.
>
> I agree that the agining should be preserved as much as possible unless
> there is an explicit requirement to do otherwise which might be
> something application specific.
>
> It is really hard to assume all the usecases at this stage but we should
> keep in mind that the semantic of the interface will get cast into stone
> once it is released. As of now I do see a great confusion point in the
> nodemask semantic which pretends to allow some fine control while it is
> largerly hard to predict because it makes some assumptions about the
> reclaim while it has a very limited control of the demotion.
>
> --
> Michal Hocko
> SUSE Labs
  
Michal Hocko Dec. 9, 2022, 8:08 a.m. UTC | #17
On Thu 08-12-22 16:59:36, Wei Xu wrote:
[...]
> > What I really mean is to add demotion nodes to the nodemask along with
> > the set of nodes you want to reclaim from. To me that sounds like a
> > more natural interface allowing for all sorts of usecases:
> > - free up demotion targets (only specify demotion nodes in the mask)
> > - control where to demote (e.g. select specific demotion target(s))
> > - do not demote at all (skip demotion nodes from the node mask)
> 
> For clarification, do you mean to add another argument (e.g.
> demotion_nodes) in addition to the "nodes" argument?

No, nodes=mask argument should control the domain where the memory
reclaim should happen. That includes both aging and the reclaim. If the
mask doesn't contain any lower tier node then no demotion will happen.
If only a subset of lower tiers are specified then only those could be
used for the demotion process. Or put it otherwise, the nodemask is not
only used to filter out zonelists during reclaim it also restricts
migration targets.

Is this more clear now?
  
Wei Xu Dec. 9, 2022, 4:41 p.m. UTC | #18
On Fri, Dec 9, 2022 at 12:08 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 08-12-22 16:59:36, Wei Xu wrote:
> [...]
> > > What I really mean is to add demotion nodes to the nodemask along with
> > > the set of nodes you want to reclaim from. To me that sounds like a
> > > more natural interface allowing for all sorts of usecases:
> > > - free up demotion targets (only specify demotion nodes in the mask)
> > > - control where to demote (e.g. select specific demotion target(s))
> > > - do not demote at all (skip demotion nodes from the node mask)
> >
> > For clarification, do you mean to add another argument (e.g.
> > demotion_nodes) in addition to the "nodes" argument?
>
> No, nodes=mask argument should control the domain where the memory
> reclaim should happen. That includes both aging and the reclaim. If the
> mask doesn't contain any lower tier node then no demotion will happen.
> If only a subset of lower tiers are specified then only those could be
> used for the demotion process. Or put it otherwise, the nodemask is not
> only used to filter out zonelists during reclaim it also restricts
> migration targets.
>
> Is this more clear now?

In that case, how can we request demotion only from toptier nodes
(without counting any reclaimed bytes from other nodes),  which is our
memory tiering use case?

Besides, when both toptier and demotion nodes are specified, the
demoted pages should only be counted as aging and not be counted
towards the requested bytes of try_to_free_mem_cgroup_pages(), which
is what this patch tries to address.

> --
> Michal Hocko
> SUSE Labs
  
Michal Hocko Dec. 9, 2022, 9:16 p.m. UTC | #19
On Fri 09-12-22 08:41:47, Wei Xu wrote:
> On Fri, Dec 9, 2022 at 12:08 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 08-12-22 16:59:36, Wei Xu wrote:
> > [...]
> > > > What I really mean is to add demotion nodes to the nodemask along with
> > > > the set of nodes you want to reclaim from. To me that sounds like a
> > > > more natural interface allowing for all sorts of usecases:
> > > > - free up demotion targets (only specify demotion nodes in the mask)
> > > > - control where to demote (e.g. select specific demotion target(s))
> > > > - do not demote at all (skip demotion nodes from the node mask)
> > >
> > > For clarification, do you mean to add another argument (e.g.
> > > demotion_nodes) in addition to the "nodes" argument?
> >
> > No, nodes=mask argument should control the domain where the memory
> > reclaim should happen. That includes both aging and the reclaim. If the
> > mask doesn't contain any lower tier node then no demotion will happen.
> > If only a subset of lower tiers are specified then only those could be
> > used for the demotion process. Or put it otherwise, the nodemask is not
> > only used to filter out zonelists during reclaim it also restricts
> > migration targets.
> >
> > Is this more clear now?
> 
> In that case, how can we request demotion only from toptier nodes
> (without counting any reclaimed bytes from other nodes),  which is our
> memory tiering use case?

I am not sure I follow. Could you be more specific please?

> Besides, when both toptier and demotion nodes are specified, the
> demoted pages should only be counted as aging and not be counted
> towards the requested bytes of try_to_free_mem_cgroup_pages(), which
> is what this patch tries to address.

This should be addressed by
http://lkml.kernel.org/r/Y5B1K5zAE0PkjFZx@dhcp22.suse.cz, no?
  
Mina Almasry Dec. 9, 2022, 9:39 p.m. UTC | #20
On Fri, Dec 9, 2022 at 1:16 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 09-12-22 08:41:47, Wei Xu wrote:
> > On Fri, Dec 9, 2022 at 12:08 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 08-12-22 16:59:36, Wei Xu wrote:
> > > [...]
> > > > > What I really mean is to add demotion nodes to the nodemask along with
> > > > > the set of nodes you want to reclaim from. To me that sounds like a
> > > > > more natural interface allowing for all sorts of usecases:
> > > > > - free up demotion targets (only specify demotion nodes in the mask)
> > > > > - control where to demote (e.g. select specific demotion target(s))
> > > > > - do not demote at all (skip demotion nodes from the node mask)
> > > >
> > > > For clarification, do you mean to add another argument (e.g.
> > > > demotion_nodes) in addition to the "nodes" argument?
> > >
> > > No, nodes=mask argument should control the domain where the memory
> > > reclaim should happen. That includes both aging and the reclaim. If the
> > > mask doesn't contain any lower tier node then no demotion will happen.
> > > If only a subset of lower tiers are specified then only those could be
> > > used for the demotion process. Or put it otherwise, the nodemask is not
> > > only used to filter out zonelists during reclaim it also restricts
> > > migration targets.
> > >
> > > Is this more clear now?
> >

I think putting the demotion sources and demotion targets in the same
nodemask is a bit confusing, and prone to error. IIUC the user puts
both the demotion source and the demotion target in the nodemaks, and
the kernel infers which is which depending on whether the node is a
top-tier node, or a bottom tier node. I think in the future this will
become ambiguous. What happens in the future when the user when the
machine has N memory tiers and the user specifies a node in a middle
tier in the nodemask? Does that mean the user wants demotion from or
to this node? Middle memory tiers can act as both...

I think if your goal is to constrain demotion targets then a much more
clear and future proof way is to simply add a second arg to
memory.reclaim "allowed_demotion_targets=".\

> > In that case, how can we request demotion only from toptier nodes
> > (without counting any reclaimed bytes from other nodes),  which is our
> > memory tiering use case?
>
> I am not sure I follow. Could you be more specific please?
>
> > Besides, when both toptier and demotion nodes are specified, the
> > demoted pages should only be counted as aging and not be counted
> > towards the requested bytes of try_to_free_mem_cgroup_pages(), which
> > is what this patch tries to address.
>
> This should be addressed by
> http://lkml.kernel.org/r/Y5B1K5zAE0PkjFZx@dhcp22.suse.cz, no?

I think I provided a test case in [1] showing very clearly that this
breaks one of our use cases, i.e. the use case where the user is
asking to demote X bytes from the top tier nodes to the lower tier
nodes. I would not like to proceed with a fix that breaks one of our
use cases. I believe I provided in this patch a fix that caters to all
existing users, and we should take the fix in this patch over a fix
that breaks use cases.

[1] https://lore.kernel.org/all/CAHS8izMKK107wVFSJvg36nQ=WzXd8_cjYBtR0p47L+XLYUSsqA@mail.gmail.com/

> --
> Michal Hocko
> SUSE Labs
  
Wei Xu Dec. 10, 2022, 8:01 a.m. UTC | #21
On Fri, Dec 9, 2022 at 1:16 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 09-12-22 08:41:47, Wei Xu wrote:
> > On Fri, Dec 9, 2022 at 12:08 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 08-12-22 16:59:36, Wei Xu wrote:
> > > [...]
> > > > > What I really mean is to add demotion nodes to the nodemask along with
> > > > > the set of nodes you want to reclaim from. To me that sounds like a
> > > > > more natural interface allowing for all sorts of usecases:
> > > > > - free up demotion targets (only specify demotion nodes in the mask)
> > > > > - control where to demote (e.g. select specific demotion target(s))
> > > > > - do not demote at all (skip demotion nodes from the node mask)
> > > >
> > > > For clarification, do you mean to add another argument (e.g.
> > > > demotion_nodes) in addition to the "nodes" argument?
> > >
> > > No, nodes=mask argument should control the domain where the memory
> > > reclaim should happen. That includes both aging and the reclaim. If the
> > > mask doesn't contain any lower tier node then no demotion will happen.
> > > If only a subset of lower tiers are specified then only those could be
> > > used for the demotion process. Or put it otherwise, the nodemask is not
> > > only used to filter out zonelists during reclaim it also restricts
> > > migration targets.
> > >
> > > Is this more clear now?
> >
> > In that case, how can we request demotion only from toptier nodes
> > (without counting any reclaimed bytes from other nodes),  which is our
> > memory tiering use case?
>
> I am not sure I follow. Could you be more specific please?

In our memory tiering use case, we would like to proactively free up
memory on top-tier nodes by demoting cold pages to lower-tier nodes.
This is to create enough free top-tier memory for new allocations and
promotions.  How many pages and how often to demote from top-tier
nodes can depend on a number of factors (e.g. the amount of free
top-tier memory, the amount of cold pages, the bandwidth pressure on
lower-tier, the task tolerance of slower memory on performance) and
are controlled by the userspace policies.

Because the purpose of such proactive demotions is to free up top-tier
memory, not to lower the amount of memory charged to the memcg, we'd
like that memory.reclaim can demote the specified amount of bytes from
the given top-tier nodes.  If we have to also provide the lower-tier
nodes to memory.reclaim to allow demotions, the kernel can reclaim
from the lower-tier nodes in the same memory.reclaim request. We then
won't be able to control the amount of bytes to be demoted from
top-tier nodes.

> > Besides, when both toptier and demotion nodes are specified, the
> > demoted pages should only be counted as aging and not be counted
> > towards the requested bytes of try_to_free_mem_cgroup_pages(), which
> > is what this patch tries to address.
>
> This should be addressed by
> http://lkml.kernel.org/r/Y5B1K5zAE0PkjFZx@dhcp22.suse.cz, no?
> --
> Michal Hocko
> SUSE Labs
  
Michal Hocko Dec. 12, 2022, 8:33 a.m. UTC | #22
On Fri 09-12-22 13:39:44, Mina Almasry wrote:
> On Fri, Dec 9, 2022 at 1:16 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 09-12-22 08:41:47, Wei Xu wrote:
> > > On Fri, Dec 9, 2022 at 12:08 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Thu 08-12-22 16:59:36, Wei Xu wrote:
> > > > [...]
> > > > > > What I really mean is to add demotion nodes to the nodemask along with
> > > > > > the set of nodes you want to reclaim from. To me that sounds like a
> > > > > > more natural interface allowing for all sorts of usecases:
> > > > > > - free up demotion targets (only specify demotion nodes in the mask)
> > > > > > - control where to demote (e.g. select specific demotion target(s))
> > > > > > - do not demote at all (skip demotion nodes from the node mask)
> > > > >
> > > > > For clarification, do you mean to add another argument (e.g.
> > > > > demotion_nodes) in addition to the "nodes" argument?
> > > >
> > > > No, nodes=mask argument should control the domain where the memory
> > > > reclaim should happen. That includes both aging and the reclaim. If the
> > > > mask doesn't contain any lower tier node then no demotion will happen.
> > > > If only a subset of lower tiers are specified then only those could be
> > > > used for the demotion process. Or put it otherwise, the nodemask is not
> > > > only used to filter out zonelists during reclaim it also restricts
> > > > migration targets.
> > > >
> > > > Is this more clear now?
> > >
> 
> I think putting the demotion sources and demotion targets in the same
> nodemask is a bit confusing, and prone to error. IIUC the user puts
> both the demotion source and the demotion target in the nodemaks, and
> the kernel infers which is which depending on whether the node is a
> top-tier node, or a bottom tier node.

The level of the tier is not really all that important. All you request
from the userspace is to free up some charges based on aging (it is a
reclaim interface). The specified nodemask doesn't and cannot really
tell anything about how that is achieved because that is an internal
implementation detail. The nodemask extension defines a domain where
that aging and reclaim should happen. Treating demotion special and
assuming a certain behavior is a free ticket to future headaches.

> I think in the future this will
> become ambiguous. What happens in the future when the user when the
> machine has N memory tiers and the user specifies a node in a middle
> tier in the nodemask? Does that mean the user wants demotion from or
> to this node? Middle memory tiers can act as both...

It's really very simple. The memory reclaim (both aging and actual
eviction) is bound to the given nodemask. Full stop. See how that
doesn't say anything about demotion? It simply is not important.

Say you have a multi-layer system with nodes 0-3 with RAM, 4, 5 with
tier 1 and nodes 6, 7 as a second tier with the following topology.
RAM nodes can demote to any lower tiers. Node 4 can demote to 6 and node
5 to node 7. Now say you want to release some memory from node 2 and 3
to release some memory from there also globally yet node 5 should be
avoided for whatever reason. If you reclaim with nodemask=2,3,4,6 will
achieve exactly that without any ambiguity. Each node in the node mask
will get aged and demotion targets will be filtered based on the
nodemask as well so you cannot really end up aging (and demoting from)
nodes 5, 7 even if they are part of the demotion path. This is simply
impossible if the demotion target is implicit and it can live outside of
the provided nodemask. Are you on the same page that this would be
problematic?

> I think if your goal is to constrain demotion targets then a much more
> clear and future proof way is to simply add a second arg to
> memory.reclaim "allowed_demotion_targets=".\

That would be an option as well but what is the exact semantic? Do you
control the demotion path as well?

> > > In that case, how can we request demotion only from toptier nodes
> > > (without counting any reclaimed bytes from other nodes),  which is our
> > > memory tiering use case?
> >
> > I am not sure I follow. Could you be more specific please?
> >
> > > Besides, when both toptier and demotion nodes are specified, the
> > > demoted pages should only be counted as aging and not be counted
> > > towards the requested bytes of try_to_free_mem_cgroup_pages(), which
> > > is what this patch tries to address.
> >
> > This should be addressed by
> > http://lkml.kernel.org/r/Y5B1K5zAE0PkjFZx@dhcp22.suse.cz, no?
> 
> I think I provided a test case in [1] showing very clearly that this
> breaks one of our use cases, i.e. the use case where the user is
> asking to demote X bytes from the top tier nodes to the lower tier
> nodes.

Are you referring to EAGAIN situation as a result of the nr_reclaimed
patch? Please note that the patch was untested and likely incomplete.
Also really slightly independent on the actual nodemask reclaim
extension. Btw. based on these discussions I have to retract my ack and
reopen the discussion (will follow up on the original email thread). The
current semantic is really concerning.
  
Michal Hocko Dec. 12, 2022, 8:36 a.m. UTC | #23
On Sat 10-12-22 00:01:28, Wei Xu wrote:
> On Fri, Dec 9, 2022 at 1:16 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 09-12-22 08:41:47, Wei Xu wrote:
> > > On Fri, Dec 9, 2022 at 12:08 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Thu 08-12-22 16:59:36, Wei Xu wrote:
> > > > [...]
> > > > > > What I really mean is to add demotion nodes to the nodemask along with
> > > > > > the set of nodes you want to reclaim from. To me that sounds like a
> > > > > > more natural interface allowing for all sorts of usecases:
> > > > > > - free up demotion targets (only specify demotion nodes in the mask)
> > > > > > - control where to demote (e.g. select specific demotion target(s))
> > > > > > - do not demote at all (skip demotion nodes from the node mask)
> > > > >
> > > > > For clarification, do you mean to add another argument (e.g.
> > > > > demotion_nodes) in addition to the "nodes" argument?
> > > >
> > > > No, nodes=mask argument should control the domain where the memory
> > > > reclaim should happen. That includes both aging and the reclaim. If the
> > > > mask doesn't contain any lower tier node then no demotion will happen.
> > > > If only a subset of lower tiers are specified then only those could be
> > > > used for the demotion process. Or put it otherwise, the nodemask is not
> > > > only used to filter out zonelists during reclaim it also restricts
> > > > migration targets.
> > > >
> > > > Is this more clear now?
> > >
> > > In that case, how can we request demotion only from toptier nodes
> > > (without counting any reclaimed bytes from other nodes),  which is our
> > > memory tiering use case?
> >
> > I am not sure I follow. Could you be more specific please?
> 
> In our memory tiering use case, we would like to proactively free up
> memory on top-tier nodes by demoting cold pages to lower-tier nodes.
> This is to create enough free top-tier memory for new allocations and
> promotions.  How many pages and how often to demote from top-tier
> nodes can depend on a number of factors (e.g. the amount of free
> top-tier memory, the amount of cold pages, the bandwidth pressure on
> lower-tier, the task tolerance of slower memory on performance) and
> are controlled by the userspace policies.
> 
> Because the purpose of such proactive demotions is to free up top-tier
> memory, not to lower the amount of memory charged to the memcg, we'd
> like that memory.reclaim can demote the specified amount of bytes from
> the given top-tier nodes.  If we have to also provide the lower-tier
> nodes to memory.reclaim to allow demotions, the kernel can reclaim
> from the lower-tier nodes in the same memory.reclaim request. We then
> won't be able to control the amount of bytes to be demoted from
> top-tier nodes.

I am not sure this is something to be handled by the reclaim interface
because now you are creating an ambiguity what the interface should do
and start depend on it. Consider that we will change the reclaim
algorithm in the future and the node you request to demote will simply
reclaim rather than demote. This will break your usecase, right?
  

Patch

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2b42ac9ad755..f324e80395c3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1555,13 +1555,18 @@  static void folio_check_dirty_writeback(struct folio *folio,
 		mapping->a_ops->is_dirty_writeback(folio, dirty, writeback);
 }

+struct demotion_control {
+	struct migration_target_control *mtc;
+	nodemask_t *demote_from_nodemask;
+	unsigned long nr_demoted_outside_nodemask;
+};
+
 static struct page *alloc_demote_page(struct page *page, unsigned long private)
 {
 	struct page *target_page;
 	nodemask_t *allowed_mask;
-	struct migration_target_control *mtc;
-
-	mtc = (struct migration_target_control *)private;
+	struct demotion_control *dc = (struct demotion_control *)private;
+	struct migration_target_control *mtc = dc->mtc;

 	allowed_mask = mtc->nmask;
 	/*
@@ -1576,13 +1581,31 @@  static struct page *alloc_demote_page(struct page *page, unsigned long private)
 	mtc->nmask = NULL;
 	mtc->gfp_mask |= __GFP_THISNODE;
 	target_page = alloc_migration_target(page, (unsigned long)mtc);
-	if (target_page)
+	if (!target_page) {
+		mtc->gfp_mask &= ~__GFP_THISNODE;
+		mtc->nmask = allowed_mask;
+		target_page = alloc_migration_target(page, (unsigned long)mtc);
+	}
+
+	if (!target_page)
 		return target_page;

-	mtc->gfp_mask &= ~__GFP_THISNODE;
-	mtc->nmask = allowed_mask;
+	if (dc->demote_from_nodemask &&
+	    !node_isset(page_to_nid(target_page), *dc->demote_from_nodemask))
+		dc->nr_demoted_outside_nodemask++;

-	return alloc_migration_target(page, (unsigned long)mtc);
+	return target_page;
+}
+
+void free_demote_page(struct page *page, unsigned long private)
+{
+	struct demotion_control *dc = (struct demotion_control *)private;
+
+	if (dc->demote_from_nodemask &&
+	    !node_isset(page_to_nid(page), *dc->demote_from_nodemask))
+		dc->nr_demoted_outside_nodemask--;
+
+	folio_put(page_folio(page));
 }

 /*
@@ -1590,7 +1613,8 @@  static struct page *alloc_demote_page(struct page *page, unsigned long private)
  * Folios which are not demoted are left on @demote_folios.
  */
 static unsigned int demote_folio_list(struct list_head *demote_folios,
-				     struct pglist_data *pgdat)
+				      struct pglist_data *pgdat,
+				      nodemask_t *nodemask)
 {
 	int target_nid = next_demotion_node(pgdat->node_id);
 	unsigned int nr_succeeded;
@@ -1608,6 +1632,12 @@  static unsigned int demote_folio_list(struct list_head *demote_folios,
 		.nmask = &allowed_mask
 	};

+	struct demotion_control dc = {
+		.mtc = &mtc,
+		.demote_from_nodemask = nodemask,
+		.nr_demoted_outside_nodemask = 0,
+	};
+
 	if (list_empty(demote_folios))
 		return 0;

@@ -1617,13 +1647,13 @@  static unsigned int demote_folio_list(struct list_head *demote_folios,
 	node_get_allowed_targets(pgdat, &allowed_mask);

 	/* Demotion ignores all cpuset and mempolicy settings */
-	migrate_pages(demote_folios, alloc_demote_page, NULL,
-		      (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
+	migrate_pages(demote_folios, alloc_demote_page, free_demote_page,
+		      (unsigned long)&dc, MIGRATE_ASYNC, MR_DEMOTION,
 		      &nr_succeeded);

 	__count_vm_events(PGDEMOTE_KSWAPD + reclaimer_offset(), nr_succeeded);

-	return nr_succeeded;
+	return dc.nr_demoted_outside_nodemask;
 }

 static bool may_enter_fs(struct folio *folio, gfp_t gfp_mask)
@@ -1643,7 +1673,12 @@  static bool may_enter_fs(struct folio *folio, gfp_t gfp_mask)
 }

 /*
- * shrink_folio_list() returns the number of reclaimed pages
+ * shrink_folio_list() returns the number of reclaimed pages.
+ *
+ * Demoted pages are counted as reclaimed iff:
+ *   (a) sc->nodemask arg is provided.
+ *   (b) page has been demoted from a node inside sc->nodemask to a node
+ *   outside sc->nodemask.
  */
 static unsigned int shrink_folio_list(struct list_head *folio_list,
 		struct pglist_data *pgdat, struct scan_control *sc,
@@ -1653,6 +1688,7 @@  static unsigned int shrink_folio_list(struct list_head *folio_list,
 	LIST_HEAD(free_folios);
 	LIST_HEAD(demote_folios);
 	unsigned int nr_reclaimed = 0;
+	unsigned int nr_demoted_outside_nodemask = 0;
 	unsigned int pgactivate = 0;
 	bool do_demote_pass;
 	struct swap_iocb *plug = NULL;
@@ -2085,7 +2121,12 @@  static unsigned int shrink_folio_list(struct list_head *folio_list,
 	/* 'folio_list' is always empty here */

 	/* Migrate folios selected for demotion */
-	nr_reclaimed += demote_folio_list(&demote_folios, pgdat);
+	nr_demoted_outside_nodemask =
+		demote_folio_list(&demote_folios, pgdat, sc->nodemask);
+
+	if (sc->nodemask)
+		nr_reclaimed += nr_demoted_outside_nodemask;
+
 	/* Folios that could not be demoted are still in @demote_folios */
 	if (!list_empty(&demote_folios)) {
 		/* Folios which weren't demoted go back on @folio_list */
@@ -2130,9 +2171,11 @@  static unsigned int shrink_folio_list(struct list_head *folio_list,
 unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 					   struct list_head *folio_list)
 {
+	nodemask_t nodemask = NODE_MASK_NONE;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
 		.may_unmap = 1,
+		.nodemask = &nodemask
 	};
 	struct reclaim_stat stat;
 	unsigned int nr_reclaimed;
@@ -2140,6 +2183,12 @@  unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 	LIST_HEAD(clean_folios);
 	unsigned int noreclaim_flag;

+	/*
+	 * Set the nodemask in sc to indicate to shrink_folio_list() that we're
+	 * looking for reclaim from this node.
+	 */
+	node_set(zone->zone_pgdat->node_id, nodemask);
+
 	list_for_each_entry_safe(folio, next, folio_list, lru) {
 		if (!folio_test_hugetlb(folio) && folio_is_file_lru(folio) &&
 		    !folio_test_dirty(folio) && !__folio_test_movable(folio) &&
@@ -7031,12 +7080,20 @@  static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
 	unsigned long zone_boosts[MAX_NR_ZONES] = { 0, };
 	bool boosted;
 	struct zone *zone;
+	nodemask_t nodemask = NODE_MASK_NONE;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
 		.order = order,
 		.may_unmap = 1,
+		.nodemask = &nodemask,
 	};

+	/*
+	 * Set the nodemask in sc to indicate to kswapd_shrink_node() that we're
+	 * looking for reclaim from this node.
+	 */
+	node_set(pgdat->node_id, nodemask);
+
 	set_task_reclaim_state(current, &sc.reclaim_state);
 	psi_memstall_enter(&pflags);
 	__fs_reclaim_acquire(_THIS_IP_);
@@ -7642,6 +7699,7 @@  static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
 	const unsigned long nr_pages = 1 << order;
 	struct task_struct *p = current;
 	unsigned int noreclaim_flag;
+	nodemask_t nodemask = NODE_MASK_NONE;
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
 		.gfp_mask = current_gfp_context(gfp_mask),
@@ -7651,9 +7709,16 @@  static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
 		.may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
 		.may_swap = 1,
 		.reclaim_idx = gfp_zone(gfp_mask),
+		.nodemask = &nodemask,
 	};
 	unsigned long pflags;

+	/*
+	 * Set the nodemask in sc to indicate to shrink_node() that we're
+	 * looking for reclaim from this node.
+	 */
+	node_set(pgdat->node_id, nodemask);
+
 	trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order,
 					   sc.gfp_mask);