[v1,mm-unstable] mm: Fix memcg reclaim on memory tiered systems

Message ID 20221203011120.2361610-1-almasrymina@google.com
State New
Headers
Series [v1,mm-unstable] mm: Fix memcg reclaim on memory tiered systems |

Commit Message

Mina Almasry Dec. 3, 2022, 1:11 a.m. UTC
  commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
reclaim"") enabled demotion in memcg reclaim, which is the right thing
to do, however, I suspect it introduced a regression in the behavior of
try_to_free_mem_cgroup_pages().

The callers of try_to_free_mem_cgroup_pages() expect it to attempt to
reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage
of the cgroup should reduce by nr_pages. The callers expect
try_to_free_mem_cgroup_pages() to also return the number of pages
reclaimed, not demoted.

However, what try_to_free_mem_cgroup_pages() actually does is it
unconditionally counts demoted pages as reclaimed pages. So in practice
when it is called it will often demote nr_pages and return the number of
demoted pages to the caller. Demoted pages don't lower the memcg usage,
and so I think try_to_free_mem_cgroup_pages() is not actually doing what
the callers want it to do.

I suspect various things work suboptimally on memory systems or don't
work at all due to this:

- memory.high enforcement likely doesn't work (it just demotes nr_pages
  instead of lowering the memcg usage by nr_pages).
- try_charge_memcg() will keep retrying the charge while
  try_to_free_mem_cgroup_pages() is just demoting pages and not actually
  making any room for the charge.
- memory.reclaim has a wonky interface. It advertises to the user it
  reclaims the provided amount but it will actually demote that amount.

There may be more effects to this issue.

To fix these issues I propose shrink_folio_list() to only count pages
demoted from inside of sc->nodemask to outside of sc->nodemask as
'reclaimed'.

For callers such as reclaim_high() or try_charge_memcg() that set
sc->nodemask to NULL, try_to_free_mem_cgroup_pages() will try to
actually reclaim nr_pages and return the number of pages reclaimed. No
demoted pages would count towards the nr_pages requirement.

For callers such as memory_reclaim() that set sc->nodemask,
try_to_free_mem_cgroup_pages() will free nr_pages from that nodemask
with either reclaim or demotion.

Tested this change using memory.reclaim interface. With this change,

	echo "1m" > memory.reclaim

Will cause freeing of 1m of memory from the cgroup regardless of the
demotions happening inside.

	echo "1m nodes=0" > memory.reclaim

Will cause freeing of 1m of node 0 by demotion if a demotion target is
available, and by reclaim if no demotion target is available.

Signed-off-by: Mina Almasry <almasrymina@google.com>

---

This is developed on top of mm-unstable largely because I need the
memory.reclaim nodes= arg to test it properly.
---
 mm/vmscan.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

--
2.39.0.rc0.267.gcb52ba06e7-goog
  

Comments

Wei Xu Dec. 3, 2022, 4:14 a.m. UTC | #1
On Fri, Dec 2, 2022 at 5:11 PM Mina Almasry <almasrymina@google.com> wrote:
>
> commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> reclaim"") enabled demotion in memcg reclaim, which is the right thing
> to do, however, I suspect it introduced a regression in the behavior of
> try_to_free_mem_cgroup_pages().
>
> The callers of try_to_free_mem_cgroup_pages() expect it to attempt to
> reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage
> of the cgroup should reduce by nr_pages. The callers expect
> try_to_free_mem_cgroup_pages() to also return the number of pages
> reclaimed, not demoted.
>
> However, what try_to_free_mem_cgroup_pages() actually does is it
> unconditionally counts demoted pages as reclaimed pages. So in practice
> when it is called it will often demote nr_pages and return the number of
> demoted pages to the caller. Demoted pages don't lower the memcg usage,
> and so I think try_to_free_mem_cgroup_pages() is not actually doing what
> the callers want it to do.
>
> I suspect various things work suboptimally on memory systems or don't
> work at all due to this:
>
> - memory.high enforcement likely doesn't work (it just demotes nr_pages
>   instead of lowering the memcg usage by nr_pages).
> - try_charge_memcg() will keep retrying the charge while
>   try_to_free_mem_cgroup_pages() is just demoting pages and not actually
>   making any room for the charge.
> - memory.reclaim has a wonky interface. It advertises to the user it
>   reclaims the provided amount but it will actually demote that amount.
>
> There may be more effects to this issue.
>
> To fix these issues I propose shrink_folio_list() to only count pages
> demoted from inside of sc->nodemask to outside of sc->nodemask as
> 'reclaimed'.
>
> For callers such as reclaim_high() or try_charge_memcg() that set
> sc->nodemask to NULL, try_to_free_mem_cgroup_pages() will try to
> actually reclaim nr_pages and return the number of pages reclaimed. No
> demoted pages would count towards the nr_pages requirement.
>
> For callers such as memory_reclaim() that set sc->nodemask,
> try_to_free_mem_cgroup_pages() will free nr_pages from that nodemask
> with either reclaim or demotion.
>
> Tested this change using memory.reclaim interface. With this change,
>
>         echo "1m" > memory.reclaim
>
> Will cause freeing of 1m of memory from the cgroup regardless of the
> demotions happening inside.
>
>         echo "1m nodes=0" > memory.reclaim
>
> Will cause freeing of 1m of node 0 by demotion if a demotion target is
> available, and by reclaim if no demotion target is available.
>
> Signed-off-by: Mina Almasry <almasrymina@google.com>
>
> ---
>
> This is developed on top of mm-unstable largely because I need the
> memory.reclaim nodes= arg to test it properly.
> ---
>  mm/vmscan.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2b42ac9ad755..8f6e993b870d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1653,6 +1653,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>         LIST_HEAD(free_folios);
>         LIST_HEAD(demote_folios);
>         unsigned int nr_reclaimed = 0;
> +       unsigned int nr_demoted = 0;
>         unsigned int pgactivate = 0;
>         bool do_demote_pass;
>         struct swap_iocb *plug = NULL;
> @@ -2085,7 +2086,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>         /* 'folio_list' is always empty here */
>
>         /* Migrate folios selected for demotion */
> -       nr_reclaimed += demote_folio_list(&demote_folios, pgdat);
> +       nr_demoted = demote_folio_list(&demote_folios, pgdat);
> +
> +       /*
> +        * Only count demoted folios as reclaimed if we demoted them from
> +        * inside of the nodemask to outside of the nodemask, hence reclaiming
> +        * pages in the nodemask.
> +        */
> +       if (sc->nodemask && node_isset(pgdat->node_id, *sc->nodemask) &&
> +           !node_isset(next_demotion_node(pgdat->node_id), *sc->nodemask))

next_demotion_node() is just the first demotion target node. Demotion
can fall back to other allowed target nodes returned by
node_get_allowed_targets().  When the page is demoted to a fallback
node and this fallback node is in sc->nodemask, nr_demoted should not
be added into nr_reclaimed, either.

One way to address this issue is to pass sc->nodemask into
demote_folio_list() and exclude sc->nodemask from the allowed target
demotion nodes.

> +               nr_reclaimed += nr_demoted;
> +
>         /* Folios that could not be demoted are still in @demote_folios */
>         if (!list_empty(&demote_folios)) {
>                 /* Folios which weren't demoted go back on @folio_list */
> --
> 2.39.0.rc0.267.gcb52ba06e7-goog
  
Mina Almasry Dec. 4, 2022, 9:34 a.m. UTC | #2
On Fri, Dec 2, 2022 at 8:14 PM Wei Xu <weixugc@google.com> wrote:
>
> On Fri, Dec 2, 2022 at 5:11 PM Mina Almasry <almasrymina@google.com> wrote:
> >
> > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> > reclaim"") enabled demotion in memcg reclaim, which is the right thing
> > to do, however, I suspect it introduced a regression in the behavior of
> > try_to_free_mem_cgroup_pages().
> >
> > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to
> > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage
> > of the cgroup should reduce by nr_pages. The callers expect
> > try_to_free_mem_cgroup_pages() to also return the number of pages
> > reclaimed, not demoted.
> >
> > However, what try_to_free_mem_cgroup_pages() actually does is it
> > unconditionally counts demoted pages as reclaimed pages. So in practice
> > when it is called it will often demote nr_pages and return the number of
> > demoted pages to the caller. Demoted pages don't lower the memcg usage,
> > and so I think try_to_free_mem_cgroup_pages() is not actually doing what
> > the callers want it to do.
> >
> > I suspect various things work suboptimally on memory systems or don't
> > work at all due to this:
> >
> > - memory.high enforcement likely doesn't work (it just demotes nr_pages
> >   instead of lowering the memcg usage by nr_pages).
> > - try_charge_memcg() will keep retrying the charge while
> >   try_to_free_mem_cgroup_pages() is just demoting pages and not actually
> >   making any room for the charge.
> > - memory.reclaim has a wonky interface. It advertises to the user it
> >   reclaims the provided amount but it will actually demote that amount.
> >
> > There may be more effects to this issue.
> >
> > To fix these issues I propose shrink_folio_list() to only count pages
> > demoted from inside of sc->nodemask to outside of sc->nodemask as
> > 'reclaimed'.
> >
> > For callers such as reclaim_high() or try_charge_memcg() that set
> > sc->nodemask to NULL, try_to_free_mem_cgroup_pages() will try to
> > actually reclaim nr_pages and return the number of pages reclaimed. No
> > demoted pages would count towards the nr_pages requirement.
> >
> > For callers such as memory_reclaim() that set sc->nodemask,
> > try_to_free_mem_cgroup_pages() will free nr_pages from that nodemask
> > with either reclaim or demotion.
> >
> > Tested this change using memory.reclaim interface. With this change,
> >
> >         echo "1m" > memory.reclaim
> >
> > Will cause freeing of 1m of memory from the cgroup regardless of the
> > demotions happening inside.
> >
> >         echo "1m nodes=0" > memory.reclaim
> >
> > Will cause freeing of 1m of node 0 by demotion if a demotion target is
> > available, and by reclaim if no demotion target is available.
> >
> > Signed-off-by: Mina Almasry <almasrymina@google.com>
> >
> > ---
> >
> > This is developed on top of mm-unstable largely because I need the
> > memory.reclaim nodes= arg to test it properly.
> > ---
> >  mm/vmscan.c | 13 ++++++++++++-
> >  1 file changed, 12 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 2b42ac9ad755..8f6e993b870d 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1653,6 +1653,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >         LIST_HEAD(free_folios);
> >         LIST_HEAD(demote_folios);
> >         unsigned int nr_reclaimed = 0;
> > +       unsigned int nr_demoted = 0;
> >         unsigned int pgactivate = 0;
> >         bool do_demote_pass;
> >         struct swap_iocb *plug = NULL;
> > @@ -2085,7 +2086,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >         /* 'folio_list' is always empty here */
> >
> >         /* Migrate folios selected for demotion */
> > -       nr_reclaimed += demote_folio_list(&demote_folios, pgdat);
> > +       nr_demoted = demote_folio_list(&demote_folios, pgdat);
> > +
> > +       /*
> > +        * Only count demoted folios as reclaimed if we demoted them from
> > +        * inside of the nodemask to outside of the nodemask, hence reclaiming
> > +        * pages in the nodemask.
> > +        */
> > +       if (sc->nodemask && node_isset(pgdat->node_id, *sc->nodemask) &&
> > +           !node_isset(next_demotion_node(pgdat->node_id), *sc->nodemask))
>
> next_demotion_node() is just the first demotion target node. Demotion
> can fall back to other allowed target nodes returned by
> node_get_allowed_targets().  When the page is demoted to a fallback
> node and this fallback node is in sc->nodemask, nr_demoted should not
> be added into nr_reclaimed, either.
>

Thanks for reviewing Wei, I did indeed miss this.

> One way to address this issue is to pass sc->nodemask into
> demote_folio_list() and exclude sc->nodemask from the allowed target
> demotion nodes.
>

This makes sense to me. Applied this change and uploaded v2:
https://lore.kernel.org/linux-mm/20221204093008.2620459-1-almasrymina@google.com/T/#u

> > +               nr_reclaimed += nr_demoted;
> > +
> >         /* Folios that could not be demoted are still in @demote_folios */
> >         if (!list_empty(&demote_folios)) {
> >                 /* Folios which weren't demoted go back on @folio_list */
> > --
> > 2.39.0.rc0.267.gcb52ba06e7-goog
  
Huang, Ying Dec. 5, 2022, 2:38 a.m. UTC | #3
Mina Almasry <almasrymina@google.com> writes:

> commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> reclaim"") enabled demotion in memcg reclaim, which is the right thing
> to do, however, I suspect it introduced a regression in the behavior of
> try_to_free_mem_cgroup_pages().
>
> The callers of try_to_free_mem_cgroup_pages() expect it to attempt to
> reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage
> of the cgroup should reduce by nr_pages. The callers expect
> try_to_free_mem_cgroup_pages() to also return the number of pages
> reclaimed, not demoted.
>
> However, what try_to_free_mem_cgroup_pages() actually does is it
> unconditionally counts demoted pages as reclaimed pages. So in practice
> when it is called it will often demote nr_pages and return the number of
> demoted pages to the caller. Demoted pages don't lower the memcg usage,
> and so I think try_to_free_mem_cgroup_pages() is not actually doing what
> the callers want it to do.
>
> I suspect various things work suboptimally on memory systems or don't
> work at all due to this:
>
> - memory.high enforcement likely doesn't work (it just demotes nr_pages
>   instead of lowering the memcg usage by nr_pages).
> - try_charge_memcg() will keep retrying the charge while
>   try_to_free_mem_cgroup_pages() is just demoting pages and not actually
>   making any room for the charge.
> - memory.reclaim has a wonky interface. It advertises to the user it
>   reclaims the provided amount but it will actually demote that amount.
>
> There may be more effects to this issue.
>
> To fix these issues I propose shrink_folio_list() to only count pages
> demoted from inside of sc->nodemask to outside of sc->nodemask as
> 'reclaimed'.
>
> For callers such as reclaim_high() or try_charge_memcg() that set
> sc->nodemask to NULL, try_to_free_mem_cgroup_pages() will try to
> actually reclaim nr_pages and return the number of pages reclaimed. No
> demoted pages would count towards the nr_pages requirement.
>
> For callers such as memory_reclaim() that set sc->nodemask,
> try_to_free_mem_cgroup_pages() will free nr_pages from that nodemask
> with either reclaim or demotion.

Have you checked all callers?  For example, IIUC, in
reclaim_clean_pages_from_list(), although sc.nodemask == NULL, the
demoted pages should be counted as reclaimed.  How about count both
"demoted" and "reclaimed" in struct scan_control, and let callers to
determine how to use the number?

> Tested this change using memory.reclaim interface. With this change,
>
> 	echo "1m" > memory.reclaim
>
> Will cause freeing of 1m of memory from the cgroup regardless of the
> demotions happening inside.
>
> 	echo "1m nodes=0" > memory.reclaim

Have you tested these tests in the original kernel?  If so, whether does
the issue you suspected above occurs during testing?

Best Regards,
Huang, Ying

> Will cause freeing of 1m of node 0 by demotion if a demotion target is
> available, and by reclaim if no demotion target is available.
>
> Signed-off-by: Mina Almasry <almasrymina@google.com>
>
> ---
>
> This is developed on top of mm-unstable largely because I need the
> memory.reclaim nodes= arg to test it properly.
> ---
>  mm/vmscan.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2b42ac9ad755..8f6e993b870d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1653,6 +1653,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>  	LIST_HEAD(free_folios);
>  	LIST_HEAD(demote_folios);
>  	unsigned int nr_reclaimed = 0;
> +	unsigned int nr_demoted = 0;
>  	unsigned int pgactivate = 0;
>  	bool do_demote_pass;
>  	struct swap_iocb *plug = NULL;
> @@ -2085,7 +2086,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>  	/* 'folio_list' is always empty here */
>
>  	/* Migrate folios selected for demotion */
> -	nr_reclaimed += demote_folio_list(&demote_folios, pgdat);
> +	nr_demoted = demote_folio_list(&demote_folios, pgdat);
> +
> +	/*
> +	 * Only count demoted folios as reclaimed if we demoted them from
> +	 * inside of the nodemask to outside of the nodemask, hence reclaiming
> +	 * pages in the nodemask.
> +	 */
> +	if (sc->nodemask && node_isset(pgdat->node_id, *sc->nodemask) &&
> +	    !node_isset(next_demotion_node(pgdat->node_id), *sc->nodemask))
> +		nr_reclaimed += nr_demoted;
> +
>  	/* Folios that could not be demoted are still in @demote_folios */
>  	if (!list_empty(&demote_folios)) {
>  		/* Folios which weren't demoted go back on @folio_list */
> --
> 2.39.0.rc0.267.gcb52ba06e7-goog
  
Huang, Ying Dec. 5, 2022, 2:57 a.m. UTC | #4
Wei Xu <weixugc@google.com> writes:

> On Fri, Dec 2, 2022 at 5:11 PM Mina Almasry <almasrymina@google.com> wrote:
>>
>> commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
>> reclaim"") enabled demotion in memcg reclaim, which is the right thing
>> to do, however, I suspect it introduced a regression in the behavior of
>> try_to_free_mem_cgroup_pages().
>>
>> The callers of try_to_free_mem_cgroup_pages() expect it to attempt to
>> reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage
>> of the cgroup should reduce by nr_pages. The callers expect
>> try_to_free_mem_cgroup_pages() to also return the number of pages
>> reclaimed, not demoted.
>>
>> However, what try_to_free_mem_cgroup_pages() actually does is it
>> unconditionally counts demoted pages as reclaimed pages. So in practice
>> when it is called it will often demote nr_pages and return the number of
>> demoted pages to the caller. Demoted pages don't lower the memcg usage,
>> and so I think try_to_free_mem_cgroup_pages() is not actually doing what
>> the callers want it to do.
>>
>> I suspect various things work suboptimally on memory systems or don't
>> work at all due to this:
>>
>> - memory.high enforcement likely doesn't work (it just demotes nr_pages
>>   instead of lowering the memcg usage by nr_pages).
>> - try_charge_memcg() will keep retrying the charge while
>>   try_to_free_mem_cgroup_pages() is just demoting pages and not actually
>>   making any room for the charge.
>> - memory.reclaim has a wonky interface. It advertises to the user it
>>   reclaims the provided amount but it will actually demote that amount.
>>
>> There may be more effects to this issue.
>>
>> To fix these issues I propose shrink_folio_list() to only count pages
>> demoted from inside of sc->nodemask to outside of sc->nodemask as
>> 'reclaimed'.
>>
>> For callers such as reclaim_high() or try_charge_memcg() that set
>> sc->nodemask to NULL, try_to_free_mem_cgroup_pages() will try to
>> actually reclaim nr_pages and return the number of pages reclaimed. No
>> demoted pages would count towards the nr_pages requirement.
>>
>> For callers such as memory_reclaim() that set sc->nodemask,
>> try_to_free_mem_cgroup_pages() will free nr_pages from that nodemask
>> with either reclaim or demotion.
>>
>> Tested this change using memory.reclaim interface. With this change,
>>
>>         echo "1m" > memory.reclaim
>>
>> Will cause freeing of 1m of memory from the cgroup regardless of the
>> demotions happening inside.
>>
>>         echo "1m nodes=0" > memory.reclaim
>>
>> Will cause freeing of 1m of node 0 by demotion if a demotion target is
>> available, and by reclaim if no demotion target is available.
>>
>> Signed-off-by: Mina Almasry <almasrymina@google.com>
>>
>> ---
>>
>> This is developed on top of mm-unstable largely because I need the
>> memory.reclaim nodes= arg to test it properly.
>> ---
>>  mm/vmscan.c | 13 ++++++++++++-
>>  1 file changed, 12 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 2b42ac9ad755..8f6e993b870d 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1653,6 +1653,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>         LIST_HEAD(free_folios);
>>         LIST_HEAD(demote_folios);
>>         unsigned int nr_reclaimed = 0;
>> +       unsigned int nr_demoted = 0;
>>         unsigned int pgactivate = 0;
>>         bool do_demote_pass;
>>         struct swap_iocb *plug = NULL;
>> @@ -2085,7 +2086,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>         /* 'folio_list' is always empty here */
>>
>>         /* Migrate folios selected for demotion */
>> -       nr_reclaimed += demote_folio_list(&demote_folios, pgdat);
>> +       nr_demoted = demote_folio_list(&demote_folios, pgdat);
>> +
>> +       /*
>> +        * Only count demoted folios as reclaimed if we demoted them from
>> +        * inside of the nodemask to outside of the nodemask, hence reclaiming
>> +        * pages in the nodemask.
>> +        */
>> +       if (sc->nodemask && node_isset(pgdat->node_id, *sc->nodemask) &&
>> +           !node_isset(next_demotion_node(pgdat->node_id), *sc->nodemask))
>
> next_demotion_node() is just the first demotion target node. Demotion
> can fall back to other allowed target nodes returned by
> node_get_allowed_targets().  When the page is demoted to a fallback
> node and this fallback node is in sc->nodemask, nr_demoted should not
> be added into nr_reclaimed, either.
>
> One way to address this issue is to pass sc->nodemask into
> demote_folio_list() and exclude sc->nodemask from the allowed target
> demotion nodes.

I don't think this is a good idea.  Because this may break the fast ->
slow -> storage aging order.  A warm page in fast memory node may be
reclaimed to storage directly, instead of being demoted to the slow
memory node.

If necessary, we can account "nr_demoted" in alloc_demote_page() and
to-be-added free_demote_page().

Best Regards,
Huang, Ying

>> +               nr_reclaimed += nr_demoted;
>> +
>>         /* Folios that could not be demoted are still in @demote_folios */
>>         if (!list_empty(&demote_folios)) {
>>                 /* Folios which weren't demoted go back on @folio_list */
>> --
>> 2.39.0.rc0.267.gcb52ba06e7-goog
  
Mina Almasry Dec. 6, 2022, 12:04 a.m. UTC | #5
On Sun, Dec 4, 2022 at 6:39 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Mina Almasry <almasrymina@google.com> writes:
>
> > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> > reclaim"") enabled demotion in memcg reclaim, which is the right thing
> > to do, however, I suspect it introduced a regression in the behavior of
> > try_to_free_mem_cgroup_pages().
> >
> > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to
> > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage
> > of the cgroup should reduce by nr_pages. The callers expect
> > try_to_free_mem_cgroup_pages() to also return the number of pages
> > reclaimed, not demoted.
> >
> > However, what try_to_free_mem_cgroup_pages() actually does is it
> > unconditionally counts demoted pages as reclaimed pages. So in practice
> > when it is called it will often demote nr_pages and return the number of
> > demoted pages to the caller. Demoted pages don't lower the memcg usage,
> > and so I think try_to_free_mem_cgroup_pages() is not actually doing what
> > the callers want it to do.
> >
> > I suspect various things work suboptimally on memory systems or don't
> > work at all due to this:
> >
> > - memory.high enforcement likely doesn't work (it just demotes nr_pages
> >   instead of lowering the memcg usage by nr_pages).
> > - try_charge_memcg() will keep retrying the charge while
> >   try_to_free_mem_cgroup_pages() is just demoting pages and not actually
> >   making any room for the charge.
> > - memory.reclaim has a wonky interface. It advertises to the user it
> >   reclaims the provided amount but it will actually demote that amount.
> >
> > There may be more effects to this issue.
> >
> > To fix these issues I propose shrink_folio_list() to only count pages
> > demoted from inside of sc->nodemask to outside of sc->nodemask as
> > 'reclaimed'.
> >
> > For callers such as reclaim_high() or try_charge_memcg() that set
> > sc->nodemask to NULL, try_to_free_mem_cgroup_pages() will try to
> > actually reclaim nr_pages and return the number of pages reclaimed. No
> > demoted pages would count towards the nr_pages requirement.
> >
> > For callers such as memory_reclaim() that set sc->nodemask,
> > try_to_free_mem_cgroup_pages() will free nr_pages from that nodemask
> > with either reclaim or demotion.
>
> Have you checked all callers?  For example, IIUC, in
> reclaim_clean_pages_from_list(), although sc.nodemask == NULL, the
> demoted pages should be counted as reclaimed.

I checked all call stacks leading to shrink_folio_list() now (at least
I hope). Here is what I think they do and how I propose to handle
them:

- reclaim_clean_pages_from_list() & __node_reclaim() & balance_pgdat()
These try to free memory from a specific node, and both demotion and
reclaim from that node should be counted. I propose these calls set
sc>nodemask = pgdat.node_id to signal to shrink_folio_list() that both
demotion and reclaim from this node should be counted.

- try_to_free_pages()
Tries to free pages from a specific nodemask. It sets sc->nodemask to
ac->nodemask. In this case pages demoted within the nodemask should
not count. Pages demoted outside of the nodemask should count, which
this patch already tries to do.

- mem_cgroup_shrink_node()
This is memcg soft limit reclaim. AFAIU only reclaim should be
counted. It already sets sc->nodemask=NULL to indicate that it
requires reclaim from all nodes and that only reclaimed memory should
be counted, which this patch already tries to do.

- try_to_free_mem_cgroup_pages()
This is covered in the commit message. Many callers set nodemask=NULL
indicating they want reclaim and demotion should not count.
memory.reclaim sets nodemask depending on the 'nodes=' arg and wants
demotion and reclaim from that nodemask.

- reclaim_folio_list()
Sets no_demotion = 1. No ambiguity here, only reclaims and counts
reclaimed pages.

If agreeable I can fix reclaim_clean_pages_from_list() &
__node_reclaim() & balance_pgdat() call sites in v3.

> How about count both
> "demoted" and "reclaimed" in struct scan_control, and let callers to
> determine how to use the number?
>

I don't think this is by itself enough. Pages demoted between 2 nodes
that are both in sc->nodemask should not count, I think. So 'demoted'
needs to be specifically pages demoted outside of the nodemask. We can
do 2 things:

1. Only allow the kernel to demote outside the nodemask (which you
don't prefer).
2. Allow the kernel to demote inside the nodemask but not count them.

I will see if I can implement #2.

> > Tested this change using memory.reclaim interface. With this change,
> >
> >       echo "1m" > memory.reclaim
> >
> > Will cause freeing of 1m of memory from the cgroup regardless of the
> > demotions happening inside.
> >
> >       echo "1m nodes=0" > memory.reclaim
>
> Have you tested these tests in the original kernel?  If so, whether does
> the issue you suspected above occurs during testing?
>

Yes. I set up a test case where I allocate 500m in a cgroup, and then do:

    echo "50m" > memory.reclaim

Without my fix, my kernel demotes 70mb and reclaims 4 mb.

With my v1 fix, my kernel demotes all memory possible and reclaims 60mb.

I will add this to the commit message in the next version.


> Best Regards,
> Huang, Ying
>
> > Will cause freeing of 1m of node 0 by demotion if a demotion target is
> > available, and by reclaim if no demotion target is available.
> >
> > Signed-off-by: Mina Almasry <almasrymina@google.com>
> >
> > ---
> >
> > This is developed on top of mm-unstable largely because I need the
> > memory.reclaim nodes= arg to test it properly.
> > ---
> >  mm/vmscan.c | 13 ++++++++++++-
> >  1 file changed, 12 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 2b42ac9ad755..8f6e993b870d 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1653,6 +1653,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >       LIST_HEAD(free_folios);
> >       LIST_HEAD(demote_folios);
> >       unsigned int nr_reclaimed = 0;
> > +     unsigned int nr_demoted = 0;
> >       unsigned int pgactivate = 0;
> >       bool do_demote_pass;
> >       struct swap_iocb *plug = NULL;
> > @@ -2085,7 +2086,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >       /* 'folio_list' is always empty here */
> >
> >       /* Migrate folios selected for demotion */
> > -     nr_reclaimed += demote_folio_list(&demote_folios, pgdat);
> > +     nr_demoted = demote_folio_list(&demote_folios, pgdat);
> > +
> > +     /*
> > +      * Only count demoted folios as reclaimed if we demoted them from
> > +      * inside of the nodemask to outside of the nodemask, hence reclaiming
> > +      * pages in the nodemask.
> > +      */
> > +     if (sc->nodemask && node_isset(pgdat->node_id, *sc->nodemask) &&
> > +         !node_isset(next_demotion_node(pgdat->node_id), *sc->nodemask))
> > +             nr_reclaimed += nr_demoted;
> > +
> >       /* Folios that could not be demoted are still in @demote_folios */
> >       if (!list_empty(&demote_folios)) {
> >               /* Folios which weren't demoted go back on @folio_list */
> > --
> > 2.39.0.rc0.267.gcb52ba06e7-goog
>
  
Huang, Ying Dec. 6, 2022, 1:25 a.m. UTC | #6
Mina Almasry <almasrymina@google.com> writes:

> On Sun, Dec 4, 2022 at 6:39 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Mina Almasry <almasrymina@google.com> writes:
>>
>> > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
>> > reclaim"") enabled demotion in memcg reclaim, which is the right thing
>> > to do, however, I suspect it introduced a regression in the behavior of
>> > try_to_free_mem_cgroup_pages().
>> >
>> > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to
>> > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage
>> > of the cgroup should reduce by nr_pages. The callers expect
>> > try_to_free_mem_cgroup_pages() to also return the number of pages
>> > reclaimed, not demoted.
>> >
>> > However, what try_to_free_mem_cgroup_pages() actually does is it
>> > unconditionally counts demoted pages as reclaimed pages. So in practice
>> > when it is called it will often demote nr_pages and return the number of
>> > demoted pages to the caller. Demoted pages don't lower the memcg usage,
>> > and so I think try_to_free_mem_cgroup_pages() is not actually doing what
>> > the callers want it to do.
>> >
>> > I suspect various things work suboptimally on memory systems or don't
>> > work at all due to this:
>> >
>> > - memory.high enforcement likely doesn't work (it just demotes nr_pages
>> >   instead of lowering the memcg usage by nr_pages).
>> > - try_charge_memcg() will keep retrying the charge while
>> >   try_to_free_mem_cgroup_pages() is just demoting pages and not actually
>> >   making any room for the charge.
>> > - memory.reclaim has a wonky interface. It advertises to the user it
>> >   reclaims the provided amount but it will actually demote that amount.
>> >
>> > There may be more effects to this issue.
>> >
>> > To fix these issues I propose shrink_folio_list() to only count pages
>> > demoted from inside of sc->nodemask to outside of sc->nodemask as
>> > 'reclaimed'.
>> >
>> > For callers such as reclaim_high() or try_charge_memcg() that set
>> > sc->nodemask to NULL, try_to_free_mem_cgroup_pages() will try to
>> > actually reclaim nr_pages and return the number of pages reclaimed. No
>> > demoted pages would count towards the nr_pages requirement.
>> >
>> > For callers such as memory_reclaim() that set sc->nodemask,
>> > try_to_free_mem_cgroup_pages() will free nr_pages from that nodemask
>> > with either reclaim or demotion.
>>
>> Have you checked all callers?  For example, IIUC, in
>> reclaim_clean_pages_from_list(), although sc.nodemask == NULL, the
>> demoted pages should be counted as reclaimed.
>
> I checked all call stacks leading to shrink_folio_list() now (at least
> I hope). Here is what I think they do and how I propose to handle
> them:
>
> - reclaim_clean_pages_from_list() & __node_reclaim() & balance_pgdat()
> These try to free memory from a specific node, and both demotion and
> reclaim from that node should be counted. I propose these calls set
> sc>nodemask = pgdat.node_id to signal to shrink_folio_list() that both
> demotion and reclaim from this node should be counted.
>
> - try_to_free_pages()
> Tries to free pages from a specific nodemask. It sets sc->nodemask to
> ac->nodemask. In this case pages demoted within the nodemask should
> not count. Pages demoted outside of the nodemask should count, which
> this patch already tries to do.
>
> - mem_cgroup_shrink_node()
> This is memcg soft limit reclaim. AFAIU only reclaim should be
> counted. It already sets sc->nodemask=NULL to indicate that it
> requires reclaim from all nodes and that only reclaimed memory should
> be counted, which this patch already tries to do.
>
> - try_to_free_mem_cgroup_pages()
> This is covered in the commit message. Many callers set nodemask=NULL
> indicating they want reclaim and demotion should not count.
> memory.reclaim sets nodemask depending on the 'nodes=' arg and wants
> demotion and reclaim from that nodemask.
>
> - reclaim_folio_list()
> Sets no_demotion = 1. No ambiguity here, only reclaims and counts
> reclaimed pages.
>
> If agreeable I can fix reclaim_clean_pages_from_list() &
> __node_reclaim() & balance_pgdat() call sites in v3.

Looks good to me, Thanks!

>> How about count both
>> "demoted" and "reclaimed" in struct scan_control, and let callers to
>> determine how to use the number?
>>
>
> I don't think this is by itself enough. Pages demoted between 2 nodes
> that are both in sc->nodemask should not count, I think. So 'demoted'
> needs to be specifically pages demoted outside of the nodemask.

Yes.  Maybe we can do that when we need it.  I suggest to change the
return value description in the comments of shrink_folio_list().

> We can do 2 things:
>
> 1. Only allow the kernel to demote outside the nodemask (which you
> don't prefer).
> 2. Allow the kernel to demote inside the nodemask but not count them.
>
> I will see if I can implement #2.

Thanks!

>> > Tested this change using memory.reclaim interface. With this change,
>> >
>> >       echo "1m" > memory.reclaim
>> >
>> > Will cause freeing of 1m of memory from the cgroup regardless of the
>> > demotions happening inside.
>> >
>> >       echo "1m nodes=0" > memory.reclaim
>>
>> Have you tested these tests in the original kernel?  If so, whether does
>> the issue you suspected above occurs during testing?
>>
>
> Yes. I set up a test case where I allocate 500m in a cgroup, and then do:
>
>     echo "50m" > memory.reclaim
>
> Without my fix, my kernel demotes 70mb and reclaims 4 mb.
>
> With my v1 fix, my kernel demotes all memory possible and reclaims 60mb.
>
> I will add this to the commit message in the next version.

Good!  Thanks!

Best Regards,
Huang, Ying

>>
>> > Will cause freeing of 1m of node 0 by demotion if a demotion target is
>> > available, and by reclaim if no demotion target is available.
>> >
>> > Signed-off-by: Mina Almasry <almasrymina@google.com>
>> >
>> > ---
>> >
>> > This is developed on top of mm-unstable largely because I need the
>> > memory.reclaim nodes= arg to test it properly.
>> > ---
>> >  mm/vmscan.c | 13 ++++++++++++-
>> >  1 file changed, 12 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index 2b42ac9ad755..8f6e993b870d 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -1653,6 +1653,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>> >       LIST_HEAD(free_folios);
>> >       LIST_HEAD(demote_folios);
>> >       unsigned int nr_reclaimed = 0;
>> > +     unsigned int nr_demoted = 0;
>> >       unsigned int pgactivate = 0;
>> >       bool do_demote_pass;
>> >       struct swap_iocb *plug = NULL;
>> > @@ -2085,7 +2086,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>> >       /* 'folio_list' is always empty here */
>> >
>> >       /* Migrate folios selected for demotion */
>> > -     nr_reclaimed += demote_folio_list(&demote_folios, pgdat);
>> > +     nr_demoted = demote_folio_list(&demote_folios, pgdat);
>> > +
>> > +     /*
>> > +      * Only count demoted folios as reclaimed if we demoted them from
>> > +      * inside of the nodemask to outside of the nodemask, hence reclaiming
>> > +      * pages in the nodemask.
>> > +      */
>> > +     if (sc->nodemask && node_isset(pgdat->node_id, *sc->nodemask) &&
>> > +         !node_isset(next_demotion_node(pgdat->node_id), *sc->nodemask))
>> > +             nr_reclaimed += nr_demoted;
>> > +
>> >       /* Folios that could not be demoted are still in @demote_folios */
>> >       if (!list_empty(&demote_folios)) {
>> >               /* Folios which weren't demoted go back on @folio_list */
>> > --
>> > 2.39.0.rc0.267.gcb52ba06e7-goog
>>
  
Mina Almasry Dec. 6, 2022, 2:36 a.m. UTC | #7
On Mon, Dec 5, 2022 at 5:26 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Mina Almasry <almasrymina@google.com> writes:
>
> > On Sun, Dec 4, 2022 at 6:39 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Mina Almasry <almasrymina@google.com> writes:
> >>
> >> > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> >> > reclaim"") enabled demotion in memcg reclaim, which is the right thing
> >> > to do, however, I suspect it introduced a regression in the behavior of
> >> > try_to_free_mem_cgroup_pages().
> >> >
> >> > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to
> >> > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage
> >> > of the cgroup should reduce by nr_pages. The callers expect
> >> > try_to_free_mem_cgroup_pages() to also return the number of pages
> >> > reclaimed, not demoted.
> >> >
> >> > However, what try_to_free_mem_cgroup_pages() actually does is it
> >> > unconditionally counts demoted pages as reclaimed pages. So in practice
> >> > when it is called it will often demote nr_pages and return the number of
> >> > demoted pages to the caller. Demoted pages don't lower the memcg usage,
> >> > and so I think try_to_free_mem_cgroup_pages() is not actually doing what
> >> > the callers want it to do.
> >> >
> >> > I suspect various things work suboptimally on memory systems or don't
> >> > work at all due to this:
> >> >
> >> > - memory.high enforcement likely doesn't work (it just demotes nr_pages
> >> >   instead of lowering the memcg usage by nr_pages).
> >> > - try_charge_memcg() will keep retrying the charge while
> >> >   try_to_free_mem_cgroup_pages() is just demoting pages and not actually
> >> >   making any room for the charge.
> >> > - memory.reclaim has a wonky interface. It advertises to the user it
> >> >   reclaims the provided amount but it will actually demote that amount.
> >> >
> >> > There may be more effects to this issue.
> >> >
> >> > To fix these issues I propose shrink_folio_list() to only count pages
> >> > demoted from inside of sc->nodemask to outside of sc->nodemask as
> >> > 'reclaimed'.
> >> >
> >> > For callers such as reclaim_high() or try_charge_memcg() that set
> >> > sc->nodemask to NULL, try_to_free_mem_cgroup_pages() will try to
> >> > actually reclaim nr_pages and return the number of pages reclaimed. No
> >> > demoted pages would count towards the nr_pages requirement.
> >> >
> >> > For callers such as memory_reclaim() that set sc->nodemask,
> >> > try_to_free_mem_cgroup_pages() will free nr_pages from that nodemask
> >> > with either reclaim or demotion.
> >>
> >> Have you checked all callers?  For example, IIUC, in
> >> reclaim_clean_pages_from_list(), although sc.nodemask == NULL, the
> >> demoted pages should be counted as reclaimed.
> >
> > I checked all call stacks leading to shrink_folio_list() now (at least
> > I hope). Here is what I think they do and how I propose to handle
> > them:
> >
> > - reclaim_clean_pages_from_list() & __node_reclaim() & balance_pgdat()
> > These try to free memory from a specific node, and both demotion and
> > reclaim from that node should be counted. I propose these calls set
> > sc>nodemask = pgdat.node_id to signal to shrink_folio_list() that both
> > demotion and reclaim from this node should be counted.
> >
> > - try_to_free_pages()
> > Tries to free pages from a specific nodemask. It sets sc->nodemask to
> > ac->nodemask. In this case pages demoted within the nodemask should
> > not count. Pages demoted outside of the nodemask should count, which
> > this patch already tries to do.
> >
> > - mem_cgroup_shrink_node()
> > This is memcg soft limit reclaim. AFAIU only reclaim should be
> > counted. It already sets sc->nodemask=NULL to indicate that it
> > requires reclaim from all nodes and that only reclaimed memory should
> > be counted, which this patch already tries to do.
> >
> > - try_to_free_mem_cgroup_pages()
> > This is covered in the commit message. Many callers set nodemask=NULL
> > indicating they want reclaim and demotion should not count.
> > memory.reclaim sets nodemask depending on the 'nodes=' arg and wants
> > demotion and reclaim from that nodemask.
> >
> > - reclaim_folio_list()
> > Sets no_demotion = 1. No ambiguity here, only reclaims and counts
> > reclaimed pages.
> >
> > If agreeable I can fix reclaim_clean_pages_from_list() &
> > __node_reclaim() & balance_pgdat() call sites in v3.
>
> Looks good to me, Thanks!
>

Thanks. Sent out v3 with the comments addressed. PTAL whenever convenient:
https://lore.kernel.org/linux-mm/20221206023406.3182800-1-almasrymina@google.com/T/#u

> >> How about count both
> >> "demoted" and "reclaimed" in struct scan_control, and let callers to
> >> determine how to use the number?
> >>
> >
> > I don't think this is by itself enough. Pages demoted between 2 nodes
> > that are both in sc->nodemask should not count, I think. So 'demoted'
> > needs to be specifically pages demoted outside of the nodemask.
>
> Yes.  Maybe we can do that when we need it.  I suggest to change the
> return value description in the comments of shrink_folio_list().
>
> > We can do 2 things:
> >
> > 1. Only allow the kernel to demote outside the nodemask (which you
> > don't prefer).
> > 2. Allow the kernel to demote inside the nodemask but not count them.
> >
> > I will see if I can implement #2.
>
> Thanks!
>
> >> > Tested this change using memory.reclaim interface. With this change,
> >> >
> >> >       echo "1m" > memory.reclaim
> >> >
> >> > Will cause freeing of 1m of memory from the cgroup regardless of the
> >> > demotions happening inside.
> >> >
> >> >       echo "1m nodes=0" > memory.reclaim
> >>
> >> Have you tested these tests in the original kernel?  If so, whether does
> >> the issue you suspected above occurs during testing?
> >>
> >
> > Yes. I set up a test case where I allocate 500m in a cgroup, and then do:
> >
> >     echo "50m" > memory.reclaim
> >
> > Without my fix, my kernel demotes 70mb and reclaims 4 mb.
> >
> > With my v1 fix, my kernel demotes all memory possible and reclaims 60mb.
> >
> > I will add this to the commit message in the next version.
>
> Good!  Thanks!
>
> Best Regards,
> Huang, Ying
>
> >>
> >> > Will cause freeing of 1m of node 0 by demotion if a demotion target is
> >> > available, and by reclaim if no demotion target is available.
> >> >
> >> > Signed-off-by: Mina Almasry <almasrymina@google.com>
> >> >
> >> > ---
> >> >
> >> > This is developed on top of mm-unstable largely because I need the
> >> > memory.reclaim nodes= arg to test it properly.
> >> > ---
> >> >  mm/vmscan.c | 13 ++++++++++++-
> >> >  1 file changed, 12 insertions(+), 1 deletion(-)
> >> >
> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> > index 2b42ac9ad755..8f6e993b870d 100644
> >> > --- a/mm/vmscan.c
> >> > +++ b/mm/vmscan.c
> >> > @@ -1653,6 +1653,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >> >       LIST_HEAD(free_folios);
> >> >       LIST_HEAD(demote_folios);
> >> >       unsigned int nr_reclaimed = 0;
> >> > +     unsigned int nr_demoted = 0;
> >> >       unsigned int pgactivate = 0;
> >> >       bool do_demote_pass;
> >> >       struct swap_iocb *plug = NULL;
> >> > @@ -2085,7 +2086,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >> >       /* 'folio_list' is always empty here */
> >> >
> >> >       /* Migrate folios selected for demotion */
> >> > -     nr_reclaimed += demote_folio_list(&demote_folios, pgdat);
> >> > +     nr_demoted = demote_folio_list(&demote_folios, pgdat);
> >> > +
> >> > +     /*
> >> > +      * Only count demoted folios as reclaimed if we demoted them from
> >> > +      * inside of the nodemask to outside of the nodemask, hence reclaiming
> >> > +      * pages in the nodemask.
> >> > +      */
> >> > +     if (sc->nodemask && node_isset(pgdat->node_id, *sc->nodemask) &&
> >> > +         !node_isset(next_demotion_node(pgdat->node_id), *sc->nodemask))
> >> > +             nr_reclaimed += nr_demoted;
> >> > +
> >> >       /* Folios that could not be demoted are still in @demote_folios */
> >> >       if (!list_empty(&demote_folios)) {
> >> >               /* Folios which weren't demoted go back on @folio_list */
> >> > --
> >> > 2.39.0.rc0.267.gcb52ba06e7-goog
> >>
>
  

Patch

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2b42ac9ad755..8f6e993b870d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1653,6 +1653,7 @@  static unsigned int shrink_folio_list(struct list_head *folio_list,
 	LIST_HEAD(free_folios);
 	LIST_HEAD(demote_folios);
 	unsigned int nr_reclaimed = 0;
+	unsigned int nr_demoted = 0;
 	unsigned int pgactivate = 0;
 	bool do_demote_pass;
 	struct swap_iocb *plug = NULL;
@@ -2085,7 +2086,17 @@  static unsigned int shrink_folio_list(struct list_head *folio_list,
 	/* 'folio_list' is always empty here */

 	/* Migrate folios selected for demotion */
-	nr_reclaimed += demote_folio_list(&demote_folios, pgdat);
+	nr_demoted = demote_folio_list(&demote_folios, pgdat);
+
+	/*
+	 * Only count demoted folios as reclaimed if we demoted them from
+	 * inside of the nodemask to outside of the nodemask, hence reclaiming
+	 * pages in the nodemask.
+	 */
+	if (sc->nodemask && node_isset(pgdat->node_id, *sc->nodemask) &&
+	    !node_isset(next_demotion_node(pgdat->node_id), *sc->nodemask))
+		nr_reclaimed += nr_demoted;
+
 	/* Folios that could not be demoted are still in @demote_folios */
 	if (!list_empty(&demote_folios)) {
 		/* Folios which weren't demoted go back on @folio_list */