[V2] mm: madvise: fix uneven accounting of psi

Message ID 1687861992-8722-1-git-send-email-quic_charante@quicinc.com
State New
Headers
Series [V2] mm: madvise: fix uneven accounting of psi |

Commit Message

Charan Teja Kalla June 27, 2023, 10:33 a.m. UTC
  A folio turns into a Workingset during:
1) shrink_active_list() placing the folio from active to inactive list.
2) When a workingset transition is happening during the folio refault.

And when Workingset is set on a folio, PSI for memory can be accounted
during a) That folio is being reclaimed and b) Refault of that folio.

This accounting of PSI for memory is not consistent in the cases where
clients use madvise(COLD/PAGEOUT) to deactivate or proactively reclaim a
folio:
a) A folio started at inactive and moved to active as part of accesses.
Workingset is absent on the folio thus madvise(MADV_PAGEOUT) don't
account such folios for PSI.

b) When the same folio transition from inactive->active and then to
inactive through shrink_active_list(). Workingset is set on the folio
thus madvise(MADV_PAGEOUT) account such folios for PSI.

c) When the same folio is part of active list directly as a result of
folio refault and this was a workingset folio prior to eviction.
Workingset is set on the folio thus both the operations of MADV_PAGEOUT
and reclaim of the MADV_COLD operated folio account for PSI.

d) madvise(MADV_COLD) transfers the folio from active list to inactive
list. Such folios may not have the Workingset thus reclaim operation
on such folio doesn't account for PSI.

As said above, the MADV_PAGEOUT on a folio is accounts for memory PSI in
b) and c) but not in a). Reclaim of a folio on which MADV_COLD is
performed accounts memory PSI in c) but not in d) which is an
inconsistent behaviour. Make this PSI accounting always consistent by
turning a folio into a workingset one whenever it is leaving the active
list. Also, accounting of PSI on a folio whenever it leaves the
active list as part of the MADV_COLD/PAGEOUT operation helps the users
whether they are operating on proper folios[1].

[1] https://lore.kernel.org/all/20230605180013.GD221380@cmpxchg.org/

Suggested-by: Suren Baghdasaryan <surenb@google.com>
Reported-by: Sai Manobhiram Manapragada <quic_smanapra@quicinc.com>
Reported-by: Pavan Kondeti <quic_pkondeti@quicinc.com>
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
---
V2: Made changes as per the comments from Johannes/Suren.

V1: https://lore.kernel.org/all/1685531374-6091-1-git-send-email-quic_charante@quicinc.com/

 mm/madvise.c | 2 ++
 1 file changed, 2 insertions(+)
  

Comments

Pavan Kondeti June 27, 2023, 1:56 p.m. UTC | #1
On Tue, Jun 27, 2023 at 04:03:12PM +0530, Charan Teja Kalla wrote:
> A folio turns into a Workingset during:
> 1) shrink_active_list() placing the folio from active to inactive list.
> 2) When a workingset transition is happening during the folio refault.
> 
> And when Workingset is set on a folio, PSI for memory can be accounted
> during a) That folio is being reclaimed and b) Refault of that folio.
> 

Please help me understand why PSI for memory (I understood it as the 
time spent in psi_memstall_enter() to psi_memstall_leave()) would be
accounted in (a) i.e during reclaim. I understand that when a working

The (b) part is very clear.

> This accounting of PSI for memory is not consistent in the cases where
> clients use madvise(COLD/PAGEOUT) to deactivate or proactively reclaim a
> folio:
> a) A folio started at inactive and moved to active as part of accesses.
> Workingset is absent on the folio thus madvise(MADV_PAGEOUT) don't
> account such folios for PSI.
> 
> b) When the same folio transition from inactive->active and then to
> inactive through shrink_active_list(). Workingset is set on the folio
> thus madvise(MADV_PAGEOUT) account such folios for PSI.
> 
> c) When the same folio is part of active list directly as a result of
> folio refault and this was a workingset folio prior to eviction.
> Workingset is set on the folio thus both the operations of MADV_PAGEOUT
> and reclaim of the MADV_COLD operated folio account for PSI.
> 
> d) madvise(MADV_COLD) transfers the folio from active list to inactive
> list. Such folios may not have the Workingset thus reclaim operation
> on such folio doesn't account for PSI.
> 
> As said above, the MADV_PAGEOUT on a folio is accounts for memory PSI in
> b) and c) but not in a). Reclaim of a folio on which MADV_COLD is
> performed accounts memory PSI in c) but not in d) which is an
> inconsistent behaviour. Make this PSI accounting always consistent by
> turning a folio into a workingset one whenever it is leaving the active
> list. Also, accounting of PSI on a folio whenever it leaves the
> active list as part of the MADV_COLD/PAGEOUT operation helps the users
> whether they are operating on proper folios[1].

I understood the problem from V1 discussions. But the references to 
"madvise account such folios for PSI" is confusing. Why would madvise(PAGEOUT)
be accounting anything related to PSI. I get that madvise() is messing
up PSI accuracy indirectly..

> 
> [1] https://lore.kernel.org/all/20230605180013.GD221380@cmpxchg.org/
> 
> Suggested-by: Suren Baghdasaryan <surenb@google.com>
> Reported-by: Sai Manobhiram Manapragada <quic_smanapra@quicinc.com>
> Reported-by: Pavan Kondeti <quic_pkondeti@quicinc.com>
> Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
> ---
> V2: Made changes as per the comments from Johannes/Suren.
> 
> V1: https://lore.kernel.org/all/1685531374-6091-1-git-send-email-quic_charante@quicinc.com/
> 
>  mm/madvise.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index d9e7b42..76fb31f 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -413,6 +413,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  
>  		folio_clear_referenced(folio);
>  		folio_test_clear_young(folio);
> +		folio_set_workingset(folio);
>  		if (pageout) {
>  			if (folio_isolate_lru(folio)) {
>  				if (folio_test_unevictable(folio))
> @@ -512,6 +513,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  		 */
>  		folio_clear_referenced(folio);
>  		folio_test_clear_young(folio);
> +		folio_set_workingset(folio);
>  		if (pageout) {
>  			if (folio_isolate_lru(folio)) {
>  				if (folio_test_unevictable(folio))
> -- 
> 2.7.4
> 

This is not limited to madvise(PAGEOUT) right, anywhere an active page
is reclaimed we have the same problem. For ex: damon_pa_pageout() and
__alloc_contig_migrate_range()->reclaim_clean_pages_from_list().

If that is the case, can we set mark a folio as a workingset when it is
activated? That way, we don't have make madvise() as a special case?

Thanks,
Pavan
  
Johannes Weiner June 27, 2023, 2:46 p.m. UTC | #2
Hi Charan,

thanks for fixing this. One comment:

On Tue, Jun 27, 2023 at 04:03:12PM +0530, Charan Teja Kalla wrote:
> @@ -413,6 +413,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  
>  		folio_clear_referenced(folio);
>  		folio_test_clear_young(folio);
> +		folio_set_workingset(folio);

Unless I'm missing something, this also includes inactive pages, which
is undesirable. Shouldn't this be:

		if (folio_test_active(folio))
			folio_set_workingset(folio);

> @@ -512,6 +513,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  		 */
>  		folio_clear_referenced(folio);
>  		folio_test_clear_young(folio);
> +		folio_set_workingset(folio);

Here as well.
  
Charan Teja Kalla June 28, 2023, 10:49 a.m. UTC | #3
Hi Pavan,

On 6/27/2023 7:26 PM, Pavan Kondeti wrote:
>> A folio turns into a Workingset during:
>> 1) shrink_active_list() placing the folio from active to inactive list.
>> 2) When a workingset transition is happening during the folio refault.
>>
>> And when Workingset is set on a folio, PSI for memory can be accounted
>> during a) That folio is being reclaimed and b) Refault of that folio.
>>
> Please help me understand why PSI for memory (I understood it as the 
> time spent in psi_memstall_enter() to psi_memstall_leave()) would be
> accounted in (a) i.e during reclaim. I understand that when a working
> 
> The (b) part is very clear.
> 
I meant to say, for usual reclaim, PSI is accounted on a folio for both
reclaim and as well during the refault operation when Workingset is set
on a folio i.e., both a) and b) cases above.

>> This accounting of PSI for memory is not consistent in the cases where
>> clients use madvise(COLD/PAGEOUT) to deactivate or proactively reclaim a
>> folio:

Seems I need to be explicit here. How about the below?

This accounting of PSI for memory is not consistent for reclaim +
refault operation between usual reclaim and madvise(COLD/PAGEOUT) which
deactivate or proactively reclaim a folio:

lmk for any better rephrasing?
>> a) A folio started at inactive and moved to active as part of accesses.
>> Workingset is absent on the folio thus madvise(MADV_PAGEOUT) don't
>> account such folios for PSI.
>>
>> b) When the same folio transition from inactive->active and then to
>> inactive through shrink_active_list(). Workingset is set on the folio
>> thus madvise(MADV_PAGEOUT) account such folios for PSI.
>>
>> c) When the same folio is part of active list directly as a result of
>> folio refault and this was a workingset folio prior to eviction.
>> Workingset is set on the folio thus both the operations of MADV_PAGEOUT
>> and reclaim of the MADV_COLD operated folio account for PSI.
>>
>> d) madvise(MADV_COLD) transfers the folio from active list to inactive
>> list. Such folios may not have the Workingset thus reclaim operation
>> on such folio doesn't account for PSI.
> This is not limited to madvise(PAGEOUT) right, anywhere an active page
> is reclaimed we have the same problem. For ex: damon_pa_pageout() and
> __alloc_contig_migrate_range()->reclaim_clean_pages_from_list().
>> If that is the case, can we set mark a folio as a workingset when it is
> activated? That way, we don't have make madvise() as a special case?
I think marking the folio as a workingset when it sits on the active is
not a correct thing. For the same example you mentioned, a simple CMA
allocation will be dropping the clean pages instead of migration. PSI
accounting on refault of those pages don't reveal anything to the user.

Where as in the madvise() cases, this PSI tells the user about the type
of pages that he is working on.[1]

BTW, damon_pa_pageout() seems a valid case above. let me fix it in the
next patch.

[1]https://lore.kernel.org/all/20230605180013.GD221380@cmpxchg.org/
  
Charan Teja Kalla June 28, 2023, 10:50 a.m. UTC | #4
Thanks Johannes!!

On 6/27/2023 8:16 PM, Johannes Weiner wrote:
> Unless I'm missing something, this also includes inactive pages, which
> is undesirable. Shouldn't this be:
> 
> 		if (folio_test_active(folio))

My bad. Let me fix it.

> 			folio_set_workingset(folio);
  
Pavan Kondeti June 29, 2023, 5:07 a.m. UTC | #5
On Wed, Jun 28, 2023 at 04:19:01PM +0530, Charan Teja Kalla wrote:
> Hi Pavan,
> 
> On 6/27/2023 7:26 PM, Pavan Kondeti wrote:
> >> A folio turns into a Workingset during:
> >> 1) shrink_active_list() placing the folio from active to inactive list.
> >> 2) When a workingset transition is happening during the folio refault.
> >>
> >> And when Workingset is set on a folio, PSI for memory can be accounted
> >> during a) That folio is being reclaimed and b) Refault of that folio.
> >>
> > Please help me understand why PSI for memory (I understood it as the 
> > time spent in psi_memstall_enter() to psi_memstall_leave()) would be
> > accounted in (a) i.e during reclaim. I understand that when a working
> > 
> > The (b) part is very clear.
> > 
> I meant to say, for usual reclaim, PSI is accounted on a folio for both
> reclaim and as well during the refault operation when Workingset is set
> on a folio i.e., both a) and b) cases above.
> 

Got it.

> >> This accounting of PSI for memory is not consistent in the cases where
> >> clients use madvise(COLD/PAGEOUT) to deactivate or proactively reclaim a
> >> folio:
> 
> Seems I need to be explicit here. How about the below?
> 
> This accounting of PSI for memory is not consistent for reclaim +
> refault operation between usual reclaim and madvise(COLD/PAGEOUT) which
> deactivate or proactively reclaim a folio:
> 

Looks good.

> lmk for any better rephrasing?
> >> a) A folio started at inactive and moved to active as part of accesses.
> >> Workingset is absent on the folio thus madvise(MADV_PAGEOUT) don't
> >> account such folios for PSI.
> >>
> >> b) When the same folio transition from inactive->active and then to
> >> inactive through shrink_active_list(). Workingset is set on the folio
> >> thus madvise(MADV_PAGEOUT) account such folios for PSI.
> >>
> >> c) When the same folio is part of active list directly as a result of
> >> folio refault and this was a workingset folio prior to eviction.
> >> Workingset is set on the folio thus both the operations of MADV_PAGEOUT
> >> and reclaim of the MADV_COLD operated folio account for PSI.
> >>
> >> d) madvise(MADV_COLD) transfers the folio from active list to inactive
> >> list. Such folios may not have the Workingset thus reclaim operation
> >> on such folio doesn't account for PSI.
> > This is not limited to madvise(PAGEOUT) right, anywhere an active page
> > is reclaimed we have the same problem. For ex: damon_pa_pageout() and
> > __alloc_contig_migrate_range()->reclaim_clean_pages_from_list().
> >> If that is the case, can we set mark a folio as a workingset when it is
> > activated? That way, we don't have make madvise() as a special case?
> I think marking the folio as a workingset when it sits on the active is
> not a correct thing. For the same example you mentioned, a simple CMA
> allocation will be dropping the clean pages instead of migration. PSI
> accounting on refault of those pages don't reveal anything to the user.
> 

Agreed. Thanks for the clarification.

> Where as in the madvise() cases, this PSI tells the user about the type
> of pages that he is working on.[1]
> 
> BTW, damon_pa_pageout() seems a valid case above. let me fix it in the
> next patch.

Looks good.

Thanks,
Pavan
  
Charan Teja Kalla June 30, 2023, 1:16 p.m. UTC | #6
Hi Pavan,

On 6/28/2023 4:19 PM, Charan Teja Kalla wrote:
> I think marking the folio as a workingset when it sits on the active is
> not a correct thing. For the same example you mentioned, a simple CMA
> allocation will be dropping the clean pages instead of migration. PSI
> accounting on refault of those pages don't reveal anything to the user.
> 
> Where as in the madvise() cases, this PSI tells the user about the type
> of pages that he is working on.[1]
> 
> BTW, damon_pa_pageout() seems a valid case above. let me fix it in the
> next patch.
I did look a little bit more at the damon code and IIUC it: DAMON
monitors the ranges it is asked to operate as regions and
operate(reclaim) on the region that has less number of accesses, IOW,
damon won't do pageout operation on a folio if it is really under use,
CMIW.

This is unlike the case with the madvise() operation where Workingset
helps in accounting PSI that helps user the type of folios he is
operating on.

Assume that damon is operating on wrong set of regions and Workingset
helps in giving a PSI. This got no help to user and just telling the
internals of damon. No?

Having said that, theoretically it seems correct to me to set workingset
on folios as they leave the active list, but I don't have any strong
reason to say what happens if we won't.

Moreover, this patch is mostly talks about the madvise() operated folios
not inline with the usual reclaim. May be a separate change can be
raised for damon() operated folios once we agree upon the importance of
Workingset to these folios. WDYT?

Thanks,
  

Patch

diff --git a/mm/madvise.c b/mm/madvise.c
index d9e7b42..76fb31f 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -413,6 +413,7 @@  static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 
 		folio_clear_referenced(folio);
 		folio_test_clear_young(folio);
+		folio_set_workingset(folio);
 		if (pageout) {
 			if (folio_isolate_lru(folio)) {
 				if (folio_test_unevictable(folio))
@@ -512,6 +513,7 @@  static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 		 */
 		folio_clear_referenced(folio);
 		folio_test_clear_young(folio);
+		folio_set_workingset(folio);
 		if (pageout) {
 			if (folio_isolate_lru(folio)) {
 				if (folio_test_unevictable(folio))