mm: madvise: fix uneven accounting of psi

Message ID 1685531374-6091-1-git-send-email-quic_charante@quicinc.com
State New
Headers
Series mm: madvise: fix uneven accounting of psi |

Commit Message

Charan Teja Kalla May 31, 2023, 11:09 a.m. UTC
  A folio turns into a Workingset during:
1) shrink_active_list() placing the folio from active to inactive list.
2) When a workingset transition is happening during the folio refault.

And when Workingset is set on a folio, PSI for memory can be accounted
during a) That folio is being reclaimed and b) Refault of that folio.

There exists clients who can do the proactive reclaim using the system
calls like madvise(), whose folios can be safely treated as inactive
folios assuming the client knows that these folios are not needed in the
near future thus wanted to reclaim them. For such folios psi is not
accounted uniformly:
a) A folio started at inactive and moved to active as part of accesses.
Workingset is absent on the folio thus madvise(MADV_PAGEOUT) don't
account such folios for PSI.

b) When the same folio transition from inactive->active and then to
inactive through shrink_active_list(). Workingset is set on the folio
thus madvise(MADV_PAGEOUT) account such folios for PSI.

c) When the same folio is part of active list directly as a result of
folio refault and this was a workingset folio prior to eviction.
Workingset is set on the folio thus madvise(MADV_PAGEOUT) account such
folios for PSI.

As said about the MADV_PAGEOUT on a folio is accounted in b) and c) but
not in a) which is inconsistent. Remove this inconsistency by always not
considering the PSI for folios that are getting reclaimed through
madvise(MADV_PAGEOUT) by clearing the Workingset on a folio. This
consistency of clearing the workingset was chosen under the assumption
that client knows these folios are not in active use thus reclaiming
them hence not eligible as workingset folios. Probably it is the same
reason why workingset is not set on a folio through MADV_COLD but during
the shrink_active_list() though both the actions make the folio put onto
the inactive list.

This patch is tested on Android, Snapdragon SoC with 8Gb RAM, 4GB swap
mounted on zram which has 2GB of backingdev. The test case involved
launching some memory hungry apps in an order and do the proactive
reclaim for the app that went to background using madvise(MADV_PAGEOUT).
We are seeing ~40% less total values of psi mem some and full when this
patch is combined with [1].

[1]https://lore.kernel.org/all/20220214214921.419687-1-hannes@cmpxchg.org/T/#u

Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
---
 mm/madvise.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)
  

Comments

Andrew Morton May 31, 2023, 10:04 p.m. UTC | #1
On Wed, 31 May 2023 16:39:34 +0530 Charan Teja Kalla <quic_charante@quicinc.com> wrote:

> A folio turns into a Workingset during:
> 1) shrink_active_list() placing the folio from active to inactive list.
> 2) When a workingset transition is happening during the folio refault.
> 
> And when Workingset is set on a folio, PSI for memory can be accounted
> during a) That folio is being reclaimed and b) Refault of that folio.
> 
> There exists clients who can do the proactive reclaim using the system
> calls like madvise(), whose folios can be safely treated as inactive
> folios assuming the client knows that these folios are not needed in the
> near future thus wanted to reclaim them. For such folios psi is not
> accounted uniformly:
> a) A folio started at inactive and moved to active as part of accesses.
> Workingset is absent on the folio thus madvise(MADV_PAGEOUT) don't
> account such folios for PSI.
> 
> b) When the same folio transition from inactive->active and then to
> inactive through shrink_active_list(). Workingset is set on the folio
> thus madvise(MADV_PAGEOUT) account such folios for PSI.
> 
> c) When the same folio is part of active list directly as a result of
> folio refault and this was a workingset folio prior to eviction.
> Workingset is set on the folio thus madvise(MADV_PAGEOUT) account such
> folios for PSI.
> 
> As said about the MADV_PAGEOUT on a folio is accounted in b) and c) but
> not in a) which is inconsistent. Remove this inconsistency by always not
> considering the PSI for folios that are getting reclaimed through
> madvise(MADV_PAGEOUT) by clearing the Workingset on a folio. This
> consistency of clearing the workingset was chosen under the assumption
> that client knows these folios are not in active use thus reclaiming
> them hence not eligible as workingset folios. Probably it is the same
> reason why workingset is not set on a folio through MADV_COLD but during
> the shrink_active_list() though both the actions make the folio put onto
> the inactive list.
> 
> This patch is tested on Android, Snapdragon SoC with 8Gb RAM, 4GB swap
> mounted on zram which has 2GB of backingdev. The test case involved
> launching some memory hungry apps in an order and do the proactive
> reclaim for the app that went to background using madvise(MADV_PAGEOUT).
> We are seeing ~40% less total values of psi mem some and full when this
> patch is combined with [1].

Does this accounting inaccuracy have any perceptible runtime effects?
  
Johannes Weiner May 31, 2023, 10:19 p.m. UTC | #2
On Wed, May 31, 2023 at 04:39:34PM +0530, Charan Teja Kalla wrote:
> This patch is tested on Android, Snapdragon SoC with 8Gb RAM, 4GB swap
> mounted on zram which has 2GB of backingdev. The test case involved
> launching some memory hungry apps in an order and do the proactive
> reclaim for the app that went to background using madvise(MADV_PAGEOUT).
> We are seeing ~40% less total values of psi mem some and full when this
> patch is combined with [1].

Does that mean those pages are thrashing, but because you clear their
workingset it isn't detected and reported via psi?

I don't rally get why silencing the thrashing is an improvement.

> [1]https://lore.kernel.org/all/20220214214921.419687-1-hannes@cmpxchg.org/T/#u
> 
> Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
> ---
>  mm/madvise.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 340125d..3410c39 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -409,8 +409,10 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  			if (folio_isolate_lru(folio)) {
>  				if (folio_test_unevictable(folio))
>  					folio_putback_lru(folio);
> -				else
> +				else {
> +					folio_clear_workingset(folio);
>  					list_add(&folio->lru, &folio_list);
> +				}
>  			}
>  		} else
>  			folio_deactivate(folio);
> @@ -503,8 +505,10 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  			if (folio_isolate_lru(folio)) {
>  				if (folio_test_unevictable(folio))
>  					folio_putback_lru(folio);
> -				else
> +				else {
> +					folio_clear_workingset(folio);
>  					list_add(&folio->lru, &folio_list);
> +				}
>  			}
>  		} else
>  			folio_deactivate(folio);
> -- 
> 2.7.4
>
  
Charan Teja Kalla June 1, 2023, 1:07 p.m. UTC | #3
Thanks Johannes for taking a look at it..

On 6/1/2023 3:49 AM, Johannes Weiner wrote:
> On Wed, May 31, 2023 at 04:39:34PM +0530, Charan Teja Kalla wrote:
>> This patch is tested on Android, Snapdragon SoC with 8Gb RAM, 4GB swap
>> mounted on zram which has 2GB of backingdev. The test case involved
>> launching some memory hungry apps in an order and do the proactive
>> reclaim for the app that went to background using madvise(MADV_PAGEOUT).
>> We are seeing ~40% less total values of psi mem some and full when this
>> patch is combined with [1].
> Does that mean those pages are thrashing, but because you clear their
> workingset it isn't detected and reported via psi?
> 

Seems I didn't mention the usecase clearly and let me correct it. Say we
have the Android apps A, B, C, ... H and launching of these apps goes
like below.

1) Launch app A.
2) Launch app B.
3) Launch app C. At this time, we consider the memory used by app A is
no more in active use thus proactively reclaim them where we do issue
MADV_PAGEOUT on anon regions only thus these pages goes to swap mounted
on zram and subsequently into the backing dev attached to the zram.
4) Launch app D.. Proactively reclaim the anon regions of App B into
swap and through to backing dev.
5) Now make the app A to foreground. This can read the pages from the
swap + backing dev (because of the step 3)) that belongs to app A and
also proactively reclaim anon regions of app C.
6) Launch E --> proactive reclaim of app D to zram swap + backing dev.
7) Make App B to foreground --> Read memory of app B from swap +
backingdev and as well reclaim the anon regions of app A.
8) Like wise launches of apps F, C, G, D, H, E .....

If we look at steps 5, 7,..., we are just making the apps foreground
which folios (if were marked as workingset) can contribute to PSI events
through swap_readpage(). But in reality, these are not the real
workingset folios (I think it is safe to say this), because it is the
user who decided the reclaim of these pages by issuing the MADV_PAGEOUT
which he knows that these are not going to be needed in the near future
thus reclaim them.

I think the other way to look at the problem is the user can write a
simple application where he can do  MADV_PAGEOUT and read them back in a
loop. If at any point, folios user working on turns out to be a
workingset( he can just be probabilistic here), the PSI events will be
triggered though there may not be real memory pressure in the system.


> I don't rally get why silencing the thrashing is an improvement.
> 
Agree that we shouldn't be really silence the thrashing. My point is we
shouldn't be  considering the folios as thrashing If those were getting
reclaim by the user him self through MADV_PAGEOUT under the assumption
that __user knows they are not real working set__.  Please let me know
if I am not making sense here.

>> [1]https://lore.kernel.org/all/20220214214921.419687-1-hannes@cmpxchg.org/T/#u
>>
>> Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
  
Charan Teja Kalla June 1, 2023, 1:18 p.m. UTC | #4
Thanks Andrew!!

On 6/1/2023 3:34 AM, Andrew Morton wrote:
>> This patch is tested on Android, Snapdragon SoC with 8Gb RAM, 4GB swap
>> mounted on zram which has 2GB of backingdev. The test case involved
>> launching some memory hungry apps in an order and do the proactive
>> reclaim for the app that went to background using madvise(MADV_PAGEOUT).
>> We are seeing ~40% less total values of psi mem some and full when this
>> patch is combined with [1].
> Does this accounting inaccuracy have any perceptible runtime effects?
> 
With the testcase mentioned, we are able to achieve better app
concurrency(It is a measure on Android of how many apps that can survive
by not getting killed by lmkd(which relies on PSI events to take the
decissions) by the end of the testcase).

The total memsome and memfull psi events(measured in usec) are as below:
mem PSI	  	 w/o patch	with patch
some		61255313	21867736
full		31353138	15391454

Thanks,
Charan
>
  
Johannes Weiner June 5, 2023, 6 p.m. UTC | #5
Hi Charan,

On Thu, Jun 01, 2023 at 06:37:50PM +0530, Charan Teja Kalla wrote:
> Thanks Johannes for taking a look at it..
> 
> On 6/1/2023 3:49 AM, Johannes Weiner wrote:
> > On Wed, May 31, 2023 at 04:39:34PM +0530, Charan Teja Kalla wrote:
> >> This patch is tested on Android, Snapdragon SoC with 8Gb RAM, 4GB swap
> >> mounted on zram which has 2GB of backingdev. The test case involved
> >> launching some memory hungry apps in an order and do the proactive
> >> reclaim for the app that went to background using madvise(MADV_PAGEOUT).
> >> We are seeing ~40% less total values of psi mem some and full when this
> >> patch is combined with [1].
> > Does that mean those pages are thrashing, but because you clear their
> > workingset it isn't detected and reported via psi?
> > 
> 
> Seems I didn't mention the usecase clearly and let me correct it. Say we
> have the Android apps A, B, C, ... H and launching of these apps goes
> like below.
> 
> 1) Launch app A.
> 2) Launch app B.
> 3) Launch app C. At this time, we consider the memory used by app A is
> no more in active use thus proactively reclaim them where we do issue
> MADV_PAGEOUT on anon regions only thus these pages goes to swap mounted
> on zram and subsequently into the backing dev attached to the zram.
> 4) Launch app D.. Proactively reclaim the anon regions of App B into
> swap and through to backing dev.
> 5) Now make the app A to foreground. This can read the pages from the
> swap + backing dev (because of the step 3)) that belongs to app A and
> also proactively reclaim anon regions of app C.
> 6) Launch E --> proactive reclaim of app D to zram swap + backing dev.
> 7) Make App B to foreground --> Read memory of app B from swap +
> backingdev and as well reclaim the anon regions of app A.
> 8) Like wise launches of apps F, C, G, D, H, E .....
> 
> If we look at steps 5, 7,..., we are just making the apps foreground
> which folios (if were marked as workingset) can contribute to PSI events
> through swap_readpage(). But in reality, these are not the real
> workingset folios (I think it is safe to say this), because it is the
> user who decided the reclaim of these pages by issuing the MADV_PAGEOUT
> which he knows that these are not going to be needed in the near future
> thus reclaim them.
> 
> I think the other way to look at the problem is the user can write a
> simple application where he can do  MADV_PAGEOUT and read them back in a
> loop. If at any point, folios user working on turns out to be a
> workingset( he can just be probabilistic here), the PSI events will be
> triggered though there may not be real memory pressure in the system.
>
> > I don't rally get why silencing the thrashing is an improvement.
> > 
> Agree that we shouldn't be really silence the thrashing. My point is we
> shouldn't be  considering the folios as thrashing If those were getting
> reclaim by the user him self through MADV_PAGEOUT under the assumption
> that __user knows they are not real working set__.  Please let me know
> if I am not making sense here.

I'm not sure I agree with this. I think it misses the point of what
the madvise is actually for.

The workingset is defined based on access frequency and available
memory. Thrashing is defined as having to read pages back shortly
after their eviction.

MADV_PAGEOUT is for the application to inform the kernel that it's
done accessing the pages, so that the kernel can accelerate their
eviction over other pages that may still be in use. This is ultimately
meant to REDUCE reclaim and paging.

However, in this case, the MADVISE_PAGEOUT evicts pages that are
reused after and then refault. It INCREASED reclaim and paging.

Surely that's a problem? And the system would have behaved better
without the madvise() in the first place?

In fact, I would argue that the pressure spike is a great signal for
detecting overzealous madvising. If you're redefining the workingset
from access frequency to "whatever the user is saying", that will take
away an important mechanism to detect advise bugs and unnecessary IO.
  
Charan Teja Kalla June 6, 2023, 2:53 p.m. UTC | #6
Thanks Johannes for the detailed review comments...

On 6/5/2023 11:30 PM, Johannes Weiner wrote:
>> Agree that we shouldn't be really silence the thrashing. My point is we
>> shouldn't be  considering the folios as thrashing If those were getting
>> reclaim by the user him self through MADV_PAGEOUT under the assumption
>> that __user knows they are not real working set__.  Please let me know
>> if I am not making sense here.
> I'm not sure I agree with this. I think it misses the point of what
> the madvise is actually for.
> 
> The workingset is defined based on access frequency and available
> memory. Thrashing is defined as having to read pages back shortly
> after their eviction.
> 
> MADV_PAGEOUT is for the application to inform the kernel that it's
> done accessing the pages, so that the kernel can accelerate their
> eviction over other pages that may still be in use. This is ultimately
> meant to REDUCE reclaim and paging.
> 
> However, in this case, the MADVISE_PAGEOUT evicts pages that are
> reused after and then refault. It INCREASED reclaim and paging.
> 
I agree here...
> Surely that's a problem? And the system would have behaved better
> without the madvise() in the first place?
> 
Yes, the system behavior could be much better without this PAGEOUT
operation...
> In fact, I would argue that the pressure spike is a great signal for
> detecting overzealous madvising. If you're redefining the workingset
> from access frequency to "whatever the user is saying", that will take
> away an important mechanism to detect advise bugs and unnecessary IO.
currently wanted to do the PAGEOUT operation but what information lacks
is if I am really operating on the workingset pages. Had the client
knows that he is operating on the workingset pages, he could have backed
off from madvising.

I now note that I shouldn't be defining the workingset from "whatever
user is saying". But then, IMO, there should be a way from the kernel to
the user that his madvise operation is being performed on the workingset
pages.

One way the user can do is monitoring the PSI events while PAGEOUT is
being performed and he may exclude those VMA's from next time.

Alternatively kernel itself can support it may be through like
MADV_PAGEOUT_INACTIVE which doesn't pageout the Workingset pages.

Please let me know your opinion about this interface.

This has the usecase on android where it just assumes that 2nd
background app will most likely to be not used in the future thus
reclaim those app pages. It works well for most of the times but such
assumption will go wrong with the usecase I had mentioned.

--Thanks.
  
Suren Baghdasaryan June 6, 2023, 7:48 p.m. UTC | #7
Hi Charan,

On Tue, Jun 6, 2023 at 7:54 AM Charan Teja Kalla
<quic_charante@quicinc.com> wrote:
>
> Thanks Johannes for the detailed review comments...
>
> On 6/5/2023 11:30 PM, Johannes Weiner wrote:
> >> Agree that we shouldn't be really silence the thrashing. My point is we
> >> shouldn't be  considering the folios as thrashing If those were getting
> >> reclaim by the user him self through MADV_PAGEOUT under the assumption
> >> that __user knows they are not real working set__.  Please let me know
> >> if I am not making sense here.
> > I'm not sure I agree with this. I think it misses the point of what
> > the madvise is actually for.
> >
> > The workingset is defined based on access frequency and available
> > memory. Thrashing is defined as having to read pages back shortly
> > after their eviction.
> >
> > MADV_PAGEOUT is for the application to inform the kernel that it's
> > done accessing the pages, so that the kernel can accelerate their
> > eviction over other pages that may still be in use. This is ultimately
> > meant to REDUCE reclaim and paging.
> >
> > However, in this case, the MADVISE_PAGEOUT evicts pages that are
> > reused after and then refault. It INCREASED reclaim and paging.
> >
> I agree here...
> > Surely that's a problem? And the system would have behaved better
> > without the madvise() in the first place?
> >
> Yes, the system behavior could be much better without this PAGEOUT
> operation...
> > In fact, I would argue that the pressure spike is a great signal for
> > detecting overzealous madvising. If you're redefining the workingset
> > from access frequency to "whatever the user is saying", that will take
> > away an important mechanism to detect advise bugs and unnecessary IO.
> currently wanted to do the PAGEOUT operation but what information lacks
> is if I am really operating on the workingset pages. Had the client
> knows that he is operating on the workingset pages, he could have backed
> off from madvising.
>
> I now note that I shouldn't be defining the workingset from "whatever
> user is saying". But then, IMO, there should be a way from the kernel to
> the user that his madvise operation is being performed on the workingset
> pages.
>
> One way the user can do is monitoring the PSI events while PAGEOUT is
> being performed and he may exclude those VMA's from next time.
>
> Alternatively kernel itself can support it may be through like
> MADV_PAGEOUT_INACTIVE which doesn't pageout the Workingset pages.
>
> Please let me know your opinion about this interface.
>
> This has the usecase on android where it just assumes that 2nd
> background app will most likely to be not used in the future thus
> reclaim those app pages. It works well for most of the times but such
> assumption will go wrong with the usecase I had mentioned.

Hi Folks. Sorry for being late to the party.
Yeah, userspace does not have a crystal ball to predict future user
behavior, so there will always be pathological cases when usual
assumptions and resulting madvise() would make things worse.

I think this discussion can be split into several questions/issues:
1. Inconsistency in how madvise(MADV_PAGEOUT) would affect PSI
calculation when the page is refaulted, based on the path it took
before being evicted by madvise(). In your initial description case
(a) is inconsistent with (b) and (c) and it's probably worth fixing.
IMHO (a) should be made consistent with others, not the other way
around. My reasoning is that page was expelled from the active list,
so it was part of the active workingset.

2. Whether refaults caused by incorrect madvise(MADV_PAGEOUT) should
be counted as workingset refault and affect PSI.
This one I think is trickier. IMHO it should be counted as workingset
refault simply because it was refaulted and it was part of the
workingset. Whether it should affect PSI, which is supposed to be an
indicator of "pressure" is, I think, debatable. With madvise() in the
mix, refault might happen without any real memory pressure... So, the
answer is not obvious to me.

3. Should refaults caused by incorrect madvise(MADV_PAGEOUT) be
distinguished from the ones which were evicted by kernel reclaim
mechanisms.
I can see use for that from userspace to detect incorrect madvise()
and adjust its aggressiveness. I think the API might get a bit complex
because of the need to associate refaults with specific madvise()/VMAs
to understand which hint was incorrect and adjust the behavior.

Hope my feedback is useful and if we can improve Android's userspace
behavior, I'm happy to help make that happen.
Thanks,
Suren.

>
> --Thanks.
  
Charan Teja Kalla June 9, 2023, 12:42 p.m. UTC | #8
Thanks Suren & Johannes,

On 6/7/2023 1:18 AM, Suren Baghdasaryan wrote:
> Hi Folks. Sorry for being late to the party.
> Yeah, userspace does not have a crystal ball to predict future user
> behavior, so there will always be pathological cases when usual
> assumptions and resulting madvise() would make things worse.
> 
> I think this discussion can be split into several questions/issues:
> 1. Inconsistency in how madvise(MADV_PAGEOUT) would affect PSI
> calculation when the page is refaulted, based on the path it took
> before being evicted by madvise(). In your initial description case
> (a) is inconsistent with (b) and (c) and it's probably worth fixing.
> IMHO (a) should be made consistent with others, not the other way
> around. My reasoning is that page was expelled from the active list,
> so it was part of the active workingset.
> 
That means we should be setting Workingset on the page while it is on
the active list and when it is being pageout through madvising. Right? I
see, this makes it consistent.

On the same note, discussing with Suren offline, Should the refaulted
madvise pages start always at the inactive list? If they are really
active, they get promoted anyway..

> 2. Whether refaults caused by incorrect madvise(MADV_PAGEOUT) should
> be counted as workingset refault and affect PSI.
> This one I think is trickier. IMHO it should be counted as workingset
> refault simply because it was refaulted and it was part of the
> workingset. Whether it should affect PSI, which is supposed to be an
> indicator of "pressure" is, I think, debatable. With madvise() in the
> mix, refault might happen without any real memory pressure... So, the
> answer is not obvious to me.
> 
> 3. Should refaults caused by incorrect madvise(MADV_PAGEOUT) be
> distinguished from the ones which were evicted by kernel reclaim
> mechanisms.
> I can see use for that from userspace to detect incorrect madvise()
> and adjust its aggressiveness. I think the API might get a bit complex
> because of the need to associate refaults with specific madvise()/VMAs
> to understand which hint was incorrect and adjust the behavior.
> Instead what is the opinion about giving an MADV_PAGEOUT_INACTIVE
interface which does operate on a page only If it is on the inactive
list and !PageWorkingset ?

> Hope my feedback is useful and if we can improve Android's userspace
> behavior, I'm happy to help make that happen.
Thanks...
  
Suren Baghdasaryan June 9, 2023, 11:13 p.m. UTC | #9
On Fri, Jun 9, 2023 at 5:42 AM Charan Teja Kalla
<quic_charante@quicinc.com> wrote:
>
> Thanks Suren & Johannes,
>
> On 6/7/2023 1:18 AM, Suren Baghdasaryan wrote:
> > Hi Folks. Sorry for being late to the party.
> > Yeah, userspace does not have a crystal ball to predict future user
> > behavior, so there will always be pathological cases when usual
> > assumptions and resulting madvise() would make things worse.
> >
> > I think this discussion can be split into several questions/issues:
> > 1. Inconsistency in how madvise(MADV_PAGEOUT) would affect PSI
> > calculation when the page is refaulted, based on the path it took
> > before being evicted by madvise(). In your initial description case
> > (a) is inconsistent with (b) and (c) and it's probably worth fixing.
> > IMHO (a) should be made consistent with others, not the other way
> > around. My reasoning is that page was expelled from the active list,
> > so it was part of the active workingset.
> >
> That means we should be setting Workingset on the page while it is on
> the active list and when it is being pageout through madvising. Right? I
> see, this makes it consistent.

This was my opinion but others might think otherwise, like I found out
in some recent conversations. So, it would be great to get some more
feedback before making the change.

>
> On the same note, discussing with Suren offline, Should the refaulted
> madvise pages start always at the inactive list? If they are really
> active, they get promoted anyway..
>
> > 2. Whether refaults caused by incorrect madvise(MADV_PAGEOUT) should
> > be counted as workingset refault and affect PSI.
> > This one I think is trickier. IMHO it should be counted as workingset
> > refault simply because it was refaulted and it was part of the
> > workingset. Whether it should affect PSI, which is supposed to be an
> > indicator of "pressure" is, I think, debatable. With madvise() in the
> > mix, refault might happen without any real memory pressure... So, the
> > answer is not obvious to me.
> >
> > 3. Should refaults caused by incorrect madvise(MADV_PAGEOUT) be
> > distinguished from the ones which were evicted by kernel reclaim
> > mechanisms.
> > I can see use for that from userspace to detect incorrect madvise()
> > and adjust its aggressiveness. I think the API might get a bit complex
> > because of the need to associate refaults with specific madvise()/VMAs
> > to understand which hint was incorrect and adjust the behavior.
> > Instead what is the opinion about giving an MADV_PAGEOUT_INACTIVE
> interface which does operate on a page only If it is on the inactive
> list and !PageWorkingset ?

IOW you want a less aggressive mechanism which can be used by the
userspace to tell the kernel "I think these pages won't be used but
I'm not 100% sure, so drop them only if they are inactive"?
 I don't know how much that will help when the madvise() ends up being
wrong but maybe you can quickly experiment and tell us if the
difference is substantial?

>
> > Hope my feedback is useful and if we can improve Android's userspace
> > behavior, I'm happy to help make that happen.
> Thanks...
  
Johannes Weiner June 12, 2023, 1:40 p.m. UTC | #10
On Fri, Jun 09, 2023 at 04:13:14PM -0700, Suren Baghdasaryan wrote:
> On Fri, Jun 9, 2023 at 5:42 AM Charan Teja Kalla
> <quic_charante@quicinc.com> wrote:
> >
> > Thanks Suren & Johannes,
> >
> > On 6/7/2023 1:18 AM, Suren Baghdasaryan wrote:
> > > Hi Folks. Sorry for being late to the party.
> > > Yeah, userspace does not have a crystal ball to predict future user
> > > behavior, so there will always be pathological cases when usual
> > > assumptions and resulting madvise() would make things worse.
> > >
> > > I think this discussion can be split into several questions/issues:
> > > 1. Inconsistency in how madvise(MADV_PAGEOUT) would affect PSI
> > > calculation when the page is refaulted, based on the path it took
> > > before being evicted by madvise(). In your initial description case
> > > (a) is inconsistent with (b) and (c) and it's probably worth fixing.
> > > IMHO (a) should be made consistent with others, not the other way
> > > around. My reasoning is that page was expelled from the active list,
> > > so it was part of the active workingset.
> > >
> > That means we should be setting Workingset on the page while it is on
> > the active list and when it is being pageout through madvising. Right? I
> > see, this makes it consistent.
> 
> This was my opinion but others might think otherwise, like I found out
> in some recent conversations. So, it would be great to get some more
> feedback before making the change.

I also agree with the consistency fix: it should set Workingset when
madvise zaps pages from the active list.
  
Pavan Kondeti June 13, 2023, 3:13 a.m. UTC | #11
On Fri, Jun 09, 2023 at 06:12:28PM +0530, Charan Teja Kalla wrote:
> Thanks Suren & Johannes,
> 
> On 6/7/2023 1:18 AM, Suren Baghdasaryan wrote:
> > Hi Folks. Sorry for being late to the party.
> > Yeah, userspace does not have a crystal ball to predict future user
> > behavior, so there will always be pathological cases when usual
> > assumptions and resulting madvise() would make things worse.
> > 
> > I think this discussion can be split into several questions/issues:
> > 1. Inconsistency in how madvise(MADV_PAGEOUT) would affect PSI
> > calculation when the page is refaulted, based on the path it took
> > before being evicted by madvise(). In your initial description case
> > (a) is inconsistent with (b) and (c) and it's probably worth fixing.
> > IMHO (a) should be made consistent with others, not the other way
> > around. My reasoning is that page was expelled from the active list,
> > so it was part of the active workingset.
> > 
> That means we should be setting Workingset on the page while it is on
> the active list and when it is being pageout through madvising. Right? I
> see, this makes it consistent.
> 
> On the same note, discussing with Suren offline, Should the refaulted
> madvise pages start always at the inactive list? If they are really
> active, they get promoted anyway..
> 
Can you elaborate on the rationale why refaulted madvise pages needs to
be on inactive list? If it had not been paged out via madvise, it would
have been activated no?

Thanks,
Pavan
  
Charan Teja Kalla June 26, 2023, 2:31 p.m. UTC | #12
Hi Suren,

On 6/10/2023 4:43 AM, Suren Baghdasaryan wrote:
>>> I can see use for that from userspace to detect incorrect madvise()
>>> and adjust its aggressiveness. I think the API might get a bit complex
>>> because of the need to associate refaults with specific madvise()/VMAs
>>> to understand which hint was incorrect and adjust the behavior.
>>> Instead what is the opinion about giving an MADV_PAGEOUT_INACTIVE
>> interface which does operate on a page only If it is on the inactive
>> list and !PageWorkingset ?
> IOW you want a less aggressive mechanism which can be used by the
> userspace to tell the kernel "I think these pages won't be used but
> I'm not 100% sure, so drop them only if they are inactive"?
>  I don't know how much that will help when the madvise() ends up being
> wrong but maybe you can quickly experiment and tell us if the
> difference is substantial?

We did some extensive testing on Android and this ask is not helping us
much. I am really not sure if there is some other usecase that can
benefit from this. So, for now I just stick to your suggestion of making
the pages on the Active list as the Workingset at the time of pageout.
>

Thanks.
  
Suren Baghdasaryan June 27, 2023, 10:50 p.m. UTC | #13
On Mon, Jun 26, 2023 at 7:31 AM Charan Teja Kalla
<quic_charante@quicinc.com> wrote:
>
> Hi Suren,
>
> On 6/10/2023 4:43 AM, Suren Baghdasaryan wrote:
> >>> I can see use for that from userspace to detect incorrect madvise()
> >>> and adjust its aggressiveness. I think the API might get a bit complex
> >>> because of the need to associate refaults with specific madvise()/VMAs
> >>> to understand which hint was incorrect and adjust the behavior.
> >>> Instead what is the opinion about giving an MADV_PAGEOUT_INACTIVE
> >> interface which does operate on a page only If it is on the inactive
> >> list and !PageWorkingset ?
> > IOW you want a less aggressive mechanism which can be used by the
> > userspace to tell the kernel "I think these pages won't be used but
> > I'm not 100% sure, so drop them only if they are inactive"?
> >  I don't know how much that will help when the madvise() ends up being
> > wrong but maybe you can quickly experiment and tell us if the
> > difference is substantial?
>
> We did some extensive testing on Android and this ask is not helping us
> much. I am really not sure if there is some other usecase that can
> benefit from this. So, for now I just stick to your suggestion of making
> the pages on the Active list as the Workingset at the time of pageout.
> >

Thanks for checking that. Your plan SGTM.

>
> Thanks.
>
  

Patch

diff --git a/mm/madvise.c b/mm/madvise.c
index 340125d..3410c39 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -409,8 +409,10 @@  static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 			if (folio_isolate_lru(folio)) {
 				if (folio_test_unevictable(folio))
 					folio_putback_lru(folio);
-				else
+				else {
+					folio_clear_workingset(folio);
 					list_add(&folio->lru, &folio_list);
+				}
 			}
 		} else
 			folio_deactivate(folio);
@@ -503,8 +505,10 @@  static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 			if (folio_isolate_lru(folio)) {
 				if (folio_test_unevictable(folio))
 					folio_putback_lru(folio);
-				else
+				else {
+					folio_clear_workingset(folio);
 					list_add(&folio->lru, &folio_list);
+				}
 			}
 		} else
 			folio_deactivate(folio);