Message ID | 1685531374-6091-1-git-send-email-quic_charante@quicinc.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp2802071vqr; Wed, 31 May 2023 04:23:20 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ5kkkSe6kF6TNFQWW7FkVvJJdhNAH/Q/YClcvBBRHzETUiAdWvgyhX90zRxKbDmPpkgG9+e X-Received: by 2002:a05:6358:9490:b0:125:7fa2:eb81 with SMTP id i16-20020a056358949000b001257fa2eb81mr1571951rwb.32.1685532200331; Wed, 31 May 2023 04:23:20 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1685532200; cv=none; d=google.com; s=arc-20160816; b=kWPXalooaeChKKqr69mDanXnpmC1YJRDQzQjvPwuWUXbt3LsNzj63Wl0WjicMZ6tT7 B2Ext81DddUMzEL9Im1ZEQon4SNH/JaPVMP7jOHGjoM1AW74M29dl0g7Zy4IpKNkTnzX Px68dvIEB+AcukOWv8FY5Z1Vy0ZDCpDclyVOtkN/ucLw6WV4G3VKw7ejeIYQkzAJO8JZ pwEibe0TaJB+Dix7GHUR/xojLn92isKkcYV5VvTE6sOyk7sUAleqxSFLxjKztQgahsUN ceg9nxhRXFiiQDXTGTvaZvkmFyLHv2yvAIVH1H18/8D5EI7kgirFtWHJ0nChgBOxP0vl CYzw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:message-id:date:subject:cc:to:from :dkim-signature; bh=w0KeEXQFRBlqfa9GOpSBvyxn8umTaMqKZQuHqMXdjpA=; b=wDNtkci5MbZ+intQQ9q5v5oeGwU6q8o8Dv/CvInOif/dCeD5/87q0u84HbWAB5ISNV V2MuCJETBYqett1wY/LMFqgIhp0kf/PLi6oopPGeUoxr03744CYqIJys9g4fgdSz8uBP iGKCx4s4g9m5R53r1NKxXldFxkykJsEh5iSdVEDU++jU+S/1mSxD+M04iA173ow6bz1Z GVUIpw5e5PHwjTmIMbhUHcXdiC91PYxpZVUIx5rTkNl19NB9y4sxKx9cxRwPfpTtpMMJ bbbvM2dvFcZyO4JSaVFa0itzG31Qi9PRulwow1W04TNusBIjw24WAWA+gGI4cq/OmgHp BElw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@quicinc.com header.s=qcppdkim1 header.b=BYrFiJLH; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=quicinc.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id l6-20020a170903120600b001ae0691dfdbsi756180plh.158.2023.05.31.04.23.08; Wed, 31 May 2023 04:23:20 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@quicinc.com header.s=qcppdkim1 header.b=BYrFiJLH; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=quicinc.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235418AbjEaLKl (ORCPT <rfc822;andrewvogler123@gmail.com> + 99 others); Wed, 31 May 2023 07:10:41 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58588 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235407AbjEaLKg (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 31 May 2023 07:10:36 -0400 Received: from mx0b-0031df01.pphosted.com (mx0b-0031df01.pphosted.com [205.220.180.131]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F24D012E for <linux-kernel@vger.kernel.org>; Wed, 31 May 2023 04:10:34 -0700 (PDT) Received: from pps.filterd (m0279871.ppops.net [127.0.0.1]) by mx0a-0031df01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 34V7el0v001988; Wed, 31 May 2023 11:10:21 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=quicinc.com; h=from : to : cc : subject : date : message-id : mime-version : content-type; s=qcppdkim1; bh=w0KeEXQFRBlqfa9GOpSBvyxn8umTaMqKZQuHqMXdjpA=; b=BYrFiJLHtjk0VQJVQznbTf3qdu1iPJckp2ZIih/aQTDZhMVPYozPTSYAHxhOlecPmkTR 9ZH52rjZPZCxfqFQ9K3S9vrtVb/CuyBIhXTlhU7LpScdMlFj5cmhFT9+2HtwxJ5xpyt4 9D2RvYeJjQcmetD8QkqqOmqFeBiYrS7SclkWfxtQwa9Q3XSBHcMNGgAM/9rFyTUCfKgq iD8GNsp4h4GhYWsZYxINCaXF40ziPeAdkpYhqAnJwUeTYR7OahDyiaOKSvf0j2/DXNaG cPXx83bjZKrFe5txEnDeL10GGwfCOao+zoO0sfIpSMHjB2OMrJ3lYcB+05+rkP6L/wrL MQ== Received: from nalasppmta03.qualcomm.com (Global_NAT1.qualcomm.com [129.46.96.20]) by mx0a-0031df01.pphosted.com (PPS) with ESMTPS id 3qx0sbrkeg-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 31 May 2023 11:10:21 +0000 Received: from nalasex01a.na.qualcomm.com (nalasex01a.na.qualcomm.com [10.47.209.196]) by NALASPPMTA03.qualcomm.com (8.17.1.5/8.17.1.5) with ESMTPS id 34VBAHxM016538 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 31 May 2023 11:10:18 GMT Received: from hu-charante-hyd.qualcomm.com (10.80.80.8) by nalasex01a.na.qualcomm.com (10.47.209.196) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.986.42; Wed, 31 May 2023 04:10:15 -0700 From: Charan Teja Kalla <quic_charante@quicinc.com> To: <akpm@linux-foundation.org>, <hannes@cmpxchg.org>, <minchan@kernel.org> CC: <quic_pkondeti@quicinc.com>, <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>, Charan Teja Kalla <quic_charante@quicinc.com> Subject: [PATCH] mm: madvise: fix uneven accounting of psi Date: Wed, 31 May 2023 16:39:34 +0530 Message-ID: <1685531374-6091-1-git-send-email-quic_charante@quicinc.com> X-Mailer: git-send-email 2.7.4 MIME-Version: 1.0 Content-Type: text/plain X-Originating-IP: [10.80.80.8] X-ClientProxiedBy: nasanex01b.na.qualcomm.com (10.46.141.250) To nalasex01a.na.qualcomm.com (10.47.209.196) X-QCInternal: smtphost X-Proofpoint-Virus-Version: vendor=nai engine=6200 definitions=5800 signatures=585085 X-Proofpoint-GUID: vIMLouYFIPeLAVxXQXv94qn2SwORAR8K X-Proofpoint-ORIG-GUID: vIMLouYFIPeLAVxXQXv94qn2SwORAR8K X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.254,Aquarius:18.0.957,Hydra:6.0.573,FMLib:17.11.176.26 definitions=2023-05-31_06,2023-05-31_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 adultscore=0 spamscore=0 mlxlogscore=999 mlxscore=0 lowpriorityscore=0 phishscore=0 impostorscore=0 priorityscore=1501 clxscore=1011 bulkscore=0 suspectscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2304280000 definitions=main-2305310096 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1767408612577674690?= X-GMAIL-MSGID: =?utf-8?q?1767408612577674690?= |
Series |
mm: madvise: fix uneven accounting of psi
|
|
Commit Message
Charan Teja Kalla
May 31, 2023, 11:09 a.m. UTC
A folio turns into a Workingset during:
1) shrink_active_list() placing the folio from active to inactive list.
2) When a workingset transition is happening during the folio refault.
And when Workingset is set on a folio, PSI for memory can be accounted
during a) That folio is being reclaimed and b) Refault of that folio.
There exists clients who can do the proactive reclaim using the system
calls like madvise(), whose folios can be safely treated as inactive
folios assuming the client knows that these folios are not needed in the
near future thus wanted to reclaim them. For such folios psi is not
accounted uniformly:
a) A folio started at inactive and moved to active as part of accesses.
Workingset is absent on the folio thus madvise(MADV_PAGEOUT) don't
account such folios for PSI.
b) When the same folio transition from inactive->active and then to
inactive through shrink_active_list(). Workingset is set on the folio
thus madvise(MADV_PAGEOUT) account such folios for PSI.
c) When the same folio is part of active list directly as a result of
folio refault and this was a workingset folio prior to eviction.
Workingset is set on the folio thus madvise(MADV_PAGEOUT) account such
folios for PSI.
As said about the MADV_PAGEOUT on a folio is accounted in b) and c) but
not in a) which is inconsistent. Remove this inconsistency by always not
considering the PSI for folios that are getting reclaimed through
madvise(MADV_PAGEOUT) by clearing the Workingset on a folio. This
consistency of clearing the workingset was chosen under the assumption
that client knows these folios are not in active use thus reclaiming
them hence not eligible as workingset folios. Probably it is the same
reason why workingset is not set on a folio through MADV_COLD but during
the shrink_active_list() though both the actions make the folio put onto
the inactive list.
This patch is tested on Android, Snapdragon SoC with 8Gb RAM, 4GB swap
mounted on zram which has 2GB of backingdev. The test case involved
launching some memory hungry apps in an order and do the proactive
reclaim for the app that went to background using madvise(MADV_PAGEOUT).
We are seeing ~40% less total values of psi mem some and full when this
patch is combined with [1].
[1]https://lore.kernel.org/all/20220214214921.419687-1-hannes@cmpxchg.org/T/#u
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
---
mm/madvise.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
Comments
On Wed, 31 May 2023 16:39:34 +0530 Charan Teja Kalla <quic_charante@quicinc.com> wrote: > A folio turns into a Workingset during: > 1) shrink_active_list() placing the folio from active to inactive list. > 2) When a workingset transition is happening during the folio refault. > > And when Workingset is set on a folio, PSI for memory can be accounted > during a) That folio is being reclaimed and b) Refault of that folio. > > There exists clients who can do the proactive reclaim using the system > calls like madvise(), whose folios can be safely treated as inactive > folios assuming the client knows that these folios are not needed in the > near future thus wanted to reclaim them. For such folios psi is not > accounted uniformly: > a) A folio started at inactive and moved to active as part of accesses. > Workingset is absent on the folio thus madvise(MADV_PAGEOUT) don't > account such folios for PSI. > > b) When the same folio transition from inactive->active and then to > inactive through shrink_active_list(). Workingset is set on the folio > thus madvise(MADV_PAGEOUT) account such folios for PSI. > > c) When the same folio is part of active list directly as a result of > folio refault and this was a workingset folio prior to eviction. > Workingset is set on the folio thus madvise(MADV_PAGEOUT) account such > folios for PSI. > > As said about the MADV_PAGEOUT on a folio is accounted in b) and c) but > not in a) which is inconsistent. Remove this inconsistency by always not > considering the PSI for folios that are getting reclaimed through > madvise(MADV_PAGEOUT) by clearing the Workingset on a folio. This > consistency of clearing the workingset was chosen under the assumption > that client knows these folios are not in active use thus reclaiming > them hence not eligible as workingset folios. Probably it is the same > reason why workingset is not set on a folio through MADV_COLD but during > the shrink_active_list() though both the actions make the folio put onto > the inactive list. > > This patch is tested on Android, Snapdragon SoC with 8Gb RAM, 4GB swap > mounted on zram which has 2GB of backingdev. The test case involved > launching some memory hungry apps in an order and do the proactive > reclaim for the app that went to background using madvise(MADV_PAGEOUT). > We are seeing ~40% less total values of psi mem some and full when this > patch is combined with [1]. Does this accounting inaccuracy have any perceptible runtime effects?
On Wed, May 31, 2023 at 04:39:34PM +0530, Charan Teja Kalla wrote: > This patch is tested on Android, Snapdragon SoC with 8Gb RAM, 4GB swap > mounted on zram which has 2GB of backingdev. The test case involved > launching some memory hungry apps in an order and do the proactive > reclaim for the app that went to background using madvise(MADV_PAGEOUT). > We are seeing ~40% less total values of psi mem some and full when this > patch is combined with [1]. Does that mean those pages are thrashing, but because you clear their workingset it isn't detected and reported via psi? I don't rally get why silencing the thrashing is an improvement. > [1]https://lore.kernel.org/all/20220214214921.419687-1-hannes@cmpxchg.org/T/#u > > Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> > --- > mm/madvise.c | 8 ++++++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/mm/madvise.c b/mm/madvise.c > index 340125d..3410c39 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -409,8 +409,10 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > if (folio_isolate_lru(folio)) { > if (folio_test_unevictable(folio)) > folio_putback_lru(folio); > - else > + else { > + folio_clear_workingset(folio); > list_add(&folio->lru, &folio_list); > + } > } > } else > folio_deactivate(folio); > @@ -503,8 +505,10 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > if (folio_isolate_lru(folio)) { > if (folio_test_unevictable(folio)) > folio_putback_lru(folio); > - else > + else { > + folio_clear_workingset(folio); > list_add(&folio->lru, &folio_list); > + } > } > } else > folio_deactivate(folio); > -- > 2.7.4 >
Thanks Johannes for taking a look at it.. On 6/1/2023 3:49 AM, Johannes Weiner wrote: > On Wed, May 31, 2023 at 04:39:34PM +0530, Charan Teja Kalla wrote: >> This patch is tested on Android, Snapdragon SoC with 8Gb RAM, 4GB swap >> mounted on zram which has 2GB of backingdev. The test case involved >> launching some memory hungry apps in an order and do the proactive >> reclaim for the app that went to background using madvise(MADV_PAGEOUT). >> We are seeing ~40% less total values of psi mem some and full when this >> patch is combined with [1]. > Does that mean those pages are thrashing, but because you clear their > workingset it isn't detected and reported via psi? > Seems I didn't mention the usecase clearly and let me correct it. Say we have the Android apps A, B, C, ... H and launching of these apps goes like below. 1) Launch app A. 2) Launch app B. 3) Launch app C. At this time, we consider the memory used by app A is no more in active use thus proactively reclaim them where we do issue MADV_PAGEOUT on anon regions only thus these pages goes to swap mounted on zram and subsequently into the backing dev attached to the zram. 4) Launch app D.. Proactively reclaim the anon regions of App B into swap and through to backing dev. 5) Now make the app A to foreground. This can read the pages from the swap + backing dev (because of the step 3)) that belongs to app A and also proactively reclaim anon regions of app C. 6) Launch E --> proactive reclaim of app D to zram swap + backing dev. 7) Make App B to foreground --> Read memory of app B from swap + backingdev and as well reclaim the anon regions of app A. 8) Like wise launches of apps F, C, G, D, H, E ..... If we look at steps 5, 7,..., we are just making the apps foreground which folios (if were marked as workingset) can contribute to PSI events through swap_readpage(). But in reality, these are not the real workingset folios (I think it is safe to say this), because it is the user who decided the reclaim of these pages by issuing the MADV_PAGEOUT which he knows that these are not going to be needed in the near future thus reclaim them. I think the other way to look at the problem is the user can write a simple application where he can do MADV_PAGEOUT and read them back in a loop. If at any point, folios user working on turns out to be a workingset( he can just be probabilistic here), the PSI events will be triggered though there may not be real memory pressure in the system. > I don't rally get why silencing the thrashing is an improvement. > Agree that we shouldn't be really silence the thrashing. My point is we shouldn't be considering the folios as thrashing If those were getting reclaim by the user him self through MADV_PAGEOUT under the assumption that __user knows they are not real working set__. Please let me know if I am not making sense here. >> [1]https://lore.kernel.org/all/20220214214921.419687-1-hannes@cmpxchg.org/T/#u >> >> Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Thanks Andrew!! On 6/1/2023 3:34 AM, Andrew Morton wrote: >> This patch is tested on Android, Snapdragon SoC with 8Gb RAM, 4GB swap >> mounted on zram which has 2GB of backingdev. The test case involved >> launching some memory hungry apps in an order and do the proactive >> reclaim for the app that went to background using madvise(MADV_PAGEOUT). >> We are seeing ~40% less total values of psi mem some and full when this >> patch is combined with [1]. > Does this accounting inaccuracy have any perceptible runtime effects? > With the testcase mentioned, we are able to achieve better app concurrency(It is a measure on Android of how many apps that can survive by not getting killed by lmkd(which relies on PSI events to take the decissions) by the end of the testcase). The total memsome and memfull psi events(measured in usec) are as below: mem PSI w/o patch with patch some 61255313 21867736 full 31353138 15391454 Thanks, Charan >
Hi Charan, On Thu, Jun 01, 2023 at 06:37:50PM +0530, Charan Teja Kalla wrote: > Thanks Johannes for taking a look at it.. > > On 6/1/2023 3:49 AM, Johannes Weiner wrote: > > On Wed, May 31, 2023 at 04:39:34PM +0530, Charan Teja Kalla wrote: > >> This patch is tested on Android, Snapdragon SoC with 8Gb RAM, 4GB swap > >> mounted on zram which has 2GB of backingdev. The test case involved > >> launching some memory hungry apps in an order and do the proactive > >> reclaim for the app that went to background using madvise(MADV_PAGEOUT). > >> We are seeing ~40% less total values of psi mem some and full when this > >> patch is combined with [1]. > > Does that mean those pages are thrashing, but because you clear their > > workingset it isn't detected and reported via psi? > > > > Seems I didn't mention the usecase clearly and let me correct it. Say we > have the Android apps A, B, C, ... H and launching of these apps goes > like below. > > 1) Launch app A. > 2) Launch app B. > 3) Launch app C. At this time, we consider the memory used by app A is > no more in active use thus proactively reclaim them where we do issue > MADV_PAGEOUT on anon regions only thus these pages goes to swap mounted > on zram and subsequently into the backing dev attached to the zram. > 4) Launch app D.. Proactively reclaim the anon regions of App B into > swap and through to backing dev. > 5) Now make the app A to foreground. This can read the pages from the > swap + backing dev (because of the step 3)) that belongs to app A and > also proactively reclaim anon regions of app C. > 6) Launch E --> proactive reclaim of app D to zram swap + backing dev. > 7) Make App B to foreground --> Read memory of app B from swap + > backingdev and as well reclaim the anon regions of app A. > 8) Like wise launches of apps F, C, G, D, H, E ..... > > If we look at steps 5, 7,..., we are just making the apps foreground > which folios (if were marked as workingset) can contribute to PSI events > through swap_readpage(). But in reality, these are not the real > workingset folios (I think it is safe to say this), because it is the > user who decided the reclaim of these pages by issuing the MADV_PAGEOUT > which he knows that these are not going to be needed in the near future > thus reclaim them. > > I think the other way to look at the problem is the user can write a > simple application where he can do MADV_PAGEOUT and read them back in a > loop. If at any point, folios user working on turns out to be a > workingset( he can just be probabilistic here), the PSI events will be > triggered though there may not be real memory pressure in the system. > > > I don't rally get why silencing the thrashing is an improvement. > > > Agree that we shouldn't be really silence the thrashing. My point is we > shouldn't be considering the folios as thrashing If those were getting > reclaim by the user him self through MADV_PAGEOUT under the assumption > that __user knows they are not real working set__. Please let me know > if I am not making sense here. I'm not sure I agree with this. I think it misses the point of what the madvise is actually for. The workingset is defined based on access frequency and available memory. Thrashing is defined as having to read pages back shortly after their eviction. MADV_PAGEOUT is for the application to inform the kernel that it's done accessing the pages, so that the kernel can accelerate their eviction over other pages that may still be in use. This is ultimately meant to REDUCE reclaim and paging. However, in this case, the MADVISE_PAGEOUT evicts pages that are reused after and then refault. It INCREASED reclaim and paging. Surely that's a problem? And the system would have behaved better without the madvise() in the first place? In fact, I would argue that the pressure spike is a great signal for detecting overzealous madvising. If you're redefining the workingset from access frequency to "whatever the user is saying", that will take away an important mechanism to detect advise bugs and unnecessary IO.
Thanks Johannes for the detailed review comments... On 6/5/2023 11:30 PM, Johannes Weiner wrote: >> Agree that we shouldn't be really silence the thrashing. My point is we >> shouldn't be considering the folios as thrashing If those were getting >> reclaim by the user him self through MADV_PAGEOUT under the assumption >> that __user knows they are not real working set__. Please let me know >> if I am not making sense here. > I'm not sure I agree with this. I think it misses the point of what > the madvise is actually for. > > The workingset is defined based on access frequency and available > memory. Thrashing is defined as having to read pages back shortly > after their eviction. > > MADV_PAGEOUT is for the application to inform the kernel that it's > done accessing the pages, so that the kernel can accelerate their > eviction over other pages that may still be in use. This is ultimately > meant to REDUCE reclaim and paging. > > However, in this case, the MADVISE_PAGEOUT evicts pages that are > reused after and then refault. It INCREASED reclaim and paging. > I agree here... > Surely that's a problem? And the system would have behaved better > without the madvise() in the first place? > Yes, the system behavior could be much better without this PAGEOUT operation... > In fact, I would argue that the pressure spike is a great signal for > detecting overzealous madvising. If you're redefining the workingset > from access frequency to "whatever the user is saying", that will take > away an important mechanism to detect advise bugs and unnecessary IO. currently wanted to do the PAGEOUT operation but what information lacks is if I am really operating on the workingset pages. Had the client knows that he is operating on the workingset pages, he could have backed off from madvising. I now note that I shouldn't be defining the workingset from "whatever user is saying". But then, IMO, there should be a way from the kernel to the user that his madvise operation is being performed on the workingset pages. One way the user can do is monitoring the PSI events while PAGEOUT is being performed and he may exclude those VMA's from next time. Alternatively kernel itself can support it may be through like MADV_PAGEOUT_INACTIVE which doesn't pageout the Workingset pages. Please let me know your opinion about this interface. This has the usecase on android where it just assumes that 2nd background app will most likely to be not used in the future thus reclaim those app pages. It works well for most of the times but such assumption will go wrong with the usecase I had mentioned. --Thanks.
Hi Charan, On Tue, Jun 6, 2023 at 7:54 AM Charan Teja Kalla <quic_charante@quicinc.com> wrote: > > Thanks Johannes for the detailed review comments... > > On 6/5/2023 11:30 PM, Johannes Weiner wrote: > >> Agree that we shouldn't be really silence the thrashing. My point is we > >> shouldn't be considering the folios as thrashing If those were getting > >> reclaim by the user him self through MADV_PAGEOUT under the assumption > >> that __user knows they are not real working set__. Please let me know > >> if I am not making sense here. > > I'm not sure I agree with this. I think it misses the point of what > > the madvise is actually for. > > > > The workingset is defined based on access frequency and available > > memory. Thrashing is defined as having to read pages back shortly > > after their eviction. > > > > MADV_PAGEOUT is for the application to inform the kernel that it's > > done accessing the pages, so that the kernel can accelerate their > > eviction over other pages that may still be in use. This is ultimately > > meant to REDUCE reclaim and paging. > > > > However, in this case, the MADVISE_PAGEOUT evicts pages that are > > reused after and then refault. It INCREASED reclaim and paging. > > > I agree here... > > Surely that's a problem? And the system would have behaved better > > without the madvise() in the first place? > > > Yes, the system behavior could be much better without this PAGEOUT > operation... > > In fact, I would argue that the pressure spike is a great signal for > > detecting overzealous madvising. If you're redefining the workingset > > from access frequency to "whatever the user is saying", that will take > > away an important mechanism to detect advise bugs and unnecessary IO. > currently wanted to do the PAGEOUT operation but what information lacks > is if I am really operating on the workingset pages. Had the client > knows that he is operating on the workingset pages, he could have backed > off from madvising. > > I now note that I shouldn't be defining the workingset from "whatever > user is saying". But then, IMO, there should be a way from the kernel to > the user that his madvise operation is being performed on the workingset > pages. > > One way the user can do is monitoring the PSI events while PAGEOUT is > being performed and he may exclude those VMA's from next time. > > Alternatively kernel itself can support it may be through like > MADV_PAGEOUT_INACTIVE which doesn't pageout the Workingset pages. > > Please let me know your opinion about this interface. > > This has the usecase on android where it just assumes that 2nd > background app will most likely to be not used in the future thus > reclaim those app pages. It works well for most of the times but such > assumption will go wrong with the usecase I had mentioned. Hi Folks. Sorry for being late to the party. Yeah, userspace does not have a crystal ball to predict future user behavior, so there will always be pathological cases when usual assumptions and resulting madvise() would make things worse. I think this discussion can be split into several questions/issues: 1. Inconsistency in how madvise(MADV_PAGEOUT) would affect PSI calculation when the page is refaulted, based on the path it took before being evicted by madvise(). In your initial description case (a) is inconsistent with (b) and (c) and it's probably worth fixing. IMHO (a) should be made consistent with others, not the other way around. My reasoning is that page was expelled from the active list, so it was part of the active workingset. 2. Whether refaults caused by incorrect madvise(MADV_PAGEOUT) should be counted as workingset refault and affect PSI. This one I think is trickier. IMHO it should be counted as workingset refault simply because it was refaulted and it was part of the workingset. Whether it should affect PSI, which is supposed to be an indicator of "pressure" is, I think, debatable. With madvise() in the mix, refault might happen without any real memory pressure... So, the answer is not obvious to me. 3. Should refaults caused by incorrect madvise(MADV_PAGEOUT) be distinguished from the ones which were evicted by kernel reclaim mechanisms. I can see use for that from userspace to detect incorrect madvise() and adjust its aggressiveness. I think the API might get a bit complex because of the need to associate refaults with specific madvise()/VMAs to understand which hint was incorrect and adjust the behavior. Hope my feedback is useful and if we can improve Android's userspace behavior, I'm happy to help make that happen. Thanks, Suren. > > --Thanks.
Thanks Suren & Johannes, On 6/7/2023 1:18 AM, Suren Baghdasaryan wrote: > Hi Folks. Sorry for being late to the party. > Yeah, userspace does not have a crystal ball to predict future user > behavior, so there will always be pathological cases when usual > assumptions and resulting madvise() would make things worse. > > I think this discussion can be split into several questions/issues: > 1. Inconsistency in how madvise(MADV_PAGEOUT) would affect PSI > calculation when the page is refaulted, based on the path it took > before being evicted by madvise(). In your initial description case > (a) is inconsistent with (b) and (c) and it's probably worth fixing. > IMHO (a) should be made consistent with others, not the other way > around. My reasoning is that page was expelled from the active list, > so it was part of the active workingset. > That means we should be setting Workingset on the page while it is on the active list and when it is being pageout through madvising. Right? I see, this makes it consistent. On the same note, discussing with Suren offline, Should the refaulted madvise pages start always at the inactive list? If they are really active, they get promoted anyway.. > 2. Whether refaults caused by incorrect madvise(MADV_PAGEOUT) should > be counted as workingset refault and affect PSI. > This one I think is trickier. IMHO it should be counted as workingset > refault simply because it was refaulted and it was part of the > workingset. Whether it should affect PSI, which is supposed to be an > indicator of "pressure" is, I think, debatable. With madvise() in the > mix, refault might happen without any real memory pressure... So, the > answer is not obvious to me. > > 3. Should refaults caused by incorrect madvise(MADV_PAGEOUT) be > distinguished from the ones which were evicted by kernel reclaim > mechanisms. > I can see use for that from userspace to detect incorrect madvise() > and adjust its aggressiveness. I think the API might get a bit complex > because of the need to associate refaults with specific madvise()/VMAs > to understand which hint was incorrect and adjust the behavior. > Instead what is the opinion about giving an MADV_PAGEOUT_INACTIVE interface which does operate on a page only If it is on the inactive list and !PageWorkingset ? > Hope my feedback is useful and if we can improve Android's userspace > behavior, I'm happy to help make that happen. Thanks...
On Fri, Jun 9, 2023 at 5:42 AM Charan Teja Kalla <quic_charante@quicinc.com> wrote: > > Thanks Suren & Johannes, > > On 6/7/2023 1:18 AM, Suren Baghdasaryan wrote: > > Hi Folks. Sorry for being late to the party. > > Yeah, userspace does not have a crystal ball to predict future user > > behavior, so there will always be pathological cases when usual > > assumptions and resulting madvise() would make things worse. > > > > I think this discussion can be split into several questions/issues: > > 1. Inconsistency in how madvise(MADV_PAGEOUT) would affect PSI > > calculation when the page is refaulted, based on the path it took > > before being evicted by madvise(). In your initial description case > > (a) is inconsistent with (b) and (c) and it's probably worth fixing. > > IMHO (a) should be made consistent with others, not the other way > > around. My reasoning is that page was expelled from the active list, > > so it was part of the active workingset. > > > That means we should be setting Workingset on the page while it is on > the active list and when it is being pageout through madvising. Right? I > see, this makes it consistent. This was my opinion but others might think otherwise, like I found out in some recent conversations. So, it would be great to get some more feedback before making the change. > > On the same note, discussing with Suren offline, Should the refaulted > madvise pages start always at the inactive list? If they are really > active, they get promoted anyway.. > > > 2. Whether refaults caused by incorrect madvise(MADV_PAGEOUT) should > > be counted as workingset refault and affect PSI. > > This one I think is trickier. IMHO it should be counted as workingset > > refault simply because it was refaulted and it was part of the > > workingset. Whether it should affect PSI, which is supposed to be an > > indicator of "pressure" is, I think, debatable. With madvise() in the > > mix, refault might happen without any real memory pressure... So, the > > answer is not obvious to me. > > > > 3. Should refaults caused by incorrect madvise(MADV_PAGEOUT) be > > distinguished from the ones which were evicted by kernel reclaim > > mechanisms. > > I can see use for that from userspace to detect incorrect madvise() > > and adjust its aggressiveness. I think the API might get a bit complex > > because of the need to associate refaults with specific madvise()/VMAs > > to understand which hint was incorrect and adjust the behavior. > > Instead what is the opinion about giving an MADV_PAGEOUT_INACTIVE > interface which does operate on a page only If it is on the inactive > list and !PageWorkingset ? IOW you want a less aggressive mechanism which can be used by the userspace to tell the kernel "I think these pages won't be used but I'm not 100% sure, so drop them only if they are inactive"? I don't know how much that will help when the madvise() ends up being wrong but maybe you can quickly experiment and tell us if the difference is substantial? > > > Hope my feedback is useful and if we can improve Android's userspace > > behavior, I'm happy to help make that happen. > Thanks...
On Fri, Jun 09, 2023 at 04:13:14PM -0700, Suren Baghdasaryan wrote: > On Fri, Jun 9, 2023 at 5:42 AM Charan Teja Kalla > <quic_charante@quicinc.com> wrote: > > > > Thanks Suren & Johannes, > > > > On 6/7/2023 1:18 AM, Suren Baghdasaryan wrote: > > > Hi Folks. Sorry for being late to the party. > > > Yeah, userspace does not have a crystal ball to predict future user > > > behavior, so there will always be pathological cases when usual > > > assumptions and resulting madvise() would make things worse. > > > > > > I think this discussion can be split into several questions/issues: > > > 1. Inconsistency in how madvise(MADV_PAGEOUT) would affect PSI > > > calculation when the page is refaulted, based on the path it took > > > before being evicted by madvise(). In your initial description case > > > (a) is inconsistent with (b) and (c) and it's probably worth fixing. > > > IMHO (a) should be made consistent with others, not the other way > > > around. My reasoning is that page was expelled from the active list, > > > so it was part of the active workingset. > > > > > That means we should be setting Workingset on the page while it is on > > the active list and when it is being pageout through madvising. Right? I > > see, this makes it consistent. > > This was my opinion but others might think otherwise, like I found out > in some recent conversations. So, it would be great to get some more > feedback before making the change. I also agree with the consistency fix: it should set Workingset when madvise zaps pages from the active list.
On Fri, Jun 09, 2023 at 06:12:28PM +0530, Charan Teja Kalla wrote: > Thanks Suren & Johannes, > > On 6/7/2023 1:18 AM, Suren Baghdasaryan wrote: > > Hi Folks. Sorry for being late to the party. > > Yeah, userspace does not have a crystal ball to predict future user > > behavior, so there will always be pathological cases when usual > > assumptions and resulting madvise() would make things worse. > > > > I think this discussion can be split into several questions/issues: > > 1. Inconsistency in how madvise(MADV_PAGEOUT) would affect PSI > > calculation when the page is refaulted, based on the path it took > > before being evicted by madvise(). In your initial description case > > (a) is inconsistent with (b) and (c) and it's probably worth fixing. > > IMHO (a) should be made consistent with others, not the other way > > around. My reasoning is that page was expelled from the active list, > > so it was part of the active workingset. > > > That means we should be setting Workingset on the page while it is on > the active list and when it is being pageout through madvising. Right? I > see, this makes it consistent. > > On the same note, discussing with Suren offline, Should the refaulted > madvise pages start always at the inactive list? If they are really > active, they get promoted anyway.. > Can you elaborate on the rationale why refaulted madvise pages needs to be on inactive list? If it had not been paged out via madvise, it would have been activated no? Thanks, Pavan
Hi Suren, On 6/10/2023 4:43 AM, Suren Baghdasaryan wrote: >>> I can see use for that from userspace to detect incorrect madvise() >>> and adjust its aggressiveness. I think the API might get a bit complex >>> because of the need to associate refaults with specific madvise()/VMAs >>> to understand which hint was incorrect and adjust the behavior. >>> Instead what is the opinion about giving an MADV_PAGEOUT_INACTIVE >> interface which does operate on a page only If it is on the inactive >> list and !PageWorkingset ? > IOW you want a less aggressive mechanism which can be used by the > userspace to tell the kernel "I think these pages won't be used but > I'm not 100% sure, so drop them only if they are inactive"? > I don't know how much that will help when the madvise() ends up being > wrong but maybe you can quickly experiment and tell us if the > difference is substantial? We did some extensive testing on Android and this ask is not helping us much. I am really not sure if there is some other usecase that can benefit from this. So, for now I just stick to your suggestion of making the pages on the Active list as the Workingset at the time of pageout. > Thanks.
On Mon, Jun 26, 2023 at 7:31 AM Charan Teja Kalla <quic_charante@quicinc.com> wrote: > > Hi Suren, > > On 6/10/2023 4:43 AM, Suren Baghdasaryan wrote: > >>> I can see use for that from userspace to detect incorrect madvise() > >>> and adjust its aggressiveness. I think the API might get a bit complex > >>> because of the need to associate refaults with specific madvise()/VMAs > >>> to understand which hint was incorrect and adjust the behavior. > >>> Instead what is the opinion about giving an MADV_PAGEOUT_INACTIVE > >> interface which does operate on a page only If it is on the inactive > >> list and !PageWorkingset ? > > IOW you want a less aggressive mechanism which can be used by the > > userspace to tell the kernel "I think these pages won't be used but > > I'm not 100% sure, so drop them only if they are inactive"? > > I don't know how much that will help when the madvise() ends up being > > wrong but maybe you can quickly experiment and tell us if the > > difference is substantial? > > We did some extensive testing on Android and this ask is not helping us > much. I am really not sure if there is some other usecase that can > benefit from this. So, for now I just stick to your suggestion of making > the pages on the Active list as the Workingset at the time of pageout. > > Thanks for checking that. Your plan SGTM. > > Thanks. >
diff --git a/mm/madvise.c b/mm/madvise.c index 340125d..3410c39 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -409,8 +409,10 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, if (folio_isolate_lru(folio)) { if (folio_test_unevictable(folio)) folio_putback_lru(folio); - else + else { + folio_clear_workingset(folio); list_add(&folio->lru, &folio_list); + } } } else folio_deactivate(folio); @@ -503,8 +505,10 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, if (folio_isolate_lru(folio)) { if (folio_test_unevictable(folio)) folio_putback_lru(folio); - else + else { + folio_clear_workingset(folio); list_add(&folio->lru, &folio_list); + } } } else folio_deactivate(folio);