Message ID | 1684919574-28368-1-git-send-email-zhaoyang.huang@unisoc.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp2715108vqo; Wed, 24 May 2023 02:37:38 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ5/tquWc9S4zMzuDYA51bcdgWKteu6xBs3OgrTQMMlCo/A2HOK1zbbt8o4lROWklE88GHkJ X-Received: by 2002:a17:903:32c3:b0:1ac:7345:f254 with SMTP id i3-20020a17090332c300b001ac7345f254mr19900511plr.33.1684921057847; Wed, 24 May 2023 02:37:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1684921057; cv=none; d=google.com; s=arc-20160816; b=E7Oi/5/AEaEDlhOSMksMolXrTpQ1HyBcW+nX95tBCRb76c7KCPi9peh+DXE5YXaNYY hL/JXtpCIj5Vbp64YPKcC0AjrkdoUZXV65DRj4yH9x2lQH68ukJlOwUWxFGDDh9TYnO+ 21RwJJRvJlGOmuipawOZCqLogMdJRRZXgauOiJnFBj3FkD+Z6KsA+nEl+Ij7cW8n7deJ vwll/+dPc6HmEKVxLp9tbN1tUxql83/hz6yVC1MWhKeWDC2VGAkYB3dWS0m2sR6TcWKw w73nxbY+ieqR0hG2G9xT0npsOq39quWNLQijL33LNLpRpPDqB/+gfOKh6z3WbG80h2Kf Konw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:message-id:date:subject:to:from; bh=t/EsJJOLcMCr7XMbUl3dglWw/3xZHwzmm3hsF/S/RXU=; b=XNWSO/Wl3bLYR1PToXYtVr/Nj64TxICqub2uXku5tzSvtj/ErgvaSjaYeU2S/b+VJk BXHas4iR/J5bkeaU1MAMgGFpdfoqqBIte1JlWHKXnCh65GoNRSZBcag6ABHC1iL7ftBC suC92Y+ko/dgQhpRsAGMqzEVbstnWafMepK+l4d0qmNNaIbXbHnYj3Ihv4+TUWgH2k1s F2AoGkcIXWa6Qelq09B6wwHhACYku1b+QnZubsPk/ErILwxAIwRWYfbYW+Z3kh8Glxk6 qzrx/1MmteYm6b/e47WlPyH/mU1t0SQiZYmRKW6wvpAFXFB/2JrYtbJVbTvlNEmRN/Lh xGqQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id c17-20020a170902d49100b001ae82f13a0fsi8273990plg.643.2023.05.24.02.37.25; Wed, 24 May 2023 02:37:37 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240201AbjEXJNz (ORCPT <rfc822;ahmedalshaiji.dev@gmail.com> + 99 others); Wed, 24 May 2023 05:13:55 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41752 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229723AbjEXJNx (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 24 May 2023 05:13:53 -0400 Received: from SHSQR01.spreadtrum.com (unknown [222.66.158.135]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9353593 for <linux-kernel@vger.kernel.org>; Wed, 24 May 2023 02:13:50 -0700 (PDT) Received: from SHSend.spreadtrum.com (bjmbx01.spreadtrum.com [10.0.64.7]) by SHSQR01.spreadtrum.com with ESMTP id 34O9DADV076637; Wed, 24 May 2023 17:13:10 +0800 (+08) (envelope-from zhaoyang.huang@unisoc.com) Received: from bj03382pcu.spreadtrum.com (10.0.73.76) by BJMBX01.spreadtrum.com (10.0.64.7) with Microsoft SMTP Server (TLS) id 15.0.1497.23; Wed, 24 May 2023 17:13:10 +0800 From: "zhaoyang.huang" <zhaoyang.huang@unisoc.com> To: Andrew Morton <akpm@linux-foundation.org>, Johannes Weiner <hannes@cmpxchg.org>, Suren Baghdasaryan <surenb@google.com>, <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>, Zhaoyang Huang <huangzhaoyang@gmail.com>, <ke.wang@unisoc.com> Subject: [PATCH] mm: deduct the number of pages reclaimed by madvise from workingset Date: Wed, 24 May 2023 17:12:54 +0800 Message-ID: <1684919574-28368-1-git-send-email-zhaoyang.huang@unisoc.com> X-Mailer: git-send-email 1.9.1 MIME-Version: 1.0 Content-Type: text/plain X-Originating-IP: [10.0.73.76] X-ClientProxiedBy: SHCAS03.spreadtrum.com (10.0.1.207) To BJMBX01.spreadtrum.com (10.0.64.7) X-MAIL: SHSQR01.spreadtrum.com 34O9DADV076637 X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1766767783529078915?= X-GMAIL-MSGID: =?utf-8?q?1766767783529078915?= |
Series |
mm: deduct the number of pages reclaimed by madvise from workingset
|
|
Commit Message
zhaoyang.huang
May 24, 2023, 9:12 a.m. UTC
From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> The pages reclaimed by madvise_pageout are made of inactive and dropped from LRU forcefully, which lead to the coming up refault pages possess a large refault distance than it should be. These could affect the accuracy of thrashing when madvise_pageout is used as a common way of memory reclaiming as ANDROID does now. Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> --- include/linux/swap.h | 2 +- mm/madvise.c | 4 ++-- mm/vmscan.c | 8 +++++++- 3 files changed, 10 insertions(+), 4 deletions(-)
Comments
On Wed, May 24, 2023 at 2:13 AM zhaoyang.huang <zhaoyang.huang@unisoc.com> wrote: > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > The pages reclaimed by madvise_pageout are made of inactive and dropped from LRU > forcefully, which lead to the coming up refault pages possess a large refault > distance than it should be. These could affect the accuracy of thrashing when > madvise_pageout is used as a common way of memory reclaiming as ANDROID does now. Doesn't workingset_eviction() in the following call chain already handle nonresident page aging?: reclaim_pages reclaim_folio_list shrink_folio_list __remove_mapping workingset_eviction workingset_age_nonresident > > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > --- > include/linux/swap.h | 2 +- > mm/madvise.c | 4 ++-- > mm/vmscan.c | 8 +++++++- > 3 files changed, 10 insertions(+), 4 deletions(-) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 2787b84..0312142 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -428,7 +428,7 @@ extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, > extern int vm_swappiness; > long remove_mapping(struct address_space *mapping, struct folio *folio); > > -extern unsigned long reclaim_pages(struct list_head *page_list); > +extern unsigned long reclaim_pages(struct mm_struct *mm, struct list_head *page_list); > #ifdef CONFIG_NUMA > extern int node_reclaim_mode; > extern int sysctl_min_unmapped_ratio; > diff --git a/mm/madvise.c b/mm/madvise.c > index b6ea204..61c8d7b 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -420,7 +420,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > huge_unlock: > spin_unlock(ptl); > if (pageout) > - reclaim_pages(&page_list); > + reclaim_pages(mm, &page_list); > return 0; > } > > @@ -516,7 +516,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > arch_leave_lazy_mmu_mode(); > pte_unmap_unlock(orig_pte, ptl); > if (pageout) > - reclaim_pages(&page_list); > + reclaim_pages(mm, &page_list); > cond_resched(); > > return 0; > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 20facec..048c10b 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2741,12 +2741,14 @@ static unsigned int reclaim_folio_list(struct list_head *folio_list, > return nr_reclaimed; > } > > -unsigned long reclaim_pages(struct list_head *folio_list) > +unsigned long reclaim_pages(struct mm_struct *mm, struct list_head *folio_list) You would also need to change Damon usage of reclaim_pages() here: https://elixir.bootlin.com/linux/v6.4-rc1/source/mm/damon/paddr.c#L253 > { > int nid; > unsigned int nr_reclaimed = 0; > LIST_HEAD(node_folio_list); > unsigned int noreclaim_flag; > + struct lruvec *lruvec; > + struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm); > > if (list_empty(folio_list)) > return nr_reclaimed; > @@ -2764,10 +2766,14 @@ unsigned long reclaim_pages(struct list_head *folio_list) > } > > nr_reclaimed += reclaim_folio_list(&node_folio_list, NODE_DATA(nid)); > + lruvec = &memcg->nodeinfo[nid]->lruvec; > + workingset_age_nonresident(lruvec, -nr_reclaimed); > nid = folio_nid(lru_to_folio(folio_list)); > } while (!list_empty(folio_list)); > > nr_reclaimed += reclaim_folio_list(&node_folio_list, NODE_DATA(nid)); > + lruvec = &memcg->nodeinfo[nid]->lruvec; > + workingset_age_nonresident(lruvec, -nr_reclaimed); > > memalloc_noreclaim_restore(noreclaim_flag); > > -- > 1.9.1 >
On Thu, May 25, 2023 at 4:41 AM Suren Baghdasaryan <surenb@google.com> wrote: > > On Wed, May 24, 2023 at 2:13 AM zhaoyang.huang > <zhaoyang.huang@unisoc.com> wrote: > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > The pages reclaimed by madvise_pageout are made of inactive and dropped from LRU > > forcefully, which lead to the coming up refault pages possess a large refault > > distance than it should be. These could affect the accuracy of thrashing when > > madvise_pageout is used as a common way of memory reclaiming as ANDROID does now. > > Doesn't workingset_eviction() in the following call chain already > handle nonresident page aging?: > > reclaim_pages > reclaim_folio_list > shrink_folio_list > __remove_mapping > workingset_eviction > workingset_age_nonresident Yes. What I suggest is to minor this pages from non-resident as they are dropped forcefully > > > > > > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > --- > > include/linux/swap.h | 2 +- > > mm/madvise.c | 4 ++-- > > mm/vmscan.c | 8 +++++++- > > 3 files changed, 10 insertions(+), 4 deletions(-) > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > index 2787b84..0312142 100644 > > --- a/include/linux/swap.h > > +++ b/include/linux/swap.h > > @@ -428,7 +428,7 @@ extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, > > extern int vm_swappiness; > > long remove_mapping(struct address_space *mapping, struct folio *folio); > > > > -extern unsigned long reclaim_pages(struct list_head *page_list); > > +extern unsigned long reclaim_pages(struct mm_struct *mm, struct list_head *page_list); > > #ifdef CONFIG_NUMA > > extern int node_reclaim_mode; > > extern int sysctl_min_unmapped_ratio; > > diff --git a/mm/madvise.c b/mm/madvise.c > > index b6ea204..61c8d7b 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -420,7 +420,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > > huge_unlock: > > spin_unlock(ptl); > > if (pageout) > > - reclaim_pages(&page_list); > > + reclaim_pages(mm, &page_list); > > return 0; > > } > > > > @@ -516,7 +516,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > > arch_leave_lazy_mmu_mode(); > > pte_unmap_unlock(orig_pte, ptl); > > if (pageout) > > - reclaim_pages(&page_list); > > + reclaim_pages(mm, &page_list); > > cond_resched(); > > > > return 0; > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 20facec..048c10b 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -2741,12 +2741,14 @@ static unsigned int reclaim_folio_list(struct list_head *folio_list, > > return nr_reclaimed; > > } > > > > -unsigned long reclaim_pages(struct list_head *folio_list) > > +unsigned long reclaim_pages(struct mm_struct *mm, struct list_head *folio_list) > > You would also need to change Damon usage of reclaim_pages() here: > https://elixir.bootlin.com/linux/v6.4-rc1/source/mm/damon/paddr.c#L253 ok, thanks for reminding > > > { > > int nid; > > unsigned int nr_reclaimed = 0; > > LIST_HEAD(node_folio_list); > > unsigned int noreclaim_flag; > > + struct lruvec *lruvec; > > + struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm); > > > > if (list_empty(folio_list)) > > return nr_reclaimed; > > @@ -2764,10 +2766,14 @@ unsigned long reclaim_pages(struct list_head *folio_list) > > } > > > > nr_reclaimed += reclaim_folio_list(&node_folio_list, NODE_DATA(nid)); > > + lruvec = &memcg->nodeinfo[nid]->lruvec; > > + workingset_age_nonresident(lruvec, -nr_reclaimed); > > nid = folio_nid(lru_to_folio(folio_list)); > > } while (!list_empty(folio_list)); > > > > nr_reclaimed += reclaim_folio_list(&node_folio_list, NODE_DATA(nid)); > > + lruvec = &memcg->nodeinfo[nid]->lruvec; > > + workingset_age_nonresident(lruvec, -nr_reclaimed); > > > > memalloc_noreclaim_restore(noreclaim_flag); > > > > -- > > 1.9.1 > >
On Wed, May 24, 2023 at 05:12:54PM +0800, zhaoyang.huang wrote: > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > The pages reclaimed by madvise_pageout are made of inactive and dropped from LRU > forcefully, which lead to the coming up refault pages possess a large refault > distance than it should be. These could affect the accuracy of thrashing when > madvise_pageout is used as a common way of memory reclaiming as ANDROID does now. This alludes to, but doesn't explain, a real world usecase. Yes, madvise_pageout() will record non-resident entries today. This means refault and thrash detection is on for user-driven reclaim. So why is that undesirable? Today we measure and report the cost of reclaim and memory pressure for physical memory shortages, cgroup limits, and user-driven cgroup reclaim. Why should we not do the same for madv_pageout()? If the userspace code that drives pageout has a bug and the result is extreme thrashing, wouldn't you want to know that? Please explain the idea here better. > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > --- > include/linux/swap.h | 2 +- > mm/madvise.c | 4 ++-- > mm/vmscan.c | 8 +++++++- > 3 files changed, 10 insertions(+), 4 deletions(-) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 2787b84..0312142 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -428,7 +428,7 @@ extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, > extern int vm_swappiness; > long remove_mapping(struct address_space *mapping, struct folio *folio); > > -extern unsigned long reclaim_pages(struct list_head *page_list); > +extern unsigned long reclaim_pages(struct mm_struct *mm, struct list_head *page_list); > #ifdef CONFIG_NUMA > extern int node_reclaim_mode; > extern int sysctl_min_unmapped_ratio; > diff --git a/mm/madvise.c b/mm/madvise.c > index b6ea204..61c8d7b 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -420,7 +420,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > huge_unlock: > spin_unlock(ptl); > if (pageout) > - reclaim_pages(&page_list); > + reclaim_pages(mm, &page_list); > return 0; > } > > @@ -516,7 +516,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > arch_leave_lazy_mmu_mode(); > pte_unmap_unlock(orig_pte, ptl); > if (pageout) > - reclaim_pages(&page_list); > + reclaim_pages(mm, &page_list); > cond_resched(); > > return 0; > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 20facec..048c10b 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2741,12 +2741,14 @@ static unsigned int reclaim_folio_list(struct list_head *folio_list, > return nr_reclaimed; > } > > -unsigned long reclaim_pages(struct list_head *folio_list) > +unsigned long reclaim_pages(struct mm_struct *mm, struct list_head *folio_list) > { > int nid; > unsigned int nr_reclaimed = 0; > LIST_HEAD(node_folio_list); > unsigned int noreclaim_flag; > + struct lruvec *lruvec; > + struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm); > > if (list_empty(folio_list)) > return nr_reclaimed; > @@ -2764,10 +2766,14 @@ unsigned long reclaim_pages(struct list_head *folio_list) > } > > nr_reclaimed += reclaim_folio_list(&node_folio_list, NODE_DATA(nid)); > + lruvec = &memcg->nodeinfo[nid]->lruvec; > + workingset_age_nonresident(lruvec, -nr_reclaimed); > nid = folio_nid(lru_to_folio(folio_list)); > } while (!list_empty(folio_list)); > > nr_reclaimed += reclaim_folio_list(&node_folio_list, NODE_DATA(nid)); > + lruvec = &memcg->nodeinfo[nid]->lruvec; > + workingset_age_nonresident(lruvec, -nr_reclaimed); The task might have moved cgroups in between, who knows what kind of artifacts it will introduce if you wind back the wrong clock. If there are reclaim passes that shouldn't participate in non-resident tracking, that should be plumbed through the stack to __remove_mapping (which already has that bool reclaimed param to not record entries).
On Thu, May 25, 2023 at 9:54 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Wed, May 24, 2023 at 05:12:54PM +0800, zhaoyang.huang wrote: > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > The pages reclaimed by madvise_pageout are made of inactive and dropped from LRU > > forcefully, which lead to the coming up refault pages possess a large refault > > distance than it should be. These could affect the accuracy of thrashing when > > madvise_pageout is used as a common way of memory reclaiming as ANDROID does now. > > This alludes to, but doesn't explain, a real world usecase. More block io(wait_on_page_bit_common) observed during APP start in latest android version where user space memory reclaiming changes from in-kernel PPR to madvise_pageout. We believe that it could be related with inaccuracy of workingset. > > Yes, madvise_pageout() will record non-resident entries today. This > means refault and thrash detection is on for user-driven reclaim. > > So why is that undesirable? Let's raise an extreme scenario, that is, the tail page of LRU could experience a given refault distance without any in-kernel reclaiming and be wrongly deemed as inactive and get less protection. > > Today we measure and report the cost of reclaim and memory pressure > for physical memory shortages, cgroup limits, and user-driven cgroup > reclaim. Why should we not do the same for madv_pageout()? If the > userspace code that drives pageout has a bug and the result is extreme > thrashing, wouldn't you want to know that? Actually, the pages evicted by madv_cold/pageout from active_lru are not marked as WORKINGSET, which will surpass the thrashing account when it faults back and gets struck by IO. I think they should be treated in the same way in terms of SetPageWorkingset and lruvec->non-resident. Please refer to my previous patch "mm: mark folio as workingset in lru_deactivate_fn index 70e2063..4d1c14f 100644" > > Please explain the idea here better. > > > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > --- > > include/linux/swap.h | 2 +- > > mm/madvise.c | 4 ++-- > > mm/vmscan.c | 8 +++++++- > > 3 files changed, 10 insertions(+), 4 deletions(-) > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > index 2787b84..0312142 100644 > > --- a/include/linux/swap.h > > +++ b/include/linux/swap.h > > @@ -428,7 +428,7 @@ extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, > > extern int vm_swappiness; > > long remove_mapping(struct address_space *mapping, struct folio *folio); > > > > -extern unsigned long reclaim_pages(struct list_head *page_list); > > +extern unsigned long reclaim_pages(struct mm_struct *mm, struct list_head *page_list); > > #ifdef CONFIG_NUMA > > extern int node_reclaim_mode; > > extern int sysctl_min_unmapped_ratio; > > diff --git a/mm/madvise.c b/mm/madvise.c > > index b6ea204..61c8d7b 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -420,7 +420,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > > huge_unlock: > > spin_unlock(ptl); > > if (pageout) > > - reclaim_pages(&page_list); > > + reclaim_pages(mm, &page_list); > > return 0; > > } > > > > @@ -516,7 +516,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > > arch_leave_lazy_mmu_mode(); > > pte_unmap_unlock(orig_pte, ptl); > > if (pageout) > > - reclaim_pages(&page_list); > > + reclaim_pages(mm, &page_list); > > cond_resched(); > > > > return 0; > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 20facec..048c10b 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -2741,12 +2741,14 @@ static unsigned int reclaim_folio_list(struct list_head *folio_list, > > return nr_reclaimed; > > } > > > > -unsigned long reclaim_pages(struct list_head *folio_list) > > +unsigned long reclaim_pages(struct mm_struct *mm, struct list_head *folio_list) > > { > > int nid; > > unsigned int nr_reclaimed = 0; > > LIST_HEAD(node_folio_list); > > unsigned int noreclaim_flag; > > + struct lruvec *lruvec; > > + struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm); > > > > if (list_empty(folio_list)) > > return nr_reclaimed; > > @@ -2764,10 +2766,14 @@ unsigned long reclaim_pages(struct list_head *folio_list) > > } > > > > nr_reclaimed += reclaim_folio_list(&node_folio_list, NODE_DATA(nid)); > > + lruvec = &memcg->nodeinfo[nid]->lruvec; > > + workingset_age_nonresident(lruvec, -nr_reclaimed); > > nid = folio_nid(lru_to_folio(folio_list)); > > } while (!list_empty(folio_list)); > > > > nr_reclaimed += reclaim_folio_list(&node_folio_list, NODE_DATA(nid)); > > + lruvec = &memcg->nodeinfo[nid]->lruvec; > > + workingset_age_nonresident(lruvec, -nr_reclaimed); > > The task might have moved cgroups in between, who knows what kind of > artifacts it will introduce if you wind back the wrong clock. > > If there are reclaim passes that shouldn't participate in non-resident > tracking, that should be plumbed through the stack to __remove_mapping > (which already has that bool reclaimed param to not record entries).
On Thu, May 25, 2023 at 11:39 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote: > > On Thu, May 25, 2023 at 9:54 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Wed, May 24, 2023 at 05:12:54PM +0800, zhaoyang.huang wrote: > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > > > The pages reclaimed by madvise_pageout are made of inactive and dropped from LRU > > > forcefully, which lead to the coming up refault pages possess a large refault > > > distance than it should be. These could affect the accuracy of thrashing when > > > madvise_pageout is used as a common way of memory reclaiming as ANDROID does now. > > > > This alludes to, but doesn't explain, a real world usecase. > More block io(wait_on_page_bit_common) observed during APP start in > latest android version where user space memory reclaiming changes from > in-kernel PPR to madvise_pageout. We believe that it could be related > with inaccuracy of workingset. Do you mean the userspace incorrectly treats the active workingset as inactive and purges it? If so, it sounds like the fix should be in the userspace, not in the kernel. > > > > Yes, madvise_pageout() will record non-resident entries today. This > > means refault and thrash detection is on for user-driven reclaim. > > > > So why is that undesirable? > Let's raise an extreme scenario, that is, the tail page of LRU could > experience a given refault distance without any in-kernel reclaiming > and be wrongly deemed as inactive and get less protection. madvise_pageout is a hint to the kernel that this page should be treated as inactive, which it does. Why is that wrong? > > > > Today we measure and report the cost of reclaim and memory pressure > > for physical memory shortages, cgroup limits, and user-driven cgroup > > reclaim. Why should we not do the same for madv_pageout()? If the > > userspace code that drives pageout has a bug and the result is extreme > > thrashing, wouldn't you want to know that? > Actually, the pages evicted by madv_cold/pageout from active_lru are > not marked as WORKINGSET, which will surpass the thrashing account > when it faults back and gets struck by IO. I think they should be > treated in the same way in terms of SetPageWorkingset and > lruvec->non-resident. Please refer to my previous patch "mm: mark > folio as workingset in lru_deactivate_fn index 70e2063..4d1c14f > 100644" I see your point but it's debatable. If madvise_pageout is a hint to treat the page as inactive, then why should the kernel treat it as part of the active workingset when evicting? I guess it boils down to whether the kernel should try fixing a wrong hint from the userspace. I think not but I would be interested in the opinions of others. Thanks, Suren. > > > > > > Please explain the idea here better. > > > > > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > --- > > > include/linux/swap.h | 2 +- > > > mm/madvise.c | 4 ++-- > > > mm/vmscan.c | 8 +++++++- > > > 3 files changed, 10 insertions(+), 4 deletions(-) > > > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > > index 2787b84..0312142 100644 > > > --- a/include/linux/swap.h > > > +++ b/include/linux/swap.h > > > @@ -428,7 +428,7 @@ extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, > > > extern int vm_swappiness; > > > long remove_mapping(struct address_space *mapping, struct folio *folio); > > > > > > -extern unsigned long reclaim_pages(struct list_head *page_list); > > > +extern unsigned long reclaim_pages(struct mm_struct *mm, struct list_head *page_list); > > > #ifdef CONFIG_NUMA > > > extern int node_reclaim_mode; > > > extern int sysctl_min_unmapped_ratio; > > > diff --git a/mm/madvise.c b/mm/madvise.c > > > index b6ea204..61c8d7b 100644 > > > --- a/mm/madvise.c > > > +++ b/mm/madvise.c > > > @@ -420,7 +420,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > > > huge_unlock: > > > spin_unlock(ptl); > > > if (pageout) > > > - reclaim_pages(&page_list); > > > + reclaim_pages(mm, &page_list); > > > return 0; > > > } > > > > > > @@ -516,7 +516,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > > > arch_leave_lazy_mmu_mode(); > > > pte_unmap_unlock(orig_pte, ptl); > > > if (pageout) > > > - reclaim_pages(&page_list); > > > + reclaim_pages(mm, &page_list); > > > cond_resched(); > > > > > > return 0; > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > index 20facec..048c10b 100644 > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -2741,12 +2741,14 @@ static unsigned int reclaim_folio_list(struct list_head *folio_list, > > > return nr_reclaimed; > > > } > > > > > > -unsigned long reclaim_pages(struct list_head *folio_list) > > > +unsigned long reclaim_pages(struct mm_struct *mm, struct list_head *folio_list) > > > { > > > int nid; > > > unsigned int nr_reclaimed = 0; > > > LIST_HEAD(node_folio_list); > > > unsigned int noreclaim_flag; > > > + struct lruvec *lruvec; > > > + struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm); > > > > > > if (list_empty(folio_list)) > > > return nr_reclaimed; > > > @@ -2764,10 +2766,14 @@ unsigned long reclaim_pages(struct list_head *folio_list) > > > } > > > > > > nr_reclaimed += reclaim_folio_list(&node_folio_list, NODE_DATA(nid)); > > > + lruvec = &memcg->nodeinfo[nid]->lruvec; > > > + workingset_age_nonresident(lruvec, -nr_reclaimed); > > > nid = folio_nid(lru_to_folio(folio_list)); > > > } while (!list_empty(folio_list)); > > > > > > nr_reclaimed += reclaim_folio_list(&node_folio_list, NODE_DATA(nid)); > > > + lruvec = &memcg->nodeinfo[nid]->lruvec; > > > + workingset_age_nonresident(lruvec, -nr_reclaimed); > > > > The task might have moved cgroups in between, who knows what kind of > > artifacts it will introduce if you wind back the wrong clock. > > > > If there are reclaim passes that shouldn't participate in non-resident > > tracking, that should be plumbed through the stack to __remove_mapping > > (which already has that bool reclaimed param to not record entries).
diff --git a/include/linux/swap.h b/include/linux/swap.h index 2787b84..0312142 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -428,7 +428,7 @@ extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, extern int vm_swappiness; long remove_mapping(struct address_space *mapping, struct folio *folio); -extern unsigned long reclaim_pages(struct list_head *page_list); +extern unsigned long reclaim_pages(struct mm_struct *mm, struct list_head *page_list); #ifdef CONFIG_NUMA extern int node_reclaim_mode; extern int sysctl_min_unmapped_ratio; diff --git a/mm/madvise.c b/mm/madvise.c index b6ea204..61c8d7b 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -420,7 +420,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, huge_unlock: spin_unlock(ptl); if (pageout) - reclaim_pages(&page_list); + reclaim_pages(mm, &page_list); return 0; } @@ -516,7 +516,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, arch_leave_lazy_mmu_mode(); pte_unmap_unlock(orig_pte, ptl); if (pageout) - reclaim_pages(&page_list); + reclaim_pages(mm, &page_list); cond_resched(); return 0; diff --git a/mm/vmscan.c b/mm/vmscan.c index 20facec..048c10b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2741,12 +2741,14 @@ static unsigned int reclaim_folio_list(struct list_head *folio_list, return nr_reclaimed; } -unsigned long reclaim_pages(struct list_head *folio_list) +unsigned long reclaim_pages(struct mm_struct *mm, struct list_head *folio_list) { int nid; unsigned int nr_reclaimed = 0; LIST_HEAD(node_folio_list); unsigned int noreclaim_flag; + struct lruvec *lruvec; + struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm); if (list_empty(folio_list)) return nr_reclaimed; @@ -2764,10 +2766,14 @@ unsigned long reclaim_pages(struct list_head *folio_list) } nr_reclaimed += reclaim_folio_list(&node_folio_list, NODE_DATA(nid)); + lruvec = &memcg->nodeinfo[nid]->lruvec; + workingset_age_nonresident(lruvec, -nr_reclaimed); nid = folio_nid(lru_to_folio(folio_list)); } while (!list_empty(folio_list)); nr_reclaimed += reclaim_folio_list(&node_folio_list, NODE_DATA(nid)); + lruvec = &memcg->nodeinfo[nid]->lruvec; + workingset_age_nonresident(lruvec, -nr_reclaimed); memalloc_noreclaim_restore(noreclaim_flag);