Message ID | 20230707103226.38496-1-yangyifei03@kuaishou.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:9f45:0:b0:3ea:f831:8777 with SMTP id v5csp3181397vqx; Fri, 7 Jul 2023 04:05:18 -0700 (PDT) X-Google-Smtp-Source: APBJJlHKIPmokDhmvWRV3Zvi3pQUGoeCxOHdRc16ELWFVXh+Hxwb+kqwiRKocQFwqwdb/vbiCdIK X-Received: by 2002:a05:6a20:1397:b0:129:3bb4:77f1 with SMTP id hn23-20020a056a20139700b001293bb477f1mr3285624pzc.0.1688727918200; Fri, 07 Jul 2023 04:05:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1688727918; cv=none; d=google.com; s=arc-20160816; b=P7vvel5pICwPmUHHBuTbGqZVsHeNSnncSNxQ/52wsOhgNECERVeKsQkvNgZZU9I+px bcOx3yBBRCVcl6PywWj/tWvSofVV86PdL0mWmwurLjAQnbLMMjbtl9WPfFHLIJM90yJ2 XvIklD1uOOC/H+WYssxo7VnoSuHKrIup1jelZgTMuLKmhKnpjiYrFIehvRuHRjgguCdz dL2xauUdE8NTIdz33bDuHCvo8ENOrgabA+KMl212Kcm0nR4vwBc2fICe/v4lFEK/Lg6D mEAM2iq46LRRw2Mx6KZ9++8dzUjYKPzsOT8lT1mVjMatDfFz968qBfMKJsk9z67DHRKq iwHg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=elR/CUA67bIsrY8k02zjWmaLHSSZNGva7b2+ov2x4wY=; fh=SPmfD6noskkHjdd9KA6ensRZ5dq2MomeWl0lmvyA5+0=; b=I7IN90Ny4m9oMNgS7M50ua1nzavE4mn1rIw7dvDs1PtiRSM3qahUYcBvO7SRxVML+i t9dMzMI4INK6kD9c2QTLom+T3SRCbnvttqv6QQ9BFk88xSM62GuMJCdxQrE4xqglIamn CsM9mkxTMY3TIUNE9g/276rLXxDKsNbqmYPpVDrPFfcRUNP0V6fZeWGhTI9NqZyxfUPM eDZ2P89Bqj6lf5tg030wPhKYat8ews6//+g8wPdnAyBYBWipQC1j3WGbKmHcwS07pYAL sSKG4p9dfgKJk+NpqI1ac2G0wVB+CfpmRwYyv+qInUsHlMcIWcyT8SWR104mlGtnBABE /0mA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kuaishou.com header.s=dkim header.b=UuyMOdWC; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kuaishou.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id z32-20020a631920000000b0053f212830aesi3552359pgl.311.2023.07.07.04.05.03; Fri, 07 Jul 2023 04:05:18 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kuaishou.com header.s=dkim header.b=UuyMOdWC; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kuaishou.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232076AbjGGKqM (ORCPT <rfc822;hadasmailinglist@gmail.com> + 99 others); Fri, 7 Jul 2023 06:46:12 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49640 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230079AbjGGKqJ (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 7 Jul 2023 06:46:09 -0400 X-Greylist: delayed 721 seconds by postgrey-1.37 at lindbergh.monkeyblade.net; Fri, 07 Jul 2023 03:46:07 PDT Received: from bjm7-spam01.kuaishou.com (smtpcn03.kuaishou.com [103.107.217.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2EB871737 for <linux-kernel@vger.kernel.org>; Fri, 7 Jul 2023 03:46:06 -0700 (PDT) Received: from bjm7-spam01.kuaishou.com (localhost [127.0.0.2] (may be forged)) by bjm7-spam01.kuaishou.com with ESMTP id 367AY63d051707 for <linux-kernel@vger.kernel.org>; Fri, 7 Jul 2023 18:34:06 +0800 (GMT-8) (envelope-from yangyifei03@kuaishou.com) Received: from bjm7-pm-mail12.kuaishou.com ([172.28.1.94]) by bjm7-spam01.kuaishou.com with ESMTPS id 367AWXfQ050673 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Fri, 7 Jul 2023 18:32:33 +0800 (GMT-8) (envelope-from yangyifei03@kuaishou.com) DKIM-Signature: v=1; a=rsa-sha256; d=kuaishou.com; s=dkim; c=relaxed/relaxed; t=1688725813; h=from:subject:to:date:message-id; bh=elR/CUA67bIsrY8k02zjWmaLHSSZNGva7b2+ov2x4wY=; b=UuyMOdWC5iW9fEjfe1Vpy31A1lQ9H9REeiFSO7aFfHoRbJuwscXaaVRbdqwtD48iD9SMuxrWawJ l+NvtFDcR4vYQ3S38eqLAht9pDl+IQW+uBXabsLyKGW1UWoECjsOQZfLuvQWb0ue77/iN4Pkrkm86 AUnOpFcmRH+s8a9nYTc= Received: from public-zl-rs11.idczw.hb1.kwaidc.com (172.28.1.32) by bjm7-pm-mail12.kuaishou.com (172.28.1.94) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.20; Fri, 7 Jul 2023 18:30:13 +0800 From: Efly Young <yangyifei03@kuaishou.com> To: <akpm@linux-foundation.org> CC: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org> Subject: [PATCH] mm:vmscan: fix inaccurate reclaim during proactive reclaim Date: Fri, 7 Jul 2023 18:32:26 +0800 Message-ID: <20230707103226.38496-1-yangyifei03@kuaishou.com> X-Mailer: git-send-email 2.35.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-Originating-IP: [172.28.1.32] X-ClientProxiedBy: bjxm-pm-mail09.kuaishou.com (172.28.128.9) To bjm7-pm-mail12.kuaishou.com (172.28.1.94) X-DNSRBL: X-SPAM-SOURCE-CHECK: pass X-MAIL: bjm7-spam01.kuaishou.com 367AY63d051707 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1770759565503161077?= X-GMAIL-MSGID: =?utf-8?q?1770759565503161077?= |
Series |
mm:vmscan: fix inaccurate reclaim during proactive reclaim
|
|
Commit Message
Efly Young
July 7, 2023, 10:32 a.m. UTC
With commit f53af4285d77 ("mm: vmscan: fix extreme overreclaim
and swap floods"), proactive reclaim still seems inaccurate.
Our problematic scene also are almost anon pages. Request 1G
by writing memory.reclaim will reclaim 1.7G or other values
more than 1G by swapping.
This try to fix the inaccurate reclaim problem.
Signed-off-by: Efly Young <yangyifei03@kuaishou.com>
---
mm/vmscan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Comments
(cc hannes) On Fri, 7 Jul 2023 18:32:26 +0800 Efly Young <yangyifei03@kuaishou.com> wrote: > With commit f53af4285d77 ("mm: vmscan: fix extreme overreclaim > and swap floods"), proactive reclaim still seems inaccurate. > > Our problematic scene also are almost anon pages. Request 1G > by writing memory.reclaim will reclaim 1.7G or other values > more than 1G by swapping. > > This try to fix the inaccurate reclaim problem. It would be helpful to have some additional explanation of why you believe the current code is incorrect? > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -6208,7 +6208,7 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) > unsigned long nr_to_scan; > enum lru_list lru; > unsigned long nr_reclaimed = 0; > - unsigned long nr_to_reclaim = sc->nr_to_reclaim; > + unsigned long nr_to_reclaim = (sc->nr_to_reclaim - sc->nr_reclaimed); > bool proportional_reclaim; > struct blk_plug plug; >
On Fri, Jul 07, 2023 at 06:32:26PM +0800, Efly Young wrote: > With commit f53af4285d77 ("mm: vmscan: fix extreme overreclaim > and swap floods"), proactive reclaim still seems inaccurate. > > Our problematic scene also are almost anon pages. Request 1G > by writing memory.reclaim will reclaim 1.7G or other values > more than 1G by swapping. > > This try to fix the inaccurate reclaim problem. I can see how this happens. Direct and kswapd reclaim have much smaller nr_to_reclaim targets, so it's less noticable when we loop a few times. Proactive reclaim can come in with a rather large value. What does the reproducer setup look like? Are you calling reclaim on a higher level cgroup with several children? Or is the looping coming from having multiple zones alone? > Signed-off-by: Efly Young <yangyifei03@kuaishou.com> > --- > mm/vmscan.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 9c1c5e8b..2aea8d9 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -6208,7 +6208,7 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) > unsigned long nr_to_scan; > enum lru_list lru; > unsigned long nr_reclaimed = 0; > - unsigned long nr_to_reclaim = sc->nr_to_reclaim; > + unsigned long nr_to_reclaim = (sc->nr_to_reclaim - sc->nr_reclaimed); This can underflow. shrink_list() eats SWAP_CLUSTER_MAX batches out of lru_pages >> priority, and only checks reclaimed > to_reclaim after. This will then disable the bailout mechanism entirely. In general, I'm not sure this is the best spot to fix the problem: - During reclaim/compaction, should_continue_reclaim() may decide that more reclaim is required before compaction can proceed. But the second cycle might not do anything now, since you remember the work done by the previous one. - shrink_node_memcgs() might do the full batch against the first cgroup and not touch the second one anymore. This will result in super lopsided behavior when you target a tree of multiple groups. There might be other spots that break, I haven't checked. You could go through them one by one, of course. But the truth is, larger reclaim targets are the rare exception. Trying to support them at the risk of breaking all other reclaim users seems ill-advised. A better approach might be to just say: "don't call reclaim with large numbers". Have proactive reclaim code handle the batching into smaller chunks: diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e8ca4bdcb03c..4b016806dcc7 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6696,7 +6696,7 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, lru_add_drain_all(); reclaimed = try_to_free_mem_cgroup_pages(memcg, - nr_to_reclaim - nr_reclaimed, + min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), GFP_KERNEL, reclaim_options); if (!reclaimed && !nr_retries--)
>> With commit f53af4285d77 ("mm: vmscan: fix extreme overreclaim >> and swap floods"), proactive reclaim still seems inaccurate. >> >> Our problematic scene also are almost anon pages. Request 1G >> by writing memory.reclaim will reclaim 1.7G or other values >> more than 1G by swapping. >> >> This try to fix the inaccurate reclaim problem. > > I can see how this happens. Direct and kswapd reclaim have much > smaller nr_to_reclaim targets, so it's less noticable when we loop a > few times. Proactive reclaim can come in with a rather large value. > > What does the reproducer setup look like? Are you calling reclaim on a > higher level cgroup with several children? Or is the looping coming > from having multiple zones alone? Thank you for your comment. The process in a leaf cgroup without children just malloc 20G anonymous memory and sleep, then calling reclaim in the leaf cgroup. Before commit f53af4285d77 ("mm: vmscan: fix extreme overreclaim and swap floods"), reclaimer may reclaim many times the amount of request. Now it should eventually reclaim in [request, 2 * request). >> Signed-off-by: Efly Young <yangyifei03@kuaishou.com> >> --- >> mm/vmscan.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/mm/vmscan.c b/mm/vmscan.c >> index 9c1c5e8b..2aea8d9 100644 >> --- a/mm/vmscan.c >> +++ b/mm/vmscan.c >> @@ -6208,7 +6208,7 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) >> unsigned long nr_to_scan; >> enum lru_list lru; >> unsigned long nr_reclaimed = 0; >> - unsigned long nr_to_reclaim = sc->nr_to_reclaim; >> + unsigned long nr_to_reclaim = (sc->nr_to_reclaim - sc->nr_reclaimed); > > This can underflow. shrink_list() eats SWAP_CLUSTER_MAX batches out of > lru_pages >> priority, and only checks reclaimed > to_reclaim > after. This will then disable the bailout mechanism entirely. > > In general, I'm not sure this is the best spot to fix the problem: > > - During reclaim/compaction, should_continue_reclaim() may decide that > more reclaim is required before compaction can proceed. But the > second cycle might not do anything now, since you remember the work > done by the previous one. > > - shrink_node_memcgs() might do the full batch against the first > cgroup and not touch the second one anymore. This will result in > super lopsided behavior when you target a tree of multiple groups. > > There might be other spots that break, I haven't checked. > > You could go through them one by one, of course. But the truth is, > larger reclaim targets are the rare exception. Trying to support them > at the risk of breaking all other reclaim users seems ill-advised. I agree with your view. These explanations are more considerate. Thank you again for helping me out. > A better approach might be to just say: "don't call reclaim with large > numbers". Have proactive reclaim code handle the batching into smaller > chunks: > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index e8ca4bdcb03c..4b016806dcc7 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -6696,7 +6696,7 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > lru_add_drain_all(); > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > - nr_to_reclaim - nr_reclaimed, > + min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > GFP_KERNEL, reclaim_options); > > if (!reclaimed && !nr_retries--) May be this way could solve the inaccurate proactive reclaim problem without breaking the original balance. But may be less efficient than before?
diff --git a/mm/vmscan.c b/mm/vmscan.c index 9c1c5e8b..2aea8d9 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -6208,7 +6208,7 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) unsigned long nr_to_scan; enum lru_list lru; unsigned long nr_reclaimed = 0; - unsigned long nr_to_reclaim = sc->nr_to_reclaim; + unsigned long nr_to_reclaim = (sc->nr_to_reclaim - sc->nr_reclaimed); bool proportional_reclaim; struct blk_plug plug;