Message ID | 20221024134146.3442393-1-chenwandun@huawei.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp571134wru; Mon, 24 Oct 2022 10:32:35 -0700 (PDT) X-Google-Smtp-Source: AMsMyM5PRi8MpJAl6gPXxNbjiuCZGRW0wJwqNOzzFaGMkfqUqutmNJYgaxPINjqtSFYavcyCq/Xz X-Received: by 2002:a17:902:d682:b0:186:9ecf:94c2 with SMTP id v2-20020a170902d68200b001869ecf94c2mr8526922ply.54.1666632755014; Mon, 24 Oct 2022 10:32:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666632755; cv=none; d=google.com; s=arc-20160816; b=Q6TX9+tgNKXLb+zwj21kZEVjh2QFv6/bmmL9V01+zFdn7zmLPc7UHn0W+XgGYcGers i7G6mqMLc3ZXvUpEy0u3QyKorj3xw3b4yV6QRjoZiGlRO9wexSAxEsu4+kJ69luOs57z y26TLp+R0f8YibKm3vDIqMsH/WzMSEu/XentuGMB3d8/H3vp7WDlxaTuc9Sk0LxXanDX /k7SGFdZzXyY6mEVkRKex/y4QHaeGAIMgsHr0YnUE38X/yHGldYxoYDfZQgURiQnphIE ygF+W5pcmYDwdYWEe94YN4DVfZIihgb4mICvqq6W/XQtIZfdWoTwJ5eyuXQpPKnf2P48 vSjw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:to:from; bh=ePBws6xZpUqCWuNS3hy0Y16aFjxCqqTx3X1LBGvn5Ds=; b=YYEB0fj67UbBtlDUA9ADSifIk47TEA1X98LC092M4fVTpCAE2HCwiwEJucMVMOUDlV kmRxFN6s4fwP8XR+Ff08c29T0+N4VzSRLvmMxied+htL2YQ3RJoL3D7m5BEy7OaMgiOg ft7ay29wuZCE9o6lNyEwsTecW/qQfH4DSrpxa7yY4ISnID03TxWg9m+/qVqxby03TdOv 4Q9z9dafARwkQ6+Ma/s4Rptg7bEf/8lOHERQSXO+Wa3zBwLEGiuIqFPq1uPpiiBjySXo CCg3Y4ODD73xCnGIaDF81v+p4qLOidZ/0bbqVIfmWowGZBKNRLObZbmPaMXOJE9mnP83 ZogA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id e2-20020a63ae42000000b0046f02f7fccesi85074pgp.302.2022.10.24.10.32.21; Mon, 24 Oct 2022 10:32:35 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232244AbiJXR1W (ORCPT <rfc822;pwkd43@gmail.com> + 99 others); Mon, 24 Oct 2022 13:27:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60010 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232301AbiJXR1C (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 24 Oct 2022 13:27:02 -0400 Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 69D2382747 for <linux-kernel@vger.kernel.org>; Mon, 24 Oct 2022 09:02:34 -0700 (PDT) Received: from dggpemm500021.china.huawei.com (unknown [172.30.72.56]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4MwwSL75lwzmVMr; Mon, 24 Oct 2022 21:11:14 +0800 (CST) Received: from dggpemm500002.china.huawei.com (7.185.36.229) by dggpemm500021.china.huawei.com (7.185.36.109) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Mon, 24 Oct 2022 21:16:05 +0800 Received: from localhost.localdomain (10.175.112.125) by dggpemm500002.china.huawei.com (7.185.36.229) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Mon, 24 Oct 2022 21:16:05 +0800 From: Chen Wandun <chenwandun@huawei.com> To: <akpm@linux-foundation.org>, <mgorman@techsingularity.net>, <vbabka@suse.cz>, <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>, <wangkefeng.wang@huawei.com> Subject: [PATCH] mm: fix pcp count beyond pcp high in pcplist allocation Date: Mon, 24 Oct 2022 21:41:46 +0800 Message-ID: <20221024134146.3442393-1-chenwandun@huawei.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Content-Type: text/plain; charset=US-ASCII X-Originating-IP: [10.175.112.125] X-ClientProxiedBy: dggems704-chm.china.huawei.com (10.3.19.181) To dggpemm500002.china.huawei.com (7.185.36.229) X-CFilter-Loop: Reflected X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1747591107668636476?= X-GMAIL-MSGID: =?utf-8?q?1747591107668636476?= |
Series |
mm: fix pcp count beyond pcp high in pcplist allocation
|
|
Commit Message
Chen Wandun
Oct. 24, 2022, 1:41 p.m. UTC
Nowadays there are several orders in pcplist, Function __rmqueue_pcplist
would alloc pcp batch pages to refill pcplist, when list of target order
if empty meanwhile other lists is not all empty, that result in pcp count
beyond pcp high after allocation. This behaviour can be easily observed by
adding debugging information in __rmqueue_pcplist.
Fix this by recalculate the batch pages to be allocated.
Fixes: 44042b449872 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
Signed-off-by: Chen Wandun <chenwandun@huawei.com>
---
mm/page_alloc.c | 2 ++
1 file changed, 2 insertions(+)
Comments
On Mon, Oct 24, 2022 at 09:41:46PM +0800, Chen Wandun wrote: > Nowadays there are several orders in pcplist, Function __rmqueue_pcplist > would alloc pcp batch pages to refill pcplist, when list of target order > if empty meanwhile other lists is not all empty, that result in pcp count > beyond pcp high after allocation. This behaviour can be easily observed by > adding debugging information in __rmqueue_pcplist. > > Fix this by recalculate the batch pages to be allocated. Are any problems observed other than the PCP lists temporarily exceed pcp->high? As is, the patch could result in a batch request of 0 and fall through to allocating from the zone list anyway defeating the purpose of the PCP allocator and probably regressing performance in some csaes. The intention was to allow high to be briefly exceeded on the refill side, particularly for THP pages and to always refill by at least two pages. In the THP case, one would be allocated and maybe one in the near future without acquiring the zone lock. If the limits are exceeded, it's only exceeded until the next free.
On 2022/10/24 22:55, Mel Gorman wrote: > On Mon, Oct 24, 2022 at 09:41:46PM +0800, Chen Wandun wrote: >> Nowadays there are several orders in pcplist, Function __rmqueue_pcplist >> would alloc pcp batch pages to refill pcplist, when list of target order >> if empty meanwhile other lists is not all empty, that result in pcp count >> beyond pcp high after allocation. This behaviour can be easily observed by >> adding debugging information in __rmqueue_pcplist. >> >> Fix this by recalculate the batch pages to be allocated. > Are any problems observed other than the PCP lists temporarily exceed > pcp->high? It will result frequently refill pcp page from buddy and release pcp page to buddy. > As is, the patch could result in a batch request of 0 and I foget this, the patch need some improve, thanks. > fall through to allocating from the zone list anyway defeating the > purpose of the PCP allocator and probably regressing performance in some > csaes. Same as I understand,how about set high/batch for each order in pcplist, or just share pcp batch value only set high for each order? It looks like strange for pcp count beyond pcp high in common case. If each order has it's own pcp high value, that behaviour is same as pcplist which only contains order 0. Thanks Wandun. > > The intention was to allow high to be briefly exceeded on the refill side, > particularly for THP pages and to always refill by at least two pages. In > the THP case, one would be allocated and maybe one in the near future > without acquiring the zone lock. If the limits are exceeded, it's only > exceeded until the next free. >
On Tue, Oct 25, 2022 at 07:49:59PM +0800, Chen Wandun wrote: > > > On 2022/10/24 22:55, Mel Gorman wrote: > > On Mon, Oct 24, 2022 at 09:41:46PM +0800, Chen Wandun wrote: > > > Nowadays there are several orders in pcplist, Function __rmqueue_pcplist > > > would alloc pcp batch pages to refill pcplist, when list of target order > > > if empty meanwhile other lists is not all empty, that result in pcp count > > > beyond pcp high after allocation. This behaviour can be easily observed by > > > adding debugging information in __rmqueue_pcplist. > > > > > > Fix this by recalculate the batch pages to be allocated. > > Are any problems observed other than the PCP lists temporarily exceed > > pcp->high? > > It will result frequently refill pcp page from buddy and release pcp page to > buddy. Under what circumstances does this causes a problem? I 100% accept that it could happen but one downside of the patch is that it simply changes the shape of the problem. If the batch refill is clamped then potentially the PCP list is depleted quicker and needs to be refilled sooner and so zone lock acquisitions are still required potentially higher frequency due to clamped refill sizes. All that changes is the timing. > > As is, the patch could result in a batch request of 0 and > > I foget this, the patch need some improve, thanks. > > > fall through to allocating from the zone list anyway defeating the > > purpose of the PCP allocator and probably regressing performance in some > > csaes. > > Same as I understand???how about set high/batch for each order in pcplist??? Using anything would than (X >> order) consumes storage. Even if storage was to be used, selecting a value per-order would be impossible because the correct value would depend on frequency of requests for each order. That can only be determined at runtime and the cost of determining the value would likely exceed the benefit. At most, you could state that the batch refill should at least be 1 but otherwise not exceed high. The downside is that zone->lock contention will increase for a stream of THP pages which is a common allocation size. The intent behind batch-2 was to reduce contention by 50% when multiple processes are faulting large anonymous regions at the same time. THP allocations are ones most likely to exceed pcp->high by a noticeable amount. > or just share pcp batch value only set high for each order? It looks like > strange for pcp count beyond pcp high in common case. > > If each order has it's own pcp high value, that behaviour is same as pcplist > which > only contains order 0. > Specify in the changelog how a workload is improved. That may be in terms of memory usage, performance, zone lock contention or cases where pcp->high being exceeded causes a functional problem on a particular class of system.
在 2022/10/25 21:19, Mel Gorman 写道: > On Tue, Oct 25, 2022 at 07:49:59PM +0800, Chen Wandun wrote: >> >> On 2022/10/24 22:55, Mel Gorman wrote: >>> On Mon, Oct 24, 2022 at 09:41:46PM +0800, Chen Wandun wrote: >>>> Nowadays there are several orders in pcplist, Function __rmqueue_pcplist >>>> would alloc pcp batch pages to refill pcplist, when list of target order >>>> if empty meanwhile other lists is not all empty, that result in pcp count >>>> beyond pcp high after allocation. This behaviour can be easily observed by >>>> adding debugging information in __rmqueue_pcplist. >>>> >>>> Fix this by recalculate the batch pages to be allocated. >>> Are any problems observed other than the PCP lists temporarily exceed >>> pcp->high? >> It will result frequently refill pcp page from buddy and release pcp page to >> buddy. > Under what circumstances does this causes a problem? I 100% accept that it Sorry for long time no reply. It is hard to say this phenomenon would cause functional problem, I just found this phenomenon and wonder if something can be improve. > could happen but one downside of the patch is that it simply changes the > shape of the problem. If the batch refill is clamped then potentially the > PCP list is depleted quicker and needs to be refilled sooner and so zone > lock acquisitions are still required potentially higher frequency due to > clamped refill sizes. All that changes is the timing. Agree, the contention of zone-lock need more consideration. > >>> As is, the patch could result in a batch request of 0 and >> I foget this, the patch need some improve, thanks. >> >>> fall through to allocating from the zone list anyway defeating the >>> purpose of the PCP allocator and probably regressing performance in some >>> csaes. >> Same as I understand???how about set high/batch for each order in pcplist??? > Using anything would than (X >> order) consumes storage. Even if storage > was to be used, selecting a value per-order would be impossible because > the correct value would depend on frequency of requests for each order. > That can only be determined at runtime and the cost of determining the > value would likely exceed the benefit. Can we set a experience value for pcp batch for each order during init stage? If so we can make accurately control for pcp size. Nowdays, the size of each order in pcp list is full of randomness. I dont konw which scheme is better for performance. > > At most, you could state that the batch refill should at least be 1 but > otherwise not exceed high. The downside is that zone->lock contention will > increase for a stream of THP pages which is a common allocation size. > The intent behind batch-2 was to reduce contention by 50% when multiple > processes are faulting large anonymous regions at the same time. THP > allocations are ones most likely to exceed pcp->high by a noticeable amount. > >> or just share pcp batch value only set high for each order? It looks like >> strange for pcp count beyond pcp high in common case. >> >> If each order has it's own pcp high value, that behaviour is same as pcplist >> which >> only contains order 0. >> > Specify in the changelog how a workload is improved. That may be in terms > of memory usage, performance, zone lock contention or cases where pcp->high > being exceeded causes a functional problem on a particular class of > system. Got it, thanks. >
On Mon, Oct 31, 2022 at 11:37:35AM +0800, Chen Wandun wrote: > > > > As is, the patch could result in a batch request of 0 and > > > I foget this, the patch need some improve, thanks. > > > > > > > fall through to allocating from the zone list anyway defeating the > > > > purpose of the PCP allocator and probably regressing performance in some > > > > csaes. > > > Same as I understand???how about set high/batch for each order in pcplist??? > > Using anything would than (X >> order) consumes storage. Even if storage > > was to be used, selecting a value per-order would be impossible because > > the correct value would depend on frequency of requests for each order. > > That can only be determined at runtime and the cost of determining the > > value would likely exceed the benefit. > > Can we set a experience value for pcp batch for each order during init > stage? I'm not sure what you mean by "experience value" but maybe you meant experimental value? > If so we can make accurately control for pcp size. Nowdays, the size of each > order in pcp list is full of randomness. I dont konw which scheme is better > for performance. > It is something that could be experimented with but the main question is -- what should those per-order values be? One option would be to enforce pcp->high for all high-order values except THP if THP is enabled. That would limit some of the issues with pcp->high being exceeded as even if two THPs are refilled, one of them is allocated immediately. I wasn't convinced it was necessary when implementing high-order PCP support but it could be evaluated.
在 2022/11/1 18:40, Mel Gorman 写道: > On Mon, Oct 31, 2022 at 11:37:35AM +0800, Chen Wandun wrote: >>>>> As is, the patch could result in a batch request of 0 and >>>> I foget this, the patch need some improve, thanks. >>>> >>>>> fall through to allocating from the zone list anyway defeating the >>>>> purpose of the PCP allocator and probably regressing performance in some >>>>> csaes. >>>> Same as I understand???how about set high/batch for each order in pcplist??? >>> Using anything would than (X >> order) consumes storage. Even if storage >>> was to be used, selecting a value per-order would be impossible because >>> the correct value would depend on frequency of requests for each order. >>> That can only be determined at runtime and the cost of determining the >>> value would likely exceed the benefit. >> Can we set a experience value for pcp batch for each order during init >> stage? > I'm not sure what you mean by "experience value" but maybe you meant > experimental value? yes, experimental value, sorry for that. > >> If so we can make accurately control for pcp size. Nowdays, the size of each >> order in pcp list is full of randomness. I dont konw which scheme is better >> for performance. >> > It is something that could be experimented with but the main question is > -- what should those per-order values be? One option would be to enforce > pcp->high for all high-order values except THP if THP is enabled. That would > limit some of the issues with pcp->high being exceeded as even if two THPs > are refilled, one of them is allocated immediately. I wasn't convinced it was > necessary when implementing high-order PCP support but it could be evaluated. Thank you for your suggestion, I will do some tests.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 39f846d098f5..93e18b6de2f3 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3742,6 +3742,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, do { if (list_empty(list)) { int batch = READ_ONCE(pcp->batch); + int high = READ_ONCE(pcp->high); int alloced; /* @@ -3753,6 +3754,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, */ if (batch > 1) batch = max(batch >> order, 2); + batch = min(batch, (high - pcp->count) >> order); alloced = rmqueue_bulk(zone, order, batch, list, migratetype, alloc_flags);