Message ID | 20231016071245.2865233-1-zhaoyang.huang@unisoc.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:2908:b0:403:3b70:6f57 with SMTP id ib8csp3286449vqb; Mon, 16 Oct 2023 00:14:42 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGxwOH1wo3mfNmkRn7h/DlvHbe8PXrGMsOxv4S0b789+ckjPqVuBgLl+nQIZJhCvdBqCPjo X-Received: by 2002:a17:903:2306:b0:1bb:ac37:384b with SMTP id d6-20020a170903230600b001bbac37384bmr37690866plh.6.1697440482436; Mon, 16 Oct 2023 00:14:42 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697440482; cv=none; d=google.com; s=arc-20160816; b=XPNQJQSwAUQ8UQyMU9RrGjyAUZd3RxQG4di0a/4sIUdooMSGniMVKWRcMfy7KdlSG3 BEx38aL6JbeR2hCWDQWjip++8Ucclz+D1Nkj+D8r0OYHMGeQa34nZiCDEK+0TIqImvrP cpwUa7fVNqjaeESnayrW5XqXjzXse9LFaTalgKiANkQDJYh2oVzPM8bpyyxtsxQoJmtR ImfKi9LH2YjIIPg7ULvcb/mPSUqTj+M3wiP1Z/6gu5jaDGrJUpM3U3VVwEMeNllmzLWX zXDy0GyX2mrwDrE97eAf6QRvXIUIeF+Egbl5E5OylZ4UxbSRz8OZP2ZmlubDxL3L35DD +zMg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:to:from; bh=rbvh5O7avPAgsCze7na39wTf5wDX5P0g3D5WnqbUxJE=; fh=wY7U6F8dgiPLz/DwDfZWxyAVk57Bl0b09DnWj2NpYfo=; b=N5aFLTUMXxxwQjxRS10npf2oFilBkAlsXL/ui22XBFi90Yf0Rknsjd8VaZBi6EvXs3 Fz3CkcnPUy2bfadojBtBaHq0t7X+ZGfhCijlIj/8MHHKsvxWJn9EreqfRuH9LYCzEvDO funAlVEfSnouZFo+NczfyRFSrcQcNDpLtN6FTppg/rCCPfEcbnV9KVHx4xlgtIsOrFL/ 5DgalhgseBovCaSaWspRK9yrm/sAyz/KJIGlBizxslze1XgqGz+o8O/vmxfbWOzw9Pcl sXTdyJI58yvwrY5rcqq1/xPgjzBRghR0M1E/baH9RP289/yPE2B86pjrKlglCyKSbkMD hLrQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from groat.vger.email (groat.vger.email. [2620:137:e000::3:5]) by mx.google.com with ESMTPS id jn20-20020a170903051400b001bdafae4b7dsi8103232plb.43.2023.10.16.00.14.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 16 Oct 2023 00:14:42 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) client-ip=2620:137:e000::3:5; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id 213398098FC4; Mon, 16 Oct 2023 00:14:40 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232056AbjJPHNw (ORCPT <rfc822;hjfbswb@gmail.com> + 18 others); Mon, 16 Oct 2023 03:13:52 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57858 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232064AbjJPHNs (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 16 Oct 2023 03:13:48 -0400 Received: from SHSQR01.spreadtrum.com (mx1.unisoc.com [222.66.158.135]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8B155ED for <linux-kernel@vger.kernel.org>; Mon, 16 Oct 2023 00:13:44 -0700 (PDT) Received: from dlp.unisoc.com ([10.29.3.86]) by SHSQR01.spreadtrum.com with ESMTP id 39G7Cote053037; Mon, 16 Oct 2023 15:12:50 +0800 (+08) (envelope-from zhaoyang.huang@unisoc.com) Received: from SHDLP.spreadtrum.com (bjmbx01.spreadtrum.com [10.0.64.7]) by dlp.unisoc.com (SkyGuard) with ESMTPS id 4S87WF1gbWz2KmQJM; Mon, 16 Oct 2023 15:08:41 +0800 (CST) Received: from bj03382pcu01.spreadtrum.com (10.0.73.40) by BJMBX01.spreadtrum.com (10.0.64.7) with Microsoft SMTP Server (TLS) id 15.0.1497.23; Mon, 16 Oct 2023 15:12:48 +0800 From: "zhaoyang.huang" <zhaoyang.huang@unisoc.com> To: Andrew Morton <akpm@linux-foundation.org>, Johannes Weiner <hannes@cmpxchg.org>, Roman Gushchin <roman.gushchin@linux.dev>, <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>, Zhaoyang Huang <huangzhaoyang@gmail.com>, <steve.kang@unisoc.com> Subject: [PATCHv6 1/1] mm: optimization on page allocation when CMA enabled Date: Mon, 16 Oct 2023 15:12:45 +0800 Message-ID: <20231016071245.2865233-1-zhaoyang.huang@unisoc.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Content-Type: text/plain; charset=US-ASCII X-Originating-IP: [10.0.73.40] X-ClientProxiedBy: SHCAS03.spreadtrum.com (10.0.1.207) To BJMBX01.spreadtrum.com (10.0.64.7) X-MAIL: SHSQR01.spreadtrum.com 39G7Cote053037 X-Spam-Status: No, score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Mon, 16 Oct 2023 00:14:40 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1779895351634602687 X-GMAIL-MSGID: 1779895351634602687 |
Series |
[PATCHv6,1/1] mm: optimization on page allocation when CMA enabled
|
|
Commit Message
zhaoyang.huang
Oct. 16, 2023, 7:12 a.m. UTC
From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> According to current CMA utilization policy, an alloc_pages(GFP_USER) could 'steal' UNMOVABLE & RECLAIMABLE page blocks via the help of CMA(pass zone_watermark_ok by counting CMA in but use U&R in rmqueue), which could lead to following alloc_pages(GFP_KERNEL) fail. Solving this by introducing second watermark checking for GFP_MOVABLE, which could have the allocation use CMA when proper. -- Free_pages(30MB) | | -- WMARK_LOW(25MB) | -- Free_CMA(12MB) | | -- Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> --- v6: update comments --- --- mm/page_alloc.c | 44 ++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 40 insertions(+), 4 deletions(-)
Comments
On Mon, 16 Oct 2023 15:12:45 +0800 "zhaoyang.huang" <zhaoyang.huang@unisoc.com> wrote: > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > According to current CMA utilization policy, an alloc_pages(GFP_USER) > could 'steal' UNMOVABLE & RECLAIMABLE page blocks via the help of > CMA(pass zone_watermark_ok by counting CMA in but use U&R in rmqueue), > which could lead to following alloc_pages(GFP_KERNEL) fail. > Solving this by introducing second watermark checking for GFP_MOVABLE, > which could have the allocation use CMA when proper. > > -- Free_pages(30MB) > | > | > -- WMARK_LOW(25MB) > | > -- Free_CMA(12MB) > | > | > -- > > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > --- > v6: update comments The patch itself is identical to the v5 patch. So either you meant "update changelog" above or you sent the wrong diff? Also, have we resolved any concerns regarding possible performance impacts of this change?
On Tue, Oct 17, 2023 at 6:40 AM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Mon, 16 Oct 2023 15:12:45 +0800 "zhaoyang.huang" <zhaoyang.huang@unisoc.com> wrote: > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > According to current CMA utilization policy, an alloc_pages(GFP_USER) > > could 'steal' UNMOVABLE & RECLAIMABLE page blocks via the help of > > CMA(pass zone_watermark_ok by counting CMA in but use U&R in rmqueue), > > which could lead to following alloc_pages(GFP_KERNEL) fail. > > Solving this by introducing second watermark checking for GFP_MOVABLE, > > which could have the allocation use CMA when proper. > > > > -- Free_pages(30MB) > > | > > | > > -- WMARK_LOW(25MB) > > | > > -- Free_CMA(12MB) > > | > > | > > -- > > > > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > --- > > v6: update comments > > The patch itself is identical to the v5 patch. So either you meant > "update changelog" above or you sent the wrong diff? sorry, should be update changelog > > Also, have we resolved any concerns regarding possible performance > impacts of this change? I don't think this commit could introduce performance impact as it actually adds one more path for using CMA page blocks in advance. __rmqueue(struct zone *zone, unsigned int order, int migratetype, if (IS_ENABLED(CONFIG_CMA)) { if (alloc_flags & ALLOC_CMA && - zone_page_state(zone, NR_FREE_CMA_PAGES) > - zone_page_state(zone, NR_FREE_PAGES) / 2) { + use_cma_first(zone, order, alloc_flags)) { //current '1/2' logic is kept while add a path for using CMA in advance than now. page = __rmqueue_cma_fallback(zone, order); if (page) return page; } } retry: //normal rmqueue_smallest path is not affected which could deemed as a fallback path for __rmqueue_cma_fallback failure page = __rmqueue_smallest(zone, order, migratetype); if (unlikely(!page)) { if (alloc_flags & ALLOC_CMA) page = __rmqueue_cma_fallback(zone, order); if (!page && __rmqueue_fallback(zone, order, migratetype, alloc_flags)) goto retry; } return page; }
On Mon, Oct 16, 2023 at 03:12:45PM +0800, zhaoyang.huang wrote: > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > According to current CMA utilization policy, an alloc_pages(GFP_USER) > could 'steal' UNMOVABLE & RECLAIMABLE page blocks via the help of > CMA(pass zone_watermark_ok by counting CMA in but use U&R in rmqueue), > which could lead to following alloc_pages(GFP_KERNEL) fail. > Solving this by introducing second watermark checking for GFP_MOVABLE, > which could have the allocation use CMA when proper. > > -- Free_pages(30MB) > | > | > -- WMARK_LOW(25MB) > | > -- Free_CMA(12MB) > | > | > -- We're running into the same issue in production and had an incident over the weekend because of it. The hosts have a raised vm.min_free_kbytes for network rx reliability, which makes the mismatch between free pages and what's actually allocatable by regular kernel requests quite pronounced. It wasn't OOMing this time, but we saw very high rates of thrashing while CMA had plenty of headroom. I had raised the broader issue around poor CMA utilization before: https://lore.kernel.org/lkml/20230726145304.1319046-1-hannes@cmpxchg.org/ For context, we're using hugetlb_cma at several gigabytes to allow sharing hosts between jobs that use hugetlb and jobs that don't. > @@ -2078,6 +2078,43 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, > > } > > +#ifdef CONFIG_CMA > +/* > + * GFP_MOVABLE allocation could drain UNMOVABLE & RECLAIMABLE page blocks via > + * the help of CMA which makes GFP_KERNEL failed. Checking if zone_watermark_ok > + * again without ALLOC_CMA to see if to use CMA first. > + */ > +static bool use_cma_first(struct zone *zone, unsigned int order, unsigned int alloc_flags) > +{ > + unsigned long watermark; > + bool cma_first = false; > + > + watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK); > + /* check if GFP_MOVABLE pass previous zone_watermark_ok via the help of CMA */ > + if (zone_watermark_ok(zone, order, watermark, 0, alloc_flags & (~ALLOC_CMA))) { > + /* > + * Balance movable allocations between regular and CMA areas by > + * allocating from CMA when over half of the zone's free memory > + * is in the CMA area. > + */ > + cma_first = (zone_page_state(zone, NR_FREE_CMA_PAGES) > > + zone_page_state(zone, NR_FREE_PAGES) / 2); > + } else { > + /* > + * watermark failed means UNMOVABLE & RECLAIMBLE is not enough > + * now, we should use cma first to keep them stay around the > + * corresponding watermark > + */ > + cma_first = true; > + } > + return cma_first; I think it's a step in the right direction. However, it doesn't take the lowmem reserves into account. With DMA32 that can be an additional multiple gigabytes of "free" memory not available to GFP_KERNEL. It also has a knee in the balancing curve because it doesn't take reserves into account *until* non-CMA is depleted - at which point it would already be below the use-CMA threshold by the full reserves and watermarks. A more complete solution would have to plumb the highest_zoneidx information through the rmqueue family of functions somehow, and always take unavailable free memory into account: --- Subject: [PATCH] mm: page_alloc: use CMA when kernel allocations are beginning to fail We can get into a situation where kernel allocations are starting to fail on watermarks, but movable allocations still don't use CMA because they make up more than half of the free memory. This can happen in particular with elevated vm.min_free_kbytes settings, where the remaining free pages aren't available to non-atomic requests. Example scenario: Free: 3.0G Watermarks: 2.0G CMA: 1.4G -> non-CMA: 1.6G CMA isn't used because CMA <= free/2. Kernel allocations fail due to non-CMA < watermarks. If memory is mostly unreclaimable (e.g. anon without swap), the kernel is more likely to OOM prematurely. Reduce the probability of that happening by taking reserves and watermarks into account when deciding whether to start using CMA. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- mm/page_alloc.c | 93 +++++++++++++++++++++++++++++++------------------ 1 file changed, 59 insertions(+), 34 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 733732e7e0ba..b9273d7f23b8 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2079,30 +2079,52 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, } +static bool should_try_cma(struct zone *zone, unsigned int order, + gfp_t gfp_flags, unsigned int alloc_flags) +{ + long free_pages; + + if (!IS_ENABLED(CONFIG_CMA) || !(alloc_flags & ALLOC_CMA)) + return false; + + /* + * CMA regions can be used by movable allocations while + * they're not otherwise in use. This is a delicate balance: + * Filling CMA too soon poses a latency risk for actual CMA + * allocations (think camera app startup). Filling CMA too + * late risks premature OOMs from non-movable allocations. + * + * Start using CMA once it dominates the remaining free + * memory. Be sure to take watermarks and reserves into + * account when considering what's truly "free". + * + * free_pages can go negative, but that's okay because + * NR_FREE_CMA_PAGES should not. + */ + + free_pages = zone_page_state(zone, NR_FREE_PAGES); + free_pages -= zone->lowmem_reserve[gfp_zone(gfp_flags)]; + free_pages -= wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK); + + return zone_page_state(zone, NR_FREE_CMA_PAGES) > free_pages / 2; +} + /* * Do the hard work of removing an element from the buddy allocator. * Call me with the zone->lock already held. */ static __always_inline struct page * -__rmqueue(struct zone *zone, unsigned int order, int migratetype, - unsigned int alloc_flags) +__rmqueue(struct zone *zone, unsigned int order, gfp_t gfp_flags, + int migratetype, unsigned int alloc_flags) { struct page *page; - if (IS_ENABLED(CONFIG_CMA)) { - /* - * Balance movable allocations between regular and CMA areas by - * allocating from CMA when over half of the zone's free memory - * is in the CMA area. - */ - if (alloc_flags & ALLOC_CMA && - zone_page_state(zone, NR_FREE_CMA_PAGES) > - zone_page_state(zone, NR_FREE_PAGES) / 2) { - page = __rmqueue_cma_fallback(zone, order); - if (page) - return page; - } + if (should_try_cma(zone, order, gfp_flags, alloc_flags)) { + page = __rmqueue_cma_fallback(zone, order); + if (page) + return page; } + retry: page = __rmqueue_smallest(zone, order, migratetype); if (unlikely(!page)) { @@ -2121,7 +2143,7 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype, * a single hold of the lock, for efficiency. Add them to the supplied list. * Returns the number of new pages which were placed at *list. */ -static int rmqueue_bulk(struct zone *zone, unsigned int order, +static int rmqueue_bulk(struct zone *zone, unsigned int order, gfp_t gfp_flags, unsigned long count, struct list_head *list, int migratetype, unsigned int alloc_flags) { @@ -2130,8 +2152,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, spin_lock_irqsave(&zone->lock, flags); for (i = 0; i < count; ++i) { - struct page *page = __rmqueue(zone, order, migratetype, - alloc_flags); + struct page *page = __rmqueue(zone, order, gfp_flags, + migratetype, alloc_flags); if (unlikely(page == NULL)) break; @@ -2714,8 +2736,8 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z, static __always_inline struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, - unsigned int order, unsigned int alloc_flags, - int migratetype) + unsigned int order, gfp_t gfp_flags, + unsigned int alloc_flags, int migratetype) { struct page *page; unsigned long flags; @@ -2726,7 +2748,8 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, if (alloc_flags & ALLOC_HIGHATOMIC) page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC); if (!page) { - page = __rmqueue(zone, order, migratetype, alloc_flags); + page = __rmqueue(zone, order, migratetype, + gfp_flags, alloc_flags); /* * If the allocation fails, allow OOM handling access @@ -2806,10 +2829,10 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order) /* Remove page from the per-cpu list, caller must protect the list */ static inline struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, - int migratetype, - unsigned int alloc_flags, - struct per_cpu_pages *pcp, - struct list_head *list) + gfp_t gfp_flags, int migratetype, + unsigned int alloc_flags, + struct per_cpu_pages *pcp, + struct list_head *list) { struct page *page; @@ -2818,7 +2841,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, int batch = nr_pcp_alloc(pcp, zone, order); int alloced; - alloced = rmqueue_bulk(zone, order, + alloced = rmqueue_bulk(zone, order, gfp_flags, batch, list, migratetype, alloc_flags); @@ -2837,8 +2860,9 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, /* Lock and remove page from the per-cpu list */ static struct page *rmqueue_pcplist(struct zone *preferred_zone, - struct zone *zone, unsigned int order, - int migratetype, unsigned int alloc_flags) + struct zone *zone, unsigned int order, + gfp_t gfp_flags, int migratetype, + unsigned int alloc_flags) { struct per_cpu_pages *pcp; struct list_head *list; @@ -2860,7 +2884,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone, */ pcp->free_count >>= 1; list = &pcp->lists[order_to_pindex(migratetype, order)]; - page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list); + page = __rmqueue_pcplist(zone, order, gfp_flags, migratetype, + alloc_flags, pcp, list); pcp_spin_unlock(pcp); pcp_trylock_finish(UP_flags); if (page) { @@ -2898,13 +2923,13 @@ struct page *rmqueue(struct zone *preferred_zone, if (likely(pcp_allowed_order(order))) { page = rmqueue_pcplist(preferred_zone, zone, order, - migratetype, alloc_flags); + gfp_flags, migratetype, alloc_flags); if (likely(page)) goto out; } - page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags, - migratetype); + page = rmqueue_buddy(preferred_zone, zone, order, gfp_flags, + alloc_flags, migratetype); out: /* Separate test+clear to avoid unnecessary atomics */ @@ -4480,8 +4505,8 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid, continue; } - page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags, - pcp, pcp_list); + page = __rmqueue_pcplist(zone, 0, gfp, ac.migratetype, + alloc_flags, pcp, pcp_list); if (unlikely(!page)) { /* Try and allocate at least one page */ if (!nr_account) {
On Wed, Nov 8, 2023 at 1:28 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Mon, Oct 16, 2023 at 03:12:45PM +0800, zhaoyang.huang wrote: > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > According to current CMA utilization policy, an alloc_pages(GFP_USER) > > could 'steal' UNMOVABLE & RECLAIMABLE page blocks via the help of > > CMA(pass zone_watermark_ok by counting CMA in but use U&R in rmqueue), > > which could lead to following alloc_pages(GFP_KERNEL) fail. > > Solving this by introducing second watermark checking for GFP_MOVABLE, > > which could have the allocation use CMA when proper. > > > > -- Free_pages(30MB) > > | > > | > > -- WMARK_LOW(25MB) > > | > > -- Free_CMA(12MB) > > | > > | > > -- > > We're running into the same issue in production and had an incident > over the weekend because of it. The hosts have a raised > vm.min_free_kbytes for network rx reliability, which makes the > mismatch between free pages and what's actually allocatable by regular > kernel requests quite pronounced. It wasn't OOMing this time, but we > saw very high rates of thrashing while CMA had plenty of headroom. > > I had raised the broader issue around poor CMA utilization before: > https://lore.kernel.org/lkml/20230726145304.1319046-1-hannes@cmpxchg.org/ > > For context, we're using hugetlb_cma at several gigabytes to allow > sharing hosts between jobs that use hugetlb and jobs that don't. > > > @@ -2078,6 +2078,43 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, > > > > } > > > > +#ifdef CONFIG_CMA > > +/* > > + * GFP_MOVABLE allocation could drain UNMOVABLE & RECLAIMABLE page blocks via > > + * the help of CMA which makes GFP_KERNEL failed. Checking if zone_watermark_ok > > + * again without ALLOC_CMA to see if to use CMA first. > > + */ > > +static bool use_cma_first(struct zone *zone, unsigned int order, unsigned int alloc_flags) > > +{ > > + unsigned long watermark; > > + bool cma_first = false; > > + > > + watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK); > > + /* check if GFP_MOVABLE pass previous zone_watermark_ok via the help of CMA */ > > + if (zone_watermark_ok(zone, order, watermark, 0, alloc_flags & (~ALLOC_CMA))) { > > + /* > > + * Balance movable allocations between regular and CMA areas by > > + * allocating from CMA when over half of the zone's free memory > > + * is in the CMA area. > > + */ ok, thanks for point out. Could we simple it like this, which will mis-judge kmalloc within ioctl as GFP_USER. I think it is ok as it is rare if (current_is_kswapd() || !current->mm) gfp_flags = GFP_KERNEL; else gfp_flags = GFP_USER; free_pages = zone_page_state(zone, NR_FREE_PAGES); free_pages -= zone->lowmem_reserve[gfp_zone(gfp_flags)]; free_pages -= wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK); cma_first = free_pages > zone_page_state(zone, NR_FREE_PAGES) / 2); > > + cma_first = (zone_page_state(zone, NR_FREE_CMA_PAGES) > > > + zone_page_state(zone, NR_FREE_PAGES) / 2); > > + } else { > > + /* > > + * watermark failed means UNMOVABLE & RECLAIMBLE is not enough > > + * now, we should use cma first to keep them stay around the > > + * corresponding watermark > > + */ > > + cma_first = true; > > + } > > + return cma_first; > > I think it's a step in the right direction. However, it doesn't take > the lowmem reserves into account. With DMA32 that can be an additional > multiple gigabytes of "free" memory not available to GFP_KERNEL. It > also has a knee in the balancing curve because it doesn't take > reserves into account *until* non-CMA is depleted - at which point it > would already be below the use-CMA threshold by the full reserves and > watermarks. > > A more complete solution would have to plumb the highest_zoneidx > information through the rmqueue family of functions somehow, and > always take unavailable free memory into account: > > --- > Subject: [PATCH] mm: page_alloc: use CMA when kernel allocations are beginning > to fail > > We can get into a situation where kernel allocations are starting to > fail on watermarks, but movable allocations still don't use CMA > because they make up more than half of the free memory. This can > happen in particular with elevated vm.min_free_kbytes settings, where > the remaining free pages aren't available to non-atomic requests. > > Example scenario: > > Free: 3.0G > Watermarks: 2.0G > CMA: 1.4G > -> non-CMA: 1.6G > > CMA isn't used because CMA <= free/2. Kernel allocations fail due to > non-CMA < watermarks. If memory is mostly unreclaimable (e.g. anon > without swap), the kernel is more likely to OOM prematurely. > > Reduce the probability of that happening by taking reserves and > watermarks into account when deciding whether to start using CMA. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > --- > mm/page_alloc.c | 93 +++++++++++++++++++++++++++++++------------------ > 1 file changed, 59 insertions(+), 34 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 733732e7e0ba..b9273d7f23b8 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2079,30 +2079,52 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, > > } > > +static bool should_try_cma(struct zone *zone, unsigned int order, > + gfp_t gfp_flags, unsigned int alloc_flags) > +{ > + long free_pages; > + > + if (!IS_ENABLED(CONFIG_CMA) || !(alloc_flags & ALLOC_CMA)) > + return false; > + > + /* > + * CMA regions can be used by movable allocations while > + * they're not otherwise in use. This is a delicate balance: > + * Filling CMA too soon poses a latency risk for actual CMA > + * allocations (think camera app startup). Filling CMA too > + * late risks premature OOMs from non-movable allocations. > + * > + * Start using CMA once it dominates the remaining free > + * memory. Be sure to take watermarks and reserves into > + * account when considering what's truly "free". > + * > + * free_pages can go negative, but that's okay because > + * NR_FREE_CMA_PAGES should not. > + */ > + > + free_pages = zone_page_state(zone, NR_FREE_PAGES); > + free_pages -= zone->lowmem_reserve[gfp_zone(gfp_flags)]; > + free_pages -= wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK); > + > + return zone_page_state(zone, NR_FREE_CMA_PAGES) > free_pages / 2; > +} > + > /* > * Do the hard work of removing an element from the buddy allocator. > * Call me with the zone->lock already held. > */ > static __always_inline struct page * > -__rmqueue(struct zone *zone, unsigned int order, int migratetype, > - unsigned int alloc_flags) > +__rmqueue(struct zone *zone, unsigned int order, gfp_t gfp_flags, > + int migratetype, unsigned int alloc_flags) > { > struct page *page; > > - if (IS_ENABLED(CONFIG_CMA)) { > - /* > - * Balance movable allocations between regular and CMA areas by > - * allocating from CMA when over half of the zone's free memory > - * is in the CMA area. > - */ > - if (alloc_flags & ALLOC_CMA && > - zone_page_state(zone, NR_FREE_CMA_PAGES) > > - zone_page_state(zone, NR_FREE_PAGES) / 2) { > - page = __rmqueue_cma_fallback(zone, order); > - if (page) > - return page; > - } > + if (should_try_cma(zone, order, gfp_flags, alloc_flags)) { > + page = __rmqueue_cma_fallback(zone, order); > + if (page) > + return page; > } > + > retry: > page = __rmqueue_smallest(zone, order, migratetype); > if (unlikely(!page)) { > @@ -2121,7 +2143,7 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype, > * a single hold of the lock, for efficiency. Add them to the supplied list. > * Returns the number of new pages which were placed at *list. > */ > -static int rmqueue_bulk(struct zone *zone, unsigned int order, > +static int rmqueue_bulk(struct zone *zone, unsigned int order, gfp_t gfp_flags, > unsigned long count, struct list_head *list, > int migratetype, unsigned int alloc_flags) > { > @@ -2130,8 +2152,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, > > spin_lock_irqsave(&zone->lock, flags); > for (i = 0; i < count; ++i) { > - struct page *page = __rmqueue(zone, order, migratetype, > - alloc_flags); > + struct page *page = __rmqueue(zone, order, gfp_flags, > + migratetype, alloc_flags); > if (unlikely(page == NULL)) > break; > > @@ -2714,8 +2736,8 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z, > > static __always_inline > struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, > - unsigned int order, unsigned int alloc_flags, > - int migratetype) > + unsigned int order, gfp_t gfp_flags, > + unsigned int alloc_flags, int migratetype) > { > struct page *page; > unsigned long flags; > @@ -2726,7 +2748,8 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, > if (alloc_flags & ALLOC_HIGHATOMIC) > page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC); > if (!page) { > - page = __rmqueue(zone, order, migratetype, alloc_flags); > + page = __rmqueue(zone, order, migratetype, > + gfp_flags, alloc_flags); > > /* > * If the allocation fails, allow OOM handling access > @@ -2806,10 +2829,10 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order) > /* Remove page from the per-cpu list, caller must protect the list */ > static inline > struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, > - int migratetype, > - unsigned int alloc_flags, > - struct per_cpu_pages *pcp, > - struct list_head *list) > + gfp_t gfp_flags, int migratetype, > + unsigned int alloc_flags, > + struct per_cpu_pages *pcp, > + struct list_head *list) > { > struct page *page; > > @@ -2818,7 +2841,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, > int batch = nr_pcp_alloc(pcp, zone, order); > int alloced; > > - alloced = rmqueue_bulk(zone, order, > + alloced = rmqueue_bulk(zone, order, gfp_flags, > batch, list, > migratetype, alloc_flags); > > @@ -2837,8 +2860,9 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, > > /* Lock and remove page from the per-cpu list */ > static struct page *rmqueue_pcplist(struct zone *preferred_zone, > - struct zone *zone, unsigned int order, > - int migratetype, unsigned int alloc_flags) > + struct zone *zone, unsigned int order, > + gfp_t gfp_flags, int migratetype, > + unsigned int alloc_flags) > { > struct per_cpu_pages *pcp; > struct list_head *list; > @@ -2860,7 +2884,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone, > */ > pcp->free_count >>= 1; > list = &pcp->lists[order_to_pindex(migratetype, order)]; > - page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list); > + page = __rmqueue_pcplist(zone, order, gfp_flags, migratetype, > + alloc_flags, pcp, list); > pcp_spin_unlock(pcp); > pcp_trylock_finish(UP_flags); > if (page) { > @@ -2898,13 +2923,13 @@ struct page *rmqueue(struct zone *preferred_zone, > > if (likely(pcp_allowed_order(order))) { > page = rmqueue_pcplist(preferred_zone, zone, order, > - migratetype, alloc_flags); > + gfp_flags, migratetype, alloc_flags); > if (likely(page)) > goto out; > } > > - page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags, > - migratetype); > + page = rmqueue_buddy(preferred_zone, zone, order, gfp_flags, > + alloc_flags, migratetype); > > out: > /* Separate test+clear to avoid unnecessary atomics */ > @@ -4480,8 +4505,8 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid, > continue; > } > > - page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags, > - pcp, pcp_list); > + page = __rmqueue_pcplist(zone, 0, gfp, ac.migratetype, > + alloc_flags, pcp, pcp_list); > if (unlikely(!page)) { > /* Try and allocate at least one page */ > if (!nr_account) { > -- > 2.42.0 >
On Wed, 8 Nov 2023 16:55:19 +0800 Zhaoyang Huang <huangzhaoyang@gmail.com> wrote: > > > +static bool use_cma_first(struct zone *zone, unsigned int order, unsigned int alloc_flags) > > > +{ > > > + unsigned long watermark; > > > + bool cma_first = false; > > > + > > > + watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK); > > > + /* check if GFP_MOVABLE pass previous zone_watermark_ok via the help of CMA */ > > > + if (zone_watermark_ok(zone, order, watermark, 0, alloc_flags & (~ALLOC_CMA))) { > > > + /* > > > + * Balance movable allocations between regular and CMA areas by > > > + * allocating from CMA when over half of the zone's free memory > > > + * is in the CMA area. > > > + */ > ok, thanks for point out. > Could we simple it like this, which will mis-judge kmalloc within > ioctl as GFP_USER. I think it is ok as it is rare > if (current_is_kswapd() || !current->mm) > gfp_flags = GFP_KERNEL; > else > gfp_flags = GFP_USER; > free_pages = zone_page_state(zone, NR_FREE_PAGES); > free_pages -= zone->lowmem_reserve[gfp_zone(gfp_flags)]; > free_pages -= wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK); > cma_first = free_pages > zone_page_state(zone, NR_FREE_PAGES) / 2); > This went all quiet. Do we feel that "mm: optimization on page allocation when CMA enabled" should be merged as-is, or dropped in the expectation that something based on Johannes's suggestion will be developed?
On Sat, Dec 30, 2023 at 3:40 AM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Wed, 8 Nov 2023 16:55:19 +0800 Zhaoyang Huang <huangzhaoyang@gmail.com> wrote: > > > > > +static bool use_cma_first(struct zone *zone, unsigned int order, unsigned int alloc_flags) > > > > +{ > > > > + unsigned long watermark; > > > > + bool cma_first = false; > > > > + > > > > + watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK); > > > > + /* check if GFP_MOVABLE pass previous zone_watermark_ok via the help of CMA */ > > > > + if (zone_watermark_ok(zone, order, watermark, 0, alloc_flags & (~ALLOC_CMA))) { > > > > + /* > > > > + * Balance movable allocations between regular and CMA areas by > > > > + * allocating from CMA when over half of the zone's free memory > > > > + * is in the CMA area. > > > > + */ > > ok, thanks for point out. > > Could we simple it like this, which will mis-judge kmalloc within > > ioctl as GFP_USER. I think it is ok as it is rare > > if (current_is_kswapd() || !current->mm) > > gfp_flags = GFP_KERNEL; > > else > > gfp_flags = GFP_USER; > > free_pages = zone_page_state(zone, NR_FREE_PAGES); > > free_pages -= zone->lowmem_reserve[gfp_zone(gfp_flags)]; > > free_pages -= wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK); > > cma_first = free_pages > zone_page_state(zone, NR_FREE_PAGES) / 2); > > > > This went all quiet. Do we feel that "mm: optimization on page > allocation when CMA enabled" should be merged as-is, or dropped in the > expectation that something based on Johannes's suggestion will be > developed? I just establish a v6.6 environment and will provide comparison results with and without the patch >
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 452459836b71..5a146aa7c0aa 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2078,6 +2078,43 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, } +#ifdef CONFIG_CMA +/* + * GFP_MOVABLE allocation could drain UNMOVABLE & RECLAIMABLE page blocks via + * the help of CMA which makes GFP_KERNEL failed. Checking if zone_watermark_ok + * again without ALLOC_CMA to see if to use CMA first. + */ +static bool use_cma_first(struct zone *zone, unsigned int order, unsigned int alloc_flags) +{ + unsigned long watermark; + bool cma_first = false; + + watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK); + /* check if GFP_MOVABLE pass previous zone_watermark_ok via the help of CMA */ + if (zone_watermark_ok(zone, order, watermark, 0, alloc_flags & (~ALLOC_CMA))) { + /* + * Balance movable allocations between regular and CMA areas by + * allocating from CMA when over half of the zone's free memory + * is in the CMA area. + */ + cma_first = (zone_page_state(zone, NR_FREE_CMA_PAGES) > + zone_page_state(zone, NR_FREE_PAGES) / 2); + } else { + /* + * watermark failed means UNMOVABLE & RECLAIMBLE is not enough + * now, we should use cma first to keep them stay around the + * corresponding watermark + */ + cma_first = true; + } + return cma_first; +} +#else +static bool use_cma_first(struct zone *zone, unsigned int order, unsigned int alloc_flags) +{ + return false; +} +#endif /* * Do the hard work of removing an element from the buddy allocator. * Call me with the zone->lock already held. @@ -2091,12 +2128,11 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype, if (IS_ENABLED(CONFIG_CMA)) { /* * Balance movable allocations between regular and CMA areas by - * allocating from CMA when over half of the zone's free memory - * is in the CMA area. + * allocating from CMA base on judging zone_watermark_ok again + * to see if the latest check got pass via the help of CMA */ if (alloc_flags & ALLOC_CMA && - zone_page_state(zone, NR_FREE_CMA_PAGES) > - zone_page_state(zone, NR_FREE_PAGES) / 2) { + use_cma_first(zone, order, alloc_flags)) { page = __rmqueue_cma_fallback(zone, order); if (page) return page;