[PATCHv6,1/1] mm: optimization on page allocation when CMA enabled

Message ID 20231016071245.2865233-1-zhaoyang.huang@unisoc.com
State New
Headers
Series [PATCHv6,1/1] mm: optimization on page allocation when CMA enabled |

Commit Message

zhaoyang.huang Oct. 16, 2023, 7:12 a.m. UTC
  From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>

According to current CMA utilization policy, an alloc_pages(GFP_USER)
could 'steal' UNMOVABLE & RECLAIMABLE page blocks via the help of
CMA(pass zone_watermark_ok by counting CMA in but use U&R in rmqueue),
which could lead to following alloc_pages(GFP_KERNEL) fail.
Solving this by introducing second watermark checking for GFP_MOVABLE,
which could have the allocation use CMA when proper.

-- Free_pages(30MB)
|
|
-- WMARK_LOW(25MB)
|
-- Free_CMA(12MB)
|
|
--

Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
---
v6: update comments
---
---
 mm/page_alloc.c | 44 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 40 insertions(+), 4 deletions(-)
  

Comments

Andrew Morton Oct. 16, 2023, 10:39 p.m. UTC | #1
On Mon, 16 Oct 2023 15:12:45 +0800 "zhaoyang.huang" <zhaoyang.huang@unisoc.com> wrote:

> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> 
> According to current CMA utilization policy, an alloc_pages(GFP_USER)
> could 'steal' UNMOVABLE & RECLAIMABLE page blocks via the help of
> CMA(pass zone_watermark_ok by counting CMA in but use U&R in rmqueue),
> which could lead to following alloc_pages(GFP_KERNEL) fail.
> Solving this by introducing second watermark checking for GFP_MOVABLE,
> which could have the allocation use CMA when proper.
> 
> -- Free_pages(30MB)
> |
> |
> -- WMARK_LOW(25MB)
> |
> -- Free_CMA(12MB)
> |
> |
> --
> 
> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> ---
> v6: update comments

The patch itself is identical to the v5 patch.  So either you meant
"update changelog" above or you sent the wrong diff?

Also, have we resolved any concerns regarding possible performance
impacts of this change?
  
Zhaoyang Huang Oct. 17, 2023, 2:32 a.m. UTC | #2
On Tue, Oct 17, 2023 at 6:40 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Mon, 16 Oct 2023 15:12:45 +0800 "zhaoyang.huang" <zhaoyang.huang@unisoc.com> wrote:
>
> > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> >
> > According to current CMA utilization policy, an alloc_pages(GFP_USER)
> > could 'steal' UNMOVABLE & RECLAIMABLE page blocks via the help of
> > CMA(pass zone_watermark_ok by counting CMA in but use U&R in rmqueue),
> > which could lead to following alloc_pages(GFP_KERNEL) fail.
> > Solving this by introducing second watermark checking for GFP_MOVABLE,
> > which could have the allocation use CMA when proper.
> >
> > -- Free_pages(30MB)
> > |
> > |
> > -- WMARK_LOW(25MB)
> > |
> > -- Free_CMA(12MB)
> > |
> > |
> > --
> >
> > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > ---
> > v6: update comments
>
> The patch itself is identical to the v5 patch.  So either you meant
> "update changelog" above or you sent the wrong diff?
sorry, should be update changelog
>
> Also, have we resolved any concerns regarding possible performance
> impacts of this change?
I don't think this commit could introduce performance impact as it
actually adds one more path for using CMA page blocks in advance.

 __rmqueue(struct zone *zone, unsigned int order, int migratetype,
        if (IS_ENABLED(CONFIG_CMA)) {
                if (alloc_flags & ALLOC_CMA &&
-                   zone_page_state(zone, NR_FREE_CMA_PAGES) >
-                   zone_page_state(zone, NR_FREE_PAGES) / 2) {
+                       use_cma_first(zone, order, alloc_flags)) {
          //current '1/2' logic is kept while add a path for using CMA
in advance than now.
                        page = __rmqueue_cma_fallback(zone, order);
                        if (page)
                                return page;
                }
        }
retry:
                               //normal rmqueue_smallest path is not
affected which could deemed as a fallback path for
__rmqueue_cma_fallback failure
        page = __rmqueue_smallest(zone, order, migratetype);
        if (unlikely(!page)) {
                if (alloc_flags & ALLOC_CMA)
                        page = __rmqueue_cma_fallback(zone, order);

                if (!page && __rmqueue_fallback(zone, order, migratetype,
                                                                alloc_flags))
                        goto retry;
        }
        return page;
}
  
Johannes Weiner Nov. 7, 2023, 5:28 p.m. UTC | #3
On Mon, Oct 16, 2023 at 03:12:45PM +0800, zhaoyang.huang wrote:
> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> 
> According to current CMA utilization policy, an alloc_pages(GFP_USER)
> could 'steal' UNMOVABLE & RECLAIMABLE page blocks via the help of
> CMA(pass zone_watermark_ok by counting CMA in but use U&R in rmqueue),
> which could lead to following alloc_pages(GFP_KERNEL) fail.
> Solving this by introducing second watermark checking for GFP_MOVABLE,
> which could have the allocation use CMA when proper.
> 
> -- Free_pages(30MB)
> |
> |
> -- WMARK_LOW(25MB)
> |
> -- Free_CMA(12MB)
> |
> |
> --

We're running into the same issue in production and had an incident
over the weekend because of it. The hosts have a raised
vm.min_free_kbytes for network rx reliability, which makes the
mismatch between free pages and what's actually allocatable by regular
kernel requests quite pronounced. It wasn't OOMing this time, but we
saw very high rates of thrashing while CMA had plenty of headroom.

I had raised the broader issue around poor CMA utilization before:
https://lore.kernel.org/lkml/20230726145304.1319046-1-hannes@cmpxchg.org/

For context, we're using hugetlb_cma at several gigabytes to allow
sharing hosts between jobs that use hugetlb and jobs that don't.

> @@ -2078,6 +2078,43 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
>  
>  }
>  
> +#ifdef CONFIG_CMA
> +/*
> + * GFP_MOVABLE allocation could drain UNMOVABLE & RECLAIMABLE page blocks via
> + * the help of CMA which makes GFP_KERNEL failed. Checking if zone_watermark_ok
> + * again without ALLOC_CMA to see if to use CMA first.
> + */
> +static bool use_cma_first(struct zone *zone, unsigned int order, unsigned int alloc_flags)
> +{
> +	unsigned long watermark;
> +	bool cma_first = false;
> +
> +	watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
> +	/* check if GFP_MOVABLE pass previous zone_watermark_ok via the help of CMA */
> +	if (zone_watermark_ok(zone, order, watermark, 0, alloc_flags & (~ALLOC_CMA))) {
> +		/*
> +		 * Balance movable allocations between regular and CMA areas by
> +		 * allocating from CMA when over half of the zone's free memory
> +		 * is in the CMA area.
> +		 */
> +		cma_first = (zone_page_state(zone, NR_FREE_CMA_PAGES) >
> +				zone_page_state(zone, NR_FREE_PAGES) / 2);
> +	} else {
> +		/*
> +		 * watermark failed means UNMOVABLE & RECLAIMBLE is not enough
> +		 * now, we should use cma first to keep them stay around the
> +		 * corresponding watermark
> +		 */
> +		cma_first = true;
> +	}
> +	return cma_first;

I think it's a step in the right direction. However, it doesn't take
the lowmem reserves into account. With DMA32 that can be an additional
multiple gigabytes of "free" memory not available to GFP_KERNEL. It
also has a knee in the balancing curve because it doesn't take
reserves into account *until* non-CMA is depleted - at which point it
would already be below the use-CMA threshold by the full reserves and
watermarks.

A more complete solution would have to plumb the highest_zoneidx
information through the rmqueue family of functions somehow, and
always take unavailable free memory into account:

---
Subject: [PATCH] mm: page_alloc: use CMA when kernel allocations are beginning
 to fail

We can get into a situation where kernel allocations are starting to
fail on watermarks, but movable allocations still don't use CMA
because they make up more than half of the free memory. This can
happen in particular with elevated vm.min_free_kbytes settings, where
the remaining free pages aren't available to non-atomic requests.

Example scenario:

      Free: 3.0G
Watermarks: 2.0G
       CMA: 1.4G
-> non-CMA: 1.6G

CMA isn't used because CMA <= free/2. Kernel allocations fail due to
non-CMA < watermarks. If memory is mostly unreclaimable (e.g. anon
without swap), the kernel is more likely to OOM prematurely.

Reduce the probability of that happening by taking reserves and
watermarks into account when deciding whether to start using CMA.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 93 +++++++++++++++++++++++++++++++------------------
 1 file changed, 59 insertions(+), 34 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 733732e7e0ba..b9273d7f23b8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2079,30 +2079,52 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
 
 }
 
+static bool should_try_cma(struct zone *zone, unsigned int order,
+			   gfp_t gfp_flags, unsigned int alloc_flags)
+{
+	long free_pages;
+
+	if (!IS_ENABLED(CONFIG_CMA) || !(alloc_flags & ALLOC_CMA))
+		return false;
+
+	/*
+	 * CMA regions can be used by movable allocations while
+	 * they're not otherwise in use. This is a delicate balance:
+	 * Filling CMA too soon poses a latency risk for actual CMA
+	 * allocations (think camera app startup). Filling CMA too
+	 * late risks premature OOMs from non-movable allocations.
+	 *
+	 * Start using CMA once it dominates the remaining free
+	 * memory. Be sure to take watermarks and reserves into
+	 * account when considering what's truly "free".
+	 *
+	 * free_pages can go negative, but that's okay because
+	 * NR_FREE_CMA_PAGES should not.
+	 */
+
+	free_pages = zone_page_state(zone, NR_FREE_PAGES);
+	free_pages -= zone->lowmem_reserve[gfp_zone(gfp_flags)];
+	free_pages -= wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
+
+	return zone_page_state(zone, NR_FREE_CMA_PAGES) > free_pages / 2;
+}
+
 /*
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
  */
 static __always_inline struct page *
-__rmqueue(struct zone *zone, unsigned int order, int migratetype,
-						unsigned int alloc_flags)
+__rmqueue(struct zone *zone, unsigned int order, gfp_t gfp_flags,
+	  int migratetype, unsigned int alloc_flags)
 {
 	struct page *page;
 
-	if (IS_ENABLED(CONFIG_CMA)) {
-		/*
-		 * Balance movable allocations between regular and CMA areas by
-		 * allocating from CMA when over half of the zone's free memory
-		 * is in the CMA area.
-		 */
-		if (alloc_flags & ALLOC_CMA &&
-		    zone_page_state(zone, NR_FREE_CMA_PAGES) >
-		    zone_page_state(zone, NR_FREE_PAGES) / 2) {
-			page = __rmqueue_cma_fallback(zone, order);
-			if (page)
-				return page;
-		}
+	if (should_try_cma(zone, order, gfp_flags, alloc_flags)) {
+		page = __rmqueue_cma_fallback(zone, order);
+		if (page)
+			return page;
 	}
+
 retry:
 	page = __rmqueue_smallest(zone, order, migratetype);
 	if (unlikely(!page)) {
@@ -2121,7 +2143,7 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
  * a single hold of the lock, for efficiency.  Add them to the supplied list.
  * Returns the number of new pages which were placed at *list.
  */
-static int rmqueue_bulk(struct zone *zone, unsigned int order,
+static int rmqueue_bulk(struct zone *zone, unsigned int order, gfp_t gfp_flags,
 			unsigned long count, struct list_head *list,
 			int migratetype, unsigned int alloc_flags)
 {
@@ -2130,8 +2152,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 
 	spin_lock_irqsave(&zone->lock, flags);
 	for (i = 0; i < count; ++i) {
-		struct page *page = __rmqueue(zone, order, migratetype,
-								alloc_flags);
+		struct page *page = __rmqueue(zone, order, gfp_flags,
+					      migratetype, alloc_flags);
 		if (unlikely(page == NULL))
 			break;
 
@@ -2714,8 +2736,8 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
 
 static __always_inline
 struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
-			   unsigned int order, unsigned int alloc_flags,
-			   int migratetype)
+			   unsigned int order, gfp_t gfp_flags,
+			   unsigned int alloc_flags, int migratetype)
 {
 	struct page *page;
 	unsigned long flags;
@@ -2726,7 +2748,8 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 		if (alloc_flags & ALLOC_HIGHATOMIC)
 			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
 		if (!page) {
-			page = __rmqueue(zone, order, migratetype, alloc_flags);
+			page = __rmqueue(zone, order, migratetype,
+					 gfp_flags, alloc_flags);
 
 			/*
 			 * If the allocation fails, allow OOM handling access
@@ -2806,10 +2829,10 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order)
 /* Remove page from the per-cpu list, caller must protect the list */
 static inline
 struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
-			int migratetype,
-			unsigned int alloc_flags,
-			struct per_cpu_pages *pcp,
-			struct list_head *list)
+			       gfp_t gfp_flags, int migratetype,
+			       unsigned int alloc_flags,
+			       struct per_cpu_pages *pcp,
+			       struct list_head *list)
 {
 	struct page *page;
 
@@ -2818,7 +2841,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 			int batch = nr_pcp_alloc(pcp, zone, order);
 			int alloced;
 
-			alloced = rmqueue_bulk(zone, order,
+			alloced = rmqueue_bulk(zone, order, gfp_flags,
 					batch, list,
 					migratetype, alloc_flags);
 
@@ -2837,8 +2860,9 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 
 /* Lock and remove page from the per-cpu list */
 static struct page *rmqueue_pcplist(struct zone *preferred_zone,
-			struct zone *zone, unsigned int order,
-			int migratetype, unsigned int alloc_flags)
+				    struct zone *zone, unsigned int order,
+				    gfp_t gfp_flags, int migratetype,
+				    unsigned int alloc_flags)
 {
 	struct per_cpu_pages *pcp;
 	struct list_head *list;
@@ -2860,7 +2884,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	 */
 	pcp->free_count >>= 1;
 	list = &pcp->lists[order_to_pindex(migratetype, order)];
-	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
+	page = __rmqueue_pcplist(zone, order, gfp_flags, migratetype,
+				 alloc_flags, pcp, list);
 	pcp_spin_unlock(pcp);
 	pcp_trylock_finish(UP_flags);
 	if (page) {
@@ -2898,13 +2923,13 @@ struct page *rmqueue(struct zone *preferred_zone,
 
 	if (likely(pcp_allowed_order(order))) {
 		page = rmqueue_pcplist(preferred_zone, zone, order,
-				       migratetype, alloc_flags);
+				       gfp_flags, migratetype, alloc_flags);
 		if (likely(page))
 			goto out;
 	}
 
-	page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
-							migratetype);
+	page = rmqueue_buddy(preferred_zone, zone, order, gfp_flags,
+			     alloc_flags, migratetype);
 
 out:
 	/* Separate test+clear to avoid unnecessary atomics */
@@ -4480,8 +4505,8 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 			continue;
 		}
 
-		page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags,
-								pcp, pcp_list);
+		page = __rmqueue_pcplist(zone, 0, gfp, ac.migratetype,
+					 alloc_flags, pcp, pcp_list);
 		if (unlikely(!page)) {
 			/* Try and allocate at least one page */
 			if (!nr_account) {
  
Zhaoyang Huang Nov. 8, 2023, 8:55 a.m. UTC | #4
On Wed, Nov 8, 2023 at 1:28 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, Oct 16, 2023 at 03:12:45PM +0800, zhaoyang.huang wrote:
> > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> >
> > According to current CMA utilization policy, an alloc_pages(GFP_USER)
> > could 'steal' UNMOVABLE & RECLAIMABLE page blocks via the help of
> > CMA(pass zone_watermark_ok by counting CMA in but use U&R in rmqueue),
> > which could lead to following alloc_pages(GFP_KERNEL) fail.
> > Solving this by introducing second watermark checking for GFP_MOVABLE,
> > which could have the allocation use CMA when proper.
> >
> > -- Free_pages(30MB)
> > |
> > |
> > -- WMARK_LOW(25MB)
> > |
> > -- Free_CMA(12MB)
> > |
> > |
> > --
>
> We're running into the same issue in production and had an incident
> over the weekend because of it. The hosts have a raised
> vm.min_free_kbytes for network rx reliability, which makes the
> mismatch between free pages and what's actually allocatable by regular
> kernel requests quite pronounced. It wasn't OOMing this time, but we
> saw very high rates of thrashing while CMA had plenty of headroom.
>
> I had raised the broader issue around poor CMA utilization before:
> https://lore.kernel.org/lkml/20230726145304.1319046-1-hannes@cmpxchg.org/
>
> For context, we're using hugetlb_cma at several gigabytes to allow
> sharing hosts between jobs that use hugetlb and jobs that don't.
>
> > @@ -2078,6 +2078,43 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
> >
> >  }
> >
> > +#ifdef CONFIG_CMA
> > +/*
> > + * GFP_MOVABLE allocation could drain UNMOVABLE & RECLAIMABLE page blocks via
> > + * the help of CMA which makes GFP_KERNEL failed. Checking if zone_watermark_ok
> > + * again without ALLOC_CMA to see if to use CMA first.
> > + */
> > +static bool use_cma_first(struct zone *zone, unsigned int order, unsigned int alloc_flags)
> > +{
> > +     unsigned long watermark;
> > +     bool cma_first = false;
> > +
> > +     watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
> > +     /* check if GFP_MOVABLE pass previous zone_watermark_ok via the help of CMA */
> > +     if (zone_watermark_ok(zone, order, watermark, 0, alloc_flags & (~ALLOC_CMA))) {
> > +             /*
> > +              * Balance movable allocations between regular and CMA areas by
> > +              * allocating from CMA when over half of the zone's free memory
> > +              * is in the CMA area.
> > +              */
ok, thanks for point out.
Could we simple it like this, which will mis-judge kmalloc within
ioctl as GFP_USER. I think it is ok as it is rare
             if (current_is_kswapd() || !current->mm)
                 gfp_flags = GFP_KERNEL;
             else
                 gfp_flags = GFP_USER;
            free_pages = zone_page_state(zone, NR_FREE_PAGES);
            free_pages -= zone->lowmem_reserve[gfp_zone(gfp_flags)];
            free_pages -= wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
            cma_first = free_pages > zone_page_state(zone, NR_FREE_PAGES) / 2);

> > +             cma_first = (zone_page_state(zone, NR_FREE_CMA_PAGES) >
> > +                             zone_page_state(zone, NR_FREE_PAGES) / 2);
> > +     } else {
> > +             /*
> > +              * watermark failed means UNMOVABLE & RECLAIMBLE is not enough
> > +              * now, we should use cma first to keep them stay around the
> > +              * corresponding watermark
> > +              */
> > +             cma_first = true;
> > +     }
> > +     return cma_first;
>
> I think it's a step in the right direction. However, it doesn't take
> the lowmem reserves into account. With DMA32 that can be an additional
> multiple gigabytes of "free" memory not available to GFP_KERNEL. It
> also has a knee in the balancing curve because it doesn't take
> reserves into account *until* non-CMA is depleted - at which point it
> would already be below the use-CMA threshold by the full reserves and
> watermarks.
>
> A more complete solution would have to plumb the highest_zoneidx
> information through the rmqueue family of functions somehow, and
> always take unavailable free memory into account:
>
> ---
> Subject: [PATCH] mm: page_alloc: use CMA when kernel allocations are beginning
>  to fail
>
> We can get into a situation where kernel allocations are starting to
> fail on watermarks, but movable allocations still don't use CMA
> because they make up more than half of the free memory. This can
> happen in particular with elevated vm.min_free_kbytes settings, where
> the remaining free pages aren't available to non-atomic requests.
>
> Example scenario:
>
>       Free: 3.0G
> Watermarks: 2.0G
>        CMA: 1.4G
> -> non-CMA: 1.6G
>
> CMA isn't used because CMA <= free/2. Kernel allocations fail due to
> non-CMA < watermarks. If memory is mostly unreclaimable (e.g. anon
> without swap), the kernel is more likely to OOM prematurely.
>
> Reduce the probability of that happening by taking reserves and
> watermarks into account when deciding whether to start using CMA.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/page_alloc.c | 93 +++++++++++++++++++++++++++++++------------------
>  1 file changed, 59 insertions(+), 34 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 733732e7e0ba..b9273d7f23b8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2079,30 +2079,52 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
>
>  }
>
> +static bool should_try_cma(struct zone *zone, unsigned int order,
> +                          gfp_t gfp_flags, unsigned int alloc_flags)
> +{
> +       long free_pages;
> +
> +       if (!IS_ENABLED(CONFIG_CMA) || !(alloc_flags & ALLOC_CMA))
> +               return false;
> +
> +       /*
> +        * CMA regions can be used by movable allocations while
> +        * they're not otherwise in use. This is a delicate balance:
> +        * Filling CMA too soon poses a latency risk for actual CMA
> +        * allocations (think camera app startup). Filling CMA too
> +        * late risks premature OOMs from non-movable allocations.
> +        *
> +        * Start using CMA once it dominates the remaining free
> +        * memory. Be sure to take watermarks and reserves into
> +        * account when considering what's truly "free".
> +        *
> +        * free_pages can go negative, but that's okay because
> +        * NR_FREE_CMA_PAGES should not.
> +        */
> +
> +       free_pages = zone_page_state(zone, NR_FREE_PAGES);
> +       free_pages -= zone->lowmem_reserve[gfp_zone(gfp_flags)];
> +       free_pages -= wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
> +
> +       return zone_page_state(zone, NR_FREE_CMA_PAGES) > free_pages / 2;
> +}
> +
>  /*
>   * Do the hard work of removing an element from the buddy allocator.
>   * Call me with the zone->lock already held.
>   */
>  static __always_inline struct page *
> -__rmqueue(struct zone *zone, unsigned int order, int migratetype,
> -                                               unsigned int alloc_flags)
> +__rmqueue(struct zone *zone, unsigned int order, gfp_t gfp_flags,
> +         int migratetype, unsigned int alloc_flags)
>  {
>         struct page *page;
>
> -       if (IS_ENABLED(CONFIG_CMA)) {
> -               /*
> -                * Balance movable allocations between regular and CMA areas by
> -                * allocating from CMA when over half of the zone's free memory
> -                * is in the CMA area.
> -                */
> -               if (alloc_flags & ALLOC_CMA &&
> -                   zone_page_state(zone, NR_FREE_CMA_PAGES) >
> -                   zone_page_state(zone, NR_FREE_PAGES) / 2) {
> -                       page = __rmqueue_cma_fallback(zone, order);
> -                       if (page)
> -                               return page;
> -               }
> +       if (should_try_cma(zone, order, gfp_flags, alloc_flags)) {
> +               page = __rmqueue_cma_fallback(zone, order);
> +               if (page)
> +                       return page;
>         }
> +
>  retry:
>         page = __rmqueue_smallest(zone, order, migratetype);
>         if (unlikely(!page)) {
> @@ -2121,7 +2143,7 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
>   * a single hold of the lock, for efficiency.  Add them to the supplied list.
>   * Returns the number of new pages which were placed at *list.
>   */
> -static int rmqueue_bulk(struct zone *zone, unsigned int order,
> +static int rmqueue_bulk(struct zone *zone, unsigned int order, gfp_t gfp_flags,
>                         unsigned long count, struct list_head *list,
>                         int migratetype, unsigned int alloc_flags)
>  {
> @@ -2130,8 +2152,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>
>         spin_lock_irqsave(&zone->lock, flags);
>         for (i = 0; i < count; ++i) {
> -               struct page *page = __rmqueue(zone, order, migratetype,
> -                                                               alloc_flags);
> +               struct page *page = __rmqueue(zone, order, gfp_flags,
> +                                             migratetype, alloc_flags);
>                 if (unlikely(page == NULL))
>                         break;
>
> @@ -2714,8 +2736,8 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
>
>  static __always_inline
>  struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
> -                          unsigned int order, unsigned int alloc_flags,
> -                          int migratetype)
> +                          unsigned int order, gfp_t gfp_flags,
> +                          unsigned int alloc_flags, int migratetype)
>  {
>         struct page *page;
>         unsigned long flags;
> @@ -2726,7 +2748,8 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
>                 if (alloc_flags & ALLOC_HIGHATOMIC)
>                         page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
>                 if (!page) {
> -                       page = __rmqueue(zone, order, migratetype, alloc_flags);
> +                       page = __rmqueue(zone, order, migratetype,
> +                                        gfp_flags, alloc_flags);
>
>                         /*
>                          * If the allocation fails, allow OOM handling access
> @@ -2806,10 +2829,10 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order)
>  /* Remove page from the per-cpu list, caller must protect the list */
>  static inline
>  struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
> -                       int migratetype,
> -                       unsigned int alloc_flags,
> -                       struct per_cpu_pages *pcp,
> -                       struct list_head *list)
> +                              gfp_t gfp_flags, int migratetype,
> +                              unsigned int alloc_flags,
> +                              struct per_cpu_pages *pcp,
> +                              struct list_head *list)
>  {
>         struct page *page;
>
> @@ -2818,7 +2841,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
>                         int batch = nr_pcp_alloc(pcp, zone, order);
>                         int alloced;
>
> -                       alloced = rmqueue_bulk(zone, order,
> +                       alloced = rmqueue_bulk(zone, order, gfp_flags,
>                                         batch, list,
>                                         migratetype, alloc_flags);
>
> @@ -2837,8 +2860,9 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
>
>  /* Lock and remove page from the per-cpu list */
>  static struct page *rmqueue_pcplist(struct zone *preferred_zone,
> -                       struct zone *zone, unsigned int order,
> -                       int migratetype, unsigned int alloc_flags)
> +                                   struct zone *zone, unsigned int order,
> +                                   gfp_t gfp_flags, int migratetype,
> +                                   unsigned int alloc_flags)
>  {
>         struct per_cpu_pages *pcp;
>         struct list_head *list;
> @@ -2860,7 +2884,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
>          */
>         pcp->free_count >>= 1;
>         list = &pcp->lists[order_to_pindex(migratetype, order)];
> -       page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
> +       page = __rmqueue_pcplist(zone, order, gfp_flags, migratetype,
> +                                alloc_flags, pcp, list);
>         pcp_spin_unlock(pcp);
>         pcp_trylock_finish(UP_flags);
>         if (page) {
> @@ -2898,13 +2923,13 @@ struct page *rmqueue(struct zone *preferred_zone,
>
>         if (likely(pcp_allowed_order(order))) {
>                 page = rmqueue_pcplist(preferred_zone, zone, order,
> -                                      migratetype, alloc_flags);
> +                                      gfp_flags, migratetype, alloc_flags);
>                 if (likely(page))
>                         goto out;
>         }
>
> -       page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
> -                                                       migratetype);
> +       page = rmqueue_buddy(preferred_zone, zone, order, gfp_flags,
> +                            alloc_flags, migratetype);
>
>  out:
>         /* Separate test+clear to avoid unnecessary atomics */
> @@ -4480,8 +4505,8 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>                         continue;
>                 }
>
> -               page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags,
> -                                                               pcp, pcp_list);
> +               page = __rmqueue_pcplist(zone, 0, gfp, ac.migratetype,
> +                                        alloc_flags, pcp, pcp_list);
>                 if (unlikely(!page)) {
>                         /* Try and allocate at least one page */
>                         if (!nr_account) {
> --
> 2.42.0

>
  
Andrew Morton Dec. 29, 2023, 7:40 p.m. UTC | #5
On Wed, 8 Nov 2023 16:55:19 +0800 Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:

> > > +static bool use_cma_first(struct zone *zone, unsigned int order, unsigned int alloc_flags)
> > > +{
> > > +     unsigned long watermark;
> > > +     bool cma_first = false;
> > > +
> > > +     watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
> > > +     /* check if GFP_MOVABLE pass previous zone_watermark_ok via the help of CMA */
> > > +     if (zone_watermark_ok(zone, order, watermark, 0, alloc_flags & (~ALLOC_CMA))) {
> > > +             /*
> > > +              * Balance movable allocations between regular and CMA areas by
> > > +              * allocating from CMA when over half of the zone's free memory
> > > +              * is in the CMA area.
> > > +              */
> ok, thanks for point out.
> Could we simple it like this, which will mis-judge kmalloc within
> ioctl as GFP_USER. I think it is ok as it is rare
>              if (current_is_kswapd() || !current->mm)
>                  gfp_flags = GFP_KERNEL;
>              else
>                  gfp_flags = GFP_USER;
>             free_pages = zone_page_state(zone, NR_FREE_PAGES);
>             free_pages -= zone->lowmem_reserve[gfp_zone(gfp_flags)];
>             free_pages -= wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
>             cma_first = free_pages > zone_page_state(zone, NR_FREE_PAGES) / 2);
> 

This went all quiet.  Do we feel that "mm: optimization on page
allocation when CMA enabled" should be merged as-is, or dropped in the
expectation that something based on Johannes's suggestion will be
developed?
  
Zhaoyang Huang Jan. 2, 2024, 5:50 a.m. UTC | #6
On Sat, Dec 30, 2023 at 3:40 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed, 8 Nov 2023 16:55:19 +0800 Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
>
> > > > +static bool use_cma_first(struct zone *zone, unsigned int order, unsigned int alloc_flags)
> > > > +{
> > > > +     unsigned long watermark;
> > > > +     bool cma_first = false;
> > > > +
> > > > +     watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
> > > > +     /* check if GFP_MOVABLE pass previous zone_watermark_ok via the help of CMA */
> > > > +     if (zone_watermark_ok(zone, order, watermark, 0, alloc_flags & (~ALLOC_CMA))) {
> > > > +             /*
> > > > +              * Balance movable allocations between regular and CMA areas by
> > > > +              * allocating from CMA when over half of the zone's free memory
> > > > +              * is in the CMA area.
> > > > +              */
> > ok, thanks for point out.
> > Could we simple it like this, which will mis-judge kmalloc within
> > ioctl as GFP_USER. I think it is ok as it is rare
> >              if (current_is_kswapd() || !current->mm)
> >                  gfp_flags = GFP_KERNEL;
> >              else
> >                  gfp_flags = GFP_USER;
> >             free_pages = zone_page_state(zone, NR_FREE_PAGES);
> >             free_pages -= zone->lowmem_reserve[gfp_zone(gfp_flags)];
> >             free_pages -= wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
> >             cma_first = free_pages > zone_page_state(zone, NR_FREE_PAGES) / 2);
> >
>
> This went all quiet.  Do we feel that "mm: optimization on page
> allocation when CMA enabled" should be merged as-is, or dropped in the
> expectation that something based on Johannes's suggestion will be
> developed?
I just establish a v6.6 environment and will provide comparison
results with and without the patch
>
  

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 452459836b71..5a146aa7c0aa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2078,6 +2078,43 @@  __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
 
 }
 
+#ifdef CONFIG_CMA
+/*
+ * GFP_MOVABLE allocation could drain UNMOVABLE & RECLAIMABLE page blocks via
+ * the help of CMA which makes GFP_KERNEL failed. Checking if zone_watermark_ok
+ * again without ALLOC_CMA to see if to use CMA first.
+ */
+static bool use_cma_first(struct zone *zone, unsigned int order, unsigned int alloc_flags)
+{
+	unsigned long watermark;
+	bool cma_first = false;
+
+	watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
+	/* check if GFP_MOVABLE pass previous zone_watermark_ok via the help of CMA */
+	if (zone_watermark_ok(zone, order, watermark, 0, alloc_flags & (~ALLOC_CMA))) {
+		/*
+		 * Balance movable allocations between regular and CMA areas by
+		 * allocating from CMA when over half of the zone's free memory
+		 * is in the CMA area.
+		 */
+		cma_first = (zone_page_state(zone, NR_FREE_CMA_PAGES) >
+				zone_page_state(zone, NR_FREE_PAGES) / 2);
+	} else {
+		/*
+		 * watermark failed means UNMOVABLE & RECLAIMBLE is not enough
+		 * now, we should use cma first to keep them stay around the
+		 * corresponding watermark
+		 */
+		cma_first = true;
+	}
+	return cma_first;
+}
+#else
+static bool use_cma_first(struct zone *zone, unsigned int order, unsigned int alloc_flags)
+{
+	return false;
+}
+#endif
 /*
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
@@ -2091,12 +2128,11 @@  __rmqueue(struct zone *zone, unsigned int order, int migratetype,
 	if (IS_ENABLED(CONFIG_CMA)) {
 		/*
 		 * Balance movable allocations between regular and CMA areas by
-		 * allocating from CMA when over half of the zone's free memory
-		 * is in the CMA area.
+		 * allocating from CMA base on judging zone_watermark_ok again
+		 * to see if the latest check got pass via the help of CMA
 		 */
 		if (alloc_flags & ALLOC_CMA &&
-		    zone_page_state(zone, NR_FREE_CMA_PAGES) >
-		    zone_page_state(zone, NR_FREE_PAGES) / 2) {
+			use_cma_first(zone, order, alloc_flags)) {
 			page = __rmqueue_cma_fallback(zone, order);
 			if (page)
 				return page;