[v3,2/3] hugetlb: memcg: account hugetlb-backed memory in memory controller

Message ID 20231003001828.2554080-3-nphamcs@gmail.com
State New
Headers
Series [v3,1/3] memcontrol: add helpers for hugetlb memcg accounting |

Commit Message

Nhat Pham Oct. 3, 2023, 12:18 a.m. UTC
  Currently, hugetlb memory usage is not acounted for in the memory
controller, which could lead to memory overprotection for cgroups with
hugetlb-backed memory. This has been observed in our production system.

For instance, here is one of our usecases: suppose there are two 32G
containers. The machine is booted with hugetlb_cma=6G, and each
container may or may not use up to 3 gigantic page, depending on the
workload within it. The rest is anon, cache, slab, etc. We can set the
hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness.
But it is very difficult to configure memory.max to keep overall
consumption, including anon, cache, slab etc. fair.

What we have had to resort to is to constantly poll hugetlb usage and
readjust memory.max. Similar procedure is done to other memory limits
(memory.low for e.g). However, this is rather cumbersome and buggy.
Furthermore, when there is a delay in memory limits correction, (for e.g
when hugetlb usage changes within consecutive runs of the userspace
agent), the system could be in an over/underprotected state.

This patch rectifies this issue by charging the memcg when the hugetlb
folio is utilized, and uncharging when the folio is freed (analogous to
the hugetlb controller). Note that we do not charge when the folio is
allocated to the hugetlb pool, because at this point it is not owned by
any memcg.

Some caveats to consider:
  * This feature is only available on cgroup v2.
  * There is no hugetlb pool management involved in the memory
    controller. As stated above, hugetlb folios are only charged towards
    the memory controller when it is used. Host overcommit management
    has to consider it when configuring hard limits.
  * Failure to charge towards the memcg results in SIGBUS. This could
    happen even if the hugetlb pool still has pages (but the cgroup
    limit is hit and reclaim attempt fails).
  * When this feature is enabled, hugetlb pages contribute to memory
    reclaim protection. low, min limits tuning must take into account
    hugetlb memory.
  * Hugetlb pages utilized while this option is not selected will not
    be tracked by the memory controller (even if cgroup v2 is remounted
    later on).

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 29 ++++++++++++++++++++
 include/linux/cgroup-defs.h             |  5 ++++
 include/linux/memcontrol.h              |  9 +++++++
 kernel/cgroup/cgroup.c                  | 15 ++++++++++-
 mm/hugetlb.c                            | 35 ++++++++++++++++++++-----
 mm/memcontrol.c                         | 35 +++++++++++++++++++++++++
 6 files changed, 120 insertions(+), 8 deletions(-)
  

Comments

Nhat Pham Oct. 3, 2023, 12:26 a.m. UTC | #1
On Mon, Oct 2, 2023 at 5:18 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> Currently, hugetlb memory usage is not acounted for in the memory
> controller, which could lead to memory overprotection for cgroups with
> hugetlb-backed memory. This has been observed in our production system.
>
> For instance, here is one of our usecases: suppose there are two 32G
> containers. The machine is booted with hugetlb_cma=6G, and each
> container may or may not use up to 3 gigantic page, depending on the
> workload within it. The rest is anon, cache, slab, etc. We can set the
> hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness.
> But it is very difficult to configure memory.max to keep overall
> consumption, including anon, cache, slab etc. fair.
>
> What we have had to resort to is to constantly poll hugetlb usage and
> readjust memory.max. Similar procedure is done to other memory limits
> (memory.low for e.g). However, this is rather cumbersome and buggy.
> Furthermore, when there is a delay in memory limits correction, (for e.g
> when hugetlb usage changes within consecutive runs of the userspace
> agent), the system could be in an over/underprotected state.
>
> This patch rectifies this issue by charging the memcg when the hugetlb
> folio is utilized, and uncharging when the folio is freed (analogous to
> the hugetlb controller). Note that we do not charge when the folio is
> allocated to the hugetlb pool, because at this point it is not owned by
> any memcg.
>
> Some caveats to consider:
>   * This feature is only available on cgroup v2.
>   * There is no hugetlb pool management involved in the memory
>     controller. As stated above, hugetlb folios are only charged towards
>     the memory controller when it is used. Host overcommit management
>     has to consider it when configuring hard limits.
>   * Failure to charge towards the memcg results in SIGBUS. This could
>     happen even if the hugetlb pool still has pages (but the cgroup
>     limit is hit and reclaim attempt fails).
>   * When this feature is enabled, hugetlb pages contribute to memory
>     reclaim protection. low, min limits tuning must take into account
>     hugetlb memory.
>   * Hugetlb pages utilized while this option is not selected will not
>     be tracked by the memory controller (even if cgroup v2 is remounted
>     later on).
(special thanks to Michal Hocko, who pointed most of these out).
>
> Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 29 ++++++++++++++++++++
>  include/linux/cgroup-defs.h             |  5 ++++
>  include/linux/memcontrol.h              |  9 +++++++
>  kernel/cgroup/cgroup.c                  | 15 ++++++++++-
>  mm/hugetlb.c                            | 35 ++++++++++++++++++++-----
>  mm/memcontrol.c                         | 35 +++++++++++++++++++++++++
>  6 files changed, 120 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 622a7f28db1f..606b2e0eac4b 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -210,6 +210,35 @@ cgroup v2 currently supports the following mount options.
>          relying on the original semantics (e.g. specifying bogusly
>          high 'bypass' protection values at higher tree levels).
>
> +  memory_hugetlb_accounting
> +        Count HugeTLB memory usage towards the cgroup's overall
> +        memory usage for the memory controller (for the purpose of
> +        statistics reporting and memory protetion). This is a new
> +        behavior that could regress existing setups, so it must be
> +        explicitly opted in with this mount option.
> +
> +        A few caveats to keep in mind:
> +
> +        * There is no HugeTLB pool management involved in the memory
> +          controller. The pre-allocated pool does not belong to anyone.
> +          Specifically, when a new HugeTLB folio is allocated to
> +          the pool, it is not accounted for from the perspective of the
> +          memory controller. It is only charged to a cgroup when it is
> +          actually used (for e.g at page fault time). Host memory
> +          overcommit management has to consider this when configuring
> +          hard limits. In general, HugeTLB pool management should be
> +          done via other mechanisms (such as the HugeTLB controller).
> +        * Failure to charge a HugeTLB folio to the memory controller
> +          results in SIGBUS. This could happen even if the HugeTLB pool
> +          still has pages available (but the cgroup limit is hit and
> +          reclaim attempt fails).
> +        * Charging HugeTLB memory towards the memory controller affects
> +          memory protection and reclaim dynamics. Any userspace tuning
> +          (of low, min limits for e.g) needs to take this into account.
> +        * HugeTLB pages utilized while this option is not selected
> +          will not be tracked by the memory controller (even if cgroup
> +          v2 is remounted later on).
> +
>
>  Organizing Processes and Threads
>  --------------------------------
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index f1b3151ac30b..8641f4320c98 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -115,6 +115,11 @@ enum {
>          * Enable recursive subtree protection
>          */
>         CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 18),
> +
> +       /*
> +        * Enable hugetlb accounting for the memory controller.
> +        */
> +        CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING = (1 << 19),
>  };
>
>  /* cftype->flags */
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 42bf7e9b1a2f..a827e2129790 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -679,6 +679,9 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
>         return __mem_cgroup_charge(folio, mm, gfp);
>  }
>
> +int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
> +               long nr_pages);
> +
>  int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
>                                   gfp_t gfp, swp_entry_t entry);
>  void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
> @@ -1262,6 +1265,12 @@ static inline int mem_cgroup_charge(struct folio *folio,
>         return 0;
>  }
>
> +static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg,
> +               gfp_t gfp, long nr_pages)
> +{
> +       return 0;
> +}
> +
>  static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
>                         struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
>  {
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 1fb7f562289d..f11488b18ceb 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -1902,6 +1902,7 @@ enum cgroup2_param {
>         Opt_favordynmods,
>         Opt_memory_localevents,
>         Opt_memory_recursiveprot,
> +       Opt_memory_hugetlb_accounting,
>         nr__cgroup2_params
>  };
>
> @@ -1910,6 +1911,7 @@ static const struct fs_parameter_spec cgroup2_fs_parameters[] = {
>         fsparam_flag("favordynmods",            Opt_favordynmods),
>         fsparam_flag("memory_localevents",      Opt_memory_localevents),
>         fsparam_flag("memory_recursiveprot",    Opt_memory_recursiveprot),
> +       fsparam_flag("memory_hugetlb_accounting", Opt_memory_hugetlb_accounting),
>         {}
>  };
>
> @@ -1936,6 +1938,9 @@ static int cgroup2_parse_param(struct fs_context *fc, struct fs_parameter *param
>         case Opt_memory_recursiveprot:
>                 ctx->flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT;
>                 return 0;
> +       case Opt_memory_hugetlb_accounting:
> +               ctx->flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
> +               return 0;
>         }
>         return -EINVAL;
>  }
> @@ -1960,6 +1965,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
>                         cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT;
>                 else
>                         cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_RECURSIVE_PROT;
> +
> +               if (root_flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
> +                       cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
> +               else
> +                       cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
>         }
>  }
>
> @@ -1973,6 +1983,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
>                 seq_puts(seq, ",memory_localevents");
>         if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT)
>                 seq_puts(seq, ",memory_recursiveprot");
> +       if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
> +               seq_puts(seq, ",memory_hugetlb_accounting");
>         return 0;
>  }
>
> @@ -7050,7 +7062,8 @@ static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr,
>                         "nsdelegate\n"
>                         "favordynmods\n"
>                         "memory_localevents\n"
> -                       "memory_recursiveprot\n");
> +                       "memory_recursiveprot\n"
> +                       "memory_hugetlb_accounting\n");
>  }
>  static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features);
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index de220e3ff8be..74472e911b0a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1902,6 +1902,7 @@ void free_huge_folio(struct folio *folio)
>                                      pages_per_huge_page(h), folio);
>         hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
>                                           pages_per_huge_page(h), folio);
> +       mem_cgroup_uncharge(folio);
>         if (restore_reserve)
>                 h->resv_huge_pages++;
>
> @@ -3009,11 +3010,20 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>         struct hugepage_subpool *spool = subpool_vma(vma);
>         struct hstate *h = hstate_vma(vma);
>         struct folio *folio;
> -       long map_chg, map_commit;
> +       long map_chg, map_commit, nr_pages = pages_per_huge_page(h);
>         long gbl_chg;
> -       int ret, idx;
> +       int memcg_charge_ret, ret, idx;
>         struct hugetlb_cgroup *h_cg = NULL;
> +       struct mem_cgroup *memcg;
>         bool deferred_reserve;
> +       gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
> +
> +       memcg = get_mem_cgroup_from_current();
> +       memcg_charge_ret = mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages);
> +       if (memcg_charge_ret == -ENOMEM) {
> +               mem_cgroup_put(memcg);
> +               return ERR_PTR(-ENOMEM);
> +       }
>
>         idx = hstate_index(h);
>         /*
> @@ -3022,8 +3032,12 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>          * code of zero indicates a reservation exists (no change).
>          */
>         map_chg = gbl_chg = vma_needs_reservation(h, vma, addr);
> -       if (map_chg < 0)
> +       if (map_chg < 0) {
> +               if (!memcg_charge_ret)
> +                       mem_cgroup_cancel_charge(memcg, nr_pages);
> +               mem_cgroup_put(memcg);
>                 return ERR_PTR(-ENOMEM);
> +       }
>
>         /*
>          * Processes that did not create the mapping will have no
> @@ -3034,10 +3048,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>          */
>         if (map_chg || avoid_reserve) {
>                 gbl_chg = hugepage_subpool_get_pages(spool, 1);
> -               if (gbl_chg < 0) {
> -                       vma_end_reservation(h, vma, addr);
> -                       return ERR_PTR(-ENOSPC);
> -               }
> +               if (gbl_chg < 0)
> +                       goto out_end_reservation;
>
>                 /*
>                  * Even though there was no reservation in the region/reserve
> @@ -3119,6 +3131,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>                         hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
>                                         pages_per_huge_page(h), folio);
>         }
> +
> +       if (!memcg_charge_ret)
> +               mem_cgroup_commit_charge(folio, memcg);
> +       mem_cgroup_put(memcg);
> +
>         return folio;
>
>  out_uncharge_cgroup:
> @@ -3130,7 +3147,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  out_subpool_put:
>         if (map_chg || avoid_reserve)
>                 hugepage_subpool_put_pages(spool, 1);
> +out_end_reservation:
>         vma_end_reservation(h, vma, addr);
> +       if (!memcg_charge_ret)
> +               mem_cgroup_cancel_charge(memcg, nr_pages);
> +       mem_cgroup_put(memcg);
>         return ERR_PTR(-ENOSPC);
>  }
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 0219befeae38..6660684f6f97 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7085,6 +7085,41 @@ int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp)
>         return ret;
>  }
>
> +/**
> + * mem_cgroup_hugetlb_try_charge - try to charge the memcg for a hugetlb folio
> + * @memcg: memcg to charge.
> + * @gfp: reclaim mode.
> + * @nr_pages: number of pages to charge.
> + *
> + * This function is called when allocating a huge page folio to determine if
> + * the memcg has the capacity for it. It does not commit the charge yet,
> + * as the hugetlb folio itself has not been obtained from the hugetlb pool.
> + *
> + * Once we have obtained the hugetlb folio, we can call
> + * mem_cgroup_commit_charge() to commit the charge. If we fail to obtain the
> + * folio, we should instead call mem_cgroup_cancel_charge() to undo the effect
> + * of try_charge().
> + *
> + * Returns 0 on success. Otherwise, an error code is returned.
> + */
> +int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
> +                       long nr_pages)
> +{
> +       /*
> +        * If hugetlb memcg charging is not enabled, do not fail hugetlb allocation,
> +        * but do not attempt to commit charge later (or cancel on error) either.
> +        */
> +       if (mem_cgroup_disabled() || !memcg ||
> +               !cgroup_subsys_on_dfl(memory_cgrp_subsys) ||
> +               !(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING))
> +               return -EOPNOTSUPP;
> +
> +       if (try_charge(memcg, gfp, nr_pages))
> +               return -ENOMEM;
> +
> +       return 0;
> +}
> +
>  /**
>   * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for swapin.
>   * @folio: folio to charge.
> --
> 2.34.1
  
Johannes Weiner Oct. 3, 2023, 12:54 p.m. UTC | #2
On Mon, Oct 02, 2023 at 05:18:27PM -0700, Nhat Pham wrote:
> Currently, hugetlb memory usage is not acounted for in the memory
> controller, which could lead to memory overprotection for cgroups with
> hugetlb-backed memory. This has been observed in our production system.
> 
> For instance, here is one of our usecases: suppose there are two 32G
> containers. The machine is booted with hugetlb_cma=6G, and each
> container may or may not use up to 3 gigantic page, depending on the
> workload within it. The rest is anon, cache, slab, etc. We can set the
> hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness.
> But it is very difficult to configure memory.max to keep overall
> consumption, including anon, cache, slab etc. fair.
> 
> What we have had to resort to is to constantly poll hugetlb usage and
> readjust memory.max. Similar procedure is done to other memory limits
> (memory.low for e.g). However, this is rather cumbersome and buggy.
> Furthermore, when there is a delay in memory limits correction, (for e.g
> when hugetlb usage changes within consecutive runs of the userspace
> agent), the system could be in an over/underprotected state.
> 
> This patch rectifies this issue by charging the memcg when the hugetlb
> folio is utilized, and uncharging when the folio is freed (analogous to
> the hugetlb controller). Note that we do not charge when the folio is
> allocated to the hugetlb pool, because at this point it is not owned by
> any memcg.
> 
> Some caveats to consider:
>   * This feature is only available on cgroup v2.
>   * There is no hugetlb pool management involved in the memory
>     controller. As stated above, hugetlb folios are only charged towards
>     the memory controller when it is used. Host overcommit management
>     has to consider it when configuring hard limits.
>   * Failure to charge towards the memcg results in SIGBUS. This could
>     happen even if the hugetlb pool still has pages (but the cgroup
>     limit is hit and reclaim attempt fails).
>   * When this feature is enabled, hugetlb pages contribute to memory
>     reclaim protection. low, min limits tuning must take into account
>     hugetlb memory.
>   * Hugetlb pages utilized while this option is not selected will not
>     be tracked by the memory controller (even if cgroup v2 is remounted
>     later on).
> 
> Signed-off-by: Nhat Pham <nphamcs@gmail.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
  
Michal Hocko Oct. 3, 2023, 12:58 p.m. UTC | #3
On Mon 02-10-23 17:18:27, Nhat Pham wrote:
> Currently, hugetlb memory usage is not acounted for in the memory
> controller, which could lead to memory overprotection for cgroups with
> hugetlb-backed memory. This has been observed in our production system.
> 
> For instance, here is one of our usecases: suppose there are two 32G
> containers. The machine is booted with hugetlb_cma=6G, and each
> container may or may not use up to 3 gigantic page, depending on the
> workload within it. The rest is anon, cache, slab, etc. We can set the
> hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness.
> But it is very difficult to configure memory.max to keep overall
> consumption, including anon, cache, slab etc. fair.
> 
> What we have had to resort to is to constantly poll hugetlb usage and
> readjust memory.max. Similar procedure is done to other memory limits
> (memory.low for e.g). However, this is rather cumbersome and buggy.

Could you expand some more on how this _helps_ memory.low? The
hugetlb memory is not reclaimable so whatever portion of its memcg
consumption will be "protected from the reclaim". Consider this
	      parent
	/		\
       A		 B
       low=50%		 low=0
       current=40%	 current=60%

We have an external memory pressure and the reclaim should prefer B as A
is under its low limit, correct? But now consider that the predominant
consumption of B is hugetlb which would mean the memory reclaim cannot
do much for B and so the A's protection might be breached.

As an admin (or a tool) you need to know about hugetlb as a potential
contributor to this behavior (sure mlocked memory would behave the same
but mlock rarely consumes huge amount of memory in my experience).
Without the accounting there might not be any external pressure in the
first place. 

All that being said, I do not see how adding hugetlb into accounting
makes low, min limits management any easier.
  
Johannes Weiner Oct. 3, 2023, 3:59 p.m. UTC | #4
On Tue, Oct 03, 2023 at 02:58:58PM +0200, Michal Hocko wrote:
> On Mon 02-10-23 17:18:27, Nhat Pham wrote:
> > Currently, hugetlb memory usage is not acounted for in the memory
> > controller, which could lead to memory overprotection for cgroups with
> > hugetlb-backed memory. This has been observed in our production system.
> > 
> > For instance, here is one of our usecases: suppose there are two 32G
> > containers. The machine is booted with hugetlb_cma=6G, and each
> > container may or may not use up to 3 gigantic page, depending on the
> > workload within it. The rest is anon, cache, slab, etc. We can set the
> > hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness.
> > But it is very difficult to configure memory.max to keep overall
> > consumption, including anon, cache, slab etc. fair.
> > 
> > What we have had to resort to is to constantly poll hugetlb usage and
> > readjust memory.max. Similar procedure is done to other memory limits
> > (memory.low for e.g). However, this is rather cumbersome and buggy.
> 
> Could you expand some more on how this _helps_ memory.low? The
> hugetlb memory is not reclaimable so whatever portion of its memcg
> consumption will be "protected from the reclaim". Consider this
> 	      parent
> 	/		\
>        A		 B
>        low=50%		 low=0
>        current=40%	 current=60%
> 
> We have an external memory pressure and the reclaim should prefer B as A
> is under its low limit, correct? But now consider that the predominant
> consumption of B is hugetlb which would mean the memory reclaim cannot
> do much for B and so the A's protection might be breached.
> 
> As an admin (or a tool) you need to know about hugetlb as a potential
> contributor to this behavior (sure mlocked memory would behave the same
> but mlock rarely consumes huge amount of memory in my experience).
> Without the accounting there might not be any external pressure in the
> first place.
> 
> All that being said, I do not see how adding hugetlb into accounting
> makes low, min limits management any easier.

It's important to differentiate the cgroup usecases. One is of course
the cloud/virtual server scenario, where you set the hard limits to
whatever the customer paid for, and don't know and don't care about
the workload running inside. In that case, memory.low and overcommit
aren't really safe to begin with due to unknown unreclaimable mem.

The other common usecase is the datacenter where you run your own
applications. You understand their workingset and requirements, and
configure and overcommit the containers in a way where jobs always
meet their SLAs. E.g. if multiple containers spike, memory.low is set
such that interactive workloads are prioritized over batch jobs, and
both have priority over routine system management tasks.

This is arguably the only case where it's safe to use memory.low. You
have to know what's reclaimable and what isn't, otherwise you cannot
know that memory.low will even do anything, and isolation breaks down.
So we already have that knowledge: mlocked sections, how much anon is
without swap space, and how much memory must not be reclaimed (even if
it is reclaimable) for the workload to meet its SLAs. Hugetlb doesn't
really complicate this equation - we already have to consider it
unreclaimable workingset from an overcommit POV on those hosts.

The reason this patch helps in this scenario is that the service teams
are usually different from the containers/infra team. The service
understands their workload and declares its workingset. But it's the
infra team running the containers that currently has to go and find
out if they're using hugetlb and tweak the cgroups. Bugs and
untimeliness in the tweaking have caused multiple production incidents
already. And both teams are regularly confused when there are large
parts of the workload that don't show up in memory.current which both
sides monitor. Keep in mind that these systems are already pretty
complex, with multiple overcommitted containers and system-level
activity. The current hugetlb quirk can heavily distort what a given
container is doing on the host.

With this patch, the service can declare its workingset, the container
team can configure the container, and memory.current makes sense to
everybody. The workload parameters are pretty reliable, but if the
service team gets it wrong and we underprotect the workload, and/or
its unreclaimable memory exceeds what was declared, the infra team
gets alarms on elevated LOW breaching events and investigates if its
an infra problem or a service spec problem that needs escalation.

So the case you describe above only happens when mistakes are made,
and we detect and rectify them. In the common case, hugetlb is part of
the recognized workingset, and we configure memory.low to cut off only
known optional and reclaimable memory under pressure.
  
Mike Kravetz Oct. 3, 2023, 5:13 p.m. UTC | #5
On 10/02/23 17:18, Nhat Pham wrote:
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index de220e3ff8be..74472e911b0a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1902,6 +1902,7 @@ void free_huge_folio(struct folio *folio)
>  				     pages_per_huge_page(h), folio);
>  	hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
>  					  pages_per_huge_page(h), folio);
> +	mem_cgroup_uncharge(folio);
>  	if (restore_reserve)
>  		h->resv_huge_pages++;
>  
> @@ -3009,11 +3010,20 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  	struct hugepage_subpool *spool = subpool_vma(vma);
>  	struct hstate *h = hstate_vma(vma);
>  	struct folio *folio;
> -	long map_chg, map_commit;
> +	long map_chg, map_commit, nr_pages = pages_per_huge_page(h);
>  	long gbl_chg;
> -	int ret, idx;
> +	int memcg_charge_ret, ret, idx;
>  	struct hugetlb_cgroup *h_cg = NULL;
> +	struct mem_cgroup *memcg;
>  	bool deferred_reserve;
> +	gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
> +
> +	memcg = get_mem_cgroup_from_current();
> +	memcg_charge_ret = mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages);
> +	if (memcg_charge_ret == -ENOMEM) {
> +		mem_cgroup_put(memcg);
> +		return ERR_PTR(-ENOMEM);
> +	}
>  
>  	idx = hstate_index(h);
>  	/*
> @@ -3022,8 +3032,12 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  	 * code of zero indicates a reservation exists (no change).
>  	 */
>  	map_chg = gbl_chg = vma_needs_reservation(h, vma, addr);
> -	if (map_chg < 0)
> +	if (map_chg < 0) {
> +		if (!memcg_charge_ret)
> +			mem_cgroup_cancel_charge(memcg, nr_pages);
> +		mem_cgroup_put(memcg);
>  		return ERR_PTR(-ENOMEM);
> +	}
>  
>  	/*
>  	 * Processes that did not create the mapping will have no
> @@ -3034,10 +3048,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  	 */
>  	if (map_chg || avoid_reserve) {
>  		gbl_chg = hugepage_subpool_get_pages(spool, 1);
> -		if (gbl_chg < 0) {
> -			vma_end_reservation(h, vma, addr);
> -			return ERR_PTR(-ENOSPC);
> -		}
> +		if (gbl_chg < 0)
> +			goto out_end_reservation;
>  
>  		/*
>  		 * Even though there was no reservation in the region/reserve
> @@ -3119,6 +3131,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  			hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
>  					pages_per_huge_page(h), folio);
>  	}
> +
> +	if (!memcg_charge_ret)
> +		mem_cgroup_commit_charge(folio, memcg);
> +	mem_cgroup_put(memcg);
> +
>  	return folio;
>  
>  out_uncharge_cgroup:
> @@ -3130,7 +3147,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  out_subpool_put:
>  	if (map_chg || avoid_reserve)
>  		hugepage_subpool_put_pages(spool, 1);
> +out_end_reservation:
>  	vma_end_reservation(h, vma, addr);
> +	if (!memcg_charge_ret)
> +		mem_cgroup_cancel_charge(memcg, nr_pages);
> +	mem_cgroup_put(memcg);
>  	return ERR_PTR(-ENOSPC);
>  }
>  

IIUC, huge page usage is charged in alloc_hugetlb_folio and uncharged in
free_huge_folio.  During migration, huge pages are allocated via
alloc_migrate_hugetlb_folio, not alloc_hugetlb_folio.  So, there is no
charging for the migration target page and we uncharge the source page.
It looks like there will be no charge for the huge page after migration?

If my analysis above is correct, then we may need to be careful about
this accounting.  We may not want both source and target pages to be
charged at the same time.
  
Nhat Pham Oct. 3, 2023, 6:01 p.m. UTC | #6
On Tue, Oct 3, 2023 at 10:13 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 10/02/23 17:18, Nhat Pham wrote:
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index de220e3ff8be..74472e911b0a 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1902,6 +1902,7 @@ void free_huge_folio(struct folio *folio)
> >                                    pages_per_huge_page(h), folio);
> >       hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
> >                                         pages_per_huge_page(h), folio);
> > +     mem_cgroup_uncharge(folio);
> >       if (restore_reserve)
> >               h->resv_huge_pages++;
> >
> > @@ -3009,11 +3010,20 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> >       struct hugepage_subpool *spool = subpool_vma(vma);
> >       struct hstate *h = hstate_vma(vma);
> >       struct folio *folio;
> > -     long map_chg, map_commit;
> > +     long map_chg, map_commit, nr_pages = pages_per_huge_page(h);
> >       long gbl_chg;
> > -     int ret, idx;
> > +     int memcg_charge_ret, ret, idx;
> >       struct hugetlb_cgroup *h_cg = NULL;
> > +     struct mem_cgroup *memcg;
> >       bool deferred_reserve;
> > +     gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
> > +
> > +     memcg = get_mem_cgroup_from_current();
> > +     memcg_charge_ret = mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages);
> > +     if (memcg_charge_ret == -ENOMEM) {
> > +             mem_cgroup_put(memcg);
> > +             return ERR_PTR(-ENOMEM);
> > +     }
> >
> >       idx = hstate_index(h);
> >       /*
> > @@ -3022,8 +3032,12 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> >        * code of zero indicates a reservation exists (no change).
> >        */
> >       map_chg = gbl_chg = vma_needs_reservation(h, vma, addr);
> > -     if (map_chg < 0)
> > +     if (map_chg < 0) {
> > +             if (!memcg_charge_ret)
> > +                     mem_cgroup_cancel_charge(memcg, nr_pages);
> > +             mem_cgroup_put(memcg);
> >               return ERR_PTR(-ENOMEM);
> > +     }
> >
> >       /*
> >        * Processes that did not create the mapping will have no
> > @@ -3034,10 +3048,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> >        */
> >       if (map_chg || avoid_reserve) {
> >               gbl_chg = hugepage_subpool_get_pages(spool, 1);
> > -             if (gbl_chg < 0) {
> > -                     vma_end_reservation(h, vma, addr);
> > -                     return ERR_PTR(-ENOSPC);
> > -             }
> > +             if (gbl_chg < 0)
> > +                     goto out_end_reservation;
> >
> >               /*
> >                * Even though there was no reservation in the region/reserve
> > @@ -3119,6 +3131,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> >                       hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
> >                                       pages_per_huge_page(h), folio);
> >       }
> > +
> > +     if (!memcg_charge_ret)
> > +             mem_cgroup_commit_charge(folio, memcg);
> > +     mem_cgroup_put(memcg);
> > +
> >       return folio;
> >
> >  out_uncharge_cgroup:
> > @@ -3130,7 +3147,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> >  out_subpool_put:
> >       if (map_chg || avoid_reserve)
> >               hugepage_subpool_put_pages(spool, 1);
> > +out_end_reservation:
> >       vma_end_reservation(h, vma, addr);
> > +     if (!memcg_charge_ret)
> > +             mem_cgroup_cancel_charge(memcg, nr_pages);
> > +     mem_cgroup_put(memcg);
> >       return ERR_PTR(-ENOSPC);
> >  }
> >
>
> IIUC, huge page usage is charged in alloc_hugetlb_folio and uncharged in
> free_huge_folio.  During migration, huge pages are allocated via
> alloc_migrate_hugetlb_folio, not alloc_hugetlb_folio.  So, there is no
> charging for the migration target page and we uncharge the source page.
> It looks like there will be no charge for the huge page after migration?
>

Ah I see! This is a bit subtle indeed.

For the hugetlb controller, it looks like they update the cgroup info
inside move_hugetlb_state(), which calls hugetlb_cgroup_migrate()
to transfer the hugetlb cgroup info to the destination folio.

Perhaps we can do something analogous here.

> If my analysis above is correct, then we may need to be careful about
> this accounting.  We may not want both source and target pages to be
> charged at the same time.

We can create a variant of mem_cgroup_migrate that does not double
charge, but instead just copy the mem_cgroup information to the new
folio, and then clear that info from the old folio. That way the memory
usage counters are untouched.

Somebody with more expertise on migration should fact check me
of course :)

> --
> Mike Kravetz
  
Johannes Weiner Oct. 3, 2023, 6:39 p.m. UTC | #7
On Tue, Oct 03, 2023 at 11:01:24AM -0700, Nhat Pham wrote:
> On Tue, Oct 3, 2023 at 10:13 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >
> > On 10/02/23 17:18, Nhat Pham wrote:
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index de220e3ff8be..74472e911b0a 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -1902,6 +1902,7 @@ void free_huge_folio(struct folio *folio)
> > >                                    pages_per_huge_page(h), folio);
> > >       hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
> > >                                         pages_per_huge_page(h), folio);
> > > +     mem_cgroup_uncharge(folio);
> > >       if (restore_reserve)
> > >               h->resv_huge_pages++;
> > >
> > > @@ -3009,11 +3010,20 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > >       struct hugepage_subpool *spool = subpool_vma(vma);
> > >       struct hstate *h = hstate_vma(vma);
> > >       struct folio *folio;
> > > -     long map_chg, map_commit;
> > > +     long map_chg, map_commit, nr_pages = pages_per_huge_page(h);
> > >       long gbl_chg;
> > > -     int ret, idx;
> > > +     int memcg_charge_ret, ret, idx;
> > >       struct hugetlb_cgroup *h_cg = NULL;
> > > +     struct mem_cgroup *memcg;
> > >       bool deferred_reserve;
> > > +     gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
> > > +
> > > +     memcg = get_mem_cgroup_from_current();
> > > +     memcg_charge_ret = mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages);
> > > +     if (memcg_charge_ret == -ENOMEM) {
> > > +             mem_cgroup_put(memcg);
> > > +             return ERR_PTR(-ENOMEM);
> > > +     }
> > >
> > >       idx = hstate_index(h);
> > >       /*
> > > @@ -3022,8 +3032,12 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > >        * code of zero indicates a reservation exists (no change).
> > >        */
> > >       map_chg = gbl_chg = vma_needs_reservation(h, vma, addr);
> > > -     if (map_chg < 0)
> > > +     if (map_chg < 0) {
> > > +             if (!memcg_charge_ret)
> > > +                     mem_cgroup_cancel_charge(memcg, nr_pages);
> > > +             mem_cgroup_put(memcg);
> > >               return ERR_PTR(-ENOMEM);
> > > +     }
> > >
> > >       /*
> > >        * Processes that did not create the mapping will have no
> > > @@ -3034,10 +3048,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > >        */
> > >       if (map_chg || avoid_reserve) {
> > >               gbl_chg = hugepage_subpool_get_pages(spool, 1);
> > > -             if (gbl_chg < 0) {
> > > -                     vma_end_reservation(h, vma, addr);
> > > -                     return ERR_PTR(-ENOSPC);
> > > -             }
> > > +             if (gbl_chg < 0)
> > > +                     goto out_end_reservation;
> > >
> > >               /*
> > >                * Even though there was no reservation in the region/reserve
> > > @@ -3119,6 +3131,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > >                       hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
> > >                                       pages_per_huge_page(h), folio);
> > >       }
> > > +
> > > +     if (!memcg_charge_ret)
> > > +             mem_cgroup_commit_charge(folio, memcg);
> > > +     mem_cgroup_put(memcg);
> > > +
> > >       return folio;
> > >
> > >  out_uncharge_cgroup:
> > > @@ -3130,7 +3147,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > >  out_subpool_put:
> > >       if (map_chg || avoid_reserve)
> > >               hugepage_subpool_put_pages(spool, 1);
> > > +out_end_reservation:
> > >       vma_end_reservation(h, vma, addr);
> > > +     if (!memcg_charge_ret)
> > > +             mem_cgroup_cancel_charge(memcg, nr_pages);
> > > +     mem_cgroup_put(memcg);
> > >       return ERR_PTR(-ENOSPC);
> > >  }
> > >
> >
> > IIUC, huge page usage is charged in alloc_hugetlb_folio and uncharged in
> > free_huge_folio.  During migration, huge pages are allocated via
> > alloc_migrate_hugetlb_folio, not alloc_hugetlb_folio.  So, there is no
> > charging for the migration target page and we uncharge the source page.
> > It looks like there will be no charge for the huge page after migration?
> >
> 
> Ah I see! This is a bit subtle indeed.
> 
> For the hugetlb controller, it looks like they update the cgroup info
> inside move_hugetlb_state(), which calls hugetlb_cgroup_migrate()
> to transfer the hugetlb cgroup info to the destination folio.
> 
> Perhaps we can do something analogous here.
> 
> > If my analysis above is correct, then we may need to be careful about
> > this accounting.  We may not want both source and target pages to be
> > charged at the same time.
> 
> We can create a variant of mem_cgroup_migrate that does not double
> charge, but instead just copy the mem_cgroup information to the new
> folio, and then clear that info from the old folio. That way the memory
> usage counters are untouched.
> 
> Somebody with more expertise on migration should fact check me
> of course :)

The only reason mem_cgroup_migrate() double charges right now is
because it's used by replace_page_cache_folio(). In that context, the
isolation of the old page isn't quite as thorough as with migration,
so it cannot transfer and uncharge directly. This goes back a long
time: 0a31bc97c80c3fa87b32c091d9a930ac19cd0c40

If you rename the current implementation to mem_cgroup_replace_page()
for that one caller, you can add a mem_cgroup_migrate() variant which
is charge neutral and clears old->memcg_data. This can be used for
regular and hugetlb page migration. Something like this (totally
untested):

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a4d3282493b6..17ec45bf3653 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7226,29 +7226,14 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
 	if (mem_cgroup_disabled())
 		return;
 
-	/* Page cache replacement: new folio already charged? */
-	if (folio_memcg(new))
-		return;
-
 	memcg = folio_memcg(old);
 	VM_WARN_ON_ONCE_FOLIO(!memcg, old);
 	if (!memcg)
 		return;
 
-	/* Force-charge the new page. The old one will be freed soon */
-	if (!mem_cgroup_is_root(memcg)) {
-		page_counter_charge(&memcg->memory, nr_pages);
-		if (do_memsw_account())
-			page_counter_charge(&memcg->memsw, nr_pages);
-	}
-
-	css_get(&memcg->css);
+	/* Transfer the charge and the css ref */
 	commit_charge(new, memcg);
-
-	local_irq_save(flags);
-	mem_cgroup_charge_statistics(memcg, nr_pages);
-	memcg_check_events(memcg, folio_nid(new));
-	local_irq_restore(flags);
+	old->memcg_data = 0;
 }
 
 DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key);
  
Nhat Pham Oct. 3, 2023, 10:09 p.m. UTC | #8
On Tue, Oct 3, 2023 at 11:39 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Oct 03, 2023 at 11:01:24AM -0700, Nhat Pham wrote:
> > On Tue, Oct 3, 2023 at 10:13 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > >
> > > On 10/02/23 17:18, Nhat Pham wrote:
> > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > > index de220e3ff8be..74472e911b0a 100644
> > > > --- a/mm/hugetlb.c
> > > > +++ b/mm/hugetlb.c
> > > > @@ -1902,6 +1902,7 @@ void free_huge_folio(struct folio *folio)
> > > >                                    pages_per_huge_page(h), folio);
> > > >       hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
> > > >                                         pages_per_huge_page(h), folio);
> > > > +     mem_cgroup_uncharge(folio);
> > > >       if (restore_reserve)
> > > >               h->resv_huge_pages++;
> > > >
> > > > @@ -3009,11 +3010,20 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > > >       struct hugepage_subpool *spool = subpool_vma(vma);
> > > >       struct hstate *h = hstate_vma(vma);
> > > >       struct folio *folio;
> > > > -     long map_chg, map_commit;
> > > > +     long map_chg, map_commit, nr_pages = pages_per_huge_page(h);
> > > >       long gbl_chg;
> > > > -     int ret, idx;
> > > > +     int memcg_charge_ret, ret, idx;
> > > >       struct hugetlb_cgroup *h_cg = NULL;
> > > > +     struct mem_cgroup *memcg;
> > > >       bool deferred_reserve;
> > > > +     gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
> > > > +
> > > > +     memcg = get_mem_cgroup_from_current();
> > > > +     memcg_charge_ret = mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages);
> > > > +     if (memcg_charge_ret == -ENOMEM) {
> > > > +             mem_cgroup_put(memcg);
> > > > +             return ERR_PTR(-ENOMEM);
> > > > +     }
> > > >
> > > >       idx = hstate_index(h);
> > > >       /*
> > > > @@ -3022,8 +3032,12 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > > >        * code of zero indicates a reservation exists (no change).
> > > >        */
> > > >       map_chg = gbl_chg = vma_needs_reservation(h, vma, addr);
> > > > -     if (map_chg < 0)
> > > > +     if (map_chg < 0) {
> > > > +             if (!memcg_charge_ret)
> > > > +                     mem_cgroup_cancel_charge(memcg, nr_pages);
> > > > +             mem_cgroup_put(memcg);
> > > >               return ERR_PTR(-ENOMEM);
> > > > +     }
> > > >
> > > >       /*
> > > >        * Processes that did not create the mapping will have no
> > > > @@ -3034,10 +3048,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > > >        */
> > > >       if (map_chg || avoid_reserve) {
> > > >               gbl_chg = hugepage_subpool_get_pages(spool, 1);
> > > > -             if (gbl_chg < 0) {
> > > > -                     vma_end_reservation(h, vma, addr);
> > > > -                     return ERR_PTR(-ENOSPC);
> > > > -             }
> > > > +             if (gbl_chg < 0)
> > > > +                     goto out_end_reservation;
> > > >
> > > >               /*
> > > >                * Even though there was no reservation in the region/reserve
> > > > @@ -3119,6 +3131,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > > >                       hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
> > > >                                       pages_per_huge_page(h), folio);
> > > >       }
> > > > +
> > > > +     if (!memcg_charge_ret)
> > > > +             mem_cgroup_commit_charge(folio, memcg);
> > > > +     mem_cgroup_put(memcg);
> > > > +
> > > >       return folio;
> > > >
> > > >  out_uncharge_cgroup:
> > > > @@ -3130,7 +3147,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > > >  out_subpool_put:
> > > >       if (map_chg || avoid_reserve)
> > > >               hugepage_subpool_put_pages(spool, 1);
> > > > +out_end_reservation:
> > > >       vma_end_reservation(h, vma, addr);
> > > > +     if (!memcg_charge_ret)
> > > > +             mem_cgroup_cancel_charge(memcg, nr_pages);
> > > > +     mem_cgroup_put(memcg);
> > > >       return ERR_PTR(-ENOSPC);
> > > >  }
> > > >
> > >
> > > IIUC, huge page usage is charged in alloc_hugetlb_folio and uncharged in
> > > free_huge_folio.  During migration, huge pages are allocated via
> > > alloc_migrate_hugetlb_folio, not alloc_hugetlb_folio.  So, there is no
> > > charging for the migration target page and we uncharge the source page.
> > > It looks like there will be no charge for the huge page after migration?
> > >
> >
> > Ah I see! This is a bit subtle indeed.
> >
> > For the hugetlb controller, it looks like they update the cgroup info
> > inside move_hugetlb_state(), which calls hugetlb_cgroup_migrate()
> > to transfer the hugetlb cgroup info to the destination folio.
> >
> > Perhaps we can do something analogous here.
> >
> > > If my analysis above is correct, then we may need to be careful about
> > > this accounting.  We may not want both source and target pages to be
> > > charged at the same time.
> >
> > We can create a variant of mem_cgroup_migrate that does not double
> > charge, but instead just copy the mem_cgroup information to the new
> > folio, and then clear that info from the old folio. That way the memory
> > usage counters are untouched.
> >
> > Somebody with more expertise on migration should fact check me
> > of course :)
>
> The only reason mem_cgroup_migrate() double charges right now is
> because it's used by replace_page_cache_folio(). In that context, the
> isolation of the old page isn't quite as thorough as with migration,
> so it cannot transfer and uncharge directly. This goes back a long
> time: 0a31bc97c80c3fa87b32c091d9a930ac19cd0c40
>
> If you rename the current implementation to mem_cgroup_replace_page()
> for that one caller, you can add a mem_cgroup_migrate() variant which
> is charge neutral and clears old->memcg_data. This can be used for
> regular and hugetlb page migration. Something like this (totally
> untested):
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a4d3282493b6..17ec45bf3653 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7226,29 +7226,14 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
>         if (mem_cgroup_disabled())
>                 return;
>
> -       /* Page cache replacement: new folio already charged? */
> -       if (folio_memcg(new))
> -               return;
> -
>         memcg = folio_memcg(old);
>         VM_WARN_ON_ONCE_FOLIO(!memcg, old);
>         if (!memcg)
>                 return;
>
> -       /* Force-charge the new page. The old one will be freed soon */
> -       if (!mem_cgroup_is_root(memcg)) {
> -               page_counter_charge(&memcg->memory, nr_pages);
> -               if (do_memsw_account())
> -                       page_counter_charge(&memcg->memsw, nr_pages);
> -       }
> -
> -       css_get(&memcg->css);
> +       /* Transfer the charge and the css ref */
>         commit_charge(new, memcg);
> -
> -       local_irq_save(flags);
> -       mem_cgroup_charge_statistics(memcg, nr_pages);
> -       memcg_check_events(memcg, folio_nid(new));
> -       local_irq_restore(flags);
> +       old->memcg_data = 0;
>  }
>
>  DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key);
>

Ah, I like this. Will send a fixlet based on this :)
I was scratching my head trying to figure out why we were
doing the double charging in the first place. Thanks for the context,
Johannes!
  
Mike Kravetz Oct. 3, 2023, 10:42 p.m. UTC | #9
On 10/03/23 15:09, Nhat Pham wrote:
> On Tue, Oct 3, 2023 at 11:39 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Tue, Oct 03, 2023 at 11:01:24AM -0700, Nhat Pham wrote:
> > > On Tue, Oct 3, 2023 at 10:13 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > > > On 10/02/23 17:18, Nhat Pham wrote:
> > > >
> > > > IIUC, huge page usage is charged in alloc_hugetlb_folio and uncharged in
> > > > free_huge_folio.  During migration, huge pages are allocated via
> > > > alloc_migrate_hugetlb_folio, not alloc_hugetlb_folio.  So, there is no
> > > > charging for the migration target page and we uncharge the source page.
> > > > It looks like there will be no charge for the huge page after migration?
> > > >
> > >
> > > Ah I see! This is a bit subtle indeed.
> > >
> > > For the hugetlb controller, it looks like they update the cgroup info
> > > inside move_hugetlb_state(), which calls hugetlb_cgroup_migrate()
> > > to transfer the hugetlb cgroup info to the destination folio.
> > >
> > > Perhaps we can do something analogous here.
> > >
> > > > If my analysis above is correct, then we may need to be careful about
> > > > this accounting.  We may not want both source and target pages to be
> > > > charged at the same time.
> > >
> > > We can create a variant of mem_cgroup_migrate that does not double
> > > charge, but instead just copy the mem_cgroup information to the new
> > > folio, and then clear that info from the old folio. That way the memory
> > > usage counters are untouched.
> > >
> > > Somebody with more expertise on migration should fact check me
> > > of course :)
> >
> > The only reason mem_cgroup_migrate() double charges right now is
> > because it's used by replace_page_cache_folio(). In that context, the
> > isolation of the old page isn't quite as thorough as with migration,
> > so it cannot transfer and uncharge directly. This goes back a long
> > time: 0a31bc97c80c3fa87b32c091d9a930ac19cd0c40
> >
> > If you rename the current implementation to mem_cgroup_replace_page()
> > for that one caller, you can add a mem_cgroup_migrate() variant which
> > is charge neutral and clears old->memcg_data. This can be used for
> > regular and hugetlb page migration. Something like this (totally
> > untested):
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index a4d3282493b6..17ec45bf3653 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -7226,29 +7226,14 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
> >         if (mem_cgroup_disabled())
> >                 return;
> >
> > -       /* Page cache replacement: new folio already charged? */
> > -       if (folio_memcg(new))
> > -               return;
> > -
> >         memcg = folio_memcg(old);
> >         VM_WARN_ON_ONCE_FOLIO(!memcg, old);
> >         if (!memcg)
> >                 return;
> >
> > -       /* Force-charge the new page. The old one will be freed soon */
> > -       if (!mem_cgroup_is_root(memcg)) {
> > -               page_counter_charge(&memcg->memory, nr_pages);
> > -               if (do_memsw_account())
> > -                       page_counter_charge(&memcg->memsw, nr_pages);
> > -       }
> > -
> > -       css_get(&memcg->css);
> > +       /* Transfer the charge and the css ref */
> >         commit_charge(new, memcg);
> > -
> > -       local_irq_save(flags);
> > -       mem_cgroup_charge_statistics(memcg, nr_pages);
> > -       memcg_check_events(memcg, folio_nid(new));
> > -       local_irq_restore(flags);
> > +       old->memcg_data = 0;
> >  }
> >
> >  DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key);
> >
> 
> Ah, I like this. Will send a fixlet based on this :)
> I was scratching my head trying to figure out why we were
> doing the double charging in the first place. Thanks for the context,
> Johannes!

Be sure to check for code similar to this in folio_migrate_flags:

void folio_migrate_flags(struct folio *newfolio, struct folio *folio)
{
...
	if (!folio_test_hugetlb(folio))
		mem_cgroup_migrate(folio, newfolio);
}

There are many places where hugetlb is special cased.
  
Nhat Pham Oct. 3, 2023, 11:26 p.m. UTC | #10
On Tue, Oct 3, 2023 at 3:42 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 10/03/23 15:09, Nhat Pham wrote:
> > On Tue, Oct 3, 2023 at 11:39 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > On Tue, Oct 03, 2023 at 11:01:24AM -0700, Nhat Pham wrote:
> > > > On Tue, Oct 3, 2023 at 10:13 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > > > > On 10/02/23 17:18, Nhat Pham wrote:
> > > > >
> > > > > IIUC, huge page usage is charged in alloc_hugetlb_folio and uncharged in
> > > > > free_huge_folio.  During migration, huge pages are allocated via
> > > > > alloc_migrate_hugetlb_folio, not alloc_hugetlb_folio.  So, there is no
> > > > > charging for the migration target page and we uncharge the source page.
> > > > > It looks like there will be no charge for the huge page after migration?
> > > > >
> > > >
> > > > Ah I see! This is a bit subtle indeed.
> > > >
> > > > For the hugetlb controller, it looks like they update the cgroup info
> > > > inside move_hugetlb_state(), which calls hugetlb_cgroup_migrate()
> > > > to transfer the hugetlb cgroup info to the destination folio.
> > > >
> > > > Perhaps we can do something analogous here.
> > > >
> > > > > If my analysis above is correct, then we may need to be careful about
> > > > > this accounting.  We may not want both source and target pages to be
> > > > > charged at the same time.
> > > >
> > > > We can create a variant of mem_cgroup_migrate that does not double
> > > > charge, but instead just copy the mem_cgroup information to the new
> > > > folio, and then clear that info from the old folio. That way the memory
> > > > usage counters are untouched.
> > > >
> > > > Somebody with more expertise on migration should fact check me
> > > > of course :)
> > >
> > > The only reason mem_cgroup_migrate() double charges right now is
> > > because it's used by replace_page_cache_folio(). In that context, the
> > > isolation of the old page isn't quite as thorough as with migration,
> > > so it cannot transfer and uncharge directly. This goes back a long
> > > time: 0a31bc97c80c3fa87b32c091d9a930ac19cd0c40
> > >
> > > If you rename the current implementation to mem_cgroup_replace_page()
> > > for that one caller, you can add a mem_cgroup_migrate() variant which
> > > is charge neutral and clears old->memcg_data. This can be used for
> > > regular and hugetlb page migration. Something like this (totally
> > > untested):
> > >
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index a4d3282493b6..17ec45bf3653 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -7226,29 +7226,14 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
> > >         if (mem_cgroup_disabled())
> > >                 return;
> > >
> > > -       /* Page cache replacement: new folio already charged? */
> > > -       if (folio_memcg(new))
> > > -               return;
> > > -
> > >         memcg = folio_memcg(old);
> > >         VM_WARN_ON_ONCE_FOLIO(!memcg, old);
> > >         if (!memcg)
> > >                 return;
> > >
> > > -       /* Force-charge the new page. The old one will be freed soon */
> > > -       if (!mem_cgroup_is_root(memcg)) {
> > > -               page_counter_charge(&memcg->memory, nr_pages);
> > > -               if (do_memsw_account())
> > > -                       page_counter_charge(&memcg->memsw, nr_pages);
> > > -       }
> > > -
> > > -       css_get(&memcg->css);
> > > +       /* Transfer the charge and the css ref */
> > >         commit_charge(new, memcg);
> > > -
> > > -       local_irq_save(flags);
> > > -       mem_cgroup_charge_statistics(memcg, nr_pages);
> > > -       memcg_check_events(memcg, folio_nid(new));
> > > -       local_irq_restore(flags);
> > > +       old->memcg_data = 0;
> > >  }
> > >
> > >  DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key);
> > >
> >
> > Ah, I like this. Will send a fixlet based on this :)
> > I was scratching my head trying to figure out why we were
> > doing the double charging in the first place. Thanks for the context,
> > Johannes!
>
> Be sure to check for code similar to this in folio_migrate_flags:
>
> void folio_migrate_flags(struct folio *newfolio, struct folio *folio)
> {
> ...
>         if (!folio_test_hugetlb(folio))
>                 mem_cgroup_migrate(folio, newfolio);
> }
>
> There are many places where hugetlb is special cased.

Yeah makes sense. I'm actually gonna take advantage of this,
and remove the test hugetlb check here, so that it will also
migrate the memcg metadata in this case too.  See the new patch
I just sent out.

> --
> Mike Kravetz
  

Patch

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 622a7f28db1f..606b2e0eac4b 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -210,6 +210,35 @@  cgroup v2 currently supports the following mount options.
         relying on the original semantics (e.g. specifying bogusly
         high 'bypass' protection values at higher tree levels).
 
+  memory_hugetlb_accounting
+        Count HugeTLB memory usage towards the cgroup's overall
+        memory usage for the memory controller (for the purpose of
+        statistics reporting and memory protetion). This is a new
+        behavior that could regress existing setups, so it must be
+        explicitly opted in with this mount option.
+
+        A few caveats to keep in mind:
+
+        * There is no HugeTLB pool management involved in the memory
+          controller. The pre-allocated pool does not belong to anyone.
+          Specifically, when a new HugeTLB folio is allocated to
+          the pool, it is not accounted for from the perspective of the
+          memory controller. It is only charged to a cgroup when it is
+          actually used (for e.g at page fault time). Host memory
+          overcommit management has to consider this when configuring
+          hard limits. In general, HugeTLB pool management should be
+          done via other mechanisms (such as the HugeTLB controller).
+        * Failure to charge a HugeTLB folio to the memory controller
+          results in SIGBUS. This could happen even if the HugeTLB pool
+          still has pages available (but the cgroup limit is hit and
+          reclaim attempt fails).
+        * Charging HugeTLB memory towards the memory controller affects
+          memory protection and reclaim dynamics. Any userspace tuning
+          (of low, min limits for e.g) needs to take this into account.
+        * HugeTLB pages utilized while this option is not selected
+          will not be tracked by the memory controller (even if cgroup
+          v2 is remounted later on).
+
 
 Organizing Processes and Threads
 --------------------------------
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index f1b3151ac30b..8641f4320c98 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -115,6 +115,11 @@  enum {
 	 * Enable recursive subtree protection
 	 */
 	CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 18),
+
+	/*
+	 * Enable hugetlb accounting for the memory controller.
+	 */
+	 CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING = (1 << 19),
 };
 
 /* cftype->flags */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 42bf7e9b1a2f..a827e2129790 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -679,6 +679,9 @@  static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
 	return __mem_cgroup_charge(folio, mm, gfp);
 }
 
+int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
+		long nr_pages);
+
 int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
 				  gfp_t gfp, swp_entry_t entry);
 void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
@@ -1262,6 +1265,12 @@  static inline int mem_cgroup_charge(struct folio *folio,
 	return 0;
 }
 
+static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg,
+		gfp_t gfp, long nr_pages)
+{
+	return 0;
+}
+
 static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
 			struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
 {
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1fb7f562289d..f11488b18ceb 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1902,6 +1902,7 @@  enum cgroup2_param {
 	Opt_favordynmods,
 	Opt_memory_localevents,
 	Opt_memory_recursiveprot,
+	Opt_memory_hugetlb_accounting,
 	nr__cgroup2_params
 };
 
@@ -1910,6 +1911,7 @@  static const struct fs_parameter_spec cgroup2_fs_parameters[] = {
 	fsparam_flag("favordynmods",		Opt_favordynmods),
 	fsparam_flag("memory_localevents",	Opt_memory_localevents),
 	fsparam_flag("memory_recursiveprot",	Opt_memory_recursiveprot),
+	fsparam_flag("memory_hugetlb_accounting", Opt_memory_hugetlb_accounting),
 	{}
 };
 
@@ -1936,6 +1938,9 @@  static int cgroup2_parse_param(struct fs_context *fc, struct fs_parameter *param
 	case Opt_memory_recursiveprot:
 		ctx->flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT;
 		return 0;
+	case Opt_memory_hugetlb_accounting:
+		ctx->flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
+		return 0;
 	}
 	return -EINVAL;
 }
@@ -1960,6 +1965,11 @@  static void apply_cgroup_root_flags(unsigned int root_flags)
 			cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT;
 		else
 			cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_RECURSIVE_PROT;
+
+		if (root_flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
+			cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
+		else
+			cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
 	}
 }
 
@@ -1973,6 +1983,8 @@  static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
 		seq_puts(seq, ",memory_localevents");
 	if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT)
 		seq_puts(seq, ",memory_recursiveprot");
+	if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
+		seq_puts(seq, ",memory_hugetlb_accounting");
 	return 0;
 }
 
@@ -7050,7 +7062,8 @@  static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr,
 			"nsdelegate\n"
 			"favordynmods\n"
 			"memory_localevents\n"
-			"memory_recursiveprot\n");
+			"memory_recursiveprot\n"
+			"memory_hugetlb_accounting\n");
 }
 static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index de220e3ff8be..74472e911b0a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1902,6 +1902,7 @@  void free_huge_folio(struct folio *folio)
 				     pages_per_huge_page(h), folio);
 	hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
 					  pages_per_huge_page(h), folio);
+	mem_cgroup_uncharge(folio);
 	if (restore_reserve)
 		h->resv_huge_pages++;
 
@@ -3009,11 +3010,20 @@  struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	struct hugepage_subpool *spool = subpool_vma(vma);
 	struct hstate *h = hstate_vma(vma);
 	struct folio *folio;
-	long map_chg, map_commit;
+	long map_chg, map_commit, nr_pages = pages_per_huge_page(h);
 	long gbl_chg;
-	int ret, idx;
+	int memcg_charge_ret, ret, idx;
 	struct hugetlb_cgroup *h_cg = NULL;
+	struct mem_cgroup *memcg;
 	bool deferred_reserve;
+	gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
+
+	memcg = get_mem_cgroup_from_current();
+	memcg_charge_ret = mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages);
+	if (memcg_charge_ret == -ENOMEM) {
+		mem_cgroup_put(memcg);
+		return ERR_PTR(-ENOMEM);
+	}
 
 	idx = hstate_index(h);
 	/*
@@ -3022,8 +3032,12 @@  struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	 * code of zero indicates a reservation exists (no change).
 	 */
 	map_chg = gbl_chg = vma_needs_reservation(h, vma, addr);
-	if (map_chg < 0)
+	if (map_chg < 0) {
+		if (!memcg_charge_ret)
+			mem_cgroup_cancel_charge(memcg, nr_pages);
+		mem_cgroup_put(memcg);
 		return ERR_PTR(-ENOMEM);
+	}
 
 	/*
 	 * Processes that did not create the mapping will have no
@@ -3034,10 +3048,8 @@  struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	 */
 	if (map_chg || avoid_reserve) {
 		gbl_chg = hugepage_subpool_get_pages(spool, 1);
-		if (gbl_chg < 0) {
-			vma_end_reservation(h, vma, addr);
-			return ERR_PTR(-ENOSPC);
-		}
+		if (gbl_chg < 0)
+			goto out_end_reservation;
 
 		/*
 		 * Even though there was no reservation in the region/reserve
@@ -3119,6 +3131,11 @@  struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 			hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
 					pages_per_huge_page(h), folio);
 	}
+
+	if (!memcg_charge_ret)
+		mem_cgroup_commit_charge(folio, memcg);
+	mem_cgroup_put(memcg);
+
 	return folio;
 
 out_uncharge_cgroup:
@@ -3130,7 +3147,11 @@  struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 out_subpool_put:
 	if (map_chg || avoid_reserve)
 		hugepage_subpool_put_pages(spool, 1);
+out_end_reservation:
 	vma_end_reservation(h, vma, addr);
+	if (!memcg_charge_ret)
+		mem_cgroup_cancel_charge(memcg, nr_pages);
+	mem_cgroup_put(memcg);
 	return ERR_PTR(-ENOSPC);
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0219befeae38..6660684f6f97 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7085,6 +7085,41 @@  int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp)
 	return ret;
 }
 
+/**
+ * mem_cgroup_hugetlb_try_charge - try to charge the memcg for a hugetlb folio
+ * @memcg: memcg to charge.
+ * @gfp: reclaim mode.
+ * @nr_pages: number of pages to charge.
+ *
+ * This function is called when allocating a huge page folio to determine if
+ * the memcg has the capacity for it. It does not commit the charge yet,
+ * as the hugetlb folio itself has not been obtained from the hugetlb pool.
+ *
+ * Once we have obtained the hugetlb folio, we can call
+ * mem_cgroup_commit_charge() to commit the charge. If we fail to obtain the
+ * folio, we should instead call mem_cgroup_cancel_charge() to undo the effect
+ * of try_charge().
+ *
+ * Returns 0 on success. Otherwise, an error code is returned.
+ */
+int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
+			long nr_pages)
+{
+	/*
+	 * If hugetlb memcg charging is not enabled, do not fail hugetlb allocation,
+	 * but do not attempt to commit charge later (or cancel on error) either.
+	 */
+	if (mem_cgroup_disabled() || !memcg ||
+		!cgroup_subsys_on_dfl(memory_cgrp_subsys) ||
+		!(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING))
+		return -EOPNOTSUPP;
+
+	if (try_charge(memcg, gfp, nr_pages))
+		return -ENOMEM;
+
+	return 0;
+}
+
 /**
  * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for swapin.
  * @folio: folio to charge.