Message ID | 20230928005723.1709119-1-nphamcs@gmail.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:cae8:0:b0:403:3b70:6f57 with SMTP id r8csp3018654vqu; Wed, 27 Sep 2023 19:02:20 -0700 (PDT) X-Google-Smtp-Source: AGHT+IH6mwZWWqj2+1635dFtvNIybJq2AYdd2PgOpUgqgeJPXnCjKQOjO/+N7/LSaRD2Xh+fU4JB X-Received: by 2002:a05:6a00:10d1:b0:692:cede:c902 with SMTP id d17-20020a056a0010d100b00692cedec902mr3635390pfu.9.1695866540372; Wed, 27 Sep 2023 19:02:20 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695866540; cv=none; d=google.com; s=arc-20160816; b=dkUAol50OK9FOVZlZJBXMfv6n4IcMvnvOCWLXVoFlDis/W1s1hwn2yEcdMZS5h4kMy IQUvtnCrna4K5TMRIamEVzSIfaWJ0eqx/lGLPSxsA37XOoUe8XP4BYDBLQmNRlmSzX1i 7QII8QD1wAHEPS60Z0AIzLhaVsu5yEgng1SbAQMNAnS5CKBAjq0rFsodRHMregj1CwlN qn/qu80Tny1sQzh0qnHnKc87wIpwlzM8UDsBabHeGYCn5jLlhn0mVAwVCfEQEnU0dnn7 5ASLY2wNWr5aid7p0utuBQ9fW3oP3no38WU/9Wf3JPMqaChk6vnyl+bQ6NDnJ4OvI7eI CjpA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=sD2wWZIfPcmHhJT+GMtj+x1EEBi2NflbeYV6Occ/i+c=; fh=dr01barU9xhgfmkh9W9wZxPUiBGNGqZfFlS6t/GIbLY=; b=sjQlXaBluFoaxLJ3gqAmIQwb4NoIMX+n7iD3Jv4wFrKZLhmVoTv42FOEF+l86NbK3+ XN3EgRRa+heKzcRTmZzcqnoZqzi6awGNFgTpZMItj980XrcdVKlPwlE5qqJ9rCVtSehX vdzjzD2z6hA61h3z1JH4JjPf1Ff9ilVeRNxtfq7fHpipEzhhIGf319gHitKfaItMcyVX kDKKYPfQ5QI+E9/2K75MYN+9qWFKVgjMvOFx29h717M1n0LnoVlVrlGt9LrIwRixvaPk KlyudXnC6UjZxBp4N3UNyJbaxQoH2biQypHdMPsc32Al1GnVSkawwQxhvd0sSuyGBown BhFw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=FVveHHzX; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from pete.vger.email (pete.vger.email. [23.128.96.36]) by mx.google.com with ESMTPS id c2-20020a056a00248200b0068fe810e8a2si1027655pfv.185.2023.09.27.19.02.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 27 Sep 2023 19:02:20 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) client-ip=23.128.96.36; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=FVveHHzX; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id 3ACFD818597A; Wed, 27 Sep 2023 17:57:42 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229793AbjI1A51 (ORCPT <rfc822;ruipengqi7@gmail.com> + 20 others); Wed, 27 Sep 2023 20:57:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52734 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229539AbjI1A50 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 27 Sep 2023 20:57:26 -0400 Received: from mail-pg1-x52d.google.com (mail-pg1-x52d.google.com [IPv6:2607:f8b0:4864:20::52d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EB3CEBF; Wed, 27 Sep 2023 17:57:24 -0700 (PDT) Received: by mail-pg1-x52d.google.com with SMTP id 41be03b00d2f7-5827f6d60aaso5115715a12.3; Wed, 27 Sep 2023 17:57:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1695862644; x=1696467444; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=sD2wWZIfPcmHhJT+GMtj+x1EEBi2NflbeYV6Occ/i+c=; b=FVveHHzXpfOjGV2H1qN7CkdbKSJAZobytc4P6O3U2rvYHA4ei2/j4Zk3+rehgivJUM /l9CR6unAngMyhGrfjVubf51UatzVmj9TF7qpsHzqr63X2VJv6ARMThjv1Af3X63ZsLH YZEMMNq5BszHFA1rC3dIvAACk2jGKOmCoPZCt1bjb0Qv6JwiHj5/VEnh+pseeTNwk5WG gcjroof8llK7wzNEtrcAFwJQe7p/ObmihT56Qnmo6lWW+NoXRBZi3dXjhLTmKQs1Jd3+ 0tBsUjZWps+eD0CbyjeFS5ON77Xq8HN+9SwqgMf3Pb82NC6dGBlGiu8nr/pMKS2FlGLP 2Zig== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695862644; x=1696467444; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=sD2wWZIfPcmHhJT+GMtj+x1EEBi2NflbeYV6Occ/i+c=; b=MpktOFvW71H+ZW4flGIy8atFUxuaqCttBmY/fEHXavQFxCUJbMbeLObrED4vqrKGJe bFBWsEWTf6g5nY6oANd24aMJZDcrPfUCVUhxL+YA/O3p+RJ+ggoUP8WlSLlbT/cFSz0V 56LCj6e1iNdhTsdMhT18bMax7mqhdzjXZH9CLrdFB6ouG9ZYmuS0NsOI1kE4zwN374SJ tIEDr7TgXMnU5mG6GjpmLx2yJ5EqF2dYpIh1J2dXnn2ZnAsfqYoo3F9MGe8ZoOKK9D0y MQGS5VMWhNSzHv4490KNpnwVU7KX0wgeQMbGc+12lCn/3gZZrEkaT6+aOQ4bj9UaVr7c s8WQ== X-Gm-Message-State: AOJu0YzqOaQ6Kp67kZwzZKcKkiFMfu2MBubXggCCLg3qqd8b8w8+v0Ab t1Zul7h7GyHSdrOYROOCNG8= X-Received: by 2002:a05:6a20:9147:b0:160:c1b9:a759 with SMTP id x7-20020a056a20914700b00160c1b9a759mr4043353pzc.20.1695862644240; Wed, 27 Sep 2023 17:57:24 -0700 (PDT) Received: from localhost (fwdproxy-prn-003.fbsv.net. [2a03:2880:ff:3::face:b00c]) by smtp.gmail.com with ESMTPSA id z1-20020a170902708100b001bb0eebd90asm13684774plk.245.2023.09.27.17.57.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 27 Sep 2023 17:57:23 -0700 (PDT) From: Nhat Pham <nphamcs@gmail.com> To: akpm@linux-foundation.org Cc: riel@surriel.com, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeelb@google.com, muchun.song@linux.dev, tj@kernel.org, lizefan.x@bytedance.com, shuah@kernel.org, mike.kravetz@oracle.com, yosryahmed@google.com, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org Subject: [PATCH v2 0/2] hugetlb memcg accounting Date: Wed, 27 Sep 2023 17:57:21 -0700 Message-Id: <20230928005723.1709119-1-nphamcs@gmail.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.6 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Wed, 27 Sep 2023 17:57:42 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1778244953504574119 X-GMAIL-MSGID: 1778244953504574119 |
Series |
hugetlb memcg accounting
|
|
Message
Nhat Pham
Sept. 28, 2023, 12:57 a.m. UTC
Changelog: v2: * Add a cgroup mount option to enable/disable the new hugetlb memcg accounting behavior (patch 1) (suggested by Johannes Weiner). * Add a couple more ksft_print_msg() on error to aid debugging when the selftest fails. (patch 2) Currently, hugetlb memory usage is not acounted for in the memory controller, which could lead to memory overprotection for cgroups with hugetlb-backed memory. This has been observed in our production system. This patch series rectifies this issue by charging the memcg when the hugetlb folio is allocated, and uncharging when the folio is freed. In addition, a new selftest is added to demonstrate and verify this new behavior. Nhat Pham (2): hugetlb: memcg: account hugetlb-backed memory in memory controller selftests: add a selftest to verify hugetlb usage in memcg Documentation/admin-guide/cgroup-v2.rst | 9 + MAINTAINERS | 2 + fs/hugetlbfs/inode.c | 2 +- include/linux/cgroup-defs.h | 5 + include/linux/hugetlb.h | 6 +- include/linux/memcontrol.h | 8 + kernel/cgroup/cgroup.c | 15 +- mm/hugetlb.c | 23 +- mm/memcontrol.c | 41 +++ tools/testing/selftests/cgroup/.gitignore | 1 + tools/testing/selftests/cgroup/Makefile | 2 + .../selftests/cgroup/test_hugetlb_memcg.c | 234 ++++++++++++++++++ 12 files changed, 338 insertions(+), 10 deletions(-) create mode 100644 tools/testing/selftests/cgroup/test_hugetlb_memcg.c
Comments
On Thu, Sep 28, 2023 at 3:59 PM Frank van der Linden <fvdl@google.com> wrote: > > On Wed, Sep 27, 2023 at 5:57 PM Nhat Pham <nphamcs@gmail.com> wrote: >> >> Currently, hugetlb memory usage is not acounted for in the memory >> controller, which could lead to memory overprotection for cgroups with >> hugetlb-backed memory. This has been observed in our production system. >> >> This patch rectifies this issue by charging the memcg when the hugetlb >> folio is allocated, and uncharging when the folio is freed (analogous to >> the hugetlb controller). >> >> Signed-off-by: Nhat Pham <nphamcs@gmail.com> >> --- >> Documentation/admin-guide/cgroup-v2.rst | 9 ++++++ >> fs/hugetlbfs/inode.c | 2 +- >> include/linux/cgroup-defs.h | 5 +++ >> include/linux/hugetlb.h | 6 ++-- >> include/linux/memcontrol.h | 8 +++++ >> kernel/cgroup/cgroup.c | 15 ++++++++- >> mm/hugetlb.c | 23 ++++++++++---- >> mm/memcontrol.c | 41 +++++++++++++++++++++++++ >> 8 files changed, 99 insertions(+), 10 deletions(-) >> >> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst >> index 622a7f28db1f..e6267b8cbd1d 100644 >> --- a/Documentation/admin-guide/cgroup-v2.rst >> +++ b/Documentation/admin-guide/cgroup-v2.rst >> @@ -210,6 +210,15 @@ cgroup v2 currently supports the following mount options. >> relying on the original semantics (e.g. specifying bogusly >> high 'bypass' protection values at higher tree levels). >> >> + memory_hugetlb_accounting >> + Count hugetlb memory usage towards the cgroup's overall >> + memory usage for the memory controller. This is a new behavior >> + that could regress existing setups, so it must be explicitly >> + opted in with this mount option. Note that hugetlb pages >> + allocated while this option is not selected will not be >> + tracked by the memory controller (even if cgroup v2 is >> + remounted later on). >> + >> >> Organizing Processes and Threads >> -------------------------------- >> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c >> index 60fce26ff937..034967319955 100644 >> --- a/fs/hugetlbfs/inode.c >> +++ b/fs/hugetlbfs/inode.c >> @@ -902,7 +902,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, >> * to keep reservation accounting consistent. >> */ >> hugetlb_set_vma_policy(&pseudo_vma, inode, index); >> - folio = alloc_hugetlb_folio(&pseudo_vma, addr, 0); >> + folio = alloc_hugetlb_folio(&pseudo_vma, addr, 0, true); >> hugetlb_drop_vma_policy(&pseudo_vma); >> if (IS_ERR(folio)) { >> mutex_unlock(&hugetlb_fault_mutex_table[hash]); >> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h >> index f1b3151ac30b..8641f4320c98 100644 >> --- a/include/linux/cgroup-defs.h >> +++ b/include/linux/cgroup-defs.h >> @@ -115,6 +115,11 @@ enum { >> * Enable recursive subtree protection >> */ >> CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 18), >> + >> + /* >> + * Enable hugetlb accounting for the memory controller. >> + */ >> + CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING = (1 << 19), >> }; >> >> /* cftype->flags */ >> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h >> index a30686e649f7..9b73db1605a2 100644 >> --- a/include/linux/hugetlb.h >> +++ b/include/linux/hugetlb.h >> @@ -713,7 +713,8 @@ struct huge_bootmem_page { >> >> int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list); >> struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, >> - unsigned long addr, int avoid_reserve); >> + unsigned long addr, int avoid_reserve, >> + bool restore_reserve_on_memcg_failure); >> struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid, >> nodemask_t *nmask, gfp_t gfp_mask); >> struct folio *alloc_hugetlb_folio_vma(struct hstate *h, struct vm_area_struct *vma, >> @@ -1016,7 +1017,8 @@ static inline int isolate_or_dissolve_huge_page(struct page *page, >> >> static inline struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, >> unsigned long addr, >> - int avoid_reserve) >> + int avoid_reserve, >> + bool restore_reserve_on_memcg_failure) >> { >> return NULL; >> } >> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h >> index e0cfab58ab71..8094679c99dd 100644 >> --- a/include/linux/memcontrol.h >> +++ b/include/linux/memcontrol.h >> @@ -677,6 +677,8 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, >> return __mem_cgroup_charge(folio, mm, gfp); >> } >> >> +int mem_cgroup_hugetlb_charge_folio(struct folio *folio, gfp_t gfp); >> + >> int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, >> gfp_t gfp, swp_entry_t entry); >> void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry); >> @@ -1251,6 +1253,12 @@ static inline int mem_cgroup_charge(struct folio *folio, >> return 0; >> } >> >> +static inline int mem_cgroup_hugetlb_charge_folio(struct folio *folio, >> + gfp_t gfp) >> +{ >> + return 0; >> +} >> + >> static inline int mem_cgroup_swapin_charge_folio(struct folio *folio, >> struct mm_struct *mm, gfp_t gfp, swp_entry_t entry) >> { >> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c >> index 1fb7f562289d..f11488b18ceb 100644 >> --- a/kernel/cgroup/cgroup.c >> +++ b/kernel/cgroup/cgroup.c >> @@ -1902,6 +1902,7 @@ enum cgroup2_param { >> Opt_favordynmods, >> Opt_memory_localevents, >> Opt_memory_recursiveprot, >> + Opt_memory_hugetlb_accounting, >> nr__cgroup2_params >> }; >> >> @@ -1910,6 +1911,7 @@ static const struct fs_parameter_spec cgroup2_fs_parameters[] = { >> fsparam_flag("favordynmods", Opt_favordynmods), >> fsparam_flag("memory_localevents", Opt_memory_localevents), >> fsparam_flag("memory_recursiveprot", Opt_memory_recursiveprot), >> + fsparam_flag("memory_hugetlb_accounting", Opt_memory_hugetlb_accounting), >> {} >> }; >> >> @@ -1936,6 +1938,9 @@ static int cgroup2_parse_param(struct fs_context *fc, struct fs_parameter *param >> case Opt_memory_recursiveprot: >> ctx->flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT; >> return 0; >> + case Opt_memory_hugetlb_accounting: >> + ctx->flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING; >> + return 0; >> } >> return -EINVAL; >> } >> @@ -1960,6 +1965,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags) >> cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT; >> else >> cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_RECURSIVE_PROT; >> + >> + if (root_flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING) >> + cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING; >> + else >> + cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING; >> } >> } >> >> @@ -1973,6 +1983,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root >> seq_puts(seq, ",memory_localevents"); >> if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT) >> seq_puts(seq, ",memory_recursiveprot"); >> + if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING) >> + seq_puts(seq, ",memory_hugetlb_accounting"); >> return 0; >> } >> >> @@ -7050,7 +7062,8 @@ static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr, >> "nsdelegate\n" >> "favordynmods\n" >> "memory_localevents\n" >> - "memory_recursiveprot\n"); >> + "memory_recursiveprot\n" >> + "memory_hugetlb_accounting\n"); >> } >> static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features); >> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >> index de220e3ff8be..ff88ea4df11a 100644 >> --- a/mm/hugetlb.c >> +++ b/mm/hugetlb.c >> @@ -1902,6 +1902,7 @@ void free_huge_folio(struct folio *folio) >> pages_per_huge_page(h), folio); >> hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h), >> pages_per_huge_page(h), folio); >> + mem_cgroup_uncharge(folio); >> if (restore_reserve) >> h->resv_huge_pages++; >> >> @@ -3004,7 +3005,8 @@ int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list) >> } >> >> struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, >> - unsigned long addr, int avoid_reserve) >> + unsigned long addr, int avoid_reserve, >> + bool restore_reserve_on_memcg_failure) >> { >> struct hugepage_subpool *spool = subpool_vma(vma); >> struct hstate *h = hstate_vma(vma); >> @@ -3119,6 +3121,15 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, >> hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h), >> pages_per_huge_page(h), folio); >> } >> + >> + /* undo allocation if memory controller disallows it. */ >> + if (mem_cgroup_hugetlb_charge_folio(folio, GFP_KERNEL)) { >> + if (restore_reserve_on_memcg_failure) >> + restore_reserve_on_error(h, vma, addr, folio); >> + folio_put(folio); >> + return ERR_PTR(-ENOMEM); >> + } >> + >> return folio; >> >> out_uncharge_cgroup: >> @@ -5179,7 +5190,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, >> spin_unlock(src_ptl); >> spin_unlock(dst_ptl); >> /* Do not use reserve as it's private owned */ >> - new_folio = alloc_hugetlb_folio(dst_vma, addr, 1); >> + new_folio = alloc_hugetlb_folio(dst_vma, addr, 1, false); >> if (IS_ERR(new_folio)) { >> folio_put(pte_folio); >> ret = PTR_ERR(new_folio); >> @@ -5656,7 +5667,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma, >> * be acquired again before returning to the caller, as expected. >> */ >> spin_unlock(ptl); >> - new_folio = alloc_hugetlb_folio(vma, haddr, outside_reserve); >> + new_folio = alloc_hugetlb_folio(vma, haddr, outside_reserve, true); >> >> if (IS_ERR(new_folio)) { >> /* >> @@ -5930,7 +5941,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, >> VM_UFFD_MISSING); >> } >> >> - folio = alloc_hugetlb_folio(vma, haddr, 0); >> + folio = alloc_hugetlb_folio(vma, haddr, 0, true); >> if (IS_ERR(folio)) { >> /* >> * Returning error will result in faulting task being >> @@ -6352,7 +6363,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte, >> goto out; >> } >> >> - folio = alloc_hugetlb_folio(dst_vma, dst_addr, 0); >> + folio = alloc_hugetlb_folio(dst_vma, dst_addr, 0, true); >> if (IS_ERR(folio)) { >> ret = -ENOMEM; >> goto out; >> @@ -6394,7 +6405,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte, >> goto out; >> } >> >> - folio = alloc_hugetlb_folio(dst_vma, dst_addr, 0); >> + folio = alloc_hugetlb_folio(dst_vma, dst_addr, 0, false); >> if (IS_ERR(folio)) { >> folio_put(*foliop); >> ret = -ENOMEM; >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index d1a322a75172..d5dfc9b36acb 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -7050,6 +7050,47 @@ int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp) >> return ret; >> } >> >> +static struct mem_cgroup *get_mem_cgroup_from_current(void) >> +{ >> + struct mem_cgroup *memcg; >> + >> +again: >> + rcu_read_lock(); >> + memcg = mem_cgroup_from_task(current); >> + if (!css_tryget(&memcg->css)) { >> + rcu_read_unlock(); >> + goto again; >> + } >> + rcu_read_unlock(); >> + return memcg; >> +} >> + >> +/** >> + * mem_cgroup_hugetlb_charge_folio - Charge a newly allocated hugetlb folio. >> + * @folio: folio to charge. >> + * @gfp: reclaim mode >> + * >> + * This function charges an allocated hugetlbf folio to the memcg of the >> + * current task. >> + * >> + * Returns 0 on success. Otherwise, an error code is returned. >> + */ >> +int mem_cgroup_hugetlb_charge_folio(struct folio *folio, gfp_t gfp) >> +{ >> + struct mem_cgroup *memcg; >> + int ret; >> + >> + if (mem_cgroup_disabled() || >> + !(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)) >> + return 0; >> + >> + memcg = get_mem_cgroup_from_current(); >> + ret = charge_memcg(folio, memcg, gfp); >> + mem_cgroup_put(memcg); >> + >> + return ret; >> +} >> + >> /** >> * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for swapin. >> * @folio: folio to charge. >> -- >> 2.34.1 >> > > With the mount option added, I'm fine with this. There are reasons to want and reasons not to want this, so everybody's happy! And the default is no accounting, so this should be safe on impact! > > Out of curiosity: is anyone aware of any code that may behave badly when folio_memcg(hugetlb_folio) != NULL, not expecting it? Good point. My understanding of the memory controller mechanism is that it should be fine - we're just essentially storing some memcg metadata in the struct folio, and then charging values towards the memcg counters. I don't think we fiddle with anything else in the folio itself that could be ruinous? I also did my best to trace the code paths that go through alloc_hugetlb_folio and free_huge_folio (the places where charging and uncharging happens) to make sure no funny business is going on, and it seems a lot of these paths have special, dedicated handling for hugetlb folio. The usual pattern is checking if the folio is a hugetlb one first, so we're unlikely to even call folio_memcg on a hugetlb folio in existing code in the first place. But if anyone knows something I missed please let me know! And feel free to loop more people in if there's anyone I miss in the cc list :) > > - Frank
<snip> > > + > +/** > + * mem_cgroup_hugetlb_charge_folio - Charge a newly allocated hugetlb folio. > + * @folio: folio to charge. > + * @gfp: reclaim mode > + * > + * This function charges an allocated hugetlbf folio to the memcg of the > + * current task. > + * > + * Returns 0 on success. Otherwise, an error code is returned. > + */ > +int mem_cgroup_hugetlb_charge_folio(struct folio *folio, gfp_t gfp) > +{ > + struct mem_cgroup *memcg; > + int ret; > + > + if (mem_cgroup_disabled() || > + !(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)) What happens if the memory controller is mounted in a cgroup v1 hierarchy? It appears to me that we *will* go through with hugetlb charging in this case? > > + return 0; > + > + memcg = get_mem_cgroup_from_current(); > + ret = charge_memcg(folio, memcg, gfp); > + mem_cgroup_put(memcg); > + > + return ret; > +} > + > /** > * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for swapin. > * @folio: folio to charge. > -- > 2.34.1
On Thu, Sep 28, 2023 at 5:38 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > <snip> > > > > > + > > +/** > > + * mem_cgroup_hugetlb_charge_folio - Charge a newly allocated hugetlb folio. > > + * @folio: folio to charge. > > + * @gfp: reclaim mode > > + * > > + * This function charges an allocated hugetlbf folio to the memcg of the > > + * current task. > > + * > > + * Returns 0 on success. Otherwise, an error code is returned. > > + */ > > +int mem_cgroup_hugetlb_charge_folio(struct folio *folio, gfp_t gfp) > > +{ > > + struct mem_cgroup *memcg; > > + int ret; > > + > > + if (mem_cgroup_disabled() || > > + !(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)) > > What happens if the memory controller is mounted in a cgroup v1 > hierarchy? It appears to me that we *will* go through with hugetlb > charging in this case? Ah right, cgroup v1. Does it not work with mount flag guarding? What's the behavior of cgroup v1 when it comes to memory recursive protection for e.g (which this mount flag is based on)? If it doesn't work, we'll have to add a separate knob for v1 - no biggies. Other than this concern, I don't have anything against cgroup v1 having this feature per se - everything should still work. But let I know if it can break cgroupv1 accounting otherwise :) > > > > > + return 0; > > + > > + memcg = get_mem_cgroup_from_current(); > > + ret = charge_memcg(folio, memcg, gfp); > > + mem_cgroup_put(memcg); > > + > > + return ret; > > +} > > + > > /** > > * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for swapin. > > * @folio: folio to charge. > > -- > > 2.34.1
On Thu, Sep 28, 2023 at 5:58 PM Nhat Pham <nphamcs@gmail.com> wrote: > > On Thu, Sep 28, 2023 at 5:38 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > <snip> > > > > > > > > + > > > +/** > > > + * mem_cgroup_hugetlb_charge_folio - Charge a newly allocated hugetlb folio. > > > + * @folio: folio to charge. > > > + * @gfp: reclaim mode > > > + * > > > + * This function charges an allocated hugetlbf folio to the memcg of the > > > + * current task. > > > + * > > > + * Returns 0 on success. Otherwise, an error code is returned. > > > + */ > > > +int mem_cgroup_hugetlb_charge_folio(struct folio *folio, gfp_t gfp) > > > +{ > > > + struct mem_cgroup *memcg; > > > + int ret; > > > + > > > + if (mem_cgroup_disabled() || > > > + !(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)) > > > > What happens if the memory controller is mounted in a cgroup v1 > > hierarchy? It appears to me that we *will* go through with hugetlb > > charging in this case? > > Ah right, cgroup v1. Does it not work with mount flag guarding? > What's the behavior of cgroup v1 when it comes to memory > recursive protection for e.g (which this mount flag is based on)? > > If it doesn't work, we'll have to add a separate knob for v1 - > no biggies. But to be clear, my intention is that we're not adding this feature to v1 (which, to my understanding, has been deprecated). If it's added by virtue of it sharing infrastructure with v2, then it's fine, but only if the mount option still works to guard against unintentional enablement (if not we'll also short-circuit v1, or add knobs if ppl really want it in v1 as well). If it's not added at all, then I don't have any complaints :) > > Other than this concern, I don't have anything against cgroup v1 > having this feature per se - everything should still work. But let > I know if it can break cgroupv1 accounting otherwise :) > > > > > > > > > + return 0; > > > + > > > + memcg = get_mem_cgroup_from_current(); > > > + ret = charge_memcg(folio, memcg, gfp); > > > + mem_cgroup_put(memcg); > > > + > > > + return ret; > > > +} > > > + > > > /** > > > * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for swapin. > > > * @folio: folio to charge. > > > -- > > > 2.34.1
On Thu, Sep 28, 2023 at 6:07 PM Nhat Pham <nphamcs@gmail.com> wrote: > > On Thu, Sep 28, 2023 at 5:58 PM Nhat Pham <nphamcs@gmail.com> wrote: > > > > On Thu, Sep 28, 2023 at 5:38 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > <snip> > > > > > > > > > > > + > > > > +/** > > > > + * mem_cgroup_hugetlb_charge_folio - Charge a newly allocated hugetlb folio. > > > > + * @folio: folio to charge. > > > > + * @gfp: reclaim mode > > > > + * > > > > + * This function charges an allocated hugetlbf folio to the memcg of the > > > > + * current task. > > > > + * > > > > + * Returns 0 on success. Otherwise, an error code is returned. > > > > + */ > > > > +int mem_cgroup_hugetlb_charge_folio(struct folio *folio, gfp_t gfp) > > > > +{ > > > > + struct mem_cgroup *memcg; > > > > + int ret; > > > > + > > > > + if (mem_cgroup_disabled() || > > > > + !(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)) > > > > > > What happens if the memory controller is mounted in a cgroup v1 > > > hierarchy? It appears to me that we *will* go through with hugetlb > > > charging in this case? > > > > Ah right, cgroup v1. Does it not work with mount flag guarding? > > What's the behavior of cgroup v1 when it comes to memory > > recursive protection for e.g (which this mount flag is based on)? > > > > If it doesn't work, we'll have to add a separate knob for v1 - > > no biggies. > > But to be clear, my intention is that we're not adding this > feature to v1 (which, to my understanding, has been > deprecated). > > If it's added by virtue of it sharing infrastructure with v2, > then it's fine, but only if the mount option still works to > guard against unintentional enablement (if not we'll > also short-circuit v1, or add knobs if ppl really want > it in v1 as well). > > If it's not added at all, then I don't have any complaints :) > > > > > Other than this concern, I don't have anything against cgroup v1 > > having this feature per se - everything should still work. But let > > I know if it can break cgroupv1 accounting otherwise :) > > My concern is the scenario where the memory controller is mounted in cgroup v1, and cgroup v2 is mounted with memory_hugetlb_accounting. In this case it seems like the current code will only check whether memory_hugetlb_accounting was set on cgroup v2 or not, disregarding the fact that cgroup v1 did not enable hugetlb accounting. I obviously prefer that any features are also added to cgroup v1, because we still didn't make it to cgroup v2, especially when the infrastructure is shared. On the other hand, I am pretty sure the maintainers will not like what I am saying :)
On Thu, Sep 28, 2023 at 06:18:19PM -0700, Yosry Ahmed wrote: > My concern is the scenario where the memory controller is mounted in > cgroup v1, and cgroup v2 is mounted with memory_hugetlb_accounting. > > In this case it seems like the current code will only check whether > memory_hugetlb_accounting was set on cgroup v2 or not, disregarding > the fact that cgroup v1 did not enable hugetlb accounting. > > I obviously prefer that any features are also added to cgroup v1, > because we still didn't make it to cgroup v2, especially when the > infrastructure is shared. On the other hand, I am pretty sure the > maintainers will not like what I am saying :) I have a weak preference. It's definitely a little weird that the v1 controller's behavior changes based on the v2 mount flag. And that if you want it as an otherwise exclusive v1 user, you'd have to mount a dummy v2. But I also don't see a scenario where it would hurt, or where there would be an unresolvable conflict between v1 and v2 in expressing desired behavior, since the memory controller is exclusive to one. While we could eliminate this quirk with a simple !cgroup_subsys_on_dfl(memory_cgrp_subsys) inside the charge function, it would seem almost punitive to add extra code just to take something away that isn't really a problem and could be useful to some people. If Tejun doesn't object, I'd say let's just keep implied v1 behavior.
On Fri, Sep 29, 2023 at 08:11:54AM -0700, Yosry Ahmed wrote: > On Fri, Sep 29, 2023 at 8:08 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Thu, Sep 28, 2023 at 06:18:19PM -0700, Yosry Ahmed wrote: > > > My concern is the scenario where the memory controller is mounted in > > > cgroup v1, and cgroup v2 is mounted with memory_hugetlb_accounting. > > > > > > In this case it seems like the current code will only check whether > > > memory_hugetlb_accounting was set on cgroup v2 or not, disregarding > > > the fact that cgroup v1 did not enable hugetlb accounting. > > > > > > I obviously prefer that any features are also added to cgroup v1, > > > because we still didn't make it to cgroup v2, especially when the > > > infrastructure is shared. On the other hand, I am pretty sure the > > > maintainers will not like what I am saying :) > > > > I have a weak preference. > > > > It's definitely a little weird that the v1 controller's behavior > > changes based on the v2 mount flag. And that if you want it as an > > otherwise exclusive v1 user, you'd have to mount a dummy v2. > > > > But I also don't see a scenario where it would hurt, or where there > > would be an unresolvable conflict between v1 and v2 in expressing > > desired behavior, since the memory controller is exclusive to one. > > > > While we could eliminate this quirk with a simple > > !cgroup_subsys_on_dfl(memory_cgrp_subsys) inside the charge function, > > it would seem almost punitive to add extra code just to take something > > away that isn't really a problem and could be useful to some people. > > > > If Tejun doesn't object, I'd say let's just keep implied v1 behavior. > > I agree that adding extra code to take a feature away from v1 is > probably too much, but I also think relying on a v2 mount option is > weird. Would it be too much to just have a v1-specific flag as well > and use cgroup_subsys_on_dfl(memory_cgrp_subsys) to decide which flag > to read? Yeah, let's not preemptively add explicit new features to cgroup1. Since we agree the incidental support is weird, let's filter hugetlb charging on cgroup_subsys_on_dfl(memory_cgrp_subsys) after all. If somebody wants this for v1 - and it doesn't sound like Google is even in that category according to Frank - they should send a separate patch and we can go through all the reasons why switching to v2 is not an option for them.
On Fri 29-09-23 13:42:21, Johannes Weiner wrote: > On Fri, Sep 29, 2023 at 08:11:54AM -0700, Yosry Ahmed wrote: > > On Fri, Sep 29, 2023 at 8:08 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > On Thu, Sep 28, 2023 at 06:18:19PM -0700, Yosry Ahmed wrote: > > > > My concern is the scenario where the memory controller is mounted in > > > > cgroup v1, and cgroup v2 is mounted with memory_hugetlb_accounting. > > > > > > > > In this case it seems like the current code will only check whether > > > > memory_hugetlb_accounting was set on cgroup v2 or not, disregarding > > > > the fact that cgroup v1 did not enable hugetlb accounting. > > > > > > > > I obviously prefer that any features are also added to cgroup v1, > > > > because we still didn't make it to cgroup v2, especially when the > > > > infrastructure is shared. On the other hand, I am pretty sure the > > > > maintainers will not like what I am saying :) > > > > > > I have a weak preference. > > > > > > It's definitely a little weird that the v1 controller's behavior > > > changes based on the v2 mount flag. And that if you want it as an > > > otherwise exclusive v1 user, you'd have to mount a dummy v2. > > > > > > But I also don't see a scenario where it would hurt, or where there > > > would be an unresolvable conflict between v1 and v2 in expressing > > > desired behavior, since the memory controller is exclusive to one. > > > > > > While we could eliminate this quirk with a simple > > > !cgroup_subsys_on_dfl(memory_cgrp_subsys) inside the charge function, > > > it would seem almost punitive to add extra code just to take something > > > away that isn't really a problem and could be useful to some people. > > > > > > If Tejun doesn't object, I'd say let's just keep implied v1 behavior. > > > > I agree that adding extra code to take a feature away from v1 is > > probably too much, but I also think relying on a v2 mount option is > > weird. Would it be too much to just have a v1-specific flag as well > > and use cgroup_subsys_on_dfl(memory_cgrp_subsys) to decide which flag > > to read? > > Yeah, let's not preemptively add explicit new features to cgroup1. > > Since we agree the incidental support is weird, let's filter hugetlb > charging on cgroup_subsys_on_dfl(memory_cgrp_subsys) after all. Agreed. It would be a bad idea to have an implicit behavior change based on v2 mounting options. And I really do not think we want to add this feature to v1. I am not supper thrilled about enabling this for v2 to be completely honest but I do see a demand so I will not object to that.
On Wed 27-09-23 17:57:22, Nhat Pham wrote: > Currently, hugetlb memory usage is not acounted for in the memory > controller, which could lead to memory overprotection for cgroups with > hugetlb-backed memory. This has been observed in our production system. > > This patch rectifies this issue by charging the memcg when the hugetlb > folio is allocated, and uncharging when the folio is freed (analogous to > the hugetlb controller). This changelog is missing a lot of information. Both about the usecase (we do not want to fish that out from archives in the future) and the actual implementation and the reasoning behind that. AFAICS you have decided to charge on the hugetlb use rather than hugetlb allocation to the pool. I suspect the underlying reasoning is that pool pages do not belong to anybody. This is a deliberate decision and it should be documented as such. It is also very important do describe subtle behavior properties that might be rather unintuitive to users. Most notably - there is no hugetlb pool management involved in the memcg controller. One has to use hugetlb controller for that purpose. Also the pre allocated pool as such doesn't belong to anybody so the memcg host overcommit management has to consider it when configuring hard limits. - memcg limit reclaim doesn't assist hugetlb pages allocation when hugetlb overcommit is configured (i.e. pages are not consumed from the pool) which means that the page allocation might disrupt workloads from other memcgs. - failure to charge a hugetlb page results in SIGBUS rather than memcg oom killer. That could be the case even if the hugetlb pool still has pages available and there is reclaimable memory in the memcg. - hugetlb pages are contributing to memory reclaim protection implicitly. This means that the low,min limits tunning has to consider hugetlb memory as well. I suspect there is more than the above. To be completely honest I am still not convinced this is a good idea. I do recognize that this might work in a very limited environments but hugetlb management is quite challenging on its own and this just adds another layer of complexity which is really hard to see through without an intimate understanding of both memcg and hugetlb. The reason that hugetlb has been living outside of the core MM (and memcg) is not just because we like it that way. And yes I do fully understand that users shouldn't really care about that because this is just a memory to them. We should also consider the global control for this functionality. I am especially worried about setups where a mixed bag of workloads (containers) is executed. While some of them will be ready for the new accounting mode many will leave in their own world without ever being modified. How do we deal with that situation? All that being said, I am not going to ack nor nack this but I really do prefer to be much more explicit about the motivation and current implementation specifics so that we can forward users to something they can digest. > Signed-off-by: Nhat Pham <nphamcs@gmail.com> [...] a minor implementation detail below. I couldn't spot anything obviously broken with the rest of the hugetlb specific code. restore_reserve_on_memcg_failure is rather clumsy and potentially error prone but I will leave that out to Mike as he is much more familiar with that behavior than me. > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index de220e3ff8be..ff88ea4df11a 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c [...] > @@ -3119,6 +3121,15 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h), > pages_per_huge_page(h), folio); > } > + > + /* undo allocation if memory controller disallows it. */ > + if (mem_cgroup_hugetlb_charge_folio(folio, GFP_KERNEL)) { htlb_alloc_mask(h) rather than GFP_KERNEL. Ideally with __GFP_RETRY_MAYFAIL which is a default allocation policy. > + if (restore_reserve_on_memcg_failure) > + restore_reserve_on_error(h, vma, addr, folio); > + folio_put(folio); > + return ERR_PTR(-ENOMEM); > + } > + > return folio; > > out_uncharge_cgroup:
On Mon, Oct 02, 2023 at 03:43:19PM +0200, Michal Hocko wrote: > On Wed 27-09-23 17:57:22, Nhat Pham wrote: > > Currently, hugetlb memory usage is not acounted for in the memory > > controller, which could lead to memory overprotection for cgroups with > > hugetlb-backed memory. This has been observed in our production system. > > > > This patch rectifies this issue by charging the memcg when the hugetlb > > folio is allocated, and uncharging when the folio is freed (analogous to > > the hugetlb controller). > > This changelog is missing a lot of information. Both about the usecase > (we do not want to fish that out from archives in the future) and the > actual implementation and the reasoning behind that. > > AFAICS you have decided to charge on the hugetlb use rather than hugetlb > allocation to the pool. I suspect the underlying reasoning is that pool > pages do not belong to anybody. This is a deliberate decision and it > should be documented as such. > > It is also very important do describe subtle behavior properties that > might be rather unintuitive to users. Most notably > - there is no hugetlb pool management involved in the memcg > controller. One has to use hugetlb controller for that purpose. > Also the pre allocated pool as such doesn't belong to anybody so the > memcg host overcommit management has to consider it when configuring > hard limits. +1 > - memcg limit reclaim doesn't assist hugetlb pages allocation when > hugetlb overcommit is configured (i.e. pages are not consumed from the > pool) which means that the page allocation might disrupt workloads > from other memcgs. > - failure to charge a hugetlb page results in SIGBUS rather > than memcg oom killer. That could be the case even if the > hugetlb pool still has pages available and there is > reclaimable memory in the memcg. Are these actually true? AFAICS, regardless of whether the page comes from the pool or the buddy allocator, the memcg code will go through the regular charge path, attempt reclaim, and OOM if that fails.
On Mon 02-10-23 10:50:26, Johannes Weiner wrote: > On Mon, Oct 02, 2023 at 03:43:19PM +0200, Michal Hocko wrote: > > On Wed 27-09-23 17:57:22, Nhat Pham wrote: [...] > > - memcg limit reclaim doesn't assist hugetlb pages allocation when > > hugetlb overcommit is configured (i.e. pages are not consumed from the > > pool) which means that the page allocation might disrupt workloads > > from other memcgs. > > - failure to charge a hugetlb page results in SIGBUS rather > > than memcg oom killer. That could be the case even if the > > hugetlb pool still has pages available and there is > > reclaimable memory in the memcg. > > Are these actually true? AFAICS, regardless of whether the page comes > from the pool or the buddy allocator, the memcg code will go through > the regular charge path, attempt reclaim, and OOM if that fails. OK, I should have been more explicit. Let me expand. Charges are accounted only _after_ the actual allocation is done. So the actual allocation is not constrained by the memcg context. It might reclaim from the memcg at that time but the disruption could have already happened. Not really any different from regular memory allocation attempt but much more visible with GB pages and one could reasonably expect that memcg should stop such a GB allocation if the local reclaim would be hopeless to free up enough from its own consumption. Makes more sense? With the later point I meant to say that the memcg OOM killer will not communicate the hugetlb request failure so the usual SIGBUS will be returned to the userspace. I can imagine a SIGBUS handler could check hugetlb availability to retry or something similar.
On Mon, Oct 02, 2023 at 05:08:34PM +0200, Michal Hocko wrote: > On Mon 02-10-23 10:50:26, Johannes Weiner wrote: > > On Mon, Oct 02, 2023 at 03:43:19PM +0200, Michal Hocko wrote: > > > On Wed 27-09-23 17:57:22, Nhat Pham wrote: > [...] > > > - memcg limit reclaim doesn't assist hugetlb pages allocation when > > > hugetlb overcommit is configured (i.e. pages are not consumed from the > > > pool) which means that the page allocation might disrupt workloads > > > from other memcgs. > > > - failure to charge a hugetlb page results in SIGBUS rather > > > than memcg oom killer. That could be the case even if the > > > hugetlb pool still has pages available and there is > > > reclaimable memory in the memcg. > > > > Are these actually true? AFAICS, regardless of whether the page comes > > from the pool or the buddy allocator, the memcg code will go through > > the regular charge path, attempt reclaim, and OOM if that fails. > > OK, I should have been more explicit. Let me expand. Charges are > accounted only _after_ the actual allocation is done. So the actual > allocation is not constrained by the memcg context. It might reclaim > from the memcg at that time but the disruption could have already > happened. Not really any different from regular memory allocation > attempt but much more visible with GB pages and one could reasonably > expect that memcg should stop such a GB allocation if the local reclaim > would be hopeless to free up enough from its own consumption. > > Makes more sense? Yes, that makes sense. This should be fairly easy to address by having hugetlb do the split transaction that charge_memcg() does in one go, similar to what we do for the hugetlb controller as well. IOW, alloc_hugetlb_folio() { if (mem_cgroup_hugetlb_try_charge()) return ERR_PTR(-ENOMEM); folio = dequeue(); if (!folio) { folio = alloc_buddy(); if (!folio) goto uncharge; } mem_cgroup_hugetlb_commit_charge(); }
On Mon, Oct 02, 2023 at 03:43:19PM +0200, Michal Hocko wrote: > We should also consider the global control for this functionality. I am > especially worried about setups where a mixed bag of workloads > (containers) is executed. While some of them will be ready for the new > accounting mode many will leave in their own world without ever being > modified. How do we deal with that situation? It's possible to add more localized control on top of the global flag should this come up. But this seems like a new and honestly pretty hypothetical usecase, given the host-level coordination already involved in real-world hugetlb setups. The same could be said about other mount options, such as nsdelegate, memory_localevents, and memory_recursiveprot. Those you'd expect to have a much broader audience, and nobody has asked for mixed use. Let's cross this bridge not when but if we have to.
On Mon, Oct 2, 2023 at 6:43 AM Michal Hocko <mhocko@suse.com> wrote: > > On Wed 27-09-23 17:57:22, Nhat Pham wrote: > > Currently, hugetlb memory usage is not acounted for in the memory > > controller, which could lead to memory overprotection for cgroups with > > hugetlb-backed memory. This has been observed in our production system. > > > > This patch rectifies this issue by charging the memcg when the hugetlb > > folio is allocated, and uncharging when the folio is freed (analogous to > > the hugetlb controller). > > This changelog is missing a lot of information. Both about the usecase > (we do not want to fish that out from archives in the future) and the > actual implementation and the reasoning behind that. > > AFAICS you have decided to charge on the hugetlb use rather than hugetlb > allocation to the pool. I suspect the underlying reasoning is that pool > pages do not belong to anybody. This is a deliberate decision and it > should be documented as such. Yep that was the intention behind placing the charging of the hugetlb folio in alloc_hugetlb_folio(). I'll document this in the changelog and/or code. > > It is also very important do describe subtle behavior properties that > might be rather unintuitive to users. Most notably If you don't mind, I'll summarize these into the next version of the patch's changelog :) > - there is no hugetlb pool management involved in the memcg > controller. One has to use hugetlb controller for that purpose. > Also the pre allocated pool as such doesn't belong to anybody so the > memcg host overcommit management has to consider it when configuring > hard limits. > - memcg limit reclaim doesn't assist hugetlb pages allocation when > hugetlb overcommit is configured (i.e. pages are not consumed from the > pool) which means that the page allocation might disrupt workloads > from other memcgs. > - failure to charge a hugetlb page results in SIGBUS rather > than memcg oom killer. That could be the case even if the > hugetlb pool still has pages available and there is > reclaimable memory in the memcg. Ah yes that should be documented indeed. > - hugetlb pages are contributing to memory reclaim protection > implicitly. This means that the low,min limits tunning has to consider > hugetlb memory as well. This was the original inspiration for this change. I'll expand on it in the new version's changelog. > > I suspect there is more than the above. To be completely honest I am > still not convinced this is a good idea. > > I do recognize that this might work in a very limited environments but > hugetlb management is quite challenging on its own and this just adds > another layer of complexity which is really hard to see through without > an intimate understanding of both memcg and hugetlb. The reason that > hugetlb has been living outside of the core MM (and memcg) is not just > because we like it that way. And yes I do fully understand that users > shouldn't really care about that because this is just a memory to them. > > We should also consider the global control for this functionality. I am > especially worried about setups where a mixed bag of workloads > (containers) is executed. While some of them will be ready for the new > accounting mode many will leave in their own world without ever being > modified. How do we deal with that situation? Johannes already responded to this, but I also think this hypothetical situation isn't super urgent to handle right now. That said, we can always revisit it if/when it proves to be an issue and add appropriate memcg-specific control for this feature as a follow-up. > > All that being said, I am not going to ack nor nack this but I really do > prefer to be much more explicit about the motivation and current > implementation specifics so that we can forward users to something > they can digest. > > > Signed-off-by: Nhat Pham <nphamcs@gmail.com> > [...] > > a minor implementation detail below. I couldn't spot anything obviously > broken with the rest of the hugetlb specific code. restore_reserve_on_memcg_failure > is rather clumsy and potentially error prone but I will leave that out > to Mike as he is much more familiar with that behavior than me. That part irks me too, but I'm trying to follow the error handling logic that follows each alloc_hugetlb_folio() call site. If anyone has any suggestions, I'd be happy to listen! > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index de220e3ff8be..ff88ea4df11a 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > [...] > > @@ -3119,6 +3121,15 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > > hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h), > > pages_per_huge_page(h), folio); > > } > > + > > + /* undo allocation if memory controller disallows it. */ > > + if (mem_cgroup_hugetlb_charge_folio(folio, GFP_KERNEL)) { > > htlb_alloc_mask(h) rather than GFP_KERNEL. Ideally with > __GFP_RETRY_MAYFAIL which is a default allocation policy. Oh I wasn't aware of htlb_alloc_mask(h). So I'll fix this to: htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL > > > + if (restore_reserve_on_memcg_failure) > > + restore_reserve_on_error(h, vma, addr, folio); > > + folio_put(folio); > > + return ERR_PTR(-ENOMEM); > > + } > > + > > return folio; > > > > out_uncharge_cgroup: > > -- > Michal Hocko > SUSE Labs
On Mon 02-10-23 11:25:55, Johannes Weiner wrote: > On Mon, Oct 02, 2023 at 05:08:34PM +0200, Michal Hocko wrote: > > On Mon 02-10-23 10:50:26, Johannes Weiner wrote: > > > On Mon, Oct 02, 2023 at 03:43:19PM +0200, Michal Hocko wrote: > > > > On Wed 27-09-23 17:57:22, Nhat Pham wrote: > > [...] > > > > - memcg limit reclaim doesn't assist hugetlb pages allocation when > > > > hugetlb overcommit is configured (i.e. pages are not consumed from the > > > > pool) which means that the page allocation might disrupt workloads > > > > from other memcgs. > > > > - failure to charge a hugetlb page results in SIGBUS rather > > > > than memcg oom killer. That could be the case even if the > > > > hugetlb pool still has pages available and there is > > > > reclaimable memory in the memcg. > > > > > > Are these actually true? AFAICS, regardless of whether the page comes > > > from the pool or the buddy allocator, the memcg code will go through > > > the regular charge path, attempt reclaim, and OOM if that fails. > > > > OK, I should have been more explicit. Let me expand. Charges are > > accounted only _after_ the actual allocation is done. So the actual > > allocation is not constrained by the memcg context. It might reclaim > > from the memcg at that time but the disruption could have already > > happened. Not really any different from regular memory allocation > > attempt but much more visible with GB pages and one could reasonably > > expect that memcg should stop such a GB allocation if the local reclaim > > would be hopeless to free up enough from its own consumption. > > > > Makes more sense? > > Yes, that makes sense. > > This should be fairly easy to address by having hugetlb do the split > transaction that charge_memcg() does in one go, similar to what we do > for the hugetlb controller as well. IOW, > > alloc_hugetlb_folio() > { > if (mem_cgroup_hugetlb_try_charge()) > return ERR_PTR(-ENOMEM); > > folio = dequeue(); > if (!folio) { > folio = alloc_buddy(); > if (!folio) > goto uncharge; > } > > mem_cgroup_hugetlb_commit_charge(); > } yes, this makes sense. I still suspect we will need a better charge reclaim tuning for GB pages as those are just too huge and a simple MAX_RECLAIM_RETRIES * GB worth of reclaim targets might be just overly aggressive.