[0/2] hugetlb memcg accounting

Message ID	20230926194949.2637078-1-nphamcs@gmail.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7; From: Nhat Pham <nphamcs@gmail.com> To: akpm@linux-foundation.org Cc: riel@surriel.com, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeelb@google.com, muchun.song@linux.dev, tj@kernel.org, lizefan.x@bytedance.com, shuah@kernel.org, mike.kravetz@oracle.com, yosryahmed@google.com, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org Subject: [PATCH 0/2] hugetlb memcg accounting Date: Tue, 26 Sep 2023 12:49:47 -0700 Message-Id: <20230926194949.2637078-1-nphamcs@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	hugetlb memcg accounting \| [0/2] hugetlb memcg accounting [1/2] hugetlb: memcg: account hugetlb-backed memory in memory controller [2/2] selftests: add a selftest to verify hugetlb usage in memcg

Message ID

20230926194949.2637078-1-nphamcs@gmail.com

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:7 as permitted sender)
 client-ip=2620:137:e000::3:7;
From: Nhat Pham <nphamcs@gmail.com>
To: akpm@linux-foundation.org
Cc: riel@surriel.com, hannes@cmpxchg.org, mhocko@kernel.org,
        roman.gushchin@linux.dev, shakeelb@google.com,
        muchun.song@linux.dev, tj@kernel.org, lizefan.x@bytedance.com,
        shuah@kernel.org, mike.kravetz@oracle.com, yosryahmed@google.com,
        linux-mm@kvack.org, kernel-team@meta.com,
        linux-kernel@vger.kernel.org, cgroups@vger.kernel.org
Subject: [PATCH 0/2] hugetlb memcg accounting
Date: Tue, 26 Sep 2023 12:49:47 -0700
Message-Id: <20230926194949.2637078-1-nphamcs@gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

hugetlb memcg accounting |

Message

Nhat Pham Sept. 26, 2023, 7:49 p.m. UTC

  Currently, hugetlb memory usage is not acounted for in the memory
controller, which could lead to memory overprotection for cgroups with
hugetlb-backed memory. This has been observed in our production system.

This patch series rectifies this issue by charging the memcg when the
hugetlb folio is allocated, and uncharging when the folio is freed. In
addition, a new selftest is added to demonstrate and verify this new
behavior.

Nhat Pham (2):
  hugetlb: memcg: account hugetlb-backed memory in memory controller
  selftests: add a selftest to verify hugetlb usage in memcg

 MAINTAINERS                                   |   2 +
 fs/hugetlbfs/inode.c                          |   2 +-
 include/linux/hugetlb.h                       |   6 +-
 include/linux/memcontrol.h                    |   8 +
 mm/hugetlb.c                                  |  23 +-
 mm/memcontrol.c                               |  40 ++++
 tools/testing/selftests/cgroup/.gitignore     |   1 +
 tools/testing/selftests/cgroup/Makefile       |   2 +
 .../selftests/cgroup/test_hugetlb_memcg.c     | 222 ++++++++++++++++++
 9 files changed, 297 insertions(+), 9 deletions(-)
 create mode 100644 tools/testing/selftests/cgroup/test_hugetlb_memcg.c

Comments

Frank van der Linden Sept. 26, 2023, 8:50 p.m. UTC | #1

On Tue, Sep 26, 2023 at 12:49 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> Currently, hugetlb memory usage is not acounted for in the memory
> controller, which could lead to memory overprotection for cgroups with
> hugetlb-backed memory. This has been observed in our production system.
>
> This patch series rectifies this issue by charging the memcg when the
> hugetlb folio is allocated, and uncharging when the folio is freed. In
> addition, a new selftest is added to demonstrate and verify this new
> behavior.
>
> Nhat Pham (2):
>   hugetlb: memcg: account hugetlb-backed memory in memory controller
>   selftests: add a selftest to verify hugetlb usage in memcg
>
>  MAINTAINERS                                   |   2 +
>  fs/hugetlbfs/inode.c                          |   2 +-
>  include/linux/hugetlb.h                       |   6 +-
>  include/linux/memcontrol.h                    |   8 +
>  mm/hugetlb.c                                  |  23 +-
>  mm/memcontrol.c                               |  40 ++++
>  tools/testing/selftests/cgroup/.gitignore     |   1 +
>  tools/testing/selftests/cgroup/Makefile       |   2 +
>  .../selftests/cgroup/test_hugetlb_memcg.c     | 222 ++++++++++++++++++
>  9 files changed, 297 insertions(+), 9 deletions(-)
>  create mode 100644 tools/testing/selftests/cgroup/test_hugetlb_memcg.c
>
> --
> 2.34.1
>

We've had this behavior at Google for a long time, and we're actually
getting rid of it. hugetlb pages are a precious resource that should
be accounted for separately. They are not just any memory, they are
physically contiguous memory, charging them the same as any other
region of the same size ended up not making sense, especially not for
larger hugetlb page sizes.

Additionally, if this behavior is changed just like that, there will
be quite a few workloads that will break badly because they'll hit
their limits immediately - imagine a container that uses 1G hugetlb
pages to back something large (a database, a VM), and 'plain' memory
for control processes.

What do your workloads do? Is it not possible for you to account for
hugetlb pages separately? Sure, it can be annoying to have to deal
with 2 separate totals that you need to take into account, but again,
hugetlb pages are a resource that is best dealt with separately.

- Frank

Johannes Weiner Sept. 26, 2023, 10:14 p.m. UTC | #2

Hi Frank,

On Tue, Sep 26, 2023 at 01:50:10PM -0700, Frank van der Linden wrote:
> On Tue, Sep 26, 2023 at 12:49 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > Currently, hugetlb memory usage is not acounted for in the memory
> > controller, which could lead to memory overprotection for cgroups with
> > hugetlb-backed memory. This has been observed in our production system.
> >
> > This patch series rectifies this issue by charging the memcg when the
> > hugetlb folio is allocated, and uncharging when the folio is freed. In
> > addition, a new selftest is added to demonstrate and verify this new
> > behavior.
> >
> > Nhat Pham (2):
> >   hugetlb: memcg: account hugetlb-backed memory in memory controller
> >   selftests: add a selftest to verify hugetlb usage in memcg
> >
> >  MAINTAINERS                                   |   2 +
> >  fs/hugetlbfs/inode.c                          |   2 +-
> >  include/linux/hugetlb.h                       |   6 +-
> >  include/linux/memcontrol.h                    |   8 +
> >  mm/hugetlb.c                                  |  23 +-
> >  mm/memcontrol.c                               |  40 ++++
> >  tools/testing/selftests/cgroup/.gitignore     |   1 +
> >  tools/testing/selftests/cgroup/Makefile       |   2 +
> >  .../selftests/cgroup/test_hugetlb_memcg.c     | 222 ++++++++++++++++++
> >  9 files changed, 297 insertions(+), 9 deletions(-)
> >  create mode 100644 tools/testing/selftests/cgroup/test_hugetlb_memcg.c
> >
> > --
> > 2.34.1
> >
> 
> We've had this behavior at Google for a long time, and we're actually
> getting rid of it. hugetlb pages are a precious resource that should
> be accounted for separately. They are not just any memory, they are
> physically contiguous memory, charging them the same as any other
> region of the same size ended up not making sense, especially not for
> larger hugetlb page sizes.

I agree that on one hand they're a limited resource, and some form of
access control makes sense. There is the hugetlb cgroup controller
that allows for tracking and apportioning them per-cgroups.

But on the other hand they're also still just host memory that a
cgroup can consume, which is the domain of memcg.

Those two aren't mutually exclusive. It makes sense to set a limit on
a cgroup's access to hugetlb. It also makes sense that the huge pages
a cgroup IS using count toward its memory limit, where they displace
file cache and anonymous pages under pressure. Or that they're
considered when determining degree of protection from global pressure.

This isn't unlike e.g. kernel memory being special in that it consumes
lowmem and isn't reclaimable. This shows up in total memory, while it
was also tracked and limited separately. (Separate control disappeared
for lack of a good enforcement mechanism - but hugetlb has that.)

The fact that memory consumed by hugetlb is currently not considered
inside memcg (host memory accounting and control) is inconsistent. It
has been quite confusing to our service owners and complicating things
for our containers team.

For example, jobs need to describe their overall memory size in order
to match them to machines and co-locate them. Based on that parameter
the container limits as well as protection (memory.low) from global
pressure is set. Currently, there are ugly hacks in place to subtract
any hugetlb quota from the container config - otherwise the limits and
protection settings would be way too big if a large part of the host
memory consumption isn't a part of it. This has been quite cumbersome
and error prone.

> Additionally, if this behavior is changed just like that, there will
> be quite a few workloads that will break badly because they'll hit
> their limits immediately - imagine a container that uses 1G hugetlb
> pages to back something large (a database, a VM), and 'plain' memory
> for control processes.

I agree with you there. This could break existing setups. We've added
new consumers to memcg in the past without thinking too hard about it,
but hugetlb often makes up a huge portion of a group's overall memory
footprint. And we *do* have those subtraction hacks in place that
would then fail in the other direction.

A cgroup mountflag makes sense for this to ease the transition.

Nhat Pham Sept. 26, 2023, 11:31 p.m. UTC | #3

On Tue, Sep 26, 2023 at 1:50 PM Frank van der Linden <fvdl@google.com> wrote:
>
> On Tue, Sep 26, 2023 at 12:49 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > Currently, hugetlb memory usage is not acounted for in the memory
> > controller, which could lead to memory overprotection for cgroups with
> > hugetlb-backed memory. This has been observed in our production system.
> >
> > This patch series rectifies this issue by charging the memcg when the
> > hugetlb folio is allocated, and uncharging when the folio is freed. In
> > addition, a new selftest is added to demonstrate and verify this new
> > behavior.
> >
> > Nhat Pham (2):
> >   hugetlb: memcg: account hugetlb-backed memory in memory controller
> >   selftests: add a selftest to verify hugetlb usage in memcg
> >
> >  MAINTAINERS                                   |   2 +
> >  fs/hugetlbfs/inode.c                          |   2 +-
> >  include/linux/hugetlb.h                       |   6 +-
> >  include/linux/memcontrol.h                    |   8 +
> >  mm/hugetlb.c                                  |  23 +-
> >  mm/memcontrol.c                               |  40 ++++
> >  tools/testing/selftests/cgroup/.gitignore     |   1 +
> >  tools/testing/selftests/cgroup/Makefile       |   2 +
> >  .../selftests/cgroup/test_hugetlb_memcg.c     | 222 ++++++++++++++++++
> >  9 files changed, 297 insertions(+), 9 deletions(-)
> >  create mode 100644 tools/testing/selftests/cgroup/test_hugetlb_memcg.c
> >
> > --
> > 2.34.1
> >
>
> We've had this behavior at Google for a long time, and we're actually
> getting rid of it. hugetlb pages are a precious resource that should
> be accounted for separately. They are not just any memory, they are
> physically contiguous memory, charging them the same as any other
> region of the same size ended up not making sense, especially not for
> larger hugetlb page sizes.

I agree hugetlb is a special kind of resource. But as Johannes
pointed out, it is still a form of memory. Semantically, its usage
should be modulated by the memory controller.

We do have the HugeTLB controller for hugetlb-specific
restriction, and where appropriate we definitely should take
advantage of it. But it does not fix the hole we have in memory
usage reporting, as well as (over)protection and reclaim dynamics.
Hence the need for the userspace hack (as Johannes described):
manually adding/subtracting HugeTLB usage where applicable.
This is not only inelegant, but also cumbersome and buggy.

>
> Additionally, if this behavior is changed just like that, there will
> be quite a few workloads that will break badly because they'll hit
> their limits immediately - imagine a container that uses 1G hugetlb
> pages to back something large (a database, a VM), and 'plain' memory
> for control processes.
>
> What do your workloads do? Is it not possible for you to account for
> hugetlb pages separately? Sure, it can be annoying to have to deal
> with 2 separate totals that you need to take into account, but again,
> hugetlb pages are a resource that is best dealt with separately.
>

Johannes beat me to it - he described our use case, and what we
have hacked together to temporarily get around the issue.

A knob/flag to turn on/off this behavior sounds good to me.

> - Frank
Thanks for the comments, Frank!

Michal Hocko Sept. 27, 2023, 11:21 a.m. UTC | #4

On Tue 26-09-23 12:49:47, Nhat Pham wrote:
> Currently, hugetlb memory usage is not acounted for in the memory
> controller, which could lead to memory overprotection for cgroups with
> hugetlb-backed memory. This has been observed in our production system.
> 
> This patch series rectifies this issue by charging the memcg when the
> hugetlb folio is allocated, and uncharging when the folio is freed. In
> addition, a new selftest is added to demonstrate and verify this new
> behavior.

The primary reason why hugetlb is living outside of memcg (and the core
MM as well) is that it doesn't really fit the whole scheme. In several
aspects. First and the foremost it is an independently managed resource
with its own pool management, use and lifetime.

There is no notion of memory reclaim and this makes a huge difference
for the pool that might consume considerable amount of memory. While
this is the case for many kernel allocations as well they usually do not
consume considerable portions of the accounted memory. This makes it
really tricky to handle limit enforcement gracefully.

Another important aspect comes from the lifetime semantics when a proper
reservations accounting and managing needs to handle mmap time rather
than than usual allocation path. While pages are allocated they do not
belong to anybody and only later at the #PF time (or read for the fs
backed mapping) the ownership is established. That makes it really hard
to manage memory as whole under the memcg anyway as a large part of
that pool sits without an ownership yet it cannot be used for any other
purpose.

These and more reasons where behind the earlier decision o have a
dedicated hugetlb controller.

Also I will also Nack involving hugetlb pages being accounted by
default. This would break any setups which mix normal and hugetlb memory
with memcg limits applied.

Michal Hocko Sept. 27, 2023, 12:50 p.m. UTC | #5

On Tue 26-09-23 18:14:14, Johannes Weiner wrote:
[...]
> The fact that memory consumed by hugetlb is currently not considered
> inside memcg (host memory accounting and control) is inconsistent. It
> has been quite confusing to our service owners and complicating things
> for our containers team.

I do understand how that is confusing and inconsistent as well. Hugetlb
is bringing throughout its existence I am afraid.

As noted in other reply though I am not sure hugeltb pool can be
reasonably incorporated with a sane semantic. Neither of the regular
allocation nor the hugetlb reservation/actual use can fallback to the
pool of the other. This makes them 2 different things each hitting their
own failure cases that require a dedicated handling.

Just from top of my head these are cases I do not see easy way out from:
	- hugetlb charge failure has two failure modes - pool empty
	  or memcg limit reached. The former is not recoverable and
	  should fail without any further intervention the latter might
	  benefit from reclaiming.
	- !hugetlb memory charge failure cannot consider any hugetlb
	  pages - they are implicit memory.min protection so it is
          impossible to manage reclaim protection without having a
          knowledge of the hugetlb use.
	- there is no way to control the hugetlb pool distribution by
	  memcg limits. How do we distinguish reservations from actual
	  use?
	- pre-allocated pool is consuming memory without any actual
	  owner until it is actually used and even that has two stages
	  (reserved and really used). This makes it really hard to
	  manage memory as whole when there is a considerable amount of
	  hugetlb memore preallocated.
I am pretty sure there are many more interesting cases.

Johannes Weiner Sept. 27, 2023, 4:44 p.m. UTC | #6

On Wed, Sep 27, 2023 at 02:50:10PM +0200, Michal Hocko wrote:
> On Tue 26-09-23 18:14:14, Johannes Weiner wrote:
> [...]
> > The fact that memory consumed by hugetlb is currently not considered
> > inside memcg (host memory accounting and control) is inconsistent. It
> > has been quite confusing to our service owners and complicating things
> > for our containers team.
> 
> I do understand how that is confusing and inconsistent as well. Hugetlb
> is bringing throughout its existence I am afraid.
> 
> As noted in other reply though I am not sure hugeltb pool can be
> reasonably incorporated with a sane semantic. Neither of the regular
> allocation nor the hugetlb reservation/actual use can fallback to the
> pool of the other. This makes them 2 different things each hitting their
> own failure cases that require a dedicated handling.
> 
> Just from top of my head these are cases I do not see easy way out from:
> 	- hugetlb charge failure has two failure modes - pool empty
> 	  or memcg limit reached. The former is not recoverable and
> 	  should fail without any further intervention the latter might
> 	  benefit from reclaiming.
> 	- !hugetlb memory charge failure cannot consider any hugetlb
> 	  pages - they are implicit memory.min protection so it is
>           impossible to manage reclaim protection without having a
>           knowledge of the hugetlb use.
> 	- there is no way to control the hugetlb pool distribution by
> 	  memcg limits. How do we distinguish reservations from actual
> 	  use?
> 	- pre-allocated pool is consuming memory without any actual
> 	  owner until it is actually used and even that has two stages
> 	  (reserved and really used). This makes it really hard to
> 	  manage memory as whole when there is a considerable amount of
> 	  hugetlb memore preallocated.

It's important to distinguish hugetlb access policy from memory use
policy. This patch isn't about hugetlb access, it's about general
memory use.

Hugetlb access policy is a separate domain with separate
answers. Preallocating is a privileged operation, for access control
there is the hugetlb cgroup controller etc.

What's missing is that once you get past the access restrictions and
legitimately get your hands on huge pages, that memory use gets
reflected in memory.current and exerts pressure on *other* memory
inside the group, such as anon or optimistic cache allocations.

Note that hugetlb *can* be allocated on demand. It's unexpected that
when an application optimistically allocates a couple of 2M hugetlb
pages those aren't reflected in its memory.current. The same is true
for hugetlb_cma. If the gigantic pages aren't currently allocated to a
cgroup, that CMA memory can be used for movable memory elsewhere.

The points you and Frank raise are reasons and scenarios where
additional hugetlb access control is necessary - preallocation,
limited availability of 1G pages etc. But they're not reasons against
charging faulted in hugetlb to the memcg *as well*.

My point is we need both. One to manage competition over hugetlb,
because it has unique limitations. The other to manage competition
over host memory which hugetlb is a part of.

Here is a usecase from our fleet.

Imagine a configuration with two 32G containers. The machine is booted
with hugetlb_cma=6G, and each container may or may not use up to 3
gigantic page, depending on the workload within it. The rest is anon,
cache, slab, etc. You set the hugetlb cgroup limit of each cgroup to
3G to enforce hugetlb fairness. But how do you configure memory.max to
keep *overall* consumption, including anon, cache, slab etc. fair?

If used hugetlb is charged, you can just set memory.max=32G regardless
of the workload inside.

Without it, you'd have to constantly poll hugetlb usage and readjust
memory.max!

Nhat Pham Sept. 27, 2023, 5:22 p.m. UTC | #7

On Wed, Sep 27, 2023 at 9:44 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Sep 27, 2023 at 02:50:10PM +0200, Michal Hocko wrote:
> > On Tue 26-09-23 18:14:14, Johannes Weiner wrote:
> > [...]
> > > The fact that memory consumed by hugetlb is currently not considered
> > > inside memcg (host memory accounting and control) is inconsistent. It
> > > has been quite confusing to our service owners and complicating things
> > > for our containers team.
> >
> > I do understand how that is confusing and inconsistent as well. Hugetlb
> > is bringing throughout its existence I am afraid.
> >
> > As noted in other reply though I am not sure hugeltb pool can be
> > reasonably incorporated with a sane semantic. Neither of the regular
> > allocation nor the hugetlb reservation/actual use can fallback to the
> > pool of the other. This makes them 2 different things each hitting their
> > own failure cases that require a dedicated handling.
> >
> > Just from top of my head these are cases I do not see easy way out from:
> >       - hugetlb charge failure has two failure modes - pool empty
> >         or memcg limit reached. The former is not recoverable and
> >         should fail without any further intervention the latter might
> >         benefit from reclaiming.
> >       - !hugetlb memory charge failure cannot consider any hugetlb
> >         pages - they are implicit memory.min protection so it is
> >           impossible to manage reclaim protection without having a
> >           knowledge of the hugetlb use.
> >       - there is no way to control the hugetlb pool distribution by
> >         memcg limits. How do we distinguish reservations from actual
> >         use?
> >       - pre-allocated pool is consuming memory without any actual
> >         owner until it is actually used and even that has two stages
> >         (reserved and really used). This makes it really hard to
> >         manage memory as whole when there is a considerable amount of
> >         hugetlb memore preallocated.
>
> It's important to distinguish hugetlb access policy from memory use
> policy. This patch isn't about hugetlb access, it's about general
> memory use.
>
> Hugetlb access policy is a separate domain with separate
> answers. Preallocating is a privileged operation, for access control
> there is the hugetlb cgroup controller etc.
>
> What's missing is that once you get past the access restrictions and
> legitimately get your hands on huge pages, that memory use gets
> reflected in memory.current and exerts pressure on *other* memory
> inside the group, such as anon or optimistic cache allocations.
>
> Note that hugetlb *can* be allocated on demand. It's unexpected that
> when an application optimistically allocates a couple of 2M hugetlb
> pages those aren't reflected in its memory.current. The same is true
> for hugetlb_cma. If the gigantic pages aren't currently allocated to a
> cgroup, that CMA memory can be used for movable memory elsewhere.
>
> The points you and Frank raise are reasons and scenarios where
> additional hugetlb access control is necessary - preallocation,
> limited availability of 1G pages etc. But they're not reasons against
> charging faulted in hugetlb to the memcg *as well*.
>
> My point is we need both. One to manage competition over hugetlb,
> because it has unique limitations. The other to manage competition
> over host memory which hugetlb is a part of.
>
> Here is a usecase from our fleet.
>
> Imagine a configuration with two 32G containers. The machine is booted
> with hugetlb_cma=6G, and each container may or may not use up to 3
> gigantic page, depending on the workload within it. The rest is anon,
> cache, slab, etc. You set the hugetlb cgroup limit of each cgroup to
> 3G to enforce hugetlb fairness. But how do you configure memory.max to
> keep *overall* consumption, including anon, cache, slab etc. fair?
>
> If used hugetlb is charged, you can just set memory.max=32G regardless
> of the workload inside.
>
> Without it, you'd have to constantly poll hugetlb usage and readjust
> memory.max!

Yep, and I'd like to add that this could and have caused issues in
our production system, when there is a delay in memory limits
(low or max) correction. The userspace agent in charge of correcting
these only runs periodically, and within consecutive runs the system
could be in an over/underprotected state. An instantaneous charge
towards the memory controller would close this gap.

I think we need both a HugeTLB controller and memory controller
accounting.

Johannes Weiner Sept. 27, 2023, 6:47 p.m. UTC | #8

On Wed, Sep 27, 2023 at 01:21:20PM +0200, Michal Hocko wrote:
> On Tue 26-09-23 12:49:47, Nhat Pham wrote:
> > Currently, hugetlb memory usage is not acounted for in the memory
> > controller, which could lead to memory overprotection for cgroups with
> > hugetlb-backed memory. This has been observed in our production system.
> > 
> > This patch series rectifies this issue by charging the memcg when the
> > hugetlb folio is allocated, and uncharging when the folio is freed. In
> > addition, a new selftest is added to demonstrate and verify this new
> > behavior.
> 
> The primary reason why hugetlb is living outside of memcg (and the core
> MM as well) is that it doesn't really fit the whole scheme. In several
> aspects. First and the foremost it is an independently managed resource
> with its own pool management, use and lifetime.

Honestly, the simpler explanation is that few people have used hugetlb
in regular, containerized non-HPC workloads.

Hugetlb has historically been much more special, and it retains a
specialness that warrants e.g. the hugetlb cgroup container. But it
has also made strides with hugetlb_cma, migratability, madvise support
etc. that allows much more on-demand use. It's no longer the case that
you just put a static pool of memory aside during boot and only a few
blessed applications are using it.

For example, we're using hugetlb_cma very broadly with generic
containers. The CMA region is fully usable by movable non-huge stuff
until huge pages are allocated in it. With the hugetlb controller you
can define a maximum number of hugetlb pages that can be used per
container. But what if that container isn't using any? Why shouldn't
it be allowed to use its overall memory allowance for anon and cache
instead?

With hugetlb being more dynamic, it becomes the same problem that we
had with dedicated tcp and kmem pools. It didn't make sense to fail a
random slab allocation when you still have memory headroom or can
reclaim some cache. Nowadays, the same problem applies to hugetlb.

> There is no notion of memory reclaim and this makes a huge difference
> for the pool that might consume considerable amount of memory. While
> this is the case for many kernel allocations as well they usually do not
> consume considerable portions of the accounted memory. This makes it
> really tricky to handle limit enforcement gracefully.

I don't think that's true. For some workloads, network buffers can
absolutely dominate. And they work just fine with cgroup limits. It
isn't a problem that they aren't reclaimable themselves, it's just
important that they put pressure on stuff that is.

So that if you use 80% hugetlb, the other memory is forced to stay in
the remaining 20%, or it OOMs; and that if you don't use hugetlb, the
group is still allowed to use the full 100% of its host memory
allowance, without requiring some outside agent continuously
monitoring and adjusting the container limits.

> Another important aspect comes from the lifetime semantics when a proper
> reservations accounting and managing needs to handle mmap time rather
> than than usual allocation path. While pages are allocated they do not
> belong to anybody and only later at the #PF time (or read for the fs
> backed mapping) the ownership is established. That makes it really hard
> to manage memory as whole under the memcg anyway as a large part of
> that pool sits without an ownership yet it cannot be used for any other
> purpose.
>
> These and more reasons where behind the earlier decision o have a
> dedicated hugetlb controller.

Yeah, there is still a need for an actual hugetlb controller for the
static use cases (and even for dynamic access to hugetlb_cma).

But you need memcg coverage as well for the more dynamic cases to work
as expected. And having that doesn't really interfere with the static
usecases.

> Also I will also Nack involving hugetlb pages being accounted by
> default. This would break any setups which mix normal and hugetlb memory
> with memcg limits applied.

Yes, no disagreement there. I think we're all on the same page this
needs to be opt-in, say with a cgroup mount option.

Roman Gushchin Sept. 27, 2023, 9:37 p.m. UTC | #9

On Wed, Sep 27, 2023 at 02:47:38PM -0400, Johannes Weiner wrote:
> On Wed, Sep 27, 2023 at 01:21:20PM +0200, Michal Hocko wrote:
> > On Tue 26-09-23 12:49:47, Nhat Pham wrote:
> > > Currently, hugetlb memory usage is not acounted for in the memory
> > > controller, which could lead to memory overprotection for cgroups with
> > > hugetlb-backed memory. This has been observed in our production system.
> > > 
> > > This patch series rectifies this issue by charging the memcg when the
> > > hugetlb folio is allocated, and uncharging when the folio is freed. In
> > > addition, a new selftest is added to demonstrate and verify this new
> > > behavior.
> > 
> > The primary reason why hugetlb is living outside of memcg (and the core
> > MM as well) is that it doesn't really fit the whole scheme. In several
> > aspects. First and the foremost it is an independently managed resource
> > with its own pool management, use and lifetime.
> 
> Honestly, the simpler explanation is that few people have used hugetlb
> in regular, containerized non-HPC workloads.
> 
> Hugetlb has historically been much more special, and it retains a
> specialness that warrants e.g. the hugetlb cgroup container. But it
> has also made strides with hugetlb_cma, migratability, madvise support
> etc. that allows much more on-demand use. It's no longer the case that
> you just put a static pool of memory aside during boot and only a few
> blessed applications are using it.
> 
> For example, we're using hugetlb_cma very broadly with generic
> containers. The CMA region is fully usable by movable non-huge stuff
> until huge pages are allocated in it. With the hugetlb controller you
> can define a maximum number of hugetlb pages that can be used per
> container. But what if that container isn't using any? Why shouldn't
> it be allowed to use its overall memory allowance for anon and cache
> instead?

Cool, I remember proposing hugetlb memcg stats several years ago and if
I remember correctly at that time you was opposing it based on the idea
that huge pages are not a part of the overall memcg flow: they are not
a subject for memory pressure, can't be evicted, etc. And thp's were seen
as a long-term replacement. Even though all above it's true, hugetlb has
it's niche and I don't think thp's will realistically replace it any time
soon.

So I'm glad to see this effort (and very supportive) on making hugetlb
more convenient and transparent for an end user.

Thanks!

Nhat Pham Sept. 27, 2023, 11:33 p.m. UTC | #10

On Wed, Sep 27, 2023 at 4:21 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 26-09-23 12:49:47, Nhat Pham wrote:
> > Currently, hugetlb memory usage is not acounted for in the memory
> > controller, which could lead to memory overprotection for cgroups with
> > hugetlb-backed memory. This has been observed in our production system.
> >
> > This patch series rectifies this issue by charging the memcg when the
> > hugetlb folio is allocated, and uncharging when the folio is freed. In
> > addition, a new selftest is added to demonstrate and verify this new
> > behavior.
>
> The primary reason why hugetlb is living outside of memcg (and the core
> MM as well) is that it doesn't really fit the whole scheme. In several
> aspects. First and the foremost it is an independently managed resource
> with its own pool management, use and lifetime.
>
> There is no notion of memory reclaim and this makes a huge difference
> for the pool that might consume considerable amount of memory. While
> this is the case for many kernel allocations as well they usually do not
> consume considerable portions of the accounted memory. This makes it
> really tricky to handle limit enforcement gracefully.
>
> Another important aspect comes from the lifetime semantics when a proper
> reservations accounting and managing needs to handle mmap time rather
> than than usual allocation path. While pages are allocated they do not
> belong to anybody and only later at the #PF time (or read for the fs
> backed mapping) the ownership is established. That makes it really hard
> to manage memory as whole under the memcg anyway as a large part of
> that pool sits without an ownership yet it cannot be used for any other
> purpose.
>
> These and more reasons where behind the earlier decision o have a
> dedicated hugetlb controller.

While I believe all of these are true, I think they are not reasons not to
have memcg accounting. As everyone has pointed out, memcg
accounting by itself cannot handle all situations - it is not a fix-all.
Other mechanisms, such as the HugeTLB controller, could be the better
solution in these cases, and hugetlb memcg accounting is definitely not
an attempt to infringe upon these control domains.

However, memcg accounting is still necessary for certain memory limits
enforcement to work cleanly and properly - such as the use cases we have
(as Johannes has beautifully described). It will certainly help
administrators simplify their control workflow a lot (assuming we do not
surprise them with this change - a new mount option to opt-in should
help with the transition).

>
> Also I will also Nack involving hugetlb pages being accounted by
> default. This would break any setups which mix normal and hugetlb memory
> with memcg limits applied.

Got it! I'll introduce some opt-in mechanisms in the next version. This is
my oversight.


> --
> Michal Hocko
> SUSE Labs

Nhat Pham Sept. 28, 2023, 1 a.m. UTC | #11

On Tue, Sep 26, 2023 at 12:49 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> Currently, hugetlb memory usage is not acounted for in the memory
> controller, which could lead to memory overprotection for cgroups with
> hugetlb-backed memory. This has been observed in our production system.
>
> This patch series rectifies this issue by charging the memcg when the
> hugetlb folio is allocated, and uncharging when the folio is freed. In
> addition, a new selftest is added to demonstrate and verify this new
> behavior.
>
> Nhat Pham (2):
>   hugetlb: memcg: account hugetlb-backed memory in memory controller
>   selftests: add a selftest to verify hugetlb usage in memcg
>
>  MAINTAINERS                                   |   2 +
>  fs/hugetlbfs/inode.c                          |   2 +-
>  include/linux/hugetlb.h                       |   6 +-
>  include/linux/memcontrol.h                    |   8 +
>  mm/hugetlb.c                                  |  23 +-
>  mm/memcontrol.c                               |  40 ++++
>  tools/testing/selftests/cgroup/.gitignore     |   1 +
>  tools/testing/selftests/cgroup/Makefile       |   2 +
>  .../selftests/cgroup/test_hugetlb_memcg.c     | 222 ++++++++++++++++++
>  9 files changed, 297 insertions(+), 9 deletions(-)
>  create mode 100644 tools/testing/selftests/cgroup/test_hugetlb_memcg.c
>
> --
> 2.34.1

Thanks for all the comments and suggestions everyone!
FYI, I have sent out a second version of the patch series with the new
mount flag:

https://lore.kernel.org/lkml/20230928005723.1709119-1-nphamcs@gmail.com/T/#t

Feel free to check it out and discuss it over there too!

Johannes Weiner Sept. 28, 2023, 12:52 p.m. UTC | #12

On Wed, Sep 27, 2023 at 02:37:47PM -0700, Roman Gushchin wrote:
> On Wed, Sep 27, 2023 at 02:47:38PM -0400, Johannes Weiner wrote:
> > On Wed, Sep 27, 2023 at 01:21:20PM +0200, Michal Hocko wrote:
> > > On Tue 26-09-23 12:49:47, Nhat Pham wrote:
> > > > Currently, hugetlb memory usage is not acounted for in the memory
> > > > controller, which could lead to memory overprotection for cgroups with
> > > > hugetlb-backed memory. This has been observed in our production system.
> > > > 
> > > > This patch series rectifies this issue by charging the memcg when the
> > > > hugetlb folio is allocated, and uncharging when the folio is freed. In
> > > > addition, a new selftest is added to demonstrate and verify this new
> > > > behavior.
> > > 
> > > The primary reason why hugetlb is living outside of memcg (and the core
> > > MM as well) is that it doesn't really fit the whole scheme. In several
> > > aspects. First and the foremost it is an independently managed resource
> > > with its own pool management, use and lifetime.
> > 
> > Honestly, the simpler explanation is that few people have used hugetlb
> > in regular, containerized non-HPC workloads.
> > 
> > Hugetlb has historically been much more special, and it retains a
> > specialness that warrants e.g. the hugetlb cgroup container. But it
> > has also made strides with hugetlb_cma, migratability, madvise support
> > etc. that allows much more on-demand use. It's no longer the case that
> > you just put a static pool of memory aside during boot and only a few
> > blessed applications are using it.
> > 
> > For example, we're using hugetlb_cma very broadly with generic
> > containers. The CMA region is fully usable by movable non-huge stuff
> > until huge pages are allocated in it. With the hugetlb controller you
> > can define a maximum number of hugetlb pages that can be used per
> > container. But what if that container isn't using any? Why shouldn't
> > it be allowed to use its overall memory allowance for anon and cache
> > instead?
> 
> Cool, I remember proposing hugetlb memcg stats several years ago and if
> I remember correctly at that time you was opposing it based on the idea
> that huge pages are not a part of the overall memcg flow: they are not
> a subject for memory pressure, can't be evicted, etc. And thp's were seen
> as a long-term replacement. Even though all above it's true, hugetlb has
> it's niche and I don't think thp's will realistically replace it any time
> soon.

Yeah, Michal's arguments very much reminded me of my stance then. I
stand corrected.

I'm still hopeful that we can make 2M work transparently. I'd expect
1G to remain in the hugetlb domain for some time to come, but even
those are mostly dynamic now with your hugetlb_cma feature!

> So I'm glad to see this effort (and very supportive) on making hugetlb
> more convenient and transparent for an end user.

Thanks!

Mike Kravetz Oct. 1, 2023, 11:27 p.m. UTC | #13

On 09/27/23 14:47, Johannes Weiner wrote:
> On Wed, Sep 27, 2023 at 01:21:20PM +0200, Michal Hocko wrote:
> > On Tue 26-09-23 12:49:47, Nhat Pham wrote:
> 
> So that if you use 80% hugetlb, the other memory is forced to stay in
> the remaining 20%, or it OOMs; and that if you don't use hugetlb, the
> group is still allowed to use the full 100% of its host memory
> allowance, without requiring some outside agent continuously
> monitoring and adjusting the container limits.

Jumping in late here as I was traveling last week.  In addition, I want
to state my limited cgroup knowledge up front.

I was thinking of your scenario above a little differently.  Suppose a
group is up and running at almost 100% memory usage.  However, the majority
of that memory is reclaimable.  Now, someone wants to allocate a 2M hugetlb
page.  There is not 2MB free, but we could easily reclaim 2MB to make room
for the hugetlb page.  I may be missing something, but I do not see how that
is going to happen.  It seems like we would really want that behavior.

Johannes Weiner Oct. 2, 2023, 2:42 p.m. UTC | #14

On Sun, Oct 01, 2023 at 04:27:30PM -0700, Mike Kravetz wrote:
> On 09/27/23 14:47, Johannes Weiner wrote:
> > On Wed, Sep 27, 2023 at 01:21:20PM +0200, Michal Hocko wrote:
> > > On Tue 26-09-23 12:49:47, Nhat Pham wrote:
> > 
> > So that if you use 80% hugetlb, the other memory is forced to stay in
> > the remaining 20%, or it OOMs; and that if you don't use hugetlb, the
> > group is still allowed to use the full 100% of its host memory
> > allowance, without requiring some outside agent continuously
> > monitoring and adjusting the container limits.
> 
> Jumping in late here as I was traveling last week.  In addition, I want
> to state my limited cgroup knowledge up front.
> 
> I was thinking of your scenario above a little differently.  Suppose a
> group is up and running at almost 100% memory usage.  However, the majority
> of that memory is reclaimable.  Now, someone wants to allocate a 2M hugetlb
> page.  There is not 2MB free, but we could easily reclaim 2MB to make room
> for the hugetlb page.  I may be missing something, but I do not see how that
> is going to happen.  It seems like we would really want that behavior.

But that is actually what it does, no?

alloc_hugetlb_folio
  mem_cgroup_hugetlb_charge_folio
    charge_memcg
      try_charge
        !page_counter_try_charge ?
          !try_to_free_mem_cgroup_pages ?
            mem_cgroup_oom

So it does reclaim when the hugetlb hits the cgroup limit. And if that
fails to make room, it OOMs the cgroup.

Or maybe I'm missing something?

Michal Hocko Oct. 2, 2023, 2:58 p.m. UTC | #15

On Mon 02-10-23 10:42:50, Johannes Weiner wrote:
> On Sun, Oct 01, 2023 at 04:27:30PM -0700, Mike Kravetz wrote:
> > On 09/27/23 14:47, Johannes Weiner wrote:
> > > On Wed, Sep 27, 2023 at 01:21:20PM +0200, Michal Hocko wrote:
> > > > On Tue 26-09-23 12:49:47, Nhat Pham wrote:
> > > 
> > > So that if you use 80% hugetlb, the other memory is forced to stay in
> > > the remaining 20%, or it OOMs; and that if you don't use hugetlb, the
> > > group is still allowed to use the full 100% of its host memory
> > > allowance, without requiring some outside agent continuously
> > > monitoring and adjusting the container limits.
> > 
> > Jumping in late here as I was traveling last week.  In addition, I want
> > to state my limited cgroup knowledge up front.
> > 
> > I was thinking of your scenario above a little differently.  Suppose a
> > group is up and running at almost 100% memory usage.  However, the majority
> > of that memory is reclaimable.  Now, someone wants to allocate a 2M hugetlb
> > page.  There is not 2MB free, but we could easily reclaim 2MB to make room
> > for the hugetlb page.  I may be missing something, but I do not see how that
> > is going to happen.  It seems like we would really want that behavior.
> 
> But that is actually what it does, no?
> 
> alloc_hugetlb_folio
>   mem_cgroup_hugetlb_charge_folio
>     charge_memcg
>       try_charge
>         !page_counter_try_charge ?
>           !try_to_free_mem_cgroup_pages ?
>             mem_cgroup_oom
> 
> So it does reclaim when the hugetlb hits the cgroup limit. And if that
> fails to make room, it OOMs the cgroup.
> 
> Or maybe I'm missing something?

I beleve that Mike alludes to what I have pointed in other email:
http://lkml.kernel.org/r/ZRrI90KcRBwVZn/r@dhcp22.suse.cz and a situation
when the hugetlb requests results in an acutal hugetlb allocation rather
than consumption from the pre-allocated pool. In that case memcg is not
involved because the charge happens only after the allocation happens.
That btw. means that this request could disrupt a different memcg even
if the current one is at the limit or it could be reclaimed instead.

Also there is not OOM as hugetlb pages are costly requests and we do not
invoke the oom killer.

Johannes Weiner Oct. 2, 2023, 3:36 p.m. UTC | #16

On Mon, Oct 02, 2023 at 04:58:21PM +0200, Michal Hocko wrote:
> Also there is not OOM as hugetlb pages are costly requests and we do not
> invoke the oom killer.

Ah good point.

That seems like a policy choice we could make. However, since hugetlb
users are already set up for and come to expect SIGBUS for physical
failure as well as hugetlb_cgroup limits, we should have memcg follow
established precedent and leave the OOM killer out.

Agree that a sentence in the changelog about this makes sense though.