[RFC,0/5] mm: Select victim memcg using BPF_OOM_POLICY

Message ID 20230727073632.44983-1-zhouchuyi@bytedance.com
Headers
Series mm: Select victim memcg using BPF_OOM_POLICY |

Message

Chuyi Zhou July 27, 2023, 7:36 a.m. UTC
  This patchset tries to add a new bpf prog type and use it to select
a victim memcg when global OOM is invoked. The mainly motivation is
the need to customizable OOM victim selection functionality so that
we can protect more important app from OOM killer.

Chuyi Zhou (5):
  bpf: Introduce BPF_PROG_TYPE_OOM_POLICY
  mm: Select victim memcg using bpf prog
  libbpf, bpftool: Support BPF_PROG_TYPE_OOM_POLICY
  bpf: Add a new bpf helper to get cgroup ino
  bpf: Sample BPF program to set oom policy

 include/linux/bpf_oom.h        |  22 ++++
 include/linux/bpf_types.h      |   2 +
 include/linux/memcontrol.h     |   6 ++
 include/uapi/linux/bpf.h       |  21 ++++
 kernel/bpf/core.c              |   1 +
 kernel/bpf/helpers.c           |  17 +++
 kernel/bpf/syscall.c           |  10 ++
 mm/memcontrol.c                |  50 +++++++++
 mm/oom_kill.c                  | 185 +++++++++++++++++++++++++++++++++
 samples/bpf/Makefile           |   3 +
 samples/bpf/oom_kern.c         |  42 ++++++++
 samples/bpf/oom_user.c         | 128 +++++++++++++++++++++++
 tools/bpf/bpftool/common.c     |   1 +
 tools/include/uapi/linux/bpf.h |  21 ++++
 tools/lib/bpf/libbpf.c         |   3 +
 tools/lib/bpf/libbpf_probes.c  |   2 +
 16 files changed, 514 insertions(+)
 create mode 100644 include/linux/bpf_oom.h
 create mode 100644 samples/bpf/oom_kern.c
 create mode 100644 samples/bpf/oom_user.c
  

Comments

Alan Maguire July 27, 2023, 11:43 a.m. UTC | #1
On 27/07/2023 08:36, Chuyi Zhou wrote:
> This patchset tries to add a new bpf prog type and use it to select
> a victim memcg when global OOM is invoked. The mainly motivation is
> the need to customizable OOM victim selection functionality so that
> we can protect more important app from OOM killer.
>

It's a nice use case, but at a high level, the approach pursued here
is, as I understand it, discouraged for new BPF program development.
Specifically, adding a new BPF program type with semantics like this
is not preferred. Instead, can you look at using something like

- using "fmod_ret" instead of a new program type
- use BPF kfuncs instead of helpers.
- add selftests in tools/testing/selftests/bpf not samples.

There's some examples of how solutions have evolved from the traditional
approach (adding a new program type, helpers etc) to using kfuncs etc on
this list - for example HID-BPF and the BPF scheduler series - which
should help orient you. There are presentations from Linux Plumbers 2022
that walk through some of this too.

Judging by the sample program example, all you should need here is a way
to override the return value of bpf_oom_set_policy() - a noinline
function that by default returns a no-op. It can then be overridden by
an "fmod_ret" BPF program.

One thing you lose is cgroup specificity at BPF attach time, but you can
always add predicates based on the cgroup to your BPF program if needed.

Alan

> Chuyi Zhou (5):
>   bpf: Introduce BPF_PROG_TYPE_OOM_POLICY
>   mm: Select victim memcg using bpf prog
>   libbpf, bpftool: Support BPF_PROG_TYPE_OOM_POLICY
>   bpf: Add a new bpf helper to get cgroup ino
>   bpf: Sample BPF program to set oom policy
> 
>  include/linux/bpf_oom.h        |  22 ++++
>  include/linux/bpf_types.h      |   2 +
>  include/linux/memcontrol.h     |   6 ++
>  include/uapi/linux/bpf.h       |  21 ++++
>  kernel/bpf/core.c              |   1 +
>  kernel/bpf/helpers.c           |  17 +++
>  kernel/bpf/syscall.c           |  10 ++
>  mm/memcontrol.c                |  50 +++++++++
>  mm/oom_kill.c                  | 185 +++++++++++++++++++++++++++++++++
>  samples/bpf/Makefile           |   3 +
>  samples/bpf/oom_kern.c         |  42 ++++++++
>  samples/bpf/oom_user.c         | 128 +++++++++++++++++++++++
>  tools/bpf/bpftool/common.c     |   1 +
>  tools/include/uapi/linux/bpf.h |  21 ++++
>  tools/lib/bpf/libbpf.c         |   3 +
>  tools/lib/bpf/libbpf_probes.c  |   2 +
>  16 files changed, 514 insertions(+)
>  create mode 100644 include/linux/bpf_oom.h
>  create mode 100644 samples/bpf/oom_kern.c
>  create mode 100644 samples/bpf/oom_user.c
>
  
Chuyi Zhou July 27, 2023, 12:12 p.m. UTC | #2
在 2023/7/27 16:15, Michal Hocko 写道:
> On Thu 27-07-23 15:36:27, Chuyi Zhou wrote:
>> This patchset tries to add a new bpf prog type and use it to select
>> a victim memcg when global OOM is invoked. The mainly motivation is
>> the need to customizable OOM victim selection functionality so that
>> we can protect more important app from OOM killer.
> 
> This is rather modest to give an idea how the whole thing is supposed to
> work. I have looked through patches very quickly but there is no overall
> design described anywhere either.
> 
> Please could you give us a high level design description and reasoning
> why certain decisions have been made? e.g. why is this limited to the
> global oom sitation, why is the BPF program forced to operate on memcgs
> as entities etc...
> Also it would be very helpful to call out limitations of the BPF
> program, if there are any.
> 
> Thanks!

Hi,

Thanks for your advice.

The global/memcg OOM victim selection uses process as the base search 
granularity. However, we can see a need for cgroup level protection and 
there's been some discussion[1]. It seems reasonable to consider using 
memcg as a search granularity in victim selection algorithm.

Besides, it seems pretty well fit for offloading policy decisions to a 
BPF program, since BPF is scalable and flexible. That's why the new BPF
program operate on memcgs as entities.

The idea is to let user choose which leaf in the memcg tree should be 
selected as the victim. At the first layer, if we choose A, then it 
protects the memcg under the B, C, and D subtrees.

         root
      /   | \  \
     A    B  C  D
    /\
   E F


Using the BPF prog, we are allowed to compare the OOM priority between
two siblings so that we can choose the best victim in each layer.
For example:

run_prog(B, C) -> choose B
run_prog(B, D) -> choose D
run_prog(A, D) -> choose A

Once we select A as the victim in the first layer, the victim in next 
layer would be selected among A's children. Finally, we select a leaf 
memcg as victim.

In our scenarios, the impact caused by global OOM's is much more common, 
so we only considered global in this patchset. But it seems that the 
idea can also be applied to memcg OOM.

[1]https://lore.kernel.org/lkml/ZIgodGWoC%2FR07eak@dhcp22.suse.cz/

Thanks!
  
Alexei Starovoitov July 27, 2023, 3:57 p.m. UTC | #3
On Thu, Jul 27, 2023 at 4:45 AM Alan Maguire <alan.maguire@oracle.com> wrote:
>
> On 27/07/2023 08:36, Chuyi Zhou wrote:
> > This patchset tries to add a new bpf prog type and use it to select
> > a victim memcg when global OOM is invoked. The mainly motivation is
> > the need to customizable OOM victim selection functionality so that
> > we can protect more important app from OOM killer.
> >
>
> It's a nice use case, but at a high level, the approach pursued here
> is, as I understand it, discouraged for new BPF program development.
> Specifically, adding a new BPF program type with semantics like this
> is not preferred. Instead, can you look at using something like
>
> - using "fmod_ret" instead of a new program type
> - use BPF kfuncs instead of helpers.
> - add selftests in tools/testing/selftests/bpf not samples.

+1 to what Alan said above and below.

Also as Michal said there needs to be a design doc with pros and cons.
We see far too often that people attempt to use BPF in places where it
shouldn't be.
If programmability is not strictly necessary then BPF is not a good fit.

> There's some examples of how solutions have evolved from the traditional
> approach (adding a new program type, helpers etc) to using kfuncs etc on
> this list - for example HID-BPF and the BPF scheduler series - which
> should help orient you. There are presentations from Linux Plumbers 2022
> that walk through some of this too.
>
> Judging by the sample program example, all you should need here is a way
> to override the return value of bpf_oom_set_policy() - a noinline
> function that by default returns a no-op. It can then be overridden by
> an "fmod_ret" BPF program.
>
> One thing you lose is cgroup specificity at BPF attach time, but you can
> always add predicates based on the cgroup to your BPF program if needed.
>
> Alan
>
> > Chuyi Zhou (5):
> >   bpf: Introduce BPF_PROG_TYPE_OOM_POLICY
> >   mm: Select victim memcg using bpf prog
> >   libbpf, bpftool: Support BPF_PROG_TYPE_OOM_POLICY
> >   bpf: Add a new bpf helper to get cgroup ino
> >   bpf: Sample BPF program to set oom policy
> >
> >  include/linux/bpf_oom.h        |  22 ++++
> >  include/linux/bpf_types.h      |   2 +
> >  include/linux/memcontrol.h     |   6 ++
> >  include/uapi/linux/bpf.h       |  21 ++++
> >  kernel/bpf/core.c              |   1 +
> >  kernel/bpf/helpers.c           |  17 +++
> >  kernel/bpf/syscall.c           |  10 ++
> >  mm/memcontrol.c                |  50 +++++++++
> >  mm/oom_kill.c                  | 185 +++++++++++++++++++++++++++++++++
> >  samples/bpf/Makefile           |   3 +
> >  samples/bpf/oom_kern.c         |  42 ++++++++
> >  samples/bpf/oom_user.c         | 128 +++++++++++++++++++++++
> >  tools/bpf/bpftool/common.c     |   1 +
> >  tools/include/uapi/linux/bpf.h |  21 ++++
> >  tools/lib/bpf/libbpf.c         |   3 +
> >  tools/lib/bpf/libbpf_probes.c  |   2 +
> >  16 files changed, 514 insertions(+)
> >  create mode 100644 include/linux/bpf_oom.h
> >  create mode 100644 samples/bpf/oom_kern.c
> >  create mode 100644 samples/bpf/oom_user.c
> >
  
Michal Hocko July 27, 2023, 5:17 p.m. UTC | #4
On Thu 27-07-23 08:57:03, Alexei Starovoitov wrote:
> On Thu, Jul 27, 2023 at 4:45 AM Alan Maguire <alan.maguire@oracle.com> wrote:
> >
> > On 27/07/2023 08:36, Chuyi Zhou wrote:
> > > This patchset tries to add a new bpf prog type and use it to select
> > > a victim memcg when global OOM is invoked. The mainly motivation is
> > > the need to customizable OOM victim selection functionality so that
> > > we can protect more important app from OOM killer.
> > >
> >
> > It's a nice use case, but at a high level, the approach pursued here
> > is, as I understand it, discouraged for new BPF program development.
> > Specifically, adding a new BPF program type with semantics like this
> > is not preferred. Instead, can you look at using something like
> >
> > - using "fmod_ret" instead of a new program type
> > - use BPF kfuncs instead of helpers.
> > - add selftests in tools/testing/selftests/bpf not samples.
> 
> +1 to what Alan said above and below.
> 
> Also as Michal said there needs to be a design doc with pros and cons.
> We see far too often that people attempt to use BPF in places where it
> shouldn't be.
> If programmability is not strictly necessary then BPF is not a good fit.

To be completely honest I am not sure whether BPF is the right fit
myself. It is definitely a path to be explored but maybe we will learn
not the proper one in the end. The primary reason for considering it
though is that there is endless (+1) different policies how to select a
victim to kill on OOM. So having an interface to hook into and make that
decision sounds very promissing. This might be something very static
like EXPORT_SYMBOL that a kernel module can hook into or a more broader
BPF callback that could be more generic and therefore allow for easier
to define policy AFAIU.
  
Michal Hocko July 27, 2023, 5:23 p.m. UTC | #5
On Thu 27-07-23 20:12:01, Chuyi Zhou wrote:
> 
> 
> 在 2023/7/27 16:15, Michal Hocko 写道:
> > On Thu 27-07-23 15:36:27, Chuyi Zhou wrote:
> > > This patchset tries to add a new bpf prog type and use it to select
> > > a victim memcg when global OOM is invoked. The mainly motivation is
> > > the need to customizable OOM victim selection functionality so that
> > > we can protect more important app from OOM killer.
> > 
> > This is rather modest to give an idea how the whole thing is supposed to
> > work. I have looked through patches very quickly but there is no overall
> > design described anywhere either.
> > 
> > Please could you give us a high level design description and reasoning
> > why certain decisions have been made? e.g. why is this limited to the
> > global oom sitation, why is the BPF program forced to operate on memcgs
> > as entities etc...
> > Also it would be very helpful to call out limitations of the BPF
> > program, if there are any.
> > 
> > Thanks!
> 
> Hi,
> 
> Thanks for your advice.
> 
> The global/memcg OOM victim selection uses process as the base search
> granularity. However, we can see a need for cgroup level protection and
> there's been some discussion[1]. It seems reasonable to consider using memcg
> as a search granularity in victim selection algorithm.

Yes, it can be reasonable for some policies but making it central to the
design is very limiting.

> Besides, it seems pretty well fit for offloading policy decisions to a BPF
> program, since BPF is scalable and flexible. That's why the new BPF
> program operate on memcgs as entities.

I do not follow your line of argumentation here. The same could be
argued for processes or beans.

> The idea is to let user choose which leaf in the memcg tree should be
> selected as the victim. At the first layer, if we choose A, then it protects
> the memcg under the B, C, and D subtrees.
> 
>         root
>      /   | \  \
>     A    B  C  D
>    /\
>   E F
> 
> 
> Using the BPF prog, we are allowed to compare the OOM priority between
> two siblings so that we can choose the best victim in each layer.

How is the priority defined and communicated to the userspace.

> For example:
> 
> run_prog(B, C) -> choose B
> run_prog(B, D) -> choose D
> run_prog(A, D) -> choose A
> 
> Once we select A as the victim in the first layer, the victim in next layer
> would be selected among A's children. Finally, we select a leaf memcg as
> victim.

This sounds like a very specific oom policy and that is fine. But the
interface shouldn't be bound to any concepts like priorities let alone
be bound to memcg based selection. Ideally the BPF program should get
the oom_control as an input and either get a hook to kill process or if
that is not possible then return an entity to kill (either process or
set of processes).

> In our scenarios, the impact caused by global OOM's is much more common, so
> we only considered global in this patchset. But it seems that the idea can
> also be applied to memcg OOM.

The global and memcg OOMs shouldn't have a different interface. If a
specific BPF program wants to implement a different policy for global
vs. memcg OOM then be it but this should be a decision of the said
program not an inherent limitation of the interface.

> 
> [1]https://lore.kernel.org/lkml/ZIgodGWoC%2FR07eak@dhcp22.suse.cz/
> 
> Thanks!
> -- 
> Chuyi Zhou
  
Chuyi Zhou July 28, 2023, 2:34 a.m. UTC | #6
Hi,

在 2023/7/27 19:43, Alan Maguire 写道:
> On 27/07/2023 08:36, Chuyi Zhou wrote:
>> This patchset tries to add a new bpf prog type and use it to select
>> a victim memcg when global OOM is invoked. The mainly motivation is
>> the need to customizable OOM victim selection functionality so that
>> we can protect more important app from OOM killer.
>>
> 
> It's a nice use case, but at a high level, the approach pursued here
> is, as I understand it, discouraged for new BPF program development.
> Specifically, adding a new BPF program type with semantics like this
> is not preferred. Instead, can you look at using something like
> 
> - using "fmod_ret" instead of a new program type
> - use BPF kfuncs instead of helpers.
> - add selftests in tools/testing/selftests/bpf not samples.
> 
> There's some examples of how solutions have evolved from the traditional
> approach (adding a new program type, helpers etc) to using kfuncs etc on
> this list - for example HID-BPF and the BPF scheduler series - which
> should help orient you. There are presentations from Linux Plumbers 2022
> that walk through some of this too.
> 
> Judging by the sample program example, all you should need here is a way
> to override the return value of bpf_oom_set_policy() - a noinline
> function that by default returns a no-op. It can then be overridden by
> an "fmod_ret" BPF program.
> 
Indeed, I'll try to use kfuncs & fmod_ret.

Thanks for your advice.
--
Chuyi Zhou
> One thing you lose is cgroup specificity at BPF attach time, but you can
> always add predicates based on the cgroup to your BPF program if needed.
> 
> Alan
> 
>> Chuyi Zhou (5):
>>    bpf: Introduce BPF_PROG_TYPE_OOM_POLICY
>>    mm: Select victim memcg using bpf prog
>>    libbpf, bpftool: Support BPF_PROG_TYPE_OOM_POLICY
>>    bpf: Add a new bpf helper to get cgroup ino
>>    bpf: Sample BPF program to set oom policy
>>
>>   include/linux/bpf_oom.h        |  22 ++++
>>   include/linux/bpf_types.h      |   2 +
>>   include/linux/memcontrol.h     |   6 ++
>>   include/uapi/linux/bpf.h       |  21 ++++
>>   kernel/bpf/core.c              |   1 +
>>   kernel/bpf/helpers.c           |  17 +++
>>   kernel/bpf/syscall.c           |  10 ++
>>   mm/memcontrol.c                |  50 +++++++++
>>   mm/oom_kill.c                  | 185 +++++++++++++++++++++++++++++++++
>>   samples/bpf/Makefile           |   3 +
>>   samples/bpf/oom_kern.c         |  42 ++++++++
>>   samples/bpf/oom_user.c         | 128 +++++++++++++++++++++++
>>   tools/bpf/bpftool/common.c     |   1 +
>>   tools/include/uapi/linux/bpf.h |  21 ++++
>>   tools/lib/bpf/libbpf.c         |   3 +
>>   tools/lib/bpf/libbpf_probes.c  |   2 +
>>   16 files changed, 514 insertions(+)
>>   create mode 100644 include/linux/bpf_oom.h
>>   create mode 100644 samples/bpf/oom_kern.c
>>   create mode 100644 samples/bpf/oom_user.c
>>
  
Roman Gushchin July 28, 2023, 4:30 a.m. UTC | #7
On Thu, Jul 27, 2023 at 10:15:16AM +0200, Michal Hocko wrote:
> On Thu 27-07-23 15:36:27, Chuyi Zhou wrote:
> > This patchset tries to add a new bpf prog type and use it to select
> > a victim memcg when global OOM is invoked. The mainly motivation is
> > the need to customizable OOM victim selection functionality so that
> > we can protect more important app from OOM killer.
> 
> This is rather modest to give an idea how the whole thing is supposed to
> work. I have looked through patches very quickly but there is no overall
> design described anywhere either.
> 
> Please could you give us a high level design description and reasoning
> why certain decisions have been made? e.g. why is this limited to the
> global oom sitation, why is the BPF program forced to operate on memcgs
> as entities etc...
> Also it would be very helpful to call out limitations of the BPF
> program, if there are any.

One thing I realized recently: we don't have to make a victim selection
during the OOM, we [almost always] can do it in advance.

Kernel OOM's must guarantee the forward progress under heavy memory pressure
and it creates a lot of limitations on what can and what can't be done in
these circumstances.

But in practice most policies except maybe those which aim to catch very fast
memory spikes rely on things which are fairly static: a logical importance of
several workloads in comparison to some other workloads, "age", memory footprint
etc.

So I wonder if the right path is to create a kernel interface which allows
to define a OOM victim (maybe several victims, also depending on if it's
a global or a memcg oom) and update it periodically from an userspace.
In fact, the second part is already implemented by tools like oomd, systemd-oomd etc.
Someone might say that the first part is also implemented by the oom_score
interface, but I don't think it's an example of a convenient interface.
It's also not a memcg-level interface.

Just some thoughts.

Thanks!
  
Michal Hocko July 28, 2023, 8:06 a.m. UTC | #8
On Thu 27-07-23 21:30:01, Roman Gushchin wrote:
> On Thu, Jul 27, 2023 at 10:15:16AM +0200, Michal Hocko wrote:
> > On Thu 27-07-23 15:36:27, Chuyi Zhou wrote:
> > > This patchset tries to add a new bpf prog type and use it to select
> > > a victim memcg when global OOM is invoked. The mainly motivation is
> > > the need to customizable OOM victim selection functionality so that
> > > we can protect more important app from OOM killer.
> > 
> > This is rather modest to give an idea how the whole thing is supposed to
> > work. I have looked through patches very quickly but there is no overall
> > design described anywhere either.
> > 
> > Please could you give us a high level design description and reasoning
> > why certain decisions have been made? e.g. why is this limited to the
> > global oom sitation, why is the BPF program forced to operate on memcgs
> > as entities etc...
> > Also it would be very helpful to call out limitations of the BPF
> > program, if there are any.
> 
> One thing I realized recently: we don't have to make a victim selection
> during the OOM, we [almost always] can do it in advance.
> 
> Kernel OOM's must guarantee the forward progress under heavy memory pressure
> and it creates a lot of limitations on what can and what can't be done in
> these circumstances.
> 
> But in practice most policies except maybe those which aim to catch very fast
> memory spikes rely on things which are fairly static: a logical importance of
> several workloads in comparison to some other workloads, "age", memory footprint
> etc.
> 
> So I wonder if the right path is to create a kernel interface which allows
> to define a OOM victim (maybe several victims, also depending on if it's
> a global or a memcg oom) and update it periodically from an userspace.

We already have that interface. Just echo OOM_SCORE_ADJ_MAX to any tasks
that are to be killed with a priority...
Not a great interface but still something available.

> In fact, the second part is already implemented by tools like oomd, systemd-oomd etc.
> Someone might say that the first part is also implemented by the oom_score
> interface, but I don't think it's an example of a convenient interface.
> It's also not a memcg-level interface.

What do you mean by not memcg-level interface? What kind of interface
would you propose instead?

> Just some thoughts.
> 
> Thanks!
  
Roman Gushchin July 28, 2023, 6:42 p.m. UTC | #9
On Fri, Jul 28, 2023 at 10:06:38AM +0200, Michal Hocko wrote:
> On Thu 27-07-23 21:30:01, Roman Gushchin wrote:
> > On Thu, Jul 27, 2023 at 10:15:16AM +0200, Michal Hocko wrote:
> > > On Thu 27-07-23 15:36:27, Chuyi Zhou wrote:
> > > > This patchset tries to add a new bpf prog type and use it to select
> > > > a victim memcg when global OOM is invoked. The mainly motivation is
> > > > the need to customizable OOM victim selection functionality so that
> > > > we can protect more important app from OOM killer.
> > > 
> > > This is rather modest to give an idea how the whole thing is supposed to
> > > work. I have looked through patches very quickly but there is no overall
> > > design described anywhere either.
> > > 
> > > Please could you give us a high level design description and reasoning
> > > why certain decisions have been made? e.g. why is this limited to the
> > > global oom sitation, why is the BPF program forced to operate on memcgs
> > > as entities etc...
> > > Also it would be very helpful to call out limitations of the BPF
> > > program, if there are any.
> > 
> > One thing I realized recently: we don't have to make a victim selection
> > during the OOM, we [almost always] can do it in advance.
> > 
> > Kernel OOM's must guarantee the forward progress under heavy memory pressure
> > and it creates a lot of limitations on what can and what can't be done in
> > these circumstances.
> > 
> > But in practice most policies except maybe those which aim to catch very fast
> > memory spikes rely on things which are fairly static: a logical importance of
> > several workloads in comparison to some other workloads, "age", memory footprint
> > etc.
> > 
> > So I wonder if the right path is to create a kernel interface which allows
> > to define a OOM victim (maybe several victims, also depending on if it's
> > a global or a memcg oom) and update it periodically from an userspace.
> 
> We already have that interface. Just echo OOM_SCORE_ADJ_MAX to any tasks
> that are to be killed with a priority...
> Not a great interface but still something available.
> 
> > In fact, the second part is already implemented by tools like oomd, systemd-oomd etc.
> > Someone might say that the first part is also implemented by the oom_score
> > interface, but I don't think it's an example of a convenient interface.
> > It's also not a memcg-level interface.
> 
> What do you mean by not memcg-level interface? What kind of interface
> would you propose instead?

Something like memory.oom.priority, which is 0 by default, but if set to 1,
the memory cgroup is considered a good oom victim. Idk if we need priorities
or just fine with a binary thing.

Under oom conditions the kernel can look if we have a pre-selected oom target
and if not fallback to the current behavior.

Thanks!
  
Chuyi Zhou July 31, 2023, 6 a.m. UTC | #10
Hello, Michal

在 2023/7/28 01:23, Michal Hocko 写道:
> On Thu 27-07-23 20:12:01, Chuyi Zhou wrote:
>>
>>
>> 在 2023/7/27 16:15, Michal Hocko 写道:
>>> On Thu 27-07-23 15:36:27, Chuyi Zhou wrote:
>>>> This patchset tries to add a new bpf prog type and use it to select
>>>> a victim memcg when global OOM is invoked. The mainly motivation is
>>>> the need to customizable OOM victim selection functionality so that
>>>> we can protect more important app from OOM killer.
>>>
>>> This is rather modest to give an idea how the whole thing is supposed to
>>> work. I have looked through patches very quickly but there is no overall
>>> design described anywhere either.
>>>
>>> Please could you give us a high level design description and reasoning
>>> why certain decisions have been made? e.g. why is this limited to the
>>> global oom sitation, why is the BPF program forced to operate on memcgs
>>> as entities etc...
>>> Also it would be very helpful to call out limitations of the BPF
>>> program, if there are any.
>>>
>>> Thanks!
>>
>> Hi,
>>
>> Thanks for your advice.
>>
>> The global/memcg OOM victim selection uses process as the base search
>> granularity. However, we can see a need for cgroup level protection and
>> there's been some discussion[1]. It seems reasonable to consider using memcg
>> as a search granularity in victim selection algorithm.
> 
> Yes, it can be reasonable for some policies but making it central to the
> design is very limiting.
> 
>> Besides, it seems pretty well fit for offloading policy decisions to a BPF
>> program, since BPF is scalable and flexible. That's why the new BPF
>> program operate on memcgs as entities.
> 
> I do not follow your line of argumentation here. The same could be
> argued for processes or beans.
> 
>> The idea is to let user choose which leaf in the memcg tree should be
>> selected as the victim. At the first layer, if we choose A, then it protects
>> the memcg under the B, C, and D subtrees.
>>
>>          root
>>       /   | \  \
>>      A    B  C  D
>>     /\
>>    E F
>>
>>
>> Using the BPF prog, we are allowed to compare the OOM priority between
>> two siblings so that we can choose the best victim in each layer.
> 
> How is the priority defined and communicated to the userspace.
> 
>> For example:
>>
>> run_prog(B, C) -> choose B
>> run_prog(B, D) -> choose D
>> run_prog(A, D) -> choose A
>>
>> Once we select A as the victim in the first layer, the victim in next layer
>> would be selected among A's children. Finally, we select a leaf memcg as
>> victim.
> 
> This sounds like a very specific oom policy and that is fine. But the
> interface shouldn't be bound to any concepts like priorities let alone
> be bound to memcg based selection. Ideally the BPF program should get
> the oom_control as an input and either get a hook to kill process or if
> that is not possible then return an entity to kill (either process or
> set of processes).

Here are two interfaces I can think of. I was wondering if you could 
give me some feedback.

1. Add a new hook in select_bad_process(), we can attach it and return a 
set of pids or cgroup_ids which are pre-selected by user-defined policy, 
  suggested by Roman. Then we could use oom_evaluate_task to find a 
final victim among them. It's user-friendly and we can offload the OOM 
policy to userspace.

2. Add a new hook in oom_evaluate_task() and return a point to override 
the default oom_badness return-value. The simplest way to use this is to 
protect certain processes by setting the minimum score.

Of course if you have a better idea, please let me know.

Thanks!
---
Chuyi Zhou
> 
>> In our scenarios, the impact caused by global OOM's is much more common, so
>> we only considered global in this patchset. But it seems that the idea can
>> also be applied to memcg OOM.
> 
> The global and memcg OOMs shouldn't have a different interface. If a
> specific BPF program wants to implement a different policy for global
> vs. memcg OOM then be it but this should be a decision of the said
> program not an inherent limitation of the interface.
> 
>>
>> [1]https://lore.kernel.org/lkml/ZIgodGWoC%2FR07eak@dhcp22.suse.cz/
>>
>> Thanks!
>> -- 
>> Chuyi Zhou
>
  
Michal Hocko July 31, 2023, 1:12 p.m. UTC | #11
On Fri 28-07-23 11:42:27, Roman Gushchin wrote:
> On Fri, Jul 28, 2023 at 10:06:38AM +0200, Michal Hocko wrote:
> > On Thu 27-07-23 21:30:01, Roman Gushchin wrote:
> > > On Thu, Jul 27, 2023 at 10:15:16AM +0200, Michal Hocko wrote:
> > > > On Thu 27-07-23 15:36:27, Chuyi Zhou wrote:
> > > > > This patchset tries to add a new bpf prog type and use it to select
> > > > > a victim memcg when global OOM is invoked. The mainly motivation is
> > > > > the need to customizable OOM victim selection functionality so that
> > > > > we can protect more important app from OOM killer.
> > > > 
> > > > This is rather modest to give an idea how the whole thing is supposed to
> > > > work. I have looked through patches very quickly but there is no overall
> > > > design described anywhere either.
> > > > 
> > > > Please could you give us a high level design description and reasoning
> > > > why certain decisions have been made? e.g. why is this limited to the
> > > > global oom sitation, why is the BPF program forced to operate on memcgs
> > > > as entities etc...
> > > > Also it would be very helpful to call out limitations of the BPF
> > > > program, if there are any.
> > > 
> > > One thing I realized recently: we don't have to make a victim selection
> > > during the OOM, we [almost always] can do it in advance.
> > > 
> > > Kernel OOM's must guarantee the forward progress under heavy memory pressure
> > > and it creates a lot of limitations on what can and what can't be done in
> > > these circumstances.
> > > 
> > > But in practice most policies except maybe those which aim to catch very fast
> > > memory spikes rely on things which are fairly static: a logical importance of
> > > several workloads in comparison to some other workloads, "age", memory footprint
> > > etc.
> > > 
> > > So I wonder if the right path is to create a kernel interface which allows
> > > to define a OOM victim (maybe several victims, also depending on if it's
> > > a global or a memcg oom) and update it periodically from an userspace.
> > 
> > We already have that interface. Just echo OOM_SCORE_ADJ_MAX to any tasks
> > that are to be killed with a priority...
> > Not a great interface but still something available.
> > 
> > > In fact, the second part is already implemented by tools like oomd, systemd-oomd etc.
> > > Someone might say that the first part is also implemented by the oom_score
> > > interface, but I don't think it's an example of a convenient interface.
> > > It's also not a memcg-level interface.
> > 
> > What do you mean by not memcg-level interface? What kind of interface
> > would you propose instead?
> 
> Something like memory.oom.priority, which is 0 by default, but if set to 1,
> the memory cgroup is considered a good oom victim. Idk if we need priorities
> or just fine with a binary thing.

Priorities as a general API have been discussed at several occasions
(e.g http://lkml.kernel.org/r/ZFkEqhAs7FELUO3a@dhcp22.suse.cz). Their usage
is rather limited, hiearchical semantic not trivial etc.
  
Michal Hocko July 31, 2023, 1:23 p.m. UTC | #12
On Mon 31-07-23 14:00:22, Chuyi Zhou wrote:
> Hello, Michal
> 
> 在 2023/7/28 01:23, Michal Hocko 写道:
[...]
> > This sounds like a very specific oom policy and that is fine. But the
> > interface shouldn't be bound to any concepts like priorities let alone
> > be bound to memcg based selection. Ideally the BPF program should get
> > the oom_control as an input and either get a hook to kill process or if
> > that is not possible then return an entity to kill (either process or
> > set of processes).
> 
> Here are two interfaces I can think of. I was wondering if you could give me
> some feedback.
> 
> 1. Add a new hook in select_bad_process(), we can attach it and return a set
> of pids or cgroup_ids which are pre-selected by user-defined policy,
> suggested by Roman. Then we could use oom_evaluate_task to find a final
> victim among them. It's user-friendly and we can offload the OOM policy to
> userspace.
> 
> 2. Add a new hook in oom_evaluate_task() and return a point to override the
> default oom_badness return-value. The simplest way to use this is to protect
> certain processes by setting the minimum score.
> 
> Of course if you have a better idea, please let me know.

Hooking into oom_evaluate_task seems the least disruptive to the
existing oom killer implementation. I would start by planing with that
and see whether useful oom policies could be defined this way. I am not
sure what is the best way to communicate user input so that a BPF prgram
can consume it though. The interface should be generic enough that it
doesn't really pre-define any specific class of policies. Maybe we can
add something completely opaque to each memcg/task? Does BPF
infrastructure allow anything like that already?
  
Chuyi Zhou July 31, 2023, 4:26 p.m. UTC | #13
在 2023/7/31 21:23, Michal Hocko 写道:
> On Mon 31-07-23 14:00:22, Chuyi Zhou wrote:
>> Hello, Michal
>>
>> 在 2023/7/28 01:23, Michal Hocko 写道:
> [...]
>>> This sounds like a very specific oom policy and that is fine. But the
>>> interface shouldn't be bound to any concepts like priorities let alone
>>> be bound to memcg based selection. Ideally the BPF program should get
>>> the oom_control as an input and either get a hook to kill process or if
>>> that is not possible then return an entity to kill (either process or
>>> set of processes).
>>
>> Here are two interfaces I can think of. I was wondering if you could give me
>> some feedback.
>>
>> 1. Add a new hook in select_bad_process(), we can attach it and return a set
>> of pids or cgroup_ids which are pre-selected by user-defined policy,
>> suggested by Roman. Then we could use oom_evaluate_task to find a final
>> victim among them. It's user-friendly and we can offload the OOM policy to
>> userspace.
>>
>> 2. Add a new hook in oom_evaluate_task() and return a point to override the
>> default oom_badness return-value. The simplest way to use this is to protect
>> certain processes by setting the minimum score.
>>
>> Of course if you have a better idea, please let me know.
> 
> Hooking into oom_evaluate_task seems the least disruptive to the
> existing oom killer implementation. I would start by planing with that
> and see whether useful oom policies could be defined this way. I am not
> sure what is the best way to communicate user input so that a BPF prgram
> can consume it though. The interface should be generic enough that it
> doesn't really pre-define any specific class of policies. Maybe we can
> add something completely opaque to each memcg/task? Does BPF
> infrastructure allow anything like that already?
> 

“Maybe we can add something completely opaque to each memcg/task?”
Sorry, I don't understand what you mean.

I think we probably don't need to expose too much to the user, the 
following might be sufficient:

noinline int bpf_get_score(struct oom_control *oc,
		struct task_struct *task);

static int oom_evaluate_task()
{
	...
	points = bpf_get_score(oc, task);
	if (!check_points_valid(points))
          	points = oom_badness(task, oc->totalpages);
	...
}

There are several reasons:

1. The implementation of use-defined OOM policy, such as iteration, 
sorting and other complex operations, is more suitable to be placed in 
the userspace rather than in the bpf program. It is more convenient to 
implement these operations in userspace in which the useful information 
(memory usage of each task and memcg, memory allocation speed, etc.) can 
also be captured. For example, oomd implements multiple policies[1] 
without kernel-space input.

2. Userspace apps, such as oomd, can import useful information into bpf 
program, e.g., through bpf_map, and update it periodically. For example, 
we can do the scoring directly in userspace and maintain a score hash, 
so that in the bpf program, we only need to look for the corresponding 
score of the process.

Userspace policy(oomd)
          bpf_map_update
          score_hash
       ------------------>  BPF program
                               look up score in
                                score_hash
                             ---------------> kernel space
Just some thoughts.
Thanks!

[1]https://github.com/facebookincubator/oomd/tree/main/src/oomd/plugins)
  
Abel Wu Aug. 1, 2023, 6:53 a.m. UTC | #14
On 7/28/23 12:30 PM, Roman Gushchin wrote:
> On Thu, Jul 27, 2023 at 10:15:16AM +0200, Michal Hocko wrote:
>> On Thu 27-07-23 15:36:27, Chuyi Zhou wrote:
>>> This patchset tries to add a new bpf prog type and use it to select
>>> a victim memcg when global OOM is invoked. The mainly motivation is
>>> the need to customizable OOM victim selection functionality so that
>>> we can protect more important app from OOM killer.
>>
>> This is rather modest to give an idea how the whole thing is supposed to
>> work. I have looked through patches very quickly but there is no overall
>> design described anywhere either.
>>
>> Please could you give us a high level design description and reasoning
>> why certain decisions have been made? e.g. why is this limited to the
>> global oom sitation, why is the BPF program forced to operate on memcgs
>> as entities etc...
>> Also it would be very helpful to call out limitations of the BPF
>> program, if there are any.
> 
> One thing I realized recently: we don't have to make a victim selection
> during the OOM, we [almost always] can do it in advance.

I agree. We take precautions against memory shortage on over-committed
machines through oomd-like userspace tools, to mitigate possible SLO
violations on important services. The kernel OOM-killer in our scenario
works as a last resort, since userspace tools are not that reliable.
IMHO it would be useful for kernel to provide such flexibility.

> 
> Kernel OOM's must guarantee the forward progress under heavy memory pressure
> and it creates a lot of limitations on what can and what can't be done in
> these circumstances.
> 
> But in practice most policies except maybe those which aim to catch very fast
> memory spikes rely on things which are fairly static: a logical importance of
> several workloads in comparison to some other workloads, "age", memory footprint
> etc.
> 
> So I wonder if the right path is to create a kernel interface which allows
> to define a OOM victim (maybe several victims, also depending on if it's
> a global or a memcg oom) and update it periodically from an userspace.

Something like [1] proposed by Chuyi? IIUC there is still lack of some
triggers to invoke the procedure so we can actually do this in advance.

[1] 
https://lore.kernel.org/lkml/f8f44103-afba-10ee-b14b-a8e60a7f33d8@bytedance.com/

Thanks & Best,
	Abel
  
Michal Hocko Aug. 1, 2023, 8:18 a.m. UTC | #15
On Tue 01-08-23 00:26:20, Chuyi Zhou wrote:
> 
> 
> 在 2023/7/31 21:23, Michal Hocko 写道:
> > On Mon 31-07-23 14:00:22, Chuyi Zhou wrote:
> > > Hello, Michal
> > > 
> > > 在 2023/7/28 01:23, Michal Hocko 写道:
> > [...]
> > > > This sounds like a very specific oom policy and that is fine. But the
> > > > interface shouldn't be bound to any concepts like priorities let alone
> > > > be bound to memcg based selection. Ideally the BPF program should get
> > > > the oom_control as an input and either get a hook to kill process or if
> > > > that is not possible then return an entity to kill (either process or
> > > > set of processes).
> > > 
> > > Here are two interfaces I can think of. I was wondering if you could give me
> > > some feedback.
> > > 
> > > 1. Add a new hook in select_bad_process(), we can attach it and return a set
> > > of pids or cgroup_ids which are pre-selected by user-defined policy,
> > > suggested by Roman. Then we could use oom_evaluate_task to find a final
> > > victim among them. It's user-friendly and we can offload the OOM policy to
> > > userspace.
> > > 
> > > 2. Add a new hook in oom_evaluate_task() and return a point to override the
> > > default oom_badness return-value. The simplest way to use this is to protect
> > > certain processes by setting the minimum score.
> > > 
> > > Of course if you have a better idea, please let me know.
> > 
> > Hooking into oom_evaluate_task seems the least disruptive to the
> > existing oom killer implementation. I would start by planing with that
> > and see whether useful oom policies could be defined this way. I am not
> > sure what is the best way to communicate user input so that a BPF prgram
> > can consume it though. The interface should be generic enough that it
> > doesn't really pre-define any specific class of policies. Maybe we can
> > add something completely opaque to each memcg/task? Does BPF
> > infrastructure allow anything like that already?
> > 
> 
> “Maybe we can add something completely opaque to each memcg/task?”
> Sorry, I don't understand what you mean.

What I meant to say is to add a very non-specific interface that would
would a specific BPF program understand. Mostly an opaque value from the
memcg POV.

> I think we probably don't need to expose too much to the user, the following
> might be sufficient:
> 
> noinline int bpf_get_score(struct oom_control *oc,
> 		struct task_struct *task);
> 
> static int oom_evaluate_task()
> {
> 	...
> 	points = bpf_get_score(oc, task);
> 	if (!check_points_valid(points))
>          	points = oom_badness(task, oc->totalpages);
> 	...
> }
> 
> There are several reasons:
> 
> 1. The implementation of use-defined OOM policy, such as iteration, sorting
> and other complex operations, is more suitable to be placed in the userspace
> rather than in the bpf program. It is more convenient to implement these
> operations in userspace in which the useful information (memory usage of
> each task and memcg, memory allocation speed, etc.) can also be captured.
> For example, oomd implements multiple policies[1] without kernel-space
> input.

I do agree that userspace can handle a lot on its own and provide the
input to the BPF program to make a decision.

> 2. Userspace apps, such as oomd, can import useful information into bpf
> program, e.g., through bpf_map, and update it periodically. For example, we
> can do the scoring directly in userspace and maintain a score hash, so that
> in the bpf program, we only need to look for the corresponding score of the
> process.

Sure, why not. But all that is an implementation detail. We are
currently talkin about a proper abstraction and layering that would
allow what you do currently but also much more.

> Userspace policy(oomd)
>          bpf_map_update
>          score_hash
>       ------------------>  BPF program
>                               look up score in
>                                score_hash
>                             ---------------> kernel space
> Just some thoughts.

I believe all the above should be possible if BPF program is hooked at
the oom_evaluate_task layer and allow to bypass the default logic. BPF
program can process whatever data it has available. The oom scope iteration
will be implemented already in the kernel so all the BPF program has to
do is to rank processes and/or memcgs if oom.group is enabled. Whould
that work for your usecase?

> Thanks!
> 
> [1]https://github.com/facebookincubator/oomd/tree/main/src/oomd/plugins)
  
Chuyi Zhou Aug. 2, 2023, 3:04 a.m. UTC | #16
Hello,
在 2023/8/1 16:18, Michal Hocko 写道:
> On Tue 01-08-23 00:26:20, Chuyi Zhou wrote:
>>
>>
>> 在 2023/7/31 21:23, Michal Hocko 写道:
>>> On Mon 31-07-23 14:00:22, Chuyi Zhou wrote:
>>>> Hello, Michal
>>>>
>>>> 在 2023/7/28 01:23, Michal Hocko 写道:
>>> [...]
>>>>> This sounds like a very specific oom policy and that is fine. But the
>>>>> interface shouldn't be bound to any concepts like priorities let alone
>>>>> be bound to memcg based selection. Ideally the BPF program should get
>>>>> the oom_control as an input and either get a hook to kill process or if
>>>>> that is not possible then return an entity to kill (either process or
>>>>> set of processes).
>>>>
>>>> Here are two interfaces I can think of. I was wondering if you could give me
>>>> some feedback.
>>>>
>>>> 1. Add a new hook in select_bad_process(), we can attach it and return a set
>>>> of pids or cgroup_ids which are pre-selected by user-defined policy,
>>>> suggested by Roman. Then we could use oom_evaluate_task to find a final
>>>> victim among them. It's user-friendly and we can offload the OOM policy to
>>>> userspace.
>>>>
>>>> 2. Add a new hook in oom_evaluate_task() and return a point to override the
>>>> default oom_badness return-value. The simplest way to use this is to protect
>>>> certain processes by setting the minimum score.
>>>>
>>>> Of course if you have a better idea, please let me know.
>>>
>>> Hooking into oom_evaluate_task seems the least disruptive to the
>>> existing oom killer implementation. I would start by planing with that
>>> and see whether useful oom policies could be defined this way. I am not
>>> sure what is the best way to communicate user input so that a BPF prgram
>>> can consume it though. The interface should be generic enough that it
>>> doesn't really pre-define any specific class of policies. Maybe we can
>>> add something completely opaque to each memcg/task? Does BPF
>>> infrastructure allow anything like that already?
>>>
>>
>> “Maybe we can add something completely opaque to each memcg/task?”
>> Sorry, I don't understand what you mean.
> 
> What I meant to say is to add a very non-specific interface that would
> would a specific BPF program understand. Mostly an opaque value from the
> memcg POV.
> 
>> I think we probably don't need to expose too much to the user, the following
>> might be sufficient:
>>
>> noinline int bpf_get_score(struct oom_control *oc,
>> 		struct task_struct *task);
>>
>> static int oom_evaluate_task()
>> {
>> 	...
>> 	points = bpf_get_score(oc, task);
>> 	if (!check_points_valid(points))
>>           	points = oom_badness(task, oc->totalpages);
>> 	...
>> }
>>
>> There are several reasons:
>>
>> 1. The implementation of use-defined OOM policy, such as iteration, sorting
>> and other complex operations, is more suitable to be placed in the userspace
>> rather than in the bpf program. It is more convenient to implement these
>> operations in userspace in which the useful information (memory usage of
>> each task and memcg, memory allocation speed, etc.) can also be captured.
>> For example, oomd implements multiple policies[1] without kernel-space
>> input.
> 
> I do agree that userspace can handle a lot on its own and provide the
> input to the BPF program to make a decision.
> 
>> 2. Userspace apps, such as oomd, can import useful information into bpf
>> program, e.g., through bpf_map, and update it periodically. For example, we
>> can do the scoring directly in userspace and maintain a score hash, so that
>> in the bpf program, we only need to look for the corresponding score of the
>> process.
> 
> Sure, why not. But all that is an implementation detail. We are
> currently talkin about a proper abstraction and layering that would
> allow what you do currently but also much more.
> 
>> Userspace policy(oomd)
>>           bpf_map_update
>>           score_hash
>>        ------------------>  BPF program
>>                                look up score in
>>                                 score_hash
>>                              ---------------> kernel space
>> Just some thoughts.
> 
> I believe all the above should be possible if BPF program is hooked at
> the oom_evaluate_task layer and allow to bypass the default logic. BPF
> program can process whatever data it has available. The oom scope iteration
> will be implemented already in the kernel so all the BPF program has to
> do is to rank processes and/or memcgs if oom.group is enabled. Whould
> that work for your usecase?

Yes, I think the above interface can works well for our usecase.

In our scenario, we want to protect the application with higher priority 
and try to select lower priority as the victim.

Specifically, We can set priority for memcgs in userspace. In BPF 
program, we can find the memcg to which the given process belongs, and 
then rank according to the memcg's priority.

Thanks.
> 
>> Thanks!
>>
>> [1]https://github.com/facebookincubator/oomd/tree/main/src/oomd/plugins)
>