Message ID | 20230727073632.44983-1-zhouchuyi@bytedance.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:a985:0:b0:3e4:2afc:c1 with SMTP id t5csp963274vqo; Thu, 27 Jul 2023 02:12:54 -0700 (PDT) X-Google-Smtp-Source: APBJJlHm3RrAaTAQwH8LbGoRXho+esx3gX6roCMgrkXvTM7elhXRL8dzUkD/sKJZrLzGA5Hxg21o X-Received: by 2002:a17:907:a0c6:b0:994:5340:22f4 with SMTP id hw6-20020a170907a0c600b00994534022f4mr1199175ejc.6.1690449174093; Thu, 27 Jul 2023 02:12:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1690449174; cv=none; d=google.com; s=arc-20160816; b=lerAcbN6n7Novy59CVstZ1KPVDtD80GxxW0lencgH7aeTnYRy0D3IbrQCCS/5N2/hm vzBvo1y2R08Gn6k5MGTAHM9ngj5OctV/K+5q1cxyINBECloZpAxS1M3iQs6YNTo6KcW4 iEe7hWjjXk004+QfX8wwG48sKq7kbfJ5JgK0+Q0TvOA4YOw19D63yq3baj+nMe3n/5VD YmvJccjp1vtuKy/m1gXWkzRhhz/NLWVg7ASnRMamnIonOVVwATOwC+sSj5nvcou7LSi8 ybuvU5kcky64hevzmzfo4GDIH6WW+oXEYdCTyAMC08cP4/SD/R1az9KyKPtUktGFDpwv EItw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=QdW9TKjaVoY8TAoiZ9/Vr1Qx2UudtUG/Q1a9kCBPzGw=; fh=hkVHq7oq/x6T47Z/W+8iW7aAYD01qDHXzBTbsOMDe4M=; b=ij+vZ8nd1FbMab4u7D0DvsqIk4O65c4ioFQaD4Ud4IPBLhrXJs9PPM9mIm6WHSZcpv dZjW0u24YzbIcRWFEQEF2h/CB7gClq5hRrfZX2W/9rSjQ7mxdqCdva3s6XkN+UKBR6Ju g8lmQqMWnXTYvr3za5kFsh+foAMLNOjvkcmzov8jbs3OHrgtyjCFV6T3gGLhefCilLrS nCJMFI1fM2Uzw9ByQdryBqLB2l1xyZQVgPVjojG2Ni8HpMvHL+Q/oK1Ldv7xzn9CjuEh j0zl9Hngu4+zyGxzPeRmI5deLkyejzWbNkkr5JSFp9Sed/nvN7haqDyP4jycnJldGeUd NlMg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=WJUfha13; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=bytedance.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id v19-20020a17090606d300b00992f2becb47si789858ejb.7.2023.07.27.02.12.31; Thu, 27 Jul 2023 02:12:54 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=WJUfha13; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=bytedance.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233495AbjG0Hoy (ORCPT <rfc822;hanasaki@gmail.com> + 99 others); Thu, 27 Jul 2023 03:44:54 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52014 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233679AbjG0Hna (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 27 Jul 2023 03:43:30 -0400 Received: from mail-pf1-x430.google.com (mail-pf1-x430.google.com [IPv6:2607:f8b0:4864:20::430]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2F44683EC for <linux-kernel@vger.kernel.org>; Thu, 27 Jul 2023 00:37:04 -0700 (PDT) Received: by mail-pf1-x430.google.com with SMTP id d2e1a72fcca58-686ba29ccb1so383530b3a.1 for <linux-kernel@vger.kernel.org>; Thu, 27 Jul 2023 00:37:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1690443423; x=1691048223; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=QdW9TKjaVoY8TAoiZ9/Vr1Qx2UudtUG/Q1a9kCBPzGw=; b=WJUfha13VzDGQgQyWyKoLGRPd9B2AkAKQKlD7TIy163bJUtKECJHVmywN3s0unJxcA AFA1yR4NNxwgvdmCB9kcHHQPMLl9LlV8pAcIGWlF0HUvZLVIlQEbvML2tpREm3u8AD4k ZViVO1tm0udVd0TVKE7yJTePILn0TB+bdWzMf3niDZWAHwzWJFTwp+C84Q2j6ViYWC5t 4N5Am2+2fKfp6XRRZQnrrgjY5wF7mYeeJH4JoDqHPFPB6qrHo39bwgcnuGX6byF15qjx ceejVzI8aCT+WIluxObIBO5wrLPWkAdawMcVjKG/3zRwdXvZXxv1sPDUtb817SM3m6vs XPfQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690443423; x=1691048223; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=QdW9TKjaVoY8TAoiZ9/Vr1Qx2UudtUG/Q1a9kCBPzGw=; b=a/fSoOccBDeP72ZQCBBari5VgNjsc6JgCgKeWxkG4BpHzs1jsAs1rcDjjoib6ZcfMm Dxjt/jg6s+q69DMDB18ratBDNqToPwdncZyrlwUya+Y61j+LYzGjWttst2UvEMJspE/X 9BDGEkPqieqEbETQIGf0vpqfq2Ui6LoCAO10yNzkiLDyak77TCLB9Yba4I6y2u6FJvcY ODPk0IRi1j+SauDft1MMmtyzP3SboloRHKI2RNq1QJAmvD/9IPUBaO4hMczy5dM0xAeM DvN/gpzhiwm5VIpvsY0Fy53998qspnJ+r9FV04JtS2KpuIyLaprp4ogEHh1ZBJpOnwDi 6C0g== X-Gm-Message-State: ABy/qLaeoX8o0lWK/gKzQxbHGuuung9AKjmA3gal1CiAp2i98zHSFmGH uUpxxc4iOLCqI5YjRJ9Vy6gwEw== X-Received: by 2002:a05:6a20:3c8f:b0:123:152d:d46b with SMTP id b15-20020a056a203c8f00b00123152dd46bmr2384597pzj.26.1690443423642; Thu, 27 Jul 2023 00:37:03 -0700 (PDT) Received: from n37-019-243.byted.org ([180.184.51.134]) by smtp.gmail.com with ESMTPSA id s196-20020a6377cd000000b005638a70110bsm733919pgc.65.2023.07.27.00.37.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 27 Jul 2023 00:37:03 -0700 (PDT) From: Chuyi Zhou <zhouchuyi@bytedance.com> To: hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org Cc: bpf@vger.kernel.org, linux-kernel@vger.kernel.org, wuyun.abel@bytedance.com, robin.lu@bytedance.com, Chuyi Zhou <zhouchuyi@bytedance.com> Subject: [RFC PATCH 0/5] mm: Select victim memcg using BPF_OOM_POLICY Date: Thu, 27 Jul 2023 15:36:27 +0800 Message-Id: <20230727073632.44983-1-zhouchuyi@bytedance.com> X-Mailer: git-send-email 2.20.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1772564259239758541 X-GMAIL-MSGID: 1772564433177582863 |
Series |
mm: Select victim memcg using BPF_OOM_POLICY
|
|
Message
Chuyi Zhou
July 27, 2023, 7:36 a.m. UTC
This patchset tries to add a new bpf prog type and use it to select a victim memcg when global OOM is invoked. The mainly motivation is the need to customizable OOM victim selection functionality so that we can protect more important app from OOM killer. Chuyi Zhou (5): bpf: Introduce BPF_PROG_TYPE_OOM_POLICY mm: Select victim memcg using bpf prog libbpf, bpftool: Support BPF_PROG_TYPE_OOM_POLICY bpf: Add a new bpf helper to get cgroup ino bpf: Sample BPF program to set oom policy include/linux/bpf_oom.h | 22 ++++ include/linux/bpf_types.h | 2 + include/linux/memcontrol.h | 6 ++ include/uapi/linux/bpf.h | 21 ++++ kernel/bpf/core.c | 1 + kernel/bpf/helpers.c | 17 +++ kernel/bpf/syscall.c | 10 ++ mm/memcontrol.c | 50 +++++++++ mm/oom_kill.c | 185 +++++++++++++++++++++++++++++++++ samples/bpf/Makefile | 3 + samples/bpf/oom_kern.c | 42 ++++++++ samples/bpf/oom_user.c | 128 +++++++++++++++++++++++ tools/bpf/bpftool/common.c | 1 + tools/include/uapi/linux/bpf.h | 21 ++++ tools/lib/bpf/libbpf.c | 3 + tools/lib/bpf/libbpf_probes.c | 2 + 16 files changed, 514 insertions(+) create mode 100644 include/linux/bpf_oom.h create mode 100644 samples/bpf/oom_kern.c create mode 100644 samples/bpf/oom_user.c
Comments
On 27/07/2023 08:36, Chuyi Zhou wrote: > This patchset tries to add a new bpf prog type and use it to select > a victim memcg when global OOM is invoked. The mainly motivation is > the need to customizable OOM victim selection functionality so that > we can protect more important app from OOM killer. > It's a nice use case, but at a high level, the approach pursued here is, as I understand it, discouraged for new BPF program development. Specifically, adding a new BPF program type with semantics like this is not preferred. Instead, can you look at using something like - using "fmod_ret" instead of a new program type - use BPF kfuncs instead of helpers. - add selftests in tools/testing/selftests/bpf not samples. There's some examples of how solutions have evolved from the traditional approach (adding a new program type, helpers etc) to using kfuncs etc on this list - for example HID-BPF and the BPF scheduler series - which should help orient you. There are presentations from Linux Plumbers 2022 that walk through some of this too. Judging by the sample program example, all you should need here is a way to override the return value of bpf_oom_set_policy() - a noinline function that by default returns a no-op. It can then be overridden by an "fmod_ret" BPF program. One thing you lose is cgroup specificity at BPF attach time, but you can always add predicates based on the cgroup to your BPF program if needed. Alan > Chuyi Zhou (5): > bpf: Introduce BPF_PROG_TYPE_OOM_POLICY > mm: Select victim memcg using bpf prog > libbpf, bpftool: Support BPF_PROG_TYPE_OOM_POLICY > bpf: Add a new bpf helper to get cgroup ino > bpf: Sample BPF program to set oom policy > > include/linux/bpf_oom.h | 22 ++++ > include/linux/bpf_types.h | 2 + > include/linux/memcontrol.h | 6 ++ > include/uapi/linux/bpf.h | 21 ++++ > kernel/bpf/core.c | 1 + > kernel/bpf/helpers.c | 17 +++ > kernel/bpf/syscall.c | 10 ++ > mm/memcontrol.c | 50 +++++++++ > mm/oom_kill.c | 185 +++++++++++++++++++++++++++++++++ > samples/bpf/Makefile | 3 + > samples/bpf/oom_kern.c | 42 ++++++++ > samples/bpf/oom_user.c | 128 +++++++++++++++++++++++ > tools/bpf/bpftool/common.c | 1 + > tools/include/uapi/linux/bpf.h | 21 ++++ > tools/lib/bpf/libbpf.c | 3 + > tools/lib/bpf/libbpf_probes.c | 2 + > 16 files changed, 514 insertions(+) > create mode 100644 include/linux/bpf_oom.h > create mode 100644 samples/bpf/oom_kern.c > create mode 100644 samples/bpf/oom_user.c >
在 2023/7/27 16:15, Michal Hocko 写道: > On Thu 27-07-23 15:36:27, Chuyi Zhou wrote: >> This patchset tries to add a new bpf prog type and use it to select >> a victim memcg when global OOM is invoked. The mainly motivation is >> the need to customizable OOM victim selection functionality so that >> we can protect more important app from OOM killer. > > This is rather modest to give an idea how the whole thing is supposed to > work. I have looked through patches very quickly but there is no overall > design described anywhere either. > > Please could you give us a high level design description and reasoning > why certain decisions have been made? e.g. why is this limited to the > global oom sitation, why is the BPF program forced to operate on memcgs > as entities etc... > Also it would be very helpful to call out limitations of the BPF > program, if there are any. > > Thanks! Hi, Thanks for your advice. The global/memcg OOM victim selection uses process as the base search granularity. However, we can see a need for cgroup level protection and there's been some discussion[1]. It seems reasonable to consider using memcg as a search granularity in victim selection algorithm. Besides, it seems pretty well fit for offloading policy decisions to a BPF program, since BPF is scalable and flexible. That's why the new BPF program operate on memcgs as entities. The idea is to let user choose which leaf in the memcg tree should be selected as the victim. At the first layer, if we choose A, then it protects the memcg under the B, C, and D subtrees. root / | \ \ A B C D /\ E F Using the BPF prog, we are allowed to compare the OOM priority between two siblings so that we can choose the best victim in each layer. For example: run_prog(B, C) -> choose B run_prog(B, D) -> choose D run_prog(A, D) -> choose A Once we select A as the victim in the first layer, the victim in next layer would be selected among A's children. Finally, we select a leaf memcg as victim. In our scenarios, the impact caused by global OOM's is much more common, so we only considered global in this patchset. But it seems that the idea can also be applied to memcg OOM. [1]https://lore.kernel.org/lkml/ZIgodGWoC%2FR07eak@dhcp22.suse.cz/ Thanks!
On Thu, Jul 27, 2023 at 4:45 AM Alan Maguire <alan.maguire@oracle.com> wrote: > > On 27/07/2023 08:36, Chuyi Zhou wrote: > > This patchset tries to add a new bpf prog type and use it to select > > a victim memcg when global OOM is invoked. The mainly motivation is > > the need to customizable OOM victim selection functionality so that > > we can protect more important app from OOM killer. > > > > It's a nice use case, but at a high level, the approach pursued here > is, as I understand it, discouraged for new BPF program development. > Specifically, adding a new BPF program type with semantics like this > is not preferred. Instead, can you look at using something like > > - using "fmod_ret" instead of a new program type > - use BPF kfuncs instead of helpers. > - add selftests in tools/testing/selftests/bpf not samples. +1 to what Alan said above and below. Also as Michal said there needs to be a design doc with pros and cons. We see far too often that people attempt to use BPF in places where it shouldn't be. If programmability is not strictly necessary then BPF is not a good fit. > There's some examples of how solutions have evolved from the traditional > approach (adding a new program type, helpers etc) to using kfuncs etc on > this list - for example HID-BPF and the BPF scheduler series - which > should help orient you. There are presentations from Linux Plumbers 2022 > that walk through some of this too. > > Judging by the sample program example, all you should need here is a way > to override the return value of bpf_oom_set_policy() - a noinline > function that by default returns a no-op. It can then be overridden by > an "fmod_ret" BPF program. > > One thing you lose is cgroup specificity at BPF attach time, but you can > always add predicates based on the cgroup to your BPF program if needed. > > Alan > > > Chuyi Zhou (5): > > bpf: Introduce BPF_PROG_TYPE_OOM_POLICY > > mm: Select victim memcg using bpf prog > > libbpf, bpftool: Support BPF_PROG_TYPE_OOM_POLICY > > bpf: Add a new bpf helper to get cgroup ino > > bpf: Sample BPF program to set oom policy > > > > include/linux/bpf_oom.h | 22 ++++ > > include/linux/bpf_types.h | 2 + > > include/linux/memcontrol.h | 6 ++ > > include/uapi/linux/bpf.h | 21 ++++ > > kernel/bpf/core.c | 1 + > > kernel/bpf/helpers.c | 17 +++ > > kernel/bpf/syscall.c | 10 ++ > > mm/memcontrol.c | 50 +++++++++ > > mm/oom_kill.c | 185 +++++++++++++++++++++++++++++++++ > > samples/bpf/Makefile | 3 + > > samples/bpf/oom_kern.c | 42 ++++++++ > > samples/bpf/oom_user.c | 128 +++++++++++++++++++++++ > > tools/bpf/bpftool/common.c | 1 + > > tools/include/uapi/linux/bpf.h | 21 ++++ > > tools/lib/bpf/libbpf.c | 3 + > > tools/lib/bpf/libbpf_probes.c | 2 + > > 16 files changed, 514 insertions(+) > > create mode 100644 include/linux/bpf_oom.h > > create mode 100644 samples/bpf/oom_kern.c > > create mode 100644 samples/bpf/oom_user.c > >
On Thu 27-07-23 08:57:03, Alexei Starovoitov wrote: > On Thu, Jul 27, 2023 at 4:45 AM Alan Maguire <alan.maguire@oracle.com> wrote: > > > > On 27/07/2023 08:36, Chuyi Zhou wrote: > > > This patchset tries to add a new bpf prog type and use it to select > > > a victim memcg when global OOM is invoked. The mainly motivation is > > > the need to customizable OOM victim selection functionality so that > > > we can protect more important app from OOM killer. > > > > > > > It's a nice use case, but at a high level, the approach pursued here > > is, as I understand it, discouraged for new BPF program development. > > Specifically, adding a new BPF program type with semantics like this > > is not preferred. Instead, can you look at using something like > > > > - using "fmod_ret" instead of a new program type > > - use BPF kfuncs instead of helpers. > > - add selftests in tools/testing/selftests/bpf not samples. > > +1 to what Alan said above and below. > > Also as Michal said there needs to be a design doc with pros and cons. > We see far too often that people attempt to use BPF in places where it > shouldn't be. > If programmability is not strictly necessary then BPF is not a good fit. To be completely honest I am not sure whether BPF is the right fit myself. It is definitely a path to be explored but maybe we will learn not the proper one in the end. The primary reason for considering it though is that there is endless (+1) different policies how to select a victim to kill on OOM. So having an interface to hook into and make that decision sounds very promissing. This might be something very static like EXPORT_SYMBOL that a kernel module can hook into or a more broader BPF callback that could be more generic and therefore allow for easier to define policy AFAIU.
On Thu 27-07-23 20:12:01, Chuyi Zhou wrote: > > > 在 2023/7/27 16:15, Michal Hocko 写道: > > On Thu 27-07-23 15:36:27, Chuyi Zhou wrote: > > > This patchset tries to add a new bpf prog type and use it to select > > > a victim memcg when global OOM is invoked. The mainly motivation is > > > the need to customizable OOM victim selection functionality so that > > > we can protect more important app from OOM killer. > > > > This is rather modest to give an idea how the whole thing is supposed to > > work. I have looked through patches very quickly but there is no overall > > design described anywhere either. > > > > Please could you give us a high level design description and reasoning > > why certain decisions have been made? e.g. why is this limited to the > > global oom sitation, why is the BPF program forced to operate on memcgs > > as entities etc... > > Also it would be very helpful to call out limitations of the BPF > > program, if there are any. > > > > Thanks! > > Hi, > > Thanks for your advice. > > The global/memcg OOM victim selection uses process as the base search > granularity. However, we can see a need for cgroup level protection and > there's been some discussion[1]. It seems reasonable to consider using memcg > as a search granularity in victim selection algorithm. Yes, it can be reasonable for some policies but making it central to the design is very limiting. > Besides, it seems pretty well fit for offloading policy decisions to a BPF > program, since BPF is scalable and flexible. That's why the new BPF > program operate on memcgs as entities. I do not follow your line of argumentation here. The same could be argued for processes or beans. > The idea is to let user choose which leaf in the memcg tree should be > selected as the victim. At the first layer, if we choose A, then it protects > the memcg under the B, C, and D subtrees. > > root > / | \ \ > A B C D > /\ > E F > > > Using the BPF prog, we are allowed to compare the OOM priority between > two siblings so that we can choose the best victim in each layer. How is the priority defined and communicated to the userspace. > For example: > > run_prog(B, C) -> choose B > run_prog(B, D) -> choose D > run_prog(A, D) -> choose A > > Once we select A as the victim in the first layer, the victim in next layer > would be selected among A's children. Finally, we select a leaf memcg as > victim. This sounds like a very specific oom policy and that is fine. But the interface shouldn't be bound to any concepts like priorities let alone be bound to memcg based selection. Ideally the BPF program should get the oom_control as an input and either get a hook to kill process or if that is not possible then return an entity to kill (either process or set of processes). > In our scenarios, the impact caused by global OOM's is much more common, so > we only considered global in this patchset. But it seems that the idea can > also be applied to memcg OOM. The global and memcg OOMs shouldn't have a different interface. If a specific BPF program wants to implement a different policy for global vs. memcg OOM then be it but this should be a decision of the said program not an inherent limitation of the interface. > > [1]https://lore.kernel.org/lkml/ZIgodGWoC%2FR07eak@dhcp22.suse.cz/ > > Thanks! > -- > Chuyi Zhou
Hi, 在 2023/7/27 19:43, Alan Maguire 写道: > On 27/07/2023 08:36, Chuyi Zhou wrote: >> This patchset tries to add a new bpf prog type and use it to select >> a victim memcg when global OOM is invoked. The mainly motivation is >> the need to customizable OOM victim selection functionality so that >> we can protect more important app from OOM killer. >> > > It's a nice use case, but at a high level, the approach pursued here > is, as I understand it, discouraged for new BPF program development. > Specifically, adding a new BPF program type with semantics like this > is not preferred. Instead, can you look at using something like > > - using "fmod_ret" instead of a new program type > - use BPF kfuncs instead of helpers. > - add selftests in tools/testing/selftests/bpf not samples. > > There's some examples of how solutions have evolved from the traditional > approach (adding a new program type, helpers etc) to using kfuncs etc on > this list - for example HID-BPF and the BPF scheduler series - which > should help orient you. There are presentations from Linux Plumbers 2022 > that walk through some of this too. > > Judging by the sample program example, all you should need here is a way > to override the return value of bpf_oom_set_policy() - a noinline > function that by default returns a no-op. It can then be overridden by > an "fmod_ret" BPF program. > Indeed, I'll try to use kfuncs & fmod_ret. Thanks for your advice. -- Chuyi Zhou > One thing you lose is cgroup specificity at BPF attach time, but you can > always add predicates based on the cgroup to your BPF program if needed. > > Alan > >> Chuyi Zhou (5): >> bpf: Introduce BPF_PROG_TYPE_OOM_POLICY >> mm: Select victim memcg using bpf prog >> libbpf, bpftool: Support BPF_PROG_TYPE_OOM_POLICY >> bpf: Add a new bpf helper to get cgroup ino >> bpf: Sample BPF program to set oom policy >> >> include/linux/bpf_oom.h | 22 ++++ >> include/linux/bpf_types.h | 2 + >> include/linux/memcontrol.h | 6 ++ >> include/uapi/linux/bpf.h | 21 ++++ >> kernel/bpf/core.c | 1 + >> kernel/bpf/helpers.c | 17 +++ >> kernel/bpf/syscall.c | 10 ++ >> mm/memcontrol.c | 50 +++++++++ >> mm/oom_kill.c | 185 +++++++++++++++++++++++++++++++++ >> samples/bpf/Makefile | 3 + >> samples/bpf/oom_kern.c | 42 ++++++++ >> samples/bpf/oom_user.c | 128 +++++++++++++++++++++++ >> tools/bpf/bpftool/common.c | 1 + >> tools/include/uapi/linux/bpf.h | 21 ++++ >> tools/lib/bpf/libbpf.c | 3 + >> tools/lib/bpf/libbpf_probes.c | 2 + >> 16 files changed, 514 insertions(+) >> create mode 100644 include/linux/bpf_oom.h >> create mode 100644 samples/bpf/oom_kern.c >> create mode 100644 samples/bpf/oom_user.c >>
On Thu, Jul 27, 2023 at 10:15:16AM +0200, Michal Hocko wrote: > On Thu 27-07-23 15:36:27, Chuyi Zhou wrote: > > This patchset tries to add a new bpf prog type and use it to select > > a victim memcg when global OOM is invoked. The mainly motivation is > > the need to customizable OOM victim selection functionality so that > > we can protect more important app from OOM killer. > > This is rather modest to give an idea how the whole thing is supposed to > work. I have looked through patches very quickly but there is no overall > design described anywhere either. > > Please could you give us a high level design description and reasoning > why certain decisions have been made? e.g. why is this limited to the > global oom sitation, why is the BPF program forced to operate on memcgs > as entities etc... > Also it would be very helpful to call out limitations of the BPF > program, if there are any. One thing I realized recently: we don't have to make a victim selection during the OOM, we [almost always] can do it in advance. Kernel OOM's must guarantee the forward progress under heavy memory pressure and it creates a lot of limitations on what can and what can't be done in these circumstances. But in practice most policies except maybe those which aim to catch very fast memory spikes rely on things which are fairly static: a logical importance of several workloads in comparison to some other workloads, "age", memory footprint etc. So I wonder if the right path is to create a kernel interface which allows to define a OOM victim (maybe several victims, also depending on if it's a global or a memcg oom) and update it periodically from an userspace. In fact, the second part is already implemented by tools like oomd, systemd-oomd etc. Someone might say that the first part is also implemented by the oom_score interface, but I don't think it's an example of a convenient interface. It's also not a memcg-level interface. Just some thoughts. Thanks!
On Thu 27-07-23 21:30:01, Roman Gushchin wrote: > On Thu, Jul 27, 2023 at 10:15:16AM +0200, Michal Hocko wrote: > > On Thu 27-07-23 15:36:27, Chuyi Zhou wrote: > > > This patchset tries to add a new bpf prog type and use it to select > > > a victim memcg when global OOM is invoked. The mainly motivation is > > > the need to customizable OOM victim selection functionality so that > > > we can protect more important app from OOM killer. > > > > This is rather modest to give an idea how the whole thing is supposed to > > work. I have looked through patches very quickly but there is no overall > > design described anywhere either. > > > > Please could you give us a high level design description and reasoning > > why certain decisions have been made? e.g. why is this limited to the > > global oom sitation, why is the BPF program forced to operate on memcgs > > as entities etc... > > Also it would be very helpful to call out limitations of the BPF > > program, if there are any. > > One thing I realized recently: we don't have to make a victim selection > during the OOM, we [almost always] can do it in advance. > > Kernel OOM's must guarantee the forward progress under heavy memory pressure > and it creates a lot of limitations on what can and what can't be done in > these circumstances. > > But in practice most policies except maybe those which aim to catch very fast > memory spikes rely on things which are fairly static: a logical importance of > several workloads in comparison to some other workloads, "age", memory footprint > etc. > > So I wonder if the right path is to create a kernel interface which allows > to define a OOM victim (maybe several victims, also depending on if it's > a global or a memcg oom) and update it periodically from an userspace. We already have that interface. Just echo OOM_SCORE_ADJ_MAX to any tasks that are to be killed with a priority... Not a great interface but still something available. > In fact, the second part is already implemented by tools like oomd, systemd-oomd etc. > Someone might say that the first part is also implemented by the oom_score > interface, but I don't think it's an example of a convenient interface. > It's also not a memcg-level interface. What do you mean by not memcg-level interface? What kind of interface would you propose instead? > Just some thoughts. > > Thanks!
On Fri, Jul 28, 2023 at 10:06:38AM +0200, Michal Hocko wrote: > On Thu 27-07-23 21:30:01, Roman Gushchin wrote: > > On Thu, Jul 27, 2023 at 10:15:16AM +0200, Michal Hocko wrote: > > > On Thu 27-07-23 15:36:27, Chuyi Zhou wrote: > > > > This patchset tries to add a new bpf prog type and use it to select > > > > a victim memcg when global OOM is invoked. The mainly motivation is > > > > the need to customizable OOM victim selection functionality so that > > > > we can protect more important app from OOM killer. > > > > > > This is rather modest to give an idea how the whole thing is supposed to > > > work. I have looked through patches very quickly but there is no overall > > > design described anywhere either. > > > > > > Please could you give us a high level design description and reasoning > > > why certain decisions have been made? e.g. why is this limited to the > > > global oom sitation, why is the BPF program forced to operate on memcgs > > > as entities etc... > > > Also it would be very helpful to call out limitations of the BPF > > > program, if there are any. > > > > One thing I realized recently: we don't have to make a victim selection > > during the OOM, we [almost always] can do it in advance. > > > > Kernel OOM's must guarantee the forward progress under heavy memory pressure > > and it creates a lot of limitations on what can and what can't be done in > > these circumstances. > > > > But in practice most policies except maybe those which aim to catch very fast > > memory spikes rely on things which are fairly static: a logical importance of > > several workloads in comparison to some other workloads, "age", memory footprint > > etc. > > > > So I wonder if the right path is to create a kernel interface which allows > > to define a OOM victim (maybe several victims, also depending on if it's > > a global or a memcg oom) and update it periodically from an userspace. > > We already have that interface. Just echo OOM_SCORE_ADJ_MAX to any tasks > that are to be killed with a priority... > Not a great interface but still something available. > > > In fact, the second part is already implemented by tools like oomd, systemd-oomd etc. > > Someone might say that the first part is also implemented by the oom_score > > interface, but I don't think it's an example of a convenient interface. > > It's also not a memcg-level interface. > > What do you mean by not memcg-level interface? What kind of interface > would you propose instead? Something like memory.oom.priority, which is 0 by default, but if set to 1, the memory cgroup is considered a good oom victim. Idk if we need priorities or just fine with a binary thing. Under oom conditions the kernel can look if we have a pre-selected oom target and if not fallback to the current behavior. Thanks!
Hello, Michal 在 2023/7/28 01:23, Michal Hocko 写道: > On Thu 27-07-23 20:12:01, Chuyi Zhou wrote: >> >> >> 在 2023/7/27 16:15, Michal Hocko 写道: >>> On Thu 27-07-23 15:36:27, Chuyi Zhou wrote: >>>> This patchset tries to add a new bpf prog type and use it to select >>>> a victim memcg when global OOM is invoked. The mainly motivation is >>>> the need to customizable OOM victim selection functionality so that >>>> we can protect more important app from OOM killer. >>> >>> This is rather modest to give an idea how the whole thing is supposed to >>> work. I have looked through patches very quickly but there is no overall >>> design described anywhere either. >>> >>> Please could you give us a high level design description and reasoning >>> why certain decisions have been made? e.g. why is this limited to the >>> global oom sitation, why is the BPF program forced to operate on memcgs >>> as entities etc... >>> Also it would be very helpful to call out limitations of the BPF >>> program, if there are any. >>> >>> Thanks! >> >> Hi, >> >> Thanks for your advice. >> >> The global/memcg OOM victim selection uses process as the base search >> granularity. However, we can see a need for cgroup level protection and >> there's been some discussion[1]. It seems reasonable to consider using memcg >> as a search granularity in victim selection algorithm. > > Yes, it can be reasonable for some policies but making it central to the > design is very limiting. > >> Besides, it seems pretty well fit for offloading policy decisions to a BPF >> program, since BPF is scalable and flexible. That's why the new BPF >> program operate on memcgs as entities. > > I do not follow your line of argumentation here. The same could be > argued for processes or beans. > >> The idea is to let user choose which leaf in the memcg tree should be >> selected as the victim. At the first layer, if we choose A, then it protects >> the memcg under the B, C, and D subtrees. >> >> root >> / | \ \ >> A B C D >> /\ >> E F >> >> >> Using the BPF prog, we are allowed to compare the OOM priority between >> two siblings so that we can choose the best victim in each layer. > > How is the priority defined and communicated to the userspace. > >> For example: >> >> run_prog(B, C) -> choose B >> run_prog(B, D) -> choose D >> run_prog(A, D) -> choose A >> >> Once we select A as the victim in the first layer, the victim in next layer >> would be selected among A's children. Finally, we select a leaf memcg as >> victim. > > This sounds like a very specific oom policy and that is fine. But the > interface shouldn't be bound to any concepts like priorities let alone > be bound to memcg based selection. Ideally the BPF program should get > the oom_control as an input and either get a hook to kill process or if > that is not possible then return an entity to kill (either process or > set of processes). Here are two interfaces I can think of. I was wondering if you could give me some feedback. 1. Add a new hook in select_bad_process(), we can attach it and return a set of pids or cgroup_ids which are pre-selected by user-defined policy, suggested by Roman. Then we could use oom_evaluate_task to find a final victim among them. It's user-friendly and we can offload the OOM policy to userspace. 2. Add a new hook in oom_evaluate_task() and return a point to override the default oom_badness return-value. The simplest way to use this is to protect certain processes by setting the minimum score. Of course if you have a better idea, please let me know. Thanks! --- Chuyi Zhou > >> In our scenarios, the impact caused by global OOM's is much more common, so >> we only considered global in this patchset. But it seems that the idea can >> also be applied to memcg OOM. > > The global and memcg OOMs shouldn't have a different interface. If a > specific BPF program wants to implement a different policy for global > vs. memcg OOM then be it but this should be a decision of the said > program not an inherent limitation of the interface. > >> >> [1]https://lore.kernel.org/lkml/ZIgodGWoC%2FR07eak@dhcp22.suse.cz/ >> >> Thanks! >> -- >> Chuyi Zhou >
On Fri 28-07-23 11:42:27, Roman Gushchin wrote: > On Fri, Jul 28, 2023 at 10:06:38AM +0200, Michal Hocko wrote: > > On Thu 27-07-23 21:30:01, Roman Gushchin wrote: > > > On Thu, Jul 27, 2023 at 10:15:16AM +0200, Michal Hocko wrote: > > > > On Thu 27-07-23 15:36:27, Chuyi Zhou wrote: > > > > > This patchset tries to add a new bpf prog type and use it to select > > > > > a victim memcg when global OOM is invoked. The mainly motivation is > > > > > the need to customizable OOM victim selection functionality so that > > > > > we can protect more important app from OOM killer. > > > > > > > > This is rather modest to give an idea how the whole thing is supposed to > > > > work. I have looked through patches very quickly but there is no overall > > > > design described anywhere either. > > > > > > > > Please could you give us a high level design description and reasoning > > > > why certain decisions have been made? e.g. why is this limited to the > > > > global oom sitation, why is the BPF program forced to operate on memcgs > > > > as entities etc... > > > > Also it would be very helpful to call out limitations of the BPF > > > > program, if there are any. > > > > > > One thing I realized recently: we don't have to make a victim selection > > > during the OOM, we [almost always] can do it in advance. > > > > > > Kernel OOM's must guarantee the forward progress under heavy memory pressure > > > and it creates a lot of limitations on what can and what can't be done in > > > these circumstances. > > > > > > But in practice most policies except maybe those which aim to catch very fast > > > memory spikes rely on things which are fairly static: a logical importance of > > > several workloads in comparison to some other workloads, "age", memory footprint > > > etc. > > > > > > So I wonder if the right path is to create a kernel interface which allows > > > to define a OOM victim (maybe several victims, also depending on if it's > > > a global or a memcg oom) and update it periodically from an userspace. > > > > We already have that interface. Just echo OOM_SCORE_ADJ_MAX to any tasks > > that are to be killed with a priority... > > Not a great interface but still something available. > > > > > In fact, the second part is already implemented by tools like oomd, systemd-oomd etc. > > > Someone might say that the first part is also implemented by the oom_score > > > interface, but I don't think it's an example of a convenient interface. > > > It's also not a memcg-level interface. > > > > What do you mean by not memcg-level interface? What kind of interface > > would you propose instead? > > Something like memory.oom.priority, which is 0 by default, but if set to 1, > the memory cgroup is considered a good oom victim. Idk if we need priorities > or just fine with a binary thing. Priorities as a general API have been discussed at several occasions (e.g http://lkml.kernel.org/r/ZFkEqhAs7FELUO3a@dhcp22.suse.cz). Their usage is rather limited, hiearchical semantic not trivial etc.
On Mon 31-07-23 14:00:22, Chuyi Zhou wrote: > Hello, Michal > > 在 2023/7/28 01:23, Michal Hocko 写道: [...] > > This sounds like a very specific oom policy and that is fine. But the > > interface shouldn't be bound to any concepts like priorities let alone > > be bound to memcg based selection. Ideally the BPF program should get > > the oom_control as an input and either get a hook to kill process or if > > that is not possible then return an entity to kill (either process or > > set of processes). > > Here are two interfaces I can think of. I was wondering if you could give me > some feedback. > > 1. Add a new hook in select_bad_process(), we can attach it and return a set > of pids or cgroup_ids which are pre-selected by user-defined policy, > suggested by Roman. Then we could use oom_evaluate_task to find a final > victim among them. It's user-friendly and we can offload the OOM policy to > userspace. > > 2. Add a new hook in oom_evaluate_task() and return a point to override the > default oom_badness return-value. The simplest way to use this is to protect > certain processes by setting the minimum score. > > Of course if you have a better idea, please let me know. Hooking into oom_evaluate_task seems the least disruptive to the existing oom killer implementation. I would start by planing with that and see whether useful oom policies could be defined this way. I am not sure what is the best way to communicate user input so that a BPF prgram can consume it though. The interface should be generic enough that it doesn't really pre-define any specific class of policies. Maybe we can add something completely opaque to each memcg/task? Does BPF infrastructure allow anything like that already?
在 2023/7/31 21:23, Michal Hocko 写道: > On Mon 31-07-23 14:00:22, Chuyi Zhou wrote: >> Hello, Michal >> >> 在 2023/7/28 01:23, Michal Hocko 写道: > [...] >>> This sounds like a very specific oom policy and that is fine. But the >>> interface shouldn't be bound to any concepts like priorities let alone >>> be bound to memcg based selection. Ideally the BPF program should get >>> the oom_control as an input and either get a hook to kill process or if >>> that is not possible then return an entity to kill (either process or >>> set of processes). >> >> Here are two interfaces I can think of. I was wondering if you could give me >> some feedback. >> >> 1. Add a new hook in select_bad_process(), we can attach it and return a set >> of pids or cgroup_ids which are pre-selected by user-defined policy, >> suggested by Roman. Then we could use oom_evaluate_task to find a final >> victim among them. It's user-friendly and we can offload the OOM policy to >> userspace. >> >> 2. Add a new hook in oom_evaluate_task() and return a point to override the >> default oom_badness return-value. The simplest way to use this is to protect >> certain processes by setting the minimum score. >> >> Of course if you have a better idea, please let me know. > > Hooking into oom_evaluate_task seems the least disruptive to the > existing oom killer implementation. I would start by planing with that > and see whether useful oom policies could be defined this way. I am not > sure what is the best way to communicate user input so that a BPF prgram > can consume it though. The interface should be generic enough that it > doesn't really pre-define any specific class of policies. Maybe we can > add something completely opaque to each memcg/task? Does BPF > infrastructure allow anything like that already? > “Maybe we can add something completely opaque to each memcg/task?” Sorry, I don't understand what you mean. I think we probably don't need to expose too much to the user, the following might be sufficient: noinline int bpf_get_score(struct oom_control *oc, struct task_struct *task); static int oom_evaluate_task() { ... points = bpf_get_score(oc, task); if (!check_points_valid(points)) points = oom_badness(task, oc->totalpages); ... } There are several reasons: 1. The implementation of use-defined OOM policy, such as iteration, sorting and other complex operations, is more suitable to be placed in the userspace rather than in the bpf program. It is more convenient to implement these operations in userspace in which the useful information (memory usage of each task and memcg, memory allocation speed, etc.) can also be captured. For example, oomd implements multiple policies[1] without kernel-space input. 2. Userspace apps, such as oomd, can import useful information into bpf program, e.g., through bpf_map, and update it periodically. For example, we can do the scoring directly in userspace and maintain a score hash, so that in the bpf program, we only need to look for the corresponding score of the process. Userspace policy(oomd) bpf_map_update score_hash ------------------> BPF program look up score in score_hash ---------------> kernel space Just some thoughts. Thanks! [1]https://github.com/facebookincubator/oomd/tree/main/src/oomd/plugins)
On 7/28/23 12:30 PM, Roman Gushchin wrote: > On Thu, Jul 27, 2023 at 10:15:16AM +0200, Michal Hocko wrote: >> On Thu 27-07-23 15:36:27, Chuyi Zhou wrote: >>> This patchset tries to add a new bpf prog type and use it to select >>> a victim memcg when global OOM is invoked. The mainly motivation is >>> the need to customizable OOM victim selection functionality so that >>> we can protect more important app from OOM killer. >> >> This is rather modest to give an idea how the whole thing is supposed to >> work. I have looked through patches very quickly but there is no overall >> design described anywhere either. >> >> Please could you give us a high level design description and reasoning >> why certain decisions have been made? e.g. why is this limited to the >> global oom sitation, why is the BPF program forced to operate on memcgs >> as entities etc... >> Also it would be very helpful to call out limitations of the BPF >> program, if there are any. > > One thing I realized recently: we don't have to make a victim selection > during the OOM, we [almost always] can do it in advance. I agree. We take precautions against memory shortage on over-committed machines through oomd-like userspace tools, to mitigate possible SLO violations on important services. The kernel OOM-killer in our scenario works as a last resort, since userspace tools are not that reliable. IMHO it would be useful for kernel to provide such flexibility. > > Kernel OOM's must guarantee the forward progress under heavy memory pressure > and it creates a lot of limitations on what can and what can't be done in > these circumstances. > > But in practice most policies except maybe those which aim to catch very fast > memory spikes rely on things which are fairly static: a logical importance of > several workloads in comparison to some other workloads, "age", memory footprint > etc. > > So I wonder if the right path is to create a kernel interface which allows > to define a OOM victim (maybe several victims, also depending on if it's > a global or a memcg oom) and update it periodically from an userspace. Something like [1] proposed by Chuyi? IIUC there is still lack of some triggers to invoke the procedure so we can actually do this in advance. [1] https://lore.kernel.org/lkml/f8f44103-afba-10ee-b14b-a8e60a7f33d8@bytedance.com/ Thanks & Best, Abel
On Tue 01-08-23 00:26:20, Chuyi Zhou wrote: > > > 在 2023/7/31 21:23, Michal Hocko 写道: > > On Mon 31-07-23 14:00:22, Chuyi Zhou wrote: > > > Hello, Michal > > > > > > 在 2023/7/28 01:23, Michal Hocko 写道: > > [...] > > > > This sounds like a very specific oom policy and that is fine. But the > > > > interface shouldn't be bound to any concepts like priorities let alone > > > > be bound to memcg based selection. Ideally the BPF program should get > > > > the oom_control as an input and either get a hook to kill process or if > > > > that is not possible then return an entity to kill (either process or > > > > set of processes). > > > > > > Here are two interfaces I can think of. I was wondering if you could give me > > > some feedback. > > > > > > 1. Add a new hook in select_bad_process(), we can attach it and return a set > > > of pids or cgroup_ids which are pre-selected by user-defined policy, > > > suggested by Roman. Then we could use oom_evaluate_task to find a final > > > victim among them. It's user-friendly and we can offload the OOM policy to > > > userspace. > > > > > > 2. Add a new hook in oom_evaluate_task() and return a point to override the > > > default oom_badness return-value. The simplest way to use this is to protect > > > certain processes by setting the minimum score. > > > > > > Of course if you have a better idea, please let me know. > > > > Hooking into oom_evaluate_task seems the least disruptive to the > > existing oom killer implementation. I would start by planing with that > > and see whether useful oom policies could be defined this way. I am not > > sure what is the best way to communicate user input so that a BPF prgram > > can consume it though. The interface should be generic enough that it > > doesn't really pre-define any specific class of policies. Maybe we can > > add something completely opaque to each memcg/task? Does BPF > > infrastructure allow anything like that already? > > > > “Maybe we can add something completely opaque to each memcg/task?” > Sorry, I don't understand what you mean. What I meant to say is to add a very non-specific interface that would would a specific BPF program understand. Mostly an opaque value from the memcg POV. > I think we probably don't need to expose too much to the user, the following > might be sufficient: > > noinline int bpf_get_score(struct oom_control *oc, > struct task_struct *task); > > static int oom_evaluate_task() > { > ... > points = bpf_get_score(oc, task); > if (!check_points_valid(points)) > points = oom_badness(task, oc->totalpages); > ... > } > > There are several reasons: > > 1. The implementation of use-defined OOM policy, such as iteration, sorting > and other complex operations, is more suitable to be placed in the userspace > rather than in the bpf program. It is more convenient to implement these > operations in userspace in which the useful information (memory usage of > each task and memcg, memory allocation speed, etc.) can also be captured. > For example, oomd implements multiple policies[1] without kernel-space > input. I do agree that userspace can handle a lot on its own and provide the input to the BPF program to make a decision. > 2. Userspace apps, such as oomd, can import useful information into bpf > program, e.g., through bpf_map, and update it periodically. For example, we > can do the scoring directly in userspace and maintain a score hash, so that > in the bpf program, we only need to look for the corresponding score of the > process. Sure, why not. But all that is an implementation detail. We are currently talkin about a proper abstraction and layering that would allow what you do currently but also much more. > Userspace policy(oomd) > bpf_map_update > score_hash > ------------------> BPF program > look up score in > score_hash > ---------------> kernel space > Just some thoughts. I believe all the above should be possible if BPF program is hooked at the oom_evaluate_task layer and allow to bypass the default logic. BPF program can process whatever data it has available. The oom scope iteration will be implemented already in the kernel so all the BPF program has to do is to rank processes and/or memcgs if oom.group is enabled. Whould that work for your usecase? > Thanks! > > [1]https://github.com/facebookincubator/oomd/tree/main/src/oomd/plugins)
Hello, 在 2023/8/1 16:18, Michal Hocko 写道: > On Tue 01-08-23 00:26:20, Chuyi Zhou wrote: >> >> >> 在 2023/7/31 21:23, Michal Hocko 写道: >>> On Mon 31-07-23 14:00:22, Chuyi Zhou wrote: >>>> Hello, Michal >>>> >>>> 在 2023/7/28 01:23, Michal Hocko 写道: >>> [...] >>>>> This sounds like a very specific oom policy and that is fine. But the >>>>> interface shouldn't be bound to any concepts like priorities let alone >>>>> be bound to memcg based selection. Ideally the BPF program should get >>>>> the oom_control as an input and either get a hook to kill process or if >>>>> that is not possible then return an entity to kill (either process or >>>>> set of processes). >>>> >>>> Here are two interfaces I can think of. I was wondering if you could give me >>>> some feedback. >>>> >>>> 1. Add a new hook in select_bad_process(), we can attach it and return a set >>>> of pids or cgroup_ids which are pre-selected by user-defined policy, >>>> suggested by Roman. Then we could use oom_evaluate_task to find a final >>>> victim among them. It's user-friendly and we can offload the OOM policy to >>>> userspace. >>>> >>>> 2. Add a new hook in oom_evaluate_task() and return a point to override the >>>> default oom_badness return-value. The simplest way to use this is to protect >>>> certain processes by setting the minimum score. >>>> >>>> Of course if you have a better idea, please let me know. >>> >>> Hooking into oom_evaluate_task seems the least disruptive to the >>> existing oom killer implementation. I would start by planing with that >>> and see whether useful oom policies could be defined this way. I am not >>> sure what is the best way to communicate user input so that a BPF prgram >>> can consume it though. The interface should be generic enough that it >>> doesn't really pre-define any specific class of policies. Maybe we can >>> add something completely opaque to each memcg/task? Does BPF >>> infrastructure allow anything like that already? >>> >> >> “Maybe we can add something completely opaque to each memcg/task?” >> Sorry, I don't understand what you mean. > > What I meant to say is to add a very non-specific interface that would > would a specific BPF program understand. Mostly an opaque value from the > memcg POV. > >> I think we probably don't need to expose too much to the user, the following >> might be sufficient: >> >> noinline int bpf_get_score(struct oom_control *oc, >> struct task_struct *task); >> >> static int oom_evaluate_task() >> { >> ... >> points = bpf_get_score(oc, task); >> if (!check_points_valid(points)) >> points = oom_badness(task, oc->totalpages); >> ... >> } >> >> There are several reasons: >> >> 1. The implementation of use-defined OOM policy, such as iteration, sorting >> and other complex operations, is more suitable to be placed in the userspace >> rather than in the bpf program. It is more convenient to implement these >> operations in userspace in which the useful information (memory usage of >> each task and memcg, memory allocation speed, etc.) can also be captured. >> For example, oomd implements multiple policies[1] without kernel-space >> input. > > I do agree that userspace can handle a lot on its own and provide the > input to the BPF program to make a decision. > >> 2. Userspace apps, such as oomd, can import useful information into bpf >> program, e.g., through bpf_map, and update it periodically. For example, we >> can do the scoring directly in userspace and maintain a score hash, so that >> in the bpf program, we only need to look for the corresponding score of the >> process. > > Sure, why not. But all that is an implementation detail. We are > currently talkin about a proper abstraction and layering that would > allow what you do currently but also much more. > >> Userspace policy(oomd) >> bpf_map_update >> score_hash >> ------------------> BPF program >> look up score in >> score_hash >> ---------------> kernel space >> Just some thoughts. > > I believe all the above should be possible if BPF program is hooked at > the oom_evaluate_task layer and allow to bypass the default logic. BPF > program can process whatever data it has available. The oom scope iteration > will be implemented already in the kernel so all the BPF program has to > do is to rank processes and/or memcgs if oom.group is enabled. Whould > that work for your usecase? Yes, I think the above interface can works well for our usecase. In our scenario, we want to protect the application with higher priority and try to select lower priority as the victim. Specifically, We can set priority for memcgs in userspace. In BPF program, we can find the memcg to which the given process belongs, and then rank according to the memcg's priority. Thanks. > >> Thanks! >> >> [1]https://github.com/facebookincubator/oomd/tree/main/src/oomd/plugins) >