Message ID | 20230404115509.14299-1-ligang.bdlg@bytedance.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp2974755vqo; Tue, 4 Apr 2023 05:11:09 -0700 (PDT) X-Google-Smtp-Source: AKy350aKlvlZGBsACmMX3ZFA/gBvIRSTK12zceDouwuSUjH1u4Al71EfWBm3v3IWIyAI4B+6wkDM X-Received: by 2002:a05:6402:1816:b0:4fd:2b05:aa2 with SMTP id g22-20020a056402181600b004fd2b050aa2mr1959410edy.42.1680610268945; Tue, 04 Apr 2023 05:11:08 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1680610268; cv=none; d=google.com; s=arc-20160816; b=JwlLR6D0uc/pPcawuXFQSyGK/R2sQ0UFm1J6+AlSuAlGEoQ27lF2zo3nJLowc5gXn2 slKyFumnKTukWD8ME7HIvPi2U2v3eZTAj0UNoXtBRNPaYPP6/gE6Sgroinsvl6iQP/VB JviHvoW6dqjvWj45lDE8cX6lKtxfRoU61LehtY70lpxHcv5pJxSOac/wC60/2g0A+t/f hZkhYHXtkmGI0mezaiUvLn+rVyUDhlrm/l4UjX3vIlm1LSTb+WzG6FESzVRwj9h2x+A5 g8MGp//cK+9UozT08Qf06PXVTJprRgHsZGCULsx7Je6y3GzNAXP5Xqkosb2qRIHgocoD C0uQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=BZYpEiXmAQmikuks9KMMpXu6elZlbF1nKUjvillJPpY=; b=OeamBzQvmG4g/vQJN4xmS6a+pSpcRJA/P0aq5UtrlGxa8MpS1VWUwH3R1gDMhKV1jt eDaU0S4W3rAt26DG1Eix1UIOWhK/PVszTu2XiryFmexexQ4qUV3Mjkfi4zw5HE2har4T gKj7YjoROksU9u0W0TYMLHlU83Gvy382gvb1IkA/FrAdYt2LX5DmibCjO3fmof48zYDY J+iJW37s0b0vv4HcpoR6gvRsLPH9vCz6Qdp8FxPNhx6AC4ST7GL7KqfCjpgBdvh+0fb1 6y/IULVf+eqvH++Q9tByqInilk3psx7YeIVxqwqdEVqSATNEaFMQ/0U7n1vVlHPf4cM7 zMHA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=lbazhyBf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=bytedance.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id d11-20020a056402000b00b004af7147d830si970289edu.195.2023.04.04.05.10.43; Tue, 04 Apr 2023 05:11:08 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=lbazhyBf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=bytedance.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234196AbjDDL4P (ORCPT <rfc822;a1648639935@gmail.com> + 99 others); Tue, 4 Apr 2023 07:56:15 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34302 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233431AbjDDL4N (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 4 Apr 2023 07:56:13 -0400 Received: from mail-pj1-x1031.google.com (mail-pj1-x1031.google.com [IPv6:2607:f8b0:4864:20::1031]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 11A6140EF for <linux-kernel@vger.kernel.org>; Tue, 4 Apr 2023 04:55:19 -0700 (PDT) Received: by mail-pj1-x1031.google.com with SMTP id ml21so8284341pjb.4 for <linux-kernel@vger.kernel.org>; Tue, 04 Apr 2023 04:55:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1680609318; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=BZYpEiXmAQmikuks9KMMpXu6elZlbF1nKUjvillJPpY=; b=lbazhyBfq7jbbyxm/9fkElTeNmdjbJUD8gWVnHI5C9Blp+OoQLsEgg4ZOqFJMMV3v8 IorhurWEFzdZK3MRzjpLSNFC0Eev8lUx3qgtqFwEuIioejSQw/XnS4eeW/x7v9ZkGnIc f3e74QQR//XqXcuANnuhhmg0sE3fN6yz9bPe9hAAleTYVzWI2oFeqjg918FRYHZqG99K QxBCI1VgwQzL++cxY29BmxpMTBcA45UsdJWbvhVk7Qn8rIngbGG89QPH2ivDViZ0JPRZ FbWDcu2jfIFBrk+WGX9oIt6yxF0q2yaXqG/Mip448t93QUYj0DAkRUs+AbA24tpXz6VM puQg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680609318; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=BZYpEiXmAQmikuks9KMMpXu6elZlbF1nKUjvillJPpY=; b=3RJaRGHYyB0pOYb4hXNPo2//mMP07Rbvit+FrI/cfS5qAlH/Z51Mn27W5SMHjT3cco zMq4PW4sGijgGM6vXXMl951ZemLhe03lxGgylimXpq5n3OwTrfNuox9WAU80y02B6ZTc vIQzB4kp1/Y9THUscI8xSoSJTzBFntDtSIxCPAZFmeepDAoODHbXJPPAGtuCQaGohgTN nAHXGyBiuHl8ziyiDDCj8FwziTD6TLbEjDuq+tU0GkF9JH4UpHP6rSkA1/Rf4877YKTC WrqYcyTYkleiFouPEXSFk7SmLiPO38wUiPq8biN7uV07vm8fdPGsMz0BFamSCiN6K6rN pbyQ== X-Gm-Message-State: AAQBX9eY8DTZSK0XgGd1mvioiL/1WIPT1nE91SV/H5dW6yZPDhktSmcr BGd3bk8Ct4NogBni1BKa+2PdTw== X-Received: by 2002:a17:90b:3889:b0:240:b3ae:d881 with SMTP id mu9-20020a17090b388900b00240b3aed881mr2706336pjb.5.1680609318070; Tue, 04 Apr 2023 04:55:18 -0700 (PDT) Received: from C02FT5A6MD6R.bytedance.net ([61.213.176.9]) by smtp.gmail.com with ESMTPSA id q80-20020a632a53000000b005141568e322sm1248062pgq.81.2023.04.04.04.55.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 04 Apr 2023 04:55:17 -0700 (PDT) From: Gang Li <ligang.bdlg@bytedance.com> To: mhocko@suse.com, rientjes@google.com Cc: Gang Li <ligang.bdlg@bytedance.com>, linux-kernel@vger.kernel.org Subject: [PATCH v2] mm: oom: introduce cpuset oom Date: Tue, 4 Apr 2023 19:55:09 +0800 Message-Id: <20230404115509.14299-1-ligang.bdlg@bytedance.com> X-Mailer: git-send-email 2.32.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.2 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1762247593351711951?= X-GMAIL-MSGID: =?utf-8?q?1762247593351711951?= |
Series |
[v2] mm: oom: introduce cpuset oom
|
|
Commit Message
Gang Li
April 4, 2023, 11:55 a.m. UTC
When a process in cpuset triggers oom, it may kill a completely
irrelevant process on another numa node, which will not release any
memory for this cpuset.
It seems that `CONSTRAINT_CPUSET` is not really doing much these
days. Using CONSTRAINT_CPUSET, we can easily achieve node aware oom
by selecting victim from all cpusets with the same mems_allowed as
the current cpuset.
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
---
Changes in v2:
- Select victim from all cpusets with the same mems_allowed as the current cpuset.
(David Rientjes <rientjes@google.com>)
v1:
- https://lore.kernel.org/all/20220921064710.89663-1-ligang.bdlg@bytedance.com/
---
include/linux/cpuset.h | 6 ++++++
kernel/cgroup/cpuset.c | 28 ++++++++++++++++++++++++++++
mm/oom_kill.c | 4 ++++
3 files changed, 38 insertions(+)
Comments
[CC cpuset people] On Tue 04-04-23 19:55:09, Gang Li wrote: > When a process in cpuset triggers oom, it may kill a completely > irrelevant process on another numa node, which will not release any > memory for this cpuset. > > It seems that `CONSTRAINT_CPUSET` is not really doing much these > days. Using CONSTRAINT_CPUSET, we can easily achieve node aware oom > by selecting victim from all cpusets with the same mems_allowed as > the current cpuset. This should go into more details about the usecase, testing and ideally also spend couple of words about how CONSTRAINT_CPUSET is actually implemented because this is not really immediately obvious. An example of before/after behavior would have been really nice as well. You should also go into much more details about how oom victims are actually evaluated. > Suggested-by: Michal Hocko <mhocko@suse.com> > Signed-off-by: Gang Li <ligang.bdlg@bytedance.com> > --- > Changes in v2: > - Select victim from all cpusets with the same mems_allowed as the current cpuset. > (David Rientjes <rientjes@google.com>) > > v1: > - https://lore.kernel.org/all/20220921064710.89663-1-ligang.bdlg@bytedance.com/ > > --- > include/linux/cpuset.h | 6 ++++++ > kernel/cgroup/cpuset.c | 28 ++++++++++++++++++++++++++++ > mm/oom_kill.c | 4 ++++ > 3 files changed, 38 insertions(+) As this is a userspace visible change it should also be documented somewhere in Documentation. I am not really familiar with cpusets internals so I cannot really judge cpuset_cgroup_scan_tasks implementation. The oom report should be explicit about this being CPUSET specific oom handling so unexpected behavior could be nailed down to this change so I do not see a major concern from the oom POV. Nevertheless it would be still good to consider whether this should be an opt-in behavior. I personally do not see a major problem because most cpuset deployments I have seen tend to be well partitioned so the new behavior makes more sense. > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h > index 980b76a1237e..fc244141bd52 100644 > --- a/include/linux/cpuset.h > +++ b/include/linux/cpuset.h > @@ -171,6 +171,8 @@ static inline void set_mems_allowed(nodemask_t nodemask) > task_unlock(current); > } > > +int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg); > + > #else /* !CONFIG_CPUSETS */ > > static inline bool cpusets_enabled(void) { return false; } > @@ -287,6 +289,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq) > return false; > } > > +static inline int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg) > +{ > + return 0; > +} > #endif /* !CONFIG_CPUSETS */ > > #endif /* _LINUX_CPUSET_H */ > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > index bc4dcfd7bee5..b009c98ca19e 100644 > --- a/kernel/cgroup/cpuset.c > +++ b/kernel/cgroup/cpuset.c > @@ -4013,6 +4013,34 @@ void cpuset_print_current_mems_allowed(void) > rcu_read_unlock(); > } > > +int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg) > +{ > + int ret = 0; > + struct css_task_iter it; > + struct task_struct *task; > + struct cpuset *cs; > + struct cgroup_subsys_state *pos_css; > + > + /* > + * Situation gets complex with overlapping nodemasks in different cpusets. > + * TODO: Maybe we should calculate the "distance" between different mems_allowed. > + * > + * But for now, let's make it simple. Just iterate through all cpusets > + * with the same mems_allowed as the current cpuset. > + */ > + rcu_read_lock(); > + cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) { > + if (nodes_equal(cs->mems_allowed, task_cs(current)->mems_allowed)) { > + css_task_iter_start(&(cs->css), CSS_TASK_ITER_PROCS, &it); > + while (!ret && (task = css_task_iter_next(&it))) > + ret = fn(task, arg); > + css_task_iter_end(&it); > + } > + } > + rcu_read_unlock(); > + return ret; > +} > + > /* > * Collection of memory_pressure is suppressed unless > * this flag is enabled by writing "1" to the special > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 044e1eed720e..205982a69b30 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -367,6 +367,8 @@ static void select_bad_process(struct oom_control *oc) > > if (is_memcg_oom(oc)) > mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc); > + else if (oc->constraint == CONSTRAINT_CPUSET) > + cpuset_cgroup_scan_tasks(oom_evaluate_task, oc); > else { > struct task_struct *p; > > @@ -427,6 +429,8 @@ static void dump_tasks(struct oom_control *oc) > > if (is_memcg_oom(oc)) > mem_cgroup_scan_tasks(oc->memcg, dump_task, oc); > + else if (oc->constraint == CONSTRAINT_CPUSET) > + cpuset_cgroup_scan_tasks(dump_task, oc); > else { > struct task_struct *p; > > -- > 2.20.1
On 4/4/23 10:31, Michal Hocko wrote: > [CC cpuset people] > > On Tue 04-04-23 19:55:09, Gang Li wrote: >> When a process in cpuset triggers oom, it may kill a completely >> irrelevant process on another numa node, which will not release any >> memory for this cpuset. >> >> It seems that `CONSTRAINT_CPUSET` is not really doing much these >> days. Using CONSTRAINT_CPUSET, we can easily achieve node aware oom >> by selecting victim from all cpusets with the same mems_allowed as >> the current cpuset. > This should go into more details about the usecase, testing and ideally > also spend couple of words about how CONSTRAINT_CPUSET is actually > implemented because this is not really immediately obvious. An example > of before/after behavior would have been really nice as well. > > You should also go into much more details about how oom victims are > actually evaluated. > >> Suggested-by: Michal Hocko <mhocko@suse.com> >> Signed-off-by: Gang Li <ligang.bdlg@bytedance.com> >> --- >> Changes in v2: >> - Select victim from all cpusets with the same mems_allowed as the current cpuset. >> (David Rientjes <rientjes@google.com>) >> >> v1: >> - https://lore.kernel.org/all/20220921064710.89663-1-ligang.bdlg@bytedance.com/ >> >> --- >> include/linux/cpuset.h | 6 ++++++ >> kernel/cgroup/cpuset.c | 28 ++++++++++++++++++++++++++++ >> mm/oom_kill.c | 4 ++++ >> 3 files changed, 38 insertions(+) > As this is a userspace visible change it should also be documented > somewhere in Documentation. > > I am not really familiar with cpusets internals so I cannot really judge > cpuset_cgroup_scan_tasks implementation. > > The oom report should be explicit about this being CPUSET specific oom > handling so unexpected behavior could be nailed down to this change so I > do not see a major concern from the oom POV. Nevertheless it would be > still good to consider whether this should be an opt-in behavior. I > personally do not see a major problem because most cpuset deployments I > have seen tend to be well partitioned so the new behavior makes more > sense. > >> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h >> index 980b76a1237e..fc244141bd52 100644 >> --- a/include/linux/cpuset.h >> +++ b/include/linux/cpuset.h >> @@ -171,6 +171,8 @@ static inline void set_mems_allowed(nodemask_t nodemask) >> task_unlock(current); >> } >> >> +int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg); >> + >> #else /* !CONFIG_CPUSETS */ >> >> static inline bool cpusets_enabled(void) { return false; } >> @@ -287,6 +289,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq) >> return false; >> } >> >> +static inline int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg) >> +{ >> + return 0; >> +} >> #endif /* !CONFIG_CPUSETS */ >> >> #endif /* _LINUX_CPUSET_H */ >> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c >> index bc4dcfd7bee5..b009c98ca19e 100644 >> --- a/kernel/cgroup/cpuset.c >> +++ b/kernel/cgroup/cpuset.c >> @@ -4013,6 +4013,34 @@ void cpuset_print_current_mems_allowed(void) >> rcu_read_unlock(); >> } >> >> +int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg) >> +{ >> + int ret = 0; >> + struct css_task_iter it; >> + struct task_struct *task; >> + struct cpuset *cs; >> + struct cgroup_subsys_state *pos_css; >> + >> + /* >> + * Situation gets complex with overlapping nodemasks in different cpusets. >> + * TODO: Maybe we should calculate the "distance" between different mems_allowed. >> + * >> + * But for now, let's make it simple. Just iterate through all cpusets >> + * with the same mems_allowed as the current cpuset. >> + */ >> + rcu_read_lock(); >> + cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) { >> + if (nodes_equal(cs->mems_allowed, task_cs(current)->mems_allowed)) { >> + css_task_iter_start(&(cs->css), CSS_TASK_ITER_PROCS, &it); >> + while (!ret && (task = css_task_iter_next(&it))) >> + ret = fn(task, arg); >> + css_task_iter_end(&it); >> + } >> + } >> + rcu_read_unlock(); >> + return ret; >> +} You will also need to take cpuset_rwsem to make sure that cpusets are stable. BTW, the cpuset_cgroup_scan_tasks() name is kind of redundant. I will suggest you just name it as cpuset_scan_tasks(). Please also add a doctext comment about its purpose and how it should be used. Cheers, Longman
On 2023/4/4 22:31, Michal Hocko wrote: > [CC cpuset people] > > > This should go into more details about the usecase, testing and ideally > also spend couple of words about how CONSTRAINT_CPUSET is actually > implemented because this is not really immediately obvious. An example > of before/after behavior would have been really nice as well. > > You should also go into much more details about how oom victims are > actually evaluated. > > As this is a userspace visible change it should also be documented > somewhere in Documentation. > > I am not really familiar with cpusets internals so I cannot really judge > cpuset_cgroup_scan_tasks implementation. > > The oom report should be explicit about this being CPUSET specific oom > handling so unexpected behavior could be nailed down to this change so I > do not see a major concern from the oom POV. Nevertheless it would be > still good to consider whether this should be an opt-in behavior. I > personally do not see a major problem because most cpuset deployments I > have seen tend to be well partitioned so the new behavior makes more > sense. > On 2023/4/5 01:24, Waiman Long wrote: > > You will also need to take cpuset_rwsem to make sure that cpusets are > stable. BTW, the cpuset_cgroup_scan_tasks() name is kind of redundant. I > will suggest you just name it as cpuset_scan_tasks(). Please also add a > doctext comment about its purpose and how it should be used. Thank you all. I will make the following changes and send v3. 1. Provide more details about the use case, testing, and implementation of CONSTRAINT_CPUSET, including an example of before/after behavior. 2. Provide more details about how OOM victims are evaluated. 3. Document the userspace visible change in Documentation. 4. Add an option /proc/sys/vm/oom_in_cpuset for cpuset oom. 5. Rename cpuset_cgroup_scan_tasks() to cpuset_scan_tasks() and add a doctext comment about its purpose and how it should be used. 6. Take cpuset_rwsem to ensure that cpusets are stable.
On 2023/4/4 22:31, Michal Hocko wrote: > [CC cpuset people] > > The oom report should be explicit about this being CPUSET specific oom > handling so unexpected behavior could be nailed down to this change so I Yes, the oom message looks like this: ``` [ 65.470256] oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=test,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-4.scope,task=memkiller,pid=1968,uid=0 Apr 4 09:08:53 debian kernel: [ 65.481992] Out of memory: Killed process 1968 (memkiller) total-vm:2099436kB, anon-rss:1971712kB, file-rss:1024kB, shmem-rss:0kB, UID:0 pgtables:3904kB oom_score_adj:0 ``` > do not see a major concern from the oom POV. Nevertheless it would be > still good to consider whether this should be an opt-in behavior. I > personally do not see a major problem because most cpuset deployments I > have seen tend to be well partitioned so the new behavior makes more > sense. > Since memcgroup oom is mandatory, cpuset oom should preferably be mandatory as well. But we can still consider adding an option to user. How about introduce `/proc/sys/vm/oom_in_cpuset`?
On Thu 06-04-23 11:22:16, Gang Li wrote: > > On 2023/4/4 22:31, Michal Hocko wrote: > > [CC cpuset people] > > > > The oom report should be explicit about this being CPUSET specific oom > > handling so unexpected behavior could be nailed down to this change so I > Yes, the oom message looks like this: > > ``` > [ 65.470256] oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=test,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-4.scope,task=memkiller,pid=1968,uid=0 > Apr 4 09:08:53 debian kernel: [ 65.481992] Out of memory: Killed process > 1968 (memkiller) total-vm:2099436kB, anon-rss:1971712kB, file-rss:1024kB, > shmem-rss:0kB, UID:0 pgtables:3904kB oom_score_adj:0 > ``` > > > > do not see a major concern from the oom POV. Nevertheless it would be > > still good to consider whether this should be an opt-in behavior. I > > personally do not see a major problem because most cpuset deployments I > > have seen tend to be well partitioned so the new behavior makes more > > sense. > > > > Since memcgroup oom is mandatory, cpuset oom should preferably be mandatory > as well. But we can still consider adding an option to user. > > How about introduce `/proc/sys/vm/oom_in_cpuset`? As I've said, I do not see any major concern having this behavior implicit, the behavior makes semantic sense and it is also much more likely that the selected oom victim will be a better choice than what we do currently. Especially on properly partitioned systems with large memory consumers in each partition (cpuset). That being said, I would just not add any sysctl at this stage and rather document the decision. If we ever encounter usecase(s) which would regress based on this change we can introcuce the sysctl later.
I see, then let's do this as you said. On 2023/4/6 16:30, Michal Hocko wrote: > > As I've said, I do not see any major concern having this behavior > implicit, the behavior makes semantic sense and it is also much more > likely that the selected oom victim will be a better choice than what we > do currently. Especially on properly partitioned systems with large > memory consumers in each partition (cpuset). > > That being said, I would just not add any sysctl at this stage and > rather document the decision. If we ever encounter usecase(s) which > would regress based on this change we can introcuce the sysctl later. >
On 2023/4/5 01:24, Waiman Long wrote: > > You will also need to take cpuset_rwsem to make sure that cpusets are > stable. BTW, the cpuset_cgroup_scan_tasks() name is kind of redundant. I > will suggest you just name it as cpuset_scan_tasks(). Please also add a mem cgroup oom use `mem_cgroup_scan_tasks`. How about keep `cpuset_cgroup_scan_tasks` for naming consistency? ``` static void select_bad_process(struct oom_control *oc) { oc->chosen_points = LONG_MIN; if (is_memcg_oom(oc)) mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc); else if (oc->constraint == CONSTRAINT_CPUSET) cpuset_cgroup_scan_tasks(oom_evaluate_task, oc); else { ... } } ```
On 4/6/23 09:02, Gang Li wrote: > On 2023/4/5 01:24, Waiman Long wrote: >> >> You will also need to take cpuset_rwsem to make sure that cpusets are >> stable. BTW, the cpuset_cgroup_scan_tasks() name is kind of >> redundant. I will suggest you just name it as cpuset_scan_tasks(). >> Please also add a > > mem cgroup oom use `mem_cgroup_scan_tasks`. > How about keep `cpuset_cgroup_scan_tasks` for naming consistency? The "memory cgroup" term is used to identify the code related to memory cgroup controller. Cpuset, on the other hand, is a well known cgroup controller. Cpuset cgroup is not a term that people will normally use. That is why this name is awkward. Cheers, Longman
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index 980b76a1237e..fc244141bd52 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -171,6 +171,8 @@ static inline void set_mems_allowed(nodemask_t nodemask) task_unlock(current); } +int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg); + #else /* !CONFIG_CPUSETS */ static inline bool cpusets_enabled(void) { return false; } @@ -287,6 +289,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq) return false; } +static inline int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg) +{ + return 0; +} #endif /* !CONFIG_CPUSETS */ #endif /* _LINUX_CPUSET_H */ diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index bc4dcfd7bee5..b009c98ca19e 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -4013,6 +4013,34 @@ void cpuset_print_current_mems_allowed(void) rcu_read_unlock(); } +int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg) +{ + int ret = 0; + struct css_task_iter it; + struct task_struct *task; + struct cpuset *cs; + struct cgroup_subsys_state *pos_css; + + /* + * Situation gets complex with overlapping nodemasks in different cpusets. + * TODO: Maybe we should calculate the "distance" between different mems_allowed. + * + * But for now, let's make it simple. Just iterate through all cpusets + * with the same mems_allowed as the current cpuset. + */ + rcu_read_lock(); + cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) { + if (nodes_equal(cs->mems_allowed, task_cs(current)->mems_allowed)) { + css_task_iter_start(&(cs->css), CSS_TASK_ITER_PROCS, &it); + while (!ret && (task = css_task_iter_next(&it))) + ret = fn(task, arg); + css_task_iter_end(&it); + } + } + rcu_read_unlock(); + return ret; +} + /* * Collection of memory_pressure is suppressed unless * this flag is enabled by writing "1" to the special diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 044e1eed720e..205982a69b30 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -367,6 +367,8 @@ static void select_bad_process(struct oom_control *oc) if (is_memcg_oom(oc)) mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc); + else if (oc->constraint == CONSTRAINT_CPUSET) + cpuset_cgroup_scan_tasks(oom_evaluate_task, oc); else { struct task_struct *p; @@ -427,6 +429,8 @@ static void dump_tasks(struct oom_control *oc) if (is_memcg_oom(oc)) mem_cgroup_scan_tasks(oc->memcg, dump_task, oc); + else if (oc->constraint == CONSTRAINT_CPUSET) + cpuset_cgroup_scan_tasks(dump_task, oc); else { struct task_struct *p;