diff mbox series

[v2] mm: oom: introduce cpuset oom

Message ID	20230404115509.14299-1-ligang.bdlg@bytedance.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Gang Li <ligang.bdlg@bytedance.com> To: mhocko@suse.com, rientjes@google.com Cc: Gang Li <ligang.bdlg@bytedance.com>, linux-kernel@vger.kernel.org Subject: [PATCH v2] mm: oom: introduce cpuset oom Date: Tue, 4 Apr 2023 19:55:09 +0800 Message-Id: <20230404115509.14299-1-ligang.bdlg@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[v2] mm: oom: introduce cpuset oom \| [v2] mm: oom: introduce cpuset oom

Commit Message

Gang Li April 4, 2023, 11:55 a.m. UTC

  When a process in cpuset triggers oom, it may kill a completely
irrelevant process on another numa node, which will not release any
memory for this cpuset.

It seems that `CONSTRAINT_CPUSET` is not really doing much these
days. Using CONSTRAINT_CPUSET, we can easily achieve node aware oom
by selecting victim from all cpusets with the same mems_allowed as
the current cpuset.

Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
---
Changes in v2:
- Select victim from all cpusets with the same mems_allowed as the current cpuset.
  (David Rientjes <rientjes@google.com>)

v1:
- https://lore.kernel.org/all/20220921064710.89663-1-ligang.bdlg@bytedance.com/

---
 include/linux/cpuset.h |  6 ++++++
 kernel/cgroup/cpuset.c | 28 ++++++++++++++++++++++++++++
 mm/oom_kill.c          |  4 ++++
 3 files changed, 38 insertions(+)

Comments

Michal Hocko April 4, 2023, 2:31 p.m. UTC | #1

[CC cpuset people]

On Tue 04-04-23 19:55:09, Gang Li wrote:
> When a process in cpuset triggers oom, it may kill a completely
> irrelevant process on another numa node, which will not release any
> memory for this cpuset.
> 
> It seems that `CONSTRAINT_CPUSET` is not really doing much these
> days. Using CONSTRAINT_CPUSET, we can easily achieve node aware oom
> by selecting victim from all cpusets with the same mems_allowed as
> the current cpuset.

This should go into more details about the usecase, testing and ideally
also spend couple of words about how CONSTRAINT_CPUSET is actually
implemented because this is not really immediately obvious. An example
of before/after behavior would have been really nice as well.

You should also go into much more details about how oom victims are
actually evaluated.

> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
> ---
> Changes in v2:
> - Select victim from all cpusets with the same mems_allowed as the current cpuset.
>   (David Rientjes <rientjes@google.com>)
> 
> v1:
> - https://lore.kernel.org/all/20220921064710.89663-1-ligang.bdlg@bytedance.com/
> 
> ---
>  include/linux/cpuset.h |  6 ++++++
>  kernel/cgroup/cpuset.c | 28 ++++++++++++++++++++++++++++
>  mm/oom_kill.c          |  4 ++++
>  3 files changed, 38 insertions(+)

As this is a userspace visible change it should also be documented
somewhere  in Documentation.

I am not really familiar with cpusets internals so I cannot really judge
cpuset_cgroup_scan_tasks implementation.

The oom report should be explicit about this being CPUSET specific oom
handling so unexpected behavior could be nailed down to this change so I
do not see a major concern from the oom POV. Nevertheless it would be
still good to consider whether this should be an opt-in behavior. I
personally do not see a major problem because most cpuset deployments I
have seen tend to be well partitioned so the new behavior makes more
sense. 

> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 980b76a1237e..fc244141bd52 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -171,6 +171,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
>  	task_unlock(current);
>  }
>  
> +int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg);
> +
>  #else /* !CONFIG_CPUSETS */
>  
>  static inline bool cpusets_enabled(void) { return false; }
> @@ -287,6 +289,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
>  	return false;
>  }
>  
> +static inline int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg)
> +{
> +	return 0;
> +}
>  #endif /* !CONFIG_CPUSETS */
>  
>  #endif /* _LINUX_CPUSET_H */
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index bc4dcfd7bee5..b009c98ca19e 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -4013,6 +4013,34 @@ void cpuset_print_current_mems_allowed(void)
>  	rcu_read_unlock();
>  }
>  
> +int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg)
> +{
> +	int ret = 0;
> +	struct css_task_iter it;
> +	struct task_struct *task;
> +	struct cpuset *cs;
> +	struct cgroup_subsys_state *pos_css;
> +
> +	/*
> +	 * Situation gets complex with overlapping nodemasks in different cpusets.
> +	 * TODO: Maybe we should calculate the "distance" between different mems_allowed.
> +	 *
> +	 * But for now, let's make it simple. Just iterate through all cpusets
> +	 * with the same mems_allowed as the current cpuset.
> +	 */
> +	rcu_read_lock();
> +	cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
> +		if (nodes_equal(cs->mems_allowed, task_cs(current)->mems_allowed)) {
> +			css_task_iter_start(&(cs->css), CSS_TASK_ITER_PROCS, &it);
> +			while (!ret && (task = css_task_iter_next(&it)))
> +				ret = fn(task, arg);
> +			css_task_iter_end(&it);
> +		}
> +	}
> +	rcu_read_unlock();
> +	return ret;
> +}
> +
>  /*
>   * Collection of memory_pressure is suppressed unless
>   * this flag is enabled by writing "1" to the special
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 044e1eed720e..205982a69b30 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -367,6 +367,8 @@ static void select_bad_process(struct oom_control *oc)
>  
>  	if (is_memcg_oom(oc))
>  		mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc);
> +	else if (oc->constraint == CONSTRAINT_CPUSET)
> +		cpuset_cgroup_scan_tasks(oom_evaluate_task, oc);
>  	else {
>  		struct task_struct *p;
>  
> @@ -427,6 +429,8 @@ static void dump_tasks(struct oom_control *oc)
>  
>  	if (is_memcg_oom(oc))
>  		mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);
> +	else if (oc->constraint == CONSTRAINT_CPUSET)
> +		cpuset_cgroup_scan_tasks(dump_task, oc);
>  	else {
>  		struct task_struct *p;
>  
> -- 
> 2.20.1

Waiman Long April 4, 2023, 5:24 p.m. UTC | #2

On 4/4/23 10:31, Michal Hocko wrote:
> [CC cpuset people]
>
> On Tue 04-04-23 19:55:09, Gang Li wrote:
>> When a process in cpuset triggers oom, it may kill a completely
>> irrelevant process on another numa node, which will not release any
>> memory for this cpuset.
>>
>> It seems that `CONSTRAINT_CPUSET` is not really doing much these
>> days. Using CONSTRAINT_CPUSET, we can easily achieve node aware oom
>> by selecting victim from all cpusets with the same mems_allowed as
>> the current cpuset.
> This should go into more details about the usecase, testing and ideally
> also spend couple of words about how CONSTRAINT_CPUSET is actually
> implemented because this is not really immediately obvious. An example
> of before/after behavior would have been really nice as well.
>
> You should also go into much more details about how oom victims are
> actually evaluated.
>
>> Suggested-by: Michal Hocko <mhocko@suse.com>
>> Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
>> ---
>> Changes in v2:
>> - Select victim from all cpusets with the same mems_allowed as the current cpuset.
>>    (David Rientjes <rientjes@google.com>)
>>
>> v1:
>> - https://lore.kernel.org/all/20220921064710.89663-1-ligang.bdlg@bytedance.com/
>>
>> ---
>>   include/linux/cpuset.h |  6 ++++++
>>   kernel/cgroup/cpuset.c | 28 ++++++++++++++++++++++++++++
>>   mm/oom_kill.c          |  4 ++++
>>   3 files changed, 38 insertions(+)
> As this is a userspace visible change it should also be documented
> somewhere  in Documentation.
>
> I am not really familiar with cpusets internals so I cannot really judge
> cpuset_cgroup_scan_tasks implementation.
>
> The oom report should be explicit about this being CPUSET specific oom
> handling so unexpected behavior could be nailed down to this change so I
> do not see a major concern from the oom POV. Nevertheless it would be
> still good to consider whether this should be an opt-in behavior. I
> personally do not see a major problem because most cpuset deployments I
> have seen tend to be well partitioned so the new behavior makes more
> sense.
>
>> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
>> index 980b76a1237e..fc244141bd52 100644
>> --- a/include/linux/cpuset.h
>> +++ b/include/linux/cpuset.h
>> @@ -171,6 +171,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
>>   	task_unlock(current);
>>   }
>>   
>> +int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg);
>> +
>>   #else /* !CONFIG_CPUSETS */
>>   
>>   static inline bool cpusets_enabled(void) { return false; }
>> @@ -287,6 +289,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
>>   	return false;
>>   }
>>   
>> +static inline int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg)
>> +{
>> +	return 0;
>> +}
>>   #endif /* !CONFIG_CPUSETS */
>>   
>>   #endif /* _LINUX_CPUSET_H */
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index bc4dcfd7bee5..b009c98ca19e 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -4013,6 +4013,34 @@ void cpuset_print_current_mems_allowed(void)
>>   	rcu_read_unlock();
>>   }
>>   
>> +int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg)
>> +{
>> +	int ret = 0;
>> +	struct css_task_iter it;
>> +	struct task_struct *task;
>> +	struct cpuset *cs;
>> +	struct cgroup_subsys_state *pos_css;
>> +
>> +	/*
>> +	 * Situation gets complex with overlapping nodemasks in different cpusets.
>> +	 * TODO: Maybe we should calculate the "distance" between different mems_allowed.
>> +	 *
>> +	 * But for now, let's make it simple. Just iterate through all cpusets
>> +	 * with the same mems_allowed as the current cpuset.
>> +	 */
>> +	rcu_read_lock();
>> +	cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
>> +		if (nodes_equal(cs->mems_allowed, task_cs(current)->mems_allowed)) {
>> +			css_task_iter_start(&(cs->css), CSS_TASK_ITER_PROCS, &it);
>> +			while (!ret && (task = css_task_iter_next(&it)))
>> +				ret = fn(task, arg);
>> +			css_task_iter_end(&it);
>> +		}
>> +	}
>> +	rcu_read_unlock();
>> +	return ret;
>> +}

You will also need to take cpuset_rwsem to make sure that cpusets are 
stable. BTW, the cpuset_cgroup_scan_tasks() name is kind of redundant. I 
will suggest you just name it as cpuset_scan_tasks(). Please also add a 
doctext comment about its purpose and how it should be used.

Cheers,
Longman

Gang Li April 6, 2023, 3:04 a.m. UTC | #3

On 2023/4/4 22:31, Michal Hocko wrote:
 > [CC cpuset people]
 >
 >
 > This should go into more details about the usecase, testing and ideally
 > also spend couple of words about how CONSTRAINT_CPUSET is actually
 > implemented because this is not really immediately obvious. An example
 > of before/after behavior would have been really nice as well.
 >
 > You should also go into much more details about how oom victims are
 > actually evaluated.
 >
 > As this is a userspace visible change it should also be documented
 > somewhere  in Documentation.
 >
 > I am not really familiar with cpusets internals so I cannot really judge
 > cpuset_cgroup_scan_tasks implementation.
 >
 > The oom report should be explicit about this being CPUSET specific oom
 > handling so unexpected behavior could be nailed down to this change so I
 > do not see a major concern from the oom POV. Nevertheless it would be
 > still good to consider whether this should be an opt-in behavior. I
 > personally do not see a major problem because most cpuset deployments I
 > have seen tend to be well partitioned so the new behavior makes more
 > sense.
 >

On 2023/4/5 01:24, Waiman Long wrote:
 >
 > You will also need to take cpuset_rwsem to make sure that cpusets are
 > stable. BTW, the cpuset_cgroup_scan_tasks() name is kind of redundant. I
 > will suggest you just name it as cpuset_scan_tasks(). Please also add a
 > doctext comment about its purpose and how it should be used.


Thank you all. I will make the following changes and send v3.

1. Provide more details about the use case, testing, and implementation 
of CONSTRAINT_CPUSET, including an example of before/after behavior.

2. Provide more details about how OOM victims are evaluated.

3. Document the userspace visible change in Documentation.

4. Add an option /proc/sys/vm/oom_in_cpuset for cpuset oom.

5. Rename cpuset_cgroup_scan_tasks() to cpuset_scan_tasks() and add a 
doctext comment about its purpose and how it should be used.

6. Take cpuset_rwsem to ensure that cpusets are stable.

Gang Li April 6, 2023, 3:22 a.m. UTC | #4

On 2023/4/4 22:31, Michal Hocko wrote:
> [CC cpuset people]
> 
> The oom report should be explicit about this being CPUSET specific oom
> handling so unexpected behavior could be nailed down to this change so I
Yes, the oom message looks like this:

```
[   65.470256] 
oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=test,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-4.scope,task=memkiller,pid=1968,uid=0
Apr  4 09:08:53 debian kernel: [   65.481992] Out of memory: Killed 
process 1968 (memkiller) total-vm:2099436kB, anon-rss:1971712kB, 
file-rss:1024kB, shmem-rss:0kB, UID:0 pgtables:3904kB oom_score_adj:0
```


> do not see a major concern from the oom POV. Nevertheless it would be
> still good to consider whether this should be an opt-in behavior. I
> personally do not see a major problem because most cpuset deployments I
> have seen tend to be well partitioned so the new behavior makes more
> sense.
> 

Since memcgroup oom is mandatory, cpuset oom should preferably be 
mandatory as well. But we can still consider adding an option to user.

How about introduce `/proc/sys/vm/oom_in_cpuset`?

Michal Hocko April 6, 2023, 8:30 a.m. UTC | #5

On Thu 06-04-23 11:22:16, Gang Li wrote:
> 
> On 2023/4/4 22:31, Michal Hocko wrote:
> > [CC cpuset people]
> > 
> > The oom report should be explicit about this being CPUSET specific oom
> > handling so unexpected behavior could be nailed down to this change so I
> Yes, the oom message looks like this:
> 
> ```
> [   65.470256] oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=test,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-4.scope,task=memkiller,pid=1968,uid=0
> Apr  4 09:08:53 debian kernel: [   65.481992] Out of memory: Killed process
> 1968 (memkiller) total-vm:2099436kB, anon-rss:1971712kB, file-rss:1024kB,
> shmem-rss:0kB, UID:0 pgtables:3904kB oom_score_adj:0
> ```
> 
> 
> > do not see a major concern from the oom POV. Nevertheless it would be
> > still good to consider whether this should be an opt-in behavior. I
> > personally do not see a major problem because most cpuset deployments I
> > have seen tend to be well partitioned so the new behavior makes more
> > sense.
> > 
> 
> Since memcgroup oom is mandatory, cpuset oom should preferably be mandatory
> as well. But we can still consider adding an option to user.
> 
> How about introduce `/proc/sys/vm/oom_in_cpuset`?

As I've said, I do not see any major concern having this behavior
implicit, the behavior makes semantic sense and it is also much more
likely that the selected oom victim will be a better choice than what we
do currently. Especially on properly partitioned systems with large
memory consumers in each partition (cpuset).

That being said, I would just not add any sysctl at this stage and
rather document the decision. If we ever encounter usecase(s) which
would regress based on this change we can introcuce the sysctl later.

Gang Li April 6, 2023, 8:33 a.m. UTC | #6

I see, then let's do this as you said.


On 2023/4/6 16:30, Michal Hocko wrote:
> 
> As I've said, I do not see any major concern having this behavior
> implicit, the behavior makes semantic sense and it is also much more
> likely that the selected oom victim will be a better choice than what we
> do currently. Especially on properly partitioned systems with large
> memory consumers in each partition (cpuset).
> 
> That being said, I would just not add any sysctl at this stage and
> rather document the decision. If we ever encounter usecase(s) which
> would regress based on this change we can introcuce the sysctl later.
>

Gang Li April 6, 2023, 1:02 p.m. UTC | #7

On 2023/4/5 01:24, Waiman Long wrote:
> 
> You will also need to take cpuset_rwsem to make sure that cpusets are 
> stable. BTW, the cpuset_cgroup_scan_tasks() name is kind of redundant. I 
> will suggest you just name it as cpuset_scan_tasks(). Please also add a 

mem cgroup oom use `mem_cgroup_scan_tasks`.
How about keep `cpuset_cgroup_scan_tasks` for naming consistency?

```
static void select_bad_process(struct oom_control *oc)
{
	oc->chosen_points = LONG_MIN;

	if (is_memcg_oom(oc))
		mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc);
	else if (oc->constraint == CONSTRAINT_CPUSET)
		cpuset_cgroup_scan_tasks(oom_evaluate_task, oc);
	else {
		...
	}
}
```

Waiman Long April 6, 2023, 3:53 p.m. UTC | #8

On 4/6/23 09:02, Gang Li wrote:
> On 2023/4/5 01:24, Waiman Long wrote:
>>
>> You will also need to take cpuset_rwsem to make sure that cpusets are 
>> stable. BTW, the cpuset_cgroup_scan_tasks() name is kind of 
>> redundant. I will suggest you just name it as cpuset_scan_tasks(). 
>> Please also add a 
>
> mem cgroup oom use `mem_cgroup_scan_tasks`.
> How about keep `cpuset_cgroup_scan_tasks` for naming consistency?

The "memory cgroup" term is used to identify the code related to memory 
cgroup controller. Cpuset, on the other hand, is a well known cgroup 
controller. Cpuset cgroup is not a term that people will normally use.  
That is why this name is awkward.

Cheers,
Longman

diff mbox series

Patch

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 980b76a1237e..fc244141bd52 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -171,6 +171,8 @@  static inline void set_mems_allowed(nodemask_t nodemask)
 	task_unlock(current);
 }
 
+int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg);
+
 #else /* !CONFIG_CPUSETS */
 
 static inline bool cpusets_enabled(void) { return false; }
@@ -287,6 +289,10 @@  static inline bool read_mems_allowed_retry(unsigned int seq)
 	return false;
 }
 
+static inline int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg)
+{
+	return 0;
+}
 #endif /* !CONFIG_CPUSETS */
 
 #endif /* _LINUX_CPUSET_H */
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index bc4dcfd7bee5..b009c98ca19e 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -4013,6 +4013,34 @@  void cpuset_print_current_mems_allowed(void)
 	rcu_read_unlock();
 }
 
+int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg)
+{
+	int ret = 0;
+	struct css_task_iter it;
+	struct task_struct *task;
+	struct cpuset *cs;
+	struct cgroup_subsys_state *pos_css;
+
+	/*
+	 * Situation gets complex with overlapping nodemasks in different cpusets.
+	 * TODO: Maybe we should calculate the "distance" between different mems_allowed.
+	 *
+	 * But for now, let's make it simple. Just iterate through all cpusets
+	 * with the same mems_allowed as the current cpuset.
+	 */
+	rcu_read_lock();
+	cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
+		if (nodes_equal(cs->mems_allowed, task_cs(current)->mems_allowed)) {
+			css_task_iter_start(&(cs->css), CSS_TASK_ITER_PROCS, &it);
+			while (!ret && (task = css_task_iter_next(&it)))
+				ret = fn(task, arg);
+			css_task_iter_end(&it);
+		}
+	}
+	rcu_read_unlock();
+	return ret;
+}
+
 /*
  * Collection of memory_pressure is suppressed unless
  * this flag is enabled by writing "1" to the special
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 044e1eed720e..205982a69b30 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -367,6 +367,8 @@  static void select_bad_process(struct oom_control *oc)
 
 	if (is_memcg_oom(oc))
 		mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc);
+	else if (oc->constraint == CONSTRAINT_CPUSET)
+		cpuset_cgroup_scan_tasks(oom_evaluate_task, oc);
 	else {
 		struct task_struct *p;
 
@@ -427,6 +429,8 @@  static void dump_tasks(struct oom_control *oc)
 
 	if (is_memcg_oom(oc))
 		mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);
+	else if (oc->constraint == CONSTRAINT_CPUSET)
+		cpuset_cgroup_scan_tasks(dump_task, oc);
 	else {
 		struct task_struct *p;