diff mbox series

[2/2] cgroup/cpuset: Optimize cpuset_attach() on v2

Message ID	20221112221939.1272764-3-longman@redhat.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Waiman Long <longman@redhat.com> To: Tejun Heo <tj@kernel.org>, Zefan Li <lizefan.x@bytedance.com>, Johannes Weiner <hannes@cmpxchg.org> Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Sebastian Andrzej Siewior <bigeasy@linutronix.de>, Waiman Long <longman@redhat.com> Subject: [PATCH 2/2] cgroup/cpuset: Optimize cpuset_attach() on v2 Date: Sat, 12 Nov 2022 17:19:39 -0500 Message-Id: <20221112221939.1272764-3-longman@redhat.com> In-Reply-To: <20221112221939.1272764-1-longman@redhat.com> References: <20221112221939.1272764-1-longman@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	cgroup/cpuset: v2 optimization \| [0/2] cgroup/cpuset: v2 optimization [1/2] cgroup/cpuset: Skip spread flags update on v2 [2/2] cgroup/cpuset: Optimize cpuset_attach() on v2

Commit Message

Waiman Long Nov. 12, 2022, 10:19 p.m. UTC

  It was found that with the default hierarchy, enabling cpuset in the
child cgroups can trigger a cpuset_attach() call in each of the child
cgroups that have tasks with no change in effective cpus and mems. If
there are many processes in those child cgroups, it will burn quite a
lot of cpu cycles iterating all the tasks without doing useful work.

Optimizing this case by comparing between the old and new cpusets and
skip useless update if there is no change in effective cpus and mems.
Also mems_allowed are less likely to be changed than cpus_allowed. So
skip changing mm if there is no change in effective_mems and
CS_MEMORY_MIGRATE is not set.

By inserting some instrumentation code and running a simple command in
a container 200 times in a cgroup v2 system, it was found that all the
cpuset_attach() calls are skipped (401 times in total) as there was no
change in effective cpus and mems.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

Comments

Michal Koutný Nov. 21, 2022, 6:50 p.m. UTC | #1

On Sat, Nov 12, 2022 at 05:19:39PM -0500, Waiman Long <longman@redhat.com> wrote:
> +	/*
> +	 * In the default hierarchy, enabling cpuset in the child cgroups
> +	 * will trigger a number of cpuset_attach() calls with no change
> +	 * in effective cpus and mems. In that case, we can optimize out
> +	 * by skipping the task iteration and update.
> +	 */
> +	if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys) &&
> +	    !cpus_updated && !mems_updated) {

I'm just wondering -- why is this limited to the default hierarchy only?
IOW why can't v1 skip too (when favorable constness between cpusets).

Thanks,
Michal

Waiman Long Nov. 21, 2022, 7:11 p.m. UTC | #2

On 11/21/22 13:50, Michal Koutný wrote:
> On Sat, Nov 12, 2022 at 05:19:39PM -0500, Waiman Long <longman@redhat.com> wrote:
>> +	/*
>> +	 * In the default hierarchy, enabling cpuset in the child cgroups
>> +	 * will trigger a number of cpuset_attach() calls with no change
>> +	 * in effective cpus and mems. In that case, we can optimize out
>> +	 * by skipping the task iteration and update.
>> +	 */
>> +	if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys) &&
>> +	    !cpus_updated && !mems_updated) {
> I'm just wondering -- why is this limited to the default hierarchy only?
> IOW why can't v1 skip too (when favorable constness between cpusets).

Cpuset v1 is a bit more complex. Besides cpu and node masks, it also 
have other flags like the spread flags that we need to looks for 
changes. Unlike cpuset v2, I don't think it is likely that 
cpuset_attach() will be called without changes in cpu and node masks. 
That are the reason why this patch focuses on v2. If it is found that 
this is not the case, we can always extend the support to v1.

Cheers,
Longman

Sebastian Andrzej Siewior Nov. 23, 2022, 3:19 p.m. UTC | #3

On 2022-11-12 17:19:39 [-0500], Waiman Long wrote:
> It was found that with the default hierarchy, enabling cpuset in the
> child cgroups can trigger a cpuset_attach() call in each of the child
> cgroups that have tasks with no change in effective cpus and mems. If
> there are many processes in those child cgroups, it will burn quite a
> lot of cpu cycles iterating all the tasks without doing useful work.

Thank you.

So this preserves the CPU mask upon attaching the cpuset container.

| ~# taskset -pc $$
| pid 1564's current affinity list: 0-2

default mask after boot due to isolcpus=

| ~# echo "+cpu" >> /sys/fs/cgroup/cgroup.subtree_control ; echo "+cpuset" >> /sys/fs/cgroup/cgroup.subtree_control
| ~# taskset -pc $$
| pid 1564's current affinity list: 0-2

okay.

| ~# echo 1-3 > /sys/fs/cgroup/user.slice/cpuset.cpus
| ~# taskset -pc $$
| pid 1564's current affinity list: 1-3

wiped away.

| ~# taskset -pc 2-3 $$ 
| pid 1564's current affinity list: 1-3
| pid 1564's new affinity list: 2,3
| ~# echo 2-4 > /sys/fs/cgroup/user.slice/cpuset.cpus
| ~# taskset -pc 2-3 $$ 
| pid 1564's current affinity list: 2,3
| pid 1564's new affinity list: 2,3

But it works if the mask was changed on purpose.

Sebastian

diff mbox series

Patch

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 2525905cdf48..b8361f55ef36 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2513,12 +2513,28 @@  static void cpuset_attach(struct cgroup_taskset *tset)
 	struct cgroup_subsys_state *css;
 	struct cpuset *cs;
 	struct cpuset *oldcs = cpuset_attach_old_cs;
+	bool cpus_updated, mems_updated;
 
 	cgroup_taskset_first(tset, &css);
 	cs = css_cs(css);
 
 	lockdep_assert_cpus_held();	/* see cgroup_attach_lock() */
 	percpu_down_write(&cpuset_rwsem);
+	cpus_updated = !cpumask_equal(cs->effective_cpus,
+				      oldcs->effective_cpus);
+	mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
+
+	/*
+	 * In the default hierarchy, enabling cpuset in the child cgroups
+	 * will trigger a number of cpuset_attach() calls with no change
+	 * in effective cpus and mems. In that case, we can optimize out
+	 * by skipping the task iteration and update.
+	 */
+	if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys) &&
+	    !cpus_updated && !mems_updated) {
+		cpuset_attach_nodemask_to = cs->effective_mems;
+		goto out;
+	}
 
 	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
 
@@ -2539,9 +2555,14 @@  static void cpuset_attach(struct cgroup_taskset *tset)
 
 	/*
 	 * Change mm for all threadgroup leaders. This is expensive and may
-	 * sleep and should be moved outside migration path proper.
+	 * sleep and should be moved outside migration path proper. Skip it
+	 * if there is no change in effective_mems and CS_MEMORY_MIGRATE is
+	 * not set.
 	 */
 	cpuset_attach_nodemask_to = cs->effective_mems;
+	if (!is_memory_migrate(cs) && !mems_updated)
+		goto out;
+
 	cgroup_taskset_for_each_leader(leader, css, tset) {
 		struct mm_struct *mm = get_task_mm(leader);
 
@@ -2564,6 +2585,7 @@  static void cpuset_attach(struct cgroup_taskset *tset)
 		}
 	}
 
+out:
 	cs->old_mems_allowed = cpuset_attach_nodemask_to;
 
 	cs->attach_in_progress--;