[2/2] cgroup/cpuset: Optimize cpuset_attach() on v2
Commit Message
It was found that with the default hierarchy, enabling cpuset in the
child cgroups can trigger a cpuset_attach() call in each of the child
cgroups that have tasks with no change in effective cpus and mems. If
there are many processes in those child cgroups, it will burn quite a
lot of cpu cycles iterating all the tasks without doing useful work.
Optimizing this case by comparing between the old and new cpusets and
skip useless update if there is no change in effective cpus and mems.
Also mems_allowed are less likely to be changed than cpus_allowed. So
skip changing mm if there is no change in effective_mems and
CS_MEMORY_MIGRATE is not set.
By inserting some instrumentation code and running a simple command in
a container 200 times in a cgroup v2 system, it was found that all the
cpuset_attach() calls are skipped (401 times in total) as there was no
change in effective cpus and mems.
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset.c | 24 +++++++++++++++++++++++-
1 file changed, 23 insertions(+), 1 deletion(-)
Comments
On Sat, Nov 12, 2022 at 05:19:39PM -0500, Waiman Long <longman@redhat.com> wrote:
> + /*
> + * In the default hierarchy, enabling cpuset in the child cgroups
> + * will trigger a number of cpuset_attach() calls with no change
> + * in effective cpus and mems. In that case, we can optimize out
> + * by skipping the task iteration and update.
> + */
> + if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys) &&
> + !cpus_updated && !mems_updated) {
I'm just wondering -- why is this limited to the default hierarchy only?
IOW why can't v1 skip too (when favorable constness between cpusets).
Thanks,
Michal
On 11/21/22 13:50, Michal Koutný wrote:
> On Sat, Nov 12, 2022 at 05:19:39PM -0500, Waiman Long <longman@redhat.com> wrote:
>> + /*
>> + * In the default hierarchy, enabling cpuset in the child cgroups
>> + * will trigger a number of cpuset_attach() calls with no change
>> + * in effective cpus and mems. In that case, we can optimize out
>> + * by skipping the task iteration and update.
>> + */
>> + if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys) &&
>> + !cpus_updated && !mems_updated) {
> I'm just wondering -- why is this limited to the default hierarchy only?
> IOW why can't v1 skip too (when favorable constness between cpusets).
Cpuset v1 is a bit more complex. Besides cpu and node masks, it also
have other flags like the spread flags that we need to looks for
changes. Unlike cpuset v2, I don't think it is likely that
cpuset_attach() will be called without changes in cpu and node masks.
That are the reason why this patch focuses on v2. If it is found that
this is not the case, we can always extend the support to v1.
Cheers,
Longman
On 2022-11-12 17:19:39 [-0500], Waiman Long wrote:
> It was found that with the default hierarchy, enabling cpuset in the
> child cgroups can trigger a cpuset_attach() call in each of the child
> cgroups that have tasks with no change in effective cpus and mems. If
> there are many processes in those child cgroups, it will burn quite a
> lot of cpu cycles iterating all the tasks without doing useful work.
Thank you.
So this preserves the CPU mask upon attaching the cpuset container.
| ~# taskset -pc $$
| pid 1564's current affinity list: 0-2
default mask after boot due to isolcpus=
| ~# echo "+cpu" >> /sys/fs/cgroup/cgroup.subtree_control ; echo "+cpuset" >> /sys/fs/cgroup/cgroup.subtree_control
| ~# taskset -pc $$
| pid 1564's current affinity list: 0-2
okay.
| ~# echo 1-3 > /sys/fs/cgroup/user.slice/cpuset.cpus
| ~# taskset -pc $$
| pid 1564's current affinity list: 1-3
wiped away.
| ~# taskset -pc 2-3 $$
| pid 1564's current affinity list: 1-3
| pid 1564's new affinity list: 2,3
| ~# echo 2-4 > /sys/fs/cgroup/user.slice/cpuset.cpus
| ~# taskset -pc 2-3 $$
| pid 1564's current affinity list: 2,3
| pid 1564's new affinity list: 2,3
But it works if the mask was changed on purpose.
Sebastian
@@ -2513,12 +2513,28 @@ static void cpuset_attach(struct cgroup_taskset *tset)
struct cgroup_subsys_state *css;
struct cpuset *cs;
struct cpuset *oldcs = cpuset_attach_old_cs;
+ bool cpus_updated, mems_updated;
cgroup_taskset_first(tset, &css);
cs = css_cs(css);
lockdep_assert_cpus_held(); /* see cgroup_attach_lock() */
percpu_down_write(&cpuset_rwsem);
+ cpus_updated = !cpumask_equal(cs->effective_cpus,
+ oldcs->effective_cpus);
+ mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
+
+ /*
+ * In the default hierarchy, enabling cpuset in the child cgroups
+ * will trigger a number of cpuset_attach() calls with no change
+ * in effective cpus and mems. In that case, we can optimize out
+ * by skipping the task iteration and update.
+ */
+ if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys) &&
+ !cpus_updated && !mems_updated) {
+ cpuset_attach_nodemask_to = cs->effective_mems;
+ goto out;
+ }
guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
@@ -2539,9 +2555,14 @@ static void cpuset_attach(struct cgroup_taskset *tset)
/*
* Change mm for all threadgroup leaders. This is expensive and may
- * sleep and should be moved outside migration path proper.
+ * sleep and should be moved outside migration path proper. Skip it
+ * if there is no change in effective_mems and CS_MEMORY_MIGRATE is
+ * not set.
*/
cpuset_attach_nodemask_to = cs->effective_mems;
+ if (!is_memory_migrate(cs) && !mems_updated)
+ goto out;
+
cgroup_taskset_for_each_leader(leader, css, tset) {
struct mm_struct *mm = get_task_mm(leader);
@@ -2564,6 +2585,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
}
}
+out:
cs->old_mems_allowed = cpuset_attach_nodemask_to;
cs->attach_in_progress--;