Message ID | 20221123082157.71326-1-haifeng.xu@shopee.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:f944:0:0:0:0:0 with SMTP id q4csp2663634wrr; Wed, 23 Nov 2022 00:33:59 -0800 (PST) X-Google-Smtp-Source: AA0mqf42vAdWm8dNA1jM4oHhEt6nwNLDqnrJmtbPxq9imZP4n+eHFtVfIfgAHALEP82o5fn4QCen X-Received: by 2002:a17:906:348b:b0:78d:9e04:d8c2 with SMTP id g11-20020a170906348b00b0078d9e04d8c2mr22438784ejb.614.1669192438874; Wed, 23 Nov 2022 00:33:58 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1669192438; cv=none; d=google.com; s=arc-20160816; b=hlD9BaoUL29yW0bw/JFx2b76HzShD41JdvHI7U1LqJsbL8XvvOLnf9Iud0Xv4yHp8r foHvnIlB6lmr/KzRnXqdhi/ciXodl4fcOAouG0S/ZJngaEn232WYxqpn4hbtKEB+74Cl Voo/7UD9FgQAdWjItLb0cu8VQtuwoKeT8QUtrdMGFVoGOoBsM8CxhNDzNggZ2Yq7Z2lH aG1l4a+RY7D9XzKPP8wSqLUcvrd3NiViEQq+N9O5VXf7ErDf9pSXLGSo6s05uxhK8e8C QEJxYxK4lcDCZ1WSUn5fkFJ7XCk8YqxZwNvuN7rU6F1gTFogUW/5PhuF51xMkbCFzJaC QREA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=yo2K84hmvJF9oKmVUE8rVWn5muORVTM6UmA/7exBFUM=; b=p4PmKwMBcDIIhIIZ6zcuYtRG8gfoqFbOGgwkIVkLlTrajG58a4WsWdb8bGUgm7uADo LCak5/Z6v8pawVP6yr72Lites89H4Quc3s3F7UBA/Hv5Zfgs1Swd4ogth2uv7TFARRjj jNGNYKvtR5I8GMhPIHihmEIlCOhdvbmAu3ckmElE+/XOHf8Uu2pAnRcDYDhnvUDuOL2N iUcdZ+94bqVSsZoku8coESRQJBJEQZ+L1wgnCyjd7iZ+Y+piT3d0ipuJ7SUSVBj5Kn4Q PA0NO0Vr0rQ9CZmsBwdSw10mEAUq1AL+rIWfNHhNgrqTQOcQHf1bIgcKNeF2FBC9AkoS zdJw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@shopee.com header.s=shopee.com header.b=VmEnvFpr; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=shopee.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id i16-20020a0564020f1000b00461bff75d84si12857608eda.463.2022.11.23.00.33.35; Wed, 23 Nov 2022 00:33:58 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@shopee.com header.s=shopee.com header.b=VmEnvFpr; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=shopee.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236305AbiKWIXv (ORCPT <rfc822;cjcooper78@gmail.com> + 99 others); Wed, 23 Nov 2022 03:23:51 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59294 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236528AbiKWIXp (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 23 Nov 2022 03:23:45 -0500 Received: from mail-pl1-x631.google.com (mail-pl1-x631.google.com [IPv6:2607:f8b0:4864:20::631]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D8C76FBA9F for <linux-kernel@vger.kernel.org>; Wed, 23 Nov 2022 00:23:42 -0800 (PST) Received: by mail-pl1-x631.google.com with SMTP id b21so15980545plc.9 for <linux-kernel@vger.kernel.org>; Wed, 23 Nov 2022 00:23:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=shopee.com; s=shopee.com; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=yo2K84hmvJF9oKmVUE8rVWn5muORVTM6UmA/7exBFUM=; b=VmEnvFprI1mHKVy+07lphANsK+RLSjSH8ZZGVFzeZ0SQCYsAvYfb8tQwV+GaQs4jMd Fhfh4ZC4WvVGyUc3TlX3vl0ZqD5kHFvh/BahqrDF0S8wapyOl1nAVvuhgDLPI6yPojVw LuFg+XX0OxUSgPe8Zq2RVjTjeKnvONyFn0cZRTsjOuIaEo7QUKwVu9hXgIFK9LgJ14Ym u3w+77ooTwpLJsCE55lO0l2qS8XmJ7T/0uHynGwdGMPgSlRm37X1Cfy9zcnQgo28/bZd G3Tcu28J0Id40ep5xIQ7m6QU0kNdS5nxA2SEjOkuuftUUfY2jdqRM7I4BdF79Ew+iIXI HGnA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=yo2K84hmvJF9oKmVUE8rVWn5muORVTM6UmA/7exBFUM=; b=sRJaOW7Unc06FRsDLJv3mrHwwdk3AQNfy0PQ/dhJPABmmdCtjhZH7ci2EczFcf/JnA 4auWpxZnYdMafgQ//P8036go3rm+raNCAHERpGf8m6v6v4tcuSoabGhpNOhSWeoZksPf DzOqfCOzFuVJ4iOXpn8nixAwQtdjSD720XGyN7F7UfcH4cDLW8NLwrBRv5dFNCPkerNL OUUzkv7weklnhnKPVENrl51yaolTm3CJyXWk5g41UtzkNAUaVsEy5wGFSEjhoRQDQ6dO 1KBLoNUNpVD4JaVdc1eRMme1JD6r20WhzwS254FhpZDRKrApBufwUot5KtFeJZSkIy53 /z/g== X-Gm-Message-State: ANoB5pkhJPTJ3Oe+dEFZ+Yyn4mUiTUmgoaoy6Ri22EcOdD4UjgRAxQgG 6UfmDHbQK8Cpix7GiBsdbKX2Fg== X-Received: by 2002:a17:90b:2688:b0:218:b9e1:ebef with SMTP id pl8-20020a17090b268800b00218b9e1ebefmr12795766pjb.65.1669191822401; Wed, 23 Nov 2022 00:23:42 -0800 (PST) Received: from ubuntu-haifeng.default.svc.cluster.local ([101.127.248.173]) by smtp.gmail.com with ESMTPSA id h17-20020aa79f51000000b0056c2e497b02sm12454017pfr.173.2022.11.23.00.23.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 23 Nov 2022 00:23:42 -0800 (PST) From: "haifeng.xu" <haifeng.xu@shopee.com> To: longman@redhat.com Cc: lizefan.x@bytedance.com, tj@kernel.org, hannes@cmpxchg.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, "haifeng.xu" <haifeng.xu@shopee.com> Subject: [PATCH] cgroup/cpuset: Optimize update_tasks_nodemask() Date: Wed, 23 Nov 2022 08:21:57 +0000 Message-Id: <20221123082157.71326-1-haifeng.xu@shopee.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1750275131203508174?= X-GMAIL-MSGID: =?utf-8?q?1750275131203508174?= |
Series |
cgroup/cpuset: Optimize update_tasks_nodemask()
|
|
Commit Message
Haifeng Xu
Nov. 23, 2022, 8:21 a.m. UTC
When change the 'cpuset.mems' under some cgroup, system will hung
for a long time. From the dmesg, many processes or theads are
stuck in fork/exit. The reason is show as follows.
thread A:
cpuset_write_resmask /* takes cpuset_rwsem */
...
update_tasks_nodemask
mpol_rebind_mm /* waits mmap_lock */
thread B:
worker_thread
...
cpuset_migrate_mm_workfn
do_migrate_pages /* takes mmap_lock */
thread C:
cgroup_procs_write /* takes cgroup_mutex and cgroup_threadgroup_rwsem */
...
cpuset_can_attach
percpu_down_write /* waits cpuset_rwsem */
Once update the nodemasks of cpuset, thread A wakes up thread B to
migrate mm. But when thread A iterates through all tasks, including
child threads and group leader, it has to wait the mmap_lock which
has been take by thread B. Unfortunately, thread C wants to migrate
tasks into cgroup at this moment, it must wait thread A to release
cpuset_rwsem. If thread B spends much time to migrate mm, the
fork/exit which acquire cgroup_threadgroup_rwsem also need to
wait for a long time.
There is no need to migrate the mm of child threads which is
shared with group leader. Just iterate through the group
leader only.
Signed-off-by: haifeng.xu <haifeng.xu@shopee.com>
---
kernel/cgroup/cpuset.c | 3 +++
1 file changed, 3 insertions(+)
Comments
On Wed, Nov 23, 2022 at 08:21:57AM +0000, haifeng.xu wrote: > When change the 'cpuset.mems' under some cgroup, system will hung > for a long time. From the dmesg, many processes or theads are > stuck in fork/exit. The reason is show as follows. > > thread A: > cpuset_write_resmask /* takes cpuset_rwsem */ > ... > update_tasks_nodemask > mpol_rebind_mm /* waits mmap_lock */ > > thread B: > worker_thread > ... > cpuset_migrate_mm_workfn > do_migrate_pages /* takes mmap_lock */ > > thread C: > cgroup_procs_write /* takes cgroup_mutex and cgroup_threadgroup_rwsem */ > ... > cpuset_can_attach > percpu_down_write /* waits cpuset_rwsem */ > > Once update the nodemasks of cpuset, thread A wakes up thread B to > migrate mm. But when thread A iterates through all tasks, including > child threads and group leader, it has to wait the mmap_lock which > has been take by thread B. Unfortunately, thread C wants to migrate > tasks into cgroup at this moment, it must wait thread A to release > cpuset_rwsem. If thread B spends much time to migrate mm, the > fork/exit which acquire cgroup_threadgroup_rwsem also need to > wait for a long time. > > There is no need to migrate the mm of child threads which is > shared with group leader. This is only a problem in cgroup1 and cgroup1 doesn't require the threads of a given task to be in the same cgroup. I don't think you can optimize it this way. Thanks.
On 11/23/22 12:05, Tejun Heo wrote: > On Wed, Nov 23, 2022 at 08:21:57AM +0000, haifeng.xu wrote: >> When change the 'cpuset.mems' under some cgroup, system will hung >> for a long time. From the dmesg, many processes or theads are >> stuck in fork/exit. The reason is show as follows. >> >> thread A: >> cpuset_write_resmask /* takes cpuset_rwsem */ >> ... >> update_tasks_nodemask >> mpol_rebind_mm /* waits mmap_lock */ >> >> thread B: >> worker_thread >> ... >> cpuset_migrate_mm_workfn >> do_migrate_pages /* takes mmap_lock */ >> >> thread C: >> cgroup_procs_write /* takes cgroup_mutex and cgroup_threadgroup_rwsem */ >> ... >> cpuset_can_attach >> percpu_down_write /* waits cpuset_rwsem */ >> >> Once update the nodemasks of cpuset, thread A wakes up thread B to >> migrate mm. But when thread A iterates through all tasks, including >> child threads and group leader, it has to wait the mmap_lock which >> has been take by thread B. Unfortunately, thread C wants to migrate >> tasks into cgroup at this moment, it must wait thread A to release >> cpuset_rwsem. If thread B spends much time to migrate mm, the >> fork/exit which acquire cgroup_threadgroup_rwsem also need to >> wait for a long time. >> >> There is no need to migrate the mm of child threads which is >> shared with group leader. > This is only a problem in cgroup1 and cgroup1 doesn't require the threads of > a given task to be in the same cgroup. I don't think you can optimize it > this way. I think it is an issue anyway if different threads of a process are in different cpusets with different node mask. It is not a configuration that should be used at all. This patch makes update_tasks_nodemask() somewhat similar to cpuset_attach() where all tasks are iterated to update the node mask but only the task leaders are required to update the mm. For a non-group leader task, maybe we can check if the group leader is in the same cpuset. If so, we can skip the mm update. Do we need similar change in cpuset_attach()? I do think the "migrate = is_memory_migrate(cs);" line can be moved outside of the loop, though. Of course, that won't help much in this case. Cheers, Longman
On Wed, Nov 23, 2022 at 01:48:46PM -0500, Waiman Long wrote: > I think it is an issue anyway if different threads of a process are in > different cpusets with different node mask. It is not a configuration that > should be used at all. Anything memory related is in the same boat and people still use them reaching whatever end results they reach. Given the whole thing is pretty ill-defined, I don't wanna change the behavior now. > This patch makes update_tasks_nodemask() somewhat similar to cpuset_attach() > where all tasks are iterated to update the node mask but only the task > leaders are required to update the mm. For a non-group leader task, maybe we > can check if the group leader is in the same cpuset. If so, we can skip the > mm update. Do we need similar change in cpuset_attach()? The leader isn't special tho. We just wanna avoid visiting the same mm more than once, right? Thanks.
On 11/23/22 13:54, Tejun Heo wrote: > On Wed, Nov 23, 2022 at 01:48:46PM -0500, Waiman Long wrote: >> I think it is an issue anyway if different threads of a process are in >> different cpusets with different node mask. It is not a configuration that >> should be used at all. > Anything memory related is in the same boat and people still use them > reaching whatever end results they reach. Given the whole thing is pretty > ill-defined, I don't wanna change the behavior now. I am just saying that this is not a good config. I don't have any intention to change the existing behavior at all. > >> This patch makes update_tasks_nodemask() somewhat similar to cpuset_attach() >> where all tasks are iterated to update the node mask but only the task >> leaders are required to update the mm. For a non-group leader task, maybe we >> can check if the group leader is in the same cpuset. If so, we can skip the >> mm update. Do we need similar change in cpuset_attach()? > The leader isn't special tho. We just wanna avoid visiting the same mm more > than once, right? Right, the group leader is just a marker to make it easier to avoid duplicating the work for the same mm. If the group leader happens to be in another cpuset, it will suffer some performance consequence. Cheers, Longman
On Wed, Nov 23, 2022 at 02:05:59PM -0500, Waiman Long wrote: > Right, the group leader is just a marker to make it easier to avoid > duplicating the work for the same mm. If the group leader happens to be in > another cpuset, it will suffer some performance consequence. Yeah, I guess that's gonna be good enough for most cases. Thanks.
On 11/23/22 03:21, haifeng.xu wrote: > When change the 'cpuset.mems' under some cgroup, system will hung > for a long time. From the dmesg, many processes or theads are > stuck in fork/exit. The reason is show as follows. > > thread A: > cpuset_write_resmask /* takes cpuset_rwsem */ > ... > update_tasks_nodemask > mpol_rebind_mm /* waits mmap_lock */ > > thread B: > worker_thread > ... > cpuset_migrate_mm_workfn > do_migrate_pages /* takes mmap_lock */ > > thread C: > cgroup_procs_write /* takes cgroup_mutex and cgroup_threadgroup_rwsem */ > ... > cpuset_can_attach > percpu_down_write /* waits cpuset_rwsem */ > > Once update the nodemasks of cpuset, thread A wakes up thread B to > migrate mm. But when thread A iterates through all tasks, including > child threads and group leader, it has to wait the mmap_lock which > has been take by thread B. Unfortunately, thread C wants to migrate > tasks into cgroup at this moment, it must wait thread A to release > cpuset_rwsem. If thread B spends much time to migrate mm, the > fork/exit which acquire cgroup_threadgroup_rwsem also need to > wait for a long time. > > There is no need to migrate the mm of child threads which is > shared with group leader. Just iterate through the group > leader only. > > Signed-off-by: haifeng.xu <haifeng.xu@shopee.com> > --- > kernel/cgroup/cpuset.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > index 589827ccda8b..43cbd09546d0 100644 > --- a/kernel/cgroup/cpuset.c > +++ b/kernel/cgroup/cpuset.c > @@ -1968,6 +1968,9 @@ static void update_tasks_nodemask(struct cpuset *cs) > > cpuset_change_task_nodemask(task, &newmems); > > + if (!thread_group_leader(task)) > + continue; > + > mm = get_task_mm(task); > if (!mm) > continue; Could you try the attached test patch to see if it can fix your problem? Something along the line of this patch will be more acceptable. Thanks, Longman
On 2022/11/24 04:23, Waiman Long wrote: > On 11/23/22 03:21, haifeng.xu wrote: >> When change the 'cpuset.mems' under some cgroup, system will hung >> for a long time. From the dmesg, many processes or theads are >> stuck in fork/exit. The reason is show as follows. >> >> thread A: >> cpuset_write_resmask /* takes cpuset_rwsem */ >> ... >> update_tasks_nodemask >> mpol_rebind_mm /* waits mmap_lock */ >> >> thread B: >> worker_thread >> ... >> cpuset_migrate_mm_workfn >> do_migrate_pages /* takes mmap_lock */ >> >> thread C: >> cgroup_procs_write /* takes cgroup_mutex and cgroup_threadgroup_rwsem */ >> ... >> cpuset_can_attach >> percpu_down_write /* waits cpuset_rwsem */ >> >> Once update the nodemasks of cpuset, thread A wakes up thread B to >> migrate mm. But when thread A iterates through all tasks, including >> child threads and group leader, it has to wait the mmap_lock which >> has been take by thread B. Unfortunately, thread C wants to migrate >> tasks into cgroup at this moment, it must wait thread A to release >> cpuset_rwsem. If thread B spends much time to migrate mm, the >> fork/exit which acquire cgroup_threadgroup_rwsem also need to >> wait for a long time. >> >> There is no need to migrate the mm of child threads which is >> shared with group leader. Just iterate through the group >> leader only. >> >> Signed-off-by: haifeng.xu <haifeng.xu@shopee.com> >> --- >> kernel/cgroup/cpuset.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c >> index 589827ccda8b..43cbd09546d0 100644 >> --- a/kernel/cgroup/cpuset.c >> +++ b/kernel/cgroup/cpuset.c >> @@ -1968,6 +1968,9 @@ static void update_tasks_nodemask(struct cpuset >> *cs) >> cpuset_change_task_nodemask(task, &newmems); >> + if (!thread_group_leader(task)) >> + continue; >> + >> mm = get_task_mm(task); >> if (!mm) >> continue; > > Could you try the attached test patch to see if it can fix your problem? > Something along the line of this patch will be more acceptable. > > Thanks, > Longman > Hi, Longman. Thanks for your patch, but there are still some problems. 1) (group leader, node: 0,1) cgroup0 / \ / \ cgroup1 cgroup2 (threads) (threads) If set node 0 in cgroup1 and node 1 in cgroup2, both of them will update the mm. And the nodemask of mm depends on who set the node last. 2) (process, node: 0,1) cgroup0 / \ / \ cgroup1 cgroup2 (node: 0) (node: 1) If migrate thread from cgroup0 to cgroup1 or cgroup2, cpuset_attach won't update the mm. So the nodemask of thread, including mems_allowed and mempolicy(updated in cpuset_change_task_nodemask), is different from the vm_policy in vma(updated in mpol_rebind_mm). In a word, if threads have different cpusets with different nodemask, it will cause inconsistent memory behavior. Thanks, Haifeng.
On 11/23/22 22:33, Haifeng Xu wrote: > > On 2022/11/24 04:23, Waiman Long wrote: >> On 11/23/22 03:21, haifeng.xu wrote: >>> When change the 'cpuset.mems' under some cgroup, system will hung >>> for a long time. From the dmesg, many processes or theads are >>> stuck in fork/exit. The reason is show as follows. >>> >>> thread A: >>> cpuset_write_resmask /* takes cpuset_rwsem */ >>> ... >>> update_tasks_nodemask >>> mpol_rebind_mm /* waits mmap_lock */ >>> >>> thread B: >>> worker_thread >>> ... >>> cpuset_migrate_mm_workfn >>> do_migrate_pages /* takes mmap_lock */ >>> >>> thread C: >>> cgroup_procs_write /* takes cgroup_mutex and cgroup_threadgroup_rwsem */ >>> ... >>> cpuset_can_attach >>> percpu_down_write /* waits cpuset_rwsem */ >>> >>> Once update the nodemasks of cpuset, thread A wakes up thread B to >>> migrate mm. But when thread A iterates through all tasks, including >>> child threads and group leader, it has to wait the mmap_lock which >>> has been take by thread B. Unfortunately, thread C wants to migrate >>> tasks into cgroup at this moment, it must wait thread A to release >>> cpuset_rwsem. If thread B spends much time to migrate mm, the >>> fork/exit which acquire cgroup_threadgroup_rwsem also need to >>> wait for a long time. >>> >>> There is no need to migrate the mm of child threads which is >>> shared with group leader. Just iterate through the group >>> leader only. >>> >>> Signed-off-by: haifeng.xu <haifeng.xu@shopee.com> >>> --- >>> kernel/cgroup/cpuset.c | 3 +++ >>> 1 file changed, 3 insertions(+) >>> >>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c >>> index 589827ccda8b..43cbd09546d0 100644 >>> --- a/kernel/cgroup/cpuset.c >>> +++ b/kernel/cgroup/cpuset.c >>> @@ -1968,6 +1968,9 @@ static void update_tasks_nodemask(struct cpuset >>> *cs) >>> cpuset_change_task_nodemask(task, &newmems); >>> + if (!thread_group_leader(task)) >>> + continue; >>> + >>> mm = get_task_mm(task); >>> if (!mm) >>> continue; >> Could you try the attached test patch to see if it can fix your problem? >> Something along the line of this patch will be more acceptable. >> >> Thanks, >> Longman >> > Hi, Longman. > Thanks for your patch, but there are still some problems. > > 1) > (group leader, node: 0,1) > cgroup0 > / \ > / \ > cgroup1 cgroup2 > (threads) (threads) > > If set node 0 in cgroup1 and node 1 in cgroup2, both of them will update > the mm. And the nodemask of mm depends on who set the node last. Yes, that is the existing behavior. It was not that well defined in the past and so it is somewhat ambiguous as to what we need to do about it. BTW, cgroup1 has a memory_migrate flag which will force page migration if set. I guess you may have it set in your case as it will introduce a lot more delay as page migration takes time. That is probably the reason why you are seeing a long delay. So one possible solution is to turn this flag off. Cgroup v2 doesn't have this flag. > > 2) > (process, node: 0,1) > cgroup0 > / \ > / \ > cgroup1 cgroup2 > (node: 0) (node: 1) > > If migrate thread from cgroup0 to cgroup1 or cgroup2, cpuset_attach > won't update the mm. So the nodemask of thread, including mems_allowed > and mempolicy(updated in cpuset_change_task_nodemask), is different from > the vm_policy in vma(updated in mpol_rebind_mm). Yes, that can be the case. > > > In a word, if threads have different cpusets with different nodemask, it > will cause inconsistent memory behavior. So do you have suggestion of what we need to do going forward? Cheers, Longman
On 2022/11/24 12:24, Waiman Long wrote: > On 11/23/22 22:33, Haifeng Xu wrote: >> >> On 2022/11/24 04:23, Waiman Long wrote: >>> On 11/23/22 03:21, haifeng.xu wrote: >>>> When change the 'cpuset.mems' under some cgroup, system will hung >>>> for a long time. From the dmesg, many processes or theads are >>>> stuck in fork/exit. The reason is show as follows. >>>> >>>> thread A: >>>> cpuset_write_resmask /* takes cpuset_rwsem */ >>>> ... >>>> update_tasks_nodemask >>>> mpol_rebind_mm /* waits mmap_lock */ >>>> >>>> thread B: >>>> worker_thread >>>> ... >>>> cpuset_migrate_mm_workfn >>>> do_migrate_pages /* takes mmap_lock */ >>>> >>>> thread C: >>>> cgroup_procs_write /* takes cgroup_mutex and >>>> cgroup_threadgroup_rwsem */ >>>> ... >>>> cpuset_can_attach >>>> percpu_down_write /* waits cpuset_rwsem */ >>>> >>>> Once update the nodemasks of cpuset, thread A wakes up thread B to >>>> migrate mm. But when thread A iterates through all tasks, including >>>> child threads and group leader, it has to wait the mmap_lock which >>>> has been take by thread B. Unfortunately, thread C wants to migrate >>>> tasks into cgroup at this moment, it must wait thread A to release >>>> cpuset_rwsem. If thread B spends much time to migrate mm, the >>>> fork/exit which acquire cgroup_threadgroup_rwsem also need to >>>> wait for a long time. >>>> >>>> There is no need to migrate the mm of child threads which is >>>> shared with group leader. Just iterate through the group >>>> leader only. >>>> >>>> Signed-off-by: haifeng.xu <haifeng.xu@shopee.com> >>>> --- >>>> kernel/cgroup/cpuset.c | 3 +++ >>>> 1 file changed, 3 insertions(+) >>>> >>>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c >>>> index 589827ccda8b..43cbd09546d0 100644 >>>> --- a/kernel/cgroup/cpuset.c >>>> +++ b/kernel/cgroup/cpuset.c >>>> @@ -1968,6 +1968,9 @@ static void update_tasks_nodemask(struct cpuset >>>> *cs) >>>> cpuset_change_task_nodemask(task, &newmems); >>>> + if (!thread_group_leader(task)) >>>> + continue; >>>> + >>>> mm = get_task_mm(task); >>>> if (!mm) >>>> continue; >>> Could you try the attached test patch to see if it can fix your problem? >>> Something along the line of this patch will be more acceptable. >>> >>> Thanks, >>> Longman >>> >> Hi, Longman. >> Thanks for your patch, but there are still some problems. >> >> 1) >> (group leader, node: 0,1) >> cgroup0 >> / \ >> / \ >> cgroup1 cgroup2 >> (threads) (threads) >> >> If set node 0 in cgroup1 and node 1 in cgroup2, both of them will update >> the mm. And the nodemask of mm depends on who set the node last. > > Yes, that is the existing behavior. It was not that well defined in the > past and so it is somewhat ambiguous as to what we need to do about it. > The test patch works if the child threads are in same cpuset with group leader which has same logic with my patch. But if they are in different cpusets, the test patch will fail because the contention of mmap_lock still exsits and seems similar to the original logic. > BTW, cgroup1 has a memory_migrate flag which will force page migration > if set. I guess you may have it set in your case as it will introduce a > lot more delay as page migration takes time. That is probably the reason > why you are seeing a long delay. So one possible solution is to turn > this flag off. Cgroup v2 doesn't have this flag. > Dou you mean 'CS_MEMORY_MIGRATE'? This flag can be turn off in Cgroup v1, but it has been set in Cgroup v2 (cpuset_css_alloc) in default and couldn't be changed. >> >> 2) >> (process, node: 0,1) >> cgroup0 >> / \ >> / \ >> cgroup1 cgroup2 >> (node: 0) (node: 1) >> >> If migrate thread from cgroup0 to cgroup1 or cgroup2, cpuset_attach >> won't update the mm. So the nodemask of thread, including mems_allowed >> and mempolicy(updated in cpuset_change_task_nodemask), is different >> from >> the vm_policy in vma(updated in mpol_rebind_mm). > > Yes, that can be the case. > >> >> >> In a word, if threads have different cpusets with different nodemask, it >> will cause inconsistent memory behavior. > > So do you have suggestion of what we need to do going forward? Should we prevent thread from migrating to those cgroups which have different nodemask with the cgroup that contains the group leader? In addition, the group leader and child threads should be in same cgroup tree, also the level of cgroup containes group leader must be higher than these cgroups contain child threads, so update_nodemask will work. Or just disable thread migration in cpuset?It's easy to achieve but will affect cpu bind. > > Cheers, > Longman > >
On 11/24/22 02:49, Haifeng Xu wrote: > > On 2022/11/24 12:24, Waiman Long wrote: >> On 11/23/22 22:33, Haifeng Xu wrote: >>> On 2022/11/24 04:23, Waiman Long wrote: >>>> On 11/23/22 03:21, haifeng.xu wrote: >>>>> When change the 'cpuset.mems' under some cgroup, system will hung >>>>> for a long time. From the dmesg, many processes or theads are >>>>> stuck in fork/exit. The reason is show as follows. >>>>> >>>>> thread A: >>>>> cpuset_write_resmask /* takes cpuset_rwsem */ >>>>> ... >>>>> update_tasks_nodemask >>>>> mpol_rebind_mm /* waits mmap_lock */ >>>>> >>>>> thread B: >>>>> worker_thread >>>>> ... >>>>> cpuset_migrate_mm_workfn >>>>> do_migrate_pages /* takes mmap_lock */ >>>>> >>>>> thread C: >>>>> cgroup_procs_write /* takes cgroup_mutex and >>>>> cgroup_threadgroup_rwsem */ >>>>> ... >>>>> cpuset_can_attach >>>>> percpu_down_write /* waits cpuset_rwsem */ >>>>> >>>>> Once update the nodemasks of cpuset, thread A wakes up thread B to >>>>> migrate mm. But when thread A iterates through all tasks, including >>>>> child threads and group leader, it has to wait the mmap_lock which >>>>> has been take by thread B. Unfortunately, thread C wants to migrate >>>>> tasks into cgroup at this moment, it must wait thread A to release >>>>> cpuset_rwsem. If thread B spends much time to migrate mm, the >>>>> fork/exit which acquire cgroup_threadgroup_rwsem also need to >>>>> wait for a long time. >>>>> >>>>> There is no need to migrate the mm of child threads which is >>>>> shared with group leader. Just iterate through the group >>>>> leader only. >>>>> >>>>> Signed-off-by: haifeng.xu <haifeng.xu@shopee.com> >>>>> --- >>>>> kernel/cgroup/cpuset.c | 3 +++ >>>>> 1 file changed, 3 insertions(+) >>>>> >>>>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c >>>>> index 589827ccda8b..43cbd09546d0 100644 >>>>> --- a/kernel/cgroup/cpuset.c >>>>> +++ b/kernel/cgroup/cpuset.c >>>>> @@ -1968,6 +1968,9 @@ static void update_tasks_nodemask(struct cpuset >>>>> *cs) >>>>> cpuset_change_task_nodemask(task, &newmems); >>>>> + if (!thread_group_leader(task)) >>>>> + continue; >>>>> + >>>>> mm = get_task_mm(task); >>>>> if (!mm) >>>>> continue; >>>> Could you try the attached test patch to see if it can fix your problem? >>>> Something along the line of this patch will be more acceptable. >>>> >>>> Thanks, >>>> Longman >>>> >>> Hi, Longman. >>> Thanks for your patch, but there are still some problems. >>> >>> 1) >>> (group leader, node: 0,1) >>> cgroup0 >>> / \ >>> / \ >>> cgroup1 cgroup2 >>> (threads) (threads) >>> >>> If set node 0 in cgroup1 and node 1 in cgroup2, both of them will update >>> the mm. And the nodemask of mm depends on who set the node last. >> Yes, that is the existing behavior. It was not that well defined in the >> past and so it is somewhat ambiguous as to what we need to do about it. >> > The test patch works if the child threads are in same cpuset with group > leader which has same logic with my patch. But if they are in different > cpusets, the test patch will fail because the contention of mmap_lock > still exsits and seems similar to the original logic. That is true. I am thinking about adding a nodemask to mm_struct so that we can figure out if we need to propagate the changes down to all the VMAs and do the migration. That will enable us to avoid doing wasteful work. Current node mask handling isn't that efficient especially for distros that have a relatively large NODES_SHIFT value. Some work may also be need in this area. >> BTW, cgroup1 has a memory_migrate flag which will force page migration >> if set. I guess you may have it set in your case as it will introduce a >> lot more delay as page migration takes time. That is probably the reason >> why you are seeing a long delay. So one possible solution is to turn >> this flag off. Cgroup v2 doesn't have this flag. >> > Dou you mean 'CS_MEMORY_MIGRATE'? This flag can be turn off in Cgroup > v1, but it has been set in Cgroup v2 (cpuset_css_alloc) in default and > couldn't be changed. You are right. Cgroup v2 has CS_MEMORY_MIGRATE enabled by default and can't be turned off. > >>> 2) >>> (process, node: 0,1) >>> cgroup0 >>> / \ >>> / \ >>> cgroup1 cgroup2 >>> (node: 0) (node: 1) >>> >>> If migrate thread from cgroup0 to cgroup1 or cgroup2, cpuset_attach >>> won't update the mm. So the nodemask of thread, including mems_allowed >>> and mempolicy(updated in cpuset_change_task_nodemask), is different >>> from >>> the vm_policy in vma(updated in mpol_rebind_mm). >> Yes, that can be the case. >> >>> >>> In a word, if threads have different cpusets with different nodemask, it >>> will cause inconsistent memory behavior. >> So do you have suggestion of what we need to do going forward? > Should we prevent thread from migrating to those cgroups which have > different nodemask with the cgroup that contains the group leader? > > In addition, the group leader and child threads should be in same cgroup > tree, also the level of cgroup containes group leader must be higher > than these cgroups contain child threads, so update_nodemask will work. > > Or just disable thread migration in cpuset?It's easy to achieve but will > affect cpu bind. As said above, my current inclination is to add a nodemask to mm_struct and revise the way nodemask is being handled. That will take some time. Cheers, Longman
On 2022/11/25 07:00, Waiman Long wrote: > On 11/24/22 02:49, Haifeng Xu wrote: >> >> On 2022/11/24 12:24, Waiman Long wrote: >>> On 11/23/22 22:33, Haifeng Xu wrote: >>>> On 2022/11/24 04:23, Waiman Long wrote: >>>>> On 11/23/22 03:21, haifeng.xu wrote: >>>>>> When change the 'cpuset.mems' under some cgroup, system will hung >>>>>> for a long time. From the dmesg, many processes or theads are >>>>>> stuck in fork/exit. The reason is show as follows. >>>>>> >>>>>> thread A: >>>>>> cpuset_write_resmask /* takes cpuset_rwsem */ >>>>>> ... >>>>>> update_tasks_nodemask >>>>>> mpol_rebind_mm /* waits mmap_lock */ >>>>>> >>>>>> thread B: >>>>>> worker_thread >>>>>> ... >>>>>> cpuset_migrate_mm_workfn >>>>>> do_migrate_pages /* takes mmap_lock */ >>>>>> >>>>>> thread C: >>>>>> cgroup_procs_write /* takes cgroup_mutex and >>>>>> cgroup_threadgroup_rwsem */ >>>>>> ... >>>>>> cpuset_can_attach >>>>>> percpu_down_write /* waits cpuset_rwsem */ >>>>>> >>>>>> Once update the nodemasks of cpuset, thread A wakes up thread B to >>>>>> migrate mm. But when thread A iterates through all tasks, including >>>>>> child threads and group leader, it has to wait the mmap_lock which >>>>>> has been take by thread B. Unfortunately, thread C wants to migrate >>>>>> tasks into cgroup at this moment, it must wait thread A to release >>>>>> cpuset_rwsem. If thread B spends much time to migrate mm, the >>>>>> fork/exit which acquire cgroup_threadgroup_rwsem also need to >>>>>> wait for a long time. >>>>>> >>>>>> There is no need to migrate the mm of child threads which is >>>>>> shared with group leader. Just iterate through the group >>>>>> leader only. >>>>>> >>>>>> Signed-off-by: haifeng.xu <haifeng.xu@shopee.com> >>>>>> --- >>>>>> kernel/cgroup/cpuset.c | 3 +++ >>>>>> 1 file changed, 3 insertions(+) >>>>>> >>>>>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c >>>>>> index 589827ccda8b..43cbd09546d0 100644 >>>>>> --- a/kernel/cgroup/cpuset.c >>>>>> +++ b/kernel/cgroup/cpuset.c >>>>>> @@ -1968,6 +1968,9 @@ static void update_tasks_nodemask(struct cpuset >>>>>> *cs) >>>>>> cpuset_change_task_nodemask(task, &newmems); >>>>>> + if (!thread_group_leader(task)) >>>>>> + continue; >>>>>> + >>>>>> mm = get_task_mm(task); >>>>>> if (!mm) >>>>>> continue; >>>>> Could you try the attached test patch to see if it can fix your >>>>> problem? >>>>> Something along the line of this patch will be more acceptable. >>>>> >>>>> Thanks, >>>>> Longman >>>>> >>>> Hi, Longman. >>>> Thanks for your patch, but there are still some problems. >>>> >>>> 1) >>>> (group leader, node: 0,1) >>>> cgroup0 >>>> / \ >>>> / \ >>>> cgroup1 cgroup2 >>>> (threads) (threads) >>>> >>>> If set node 0 in cgroup1 and node 1 in cgroup2, both of them will >>>> update >>>> the mm. And the nodemask of mm depends on who set the node last. >>> Yes, that is the existing behavior. It was not that well defined in the >>> past and so it is somewhat ambiguous as to what we need to do about it. >>> >> The test patch works if the child threads are in same cpuset with group >> leader which has same logic with my patch. But if they are in different >> cpusets, the test patch will fail because the contention of mmap_lock >> still exsits and seems similar to the original logic. > > That is true. I am thinking about adding a nodemask to mm_struct so that > we can figure out if we need to propagate the changes down to all the > VMAs and do the migration. That will enable us to avoid doing wasteful > work. > > Current node mask handling isn't that efficient especially for distros > that have a relatively large NODES_SHIFT value. Some work may also be > need in this area. > >>> BTW, cgroup1 has a memory_migrate flag which will force page migration >>> if set. I guess you may have it set in your case as it will introduce a >>> lot more delay as page migration takes time. That is probably the reason >>> why you are seeing a long delay. So one possible solution is to turn >>> this flag off. Cgroup v2 doesn't have this flag. >>> >> Dou you mean 'CS_MEMORY_MIGRATE'? This flag can be turn off in Cgroup >> v1, but it has been set in Cgroup v2 (cpuset_css_alloc) in default and >> couldn't be changed. > You are right. Cgroup v2 has CS_MEMORY_MIGRATE enabled by default and > can't be turned off. >> >>>> 2) >>>> (process, node: 0,1) >>>> cgroup0 >>>> / \ >>>> / \ >>>> cgroup1 cgroup2 >>>> (node: 0) (node: 1) >>>> >>>> If migrate thread from cgroup0 to cgroup1 or cgroup2, cpuset_attach >>>> won't update the mm. So the nodemask of thread, including mems_allowed >>>> and mempolicy(updated in cpuset_change_task_nodemask), is different >>>> from >>>> the vm_policy in vma(updated in mpol_rebind_mm). >>> Yes, that can be the case. >>> >>>> >>>> In a word, if threads have different cpusets with different >>>> nodemask, it >>>> will cause inconsistent memory behavior. >>> So do you have suggestion of what we need to do going forward? >> Should we prevent thread from migrating to those cgroups which have >> different nodemask with the cgroup that contains the group leader? >> >> In addition, the group leader and child threads should be in same cgroup >> tree, also the level of cgroup containes group leader must be higher >> than these cgroups contain child threads, so update_nodemask will work. >> >> Or just disable thread migration in cpuset?It's easy to achieve but will >> affect cpu bind. > > As said above, my current inclination is to add a nodemask to mm_struct > and revise the way nodemask is being handled. That will take some time> > Cheers, > Longman > OK, thanks.
On 2022/11/25 07:00, Waiman Long wrote: > On 11/24/22 02:49, Haifeng Xu wrote: >> >> On 2022/11/24 12:24, Waiman Long wrote: >>> On 11/23/22 22:33, Haifeng Xu wrote: >>>> On 2022/11/24 04:23, Waiman Long wrote: >>>>> On 11/23/22 03:21, haifeng.xu wrote: >>>>>> When change the 'cpuset.mems' under some cgroup, system will hung >>>>>> for a long time. From the dmesg, many processes or theads are >>>>>> stuck in fork/exit. The reason is show as follows. >>>>>> >>>>>> thread A: >>>>>> cpuset_write_resmask /* takes cpuset_rwsem */ >>>>>> ... >>>>>> update_tasks_nodemask >>>>>> mpol_rebind_mm /* waits mmap_lock */ >>>>>> >>>>>> thread B: >>>>>> worker_thread >>>>>> ... >>>>>> cpuset_migrate_mm_workfn >>>>>> do_migrate_pages /* takes mmap_lock */ >>>>>> >>>>>> thread C: >>>>>> cgroup_procs_write /* takes cgroup_mutex and >>>>>> cgroup_threadgroup_rwsem */ >>>>>> ... >>>>>> cpuset_can_attach >>>>>> percpu_down_write /* waits cpuset_rwsem */ >>>>>> >>>>>> Once update the nodemasks of cpuset, thread A wakes up thread B to >>>>>> migrate mm. But when thread A iterates through all tasks, including >>>>>> child threads and group leader, it has to wait the mmap_lock which >>>>>> has been take by thread B. Unfortunately, thread C wants to migrate >>>>>> tasks into cgroup at this moment, it must wait thread A to release >>>>>> cpuset_rwsem. If thread B spends much time to migrate mm, the >>>>>> fork/exit which acquire cgroup_threadgroup_rwsem also need to >>>>>> wait for a long time. >>>>>> >>>>>> There is no need to migrate the mm of child threads which is >>>>>> shared with group leader. Just iterate through the group >>>>>> leader only. >>>>>> >>>>>> Signed-off-by: haifeng.xu <haifeng.xu@shopee.com> >>>>>> --- >>>>>> kernel/cgroup/cpuset.c | 3 +++ >>>>>> 1 file changed, 3 insertions(+) >>>>>> >>>>>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c >>>>>> index 589827ccda8b..43cbd09546d0 100644 >>>>>> --- a/kernel/cgroup/cpuset.c >>>>>> +++ b/kernel/cgroup/cpuset.c >>>>>> @@ -1968,6 +1968,9 @@ static void update_tasks_nodemask(struct cpuset >>>>>> *cs) >>>>>> cpuset_change_task_nodemask(task, &newmems); >>>>>> + if (!thread_group_leader(task)) >>>>>> + continue; >>>>>> + >>>>>> mm = get_task_mm(task); >>>>>> if (!mm) >>>>>> continue; >>>>> Could you try the attached test patch to see if it can fix your problem? >>>>> Something along the line of this patch will be more acceptable. >>>>> >>>>> Thanks, >>>>> Longman >>>>> >>>> Hi, Longman. >>>> Thanks for your patch, but there are still some problems. >>>> >>>> 1) >>>> (group leader, node: 0,1) >>>> cgroup0 >>>> / \ >>>> / \ >>>> cgroup1 cgroup2 >>>> (threads) (threads) >>>> >>>> If set node 0 in cgroup1 and node 1 in cgroup2, both of them will update >>>> the mm. And the nodemask of mm depends on who set the node last. >>> Yes, that is the existing behavior. It was not that well defined in the >>> past and so it is somewhat ambiguous as to what we need to do about it. >>> >> The test patch works if the child threads are in same cpuset with group >> leader which has same logic with my patch. But if they are in different >> cpusets, the test patch will fail because the contention of mmap_lock >> still exsits and seems similar to the original logic. > > That is true. I am thinking about adding a nodemask to mm_struct so that we can figure out if we need to propagate the changes down to all the VMAs and do the migration. That will enable us to avoid doing wasteful work. > > Current node mask handling isn't that efficient especially for distros that have a relatively large NODES_SHIFT value. Some work may also be need in this area. > >>> BTW, cgroup1 has a memory_migrate flag which will force page migration >>> if set. I guess you may have it set in your case as it will introduce a >>> lot more delay as page migration takes time. That is probably the reason >>> why you are seeing a long delay. So one possible solution is to turn >>> this flag off. Cgroup v2 doesn't have this flag. >>> >> Dou you mean 'CS_MEMORY_MIGRATE'? This flag can be turn off in Cgroup >> v1, but it has been set in Cgroup v2 (cpuset_css_alloc) in default and >> couldn't be changed. > You are right. Cgroup v2 has CS_MEMORY_MIGRATE enabled by default and can't be turned off. Hi, Longman. 'dfl_files' is just a minimal set. Shall we enable memory_migrate feature in Cgroup V2? So it can be turned off and help to solve the problem. >> >>>> 2) >>>> (process, node: 0,1) >>>> cgroup0 >>>> / \ >>>> / \ >>>> cgroup1 cgroup2 >>>> (node: 0) (node: 1) >>>> >>>> If migrate thread from cgroup0 to cgroup1 or cgroup2, cpuset_attach >>>> won't update the mm. So the nodemask of thread, including mems_allowed >>>> and mempolicy(updated in cpuset_change_task_nodemask), is different >>>> from >>>> the vm_policy in vma(updated in mpol_rebind_mm). >>> Yes, that can be the case. >>> >>>> >>>> In a word, if threads have different cpusets with different nodemask, it >>>> will cause inconsistent memory behavior. >>> So do you have suggestion of what we need to do going forward? >> Should we prevent thread from migrating to those cgroups which have >> different nodemask with the cgroup that contains the group leader? >> >> In addition, the group leader and child threads should be in same cgroup >> tree, also the level of cgroup containes group leader must be higher >> than these cgroups contain child threads, so update_nodemask will work. >> >> Or just disable thread migration in cpuset?It's easy to achieve but will >> affect cpu bind. > > As said above, my current inclination is to add a nodemask to mm_struct and revise the way nodemask is being handled. That will take some time. > > Cheers, > Longman >
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 589827ccda8b..43cbd09546d0 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -1968,6 +1968,9 @@ static void update_tasks_nodemask(struct cpuset *cs) cpuset_change_task_nodemask(task, &newmems); + if (!thread_group_leader(task)) + continue; + mm = get_task_mm(task); if (!mm) continue;