Message ID | 20230131221719.3176-2-will@kernel.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp64328wrn; Tue, 31 Jan 2023 14:22:34 -0800 (PST) X-Google-Smtp-Source: AK7set8yE8nWCcFFFqkq2Fjtg9nkeZZTLjUbiR/Ubpn6XehENLJzh7Wuj1mUEUQI8zeDChRKmlBi X-Received: by 2002:a17:906:53d5:b0:88c:8c2e:af17 with SMTP id p21-20020a17090653d500b0088c8c2eaf17mr21660ejo.2.1675203753986; Tue, 31 Jan 2023 14:22:33 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1675203753; cv=none; d=google.com; s=arc-20160816; b=cUIGPtos/4mgfb/wJNp1cIFnFD3trDgyiIv/V4lqpDGn9iAaHyrRyYthDyMSPKbxj7 Pj8QutW32CV/74IF/Y0m7arVCh5rO7LMqOkLYv9tIxEL6mS/1kJbfJhRRuRvPM6UMeYt mL/n4MkixGv3RFextQmyKmdmT7iAmBiH5VTOzYr0uLJ3LuNDUvjP40gz/kHcnp/3zhbK P97d/2iEDIWjRRhx6om1Tgabwki6vSNcoylmk4hbEvh65iP4i5wrY+ZBrmOG6ifW1O7E kfZBFMxDQ+q8yTJ/aspyf0OORaiGBbXaOxmfPfMxz57WJ0WFMxVyrLYZ/ETF/BLXxxGO wUFQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=1Ou8o6p+FHjSjUbjWTQ/+dwv2EXF2tBLfl2NOjJfufQ=; b=MfFpG8wK2BeJE/+lWTti6UeakesFHaUccctNzftfFZikZKACnmeQN1T/PKEuydEjhw nREdtS9PJ6LAdFoZgujZnv2vJfd2KQiwKGBFWKMy/9GTOFVCpxoG0kBpY/0dEfCiaqsu uIGWTEe6kwp0t6qSwgY1xYrFO3vTi0DqfHcmasb41J1Tbqu4FvHaxTA4ZiyD68NQu6tF qDGyyZDRQ9w+5bPxqo2WvvHU4eugvvVi/UU/8t7PJsTnfqOmHkIVA4FdPLB5xYwpFf1R slH89sw+frYxDe6Cv2gsOOc6i+2hTNWfNczZ7bJYozwF5pPg3AOqz4lKOv2bnLOmTj8n e0xw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=nfTuawCh; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id j20-20020a50ed14000000b004a21d0a3088si11852974eds.467.2023.01.31.14.22.10; Tue, 31 Jan 2023 14:22:33 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=nfTuawCh; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231431AbjAaWRe (ORCPT <rfc822;maxin.john@gmail.com> + 99 others); Tue, 31 Jan 2023 17:17:34 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44828 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229518AbjAaWR2 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 31 Jan 2023 17:17:28 -0500 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C854F46157; Tue, 31 Jan 2023 14:17:27 -0800 (PST) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 6D26B6172A; Tue, 31 Jan 2023 22:17:27 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3060FC4339E; Tue, 31 Jan 2023 22:17:25 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1675203446; bh=ELfyS2nEQLpVpTwQYA/Y76BnsftBMsvCjEcLwiHp7V4=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=nfTuawChPCNl9Geuhr0xvy93l5JDQisVlKYHhjH/PfAaEzwpic+WNs3QMvLjz93I0 WOrnIvHvp81Y3wCqXP4jLx48EN45G9jZxuM2l7EofC6G0X9VeoszRHJ+QCSEZYJaZ5 ra4tk+7dO/u7a8vCq0Aw9iOCSq3lipvtj+wcDtJweZVxO1w//nQtG/59RA6sHZPvKt XY3PVrGHBw+yA5zPqPCNG/eao/QPJ6RxpN+yCjaYrPN4dtH+WZE1VMGKzPpPhJBAJQ gpArYGeEzyBb8EIQ7ehvR5aklzi7yKQ2yVjAXW3JTXY+08g8MRfQHiozQ+0jYuQWaq e0ddTf9L2Vrcg== From: Will Deacon <will@kernel.org> To: linux-kernel@vger.kernel.org Cc: kernel-team@android.com, Will Deacon <will@kernel.org>, Peter Zijlstra <peterz@infradead.org>, Waiman Long <longman@redhat.com>, Zefan Li <lizefan.x@bytedance.com>, Tejun Heo <tj@kernel.org>, Johannes Weiner <hannes@cmpxchg.org>, cgroups@vger.kernel.org Subject: [PATCH 1/2] cpuset: Fix cpuset_cpus_allowed() to not filter offline CPUs Date: Tue, 31 Jan 2023 22:17:18 +0000 Message-Id: <20230131221719.3176-2-will@kernel.org> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20230131221719.3176-1-will@kernel.org> References: <20230131221719.3176-1-will@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1756578451312840000?= X-GMAIL-MSGID: =?utf-8?q?1756578451312840000?= |
Series |
Fix broken cpuset affinity handling on heterogeneous systems
|
|
Commit Message
Will Deacon
Jan. 31, 2023, 10:17 p.m. UTC
From: Peter Zijlstra <peterz@infradead.org> There is a difference in behaviour between CPUSET={y,n} that is now wrecking havoc with {relax,force}_compatible_cpus_allowed_ptr(). Specifically, since commit 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask") relax_compatible_cpus_allowed_ptr() is calling __sched_setaffinity() unconditionally. But the underlying problem goes back a lot further, possibly to commit: ae1c802382f7 ("cpuset: apply cs->effective_{cpus,mems}") which switched cpuset_cpus_allowed() from cs->cpus_allowed to cs->effective_cpus. The problem is that for CPUSET=y cpuset_cpus_allowed() will filter out all offline CPUs. For tasks that are part of a (!root) cpuset this is then later fixed up by the cpuset hotplug notifiers that re-evaluate and re-apply cs->effective_cpus, but for (normal) tasks in the root cpuset this does not happen and they will forever after be excluded from CPUs onlined later. As such, rewrite cpuset_cpus_allowed() to return a wider mask, including the offline CPUs. Fixes: 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask") Reported-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230117160825.GA17756@willie-the-truck Signed-off-by: Will Deacon <will@kernel.org> --- kernel/cgroup/cpuset.c | 39 ++++++++++++++++++++++++++++++++++----- 1 file changed, 34 insertions(+), 5 deletions(-)
Comments
On 1/31/23 17:17, Will Deacon wrote: > From: Peter Zijlstra <peterz@infradead.org> > > There is a difference in behaviour between CPUSET={y,n} that is now > wrecking havoc with {relax,force}_compatible_cpus_allowed_ptr(). > > Specifically, since commit 8f9ea86fdf99 ("sched: Always preserve the > user requested cpumask") relax_compatible_cpus_allowed_ptr() is > calling __sched_setaffinity() unconditionally. > > But the underlying problem goes back a lot further, possibly to > commit: ae1c802382f7 ("cpuset: apply cs->effective_{cpus,mems}") which > switched cpuset_cpus_allowed() from cs->cpus_allowed to > cs->effective_cpus. > > The problem is that for CPUSET=y cpuset_cpus_allowed() will filter out > all offline CPUs. For tasks that are part of a (!root) cpuset this is > then later fixed up by the cpuset hotplug notifiers that re-evaluate > and re-apply cs->effective_cpus, but for (normal) tasks in the root > cpuset this does not happen and they will forever after be excluded > from CPUs onlined later. > > As such, rewrite cpuset_cpus_allowed() to return a wider mask, > including the offline CPUs. > > Fixes: 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask") > Reported-by: Will Deacon <will@kernel.org> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > Link: https://lkml.kernel.org/r/20230117160825.GA17756@willie-the-truck > Signed-off-by: Will Deacon <will@kernel.org> Before cgroup v2, cpuset had only one cpumask - cpus_allowed. It only tracked online cpus and ignored the offline ones. It behaves more like effective_cpus in cpuset v2. With v2, we have 2 cpumasks - cpus_allowed and effective_cpus. When cpuset v1 is mounted, cpus_allowed and effective_cpus are effectively the same and track online cpus. With cpuset v2, cpus_allowed contains what the user has written into and it won't be changed until another write happen. However, what the user written may not be what the system can give it and effective_cpus is what the system decides a cpuset can use. Cpuset v2 is able to handle hotplug correctly and update the task's cpumask accordingly. So missing previously offline cpus won't happen with v2. Since v1 keeps the old behavior, previously offlined cpus are lost in the cpuset's cpus_allowed. However tasks in the root cpuset will still be fine with cpu hotplug as its cpus_allowed should track cpu_online_mask. IOW, only tasks in a non-root cpuset suffer this problem. It was a known issue in v1 and I believe is one of the major reasons of the cpuset v2 redesign. A major concern I have is the overhead of creating a poor man version of v2 cpus_allowed. This issue can be worked around even for cpuset v1 if it is mounted with the cpuset_v2_mode option to behave more like v2 in its cpumask handling. Alternatively we may be able to provide a config option to make this the default for v1 without the special mount option, if necessary. Cheers, Longman
On Tue, Jan 31, 2023 at 11:14:27PM -0500, Waiman Long wrote: > On 1/31/23 17:17, Will Deacon wrote: > > From: Peter Zijlstra <peterz@infradead.org> > > > > There is a difference in behaviour between CPUSET={y,n} that is now > > wrecking havoc with {relax,force}_compatible_cpus_allowed_ptr(). > > > > Specifically, since commit 8f9ea86fdf99 ("sched: Always preserve the > > user requested cpumask") relax_compatible_cpus_allowed_ptr() is > > calling __sched_setaffinity() unconditionally. > > > > But the underlying problem goes back a lot further, possibly to > > commit: ae1c802382f7 ("cpuset: apply cs->effective_{cpus,mems}") which > > switched cpuset_cpus_allowed() from cs->cpus_allowed to > > cs->effective_cpus. > > > > The problem is that for CPUSET=y cpuset_cpus_allowed() will filter out > > all offline CPUs. For tasks that are part of a (!root) cpuset this is > > then later fixed up by the cpuset hotplug notifiers that re-evaluate > > and re-apply cs->effective_cpus, but for (normal) tasks in the root > > cpuset this does not happen and they will forever after be excluded > > from CPUs onlined later. > > > > As such, rewrite cpuset_cpus_allowed() to return a wider mask, > > including the offline CPUs. > > > > Fixes: 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask") > > Reported-by: Will Deacon <will@kernel.org> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > > Link: https://lkml.kernel.org/r/20230117160825.GA17756@willie-the-truck > > Signed-off-by: Will Deacon <will@kernel.org> > > Before cgroup v2, cpuset had only one cpumask - cpus_allowed. It only > tracked online cpus and ignored the offline ones. It behaves more like > effective_cpus in cpuset v2. With v2, we have 2 cpumasks - cpus_allowed and > effective_cpus. When cpuset v1 is mounted, cpus_allowed and effective_cpus > are effectively the same and track online cpus. With cpuset v2, cpus_allowed > contains what the user has written into and it won't be changed until > another write happen. However, what the user written may not be what the > system can give it and effective_cpus is what the system decides a cpuset > can use. > > Cpuset v2 is able to handle hotplug correctly and update the task's cpumask > accordingly. So missing previously offline cpus won't happen with v2. > > Since v1 keeps the old behavior, previously offlined cpus are lost in the > cpuset's cpus_allowed. However tasks in the root cpuset will still be fine > with cpu hotplug as its cpus_allowed should track cpu_online_mask. IOW, only > tasks in a non-root cpuset suffer this problem. > > It was a known issue in v1 and I believe is one of the major reasons of the > cpuset v2 redesign. > > A major concern I have is the overhead of creating a poor man version of v2 > cpus_allowed. This issue can be worked around even for cpuset v1 if it is > mounted with the cpuset_v2_mode option to behave more like v2 in its cpumask > handling. Alternatively we may be able to provide a config option to make > this the default for v1 without the special mount option, if necessary. You're still not getting it -- even cpuset (be it v1 or v2) *MUST* *NOT* mask offline cpus for root cgroup tasks, ever. (And the only reason it gets away with masking offline for !root is that it re-applies the mask every time it changes.) Yes it did that for a fair while -- but it is wrong and broken and a very big behavioural difference between CONFIG_CPUSET={y,n}. This must not be. Arguably cpuset-v2 is still wrong for masking offline cpus in it's effective_cpus mask, but I really didn't want to go rewrite cpuset.c for something that needs to go into /urgent *now*. Hence this minimal patch that at least lets sched_setaffinity() work as intended.
On 2/1/23 04:14, Peter Zijlstra wrote: > On Tue, Jan 31, 2023 at 11:14:27PM -0500, Waiman Long wrote: >> On 1/31/23 17:17, Will Deacon wrote: >>> From: Peter Zijlstra <peterz@infradead.org> >>> >>> There is a difference in behaviour between CPUSET={y,n} that is now >>> wrecking havoc with {relax,force}_compatible_cpus_allowed_ptr(). >>> >>> Specifically, since commit 8f9ea86fdf99 ("sched: Always preserve the >>> user requested cpumask") relax_compatible_cpus_allowed_ptr() is >>> calling __sched_setaffinity() unconditionally. >>> >>> But the underlying problem goes back a lot further, possibly to >>> commit: ae1c802382f7 ("cpuset: apply cs->effective_{cpus,mems}") which >>> switched cpuset_cpus_allowed() from cs->cpus_allowed to >>> cs->effective_cpus. >>> >>> The problem is that for CPUSET=y cpuset_cpus_allowed() will filter out >>> all offline CPUs. For tasks that are part of a (!root) cpuset this is >>> then later fixed up by the cpuset hotplug notifiers that re-evaluate >>> and re-apply cs->effective_cpus, but for (normal) tasks in the root >>> cpuset this does not happen and they will forever after be excluded >>> from CPUs onlined later. >>> >>> As such, rewrite cpuset_cpus_allowed() to return a wider mask, >>> including the offline CPUs. >>> >>> Fixes: 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask") >>> Reported-by: Will Deacon <will@kernel.org> >>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> >>> Link: https://lkml.kernel.org/r/20230117160825.GA17756@willie-the-truck >>> Signed-off-by: Will Deacon <will@kernel.org> >> Before cgroup v2, cpuset had only one cpumask - cpus_allowed. It only >> tracked online cpus and ignored the offline ones. It behaves more like >> effective_cpus in cpuset v2. With v2, we have 2 cpumasks - cpus_allowed and >> effective_cpus. When cpuset v1 is mounted, cpus_allowed and effective_cpus >> are effectively the same and track online cpus. With cpuset v2, cpus_allowed >> contains what the user has written into and it won't be changed until >> another write happen. However, what the user written may not be what the >> system can give it and effective_cpus is what the system decides a cpuset >> can use. >> >> Cpuset v2 is able to handle hotplug correctly and update the task's cpumask >> accordingly. So missing previously offline cpus won't happen with v2. >> >> Since v1 keeps the old behavior, previously offlined cpus are lost in the >> cpuset's cpus_allowed. However tasks in the root cpuset will still be fine >> with cpu hotplug as its cpus_allowed should track cpu_online_mask. IOW, only >> tasks in a non-root cpuset suffer this problem. >> >> It was a known issue in v1 and I believe is one of the major reasons of the >> cpuset v2 redesign. >> >> A major concern I have is the overhead of creating a poor man version of v2 >> cpus_allowed. This issue can be worked around even for cpuset v1 if it is >> mounted with the cpuset_v2_mode option to behave more like v2 in its cpumask >> handling. Alternatively we may be able to provide a config option to make >> this the default for v1 without the special mount option, if necessary. > You're still not getting it -- even cpuset (be it v1 or v2) *MUST* *NOT* > mask offline cpus for root cgroup tasks, ever. (And the only reason it > gets away with masking offline for !root is that it re-applies the mask > every time it changes.) > > Yes it did that for a fair while -- but it is wrong and broken and a > very big behavioural difference between CONFIG_CPUSET={y,n}. This must > not be. > > Arguably cpuset-v2 is still wrong for masking offline cpus in it's > effective_cpus mask, but I really didn't want to go rewrite cpuset.c for > something that needs to go into /urgent *now*. > > Hence this minimal patch that at least lets sched_setaffinity() work as > intended. I don't object to the general idea of keeping offline cpus in a task's cpu affinity. In the case of cpu offline event, we can skip removing that offline cpu from the task's cpu affinity. That will partially solve the problem here and is also simpler. I believe a main reason why effective_cpus holds only online cpus is because of the need to detect when there is no online cpus available in a given cpuset. In this case, it will fall back to the nearest ancestors with online cpus. This offline cpu problem with cpuset v1 is a known problem for a long time. It is not a recent regression. Cheers, Longman
On 2/1/23 10:16, Waiman Long wrote: > On 2/1/23 04:14, Peter Zijlstra wrote: >> On Tue, Jan 31, 2023 at 11:14:27PM -0500, Waiman Long wrote: >>> On 1/31/23 17:17, Will Deacon wrote: >>>> From: Peter Zijlstra <peterz@infradead.org> >>>> >>>> There is a difference in behaviour between CPUSET={y,n} that is now >>>> wrecking havoc with {relax,force}_compatible_cpus_allowed_ptr(). >>>> >>>> Specifically, since commit 8f9ea86fdf99 ("sched: Always preserve the >>>> user requested cpumask") relax_compatible_cpus_allowed_ptr() is >>>> calling __sched_setaffinity() unconditionally. >>>> >>>> But the underlying problem goes back a lot further, possibly to >>>> commit: ae1c802382f7 ("cpuset: apply cs->effective_{cpus,mems}") which >>>> switched cpuset_cpus_allowed() from cs->cpus_allowed to >>>> cs->effective_cpus. >>>> >>>> The problem is that for CPUSET=y cpuset_cpus_allowed() will filter out >>>> all offline CPUs. For tasks that are part of a (!root) cpuset this is >>>> then later fixed up by the cpuset hotplug notifiers that re-evaluate >>>> and re-apply cs->effective_cpus, but for (normal) tasks in the root >>>> cpuset this does not happen and they will forever after be excluded >>>> from CPUs onlined later. >>>> >>>> As such, rewrite cpuset_cpus_allowed() to return a wider mask, >>>> including the offline CPUs. >>>> >>>> Fixes: 8f9ea86fdf99 ("sched: Always preserve the user requested >>>> cpumask") >>>> Reported-by: Will Deacon <will@kernel.org> >>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> >>>> Link: >>>> https://lkml.kernel.org/r/20230117160825.GA17756@willie-the-truck >>>> Signed-off-by: Will Deacon <will@kernel.org> >>> Before cgroup v2, cpuset had only one cpumask - cpus_allowed. It only >>> tracked online cpus and ignored the offline ones. It behaves more like >>> effective_cpus in cpuset v2. With v2, we have 2 cpumasks - >>> cpus_allowed and >>> effective_cpus. When cpuset v1 is mounted, cpus_allowed and >>> effective_cpus >>> are effectively the same and track online cpus. With cpuset v2, >>> cpus_allowed >>> contains what the user has written into and it won't be changed until >>> another write happen. However, what the user written may not be what >>> the >>> system can give it and effective_cpus is what the system decides a >>> cpuset >>> can use. >>> >>> Cpuset v2 is able to handle hotplug correctly and update the task's >>> cpumask >>> accordingly. So missing previously offline cpus won't happen with v2. >>> >>> Since v1 keeps the old behavior, previously offlined cpus are lost >>> in the >>> cpuset's cpus_allowed. However tasks in the root cpuset will still >>> be fine >>> with cpu hotplug as its cpus_allowed should track cpu_online_mask. >>> IOW, only >>> tasks in a non-root cpuset suffer this problem. >>> >>> It was a known issue in v1 and I believe is one of the major reasons >>> of the >>> cpuset v2 redesign. >>> >>> A major concern I have is the overhead of creating a poor man >>> version of v2 >>> cpus_allowed. This issue can be worked around even for cpuset v1 if >>> it is >>> mounted with the cpuset_v2_mode option to behave more like v2 in its >>> cpumask >>> handling. Alternatively we may be able to provide a config option to >>> make >>> this the default for v1 without the special mount option, if necessary. >> You're still not getting it -- even cpuset (be it v1 or v2) *MUST* *NOT* >> mask offline cpus for root cgroup tasks, ever. (And the only reason it >> gets away with masking offline for !root is that it re-applies the mask >> every time it changes.) >> >> Yes it did that for a fair while -- but it is wrong and broken and a >> very big behavioural difference between CONFIG_CPUSET={y,n}. This must >> not be. >> >> Arguably cpuset-v2 is still wrong for masking offline cpus in it's >> effective_cpus mask, but I really didn't want to go rewrite cpuset.c for >> something that needs to go into /urgent *now*. >> >> Hence this minimal patch that at least lets sched_setaffinity() work as >> intended. > > I don't object to the general idea of keeping offline cpus in a task's > cpu affinity. In the case of cpu offline event, we can skip removing > that offline cpu from the task's cpu affinity. That will partially > solve the problem here and is also simpler. > > I believe a main reason why effective_cpus holds only online cpus is > because of the need to detect when there is no online cpus available > in a given cpuset. In this case, it will fall back to the nearest > ancestors with online cpus. > > This offline cpu problem with cpuset v1 is a known problem for a long > time. It is not a recent regression. Note that using cpus_allowed directly in cgroup v2 may not be right because cpus_allowed may have no relationship to effective_cpus at all in some cases, e.g. root | V A (cpus_allowed = 1-4, effective_cpus = 1-4) | V B (cpus_allowed = 5-8, effective_cpus = 1-4) In the case of cpuset B, passing back cpus 5-8 as the allowed_cpus is wrong. I wonder how often is cpu hotplug happening in those arm64 cpu systems that only have a subset of cpus that can run 32-bit programs. Cheers, Longman
On 2/1/23 13:46, Waiman Long wrote: > On 2/1/23 10:16, Waiman Long wrote: >> On 2/1/23 04:14, Peter Zijlstra wrote: >>> On Tue, Jan 31, 2023 at 11:14:27PM -0500, Waiman Long wrote: >>>> On 1/31/23 17:17, Will Deacon wrote: >>>>> From: Peter Zijlstra <peterz@infradead.org> >>>>> >>>>> There is a difference in behaviour between CPUSET={y,n} that is now >>>>> wrecking havoc with {relax,force}_compatible_cpus_allowed_ptr(). >>>>> >>>>> Specifically, since commit 8f9ea86fdf99 ("sched: Always preserve the >>>>> user requested cpumask") relax_compatible_cpus_allowed_ptr() is >>>>> calling __sched_setaffinity() unconditionally. >>>>> >>>>> But the underlying problem goes back a lot further, possibly to >>>>> commit: ae1c802382f7 ("cpuset: apply cs->effective_{cpus,mems}") >>>>> which >>>>> switched cpuset_cpus_allowed() from cs->cpus_allowed to >>>>> cs->effective_cpus. >>>>> >>>>> The problem is that for CPUSET=y cpuset_cpus_allowed() will filter >>>>> out >>>>> all offline CPUs. For tasks that are part of a (!root) cpuset this is >>>>> then later fixed up by the cpuset hotplug notifiers that re-evaluate >>>>> and re-apply cs->effective_cpus, but for (normal) tasks in the root >>>>> cpuset this does not happen and they will forever after be excluded >>>>> from CPUs onlined later. >>>>> >>>>> As such, rewrite cpuset_cpus_allowed() to return a wider mask, >>>>> including the offline CPUs. >>>>> >>>>> Fixes: 8f9ea86fdf99 ("sched: Always preserve the user requested >>>>> cpumask") >>>>> Reported-by: Will Deacon <will@kernel.org> >>>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> >>>>> Link: >>>>> https://lkml.kernel.org/r/20230117160825.GA17756@willie-the-truck >>>>> Signed-off-by: Will Deacon <will@kernel.org> >>>> Before cgroup v2, cpuset had only one cpumask - cpus_allowed. It only >>>> tracked online cpus and ignored the offline ones. It behaves more like >>>> effective_cpus in cpuset v2. With v2, we have 2 cpumasks - >>>> cpus_allowed and >>>> effective_cpus. When cpuset v1 is mounted, cpus_allowed and >>>> effective_cpus >>>> are effectively the same and track online cpus. With cpuset v2, >>>> cpus_allowed >>>> contains what the user has written into and it won't be changed until >>>> another write happen. However, what the user written may not be >>>> what the >>>> system can give it and effective_cpus is what the system decides a >>>> cpuset >>>> can use. >>>> >>>> Cpuset v2 is able to handle hotplug correctly and update the task's >>>> cpumask >>>> accordingly. So missing previously offline cpus won't happen with v2. >>>> >>>> Since v1 keeps the old behavior, previously offlined cpus are lost >>>> in the >>>> cpuset's cpus_allowed. However tasks in the root cpuset will still >>>> be fine >>>> with cpu hotplug as its cpus_allowed should track cpu_online_mask. >>>> IOW, only >>>> tasks in a non-root cpuset suffer this problem. >>>> >>>> It was a known issue in v1 and I believe is one of the major >>>> reasons of the >>>> cpuset v2 redesign. >>>> >>>> A major concern I have is the overhead of creating a poor man >>>> version of v2 >>>> cpus_allowed. This issue can be worked around even for cpuset v1 if >>>> it is >>>> mounted with the cpuset_v2_mode option to behave more like v2 in >>>> its cpumask >>>> handling. Alternatively we may be able to provide a config option >>>> to make >>>> this the default for v1 without the special mount option, if >>>> necessary. >>> You're still not getting it -- even cpuset (be it v1 or v2) *MUST* >>> *NOT* >>> mask offline cpus for root cgroup tasks, ever. (And the only reason it >>> gets away with masking offline for !root is that it re-applies the mask >>> every time it changes.) >>> >>> Yes it did that for a fair while -- but it is wrong and broken and a >>> very big behavioural difference between CONFIG_CPUSET={y,n}. This must >>> not be. >>> >>> Arguably cpuset-v2 is still wrong for masking offline cpus in it's >>> effective_cpus mask, but I really didn't want to go rewrite cpuset.c >>> for >>> something that needs to go into /urgent *now*. >>> >>> Hence this minimal patch that at least lets sched_setaffinity() work as >>> intended. >> >> I don't object to the general idea of keeping offline cpus in a >> task's cpu affinity. In the case of cpu offline event, we can skip >> removing that offline cpu from the task's cpu affinity. That will >> partially solve the problem here and is also simpler. >> >> I believe a main reason why effective_cpus holds only online cpus is >> because of the need to detect when there is no online cpus available >> in a given cpuset. In this case, it will fall back to the nearest >> ancestors with online cpus. >> >> This offline cpu problem with cpuset v1 is a known problem for a long >> time. It is not a recent regression. > > Note that using cpus_allowed directly in cgroup v2 may not be right > because cpus_allowed may have no relationship to effective_cpus at all > in some cases, e.g. > > root > | > V > A (cpus_allowed = 1-4, effective_cpus = 1-4) > | > V > B (cpus_allowed = 5-8, effective_cpus = 1-4) > > In the case of cpuset B, passing back cpus 5-8 as the allowed_cpus is > wrong. > > I wonder how often is cpu hotplug happening in those arm64 cpu systems > that only have a subset of cpus that can run 32-bit programs. One possible solution is to use cpuset_cpus_allowed_fallback() in case none of the cpus in the current cpuset is allowed to be used to run a given task. Cheers, Longman
On 2/1/23 14:14, Waiman Long wrote: > One possible solution is to use cpuset_cpus_allowed_fallback() in case > none of the cpus in the current cpuset is allowed to be used to run a > given task. It looks like we will need to enhance cpuset_cpus_allowed_fallback() to walk up cpuset tree to find one that have useable cpus. Cheers, Longman
On Wed, Feb 01, 2023 at 01:46:11PM -0500, Waiman Long wrote: > Note that using cpus_allowed directly in cgroup v2 may not be right because > cpus_allowed may have no relationship to effective_cpus at all in some > cases, e.g. > > root > | > V > A (cpus_allowed = 1-4, effective_cpus = 1-4) > | > V > B (cpus_allowed = 5-8, effective_cpus = 1-4) > > In the case of cpuset B, passing back cpus 5-8 as the allowed_cpus is wrong. I think my patch as written does the right thing here. Since the intersection of (1-4) and (5-8) is empty it will move up the hierarchy and we'll end up with (1-4) from the cgroup side of things. So the purpose of __cs_cpus_allowed() is to override the cpus_allowed of the root set and force it to cpu_possible_mask. Then cs_cpus_allowed() computes the intersection of cs->cpus_allowed and all it's parents. This will, in the case of B above, result in the empty mask. Then cpuset_cpus_allowed() has a loop that starts with task_cpu_possible_mask(), intersects that with cs_cpus_allowed() and if the intersection of that and cpu_online_mask is empty, moves up the hierarchy. Given cs_cpus_allowed(B) is the empty mask, we'll move to A. Note that since we force the mask of root to cpu_possible_mask, cs_cpus_allowed(root) will be a no-op and if we guarantee (in arch code) that cpu_online_mask always has a non-empty intersection with task_cpu_possible_mask(), this loop is guaranteed to terminate with a viable mask.
On 2/1/23 16:10, Peter Zijlstra wrote: > On Wed, Feb 01, 2023 at 01:46:11PM -0500, Waiman Long wrote: > >> Note that using cpus_allowed directly in cgroup v2 may not be right because >> cpus_allowed may have no relationship to effective_cpus at all in some >> cases, e.g. >> >> root >> | >> V >> A (cpus_allowed = 1-4, effective_cpus = 1-4) >> | >> V >> B (cpus_allowed = 5-8, effective_cpus = 1-4) >> >> In the case of cpuset B, passing back cpus 5-8 as the allowed_cpus is wrong. > I think my patch as written does the right thing here. Since the > intersection of (1-4) and (5-8) is empty it will move up the hierarchy > and we'll end up with (1-4) from the cgroup side of things. > > So the purpose of __cs_cpus_allowed() is to override the cpus_allowed of > the root set and force it to cpu_possible_mask. > > Then cs_cpus_allowed() computes the intersection of cs->cpus_allowed and > all it's parents. This will, in the case of B above, result in the empty > mask. > > Then cpuset_cpus_allowed() has a loop that starts with > task_cpu_possible_mask(), intersects that with cs_cpus_allowed() and if > the intersection of that and cpu_online_mask is empty, moves up the > hierarchy. Given cs_cpus_allowed(B) is the empty mask, we'll move to A. > > Note that since we force the mask of root to cpu_possible_mask, > cs_cpus_allowed(root) will be a no-op and if we guarantee (in arch code) > that cpu_online_mask always has a non-empty intersection with > task_cpu_possible_mask(), this loop is guaranteed to terminate with a > viable mask. I will take a closer look at that tomorrow. I will be more comfortable ack'ing that if this is specific to v1 cpuset instead of applying this in both v1 and v2 since it is only v1 that is problematic. Cheers, Longman
On Tue, Jan 31, 2023 at 11:14:27PM -0500, Waiman Long wrote: > A major concern I have is the overhead of creating a poor man version of v2 > cpus_allowed. This issue can be worked around even for cpuset v1 if it is > mounted with the cpuset_v2_mode option to behave more like v2 in its cpumask > handling. Alternatively we may be able to provide a config option to make > this the default for v1 without the special mount option, if necessary. It is equally broken for v2, it masks against effective_cpus. Not to mention it explicitly starts with cpu_online_mask.
On 2/2/23 03:34, Peter Zijlstra wrote: > On Tue, Jan 31, 2023 at 11:14:27PM -0500, Waiman Long wrote: > >> A major concern I have is the overhead of creating a poor man version of v2 >> cpus_allowed. This issue can be worked around even for cpuset v1 if it is >> mounted with the cpuset_v2_mode option to behave more like v2 in its cpumask >> handling. Alternatively we may be able to provide a config option to make >> this the default for v1 without the special mount option, if necessary. > It is equally broken for v2, it masks against effective_cpus. Not to > mention it explicitly starts with cpu_online_mask. After taking a close look at the patch, my understanding of what it is doing is as follows: v2: cpus_allowed will not be affected by hotplug. So the new cpuset_cpus_allowed() will return effective_cpus + offline cpus that should have been part of effective_cpus if online before masking it with allowable cpus and then go up the cpuset hierarchy if necessary. v1: cpus_allowed is equivalent to v2 effective_cpus. It starts at the current cpuset and move up the hierarchy if necessary to find a cpuset that have at least one allowable cpu. First of all, it does not take into account of the v2 partition feature that may cause it to produce incorrect result if partition is enabled somewhere. Secondly, I don't see any benefit other than having some additional offline cpu available in a task's cpumask which the scheduler will ignore anyway. v2 is able to recover a previously offlined cpu. So we don't gain any net benefit other than the going up the cpuset hierarchy part. For v1, I agree we should go up the cpuset hierarchy to find a usable cpuset. Instead of introducing such a complexity in cpuset_cpus_allowed(), my current preference is to do the hierarchy climbing part in an enhanced cpuset_cpus_allowed_fallback() after an initial failure of cpuset_cpus_allowed(). That will be easier to understand than having such complexity and overhead in cpuset_cpus_allowed() alone. I will work on a patchset to do that as a counter offer. Cheers, Longman
On Thu, Feb 02, 2023 at 11:06:51AM -0500, Waiman Long wrote: > After taking a close look at the patch, my understanding of what it is doing > is as follows: > > v2: cpus_allowed will not be affected by hotplug. So the new > cpuset_cpus_allowed() will return effective_cpus + offline cpus that should > have been part of effective_cpus if online before masking it with allowable > cpus and then go up the cpuset hierarchy if necessary. > > v1: cpus_allowed is equivalent to v2 effective_cpus. It starts at the > current cpuset and move up the hierarchy if necessary to find a cpuset that > have at least one allowable cpu. > > First of all, it does not take into account of the v2 partition feature that > may cause it to produce incorrect result if partition is enabled somewhere. How so? For a partition the cpus_allowed mask should be the parition CPUs. The only magical bit about partitions is that any one CPU cannot belong to two partitions and load-balancing is split. > Secondly, I don't see any benefit other than having some additional offline > cpu available in a task's cpumask which the scheduler will ignore anyway. Those CPUs can come online again -- you're *again* dismissing the true bug :/ If you filter out the offline CPUs at sched_setaffinity() time, you forever lose those CPUs, the task will never again move to those CPUs, even if they do come online after. It is really simple to reproduce this: - boot machine - offline all CPUs except one - taskset -p ffffffff $$ - online all CPUs and observe your shell (and all its decendants) being stuck to the one CPU. Do the same thing on a CPUSET=n build and note the difference (you retain the full mask). > v2 is able to recover a previously offlined cpu. So we don't gain any > net benefit other than the going up the cpuset hierarchy part. Only for !root tasks. Not even v2 will re-set the affinity of root tasks afaict. > For v1, I agree we should go up the cpuset hierarchy to find a usable > cpuset. Instead of introducing such a complexity in cpuset_cpus_allowed(), > my current preference is to do the hierarchy climbing part in an enhanced > cpuset_cpus_allowed_fallback() after an initial failure of > cpuset_cpus_allowed(). That will be easier to understand than having such > complexity and overhead in cpuset_cpus_allowed() alone. > > I will work on a patchset to do that as a counter offer. We will need a small and simple patch for /urgent, or I will need to revert all your patches -- your call. I also don't tihnk you fully appreciate the ramifications of task_cpu_possible_mask(), cpuset currently gets that quite wrong.
On 2/2/23 14:42, Peter Zijlstra wrote: > On Thu, Feb 02, 2023 at 11:06:51AM -0500, Waiman Long wrote: > >> After taking a close look at the patch, my understanding of what it is doing >> is as follows: >> >> v2: cpus_allowed will not be affected by hotplug. So the new >> cpuset_cpus_allowed() will return effective_cpus + offline cpus that should >> have been part of effective_cpus if online before masking it with allowable >> cpus and then go up the cpuset hierarchy if necessary. >> >> v1: cpus_allowed is equivalent to v2 effective_cpus. It starts at the >> current cpuset and move up the hierarchy if necessary to find a cpuset that >> have at least one allowable cpu. >> >> First of all, it does not take into account of the v2 partition feature that >> may cause it to produce incorrect result if partition is enabled somewhere. > How so? For a partition the cpus_allowed mask should be the parition > CPUs. The only magical bit about partitions is that any one CPU cannot > belong to two partitions and load-balancing is split. There can be a child partition underneath it that uses up some of the cpus in cpus_allowed mask. So if you cascading up the cpuset tree from another child, your code will wrongly include those cpus that are dedicated to the other child partition. > >> Secondly, I don't see any benefit other than having some additional offline >> cpu available in a task's cpumask which the scheduler will ignore anyway. > Those CPUs can come online again -- you're *again* dismissing the true > bug :/ > > If you filter out the offline CPUs at sched_setaffinity() time, you > forever lose those CPUs, the task will never again move to those CPUs, > even if they do come online after. > > It is really simple to reproduce this: > > - boot machine > - offline all CPUs except one > - taskset -p ffffffff $$ > - online all CPUs > > and observe your shell (and all its decendants) being stuck to the one > CPU. Do the same thing on a CPUSET=n build and note the difference (you > retain the full mask). With the new sched_setaffinity(), the original mask should be stored in user_cpus_ptr. The cpuset code is supposed to call set_cpus_allowed_ptr() which then set the mask correctly. I will run your test and figure out why it does not work as I would have expected. > >> v2 is able to recover a previously offlined cpu. So we don't gain any >> net benefit other than the going up the cpuset hierarchy part. > Only for !root tasks. Not even v2 will re-set the affinity of root tasks > afaict. > >> For v1, I agree we should go up the cpuset hierarchy to find a usable >> cpuset. Instead of introducing such a complexity in cpuset_cpus_allowed(), >> my current preference is to do the hierarchy climbing part in an enhanced >> cpuset_cpus_allowed_fallback() after an initial failure of >> cpuset_cpus_allowed(). That will be easier to understand than having such >> complexity and overhead in cpuset_cpus_allowed() alone. >> >> I will work on a patchset to do that as a counter offer. > We will need a small and simple patch for /urgent, or I will need to > revert all your patches -- your call. > > I also don't tihnk you fully appreciate the ramifications of > task_cpu_possible_mask(), cpuset currently gets that quite wrong. OK, I don't realize the urgency of that. If it is that urgent, I will have no objection to get it in for now. We can improve it later on. So are you planning to get it into the current 6.2 rc or 6.3? Tejun, are you OK with that as you are the cgroup maintainer? Cheers, Longman
On Thu, Feb 02, 2023 at 03:46:02PM -0500, Waiman Long wrote: > > > I will work on a patchset to do that as a counter offer. > > We will need a small and simple patch for /urgent, or I will need to > > revert all your patches -- your call. > > > > I also don't tihnk you fully appreciate the ramifications of > > task_cpu_possible_mask(), cpuset currently gets that quite wrong. > > OK, I don't realize the urgency of that. If it is that urgent, I will have > no objection to get it in for now. We can improve it later on. So are you > planning to get it into the current 6.2 rc or 6.3? > > Tejun, are you OK with that as you are the cgroup maintainer? Yeah, gotta fix the regression but is there currently a solution which fixes the regression but doesn't further break other stuff? Thanks.
On 2/2/23 15:48, Tejun Heo wrote: > On Thu, Feb 02, 2023 at 03:46:02PM -0500, Waiman Long wrote: >>>> I will work on a patchset to do that as a counter offer. >>> We will need a small and simple patch for /urgent, or I will need to >>> revert all your patches -- your call. >>> >>> I also don't tihnk you fully appreciate the ramifications of >>> task_cpu_possible_mask(), cpuset currently gets that quite wrong. >> OK, I don't realize the urgency of that. If it is that urgent, I will have >> no objection to get it in for now. We can improve it later on. So are you >> planning to get it into the current 6.2 rc or 6.3? >> >> Tejun, are you OK with that as you are the cgroup maintainer? > Yeah, gotta fix the regression but is there currently a solution which fixes > the regression but doesn't further break other stuff? I believe there is a better way to do that, but it will need more time to flex out. Since cpuset_cpus_allowed() is only used by kernel/sched/core.c, Peter will be responsible if it somehow breaks other stuff. Thanks, Longman
On 2/2/23 15:53, Waiman Long wrote: > > On 2/2/23 15:48, Tejun Heo wrote: >> On Thu, Feb 02, 2023 at 03:46:02PM -0500, Waiman Long wrote: >>>>> I will work on a patchset to do that as a counter offer. >>>> We will need a small and simple patch for /urgent, or I will need to >>>> revert all your patches -- your call. >>>> >>>> I also don't tihnk you fully appreciate the ramifications of >>>> task_cpu_possible_mask(), cpuset currently gets that quite wrong. >>> OK, I don't realize the urgency of that. If it is that urgent, I >>> will have >>> no objection to get it in for now. We can improve it later on. So >>> are you >>> planning to get it into the current 6.2 rc or 6.3? >>> >>> Tejun, are you OK with that as you are the cgroup maintainer? >> Yeah, gotta fix the regression but is there currently a solution >> which fixes >> the regression but doesn't further break other stuff? > > I believe there is a better way to do that, but it will need more time > to flex out. Since cpuset_cpus_allowed() is only used by > kernel/sched/core.c, Peter will be responsible if it somehow breaks > other stuff. Maybe my cpuset patch that don't update task's cpumask on cpu offline event can help. However, I don't know the exact scenario where the regression happen, so it may not. Cheers, Longman
On Thu, Feb 02, 2023 at 04:05:14PM -0500, Waiman Long wrote: > > On 2/2/23 15:53, Waiman Long wrote: > > > > On 2/2/23 15:48, Tejun Heo wrote: > > > On Thu, Feb 02, 2023 at 03:46:02PM -0500, Waiman Long wrote: > > > > > > I will work on a patchset to do that as a counter offer. > > > > > We will need a small and simple patch for /urgent, or I will need to > > > > > revert all your patches -- your call. > > > > > > > > > > I also don't tihnk you fully appreciate the ramifications of > > > > > task_cpu_possible_mask(), cpuset currently gets that quite wrong. > > > > OK, I don't realize the urgency of that. If it is that urgent, I > > > > will have > > > > no objection to get it in for now. We can improve it later on. > > > > So are you > > > > planning to get it into the current 6.2 rc or 6.3? > > > > > > > > Tejun, are you OK with that as you are the cgroup maintainer? > > > Yeah, gotta fix the regression but is there currently a solution > > > which fixes > > > the regression but doesn't further break other stuff? > > > > I believe there is a better way to do that, but it will need more time > > to flex out. Since cpuset_cpus_allowed() is only used by > > kernel/sched/core.c, Peter will be responsible if it somehow breaks > > other stuff. > > Maybe my cpuset patch that don't update task's cpumask on cpu offline event > can help. However, I don't know the exact scenario where the regression > happen, so it may not. Neither patch looks like they would break anything. That said, the patches aren't trivial and we're really close to the merge window, so I'd really appreciate if you can take a look and test a bit before we send these Linus's way. We can replace it with a better solution afterwards. Thanks.
On 2/2/23 16:50, Tejun Heo wrote: > On Thu, Feb 02, 2023 at 04:05:14PM -0500, Waiman Long wrote: >> On 2/2/23 15:53, Waiman Long wrote: >>> On 2/2/23 15:48, Tejun Heo wrote: >>>> On Thu, Feb 02, 2023 at 03:46:02PM -0500, Waiman Long wrote: >>>>>>> I will work on a patchset to do that as a counter offer. >>>>>> We will need a small and simple patch for /urgent, or I will need to >>>>>> revert all your patches -- your call. >>>>>> >>>>>> I also don't tihnk you fully appreciate the ramifications of >>>>>> task_cpu_possible_mask(), cpuset currently gets that quite wrong. >>>>> OK, I don't realize the urgency of that. If it is that urgent, I >>>>> will have >>>>> no objection to get it in for now. We can improve it later on. >>>>> So are you >>>>> planning to get it into the current 6.2 rc or 6.3? >>>>> >>>>> Tejun, are you OK with that as you are the cgroup maintainer? >>>> Yeah, gotta fix the regression but is there currently a solution >>>> which fixes >>>> the regression but doesn't further break other stuff? >>> I believe there is a better way to do that, but it will need more time >>> to flex out. Since cpuset_cpus_allowed() is only used by >>> kernel/sched/core.c, Peter will be responsible if it somehow breaks >>> other stuff. >> Maybe my cpuset patch that don't update task's cpumask on cpu offline event >> can help. However, I don't know the exact scenario where the regression >> happen, so it may not. > Neither patch looks like they would break anything. That said, the patches > aren't trivial and we're really close to the merge window, so I'd really > appreciate if you can take a look and test a bit before we send these > Linus's way. We can replace it with a better solution afterwards. OK, will do. Cheers, Longman
On Wed, Feb 01, 2023 at 10:34:00PM -0500, Waiman Long wrote: > On 2/1/23 16:10, Peter Zijlstra wrote: > > On Wed, Feb 01, 2023 at 01:46:11PM -0500, Waiman Long wrote: > > > > > Note that using cpus_allowed directly in cgroup v2 may not be right because > > > cpus_allowed may have no relationship to effective_cpus at all in some > > > cases, e.g. > > > > > > root > > > | > > > V > > > A (cpus_allowed = 1-4, effective_cpus = 1-4) > > > | > > > V > > > B (cpus_allowed = 5-8, effective_cpus = 1-4) > > > > > > In the case of cpuset B, passing back cpus 5-8 as the allowed_cpus is wrong. > > I think my patch as written does the right thing here. Since the > > intersection of (1-4) and (5-8) is empty it will move up the hierarchy > > and we'll end up with (1-4) from the cgroup side of things. > > > > So the purpose of __cs_cpus_allowed() is to override the cpus_allowed of > > the root set and force it to cpu_possible_mask. > > > > Then cs_cpus_allowed() computes the intersection of cs->cpus_allowed and > > all it's parents. This will, in the case of B above, result in the empty > > mask. > > > > Then cpuset_cpus_allowed() has a loop that starts with > > task_cpu_possible_mask(), intersects that with cs_cpus_allowed() and if > > the intersection of that and cpu_online_mask is empty, moves up the > > hierarchy. Given cs_cpus_allowed(B) is the empty mask, we'll move to A. > > > > Note that since we force the mask of root to cpu_possible_mask, > > cs_cpus_allowed(root) will be a no-op and if we guarantee (in arch code) > > that cpu_online_mask always has a non-empty intersection with > > task_cpu_possible_mask(), this loop is guaranteed to terminate with a > > viable mask. > > I will take a closer look at that tomorrow. I will be more comfortable > ack'ing that if this is specific to v1 cpuset instead of applying this in > both v1 and v2 since it is only v1 that is problematic. fwiw, the regression I'm seeing is with cgroup2. I haven't tried v1. WIll
On 2/3/23 06:50, Will Deacon wrote: > On Wed, Feb 01, 2023 at 10:34:00PM -0500, Waiman Long wrote: >> On 2/1/23 16:10, Peter Zijlstra wrote: >>> On Wed, Feb 01, 2023 at 01:46:11PM -0500, Waiman Long wrote: >>> >>>> Note that using cpus_allowed directly in cgroup v2 may not be right because >>>> cpus_allowed may have no relationship to effective_cpus at all in some >>>> cases, e.g. >>>> >>>> root >>>> | >>>> V >>>> A (cpus_allowed = 1-4, effective_cpus = 1-4) >>>> | >>>> V >>>> B (cpus_allowed = 5-8, effective_cpus = 1-4) >>>> >>>> In the case of cpuset B, passing back cpus 5-8 as the allowed_cpus is wrong. >>> I think my patch as written does the right thing here. Since the >>> intersection of (1-4) and (5-8) is empty it will move up the hierarchy >>> and we'll end up with (1-4) from the cgroup side of things. >>> >>> So the purpose of __cs_cpus_allowed() is to override the cpus_allowed of >>> the root set and force it to cpu_possible_mask. >>> >>> Then cs_cpus_allowed() computes the intersection of cs->cpus_allowed and >>> all it's parents. This will, in the case of B above, result in the empty >>> mask. >>> >>> Then cpuset_cpus_allowed() has a loop that starts with >>> task_cpu_possible_mask(), intersects that with cs_cpus_allowed() and if >>> the intersection of that and cpu_online_mask is empty, moves up the >>> hierarchy. Given cs_cpus_allowed(B) is the empty mask, we'll move to A. >>> >>> Note that since we force the mask of root to cpu_possible_mask, >>> cs_cpus_allowed(root) will be a no-op and if we guarantee (in arch code) >>> that cpu_online_mask always has a non-empty intersection with >>> task_cpu_possible_mask(), this loop is guaranteed to terminate with a >>> viable mask. >> I will take a closer look at that tomorrow. I will be more comfortable >> ack'ing that if this is specific to v1 cpuset instead of applying this in >> both v1 and v2 since it is only v1 that is problematic. > fwiw, the regression I'm seeing is with cgroup2. I haven't tried v1. I think I know where the problem is. It is due to the fact the cpuset hotplug code doesn't update cpumasks of the tasks in the top cpuset (root) at all when there is a cpu offline or online event. It is probably because for some of the tasks in the top cpuset, especially the percpu kthread, changing their cpumasks can be catastrophic. The hotplug code does update the cpumasks of the tasks that are not in the top cpuset. This problem is irrespective of whether v1 or v2 is in use. The partition code does try to update the cpumasks of the tasks in the top cpuset, but skip the percpu kthreads. My testing show that is probably OK. For safety, I agree that is better to extend the allowed cpu list to all the possible cpus (including offline ones) for now until more testings are done to find a safe way to do that. That special case should apply only to tasks in the top cpuset. For the rests, the current code should be OK and is less risky than adopting code in this patch. Cheers, Longman
On Fri, Feb 03, 2023 at 10:13:10AM -0500, Waiman Long wrote: > I think I know where the problem is. It is due to the fact the cpuset > hotplug code doesn't update cpumasks of the tasks in the top cpuset (root) > at all when there is a cpu offline or online event. It is probably because > for some of the tasks in the top cpuset, especially the percpu kthread, > changing their cpumasks can be catastrophic. The hotplug code does update > the cpumasks of the tasks that are not in the top cpuset. This problem is > irrespective of whether v1 or v2 is in use. I've been saying this exact thing for how many mails now?
On 2/3/23 10:26, Peter Zijlstra wrote: > On Fri, Feb 03, 2023 at 10:13:10AM -0500, Waiman Long wrote: > >> I think I know where the problem is. It is due to the fact the cpuset >> hotplug code doesn't update cpumasks of the tasks in the top cpuset (root) >> at all when there is a cpu offline or online event. It is probably because >> for some of the tasks in the top cpuset, especially the percpu kthread, >> changing their cpumasks can be catastrophic. The hotplug code does update >> the cpumasks of the tasks that are not in the top cpuset. This problem is >> irrespective of whether v1 or v2 is in use. > I've been saying this exact thing for how many mails now? My bad. The fact that sched_getaffinity() masks off the offline cpus makes me thought incorrectly that tasks in the top cpuset were also updated by the hotplug code. Further testing indicates this is the case. Thanks, Longman
On Thu, Feb 02, 2023 at 11:50:55AM -1000, Tejun Heo wrote: > On Thu, Feb 02, 2023 at 04:05:14PM -0500, Waiman Long wrote: > > > > On 2/2/23 15:53, Waiman Long wrote: > > > > > > On 2/2/23 15:48, Tejun Heo wrote: > > > > On Thu, Feb 02, 2023 at 03:46:02PM -0500, Waiman Long wrote: > > > > > > > I will work on a patchset to do that as a counter offer. > > > > > > We will need a small and simple patch for /urgent, or I will need to > > > > > > revert all your patches -- your call. > > > > > > > > > > > > I also don't tihnk you fully appreciate the ramifications of > > > > > > task_cpu_possible_mask(), cpuset currently gets that quite wrong. > > > > > OK, I don't realize the urgency of that. If it is that urgent, I > > > > > will have > > > > > no objection to get it in for now. We can improve it later on. > > > > > So are you > > > > > planning to get it into the current 6.2 rc or 6.3? > > > > > > > > > > Tejun, are you OK with that as you are the cgroup maintainer? > > > > Yeah, gotta fix the regression but is there currently a solution > > > > which fixes > > > > the regression but doesn't further break other stuff? > > > > > > I believe there is a better way to do that, but it will need more time > > > to flex out. Since cpuset_cpus_allowed() is only used by > > > kernel/sched/core.c, Peter will be responsible if it somehow breaks > > > other stuff. > > > > Maybe my cpuset patch that don't update task's cpumask on cpu offline event > > can help. However, I don't know the exact scenario where the regression > > happen, so it may not. > > Neither patch looks like they would break anything. That said, the patches > aren't trivial and we're really close to the merge window, so I'd really > appreciate if you can take a look and test a bit before we send these > Linus's way. We can replace it with a better solution afterwards. FWIW, I tested this series in an arm64 heterogeneous setup with things like hotplug and exec()ing between 32-bit and 64-bit tasks and it all seems good. The alternative would be to revert Waiman's setaffinity changes, which I've had a go at here: https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=ssa-reverts and I _think_ I've rescued the UAF fix too. What do people prefer? Will
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index a29c0b13706b..8552cc2c586a 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -3683,23 +3683,52 @@ void __init cpuset_init_smp(void) BUG_ON(!cpuset_migrate_mm_wq); } +static const struct cpumask *__cs_cpus_allowed(struct cpuset *cs) +{ + const struct cpumask *cs_mask = cs->cpus_allowed; + if (!parent_cs(cs)) + cs_mask = cpu_possible_mask; + return cs_mask; +} + +static void cs_cpus_allowed(struct cpuset *cs, struct cpumask *pmask) +{ + do { + cpumask_and(pmask, pmask, __cs_cpus_allowed(cs)); + cs = parent_cs(cs); + } while (cs); +} + /** * cpuset_cpus_allowed - return cpus_allowed mask from a tasks cpuset. * @tsk: pointer to task_struct from which to obtain cpuset->cpus_allowed. * @pmask: pointer to struct cpumask variable to receive cpus_allowed set. * - * Description: Returns the cpumask_var_t cpus_allowed of the cpuset - * attached to the specified @tsk. Guaranteed to return some non-empty - * subset of cpu_online_mask, even if this means going outside the - * tasks cpuset. + * Description: Returns the cpumask_var_t cpus_allowed of the cpuset attached + * to the specified @tsk. Guaranteed to return some non-empty intersection + * with cpu_online_mask, even if this means going outside the tasks cpuset. **/ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask) { unsigned long flags; + struct cpuset *cs; spin_lock_irqsave(&callback_lock, flags); - guarantee_online_cpus(tsk, pmask); + rcu_read_lock(); + + cs = task_cs(tsk); + do { + cpumask_copy(pmask, task_cpu_possible_mask(tsk)); + cs_cpus_allowed(cs, pmask); + + if (cpumask_intersects(pmask, cpu_online_mask)) + break; + + cs = parent_cs(cs); + } while (cs); + + rcu_read_unlock(); spin_unlock_irqrestore(&callback_lock, flags); }