[1/2] cpuset: Fix cpuset_cpus_allowed() to not filter offline CPUs

Message ID 20230131221719.3176-2-will@kernel.org
State New
Headers
Series Fix broken cpuset affinity handling on heterogeneous systems |

Commit Message

Will Deacon Jan. 31, 2023, 10:17 p.m. UTC
  From: Peter Zijlstra <peterz@infradead.org>

There is a difference in behaviour between CPUSET={y,n} that is now
wrecking havoc with {relax,force}_compatible_cpus_allowed_ptr().

Specifically, since commit 8f9ea86fdf99 ("sched: Always preserve the
user requested cpumask")  relax_compatible_cpus_allowed_ptr() is
calling __sched_setaffinity() unconditionally.

But the underlying problem goes back a lot further, possibly to
commit: ae1c802382f7 ("cpuset: apply cs->effective_{cpus,mems}") which
switched cpuset_cpus_allowed() from cs->cpus_allowed to
cs->effective_cpus.

The problem is that for CPUSET=y cpuset_cpus_allowed() will filter out
all offline CPUs. For tasks that are part of a (!root) cpuset this is
then later fixed up by the cpuset hotplug notifiers that re-evaluate
and re-apply cs->effective_cpus, but for (normal) tasks in the root
cpuset this does not happen and they will forever after be excluded
from CPUs onlined later.

As such, rewrite cpuset_cpus_allowed() to return a wider mask,
including the offline CPUs.

Fixes: 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask")
Reported-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230117160825.GA17756@willie-the-truck
Signed-off-by: Will Deacon <will@kernel.org>
---
 kernel/cgroup/cpuset.c | 39 ++++++++++++++++++++++++++++++++++-----
 1 file changed, 34 insertions(+), 5 deletions(-)
  

Comments

Waiman Long Feb. 1, 2023, 4:14 a.m. UTC | #1
On 1/31/23 17:17, Will Deacon wrote:
> From: Peter Zijlstra <peterz@infradead.org>
>
> There is a difference in behaviour between CPUSET={y,n} that is now
> wrecking havoc with {relax,force}_compatible_cpus_allowed_ptr().
>
> Specifically, since commit 8f9ea86fdf99 ("sched: Always preserve the
> user requested cpumask")  relax_compatible_cpus_allowed_ptr() is
> calling __sched_setaffinity() unconditionally.
>
> But the underlying problem goes back a lot further, possibly to
> commit: ae1c802382f7 ("cpuset: apply cs->effective_{cpus,mems}") which
> switched cpuset_cpus_allowed() from cs->cpus_allowed to
> cs->effective_cpus.
>
> The problem is that for CPUSET=y cpuset_cpus_allowed() will filter out
> all offline CPUs. For tasks that are part of a (!root) cpuset this is
> then later fixed up by the cpuset hotplug notifiers that re-evaluate
> and re-apply cs->effective_cpus, but for (normal) tasks in the root
> cpuset this does not happen and they will forever after be excluded
> from CPUs onlined later.
>
> As such, rewrite cpuset_cpus_allowed() to return a wider mask,
> including the offline CPUs.
>
> Fixes: 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask")
> Reported-by: Will Deacon <will@kernel.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/20230117160825.GA17756@willie-the-truck
> Signed-off-by: Will Deacon <will@kernel.org>

Before cgroup v2, cpuset had only one cpumask - cpus_allowed. It only 
tracked online cpus and ignored the offline ones. It behaves more like 
effective_cpus in cpuset v2. With v2, we have 2 cpumasks - cpus_allowed 
and effective_cpus. When cpuset v1 is mounted, cpus_allowed and 
effective_cpus are effectively the same and track online cpus. With 
cpuset v2, cpus_allowed contains what the user has written into and it 
won't be changed until another write happen. However, what the user 
written may not be what the system can give it and effective_cpus is 
what the system decides a cpuset can use.

Cpuset v2 is able to handle hotplug correctly and update the task's 
cpumask accordingly. So missing previously offline cpus won't happen 
with v2.

Since v1 keeps the old behavior, previously offlined cpus are lost in 
the cpuset's cpus_allowed. However tasks in the root cpuset will still 
be fine with cpu hotplug as its cpus_allowed should track 
cpu_online_mask. IOW, only tasks in a non-root cpuset suffer this problem.

It was a known issue in v1 and I believe is one of the major reasons of 
the cpuset v2 redesign.

A major concern I have is the overhead of creating a poor man version of 
v2 cpus_allowed. This issue can be worked around even for cpuset v1 if 
it is mounted with the cpuset_v2_mode option to behave more like v2 in 
its cpumask handling. Alternatively we may be able to provide a config 
option to make this the default for v1 without the special mount option, 
if necessary.

Cheers,
Longman
  
Peter Zijlstra Feb. 1, 2023, 9:14 a.m. UTC | #2
On Tue, Jan 31, 2023 at 11:14:27PM -0500, Waiman Long wrote:
> On 1/31/23 17:17, Will Deacon wrote:
> > From: Peter Zijlstra <peterz@infradead.org>
> > 
> > There is a difference in behaviour between CPUSET={y,n} that is now
> > wrecking havoc with {relax,force}_compatible_cpus_allowed_ptr().
> > 
> > Specifically, since commit 8f9ea86fdf99 ("sched: Always preserve the
> > user requested cpumask")  relax_compatible_cpus_allowed_ptr() is
> > calling __sched_setaffinity() unconditionally.
> > 
> > But the underlying problem goes back a lot further, possibly to
> > commit: ae1c802382f7 ("cpuset: apply cs->effective_{cpus,mems}") which
> > switched cpuset_cpus_allowed() from cs->cpus_allowed to
> > cs->effective_cpus.
> > 
> > The problem is that for CPUSET=y cpuset_cpus_allowed() will filter out
> > all offline CPUs. For tasks that are part of a (!root) cpuset this is
> > then later fixed up by the cpuset hotplug notifiers that re-evaluate
> > and re-apply cs->effective_cpus, but for (normal) tasks in the root
> > cpuset this does not happen and they will forever after be excluded
> > from CPUs onlined later.
> > 
> > As such, rewrite cpuset_cpus_allowed() to return a wider mask,
> > including the offline CPUs.
> > 
> > Fixes: 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask")
> > Reported-by: Will Deacon <will@kernel.org>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Link: https://lkml.kernel.org/r/20230117160825.GA17756@willie-the-truck
> > Signed-off-by: Will Deacon <will@kernel.org>
> 
> Before cgroup v2, cpuset had only one cpumask - cpus_allowed. It only
> tracked online cpus and ignored the offline ones. It behaves more like
> effective_cpus in cpuset v2. With v2, we have 2 cpumasks - cpus_allowed and
> effective_cpus. When cpuset v1 is mounted, cpus_allowed and effective_cpus
> are effectively the same and track online cpus. With cpuset v2, cpus_allowed
> contains what the user has written into and it won't be changed until
> another write happen. However, what the user written may not be what the
> system can give it and effective_cpus is what the system decides a cpuset
> can use.
> 
> Cpuset v2 is able to handle hotplug correctly and update the task's cpumask
> accordingly. So missing previously offline cpus won't happen with v2.
> 
> Since v1 keeps the old behavior, previously offlined cpus are lost in the
> cpuset's cpus_allowed. However tasks in the root cpuset will still be fine
> with cpu hotplug as its cpus_allowed should track cpu_online_mask. IOW, only
> tasks in a non-root cpuset suffer this problem.
> 
> It was a known issue in v1 and I believe is one of the major reasons of the
> cpuset v2 redesign.
> 
> A major concern I have is the overhead of creating a poor man version of v2
> cpus_allowed. This issue can be worked around even for cpuset v1 if it is
> mounted with the cpuset_v2_mode option to behave more like v2 in its cpumask
> handling. Alternatively we may be able to provide a config option to make
> this the default for v1 without the special mount option, if necessary.

You're still not getting it -- even cpuset (be it v1 or v2) *MUST* *NOT*
mask offline cpus for root cgroup tasks, ever. (And the only reason it
gets away with masking offline for !root is that it re-applies the mask
every time it changes.)

Yes it did that for a fair while -- but it is wrong and broken and a
very big behavioural difference between CONFIG_CPUSET={y,n}. This must
not be.

Arguably cpuset-v2 is still wrong for masking offline cpus in it's
effective_cpus mask, but I really didn't want to go rewrite cpuset.c for
something that needs to go into /urgent *now*.

Hence this minimal patch that at least lets sched_setaffinity() work as
intended.
  
Waiman Long Feb. 1, 2023, 3:16 p.m. UTC | #3
On 2/1/23 04:14, Peter Zijlstra wrote:
> On Tue, Jan 31, 2023 at 11:14:27PM -0500, Waiman Long wrote:
>> On 1/31/23 17:17, Will Deacon wrote:
>>> From: Peter Zijlstra <peterz@infradead.org>
>>>
>>> There is a difference in behaviour between CPUSET={y,n} that is now
>>> wrecking havoc with {relax,force}_compatible_cpus_allowed_ptr().
>>>
>>> Specifically, since commit 8f9ea86fdf99 ("sched: Always preserve the
>>> user requested cpumask")  relax_compatible_cpus_allowed_ptr() is
>>> calling __sched_setaffinity() unconditionally.
>>>
>>> But the underlying problem goes back a lot further, possibly to
>>> commit: ae1c802382f7 ("cpuset: apply cs->effective_{cpus,mems}") which
>>> switched cpuset_cpus_allowed() from cs->cpus_allowed to
>>> cs->effective_cpus.
>>>
>>> The problem is that for CPUSET=y cpuset_cpus_allowed() will filter out
>>> all offline CPUs. For tasks that are part of a (!root) cpuset this is
>>> then later fixed up by the cpuset hotplug notifiers that re-evaluate
>>> and re-apply cs->effective_cpus, but for (normal) tasks in the root
>>> cpuset this does not happen and they will forever after be excluded
>>> from CPUs onlined later.
>>>
>>> As such, rewrite cpuset_cpus_allowed() to return a wider mask,
>>> including the offline CPUs.
>>>
>>> Fixes: 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask")
>>> Reported-by: Will Deacon <will@kernel.org>
>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>> Link: https://lkml.kernel.org/r/20230117160825.GA17756@willie-the-truck
>>> Signed-off-by: Will Deacon <will@kernel.org>
>> Before cgroup v2, cpuset had only one cpumask - cpus_allowed. It only
>> tracked online cpus and ignored the offline ones. It behaves more like
>> effective_cpus in cpuset v2. With v2, we have 2 cpumasks - cpus_allowed and
>> effective_cpus. When cpuset v1 is mounted, cpus_allowed and effective_cpus
>> are effectively the same and track online cpus. With cpuset v2, cpus_allowed
>> contains what the user has written into and it won't be changed until
>> another write happen. However, what the user written may not be what the
>> system can give it and effective_cpus is what the system decides a cpuset
>> can use.
>>
>> Cpuset v2 is able to handle hotplug correctly and update the task's cpumask
>> accordingly. So missing previously offline cpus won't happen with v2.
>>
>> Since v1 keeps the old behavior, previously offlined cpus are lost in the
>> cpuset's cpus_allowed. However tasks in the root cpuset will still be fine
>> with cpu hotplug as its cpus_allowed should track cpu_online_mask. IOW, only
>> tasks in a non-root cpuset suffer this problem.
>>
>> It was a known issue in v1 and I believe is one of the major reasons of the
>> cpuset v2 redesign.
>>
>> A major concern I have is the overhead of creating a poor man version of v2
>> cpus_allowed. This issue can be worked around even for cpuset v1 if it is
>> mounted with the cpuset_v2_mode option to behave more like v2 in its cpumask
>> handling. Alternatively we may be able to provide a config option to make
>> this the default for v1 without the special mount option, if necessary.
> You're still not getting it -- even cpuset (be it v1 or v2) *MUST* *NOT*
> mask offline cpus for root cgroup tasks, ever. (And the only reason it
> gets away with masking offline for !root is that it re-applies the mask
> every time it changes.)
>
> Yes it did that for a fair while -- but it is wrong and broken and a
> very big behavioural difference between CONFIG_CPUSET={y,n}. This must
> not be.
>
> Arguably cpuset-v2 is still wrong for masking offline cpus in it's
> effective_cpus mask, but I really didn't want to go rewrite cpuset.c for
> something that needs to go into /urgent *now*.
>
> Hence this minimal patch that at least lets sched_setaffinity() work as
> intended.

I don't object to the general idea of keeping offline cpus in a task's 
cpu affinity. In the case of cpu offline event, we can skip removing 
that offline cpu from the task's cpu affinity. That will partially solve 
the problem here and is also simpler.

I believe a main reason why effective_cpus holds only online cpus is 
because of the need to detect when there is no online cpus available in 
a given cpuset. In this case, it will fall back to the nearest ancestors 
with online cpus.

This offline cpu problem with cpuset v1 is a known problem for a long 
time. It is not a recent regression.

Cheers,
Longman
  
Waiman Long Feb. 1, 2023, 6:46 p.m. UTC | #4
On 2/1/23 10:16, Waiman Long wrote:
> On 2/1/23 04:14, Peter Zijlstra wrote:
>> On Tue, Jan 31, 2023 at 11:14:27PM -0500, Waiman Long wrote:
>>> On 1/31/23 17:17, Will Deacon wrote:
>>>> From: Peter Zijlstra <peterz@infradead.org>
>>>>
>>>> There is a difference in behaviour between CPUSET={y,n} that is now
>>>> wrecking havoc with {relax,force}_compatible_cpus_allowed_ptr().
>>>>
>>>> Specifically, since commit 8f9ea86fdf99 ("sched: Always preserve the
>>>> user requested cpumask") relax_compatible_cpus_allowed_ptr() is
>>>> calling __sched_setaffinity() unconditionally.
>>>>
>>>> But the underlying problem goes back a lot further, possibly to
>>>> commit: ae1c802382f7 ("cpuset: apply cs->effective_{cpus,mems}") which
>>>> switched cpuset_cpus_allowed() from cs->cpus_allowed to
>>>> cs->effective_cpus.
>>>>
>>>> The problem is that for CPUSET=y cpuset_cpus_allowed() will filter out
>>>> all offline CPUs. For tasks that are part of a (!root) cpuset this is
>>>> then later fixed up by the cpuset hotplug notifiers that re-evaluate
>>>> and re-apply cs->effective_cpus, but for (normal) tasks in the root
>>>> cpuset this does not happen and they will forever after be excluded
>>>> from CPUs onlined later.
>>>>
>>>> As such, rewrite cpuset_cpus_allowed() to return a wider mask,
>>>> including the offline CPUs.
>>>>
>>>> Fixes: 8f9ea86fdf99 ("sched: Always preserve the user requested 
>>>> cpumask")
>>>> Reported-by: Will Deacon <will@kernel.org>
>>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>>> Link: 
>>>> https://lkml.kernel.org/r/20230117160825.GA17756@willie-the-truck
>>>> Signed-off-by: Will Deacon <will@kernel.org>
>>> Before cgroup v2, cpuset had only one cpumask - cpus_allowed. It only
>>> tracked online cpus and ignored the offline ones. It behaves more like
>>> effective_cpus in cpuset v2. With v2, we have 2 cpumasks - 
>>> cpus_allowed and
>>> effective_cpus. When cpuset v1 is mounted, cpus_allowed and 
>>> effective_cpus
>>> are effectively the same and track online cpus. With cpuset v2, 
>>> cpus_allowed
>>> contains what the user has written into and it won't be changed until
>>> another write happen. However, what the user written may not be what 
>>> the
>>> system can give it and effective_cpus is what the system decides a 
>>> cpuset
>>> can use.
>>>
>>> Cpuset v2 is able to handle hotplug correctly and update the task's 
>>> cpumask
>>> accordingly. So missing previously offline cpus won't happen with v2.
>>>
>>> Since v1 keeps the old behavior, previously offlined cpus are lost 
>>> in the
>>> cpuset's cpus_allowed. However tasks in the root cpuset will still 
>>> be fine
>>> with cpu hotplug as its cpus_allowed should track cpu_online_mask. 
>>> IOW, only
>>> tasks in a non-root cpuset suffer this problem.
>>>
>>> It was a known issue in v1 and I believe is one of the major reasons 
>>> of the
>>> cpuset v2 redesign.
>>>
>>> A major concern I have is the overhead of creating a poor man 
>>> version of v2
>>> cpus_allowed. This issue can be worked around even for cpuset v1 if 
>>> it is
>>> mounted with the cpuset_v2_mode option to behave more like v2 in its 
>>> cpumask
>>> handling. Alternatively we may be able to provide a config option to 
>>> make
>>> this the default for v1 without the special mount option, if necessary.
>> You're still not getting it -- even cpuset (be it v1 or v2) *MUST* *NOT*
>> mask offline cpus for root cgroup tasks, ever. (And the only reason it
>> gets away with masking offline for !root is that it re-applies the mask
>> every time it changes.)
>>
>> Yes it did that for a fair while -- but it is wrong and broken and a
>> very big behavioural difference between CONFIG_CPUSET={y,n}. This must
>> not be.
>>
>> Arguably cpuset-v2 is still wrong for masking offline cpus in it's
>> effective_cpus mask, but I really didn't want to go rewrite cpuset.c for
>> something that needs to go into /urgent *now*.
>>
>> Hence this minimal patch that at least lets sched_setaffinity() work as
>> intended.
>
> I don't object to the general idea of keeping offline cpus in a task's 
> cpu affinity. In the case of cpu offline event, we can skip removing 
> that offline cpu from the task's cpu affinity. That will partially 
> solve the problem here and is also simpler.
>
> I believe a main reason why effective_cpus holds only online cpus is 
> because of the need to detect when there is no online cpus available 
> in a given cpuset. In this case, it will fall back to the nearest 
> ancestors with online cpus.
>
> This offline cpu problem with cpuset v1 is a known problem for a long 
> time. It is not a recent regression.

Note that using cpus_allowed directly in cgroup v2 may not be right 
because cpus_allowed may have no relationship to effective_cpus at all 
in some cases, e.g.

    root
     |
     V
     A (cpus_allowed = 1-4, effective_cpus = 1-4)
     |
     V
     B (cpus_allowed = 5-8, effective_cpus = 1-4)

In the case of cpuset B, passing back cpus 5-8 as the allowed_cpus is wrong.

I wonder how often is cpu hotplug happening in those arm64 cpu systems 
that only have a subset of cpus that can run 32-bit programs.

Cheers,
Longman
  
Waiman Long Feb. 1, 2023, 7:14 p.m. UTC | #5
On 2/1/23 13:46, Waiman Long wrote:
> On 2/1/23 10:16, Waiman Long wrote:
>> On 2/1/23 04:14, Peter Zijlstra wrote:
>>> On Tue, Jan 31, 2023 at 11:14:27PM -0500, Waiman Long wrote:
>>>> On 1/31/23 17:17, Will Deacon wrote:
>>>>> From: Peter Zijlstra <peterz@infradead.org>
>>>>>
>>>>> There is a difference in behaviour between CPUSET={y,n} that is now
>>>>> wrecking havoc with {relax,force}_compatible_cpus_allowed_ptr().
>>>>>
>>>>> Specifically, since commit 8f9ea86fdf99 ("sched: Always preserve the
>>>>> user requested cpumask") relax_compatible_cpus_allowed_ptr() is
>>>>> calling __sched_setaffinity() unconditionally.
>>>>>
>>>>> But the underlying problem goes back a lot further, possibly to
>>>>> commit: ae1c802382f7 ("cpuset: apply cs->effective_{cpus,mems}") 
>>>>> which
>>>>> switched cpuset_cpus_allowed() from cs->cpus_allowed to
>>>>> cs->effective_cpus.
>>>>>
>>>>> The problem is that for CPUSET=y cpuset_cpus_allowed() will filter 
>>>>> out
>>>>> all offline CPUs. For tasks that are part of a (!root) cpuset this is
>>>>> then later fixed up by the cpuset hotplug notifiers that re-evaluate
>>>>> and re-apply cs->effective_cpus, but for (normal) tasks in the root
>>>>> cpuset this does not happen and they will forever after be excluded
>>>>> from CPUs onlined later.
>>>>>
>>>>> As such, rewrite cpuset_cpus_allowed() to return a wider mask,
>>>>> including the offline CPUs.
>>>>>
>>>>> Fixes: 8f9ea86fdf99 ("sched: Always preserve the user requested 
>>>>> cpumask")
>>>>> Reported-by: Will Deacon <will@kernel.org>
>>>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>>>> Link: 
>>>>> https://lkml.kernel.org/r/20230117160825.GA17756@willie-the-truck
>>>>> Signed-off-by: Will Deacon <will@kernel.org>
>>>> Before cgroup v2, cpuset had only one cpumask - cpus_allowed. It only
>>>> tracked online cpus and ignored the offline ones. It behaves more like
>>>> effective_cpus in cpuset v2. With v2, we have 2 cpumasks - 
>>>> cpus_allowed and
>>>> effective_cpus. When cpuset v1 is mounted, cpus_allowed and 
>>>> effective_cpus
>>>> are effectively the same and track online cpus. With cpuset v2, 
>>>> cpus_allowed
>>>> contains what the user has written into and it won't be changed until
>>>> another write happen. However, what the user written may not be 
>>>> what the
>>>> system can give it and effective_cpus is what the system decides a 
>>>> cpuset
>>>> can use.
>>>>
>>>> Cpuset v2 is able to handle hotplug correctly and update the task's 
>>>> cpumask
>>>> accordingly. So missing previously offline cpus won't happen with v2.
>>>>
>>>> Since v1 keeps the old behavior, previously offlined cpus are lost 
>>>> in the
>>>> cpuset's cpus_allowed. However tasks in the root cpuset will still 
>>>> be fine
>>>> with cpu hotplug as its cpus_allowed should track cpu_online_mask. 
>>>> IOW, only
>>>> tasks in a non-root cpuset suffer this problem.
>>>>
>>>> It was a known issue in v1 and I believe is one of the major 
>>>> reasons of the
>>>> cpuset v2 redesign.
>>>>
>>>> A major concern I have is the overhead of creating a poor man 
>>>> version of v2
>>>> cpus_allowed. This issue can be worked around even for cpuset v1 if 
>>>> it is
>>>> mounted with the cpuset_v2_mode option to behave more like v2 in 
>>>> its cpumask
>>>> handling. Alternatively we may be able to provide a config option 
>>>> to make
>>>> this the default for v1 without the special mount option, if 
>>>> necessary.
>>> You're still not getting it -- even cpuset (be it v1 or v2) *MUST* 
>>> *NOT*
>>> mask offline cpus for root cgroup tasks, ever. (And the only reason it
>>> gets away with masking offline for !root is that it re-applies the mask
>>> every time it changes.)
>>>
>>> Yes it did that for a fair while -- but it is wrong and broken and a
>>> very big behavioural difference between CONFIG_CPUSET={y,n}. This must
>>> not be.
>>>
>>> Arguably cpuset-v2 is still wrong for masking offline cpus in it's
>>> effective_cpus mask, but I really didn't want to go rewrite cpuset.c 
>>> for
>>> something that needs to go into /urgent *now*.
>>>
>>> Hence this minimal patch that at least lets sched_setaffinity() work as
>>> intended.
>>
>> I don't object to the general idea of keeping offline cpus in a 
>> task's cpu affinity. In the case of cpu offline event, we can skip 
>> removing that offline cpu from the task's cpu affinity. That will 
>> partially solve the problem here and is also simpler.
>>
>> I believe a main reason why effective_cpus holds only online cpus is 
>> because of the need to detect when there is no online cpus available 
>> in a given cpuset. In this case, it will fall back to the nearest 
>> ancestors with online cpus.
>>
>> This offline cpu problem with cpuset v1 is a known problem for a long 
>> time. It is not a recent regression.
>
> Note that using cpus_allowed directly in cgroup v2 may not be right 
> because cpus_allowed may have no relationship to effective_cpus at all 
> in some cases, e.g.
>
>    root
>     |
>     V
>     A (cpus_allowed = 1-4, effective_cpus = 1-4)
>     |
>     V
>     B (cpus_allowed = 5-8, effective_cpus = 1-4)
>
> In the case of cpuset B, passing back cpus 5-8 as the allowed_cpus is 
> wrong.
>
> I wonder how often is cpu hotplug happening in those arm64 cpu systems 
> that only have a subset of cpus that can run 32-bit programs.

One possible solution is to use cpuset_cpus_allowed_fallback() in case 
none of the cpus in the current cpuset is allowed to be used to run a 
given task.

Cheers,
Longman
  
Waiman Long Feb. 1, 2023, 7:17 p.m. UTC | #6
On 2/1/23 14:14, Waiman Long wrote:
> One possible solution is to use cpuset_cpus_allowed_fallback() in case 
> none of the cpus in the current cpuset is allowed to be used to run a 
> given task.

It looks like we will need to enhance cpuset_cpus_allowed_fallback() to 
walk up cpuset tree to find one that have useable cpus.

Cheers,
Longman
  
Peter Zijlstra Feb. 1, 2023, 9:10 p.m. UTC | #7
On Wed, Feb 01, 2023 at 01:46:11PM -0500, Waiman Long wrote:

> Note that using cpus_allowed directly in cgroup v2 may not be right because
> cpus_allowed may have no relationship to effective_cpus at all in some
> cases, e.g.
> 
>    root
>     |
>     V
>     A (cpus_allowed = 1-4, effective_cpus = 1-4)
>     |
>     V
>     B (cpus_allowed = 5-8, effective_cpus = 1-4)
> 
> In the case of cpuset B, passing back cpus 5-8 as the allowed_cpus is wrong.

I think my patch as written does the right thing here. Since the
intersection of (1-4) and (5-8) is empty it will move up the hierarchy
and we'll end up with (1-4) from the cgroup side of things.

So the purpose of __cs_cpus_allowed() is to override the cpus_allowed of
the root set and force it to cpu_possible_mask.

Then cs_cpus_allowed() computes the intersection of cs->cpus_allowed and
all it's parents. This will, in the case of B above, result in the empty
mask.

Then cpuset_cpus_allowed() has a loop that starts with
task_cpu_possible_mask(), intersects that with cs_cpus_allowed() and if
the intersection of that and cpu_online_mask is empty, moves up the
hierarchy. Given cs_cpus_allowed(B) is the empty mask, we'll move to A.

Note that since we force the mask of root to cpu_possible_mask,
cs_cpus_allowed(root) will be a no-op and if we guarantee (in arch code)
that cpu_online_mask always has a non-empty intersection with
task_cpu_possible_mask(), this loop is guaranteed to terminate with a
viable mask.
  
Waiman Long Feb. 2, 2023, 3:34 a.m. UTC | #8
On 2/1/23 16:10, Peter Zijlstra wrote:
> On Wed, Feb 01, 2023 at 01:46:11PM -0500, Waiman Long wrote:
>
>> Note that using cpus_allowed directly in cgroup v2 may not be right because
>> cpus_allowed may have no relationship to effective_cpus at all in some
>> cases, e.g.
>>
>>     root
>>      |
>>      V
>>      A (cpus_allowed = 1-4, effective_cpus = 1-4)
>>      |
>>      V
>>      B (cpus_allowed = 5-8, effective_cpus = 1-4)
>>
>> In the case of cpuset B, passing back cpus 5-8 as the allowed_cpus is wrong.
> I think my patch as written does the right thing here. Since the
> intersection of (1-4) and (5-8) is empty it will move up the hierarchy
> and we'll end up with (1-4) from the cgroup side of things.
>
> So the purpose of __cs_cpus_allowed() is to override the cpus_allowed of
> the root set and force it to cpu_possible_mask.
>
> Then cs_cpus_allowed() computes the intersection of cs->cpus_allowed and
> all it's parents. This will, in the case of B above, result in the empty
> mask.
>
> Then cpuset_cpus_allowed() has a loop that starts with
> task_cpu_possible_mask(), intersects that with cs_cpus_allowed() and if
> the intersection of that and cpu_online_mask is empty, moves up the
> hierarchy. Given cs_cpus_allowed(B) is the empty mask, we'll move to A.
>
> Note that since we force the mask of root to cpu_possible_mask,
> cs_cpus_allowed(root) will be a no-op and if we guarantee (in arch code)
> that cpu_online_mask always has a non-empty intersection with
> task_cpu_possible_mask(), this loop is guaranteed to terminate with a
> viable mask.

I will take a closer look at that tomorrow. I will be more comfortable 
ack'ing that if this is specific to v1 cpuset instead of applying this 
in both v1 and v2 since it is only v1 that is problematic.

Cheers,
Longman
  
Peter Zijlstra Feb. 2, 2023, 8:34 a.m. UTC | #9
On Tue, Jan 31, 2023 at 11:14:27PM -0500, Waiman Long wrote:

> A major concern I have is the overhead of creating a poor man version of v2
> cpus_allowed. This issue can be worked around even for cpuset v1 if it is
> mounted with the cpuset_v2_mode option to behave more like v2 in its cpumask
> handling. Alternatively we may be able to provide a config option to make
> this the default for v1 without the special mount option, if necessary.

It is equally broken for v2, it masks against effective_cpus. Not to
mention it explicitly starts with cpu_online_mask.
  
Waiman Long Feb. 2, 2023, 4:06 p.m. UTC | #10
On 2/2/23 03:34, Peter Zijlstra wrote:
> On Tue, Jan 31, 2023 at 11:14:27PM -0500, Waiman Long wrote:
>
>> A major concern I have is the overhead of creating a poor man version of v2
>> cpus_allowed. This issue can be worked around even for cpuset v1 if it is
>> mounted with the cpuset_v2_mode option to behave more like v2 in its cpumask
>> handling. Alternatively we may be able to provide a config option to make
>> this the default for v1 without the special mount option, if necessary.
> It is equally broken for v2, it masks against effective_cpus. Not to
> mention it explicitly starts with cpu_online_mask.

After taking a close look at the patch, my understanding of what it is 
doing is as follows:

v2: cpus_allowed will not be affected by hotplug. So the new 
cpuset_cpus_allowed() will return effective_cpus + offline cpus that 
should have been part of effective_cpus if online before masking it with 
allowable cpus and then go up the cpuset hierarchy if necessary.

v1: cpus_allowed is equivalent to v2 effective_cpus. It starts at the 
current cpuset and move up the hierarchy if necessary to find a cpuset 
that have at least one allowable cpu.

First of all, it does not take into account of the v2 partition feature 
that may cause it to produce incorrect result if partition is enabled 
somewhere. Secondly, I don't see any benefit other than having some 
additional offline cpu available in a task's cpumask which the scheduler 
will ignore anyway. v2 is able to recover a previously offlined cpu. So 
we don't gain any net benefit other than the going up the cpuset 
hierarchy part.

For v1, I agree we should go up the cpuset hierarchy to find a usable 
cpuset. Instead of introducing such a complexity in 
cpuset_cpus_allowed(), my current preference is to do the hierarchy 
climbing part in an enhanced cpuset_cpus_allowed_fallback() after an 
initial failure of cpuset_cpus_allowed(). That will be easier to 
understand than having such complexity and overhead in 
cpuset_cpus_allowed() alone.

I will work on a patchset to do that as a counter offer.

Cheers,
Longman
  
Peter Zijlstra Feb. 2, 2023, 7:42 p.m. UTC | #11
On Thu, Feb 02, 2023 at 11:06:51AM -0500, Waiman Long wrote:

> After taking a close look at the patch, my understanding of what it is doing
> is as follows:
> 
> v2: cpus_allowed will not be affected by hotplug. So the new
> cpuset_cpus_allowed() will return effective_cpus + offline cpus that should
> have been part of effective_cpus if online before masking it with allowable
> cpus and then go up the cpuset hierarchy if necessary.
> 
> v1: cpus_allowed is equivalent to v2 effective_cpus. It starts at the
> current cpuset and move up the hierarchy if necessary to find a cpuset that
> have at least one allowable cpu.
> 
> First of all, it does not take into account of the v2 partition feature that
> may cause it to produce incorrect result if partition is enabled somewhere.

How so? For a partition the cpus_allowed mask should be the parition
CPUs. The only magical bit about partitions is that any one CPU cannot
belong to two partitions and load-balancing is split.

> Secondly, I don't see any benefit other than having some additional offline
> cpu available in a task's cpumask which the scheduler will ignore anyway.

Those CPUs can come online again -- you're *again* dismissing the true
bug :/

If you filter out the offline CPUs at sched_setaffinity() time, you
forever lose those CPUs, the task will never again move to those CPUs,
even if they do come online after.

It is really simple to reproduce this:

 - boot machine
 - offline all CPUs except one
 - taskset -p ffffffff $$
 - online all CPUs

and observe your shell (and all its decendants) being stuck to the one
CPU. Do the same thing on a CPUSET=n build and note the difference (you
retain the full mask).

> v2 is able to recover a previously offlined cpu. So we don't gain any
> net benefit other than the going up the cpuset hierarchy part.

Only for !root tasks. Not even v2 will re-set the affinity of root tasks
afaict.

> For v1, I agree we should go up the cpuset hierarchy to find a usable
> cpuset. Instead of introducing such a complexity in cpuset_cpus_allowed(),
> my current preference is to do the hierarchy climbing part in an enhanced
> cpuset_cpus_allowed_fallback() after an initial failure of
> cpuset_cpus_allowed(). That will be easier to understand than having such
> complexity and overhead in cpuset_cpus_allowed() alone.
> 
> I will work on a patchset to do that as a counter offer.

We will need a small and simple patch for /urgent, or I will need to
revert all your patches -- your call.

I also don't tihnk you fully appreciate the ramifications of
task_cpu_possible_mask(), cpuset currently gets that quite wrong.
  
Waiman Long Feb. 2, 2023, 8:46 p.m. UTC | #12
On 2/2/23 14:42, Peter Zijlstra wrote:
> On Thu, Feb 02, 2023 at 11:06:51AM -0500, Waiman Long wrote:
>
>> After taking a close look at the patch, my understanding of what it is doing
>> is as follows:
>>
>> v2: cpus_allowed will not be affected by hotplug. So the new
>> cpuset_cpus_allowed() will return effective_cpus + offline cpus that should
>> have been part of effective_cpus if online before masking it with allowable
>> cpus and then go up the cpuset hierarchy if necessary.
>>
>> v1: cpus_allowed is equivalent to v2 effective_cpus. It starts at the
>> current cpuset and move up the hierarchy if necessary to find a cpuset that
>> have at least one allowable cpu.
>>
>> First of all, it does not take into account of the v2 partition feature that
>> may cause it to produce incorrect result if partition is enabled somewhere.
> How so? For a partition the cpus_allowed mask should be the parition
> CPUs. The only magical bit about partitions is that any one CPU cannot
> belong to two partitions and load-balancing is split.
There can be a child partition underneath it that uses up some of the 
cpus in cpus_allowed mask. So if you cascading up the cpuset tree from 
another child, your code will wrongly include those cpus that are 
dedicated to the other child partition.
>
>> Secondly, I don't see any benefit other than having some additional offline
>> cpu available in a task's cpumask which the scheduler will ignore anyway.
> Those CPUs can come online again -- you're *again* dismissing the true
> bug :/
>
> If you filter out the offline CPUs at sched_setaffinity() time, you
> forever lose those CPUs, the task will never again move to those CPUs,
> even if they do come online after.
>
> It is really simple to reproduce this:
>
>   - boot machine
>   - offline all CPUs except one
>   - taskset -p ffffffff $$
>   - online all CPUs
>
> and observe your shell (and all its decendants) being stuck to the one
> CPU. Do the same thing on a CPUSET=n build and note the difference (you
> retain the full mask).

With the new sched_setaffinity(), the original mask should be stored in 
user_cpus_ptr. The cpuset code is supposed to call 
set_cpus_allowed_ptr() which then set the mask correctly. I will run 
your test and figure out why it does not work as I would have expected.


>
>> v2 is able to recover a previously offlined cpu. So we don't gain any
>> net benefit other than the going up the cpuset hierarchy part.
> Only for !root tasks. Not even v2 will re-set the affinity of root tasks
> afaict.
>
>> For v1, I agree we should go up the cpuset hierarchy to find a usable
>> cpuset. Instead of introducing such a complexity in cpuset_cpus_allowed(),
>> my current preference is to do the hierarchy climbing part in an enhanced
>> cpuset_cpus_allowed_fallback() after an initial failure of
>> cpuset_cpus_allowed(). That will be easier to understand than having such
>> complexity and overhead in cpuset_cpus_allowed() alone.
>>
>> I will work on a patchset to do that as a counter offer.
> We will need a small and simple patch for /urgent, or I will need to
> revert all your patches -- your call.
>
> I also don't tihnk you fully appreciate the ramifications of
> task_cpu_possible_mask(), cpuset currently gets that quite wrong.

OK, I don't realize the urgency of that. If it is that urgent, I will 
have no objection to get it in for now. We can improve it later on. So 
are you planning to get it into the current 6.2 rc or 6.3?

Tejun, are you OK with that as you are the cgroup maintainer?

Cheers,
Longman
  
Tejun Heo Feb. 2, 2023, 8:48 p.m. UTC | #13
On Thu, Feb 02, 2023 at 03:46:02PM -0500, Waiman Long wrote:
> > > I will work on a patchset to do that as a counter offer.
> > We will need a small and simple patch for /urgent, or I will need to
> > revert all your patches -- your call.
> > 
> > I also don't tihnk you fully appreciate the ramifications of
> > task_cpu_possible_mask(), cpuset currently gets that quite wrong.
> 
> OK, I don't realize the urgency of that. If it is that urgent, I will have
> no objection to get it in for now. We can improve it later on. So are you
> planning to get it into the current 6.2 rc or 6.3?
> 
> Tejun, are you OK with that as you are the cgroup maintainer?

Yeah, gotta fix the regression but is there currently a solution which fixes
the regression but doesn't further break other stuff?

Thanks.
  
Waiman Long Feb. 2, 2023, 8:53 p.m. UTC | #14
On 2/2/23 15:48, Tejun Heo wrote:
> On Thu, Feb 02, 2023 at 03:46:02PM -0500, Waiman Long wrote:
>>>> I will work on a patchset to do that as a counter offer.
>>> We will need a small and simple patch for /urgent, or I will need to
>>> revert all your patches -- your call.
>>>
>>> I also don't tihnk you fully appreciate the ramifications of
>>> task_cpu_possible_mask(), cpuset currently gets that quite wrong.
>> OK, I don't realize the urgency of that. If it is that urgent, I will have
>> no objection to get it in for now. We can improve it later on. So are you
>> planning to get it into the current 6.2 rc or 6.3?
>>
>> Tejun, are you OK with that as you are the cgroup maintainer?
> Yeah, gotta fix the regression but is there currently a solution which fixes
> the regression but doesn't further break other stuff?

I believe there is a better way to do that, but it will need more time 
to flex out. Since cpuset_cpus_allowed() is only used by 
kernel/sched/core.c, Peter will be responsible if it somehow breaks 
other stuff.

Thanks,
Longman
  
Waiman Long Feb. 2, 2023, 9:05 p.m. UTC | #15
On 2/2/23 15:53, Waiman Long wrote:
>
> On 2/2/23 15:48, Tejun Heo wrote:
>> On Thu, Feb 02, 2023 at 03:46:02PM -0500, Waiman Long wrote:
>>>>> I will work on a patchset to do that as a counter offer.
>>>> We will need a small and simple patch for /urgent, or I will need to
>>>> revert all your patches -- your call.
>>>>
>>>> I also don't tihnk you fully appreciate the ramifications of
>>>> task_cpu_possible_mask(), cpuset currently gets that quite wrong.
>>> OK, I don't realize the urgency of that. If it is that urgent, I 
>>> will have
>>> no objection to get it in for now. We can improve it later on. So 
>>> are you
>>> planning to get it into the current 6.2 rc or 6.3?
>>>
>>> Tejun, are you OK with that as you are the cgroup maintainer?
>> Yeah, gotta fix the regression but is there currently a solution 
>> which fixes
>> the regression but doesn't further break other stuff?
>
> I believe there is a better way to do that, but it will need more time 
> to flex out. Since cpuset_cpus_allowed() is only used by 
> kernel/sched/core.c, Peter will be responsible if it somehow breaks 
> other stuff.

Maybe my cpuset patch that don't update task's cpumask on cpu offline 
event can help. However, I don't know the exact scenario where the 
regression happen, so it may not.

Cheers,
Longman
  
Tejun Heo Feb. 2, 2023, 9:50 p.m. UTC | #16
On Thu, Feb 02, 2023 at 04:05:14PM -0500, Waiman Long wrote:
> 
> On 2/2/23 15:53, Waiman Long wrote:
> > 
> > On 2/2/23 15:48, Tejun Heo wrote:
> > > On Thu, Feb 02, 2023 at 03:46:02PM -0500, Waiman Long wrote:
> > > > > > I will work on a patchset to do that as a counter offer.
> > > > > We will need a small and simple patch for /urgent, or I will need to
> > > > > revert all your patches -- your call.
> > > > > 
> > > > > I also don't tihnk you fully appreciate the ramifications of
> > > > > task_cpu_possible_mask(), cpuset currently gets that quite wrong.
> > > > OK, I don't realize the urgency of that. If it is that urgent, I
> > > > will have
> > > > no objection to get it in for now. We can improve it later on.
> > > > So are you
> > > > planning to get it into the current 6.2 rc or 6.3?
> > > > 
> > > > Tejun, are you OK with that as you are the cgroup maintainer?
> > > Yeah, gotta fix the regression but is there currently a solution
> > > which fixes
> > > the regression but doesn't further break other stuff?
> > 
> > I believe there is a better way to do that, but it will need more time
> > to flex out. Since cpuset_cpus_allowed() is only used by
> > kernel/sched/core.c, Peter will be responsible if it somehow breaks
> > other stuff.
> 
> Maybe my cpuset patch that don't update task's cpumask on cpu offline event
> can help. However, I don't know the exact scenario where the regression
> happen, so it may not.

Neither patch looks like they would break anything. That said, the patches
aren't trivial and we're really close to the merge window, so I'd really
appreciate if you can take a look and test a bit before we send these
Linus's way. We can replace it with a better solution afterwards.

Thanks.
  
Waiman Long Feb. 3, 2023, 12:54 a.m. UTC | #17
On 2/2/23 16:50, Tejun Heo wrote:
> On Thu, Feb 02, 2023 at 04:05:14PM -0500, Waiman Long wrote:
>> On 2/2/23 15:53, Waiman Long wrote:
>>> On 2/2/23 15:48, Tejun Heo wrote:
>>>> On Thu, Feb 02, 2023 at 03:46:02PM -0500, Waiman Long wrote:
>>>>>>> I will work on a patchset to do that as a counter offer.
>>>>>> We will need a small and simple patch for /urgent, or I will need to
>>>>>> revert all your patches -- your call.
>>>>>>
>>>>>> I also don't tihnk you fully appreciate the ramifications of
>>>>>> task_cpu_possible_mask(), cpuset currently gets that quite wrong.
>>>>> OK, I don't realize the urgency of that. If it is that urgent, I
>>>>> will have
>>>>> no objection to get it in for now. We can improve it later on.
>>>>> So are you
>>>>> planning to get it into the current 6.2 rc or 6.3?
>>>>>
>>>>> Tejun, are you OK with that as you are the cgroup maintainer?
>>>> Yeah, gotta fix the regression but is there currently a solution
>>>> which fixes
>>>> the regression but doesn't further break other stuff?
>>> I believe there is a better way to do that, but it will need more time
>>> to flex out. Since cpuset_cpus_allowed() is only used by
>>> kernel/sched/core.c, Peter will be responsible if it somehow breaks
>>> other stuff.
>> Maybe my cpuset patch that don't update task's cpumask on cpu offline event
>> can help. However, I don't know the exact scenario where the regression
>> happen, so it may not.
> Neither patch looks like they would break anything. That said, the patches
> aren't trivial and we're really close to the merge window, so I'd really
> appreciate if you can take a look and test a bit before we send these
> Linus's way. We can replace it with a better solution afterwards.

OK, will do.

Cheers,
Longman
  
Will Deacon Feb. 3, 2023, 11:50 a.m. UTC | #18
On Wed, Feb 01, 2023 at 10:34:00PM -0500, Waiman Long wrote:
> On 2/1/23 16:10, Peter Zijlstra wrote:
> > On Wed, Feb 01, 2023 at 01:46:11PM -0500, Waiman Long wrote:
> > 
> > > Note that using cpus_allowed directly in cgroup v2 may not be right because
> > > cpus_allowed may have no relationship to effective_cpus at all in some
> > > cases, e.g.
> > > 
> > >     root
> > >      |
> > >      V
> > >      A (cpus_allowed = 1-4, effective_cpus = 1-4)
> > >      |
> > >      V
> > >      B (cpus_allowed = 5-8, effective_cpus = 1-4)
> > > 
> > > In the case of cpuset B, passing back cpus 5-8 as the allowed_cpus is wrong.
> > I think my patch as written does the right thing here. Since the
> > intersection of (1-4) and (5-8) is empty it will move up the hierarchy
> > and we'll end up with (1-4) from the cgroup side of things.
> > 
> > So the purpose of __cs_cpus_allowed() is to override the cpus_allowed of
> > the root set and force it to cpu_possible_mask.
> > 
> > Then cs_cpus_allowed() computes the intersection of cs->cpus_allowed and
> > all it's parents. This will, in the case of B above, result in the empty
> > mask.
> > 
> > Then cpuset_cpus_allowed() has a loop that starts with
> > task_cpu_possible_mask(), intersects that with cs_cpus_allowed() and if
> > the intersection of that and cpu_online_mask is empty, moves up the
> > hierarchy. Given cs_cpus_allowed(B) is the empty mask, we'll move to A.
> > 
> > Note that since we force the mask of root to cpu_possible_mask,
> > cs_cpus_allowed(root) will be a no-op and if we guarantee (in arch code)
> > that cpu_online_mask always has a non-empty intersection with
> > task_cpu_possible_mask(), this loop is guaranteed to terminate with a
> > viable mask.
> 
> I will take a closer look at that tomorrow. I will be more comfortable
> ack'ing that if this is specific to v1 cpuset instead of applying this in
> both v1 and v2 since it is only v1 that is problematic.

fwiw, the regression I'm seeing is with cgroup2. I haven't tried v1.

WIll
  
Waiman Long Feb. 3, 2023, 3:13 p.m. UTC | #19
On 2/3/23 06:50, Will Deacon wrote:
> On Wed, Feb 01, 2023 at 10:34:00PM -0500, Waiman Long wrote:
>> On 2/1/23 16:10, Peter Zijlstra wrote:
>>> On Wed, Feb 01, 2023 at 01:46:11PM -0500, Waiman Long wrote:
>>>
>>>> Note that using cpus_allowed directly in cgroup v2 may not be right because
>>>> cpus_allowed may have no relationship to effective_cpus at all in some
>>>> cases, e.g.
>>>>
>>>>      root
>>>>       |
>>>>       V
>>>>       A (cpus_allowed = 1-4, effective_cpus = 1-4)
>>>>       |
>>>>       V
>>>>       B (cpus_allowed = 5-8, effective_cpus = 1-4)
>>>>
>>>> In the case of cpuset B, passing back cpus 5-8 as the allowed_cpus is wrong.
>>> I think my patch as written does the right thing here. Since the
>>> intersection of (1-4) and (5-8) is empty it will move up the hierarchy
>>> and we'll end up with (1-4) from the cgroup side of things.
>>>
>>> So the purpose of __cs_cpus_allowed() is to override the cpus_allowed of
>>> the root set and force it to cpu_possible_mask.
>>>
>>> Then cs_cpus_allowed() computes the intersection of cs->cpus_allowed and
>>> all it's parents. This will, in the case of B above, result in the empty
>>> mask.
>>>
>>> Then cpuset_cpus_allowed() has a loop that starts with
>>> task_cpu_possible_mask(), intersects that with cs_cpus_allowed() and if
>>> the intersection of that and cpu_online_mask is empty, moves up the
>>> hierarchy. Given cs_cpus_allowed(B) is the empty mask, we'll move to A.
>>>
>>> Note that since we force the mask of root to cpu_possible_mask,
>>> cs_cpus_allowed(root) will be a no-op and if we guarantee (in arch code)
>>> that cpu_online_mask always has a non-empty intersection with
>>> task_cpu_possible_mask(), this loop is guaranteed to terminate with a
>>> viable mask.
>> I will take a closer look at that tomorrow. I will be more comfortable
>> ack'ing that if this is specific to v1 cpuset instead of applying this in
>> both v1 and v2 since it is only v1 that is problematic.
> fwiw, the regression I'm seeing is with cgroup2. I haven't tried v1.

I think I know where the problem is. It is due to the fact the cpuset 
hotplug code doesn't update cpumasks of the tasks in the top cpuset 
(root) at all when there is a cpu offline or online event. It is 
probably because for some of the tasks in the top cpuset, especially the 
percpu kthread, changing their cpumasks can be catastrophic. The hotplug 
code does update the cpumasks of the tasks that are not in the top 
cpuset. This problem is irrespective of whether v1 or v2 is in use.

The partition code does try to update the cpumasks of the tasks in the 
top cpuset, but skip the percpu kthreads. My testing show that is 
probably OK. For safety, I agree that is better to extend the allowed 
cpu list to all the possible cpus (including offline ones) for now until 
more testings are done to find a safe way to do that. That special case 
should apply only to tasks in the top cpuset. For the rests, the current 
code should be OK and is less risky than adopting code in this patch.

Cheers,
Longman
  
Peter Zijlstra Feb. 3, 2023, 3:26 p.m. UTC | #20
On Fri, Feb 03, 2023 at 10:13:10AM -0500, Waiman Long wrote:

> I think I know where the problem is. It is due to the fact the cpuset
> hotplug code doesn't update cpumasks of the tasks in the top cpuset (root)
> at all when there is a cpu offline or online event. It is probably because
> for some of the tasks in the top cpuset, especially the percpu kthread,
> changing their cpumasks can be catastrophic. The hotplug code does update
> the cpumasks of the tasks that are not in the top cpuset. This problem is
> irrespective of whether v1 or v2 is in use.

I've been saying this exact thing for how many mails now?
  
Waiman Long Feb. 3, 2023, 3:35 p.m. UTC | #21
On 2/3/23 10:26, Peter Zijlstra wrote:
> On Fri, Feb 03, 2023 at 10:13:10AM -0500, Waiman Long wrote:
>
>> I think I know where the problem is. It is due to the fact the cpuset
>> hotplug code doesn't update cpumasks of the tasks in the top cpuset (root)
>> at all when there is a cpu offline or online event. It is probably because
>> for some of the tasks in the top cpuset, especially the percpu kthread,
>> changing their cpumasks can be catastrophic. The hotplug code does update
>> the cpumasks of the tasks that are not in the top cpuset. This problem is
>> irrespective of whether v1 or v2 is in use.
> I've been saying this exact thing for how many mails now?

My bad. The fact that sched_getaffinity() masks off the offline cpus 
makes me thought incorrectly that tasks in the top cpuset were also 
updated by the hotplug code. Further testing indicates this is the case.

Thanks,
Longman
  
Will Deacon Feb. 3, 2023, 4:31 p.m. UTC | #22
On Thu, Feb 02, 2023 at 11:50:55AM -1000, Tejun Heo wrote:
> On Thu, Feb 02, 2023 at 04:05:14PM -0500, Waiman Long wrote:
> > 
> > On 2/2/23 15:53, Waiman Long wrote:
> > > 
> > > On 2/2/23 15:48, Tejun Heo wrote:
> > > > On Thu, Feb 02, 2023 at 03:46:02PM -0500, Waiman Long wrote:
> > > > > > > I will work on a patchset to do that as a counter offer.
> > > > > > We will need a small and simple patch for /urgent, or I will need to
> > > > > > revert all your patches -- your call.
> > > > > > 
> > > > > > I also don't tihnk you fully appreciate the ramifications of
> > > > > > task_cpu_possible_mask(), cpuset currently gets that quite wrong.
> > > > > OK, I don't realize the urgency of that. If it is that urgent, I
> > > > > will have
> > > > > no objection to get it in for now. We can improve it later on.
> > > > > So are you
> > > > > planning to get it into the current 6.2 rc or 6.3?
> > > > > 
> > > > > Tejun, are you OK with that as you are the cgroup maintainer?
> > > > Yeah, gotta fix the regression but is there currently a solution
> > > > which fixes
> > > > the regression but doesn't further break other stuff?
> > > 
> > > I believe there is a better way to do that, but it will need more time
> > > to flex out. Since cpuset_cpus_allowed() is only used by
> > > kernel/sched/core.c, Peter will be responsible if it somehow breaks
> > > other stuff.
> > 
> > Maybe my cpuset patch that don't update task's cpumask on cpu offline event
> > can help. However, I don't know the exact scenario where the regression
> > happen, so it may not.
> 
> Neither patch looks like they would break anything. That said, the patches
> aren't trivial and we're really close to the merge window, so I'd really
> appreciate if you can take a look and test a bit before we send these
> Linus's way. We can replace it with a better solution afterwards.

FWIW, I tested this series in an arm64 heterogeneous setup with things
like hotplug and exec()ing between 32-bit and 64-bit tasks and it all
seems good.

The alternative would be to revert Waiman's setaffinity changes, which
I've had a go at here:

https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=ssa-reverts

and I _think_ I've rescued the UAF fix too.

What do people prefer?

Will
  

Patch

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index a29c0b13706b..8552cc2c586a 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3683,23 +3683,52 @@  void __init cpuset_init_smp(void)
 	BUG_ON(!cpuset_migrate_mm_wq);
 }
 
+static const struct cpumask *__cs_cpus_allowed(struct cpuset *cs)
+{
+	const struct cpumask *cs_mask = cs->cpus_allowed;
+	if (!parent_cs(cs))
+		cs_mask = cpu_possible_mask;
+	return cs_mask;
+}
+
+static void cs_cpus_allowed(struct cpuset *cs, struct cpumask *pmask)
+{
+	do {
+		cpumask_and(pmask, pmask, __cs_cpus_allowed(cs));
+		cs = parent_cs(cs);
+	} while (cs);
+}
+
 /**
  * cpuset_cpus_allowed - return cpus_allowed mask from a tasks cpuset.
  * @tsk: pointer to task_struct from which to obtain cpuset->cpus_allowed.
  * @pmask: pointer to struct cpumask variable to receive cpus_allowed set.
  *
- * Description: Returns the cpumask_var_t cpus_allowed of the cpuset
- * attached to the specified @tsk.  Guaranteed to return some non-empty
- * subset of cpu_online_mask, even if this means going outside the
- * tasks cpuset.
+ * Description: Returns the cpumask_var_t cpus_allowed of the cpuset attached
+ * to the specified @tsk.  Guaranteed to return some non-empty intersection
+ * with cpu_online_mask, even if this means going outside the tasks cpuset.
  **/
 
 void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
 {
 	unsigned long flags;
+	struct cpuset *cs;
 
 	spin_lock_irqsave(&callback_lock, flags);
-	guarantee_online_cpus(tsk, pmask);
+	rcu_read_lock();
+
+	cs = task_cs(tsk);
+	do {
+		cpumask_copy(pmask, task_cpu_possible_mask(tsk));
+		cs_cpus_allowed(cs, pmask);
+
+		if (cpumask_intersects(pmask, cpu_online_mask))
+			break;
+
+		cs = parent_cs(cs);
+	} while (cs);
+
+	rcu_read_unlock();
 	spin_unlock_irqrestore(&callback_lock, flags);
 }