cgroup/cpuset: update parent subparts cpumask while holding css refcnt

Message ID 20230701065049.1758266-1-linmiaohe@huawei.com
State New
Headers
Series cgroup/cpuset: update parent subparts cpumask while holding css refcnt |

Commit Message

Miaohe Lin July 1, 2023, 6:50 a.m. UTC
  update_parent_subparts_cpumask() is called outside RCU read-side critical
section without holding extra css refcnt of cp. In theroy, cp could be
freed at any time. Holding extra css refcnt to ensure cp is valid while
updating parent subparts cpumask.

Fixes: d7c8142d5a55 ("cgroup/cpuset: Make partition invalid if cpumask change violates exclusivity rule")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
---
 kernel/cgroup/cpuset.c | 3 +++
 1 file changed, 3 insertions(+)
  

Comments

Waiman Long July 1, 2023, 11:38 p.m. UTC | #1
On 7/1/23 02:50, Miaohe Lin wrote:
> update_parent_subparts_cpumask() is called outside RCU read-side critical
> section without holding extra css refcnt of cp. In theroy, cp could be
> freed at any time. Holding extra css refcnt to ensure cp is valid while
> updating parent subparts cpumask.
>
> Fixes: d7c8142d5a55 ("cgroup/cpuset: Make partition invalid if cpumask change violates exclusivity rule")
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> ---
>   kernel/cgroup/cpuset.c | 3 +++
>   1 file changed, 3 insertions(+)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 58e6f18f01c1..632a9986d5de 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1806,9 +1806,12 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
>   		cpuset_for_each_child(cp, css, parent)
>   			if (is_partition_valid(cp) &&
>   			    cpumask_intersects(trialcs->cpus_allowed, cp->cpus_allowed)) {
> +				if (!css_tryget_online(&cp->css))
> +					continue;
>   				rcu_read_unlock();
>   				update_parent_subparts_cpumask(cp, partcmd_invalidate, NULL, &tmp);
>   				rcu_read_lock();
> +				css_put(&cp->css);
>   			}
>   		rcu_read_unlock();
>   		retval = 0;

Thanks for finding that. It looks good to me.

Reviewed-by: Waiman Long <longman@redhat.com>
  
Waiman Long July 1, 2023, 11:46 p.m. UTC | #2
On 7/1/23 19:38, Waiman Long wrote:
> On 7/1/23 02:50, Miaohe Lin wrote:
>> update_parent_subparts_cpumask() is called outside RCU read-side 
>> critical
>> section without holding extra css refcnt of cp. In theroy, cp could be
>> freed at any time. Holding extra css refcnt to ensure cp is valid while
>> updating parent subparts cpumask.
>>
>> Fixes: d7c8142d5a55 ("cgroup/cpuset: Make partition invalid if 
>> cpumask change violates exclusivity rule")
>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>> ---
>>   kernel/cgroup/cpuset.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index 58e6f18f01c1..632a9986d5de 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -1806,9 +1806,12 @@ static int update_cpumask(struct cpuset *cs, 
>> struct cpuset *trialcs,
>>           cpuset_for_each_child(cp, css, parent)
>>               if (is_partition_valid(cp) &&
>>                   cpumask_intersects(trialcs->cpus_allowed, 
>> cp->cpus_allowed)) {
>> +                if (!css_tryget_online(&cp->css))
>> +                    continue;
>>                   rcu_read_unlock();
>>                   update_parent_subparts_cpumask(cp, 
>> partcmd_invalidate, NULL, &tmp);
>>                   rcu_read_lock();
>> +                css_put(&cp->css);
>>               }
>>           rcu_read_unlock();
>>           retval = 0;
>
> Thanks for finding that. It looks good to me.
>
> Reviewed-by: Waiman Long <longman@redhat.com>

Though, I will say that an offline cpuset cannot be a valid partition 
root. So it is not really a problem. For correctness sake and 
consistency with other similar code, I am in favor of getting it merged.

Cheers,
Longman
  
Miaohe Lin July 3, 2023, 2:58 a.m. UTC | #3
On 2023/7/2 7:46, Waiman Long wrote:
> On 7/1/23 19:38, Waiman Long wrote:
>> On 7/1/23 02:50, Miaohe Lin wrote:
>>> update_parent_subparts_cpumask() is called outside RCU read-side critical
>>> section without holding extra css refcnt of cp. In theroy, cp could be
>>> freed at any time. Holding extra css refcnt to ensure cp is valid while
>>> updating parent subparts cpumask.
>>>
>>> Fixes: d7c8142d5a55 ("cgroup/cpuset: Make partition invalid if cpumask change violates exclusivity rule")
>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>> ---
>>>   kernel/cgroup/cpuset.c | 3 +++
>>>   1 file changed, 3 insertions(+)
>>>
>>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>>> index 58e6f18f01c1..632a9986d5de 100644
>>> --- a/kernel/cgroup/cpuset.c
>>> +++ b/kernel/cgroup/cpuset.c
>>> @@ -1806,9 +1806,12 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
>>>           cpuset_for_each_child(cp, css, parent)
>>>               if (is_partition_valid(cp) &&
>>>                   cpumask_intersects(trialcs->cpus_allowed, cp->cpus_allowed)) {
>>> +                if (!css_tryget_online(&cp->css))
>>> +                    continue;
>>>                   rcu_read_unlock();
>>>                   update_parent_subparts_cpumask(cp, partcmd_invalidate, NULL, &tmp);
>>>                   rcu_read_lock();
>>> +                css_put(&cp->css);
>>>               }
>>>           rcu_read_unlock();
>>>           retval = 0;
>>
>> Thanks for finding that. It looks good to me.
>>
>> Reviewed-by: Waiman Long <longman@redhat.com>
> 
> Though, I will say that an offline cpuset cannot be a valid partition root. So it is not really a problem. For correctness sake and consistency with other similar code, I am in favor of getting it merged.

Yes, cpuset_mutex will prevent cpuset from being offline while update cpumask. And as you mentioned, this patch makes code more consistency at least.
Thanks for your review and comment.
  
Tejun Heo July 3, 2023, 7:18 p.m. UTC | #4
On Mon, Jul 03, 2023 at 10:58:19AM +0800, Miaohe Lin wrote:
> On 2023/7/2 7:46, Waiman Long wrote:
> > On 7/1/23 19:38, Waiman Long wrote:
> >> On 7/1/23 02:50, Miaohe Lin wrote:
> >>> update_parent_subparts_cpumask() is called outside RCU read-side critical
> >>> section without holding extra css refcnt of cp. In theroy, cp could be
> >>> freed at any time. Holding extra css refcnt to ensure cp is valid while
> >>> updating parent subparts cpumask.
> >>>
> >>> Fixes: d7c8142d5a55 ("cgroup/cpuset: Make partition invalid if cpumask change violates exclusivity rule")
> >>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> >>> ---
> >>>   kernel/cgroup/cpuset.c | 3 +++
> >>>   1 file changed, 3 insertions(+)
> >>>
> >>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> >>> index 58e6f18f01c1..632a9986d5de 100644
> >>> --- a/kernel/cgroup/cpuset.c
> >>> +++ b/kernel/cgroup/cpuset.c
> >>> @@ -1806,9 +1806,12 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
> >>>           cpuset_for_each_child(cp, css, parent)
> >>>               if (is_partition_valid(cp) &&
> >>>                   cpumask_intersects(trialcs->cpus_allowed, cp->cpus_allowed)) {
> >>> +                if (!css_tryget_online(&cp->css))
> >>> +                    continue;
> >>>                   rcu_read_unlock();
> >>>                   update_parent_subparts_cpumask(cp, partcmd_invalidate, NULL, &tmp);
> >>>                   rcu_read_lock();
> >>> +                css_put(&cp->css);
> >>>               }
> >>>           rcu_read_unlock();
> >>>           retval = 0;
> >>
> >> Thanks for finding that. It looks good to me.
> >>
> >> Reviewed-by: Waiman Long <longman@redhat.com>
> > 
> > Though, I will say that an offline cpuset cannot be a valid partition root. So it is not really a problem. For correctness sake and consistency with other similar code, I am in favor of getting it merged.
> 
> Yes, cpuset_mutex will prevent cpuset from being offline while update cpumask. And as you mentioned, this patch makes code more consistency at least.

Can you update the patch description to note that this isn't required for
correctness?

Thanks.
  
Michal Koutný July 10, 2023, 3:11 p.m. UTC | #5
Hello.

On Sat, Jul 01, 2023 at 02:50:49PM +0800, Miaohe Lin <linmiaohe@huawei.com> wrote:
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1806,9 +1806,12 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
>  		cpuset_for_each_child(cp, css, parent)
>  			if (is_partition_valid(cp) &&
>  			    cpumask_intersects(trialcs->cpus_allowed, cp->cpus_allowed)) {
> +				if (!css_tryget_online(&cp->css))
> +					continue;
>  				rcu_read_unlock();
>  				update_parent_subparts_cpumask(cp, partcmd_invalidate, NULL, &tmp);
>  				rcu_read_lock();
> +				css_put(&cp->css);

Apologies for a possibly noob question -- why is RCU read lock
temporarily dropped within the loop?
(Is it only because of callback_lock or cgroup_file_kn_lock (via
notify_partition_change()) on PREEMPT_RT?)



[
OT question:
	cpuset_for_each_child(cp, css, parent)				(1)
		if (is_partition_valid(cp) &&
		    cpumask_intersects(trialcs->cpus_allowed, cp->cpus_allowed)) {
			if (!css_tryget_online(&cp->css))
				continue;
			rcu_read_unlock();
			update_parent_subparts_cpumask(cp, partcmd_invalidate, NULL, &tmp);
			  ...
			  update_tasks_cpumask(cp->parent)
			    ...
			    css_task_iter_start(&cp->parent->css, 0, &it);	(2)
			      ...
			rcu_read_lock();
			css_put(&cp->css);
		}

May this touch each task same number of times as its depth within
herarchy?
]

Thanks,
Michal
  
Waiman Long July 10, 2023, 3:40 p.m. UTC | #6
On 7/10/23 11:11, Michal Koutný wrote:
> Hello.
>
> On Sat, Jul 01, 2023 at 02:50:49PM +0800, Miaohe Lin <linmiaohe@huawei.com> wrote:
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -1806,9 +1806,12 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
>>   		cpuset_for_each_child(cp, css, parent)
>>   			if (is_partition_valid(cp) &&
>>   			    cpumask_intersects(trialcs->cpus_allowed, cp->cpus_allowed)) {
>> +				if (!css_tryget_online(&cp->css))
>> +					continue;
>>   				rcu_read_unlock();
>>   				update_parent_subparts_cpumask(cp, partcmd_invalidate, NULL, &tmp);
>>   				rcu_read_lock();
>> +				css_put(&cp->css);
> Apologies for a possibly noob question -- why is RCU read lock
> temporarily dropped within the loop?
> (Is it only because of callback_lock or cgroup_file_kn_lock (via
> notify_partition_change()) on PREEMPT_RT?)
>
>
>
> [
> OT question:
> 	cpuset_for_each_child(cp, css, parent)				(1)
> 		if (is_partition_valid(cp) &&
> 		    cpumask_intersects(trialcs->cpus_allowed, cp->cpus_allowed)) {
> 			if (!css_tryget_online(&cp->css))
> 				continue;
> 			rcu_read_unlock();
> 			update_parent_subparts_cpumask(cp, partcmd_invalidate, NULL, &tmp);
> 			  ...
> 			  update_tasks_cpumask(cp->parent)
> 			    ...
> 			    css_task_iter_start(&cp->parent->css, 0, &it);	(2)
> 			      ...
> 			rcu_read_lock();
> 			css_put(&cp->css);
> 		}
>
> May this touch each task same number of times as its depth within
> herarchy?

I believe the primary reason is because update_parent_subparts_cpumask() 
can potential run for quite a while. So we don't want to hold the 
rcu_read_lock for too long. There may also be a potential that 
schedule() may be called.

Cheers,
Longman
  
Michal Koutný July 10, 2023, 3:51 p.m. UTC | #7
On Mon, Jul 10, 2023 at 11:40:36AM -0400, Waiman Long <longman@redhat.com> wrote:
> I believe the primary reason is because update_parent_subparts_cpumask() can
> potential run for quite a while. So we don't want to hold the rcu_read_lock
> for too long.

But holding cpuset_mutex is even worse than rcu_read_lock()? IOW is the
relieve with this reason worth it?

> There may also be a potential that schedule() may be called.

Do you mean the spinlocks with PREEMPT_RT or anything else? (That seems
like the actual reason IIUC.)

Thanks,
Michal
  
Miaohe Lin July 11, 2023, 2:52 a.m. UTC | #8
On 2023/7/10 23:40, Waiman Long wrote:
> On 7/10/23 11:11, Michal Koutný wrote:
>> Hello.
>>
>> On Sat, Jul 01, 2023 at 02:50:49PM +0800, Miaohe Lin <linmiaohe@huawei.com> wrote:
>>> --- a/kernel/cgroup/cpuset.c
>>> +++ b/kernel/cgroup/cpuset.c
>>> @@ -1806,9 +1806,12 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
>>>           cpuset_for_each_child(cp, css, parent)
>>>               if (is_partition_valid(cp) &&
>>>                   cpumask_intersects(trialcs->cpus_allowed, cp->cpus_allowed)) {
>>> +                if (!css_tryget_online(&cp->css))
>>> +                    continue;
>>>                   rcu_read_unlock();
>>>                   update_parent_subparts_cpumask(cp, partcmd_invalidate, NULL, &tmp);
>>>                   rcu_read_lock();
>>> +                css_put(&cp->css);
>> Apologies for a possibly noob question -- why is RCU read lock
>> temporarily dropped within the loop?
>> (Is it only because of callback_lock or cgroup_file_kn_lock (via
>> notify_partition_change()) on PREEMPT_RT?)
>>
>>
>>
>> [
>> OT question:
>>     cpuset_for_each_child(cp, css, parent)                (1)
>>         if (is_partition_valid(cp) &&
>>             cpumask_intersects(trialcs->cpus_allowed, cp->cpus_allowed)) {
>>             if (!css_tryget_online(&cp->css))
>>                 continue;
>>             rcu_read_unlock();
>>             update_parent_subparts_cpumask(cp, partcmd_invalidate, NULL, &tmp);
>>               ...
>>               update_tasks_cpumask(cp->parent)
>>                 ...
>>                 css_task_iter_start(&cp->parent->css, 0, &it);    (2)
>>                   ...
>>             rcu_read_lock();
>>             css_put(&cp->css);
>>         }
>>
>> May this touch each task same number of times as its depth within
>> herarchy?
> 
> I believe the primary reason is because update_parent_subparts_cpumask() can potential run for quite a while. So we don't want to hold the rcu_read_lock for too long. There may also be a potential that schedule() may be called.

IMHO, the reason should be as same as the below commit:

commit 2bdfd2825c9662463371e6691b1a794e97fa36b4
Author: Waiman Long <longman@redhat.com>
Date:   Wed Feb 2 22:31:03 2022 -0500

    cgroup/cpuset: Fix "suspicious RCU usage" lockdep warning

    It was found that a "suspicious RCU usage" lockdep warning was issued
    with the rcu_read_lock() call in update_sibling_cpumasks().  It is
    because the update_cpumasks_hier() function may sleep. So we have
    to release the RCU lock, call update_cpumasks_hier() and reacquire
    it afterward.

    Also add a percpu_rwsem_assert_held() in update_sibling_cpumasks()
    instead of stating that in the comment.

Thanks both.
  
Michal Koutný July 11, 2023, 11:52 a.m. UTC | #9
On Tue, Jul 11, 2023 at 10:52:02AM +0800, Miaohe Lin <linmiaohe@huawei.com> wrote:
> commit 2bdfd2825c9662463371e6691b1a794e97fa36b4
> Author: Waiman Long <longman@redhat.com>
> Date:   Wed Feb 2 22:31:03 2022 -0500
> 
>     cgroup/cpuset: Fix "suspicious RCU usage" lockdep warning

Aha, thanks for the pointer.

I've also found a paragraph in [1]:
> In addition, the -rt patchset turns spinlocks into a sleeping locks so
> that the corresponding critical sections can be preempted, which also
> means that these sleeplockified spinlocks (but not other sleeping
> locks!) may be acquire within -rt-Linux-kernel RCU read-side critical
> sections.

That suggests (together with practical use) that dicussed spinlocks
should be fine in RCU read section. And the possible reason is deeper in
generate_sched_domains() that do kmalloc(..., GFP_KERNEL).

Alas update_cpumask_hier() still calls generate_sched_domains(), OTOH,
update_parent_subparts_cpumask() doesn't seem so.

The idea to not relieve rcu_read_lock() in update_cpumask() iteration
(instead of the technically unneeded refcnt bump) would have to be
verified with CONFIG_PROVE_RCU && CONFIG_LOCKDEP. WDYT?

Michal

[1] https://www.kernel.org/doc/html/latest/RCU/Design/Requirements/Requirements.html?highlight=rcu+read+section#specialization
  
Miaohe Lin July 12, 2023, 1:56 a.m. UTC | #10
On 2023/7/11 19:52, Michal Koutný wrote:
> On Tue, Jul 11, 2023 at 10:52:02AM +0800, Miaohe Lin <linmiaohe@huawei.com> wrote:
>> commit 2bdfd2825c9662463371e6691b1a794e97fa36b4
>> Author: Waiman Long <longman@redhat.com>
>> Date:   Wed Feb 2 22:31:03 2022 -0500
>>
>>     cgroup/cpuset: Fix "suspicious RCU usage" lockdep warning
> 
> Aha, thanks for the pointer.
> 
> I've also found a paragraph in [1]:
>> In addition, the -rt patchset turns spinlocks into a sleeping locks so
>> that the corresponding critical sections can be preempted, which also
>> means that these sleeplockified spinlocks (but not other sleeping
>> locks!) may be acquire within -rt-Linux-kernel RCU read-side critical
>> sections.
> 
> That suggests (together with practical use) that dicussed spinlocks
> should be fine in RCU read section. And the possible reason is deeper in
> generate_sched_domains() that do kmalloc(..., GFP_KERNEL).

update_parent_subparts_cpumask() would call update_flag() that do kmemdup(..., GFP_KERNEL)?

> 
> Alas update_cpumask_hier() still calls generate_sched_domains(), OTOH,
> update_parent_subparts_cpumask() doesn't seem so.

It seems update_parent_subparts_cpumask() doesn't call generate_sched_domains().

> 
> The idea to not relieve rcu_read_lock() in update_cpumask() iteration
> (instead of the technically unneeded refcnt bump) would have to be
> verified with CONFIG_PROVE_RCU && CONFIG_LOCKDEP. WDYT?

The idea to relieve rcu_read_lock() in update_cpumask() iteration was initially introduced
via the below commit:

commit d7c8142d5a5534c3c7de214e35a40a493a32b98e
Author: Waiman Long <longman@redhat.com>
Date:   Thu Sep 1 16:57:43 2022 -0400

    cgroup/cpuset: Make partition invalid if cpumask change violates exclusivity rule

    Currently, changes in "cpust.cpus" of a partition root is not allowed if
    it violates the sibling cpu exclusivity rule when the check is done
    in the validate_change() function. That is inconsistent with the
    other cpuset changes that are always allowed but may make a partition
    invalid.

    Update the cpuset code to allow cpumask change even if it violates the
    sibling cpu exclusivity rule, but invalidate the partition instead
    just like the other changes. However, other sibling partitions with
    conflicting cpumask will also be invalidated in order to not violating
    the exclusivity rule. This behavior is specific to this partition
    rule violation.

    Note that a previous commit has made sibling cpu exclusivity rule check
    the last check of validate_change(). So if -EINVAL is returned, we can
    be sure that sibling cpu exclusivity rule violation is the only rule
    that is broken.

It would be really helpful if @Waiman can figure this out.

Thanks both.
  

Patch

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 58e6f18f01c1..632a9986d5de 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1806,9 +1806,12 @@  static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 		cpuset_for_each_child(cp, css, parent)
 			if (is_partition_valid(cp) &&
 			    cpumask_intersects(trialcs->cpus_allowed, cp->cpus_allowed)) {
+				if (!css_tryget_online(&cp->css))
+					continue;
 				rcu_read_unlock();
 				update_parent_subparts_cpumask(cp, partcmd_invalidate, NULL, &tmp);
 				rcu_read_lock();
+				css_put(&cp->css);
 			}
 		rcu_read_unlock();
 		retval = 0;