[v3,1/2] x86/resctrl: IPI all CPUs for group updates

Message ID 20221115141953.816851-2-peternewman@google.com
State New
Headers
Series x86/resctrl: fix task CLOSID update race |

Commit Message

Peter Newman Nov. 15, 2022, 2:19 p.m. UTC
  To rule out needing to update a CPU when deleting an rdtgroup, we must
search the entire tasklist for group members which could be running on
that CPU. This needs to be done while blocking updates to the tasklist
to avoid leaving newly-created child tasks assigned to the old
CLOSID/RMID.

The cost of reliably propagating a CLOSID or RMID update to a single
task is higher than originally thought. The present understanding is
that we must obtain the task_rq_lock() on each task to ensure that it
observes CLOSID/RMID updates in the case that it migrates away from its
current CPU before the update IPI reaches it.

For now, just notify all the CPUs after updating the closid/rmid fields
in impacted tasks task_structs rather than paying the cost of obtaining
a more precise cpu mask.

Signed-off-by: Peter Newman <peternewman@google.com>
Reviewed-by: James Morse <james.morse@arm.com>
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 52 +++++++-------------------
 1 file changed, 13 insertions(+), 39 deletions(-)
  

Comments

Reinette Chatre Nov. 21, 2022, 9:53 p.m. UTC | #1
Hi Peter,

On 11/15/2022 6:19 AM, Peter Newman wrote:
> To rule out needing to update a CPU when deleting an rdtgroup, we must

Please do not impersonate code in changelog and comments (do not
use "we"). This is required for resctrl changes to be considered for
inclusion because resctrl patches are routed via the "tip" repo
and is thus required to follow the "tip tree handbook" 
(Documentation/process/maintainer-tip.rst). Please also
stick to a clear "context-problem-solution" changelog as is the custom
in this area.

> search the entire tasklist for group members which could be running on
> that CPU. This needs to be done while blocking updates to the tasklist
> to avoid leaving newly-created child tasks assigned to the old
> CLOSID/RMID.

This is not clear to me. rdt_move_group_tasks() obtains a read lock,
read_lock(&tasklist_lock), so concurrent modifications to the tasklist
are indeed possible. Should this perhaps be write_lock() instead?
It sounds like the scenario you are describing may be a concern. That is,
if a task belonging to a group that is being removed happens to
call fork()/clone() during the move then the child may end up being
created with old closid.

> 
> The cost of reliably propagating a CLOSID or RMID update to a single
> task is higher than originally thought. The present understanding is
> that we must obtain the task_rq_lock() on each task to ensure that it
> observes CLOSID/RMID updates in the case that it migrates away from its
> current CPU before the update IPI reaches it.

I find this confusing since it describes why a potential solution does
not solve a problem, neither problem nor solution is well described at this
point. 
 
What if you switch the order of the two patches? Patch #2 provides
the potential solution mentioned here so that may be helpful to have as
reference in this changelog.

> For now, just notify all the CPUs after updating the closid/rmid fields

For now? If you anticipate changes then there should be a plan for that, 
otherwise this is the fix without further speculation.

> in impacted tasks task_structs rather than paying the cost of obtaining
> a more precise cpu mask.

s/cpu/CPU/
It may be helpful to add that an accurate CPU mask cannot be guaranteed and
the more tasks moved the less accurate it could be (if I understand correctly).

> 
> Signed-off-by: Peter Newman <peternewman@google.com>
> Reviewed-by: James Morse <james.morse@arm.com>
> ---
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 52 +++++++-------------------
>  1 file changed, 13 insertions(+), 39 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index e5a48f05e787..049971efea2f 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -2385,12 +2385,10 @@ static int reset_all_ctrls(struct rdt_resource *r)
>   * Move tasks from one to the other group. If @from is NULL, then all tasks
>   * in the systems are moved unconditionally (used for teardown).
>   *
> - * If @mask is not NULL the cpus on which moved tasks are running are set
> - * in that mask so the update smp function call is restricted to affected
> - * cpus.
> + * Following this operation, the caller is required to update the MSRs on all
> + * CPUs.
>   */

On x86 only one MSR needs updating, the PQR_ASSOC MSR. The above could be
summarized as:
"Caller should update per CPU storage and PQR_ASSOC."

> -static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
> -				 struct cpumask *mask)
> +static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to)
>  {
>  	struct task_struct *p, *t;
>  
> @@ -2400,16 +2398,6 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
>  		    is_rmid_match(t, from)) {
>  			WRITE_ONCE(t->closid, to->closid);
>  			WRITE_ONCE(t->rmid, to->mon.rmid);
> -
> -			/*
> -			 * If the task is on a CPU, set the CPU in the mask.
> -			 * The detection is inaccurate as tasks might move or
> -			 * schedule before the smp function call takes place.
> -			 * In such a case the function call is pointless, but
> -			 * there is no other side effect.
> -			 */
> -			if (IS_ENABLED(CONFIG_SMP) && mask && task_curr(t))
> -				cpumask_set_cpu(task_cpu(t), mask);
>  		}
>  	}
>  	read_unlock(&tasklist_lock);
> @@ -2440,7 +2428,7 @@ static void rmdir_all_sub(void)
>  	struct rdtgroup *rdtgrp, *tmp;
>  
>  	/* Move all tasks to the default resource group */
> -	rdt_move_group_tasks(NULL, &rdtgroup_default, NULL);
> +	rdt_move_group_tasks(NULL, &rdtgroup_default);
>  
>  	list_for_each_entry_safe(rdtgrp, tmp, &rdt_all_groups, rdtgroup_list) {
>  		/* Free any child rmids */
> @@ -3099,23 +3087,19 @@ static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
>  	return -EPERM;
>  }
>  
> -static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
> +static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp)
>  {
>  	struct rdtgroup *prdtgrp = rdtgrp->mon.parent;
>  	int cpu;
>  
>  	/* Give any tasks back to the parent group */
> -	rdt_move_group_tasks(rdtgrp, prdtgrp, tmpmask);
> +	rdt_move_group_tasks(rdtgrp, prdtgrp);
>  
>  	/* Update per cpu rmid of the moved CPUs first */
>  	for_each_cpu(cpu, &rdtgrp->cpu_mask)
>  		per_cpu(pqr_state.default_rmid, cpu) = prdtgrp->mon.rmid;
> -	/*
> -	 * Update the MSR on moved CPUs and CPUs which have moved
> -	 * task running on them.
> -	 */
> -	cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
> -	update_closid_rmid(tmpmask, NULL);
> +
> +	update_closid_rmid(cpu_online_mask, NULL);
>  
>  	rdtgrp->flags = RDT_DELETED;
>  	free_rmid(rdtgrp->mon.rmid);
> @@ -3140,12 +3124,12 @@ static int rdtgroup_ctrl_remove(struct rdtgroup *rdtgrp)
>  	return 0;
>  }
>  
> -static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
> +static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp)
>  {
>  	int cpu;
>  
>  	/* Give any tasks back to the default group */
> -	rdt_move_group_tasks(rdtgrp, &rdtgroup_default, tmpmask);
> +	rdt_move_group_tasks(rdtgrp, &rdtgroup_default);
>  
>  	/* Give any CPUs back to the default group */
>  	cpumask_or(&rdtgroup_default.cpu_mask,
> @@ -3157,12 +3141,7 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
>  		per_cpu(pqr_state.default_rmid, cpu) = rdtgroup_default.mon.rmid;
>  	}
>  
> -	/*
> -	 * Update the MSR on moved CPUs and CPUs which have moved
> -	 * task running on them.
> -	 */
> -	cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
> -	update_closid_rmid(tmpmask, NULL);
> +	update_closid_rmid(cpu_online_mask, NULL);
>  
>  	closid_free(rdtgrp->closid);
>  	free_rmid(rdtgrp->mon.rmid);
> @@ -3181,12 +3160,8 @@ static int rdtgroup_rmdir(struct kernfs_node *kn)
>  {
>  	struct kernfs_node *parent_kn = kn->parent;
>  	struct rdtgroup *rdtgrp;
> -	cpumask_var_t tmpmask;
>  	int ret = 0;
>  
> -	if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL))
> -		return -ENOMEM;
> -
>  	rdtgrp = rdtgroup_kn_lock_live(kn);
>  	if (!rdtgrp) {
>  		ret = -EPERM;
> @@ -3206,18 +3181,17 @@ static int rdtgroup_rmdir(struct kernfs_node *kn)
>  		    rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) {
>  			ret = rdtgroup_ctrl_remove(rdtgrp);
>  		} else {
> -			ret = rdtgroup_rmdir_ctrl(rdtgrp, tmpmask);
> +			ret = rdtgroup_rmdir_ctrl(rdtgrp);
>  		}
>  	} else if (rdtgrp->type == RDTMON_GROUP &&
>  		 is_mon_groups(parent_kn, kn->name)) {
> -		ret = rdtgroup_rmdir_mon(rdtgrp, tmpmask);
> +		ret = rdtgroup_rmdir_mon(rdtgrp);
>  	} else {
>  		ret = -EPERM;
>  	}
>  
>  out:
>  	rdtgroup_kn_unlock(kn);
> -	free_cpumask_var(tmpmask);
>  	return ret;
>  }
>  

The fix looks good to me. I do think its motivation and description
needs to improve to make it palatable to folks not familiar with this area.

Reinette
  
Peter Newman Nov. 23, 2022, 11:09 a.m. UTC | #2
Hi Reinette,

On Mon, Nov 21, 2022 at 10:53 PM Reinette Chatre
<reinette.chatre@intel.com> wrote:
> On 11/15/2022 6:19 AM, Peter Newman wrote:
> > To rule out needing to update a CPU when deleting an rdtgroup, we must
>
> Please do not impersonate code in changelog and comments (do not
> use "we"). This is required for resctrl changes to be considered for
> inclusion because resctrl patches are routed via the "tip" repo
> and is thus required to follow the "tip tree handbook"
> (Documentation/process/maintainer-tip.rst). Please also
> stick to a clear "context-problem-solution" changelog as is the custom
> in this area.

Thanks, I forgot about the "tip tree handbook" and was trying to
remember where the pointers about wording from the reply to the other
patch came from. Unfortunately I didn't read your replies in FIFO
order.


>
> > search the entire tasklist for group members which could be running on
> > that CPU. This needs to be done while blocking updates to the tasklist
> > to avoid leaving newly-created child tasks assigned to the old
> > CLOSID/RMID.
>
> This is not clear to me. rdt_move_group_tasks() obtains a read lock,
> read_lock(&tasklist_lock), so concurrent modifications to the tasklist
> are indeed possible. Should this perhaps be write_lock() instead?
> It sounds like the scenario you are describing may be a concern. That is,
> if a task belonging to a group that is being removed happens to
> call fork()/clone() during the move then the child may end up being
> created with old closid.

Shouldn't read_lock(&tasklist_lock) cause write_lock(&tasklist_lock) to
block?

Maybe I paraphrased too much in the explanation.


> > The cost of reliably propagating a CLOSID or RMID update to a single
> > task is higher than originally thought. The present understanding is
> > that we must obtain the task_rq_lock() on each task to ensure that it
> > observes CLOSID/RMID updates in the case that it migrates away from its
> > current CPU before the update IPI reaches it.
>
> I find this confusing since it describes why a potential solution does
> not solve a problem, neither problem nor solution is well described at this
> point.
>
> What if you switch the order of the two patches? Patch #2 provides
> the potential solution mentioned here so that may be helpful to have as
> reference in this changelog.

Yes, I will try that. The single-task and multi-task cases should be
independent enough that I can handle them in either order.


> > For now, just notify all the CPUs after updating the closid/rmid fields
>
> For now? If you anticipate changes then there should be a plan for that,
> otherwise this is the fix without further speculation.

It seems like I have to either do something to acknowledge the tradeoff,
or make a case that the negatively affected usage isn't important.
Should I assert that deleting groups while a realtime, core-isolated
application is running isn't a legitimate use case?


>
> > in impacted tasks task_structs rather than paying the cost of obtaining
> > a more precise cpu mask.
>
> s/cpu/CPU/
> It may be helpful to add that an accurate CPU mask cannot be guaranteed and
> the more tasks moved the less accurate it could be (if I understand correctly).

Ok.


> > diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > index e5a48f05e787..049971efea2f 100644
> > --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > @@ -2385,12 +2385,10 @@ static int reset_all_ctrls(struct rdt_resource *r)
> >   * Move tasks from one to the other group. If @from is NULL, then all tasks
> >   * in the systems are moved unconditionally (used for teardown).
> >   *
> > - * If @mask is not NULL the cpus on which moved tasks are running are set
> > - * in that mask so the update smp function call is restricted to affected
> > - * cpus.
> > + * Following this operation, the caller is required to update the MSRs on all
> > + * CPUs.
> >   */
>
> On x86 only one MSR needs updating, the PQR_ASSOC MSR. The above could be
> summarized as:
> "Caller should update per CPU storage and PQR_ASSOC."

Sounds good.


> [...]
>
> The fix looks good to me. I do think its motivation and description
> needs to improve to make it palatable to folks not familiar with this area.
>
> Reinette

Once again, thanks for your review. I will work on clarifying the
comments. The explanation is usually most of the work with changes like
these, which is tough to do without some feedback from an actual reader.
I've been looking at this section of code too long to be able to judge
my own explanation anymore.

-Peter
  
Reinette Chatre Nov. 23, 2022, 4:45 p.m. UTC | #3
Hi Peter,

On 11/23/2022 3:09 AM, Peter Newman wrote:
> On Mon, Nov 21, 2022 at 10:53 PM Reinette Chatre
> <reinette.chatre@intel.com> wrote:
>> On 11/15/2022 6:19 AM, Peter Newman wrote:

...

>>> search the entire tasklist for group members which could be running on
>>> that CPU. This needs to be done while blocking updates to the tasklist
>>> to avoid leaving newly-created child tasks assigned to the old
>>> CLOSID/RMID.
>>
>> This is not clear to me. rdt_move_group_tasks() obtains a read lock,
>> read_lock(&tasklist_lock), so concurrent modifications to the tasklist
>> are indeed possible. Should this perhaps be write_lock() instead?
>> It sounds like the scenario you are describing may be a concern. That is,
>> if a task belonging to a group that is being removed happens to
>> call fork()/clone() during the move then the child may end up being
>> created with old closid.
> 
> Shouldn't read_lock(&tasklist_lock) cause write_lock(&tasklist_lock) to
> block?

Apologies, yes, you are right. I was focusing too much on the detail of
task_struct::tasks being a RCU list and lost sight of the actual lock
obtained (rcu_read_lock() vs read_lock()) in this instance. 

Reinette
  

Patch

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index e5a48f05e787..049971efea2f 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2385,12 +2385,10 @@  static int reset_all_ctrls(struct rdt_resource *r)
  * Move tasks from one to the other group. If @from is NULL, then all tasks
  * in the systems are moved unconditionally (used for teardown).
  *
- * If @mask is not NULL the cpus on which moved tasks are running are set
- * in that mask so the update smp function call is restricted to affected
- * cpus.
+ * Following this operation, the caller is required to update the MSRs on all
+ * CPUs.
  */
-static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
-				 struct cpumask *mask)
+static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to)
 {
 	struct task_struct *p, *t;
 
@@ -2400,16 +2398,6 @@  static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
 		    is_rmid_match(t, from)) {
 			WRITE_ONCE(t->closid, to->closid);
 			WRITE_ONCE(t->rmid, to->mon.rmid);
-
-			/*
-			 * If the task is on a CPU, set the CPU in the mask.
-			 * The detection is inaccurate as tasks might move or
-			 * schedule before the smp function call takes place.
-			 * In such a case the function call is pointless, but
-			 * there is no other side effect.
-			 */
-			if (IS_ENABLED(CONFIG_SMP) && mask && task_curr(t))
-				cpumask_set_cpu(task_cpu(t), mask);
 		}
 	}
 	read_unlock(&tasklist_lock);
@@ -2440,7 +2428,7 @@  static void rmdir_all_sub(void)
 	struct rdtgroup *rdtgrp, *tmp;
 
 	/* Move all tasks to the default resource group */
-	rdt_move_group_tasks(NULL, &rdtgroup_default, NULL);
+	rdt_move_group_tasks(NULL, &rdtgroup_default);
 
 	list_for_each_entry_safe(rdtgrp, tmp, &rdt_all_groups, rdtgroup_list) {
 		/* Free any child rmids */
@@ -3099,23 +3087,19 @@  static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
 	return -EPERM;
 }
 
-static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
+static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp)
 {
 	struct rdtgroup *prdtgrp = rdtgrp->mon.parent;
 	int cpu;
 
 	/* Give any tasks back to the parent group */
-	rdt_move_group_tasks(rdtgrp, prdtgrp, tmpmask);
+	rdt_move_group_tasks(rdtgrp, prdtgrp);
 
 	/* Update per cpu rmid of the moved CPUs first */
 	for_each_cpu(cpu, &rdtgrp->cpu_mask)
 		per_cpu(pqr_state.default_rmid, cpu) = prdtgrp->mon.rmid;
-	/*
-	 * Update the MSR on moved CPUs and CPUs which have moved
-	 * task running on them.
-	 */
-	cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
-	update_closid_rmid(tmpmask, NULL);
+
+	update_closid_rmid(cpu_online_mask, NULL);
 
 	rdtgrp->flags = RDT_DELETED;
 	free_rmid(rdtgrp->mon.rmid);
@@ -3140,12 +3124,12 @@  static int rdtgroup_ctrl_remove(struct rdtgroup *rdtgrp)
 	return 0;
 }
 
-static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
+static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp)
 {
 	int cpu;
 
 	/* Give any tasks back to the default group */
-	rdt_move_group_tasks(rdtgrp, &rdtgroup_default, tmpmask);
+	rdt_move_group_tasks(rdtgrp, &rdtgroup_default);
 
 	/* Give any CPUs back to the default group */
 	cpumask_or(&rdtgroup_default.cpu_mask,
@@ -3157,12 +3141,7 @@  static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
 		per_cpu(pqr_state.default_rmid, cpu) = rdtgroup_default.mon.rmid;
 	}
 
-	/*
-	 * Update the MSR on moved CPUs and CPUs which have moved
-	 * task running on them.
-	 */
-	cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
-	update_closid_rmid(tmpmask, NULL);
+	update_closid_rmid(cpu_online_mask, NULL);
 
 	closid_free(rdtgrp->closid);
 	free_rmid(rdtgrp->mon.rmid);
@@ -3181,12 +3160,8 @@  static int rdtgroup_rmdir(struct kernfs_node *kn)
 {
 	struct kernfs_node *parent_kn = kn->parent;
 	struct rdtgroup *rdtgrp;
-	cpumask_var_t tmpmask;
 	int ret = 0;
 
-	if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL))
-		return -ENOMEM;
-
 	rdtgrp = rdtgroup_kn_lock_live(kn);
 	if (!rdtgrp) {
 		ret = -EPERM;
@@ -3206,18 +3181,17 @@  static int rdtgroup_rmdir(struct kernfs_node *kn)
 		    rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) {
 			ret = rdtgroup_ctrl_remove(rdtgrp);
 		} else {
-			ret = rdtgroup_rmdir_ctrl(rdtgrp, tmpmask);
+			ret = rdtgroup_rmdir_ctrl(rdtgrp);
 		}
 	} else if (rdtgrp->type == RDTMON_GROUP &&
 		 is_mon_groups(parent_kn, kn->name)) {
-		ret = rdtgroup_rmdir_mon(rdtgrp, tmpmask);
+		ret = rdtgroup_rmdir_mon(rdtgrp);
 	} else {
 		ret = -EPERM;
 	}
 
 out:
 	rdtgroup_kn_unlock(kn);
-	free_cpumask_var(tmpmask);
 	return ret;
 }