[v4,1/2] x86/resctrl: Update task closid/rmid with task_call_func()

Message ID 20221129111055.953833-2-peternewman@google.com
State New
Headers
Series x86/resctrl: Fix task CLOSID update race |

Commit Message

Peter Newman Nov. 29, 2022, 11:10 a.m. UTC
  When the user moves a running task to a new rdtgroup using the tasks
file interface, the resulting change in CLOSID/RMID must be immediately
propagated to the PQR_ASSOC MSR on the task's CPU.

It is possible for a task to wake up or migrate while it is being moved
to a new group. If __rdtgroup_move_task() fails to observe that a task
has begun running or misses that it migrated to a new CPU, the task will
continue to use the old CLOSID or RMID until it switches in again.

__rdtgroup_move_task() assumes that if the task migrates off of its CPU
before it can IPI the task, then the task has already observed the
updated CLOSID/RMID. Because this is done locklessly and an x86 CPU can
delay stores until after loads, the following incorrect scenarios are
possible:

 1. __rdtgroup_move_task() stores the new closid and rmid in
    the task structure after it loads task_curr() and task_cpu().
 2. resctrl_sched_in() loads t->{closid,rmid} before the calling context
    switch stores new task_curr() and task_cpu() values.

Use task_call_func() in __rdtgroup_move_task() to serialize updates to
the closid and rmid fields in the task_struct with context switch.

Signed-off-by: Peter Newman <peternewman@google.com>
Reviewed-by: James Morse <james.morse@arm.com>
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 78 ++++++++++++++++----------
 1 file changed, 47 insertions(+), 31 deletions(-)
  

Comments

Reinette Chatre Dec. 6, 2022, 6:56 p.m. UTC | #1
Hi Peter,

On 11/29/2022 3:10 AM, Peter Newman wrote:
> When the user moves a running task to a new rdtgroup using the tasks
> file interface, the resulting change in CLOSID/RMID must be immediately
> propagated to the PQR_ASSOC MSR on the task's CPU.
> 
> It is possible for a task to wake up or migrate while it is being moved
> to a new group. If __rdtgroup_move_task() fails to observe that a task
> has begun running or misses that it migrated to a new CPU, the task will
> continue to use the old CLOSID or RMID until it switches in again.
> 
> __rdtgroup_move_task() assumes that if the task migrates off of its CPU
> before it can IPI the task, then the task has already observed the
> updated CLOSID/RMID. Because this is done locklessly and an x86 CPU can
> delay stores until after loads, the following incorrect scenarios are
> possible:
> 
>  1. __rdtgroup_move_task() stores the new closid and rmid in
>     the task structure after it loads task_curr() and task_cpu().

Stating how this scenario encounters the problem would help
so perhaps something like (please feel free to change):
"If the task starts running between a reordered task_curr() check and
the CLOSID/RMID update then it will start running with the old CLOSID/RMID
until it is switched again because __rdtgroup_move_task() failed to determine 
that it needs to be interrupted to obtain the new CLOSID/RMID."

>  2. resctrl_sched_in() loads t->{closid,rmid} before the calling context
>     switch stores new task_curr() and task_cpu() values.

This scenario is not clear to me. Could you please provide more detail about it?
I was trying to follow the context_switch() flow and resctrl_sched_in() is
one of the last things done (context_switch()->switch_to()->resctrl_sched_in()).
From what I can tell rq->curr, as used by task_curr() is set before
even context_switch() is called ... and since the next task is picked from
the CPU's runqueue (and set_task_cpu() sets the task's cpu when moved to
a runqueue) it seems to me that the value used by task_cpu() would also
be set early (before context_switch() is called). It is thus not clear to
me how the above reordering could occur so an example would help a lot.

> 
> Use task_call_func() in __rdtgroup_move_task() to serialize updates to
> the closid and rmid fields in the task_struct with context switch.

Is there a reason why there is a switch between the all caps CLOSID/RMID
at the beginning to the no caps here? 

> Signed-off-by: Peter Newman <peternewman@google.com>
> Reviewed-by: James Morse <james.morse@arm.com>
> ---
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 78 ++++++++++++++++----------
>  1 file changed, 47 insertions(+), 31 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index e5a48f05e787..59b7ffcd53bb 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -528,6 +528,31 @@ static void rdtgroup_remove(struct rdtgroup *rdtgrp)
>  	kfree(rdtgrp);
>  }
>  
> +static int update_locked_task_closid_rmid(struct task_struct *t, void *arg)
> +{
> +	struct rdtgroup *rdtgrp = arg;
> +
> +	/*
> +	 * Although task_call_func() serializes the writes below with the paired
> +	 * reads in resctrl_sched_in(), resctrl_sched_in() still needs
> +	 * READ_ONCE() due to rdt_move_group_tasks(), so use WRITE_ONCE() here
> +	 * to conform.
> +	 */
> +	if (rdtgrp->type == RDTCTRL_GROUP) {
> +		WRITE_ONCE(t->closid, rdtgrp->closid);
> +		WRITE_ONCE(t->rmid, rdtgrp->mon.rmid);
> +	} else if (rdtgrp->type == RDTMON_GROUP) {
> +		WRITE_ONCE(t->rmid, rdtgrp->mon.rmid);
> +	}
> +
> +	/*
> +	 * If the task is current on a CPU, the PQR_ASSOC MSR needs to be
> +	 * updated to make the resource group go into effect. If the task is not
> +	 * current, the MSR will be updated when the task is scheduled in.
> +	 */
> +	return task_curr(t);
> +}
> +
>  static void _update_task_closid_rmid(void *task)
>  {
>  	/*
> @@ -538,10 +563,24 @@ static void _update_task_closid_rmid(void *task)
>  		resctrl_sched_in();
>  }
>  
> -static void update_task_closid_rmid(struct task_struct *t)
> +static void update_task_closid_rmid(struct task_struct *t,
> +				    struct rdtgroup *rdtgrp)
>  {
> -	if (IS_ENABLED(CONFIG_SMP) && task_curr(t))
> -		smp_call_function_single(task_cpu(t), _update_task_closid_rmid, t, 1);
> +	/*
> +	 * Serialize the closid and rmid update with context switch. If
> +	 * task_call_func() indicates that the task was running during
> +	 * update_locked_task_closid_rmid(), then interrupt it.
> +	 */
> +	if (task_call_func(t, update_locked_task_closid_rmid, rdtgrp) &&
> +	    IS_ENABLED(CONFIG_SMP))
> +		/*
> +		 * If the task has migrated away from the CPU indicated by
> +		 * task_cpu() below, then it has already switched in on the
> +		 * new CPU using the updated closid and rmid and the call below
> +		 * is unnecessary, but harmless.
> +		 */
> +		smp_call_function_single(task_cpu(t),
> +					 _update_task_closid_rmid, t, 1);
>  	else
>  		_update_task_closid_rmid(t);
>  }
> @@ -557,39 +596,16 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
>  		return 0;
>  
>  	/*
> -	 * Set the task's closid/rmid before the PQR_ASSOC MSR can be
> -	 * updated by them.
> -	 *
> -	 * For ctrl_mon groups, move both closid and rmid.
>  	 * For monitor groups, can move the tasks only from
>  	 * their parent CTRL group.
>  	 */
> -
> -	if (rdtgrp->type == RDTCTRL_GROUP) {
> -		WRITE_ONCE(tsk->closid, rdtgrp->closid);
> -		WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
> -	} else if (rdtgrp->type == RDTMON_GROUP) {
> -		if (rdtgrp->mon.parent->closid == tsk->closid) {
> -			WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
> -		} else {
> -			rdt_last_cmd_puts("Can't move task to different control group\n");
> -			return -EINVAL;
> -		}
> +	if (rdtgrp->type == RDTMON_GROUP &&
> +	    rdtgrp->mon.parent->closid != tsk->closid) {
> +		rdt_last_cmd_puts("Can't move task to different control group\n");
> +		return -EINVAL;
>  	}
>  
> -	/*
> -	 * Ensure the task's closid and rmid are written before determining if
> -	 * the task is current that will decide if it will be interrupted.
> -	 */
> -	barrier();
> -
> -	/*
> -	 * By now, the task's closid and rmid are set. If the task is current
> -	 * on a CPU, the PQR_ASSOC MSR needs to be updated to make the resource
> -	 * group go into effect. If the task is not current, the MSR will be
> -	 * updated when the task is scheduled in.
> -	 */
> -	update_task_closid_rmid(tsk);
> +	update_task_closid_rmid(tsk, rdtgrp);
>  
>  	return 0;
>  }

The change looks good to me.

Reinette
  
Peter Newman Dec. 7, 2022, 10:58 a.m. UTC | #2
Hi Reinette,

On Tue, Dec 6, 2022 at 7:57 PM Reinette Chatre
<reinette.chatre@intel.com> wrote:
> On 11/29/2022 3:10 AM, Peter Newman wrote:
> > When the user moves a running task to a new rdtgroup using the tasks
> > file interface, the resulting change in CLOSID/RMID must be immediately
> > propagated to the PQR_ASSOC MSR on the task's CPU.
> >
> > It is possible for a task to wake up or migrate while it is being moved
> > to a new group. If __rdtgroup_move_task() fails to observe that a task
> > has begun running or misses that it migrated to a new CPU, the task will
> > continue to use the old CLOSID or RMID until it switches in again.
> >
> > __rdtgroup_move_task() assumes that if the task migrates off of its CPU
> > before it can IPI the task, then the task has already observed the
> > updated CLOSID/RMID. Because this is done locklessly and an x86 CPU can
> > delay stores until after loads, the following incorrect scenarios are
> > possible:
> >
> >  1. __rdtgroup_move_task() stores the new closid and rmid in
> >     the task structure after it loads task_curr() and task_cpu().
>
> Stating how this scenario encounters the problem would help
> so perhaps something like (please feel free to change):
> "If the task starts running between a reordered task_curr() check and
> the CLOSID/RMID update then it will start running with the old CLOSID/RMID
> until it is switched again because __rdtgroup_move_task() failed to determine
> that it needs to be interrupted to obtain the new CLOSID/RMID."

That is largely what I was trying to state in paragraph 2 above, though
at a higher level. I hoped the paragraph following it would do enough to
connect the high-level description with the low-level problem scenarios.

>
> >  2. resctrl_sched_in() loads t->{closid,rmid} before the calling context
> >     switch stores new task_curr() and task_cpu() values.
>
> This scenario is not clear to me. Could you please provide more detail about it?
> I was trying to follow the context_switch() flow and resctrl_sched_in() is
> one of the last things done (context_switch()->switch_to()->resctrl_sched_in()).
> From what I can tell rq->curr, as used by task_curr() is set before
> even context_switch() is called ... and since the next task is picked from
> the CPU's runqueue (and set_task_cpu() sets the task's cpu when moved to
> a runqueue) it seems to me that the value used by task_cpu() would also
> be set early (before context_switch() is called). It is thus not clear to
> me how the above reordering could occur so an example would help a lot.

Perhaps in both scenarios I didn't make it clear that reordering in the
CPU can cause the incorrect behavior rather than the program order. In
this explanation, items 1. and 2. are supposed to be completing the
sentence ending with a ':' at the end of paragraph 3, so I thought that
would keep focus on the CPU.

I had assumed that the ordering requirements were well-understood, since
they're stated in existing code comments a few times, and that making a
case for how the expected ordering could be violated would be enough,
but I'm happy to draw up a side-by-side example.

>
> >
> > Use task_call_func() in __rdtgroup_move_task() to serialize updates to
> > the closid and rmid fields in the task_struct with context switch.
>
> Is there a reason why there is a switch between the all caps CLOSID/RMID
> at the beginning to the no caps here?

It's because I referred to the task_struct fields explicitly here.

-Peter
  
Reinette Chatre Dec. 7, 2022, 6:38 p.m. UTC | #3
Hi Peter,

On 12/7/2022 2:58 AM, Peter Newman wrote:
> Hi Reinette,
> 
> On Tue, Dec 6, 2022 at 7:57 PM Reinette Chatre
> <reinette.chatre@intel.com> wrote:
>> On 11/29/2022 3:10 AM, Peter Newman wrote:
>>> When the user moves a running task to a new rdtgroup using the tasks
>>> file interface, the resulting change in CLOSID/RMID must be immediately
>>> propagated to the PQR_ASSOC MSR on the task's CPU.
>>>
>>> It is possible for a task to wake up or migrate while it is being moved
>>> to a new group. If __rdtgroup_move_task() fails to observe that a task
>>> has begun running or misses that it migrated to a new CPU, the task will
>>> continue to use the old CLOSID or RMID until it switches in again.
>>>
>>> __rdtgroup_move_task() assumes that if the task migrates off of its CPU
>>> before it can IPI the task, then the task has already observed the
>>> updated CLOSID/RMID. Because this is done locklessly and an x86 CPU can
>>> delay stores until after loads, the following incorrect scenarios are
>>> possible:
>>>
>>>  1. __rdtgroup_move_task() stores the new closid and rmid in
>>>     the task structure after it loads task_curr() and task_cpu().
>>
>> Stating how this scenario encounters the problem would help
>> so perhaps something like (please feel free to change):
>> "If the task starts running between a reordered task_curr() check and
>> the CLOSID/RMID update then it will start running with the old CLOSID/RMID
>> until it is switched again because __rdtgroup_move_task() failed to determine
>> that it needs to be interrupted to obtain the new CLOSID/RMID."
> 
> That is largely what I was trying to state in paragraph 2 above, though
> at a higher level. I hoped the paragraph following it would do enough to
> connect the high-level description with the low-level problem scenarios.

There is no need to require the reader to connect various snippets to create
a problematic scenario themselves. The changelog should make the problem
obvious. I understand that it is what you wanted to say, that is why I moved
existing snippets to form a coherent problem scenario. It is ok if you do not
like the way I wrote it, it was only an example on how it can be done.

>>>  2. resctrl_sched_in() loads t->{closid,rmid} before the calling context
>>>     switch stores new task_curr() and task_cpu() values.
>>
>> This scenario is not clear to me. Could you please provide more detail about it?
>> I was trying to follow the context_switch() flow and resctrl_sched_in() is
>> one of the last things done (context_switch()->switch_to()->resctrl_sched_in()).
>> From what I can tell rq->curr, as used by task_curr() is set before
>> even context_switch() is called ... and since the next task is picked from
>> the CPU's runqueue (and set_task_cpu() sets the task's cpu when moved to
>> a runqueue) it seems to me that the value used by task_cpu() would also
>> be set early (before context_switch() is called). It is thus not clear to
>> me how the above reordering could occur so an example would help a lot.
> 
> Perhaps in both scenarios I didn't make it clear that reordering in the
> CPU can cause the incorrect behavior rather than the program order. In
> this explanation, items 1. and 2. are supposed to be completing the
> sentence ending with a ':' at the end of paragraph 3, so I thought that
> would keep focus on the CPU.

You did make it clear that the cause is reordering in the CPU. I am just
not able to see where the reordering is occurring in your scenario (2).

> I had assumed that the ordering requirements were well-understood, since
> they're stated in existing code comments a few times, and that making a
> case for how the expected ordering could be violated would be enough,
> but I'm happy to draw up a side-by-side example.

Please do. Could you start by highlighting which resctrl_sched_in()
you are referring to? I am trying to dissect (2) with the given information:
Through "the calling context switch" the scenario is written to create
understanding that it refers to:
context_switch()->switch_to()->resctrl_sched_in() - so the calling context
switch is the first in the above call path ... where does it (context_switch())
store the new task_curr() and task_cpu() values and how does that reorder with
resctrl_sched_in() further down in call path?

>>> Use task_call_func() in __rdtgroup_move_task() to serialize updates to
>>> the closid and rmid fields in the task_struct with context switch.
>>
>> Is there a reason why there is a switch between the all caps CLOSID/RMID
>> at the beginning to the no caps here?
> 
> It's because I referred to the task_struct fields explicitly here.

You can use task_struct::closid and task_struct::rmid to make this clear.

Reinette
  
Peter Newman Dec. 8, 2022, 10:30 p.m. UTC | #4
Hi Reinette,

On Wed, Dec 7, 2022 at 7:41 PM Reinette Chatre <reinette.chatre@intel.com> wrote:
> On 12/7/2022 2:58 AM, Peter Newman wrote:
> >>>  2. resctrl_sched_in() loads t->{closid,rmid} before the calling context
> >>>     switch stores new task_curr() and task_cpu() values.
> >>
> >> This scenario is not clear to me. Could you please provide more detail about it?
> >> I was trying to follow the context_switch() flow and resctrl_sched_in() is
> >> one of the last things done (context_switch()->switch_to()->resctrl_sched_in()).
> >> From what I can tell rq->curr, as used by task_curr() is set before
> >> even context_switch() is called ... and since the next task is picked from
> >> the CPU's runqueue (and set_task_cpu() sets the task's cpu when moved to
> >> a runqueue) it seems to me that the value used by task_cpu() would also
> >> be set early (before context_switch() is called). It is thus not clear to
> >> me how the above reordering could occur so an example would help a lot.
> >
> > Perhaps in both scenarios I didn't make it clear that reordering in the
> > CPU can cause the incorrect behavior rather than the program order. In
> > this explanation, items 1. and 2. are supposed to be completing the
> > sentence ending with a ':' at the end of paragraph 3, so I thought that
> > would keep focus on the CPU.
>
> You did make it clear that the cause is reordering in the CPU. I am just
> not able to see where the reordering is occurring in your scenario (2).

It will all come down to whether it can get from updating rq->curr to
reading task_struct::{closid,rmid} without encountering a full barrier.

I'll go into the details below.

> Please do. Could you start by highlighting which resctrl_sched_in()
> you are referring to? I am trying to dissect (2) with the given information:
> Through "the calling context switch" the scenario is written to create
> understanding that it refers to:
> context_switch()->switch_to()->resctrl_sched_in() - so the calling context
> switch is the first in the above call path ... where does it (context_switch())
> store the new task_curr() and task_cpu() values and how does that reorder with
> resctrl_sched_in() further down in call path?

Yes, the rq->curr update is actually in __schedule(). I was probably
still thinking it was in prepare_task_switch() (called from
context_switch()) because of older kernels where __rdtgroup_move_task()
is still reading task_struct::on_cpu.

There is an interesting code comment under the rq->curr update site in
__schedule():

	/*
	 * RCU users of rcu_dereference(rq->curr) may not see
	 * changes to task_struct made by pick_next_task().
	 */
	RCU_INIT_POINTER(rq->curr, next);
	/*
	 * The membarrier system call requires each architecture
	 * to have a full memory barrier after updating
	 * rq->curr, before returning to user-space.
	 *
	 * Here are the schemes providing that barrier on the
	 * various architectures:
	 * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
	 *   switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
	 * - finish_lock_switch() for weakly-ordered
	 *   architectures where spin_unlock is a full barrier,
	 * - switch_to() for arm64 (weakly-ordered, spin_unlock
	 *   is a RELEASE barrier),
	 */

Based on this, I believe (2) doesn't happen on x86 because switch_mm()
provides the required ordering.

On arm64, it won't happen as long as it calls its resctrl_sched_in()
after the dsb(ish) in __switch_to(), which seems to be the case:

https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/tree/arch/arm64/kernel/process.c?h=mpam/snapshot/v6.0-rc1#n561

Based on this, I'll just sketch out the first scenario below and drop
(2) from the changelog. This also implies that the group update cases
can use a single smp_mb() to provide all the necessary ordering, because
there's a full barrier on context switch for it to pair with, so I don't
need to broadcast IPI anymore.  I don't know whether task_call_func() is
faster than an smp_mb(). I'll take some measurements to see.

The presumed behavior is __rdtgroup_move_task() not seeing t1 running
yet implies that it observes the updated values:

CPU 0                                   CPU 1
-----                                   -----
(t1->{closid,rmid} -> {1,1})            (rq->curr -> t0)

__rdtgroup_move_task():
  t1->{closid,rmid} <- {2,2}
  curr <- t1->cpu->rq->curr
                                        __schedule():
                                          rq->curr <- t1
                                        resctrl_sched_in():
                                          t1->{closid,rmid} -> {2,2}
  if (curr == t1) // false
    IPI(t1->cpu)

In (1), CPU 0 is allowed to store {closid,rmid} after reading whether t1
is current:

CPU 0                                   CPU 1
-----                                   -----
__rdtgroup_move_task():
  curr <- t1->cpu->rq->curr
                                        __schedule():
                                          rq->curr <- t1
                                        resctrl_sched_in():
                                          t1->{closid,rmid} -> {1,1}
  t1->{closid,rmid} <- {2,2}
  if (curr == t1) // false
   IPI(t1->cpu)

Please let me know if these diagrams clarify things.

-Peter
  
Reinette Chatre Dec. 9, 2022, 11:54 p.m. UTC | #5
Hi Peter,

On 12/8/2022 2:30 PM, Peter Newman wrote:
> On Wed, Dec 7, 2022 at 7:41 PM Reinette Chatre <reinette.chatre@intel.com> wrote:
>> On 12/7/2022 2:58 AM, Peter Newman wrote:
>>>>>  2. resctrl_sched_in() loads t->{closid,rmid} before the calling context
>>>>>     switch stores new task_curr() and task_cpu() values.

...

> 
> Based on this, I'll just sketch out the first scenario below and drop
> (2) from the changelog. This also implies that the group update cases

ok, thank you for doing that analysis.

> can use a single smp_mb() to provide all the necessary ordering, because
> there's a full barrier on context switch for it to pair with, so I don't
> need to broadcast IPI anymore.  I don't know whether task_call_func() is

This is not clear to me because rdt_move_group_tasks() seems to have the
same code as shown below as vulnerable to re-ordering. Only difference
is that it uses the "//false" checks to set a bit in the cpumask for a
later IPI instead of an immediate IPI.

> faster than an smp_mb(). I'll take some measurements to see.
> 
> The presumed behavior is __rdtgroup_move_task() not seeing t1 running
> yet implies that it observes the updated values:
> 
> CPU 0                                   CPU 1
> -----                                   -----
> (t1->{closid,rmid} -> {1,1})            (rq->curr -> t0)
> 
> __rdtgroup_move_task():
>   t1->{closid,rmid} <- {2,2}
>   curr <- t1->cpu->rq->curr
>                                         __schedule():
>                                           rq->curr <- t1
>                                         resctrl_sched_in():
>                                           t1->{closid,rmid} -> {2,2}
>   if (curr == t1) // false
>     IPI(t1->cpu)

I understand that the test is false when it may be expected to be true, but
there does not seem to be a problem because of that. t1 was scheduled in with
the correct CLOSID/RMID and its CPU did not get an unnecessary IPI.

> In (1), CPU 0 is allowed to store {closid,rmid} after reading whether t1
> is current:
> 
> CPU 0                                   CPU 1
> -----                                   -----
> __rdtgroup_move_task():
>   curr <- t1->cpu->rq->curr
>                                         __schedule():
>                                           rq->curr <- t1
>                                         resctrl_sched_in():
>                                           t1->{closid,rmid} -> {1,1}
>   t1->{closid,rmid} <- {2,2}
>   if (curr == t1) // false
>    IPI(t1->cpu)

Yes, this I understand to be the problematic scenario.
 
> Please let me know if these diagrams clarify things.

They do, thank you very much.

Reinette
  
Peter Newman Dec. 12, 2022, 5:36 p.m. UTC | #6
Hi Reinette,

On Sat, Dec 10, 2022 at 12:54 AM Reinette Chatre <reinette.chatre@intel.com> wrote:
> On 12/8/2022 2:30 PM, Peter Newman wrote:
> > Based on this, I'll just sketch out the first scenario below and drop
> > (2) from the changelog. This also implies that the group update cases
>
> ok, thank you for doing that analysis.
>
> > can use a single smp_mb() to provide all the necessary ordering, because
> > there's a full barrier on context switch for it to pair with, so I don't
> > need to broadcast IPI anymore.  I don't know whether task_call_func() is
>
> This is not clear to me because rdt_move_group_tasks() seems to have the
> same code as shown below as vulnerable to re-ordering. Only difference
> is that it uses the "//false" checks to set a bit in the cpumask for a
> later IPI instead of an immediate IPI.

An smp_mb() between writing the new task_struct::{closid,rmid} and
calling task_curr() would prevent the reordering I described, but I
was worried about the cost of executing a full barrier for every
matching task.

I tried something like this:

for_each_process_thread(p, t) {
	if (!from || is_closid_match(t, from) ||
	    is_rmid_match(t, from)) {
		WRITE_ONCE(t->closid, to->closid);
		WRITE_ONCE(t->rmid, to->mon.rmid);
		/* group moves are serialized by rdt */
		t->resctrl_dirty = true;
	}
}
if (IS_ENABLED(CONFIG_SMP) && mask) {
	/* Order t->{closid,rmid} stores before loads in task_curr() */
	smp_mb();
	for_each_process_thread(p, t) {
		if (t->resctrl_dirty) {
			if (task_curr(t))
				cpumask_set_cpu(task_cpu(t), mask);
			t->resctrl_dirty = false;
		}
	}
}

I repeated the `perf bench sched messaging -g 40 -l100000 ` benchmark
from before[1] to compare this with the baseline, and found that it
only increased the time to delete the benchmark's group from 1.65ms to
1.66ms, so it's an alternative to what I last posted.

I could do something similar in the single-task move, but I don't think
it makes much of a performance difference in that case. It's only a win
for the group move because the synchronization cost doesn't grow with
the group size.

[1] https://lore.kernel.org/lkml/20221129111055.953833-3-peternewman@google.com/


>
> > faster than an smp_mb(). I'll take some measurements to see.
> >
> > The presumed behavior is __rdtgroup_move_task() not seeing t1 running
> > yet implies that it observes the updated values:
> >
> > CPU 0                                   CPU 1
> > -----                                   -----
> > (t1->{closid,rmid} -> {1,1})            (rq->curr -> t0)
> >
> > __rdtgroup_move_task():
> >   t1->{closid,rmid} <- {2,2}
> >   curr <- t1->cpu->rq->curr
> >                                         __schedule():
> >                                           rq->curr <- t1
> >                                         resctrl_sched_in():
> >                                           t1->{closid,rmid} -> {2,2}
> >   if (curr == t1) // false
> >     IPI(t1->cpu)
>
> I understand that the test is false when it may be expected to be true, but
> there does not seem to be a problem because of that. t1 was scheduled in with
> the correct CLOSID/RMID and its CPU did not get an unnecessary IPI.

Yes, this one was reminding the reader of the correct behavior. I can
just leave it out.

-Peter
  
Reinette Chatre Dec. 13, 2022, 6:33 p.m. UTC | #7
Hi Peter,

On 12/12/2022 9:36 AM, Peter Newman wrote:
> On Sat, Dec 10, 2022 at 12:54 AM Reinette Chatre <reinette.chatre@intel.com> wrote:
>> On 12/8/2022 2:30 PM, Peter Newman wrote:
>>> Based on this, I'll just sketch out the first scenario below and drop
>>> (2) from the changelog. This also implies that the group update cases
>>
>> ok, thank you for doing that analysis.
>>
>>> can use a single smp_mb() to provide all the necessary ordering, because
>>> there's a full barrier on context switch for it to pair with, so I don't
>>> need to broadcast IPI anymore.  I don't know whether task_call_func() is
>>
>> This is not clear to me because rdt_move_group_tasks() seems to have the
>> same code as shown below as vulnerable to re-ordering. Only difference
>> is that it uses the "//false" checks to set a bit in the cpumask for a
>> later IPI instead of an immediate IPI.
> 
> An smp_mb() between writing the new task_struct::{closid,rmid} and
> calling task_curr() would prevent the reordering I described, but I
> was worried about the cost of executing a full barrier for every
> matching task.

So for moving a single task the solution may just be to change
the current barrier() to smp_mb()?

> 
> I tried something like this:
> 
> for_each_process_thread(p, t) {
> 	if (!from || is_closid_match(t, from) ||
> 	    is_rmid_match(t, from)) {
> 		WRITE_ONCE(t->closid, to->closid);
> 		WRITE_ONCE(t->rmid, to->mon.rmid);
> 		/* group moves are serialized by rdt */
> 		t->resctrl_dirty = true;
> 	}
> }
> if (IS_ENABLED(CONFIG_SMP) && mask) {
> 	/* Order t->{closid,rmid} stores before loads in task_curr() */
> 	smp_mb();
> 	for_each_process_thread(p, t) {
> 		if (t->resctrl_dirty) {
> 			if (task_curr(t))
> 				cpumask_set_cpu(task_cpu(t), mask);
> 			t->resctrl_dirty = false;
> 		}
> 	}
> }
> 

struct task_struct would not welcome a new member dedicated to resctrl's
rare use for convenience. Another option may be to use a flag within
the variables themselves but that seems like significant complication
(flag need to be dealt with during scheduling) for which the benefit
is not clear to me. I would prefer that we start with the simplest 
solution first (I see that as IPI to all CPUs). The changelog needs clear
description of the problem needing to be solved and the solution chosen, noting
the tradeoffs with other possible solutions. You can submit that, as an RFC
if the "IPI to all CPUs" remains a concern, after which we can bring that
submission to the attention of the experts who would have needed information then
to point us in the right direction.

Reinette
  
Peter Newman Dec. 14, 2022, 10:05 a.m. UTC | #8
Hi Reinette,

On Tue, Dec 13, 2022 at 7:34 PM Reinette Chatre
<reinette.chatre@intel.com> wrote:
> On 12/12/2022 9:36 AM, Peter Newman wrote:
> > On Sat, Dec 10, 2022 at 12:54 AM Reinette Chatre <reinette.chatre@intel.com> wrote:
> >> On 12/8/2022 2:30 PM, Peter Newman wrote:
> >>> Based on this, I'll just sketch out the first scenario below and drop
> >>> (2) from the changelog. This also implies that the group update cases
> >>
> >> ok, thank you for doing that analysis.
> >>
> >>> can use a single smp_mb() to provide all the necessary ordering, because
> >>> there's a full barrier on context switch for it to pair with, so I don't
> >>> need to broadcast IPI anymore.  I don't know whether task_call_func() is
> >>
> >> This is not clear to me because rdt_move_group_tasks() seems to have the
> >> same code as shown below as vulnerable to re-ordering. Only difference
> >> is that it uses the "//false" checks to set a bit in the cpumask for a
> >> later IPI instead of an immediate IPI.
> >
> > An smp_mb() between writing the new task_struct::{closid,rmid} and
> > calling task_curr() would prevent the reordering I described, but I
> > was worried about the cost of executing a full barrier for every
> > matching task.
>
> So for moving a single task the solution may just be to change
> the current barrier() to smp_mb()?

Yes, that's a simpler change, so I'll just do that instead.

>
> >
> > I tried something like this:
> >
> > for_each_process_thread(p, t) {
> >       if (!from || is_closid_match(t, from) ||
> >           is_rmid_match(t, from)) {
> >               WRITE_ONCE(t->closid, to->closid);
> >               WRITE_ONCE(t->rmid, to->mon.rmid);
> >               /* group moves are serialized by rdt */
> >               t->resctrl_dirty = true;
> >       }
> > }
> > if (IS_ENABLED(CONFIG_SMP) && mask) {
> >       /* Order t->{closid,rmid} stores before loads in task_curr() */
> >       smp_mb();
> >       for_each_process_thread(p, t) {
> >               if (t->resctrl_dirty) {
> >                       if (task_curr(t))
> >                               cpumask_set_cpu(task_cpu(t), mask);
> >                       t->resctrl_dirty = false;
> >               }
> >       }
> > }
> >
>
> struct task_struct would not welcome a new member dedicated to resctrl's
> rare use for convenience. Another option may be to use a flag within
> the variables themselves but that seems like significant complication
> (flag need to be dealt with during scheduling) for which the benefit
> is not clear to me. I would prefer that we start with the simplest
> solution first (I see that as IPI to all CPUs). The changelog needs clear
> description of the problem needing to be solved and the solution chosen, noting
> the tradeoffs with other possible solutions. You can submit that, as an RFC
> if the "IPI to all CPUs" remains a concern, after which we can bring that
> submission to the attention of the experts who would have needed information then
> to point us in the right direction.

To be complete, I did the benchmark again with the simple addition of
an smp_mb() on every iteration with a matching CLOSID/RMID and found
that it didn't result in a substantial performance impact. (1.57ms
-> 1.65ms).

This isn't as significant as the 2x slowdown I saw when using
task_call_func(), so maybe task_call_func() is just really expensive.
That's more reason to just upgrade the barrier in the single-task move
case.

While I agree with your points on the IPI broadcast, it seems like a
discussion I would prefer to just avoid given these measurements.

-Peter
  

Patch

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index e5a48f05e787..59b7ffcd53bb 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -528,6 +528,31 @@  static void rdtgroup_remove(struct rdtgroup *rdtgrp)
 	kfree(rdtgrp);
 }
 
+static int update_locked_task_closid_rmid(struct task_struct *t, void *arg)
+{
+	struct rdtgroup *rdtgrp = arg;
+
+	/*
+	 * Although task_call_func() serializes the writes below with the paired
+	 * reads in resctrl_sched_in(), resctrl_sched_in() still needs
+	 * READ_ONCE() due to rdt_move_group_tasks(), so use WRITE_ONCE() here
+	 * to conform.
+	 */
+	if (rdtgrp->type == RDTCTRL_GROUP) {
+		WRITE_ONCE(t->closid, rdtgrp->closid);
+		WRITE_ONCE(t->rmid, rdtgrp->mon.rmid);
+	} else if (rdtgrp->type == RDTMON_GROUP) {
+		WRITE_ONCE(t->rmid, rdtgrp->mon.rmid);
+	}
+
+	/*
+	 * If the task is current on a CPU, the PQR_ASSOC MSR needs to be
+	 * updated to make the resource group go into effect. If the task is not
+	 * current, the MSR will be updated when the task is scheduled in.
+	 */
+	return task_curr(t);
+}
+
 static void _update_task_closid_rmid(void *task)
 {
 	/*
@@ -538,10 +563,24 @@  static void _update_task_closid_rmid(void *task)
 		resctrl_sched_in();
 }
 
-static void update_task_closid_rmid(struct task_struct *t)
+static void update_task_closid_rmid(struct task_struct *t,
+				    struct rdtgroup *rdtgrp)
 {
-	if (IS_ENABLED(CONFIG_SMP) && task_curr(t))
-		smp_call_function_single(task_cpu(t), _update_task_closid_rmid, t, 1);
+	/*
+	 * Serialize the closid and rmid update with context switch. If
+	 * task_call_func() indicates that the task was running during
+	 * update_locked_task_closid_rmid(), then interrupt it.
+	 */
+	if (task_call_func(t, update_locked_task_closid_rmid, rdtgrp) &&
+	    IS_ENABLED(CONFIG_SMP))
+		/*
+		 * If the task has migrated away from the CPU indicated by
+		 * task_cpu() below, then it has already switched in on the
+		 * new CPU using the updated closid and rmid and the call below
+		 * is unnecessary, but harmless.
+		 */
+		smp_call_function_single(task_cpu(t),
+					 _update_task_closid_rmid, t, 1);
 	else
 		_update_task_closid_rmid(t);
 }
@@ -557,39 +596,16 @@  static int __rdtgroup_move_task(struct task_struct *tsk,
 		return 0;
 
 	/*
-	 * Set the task's closid/rmid before the PQR_ASSOC MSR can be
-	 * updated by them.
-	 *
-	 * For ctrl_mon groups, move both closid and rmid.
 	 * For monitor groups, can move the tasks only from
 	 * their parent CTRL group.
 	 */
-
-	if (rdtgrp->type == RDTCTRL_GROUP) {
-		WRITE_ONCE(tsk->closid, rdtgrp->closid);
-		WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
-	} else if (rdtgrp->type == RDTMON_GROUP) {
-		if (rdtgrp->mon.parent->closid == tsk->closid) {
-			WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
-		} else {
-			rdt_last_cmd_puts("Can't move task to different control group\n");
-			return -EINVAL;
-		}
+	if (rdtgrp->type == RDTMON_GROUP &&
+	    rdtgrp->mon.parent->closid != tsk->closid) {
+		rdt_last_cmd_puts("Can't move task to different control group\n");
+		return -EINVAL;
 	}
 
-	/*
-	 * Ensure the task's closid and rmid are written before determining if
-	 * the task is current that will decide if it will be interrupted.
-	 */
-	barrier();
-
-	/*
-	 * By now, the task's closid and rmid are set. If the task is current
-	 * on a CPU, the PQR_ASSOC MSR needs to be updated to make the resource
-	 * group go into effect. If the task is not current, the MSR will be
-	 * updated when the task is scheduled in.
-	 */
-	update_task_closid_rmid(tsk);
+	update_task_closid_rmid(tsk, rdtgrp);
 
 	return 0;
 }