[v5,1/1] x86/resctrl: Fix task CLOSID/RMID update race

Message ID 20221214114447.1935755-2-peternewman@google.com
State New
Headers
Series Subject: x86/resctrl: Fix task CLOSID update race |

Commit Message

Peter Newman Dec. 14, 2022, 11:44 a.m. UTC
  When the user moves a running task to a new rdtgroup using the tasks
file interface or by deleting its rdtgroup, the resulting change in
CLOSID/RMID must be immediately propagated to the PQR_ASSOC MSR on the
task(s) CPUs.

x86 allows reordering loads with prior stores, so if the task starts
running between a task_curr() check that the CPU hoisted before the
stores in the CLOSID/RMID update then it can start running with the old
CLOSID/RMID until it is switched again because __rdtgroup_move_task()
failed to determine that it needs to be interrupted to obtain the new
CLOSID/RMID.

Refer to the diagram below:

CPU 0                                   CPU 1
-----                                   -----
__rdtgroup_move_task():
  curr <- t1->cpu->rq->curr
                                        __schedule():
                                          rq->curr <- t1
                                        resctrl_sched_in():
                                          t1->{closid,rmid} -> {1,1}
  t1->{closid,rmid} <- {2,2}
  if (curr == t1) // false
   IPI(t1->cpu)

A similar race impacts rdt_move_group_tasks(), which updates tasks in a
deleted rdtgroup.

In both cases, use smp_mb() to order the task_struct::{closid,rmid}
stores before the loads in task_curr().  In particular, in the
rdt_move_group_tasks() case, simply execute an smp_mb() on every
iteration with a matching task.

It is possible to use a single smp_mb() in rdt_move_group_tasks(), but
this would require two passes and a means of remembering which
task_structs were updated in the first loop. However, benchmarking
results below showed too little performance impact in the simple
approach to justify implementing the two-pass approach.

Times below were collected using `perf stat` to measure the time to
remove a group containing a 1600-task, parallel workload.

CPU: Intel(R) Xeon(R) Platinum P-8136 CPU @ 2.00GHz (112 threads)

 # mkdir /sys/fs/resctrl/test
 # echo $$ > /sys/fs/resctrl/test/tasks
 # perf bench sched messaging -g 40 -l 100000

task-clock time ranges collected using:

 # perf stat rmdir /sys/fs/resctrl/test

Baseline:                     1.54 - 1.60 ms
smp_mb() every matching task: 1.57 - 1.67 ms

Signed-off-by: Peter Newman <peternewman@google.com>
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)
  

Comments

Reinette Chatre Dec. 15, 2022, 11:51 p.m. UTC | #1
Hi Peter,

On 12/14/2022 3:44 AM, Peter Newman wrote:
> When the user moves a running task to a new rdtgroup using the tasks
> file interface or by deleting its rdtgroup, the resulting change in
> CLOSID/RMID must be immediately propagated to the PQR_ASSOC MSR on the
> task(s) CPUs.
> 
> x86 allows reordering loads with prior stores, so if the task starts
> running between a task_curr() check that the CPU hoisted before the
> stores in the CLOSID/RMID update then it can start running with the old
> CLOSID/RMID until it is switched again because __rdtgroup_move_task()
> failed to determine that it needs to be interrupted to obtain the new
> CLOSID/RMID.
> 
> Refer to the diagram below:
> 
> CPU 0                                   CPU 1
> -----                                   -----
> __rdtgroup_move_task():
>   curr <- t1->cpu->rq->curr
>                                         __schedule():
>                                           rq->curr <- t1
>                                         resctrl_sched_in():
>                                           t1->{closid,rmid} -> {1,1}
>   t1->{closid,rmid} <- {2,2}
>   if (curr == t1) // false
>    IPI(t1->cpu)
> 
> A similar race impacts rdt_move_group_tasks(), which updates tasks in a
> deleted rdtgroup.
> 
> In both cases, use smp_mb() to order the task_struct::{closid,rmid}
> stores before the loads in task_curr().  In particular, in the
> rdt_move_group_tasks() case, simply execute an smp_mb() on every
> iteration with a matching task.
> 
> It is possible to use a single smp_mb() in rdt_move_group_tasks(), but
> this would require two passes and a means of remembering which
> task_structs were updated in the first loop. However, benchmarking
> results below showed too little performance impact in the simple
> approach to justify implementing the two-pass approach.
> 
> Times below were collected using `perf stat` to measure the time to
> remove a group containing a 1600-task, parallel workload.
> 
> CPU: Intel(R) Xeon(R) Platinum P-8136 CPU @ 2.00GHz (112 threads)
> 
>  # mkdir /sys/fs/resctrl/test
>  # echo $$ > /sys/fs/resctrl/test/tasks
>  # perf bench sched messaging -g 40 -l 100000
> 
> task-clock time ranges collected using:
> 
>  # perf stat rmdir /sys/fs/resctrl/test
> 
> Baseline:                     1.54 - 1.60 ms
> smp_mb() every matching task: 1.57 - 1.67 ms
> 

For a fix a Fixes: tag is expected. It looks like the following
may be relevant:
Fixes: ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR")
Fixes: 0efc89be9471 ("x86/intel_rdt: Update task closid immediately on CPU in rmdir and unmount")

> Signed-off-by: Peter Newman <peternewman@google.com>

Also, please do let the stable team know about this via:
Cc: stable@vger.kernel.org

> ---

There is no need to submit with a cover letter, but please do keep the history with this patch
by including it here below the "---".

>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index e5a48f05e787..5993da21d822 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -580,8 +580,10 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
>  	/*
>  	 * Ensure the task's closid and rmid are written before determining if
>  	 * the task is current that will decide if it will be interrupted.
> +	 * This pairs with the full barrier between the rq->curr update and
> +	 * resctrl_sched_in() during context switch.
>  	 */
> -	barrier();
> +	smp_mb();
>  
>  	/*
>  	 * By now, the task's closid and rmid are set. If the task is current
> @@ -2401,6 +2403,14 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
>  			WRITE_ONCE(t->closid, to->closid);
>  			WRITE_ONCE(t->rmid, to->mon.rmid);
>  
> +			/*
> +			 * Order the closid/rmid stores above before the loads
> +			 * in task_curr(). This pairs with the full barrier
> +			 * between the rq->curr update and resctrl_sched_in()
> +			 * during context switch.
> +			 */
> +			smp_mb();
> +
>  			/*
>  			 * If the task is on a CPU, set the CPU in the mask.
>  			 * The detection is inaccurate as tasks might move or


Thank you very much for sticking with this and always paying attention
to the details along the way.

Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

Reinette
  
Peter Newman Dec. 16, 2022, 10:26 a.m. UTC | #2
Hi Reinette,

On Fri, Dec 16, 2022 at 12:52 AM Reinette Chatre
<reinette.chatre@intel.com> wrote:
>
> For a fix a Fixes: tag is expected. It looks like the following
> may be relevant:
> Fixes: ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR")
> Fixes: 0efc89be9471 ("x86/intel_rdt: Update task closid immediately on CPU in rmdir and unmount")

Thanks for preparing these lines. I'll include them.

>
> > Signed-off-by: Peter Newman <peternewman@google.com>
>
> Also, please do let the stable team know about this via:
> Cc: stable@vger.kernel.org

I wasn't sure if this fix met the criteria for backporting to stable,
because I found it by code inspection, so it doesn't meet the "bothers
people" criterion.

However I can make a case that it's exploitable:

"In a memory bandwidth-metered compute host, malicious jobs could
exploit this race to remain in a previous CLOSID or RMID in order to
dodge a class-of-service downgrade imposed by an admin or steal
bandwidth."


>
> Thank you very much for sticking with this and always paying attention
> to the details along the way.
>
> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

Thank you, Reinette! This has been a learning experience for me.

-Peter
  
Reinette Chatre Dec. 16, 2022, 7:36 p.m. UTC | #3
Hi Peter,

On 12/16/2022 2:26 AM, Peter Newman wrote:
> Hi Reinette,
> 
> On Fri, Dec 16, 2022 at 12:52 AM Reinette Chatre
> <reinette.chatre@intel.com> wrote:
>>
>> For a fix a Fixes: tag is expected. It looks like the following
>> may be relevant:
>> Fixes: ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR")
>> Fixes: 0efc89be9471 ("x86/intel_rdt: Update task closid immediately on CPU in rmdir and unmount")
> 
> Thanks for preparing these lines. I'll include them.
> 
>>
>>> Signed-off-by: Peter Newman <peternewman@google.com>
>>
>> Also, please do let the stable team know about this via:
>> Cc: stable@vger.kernel.org
> 
> I wasn't sure if this fix met the criteria for backporting to stable,
> because I found it by code inspection, so it doesn't meet the "bothers
> people" criterion.

That is fair. Encountering the issue does not have an obvious error, the
consequence is that there could be intervals during which tasks may not
get resources/measurements they are entitled to. I do think that this will
be hard to test in order to demonstrate the impact.

My understanding was that this was encountered in your environment where
actions are taken at large scale. If this remains theoretical then no need
to include the stable team. With the Fixes tags they can decide if it is
something they would like to carry.

> 
> However I can make a case that it's exploitable:
> 
> "In a memory bandwidth-metered compute host, malicious jobs could
> exploit this race to remain in a previous CLOSID or RMID in order to
> dodge a class-of-service downgrade imposed by an admin or steal
> bandwidth."
> 

I am not comfortable with such high level speculation. For this
exploit to work the malicious jobs needs to control scheduler decisions
as well as time the exploit with the admin's decision to move the target task.


Reinette
  
Peter Newman Dec. 19, 2022, 10:22 a.m. UTC | #4
Hi Reinette,

On Fri, Dec 16, 2022 at 8:36 PM Reinette Chatre
<reinette.chatre@intel.com> wrote:
> On 12/16/2022 2:26 AM, Peter Newman wrote:
> > However I can make a case that it's exploitable:
> >
> > "In a memory bandwidth-metered compute host, malicious jobs could
> > exploit this race to remain in a previous CLOSID or RMID in order to
> > dodge a class-of-service downgrade imposed by an admin or steal
> > bandwidth."
> >
>
> I am not comfortable with such high level speculation. For this
> exploit to work the malicious jobs needs to control scheduler decisions
> as well as time the exploit with the admin's decision to move the target task.

I imagined if the malicious job maintained a large pool of threads in
short sleep-loops, after it sees a drop in bandwidth, it can cue the
threads to measure their memory bandwidth to see if any got past the
CLOSID change.

I don't know whether having fast, unmetered bandwidth until the next
context switch is enough of a payoff to bother with this, though. Our
workloads have too many context switches for this to be worth very much,
so I'm fine with letting others decide how important this fix is to
them.

-Peter
  

Patch

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index e5a48f05e787..5993da21d822 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -580,8 +580,10 @@  static int __rdtgroup_move_task(struct task_struct *tsk,
 	/*
 	 * Ensure the task's closid and rmid are written before determining if
 	 * the task is current that will decide if it will be interrupted.
+	 * This pairs with the full barrier between the rq->curr update and
+	 * resctrl_sched_in() during context switch.
 	 */
-	barrier();
+	smp_mb();
 
 	/*
 	 * By now, the task's closid and rmid are set. If the task is current
@@ -2401,6 +2403,14 @@  static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
 			WRITE_ONCE(t->closid, to->closid);
 			WRITE_ONCE(t->rmid, to->mon.rmid);
 
+			/*
+			 * Order the closid/rmid stores above before the loads
+			 * in task_curr(). This pairs with the full barrier
+			 * between the rq->curr update and resctrl_sched_in()
+			 * during context switch.
+			 */
+			smp_mb();
+
 			/*
 			 * If the task is on a CPU, set the CPU in the mask.
 			 * The detection is inaccurate as tasks might move or