Message ID | 20221214114447.1935755-2-peternewman@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:e747:0:0:0:0:0 with SMTP id c7csp166131wrn; Wed, 14 Dec 2022 03:49:48 -0800 (PST) X-Google-Smtp-Source: AA0mqf76AxmiSwUjX7oCUK0nt4zPeoW1uZ65VMZZZkr7+aJnzI58gnvAglMuRxZbVgGq5WVMyOv+ X-Received: by 2002:a17:906:4bcc:b0:7c1:b65:440f with SMTP id x12-20020a1709064bcc00b007c10b65440fmr20393885ejv.25.1671018588389; Wed, 14 Dec 2022 03:49:48 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1671018588; cv=none; d=google.com; s=arc-20160816; b=FsRarIj+8XAwAFoSlMmL0HeAORL+LD3dQ52Ao6R8ayvMLOCJvK7VAN+FzYba3BMPR+ T2Xiak/kN2LE1bgiPTsxWqczdgzREx9/QsklIh97foDwlBz1xhA6aU2XaUXAKkeJ+qcO LLZlX1RxFyW1Z7ozYu0zEfrDdBbb/nh9GkjyXse+P7Khh9J/RcUruJ+A0f2kfh+7uies XLFVbcdsaq46N6PLO4JyKnH3SHTc2G5mFRsT8x0hovDHu/YoJp+h5SjZ0/ZS1lmimV7C QkjMHG3w6PUonWt6A5u7iLmG3yaddAA5dQyKqLqKYnrYxYskbiR7j4X3zWUzAM03IbzP Sn2g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=YgnN0QtEEWTyKETKPOOJ3VtCQ7cxhjYteagNX9ETV3M=; b=TWBgOKU6FDmeFOsRAezd9XfpH9sqZCGz6wGuI7LoWiZglYq9U79sL2dGRPFF5iXT+C EagNBG8fAyfVbETdPsxcINWV9oA1S4Vjow9O/x9dgrJh94DhPh3170ZFbAKMsbftmmJE WWtkQe/QWs0flLofavHOJhX0g138+779/UKS3HEuNwx1QDUEOxNCgF6HtXQg5K5up2OF gtRpeY5lJDOOIwxLk+62x40TMZUBJ/nxdZ8HQlX53YCDZQOTybsUwT0dLb65sSVVZbPe YBFcz5be/E5oYATQAtM85XVXgjo3AO3tJPp6dtVGuLOIYv82huMpXmY5npnZt6S7kiFG IoXA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=bXB0wApp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id oz37-20020a1709077da500b0078d8bd255d4si2896639ejc.949.2022.12.14.03.49.25; Wed, 14 Dec 2022 03:49:48 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=bXB0wApp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238305AbiLNLpU (ORCPT <rfc822;jeantsuru.cumc.mandola@gmail.com> + 99 others); Wed, 14 Dec 2022 06:45:20 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53832 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229695AbiLNLpC (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 14 Dec 2022 06:45:02 -0500 Received: from mail-wm1-x34a.google.com (mail-wm1-x34a.google.com [IPv6:2a00:1450:4864:20::34a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E1C7020F56 for <linux-kernel@vger.kernel.org>; Wed, 14 Dec 2022 03:45:00 -0800 (PST) Received: by mail-wm1-x34a.google.com with SMTP id i9-20020a1c3b09000000b003d21fa95c38so4326668wma.3 for <linux-kernel@vger.kernel.org>; Wed, 14 Dec 2022 03:45:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=YgnN0QtEEWTyKETKPOOJ3VtCQ7cxhjYteagNX9ETV3M=; b=bXB0wAppfK3GA2HYbRuWqvPsTY3f+aGB1QepHMKJA7vImXpQpYm31VFbXfPrtLWxo3 l31asC6VSSz8FeCw53O2jOYzWz5FREHkbkwPvVGX9DQtjgefPrdxNqCIXuZFPhx/dEMx fEvcTISA6ki2F765SfusarEWqS61UtPqwk68EhbtOHx8VT/N8kZrAs1j4oyAnjgt3fnd zIuglo80o3MN2ZZqgRXAiqYiicVNTRIMIlk3ynovic8O66vjoaURSLgYsn4vKk2Ei6Ah gWVuJZd47aotqPqoe9VZAe9xhF+UQDbElM+yssZL0eMnuKZkRCLFLIRmcmKeGiHw3WRt DDHQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=YgnN0QtEEWTyKETKPOOJ3VtCQ7cxhjYteagNX9ETV3M=; b=hG/cTbnqJtq113ivPryJPCcV+5TIo4r/VmJ3kubwQkwa7QngeCATbvPd3SQe6G0BJL A7B/okyxTbOnaJKtLtG0Z6zyij2+BUFyIpKwQ/lzIMIJH/ZuRmh5osie6px6xA7tZpa5 TeyG86ZaabEV/5P/rgwSA3ohgAfMsI/g9zxd8+EqQAcDQRiSPwEFQTBOf9rlVf7l/oDJ THGVCe6QMnqynHjifF35gI8xoiHZJwYuxEt6Zk2ZaCPOjwsF/Mpb/IYEbYSD9+NA5M6R gJwoRm3i4OCrwsGho2rxXuW0+7fWjvJxw/OygICxLTnxMnaNGw/IIsOofLveVpb9CUds IBjw== X-Gm-Message-State: ANoB5plf0vic4DUek081y55kAcLYUs1sWYYCgxJaGISxzd2Z5k3rxbDW PV+Wu9+YuA5SnNMZm621ita+oslf8k3tNIA3Fg== X-Received: from peternewman-vh.c.googlers.com ([fda3:e722:ac3:cc00:28:9cb1:c0a8:30cc]) (user=peternewman job=sendgmr) by 2002:a1c:6a14:0:b0:3cf:878c:6555 with SMTP id f20-20020a1c6a14000000b003cf878c6555mr184510wmc.38.1671018299448; Wed, 14 Dec 2022 03:44:59 -0800 (PST) Date: Wed, 14 Dec 2022 12:44:47 +0100 In-Reply-To: <20221214114447.1935755-1-peternewman@google.com> Mime-Version: 1.0 References: <20221214114447.1935755-1-peternewman@google.com> X-Mailer: git-send-email 2.39.0.rc1.256.g54fd8350bd-goog Message-ID: <20221214114447.1935755-2-peternewman@google.com> Subject: [PATCH v5 1/1] x86/resctrl: Fix task CLOSID/RMID update race From: Peter Newman <peternewman@google.com> To: reinette.chatre@intel.com, fenghua.yu@intel.com Cc: bp@alien8.de, derkling@google.com, eranian@google.com, hpa@zytor.com, james.morse@arm.com, jannh@google.com, kpsingh@google.com, linux-kernel@vger.kernel.org, mingo@redhat.com, tglx@linutronix.de, x86@kernel.org, Peter Newman <peternewman@google.com> Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1752189987431380435?= X-GMAIL-MSGID: =?utf-8?q?1752189987431380435?= |
Series |
Subject: x86/resctrl: Fix task CLOSID update race
|
|
Commit Message
Peter Newman
Dec. 14, 2022, 11:44 a.m. UTC
When the user moves a running task to a new rdtgroup using the tasks
file interface or by deleting its rdtgroup, the resulting change in
CLOSID/RMID must be immediately propagated to the PQR_ASSOC MSR on the
task(s) CPUs.
x86 allows reordering loads with prior stores, so if the task starts
running between a task_curr() check that the CPU hoisted before the
stores in the CLOSID/RMID update then it can start running with the old
CLOSID/RMID until it is switched again because __rdtgroup_move_task()
failed to determine that it needs to be interrupted to obtain the new
CLOSID/RMID.
Refer to the diagram below:
CPU 0 CPU 1
----- -----
__rdtgroup_move_task():
curr <- t1->cpu->rq->curr
__schedule():
rq->curr <- t1
resctrl_sched_in():
t1->{closid,rmid} -> {1,1}
t1->{closid,rmid} <- {2,2}
if (curr == t1) // false
IPI(t1->cpu)
A similar race impacts rdt_move_group_tasks(), which updates tasks in a
deleted rdtgroup.
In both cases, use smp_mb() to order the task_struct::{closid,rmid}
stores before the loads in task_curr(). In particular, in the
rdt_move_group_tasks() case, simply execute an smp_mb() on every
iteration with a matching task.
It is possible to use a single smp_mb() in rdt_move_group_tasks(), but
this would require two passes and a means of remembering which
task_structs were updated in the first loop. However, benchmarking
results below showed too little performance impact in the simple
approach to justify implementing the two-pass approach.
Times below were collected using `perf stat` to measure the time to
remove a group containing a 1600-task, parallel workload.
CPU: Intel(R) Xeon(R) Platinum P-8136 CPU @ 2.00GHz (112 threads)
# mkdir /sys/fs/resctrl/test
# echo $$ > /sys/fs/resctrl/test/tasks
# perf bench sched messaging -g 40 -l 100000
task-clock time ranges collected using:
# perf stat rmdir /sys/fs/resctrl/test
Baseline: 1.54 - 1.60 ms
smp_mb() every matching task: 1.57 - 1.67 ms
Signed-off-by: Peter Newman <peternewman@google.com>
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
Comments
Hi Peter, On 12/14/2022 3:44 AM, Peter Newman wrote: > When the user moves a running task to a new rdtgroup using the tasks > file interface or by deleting its rdtgroup, the resulting change in > CLOSID/RMID must be immediately propagated to the PQR_ASSOC MSR on the > task(s) CPUs. > > x86 allows reordering loads with prior stores, so if the task starts > running between a task_curr() check that the CPU hoisted before the > stores in the CLOSID/RMID update then it can start running with the old > CLOSID/RMID until it is switched again because __rdtgroup_move_task() > failed to determine that it needs to be interrupted to obtain the new > CLOSID/RMID. > > Refer to the diagram below: > > CPU 0 CPU 1 > ----- ----- > __rdtgroup_move_task(): > curr <- t1->cpu->rq->curr > __schedule(): > rq->curr <- t1 > resctrl_sched_in(): > t1->{closid,rmid} -> {1,1} > t1->{closid,rmid} <- {2,2} > if (curr == t1) // false > IPI(t1->cpu) > > A similar race impacts rdt_move_group_tasks(), which updates tasks in a > deleted rdtgroup. > > In both cases, use smp_mb() to order the task_struct::{closid,rmid} > stores before the loads in task_curr(). In particular, in the > rdt_move_group_tasks() case, simply execute an smp_mb() on every > iteration with a matching task. > > It is possible to use a single smp_mb() in rdt_move_group_tasks(), but > this would require two passes and a means of remembering which > task_structs were updated in the first loop. However, benchmarking > results below showed too little performance impact in the simple > approach to justify implementing the two-pass approach. > > Times below were collected using `perf stat` to measure the time to > remove a group containing a 1600-task, parallel workload. > > CPU: Intel(R) Xeon(R) Platinum P-8136 CPU @ 2.00GHz (112 threads) > > # mkdir /sys/fs/resctrl/test > # echo $$ > /sys/fs/resctrl/test/tasks > # perf bench sched messaging -g 40 -l 100000 > > task-clock time ranges collected using: > > # perf stat rmdir /sys/fs/resctrl/test > > Baseline: 1.54 - 1.60 ms > smp_mb() every matching task: 1.57 - 1.67 ms > For a fix a Fixes: tag is expected. It looks like the following may be relevant: Fixes: ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR") Fixes: 0efc89be9471 ("x86/intel_rdt: Update task closid immediately on CPU in rmdir and unmount") > Signed-off-by: Peter Newman <peternewman@google.com> Also, please do let the stable team know about this via: Cc: stable@vger.kernel.org > --- There is no need to submit with a cover letter, but please do keep the history with this patch by including it here below the "---". > arch/x86/kernel/cpu/resctrl/rdtgroup.c | 12 +++++++++++- > 1 file changed, 11 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c > index e5a48f05e787..5993da21d822 100644 > --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c > +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c > @@ -580,8 +580,10 @@ static int __rdtgroup_move_task(struct task_struct *tsk, > /* > * Ensure the task's closid and rmid are written before determining if > * the task is current that will decide if it will be interrupted. > + * This pairs with the full barrier between the rq->curr update and > + * resctrl_sched_in() during context switch. > */ > - barrier(); > + smp_mb(); > > /* > * By now, the task's closid and rmid are set. If the task is current > @@ -2401,6 +2403,14 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to, > WRITE_ONCE(t->closid, to->closid); > WRITE_ONCE(t->rmid, to->mon.rmid); > > + /* > + * Order the closid/rmid stores above before the loads > + * in task_curr(). This pairs with the full barrier > + * between the rq->curr update and resctrl_sched_in() > + * during context switch. > + */ > + smp_mb(); > + > /* > * If the task is on a CPU, set the CPU in the mask. > * The detection is inaccurate as tasks might move or Thank you very much for sticking with this and always paying attention to the details along the way. Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reinette
Hi Reinette, On Fri, Dec 16, 2022 at 12:52 AM Reinette Chatre <reinette.chatre@intel.com> wrote: > > For a fix a Fixes: tag is expected. It looks like the following > may be relevant: > Fixes: ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR") > Fixes: 0efc89be9471 ("x86/intel_rdt: Update task closid immediately on CPU in rmdir and unmount") Thanks for preparing these lines. I'll include them. > > > Signed-off-by: Peter Newman <peternewman@google.com> > > Also, please do let the stable team know about this via: > Cc: stable@vger.kernel.org I wasn't sure if this fix met the criteria for backporting to stable, because I found it by code inspection, so it doesn't meet the "bothers people" criterion. However I can make a case that it's exploitable: "In a memory bandwidth-metered compute host, malicious jobs could exploit this race to remain in a previous CLOSID or RMID in order to dodge a class-of-service downgrade imposed by an admin or steal bandwidth." > > Thank you very much for sticking with this and always paying attention > to the details along the way. > > Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Thank you, Reinette! This has been a learning experience for me. -Peter
Hi Peter, On 12/16/2022 2:26 AM, Peter Newman wrote: > Hi Reinette, > > On Fri, Dec 16, 2022 at 12:52 AM Reinette Chatre > <reinette.chatre@intel.com> wrote: >> >> For a fix a Fixes: tag is expected. It looks like the following >> may be relevant: >> Fixes: ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR") >> Fixes: 0efc89be9471 ("x86/intel_rdt: Update task closid immediately on CPU in rmdir and unmount") > > Thanks for preparing these lines. I'll include them. > >> >>> Signed-off-by: Peter Newman <peternewman@google.com> >> >> Also, please do let the stable team know about this via: >> Cc: stable@vger.kernel.org > > I wasn't sure if this fix met the criteria for backporting to stable, > because I found it by code inspection, so it doesn't meet the "bothers > people" criterion. That is fair. Encountering the issue does not have an obvious error, the consequence is that there could be intervals during which tasks may not get resources/measurements they are entitled to. I do think that this will be hard to test in order to demonstrate the impact. My understanding was that this was encountered in your environment where actions are taken at large scale. If this remains theoretical then no need to include the stable team. With the Fixes tags they can decide if it is something they would like to carry. > > However I can make a case that it's exploitable: > > "In a memory bandwidth-metered compute host, malicious jobs could > exploit this race to remain in a previous CLOSID or RMID in order to > dodge a class-of-service downgrade imposed by an admin or steal > bandwidth." > I am not comfortable with such high level speculation. For this exploit to work the malicious jobs needs to control scheduler decisions as well as time the exploit with the admin's decision to move the target task. Reinette
Hi Reinette, On Fri, Dec 16, 2022 at 8:36 PM Reinette Chatre <reinette.chatre@intel.com> wrote: > On 12/16/2022 2:26 AM, Peter Newman wrote: > > However I can make a case that it's exploitable: > > > > "In a memory bandwidth-metered compute host, malicious jobs could > > exploit this race to remain in a previous CLOSID or RMID in order to > > dodge a class-of-service downgrade imposed by an admin or steal > > bandwidth." > > > > I am not comfortable with such high level speculation. For this > exploit to work the malicious jobs needs to control scheduler decisions > as well as time the exploit with the admin's decision to move the target task. I imagined if the malicious job maintained a large pool of threads in short sleep-loops, after it sees a drop in bandwidth, it can cue the threads to measure their memory bandwidth to see if any got past the CLOSID change. I don't know whether having fast, unmetered bandwidth until the next context switch is enough of a payoff to bother with this, though. Our workloads have too many context switches for this to be worth very much, so I'm fine with letting others decide how important this fix is to them. -Peter
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index e5a48f05e787..5993da21d822 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -580,8 +580,10 @@ static int __rdtgroup_move_task(struct task_struct *tsk, /* * Ensure the task's closid and rmid are written before determining if * the task is current that will decide if it will be interrupted. + * This pairs with the full barrier between the rq->curr update and + * resctrl_sched_in() during context switch. */ - barrier(); + smp_mb(); /* * By now, the task's closid and rmid are set. If the task is current @@ -2401,6 +2403,14 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to, WRITE_ONCE(t->closid, to->closid); WRITE_ONCE(t->rmid, to->mon.rmid); + /* + * Order the closid/rmid stores above before the loads + * in task_curr(). This pairs with the full barrier + * between the rq->curr update and resctrl_sched_in() + * during context switch. + */ + smp_mb(); + /* * If the task is on a CPU, set the CPU in the mask. * The detection is inaccurate as tasks might move or