Message ID | 20221129111055.953833-2-peternewman@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:f944:0:0:0:0:0 with SMTP id q4csp274050wrr; Tue, 29 Nov 2022 03:22:41 -0800 (PST) X-Google-Smtp-Source: AA0mqf46XMnKg8/g+aoP07X+hj+X8QPofc8Aw7BzTO/uWn4HJ6yXkygUNclzm3wZietzxSbTrJo6 X-Received: by 2002:a50:fe11:0:b0:46b:2327:7c5d with SMTP id f17-20020a50fe11000000b0046b23277c5dmr9380403edt.386.1669720961310; Tue, 29 Nov 2022 03:22:41 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1669720961; cv=none; d=google.com; s=arc-20160816; b=h0dCutmwFt4BiqRvbH/90xKxS+L1X4goB0nQuN1OhKT5YJsISYxSqazQE/Tt/Qvvpv GfWPBtNOp6KtZT7WGo2gnchncNcui1lRJ90uTWlcqOmoJfYu12YRX//xF5y/ZnN55N3d IpVGm7nnK9e56hzGN1CL4kivJ3blKDnIutODe572uZPbF3lT+r6y/qZX4GvBHi+KMDQf DEMqB/8EVQ2Lv9Z5IIg4aQZ8Rx6JXkvdIdnpPtTdE52xb+lwCn6phpREfxHCSC3VJe0s yb6vThQx6pQ/Rb8UR3H9xdAwdpcRoKBaf3Kqr45RKluKeGUo3ncuxzN1gU2LpBqMedqF PcuA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=XGkUDP29a4yBSoWuYgHMkH94AykgrslaPfp+GvPTOqc=; b=IOhdj+pFyZoqhKvp8zD+PisCE4+Rm7KtiAGSUvWWWq9g+Gk9GiOQ49IwVrUQCj0bYQ NqCrDu7xN2PkOS2CsSB8pmFmfMFCzzEPG7ZFqFxOVlKDUY7+Wx74rZXw+PGbpyhpdzx+ G9f5FNr85LOHKDA3JmtZLGXUg2WxefA9AJ2l6AKYbQkZxUvBECKS/C3cUm2zVxPmZK/c Oqv4T/kjD7s41zJ+dosolPLeDF9nb5DDX+2v4ZQF2MqBMbXGSODWQQ8WL6G8XXDC6hr+ yTuCiDEipBPEz7puTkUpopIhfjLzBp+pwhOG/8C4NRo4LOT+FPMeGrUoWCLESkBRL8Rb v8JA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=Ub3H3IpK; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id d11-20020aa7c1cb000000b0046b32882b04si4139072edp.440.2022.11.29.03.22.17; Tue, 29 Nov 2022 03:22:41 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=Ub3H3IpK; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232986AbiK2LLa (ORCPT <rfc822;rbbytesnap@gmail.com> + 99 others); Tue, 29 Nov 2022 06:11:30 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52322 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232970AbiK2LLP (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 29 Nov 2022 06:11:15 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B85FA10B64 for <linux-kernel@vger.kernel.org>; Tue, 29 Nov 2022 03:11:10 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id c188-20020a25c0c5000000b006d8eba07513so13064033ybf.17 for <linux-kernel@vger.kernel.org>; Tue, 29 Nov 2022 03:11:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=XGkUDP29a4yBSoWuYgHMkH94AykgrslaPfp+GvPTOqc=; b=Ub3H3IpKDBQJkHIiWOO0oyGDX+cqdRog00hqEGvkod9f856fH7xjr+j5QUH5e74qYC AhSAlnLtUAA2RarZ2+vyhFqrtn4wywNzudg5dX47CzSzMp3s/vyMSHxn7IiR3IrS+5kV +EqcmNtB9J+B9uLZRr5f6UU/waAGzQra9MeZEocLmvAn7R9emgB2bNuC7hT3cXcjm/Tb 9aBhEstoB7keQQrVBbGWp88WHXKyDki/PhaEGzno6dJ0Cbo1iyEt8OgbHwgE3S2iQFP8 0a6wcmxbBbfYQTVgBrwIknFN22kMKCGS3ELr9jDS8rdwM0jLnG3iFTIU8JR+TTj51sl9 cqkg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=XGkUDP29a4yBSoWuYgHMkH94AykgrslaPfp+GvPTOqc=; b=41iuPBbMqXcNF6DsnhKQSRuTC1odRRsfV5IbkelW21EXGOkqqnpOuQCJYtKhBIW+8X O9jHKUZi+rCpIerjWLNHTqJAFbOh5MjvrfOy8NS2a2ixKUQsmhmqrfdJosl92iFhG+c/ 4rQ3NZ7gu8apyjA3Srr86SOqXcsziPyWXa2FtXZurtkC97mIdOOeAaUO9n0g8cqJ3DPJ wgQstbvyIcedMq3qHSPJaZQOzUXkoaVb5cBshBOSjXJXiycgwv+fByi8NTQjRy5MJA7E oml/60m246UEP1rkhDdH0TLWsbdQgN2+LKEsI2Jm3yngU8P3KiFGePd1QzIB6WBmFvDc oG9g== X-Gm-Message-State: ANoB5pml3m6nr3DAOikah8GcW77IBaiizy8TfclHBXwVp9vWMbLE/Io7 gpKr8R46Vsl9F6b/LOCeAC2wIglr6275n9BNMg== X-Received: from peternewman10.zrh.corp.google.com ([2a00:79e0:9d:6:e398:2261:c909:b359]) (user=peternewman job=sendgmr) by 2002:a0d:f545:0:b0:377:5ad7:1c9a with SMTP id e66-20020a0df545000000b003775ad71c9amr6149710ywf.306.1669720269444; Tue, 29 Nov 2022 03:11:09 -0800 (PST) Date: Tue, 29 Nov 2022 12:10:54 +0100 In-Reply-To: <20221129111055.953833-1-peternewman@google.com> Mime-Version: 1.0 References: <20221129111055.953833-1-peternewman@google.com> X-Mailer: git-send-email 2.38.1.584.g0f3c55d4c2-goog Message-ID: <20221129111055.953833-2-peternewman@google.com> Subject: [PATCH v4 1/2] x86/resctrl: Update task closid/rmid with task_call_func() From: Peter Newman <peternewman@google.com> To: reinette.chatre@intel.com, fenghua.yu@intel.com Cc: bp@alien8.de, derkling@google.com, eranian@google.com, hpa@zytor.com, james.morse@arm.com, jannh@google.com, kpsingh@google.com, linux-kernel@vger.kernel.org, mingo@redhat.com, tglx@linutronix.de, x86@kernel.org, Peter Newman <peternewman@google.com> Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1750829326461312941?= X-GMAIL-MSGID: =?utf-8?q?1750829326461312941?= |
Series |
x86/resctrl: Fix task CLOSID update race
|
|
Commit Message
Peter Newman
Nov. 29, 2022, 11:10 a.m. UTC
When the user moves a running task to a new rdtgroup using the tasks file interface, the resulting change in CLOSID/RMID must be immediately propagated to the PQR_ASSOC MSR on the task's CPU. It is possible for a task to wake up or migrate while it is being moved to a new group. If __rdtgroup_move_task() fails to observe that a task has begun running or misses that it migrated to a new CPU, the task will continue to use the old CLOSID or RMID until it switches in again. __rdtgroup_move_task() assumes that if the task migrates off of its CPU before it can IPI the task, then the task has already observed the updated CLOSID/RMID. Because this is done locklessly and an x86 CPU can delay stores until after loads, the following incorrect scenarios are possible: 1. __rdtgroup_move_task() stores the new closid and rmid in the task structure after it loads task_curr() and task_cpu(). 2. resctrl_sched_in() loads t->{closid,rmid} before the calling context switch stores new task_curr() and task_cpu() values. Use task_call_func() in __rdtgroup_move_task() to serialize updates to the closid and rmid fields in the task_struct with context switch. Signed-off-by: Peter Newman <peternewman@google.com> Reviewed-by: James Morse <james.morse@arm.com> --- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 78 ++++++++++++++++---------- 1 file changed, 47 insertions(+), 31 deletions(-)
Comments
Hi Peter, On 11/29/2022 3:10 AM, Peter Newman wrote: > When the user moves a running task to a new rdtgroup using the tasks > file interface, the resulting change in CLOSID/RMID must be immediately > propagated to the PQR_ASSOC MSR on the task's CPU. > > It is possible for a task to wake up or migrate while it is being moved > to a new group. If __rdtgroup_move_task() fails to observe that a task > has begun running or misses that it migrated to a new CPU, the task will > continue to use the old CLOSID or RMID until it switches in again. > > __rdtgroup_move_task() assumes that if the task migrates off of its CPU > before it can IPI the task, then the task has already observed the > updated CLOSID/RMID. Because this is done locklessly and an x86 CPU can > delay stores until after loads, the following incorrect scenarios are > possible: > > 1. __rdtgroup_move_task() stores the new closid and rmid in > the task structure after it loads task_curr() and task_cpu(). Stating how this scenario encounters the problem would help so perhaps something like (please feel free to change): "If the task starts running between a reordered task_curr() check and the CLOSID/RMID update then it will start running with the old CLOSID/RMID until it is switched again because __rdtgroup_move_task() failed to determine that it needs to be interrupted to obtain the new CLOSID/RMID." > 2. resctrl_sched_in() loads t->{closid,rmid} before the calling context > switch stores new task_curr() and task_cpu() values. This scenario is not clear to me. Could you please provide more detail about it? I was trying to follow the context_switch() flow and resctrl_sched_in() is one of the last things done (context_switch()->switch_to()->resctrl_sched_in()). From what I can tell rq->curr, as used by task_curr() is set before even context_switch() is called ... and since the next task is picked from the CPU's runqueue (and set_task_cpu() sets the task's cpu when moved to a runqueue) it seems to me that the value used by task_cpu() would also be set early (before context_switch() is called). It is thus not clear to me how the above reordering could occur so an example would help a lot. > > Use task_call_func() in __rdtgroup_move_task() to serialize updates to > the closid and rmid fields in the task_struct with context switch. Is there a reason why there is a switch between the all caps CLOSID/RMID at the beginning to the no caps here? > Signed-off-by: Peter Newman <peternewman@google.com> > Reviewed-by: James Morse <james.morse@arm.com> > --- > arch/x86/kernel/cpu/resctrl/rdtgroup.c | 78 ++++++++++++++++---------- > 1 file changed, 47 insertions(+), 31 deletions(-) > > diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c > index e5a48f05e787..59b7ffcd53bb 100644 > --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c > +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c > @@ -528,6 +528,31 @@ static void rdtgroup_remove(struct rdtgroup *rdtgrp) > kfree(rdtgrp); > } > > +static int update_locked_task_closid_rmid(struct task_struct *t, void *arg) > +{ > + struct rdtgroup *rdtgrp = arg; > + > + /* > + * Although task_call_func() serializes the writes below with the paired > + * reads in resctrl_sched_in(), resctrl_sched_in() still needs > + * READ_ONCE() due to rdt_move_group_tasks(), so use WRITE_ONCE() here > + * to conform. > + */ > + if (rdtgrp->type == RDTCTRL_GROUP) { > + WRITE_ONCE(t->closid, rdtgrp->closid); > + WRITE_ONCE(t->rmid, rdtgrp->mon.rmid); > + } else if (rdtgrp->type == RDTMON_GROUP) { > + WRITE_ONCE(t->rmid, rdtgrp->mon.rmid); > + } > + > + /* > + * If the task is current on a CPU, the PQR_ASSOC MSR needs to be > + * updated to make the resource group go into effect. If the task is not > + * current, the MSR will be updated when the task is scheduled in. > + */ > + return task_curr(t); > +} > + > static void _update_task_closid_rmid(void *task) > { > /* > @@ -538,10 +563,24 @@ static void _update_task_closid_rmid(void *task) > resctrl_sched_in(); > } > > -static void update_task_closid_rmid(struct task_struct *t) > +static void update_task_closid_rmid(struct task_struct *t, > + struct rdtgroup *rdtgrp) > { > - if (IS_ENABLED(CONFIG_SMP) && task_curr(t)) > - smp_call_function_single(task_cpu(t), _update_task_closid_rmid, t, 1); > + /* > + * Serialize the closid and rmid update with context switch. If > + * task_call_func() indicates that the task was running during > + * update_locked_task_closid_rmid(), then interrupt it. > + */ > + if (task_call_func(t, update_locked_task_closid_rmid, rdtgrp) && > + IS_ENABLED(CONFIG_SMP)) > + /* > + * If the task has migrated away from the CPU indicated by > + * task_cpu() below, then it has already switched in on the > + * new CPU using the updated closid and rmid and the call below > + * is unnecessary, but harmless. > + */ > + smp_call_function_single(task_cpu(t), > + _update_task_closid_rmid, t, 1); > else > _update_task_closid_rmid(t); > } > @@ -557,39 +596,16 @@ static int __rdtgroup_move_task(struct task_struct *tsk, > return 0; > > /* > - * Set the task's closid/rmid before the PQR_ASSOC MSR can be > - * updated by them. > - * > - * For ctrl_mon groups, move both closid and rmid. > * For monitor groups, can move the tasks only from > * their parent CTRL group. > */ > - > - if (rdtgrp->type == RDTCTRL_GROUP) { > - WRITE_ONCE(tsk->closid, rdtgrp->closid); > - WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid); > - } else if (rdtgrp->type == RDTMON_GROUP) { > - if (rdtgrp->mon.parent->closid == tsk->closid) { > - WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid); > - } else { > - rdt_last_cmd_puts("Can't move task to different control group\n"); > - return -EINVAL; > - } > + if (rdtgrp->type == RDTMON_GROUP && > + rdtgrp->mon.parent->closid != tsk->closid) { > + rdt_last_cmd_puts("Can't move task to different control group\n"); > + return -EINVAL; > } > > - /* > - * Ensure the task's closid and rmid are written before determining if > - * the task is current that will decide if it will be interrupted. > - */ > - barrier(); > - > - /* > - * By now, the task's closid and rmid are set. If the task is current > - * on a CPU, the PQR_ASSOC MSR needs to be updated to make the resource > - * group go into effect. If the task is not current, the MSR will be > - * updated when the task is scheduled in. > - */ > - update_task_closid_rmid(tsk); > + update_task_closid_rmid(tsk, rdtgrp); > > return 0; > } The change looks good to me. Reinette
Hi Reinette, On Tue, Dec 6, 2022 at 7:57 PM Reinette Chatre <reinette.chatre@intel.com> wrote: > On 11/29/2022 3:10 AM, Peter Newman wrote: > > When the user moves a running task to a new rdtgroup using the tasks > > file interface, the resulting change in CLOSID/RMID must be immediately > > propagated to the PQR_ASSOC MSR on the task's CPU. > > > > It is possible for a task to wake up or migrate while it is being moved > > to a new group. If __rdtgroup_move_task() fails to observe that a task > > has begun running or misses that it migrated to a new CPU, the task will > > continue to use the old CLOSID or RMID until it switches in again. > > > > __rdtgroup_move_task() assumes that if the task migrates off of its CPU > > before it can IPI the task, then the task has already observed the > > updated CLOSID/RMID. Because this is done locklessly and an x86 CPU can > > delay stores until after loads, the following incorrect scenarios are > > possible: > > > > 1. __rdtgroup_move_task() stores the new closid and rmid in > > the task structure after it loads task_curr() and task_cpu(). > > Stating how this scenario encounters the problem would help > so perhaps something like (please feel free to change): > "If the task starts running between a reordered task_curr() check and > the CLOSID/RMID update then it will start running with the old CLOSID/RMID > until it is switched again because __rdtgroup_move_task() failed to determine > that it needs to be interrupted to obtain the new CLOSID/RMID." That is largely what I was trying to state in paragraph 2 above, though at a higher level. I hoped the paragraph following it would do enough to connect the high-level description with the low-level problem scenarios. > > > 2. resctrl_sched_in() loads t->{closid,rmid} before the calling context > > switch stores new task_curr() and task_cpu() values. > > This scenario is not clear to me. Could you please provide more detail about it? > I was trying to follow the context_switch() flow and resctrl_sched_in() is > one of the last things done (context_switch()->switch_to()->resctrl_sched_in()). > From what I can tell rq->curr, as used by task_curr() is set before > even context_switch() is called ... and since the next task is picked from > the CPU's runqueue (and set_task_cpu() sets the task's cpu when moved to > a runqueue) it seems to me that the value used by task_cpu() would also > be set early (before context_switch() is called). It is thus not clear to > me how the above reordering could occur so an example would help a lot. Perhaps in both scenarios I didn't make it clear that reordering in the CPU can cause the incorrect behavior rather than the program order. In this explanation, items 1. and 2. are supposed to be completing the sentence ending with a ':' at the end of paragraph 3, so I thought that would keep focus on the CPU. I had assumed that the ordering requirements were well-understood, since they're stated in existing code comments a few times, and that making a case for how the expected ordering could be violated would be enough, but I'm happy to draw up a side-by-side example. > > > > > Use task_call_func() in __rdtgroup_move_task() to serialize updates to > > the closid and rmid fields in the task_struct with context switch. > > Is there a reason why there is a switch between the all caps CLOSID/RMID > at the beginning to the no caps here? It's because I referred to the task_struct fields explicitly here. -Peter
Hi Peter, On 12/7/2022 2:58 AM, Peter Newman wrote: > Hi Reinette, > > On Tue, Dec 6, 2022 at 7:57 PM Reinette Chatre > <reinette.chatre@intel.com> wrote: >> On 11/29/2022 3:10 AM, Peter Newman wrote: >>> When the user moves a running task to a new rdtgroup using the tasks >>> file interface, the resulting change in CLOSID/RMID must be immediately >>> propagated to the PQR_ASSOC MSR on the task's CPU. >>> >>> It is possible for a task to wake up or migrate while it is being moved >>> to a new group. If __rdtgroup_move_task() fails to observe that a task >>> has begun running or misses that it migrated to a new CPU, the task will >>> continue to use the old CLOSID or RMID until it switches in again. >>> >>> __rdtgroup_move_task() assumes that if the task migrates off of its CPU >>> before it can IPI the task, then the task has already observed the >>> updated CLOSID/RMID. Because this is done locklessly and an x86 CPU can >>> delay stores until after loads, the following incorrect scenarios are >>> possible: >>> >>> 1. __rdtgroup_move_task() stores the new closid and rmid in >>> the task structure after it loads task_curr() and task_cpu(). >> >> Stating how this scenario encounters the problem would help >> so perhaps something like (please feel free to change): >> "If the task starts running between a reordered task_curr() check and >> the CLOSID/RMID update then it will start running with the old CLOSID/RMID >> until it is switched again because __rdtgroup_move_task() failed to determine >> that it needs to be interrupted to obtain the new CLOSID/RMID." > > That is largely what I was trying to state in paragraph 2 above, though > at a higher level. I hoped the paragraph following it would do enough to > connect the high-level description with the low-level problem scenarios. There is no need to require the reader to connect various snippets to create a problematic scenario themselves. The changelog should make the problem obvious. I understand that it is what you wanted to say, that is why I moved existing snippets to form a coherent problem scenario. It is ok if you do not like the way I wrote it, it was only an example on how it can be done. >>> 2. resctrl_sched_in() loads t->{closid,rmid} before the calling context >>> switch stores new task_curr() and task_cpu() values. >> >> This scenario is not clear to me. Could you please provide more detail about it? >> I was trying to follow the context_switch() flow and resctrl_sched_in() is >> one of the last things done (context_switch()->switch_to()->resctrl_sched_in()). >> From what I can tell rq->curr, as used by task_curr() is set before >> even context_switch() is called ... and since the next task is picked from >> the CPU's runqueue (and set_task_cpu() sets the task's cpu when moved to >> a runqueue) it seems to me that the value used by task_cpu() would also >> be set early (before context_switch() is called). It is thus not clear to >> me how the above reordering could occur so an example would help a lot. > > Perhaps in both scenarios I didn't make it clear that reordering in the > CPU can cause the incorrect behavior rather than the program order. In > this explanation, items 1. and 2. are supposed to be completing the > sentence ending with a ':' at the end of paragraph 3, so I thought that > would keep focus on the CPU. You did make it clear that the cause is reordering in the CPU. I am just not able to see where the reordering is occurring in your scenario (2). > I had assumed that the ordering requirements were well-understood, since > they're stated in existing code comments a few times, and that making a > case for how the expected ordering could be violated would be enough, > but I'm happy to draw up a side-by-side example. Please do. Could you start by highlighting which resctrl_sched_in() you are referring to? I am trying to dissect (2) with the given information: Through "the calling context switch" the scenario is written to create understanding that it refers to: context_switch()->switch_to()->resctrl_sched_in() - so the calling context switch is the first in the above call path ... where does it (context_switch()) store the new task_curr() and task_cpu() values and how does that reorder with resctrl_sched_in() further down in call path? >>> Use task_call_func() in __rdtgroup_move_task() to serialize updates to >>> the closid and rmid fields in the task_struct with context switch. >> >> Is there a reason why there is a switch between the all caps CLOSID/RMID >> at the beginning to the no caps here? > > It's because I referred to the task_struct fields explicitly here. You can use task_struct::closid and task_struct::rmid to make this clear. Reinette
Hi Reinette, On Wed, Dec 7, 2022 at 7:41 PM Reinette Chatre <reinette.chatre@intel.com> wrote: > On 12/7/2022 2:58 AM, Peter Newman wrote: > >>> 2. resctrl_sched_in() loads t->{closid,rmid} before the calling context > >>> switch stores new task_curr() and task_cpu() values. > >> > >> This scenario is not clear to me. Could you please provide more detail about it? > >> I was trying to follow the context_switch() flow and resctrl_sched_in() is > >> one of the last things done (context_switch()->switch_to()->resctrl_sched_in()). > >> From what I can tell rq->curr, as used by task_curr() is set before > >> even context_switch() is called ... and since the next task is picked from > >> the CPU's runqueue (and set_task_cpu() sets the task's cpu when moved to > >> a runqueue) it seems to me that the value used by task_cpu() would also > >> be set early (before context_switch() is called). It is thus not clear to > >> me how the above reordering could occur so an example would help a lot. > > > > Perhaps in both scenarios I didn't make it clear that reordering in the > > CPU can cause the incorrect behavior rather than the program order. In > > this explanation, items 1. and 2. are supposed to be completing the > > sentence ending with a ':' at the end of paragraph 3, so I thought that > > would keep focus on the CPU. > > You did make it clear that the cause is reordering in the CPU. I am just > not able to see where the reordering is occurring in your scenario (2). It will all come down to whether it can get from updating rq->curr to reading task_struct::{closid,rmid} without encountering a full barrier. I'll go into the details below. > Please do. Could you start by highlighting which resctrl_sched_in() > you are referring to? I am trying to dissect (2) with the given information: > Through "the calling context switch" the scenario is written to create > understanding that it refers to: > context_switch()->switch_to()->resctrl_sched_in() - so the calling context > switch is the first in the above call path ... where does it (context_switch()) > store the new task_curr() and task_cpu() values and how does that reorder with > resctrl_sched_in() further down in call path? Yes, the rq->curr update is actually in __schedule(). I was probably still thinking it was in prepare_task_switch() (called from context_switch()) because of older kernels where __rdtgroup_move_task() is still reading task_struct::on_cpu. There is an interesting code comment under the rq->curr update site in __schedule(): /* * RCU users of rcu_dereference(rq->curr) may not see * changes to task_struct made by pick_next_task(). */ RCU_INIT_POINTER(rq->curr, next); /* * The membarrier system call requires each architecture * to have a full memory barrier after updating * rq->curr, before returning to user-space. * * Here are the schemes providing that barrier on the * various architectures: * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC. * switch_mm() rely on membarrier_arch_switch_mm() on PowerPC. * - finish_lock_switch() for weakly-ordered * architectures where spin_unlock is a full barrier, * - switch_to() for arm64 (weakly-ordered, spin_unlock * is a RELEASE barrier), */ Based on this, I believe (2) doesn't happen on x86 because switch_mm() provides the required ordering. On arm64, it won't happen as long as it calls its resctrl_sched_in() after the dsb(ish) in __switch_to(), which seems to be the case: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/tree/arch/arm64/kernel/process.c?h=mpam/snapshot/v6.0-rc1#n561 Based on this, I'll just sketch out the first scenario below and drop (2) from the changelog. This also implies that the group update cases can use a single smp_mb() to provide all the necessary ordering, because there's a full barrier on context switch for it to pair with, so I don't need to broadcast IPI anymore. I don't know whether task_call_func() is faster than an smp_mb(). I'll take some measurements to see. The presumed behavior is __rdtgroup_move_task() not seeing t1 running yet implies that it observes the updated values: CPU 0 CPU 1 ----- ----- (t1->{closid,rmid} -> {1,1}) (rq->curr -> t0) __rdtgroup_move_task(): t1->{closid,rmid} <- {2,2} curr <- t1->cpu->rq->curr __schedule(): rq->curr <- t1 resctrl_sched_in(): t1->{closid,rmid} -> {2,2} if (curr == t1) // false IPI(t1->cpu) In (1), CPU 0 is allowed to store {closid,rmid} after reading whether t1 is current: CPU 0 CPU 1 ----- ----- __rdtgroup_move_task(): curr <- t1->cpu->rq->curr __schedule(): rq->curr <- t1 resctrl_sched_in(): t1->{closid,rmid} -> {1,1} t1->{closid,rmid} <- {2,2} if (curr == t1) // false IPI(t1->cpu) Please let me know if these diagrams clarify things. -Peter
Hi Peter, On 12/8/2022 2:30 PM, Peter Newman wrote: > On Wed, Dec 7, 2022 at 7:41 PM Reinette Chatre <reinette.chatre@intel.com> wrote: >> On 12/7/2022 2:58 AM, Peter Newman wrote: >>>>> 2. resctrl_sched_in() loads t->{closid,rmid} before the calling context >>>>> switch stores new task_curr() and task_cpu() values. ... > > Based on this, I'll just sketch out the first scenario below and drop > (2) from the changelog. This also implies that the group update cases ok, thank you for doing that analysis. > can use a single smp_mb() to provide all the necessary ordering, because > there's a full barrier on context switch for it to pair with, so I don't > need to broadcast IPI anymore. I don't know whether task_call_func() is This is not clear to me because rdt_move_group_tasks() seems to have the same code as shown below as vulnerable to re-ordering. Only difference is that it uses the "//false" checks to set a bit in the cpumask for a later IPI instead of an immediate IPI. > faster than an smp_mb(). I'll take some measurements to see. > > The presumed behavior is __rdtgroup_move_task() not seeing t1 running > yet implies that it observes the updated values: > > CPU 0 CPU 1 > ----- ----- > (t1->{closid,rmid} -> {1,1}) (rq->curr -> t0) > > __rdtgroup_move_task(): > t1->{closid,rmid} <- {2,2} > curr <- t1->cpu->rq->curr > __schedule(): > rq->curr <- t1 > resctrl_sched_in(): > t1->{closid,rmid} -> {2,2} > if (curr == t1) // false > IPI(t1->cpu) I understand that the test is false when it may be expected to be true, but there does not seem to be a problem because of that. t1 was scheduled in with the correct CLOSID/RMID and its CPU did not get an unnecessary IPI. > In (1), CPU 0 is allowed to store {closid,rmid} after reading whether t1 > is current: > > CPU 0 CPU 1 > ----- ----- > __rdtgroup_move_task(): > curr <- t1->cpu->rq->curr > __schedule(): > rq->curr <- t1 > resctrl_sched_in(): > t1->{closid,rmid} -> {1,1} > t1->{closid,rmid} <- {2,2} > if (curr == t1) // false > IPI(t1->cpu) Yes, this I understand to be the problematic scenario. > Please let me know if these diagrams clarify things. They do, thank you very much. Reinette
Hi Reinette, On Sat, Dec 10, 2022 at 12:54 AM Reinette Chatre <reinette.chatre@intel.com> wrote: > On 12/8/2022 2:30 PM, Peter Newman wrote: > > Based on this, I'll just sketch out the first scenario below and drop > > (2) from the changelog. This also implies that the group update cases > > ok, thank you for doing that analysis. > > > can use a single smp_mb() to provide all the necessary ordering, because > > there's a full barrier on context switch for it to pair with, so I don't > > need to broadcast IPI anymore. I don't know whether task_call_func() is > > This is not clear to me because rdt_move_group_tasks() seems to have the > same code as shown below as vulnerable to re-ordering. Only difference > is that it uses the "//false" checks to set a bit in the cpumask for a > later IPI instead of an immediate IPI. An smp_mb() between writing the new task_struct::{closid,rmid} and calling task_curr() would prevent the reordering I described, but I was worried about the cost of executing a full barrier for every matching task. I tried something like this: for_each_process_thread(p, t) { if (!from || is_closid_match(t, from) || is_rmid_match(t, from)) { WRITE_ONCE(t->closid, to->closid); WRITE_ONCE(t->rmid, to->mon.rmid); /* group moves are serialized by rdt */ t->resctrl_dirty = true; } } if (IS_ENABLED(CONFIG_SMP) && mask) { /* Order t->{closid,rmid} stores before loads in task_curr() */ smp_mb(); for_each_process_thread(p, t) { if (t->resctrl_dirty) { if (task_curr(t)) cpumask_set_cpu(task_cpu(t), mask); t->resctrl_dirty = false; } } } I repeated the `perf bench sched messaging -g 40 -l100000 ` benchmark from before[1] to compare this with the baseline, and found that it only increased the time to delete the benchmark's group from 1.65ms to 1.66ms, so it's an alternative to what I last posted. I could do something similar in the single-task move, but I don't think it makes much of a performance difference in that case. It's only a win for the group move because the synchronization cost doesn't grow with the group size. [1] https://lore.kernel.org/lkml/20221129111055.953833-3-peternewman@google.com/ > > > faster than an smp_mb(). I'll take some measurements to see. > > > > The presumed behavior is __rdtgroup_move_task() not seeing t1 running > > yet implies that it observes the updated values: > > > > CPU 0 CPU 1 > > ----- ----- > > (t1->{closid,rmid} -> {1,1}) (rq->curr -> t0) > > > > __rdtgroup_move_task(): > > t1->{closid,rmid} <- {2,2} > > curr <- t1->cpu->rq->curr > > __schedule(): > > rq->curr <- t1 > > resctrl_sched_in(): > > t1->{closid,rmid} -> {2,2} > > if (curr == t1) // false > > IPI(t1->cpu) > > I understand that the test is false when it may be expected to be true, but > there does not seem to be a problem because of that. t1 was scheduled in with > the correct CLOSID/RMID and its CPU did not get an unnecessary IPI. Yes, this one was reminding the reader of the correct behavior. I can just leave it out. -Peter
Hi Peter, On 12/12/2022 9:36 AM, Peter Newman wrote: > On Sat, Dec 10, 2022 at 12:54 AM Reinette Chatre <reinette.chatre@intel.com> wrote: >> On 12/8/2022 2:30 PM, Peter Newman wrote: >>> Based on this, I'll just sketch out the first scenario below and drop >>> (2) from the changelog. This also implies that the group update cases >> >> ok, thank you for doing that analysis. >> >>> can use a single smp_mb() to provide all the necessary ordering, because >>> there's a full barrier on context switch for it to pair with, so I don't >>> need to broadcast IPI anymore. I don't know whether task_call_func() is >> >> This is not clear to me because rdt_move_group_tasks() seems to have the >> same code as shown below as vulnerable to re-ordering. Only difference >> is that it uses the "//false" checks to set a bit in the cpumask for a >> later IPI instead of an immediate IPI. > > An smp_mb() between writing the new task_struct::{closid,rmid} and > calling task_curr() would prevent the reordering I described, but I > was worried about the cost of executing a full barrier for every > matching task. So for moving a single task the solution may just be to change the current barrier() to smp_mb()? > > I tried something like this: > > for_each_process_thread(p, t) { > if (!from || is_closid_match(t, from) || > is_rmid_match(t, from)) { > WRITE_ONCE(t->closid, to->closid); > WRITE_ONCE(t->rmid, to->mon.rmid); > /* group moves are serialized by rdt */ > t->resctrl_dirty = true; > } > } > if (IS_ENABLED(CONFIG_SMP) && mask) { > /* Order t->{closid,rmid} stores before loads in task_curr() */ > smp_mb(); > for_each_process_thread(p, t) { > if (t->resctrl_dirty) { > if (task_curr(t)) > cpumask_set_cpu(task_cpu(t), mask); > t->resctrl_dirty = false; > } > } > } > struct task_struct would not welcome a new member dedicated to resctrl's rare use for convenience. Another option may be to use a flag within the variables themselves but that seems like significant complication (flag need to be dealt with during scheduling) for which the benefit is not clear to me. I would prefer that we start with the simplest solution first (I see that as IPI to all CPUs). The changelog needs clear description of the problem needing to be solved and the solution chosen, noting the tradeoffs with other possible solutions. You can submit that, as an RFC if the "IPI to all CPUs" remains a concern, after which we can bring that submission to the attention of the experts who would have needed information then to point us in the right direction. Reinette
Hi Reinette, On Tue, Dec 13, 2022 at 7:34 PM Reinette Chatre <reinette.chatre@intel.com> wrote: > On 12/12/2022 9:36 AM, Peter Newman wrote: > > On Sat, Dec 10, 2022 at 12:54 AM Reinette Chatre <reinette.chatre@intel.com> wrote: > >> On 12/8/2022 2:30 PM, Peter Newman wrote: > >>> Based on this, I'll just sketch out the first scenario below and drop > >>> (2) from the changelog. This also implies that the group update cases > >> > >> ok, thank you for doing that analysis. > >> > >>> can use a single smp_mb() to provide all the necessary ordering, because > >>> there's a full barrier on context switch for it to pair with, so I don't > >>> need to broadcast IPI anymore. I don't know whether task_call_func() is > >> > >> This is not clear to me because rdt_move_group_tasks() seems to have the > >> same code as shown below as vulnerable to re-ordering. Only difference > >> is that it uses the "//false" checks to set a bit in the cpumask for a > >> later IPI instead of an immediate IPI. > > > > An smp_mb() between writing the new task_struct::{closid,rmid} and > > calling task_curr() would prevent the reordering I described, but I > > was worried about the cost of executing a full barrier for every > > matching task. > > So for moving a single task the solution may just be to change > the current barrier() to smp_mb()? Yes, that's a simpler change, so I'll just do that instead. > > > > > I tried something like this: > > > > for_each_process_thread(p, t) { > > if (!from || is_closid_match(t, from) || > > is_rmid_match(t, from)) { > > WRITE_ONCE(t->closid, to->closid); > > WRITE_ONCE(t->rmid, to->mon.rmid); > > /* group moves are serialized by rdt */ > > t->resctrl_dirty = true; > > } > > } > > if (IS_ENABLED(CONFIG_SMP) && mask) { > > /* Order t->{closid,rmid} stores before loads in task_curr() */ > > smp_mb(); > > for_each_process_thread(p, t) { > > if (t->resctrl_dirty) { > > if (task_curr(t)) > > cpumask_set_cpu(task_cpu(t), mask); > > t->resctrl_dirty = false; > > } > > } > > } > > > > struct task_struct would not welcome a new member dedicated to resctrl's > rare use for convenience. Another option may be to use a flag within > the variables themselves but that seems like significant complication > (flag need to be dealt with during scheduling) for which the benefit > is not clear to me. I would prefer that we start with the simplest > solution first (I see that as IPI to all CPUs). The changelog needs clear > description of the problem needing to be solved and the solution chosen, noting > the tradeoffs with other possible solutions. You can submit that, as an RFC > if the "IPI to all CPUs" remains a concern, after which we can bring that > submission to the attention of the experts who would have needed information then > to point us in the right direction. To be complete, I did the benchmark again with the simple addition of an smp_mb() on every iteration with a matching CLOSID/RMID and found that it didn't result in a substantial performance impact. (1.57ms -> 1.65ms). This isn't as significant as the 2x slowdown I saw when using task_call_func(), so maybe task_call_func() is just really expensive. That's more reason to just upgrade the barrier in the single-task move case. While I agree with your points on the IPI broadcast, it seems like a discussion I would prefer to just avoid given these measurements. -Peter
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index e5a48f05e787..59b7ffcd53bb 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -528,6 +528,31 @@ static void rdtgroup_remove(struct rdtgroup *rdtgrp) kfree(rdtgrp); } +static int update_locked_task_closid_rmid(struct task_struct *t, void *arg) +{ + struct rdtgroup *rdtgrp = arg; + + /* + * Although task_call_func() serializes the writes below with the paired + * reads in resctrl_sched_in(), resctrl_sched_in() still needs + * READ_ONCE() due to rdt_move_group_tasks(), so use WRITE_ONCE() here + * to conform. + */ + if (rdtgrp->type == RDTCTRL_GROUP) { + WRITE_ONCE(t->closid, rdtgrp->closid); + WRITE_ONCE(t->rmid, rdtgrp->mon.rmid); + } else if (rdtgrp->type == RDTMON_GROUP) { + WRITE_ONCE(t->rmid, rdtgrp->mon.rmid); + } + + /* + * If the task is current on a CPU, the PQR_ASSOC MSR needs to be + * updated to make the resource group go into effect. If the task is not + * current, the MSR will be updated when the task is scheduled in. + */ + return task_curr(t); +} + static void _update_task_closid_rmid(void *task) { /* @@ -538,10 +563,24 @@ static void _update_task_closid_rmid(void *task) resctrl_sched_in(); } -static void update_task_closid_rmid(struct task_struct *t) +static void update_task_closid_rmid(struct task_struct *t, + struct rdtgroup *rdtgrp) { - if (IS_ENABLED(CONFIG_SMP) && task_curr(t)) - smp_call_function_single(task_cpu(t), _update_task_closid_rmid, t, 1); + /* + * Serialize the closid and rmid update with context switch. If + * task_call_func() indicates that the task was running during + * update_locked_task_closid_rmid(), then interrupt it. + */ + if (task_call_func(t, update_locked_task_closid_rmid, rdtgrp) && + IS_ENABLED(CONFIG_SMP)) + /* + * If the task has migrated away from the CPU indicated by + * task_cpu() below, then it has already switched in on the + * new CPU using the updated closid and rmid and the call below + * is unnecessary, but harmless. + */ + smp_call_function_single(task_cpu(t), + _update_task_closid_rmid, t, 1); else _update_task_closid_rmid(t); } @@ -557,39 +596,16 @@ static int __rdtgroup_move_task(struct task_struct *tsk, return 0; /* - * Set the task's closid/rmid before the PQR_ASSOC MSR can be - * updated by them. - * - * For ctrl_mon groups, move both closid and rmid. * For monitor groups, can move the tasks only from * their parent CTRL group. */ - - if (rdtgrp->type == RDTCTRL_GROUP) { - WRITE_ONCE(tsk->closid, rdtgrp->closid); - WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid); - } else if (rdtgrp->type == RDTMON_GROUP) { - if (rdtgrp->mon.parent->closid == tsk->closid) { - WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid); - } else { - rdt_last_cmd_puts("Can't move task to different control group\n"); - return -EINVAL; - } + if (rdtgrp->type == RDTMON_GROUP && + rdtgrp->mon.parent->closid != tsk->closid) { + rdt_last_cmd_puts("Can't move task to different control group\n"); + return -EINVAL; } - /* - * Ensure the task's closid and rmid are written before determining if - * the task is current that will decide if it will be interrupted. - */ - barrier(); - - /* - * By now, the task's closid and rmid are set. If the task is current - * on a CPU, the PQR_ASSOC MSR needs to be updated to make the resource - * group go into effect. If the task is not current, the MSR will be - * updated when the task is scheduled in. - */ - update_task_closid_rmid(tsk); + update_task_closid_rmid(tsk, rdtgrp); return 0; }