Message ID | 20231027144050.110601-3-frederic@kernel.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:d641:0:b0:403:3b70:6f57 with SMTP id cy1csp654363vqb; Fri, 27 Oct 2023 07:41:38 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGQ1yM7NDG28XmZatiSALFpqR4KB8FxPOOdsgn7WEh+uLUCXzwjt+ztgOg12WW921tkbZY+ X-Received: by 2002:a05:6808:28c:b0:3a7:2598:ab2c with SMTP id z12-20020a056808028c00b003a72598ab2cmr2590052oic.7.1698417698597; Fri, 27 Oct 2023 07:41:38 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1698417698; cv=none; d=google.com; s=arc-20160816; b=FGORLn6IA+5EYnfN0iLgDsX+Lw+K12q5tbIiZ831zeDUaTpZbge2TOqljYiYb7dayG esklgygoGyVUrvhpG0t55mZW98e5wqejqVEm78SpQGlRpdVy3cfcykzaONmD5p2eoHee qlc2t3Iq7OtzmtZ84qd1FO3prd80rFzMW2Y3O4JD9PhdncU+HHiMf0uleebpMq/hl2aN rPINlBXGwv2cBBEIjufj0fCRMP1JmN1ZMdWfgt9zIEIbKKc3q/K0laMr7fhdoDGBazTE 4KJ61WFsHnQA0+V5CrO7pfXf2PAHSlDI1S8aTU4LwA6J6DYb4+bj2MV+qAe7kU6m/vcV 3urg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=ZdMcBX28/OOJ17CEoIzmn2m4naKz9LL++9bBrVcnuWE=; fh=Z5RfXUdA7RqZB5lwRaxglRkGuBAkH08O6zmnqH6r62g=; b=n6wZToczSJnQv0fLyMN/cvXwUcPNeW4cdjsHv5iYN5x4tgpRmyqw9Qj8Zft09iGay5 gNf5g3aGtGgz7zp1FT7Z0gd5iM0fhem5a5trxUNzglo4A3bkU2689h7Z1jVIAjKdE5gH Iz2vrE+8+q4xc3tTWI3AIY5xVsjwqqrGsyPwiABHQwhOHZJOCqjCFn6ImSfkzOY2AR6r 8nkH3LRM9RhHpUZS7GNYyApmi9XcxKEYohnJ3hUJ5aanvE/ZHHe5sC5Qx4bQ5L5p0spO mHwWrbPibZ81iZ9PBdccX2ANzAubJLiH9v3HCcadzSsjhHN/koVVfXoLJDjPI79X0DiV U1kA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=PFegXcS4; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from agentk.vger.email (agentk.vger.email. [2620:137:e000::3:2]) by mx.google.com with ESMTPS id v30-20020a25fc1e000000b00da0c7aa2504si2852244ybd.709.2023.10.27.07.41.37 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 27 Oct 2023 07:41:38 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) client-ip=2620:137:e000::3:2; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=PFegXcS4; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id 00B308301511; Fri, 27 Oct 2023 07:41:35 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345981AbjJ0OlL (ORCPT <rfc822;a1648639935@gmail.com> + 25 others); Fri, 27 Oct 2023 10:41:11 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52230 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345868AbjJ0OlJ (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 27 Oct 2023 10:41:09 -0400 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 10A3311F; Fri, 27 Oct 2023 07:41:07 -0700 (PDT) Received: by smtp.kernel.org (Postfix) with ESMTPSA id D7EA9C433A9; Fri, 27 Oct 2023 14:41:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1698417666; bh=aJr8jQIzF20EIQq6xhSUAOn5x5pigArjiu7D5NS7zLY=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=PFegXcS4zQj7CCraC8MJw6fNVlpUO/oO4mmGf3ki+qHwWjqKacCtoXEDbxT0h9jWX auZ0eB+YbwgXXYRtRHnamL0OFTlXSsi0BhFauWn0EogHhnMV8YRI4WekODYPlOIUA/ 7fNaBqicwOWC6ceKEZ7Jlt5WKRAENgiue7s3kgiu2Zt+L4jZrlOhTyQ7dwl16p8WEC h33z7TzjAt1B2B604vErrCdDh+dgwhpNZQPOfwTQJEbhBBKyHjeueQgfk/UxKDpN8a JvHeqweFw1j5+k9+9ALT+6HqmVNFxoqE2bHT+ae+TyZOLNet6dbUo4x82d+PBL2ja+ dzKBLmIyJUK+w== From: Frederic Weisbecker <frederic@kernel.org> To: LKML <linux-kernel@vger.kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org>, Boqun Feng <boqun.feng@gmail.com>, Joel Fernandes <joel@joelfernandes.org>, Josh Triplett <josh@joshtriplett.org>, Mathieu Desnoyers <mathieu.desnoyers@efficios.com>, Neeraj Upadhyay <neeraj.upadhyay@amd.com>, "Paul E . McKenney" <paulmck@kernel.org>, Steven Rostedt <rostedt@goodmis.org>, Uladzislau Rezki <urezki@gmail.com>, rcu <rcu@vger.kernel.org>, Zqiang <qiang.zhang1211@gmail.com>, "Liam R . Howlett" <Liam.Howlett@oracle.com>, Peter Zijlstra <peterz@infradead.org> Subject: [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics Date: Fri, 27 Oct 2023 16:40:48 +0200 Message-Id: <20231027144050.110601-3-frederic@kernel.org> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20231027144050.110601-1-frederic@kernel.org> References: <20231027144050.110601-1-frederic@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.2 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Fri, 27 Oct 2023 07:41:35 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1780920036640439132 X-GMAIL-MSGID: 1780920036640439132 |
Series |
rcu: Fix PF_IDLE related issues v3
|
|
Commit Message
Frederic Weisbecker
Oct. 27, 2023, 2:40 p.m. UTC
The commit:
cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup")
has changed the semantics of what is to be considered an idle task in
such a way that CPU boot code preceding the actual idle loop is excluded
from it.
This has however introduced new potential RCU-tasks stalls when either:
1) Grace period is started before init/0 had a chance to set PF_IDLE,
keeping it stuck in the holdout list until idle ever schedules.
2) Grace period is started when some possible CPUs have never been
online, keeping their idle tasks stuck in the holdout list until the
CPU ever boots up.
3) Similar to 1) but with secondary CPUs: Grace period is started
concurrently with secondary CPU booting, putting its idle task in
the holdout list because PF_IDLE isn't yet observed on it. It stays
then stuck in the holdout list until that CPU ever schedules. The
effect is mitigated here by the hotplug AP thread that must run to
bring the CPU up.
Fix this with handling the new semantics of PF_IDLE, keeping in mind
that it may or may not be set on an idle task. Take advantage of that to
strengthen the coverage of an RCU-tasks quiescent state within an idle
task, excluding the CPU boot code from it. Only the code running within
the idle loop is now a quiescent state, along with offline CPUs.
Fixes: cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup")
Suggested-by: Joel Fernandes <joel@joelfernandes.org>
Suggested-by: Paul E . McKenney" <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
kernel/rcu/tasks.h | 30 ++++++++++++++++++++++++++++--
1 file changed, 28 insertions(+), 2 deletions(-)
Comments
On Fri, Oct 27, 2023 at 04:40:48PM +0200, Frederic Weisbecker wrote: > + /* Has the task been seen voluntarily sleeping? */ > + if (!READ_ONCE(t->on_rq)) > + return false; > - if (t != current && READ_ONCE(t->on_rq) && !is_idle_task(t)) { AFAICT this ->on_rq usage is outside of scheduler locks and that READ_ONCE isn't going to help much. Obviously a pre-existing issue, and I suppose all it cares about is seeing a 0 or not, irrespective of the races, but urgh..
On Fri, Oct 27, 2023 at 09:20:26PM +0200, Peter Zijlstra wrote: > On Fri, Oct 27, 2023 at 04:40:48PM +0200, Frederic Weisbecker wrote: > > > + /* Has the task been seen voluntarily sleeping? */ > > + if (!READ_ONCE(t->on_rq)) > > + return false; > > > - if (t != current && READ_ONCE(t->on_rq) && !is_idle_task(t)) { > > AFAICT this ->on_rq usage is outside of scheduler locks and that > READ_ONCE isn't going to help much. > > Obviously a pre-existing issue, and I suppose all it cares about is > seeing a 0 or not, irrespective of the races, but urgh.. The trick is that RCU Tasks only needs to spot a task voluntarily blocked once at any point in the grace period. The beginning and end of the grace-period process have full barriers, so if this code sees t->on_rq equal to zero, we know that the task was voluntarily blocked at some point during the grace period, as required. In theory, we could acquire a scheduler lock, but in practice this would cause CPU-latency problems at a certain set of large datacenters, and for once, not the datacenters operated by my employer. In theory, we could make separate lists of tasks that we need to wait on, thus avoiding the need to scan the full task list, but in practice this would require a synchronized linked-list operation on every voluntary context switch, both in and out. In theory, the task list could sharded, so that it could be scanned incrementally, but in practice, this is a bit non-trivial. Though this particular use case doesn't care about new tasks, so it could live with something simpler than would be required for certain types of signal delivery. In theory, we could place rcu_segcblist-like mid pointers into the task list, so that scans could restart from any mid pointer. Care is required because the mid pointers would likely need to be recycled as new tasks are added. Plus care is needed because it has been a good long time since I have looked at the code managing the tasks list, and I am probably woefully out of date on how it all works. So, is there a better way? Thanx, Paul
On Fri, Oct 27, 2023 at 02:23:56PM -0700, Paul E. McKenney wrote: > On Fri, Oct 27, 2023 at 09:20:26PM +0200, Peter Zijlstra wrote: > > On Fri, Oct 27, 2023 at 04:40:48PM +0200, Frederic Weisbecker wrote: > > > > > + /* Has the task been seen voluntarily sleeping? */ > > > + if (!READ_ONCE(t->on_rq)) > > > + return false; > > > > > - if (t != current && READ_ONCE(t->on_rq) && !is_idle_task(t)) { > > > > AFAICT this ->on_rq usage is outside of scheduler locks and that > > READ_ONCE isn't going to help much. > > > > Obviously a pre-existing issue, and I suppose all it cares about is > > seeing a 0 or not, irrespective of the races, but urgh.. > > The trick is that RCU Tasks only needs to spot a task voluntarily blocked > once at any point in the grace period. The beginning and end of the > grace-period process have full barriers, so if this code sees t->on_rq > equal to zero, we know that the task was voluntarily blocked at some > point during the grace period, as required. > > In theory, we could acquire a scheduler lock, but in practice this would > cause CPU-latency problems at a certain set of large datacenters, and > for once, not the datacenters operated by my employer. > > In theory, we could make separate lists of tasks that we need to wait on, > thus avoiding the need to scan the full task list, but in practice this > would require a synchronized linked-list operation on every voluntary > context switch, both in and out. > > In theory, the task list could sharded, so that it could be scanned > incrementally, but in practice, this is a bit non-trivial. Though this > particular use case doesn't care about new tasks, so it could live with > something simpler than would be required for certain types of signal > delivery. > > In theory, we could place rcu_segcblist-like mid pointers into the > task list, so that scans could restart from any mid pointer. Care is > required because the mid pointers would likely need to be recycled as > new tasks are added. Plus care is needed because it has been a good > long time since I have looked at the code managing the tasks list, > and I am probably woefully out of date on how it all works. > > So, is there a better way? Nah, this is more or less what I feared. I just worry people will come around and put WRITE_ONCE() on the other end. I don't think that'll buy us much. Nor do I think the current READ_ONCE()s actually matter. But perhaps put a comment there, that we don't care for the races and only need to observe a 0 once or something.
On Sat, Oct 28, 2023 at 12:46:28AM +0200, Peter Zijlstra wrote: > On Fri, Oct 27, 2023 at 02:23:56PM -0700, Paul E. McKenney wrote: > > On Fri, Oct 27, 2023 at 09:20:26PM +0200, Peter Zijlstra wrote: > > > On Fri, Oct 27, 2023 at 04:40:48PM +0200, Frederic Weisbecker wrote: > > > > > > > + /* Has the task been seen voluntarily sleeping? */ > > > > + if (!READ_ONCE(t->on_rq)) > > > > + return false; > > > > > > > - if (t != current && READ_ONCE(t->on_rq) && !is_idle_task(t)) { > > > > > > AFAICT this ->on_rq usage is outside of scheduler locks and that > > > READ_ONCE isn't going to help much. > > > > > > Obviously a pre-existing issue, and I suppose all it cares about is > > > seeing a 0 or not, irrespective of the races, but urgh.. > > > > The trick is that RCU Tasks only needs to spot a task voluntarily blocked > > once at any point in the grace period. The beginning and end of the > > grace-period process have full barriers, so if this code sees t->on_rq > > equal to zero, we know that the task was voluntarily blocked at some > > point during the grace period, as required. > > > > In theory, we could acquire a scheduler lock, but in practice this would > > cause CPU-latency problems at a certain set of large datacenters, and > > for once, not the datacenters operated by my employer. > > > > In theory, we could make separate lists of tasks that we need to wait on, > > thus avoiding the need to scan the full task list, but in practice this > > would require a synchronized linked-list operation on every voluntary > > context switch, both in and out. > > > > In theory, the task list could sharded, so that it could be scanned > > incrementally, but in practice, this is a bit non-trivial. Though this > > particular use case doesn't care about new tasks, so it could live with > > something simpler than would be required for certain types of signal > > delivery. > > > > In theory, we could place rcu_segcblist-like mid pointers into the > > task list, so that scans could restart from any mid pointer. Care is > > required because the mid pointers would likely need to be recycled as > > new tasks are added. Plus care is needed because it has been a good > > long time since I have looked at the code managing the tasks list, > > and I am probably woefully out of date on how it all works. > > > > So, is there a better way? > > Nah, this is more or less what I feared. I just worry people will come > around and put WRITE_ONCE() on the other end. I don't think that'll buy > us much. Nor do I think the current READ_ONCE()s actually matter. My friend, you trust compilers more than I ever will. ;-) > But perhaps put a comment there, that we don't care for the races and > only need to observe a 0 once or something. There are these two passagers in the big lock comment preceding the RCU Tasks code: // rcu_tasks_pregp_step(): // Invokes synchronize_rcu() in order to wait for all in-flight // t->on_rq and t->nvcsw transitions to complete. This works because // all such transitions are carried out with interrupts disabled. and: // rcu_tasks_postgp(): // Invokes synchronize_rcu() in order to ensure that all prior // t->on_rq and t->nvcsw transitions are seen by all CPUs and tasks // to have happened before the end of this RCU Tasks grace period. // Again, this works because all such transitions are carried out // with interrupts disabled. The rcu_tasks_pregp_step() function contains this comment: /* * Wait for all pre-existing t->on_rq and t->nvcsw transitions * to complete. Invoking synchronize_rcu() suffices because all * these transitions occur with interrupts disabled. Without this * synchronize_rcu(), a read-side critical section that started * before the grace period might be incorrectly seen as having * started after the grace period. * * This synchronize_rcu() also dispenses with the need for a * memory barrier on the first store to t->rcu_tasks_holdout, * as it forces the store to happen after the beginning of the * grace period. */ And the rcu_tasks_postgp() function contains this comment: /* * Because ->on_rq and ->nvcsw are not guaranteed to have a full * memory barriers prior to them in the schedule() path, memory * reordering on other CPUs could cause their RCU-tasks read-side * critical sections to extend past the end of the grace period. * However, because these ->nvcsw updates are carried out with * interrupts disabled, we can use synchronize_rcu() to force the * needed ordering on all such CPUs. * * This synchronize_rcu() also confines all ->rcu_tasks_holdout * accesses to be within the grace period, avoiding the need for * memory barriers for ->rcu_tasks_holdout accesses. * * In addition, this synchronize_rcu() waits for exiting tasks * to complete their final preempt_disable() region of execution, * cleaning up after synchronize_srcu(&tasks_rcu_exit_srcu), * enforcing the whole region before tasklist removal until * the final schedule() with TASK_DEAD state to be an RCU TASKS * read side critical section. */ Does that suffice, or should we add more? Thanx, Paul
On Fri, Oct 27, 2023 at 04:41:30PM -0700, Paul E. McKenney wrote: > On Sat, Oct 28, 2023 at 12:46:28AM +0200, Peter Zijlstra wrote: > > Nah, this is more or less what I feared. I just worry people will come > > around and put WRITE_ONCE() on the other end. I don't think that'll buy > > us much. Nor do I think the current READ_ONCE()s actually matter. > > My friend, you trust compilers more than I ever will. ;-) Well, we only use the values {0,1,2}, that's contained in the first byte. Are we saying compiler will not only byte-split but also bit-split the loads? But again, lacking the WRITE_ONCE() counterpart, this READ_ONCE() isn't getting you anything, and if you really worried about it, shouldn't you have proposed a patch making it all WRITE_ONCE() back when you did this tasks-rcu stuff? > > But perhaps put a comment there, that we don't care for the races and > > only need to observe a 0 once or something. > > There are these two passagers in the big lock comment preceding the > RCU Tasks code: > // rcu_tasks_pregp_step(): > // Invokes synchronize_rcu() in order to wait for all in-flight > // t->on_rq and t->nvcsw transitions to complete. This works because > // all such transitions are carried out with interrupts disabled. > Does that suffice, or should we add more? Probably sufficient. If one were to have used the search option :-) Anyway, this brings me to nvcsw, exact same problem there, except possibly worse, because now we actually do care about the full word. No WRITE_ONCE() write side, so the READ_ONCE() don't help against store-tearing (however unlikely that actually is in this case). Also, I'm not entirely sure I see why you need on_rq and nvcsw. Would not nvcsw increasing be enough to know it passed through a quiescent state? Are you trying to say that if nvcsw hasn't advanced but on_rq is still 0, nothing has changed and you can proceed? Or rather, looking at the code it seems use the inverse, if on_rq, nvcsw must change. Makes sense I suppose, no point waiting for nvcsw to change if the task never did anything.
On Mon, Oct 30, 2023 at 09:21:38AM +0100, Peter Zijlstra wrote: > On Fri, Oct 27, 2023 at 04:41:30PM -0700, Paul E. McKenney wrote: > > On Sat, Oct 28, 2023 at 12:46:28AM +0200, Peter Zijlstra wrote: > > > > Nah, this is more or less what I feared. I just worry people will come > > > around and put WRITE_ONCE() on the other end. I don't think that'll buy > > > us much. Nor do I think the current READ_ONCE()s actually matter. > > > > My friend, you trust compilers more than I ever will. ;-) > > Well, we only use the values {0,1,2}, that's contained in the first > byte. Are we saying compiler will not only byte-split but also > bit-split the loads? > > But again, lacking the WRITE_ONCE() counterpart, this READ_ONCE() isn't > getting you anything, and if you really worried about it, shouldn't you > have proposed a patch making it all WRITE_ONCE() back when you did this > tasks-rcu stuff? There are not all that many of them. If such a WRITE_ONCE() patch would be welcome, I would be happy to put it together. > > > But perhaps put a comment there, that we don't care for the races and > > > only need to observe a 0 once or something. > > > > There are these two passagers in the big lock comment preceding the > > RCU Tasks code: > > > // rcu_tasks_pregp_step(): > > // Invokes synchronize_rcu() in order to wait for all in-flight > > // t->on_rq and t->nvcsw transitions to complete. This works because > > // all such transitions are carried out with interrupts disabled. > > > Does that suffice, or should we add more? > > Probably sufficient. If one were to have used the search option :-) > > Anyway, this brings me to nvcsw, exact same problem there, except > possibly worse, because now we actually do care about the full word. > > No WRITE_ONCE() write side, so the READ_ONCE() don't help against > store-tearing (however unlikely that actually is in this case). Again, if such a WRITE_ONCE() patch would be welcome, I would be happy to put it together. > Also, I'm not entirely sure I see why you need on_rq and nvcsw. Would > not nvcsw increasing be enough to know it passed through a quiescent > state? Are you trying to say that if nvcsw hasn't advanced but on_rq is > still 0, nothing has changed and you can proceed? > > Or rather, looking at the code it seems use the inverse, if on_rq, nvcsw > must change. > > Makes sense I suppose, no point waiting for nvcsw to change if the task > never did anything. Exactly, the on_rq check is needed to avoid excessively long grace periods for tasks that are blocked for long periods of time. Thanx, Paul
On Mon, Oct 30, 2023 at 01:11:41PM -0700, Paul E. McKenney wrote: > On Mon, Oct 30, 2023 at 09:21:38AM +0100, Peter Zijlstra wrote: > > On Fri, Oct 27, 2023 at 04:41:30PM -0700, Paul E. McKenney wrote: > > > On Sat, Oct 28, 2023 at 12:46:28AM +0200, Peter Zijlstra wrote: > > > > > > Nah, this is more or less what I feared. I just worry people will come > > > > around and put WRITE_ONCE() on the other end. I don't think that'll buy > > > > us much. Nor do I think the current READ_ONCE()s actually matter. > > > > > > My friend, you trust compilers more than I ever will. ;-) > > > > Well, we only use the values {0,1,2}, that's contained in the first > > byte. Are we saying compiler will not only byte-split but also > > bit-split the loads? > > > > But again, lacking the WRITE_ONCE() counterpart, this READ_ONCE() isn't > > getting you anything, and if you really worried about it, shouldn't you > > have proposed a patch making it all WRITE_ONCE() back when you did this > > tasks-rcu stuff? > > There are not all that many of them. If such a WRITE_ONCE() patch would > be welcome, I would be happy to put it together. > > > > > But perhaps put a comment there, that we don't care for the races and > > > > only need to observe a 0 once or something. > > > > > > There are these two passagers in the big lock comment preceding the > > > RCU Tasks code: > > > > > // rcu_tasks_pregp_step(): > > > // Invokes synchronize_rcu() in order to wait for all in-flight > > > // t->on_rq and t->nvcsw transitions to complete. This works because > > > // all such transitions are carried out with interrupts disabled. > > > > > Does that suffice, or should we add more? > > > > Probably sufficient. If one were to have used the search option :-) > > > > Anyway, this brings me to nvcsw, exact same problem there, except > > possibly worse, because now we actually do care about the full word. > > > > No WRITE_ONCE() write side, so the READ_ONCE() don't help against > > store-tearing (however unlikely that actually is in this case). > > Again, if such a WRITE_ONCE() patch would be welcome, I would be happy > to put it together. Welcome is not the right word. What bugs me most is that this was never raised when this code was written :/ Mostly my problem is that GCC generates such utter shite when you mention volatile. See, the below patch changes the perfectly fine and non-broken: 0148 1d8: 49 83 06 01 addq $0x1,(%r14) into: 0148 1d8: 49 8b 06 mov (%r14),%rax 014b 1db: 48 83 c0 01 add $0x1,%rax 014f 1df: 49 89 06 mov %rax,(%r14) For absolutely no reason :-( At least clang doesn't do this, it stays: 0403 413: 49 ff 45 00 incq 0x0(%r13) irrespective of the volatile. --- diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 802551e0009b..d616211b9151 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6575,8 +6575,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) */ static void __sched notrace __schedule(unsigned int sched_mode) { struct task_struct *prev, *next; - unsigned long *switch_count; + volatile unsigned long *switch_count; unsigned long prev_state; struct rq_flags rf; struct rq *rq;
Hello, On Tue, 31 Oct 2023, Peter Zijlstra wrote: (I can't say anything about the WRITE_ONCE/rcu code, just about the below codegen part) > Welcome is not the right word. What bugs me most is that this was never > raised when this code was written :/ > > Mostly my problem is that GCC generates such utter shite when you > mention volatile. See, the below patch changes the perfectly fine and > non-broken: > > 0148 1d8: 49 83 06 01 addq $0x1,(%r14) What is non-broken here that is ... > into: > > 0148 1d8: 49 8b 06 mov (%r14),%rax > 014b 1db: 48 83 c0 01 add $0x1,%rax > 014f 1df: 49 89 06 mov %rax,(%r14) ... broken here? (Sure code size and additional register use, but I don't think you mean this with broken). > For absolutely no reason :-( The reason is simple (and should be obvious): to adhere to the abstract machine regarding volatile. When x is volatile then x++ consists of a read and a write, in this order. The easiest way to ensure this is to actually generate a read and a write instruction. Anything else is an optimization, and for each such optimization you need to actively find an argument why this optimization is correct to start with (and then if it's an optimization at all). In this case the argument needs to somehow involve arguing that an rmw instruction on x86 is in fact completely equivalent to the separate instructions, from read cycle to write cycle over all pipeline stages, on all implementations of x86. I.e. that a rmw instruction is spec'ed to be equivalent. You most probably can make that argument in this specific case, I'll give you that. But why bother to start with, in a piece of software that is already fairly complex (the compiler)? It's much easier to just not do much anything with volatile accesses at all and be guaranteed correct. Even more so as the software author, when using volatile, most likely is much more interested in correct code (even from a abstract machine perspective) than micro optimizations. > At least clang doesn't do this, it stays: > > 0403 413: 49 ff 45 00 incq 0x0(%r13) > > irrespective of the volatile. And, are you 100% sure that this is correct? Even for x86 CPU pipeline implementations that you aren't intimately knowing about? ;-) But all that seems to be a side-track anyway, what's your real worry with the code sequence generated by GCC? Ciao, Michael.
On Tue, Oct 31, 2023 at 10:52:02AM +0100, Peter Zijlstra wrote: > On Mon, Oct 30, 2023 at 01:11:41PM -0700, Paul E. McKenney wrote: > > On Mon, Oct 30, 2023 at 09:21:38AM +0100, Peter Zijlstra wrote: > > > On Fri, Oct 27, 2023 at 04:41:30PM -0700, Paul E. McKenney wrote: > > > > On Sat, Oct 28, 2023 at 12:46:28AM +0200, Peter Zijlstra wrote: > > > > > > > > Nah, this is more or less what I feared. I just worry people will come > > > > > around and put WRITE_ONCE() on the other end. I don't think that'll buy > > > > > us much. Nor do I think the current READ_ONCE()s actually matter. > > > > > > > > My friend, you trust compilers more than I ever will. ;-) > > > > > > Well, we only use the values {0,1,2}, that's contained in the first > > > byte. Are we saying compiler will not only byte-split but also > > > bit-split the loads? > > > > > > But again, lacking the WRITE_ONCE() counterpart, this READ_ONCE() isn't > > > getting you anything, and if you really worried about it, shouldn't you > > > have proposed a patch making it all WRITE_ONCE() back when you did this > > > tasks-rcu stuff? > > > > There are not all that many of them. If such a WRITE_ONCE() patch would > > be welcome, I would be happy to put it together. > > > > > > > But perhaps put a comment there, that we don't care for the races and > > > > > only need to observe a 0 once or something. > > > > > > > > There are these two passagers in the big lock comment preceding the > > > > RCU Tasks code: > > > > > > > // rcu_tasks_pregp_step(): > > > > // Invokes synchronize_rcu() in order to wait for all in-flight > > > > // t->on_rq and t->nvcsw transitions to complete. This works because > > > > // all such transitions are carried out with interrupts disabled. > > > > > > > Does that suffice, or should we add more? > > > > > > Probably sufficient. If one were to have used the search option :-) > > > > > > Anyway, this brings me to nvcsw, exact same problem there, except > > > possibly worse, because now we actually do care about the full word. > > > > > > No WRITE_ONCE() write side, so the READ_ONCE() don't help against > > > store-tearing (however unlikely that actually is in this case). > > > > Again, if such a WRITE_ONCE() patch would be welcome, I would be happy > > to put it together. > > Welcome is not the right word. What bugs me most is that this was never > raised when this code was written :/ Me, I consider those READ_ONCE() calls to be documentation as well as defense against overly enthusiastic optimizers. "This access is racy." > Mostly my problem is that GCC generates such utter shite when you > mention volatile. See, the below patch changes the perfectly fine and > non-broken: > > 0148 1d8: 49 83 06 01 addq $0x1,(%r14) > > into: > > 0148 1d8: 49 8b 06 mov (%r14),%rax > 014b 1db: 48 83 c0 01 add $0x1,%rax > 014f 1df: 49 89 06 mov %rax,(%r14) > > For absolutely no reason :-( > > At least clang doesn't do this, it stays: > > 0403 413: 49 ff 45 00 incq 0x0(%r13) > > irrespective of the volatile. Sounds like a bug in GCC, perhaps depending on the microarchitecture in question. And it was in fact reported in the past, but closed as not-a-bug. Perhaps clang's fix for this will help GCC along. And yes, I do see that ++*switch_count in __schedule(). So, at least until GCC catches up to clang's code generation, I take it that you don't want WRITE_ONCE() for that ->nvcsw increment. Thoughts on ->on_rq? Thanx, Paul > --- > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 802551e0009b..d616211b9151 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -6575,8 +6575,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > */ > static void __sched notrace __schedule(unsigned int sched_mode) > { > struct task_struct *prev, *next; > - unsigned long *switch_count; > + volatile unsigned long *switch_count; > unsigned long prev_state; > struct rq_flags rf; > struct rq *rq;
On Tue, Oct 31, 2023 at 02:16:34PM +0000, Michael Matz wrote: > Hello, > > On Tue, 31 Oct 2023, Peter Zijlstra wrote: > > (I can't say anything about the WRITE_ONCE/rcu code, just about the below > codegen part) > > > Welcome is not the right word. What bugs me most is that this was never > > raised when this code was written :/ > > > > Mostly my problem is that GCC generates such utter shite when you > > mention volatile. See, the below patch changes the perfectly fine and > > non-broken: > > > > 0148 1d8: 49 83 06 01 addq $0x1,(%r14) > > What is non-broken here that is ... > > > into: > > > > 0148 1d8: 49 8b 06 mov (%r14),%rax > > 014b 1db: 48 83 c0 01 add $0x1,%rax > > 014f 1df: 49 89 06 mov %rax,(%r14) > > ... broken here? (Sure code size and additional register use, but I don't > think you mean this with broken). > > > For absolutely no reason :-( > > The reason is simple (and should be obvious): to adhere to the abstract > machine regarding volatile. When x is volatile then x++ consists of a > read and a write, in this order. The easiest way to ensure this is to > actually generate a read and a write instruction. Anything else is an > optimization, and for each such optimization you need to actively find an > argument why this optimization is correct to start with (and then if it's > an optimization at all). In this case the argument needs to somehow > involve arguing that an rmw instruction on x86 is in fact completely > equivalent to the separate instructions, from read cycle to write cycle > over all pipeline stages, on all implementations of x86. I.e. that a rmw > instruction is spec'ed to be equivalent. > > You most probably can make that argument in this specific case, I'll give > you that. But why bother to start with, in a piece of software that is > already fairly complex (the compiler)? It's much easier to just not do > much anything with volatile accesses at all and be guaranteed correct. > Even more so as the software author, when using volatile, most likely is > much more interested in correct code (even from a abstract machine > perspective) than micro optimizations. I suspect that Peter cares because the code in question is on a fast path within the scheduler. > > At least clang doesn't do this, it stays: > > > > 0403 413: 49 ff 45 00 incq 0x0(%r13) > > > > irrespective of the volatile. > > And, are you 100% sure that this is correct? Even for x86 CPU > pipeline implementations that you aren't intimately knowing about? ;-) Me, I am not even sure that the load-increment-store is always faithfully executed by the hardware. ;-) > But all that seems to be a side-track anyway, what's your real worry with > the code sequence generated by GCC? Again, my guess is that Peter cares deeply about the speed of the fast path on which this increment occurs. Thanx, Paul
On Tue, Oct 31, 2023 at 02:16:34PM +0000, Michael Matz wrote: > > Mostly my problem is that GCC generates such utter shite when you > > mention volatile. See, the below patch changes the perfectly fine and > > non-broken: > > > > 0148 1d8: 49 83 06 01 addq $0x1,(%r14) > > What is non-broken here that is ... > > > into: > > > > 0148 1d8: 49 8b 06 mov (%r14),%rax > > 014b 1db: 48 83 c0 01 add $0x1,%rax > > 014f 1df: 49 89 06 mov %rax,(%r14) > > ... broken here? (Sure code size and additional register use, but I don't > think you mean this with broken). The point was that the code was perfectly fine without adding the volatile, and adding volatile makes it worse. > > For absolutely no reason :-( > > The reason is simple (and should be obvious): to adhere to the abstract > machine regarding volatile. When x is volatile then x++ consists of a > read and a write, in this order. The easiest way to ensure this is to > actually generate a read and a write instruction. Anything else is an > optimization, and for each such optimization you need to actively find an > argument why this optimization is correct to start with (and then if it's > an optimization at all). In this case the argument needs to somehow > involve arguing that an rmw instruction on x86 is in fact completely > equivalent to the separate instructions, from read cycle to write cycle > over all pipeline stages, on all implementations of x86. I.e. that a rmw > instruction is spec'ed to be equivalent. > > You most probably can make that argument in this specific case, I'll give > you that. But why bother to start with, in a piece of software that is > already fairly complex (the compiler)? It's much easier to just not do > much anything with volatile accesses at all and be guaranteed correct. > Even more so as the software author, when using volatile, most likely is > much more interested in correct code (even from a abstract machine > perspective) than micro optimizations. There's a pile of situations where a RmW instruction is actively different vs a load-store split, esp for volatile variables that are explicitly expected to change asynchronously. The original RmW instruction is IRQ-safe, while the load-store version is not. If an interrupt lands in between the load and store and also modifies the variable then the store after interrupt-return will over-write said modification. These are not equivalent. In this case that's not relevant because the increment happens to happen with IRQs disabled. But the point is that these forms are very much not equivalent. > > At least clang doesn't do this, it stays: > > > > 0403 413: 49 ff 45 00 incq 0x0(%r13) > > > > irrespective of the volatile. > > And, are you 100% sure that this is correct? Even for x86 CPU > pipeline implementations that you aren't intimately knowing about? ;-) It so happens that the x86 architecture does guarantee RmW ops are IRQ-safe or locally atomic. SMP/concurrent loads will observe either pre or post but no intermediate state as well. > But all that seems to be a side-track anyway, what's your real worry with > the code sequence generated by GCC? In this case it's sub-optimal code, both larger and possibly slower for having two memops. The reason to have volatile is because that's what Linux uses to dis-allow store-tearing, something that doesn't happen in this case. A suitably insane but conforming compiler could compile a non-volatile memory increment into something insane like: load byte-0, r1 increment r1 store r1, byte-0 jno done load byte-1, r1 increment ri store r1, byte 1 jno done ... done: We want to explicitly dis-allow this. I know C has recently (2011) grown this _Atomic thing, but that has other problems.
On Tue, Oct 31, 2023 at 07:24:13AM -0700, Paul E. McKenney wrote: > So, at least until GCC catches up to clang's code generation, I take it > that you don't want WRITE_ONCE() for that ->nvcsw increment. Thoughts on > ->on_rq? I've not done the patch yet, but I suspect those would be fine, those are straight up stores, hard to get wrong (famous last words).
Hello, On Tue, 31 Oct 2023, Peter Zijlstra wrote: > > > For absolutely no reason :-( > > > > The reason is simple (and should be obvious): to adhere to the abstract > > machine regarding volatile. When x is volatile then x++ consists of a > > read and a write, in this order. The easiest way to ensure this is to > > actually generate a read and a write instruction. Anything else is an > > optimization, and for each such optimization you need to actively find an > > argument why this optimization is correct to start with (and then if it's > > an optimization at all). In this case the argument needs to somehow > > involve arguing that an rmw instruction on x86 is in fact completely > > equivalent to the separate instructions, from read cycle to write cycle > > over all pipeline stages, on all implementations of x86. I.e. that a rmw > > instruction is spec'ed to be equivalent. > > > > You most probably can make that argument in this specific case, I'll give > > you that. But why bother to start with, in a piece of software that is > > already fairly complex (the compiler)? It's much easier to just not do > > much anything with volatile accesses at all and be guaranteed correct. > > Even more so as the software author, when using volatile, most likely is > > much more interested in correct code (even from a abstract machine > > perspective) than micro optimizations. > > There's a pile of situations where a RmW instruction is actively > different vs a load-store split, esp for volatile variables that are > explicitly expected to change asynchronously. > > The original RmW instruction is IRQ-safe, while the load-store version > is not. If an interrupt lands in between the load and store and also > modifies the variable then the store after interrupt-return will > over-write said modification. > > These are not equivalent. Okay, then there you have it. Namely that LLVM has a bug (but see next paragraph). For volatile x, x++ _must_ expand to a separate read and write, because the abstract machine of C says so. If a RmW isn't equivalent to that, then it can't be used in this situation. If you _have_ to use a RmW for other reasons like interrupt safety, then a volatile variable is not the way to force this, as C simply doesn't have that concept and hence can't talk about it. (Of course it can't, as not all architectures could implement such, if it were required). (If an RmW merely gives you more guarantees than a split load-store then of course LLVM doesn't have a bug, but you said not-equivalent, so I'm assuming the worst, that RmW also has fewer (other) guarantees) > > > At least clang doesn't do this, it stays: > > > > > > 0403 413: 49 ff 45 00 incq 0x0(%r13) > > > > > > irrespective of the volatile. > > > > And, are you 100% sure that this is correct? Even for x86 CPU > > pipeline implementations that you aren't intimately knowing about? ;-) > > It so happens that the x86 architecture does guarantee RmW ops are > IRQ-safe or locally atomic. SMP/concurrent loads will observe either > pre or post but no intermediate state as well. So, are RMW ops a strict superset (vis the guarantees they give) of split load-store? If so we can at least say that using RMW is a valid optimization :) Still, an optmization only. > > But all that seems to be a side-track anyway, what's your real worry with > > the code sequence generated by GCC? > > In this case it's sub-optimal code, both larger and possibly slower for > having two memops. > > The reason to have volatile is because that's what Linux uses to > dis-allow store-tearing, something that doesn't happen in this case. A > suitably insane but conforming compiler could compile a non-volatile > memory increment into something insane like: > > load byte-0, r1 > increment r1 > store r1, byte-0 > jno done > load byte-1, r1 > increment ri > store r1, byte 1 > jno done > ... > done: > > We want to explicitly dis-allow this. Yeah, I see. Within C you don't have much choice than volatile for this :-/ Funny thing: on some architectures this is actually what is generated sometimes, even if it has multi-byte loads/stores. This came up recently on the gcc list and the byte-per-byte sequence was faster ;-) (it was rather: load-by-bytes, form whole value via shifts, increment, store-by-bytes) Insane codegen for insane micro-architectures! > I know C has recently (2011) grown this _Atomic thing, but that has > other problems. Yeah. So, hmm, I don't quite know what to say, you're between a rock and a hard place, I guess. You have to use volatile for its effects but then are unhappy about its effects :) If you can confirm the above about validity of the optimization, then at least there'd by a point for adding a peephole in GCC for this, even if current codegen isn't a bug, but I still wouldn't hold my breath. volatile is so ... ewww, it's best left alone. Ciao, Michael.
On Tue, Oct 31, 2023 at 03:55:20PM +0000, Michael Matz wrote: > > There's a pile of situations where a RmW instruction is actively > > different vs a load-store split, esp for volatile variables that are > > explicitly expected to change asynchronously. > > > > The original RmW instruction is IRQ-safe, while the load-store version > > is not. If an interrupt lands in between the load and store and also > > modifies the variable then the store after interrupt-return will > > over-write said modification. > > > > These are not equivalent. > > Okay, then there you have it. Namely that LLVM has a bug (but see next > paragraph). For volatile x, x++ _must_ expand to a separate read and > write, because the abstract machine of C says so. If a RmW isn't > equivalent to that, then it can't be used in this situation. If you > _have_ to use a RmW for other reasons like interrupt safety, then a > volatile variable is not the way to force this, as C simply doesn't have > that concept and hence can't talk about it. (Of course it can't, as not > all architectures could implement such, if it were required). Yeah, RISC archs typically lack the RmW ops. I can understand C not mandating their use. However, on architectures that do have them, using them makes a ton of sense. For us living in the real world, this C abstract machine is mostly a pain in the arse :-) > (If an RmW merely gives you more guarantees than a split load-store then > of course LLVM doesn't have a bug, but you said not-equivalent, so I'm > assuming the worst, that RmW also has fewer (other) guarantees) RmW is strict superset of load-store, and as such not equivalent :-) Specifically, using volatile degrades the guarantees -- which is counter intuitive. > So, are RMW ops a strict superset (vis the guarantees they give) of split > load-store? If so we can at least say that using RMW is a valid > optimization :) Still, an optmization only. This. > > > But all that seems to be a side-track anyway, what's your real worry with > > > the code sequence generated by GCC? > > > > In this case it's sub-optimal code, both larger and possibly slower for > > having two memops. > > > > The reason to have volatile is because that's what Linux uses to > > dis-allow store-tearing, something that doesn't happen in this case. A > > suitably insane but conforming compiler could compile a non-volatile > > memory increment into something insane like: > > > > load byte-0, r1 > > increment r1 > > store r1, byte-0 > > jno done > > load byte-1, r1 > > increment ri > > store r1, byte 1 > > jno done > > ... > > done: > > > > We want to explicitly dis-allow this. > > Yeah, I see. Within C you don't have much choice than volatile for this > :-/ Funny thing: on some architectures this is actually what is generated > sometimes, even if it has multi-byte loads/stores. This came up > recently on the gcc list and the byte-per-byte sequence was faster ;-) > (it was rather: load-by-bytes, form whole value via shifts, increment, > store-by-bytes) > Insane codegen for insane micro-architectures! *groan* > > I know C has recently (2011) grown this _Atomic thing, but that has > > other problems. > > Yeah. > > So, hmm, I don't quite know what to say, you're between a rock and a hard > place, I guess. You have to use volatile for its effects but then are > unhappy about its effects :) Notably, Linux uses a *ton* of volatile and there has historically been a lot of grumbling about the GCC stance of 'stupid' codegen the moment it sees volatile. It really would help us (the Linux community) if GCC were to be less offended by the whole volatile thing and would try to generate better code. Paul has been on the C/C++ committee meetings and keeps telling me them folks hate volatile with a passion up to the point of proposing to remove it from the language or somesuch. But the reality is that Linux very heavily relies on it and _Atomic simply cannot replace it. > If you can confirm the above about validity of the optimization, then at > least there'd by a point for adding a peephole in GCC for this, even if > current codegen isn't a bug, but I still wouldn't hold my breath. > volatile is so ... ewww, it's best left alone. Confirmed, and please, your SMP computer only works becuase of volatile, it *is* important.
Hey, On Tue, 31 Oct 2023, Peter Zijlstra wrote: > > equivalent to that, then it can't be used in this situation. If you > > _have_ to use a RmW for other reasons like interrupt safety, then a > > volatile variable is not the way to force this, as C simply doesn't have > > that concept and hence can't talk about it. (Of course it can't, as not > > all architectures could implement such, if it were required). > > Yeah, RISC archs typically lack the RmW ops. I can understand C not > mandating their use. However, on architectures that do have them, using > them makes a ton of sense. > > For us living in the real world, this C abstract machine is mostly a > pain in the arse :-) Believe me, without it you would live in a world where only languages like ECMA script or Rust existed, without any reliable spec at all ("it's obvious, the language should behave like this-and-that compiler from 2010 implements it! Or was it 2012?"). Even if it sometimes would make life easier without (also for compilers!), at least you _have_ an arse to feel pain in! :-) Ahem. > > So, hmm, I don't quite know what to say, you're between a rock and a hard > > place, I guess. You have to use volatile for its effects but then are > > unhappy about its effects :) > > Notably, Linux uses a *ton* of volatile and there has historically been > a lot of grumbling about the GCC stance of 'stupid' codegen the moment > it sees volatile. > > It really would help us (the Linux community) if GCC were to be less > offended by the whole volatile thing and would try to generate better > code. > > Paul has been on the C/C++ committee meetings and keeps telling me them > folks hate volatile with a passion up to the point of proposing to > remove it from the language or somesuch. But the reality is that Linux > very heavily relies on it and _Atomic simply cannot replace it. Oh yeah, I agree. People trying to get rid of volatile are misguided. Of course one can try to capture all the individual aspects of it, and make individual language constructs for them (_Atomic is one). It makes arguing about and precisely specifying the aspects much easier. But it also makes the feature-interoperability matrix explode and the language more complicated in areas that were already arcane to start with. But it's precisely _because_ of the large-scale feature set of volatile and the compilers wish to cater for the real world, that it's mostly left alone, as is, as written by the author. Sure, one can wish for better codegen related to volatile. But it's a slippery slope: "here I have volatile, because I don't want optimizations to happen." - "but here I want a little optimization to happen" - "but not these" - ... It's ... not so easy :) In this specific case I think we have now sufficiently argued that "load-modify-store --> rmw" on x86 even for volatile accesses is a correct transformation (and one that has sufficiently local effects that our heads don't explode while thinking about all consequences). You'd have to do that for each and every transformation where volatile stuff is involved, just so to not throw out the baby with the water. > > If you can confirm the above about validity of the optimization, then at > > least there'd by a point for adding a peephole in GCC for this, even if > > current codegen isn't a bug, but I still wouldn't hold my breath. > > volatile is so ... ewww, it's best left alone. > > Confirmed, and please, your SMP computer only works becuase of volatile, > it *is* important. Agreed. Ciao, Michael.
On Tue, Oct 31, 2023 at 04:49:56PM +0000, Michael Matz wrote: > Hey, > > On Tue, 31 Oct 2023, Peter Zijlstra wrote: > > > > equivalent to that, then it can't be used in this situation. If you > > > _have_ to use a RmW for other reasons like interrupt safety, then a > > > volatile variable is not the way to force this, as C simply doesn't have > > > that concept and hence can't talk about it. (Of course it can't, as not > > > all architectures could implement such, if it were required). > > > > Yeah, RISC archs typically lack the RmW ops. I can understand C not > > mandating their use. However, on architectures that do have them, using > > them makes a ton of sense. > > > > For us living in the real world, this C abstract machine is mostly a > > pain in the arse :-) > > Believe me, without it you would live in a world where only languages like > ECMA script or Rust existed, without any reliable spec at all ("it's > obvious, the language should behave like this-and-that compiler from 2010 > implements it! Or was it 2012?"). Even if it sometimes would make life > easier without (also for compilers!), at least you _have_ an arse to feel > pain in! :-) Ahem. You mean like Rust volatiles considering conflicting accesses to be data races? That certainly leads me to wonder how a Rust-language device driver is supposed to interoperate with Rust-language device firmware. They currently propose atomics and things like the barrier() asm to make that work, and their definition of atomic might just allow it. > > > So, hmm, I don't quite know what to say, you're between a rock and a hard > > > place, I guess. You have to use volatile for its effects but then are > > > unhappy about its effects :) > > > > Notably, Linux uses a *ton* of volatile and there has historically been > > a lot of grumbling about the GCC stance of 'stupid' codegen the moment > > it sees volatile. > > > > It really would help us (the Linux community) if GCC were to be less > > offended by the whole volatile thing and would try to generate better > > code. > > > > Paul has been on the C/C++ committee meetings and keeps telling me them > > folks hate volatile with a passion up to the point of proposing to > > remove it from the language or somesuch. But the reality is that Linux > > very heavily relies on it and _Atomic simply cannot replace it. > > Oh yeah, I agree. People trying to get rid of volatile are misguided. > Of course one can try to capture all the individual aspects of it, and > make individual language constructs for them (_Atomic is one). It makes > arguing about and precisely specifying the aspects much easier. But it > also makes the feature-interoperability matrix explode and the language > more complicated in areas that were already arcane to start with. Agreed, and I have personally witnessed some primal-scream therapy undertaken in response to attempts to better define volatile. > But it's precisely _because_ of the large-scale feature set of volatile > and the compilers wish to cater for the real world, that it's mostly left > alone, as is, as written by the author. Sure, one can wish for better > codegen related to volatile. But it's a slippery slope: "here I have > volatile, because I don't want optimizations to happen." - "but here I > want a little optimization to happen" - "but not these" - ... It's ... not > so easy :) And to your point, there really have been optimization bugs that have broken volatile. So I do very much appreciate your careful attention to this matter. > In this specific case I think we have now sufficiently argued that > "load-modify-store --> rmw" on x86 even for volatile accesses is a correct > transformation (and one that has sufficiently local effects that our heads > don't explode while thinking about all consequences). You'd have to do > that for each and every transformation where volatile stuff is involved, > just so to not throw out the baby with the water. Understood! > > > If you can confirm the above about validity of the optimization, then at > > > least there'd by a point for adding a peephole in GCC for this, even if > > > current codegen isn't a bug, but I still wouldn't hold my breath. > > > volatile is so ... ewww, it's best left alone. > > > > Confirmed, and please, your SMP computer only works becuase of volatile, > > it *is* important. > > Agreed. Good to hear!!! Thanx, Paul
On Tue, Oct 31, 2023 at 04:20:33PM +0100, Peter Zijlstra wrote: > On Tue, Oct 31, 2023 at 07:24:13AM -0700, Paul E. McKenney wrote: > > > So, at least until GCC catches up to clang's code generation, I take it > > that you don't want WRITE_ONCE() for that ->nvcsw increment. Thoughts on > > ->on_rq? > > I've not done the patch yet, but I suspect those would be fine, those > are straight up stores, hard to get wrong (famous last words). Assuming that the reads are already either marked with READ_ONCE() or are under appropriate locks, my immediate thought would be something like the all-too-lightly tested patch shown below. The ASSERT_EXCLUSIVE_WRITER() causes KCSAN to complain if there is a concurrent store of any kind to the location. Of course, please feel free to ignore. Thoughts? Thanx, Paul ------------------------------------------------------------------------ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 81885748871d..aeace19ad7f5 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2124,12 +2124,14 @@ void activate_task(struct rq *rq, struct task_struct *p, int flags) enqueue_task(rq, p, flags); - p->on_rq = TASK_ON_RQ_QUEUED; + WRITE_ONCE(p->on_rq, TASK_ON_RQ_QUEUED); + ASSERT_EXCLUSIVE_WRITER(p->on_rq); } void deactivate_task(struct rq *rq, struct task_struct *p, int flags) { - p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING; + WRITE_ONCE(p->on_rq, (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING); + ASSERT_EXCLUSIVE_WRITER(p->on_rq); dequeue_task(rq, p, flags); }
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index bf5f178fe723..a604f59aee0b 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -895,10 +895,36 @@ static void rcu_tasks_pregp_step(struct list_head *hop) synchronize_rcu(); } +/* Check for quiescent states since the pregp's synchronize_rcu() */ +static bool rcu_tasks_is_holdout(struct task_struct *t) +{ + int cpu; + + /* Has the task been seen voluntarily sleeping? */ + if (!READ_ONCE(t->on_rq)) + return false; + + /* + * Idle tasks (or idle injection) within the idle loop are RCU-tasks + * quiescent states. But CPU boot code performed by the idle task + * isn't a quiescent state. + */ + if (is_idle_task(t)) + return false; + + cpu = task_cpu(t); + + /* Idle tasks on offline CPUs are RCU-tasks quiescent states. */ + if (t == idle_task(cpu) && !rcu_cpu_online(cpu)) + return false; + + return true; +} + /* Per-task initial processing. */ static void rcu_tasks_pertask(struct task_struct *t, struct list_head *hop) { - if (t != current && READ_ONCE(t->on_rq) && !is_idle_task(t)) { + if (t != current && rcu_tasks_is_holdout(t)) { get_task_struct(t); t->rcu_tasks_nvcsw = READ_ONCE(t->nvcsw); WRITE_ONCE(t->rcu_tasks_holdout, true); @@ -947,7 +973,7 @@ static void check_holdout_task(struct task_struct *t, if (!READ_ONCE(t->rcu_tasks_holdout) || t->rcu_tasks_nvcsw != READ_ONCE(t->nvcsw) || - !READ_ONCE(t->on_rq) || + !rcu_tasks_is_holdout(t) || (IS_ENABLED(CONFIG_NO_HZ_FULL) && !is_idle_task(t) && READ_ONCE(t->rcu_tasks_idle_cpu) >= 0)) { WRITE_ONCE(t->rcu_tasks_holdout, false);