Message ID | 20231127103235.28442-3-parri.andrea@gmail.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp2992820vqx; Mon, 27 Nov 2023 02:33:03 -0800 (PST) X-Google-Smtp-Source: AGHT+IFhdxVdmThGZ5j53EXqP9KVlWkVE5RPdFgjBKCENVeyQHmLLn9Pcn21P7p/LWEQG1SLvhL+ X-Received: by 2002:a17:903:11d1:b0:1cc:5db8:7eb1 with SMTP id q17-20020a17090311d100b001cc5db87eb1mr13816942plh.51.1701081183375; Mon, 27 Nov 2023 02:33:03 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701081183; cv=none; d=google.com; s=arc-20160816; b=WNWr6IOrod/Cj4mEEgGPMi1/h1M8evMhIVPQGns3nbJmDZX6A/N1nMBoXZoFVLY9ve 8biYvea1G+s2SxpPcdESQi/lC4nXa6uOs2BblQSe1nnJeD1yMY4qeYjfwyzrC3gv7NHR N8SGMnKkp8XJnVm4tNDh2RKNiJp3aGOiOsD8VXMe9xrMtCUmQ/nlW+IBern6o9zGtRPk 41gWWhuT47pozDSUwWGNRpZMjSWZPme2fMBnDI756hWnLfk2zGXCQaou+2OOLgDr8TKN 6LscjdbuiTaNODXHhbKnqpzA+UUUmj1TvpNFD8AYBIoLrww4mzQzv6BxSkIZB0Py38c1 8tUg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=MlWvEkOns6bxM9iRll9a5EQf2qEz96P+1R1KRbDkkn8=; fh=6I8UF0n155JlUMKXPHZEdVXaDDfdWw9idVSmXg1+KEE=; b=VCddTmKDTzIRoAXk1v/rCdZrOSSq5/y3tMNYwDvZpjD2ZSk9VN1mQr+Hr+Ubp0MAZw 3Fj1Bj+2aQ0Y4lwvdbBLhYqYGANiIMN8Ahe8A93de/kd70ZztVRTyboj+urHYzDp+eAQ Z2M31SlYpeevXGKsUwlGCP6yCvEUO15C46DYKKOlxlPQdvQSKoA4219nxB+J0/lZoSc1 o26ETfBb8xGgLCHi90TzLyMowKLJfkYp1zW2aZup0zizpZwRgnDdsumaTGAUm4gEFcSN +Dx1L3rK0lVHA/nQ/ZYq7SifYn0Sfg3GDFdf6QS8iDfznKgohj7Vjp68wMSYyh9cs/fG qsOg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=iPsOo0xf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from snail.vger.email (snail.vger.email. [23.128.96.37]) by mx.google.com with ESMTPS id h9-20020a170902748900b001cfba7c561fsi4169241pll.249.2023.11.27.02.33.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 27 Nov 2023 02:33:03 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=iPsOo0xf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id B1EA58092D8B; Mon, 27 Nov 2023 02:33:01 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233067AbjK0Kcv (ORCPT <rfc822;toshivichauhan@gmail.com> + 99 others); Mon, 27 Nov 2023 05:32:51 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35724 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233057AbjK0Kcp (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 27 Nov 2023 05:32:45 -0500 Received: from mail-lf1-x136.google.com (mail-lf1-x136.google.com [IPv6:2a00:1450:4864:20::136]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AB5FDE1 for <linux-kernel@vger.kernel.org>; Mon, 27 Nov 2023 02:32:51 -0800 (PST) Received: by mail-lf1-x136.google.com with SMTP id 2adb3069b0e04-50aab20e828so5764250e87.2 for <linux-kernel@vger.kernel.org>; Mon, 27 Nov 2023 02:32:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1701081170; x=1701685970; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=MlWvEkOns6bxM9iRll9a5EQf2qEz96P+1R1KRbDkkn8=; b=iPsOo0xfotl+oYMGstYFj+BqF+SaRQTvd9D8SFWpGP62fwk9yKjPzLh9N6aO+sc39K CWIFsjQksqwr6OU7fNO65lMcMqdikApBWfTggDuzsM5x9g6jnctuKQ2isbMPhI34r+bl Q5cPnGJ6zXfGE0Y3A/QgznQE8DHQj2d81jM0trXjCpnUvdGOiQKPmqlZdBDr9T3uoNj0 HV3k3IeJ8ai+k94DAn9cWFUqntRzTlsFC7OTVyXAZh6ntJc/a8oX6yshKBWSSBKNSrqj eCNwJO1Ae/iiHJJ7NdX1+8G2INd4621oWtcq3HcwzIt0hcLzGfqJrt26O9T79/7m3MNt Qzkg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701081170; x=1701685970; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=MlWvEkOns6bxM9iRll9a5EQf2qEz96P+1R1KRbDkkn8=; b=SOpqElASBSCwe0x+2Cm/Jdv7/2r+3aWjB+OKMukATiqQxUirU6iDUNxPMksP8o2+kY ztW9Zie0G5e0e9z2sEXfUZ3VFeBM493Q95vrKz1x4VdHvGUMgxvWzLRDgrZXqIgKrIql CH6k+aApv/EVoVetDMXbk11JDpKgQm6jYUQFgEx7K4dlpU81LZZR+bPuE8aHxa747toT wLrMreTW0Lgv1/ZmUjIfm6c3AzXVIbEWahLrAEEoEU/kyO7Baw4P0Vi3FF3mqA3SfU25 F6Gs6F8j/yqPh3Z+lAbeYxuJ4i5ZCqh2RTKkTFXtlAz0A9lJ2hy+aSyffKYDZ59ZejNn lYZQ== X-Gm-Message-State: AOJu0YzOUVIaWlIu4+VMRMXz0pP7dFkif9h35azzsgFgJpUid868SUn/ GCZ99CR3fnBPFx1eFGZFBc8= X-Received: by 2002:a05:6512:3903:b0:500:9f7b:e6a4 with SMTP id a3-20020a056512390300b005009f7be6a4mr5841895lfu.32.1701081169590; Mon, 27 Nov 2023 02:32:49 -0800 (PST) Received: from andrea.wind3.hub ([31.189.63.178]) by smtp.gmail.com with ESMTPSA id p5-20020a056402044500b00540ea3a25e6sm5142057edw.72.2023.11.27.02.32.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 27 Nov 2023 02:32:49 -0800 (PST) From: Andrea Parri <parri.andrea@gmail.com> To: mathieu.desnoyers@efficios.com, paulmck@kernel.org, palmer@dabbelt.com, paul.walmsley@sifive.com, aou@eecs.berkeley.edu Cc: mmaas@google.com, hboehm@google.com, striker@us.ibm.com, charlie@rivosinc.com, rehn@rivosinc.com, linux-riscv@lists.infradead.org, linux-kernel@vger.kernel.org, Andrea Parri <parri.andrea@gmail.com> Subject: [PATCH 2/2] membarrier: riscv: Provide core serializing command Date: Mon, 27 Nov 2023 11:32:35 +0100 Message-Id: <20231127103235.28442-3-parri.andrea@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20231127103235.28442-1-parri.andrea@gmail.com> References: <20231127103235.28442-1-parri.andrea@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Mon, 27 Nov 2023 02:33:01 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783712896511671914 X-GMAIL-MSGID: 1783712902831856673 |
Series |
membarrier: riscv: Provide core serializing command
|
|
Commit Message
Andrea Parri
Nov. 27, 2023, 10:32 a.m. UTC
RISC-V uses xRET instructions on return from interrupt and to go back
to user-space; the xRET instruction is not core serializing.
Use FENCE.I for providing core serialization as follows:
- by calling sync_core_before_usermode() on return from interrupt (cf.
ipi_sync_core()),
- via switch_mm() and sync_core_before_usermode() (respectively, for
uthread->uthread and kthread->uthread transitions) to go back to
user-space.
On RISC-V, the serialization in switch_mm() is activated by resetting
the icache_stale_mask of the mm at prepare_sync_core_cmd().
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Suggested-by: Palmer Dabbelt <palmer@dabbelt.com>
---
.../membarrier-sync-core/arch-support.txt | 2 +-
arch/riscv/Kconfig | 3 ++
arch/riscv/include/asm/sync_core.h | 29 +++++++++++++++++++
3 files changed, 33 insertions(+), 1 deletion(-)
create mode 100644 arch/riscv/include/asm/sync_core.h
Comments
On 2023-11-27 05:32, Andrea Parri wrote: > RISC-V uses xRET instructions on return from interrupt and to go back > to user-space; the xRET instruction is not core serializing. > > Use FENCE.I for providing core serialization as follows: > > - by calling sync_core_before_usermode() on return from interrupt (cf. > ipi_sync_core()), > > - via switch_mm() and sync_core_before_usermode() (respectively, for > uthread->uthread and kthread->uthread transitions) to go back to > user-space. > > On RISC-V, the serialization in switch_mm() is activated by resetting > the icache_stale_mask of the mm at prepare_sync_core_cmd(). > > Signed-off-by: Andrea Parri <parri.andrea@gmail.com> > Suggested-by: Palmer Dabbelt <palmer@dabbelt.com> > --- [...] > + > +#ifdef CONFIG_SMP > +/* > + * Ensure the next switch_mm() on every CPU issues a core serializing > + * instruction for the given @mm. > + */ > +static inline void prepare_sync_core_cmd(struct mm_struct *mm) > +{ > + cpumask_setall(&mm->context.icache_stale_mask); I am concerned about the possibility that this change lacks two barriers in the following scenario: On a transition from uthread -> uthread on [CPU 0], from a thread belonging to another mm to a thread belonging to the mm [!mm -> mm] for which a concurrent membarrier sync-core is done on [CPU 1]: - [CPU 1] sets all bits in the mm icache_stale_mask [A]. There are no barriers associated with these stores. - [CPU 0] store to rq->curr [B] (by the scheduler) vs [CPU 1] loads rq->curr [C] within membarrier to decide if the IPI should be skipped. Let's say CPU 1 observes cpu_rq(0)->curr->mm != mm, so it skips the IPI. - This means membarrier relies on switch_mm() to issue the sync-core. - [CPU 0] switch_mm() loads [D] the icache_stale_mask. If the bit is zero, switch_mm() may incorrectly skip the sync-core. AFAIU, [C] can be reordered before [A] because there is no barrier between those operations within membarrier. I suspect it can cause the switch_mm() code to skip a needed sync-core. AFAIU, [D] can be reordered before [B] because there is no documented barrier between those operations within the scheduler, which can also cause switch_mm() to skip a needed sync-core. We possibly have a similar scenario for uthread->uthread when the scheduler switches between mm -> !mm. One way to fix this would be to add the following barriers: - A smp_mb() between [A] and [C], possibly just after cpumask_setall() in prepare_sync_core_cmd(), with comments detailing the ordering it guarantees, - A smp_mb() between [B] and [D], possibly just before cpumask_test_cpu() in flush_icache_deferred(), with appropriate comments. Am I missing something ? Thanks, Mathieu > +} > +#else > +static inline void prepare_sync_core_cmd(struct mm_struct *mm) > +{ > +} > +#endif /* CONFIG_SMP */ > + > +#endif /* _ASM_RISCV_SYNC_CORE_H */
> I am concerned about the possibility that this change lacks two barriers in the > following scenario: > > On a transition from uthread -> uthread on [CPU 0], from a thread belonging to > another mm to a thread belonging to the mm [!mm -> mm] for which a concurrent > membarrier sync-core is done on [CPU 1]: > > - [CPU 1] sets all bits in the mm icache_stale_mask [A]. There are no barriers > associated with these stores. > > - [CPU 0] store to rq->curr [B] (by the scheduler) vs [CPU 1] loads rq->curr [C] > within membarrier to decide if the IPI should be skipped. Let's say CPU 1 observes > cpu_rq(0)->curr->mm != mm, so it skips the IPI. > > - This means membarrier relies on switch_mm() to issue the sync-core. > > - [CPU 0] switch_mm() loads [D] the icache_stale_mask. If the bit is zero, switch_mm() > may incorrectly skip the sync-core. > > AFAIU, [C] can be reordered before [A] because there is no barrier between those > operations within membarrier. I suspect it can cause the switch_mm() code to skip > a needed sync-core. > > AFAIU, [D] can be reordered before [B] because there is no documented barrier > between those operations within the scheduler, which can also cause switch_mm() > to skip a needed sync-core. > > We possibly have a similar scenario for uthread->uthread when the scheduler > switches between mm -> !mm. > > One way to fix this would be to add the following barriers: > > - A smp_mb() between [A] and [C], possibly just after cpumask_setall() in > prepare_sync_core_cmd(), with comments detailing the ordering it guarantees, > - A smp_mb() between [B] and [D], possibly just before cpumask_test_cpu() in > flush_icache_deferred(), with appropriate comments. > > Am I missing something ? Thanks for the detailed analysis. AFAIU, the following barrier (in membarrier_private_expedited()) /* * Matches memory barriers around rq->curr modification in * scheduler. */ smp_mb(); /* system call entry is not a mb. */ can serve the purpose of ordering [A] before [C] (to be documented in v2). But I agree that [B] and [D] are unordered /missing suitable synchronization. Worse, RISC-V has currently no full barrier after [B] and before returning to user-space: I'm thinking (inspired by the PowerPC implementation), diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c index 217fd4de61342..f63222513076d 100644 --- a/arch/riscv/mm/context.c +++ b/arch/riscv/mm/context.c @@ -323,6 +323,23 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next, if (unlikely(prev == next)) return; +#if defined(CONFIG_MEMBARRIER) && defined(CONFIG_SMP) + /* + * The membarrier system call requires a full memory barrier + * after storing to rq->curr, before going back to user-space. + * + * Only need the full barrier when switching between processes: + * barrier when switching from kernel to userspace is not + * required here, given that it is implied by mmdrop(); barrier + * when switching from userspace to kernel is not needed after + * store to rq->curr. + */ + if (unlikely(atomic_read(&next->membarrier_state) & + (MEMBARRIER_STATE_PRIVATE_EXPEDITED | + MEMBARRIER_STATE_GLOBAL_EXPEDITED)) && prev) + smp_mb(); +#endif + /* * Mark the current MM context as inactive, and the next as * active. This is at least used by the icache flushing diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a708d225c28e8..a1c749fddd095 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6670,8 +6670,9 @@ static void __sched notrace __schedule(unsigned int sched_mode) * * Here are the schemes providing that barrier on the * various architectures: - * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC. - * switch_mm() rely on membarrier_arch_switch_mm() on PowerPC. + * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC, + * RISC-V. switch_mm() relies on membarrier_arch_switch_mm() + * on PowerPC. * - finish_lock_switch() for weakly-ordered * architectures where spin_unlock is a full barrier, * - switch_to() for arm64 (weakly-ordered, spin_unlock The silver lining is that similar changes (probably as a separate/preliminary patch) also restore the desired order between [B] and [D] AFAIU; so with them, 2/2 would just need additions to document the above SYNC_CORE scenario. Thoughts? Andrea
On 2023-11-28 10:13, Andrea Parri wrote: >> I am concerned about the possibility that this change lacks two barriers in the >> following scenario: >> >> On a transition from uthread -> uthread on [CPU 0], from a thread belonging to >> another mm to a thread belonging to the mm [!mm -> mm] for which a concurrent >> membarrier sync-core is done on [CPU 1]: >> >> - [CPU 1] sets all bits in the mm icache_stale_mask [A]. There are no barriers >> associated with these stores. >> >> - [CPU 0] store to rq->curr [B] (by the scheduler) vs [CPU 1] loads rq->curr [C] >> within membarrier to decide if the IPI should be skipped. Let's say CPU 1 observes >> cpu_rq(0)->curr->mm != mm, so it skips the IPI. >> >> - This means membarrier relies on switch_mm() to issue the sync-core. >> >> - [CPU 0] switch_mm() loads [D] the icache_stale_mask. If the bit is zero, switch_mm() >> may incorrectly skip the sync-core. >> >> AFAIU, [C] can be reordered before [A] because there is no barrier between those >> operations within membarrier. I suspect it can cause the switch_mm() code to skip >> a needed sync-core. >> >> AFAIU, [D] can be reordered before [B] because there is no documented barrier >> between those operations within the scheduler, which can also cause switch_mm() >> to skip a needed sync-core. >> >> We possibly have a similar scenario for uthread->uthread when the scheduler >> switches between mm -> !mm. >> >> One way to fix this would be to add the following barriers: >> >> - A smp_mb() between [A] and [C], possibly just after cpumask_setall() in >> prepare_sync_core_cmd(), with comments detailing the ordering it guarantees, >> - A smp_mb() between [B] and [D], possibly just before cpumask_test_cpu() in >> flush_icache_deferred(), with appropriate comments. >> >> Am I missing something ? > > Thanks for the detailed analysis. > > AFAIU, the following barrier (in membarrier_private_expedited()) > > /* > * Matches memory barriers around rq->curr modification in > * scheduler. > */ > smp_mb(); /* system call entry is not a mb. */ > > can serve the purpose of ordering [A] before [C] (to be documented in v2). Agreed. Yes it should be documented. > > But I agree that [B] and [D] are unordered /missing suitable synchronization. > Worse, RISC-V has currently no full barrier after [B] and before returning to > user-space: I'm thinking (inspired by the PowerPC implementation), If RISC-V currently supports the membarrier private cmd and lacks the appropriate smp_mb() in switch_mm(), then it's a bug. This initial patch should be a "Fix" and fast-tracked as such. Indeed, looking at how ASID is used to implement switch_mm, it appears to not require a full smp_mb() at all as long as there are no ASID rollovers. > > diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c > index 217fd4de61342..f63222513076d 100644 > --- a/arch/riscv/mm/context.c > +++ b/arch/riscv/mm/context.c > @@ -323,6 +323,23 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next, > if (unlikely(prev == next)) > return; > > +#if defined(CONFIG_MEMBARRIER) && defined(CONFIG_SMP) > + /* > + * The membarrier system call requires a full memory barrier > + * after storing to rq->curr, before going back to user-space. > + * > + * Only need the full barrier when switching between processes: > + * barrier when switching from kernel to userspace is not > + * required here, given that it is implied by mmdrop(); barrier > + * when switching from userspace to kernel is not needed after > + * store to rq->curr. > + */ > + if (unlikely(atomic_read(&next->membarrier_state) & > + (MEMBARRIER_STATE_PRIVATE_EXPEDITED | > + MEMBARRIER_STATE_GLOBAL_EXPEDITED)) && prev) > + smp_mb(); > +#endif The approach looks good. Please implement it within a separate membarrier_arch_switch_mm() as done on powerpc. > + > /* > * Mark the current MM context as inactive, and the next as > * active. This is at least used by the icache flushing > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index a708d225c28e8..a1c749fddd095 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -6670,8 +6670,9 @@ static void __sched notrace __schedule(unsigned int sched_mode) > * > * Here are the schemes providing that barrier on the > * various architectures: > - * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC. > - * switch_mm() rely on membarrier_arch_switch_mm() on PowerPC. > + * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC, > + * RISC-V. switch_mm() relies on membarrier_arch_switch_mm() > + * on PowerPC. > * - finish_lock_switch() for weakly-ordered > * architectures where spin_unlock is a full barrier, > * - switch_to() for arm64 (weakly-ordered, spin_unlock > > The silver lining is that similar changes (probably as a separate/preliminary > patch) also restore the desired order between [B] and [D] AFAIU; so with them, > 2/2 would just need additions to document the above SYNC_CORE scenario. Exactly. > Thoughts? I think we should be OK with the changes you suggest. Thanks! Mathieu
> > diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c > > index 217fd4de61342..f63222513076d 100644 > > --- a/arch/riscv/mm/context.c > > +++ b/arch/riscv/mm/context.c > > @@ -323,6 +323,23 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next, > > if (unlikely(prev == next)) > > return; > > +#if defined(CONFIG_MEMBARRIER) && defined(CONFIG_SMP) > > + /* > > + * The membarrier system call requires a full memory barrier > > + * after storing to rq->curr, before going back to user-space. > > + * > > + * Only need the full barrier when switching between processes: > > + * barrier when switching from kernel to userspace is not > > + * required here, given that it is implied by mmdrop(); barrier > > + * when switching from userspace to kernel is not needed after > > + * store to rq->curr. > > + */ > > + if (unlikely(atomic_read(&next->membarrier_state) & > > + (MEMBARRIER_STATE_PRIVATE_EXPEDITED | > > + MEMBARRIER_STATE_GLOBAL_EXPEDITED)) && prev) > > + smp_mb(); > > +#endif > > The approach looks good. Please implement it within a separate > membarrier_arch_switch_mm() as done on powerpc. Will do. Thanks. As regards the Fixes: tag, I guess it boils down to what we want or we need to take for commit "riscv: Support membarrier private cmd". :-) FWIW, a quick git-log search confirmed that MEMBARRIER has been around for quite some time in the RISC-V world (though I'm not familiar with any of its mainstream uses): commit 1464d00b27b2 says (at least) since 93917ad50972 ("RISC-V: Add support for restartable sequence"). I am currently inclined to pick the latter commit (and check it w/ Palmer), but other suggestions are welcome. Andrea
On 2023-11-29 13:29, Andrea Parri wrote: >>> diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c >>> index 217fd4de61342..f63222513076d 100644 >>> --- a/arch/riscv/mm/context.c >>> +++ b/arch/riscv/mm/context.c >>> @@ -323,6 +323,23 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next, >>> if (unlikely(prev == next)) >>> return; >>> +#if defined(CONFIG_MEMBARRIER) && defined(CONFIG_SMP) >>> + /* >>> + * The membarrier system call requires a full memory barrier >>> + * after storing to rq->curr, before going back to user-space. >>> + * >>> + * Only need the full barrier when switching between processes: >>> + * barrier when switching from kernel to userspace is not >>> + * required here, given that it is implied by mmdrop(); barrier >>> + * when switching from userspace to kernel is not needed after >>> + * store to rq->curr. >>> + */ >>> + if (unlikely(atomic_read(&next->membarrier_state) & >>> + (MEMBARRIER_STATE_PRIVATE_EXPEDITED | >>> + MEMBARRIER_STATE_GLOBAL_EXPEDITED)) && prev) >>> + smp_mb(); >>> +#endif >> >> The approach looks good. Please implement it within a separate >> membarrier_arch_switch_mm() as done on powerpc. > > Will do. Thanks. > > As regards the Fixes: tag, I guess it boils down to what we want or we > need to take for commit "riscv: Support membarrier private cmd". :-) I'm not seeing this commit in the Linux master branch, am I missing something ? > FWIW, a quick git-log search confirmed that MEMBARRIER has been around > for quite some time in the RISC-V world (though I'm not familiar with > any of its mainstream uses): commit 1464d00b27b2 says (at least) since > 93917ad50972 ("RISC-V: Add support for restartable sequence"). I am > currently inclined to pick the latter commit (and check it w/ Palmer), > but other suggestions are welcome. Supporting membarrier private expedited is not optional since Linux 4.14: see kernel/sched/core.c: membarrier_switch_mm(rq, prev->active_mm, next->mm); /* * sys_membarrier() requires an smp_mb() between setting * rq->curr / membarrier_switch_mm() and returning to userspace. * * The below provides this either through switch_mm(), or in * case 'prev->active_mm == next->mm' through * finish_task_switch()'s mmdrop(). */ switch_mm_irqs_off(prev->active_mm, next->mm, next); Failure to provide the required barrier is a bug in the architecture's switch_mm implementation when CONFIG_MEMBARRIER=y. We should probably introduce a new Documentation/features/sched/membarrier/arch-support.txt to clarify this requirement. Userspace code such as liburcu [1] heavily relies on membarrier private expedited (when available) to speed up RCU read-side critical sections. Various DNS servers, including BIND 9, use liburcu. Thanks, Mathieu [1] https:/liburcu.org
> > As regards the Fixes: tag, I guess it boils down to what we want or we > > need to take for commit "riscv: Support membarrier private cmd". :-) > > I'm not seeing this commit in the Linux master branch, am I missing > something ? I don't think you're missing something: I was wondering "what/where is this commit"? Sorry for the confusion. > > FWIW, a quick git-log search confirmed that MEMBARRIER has been around > > for quite some time in the RISC-V world (though I'm not familiar with > > any of its mainstream uses): commit 1464d00b27b2 says (at least) since > > 93917ad50972 ("RISC-V: Add support for restartable sequence"). I am > > currently inclined to pick the latter commit (and check it w/ Palmer), > > but other suggestions are welcome. > > Supporting membarrier private expedited is not optional since Linux 4.14: > > see kernel/sched/core.c: > > membarrier_switch_mm(rq, prev->active_mm, next->mm); > /* > * sys_membarrier() requires an smp_mb() between setting > * rq->curr / membarrier_switch_mm() and returning to userspace. > * > * The below provides this either through switch_mm(), or in > * case 'prev->active_mm == next->mm' through > * finish_task_switch()'s mmdrop(). > */ > switch_mm_irqs_off(prev->active_mm, next->mm, next); > > Failure to provide the required barrier is a bug in the architecture's > switch_mm implementation when CONFIG_MEMBARRIER=y. > > We should probably introduce a new > Documentation/features/sched/membarrier/arch-support.txt > to clarify this requirement. > > Userspace code such as liburcu [1] heavily relies on membarrier private > expedited (when available) to speed up RCU read-side critical sections. > Various DNS servers, including BIND 9, use liburcu. Thanks for the information. So I should probably stick to 93917ad50972, which apparently selected CONFIG_MEMBARRIER on RISC-V, for the Fixes: tag in question. I'll look into adding the membarrier feature you mention (as a final/ follow-up patch), unless you or someone else want to take care of it. Andrea
On 2023-11-29 16:25, Andrea Parri wrote: >>> As regards the Fixes: tag, I guess it boils down to what we want or we >>> need to take for commit "riscv: Support membarrier private cmd". :-) >> >> I'm not seeing this commit in the Linux master branch, am I missing >> something ? > > I don't think you're missing something: I was wondering "what/where is > this commit"? Sorry for the confusion. > > >>> FWIW, a quick git-log search confirmed that MEMBARRIER has been around >>> for quite some time in the RISC-V world (though I'm not familiar with >>> any of its mainstream uses): commit 1464d00b27b2 says (at least) since >>> 93917ad50972 ("RISC-V: Add support for restartable sequence"). I am >>> currently inclined to pick the latter commit (and check it w/ Palmer), >>> but other suggestions are welcome. >> >> Supporting membarrier private expedited is not optional since Linux 4.14: >> >> see kernel/sched/core.c: >> >> membarrier_switch_mm(rq, prev->active_mm, next->mm); >> /* >> * sys_membarrier() requires an smp_mb() between setting >> * rq->curr / membarrier_switch_mm() and returning to userspace. >> * >> * The below provides this either through switch_mm(), or in >> * case 'prev->active_mm == next->mm' through >> * finish_task_switch()'s mmdrop(). >> */ >> switch_mm_irqs_off(prev->active_mm, next->mm, next); >> >> Failure to provide the required barrier is a bug in the architecture's >> switch_mm implementation when CONFIG_MEMBARRIER=y. >> >> We should probably introduce a new >> Documentation/features/sched/membarrier/arch-support.txt >> to clarify this requirement. >> >> Userspace code such as liburcu [1] heavily relies on membarrier private >> expedited (when available) to speed up RCU read-side critical sections. >> Various DNS servers, including BIND 9, use liburcu. > > Thanks for the information. > > So I should probably stick to 93917ad50972, which apparently selected > CONFIG_MEMBARRIER on RISC-V, for the Fixes: tag in question. I think it goes further than that, because you can explicitly CONFIG_MEMBARRIER=y, see init/Kconfig: config MEMBARRIER bool "Enable membarrier() system call" if EXPERT default y help Enable the membarrier() system call that allows issuing memory barriers across all running threads, which can be used to distribute the cost of user-space memory barriers asymmetrically by transforming pairs of memory barriers into pairs consisting of membarrier() and a compiler barrier. If unsure, say Y. Before 1464d00b27b2, riscv just happened to set it to =n in the defconfig. I suspect the initial port of riscv merged after v4.14 was already broken. > I'll look into adding the membarrier feature you mention (as a final/ > follow-up patch), unless you or someone else want to take care of it. I'll be happy to review it :) Thanks, Mathieu
> > So I should probably stick to 93917ad50972, which apparently selected > > CONFIG_MEMBARRIER on RISC-V, for the Fixes: tag in question. > > I think it goes further than that, because you can explicitly > CONFIG_MEMBARRIER=y, see init/Kconfig: > > config MEMBARRIER > bool "Enable membarrier() system call" if EXPERT > default y > help > Enable the membarrier() system call that allows issuing memory > barriers across all running threads, which can be used to distribute > the cost of user-space memory barriers asymmetrically by transforming > pairs of memory barriers into pairs consisting of membarrier() and a > compiler barrier. > > If unsure, say Y. > > Before 1464d00b27b2, riscv just happened to set it to =n in the defconfig. > > I suspect the initial port of riscv merged after v4.14 was already broken. I see. Oh well, guess I'll have to leave this up to the maintainers then (I believe I've never managed to build riscv that far), Palmer? > > I'll look into adding the membarrier feature you mention (as a final/ > > follow-up patch), unless you or someone else want to take care of it. > > I'll be happy to review it :) Sweet! :-) Andrea
On Wed, 29 Nov 2023 14:43:34 PST (-0800), parri.andrea@gmail.com wrote: >> > So I should probably stick to 93917ad50972, which apparently selected >> > CONFIG_MEMBARRIER on RISC-V, for the Fixes: tag in question. >> >> I think it goes further than that, because you can explicitly >> CONFIG_MEMBARRIER=y, see init/Kconfig: >> >> config MEMBARRIER >> bool "Enable membarrier() system call" if EXPERT >> default y >> help >> Enable the membarrier() system call that allows issuing memory >> barriers across all running threads, which can be used to distribute >> the cost of user-space memory barriers asymmetrically by transforming >> pairs of memory barriers into pairs consisting of membarrier() and a >> compiler barrier. >> >> If unsure, say Y. >> >> Before 1464d00b27b2, riscv just happened to set it to =n in the defconfig. >> >> I suspect the initial port of riscv merged after v4.14 was already broken. > > I see. Oh well, guess I'll have to leave this up to the maintainers then > (I believe I've never managed to build riscv that far), Palmer? I see $ git grep "config MEMBARRIER" fab957c11efe2f405e08b9f0d080524bc2631428 fab957c11efe2f405e08b9f0d080524bc2631428:init/Kconfig:config MEMBARRIER so IMO this is just one of those forever bugs. So I'd lean towards Fixes: fab957c11efe ("RISC-V: Atomic and Locking Code") (or anything in that original patch set). It's not that big of a backport, so I think it's safe enough? >> > I'll look into adding the membarrier feature you mention (as a final/ >> > follow-up patch), unless you or someone else want to take care of it. >> >> I'll be happy to review it :) > > Sweet! :-) > > Andrea
> I see > > $ git grep "config MEMBARRIER" fab957c11efe2f405e08b9f0d080524bc2631428 > fab957c11efe2f405e08b9f0d080524bc2631428:init/Kconfig:config MEMBARRIER > > so IMO this is just one of those forever bugs. So I'd lean towards > > Fixes: fab957c11efe ("RISC-V: Atomic and Locking Code") Works for me, will apply in v2. > (or anything in that original patch set). It's not that big of a backport, > so I think it's safe enough? Indeed, I think so. The final version of this fix will likely depend on some machinery/code introduced by 3ccfebedd8cf54 ("powerpc, membarrier: Skip memory barrier in switch_mm()"); but, yes, nothing we can't safely adjust I think. Thanks, Andrea
On Wed, 06 Dec 2023 06:11:07 PST (-0800), parri.andrea@gmail.com wrote: >> I see >> >> $ git grep "config MEMBARRIER" fab957c11efe2f405e08b9f0d080524bc2631428 >> fab957c11efe2f405e08b9f0d080524bc2631428:init/Kconfig:config MEMBARRIER >> >> so IMO this is just one of those forever bugs. So I'd lean towards >> >> Fixes: fab957c11efe ("RISC-V: Atomic and Locking Code") > > Works for me, will apply in v2. > > >> (or anything in that original patch set). It's not that big of a backport, >> so I think it's safe enough? > > Indeed, I think so. > > The final version of this fix will likely depend on some machinery/code > introduced by 3ccfebedd8cf54 ("powerpc, membarrier: Skip memory barrier > in switch_mm()"); but, yes, nothing we can't safely adjust I think. Ya, I guess we'll have to look to know for sure but hopefully it's manageable. > Thanks, > Andrea
> > The final version of this fix will likely depend on some machinery/code > > introduced by 3ccfebedd8cf54 ("powerpc, membarrier: Skip memory barrier > > in switch_mm()"); but, yes, nothing we can't safely adjust I think. > > Ya, I guess we'll have to look to know for sure but hopefully it's > manageable. Absolutely. One approach would be to follow what PowerPC did: AFAIU, before 3ccfebedd8cf54 membarrier/powerpc used to hard code the required barrier in in finish_task_switch(), "masking" it as an smp_mb__after_unlock_lock(); riscv could use a similar approach (though with a different/new mask function). Alternatively, we could maybe keep the barrier in switch_mm(). But let me complete and send out v2 with the fix at stake... this should give us a more concrete basis to discuss about these matters. Andrea
On Wed, 06 Dec 2023 09:42:44 PST (-0800), parri.andrea@gmail.com wrote: >> > The final version of this fix will likely depend on some machinery/code >> > introduced by 3ccfebedd8cf54 ("powerpc, membarrier: Skip memory barrier >> > in switch_mm()"); but, yes, nothing we can't safely adjust I think. >> >> Ya, I guess we'll have to look to know for sure but hopefully it's >> manageable. > > Absolutely. One approach would be to follow what PowerPC did: AFAIU, before > 3ccfebedd8cf54 membarrier/powerpc used to hard code the required barrier in > in finish_task_switch(), "masking" it as an smp_mb__after_unlock_lock(); riscv > could use a similar approach (though with a different/new mask function). IIRC we patterned our MMIOWB/after-spinlock barriers after PPC, so that's probably a good place to start here as well. > Alternatively, we could maybe keep the barrier in switch_mm(). > > But let me complete and send out v2 with the fix at stake... this should give > us a more concrete basis to discuss about these matters. > > Andrea
diff --git a/Documentation/features/sched/membarrier-sync-core/arch-support.txt b/Documentation/features/sched/membarrier-sync-core/arch-support.txt index d96b778b87ed8..f6f0bd871a578 100644 --- a/Documentation/features/sched/membarrier-sync-core/arch-support.txt +++ b/Documentation/features/sched/membarrier-sync-core/arch-support.txt @@ -43,7 +43,7 @@ | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | - | riscv: | TODO | + | riscv: | ok | | s390: | ok | | sh: | TODO | | sparc: | TODO | diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index 95a2a06acc6a6..3e2734a3f2957 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -27,14 +27,17 @@ config RISCV select ARCH_HAS_GCOV_PROFILE_ALL select ARCH_HAS_GIGANTIC_PAGE select ARCH_HAS_KCOV + select ARCH_HAS_MEMBARRIER_SYNC_CORE select ARCH_HAS_MMIOWB select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE select ARCH_HAS_PMEM_API + select ARCH_HAS_PREPARE_SYNC_CORE_CMD select ARCH_HAS_PTE_SPECIAL select ARCH_HAS_SET_DIRECT_MAP if MMU select ARCH_HAS_SET_MEMORY if MMU select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL select ARCH_HAS_STRICT_MODULE_RWX if MMU && !XIP_KERNEL + select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE select ARCH_HAS_SYSCALL_WRAPPER select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST select ARCH_HAS_UBSAN_SANITIZE_ALL diff --git a/arch/riscv/include/asm/sync_core.h b/arch/riscv/include/asm/sync_core.h new file mode 100644 index 0000000000000..9153016da8f14 --- /dev/null +++ b/arch/riscv/include/asm/sync_core.h @@ -0,0 +1,29 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_RISCV_SYNC_CORE_H +#define _ASM_RISCV_SYNC_CORE_H + +/* + * RISC-V implements return to user-space through an xRET instruction, + * which is not core serializing. + */ +static inline void sync_core_before_usermode(void) +{ + asm volatile ("fence.i" ::: "memory"); +} + +#ifdef CONFIG_SMP +/* + * Ensure the next switch_mm() on every CPU issues a core serializing + * instruction for the given @mm. + */ +static inline void prepare_sync_core_cmd(struct mm_struct *mm) +{ + cpumask_setall(&mm->context.icache_stale_mask); +} +#else +static inline void prepare_sync_core_cmd(struct mm_struct *mm) +{ +} +#endif /* CONFIG_SMP */ + +#endif /* _ASM_RISCV_SYNC_CORE_H */