Message ID | 20221111075902.798571-1-guoren@kernel.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp609173wru; Fri, 11 Nov 2022 00:11:08 -0800 (PST) X-Google-Smtp-Source: AA0mqf55m7DE7j2pYlO4gqRg14xBWUc6K5EkKNbrZcJ+TSVNNwj7+iJUHsS82YJ5IT3q06Lhhn0V X-Received: by 2002:a17:902:f10c:b0:188:5698:5beb with SMTP id e12-20020a170902f10c00b0018856985bebmr1134561plb.150.1668154268587; Fri, 11 Nov 2022 00:11:08 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1668154268; cv=none; d=google.com; s=arc-20160816; b=etxXCHe9BAVppqAbF1kOuC7SBEr0TfSwKRjDV/f2NF+dfCg5/8TceGuQ1shimnnL5B U6B+Gl7ZZfNUqUgDtr1iOIXlOR5Socue36rK70tCyAxAAZpTaESEpV1iqvw7XkxwfQL6 VX70qXnRBa50/GcEpW1JjxnBZc1TKugKG5Upr+WRoFoLPiuDzUpkcy1m56CUQ4W+eyt2 orZ92nLN2eHkjjiOBcw+OO2ZMzH0hfgB6QR8qkdbs2/d/9WHQUEPDYCR1U9CXesOAo+l AcsJLW6Ioqhyb/P/SWFnmqrwYYfQgOoAarrz9XSG7SBU0PgsrU4VxNfFPLDv+nHRkdIa sBnw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=k/4nQL0TR/hVIZLQspnDL/8PNOIAe2T5HzPC+CIyjTc=; b=fsCHWLy9Bvps/n0AiqMsESAnvW2ejUHwCO/THEcF64F/szctEOyOrvFc2jLMqGLYHD Q1FdrrFY8mHd8E83QfUOcPo9jXj9v7Oo6b7uqGFuXkGL9vYEPpOmH+Mc1g4utnYEJELn C4z9MkW2hP+1+oi/yvDYamneE2dka7w4p6qRo93JQpCw1M7f0ESLNTozjcSD2GWtbJhB q3lGs0qnBU7lNpGk1vs4IbMaYEwRR/ITnhryhR27CMZWgnZS/t8xB8BCHltBpw+jzux5 vplXyCHhG44pQU7Dd8AllidSMbqDi7RJd3hzhKi1DBkHTxQ83dOzCgqzg4EDDDlwl3C1 V8HA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=bjKMK8gR; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id l1-20020a170902ec0100b00186f1a1fb83si1676469pld.352.2022.11.11.00.10.53; Fri, 11 Nov 2022 00:11:08 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=bjKMK8gR; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232855AbiKKH70 (ORCPT <rfc822;winker.wchi@gmail.com> + 99 others); Fri, 11 Nov 2022 02:59:26 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38106 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232574AbiKKH7Y (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 11 Nov 2022 02:59:24 -0500 Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A558C45A3B for <linux-kernel@vger.kernel.org>; Thu, 10 Nov 2022 23:59:23 -0800 (PST) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sin.source.kernel.org (Postfix) with ESMTPS id F158FCE257B for <linux-kernel@vger.kernel.org>; Fri, 11 Nov 2022 07:59:21 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 62CDAC433D6; Fri, 11 Nov 2022 07:59:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1668153560; bh=7YM4Ka0ebCuIEuz50vxHPcJkntgQCHzI1KnrlWMijZQ=; h=From:To:Cc:Subject:Date:From; b=bjKMK8gRmG1/YFfOrdCMYD4UqRsJ0z9YnF9AnNT90pFjQrCMqnd1FgjRxql1JeKX1 f1r1SlN0S0fPWiJnDgrXXPboYMK2OLqIaAO0+d9cl9Wk05qISLZjLf4IlHFP/fRNsU 4NvqEDnRiV8tTlAjqGJ/XfhXPKAH5U2Rdw6wDUmRbajXIX2DqIpE7k48zKLZrzWxYa ZNgU1Qw5/Sug37W0rZbq2wOTMLvOkbY5vi98CV+cbiSCs6+seywwaDeHoc64ZjvGOF /UW9GRmBJZbLY/0jUkj7lnrliWZCe/460qYO1tP2fMZ0rTMucpgED3PK5idBIjtyv4 JGhcG/GPH4HCg== From: guoren@kernel.org To: anup@brainfault.org, paul.walmsley@sifive.com, palmer@dabbelt.com, conor.dooley@microchip.com, heiko@sntech.de, philipp.tomsich@vrull.eu, alex@ghiti.fr, hch@lst.de, ajones@ventanamicro.com, gary@garyguo.net, jszhang@kernel.org Cc: linux-riscv@lists.infradead.org, linux-kernel@vger.kernel.org, Guo Ren <guoren@linux.alibaba.com>, Guo Ren <guoren@kernel.org>, Anup Patel <apatel@ventanamicro.com>, Palmer Dabbelt <palmer@rivosinc.com> Subject: [PATCH V3] riscv: asid: Fixup stale TLB entry cause application crash Date: Fri, 11 Nov 2022 02:59:02 -0500 Message-Id: <20221111075902.798571-1-guoren@kernel.org> X-Mailer: git-send-email 2.36.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1749186529918107225?= X-GMAIL-MSGID: =?utf-8?q?1749186529918107225?= |
Series |
[V3] riscv: asid: Fixup stale TLB entry cause application crash
|
|
Commit Message
Guo Ren
Nov. 11, 2022, 7:59 a.m. UTC
From: Guo Ren <guoren@linux.alibaba.com> After use_asid_allocator is enabled, the userspace application will crash by stale TLB entries. Because only using cpumask_clear_cpu without local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh. Then set_mm_asid would cause the user space application to get a stale value by stale TLB entry, but set_mm_noasid is okay. Here is the symptom of the bug: unhandled signal 11 code 0x1 (coredump) 0x0000003fd6d22524 <+4>: auipc s0,0x70 0x0000003fd6d22528 <+8>: ld s0,-148(s0) # 0x3fd6d92490 => 0x0000003fd6d2252c <+12>: ld a5,0(s0) (gdb) i r s0 s0 0x8082ed1cc3198b21 0x8082ed1cc3198b21 (gdb) x /2x 0x3fd6d92490 0x3fd6d92490: 0xd80ac8a8 0x0000003f The core dump file shows that register s0 is wrong, but the value in memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry in TLB and got a wrong result from an incorrect physical address. When the task ran on CPU0, which loaded/speculative-loaded the value of address(0x3fd6d92490), then the first version of the mapping entry was PTWed into CPU0's TLB. When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by asid), it happened to write a value on the address (0x3fd6d92490). It caused do_page_fault -> wp_page_copy -> ptep_clear_flush -> ptep_get_and_clear & flush_tlb_page. The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous switch_mm. So we only flushed the CPU1 TLB and set the second version mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0 still used a stale TLB mapping entry which contained a wrong target physical address. It raised a bug when the task happened to read that value. CPU0 CPU1 - switch 'task' in - read addr (Fill stale mapping entry into TLB) - switch 'task' out (no tlb_flush) - switch 'task' in (no tlb_flush) - write addr cause pagefault do_page_fault() (change to new addr mapping) wp_page_copy() ptep_clear_flush() ptep_get_and_clear() & flush_tlb_page() write new value into addr - switch 'task' out (no tlb_flush) - switch 'task' in (no tlb_flush) - read addr again (Use stale mapping entry in TLB) get wrong value from old phyical addr, BUG! The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm, which could guarantee to invalidate all stale TLB entries during TLB flush. Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator") Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org> Cc: Anup Patel <apatel@ventanamicro.com> Cc: Palmer Dabbelt <palmer@rivosinc.com> --- Changes in v3: - Move set/clear cpumask(mm) into set_mm (Make code more pretty with Andrew's advice) - Optimize comment description Changes in v2: - Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel test robot <lkp@intel.com>) - Keep cpumask_clear_cpu for noasid --- arch/riscv/mm/context.c | 30 ++++++++++++++++++++---------- 1 file changed, 20 insertions(+), 10 deletions(-)
Comments
On Fri, Nov 11, 2022 at 02:59:02AM -0500, guoren@kernel.org wrote: > From: Guo Ren <guoren@linux.alibaba.com> > > After use_asid_allocator is enabled, the userspace application will > crash by stale TLB entries. Because only using cpumask_clear_cpu without > local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh. > Then set_mm_asid would cause the user space application to get a stale > value by stale TLB entry, but set_mm_noasid is okay. > > Here is the symptom of the bug: > unhandled signal 11 code 0x1 (coredump) > 0x0000003fd6d22524 <+4>: auipc s0,0x70 > 0x0000003fd6d22528 <+8>: ld s0,-148(s0) # 0x3fd6d92490 > => 0x0000003fd6d2252c <+12>: ld a5,0(s0) > (gdb) i r s0 > s0 0x8082ed1cc3198b21 0x8082ed1cc3198b21 > (gdb) x /2x 0x3fd6d92490 > 0x3fd6d92490: 0xd80ac8a8 0x0000003f > The core dump file shows that register s0 is wrong, but the value in > memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry > in TLB and got a wrong result from an incorrect physical address. > > When the task ran on CPU0, which loaded/speculative-loaded the value of > address(0x3fd6d92490), then the first version of the mapping entry was > PTWed into CPU0's TLB. > When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by > asid), it happened to write a value on the address (0x3fd6d92490). It > caused do_page_fault -> wp_page_copy -> ptep_clear_flush -> > ptep_get_and_clear & flush_tlb_page. > The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB > flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous > switch_mm. So we only flushed the CPU1 TLB and set the second version > mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0 > still used a stale TLB mapping entry which contained a wrong target > physical address. It raised a bug when the task happened to read that > value. > > CPU0 CPU1 > - switch 'task' in > - read addr (Fill stale mapping > entry into TLB) > - switch 'task' out (no tlb_flush) > - switch 'task' in (no tlb_flush) > - write addr cause pagefault > do_page_fault() (change to > new addr mapping) > wp_page_copy() > ptep_clear_flush() > ptep_get_and_clear() > & flush_tlb_page() > write new value into addr > - switch 'task' out (no tlb_flush) > - switch 'task' in (no tlb_flush) > - read addr again (Use stale > mapping entry in TLB) > get wrong value from old phyical > addr, BUG! > > The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm, > which could guarantee to invalidate all stale TLB entries during TLB > flush. > > Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator") > Signed-off-by: Guo Ren <guoren@linux.alibaba.com> > Signed-off-by: Guo Ren <guoren@kernel.org> > Cc: Anup Patel <apatel@ventanamicro.com> > Cc: Palmer Dabbelt <palmer@rivosinc.com> > --- > Changes in v3: > - Move set/clear cpumask(mm) into set_mm (Make code more pretty > with Andrew's advice) > - Optimize comment description > > Changes in v2: > - Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel > test robot <lkp@intel.com>) > - Keep cpumask_clear_cpu for noasid > --- > arch/riscv/mm/context.c | 30 ++++++++++++++++++++---------- > 1 file changed, 20 insertions(+), 10 deletions(-) > > diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c > index 7acbfbd14557..0f784e3d307b 100644 > --- a/arch/riscv/mm/context.c > +++ b/arch/riscv/mm/context.c > @@ -205,12 +205,24 @@ static void set_mm_noasid(struct mm_struct *mm) > local_flush_tlb_all(); > } > > -static inline void set_mm(struct mm_struct *mm, unsigned int cpu) > +static inline void set_mm(struct mm_struct *prev, > + struct mm_struct *next, unsigned int cpu) > { > - if (static_branch_unlikely(&use_asid_allocator)) > - set_mm_asid(mm, cpu); > - else > - set_mm_noasid(mm); > + /* > + * The mm_cpumask indicates which harts' TLBs contain the virtual > + * address mapping of the mm. Compared to noasid, using asid > + * can't guarantee that stale TLB entries are invalidated because > + * the asid mechanism wouldn't flush TLB for every switch_mm for > + * performance. So when using asid, keep all CPUs footmarks in > + * cpumask() until mm reset. > + */ > + cpumask_set_cpu(cpu, mm_cpumask(next)); > + if (static_branch_unlikely(&use_asid_allocator)) { > + set_mm_asid(next, cpu); > + } else { > + cpumask_clear_cpu(cpu, mm_cpumask(prev)); > + set_mm_noasid(next); > + } > } > > static int __init asids_init(void) > @@ -264,7 +276,8 @@ static int __init asids_init(void) > } > early_initcall(asids_init); > #else > -static inline void set_mm(struct mm_struct *mm, unsigned int cpu) > +static inline void set_mm(struct mm_struct *prev, > + struct mm_struct *next, unsigned int cpu) > { > /* Nothing to do here when there is no MMU */ > } > @@ -317,10 +330,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next, > */ > cpu = smp_processor_id(); > > - cpumask_clear_cpu(cpu, mm_cpumask(prev)); > - cpumask_set_cpu(cpu, mm_cpumask(next)); > - > - set_mm(next, cpu); > + set_mm(prev, next, cpu); > > flush_icache_deferred(next, cpu); > } > -- > 2.36.1 > Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Hi Guo Ren, > After use_asid_allocator is enabled, the userspace application will > crash by stale TLB entries. Because only using cpumask_clear_cpu without > local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh. > Then set_mm_asid would cause the user space application to get a stale > value by stale TLB entry, but set_mm_noasid is okay. ... [snip] > + /* > + * The mm_cpumask indicates which harts' TLBs contain the virtual > + * address mapping of the mm. Compared to noasid, using asid > + * can't guarantee that stale TLB entries are invalidated because > + * the asid mechanism wouldn't flush TLB for every switch_mm for > + * performance. So when using asid, keep all CPUs footmarks in > + * cpumask() until mm reset. > + */ > + cpumask_set_cpu(cpu, mm_cpumask(next)); > + if (static_branch_unlikely(&use_asid_allocator)) { > + set_mm_asid(next, cpu); > + } else { > + cpumask_clear_cpu(cpu, mm_cpumask(prev)); > + set_mm_noasid(next); > + } > } I observe similar user-space crashes on my SMP systems with enabled ASID. My attempt to fix the issue was a bit different, see the following patch: https://lore.kernel.org/linux-riscv/20220829205219.283543-1-geomatsi@gmail.com/ In brief, the idea was borrowed from flush_icache_mm handling: - keep track of CPUs not running the task - perform per-ASID TLB flush on such CPUs only if the task is switched there Your patch also works fine in my tests fixing those crashes. I have a question though, regarding removed cpumask_clear_cpu. How CPUs no more running the task are removed from its mm_cpumask ? If they are not removed, then flush_tlb_mm/flush_tlb_page will broadcast unnecessary TLB flushes to those CPUs when ASID is enabled. Regards, Sergey
On Sat, Nov 19, 2022 at 4:57 AM Sergey Matyukevich <geomatsi@gmail.com> wrote: > > Hi Guo Ren, > > > > After use_asid_allocator is enabled, the userspace application will > > crash by stale TLB entries. Because only using cpumask_clear_cpu without > > local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh. > > Then set_mm_asid would cause the user space application to get a stale > > value by stale TLB entry, but set_mm_noasid is okay. > > ... [snip] > > > + /* > > + * The mm_cpumask indicates which harts' TLBs contain the virtual > > + * address mapping of the mm. Compared to noasid, using asid > > + * can't guarantee that stale TLB entries are invalidated because > > + * the asid mechanism wouldn't flush TLB for every switch_mm for > > + * performance. So when using asid, keep all CPUs footmarks in > > + * cpumask() until mm reset. > > + */ > > + cpumask_set_cpu(cpu, mm_cpumask(next)); > > + if (static_branch_unlikely(&use_asid_allocator)) { > > + set_mm_asid(next, cpu); > > + } else { > > + cpumask_clear_cpu(cpu, mm_cpumask(prev)); > > + set_mm_noasid(next); > > + } > > } > > I observe similar user-space crashes on my SMP systems with enabled ASID. > My attempt to fix the issue was a bit different, see the following patch: > > https://lore.kernel.org/linux-riscv/20220829205219.283543-1-geomatsi@gmail.com/ > > In brief, the idea was borrowed from flush_icache_mm handling: > - keep track of CPUs not running the task > - perform per-ASID TLB flush on such CPUs only if the task is switched there > > Your patch also works fine in my tests fixing those crashes. I have a > question though, regarding removed cpumask_clear_cpu. How CPUs no more > running the task are removed from its mm_cpumask ? If they are not > removed, then flush_tlb_mm/flush_tlb_page will broadcast unnecessary > TLB flushes to those CPUs when ASID is enabled. A task would be migrated to any CPU by the scheduler. So keeping TLB contents synced with cpumask_set/clear needs additional tlb_flush just like noasid, and your patch still follows that style. The worth of ASID is avoiding tlb_flush during the context switch. Yes, my patch would increase some tlb_flush IPI costs. But when mapping is stable, no tlb_flush is needed during the switch_mm (Hackbench would be beneficiary because no more TLB flush is needed at his hot point path). Here are my points: - We copied the arm64 globally unique asid mechanism into riscv, which depends on hardware broadcast TLB flush. My fixup patch is closer to the original principle design, proven in the arm64 world. - If riscv continues local TLB flush hw design in ISA spec, please try x86's per-CPU array of ASID. But that is a significant change; let's fix the current issue with the smallest patch first. In the end, thx your review and test. -- Best Regards Guo Ren
> After use_asid_allocator is enabled, the userspace application will > crash by stale TLB entries. Because only using cpumask_clear_cpu without > local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh. > Then set_mm_asid would cause the user space application to get a stale > value by stale TLB entry, but set_mm_noasid is okay. > > Here is the symptom of the bug: > unhandled signal 11 code 0x1 (coredump) > 0x0000003fd6d22524 <+4>: auipc s0,0x70 > 0x0000003fd6d22528 <+8>: ld s0,-148(s0) # 0x3fd6d92490 > => 0x0000003fd6d2252c <+12>: ld a5,0(s0) > (gdb) i r s0 > s0 0x8082ed1cc3198b21 0x8082ed1cc3198b21 > (gdb) x /2x 0x3fd6d92490 > 0x3fd6d92490: 0xd80ac8a8 0x0000003f > The core dump file shows that register s0 is wrong, but the value in > memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry > in TLB and got a wrong result from an incorrect physical address. > > When the task ran on CPU0, which loaded/speculative-loaded the value of > address(0x3fd6d92490), then the first version of the mapping entry was > PTWed into CPU0's TLB. > When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by > asid), it happened to write a value on the address (0x3fd6d92490). It > caused do_page_fault -> wp_page_copy -> ptep_clear_flush -> > ptep_get_and_clear & flush_tlb_page. > The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB > flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous > switch_mm. So we only flushed the CPU1 TLB and set the second version > mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0 > still used a stale TLB mapping entry which contained a wrong target > physical address. It raised a bug when the task happened to read that > value. > > CPU0 CPU1 > - switch 'task' in > - read addr (Fill stale mapping > entry into TLB) > - switch 'task' out (no tlb_flush) > - switch 'task' in (no tlb_flush) > - write addr cause pagefault > do_page_fault() (change to > new addr mapping) > wp_page_copy() > ptep_clear_flush() > ptep_get_and_clear() > & flush_tlb_page() > write new value into addr > - switch 'task' out (no tlb_flush) > - switch 'task' in (no tlb_flush) > - read addr again (Use stale > mapping entry in TLB) > get wrong value from old phyical > addr, BUG! > > The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm, > which could guarantee to invalidate all stale TLB entries during TLB > flush. > > Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator") > Signed-off-by: Guo Ren <guoren@linux.alibaba.com> > Signed-off-by: Guo Ren <guoren@kernel.org> > Cc: Anup Patel <apatel@ventanamicro.com> > Cc: Palmer Dabbelt <palmer@rivosinc.com> > --- > Changes in v3: > - Move set/clear cpumask(mm) into set_mm (Make code more pretty > with Andrew's advice) > - Optimize comment description > > Changes in v2: > - Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel > test robot <lkp@intel.com>) > - Keep cpumask_clear_cpu for noasid > --- > arch/riscv/mm/context.c | 30 ++++++++++++++++++++---------- > 1 file changed, 20 insertions(+), 10 deletions(-) > > diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c > index 7acbfbd14557..0f784e3d307b 100644 > --- a/arch/riscv/mm/context.c > +++ b/arch/riscv/mm/context.c > @@ -205,12 +205,24 @@ static void set_mm_noasid(struct mm_struct *mm) > local_flush_tlb_all(); > } > > -static inline void set_mm(struct mm_struct *mm, unsigned int cpu) > +static inline void set_mm(struct mm_struct *prev, > + struct mm_struct *next, unsigned int cpu) > { > - if (static_branch_unlikely(&use_asid_allocator)) > - set_mm_asid(mm, cpu); > - else > - set_mm_noasid(mm); > + /* > + * The mm_cpumask indicates which harts' TLBs contain the virtual > + * address mapping of the mm. Compared to noasid, using asid > + * can't guarantee that stale TLB entries are invalidated because > + * the asid mechanism wouldn't flush TLB for every switch_mm for > + * performance. So when using asid, keep all CPUs footmarks in > + * cpumask() until mm reset. > + */ > + cpumask_set_cpu(cpu, mm_cpumask(next)); > + if (static_branch_unlikely(&use_asid_allocator)) { > + set_mm_asid(next, cpu); > + } else { > + cpumask_clear_cpu(cpu, mm_cpumask(prev)); > + set_mm_noasid(next); > + } > } > > static int __init asids_init(void) > @@ -264,7 +276,8 @@ static int __init asids_init(void) > } > early_initcall(asids_init); > #else > -static inline void set_mm(struct mm_struct *mm, unsigned int cpu) > +static inline void set_mm(struct mm_struct *prev, > + struct mm_struct *next, unsigned int cpu) > { > /* Nothing to do here when there is no MMU */ > } > @@ -317,10 +330,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next, > */ > cpu = smp_processor_id(); > > - cpumask_clear_cpu(cpu, mm_cpumask(prev)); > - cpumask_set_cpu(cpu, mm_cpumask(next)); > - > - set_mm(next, cpu); > + set_mm(prev, next, cpu); > > flush_icache_deferred(next, cpu); > } Tested-by: Sergey Matyukevich <sergey.matyukevich@syntacore.com> Thanks, Sergey
Hi Guo Ren, > > > After use_asid_allocator is enabled, the userspace application will > > > crash by stale TLB entries. Because only using cpumask_clear_cpu without > > > local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh. > > > Then set_mm_asid would cause the user space application to get a stale > > > value by stale TLB entry, but set_mm_noasid is okay. > > > > ... [snip] > > > > > + /* > > > + * The mm_cpumask indicates which harts' TLBs contain the virtual > > > + * address mapping of the mm. Compared to noasid, using asid > > > + * can't guarantee that stale TLB entries are invalidated because > > > + * the asid mechanism wouldn't flush TLB for every switch_mm for > > > + * performance. So when using asid, keep all CPUs footmarks in > > > + * cpumask() until mm reset. > > > + */ > > > + cpumask_set_cpu(cpu, mm_cpumask(next)); > > > + if (static_branch_unlikely(&use_asid_allocator)) { > > > + set_mm_asid(next, cpu); > > > + } else { > > > + cpumask_clear_cpu(cpu, mm_cpumask(prev)); > > > + set_mm_noasid(next); > > > + } > > > } > > > > I observe similar user-space crashes on my SMP systems with enabled ASID. > > My attempt to fix the issue was a bit different, see the following patch: > > > > https://lore.kernel.org/linux-riscv/20220829205219.283543-1-geomatsi@gmail.com/ > > > > In brief, the idea was borrowed from flush_icache_mm handling: > > - keep track of CPUs not running the task > > - perform per-ASID TLB flush on such CPUs only if the task is switched there > > > > Your patch also works fine in my tests fixing those crashes. I have a > > question though, regarding removed cpumask_clear_cpu. How CPUs no more > > running the task are removed from its mm_cpumask ? If they are not > > removed, then flush_tlb_mm/flush_tlb_page will broadcast unnecessary > > TLB flushes to those CPUs when ASID is enabled. > A task would be migrated to any CPU by the scheduler. So keeping TLB > contents synced with cpumask_set/clear needs additional tlb_flush just > like noasid, and your patch still follows that style. The worth of > ASID is avoiding tlb_flush during the context switch. Yes, my patch > would increase some tlb_flush IPI costs. But when mapping is stable, > no tlb_flush is needed during the switch_mm (Hackbench would be > beneficiary because no more TLB flush is needed at his hot point > path). Here are my points: > - We copied the arm64 globally unique asid mechanism into riscv, > which depends on hardware broadcast TLB flush. My fixup patch is > closer to the original principle design, proven in the arm64 world. > - If riscv continues local TLB flush hw design in ISA spec, please > try x86's per-CPU array of ASID. But that is a significant change; > let's fix the current issue with the smallest patch first. > > In the end, thx your review and test. By the way, how did you verify the patch ? Do you have any good reproducer for this issue ? Regards, Sergey
On Fri, 18 Nov 2022 12:57:21 PST (-0800), geomatsi@gmail.com wrote: > Hi Guo Ren, > > >> After use_asid_allocator is enabled, the userspace application will >> crash by stale TLB entries. Because only using cpumask_clear_cpu without >> local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh. >> Then set_mm_asid would cause the user space application to get a stale >> value by stale TLB entry, but set_mm_noasid is okay. > > ... [snip] > >> + /* >> + * The mm_cpumask indicates which harts' TLBs contain the virtual >> + * address mapping of the mm. Compared to noasid, using asid >> + * can't guarantee that stale TLB entries are invalidated because >> + * the asid mechanism wouldn't flush TLB for every switch_mm for >> + * performance. So when using asid, keep all CPUs footmarks in >> + * cpumask() until mm reset. >> + */ >> + cpumask_set_cpu(cpu, mm_cpumask(next)); >> + if (static_branch_unlikely(&use_asid_allocator)) { >> + set_mm_asid(next, cpu); >> + } else { >> + cpumask_clear_cpu(cpu, mm_cpumask(prev)); >> + set_mm_noasid(next); >> + } >> } > > I observe similar user-space crashes on my SMP systems with enabled ASID. > My attempt to fix the issue was a bit different, see the following patch: > > https://lore.kernel.org/linux-riscv/20220829205219.283543-1-geomatsi@gmail.com/ > > In brief, the idea was borrowed from flush_icache_mm handling: > - keep track of CPUs not running the task > - perform per-ASID TLB flush on such CPUs only if the task is switched there That way looks better to me: leaking hartids in the ASID allocator might make the crashes go away, but it's just going to end up trending towards flushing everything and that doesn't seem like the right long-term solution. So I've got that one on for-next, sorry I missed it before. Thanks! > > Your patch also works fine in my tests fixing those crashes. I have a > question though, regarding removed cpumask_clear_cpu. How CPUs no more > running the task are removed from its mm_cpumask ? If they are not > removed, then flush_tlb_mm/flush_tlb_page will broadcast unnecessary > TLB flushes to those CPUs when ASID is enabled. > > Regards, > Sergey
On Fri, Dec 9, 2022 at 7:30 AM Palmer Dabbelt <palmer@rivosinc.com> wrote: > > On Fri, 18 Nov 2022 12:57:21 PST (-0800), geomatsi@gmail.com wrote: > > Hi Guo Ren, > > > > > >> After use_asid_allocator is enabled, the userspace application will > >> crash by stale TLB entries. Because only using cpumask_clear_cpu without > >> local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh. > >> Then set_mm_asid would cause the user space application to get a stale > >> value by stale TLB entry, but set_mm_noasid is okay. > > > > ... [snip] > > > >> + /* > >> + * The mm_cpumask indicates which harts' TLBs contain the virtual > >> + * address mapping of the mm. Compared to noasid, using asid > >> + * can't guarantee that stale TLB entries are invalidated because > >> + * the asid mechanism wouldn't flush TLB for every switch_mm for > >> + * performance. So when using asid, keep all CPUs footmarks in > >> + * cpumask() until mm reset. > >> + */ > >> + cpumask_set_cpu(cpu, mm_cpumask(next)); > >> + if (static_branch_unlikely(&use_asid_allocator)) { > >> + set_mm_asid(next, cpu); > >> + } else { > >> + cpumask_clear_cpu(cpu, mm_cpumask(prev)); > >> + set_mm_noasid(next); > >> + } > >> } > > > > I observe similar user-space crashes on my SMP systems with enabled ASID. > > My attempt to fix the issue was a bit different, see the following patch: > > > > https://lore.kernel.org/linux-riscv/20220829205219.283543-1-geomatsi@gmail.com/ > > > > In brief, the idea was borrowed from flush_icache_mm handling: > > - keep track of CPUs not running the task > > - perform per-ASID TLB flush on such CPUs only if the task is switched there > > That way looks better to me: leaking hartids in the ASID allocator might > make the crashes go away, but it's just going to end up trending towards > flushing everything and that doesn't seem like the right long-term > solution. The penalty in switch_mm is too heavy!!! - If the system has multiple NUMA nodes, it will cause switch_mm_fast flush unnecessary harts. - If flush_range is just 1 entry, it would case flush_tlb_all_asid. switch_mm_fast: csr_write(CSR_SATP, virt_to_pfn(mm->pgd) | ((cntx & asid_mask) << SATP_ASID_SHIFT) | satp_mode); if (need_flush_tlb) local_flush_tlb_all(); +#ifdef CONFIG_SMP + else { + cpumask_t *mask = &mm->context.tlb_stale_mask;+ + + if (cpumask_test_cpu(cpu, mask)) { + cpumask_clear_cpu(cpu, mask); + local_flush_tlb_all_asid(cntx & asid_mask); // penalty in switch_mm fast path + } + } +#endif And See: static void __sbi_tlb_flush_range(struct mm_struct *mm, unsigned long start, unsigned long size, unsigned long stride) { + struct cpumask *pmask = &mm->context.tlb_stale_mask; struct cpumask *cmask = mm_cpumask(mm); unsigned int cpuid; bool broadcast; @@ -44,6 +29,15 @@ static void __sbi_tlb_flush_range(struct mm_struct *mm, unsigned long start, if (static_branch_unlikely(&use_asid_allocator)) { unsigned long asid = atomic_long_read(&mm->context.id); + /* + * TLB will be immediately flushed on harts concurrently + * executing this MM context. TLB flush on other harts + * is deferred until this MM context migrates there. + */ + cpumask_setall(pmask); ^^^^^^^^^^^^^^^^^^^^^^^ It would flush all harts for all NUMA nodes!!! Most of them deferred to switch_mm_fast. The penalty contains unnecessary harts! + cpumask_clear_cpu(cpuid, pmask); + cpumask_andnot(pmask, pmask, cmask); Please reconsider a bit, and make a smart decision. Just penalty the harts who touched the mm, not all. And only flush the whole TLB when some entries are needed. The __sbi_tlb_flush_range is the slow path; keeping the fast path performance is worth more than improving a slow one. > > So I've got that one on for-next, sorry I missed it before. > > Thanks! > > > > > Your patch also works fine in my tests fixing those crashes. I have a > > question though, regarding removed cpumask_clear_cpu. How CPUs no more > > running the task are removed from its mm_cpumask ? If they are not > > removed, then flush_tlb_mm/flush_tlb_page will broadcast unnecessary > > TLB flushes to those CPUs when ASID is enabled. > > > > Regards, > > Sergey
On Fri, Dec 9, 2022 at 11:13 AM Guo Ren <guoren@kernel.org> wrote: > > On Fri, Dec 9, 2022 at 7:30 AM Palmer Dabbelt <palmer@rivosinc.com> wrote: > > > > On Fri, 18 Nov 2022 12:57:21 PST (-0800), geomatsi@gmail.com wrote: > > > Hi Guo Ren, > > > > > > > > >> After use_asid_allocator is enabled, the userspace application will > > >> crash by stale TLB entries. Because only using cpumask_clear_cpu without > > >> local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh. > > >> Then set_mm_asid would cause the user space application to get a stale > > >> value by stale TLB entry, but set_mm_noasid is okay. > > > > > > ... [snip] > > > > > >> + /* > > >> + * The mm_cpumask indicates which harts' TLBs contain the virtual > > >> + * address mapping of the mm. Compared to noasid, using asid > > >> + * can't guarantee that stale TLB entries are invalidated because > > >> + * the asid mechanism wouldn't flush TLB for every switch_mm for > > >> + * performance. So when using asid, keep all CPUs footmarks in > > >> + * cpumask() until mm reset. > > >> + */ > > >> + cpumask_set_cpu(cpu, mm_cpumask(next)); > > >> + if (static_branch_unlikely(&use_asid_allocator)) { > > >> + set_mm_asid(next, cpu); > > >> + } else { > > >> + cpumask_clear_cpu(cpu, mm_cpumask(prev)); > > >> + set_mm_noasid(next); > > >> + } > > >> } > > > > > > I observe similar user-space crashes on my SMP systems with enabled ASID. > > > My attempt to fix the issue was a bit different, see the following patch: > > > > > > https://lore.kernel.org/linux-riscv/20220829205219.283543-1-geomatsi@gmail.com/ > > > > > > In brief, the idea was borrowed from flush_icache_mm handling: > > > - keep track of CPUs not running the task > > > - perform per-ASID TLB flush on such CPUs only if the task is switched there > > > > That way looks better to me: leaking hartids in the ASID allocator might > > make the crashes go away, but it's just going to end up trending towards > > flushing everything and that doesn't seem like the right long-term > > solution. > The penalty in switch_mm is too heavy!!! > - If the system has multiple NUMA nodes, it will cause switch_mm_fast > flush unnecessary harts. > - If flush_range is just 1 entry, it would case flush_tlb_all_asid. > > switch_mm_fast: > csr_write(CSR_SATP, virt_to_pfn(mm->pgd) | > ((cntx & asid_mask) << SATP_ASID_SHIFT) | > satp_mode); > > if (need_flush_tlb) > local_flush_tlb_all(); > +#ifdef CONFIG_SMP > + else { > + cpumask_t *mask = &mm->context.tlb_stale_mask;+ > + > + if (cpumask_test_cpu(cpu, mask)) { > + cpumask_clear_cpu(cpu, mask); > + local_flush_tlb_all_asid(cntx & asid_mask); > // penalty in switch_mm fast path > + } > + } > +#endif > > And See: > static void __sbi_tlb_flush_range(struct mm_struct *mm, unsigned long start, > unsigned long size, unsigned long stride) > { > + struct cpumask *pmask = &mm->context.tlb_stale_mask; > struct cpumask *cmask = mm_cpumask(mm); > unsigned int cpuid; > bool broadcast; > @@ -44,6 +29,15 @@ static void __sbi_tlb_flush_range(struct mm_struct > *mm, unsigned long start, > if (static_branch_unlikely(&use_asid_allocator)) { > unsigned long asid = atomic_long_read(&mm->context.id); > > + /* > + * TLB will be immediately flushed on harts concurrently > + * executing this MM context. TLB flush on other harts > + * is deferred until this MM context migrates there. > + */ > + cpumask_setall(pmask); > ^^^^^^^^^^^^^^^^^^^^^^^ It would flush all harts for > all NUMA nodes!!! Most of them deferred to switch_mm_fast. The penalty > contains unnecessary harts! > + cpumask_clear_cpu(cpuid, pmask); > + cpumask_andnot(pmask, pmask, cmask); > > Please reconsider a bit, and make a smart decision. Just penalty the > harts who touched the mm, not all. And only flush the whole TLB when ^^^^ ^^^^ Don't > some entries are needed. > > The __sbi_tlb_flush_range is the slow path; keeping the fast path > performance is worth more than improving a slow one. > > > > > So I've got that one on for-next, sorry I missed it before. > > > > Thanks! > > > > > > > > Your patch also works fine in my tests fixing those crashes. I have a > > > question though, regarding removed cpumask_clear_cpu. How CPUs no more > > > running the task are removed from its mm_cpumask ? If they are not > > > removed, then flush_tlb_mm/flush_tlb_page will broadcast unnecessary > > > TLB flushes to those CPUs when ASID is enabled. > > > > > > Regards, > > > Sergey > > > > -- > Best Regards > Guo Ren
Hi Guo, Thank you for the patch. On Fri, Nov 11, 2022 at 8:00 AM <guoren@kernel.org> wrote: > > From: Guo Ren <guoren@linux.alibaba.com> > > After use_asid_allocator is enabled, the userspace application will > crash by stale TLB entries. Because only using cpumask_clear_cpu without > local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh. > Then set_mm_asid would cause the user space application to get a stale > value by stale TLB entry, but set_mm_noasid is okay. > > Here is the symptom of the bug: > unhandled signal 11 code 0x1 (coredump) > 0x0000003fd6d22524 <+4>: auipc s0,0x70 > 0x0000003fd6d22528 <+8>: ld s0,-148(s0) # 0x3fd6d92490 > => 0x0000003fd6d2252c <+12>: ld a5,0(s0) > (gdb) i r s0 > s0 0x8082ed1cc3198b21 0x8082ed1cc3198b21 > (gdb) x /2x 0x3fd6d92490 > 0x3fd6d92490: 0xd80ac8a8 0x0000003f > The core dump file shows that register s0 is wrong, but the value in > memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry > in TLB and got a wrong result from an incorrect physical address. > > When the task ran on CPU0, which loaded/speculative-loaded the value of > address(0x3fd6d92490), then the first version of the mapping entry was > PTWed into CPU0's TLB. > When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by > asid), it happened to write a value on the address (0x3fd6d92490). It > caused do_page_fault -> wp_page_copy -> ptep_clear_flush -> > ptep_get_and_clear & flush_tlb_page. > The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB > flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous > switch_mm. So we only flushed the CPU1 TLB and set the second version > mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0 > still used a stale TLB mapping entry which contained a wrong target > physical address. It raised a bug when the task happened to read that > value. > > CPU0 CPU1 > - switch 'task' in > - read addr (Fill stale mapping > entry into TLB) > - switch 'task' out (no tlb_flush) > - switch 'task' in (no tlb_flush) > - write addr cause pagefault > do_page_fault() (change to > new addr mapping) > wp_page_copy() > ptep_clear_flush() > ptep_get_and_clear() > & flush_tlb_page() > write new value into addr > - switch 'task' out (no tlb_flush) > - switch 'task' in (no tlb_flush) > - read addr again (Use stale > mapping entry in TLB) > get wrong value from old phyical > addr, BUG! > > The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm, > which could guarantee to invalidate all stale TLB entries during TLB > flush. > > Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator") > Signed-off-by: Guo Ren <guoren@linux.alibaba.com> > Signed-off-by: Guo Ren <guoren@kernel.org> > Cc: Anup Patel <apatel@ventanamicro.com> > Cc: Palmer Dabbelt <palmer@rivosinc.com> > --- > Changes in v3: > - Move set/clear cpumask(mm) into set_mm (Make code more pretty > with Andrew's advice) > - Optimize comment description > > Changes in v2: > - Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel > test robot <lkp@intel.com>) > - Keep cpumask_clear_cpu for noasid > --- > arch/riscv/mm/context.c | 30 ++++++++++++++++++++---------- > 1 file changed, 20 insertions(+), 10 deletions(-) > As reported on the patch [0] I was seeing consistent failures on the RZ/Five SoC while running bonnie++ utility. After applying this patch on top of Palmer's for-next branch (eb67d239f3aa) I am no longer seeing this issue. Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> [0] https://patchwork.kernel.org/project/linux-riscv/patch/20220829205219.283543-1-geomatsi@gmail.com/ Cheers, Prabhakar > diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c > index 7acbfbd14557..0f784e3d307b 100644 > --- a/arch/riscv/mm/context.c > +++ b/arch/riscv/mm/context.c > @@ -205,12 +205,24 @@ static void set_mm_noasid(struct mm_struct *mm) > local_flush_tlb_all(); > } > > -static inline void set_mm(struct mm_struct *mm, unsigned int cpu) > +static inline void set_mm(struct mm_struct *prev, > + struct mm_struct *next, unsigned int cpu) > { > - if (static_branch_unlikely(&use_asid_allocator)) > - set_mm_asid(mm, cpu); > - else > - set_mm_noasid(mm); > + /* > + * The mm_cpumask indicates which harts' TLBs contain the virtual > + * address mapping of the mm. Compared to noasid, using asid > + * can't guarantee that stale TLB entries are invalidated because > + * the asid mechanism wouldn't flush TLB for every switch_mm for > + * performance. So when using asid, keep all CPUs footmarks in > + * cpumask() until mm reset. > + */ > + cpumask_set_cpu(cpu, mm_cpumask(next)); > + if (static_branch_unlikely(&use_asid_allocator)) { > + set_mm_asid(next, cpu); > + } else { > + cpumask_clear_cpu(cpu, mm_cpumask(prev)); > + set_mm_noasid(next); > + } > } > > static int __init asids_init(void) > @@ -264,7 +276,8 @@ static int __init asids_init(void) > } > early_initcall(asids_init); > #else > -static inline void set_mm(struct mm_struct *mm, unsigned int cpu) > +static inline void set_mm(struct mm_struct *prev, > + struct mm_struct *next, unsigned int cpu) > { > /* Nothing to do here when there is no MMU */ > } > @@ -317,10 +330,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next, > */ > cpu = smp_processor_id(); > > - cpumask_clear_cpu(cpu, mm_cpumask(prev)); > - cpumask_set_cpu(cpu, mm_cpumask(next)); > - > - set_mm(next, cpu); > + set_mm(prev, next, cpu); > > flush_icache_deferred(next, cpu); > } > -- > 2.36.1 > > > _______________________________________________ > linux-riscv mailing list > linux-riscv@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-riscv
Lad, Prabhakar <prabhakar.csengg@gmail.com> 於 2022年12月23日 週五 下午8:54寫道: > > Hi Guo, > > Thank you for the patch. > > On Fri, Nov 11, 2022 at 8:00 AM <guoren@kernel.org> wrote: > > > > From: Guo Ren <guoren@linux.alibaba.com> > > > > After use_asid_allocator is enabled, the userspace application will > > crash by stale TLB entries. Because only using cpumask_clear_cpu without > > local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh. > > Then set_mm_asid would cause the user space application to get a stale > > value by stale TLB entry, but set_mm_noasid is okay. > > > > Here is the symptom of the bug: > > unhandled signal 11 code 0x1 (coredump) > > 0x0000003fd6d22524 <+4>: auipc s0,0x70 > > 0x0000003fd6d22528 <+8>: ld s0,-148(s0) # 0x3fd6d92490 > > => 0x0000003fd6d2252c <+12>: ld a5,0(s0) > > (gdb) i r s0 > > s0 0x8082ed1cc3198b21 0x8082ed1cc3198b21 > > (gdb) x /2x 0x3fd6d92490 > > 0x3fd6d92490: 0xd80ac8a8 0x0000003f > > The core dump file shows that register s0 is wrong, but the value in > > memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry > > in TLB and got a wrong result from an incorrect physical address. > > > > When the task ran on CPU0, which loaded/speculative-loaded the value of > > address(0x3fd6d92490), then the first version of the mapping entry was > > PTWed into CPU0's TLB. > > When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by > > asid), it happened to write a value on the address (0x3fd6d92490). It > > caused do_page_fault -> wp_page_copy -> ptep_clear_flush -> > > ptep_get_and_clear & flush_tlb_page. > > The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB > > flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous > > switch_mm. So we only flushed the CPU1 TLB and set the second version > > mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0 > > still used a stale TLB mapping entry which contained a wrong target > > physical address. It raised a bug when the task happened to read that > > value. > > > > CPU0 CPU1 > > - switch 'task' in > > - read addr (Fill stale mapping > > entry into TLB) > > - switch 'task' out (no tlb_flush) > > - switch 'task' in (no tlb_flush) > > - write addr cause pagefault > > do_page_fault() (change to > > new addr mapping) > > wp_page_copy() > > ptep_clear_flush() > > ptep_get_and_clear() > > & flush_tlb_page() > > write new value into addr > > - switch 'task' out (no tlb_flush) > > - switch 'task' in (no tlb_flush) > > - read addr again (Use stale > > mapping entry in TLB) > > get wrong value from old phyical > > addr, BUG! > > > > The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm, > > which could guarantee to invalidate all stale TLB entries during TLB > > flush. > > > > Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator") > > Signed-off-by: Guo Ren <guoren@linux.alibaba.com> > > Signed-off-by: Guo Ren <guoren@kernel.org> > > Cc: Anup Patel <apatel@ventanamicro.com> > > Cc: Palmer Dabbelt <palmer@rivosinc.com> > > --- > > Changes in v3: > > - Move set/clear cpumask(mm) into set_mm (Make code more pretty > > with Andrew's advice) > > - Optimize comment description > > > > Changes in v2: > > - Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel > > test robot <lkp@intel.com>) > > - Keep cpumask_clear_cpu for noasid > > --- > > arch/riscv/mm/context.c | 30 ++++++++++++++++++++---------- > > 1 file changed, 20 insertions(+), 10 deletions(-) > > > As reported on the patch [0] I was seeing consistent failures on the > RZ/Five SoC while running bonnie++ utility. After applying this patch > on top of Palmer's for-next branch (eb67d239f3aa) I am no longer > seeing this issue. > > Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> > > [0] https://patchwork.kernel.org/project/linux-riscv/patch/20220829205219.283543-1-geomatsi@gmail.com/ > Hi all, I got the same situation (i.e. unhandle signal 11) on our internal multi-core system, I tried the patch[0] & [1], but it still doesn't work, I guess there are still some potential problems. After applying this patch, the situation disappeared, I took some time to look at other arches' implementations, such as arc, they don't clear the mm_cpumask due to the similar issue. I can't say which approach might be better, but I'd like to point out that this patch works to me. Thanks. Tested-by: Zong Li <zong.li@sifive.com> [0] https://lore.kernel.org/linux-riscv/20220829205219.283543-1-geomatsi@gmail.com/ [1] https://lore.kernel.org/linux-riscv/20230129211818.686557-1-geomatsi@gmail.com/ > Cheers, > Prabhakar > > > diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c > > index 7acbfbd14557..0f784e3d307b 100644 > > --- a/arch/riscv/mm/context.c > > +++ b/arch/riscv/mm/context.c > > @@ -205,12 +205,24 @@ static void set_mm_noasid(struct mm_struct *mm) > > local_flush_tlb_all(); > > } > > > > -static inline void set_mm(struct mm_struct *mm, unsigned int cpu) > > +static inline void set_mm(struct mm_struct *prev, > > + struct mm_struct *next, unsigned int cpu) > > { > > - if (static_branch_unlikely(&use_asid_allocator)) > > - set_mm_asid(mm, cpu); > > - else > > - set_mm_noasid(mm); > > + /* > > + * The mm_cpumask indicates which harts' TLBs contain the virtual > > + * address mapping of the mm. Compared to noasid, using asid > > + * can't guarantee that stale TLB entries are invalidated because > > + * the asid mechanism wouldn't flush TLB for every switch_mm for > > + * performance. So when using asid, keep all CPUs footmarks in > > + * cpumask() until mm reset. > > + */ > > + cpumask_set_cpu(cpu, mm_cpumask(next)); > > + if (static_branch_unlikely(&use_asid_allocator)) { > > + set_mm_asid(next, cpu); > > + } else { > > + cpumask_clear_cpu(cpu, mm_cpumask(prev)); > > + set_mm_noasid(next); > > + } > > } > > > > static int __init asids_init(void) > > @@ -264,7 +276,8 @@ static int __init asids_init(void) > > } > > early_initcall(asids_init); > > #else > > -static inline void set_mm(struct mm_struct *mm, unsigned int cpu) > > +static inline void set_mm(struct mm_struct *prev, > > + struct mm_struct *next, unsigned int cpu) > > { > > /* Nothing to do here when there is no MMU */ > > } > > @@ -317,10 +330,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next, > > */ > > cpu = smp_processor_id(); > > > > - cpumask_clear_cpu(cpu, mm_cpumask(prev)); > > - cpumask_set_cpu(cpu, mm_cpumask(next)); > > - > > - set_mm(next, cpu); > > + set_mm(prev, next, cpu); > > > > flush_icache_deferred(next, cpu); > > } > > -- > > 2.36.1 > > > > > > _______________________________________________ > > linux-riscv mailing list > > linux-riscv@lists.infradead.org > > http://lists.infradead.org/mailman/listinfo/linux-riscv > > _______________________________________________ > linux-riscv mailing list > linux-riscv@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-riscv
On Fri, Feb 24, 2023 at 1:58 AM Zong Li <zongbox@gmail.com> wrote: > > Lad, Prabhakar <prabhakar.csengg@gmail.com> 於 2022年12月23日 週五 下午8:54寫道: > > > > Hi Guo, > > > > Thank you for the patch. > > > > On Fri, Nov 11, 2022 at 8:00 AM <guoren@kernel.org> wrote: > > > > > > From: Guo Ren <guoren@linux.alibaba.com> > > > > > > After use_asid_allocator is enabled, the userspace application will > > > crash by stale TLB entries. Because only using cpumask_clear_cpu without > > > local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh. > > > Then set_mm_asid would cause the user space application to get a stale > > > value by stale TLB entry, but set_mm_noasid is okay. > > > > > > Here is the symptom of the bug: > > > unhandled signal 11 code 0x1 (coredump) > > > 0x0000003fd6d22524 <+4>: auipc s0,0x70 > > > 0x0000003fd6d22528 <+8>: ld s0,-148(s0) # 0x3fd6d92490 > > > => 0x0000003fd6d2252c <+12>: ld a5,0(s0) > > > (gdb) i r s0 > > > s0 0x8082ed1cc3198b21 0x8082ed1cc3198b21 > > > (gdb) x /2x 0x3fd6d92490 > > > 0x3fd6d92490: 0xd80ac8a8 0x0000003f > > > The core dump file shows that register s0 is wrong, but the value in > > > memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry > > > in TLB and got a wrong result from an incorrect physical address. > > > > > > When the task ran on CPU0, which loaded/speculative-loaded the value of > > > address(0x3fd6d92490), then the first version of the mapping entry was > > > PTWed into CPU0's TLB. > > > When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by > > > asid), it happened to write a value on the address (0x3fd6d92490). It > > > caused do_page_fault -> wp_page_copy -> ptep_clear_flush -> > > > ptep_get_and_clear & flush_tlb_page. > > > The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB > > > flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous > > > switch_mm. So we only flushed the CPU1 TLB and set the second version > > > mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0 > > > still used a stale TLB mapping entry which contained a wrong target > > > physical address. It raised a bug when the task happened to read that > > > value. > > > > > > CPU0 CPU1 > > > - switch 'task' in > > > - read addr (Fill stale mapping > > > entry into TLB) > > > - switch 'task' out (no tlb_flush) > > > - switch 'task' in (no tlb_flush) > > > - write addr cause pagefault > > > do_page_fault() (change to > > > new addr mapping) > > > wp_page_copy() > > > ptep_clear_flush() > > > ptep_get_and_clear() > > > & flush_tlb_page() > > > write new value into addr > > > - switch 'task' out (no tlb_flush) > > > - switch 'task' in (no tlb_flush) > > > - read addr again (Use stale > > > mapping entry in TLB) > > > get wrong value from old phyical > > > addr, BUG! > > > > > > The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm, > > > which could guarantee to invalidate all stale TLB entries during TLB > > > flush. > > > > > > Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator") > > > Signed-off-by: Guo Ren <guoren@linux.alibaba.com> > > > Signed-off-by: Guo Ren <guoren@kernel.org> > > > Cc: Anup Patel <apatel@ventanamicro.com> > > > Cc: Palmer Dabbelt <palmer@rivosinc.com> > > > --- > > > Changes in v3: > > > - Move set/clear cpumask(mm) into set_mm (Make code more pretty > > > with Andrew's advice) > > > - Optimize comment description > > > > > > Changes in v2: > > > - Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel > > > test robot <lkp@intel.com>) > > > - Keep cpumask_clear_cpu for noasid > > > --- > > > arch/riscv/mm/context.c | 30 ++++++++++++++++++++---------- > > > 1 file changed, 20 insertions(+), 10 deletions(-) > > > > > As reported on the patch [0] I was seeing consistent failures on the > > RZ/Five SoC while running bonnie++ utility. After applying this patch > > on top of Palmer's for-next branch (eb67d239f3aa) I am no longer > > seeing this issue. > > > > Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> > > > > [0] https://patchwork.kernel.org/project/linux-riscv/patch/20220829205219.283543-1-geomatsi@gmail.com/ > > > > Hi all, > I got the same situation (i.e. unhandle signal 11) on our internal > multi-core system, I tried the patch[0] & [1], but it still doesn't > work, I guess there are still some potential problems. After applying > this patch, the situation disappeared, I took some time to look at > other arches' implementations, such as arc, they don't clear the > mm_cpumask due to the similar issue. I can't say which approach might > be better, but I'd like to point out that this patch works to me. > Thanks. Thx for the report. I think the patch is simpler and more direct to the problem than the current fixup approach. Although the maintainer thought it was a temporary solution. This patch has been tested in our production environment for huge amounts of thousand of hours. You can trust it :) > > Tested-by: Zong Li <zong.li@sifive.com> > > [0] https://lore.kernel.org/linux-riscv/20220829205219.283543-1-geomatsi@gmail.com/ > [1] https://lore.kernel.org/linux-riscv/20230129211818.686557-1-geomatsi@gmail.com/ > > > Cheers, > > Prabhakar > > > > > diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c > > > index 7acbfbd14557..0f784e3d307b 100644 > > > --- a/arch/riscv/mm/context.c > > > +++ b/arch/riscv/mm/context.c > > > @@ -205,12 +205,24 @@ static void set_mm_noasid(struct mm_struct *mm) > > > local_flush_tlb_all(); > > > } > > > > > > -static inline void set_mm(struct mm_struct *mm, unsigned int cpu) > > > +static inline void set_mm(struct mm_struct *prev, > > > + struct mm_struct *next, unsigned int cpu) > > > { > > > - if (static_branch_unlikely(&use_asid_allocator)) > > > - set_mm_asid(mm, cpu); > > > - else > > > - set_mm_noasid(mm); > > > + /* > > > + * The mm_cpumask indicates which harts' TLBs contain the virtual > > > + * address mapping of the mm. Compared to noasid, using asid > > > + * can't guarantee that stale TLB entries are invalidated because > > > + * the asid mechanism wouldn't flush TLB for every switch_mm for > > > + * performance. So when using asid, keep all CPUs footmarks in > > > + * cpumask() until mm reset. > > > + */ > > > + cpumask_set_cpu(cpu, mm_cpumask(next)); > > > + if (static_branch_unlikely(&use_asid_allocator)) { > > > + set_mm_asid(next, cpu); > > > + } else { > > > + cpumask_clear_cpu(cpu, mm_cpumask(prev)); > > > + set_mm_noasid(next); > > > + } > > > } > > > > > > static int __init asids_init(void) > > > @@ -264,7 +276,8 @@ static int __init asids_init(void) > > > } > > > early_initcall(asids_init); > > > #else > > > -static inline void set_mm(struct mm_struct *mm, unsigned int cpu) > > > +static inline void set_mm(struct mm_struct *prev, > > > + struct mm_struct *next, unsigned int cpu) > > > { > > > /* Nothing to do here when there is no MMU */ > > > } > > > @@ -317,10 +330,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next, > > > */ > > > cpu = smp_processor_id(); > > > > > > - cpumask_clear_cpu(cpu, mm_cpumask(prev)); > > > - cpumask_set_cpu(cpu, mm_cpumask(next)); > > > - > > > - set_mm(next, cpu); > > > + set_mm(prev, next, cpu); > > > > > > flush_icache_deferred(next, cpu); > > > } > > > -- > > > 2.36.1 > > > > > > > > > _______________________________________________ > > > linux-riscv mailing list > > > linux-riscv@lists.infradead.org > > > http://lists.infradead.org/mailman/listinfo/linux-riscv > > > > _______________________________________________ > > linux-riscv mailing list > > linux-riscv@lists.infradead.org > > http://lists.infradead.org/mailman/listinfo/linux-riscv
On Fri, Feb 24, 2023 at 01:57:55AM +0800, Zong Li wrote: > Lad, Prabhakar <prabhakar.csengg@gmail.com> 於 2022年12月23日 週五 下午8:54寫道: > > > > Hi Guo, > > > > Thank you for the patch. > > > > On Fri, Nov 11, 2022 at 8:00 AM <guoren@kernel.org> wrote: > > > > > > From: Guo Ren <guoren@linux.alibaba.com> > > > > > > After use_asid_allocator is enabled, the userspace application will > > > crash by stale TLB entries. Because only using cpumask_clear_cpu without > > > local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh. > > > Then set_mm_asid would cause the user space application to get a stale > > > value by stale TLB entry, but set_mm_noasid is okay. > > > > > > Here is the symptom of the bug: > > > unhandled signal 11 code 0x1 (coredump) > > > 0x0000003fd6d22524 <+4>: auipc s0,0x70 > > > 0x0000003fd6d22528 <+8>: ld s0,-148(s0) # 0x3fd6d92490 > > > => 0x0000003fd6d2252c <+12>: ld a5,0(s0) > > > (gdb) i r s0 > > > s0 0x8082ed1cc3198b21 0x8082ed1cc3198b21 > > > (gdb) x /2x 0x3fd6d92490 > > > 0x3fd6d92490: 0xd80ac8a8 0x0000003f > > > The core dump file shows that register s0 is wrong, but the value in > > > memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry > > > in TLB and got a wrong result from an incorrect physical address. > > > > > > When the task ran on CPU0, which loaded/speculative-loaded the value of > > > address(0x3fd6d92490), then the first version of the mapping entry was > > > PTWed into CPU0's TLB. > > > When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by > > > asid), it happened to write a value on the address (0x3fd6d92490). It > > > caused do_page_fault -> wp_page_copy -> ptep_clear_flush -> > > > ptep_get_and_clear & flush_tlb_page. > > > The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB > > > flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous > > > switch_mm. So we only flushed the CPU1 TLB and set the second version > > > mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0 > > > still used a stale TLB mapping entry which contained a wrong target > > > physical address. It raised a bug when the task happened to read that > > > value. > > > > > > CPU0 CPU1 > > > - switch 'task' in > > > - read addr (Fill stale mapping > > > entry into TLB) > > > - switch 'task' out (no tlb_flush) > > > - switch 'task' in (no tlb_flush) > > > - write addr cause pagefault > > > do_page_fault() (change to > > > new addr mapping) > > > wp_page_copy() > > > ptep_clear_flush() > > > ptep_get_and_clear() > > > & flush_tlb_page() > > > write new value into addr > > > - switch 'task' out (no tlb_flush) > > > - switch 'task' in (no tlb_flush) > > > - read addr again (Use stale > > > mapping entry in TLB) > > > get wrong value from old phyical > > > addr, BUG! > > > > > > The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm, > > > which could guarantee to invalidate all stale TLB entries during TLB > > > flush. > > > > > > Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator") > > > Signed-off-by: Guo Ren <guoren@linux.alibaba.com> > > > Signed-off-by: Guo Ren <guoren@kernel.org> > > > Cc: Anup Patel <apatel@ventanamicro.com> > > > Cc: Palmer Dabbelt <palmer@rivosinc.com> > > > --- > > > Changes in v3: > > > - Move set/clear cpumask(mm) into set_mm (Make code more pretty > > > with Andrew's advice) > > > - Optimize comment description > > > > > > Changes in v2: > > > - Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel > > > test robot <lkp@intel.com>) > > > - Keep cpumask_clear_cpu for noasid > > > --- > > > arch/riscv/mm/context.c | 30 ++++++++++++++++++++---------- > > > 1 file changed, 20 insertions(+), 10 deletions(-) > > > > > As reported on the patch [0] I was seeing consistent failures on the > > RZ/Five SoC while running bonnie++ utility. After applying this patch > > on top of Palmer's for-next branch (eb67d239f3aa) I am no longer > > seeing this issue. > > > > Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> > > > > [0] https://patchwork.kernel.org/project/linux-riscv/patch/20220829205219.283543-1-geomatsi@gmail.com/ > > > > Hi all, > I got the same situation (i.e. unhandle signal 11) on our internal > multi-core system, I tried the patch[0] & [1], but it still doesn't > work, I guess there are still some potential problems. After applying > this patch, the situation disappeared, I took some time to look at > other arches' implementations, such as arc, they don't clear the > mm_cpumask due to the similar issue. I can't say which approach might > be better, but I'd like to point out that this patch works to me. > Thanks. > > Tested-by: Zong Li <zong.li@sifive.com> > > [0] https://lore.kernel.org/linux-riscv/20220829205219.283543-1-geomatsi@gmail.com/ > [1] https://lore.kernel.org/linux-riscv/20230129211818.686557-1-geomatsi@gmail.com/ Thanks for the report! By the way, could you please share some information about the reproducing workload ? Initial idea was to reduce the number of TLB flushes by deferring (and possibly avoiding) some of them. But we have already bug reports from two different vendors, so apparently something is overlooked here. Lets switch to 'aggrregating' mm_cpumask approach suggested by Guo Ren. @Guo Ren, do you mind if I re-send your v3 patch together with the remaining reverts of my changes ? Regards, Sergey
On Sat, Feb 25, 2023 at 2:29 PM Sergey Matyukevich <geomatsi@gmail.com> wrote: > > On Fri, Feb 24, 2023 at 01:57:55AM +0800, Zong Li wrote: > > Lad, Prabhakar <prabhakar.csengg@gmail.com> 於 2022年12月23日 週五 下午8:54寫道: > > > > > > Hi Guo, > > > > > > Thank you for the patch. > > > > > > On Fri, Nov 11, 2022 at 8:00 AM <guoren@kernel.org> wrote: > > > > > > > > From: Guo Ren <guoren@linux.alibaba.com> > > > > > > > > After use_asid_allocator is enabled, the userspace application will > > > > crash by stale TLB entries. Because only using cpumask_clear_cpu without > > > > local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh. > > > > Then set_mm_asid would cause the user space application to get a stale > > > > value by stale TLB entry, but set_mm_noasid is okay. > > > > > > > > Here is the symptom of the bug: > > > > unhandled signal 11 code 0x1 (coredump) > > > > 0x0000003fd6d22524 <+4>: auipc s0,0x70 > > > > 0x0000003fd6d22528 <+8>: ld s0,-148(s0) # 0x3fd6d92490 > > > > => 0x0000003fd6d2252c <+12>: ld a5,0(s0) > > > > (gdb) i r s0 > > > > s0 0x8082ed1cc3198b21 0x8082ed1cc3198b21 > > > > (gdb) x /2x 0x3fd6d92490 > > > > 0x3fd6d92490: 0xd80ac8a8 0x0000003f > > > > The core dump file shows that register s0 is wrong, but the value in > > > > memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry > > > > in TLB and got a wrong result from an incorrect physical address. > > > > > > > > When the task ran on CPU0, which loaded/speculative-loaded the value of > > > > address(0x3fd6d92490), then the first version of the mapping entry was > > > > PTWed into CPU0's TLB. > > > > When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by > > > > asid), it happened to write a value on the address (0x3fd6d92490). It > > > > caused do_page_fault -> wp_page_copy -> ptep_clear_flush -> > > > > ptep_get_and_clear & flush_tlb_page. > > > > The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB > > > > flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous > > > > switch_mm. So we only flushed the CPU1 TLB and set the second version > > > > mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0 > > > > still used a stale TLB mapping entry which contained a wrong target > > > > physical address. It raised a bug when the task happened to read that > > > > value. > > > > > > > > CPU0 CPU1 > > > > - switch 'task' in > > > > - read addr (Fill stale mapping > > > > entry into TLB) > > > > - switch 'task' out (no tlb_flush) > > > > - switch 'task' in (no tlb_flush) > > > > - write addr cause pagefault > > > > do_page_fault() (change to > > > > new addr mapping) > > > > wp_page_copy() > > > > ptep_clear_flush() > > > > ptep_get_and_clear() > > > > & flush_tlb_page() > > > > write new value into addr > > > > - switch 'task' out (no tlb_flush) > > > > - switch 'task' in (no tlb_flush) > > > > - read addr again (Use stale > > > > mapping entry in TLB) > > > > get wrong value from old phyical > > > > addr, BUG! > > > > > > > > The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm, > > > > which could guarantee to invalidate all stale TLB entries during TLB > > > > flush. > > > > > > > > Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator") > > > > Signed-off-by: Guo Ren <guoren@linux.alibaba.com> > > > > Signed-off-by: Guo Ren <guoren@kernel.org> > > > > Cc: Anup Patel <apatel@ventanamicro.com> > > > > Cc: Palmer Dabbelt <palmer@rivosinc.com> > > > > --- > > > > Changes in v3: > > > > - Move set/clear cpumask(mm) into set_mm (Make code more pretty > > > > with Andrew's advice) > > > > - Optimize comment description > > > > > > > > Changes in v2: > > > > - Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel > > > > test robot <lkp@intel.com>) > > > > - Keep cpumask_clear_cpu for noasid > > > > --- > > > > arch/riscv/mm/context.c | 30 ++++++++++++++++++++---------- > > > > 1 file changed, 20 insertions(+), 10 deletions(-) > > > > > > > As reported on the patch [0] I was seeing consistent failures on the > > > RZ/Five SoC while running bonnie++ utility. After applying this patch > > > on top of Palmer's for-next branch (eb67d239f3aa) I am no longer > > > seeing this issue. > > > > > > Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> > > > > > > [0] https://patchwork.kernel.org/project/linux-riscv/patch/20220829205219.283543-1-geomatsi@gmail.com/ > > > > > > > Hi all, > > I got the same situation (i.e. unhandle signal 11) on our internal > > multi-core system, I tried the patch[0] & [1], but it still doesn't > > work, I guess there are still some potential problems. After applying > > this patch, the situation disappeared, I took some time to look at > > other arches' implementations, such as arc, they don't clear the > > mm_cpumask due to the similar issue. I can't say which approach might > > be better, but I'd like to point out that this patch works to me. > > Thanks. > > > > Tested-by: Zong Li <zong.li@sifive.com> > > > > [0] https://lore.kernel.org/linux-riscv/20220829205219.283543-1-geomatsi@gmail.com/ > > [1] https://lore.kernel.org/linux-riscv/20230129211818.686557-1-geomatsi@gmail.com/ > > Thanks for the report! By the way, could you please share some > information about the reproducing workload ? > > Initial idea was to reduce the number of TLB flushes by deferring (and > possibly avoiding) some of them. But we have already bug reports from > two different vendors, so apparently something is overlooked here. > Lets switch to 'aggrregating' mm_cpumask approach suggested by Guo Ren. > > @Guo Ren, do you mind if I re-send your v3 patch together with the > remaining reverts of my changes ? Okay, thx for taking care. Let's make it work around first and then improve it. Actually, the current riscv asid is from arm64 with hardware broadcast requirements. Maybe we need to consider x86 per-cpu asid pool way. > > Regards, > Sergey
On Sat, 25 Feb 2023 23:24:40 -0500 Guo Ren <guoren@kernel.org> wrote: > On Sat, Feb 25, 2023 at 2:29 PM Sergey Matyukevich <geomatsi@gmail.com> wrote: > > > > On Fri, Feb 24, 2023 at 01:57:55AM +0800, Zong Li wrote: > > > Lad, Prabhakar <prabhakar.csengg@gmail.com> 於 2022年12月23日 週五 下午8:54寫道: > > > > > > > > Hi Guo, > > > > > > > > Thank you for the patch. > > > > > > > > On Fri, Nov 11, 2022 at 8:00 AM <guoren@kernel.org> wrote: > > > > > > > > > > From: Guo Ren <guoren@linux.alibaba.com> > > > > > > > > > > After use_asid_allocator is enabled, the userspace application will > > > > > crash by stale TLB entries. Because only using cpumask_clear_cpu without > > > > > local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh. > > > > > Then set_mm_asid would cause the user space application to get a stale > > > > > value by stale TLB entry, but set_mm_noasid is okay. > > > > > > > > > > Here is the symptom of the bug: > > > > > unhandled signal 11 code 0x1 (coredump) > > > > > 0x0000003fd6d22524 <+4>: auipc s0,0x70 > > > > > 0x0000003fd6d22528 <+8>: ld s0,-148(s0) # 0x3fd6d92490 > > > > > => 0x0000003fd6d2252c <+12>: ld a5,0(s0) > > > > > (gdb) i r s0 > > > > > s0 0x8082ed1cc3198b21 0x8082ed1cc3198b21 > > > > > (gdb) x /2x 0x3fd6d92490 > > > > > 0x3fd6d92490: 0xd80ac8a8 0x0000003f > > > > > The core dump file shows that register s0 is wrong, but the value in > > > > > memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry > > > > > in TLB and got a wrong result from an incorrect physical address. > > > > > > > > > > When the task ran on CPU0, which loaded/speculative-loaded the value of > > > > > address(0x3fd6d92490), then the first version of the mapping entry was > > > > > PTWed into CPU0's TLB. > > > > > When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by > > > > > asid), it happened to write a value on the address (0x3fd6d92490). It > > > > > caused do_page_fault -> wp_page_copy -> ptep_clear_flush -> > > > > > ptep_get_and_clear & flush_tlb_page. > > > > > The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB > > > > > flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous > > > > > switch_mm. So we only flushed the CPU1 TLB and set the second version > > > > > mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0 > > > > > still used a stale TLB mapping entry which contained a wrong target > > > > > physical address. It raised a bug when the task happened to read that > > > > > value. > > > > > > > > > > CPU0 CPU1 > > > > > - switch 'task' in > > > > > - read addr (Fill stale mapping > > > > > entry into TLB) > > > > > - switch 'task' out (no tlb_flush) > > > > > - switch 'task' in (no tlb_flush) > > > > > - write addr cause pagefault > > > > > do_page_fault() (change to > > > > > new addr mapping) > > > > > wp_page_copy() > > > > > ptep_clear_flush() > > > > > ptep_get_and_clear() > > > > > & flush_tlb_page() > > > > > write new value into addr > > > > > - switch 'task' out (no tlb_flush) > > > > > - switch 'task' in (no tlb_flush) > > > > > - read addr again (Use stale > > > > > mapping entry in TLB) > > > > > get wrong value from old phyical > > > > > addr, BUG! > > > > > > > > > > The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm, > > > > > which could guarantee to invalidate all stale TLB entries during TLB > > > > > flush. > > > > > > > > > > Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator") > > > > > Signed-off-by: Guo Ren <guoren@linux.alibaba.com> > > > > > Signed-off-by: Guo Ren <guoren@kernel.org> > > > > > Cc: Anup Patel <apatel@ventanamicro.com> > > > > > Cc: Palmer Dabbelt <palmer@rivosinc.com> > > > > > --- > > > > > Changes in v3: > > > > > - Move set/clear cpumask(mm) into set_mm (Make code more pretty > > > > > with Andrew's advice) > > > > > - Optimize comment description > > > > > > > > > > Changes in v2: > > > > > - Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel > > > > > test robot <lkp@intel.com>) > > > > > - Keep cpumask_clear_cpu for noasid > > > > > --- > > > > > arch/riscv/mm/context.c | 30 ++++++++++++++++++++---------- > > > > > 1 file changed, 20 insertions(+), 10 deletions(-) > > > > > > > > > As reported on the patch [0] I was seeing consistent failures on the > > > > RZ/Five SoC while running bonnie++ utility. After applying this patch > > > > on top of Palmer's for-next branch (eb67d239f3aa) I am no longer > > > > seeing this issue. > > > > > > > > Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> > > > > > > > > [0] https://patchwork.kernel.org/project/linux-riscv/patch/20220829205219.283543-1-geomatsi@gmail.com/ > > > > > > > > > > Hi all, > > > I got the same situation (i.e. unhandle signal 11) on our internal > > > multi-core system, I tried the patch[0] & [1], but it still doesn't > > > work, I guess there are still some potential problems. After applying > > > this patch, the situation disappeared, I took some time to look at > > > other arches' implementations, such as arc, they don't clear the > > > mm_cpumask due to the similar issue. I can't say which approach might > > > be better, but I'd like to point out that this patch works to me. > > > Thanks. > > > > > > Tested-by: Zong Li <zong.li@sifive.com> > > > > > > [0] https://lore.kernel.org/linux-riscv/20220829205219.283543-1-geomatsi@gmail.com/ > > > [1] https://lore.kernel.org/linux-riscv/20230129211818.686557-1-geomatsi@gmail.com/ > > > > Thanks for the report! By the way, could you please share some > > information about the reproducing workload ? > > > > Initial idea was to reduce the number of TLB flushes by deferring (and > > possibly avoiding) some of them. But we have already bug reports from > > two different vendors, so apparently something is overlooked here. > > Lets switch to 'aggrregating' mm_cpumask approach suggested by Guo Ren. > > > > @Guo Ren, do you mind if I re-send your v3 patch together with the > > remaining reverts of my changes ? > Okay, thx for taking care. Let's make it work around first and then improve it. > > Actually, the current riscv asid is from arm64 with hardware broadcast > requirements. Maybe we need to consider x86 per-cpu asid pool way. It should be noted that the spec expects supervisor software to use a consistent meaning of non-zero ASIDs across different harts. Also, a vendor could implement efficient hardware broadcasting ASID invalidation with a custom instruction and expose it via SBI. Best, Gary
On Tue, Feb 28, 2023 at 6:40 AM Gary Guo <gary@garyguo.net> wrote: > > On Sat, 25 Feb 2023 23:24:40 -0500 > Guo Ren <guoren@kernel.org> wrote: > > > On Sat, Feb 25, 2023 at 2:29 PM Sergey Matyukevich <geomatsi@gmail.com> wrote: > > > > > > On Fri, Feb 24, 2023 at 01:57:55AM +0800, Zong Li wrote: > > > > Lad, Prabhakar <prabhakar.csengg@gmail.com> 於 2022年12月23日 週五 下午8:54寫道: > > > > > > > > > > Hi Guo, > > > > > > > > > > Thank you for the patch. > > > > > > > > > > On Fri, Nov 11, 2022 at 8:00 AM <guoren@kernel.org> wrote: > > > > > > > > > > > > From: Guo Ren <guoren@linux.alibaba.com> > > > > > > > > > > > > After use_asid_allocator is enabled, the userspace application will > > > > > > crash by stale TLB entries. Because only using cpumask_clear_cpu without > > > > > > local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh. > > > > > > Then set_mm_asid would cause the user space application to get a stale > > > > > > value by stale TLB entry, but set_mm_noasid is okay. > > > > > > > > > > > > Here is the symptom of the bug: > > > > > > unhandled signal 11 code 0x1 (coredump) > > > > > > 0x0000003fd6d22524 <+4>: auipc s0,0x70 > > > > > > 0x0000003fd6d22528 <+8>: ld s0,-148(s0) # 0x3fd6d92490 > > > > > > => 0x0000003fd6d2252c <+12>: ld a5,0(s0) > > > > > > (gdb) i r s0 > > > > > > s0 0x8082ed1cc3198b21 0x8082ed1cc3198b21 > > > > > > (gdb) x /2x 0x3fd6d92490 > > > > > > 0x3fd6d92490: 0xd80ac8a8 0x0000003f > > > > > > The core dump file shows that register s0 is wrong, but the value in > > > > > > memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry > > > > > > in TLB and got a wrong result from an incorrect physical address. > > > > > > > > > > > > When the task ran on CPU0, which loaded/speculative-loaded the value of > > > > > > address(0x3fd6d92490), then the first version of the mapping entry was > > > > > > PTWed into CPU0's TLB. > > > > > > When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by > > > > > > asid), it happened to write a value on the address (0x3fd6d92490). It > > > > > > caused do_page_fault -> wp_page_copy -> ptep_clear_flush -> > > > > > > ptep_get_and_clear & flush_tlb_page. > > > > > > The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB > > > > > > flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous > > > > > > switch_mm. So we only flushed the CPU1 TLB and set the second version > > > > > > mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0 > > > > > > still used a stale TLB mapping entry which contained a wrong target > > > > > > physical address. It raised a bug when the task happened to read that > > > > > > value. > > > > > > > > > > > > CPU0 CPU1 > > > > > > - switch 'task' in > > > > > > - read addr (Fill stale mapping > > > > > > entry into TLB) > > > > > > - switch 'task' out (no tlb_flush) > > > > > > - switch 'task' in (no tlb_flush) > > > > > > - write addr cause pagefault > > > > > > do_page_fault() (change to > > > > > > new addr mapping) > > > > > > wp_page_copy() > > > > > > ptep_clear_flush() > > > > > > ptep_get_and_clear() > > > > > > & flush_tlb_page() > > > > > > write new value into addr > > > > > > - switch 'task' out (no tlb_flush) > > > > > > - switch 'task' in (no tlb_flush) > > > > > > - read addr again (Use stale > > > > > > mapping entry in TLB) > > > > > > get wrong value from old phyical > > > > > > addr, BUG! > > > > > > > > > > > > The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm, > > > > > > which could guarantee to invalidate all stale TLB entries during TLB > > > > > > flush. > > > > > > > > > > > > Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator") > > > > > > Signed-off-by: Guo Ren <guoren@linux.alibaba.com> > > > > > > Signed-off-by: Guo Ren <guoren@kernel.org> > > > > > > Cc: Anup Patel <apatel@ventanamicro.com> > > > > > > Cc: Palmer Dabbelt <palmer@rivosinc.com> > > > > > > --- > > > > > > Changes in v3: > > > > > > - Move set/clear cpumask(mm) into set_mm (Make code more pretty > > > > > > with Andrew's advice) > > > > > > - Optimize comment description > > > > > > > > > > > > Changes in v2: > > > > > > - Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel > > > > > > test robot <lkp@intel.com>) > > > > > > - Keep cpumask_clear_cpu for noasid > > > > > > --- > > > > > > arch/riscv/mm/context.c | 30 ++++++++++++++++++++---------- > > > > > > 1 file changed, 20 insertions(+), 10 deletions(-) > > > > > > > > > > > As reported on the patch [0] I was seeing consistent failures on the > > > > > RZ/Five SoC while running bonnie++ utility. After applying this patch > > > > > on top of Palmer's for-next branch (eb67d239f3aa) I am no longer > > > > > seeing this issue. > > > > > > > > > > Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> > > > > > > > > > > [0] https://patchwork.kernel.org/project/linux-riscv/patch/20220829205219.283543-1-geomatsi@gmail.com/ > > > > > > > > > > > > > Hi all, > > > > I got the same situation (i.e. unhandle signal 11) on our internal > > > > multi-core system, I tried the patch[0] & [1], but it still doesn't > > > > work, I guess there are still some potential problems. After applying > > > > this patch, the situation disappeared, I took some time to look at > > > > other arches' implementations, such as arc, they don't clear the > > > > mm_cpumask due to the similar issue. I can't say which approach might > > > > be better, but I'd like to point out that this patch works to me. > > > > Thanks. > > > > > > > > Tested-by: Zong Li <zong.li@sifive.com> > > > > > > > > [0] https://lore.kernel.org/linux-riscv/20220829205219.283543-1-geomatsi@gmail.com/ > > > > [1] https://lore.kernel.org/linux-riscv/20230129211818.686557-1-geomatsi@gmail.com/ > > > > > > Thanks for the report! By the way, could you please share some > > > information about the reproducing workload ? > > > > > > Initial idea was to reduce the number of TLB flushes by deferring (and > > > possibly avoiding) some of them. But we have already bug reports from > > > two different vendors, so apparently something is overlooked here. > > > Lets switch to 'aggrregating' mm_cpumask approach suggested by Guo Ren. > > > > > > @Guo Ren, do you mind if I re-send your v3 patch together with the > > > remaining reverts of my changes ? > > Okay, thx for taking care. Let's make it work around first and then improve it. > > > > Actually, the current riscv asid is from arm64 with hardware broadcast > > requirements. Maybe we need to consider x86 per-cpu asid pool way. > > It should be noted that the spec expects supervisor software to > use a consistent meaning of non-zero ASIDs across different harts. > > Also, a vendor could implement efficient hardware broadcasting ASID > invalidation with a custom instruction and expose it via SBI. I agree with you; actually, our XuanTie supports hardware broadcasting invalidation, and we expect SBI style. The x86 style would be another choice for the future, and the riscv would combine different TLB maintenance styles into one Linux architecture. ---- Please let me give out an outline of the current riscv state: 1. Current Linux riscv uses unified ASIDs design, which comes from arm hw broadcast one. Some riscv vendors (XuanTie) could also support TLB hw broadcast. 2. The riscv spec doesn't suggest hw broadcast because it's unsuitable for large-scale systems. So x86 style would be another choice for the future Linux riscv. Correct? Welcome feedback. > > Best, > Gary
diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c index 7acbfbd14557..0f784e3d307b 100644 --- a/arch/riscv/mm/context.c +++ b/arch/riscv/mm/context.c @@ -205,12 +205,24 @@ static void set_mm_noasid(struct mm_struct *mm) local_flush_tlb_all(); } -static inline void set_mm(struct mm_struct *mm, unsigned int cpu) +static inline void set_mm(struct mm_struct *prev, + struct mm_struct *next, unsigned int cpu) { - if (static_branch_unlikely(&use_asid_allocator)) - set_mm_asid(mm, cpu); - else - set_mm_noasid(mm); + /* + * The mm_cpumask indicates which harts' TLBs contain the virtual + * address mapping of the mm. Compared to noasid, using asid + * can't guarantee that stale TLB entries are invalidated because + * the asid mechanism wouldn't flush TLB for every switch_mm for + * performance. So when using asid, keep all CPUs footmarks in + * cpumask() until mm reset. + */ + cpumask_set_cpu(cpu, mm_cpumask(next)); + if (static_branch_unlikely(&use_asid_allocator)) { + set_mm_asid(next, cpu); + } else { + cpumask_clear_cpu(cpu, mm_cpumask(prev)); + set_mm_noasid(next); + } } static int __init asids_init(void) @@ -264,7 +276,8 @@ static int __init asids_init(void) } early_initcall(asids_init); #else -static inline void set_mm(struct mm_struct *mm, unsigned int cpu) +static inline void set_mm(struct mm_struct *prev, + struct mm_struct *next, unsigned int cpu) { /* Nothing to do here when there is no MMU */ } @@ -317,10 +330,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next, */ cpu = smp_processor_id(); - cpumask_clear_cpu(cpu, mm_cpumask(prev)); - cpumask_set_cpu(cpu, mm_cpumask(next)); - - set_mm(next, cpu); + set_mm(prev, next, cpu); flush_icache_deferred(next, cpu); }