[V3] riscv: asid: Fixup stale TLB entry cause application crash

Message ID 20221111075902.798571-1-guoren@kernel.org
State New
Headers
Series [V3] riscv: asid: Fixup stale TLB entry cause application crash |

Commit Message

Guo Ren Nov. 11, 2022, 7:59 a.m. UTC
  From: Guo Ren <guoren@linux.alibaba.com>

After use_asid_allocator is enabled, the userspace application will
crash by stale TLB entries. Because only using cpumask_clear_cpu without
local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
Then set_mm_asid would cause the user space application to get a stale
value by stale TLB entry, but set_mm_noasid is okay.

Here is the symptom of the bug:
unhandled signal 11 code 0x1 (coredump)
   0x0000003fd6d22524 <+4>:     auipc   s0,0x70
   0x0000003fd6d22528 <+8>:     ld      s0,-148(s0) # 0x3fd6d92490
=> 0x0000003fd6d2252c <+12>:    ld      a5,0(s0)
(gdb) i r s0
s0          0x8082ed1cc3198b21       0x8082ed1cc3198b21
(gdb) x /2x 0x3fd6d92490
0x3fd6d92490:   0xd80ac8a8      0x0000003f
The core dump file shows that register s0 is wrong, but the value in
memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry
in TLB and got a wrong result from an incorrect physical address.

When the task ran on CPU0, which loaded/speculative-loaded the value of
address(0x3fd6d92490), then the first version of the mapping entry was
PTWed into CPU0's TLB.
When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by
asid), it happened to write a value on the address (0x3fd6d92490). It
caused do_page_fault -> wp_page_copy -> ptep_clear_flush ->
ptep_get_and_clear & flush_tlb_page.
The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB
flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous
switch_mm. So we only flushed the CPU1 TLB and set the second version
mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0
still used a stale TLB mapping entry which contained a wrong target
physical address. It raised a bug when the task happened to read that
value.

   CPU0                               CPU1
   - switch 'task' in
   - read addr (Fill stale mapping
     entry into TLB)
   - switch 'task' out (no tlb_flush)
                                      - switch 'task' in (no tlb_flush)
                                      - write addr cause pagefault
                                        do_page_fault() (change to
                                        new addr mapping)
                                          wp_page_copy()
                                            ptep_clear_flush()
                                              ptep_get_and_clear()
                                              & flush_tlb_page()
                                        write new value into addr
                                      - switch 'task' out (no tlb_flush)
   - switch 'task' in (no tlb_flush)
   - read addr again (Use stale
     mapping entry in TLB)
     get wrong value from old phyical
     addr, BUG!

The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm,
which could guarantee to invalidate all stale TLB entries during TLB
flush.

Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator")
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
Signed-off-by: Guo Ren <guoren@kernel.org>
Cc: Anup Patel <apatel@ventanamicro.com>
Cc: Palmer Dabbelt <palmer@rivosinc.com>
---
Changes in v3:
 - Move set/clear cpumask(mm) into set_mm (Make code more pretty
   with Andrew's advice)
 - Optimize comment description

Changes in v2:
 - Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel
   test robot <lkp@intel.com>)
 - Keep cpumask_clear_cpu for noasid
---
 arch/riscv/mm/context.c | 30 ++++++++++++++++++++----------
 1 file changed, 20 insertions(+), 10 deletions(-)
  

Comments

Andrew Jones Nov. 11, 2022, 8:29 a.m. UTC | #1
On Fri, Nov 11, 2022 at 02:59:02AM -0500, guoren@kernel.org wrote:
> From: Guo Ren <guoren@linux.alibaba.com>
> 
> After use_asid_allocator is enabled, the userspace application will
> crash by stale TLB entries. Because only using cpumask_clear_cpu without
> local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> Then set_mm_asid would cause the user space application to get a stale
> value by stale TLB entry, but set_mm_noasid is okay.
> 
> Here is the symptom of the bug:
> unhandled signal 11 code 0x1 (coredump)
>    0x0000003fd6d22524 <+4>:     auipc   s0,0x70
>    0x0000003fd6d22528 <+8>:     ld      s0,-148(s0) # 0x3fd6d92490
> => 0x0000003fd6d2252c <+12>:    ld      a5,0(s0)
> (gdb) i r s0
> s0          0x8082ed1cc3198b21       0x8082ed1cc3198b21
> (gdb) x /2x 0x3fd6d92490
> 0x3fd6d92490:   0xd80ac8a8      0x0000003f
> The core dump file shows that register s0 is wrong, but the value in
> memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry
> in TLB and got a wrong result from an incorrect physical address.
> 
> When the task ran on CPU0, which loaded/speculative-loaded the value of
> address(0x3fd6d92490), then the first version of the mapping entry was
> PTWed into CPU0's TLB.
> When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by
> asid), it happened to write a value on the address (0x3fd6d92490). It
> caused do_page_fault -> wp_page_copy -> ptep_clear_flush ->
> ptep_get_and_clear & flush_tlb_page.
> The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB
> flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous
> switch_mm. So we only flushed the CPU1 TLB and set the second version
> mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0
> still used a stale TLB mapping entry which contained a wrong target
> physical address. It raised a bug when the task happened to read that
> value.
> 
>    CPU0                               CPU1
>    - switch 'task' in
>    - read addr (Fill stale mapping
>      entry into TLB)
>    - switch 'task' out (no tlb_flush)
>                                       - switch 'task' in (no tlb_flush)
>                                       - write addr cause pagefault
>                                         do_page_fault() (change to
>                                         new addr mapping)
>                                           wp_page_copy()
>                                             ptep_clear_flush()
>                                               ptep_get_and_clear()
>                                               & flush_tlb_page()
>                                         write new value into addr
>                                       - switch 'task' out (no tlb_flush)
>    - switch 'task' in (no tlb_flush)
>    - read addr again (Use stale
>      mapping entry in TLB)
>      get wrong value from old phyical
>      addr, BUG!
> 
> The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm,
> which could guarantee to invalidate all stale TLB entries during TLB
> flush.
> 
> Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator")
> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
> Signed-off-by: Guo Ren <guoren@kernel.org>
> Cc: Anup Patel <apatel@ventanamicro.com>
> Cc: Palmer Dabbelt <palmer@rivosinc.com>
> ---
> Changes in v3:
>  - Move set/clear cpumask(mm) into set_mm (Make code more pretty
>    with Andrew's advice)
>  - Optimize comment description
> 
> Changes in v2:
>  - Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel
>    test robot <lkp@intel.com>)
>  - Keep cpumask_clear_cpu for noasid
> ---
>  arch/riscv/mm/context.c | 30 ++++++++++++++++++++----------
>  1 file changed, 20 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
> index 7acbfbd14557..0f784e3d307b 100644
> --- a/arch/riscv/mm/context.c
> +++ b/arch/riscv/mm/context.c
> @@ -205,12 +205,24 @@ static void set_mm_noasid(struct mm_struct *mm)
>  	local_flush_tlb_all();
>  }
>  
> -static inline void set_mm(struct mm_struct *mm, unsigned int cpu)
> +static inline void set_mm(struct mm_struct *prev,
> +			  struct mm_struct *next, unsigned int cpu)
>  {
> -	if (static_branch_unlikely(&use_asid_allocator))
> -		set_mm_asid(mm, cpu);
> -	else
> -		set_mm_noasid(mm);
> +	/*
> +	 * The mm_cpumask indicates which harts' TLBs contain the virtual
> +	 * address mapping of the mm. Compared to noasid, using asid
> +	 * can't guarantee that stale TLB entries are invalidated because
> +	 * the asid mechanism wouldn't flush TLB for every switch_mm for
> +	 * performance. So when using asid, keep all CPUs footmarks in
> +	 * cpumask() until mm reset.
> +	 */
> +	cpumask_set_cpu(cpu, mm_cpumask(next));
> +	if (static_branch_unlikely(&use_asid_allocator)) {
> +		set_mm_asid(next, cpu);
> +	} else {
> +		cpumask_clear_cpu(cpu, mm_cpumask(prev));
> +		set_mm_noasid(next);
> +	}
>  }
>  
>  static int __init asids_init(void)
> @@ -264,7 +276,8 @@ static int __init asids_init(void)
>  }
>  early_initcall(asids_init);
>  #else
> -static inline void set_mm(struct mm_struct *mm, unsigned int cpu)
> +static inline void set_mm(struct mm_struct *prev,
> +			  struct mm_struct *next, unsigned int cpu)
>  {
>  	/* Nothing to do here when there is no MMU */
>  }
> @@ -317,10 +330,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>  	 */
>  	cpu = smp_processor_id();
>  
> -	cpumask_clear_cpu(cpu, mm_cpumask(prev));
> -	cpumask_set_cpu(cpu, mm_cpumask(next));
> -
> -	set_mm(next, cpu);
> +	set_mm(prev, next, cpu);
>  
>  	flush_icache_deferred(next, cpu);
>  }
> -- 
> 2.36.1
>

Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
  
Sergey Matyukevich Nov. 18, 2022, 8:57 p.m. UTC | #2
Hi Guo Ren,


> After use_asid_allocator is enabled, the userspace application will
> crash by stale TLB entries. Because only using cpumask_clear_cpu without
> local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> Then set_mm_asid would cause the user space application to get a stale
> value by stale TLB entry, but set_mm_noasid is okay.

... [snip]

> +	/*
> +	 * The mm_cpumask indicates which harts' TLBs contain the virtual
> +	 * address mapping of the mm. Compared to noasid, using asid
> +	 * can't guarantee that stale TLB entries are invalidated because
> +	 * the asid mechanism wouldn't flush TLB for every switch_mm for
> +	 * performance. So when using asid, keep all CPUs footmarks in
> +	 * cpumask() until mm reset.
> +	 */
> +	cpumask_set_cpu(cpu, mm_cpumask(next));
> +	if (static_branch_unlikely(&use_asid_allocator)) {
> +		set_mm_asid(next, cpu);
> +	} else {
> +		cpumask_clear_cpu(cpu, mm_cpumask(prev));
> +		set_mm_noasid(next);
> +	}
>  }

I observe similar user-space crashes on my SMP systems with enabled ASID.
My attempt to fix the issue was a bit different, see the following patch:

https://lore.kernel.org/linux-riscv/20220829205219.283543-1-geomatsi@gmail.com/

In brief, the idea was borrowed from flush_icache_mm handling:
- keep track of CPUs not running the task
- perform per-ASID TLB flush on such CPUs only if the task is switched there

Your patch also works fine in my tests fixing those crashes. I have a
question though, regarding removed cpumask_clear_cpu. How CPUs no more
running the task are removed from its mm_cpumask ? If they are not
removed, then flush_tlb_mm/flush_tlb_page will broadcast unnecessary
TLB flushes to those CPUs when ASID is enabled.

Regards,
Sergey
  
Guo Ren Nov. 19, 2022, 3:37 a.m. UTC | #3
On Sat, Nov 19, 2022 at 4:57 AM Sergey Matyukevich <geomatsi@gmail.com> wrote:
>
> Hi Guo Ren,
>
>
> > After use_asid_allocator is enabled, the userspace application will
> > crash by stale TLB entries. Because only using cpumask_clear_cpu without
> > local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> > Then set_mm_asid would cause the user space application to get a stale
> > value by stale TLB entry, but set_mm_noasid is okay.
>
> ... [snip]
>
> > +     /*
> > +      * The mm_cpumask indicates which harts' TLBs contain the virtual
> > +      * address mapping of the mm. Compared to noasid, using asid
> > +      * can't guarantee that stale TLB entries are invalidated because
> > +      * the asid mechanism wouldn't flush TLB for every switch_mm for
> > +      * performance. So when using asid, keep all CPUs footmarks in
> > +      * cpumask() until mm reset.
> > +      */
> > +     cpumask_set_cpu(cpu, mm_cpumask(next));
> > +     if (static_branch_unlikely(&use_asid_allocator)) {
> > +             set_mm_asid(next, cpu);
> > +     } else {
> > +             cpumask_clear_cpu(cpu, mm_cpumask(prev));
> > +             set_mm_noasid(next);
> > +     }
> >  }
>
> I observe similar user-space crashes on my SMP systems with enabled ASID.
> My attempt to fix the issue was a bit different, see the following patch:
>
> https://lore.kernel.org/linux-riscv/20220829205219.283543-1-geomatsi@gmail.com/
>
> In brief, the idea was borrowed from flush_icache_mm handling:
> - keep track of CPUs not running the task
> - perform per-ASID TLB flush on such CPUs only if the task is switched there
>
> Your patch also works fine in my tests fixing those crashes. I have a
> question though, regarding removed cpumask_clear_cpu. How CPUs no more
> running the task are removed from its mm_cpumask ? If they are not
> removed, then flush_tlb_mm/flush_tlb_page will broadcast unnecessary
> TLB flushes to those CPUs when ASID is enabled.
A task would be migrated to any CPU by the scheduler. So keeping TLB
contents synced with cpumask_set/clear needs additional tlb_flush just
like noasid, and your patch still follows that style. The worth of
ASID is avoiding tlb_flush during the context switch. Yes, my patch
would increase some tlb_flush IPI costs. But when mapping is stable,
no tlb_flush is needed during the switch_mm (Hackbench would be
beneficiary because no more TLB flush is needed at his hot point
path). Here are my points:
 - We copied the arm64 globally unique asid mechanism into riscv,
which depends on hardware broadcast TLB flush. My fixup patch is
closer to the original principle design, proven in the arm64 world.
 - If riscv continues local TLB flush hw design in ISA spec, please
try x86's per-CPU array of ASID. But that is a significant change;
let's fix the current issue with the smallest patch first.

In the end, thx your review and test.
--
Best Regards
 Guo Ren
  
Sergey Matyukevich Nov. 21, 2022, 7:59 p.m. UTC | #4
> After use_asid_allocator is enabled, the userspace application will
> crash by stale TLB entries. Because only using cpumask_clear_cpu without
> local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> Then set_mm_asid would cause the user space application to get a stale
> value by stale TLB entry, but set_mm_noasid is okay.
> 
> Here is the symptom of the bug:
> unhandled signal 11 code 0x1 (coredump)
>    0x0000003fd6d22524 <+4>:     auipc   s0,0x70
>    0x0000003fd6d22528 <+8>:     ld      s0,-148(s0) # 0x3fd6d92490
> => 0x0000003fd6d2252c <+12>:    ld      a5,0(s0)
> (gdb) i r s0
> s0          0x8082ed1cc3198b21       0x8082ed1cc3198b21
> (gdb) x /2x 0x3fd6d92490
> 0x3fd6d92490:   0xd80ac8a8      0x0000003f
> The core dump file shows that register s0 is wrong, but the value in
> memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry
> in TLB and got a wrong result from an incorrect physical address.
> 
> When the task ran on CPU0, which loaded/speculative-loaded the value of
> address(0x3fd6d92490), then the first version of the mapping entry was
> PTWed into CPU0's TLB.
> When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by
> asid), it happened to write a value on the address (0x3fd6d92490). It
> caused do_page_fault -> wp_page_copy -> ptep_clear_flush ->
> ptep_get_and_clear & flush_tlb_page.
> The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB
> flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous
> switch_mm. So we only flushed the CPU1 TLB and set the second version
> mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0
> still used a stale TLB mapping entry which contained a wrong target
> physical address. It raised a bug when the task happened to read that
> value.
> 
>    CPU0                               CPU1
>    - switch 'task' in
>    - read addr (Fill stale mapping
>      entry into TLB)
>    - switch 'task' out (no tlb_flush)
>                                       - switch 'task' in (no tlb_flush)
>                                       - write addr cause pagefault
>                                         do_page_fault() (change to
>                                         new addr mapping)
>                                           wp_page_copy()
>                                             ptep_clear_flush()
>                                               ptep_get_and_clear()
>                                               & flush_tlb_page()
>                                         write new value into addr
>                                       - switch 'task' out (no tlb_flush)
>    - switch 'task' in (no tlb_flush)
>    - read addr again (Use stale
>      mapping entry in TLB)
>      get wrong value from old phyical
>      addr, BUG!
> 
> The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm,
> which could guarantee to invalidate all stale TLB entries during TLB
> flush.
> 
> Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator")
> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
> Signed-off-by: Guo Ren <guoren@kernel.org>
> Cc: Anup Patel <apatel@ventanamicro.com>
> Cc: Palmer Dabbelt <palmer@rivosinc.com>
> ---
> Changes in v3:
>  - Move set/clear cpumask(mm) into set_mm (Make code more pretty
>    with Andrew's advice)
>  - Optimize comment description
> 
> Changes in v2:
>  - Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel
>    test robot <lkp@intel.com>)
>  - Keep cpumask_clear_cpu for noasid
> ---
>  arch/riscv/mm/context.c | 30 ++++++++++++++++++++----------
>  1 file changed, 20 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
> index 7acbfbd14557..0f784e3d307b 100644
> --- a/arch/riscv/mm/context.c
> +++ b/arch/riscv/mm/context.c
> @@ -205,12 +205,24 @@ static void set_mm_noasid(struct mm_struct *mm)
>  	local_flush_tlb_all();
>  }
>  
> -static inline void set_mm(struct mm_struct *mm, unsigned int cpu)
> +static inline void set_mm(struct mm_struct *prev,
> +			  struct mm_struct *next, unsigned int cpu)
>  {
> -	if (static_branch_unlikely(&use_asid_allocator))
> -		set_mm_asid(mm, cpu);
> -	else
> -		set_mm_noasid(mm);
> +	/*
> +	 * The mm_cpumask indicates which harts' TLBs contain the virtual
> +	 * address mapping of the mm. Compared to noasid, using asid
> +	 * can't guarantee that stale TLB entries are invalidated because
> +	 * the asid mechanism wouldn't flush TLB for every switch_mm for
> +	 * performance. So when using asid, keep all CPUs footmarks in
> +	 * cpumask() until mm reset.
> +	 */
> +	cpumask_set_cpu(cpu, mm_cpumask(next));
> +	if (static_branch_unlikely(&use_asid_allocator)) {
> +		set_mm_asid(next, cpu);
> +	} else {
> +		cpumask_clear_cpu(cpu, mm_cpumask(prev));
> +		set_mm_noasid(next);
> +	}
>  }
>  
>  static int __init asids_init(void)
> @@ -264,7 +276,8 @@ static int __init asids_init(void)
>  }
>  early_initcall(asids_init);
>  #else
> -static inline void set_mm(struct mm_struct *mm, unsigned int cpu)
> +static inline void set_mm(struct mm_struct *prev,
> +			  struct mm_struct *next, unsigned int cpu)
>  {
>  	/* Nothing to do here when there is no MMU */
>  }
> @@ -317,10 +330,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>  	 */
>  	cpu = smp_processor_id();
>  
> -	cpumask_clear_cpu(cpu, mm_cpumask(prev));
> -	cpumask_set_cpu(cpu, mm_cpumask(next));
> -
> -	set_mm(next, cpu);
> +	set_mm(prev, next, cpu);
>  
>  	flush_icache_deferred(next, cpu);
>  }

Tested-by: Sergey Matyukevich <sergey.matyukevich@syntacore.com>

Thanks,
Sergey
  
Sergey Matyukevich Nov. 21, 2022, 8:03 p.m. UTC | #5
Hi Guo Ren,

> > > After use_asid_allocator is enabled, the userspace application will
> > > crash by stale TLB entries. Because only using cpumask_clear_cpu without
> > > local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> > > Then set_mm_asid would cause the user space application to get a stale
> > > value by stale TLB entry, but set_mm_noasid is okay.
> >
> > ... [snip]
> >
> > > +     /*
> > > +      * The mm_cpumask indicates which harts' TLBs contain the virtual
> > > +      * address mapping of the mm. Compared to noasid, using asid
> > > +      * can't guarantee that stale TLB entries are invalidated because
> > > +      * the asid mechanism wouldn't flush TLB for every switch_mm for
> > > +      * performance. So when using asid, keep all CPUs footmarks in
> > > +      * cpumask() until mm reset.
> > > +      */
> > > +     cpumask_set_cpu(cpu, mm_cpumask(next));
> > > +     if (static_branch_unlikely(&use_asid_allocator)) {
> > > +             set_mm_asid(next, cpu);
> > > +     } else {
> > > +             cpumask_clear_cpu(cpu, mm_cpumask(prev));
> > > +             set_mm_noasid(next);
> > > +     }
> > >  }
> >
> > I observe similar user-space crashes on my SMP systems with enabled ASID.
> > My attempt to fix the issue was a bit different, see the following patch:
> >
> > https://lore.kernel.org/linux-riscv/20220829205219.283543-1-geomatsi@gmail.com/
> >
> > In brief, the idea was borrowed from flush_icache_mm handling:
> > - keep track of CPUs not running the task
> > - perform per-ASID TLB flush on such CPUs only if the task is switched there
> >
> > Your patch also works fine in my tests fixing those crashes. I have a
> > question though, regarding removed cpumask_clear_cpu. How CPUs no more
> > running the task are removed from its mm_cpumask ? If they are not
> > removed, then flush_tlb_mm/flush_tlb_page will broadcast unnecessary
> > TLB flushes to those CPUs when ASID is enabled.
> A task would be migrated to any CPU by the scheduler. So keeping TLB
> contents synced with cpumask_set/clear needs additional tlb_flush just
> like noasid, and your patch still follows that style. The worth of
> ASID is avoiding tlb_flush during the context switch. Yes, my patch
> would increase some tlb_flush IPI costs. But when mapping is stable,
> no tlb_flush is needed during the switch_mm (Hackbench would be
> beneficiary because no more TLB flush is needed at his hot point
> path). Here are my points:
>  - We copied the arm64 globally unique asid mechanism into riscv,
> which depends on hardware broadcast TLB flush. My fixup patch is
> closer to the original principle design, proven in the arm64 world.
>  - If riscv continues local TLB flush hw design in ISA spec, please
> try x86's per-CPU array of ASID. But that is a significant change;
> let's fix the current issue with the smallest patch first.
> 
> In the end, thx your review and test.

By the way, how did you verify the patch ? Do you have any good
reproducer for this issue ?

Regards,
Sergey
  
Palmer Dabbelt Dec. 8, 2022, 11:30 p.m. UTC | #6
On Fri, 18 Nov 2022 12:57:21 PST (-0800), geomatsi@gmail.com wrote:
> Hi Guo Ren,
>
>
>> After use_asid_allocator is enabled, the userspace application will
>> crash by stale TLB entries. Because only using cpumask_clear_cpu without
>> local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
>> Then set_mm_asid would cause the user space application to get a stale
>> value by stale TLB entry, but set_mm_noasid is okay.
>
> ... [snip]
>
>> +	/*
>> +	 * The mm_cpumask indicates which harts' TLBs contain the virtual
>> +	 * address mapping of the mm. Compared to noasid, using asid
>> +	 * can't guarantee that stale TLB entries are invalidated because
>> +	 * the asid mechanism wouldn't flush TLB for every switch_mm for
>> +	 * performance. So when using asid, keep all CPUs footmarks in
>> +	 * cpumask() until mm reset.
>> +	 */
>> +	cpumask_set_cpu(cpu, mm_cpumask(next));
>> +	if (static_branch_unlikely(&use_asid_allocator)) {
>> +		set_mm_asid(next, cpu);
>> +	} else {
>> +		cpumask_clear_cpu(cpu, mm_cpumask(prev));
>> +		set_mm_noasid(next);
>> +	}
>>  }
>
> I observe similar user-space crashes on my SMP systems with enabled ASID.
> My attempt to fix the issue was a bit different, see the following patch:
>
> https://lore.kernel.org/linux-riscv/20220829205219.283543-1-geomatsi@gmail.com/
>
> In brief, the idea was borrowed from flush_icache_mm handling:
> - keep track of CPUs not running the task
> - perform per-ASID TLB flush on such CPUs only if the task is switched there

That way looks better to me: leaking hartids in the ASID allocator might 
make the crashes go away, but it's just going to end up trending towards 
flushing everything and that doesn't seem like the right long-term 
solution.

So I've got that one on for-next, sorry I missed it before.

Thanks!

>
> Your patch also works fine in my tests fixing those crashes. I have a
> question though, regarding removed cpumask_clear_cpu. How CPUs no more
> running the task are removed from its mm_cpumask ? If they are not
> removed, then flush_tlb_mm/flush_tlb_page will broadcast unnecessary
> TLB flushes to those CPUs when ASID is enabled.
>
> Regards,
> Sergey
  
Guo Ren Dec. 9, 2022, 3:13 a.m. UTC | #7
On Fri, Dec 9, 2022 at 7:30 AM Palmer Dabbelt <palmer@rivosinc.com> wrote:
>
> On Fri, 18 Nov 2022 12:57:21 PST (-0800), geomatsi@gmail.com wrote:
> > Hi Guo Ren,
> >
> >
> >> After use_asid_allocator is enabled, the userspace application will
> >> crash by stale TLB entries. Because only using cpumask_clear_cpu without
> >> local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> >> Then set_mm_asid would cause the user space application to get a stale
> >> value by stale TLB entry, but set_mm_noasid is okay.
> >
> > ... [snip]
> >
> >> +    /*
> >> +     * The mm_cpumask indicates which harts' TLBs contain the virtual
> >> +     * address mapping of the mm. Compared to noasid, using asid
> >> +     * can't guarantee that stale TLB entries are invalidated because
> >> +     * the asid mechanism wouldn't flush TLB for every switch_mm for
> >> +     * performance. So when using asid, keep all CPUs footmarks in
> >> +     * cpumask() until mm reset.
> >> +     */
> >> +    cpumask_set_cpu(cpu, mm_cpumask(next));
> >> +    if (static_branch_unlikely(&use_asid_allocator)) {
> >> +            set_mm_asid(next, cpu);
> >> +    } else {
> >> +            cpumask_clear_cpu(cpu, mm_cpumask(prev));
> >> +            set_mm_noasid(next);
> >> +    }
> >>  }
> >
> > I observe similar user-space crashes on my SMP systems with enabled ASID.
> > My attempt to fix the issue was a bit different, see the following patch:
> >
> > https://lore.kernel.org/linux-riscv/20220829205219.283543-1-geomatsi@gmail.com/
> >
> > In brief, the idea was borrowed from flush_icache_mm handling:
> > - keep track of CPUs not running the task
> > - perform per-ASID TLB flush on such CPUs only if the task is switched there
>
> That way looks better to me: leaking hartids in the ASID allocator might
> make the crashes go away, but it's just going to end up trending towards
> flushing everything and that doesn't seem like the right long-term
> solution.
The penalty in switch_mm is too heavy!!!
 - If the system has multiple NUMA nodes, it will cause switch_mm_fast
flush unnecessary harts.
 - If flush_range is just 1 entry, it would case flush_tlb_all_asid.

switch_mm_fast:
        csr_write(CSR_SATP, virt_to_pfn(mm->pgd) |
                  ((cntx & asid_mask) << SATP_ASID_SHIFT) |
                  satp_mode);

        if (need_flush_tlb)
                local_flush_tlb_all();
+#ifdef CONFIG_SMP
+        else {
+                cpumask_t *mask = &mm->context.tlb_stale_mask;+
+
+               if (cpumask_test_cpu(cpu, mask)) {
+                        cpumask_clear_cpu(cpu, mask);
+                        local_flush_tlb_all_asid(cntx & asid_mask);
// penalty in switch_mm fast path
+                }
+        }
+#endif

And See:
 static void __sbi_tlb_flush_range(struct mm_struct *mm, unsigned long start,
                                  unsigned long size, unsigned long stride)
 {
+       struct cpumask *pmask = &mm->context.tlb_stale_mask;
        struct cpumask *cmask = mm_cpumask(mm);
        unsigned int cpuid;
        bool broadcast;
@@ -44,6 +29,15 @@ static void __sbi_tlb_flush_range(struct mm_struct
*mm, unsigned long start,
        if (static_branch_unlikely(&use_asid_allocator)) {
                unsigned long asid = atomic_long_read(&mm->context.id);

+               /*
+                * TLB will be immediately flushed on harts concurrently
+                * executing this MM context. TLB flush on other harts
+                * is deferred until this MM context migrates there.
+                */
+               cpumask_setall(pmask);
                 ^^^^^^^^^^^^^^^^^^^^^^^ It would flush all harts for
all NUMA nodes!!! Most of them deferred to switch_mm_fast. The penalty
contains unnecessary harts!
+               cpumask_clear_cpu(cpuid, pmask);
+               cpumask_andnot(pmask, pmask, cmask);

Please reconsider a bit, and make a smart decision. Just penalty the
harts who touched the mm, not all. And only flush the whole TLB when
some entries are needed.

The __sbi_tlb_flush_range is the slow path; keeping the fast path
performance is worth more than improving a slow one.

>
> So I've got that one on for-next, sorry I missed it before.
>
> Thanks!
>
> >
> > Your patch also works fine in my tests fixing those crashes. I have a
> > question though, regarding removed cpumask_clear_cpu. How CPUs no more
> > running the task are removed from its mm_cpumask ? If they are not
> > removed, then flush_tlb_mm/flush_tlb_page will broadcast unnecessary
> > TLB flushes to those CPUs when ASID is enabled.
> >
> > Regards,
> > Sergey
  
Guo Ren Dec. 9, 2022, 3:15 a.m. UTC | #8
On Fri, Dec 9, 2022 at 11:13 AM Guo Ren <guoren@kernel.org> wrote:
>
> On Fri, Dec 9, 2022 at 7:30 AM Palmer Dabbelt <palmer@rivosinc.com> wrote:
> >
> > On Fri, 18 Nov 2022 12:57:21 PST (-0800), geomatsi@gmail.com wrote:
> > > Hi Guo Ren,
> > >
> > >
> > >> After use_asid_allocator is enabled, the userspace application will
> > >> crash by stale TLB entries. Because only using cpumask_clear_cpu without
> > >> local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> > >> Then set_mm_asid would cause the user space application to get a stale
> > >> value by stale TLB entry, but set_mm_noasid is okay.
> > >
> > > ... [snip]
> > >
> > >> +    /*
> > >> +     * The mm_cpumask indicates which harts' TLBs contain the virtual
> > >> +     * address mapping of the mm. Compared to noasid, using asid
> > >> +     * can't guarantee that stale TLB entries are invalidated because
> > >> +     * the asid mechanism wouldn't flush TLB for every switch_mm for
> > >> +     * performance. So when using asid, keep all CPUs footmarks in
> > >> +     * cpumask() until mm reset.
> > >> +     */
> > >> +    cpumask_set_cpu(cpu, mm_cpumask(next));
> > >> +    if (static_branch_unlikely(&use_asid_allocator)) {
> > >> +            set_mm_asid(next, cpu);
> > >> +    } else {
> > >> +            cpumask_clear_cpu(cpu, mm_cpumask(prev));
> > >> +            set_mm_noasid(next);
> > >> +    }
> > >>  }
> > >
> > > I observe similar user-space crashes on my SMP systems with enabled ASID.
> > > My attempt to fix the issue was a bit different, see the following patch:
> > >
> > > https://lore.kernel.org/linux-riscv/20220829205219.283543-1-geomatsi@gmail.com/
> > >
> > > In brief, the idea was borrowed from flush_icache_mm handling:
> > > - keep track of CPUs not running the task
> > > - perform per-ASID TLB flush on such CPUs only if the task is switched there
> >
> > That way looks better to me: leaking hartids in the ASID allocator might
> > make the crashes go away, but it's just going to end up trending towards
> > flushing everything and that doesn't seem like the right long-term
> > solution.
> The penalty in switch_mm is too heavy!!!
>  - If the system has multiple NUMA nodes, it will cause switch_mm_fast
> flush unnecessary harts.
>  - If flush_range is just 1 entry, it would case flush_tlb_all_asid.
>
> switch_mm_fast:
>         csr_write(CSR_SATP, virt_to_pfn(mm->pgd) |
>                   ((cntx & asid_mask) << SATP_ASID_SHIFT) |
>                   satp_mode);
>
>         if (need_flush_tlb)
>                 local_flush_tlb_all();
> +#ifdef CONFIG_SMP
> +        else {
> +                cpumask_t *mask = &mm->context.tlb_stale_mask;+
> +
> +               if (cpumask_test_cpu(cpu, mask)) {
> +                        cpumask_clear_cpu(cpu, mask);
> +                        local_flush_tlb_all_asid(cntx & asid_mask);
> // penalty in switch_mm fast path
> +                }
> +        }
> +#endif
>
> And See:
>  static void __sbi_tlb_flush_range(struct mm_struct *mm, unsigned long start,
>                                   unsigned long size, unsigned long stride)
>  {
> +       struct cpumask *pmask = &mm->context.tlb_stale_mask;
>         struct cpumask *cmask = mm_cpumask(mm);
>         unsigned int cpuid;
>         bool broadcast;
> @@ -44,6 +29,15 @@ static void __sbi_tlb_flush_range(struct mm_struct
> *mm, unsigned long start,
>         if (static_branch_unlikely(&use_asid_allocator)) {
>                 unsigned long asid = atomic_long_read(&mm->context.id);
>
> +               /*
> +                * TLB will be immediately flushed on harts concurrently
> +                * executing this MM context. TLB flush on other harts
> +                * is deferred until this MM context migrates there.
> +                */
> +               cpumask_setall(pmask);
>                  ^^^^^^^^^^^^^^^^^^^^^^^ It would flush all harts for
> all NUMA nodes!!! Most of them deferred to switch_mm_fast. The penalty
> contains unnecessary harts!
> +               cpumask_clear_cpu(cpuid, pmask);
> +               cpumask_andnot(pmask, pmask, cmask);
>
> Please reconsider a bit, and make a smart decision. Just penalty the
> harts who touched the mm, not all. And only flush the whole TLB when
                                                           ^^^^ ^^^^ Don't
> some entries are needed.
>
> The __sbi_tlb_flush_range is the slow path; keeping the fast path
> performance is worth more than improving a slow one.
>
> >
> > So I've got that one on for-next, sorry I missed it before.
> >
> > Thanks!
> >
> > >
> > > Your patch also works fine in my tests fixing those crashes. I have a
> > > question though, regarding removed cpumask_clear_cpu. How CPUs no more
> > > running the task are removed from its mm_cpumask ? If they are not
> > > removed, then flush_tlb_mm/flush_tlb_page will broadcast unnecessary
> > > TLB flushes to those CPUs when ASID is enabled.
> > >
> > > Regards,
> > > Sergey
>
>
>
> --
> Best Regards
>  Guo Ren
  
Lad, Prabhakar Dec. 23, 2022, 12:53 p.m. UTC | #9
Hi Guo,

Thank you for the patch.

On Fri, Nov 11, 2022 at 8:00 AM <guoren@kernel.org> wrote:
>
> From: Guo Ren <guoren@linux.alibaba.com>
>
> After use_asid_allocator is enabled, the userspace application will
> crash by stale TLB entries. Because only using cpumask_clear_cpu without
> local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> Then set_mm_asid would cause the user space application to get a stale
> value by stale TLB entry, but set_mm_noasid is okay.
>
> Here is the symptom of the bug:
> unhandled signal 11 code 0x1 (coredump)
>    0x0000003fd6d22524 <+4>:     auipc   s0,0x70
>    0x0000003fd6d22528 <+8>:     ld      s0,-148(s0) # 0x3fd6d92490
> => 0x0000003fd6d2252c <+12>:    ld      a5,0(s0)
> (gdb) i r s0
> s0          0x8082ed1cc3198b21       0x8082ed1cc3198b21
> (gdb) x /2x 0x3fd6d92490
> 0x3fd6d92490:   0xd80ac8a8      0x0000003f
> The core dump file shows that register s0 is wrong, but the value in
> memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry
> in TLB and got a wrong result from an incorrect physical address.
>
> When the task ran on CPU0, which loaded/speculative-loaded the value of
> address(0x3fd6d92490), then the first version of the mapping entry was
> PTWed into CPU0's TLB.
> When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by
> asid), it happened to write a value on the address (0x3fd6d92490). It
> caused do_page_fault -> wp_page_copy -> ptep_clear_flush ->
> ptep_get_and_clear & flush_tlb_page.
> The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB
> flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous
> switch_mm. So we only flushed the CPU1 TLB and set the second version
> mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0
> still used a stale TLB mapping entry which contained a wrong target
> physical address. It raised a bug when the task happened to read that
> value.
>
>    CPU0                               CPU1
>    - switch 'task' in
>    - read addr (Fill stale mapping
>      entry into TLB)
>    - switch 'task' out (no tlb_flush)
>                                       - switch 'task' in (no tlb_flush)
>                                       - write addr cause pagefault
>                                         do_page_fault() (change to
>                                         new addr mapping)
>                                           wp_page_copy()
>                                             ptep_clear_flush()
>                                               ptep_get_and_clear()
>                                               & flush_tlb_page()
>                                         write new value into addr
>                                       - switch 'task' out (no tlb_flush)
>    - switch 'task' in (no tlb_flush)
>    - read addr again (Use stale
>      mapping entry in TLB)
>      get wrong value from old phyical
>      addr, BUG!
>
> The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm,
> which could guarantee to invalidate all stale TLB entries during TLB
> flush.
>
> Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator")
> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
> Signed-off-by: Guo Ren <guoren@kernel.org>
> Cc: Anup Patel <apatel@ventanamicro.com>
> Cc: Palmer Dabbelt <palmer@rivosinc.com>
> ---
> Changes in v3:
>  - Move set/clear cpumask(mm) into set_mm (Make code more pretty
>    with Andrew's advice)
>  - Optimize comment description
>
> Changes in v2:
>  - Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel
>    test robot <lkp@intel.com>)
>  - Keep cpumask_clear_cpu for noasid
> ---
>  arch/riscv/mm/context.c | 30 ++++++++++++++++++++----------
>  1 file changed, 20 insertions(+), 10 deletions(-)
>
As reported on the patch [0] I was seeing consistent failures on the
RZ/Five SoC while running bonnie++ utility. After applying this patch
on top of Palmer's for-next branch (eb67d239f3aa) I am no longer
seeing this issue.

Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>

[0] https://patchwork.kernel.org/project/linux-riscv/patch/20220829205219.283543-1-geomatsi@gmail.com/

Cheers,
Prabhakar

> diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
> index 7acbfbd14557..0f784e3d307b 100644
> --- a/arch/riscv/mm/context.c
> +++ b/arch/riscv/mm/context.c
> @@ -205,12 +205,24 @@ static void set_mm_noasid(struct mm_struct *mm)
>         local_flush_tlb_all();
>  }
>
> -static inline void set_mm(struct mm_struct *mm, unsigned int cpu)
> +static inline void set_mm(struct mm_struct *prev,
> +                         struct mm_struct *next, unsigned int cpu)
>  {
> -       if (static_branch_unlikely(&use_asid_allocator))
> -               set_mm_asid(mm, cpu);
> -       else
> -               set_mm_noasid(mm);
> +       /*
> +        * The mm_cpumask indicates which harts' TLBs contain the virtual
> +        * address mapping of the mm. Compared to noasid, using asid
> +        * can't guarantee that stale TLB entries are invalidated because
> +        * the asid mechanism wouldn't flush TLB for every switch_mm for
> +        * performance. So when using asid, keep all CPUs footmarks in
> +        * cpumask() until mm reset.
> +        */
> +       cpumask_set_cpu(cpu, mm_cpumask(next));
> +       if (static_branch_unlikely(&use_asid_allocator)) {
> +               set_mm_asid(next, cpu);
> +       } else {
> +               cpumask_clear_cpu(cpu, mm_cpumask(prev));
> +               set_mm_noasid(next);
> +       }
>  }
>
>  static int __init asids_init(void)
> @@ -264,7 +276,8 @@ static int __init asids_init(void)
>  }
>  early_initcall(asids_init);
>  #else
> -static inline void set_mm(struct mm_struct *mm, unsigned int cpu)
> +static inline void set_mm(struct mm_struct *prev,
> +                         struct mm_struct *next, unsigned int cpu)
>  {
>         /* Nothing to do here when there is no MMU */
>  }
> @@ -317,10 +330,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>          */
>         cpu = smp_processor_id();
>
> -       cpumask_clear_cpu(cpu, mm_cpumask(prev));
> -       cpumask_set_cpu(cpu, mm_cpumask(next));
> -
> -       set_mm(next, cpu);
> +       set_mm(prev, next, cpu);
>
>         flush_icache_deferred(next, cpu);
>  }
> --
> 2.36.1
>
>
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv
  
Zong Li Feb. 23, 2023, 5:57 p.m. UTC | #10
Lad, Prabhakar <prabhakar.csengg@gmail.com> 於 2022年12月23日 週五 下午8:54寫道:
>
> Hi Guo,
>
> Thank you for the patch.
>
> On Fri, Nov 11, 2022 at 8:00 AM <guoren@kernel.org> wrote:
> >
> > From: Guo Ren <guoren@linux.alibaba.com>
> >
> > After use_asid_allocator is enabled, the userspace application will
> > crash by stale TLB entries. Because only using cpumask_clear_cpu without
> > local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> > Then set_mm_asid would cause the user space application to get a stale
> > value by stale TLB entry, but set_mm_noasid is okay.
> >
> > Here is the symptom of the bug:
> > unhandled signal 11 code 0x1 (coredump)
> >    0x0000003fd6d22524 <+4>:     auipc   s0,0x70
> >    0x0000003fd6d22528 <+8>:     ld      s0,-148(s0) # 0x3fd6d92490
> > => 0x0000003fd6d2252c <+12>:    ld      a5,0(s0)
> > (gdb) i r s0
> > s0          0x8082ed1cc3198b21       0x8082ed1cc3198b21
> > (gdb) x /2x 0x3fd6d92490
> > 0x3fd6d92490:   0xd80ac8a8      0x0000003f
> > The core dump file shows that register s0 is wrong, but the value in
> > memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry
> > in TLB and got a wrong result from an incorrect physical address.
> >
> > When the task ran on CPU0, which loaded/speculative-loaded the value of
> > address(0x3fd6d92490), then the first version of the mapping entry was
> > PTWed into CPU0's TLB.
> > When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by
> > asid), it happened to write a value on the address (0x3fd6d92490). It
> > caused do_page_fault -> wp_page_copy -> ptep_clear_flush ->
> > ptep_get_and_clear & flush_tlb_page.
> > The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB
> > flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous
> > switch_mm. So we only flushed the CPU1 TLB and set the second version
> > mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0
> > still used a stale TLB mapping entry which contained a wrong target
> > physical address. It raised a bug when the task happened to read that
> > value.
> >
> >    CPU0                               CPU1
> >    - switch 'task' in
> >    - read addr (Fill stale mapping
> >      entry into TLB)
> >    - switch 'task' out (no tlb_flush)
> >                                       - switch 'task' in (no tlb_flush)
> >                                       - write addr cause pagefault
> >                                         do_page_fault() (change to
> >                                         new addr mapping)
> >                                           wp_page_copy()
> >                                             ptep_clear_flush()
> >                                               ptep_get_and_clear()
> >                                               & flush_tlb_page()
> >                                         write new value into addr
> >                                       - switch 'task' out (no tlb_flush)
> >    - switch 'task' in (no tlb_flush)
> >    - read addr again (Use stale
> >      mapping entry in TLB)
> >      get wrong value from old phyical
> >      addr, BUG!
> >
> > The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm,
> > which could guarantee to invalidate all stale TLB entries during TLB
> > flush.
> >
> > Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator")
> > Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
> > Signed-off-by: Guo Ren <guoren@kernel.org>
> > Cc: Anup Patel <apatel@ventanamicro.com>
> > Cc: Palmer Dabbelt <palmer@rivosinc.com>
> > ---
> > Changes in v3:
> >  - Move set/clear cpumask(mm) into set_mm (Make code more pretty
> >    with Andrew's advice)
> >  - Optimize comment description
> >
> > Changes in v2:
> >  - Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel
> >    test robot <lkp@intel.com>)
> >  - Keep cpumask_clear_cpu for noasid
> > ---
> >  arch/riscv/mm/context.c | 30 ++++++++++++++++++++----------
> >  1 file changed, 20 insertions(+), 10 deletions(-)
> >
> As reported on the patch [0] I was seeing consistent failures on the
> RZ/Five SoC while running bonnie++ utility. After applying this patch
> on top of Palmer's for-next branch (eb67d239f3aa) I am no longer
> seeing this issue.
>
> Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
>
> [0] https://patchwork.kernel.org/project/linux-riscv/patch/20220829205219.283543-1-geomatsi@gmail.com/
>

Hi all,
I got the same situation (i.e. unhandle signal 11) on our internal
multi-core system, I tried the patch[0] & [1], but it still doesn't
work, I guess there are still some potential problems. After applying
this patch, the situation disappeared, I took some time to look at
other arches' implementations, such as arc, they don't clear the
mm_cpumask due to the similar issue. I can't say which approach might
be better, but I'd like to point out that this patch works to me.
Thanks.

Tested-by: Zong Li <zong.li@sifive.com>

[0] https://lore.kernel.org/linux-riscv/20220829205219.283543-1-geomatsi@gmail.com/
[1] https://lore.kernel.org/linux-riscv/20230129211818.686557-1-geomatsi@gmail.com/

> Cheers,
> Prabhakar
>
> > diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
> > index 7acbfbd14557..0f784e3d307b 100644
> > --- a/arch/riscv/mm/context.c
> > +++ b/arch/riscv/mm/context.c
> > @@ -205,12 +205,24 @@ static void set_mm_noasid(struct mm_struct *mm)
> >         local_flush_tlb_all();
> >  }
> >
> > -static inline void set_mm(struct mm_struct *mm, unsigned int cpu)
> > +static inline void set_mm(struct mm_struct *prev,
> > +                         struct mm_struct *next, unsigned int cpu)
> >  {
> > -       if (static_branch_unlikely(&use_asid_allocator))
> > -               set_mm_asid(mm, cpu);
> > -       else
> > -               set_mm_noasid(mm);
> > +       /*
> > +        * The mm_cpumask indicates which harts' TLBs contain the virtual
> > +        * address mapping of the mm. Compared to noasid, using asid
> > +        * can't guarantee that stale TLB entries are invalidated because
> > +        * the asid mechanism wouldn't flush TLB for every switch_mm for
> > +        * performance. So when using asid, keep all CPUs footmarks in
> > +        * cpumask() until mm reset.
> > +        */
> > +       cpumask_set_cpu(cpu, mm_cpumask(next));
> > +       if (static_branch_unlikely(&use_asid_allocator)) {
> > +               set_mm_asid(next, cpu);
> > +       } else {
> > +               cpumask_clear_cpu(cpu, mm_cpumask(prev));
> > +               set_mm_noasid(next);
> > +       }
> >  }
> >
> >  static int __init asids_init(void)
> > @@ -264,7 +276,8 @@ static int __init asids_init(void)
> >  }
> >  early_initcall(asids_init);
> >  #else
> > -static inline void set_mm(struct mm_struct *mm, unsigned int cpu)
> > +static inline void set_mm(struct mm_struct *prev,
> > +                         struct mm_struct *next, unsigned int cpu)
> >  {
> >         /* Nothing to do here when there is no MMU */
> >  }
> > @@ -317,10 +330,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> >          */
> >         cpu = smp_processor_id();
> >
> > -       cpumask_clear_cpu(cpu, mm_cpumask(prev));
> > -       cpumask_set_cpu(cpu, mm_cpumask(next));
> > -
> > -       set_mm(next, cpu);
> > +       set_mm(prev, next, cpu);
> >
> >         flush_icache_deferred(next, cpu);
> >  }
> > --
> > 2.36.1
> >
> >
> > _______________________________________________
> > linux-riscv mailing list
> > linux-riscv@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-riscv
>
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv
  
Guo Ren Feb. 23, 2023, 11:20 p.m. UTC | #11
On Fri, Feb 24, 2023 at 1:58 AM Zong Li <zongbox@gmail.com> wrote:
>
> Lad, Prabhakar <prabhakar.csengg@gmail.com> 於 2022年12月23日 週五 下午8:54寫道:
> >
> > Hi Guo,
> >
> > Thank you for the patch.
> >
> > On Fri, Nov 11, 2022 at 8:00 AM <guoren@kernel.org> wrote:
> > >
> > > From: Guo Ren <guoren@linux.alibaba.com>
> > >
> > > After use_asid_allocator is enabled, the userspace application will
> > > crash by stale TLB entries. Because only using cpumask_clear_cpu without
> > > local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> > > Then set_mm_asid would cause the user space application to get a stale
> > > value by stale TLB entry, but set_mm_noasid is okay.
> > >
> > > Here is the symptom of the bug:
> > > unhandled signal 11 code 0x1 (coredump)
> > >    0x0000003fd6d22524 <+4>:     auipc   s0,0x70
> > >    0x0000003fd6d22528 <+8>:     ld      s0,-148(s0) # 0x3fd6d92490
> > > => 0x0000003fd6d2252c <+12>:    ld      a5,0(s0)
> > > (gdb) i r s0
> > > s0          0x8082ed1cc3198b21       0x8082ed1cc3198b21
> > > (gdb) x /2x 0x3fd6d92490
> > > 0x3fd6d92490:   0xd80ac8a8      0x0000003f
> > > The core dump file shows that register s0 is wrong, but the value in
> > > memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry
> > > in TLB and got a wrong result from an incorrect physical address.
> > >
> > > When the task ran on CPU0, which loaded/speculative-loaded the value of
> > > address(0x3fd6d92490), then the first version of the mapping entry was
> > > PTWed into CPU0's TLB.
> > > When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by
> > > asid), it happened to write a value on the address (0x3fd6d92490). It
> > > caused do_page_fault -> wp_page_copy -> ptep_clear_flush ->
> > > ptep_get_and_clear & flush_tlb_page.
> > > The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB
> > > flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous
> > > switch_mm. So we only flushed the CPU1 TLB and set the second version
> > > mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0
> > > still used a stale TLB mapping entry which contained a wrong target
> > > physical address. It raised a bug when the task happened to read that
> > > value.
> > >
> > >    CPU0                               CPU1
> > >    - switch 'task' in
> > >    - read addr (Fill stale mapping
> > >      entry into TLB)
> > >    - switch 'task' out (no tlb_flush)
> > >                                       - switch 'task' in (no tlb_flush)
> > >                                       - write addr cause pagefault
> > >                                         do_page_fault() (change to
> > >                                         new addr mapping)
> > >                                           wp_page_copy()
> > >                                             ptep_clear_flush()
> > >                                               ptep_get_and_clear()
> > >                                               & flush_tlb_page()
> > >                                         write new value into addr
> > >                                       - switch 'task' out (no tlb_flush)
> > >    - switch 'task' in (no tlb_flush)
> > >    - read addr again (Use stale
> > >      mapping entry in TLB)
> > >      get wrong value from old phyical
> > >      addr, BUG!
> > >
> > > The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm,
> > > which could guarantee to invalidate all stale TLB entries during TLB
> > > flush.
> > >
> > > Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator")
> > > Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
> > > Signed-off-by: Guo Ren <guoren@kernel.org>
> > > Cc: Anup Patel <apatel@ventanamicro.com>
> > > Cc: Palmer Dabbelt <palmer@rivosinc.com>
> > > ---
> > > Changes in v3:
> > >  - Move set/clear cpumask(mm) into set_mm (Make code more pretty
> > >    with Andrew's advice)
> > >  - Optimize comment description
> > >
> > > Changes in v2:
> > >  - Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel
> > >    test robot <lkp@intel.com>)
> > >  - Keep cpumask_clear_cpu for noasid
> > > ---
> > >  arch/riscv/mm/context.c | 30 ++++++++++++++++++++----------
> > >  1 file changed, 20 insertions(+), 10 deletions(-)
> > >
> > As reported on the patch [0] I was seeing consistent failures on the
> > RZ/Five SoC while running bonnie++ utility. After applying this patch
> > on top of Palmer's for-next branch (eb67d239f3aa) I am no longer
> > seeing this issue.
> >
> > Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
> >
> > [0] https://patchwork.kernel.org/project/linux-riscv/patch/20220829205219.283543-1-geomatsi@gmail.com/
> >
>
> Hi all,
> I got the same situation (i.e. unhandle signal 11) on our internal
> multi-core system, I tried the patch[0] & [1], but it still doesn't
> work, I guess there are still some potential problems. After applying
> this patch, the situation disappeared, I took some time to look at
> other arches' implementations, such as arc, they don't clear the
> mm_cpumask due to the similar issue. I can't say which approach might
> be better, but I'd like to point out that this patch works to me.
> Thanks.
Thx for the report.

I think the patch is simpler and more direct to the problem than the
current fixup approach. Although the maintainer thought it was a
temporary solution.

This patch has been tested in our production environment for huge
amounts of thousand of hours. You can trust it :)

>
> Tested-by: Zong Li <zong.li@sifive.com>


>
> [0] https://lore.kernel.org/linux-riscv/20220829205219.283543-1-geomatsi@gmail.com/
> [1] https://lore.kernel.org/linux-riscv/20230129211818.686557-1-geomatsi@gmail.com/
>
> > Cheers,
> > Prabhakar
> >
> > > diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
> > > index 7acbfbd14557..0f784e3d307b 100644
> > > --- a/arch/riscv/mm/context.c
> > > +++ b/arch/riscv/mm/context.c
> > > @@ -205,12 +205,24 @@ static void set_mm_noasid(struct mm_struct *mm)
> > >         local_flush_tlb_all();
> > >  }
> > >
> > > -static inline void set_mm(struct mm_struct *mm, unsigned int cpu)
> > > +static inline void set_mm(struct mm_struct *prev,
> > > +                         struct mm_struct *next, unsigned int cpu)
> > >  {
> > > -       if (static_branch_unlikely(&use_asid_allocator))
> > > -               set_mm_asid(mm, cpu);
> > > -       else
> > > -               set_mm_noasid(mm);
> > > +       /*
> > > +        * The mm_cpumask indicates which harts' TLBs contain the virtual
> > > +        * address mapping of the mm. Compared to noasid, using asid
> > > +        * can't guarantee that stale TLB entries are invalidated because
> > > +        * the asid mechanism wouldn't flush TLB for every switch_mm for
> > > +        * performance. So when using asid, keep all CPUs footmarks in
> > > +        * cpumask() until mm reset.
> > > +        */
> > > +       cpumask_set_cpu(cpu, mm_cpumask(next));
> > > +       if (static_branch_unlikely(&use_asid_allocator)) {
> > > +               set_mm_asid(next, cpu);
> > > +       } else {
> > > +               cpumask_clear_cpu(cpu, mm_cpumask(prev));
> > > +               set_mm_noasid(next);
> > > +       }
> > >  }
> > >
> > >  static int __init asids_init(void)
> > > @@ -264,7 +276,8 @@ static int __init asids_init(void)
> > >  }
> > >  early_initcall(asids_init);
> > >  #else
> > > -static inline void set_mm(struct mm_struct *mm, unsigned int cpu)
> > > +static inline void set_mm(struct mm_struct *prev,
> > > +                         struct mm_struct *next, unsigned int cpu)
> > >  {
> > >         /* Nothing to do here when there is no MMU */
> > >  }
> > > @@ -317,10 +330,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> > >          */
> > >         cpu = smp_processor_id();
> > >
> > > -       cpumask_clear_cpu(cpu, mm_cpumask(prev));
> > > -       cpumask_set_cpu(cpu, mm_cpumask(next));
> > > -
> > > -       set_mm(next, cpu);
> > > +       set_mm(prev, next, cpu);
> > >
> > >         flush_icache_deferred(next, cpu);
> > >  }
> > > --
> > > 2.36.1
> > >
> > >
> > > _______________________________________________
> > > linux-riscv mailing list
> > > linux-riscv@lists.infradead.org
> > > http://lists.infradead.org/mailman/listinfo/linux-riscv
> >
> > _______________________________________________
> > linux-riscv mailing list
> > linux-riscv@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-riscv
  
Sergey Matyukevich Feb. 25, 2023, 7:29 p.m. UTC | #12
On Fri, Feb 24, 2023 at 01:57:55AM +0800, Zong Li wrote:
> Lad, Prabhakar <prabhakar.csengg@gmail.com> 於 2022年12月23日 週五 下午8:54寫道:
> >
> > Hi Guo,
> >
> > Thank you for the patch.
> >
> > On Fri, Nov 11, 2022 at 8:00 AM <guoren@kernel.org> wrote:
> > >
> > > From: Guo Ren <guoren@linux.alibaba.com>
> > >
> > > After use_asid_allocator is enabled, the userspace application will
> > > crash by stale TLB entries. Because only using cpumask_clear_cpu without
> > > local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> > > Then set_mm_asid would cause the user space application to get a stale
> > > value by stale TLB entry, but set_mm_noasid is okay.
> > >
> > > Here is the symptom of the bug:
> > > unhandled signal 11 code 0x1 (coredump)
> > >    0x0000003fd6d22524 <+4>:     auipc   s0,0x70
> > >    0x0000003fd6d22528 <+8>:     ld      s0,-148(s0) # 0x3fd6d92490
> > > => 0x0000003fd6d2252c <+12>:    ld      a5,0(s0)
> > > (gdb) i r s0
> > > s0          0x8082ed1cc3198b21       0x8082ed1cc3198b21
> > > (gdb) x /2x 0x3fd6d92490
> > > 0x3fd6d92490:   0xd80ac8a8      0x0000003f
> > > The core dump file shows that register s0 is wrong, but the value in
> > > memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry
> > > in TLB and got a wrong result from an incorrect physical address.
> > >
> > > When the task ran on CPU0, which loaded/speculative-loaded the value of
> > > address(0x3fd6d92490), then the first version of the mapping entry was
> > > PTWed into CPU0's TLB.
> > > When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by
> > > asid), it happened to write a value on the address (0x3fd6d92490). It
> > > caused do_page_fault -> wp_page_copy -> ptep_clear_flush ->
> > > ptep_get_and_clear & flush_tlb_page.
> > > The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB
> > > flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous
> > > switch_mm. So we only flushed the CPU1 TLB and set the second version
> > > mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0
> > > still used a stale TLB mapping entry which contained a wrong target
> > > physical address. It raised a bug when the task happened to read that
> > > value.
> > >
> > >    CPU0                               CPU1
> > >    - switch 'task' in
> > >    - read addr (Fill stale mapping
> > >      entry into TLB)
> > >    - switch 'task' out (no tlb_flush)
> > >                                       - switch 'task' in (no tlb_flush)
> > >                                       - write addr cause pagefault
> > >                                         do_page_fault() (change to
> > >                                         new addr mapping)
> > >                                           wp_page_copy()
> > >                                             ptep_clear_flush()
> > >                                               ptep_get_and_clear()
> > >                                               & flush_tlb_page()
> > >                                         write new value into addr
> > >                                       - switch 'task' out (no tlb_flush)
> > >    - switch 'task' in (no tlb_flush)
> > >    - read addr again (Use stale
> > >      mapping entry in TLB)
> > >      get wrong value from old phyical
> > >      addr, BUG!
> > >
> > > The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm,
> > > which could guarantee to invalidate all stale TLB entries during TLB
> > > flush.
> > >
> > > Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator")
> > > Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
> > > Signed-off-by: Guo Ren <guoren@kernel.org>
> > > Cc: Anup Patel <apatel@ventanamicro.com>
> > > Cc: Palmer Dabbelt <palmer@rivosinc.com>
> > > ---
> > > Changes in v3:
> > >  - Move set/clear cpumask(mm) into set_mm (Make code more pretty
> > >    with Andrew's advice)
> > >  - Optimize comment description
> > >
> > > Changes in v2:
> > >  - Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel
> > >    test robot <lkp@intel.com>)
> > >  - Keep cpumask_clear_cpu for noasid
> > > ---
> > >  arch/riscv/mm/context.c | 30 ++++++++++++++++++++----------
> > >  1 file changed, 20 insertions(+), 10 deletions(-)
> > >
> > As reported on the patch [0] I was seeing consistent failures on the
> > RZ/Five SoC while running bonnie++ utility. After applying this patch
> > on top of Palmer's for-next branch (eb67d239f3aa) I am no longer
> > seeing this issue.
> >
> > Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
> >
> > [0] https://patchwork.kernel.org/project/linux-riscv/patch/20220829205219.283543-1-geomatsi@gmail.com/
> >
> 
> Hi all,
> I got the same situation (i.e. unhandle signal 11) on our internal
> multi-core system, I tried the patch[0] & [1], but it still doesn't
> work, I guess there are still some potential problems. After applying
> this patch, the situation disappeared, I took some time to look at
> other arches' implementations, such as arc, they don't clear the
> mm_cpumask due to the similar issue. I can't say which approach might
> be better, but I'd like to point out that this patch works to me.
> Thanks.
> 
> Tested-by: Zong Li <zong.li@sifive.com>
> 
> [0] https://lore.kernel.org/linux-riscv/20220829205219.283543-1-geomatsi@gmail.com/
> [1] https://lore.kernel.org/linux-riscv/20230129211818.686557-1-geomatsi@gmail.com/

Thanks for the report! By the way, could you please share some
information about the reproducing workload ?

Initial idea was to reduce the number of TLB flushes by deferring (and
possibly avoiding) some of them. But we have already bug reports from
two different vendors, so apparently something is overlooked here.
Lets switch to 'aggrregating' mm_cpumask approach suggested by Guo Ren.

@Guo Ren, do you mind if I re-send your v3 patch together with the
remaining reverts of my changes ?

Regards,
Sergey
  
Guo Ren Feb. 26, 2023, 4:24 a.m. UTC | #13
On Sat, Feb 25, 2023 at 2:29 PM Sergey Matyukevich <geomatsi@gmail.com> wrote:
>
> On Fri, Feb 24, 2023 at 01:57:55AM +0800, Zong Li wrote:
> > Lad, Prabhakar <prabhakar.csengg@gmail.com> 於 2022年12月23日 週五 下午8:54寫道:
> > >
> > > Hi Guo,
> > >
> > > Thank you for the patch.
> > >
> > > On Fri, Nov 11, 2022 at 8:00 AM <guoren@kernel.org> wrote:
> > > >
> > > > From: Guo Ren <guoren@linux.alibaba.com>
> > > >
> > > > After use_asid_allocator is enabled, the userspace application will
> > > > crash by stale TLB entries. Because only using cpumask_clear_cpu without
> > > > local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> > > > Then set_mm_asid would cause the user space application to get a stale
> > > > value by stale TLB entry, but set_mm_noasid is okay.
> > > >
> > > > Here is the symptom of the bug:
> > > > unhandled signal 11 code 0x1 (coredump)
> > > >    0x0000003fd6d22524 <+4>:     auipc   s0,0x70
> > > >    0x0000003fd6d22528 <+8>:     ld      s0,-148(s0) # 0x3fd6d92490
> > > > => 0x0000003fd6d2252c <+12>:    ld      a5,0(s0)
> > > > (gdb) i r s0
> > > > s0          0x8082ed1cc3198b21       0x8082ed1cc3198b21
> > > > (gdb) x /2x 0x3fd6d92490
> > > > 0x3fd6d92490:   0xd80ac8a8      0x0000003f
> > > > The core dump file shows that register s0 is wrong, but the value in
> > > > memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry
> > > > in TLB and got a wrong result from an incorrect physical address.
> > > >
> > > > When the task ran on CPU0, which loaded/speculative-loaded the value of
> > > > address(0x3fd6d92490), then the first version of the mapping entry was
> > > > PTWed into CPU0's TLB.
> > > > When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by
> > > > asid), it happened to write a value on the address (0x3fd6d92490). It
> > > > caused do_page_fault -> wp_page_copy -> ptep_clear_flush ->
> > > > ptep_get_and_clear & flush_tlb_page.
> > > > The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB
> > > > flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous
> > > > switch_mm. So we only flushed the CPU1 TLB and set the second version
> > > > mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0
> > > > still used a stale TLB mapping entry which contained a wrong target
> > > > physical address. It raised a bug when the task happened to read that
> > > > value.
> > > >
> > > >    CPU0                               CPU1
> > > >    - switch 'task' in
> > > >    - read addr (Fill stale mapping
> > > >      entry into TLB)
> > > >    - switch 'task' out (no tlb_flush)
> > > >                                       - switch 'task' in (no tlb_flush)
> > > >                                       - write addr cause pagefault
> > > >                                         do_page_fault() (change to
> > > >                                         new addr mapping)
> > > >                                           wp_page_copy()
> > > >                                             ptep_clear_flush()
> > > >                                               ptep_get_and_clear()
> > > >                                               & flush_tlb_page()
> > > >                                         write new value into addr
> > > >                                       - switch 'task' out (no tlb_flush)
> > > >    - switch 'task' in (no tlb_flush)
> > > >    - read addr again (Use stale
> > > >      mapping entry in TLB)
> > > >      get wrong value from old phyical
> > > >      addr, BUG!
> > > >
> > > > The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm,
> > > > which could guarantee to invalidate all stale TLB entries during TLB
> > > > flush.
> > > >
> > > > Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator")
> > > > Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
> > > > Signed-off-by: Guo Ren <guoren@kernel.org>
> > > > Cc: Anup Patel <apatel@ventanamicro.com>
> > > > Cc: Palmer Dabbelt <palmer@rivosinc.com>
> > > > ---
> > > > Changes in v3:
> > > >  - Move set/clear cpumask(mm) into set_mm (Make code more pretty
> > > >    with Andrew's advice)
> > > >  - Optimize comment description
> > > >
> > > > Changes in v2:
> > > >  - Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel
> > > >    test robot <lkp@intel.com>)
> > > >  - Keep cpumask_clear_cpu for noasid
> > > > ---
> > > >  arch/riscv/mm/context.c | 30 ++++++++++++++++++++----------
> > > >  1 file changed, 20 insertions(+), 10 deletions(-)
> > > >
> > > As reported on the patch [0] I was seeing consistent failures on the
> > > RZ/Five SoC while running bonnie++ utility. After applying this patch
> > > on top of Palmer's for-next branch (eb67d239f3aa) I am no longer
> > > seeing this issue.
> > >
> > > Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
> > >
> > > [0] https://patchwork.kernel.org/project/linux-riscv/patch/20220829205219.283543-1-geomatsi@gmail.com/
> > >
> >
> > Hi all,
> > I got the same situation (i.e. unhandle signal 11) on our internal
> > multi-core system, I tried the patch[0] & [1], but it still doesn't
> > work, I guess there are still some potential problems. After applying
> > this patch, the situation disappeared, I took some time to look at
> > other arches' implementations, such as arc, they don't clear the
> > mm_cpumask due to the similar issue. I can't say which approach might
> > be better, but I'd like to point out that this patch works to me.
> > Thanks.
> >
> > Tested-by: Zong Li <zong.li@sifive.com>
> >
> > [0] https://lore.kernel.org/linux-riscv/20220829205219.283543-1-geomatsi@gmail.com/
> > [1] https://lore.kernel.org/linux-riscv/20230129211818.686557-1-geomatsi@gmail.com/
>
> Thanks for the report! By the way, could you please share some
> information about the reproducing workload ?
>
> Initial idea was to reduce the number of TLB flushes by deferring (and
> possibly avoiding) some of them. But we have already bug reports from
> two different vendors, so apparently something is overlooked here.
> Lets switch to 'aggrregating' mm_cpumask approach suggested by Guo Ren.
>
> @Guo Ren, do you mind if I re-send your v3 patch together with the
> remaining reverts of my changes ?
Okay, thx for taking care. Let's make it work around first and then improve it.

Actually, the current riscv asid is from arm64 with hardware broadcast
requirements. Maybe we need to consider x86 per-cpu asid pool way.

>
> Regards,
> Sergey
  
Gary Guo Feb. 27, 2023, 10:40 p.m. UTC | #14
On Sat, 25 Feb 2023 23:24:40 -0500
Guo Ren <guoren@kernel.org> wrote:

> On Sat, Feb 25, 2023 at 2:29 PM Sergey Matyukevich <geomatsi@gmail.com> wrote:
> >
> > On Fri, Feb 24, 2023 at 01:57:55AM +0800, Zong Li wrote:  
> > > Lad, Prabhakar <prabhakar.csengg@gmail.com> 於 2022年12月23日 週五 下午8:54寫道:  
> > > >
> > > > Hi Guo,
> > > >
> > > > Thank you for the patch.
> > > >
> > > > On Fri, Nov 11, 2022 at 8:00 AM <guoren@kernel.org> wrote:  
> > > > >
> > > > > From: Guo Ren <guoren@linux.alibaba.com>
> > > > >
> > > > > After use_asid_allocator is enabled, the userspace application will
> > > > > crash by stale TLB entries. Because only using cpumask_clear_cpu without
> > > > > local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> > > > > Then set_mm_asid would cause the user space application to get a stale
> > > > > value by stale TLB entry, but set_mm_noasid is okay.
> > > > >
> > > > > Here is the symptom of the bug:
> > > > > unhandled signal 11 code 0x1 (coredump)
> > > > >    0x0000003fd6d22524 <+4>:     auipc   s0,0x70
> > > > >    0x0000003fd6d22528 <+8>:     ld      s0,-148(s0) # 0x3fd6d92490  
> > > > > => 0x0000003fd6d2252c <+12>:    ld      a5,0(s0)  
> > > > > (gdb) i r s0
> > > > > s0          0x8082ed1cc3198b21       0x8082ed1cc3198b21
> > > > > (gdb) x /2x 0x3fd6d92490
> > > > > 0x3fd6d92490:   0xd80ac8a8      0x0000003f
> > > > > The core dump file shows that register s0 is wrong, but the value in
> > > > > memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry
> > > > > in TLB and got a wrong result from an incorrect physical address.
> > > > >
> > > > > When the task ran on CPU0, which loaded/speculative-loaded the value of
> > > > > address(0x3fd6d92490), then the first version of the mapping entry was
> > > > > PTWed into CPU0's TLB.
> > > > > When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by
> > > > > asid), it happened to write a value on the address (0x3fd6d92490). It
> > > > > caused do_page_fault -> wp_page_copy -> ptep_clear_flush ->
> > > > > ptep_get_and_clear & flush_tlb_page.
> > > > > The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB
> > > > > flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous
> > > > > switch_mm. So we only flushed the CPU1 TLB and set the second version
> > > > > mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0
> > > > > still used a stale TLB mapping entry which contained a wrong target
> > > > > physical address. It raised a bug when the task happened to read that
> > > > > value.
> > > > >
> > > > >    CPU0                               CPU1
> > > > >    - switch 'task' in
> > > > >    - read addr (Fill stale mapping
> > > > >      entry into TLB)
> > > > >    - switch 'task' out (no tlb_flush)
> > > > >                                       - switch 'task' in (no tlb_flush)
> > > > >                                       - write addr cause pagefault
> > > > >                                         do_page_fault() (change to
> > > > >                                         new addr mapping)
> > > > >                                           wp_page_copy()
> > > > >                                             ptep_clear_flush()
> > > > >                                               ptep_get_and_clear()
> > > > >                                               & flush_tlb_page()
> > > > >                                         write new value into addr
> > > > >                                       - switch 'task' out (no tlb_flush)
> > > > >    - switch 'task' in (no tlb_flush)
> > > > >    - read addr again (Use stale
> > > > >      mapping entry in TLB)
> > > > >      get wrong value from old phyical
> > > > >      addr, BUG!
> > > > >
> > > > > The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm,
> > > > > which could guarantee to invalidate all stale TLB entries during TLB
> > > > > flush.
> > > > >
> > > > > Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator")
> > > > > Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
> > > > > Signed-off-by: Guo Ren <guoren@kernel.org>
> > > > > Cc: Anup Patel <apatel@ventanamicro.com>
> > > > > Cc: Palmer Dabbelt <palmer@rivosinc.com>
> > > > > ---
> > > > > Changes in v3:
> > > > >  - Move set/clear cpumask(mm) into set_mm (Make code more pretty
> > > > >    with Andrew's advice)
> > > > >  - Optimize comment description
> > > > >
> > > > > Changes in v2:
> > > > >  - Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel
> > > > >    test robot <lkp@intel.com>)
> > > > >  - Keep cpumask_clear_cpu for noasid
> > > > > ---
> > > > >  arch/riscv/mm/context.c | 30 ++++++++++++++++++++----------
> > > > >  1 file changed, 20 insertions(+), 10 deletions(-)
> > > > >  
> > > > As reported on the patch [0] I was seeing consistent failures on the
> > > > RZ/Five SoC while running bonnie++ utility. After applying this patch
> > > > on top of Palmer's for-next branch (eb67d239f3aa) I am no longer
> > > > seeing this issue.
> > > >
> > > > Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
> > > >
> > > > [0] https://patchwork.kernel.org/project/linux-riscv/patch/20220829205219.283543-1-geomatsi@gmail.com/
> > > >  
> > >
> > > Hi all,
> > > I got the same situation (i.e. unhandle signal 11) on our internal
> > > multi-core system, I tried the patch[0] & [1], but it still doesn't
> > > work, I guess there are still some potential problems. After applying
> > > this patch, the situation disappeared, I took some time to look at
> > > other arches' implementations, such as arc, they don't clear the
> > > mm_cpumask due to the similar issue. I can't say which approach might
> > > be better, but I'd like to point out that this patch works to me.
> > > Thanks.
> > >
> > > Tested-by: Zong Li <zong.li@sifive.com>
> > >
> > > [0] https://lore.kernel.org/linux-riscv/20220829205219.283543-1-geomatsi@gmail.com/
> > > [1] https://lore.kernel.org/linux-riscv/20230129211818.686557-1-geomatsi@gmail.com/  
> >
> > Thanks for the report! By the way, could you please share some
> > information about the reproducing workload ?
> >
> > Initial idea was to reduce the number of TLB flushes by deferring (and
> > possibly avoiding) some of them. But we have already bug reports from
> > two different vendors, so apparently something is overlooked here.
> > Lets switch to 'aggrregating' mm_cpumask approach suggested by Guo Ren.
> >
> > @Guo Ren, do you mind if I re-send your v3 patch together with the
> > remaining reverts of my changes ?  
> Okay, thx for taking care. Let's make it work around first and then improve it.
> 
> Actually, the current riscv asid is from arm64 with hardware broadcast
> requirements. Maybe we need to consider x86 per-cpu asid pool way.

It should be noted that the spec expects supervisor software to
use a consistent meaning of non-zero ASIDs across different harts.

Also, a vendor could implement efficient hardware broadcasting ASID
invalidation with a custom instruction and expose it via SBI.

Best,
Gary
  
Guo Ren Feb. 28, 2023, 3:12 a.m. UTC | #15
On Tue, Feb 28, 2023 at 6:40 AM Gary Guo <gary@garyguo.net> wrote:
>
> On Sat, 25 Feb 2023 23:24:40 -0500
> Guo Ren <guoren@kernel.org> wrote:
>
> > On Sat, Feb 25, 2023 at 2:29 PM Sergey Matyukevich <geomatsi@gmail.com> wrote:
> > >
> > > On Fri, Feb 24, 2023 at 01:57:55AM +0800, Zong Li wrote:
> > > > Lad, Prabhakar <prabhakar.csengg@gmail.com> 於 2022年12月23日 週五 下午8:54寫道:
> > > > >
> > > > > Hi Guo,
> > > > >
> > > > > Thank you for the patch.
> > > > >
> > > > > On Fri, Nov 11, 2022 at 8:00 AM <guoren@kernel.org> wrote:
> > > > > >
> > > > > > From: Guo Ren <guoren@linux.alibaba.com>
> > > > > >
> > > > > > After use_asid_allocator is enabled, the userspace application will
> > > > > > crash by stale TLB entries. Because only using cpumask_clear_cpu without
> > > > > > local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> > > > > > Then set_mm_asid would cause the user space application to get a stale
> > > > > > value by stale TLB entry, but set_mm_noasid is okay.
> > > > > >
> > > > > > Here is the symptom of the bug:
> > > > > > unhandled signal 11 code 0x1 (coredump)
> > > > > >    0x0000003fd6d22524 <+4>:     auipc   s0,0x70
> > > > > >    0x0000003fd6d22528 <+8>:     ld      s0,-148(s0) # 0x3fd6d92490
> > > > > > => 0x0000003fd6d2252c <+12>:    ld      a5,0(s0)
> > > > > > (gdb) i r s0
> > > > > > s0          0x8082ed1cc3198b21       0x8082ed1cc3198b21
> > > > > > (gdb) x /2x 0x3fd6d92490
> > > > > > 0x3fd6d92490:   0xd80ac8a8      0x0000003f
> > > > > > The core dump file shows that register s0 is wrong, but the value in
> > > > > > memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry
> > > > > > in TLB and got a wrong result from an incorrect physical address.
> > > > > >
> > > > > > When the task ran on CPU0, which loaded/speculative-loaded the value of
> > > > > > address(0x3fd6d92490), then the first version of the mapping entry was
> > > > > > PTWed into CPU0's TLB.
> > > > > > When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by
> > > > > > asid), it happened to write a value on the address (0x3fd6d92490). It
> > > > > > caused do_page_fault -> wp_page_copy -> ptep_clear_flush ->
> > > > > > ptep_get_and_clear & flush_tlb_page.
> > > > > > The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB
> > > > > > flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous
> > > > > > switch_mm. So we only flushed the CPU1 TLB and set the second version
> > > > > > mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0
> > > > > > still used a stale TLB mapping entry which contained a wrong target
> > > > > > physical address. It raised a bug when the task happened to read that
> > > > > > value.
> > > > > >
> > > > > >    CPU0                               CPU1
> > > > > >    - switch 'task' in
> > > > > >    - read addr (Fill stale mapping
> > > > > >      entry into TLB)
> > > > > >    - switch 'task' out (no tlb_flush)
> > > > > >                                       - switch 'task' in (no tlb_flush)
> > > > > >                                       - write addr cause pagefault
> > > > > >                                         do_page_fault() (change to
> > > > > >                                         new addr mapping)
> > > > > >                                           wp_page_copy()
> > > > > >                                             ptep_clear_flush()
> > > > > >                                               ptep_get_and_clear()
> > > > > >                                               & flush_tlb_page()
> > > > > >                                         write new value into addr
> > > > > >                                       - switch 'task' out (no tlb_flush)
> > > > > >    - switch 'task' in (no tlb_flush)
> > > > > >    - read addr again (Use stale
> > > > > >      mapping entry in TLB)
> > > > > >      get wrong value from old phyical
> > > > > >      addr, BUG!
> > > > > >
> > > > > > The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm,
> > > > > > which could guarantee to invalidate all stale TLB entries during TLB
> > > > > > flush.
> > > > > >
> > > > > > Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator")
> > > > > > Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
> > > > > > Signed-off-by: Guo Ren <guoren@kernel.org>
> > > > > > Cc: Anup Patel <apatel@ventanamicro.com>
> > > > > > Cc: Palmer Dabbelt <palmer@rivosinc.com>
> > > > > > ---
> > > > > > Changes in v3:
> > > > > >  - Move set/clear cpumask(mm) into set_mm (Make code more pretty
> > > > > >    with Andrew's advice)
> > > > > >  - Optimize comment description
> > > > > >
> > > > > > Changes in v2:
> > > > > >  - Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel
> > > > > >    test robot <lkp@intel.com>)
> > > > > >  - Keep cpumask_clear_cpu for noasid
> > > > > > ---
> > > > > >  arch/riscv/mm/context.c | 30 ++++++++++++++++++++----------
> > > > > >  1 file changed, 20 insertions(+), 10 deletions(-)
> > > > > >
> > > > > As reported on the patch [0] I was seeing consistent failures on the
> > > > > RZ/Five SoC while running bonnie++ utility. After applying this patch
> > > > > on top of Palmer's for-next branch (eb67d239f3aa) I am no longer
> > > > > seeing this issue.
> > > > >
> > > > > Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
> > > > >
> > > > > [0] https://patchwork.kernel.org/project/linux-riscv/patch/20220829205219.283543-1-geomatsi@gmail.com/
> > > > >
> > > >
> > > > Hi all,
> > > > I got the same situation (i.e. unhandle signal 11) on our internal
> > > > multi-core system, I tried the patch[0] & [1], but it still doesn't
> > > > work, I guess there are still some potential problems. After applying
> > > > this patch, the situation disappeared, I took some time to look at
> > > > other arches' implementations, such as arc, they don't clear the
> > > > mm_cpumask due to the similar issue. I can't say which approach might
> > > > be better, but I'd like to point out that this patch works to me.
> > > > Thanks.
> > > >
> > > > Tested-by: Zong Li <zong.li@sifive.com>
> > > >
> > > > [0] https://lore.kernel.org/linux-riscv/20220829205219.283543-1-geomatsi@gmail.com/
> > > > [1] https://lore.kernel.org/linux-riscv/20230129211818.686557-1-geomatsi@gmail.com/
> > >
> > > Thanks for the report! By the way, could you please share some
> > > information about the reproducing workload ?
> > >
> > > Initial idea was to reduce the number of TLB flushes by deferring (and
> > > possibly avoiding) some of them. But we have already bug reports from
> > > two different vendors, so apparently something is overlooked here.
> > > Lets switch to 'aggrregating' mm_cpumask approach suggested by Guo Ren.
> > >
> > > @Guo Ren, do you mind if I re-send your v3 patch together with the
> > > remaining reverts of my changes ?
> > Okay, thx for taking care. Let's make it work around first and then improve it.
> >
> > Actually, the current riscv asid is from arm64 with hardware broadcast
> > requirements. Maybe we need to consider x86 per-cpu asid pool way.
>
> It should be noted that the spec expects supervisor software to
> use a consistent meaning of non-zero ASIDs across different harts.
>
> Also, a vendor could implement efficient hardware broadcasting ASID
> invalidation with a custom instruction and expose it via SBI.
I agree with you; actually, our XuanTie supports hardware broadcasting
invalidation, and we expect SBI style.

The x86 style would be another choice for the future, and the riscv
would combine different TLB maintenance styles into one Linux
architecture.

----

Please let me give out an outline of the current riscv state:

1. Current Linux riscv uses unified ASIDs design, which comes from arm
hw broadcast one. Some riscv vendors (XuanTie) could also support TLB
hw broadcast.

2. The riscv spec doesn't suggest hw broadcast because it's unsuitable
for large-scale systems. So x86 style would be another choice for the
future Linux riscv.

Correct? Welcome feedback.

>
> Best,
> Gary
  

Patch

diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
index 7acbfbd14557..0f784e3d307b 100644
--- a/arch/riscv/mm/context.c
+++ b/arch/riscv/mm/context.c
@@ -205,12 +205,24 @@  static void set_mm_noasid(struct mm_struct *mm)
 	local_flush_tlb_all();
 }
 
-static inline void set_mm(struct mm_struct *mm, unsigned int cpu)
+static inline void set_mm(struct mm_struct *prev,
+			  struct mm_struct *next, unsigned int cpu)
 {
-	if (static_branch_unlikely(&use_asid_allocator))
-		set_mm_asid(mm, cpu);
-	else
-		set_mm_noasid(mm);
+	/*
+	 * The mm_cpumask indicates which harts' TLBs contain the virtual
+	 * address mapping of the mm. Compared to noasid, using asid
+	 * can't guarantee that stale TLB entries are invalidated because
+	 * the asid mechanism wouldn't flush TLB for every switch_mm for
+	 * performance. So when using asid, keep all CPUs footmarks in
+	 * cpumask() until mm reset.
+	 */
+	cpumask_set_cpu(cpu, mm_cpumask(next));
+	if (static_branch_unlikely(&use_asid_allocator)) {
+		set_mm_asid(next, cpu);
+	} else {
+		cpumask_clear_cpu(cpu, mm_cpumask(prev));
+		set_mm_noasid(next);
+	}
 }
 
 static int __init asids_init(void)
@@ -264,7 +276,8 @@  static int __init asids_init(void)
 }
 early_initcall(asids_init);
 #else
-static inline void set_mm(struct mm_struct *mm, unsigned int cpu)
+static inline void set_mm(struct mm_struct *prev,
+			  struct mm_struct *next, unsigned int cpu)
 {
 	/* Nothing to do here when there is no MMU */
 }
@@ -317,10 +330,7 @@  void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	 */
 	cpu = smp_processor_id();
 
-	cpumask_clear_cpu(cpu, mm_cpumask(prev));
-	cpumask_set_cpu(cpu, mm_cpumask(next));
-
-	set_mm(next, cpu);
+	set_mm(prev, next, cpu);
 
 	flush_icache_deferred(next, cpu);
 }