diff mbox series

[3/3] compiler: inline does not imply notrace

Message ID	20221122195329.252654-4-namit@vmware.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Nadav Amit <nadav.amit@gmail.com> To: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-ia64@vger.kernel.org, linux-um@lists.infradead.org, linux-arch@vger.kernel.org, linux-mm@kvack.org, Andy Lutomirski <luto@kernel.org>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>, x86@kernel.org, Richard Weinberger <richard@nod.at>, Anton Ivanov <anton.ivanov@cambridgegreys.com>, Johannes Berg <johannes@sipsolutions.net>, Arnd Bergmann <arnd@arndb.de>, Andrew Morton <akpm@linux-foundation.org>, Nadav Amit <namit@vmware.com> Subject: [PATCH 3/3] compiler: inline does not imply notrace Date: Tue, 22 Nov 2022 11:53:29 -0800 Message-Id: <20221122195329.252654-4-namit@vmware.com> In-Reply-To: <20221122195329.252654-1-namit@vmware.com> References: <20221122195329.252654-1-namit@vmware.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	kprobes: notrace enhancements \| [0/3] kprobes: notrace enhancements [1/3] kprobes: Mark descendents of core_kernel_text as notrace [2/3] lib/usercopy: Allow traceing of usercopy, xarray, iov_iter [3/3] compiler: inline does not imply notrace

Commit Message

Nadav Amit Nov. 22, 2022, 7:53 p.m. UTC

  From: Nadav Amit <namit@vmware.com>

Functions that are marked as "inline" are currently also not tracable.
Apparently, this has been done to prevent differences between different
configs that caused different functions to be tracable on different
platforms.

Anyhow, this consideration is not very strong, and tying "inline" and
"notrace" does not seem very beneficial. The "inline" keyword is just a
hint, and many functions are currently not tracable due to this reason.

Disconnect "inline" from "notrace".

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 include/linux/compiler_types.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Arnd Bergmann Nov. 22, 2022, 8:09 p.m. UTC | #1

On Tue, Nov 22, 2022, at 20:53, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
>
> Functions that are marked as "inline" are currently also not tracable.
> Apparently, this has been done to prevent differences between different
> configs that caused different functions to be tracable on different
> platforms.
>
> Anyhow, this consideration is not very strong, and tying "inline" and
> "notrace" does not seem very beneficial. The "inline" keyword is just a
> hint, and many functions are currently not tracable due to this reason.

The original reason was listed in 93b3cca1ccd3 ("ftrace: Make all
inline tags also include notrace"), which describes

    Commit 5963e317b1e9d2a ("ftrace/x86: Do not change stacks in DEBUG when
    calling lockdep") prevented lockdep calls from the int3 breakpoint handler
    from reseting the stack if a function that was called was in the process
    of being converted for tracing and had a breakpoint on it. The idea is,
    before calling the lockdep code, do a load_idt() to the special IDT that
    kept the breakpoint stack from reseting. This worked well as a quick fix
    for this kernel release, until a certain config caused a lockup in the
    function tracer start up tests.
    
    Investigating it, I found that the load_idt that was used to prevent
    the int3 from changing stacks was itself being traced!

and this sounds like a much stronger reason than what you describe,
and I would expect your change to cause regressions in similar places.

It's possible that the right answer is that the affected functions
should be marked as __always_inline. 

      Arnd

Steven Rostedt Nov. 22, 2022, 8:28 p.m. UTC | #2

On Tue, 22 Nov 2022 21:09:08 +0100
"Arnd Bergmann" <arnd@arndb.de> wrote:

> On Tue, Nov 22, 2022, at 20:53, Nadav Amit wrote:
> > From: Nadav Amit <namit@vmware.com>
> >
> > Functions that are marked as "inline" are currently also not tracable.
> > Apparently, this has been done to prevent differences between different
> > configs that caused different functions to be tracable on different
> > platforms.
> >
> > Anyhow, this consideration is not very strong, and tying "inline" and
> > "notrace" does not seem very beneficial. The "inline" keyword is just a
> > hint, and many functions are currently not tracable due to this reason.  
> 
> The original reason was listed in 93b3cca1ccd3 ("ftrace: Make all
> inline tags also include notrace"), which describes
> 
>     Commit 5963e317b1e9d2a ("ftrace/x86: Do not change stacks in DEBUG when
>     calling lockdep") prevented lockdep calls from the int3 breakpoint handler
>     from reseting the stack if a function that was called was in the process
>     of being converted for tracing and had a breakpoint on it. The idea is,
>     before calling the lockdep code, do a load_idt() to the special IDT that
>     kept the breakpoint stack from reseting. This worked well as a quick fix
>     for this kernel release, until a certain config caused a lockup in the
>     function tracer start up tests.
>     
>     Investigating it, I found that the load_idt that was used to prevent
>     the int3 from changing stacks was itself being traced!
> 
> and this sounds like a much stronger reason than what you describe,
> and I would expect your change to cause regressions in similar places.
> 
> It's possible that the right answer is that the affected functions
> should be marked as __always_inline. 

Actually, this requirement may not be as needed as much today. There's been
a lot of work in the last 10 years (when that commit was added) to make
ftrace much more robust.

We could remove the notrace from inline and then see where it breaks ;-)

But I'm guessing that it's probably not as much of an issue as it was
before. Although, it may cause some splats with noinstr but I think that
will be caught at compile time.

-- Steve

Nadav Amit Nov. 22, 2022, 8:51 p.m. UTC | #3

On Nov 22, 2022, at 12:09 PM, Arnd Bergmann <arnd@arndb.de> wrote:

> !! External Email
> 
> On Tue, Nov 22, 2022, at 20:53, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> Functions that are marked as "inline" are currently also not tracable.
>> Apparently, this has been done to prevent differences between different
>> configs that caused different functions to be tracable on different
>> platforms.
>> 
>> Anyhow, this consideration is not very strong, and tying "inline" and
>> "notrace" does not seem very beneficial. The "inline" keyword is just a
>> hint, and many functions are currently not tracable due to this reason.
> 
> The original reason was listed in 93b3cca1ccd3 ("ftrace: Make all
> inline tags also include notrace"), which describes
> 
>    Commit 5963e317b1e9d2a ("ftrace/x86: Do not change stacks in DEBUG when
>    calling lockdep") prevented lockdep calls from the int3 breakpoint handler
>    from reseting the stack if a function that was called was in the process
>    of being converted for tracing and had a breakpoint on it. The idea is,
>    before calling the lockdep code, do a load_idt() to the special IDT that
>    kept the breakpoint stack from reseting. This worked well as a quick fix
>    for this kernel release, until a certain config caused a lockup in the
>    function tracer start up tests.
> 
>    Investigating it, I found that the load_idt that was used to prevent
>    the int3 from changing stacks was itself being traced!
> 
> and this sounds like a much stronger reason than what you describe,
> and I would expect your change to cause regressions in similar places.

I had no intention of misrepresenting. That was my understanding from my
previous discussion with Steven.

I assume that this patch might cause some regressions. The first patch in
this series was intended to prevents some regressions that I encountered.
This patch also marks a function that was missing “notrace” before. And I
did get kernel hangs due to the missing “notrace”.

Anyhow, I believe that the alternative - of leaving things as they are
(“inline”->”notrace”) - is even worse. Obviously it prevents proper tracing,
as there are even system calls that use inline, for instance
__do_sys_process_madvise() and __do_sys_mremap().

But more importantly, the current “inline”->”notrace” solution just papers
over missing “notrace” annotations. Anyone can remove the “inline” at any
given moment since there is no direct (or indirect) relationship between
“inline” and “notrace”. It seems to me all random and bound to fail at some
point.

> It's possible that the right answer is that the affected functions
> should be marked as __always_inline.

I think that it is probably better to mark them as notrace. People might
remove __always_inline. If we want two versions - one traceable and one not
traceable - we can also do that. But I am not sure how many people are aware
of the relationships between inline/__always_inline and tracing.

Nadav Amit Nov. 29, 2022, 2:36 a.m. UTC | #4

On Nov 22, 2022, at 12:51 PM, Nadav Amit <namit@vmware.com> wrote:

> But more importantly, the current “inline”->”notrace” solution just papers
> over missing “notrace” annotations. Anyone can remove the “inline” at any
> given moment since there is no direct (or indirect) relationship between
> “inline” and “notrace”. It seems to me all random and bound to fail at some
> point.

Peter, Steven, (and others),

Beyond the issues that are addressed in this patch-set, I encountered one
more, which reiterates the fact that the heuristics of marking “inline”
functions as “notrace” is not good enough.

Before I send a patch, I would like to get your feedback. I include a splat
below. It appeaers the execution might get stuck since some functions that
can be used for function tracing can be traced themselves.

For example, __kernel_text_address() and unwind_get_return_address() are
traceable. I think that we need to disallow the tracing of all functions
that are called directly and indirectly from function_stack_trace_call()
(i.e., they are in the dynamic extent of function_stack_trace_call).

In the lack of a proper automated static analysis tool for the matter, I
suggest the following solution, but I would like to check that you are ok
with the granularity of the “notrace” as I propose. Again, note that this is
not caused by this “inline” patch, but an issue that existed before.

-- >8 --

Author: Nadav Amit <namit@vmware.com>
Date:   Tue Nov 29 02:25:12 2022 +0000

    trace: Disable tracing of code called from function_stack_trace_call()
    
    Signed-off-by: Nadav Amit <namit@vmware.com>

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index a20a5ebfacd7..185933222d08 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -21,6 +21,10 @@ CFLAGS_REMOVE_ftrace.o = -pg
 CFLAGS_REMOVE_early_printk.o = -pg
 CFLAGS_REMOVE_head64.o = -pg
 CFLAGS_REMOVE_sev.o = -pg
+CFLAGS_REMOVE_unwind_frame.o = -pg
+CFLAGS_REMOVE_unwind_guess.o = -pg
+CFLAGS_REMOVE_unwind_orc.o = -pg
+CFLAGS_REMOVE_stacktrace.o = -pg
 endif
 
 KASAN_SANITIZE_head$(BITS).o                           := n
diff --git a/kernel/Makefile b/kernel/Makefile
index 318789c728d3..d688eab1e1f8 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -19,6 +19,8 @@ obj-$(CONFIG_MULTIUSER) += groups.o
 ifdef CONFIG_FUNCTION_TRACER
 # Do not trace internal ftrace files
 CFLAGS_REMOVE_irq_work.o = $(CC_FLAGS_FTRACE)
+CFLAGS_REMOVE_stacktrace.o = $(CC_FLAGS_FTRACE)
+CFLAGS_REMOVE_extable.o = $(CC_FLAGS_FTRACE)
 endif
 
 # Prevents flicker of uninteresting __do_softirq()/__local_bh_disable_ip()


---


[531394.447185] sched: RT throttling activated
[531394.874541] NMI watchdog: Watchdog detected hard LOCKUP on cpu 26
[531394.874745] Modules linked in: zram
[531394.875224] CPU: 26 PID: 0 Comm: swapper/26 Not tainted 6.0.0-rc6umem+ #29
[531394.875426] Hardware name: Cisco Systems Inc UCSC-C220-M5SX/UCSC-C220-M5SX, BIOS C220M5.4.0.1i.0.0522190226 05/22/2019
[531394.875623] RIP: 0010:poke_int3_handler (arch/x86/kernel/alternative.c:1435) 
[531394.875903] Code: 45 01 48 8b 0d c8 0a 45 01 49 8d 70 ff 83 f8 01 7f 1c 48 63 39 31 c0 48 81 c7 00 00 00 81 48 39 fe 74 3c f0 ff 0d b3 0a 45 01 <c3> 31 c0 c3 49 89 ca 49 89 c1 49 d1 e9 4c 89 c9 48 c1 e1 04 4c 01
All code
========
   0:	45 01 48 8b          	add    %r9d,-0x75(%r8)
   4:	0d c8 0a 45 01       	or     $0x1450ac8,%eax
   9:	49 8d 70 ff          	lea    -0x1(%r8),%rsi
   d:	83 f8 01             	cmp    $0x1,%eax
  10:	7f 1c                	jg     0x2e
  12:	48 63 39             	movslq (%rcx),%rdi
  15:	31 c0                	xor    %eax,%eax
  17:	48 81 c7 00 00 00 81 	add    $0xffffffff81000000,%rdi
  1e:	48 39 fe             	cmp    %rdi,%rsi
  21:	74 3c                	je     0x5f
  23:	f0 ff 0d b3 0a 45 01 	lock decl 0x1450ab3(%rip)        # 0x1450add
  2a:*	c3                   	ret    		<-- trapping instruction
  2b:	31 c0                	xor    %eax,%eax
  2d:	c3                   	ret    
  2e:	49 89 ca             	mov    %rcx,%r10
  31:	49 89 c1             	mov    %rax,%r9
  34:	49 d1 e9             	shr    %r9
  37:	4c 89 c9             	mov    %r9,%rcx
  3a:	48 c1 e1 04          	shl    $0x4,%rcx
  3e:	4c                   	rex.WR
  3f:	01                   	.byte 0x1

Code starting with the faulting instruction
===========================================
   0:	c3                   	ret    
   1:	31 c0                	xor    %eax,%eax
   3:	c3                   	ret    
   4:	49 89 ca             	mov    %rcx,%r10
   7:	49 89 c1             	mov    %rax,%r9
   a:	49 d1 e9             	shr    %r9
   d:	4c 89 c9             	mov    %r9,%rcx
  10:	48 c1 e1 04          	shl    $0x4,%rcx
  14:	4c                   	rex.WR
  15:	01                   	.byte 0x1
[531394.876031] RSP: 0018:ffffc9000cd387b0 EFLAGS: 00000003
[531394.876320] RAX: 0000000000000001 RBX: 0000000000000000 RCX: ffffc9000cd388a8
[531394.876482] RDX: ffffc9000cd387d8 RSI: ffffffff812ba310 RDI: ffffffffc0409094
[531394.876652] RBP: ffffc9000cd387c8 R08: ffffffffc0409099 R09: 0000000000000000
[531394.876782] R10: 0000000000000000 R11: 0000000000000000 R12: ffffc9000cd387d8
[531394.876927] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[531394.877180] FS:  0000000000000000(0000) GS:ffff88afdf900000(0000) knlGS:0000000000000000
[531394.877371] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[531394.877503] CR2: 00007efd1e47b01c CR3: 0000000006a0a001 CR4: 00000000007706e0
[531394.877641] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[531394.877822] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[531394.877993] PKRU: 55555554
[531394.878164] Call Trace:
[531394.878356]  <IRQ>
[531394.878573] ? exc_int3 (arch/x86/kernel/traps.c:817) 
[531394.879306] asm_exc_int3 (./arch/x86/include/asm/idtentry.h:569) 
[531394.879725] RIP: 0010:function_stack_trace_call (kernel/trace/trace_functions.c:219) 
[531394.880058] Code: 2b 81 48 c7 c7 e0 e2 33 83 e8 1c fa fd ff 48 c7 c7 80 e6 1c 83 e8 40 61 fe 00 5d c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc <55> 48 89 e5 41 57 41 56 41 55 41 54 53 48 83 ec 08 4c 8b 62 18 41
All code
========
   0:	2b 81 48 c7 c7 e0    	sub    -0x1f3838b8(%rcx),%eax
   6:	e2 33                	loop   0x3b
   8:	83 e8 1c             	sub    $0x1c,%eax
   b:	fa                   	cli    
   c:	fd                   	std    
   d:	ff 48 c7             	decl   -0x39(%rax)
  10:	c7 80 e6 1c 83 e8 40 	movl   $0xfe6140,-0x177ce31a(%rax)
  17:	61 fe 00 
  1a:	5d                   	pop    %rbp
  1b:	c3                   	ret    
  1c:	cc                   	int3   
  1d:	cc                   	int3   
  1e:	cc                   	int3   
  1f:	cc                   	int3   
  20:	cc                   	int3   
  21:	cc                   	int3   
  22:	cc                   	int3   
  23:	cc                   	int3   
  24:	cc                   	int3   
  25:	cc                   	int3   
  26:	cc                   	int3   
  27:	cc                   	int3   
  28:	cc                   	int3   
  29:	cc                   	int3   
  2a:*	55                   	push   %rbp		<-- trapping instruction
  2b:	48 89 e5             	mov    %rsp,%rbp
  2e:	41 57                	push   %r15
  30:	41 56                	push   %r14
  32:	41 55                	push   %r13
  34:	41 54                	push   %r12
  36:	53                   	push   %rbx
  37:	48 83 ec 08          	sub    $0x8,%rsp
  3b:	4c 8b 62 18          	mov    0x18(%rdx),%r12
  3f:	41                   	rex.B

Code starting with the faulting instruction
===========================================
   0:	55                   	push   %rbp
   1:	48 89 e5             	mov    %rsp,%rbp
   4:	41 57                	push   %r15
   6:	41 56                	push   %r14
   8:	41 55                	push   %r13
   a:	41 54                	push   %r12
   c:	53                   	push   %rbx
   d:	48 83 ec 08          	sub    $0x8,%rsp
  11:	4c 8b 62 18          	mov    0x18(%rdx),%r12
  15:	41                   	rex.B
[531394.880225] RSP: 0018:ffffc9000cd388a8 EFLAGS: 00000082
[531394.880538] RAX: 0000000000000002 RBX: ffffc9000cd389a8 RCX: ffffc9000cd388b0
[531394.880702] RDX: ffffffff831cbec0 RSI: ffffffff8111893f RDI: ffffffff811b2c80
[531394.880884] RBP: ffffc9000cd38958 R08: ffffc900082f7f48 R09: 0000000000000001
[531394.881055] R10: 0000000000000e00 R11: 0000000000000001 R12: ffffc9000cd38a50
[531394.881225] R13: 0000000000000000 R14: ffff889846ecc080 R15: 0000000000000004
[531394.883694] ? unwind_get_return_address (arch/x86/kernel/unwind_frame.c:19 arch/x86/kernel/unwind_frame.c:14) 
[531394.883984] ? kernel_text_address (kernel/extable.c:78) 
[531394.885410]  ? 0xffffffffc0409095
[531394.886432]  ? 0xffffffffc0409099
[531394.887278]  ? 0xffffffffc0409099
[531394.889561] ? __trace_stack (kernel/trace/trace.c:3119) 
[531394.890036] ? __kernel_text_address (kernel/extable.c:78) 
[531394.891196] __kernel_text_address (kernel/extable.c:78) 
[531394.891658] unwind_get_return_address (arch/x86/kernel/unwind_frame.c:19 arch/x86/kernel/unwind_frame.c:14) 
[531394.892112] ? __kernel_text_address (kernel/extable.c:78) 
[531394.892382] ? unwind_get_return_address (arch/x86/kernel/unwind_frame.c:19 arch/x86/kernel/unwind_frame.c:14) 
[531394.892789] ? write_profile (kernel/stacktrace.c:83) 
[531394.893260] arch_stack_walk (arch/x86/kernel/stacktrace.c:26) 
[531394.895650] ? __trace_stack (kernel/trace/trace.c:3119) 
[531394.897209] stack_trace_save (kernel/stacktrace.c:123) 
[531394.898344] __ftrace_trace_stack (kernel/trace/trace.c:3061) 
[531394.899500] ? rt_mutex_postunlock (kernel/printk/printk.c:2894) 
[531394.899975] __trace_stack (kernel/trace/trace.c:3119) 
[531394.901025] function_stack_trace_call (./arch/x86/include/asm/atomic.h:108 ./include/linux/atomic/atomic-instrumented.h:258 kernel/trace/trace_functions.c:245) 
[531394.901356] ? fbcon_redraw.constprop.0 (drivers/video/fbdev/core/fbcon.c:1661) 
[531394.902748]  0xffffffffc0409099
[531394.904289] ? fb_get_color_depth (drivers/video/fbdev/core/fbmem.c:92) 
[531394.906173] ? console_conditional_schedule (kernel/printk/printk.c:2895) 
[531394.907244] console_conditional_schedule (kernel/printk/printk.c:2895) 
[531394.907727] fbcon_redraw.constprop.0 (drivers/video/fbdev/core/fbcon.c:1661) 
[531394.908200] ? console_conditional_schedule (kernel/printk/printk.c:2895) 
[531394.908506] ? fbcon_redraw.constprop.0 (drivers/video/fbdev/core/fbcon.c:1661) 
[531394.911230] fbcon_scroll (drivers/video/fbdev/core/fbcon.c:1838) 
[531394.912743] con_scroll (drivers/tty/vt/vt.c:630 (discriminator 1)) 
[531394.914360] lf (drivers/tty/vt/vt.c:1507) 
[531394.914680] ? lf (drivers/tty/vt/vt.c:1502) 
[531394.915779] vt_console_print (drivers/tty/vt/vt.c:3126) 
[531394.918248] console_emit_next_record.constprop.0 (kernel/printk/printk.c:1945 kernel/printk/printk.c:2732) 
[531394.922365] console_unlock (kernel/printk/printk.c:2794 kernel/printk/printk.c:2861) 
[531394.924129] vprintk_emit (kernel/printk/printk.c:2272) 
[531394.925782] vprintk_default (kernel/printk/printk.c:2283) 
[531394.926306] vprintk (kernel/printk/printk_safe.c:50) 
[531394.926943] _printk (kernel/printk/printk.c:2296) 
[531394.929016] perf_duration_warn.cold (kernel/events/core.c:510 kernel/events/core.c:508) 
[531394.929518] irq_work_single (kernel/irq_work.c:211) 
[531394.930449] irq_work_run_list (kernel/irq_work.c:241 (discriminator 3)) 
[531394.931169] irq_work_run (kernel/irq_work.c:253) 
[531394.931816] __sysvec_irq_work (./arch/x86/include/asm/jump_label.h:27 ./include/linux/jump_label.h:207 ./arch/x86/include/asm/trace/irq_vectors.h:64 arch/x86/kernel/irq_work.c:23) 
[531394.932259] sysvec_irq_work (arch/x86/kernel/irq_work.c:17 (discriminator 14)) 
[531394.932660]  </IRQ>
[531394.932854]  <TASK>
[531394.933607] asm_sysvec_irq_work (./arch/x86/include/asm/idtentry.h:675)

Steven Rostedt Nov. 29, 2022, 4:15 a.m. UTC | #5

On Tue, 29 Nov 2022 02:36:22 +0000
Nadav Amit <namit@vmware.com> wrote:

> On Nov 22, 2022, at 12:51 PM, Nadav Amit <namit@vmware.com> wrote:
> 
> > But more importantly, the current “inline”->”notrace” solution just papers
> > over missing “notrace” annotations. Anyone can remove the “inline” at any
> > given moment since there is no direct (or indirect) relationship between
> > “inline” and “notrace”. It seems to me all random and bound to fail at some
> > point.  
> 
> Peter, Steven, (and others),
> 
> Beyond the issues that are addressed in this patch-set, I encountered one
> more, which reiterates the fact that the heuristics of marking “inline”
> functions as “notrace” is not good enough.
> 
> Before I send a patch, I would like to get your feedback. I include a splat
> below. It appeaers the execution might get stuck since some functions that
> can be used for function tracing can be traced themselves.
> 
> For example, __kernel_text_address() and unwind_get_return_address() are
> traceable. I think that we need to disallow the tracing of all functions
> that are called directly and indirectly from function_stack_trace_call()
> (i.e., they are in the dynamic extent of function_stack_trace_call).

How did this happen. It should be able to handle recursion:

static void
function_stack_trace_call(unsigned long ip, unsigned long parent_ip,
			  struct ftrace_ops *op, struct ftrace_regs *fregs)
{
	struct trace_array *tr = op->private;
	struct trace_array_cpu *data;
	unsigned long flags;
	long disabled;
	int cpu;
	unsigned int trace_ctx;

	if (unlikely(!tr->function_enabled))
		return;

	/*
	 * Need to use raw, since this must be called before the
	 * recursive protection is performed.
	 */
	local_irq_save(flags);
	cpu = raw_smp_processor_id();
	data = per_cpu_ptr(tr->array_buffer.data, cpu);
	disabled = atomic_inc_return(&data->disabled);

	if (likely(disabled == 1)) { <<<---- This stops recursion

		trace_ctx = tracing_gen_ctx_flags(flags);
		trace_function(tr, ip, parent_ip, trace_ctx);
		__trace_stack(tr, trace_ctx, STACK_SKIP);
	}

	atomic_dec(&data->disabled);
	local_irq_restore(flags);
}

Each of the stack trace functions may recurse back into this function, but
it will not recurse further. How did it crash?

-- Steve


> 
> In the lack of a proper automated static analysis tool for the matter, I
> suggest the following solution, but I would like to check that you are ok
> with the granularity of the “notrace” as I propose. Again, note that this is
> not caused by this “inline” patch, but an issue that existed before.
> 
> -- >8 --  
> 
> Author: Nadav Amit <namit@vmware.com>
> Date:   Tue Nov 29 02:25:12 2022 +0000
> 
>     trace: Disable tracing of code called from function_stack_trace_call()
>     
>     Signed-off-by: Nadav Amit <namit@vmware.com>
> 
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index a20a5ebfacd7..185933222d08 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -21,6 +21,10 @@ CFLAGS_REMOVE_ftrace.o = -pg
>  CFLAGS_REMOVE_early_printk.o = -pg
>  CFLAGS_REMOVE_head64.o = -pg
>  CFLAGS_REMOVE_sev.o = -pg
> +CFLAGS_REMOVE_unwind_frame.o = -pg
> +CFLAGS_REMOVE_unwind_guess.o = -pg
> +CFLAGS_REMOVE_unwind_orc.o = -pg
> +CFLAGS_REMOVE_stacktrace.o = -pg
>  endif
>  
>  KASAN_SANITIZE_head$(BITS).o                           := n
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 318789c728d3..d688eab1e1f8 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -19,6 +19,8 @@ obj-$(CONFIG_MULTIUSER) += groups.o
>  ifdef CONFIG_FUNCTION_TRACER
>  # Do not trace internal ftrace files
>  CFLAGS_REMOVE_irq_work.o = $(CC_FLAGS_FTRACE)
> +CFLAGS_REMOVE_stacktrace.o = $(CC_FLAGS_FTRACE)
> +CFLAGS_REMOVE_extable.o = $(CC_FLAGS_FTRACE)
>  endif
>  
>  # Prevents flicker of uninteresting __do_softirq()/__local_bh_disable_ip()
> 
> 
> ---
> 
> 
> [531394.447185] sched: RT throttling activated
> [531394.874541] NMI watchdog: Watchdog detected hard LOCKUP on cpu 26
> [531394.874745] Modules linked in: zram
> [531394.875224] CPU: 26 PID: 0 Comm: swapper/26 Not tainted 6.0.0-rc6umem+ #29
> [531394.875426] Hardware name: Cisco Systems Inc UCSC-C220-M5SX/UCSC-C220-M5SX, BIOS C220M5.4.0.1i.0.0522190226 05/22/2019
> [531394.875623] RIP: 0010:poke_int3_handler (arch/x86/kernel/alternative.c:1435) 
> [531394.875903] Code: 45 01 48 8b 0d c8 0a 45 01 49 8d 70 ff 83 f8 01 7f 1c 48 63 39 31 c0 48 81 c7 00 00 00 81 48 39 fe 74 3c f0 ff 0d b3 0a 45 01 <c3> 31 c0 c3 49 89 ca 49 89 c1 49 d1 e9 4c 89 c9 48 c1 e1 04 4c 01
> All code
> ========
>    0:	45 01 48 8b          	add    %r9d,-0x75(%r8)
>    4:	0d c8 0a 45 01       	or     $0x1450ac8,%eax
>    9:	49 8d 70 ff          	lea    -0x1(%r8),%rsi
>    d:	83 f8 01             	cmp    $0x1,%eax
>   10:	7f 1c                	jg     0x2e
>   12:	48 63 39             	movslq (%rcx),%rdi
>   15:	31 c0                	xor    %eax,%eax
>   17:	48 81 c7 00 00 00 81 	add    $0xffffffff81000000,%rdi
>   1e:	48 39 fe             	cmp    %rdi,%rsi
>   21:	74 3c                	je     0x5f
>   23:	f0 ff 0d b3 0a 45 01 	lock decl 0x1450ab3(%rip)        # 0x1450add
>   2a:*	c3                   	ret    		<-- trapping instruction
>   2b:	31 c0                	xor    %eax,%eax
>   2d:	c3                   	ret    
>   2e:	49 89 ca             	mov    %rcx,%r10
>   31:	49 89 c1             	mov    %rax,%r9
>   34:	49 d1 e9             	shr    %r9
>   37:	4c 89 c9             	mov    %r9,%rcx
>   3a:	48 c1 e1 04          	shl    $0x4,%rcx
>   3e:	4c                   	rex.WR
>   3f:	01                   	.byte 0x1
> 
> Code starting with the faulting instruction
> ===========================================
>    0:	c3                   	ret    
>    1:	31 c0                	xor    %eax,%eax
>    3:	c3                   	ret    
>    4:	49 89 ca             	mov    %rcx,%r10
>    7:	49 89 c1             	mov    %rax,%r9
>    a:	49 d1 e9             	shr    %r9
>    d:	4c 89 c9             	mov    %r9,%rcx
>   10:	48 c1 e1 04          	shl    $0x4,%rcx
>   14:	4c                   	rex.WR
>   15:	01                   	.byte 0x1
> [531394.876031] RSP: 0018:ffffc9000cd387b0 EFLAGS: 00000003
> [531394.876320] RAX: 0000000000000001 RBX: 0000000000000000 RCX: ffffc9000cd388a8
> [531394.876482] RDX: ffffc9000cd387d8 RSI: ffffffff812ba310 RDI: ffffffffc0409094
> [531394.876652] RBP: ffffc9000cd387c8 R08: ffffffffc0409099 R09: 0000000000000000
> [531394.876782] R10: 0000000000000000 R11: 0000000000000000 R12: ffffc9000cd387d8
> [531394.876927] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [531394.877180] FS:  0000000000000000(0000) GS:ffff88afdf900000(0000) knlGS:0000000000000000
> [531394.877371] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [531394.877503] CR2: 00007efd1e47b01c CR3: 0000000006a0a001 CR4: 00000000007706e0
> [531394.877641] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [531394.877822] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [531394.877993] PKRU: 55555554
> [531394.878164] Call Trace:
> [531394.878356]  <IRQ>
> [531394.878573] ? exc_int3 (arch/x86/kernel/traps.c:817) 
> [531394.879306] asm_exc_int3 (./arch/x86/include/asm/idtentry.h:569) 
> [531394.879725] RIP: 0010:function_stack_trace_call (kernel/trace/trace_functions.c:219) 
> [531394.880058] Code: 2b 81 48 c7 c7 e0 e2 33 83 e8 1c fa fd ff 48 c7 c7 80 e6 1c 83 e8 40 61 fe 00 5d c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc <55> 48 89 e5 41 57 41 56 41 55 41 54 53 48 83 ec 08 4c 8b 62 18 41
> All code
> ========
>    0:	2b 81 48 c7 c7 e0    	sub    -0x1f3838b8(%rcx),%eax
>    6:	e2 33                	loop   0x3b
>    8:	83 e8 1c             	sub    $0x1c,%eax
>    b:	fa                   	cli    
>    c:	fd                   	std    
>    d:	ff 48 c7             	decl   -0x39(%rax)
>   10:	c7 80 e6 1c 83 e8 40 	movl   $0xfe6140,-0x177ce31a(%rax)
>   17:	61 fe 00 
>   1a:	5d                   	pop    %rbp
>   1b:	c3                   	ret    
>   1c:	cc                   	int3   
>   1d:	cc                   	int3   
>   1e:	cc                   	int3   
>   1f:	cc                   	int3   
>   20:	cc                   	int3   
>   21:	cc                   	int3   
>   22:	cc                   	int3   
>   23:	cc                   	int3   
>   24:	cc                   	int3   
>   25:	cc                   	int3   
>   26:	cc                   	int3   
>   27:	cc                   	int3   
>   28:	cc                   	int3   
>   29:	cc                   	int3   
>   2a:*	55                   	push   %rbp		<-- trapping instruction
>   2b:	48 89 e5             	mov    %rsp,%rbp
>   2e:	41 57                	push   %r15
>   30:	41 56                	push   %r14
>   32:	41 55                	push   %r13
>   34:	41 54                	push   %r12
>   36:	53                   	push   %rbx
>   37:	48 83 ec 08          	sub    $0x8,%rsp
>   3b:	4c 8b 62 18          	mov    0x18(%rdx),%r12
>   3f:	41                   	rex.B
> 
> Code starting with the faulting instruction
> ===========================================
>    0:	55                   	push   %rbp
>    1:	48 89 e5             	mov    %rsp,%rbp
>    4:	41 57                	push   %r15
>    6:	41 56                	push   %r14
>    8:	41 55                	push   %r13
>    a:	41 54                	push   %r12
>    c:	53                   	push   %rbx
>    d:	48 83 ec 08          	sub    $0x8,%rsp
>   11:	4c 8b 62 18          	mov    0x18(%rdx),%r12
>   15:	41                   	rex.B
> [531394.880225] RSP: 0018:ffffc9000cd388a8 EFLAGS: 00000082
> [531394.880538] RAX: 0000000000000002 RBX: ffffc9000cd389a8 RCX: ffffc9000cd388b0
> [531394.880702] RDX: ffffffff831cbec0 RSI: ffffffff8111893f RDI: ffffffff811b2c80
> [531394.880884] RBP: ffffc9000cd38958 R08: ffffc900082f7f48 R09: 0000000000000001
> [531394.881055] R10: 0000000000000e00 R11: 0000000000000001 R12: ffffc9000cd38a50
> [531394.881225] R13: 0000000000000000 R14: ffff889846ecc080 R15: 0000000000000004
> [531394.883694] ? unwind_get_return_address (arch/x86/kernel/unwind_frame.c:19 arch/x86/kernel/unwind_frame.c:14) 
> [531394.883984] ? kernel_text_address (kernel/extable.c:78) 
> [531394.885410]  ? 0xffffffffc0409095
> [531394.886432]  ? 0xffffffffc0409099
> [531394.887278]  ? 0xffffffffc0409099
> [531394.889561] ? __trace_stack (kernel/trace/trace.c:3119) 
> [531394.890036] ? __kernel_text_address (kernel/extable.c:78) 
> [531394.891196] __kernel_text_address (kernel/extable.c:78) 
> [531394.891658] unwind_get_return_address (arch/x86/kernel/unwind_frame.c:19 arch/x86/kernel/unwind_frame.c:14) 
> [531394.892112] ? __kernel_text_address (kernel/extable.c:78) 
> [531394.892382] ? unwind_get_return_address (arch/x86/kernel/unwind_frame.c:19 arch/x86/kernel/unwind_frame.c:14) 
> [531394.892789] ? write_profile (kernel/stacktrace.c:83) 
> [531394.893260] arch_stack_walk (arch/x86/kernel/stacktrace.c:26) 
> [531394.895650] ? __trace_stack (kernel/trace/trace.c:3119) 
> [531394.897209] stack_trace_save (kernel/stacktrace.c:123) 
> [531394.898344] __ftrace_trace_stack (kernel/trace/trace.c:3061) 
> [531394.899500] ? rt_mutex_postunlock (kernel/printk/printk.c:2894) 
> [531394.899975] __trace_stack (kernel/trace/trace.c:3119) 
> [531394.901025] function_stack_trace_call (./arch/x86/include/asm/atomic.h:108 ./include/linux/atomic/atomic-instrumented.h:258 kernel/trace/trace_functions.c:245) 
> [531394.901356] ? fbcon_redraw.constprop.0 (drivers/video/fbdev/core/fbcon.c:1661) 
> [531394.902748]  0xffffffffc0409099
> [531394.904289] ? fb_get_color_depth (drivers/video/fbdev/core/fbmem.c:92) 
> [531394.906173] ? console_conditional_schedule (kernel/printk/printk.c:2895) 
> [531394.907244] console_conditional_schedule (kernel/printk/printk.c:2895) 
> [531394.907727] fbcon_redraw.constprop.0 (drivers/video/fbdev/core/fbcon.c:1661) 
> [531394.908200] ? console_conditional_schedule (kernel/printk/printk.c:2895) 
> [531394.908506] ? fbcon_redraw.constprop.0 (drivers/video/fbdev/core/fbcon.c:1661) 
> [531394.911230] fbcon_scroll (drivers/video/fbdev/core/fbcon.c:1838) 
> [531394.912743] con_scroll (drivers/tty/vt/vt.c:630 (discriminator 1)) 
> [531394.914360] lf (drivers/tty/vt/vt.c:1507) 
> [531394.914680] ? lf (drivers/tty/vt/vt.c:1502) 
> [531394.915779] vt_console_print (drivers/tty/vt/vt.c:3126) 
> [531394.918248] console_emit_next_record.constprop.0 (kernel/printk/printk.c:1945 kernel/printk/printk.c:2732) 
> [531394.922365] console_unlock (kernel/printk/printk.c:2794 kernel/printk/printk.c:2861) 
> [531394.924129] vprintk_emit (kernel/printk/printk.c:2272) 
> [531394.925782] vprintk_default (kernel/printk/printk.c:2283) 
> [531394.926306] vprintk (kernel/printk/printk_safe.c:50) 
> [531394.926943] _printk (kernel/printk/printk.c:2296) 
> [531394.929016] perf_duration_warn.cold (kernel/events/core.c:510 kernel/events/core.c:508) 
> [531394.929518] irq_work_single (kernel/irq_work.c:211) 
> [531394.930449] irq_work_run_list (kernel/irq_work.c:241 (discriminator 3)) 
> [531394.931169] irq_work_run (kernel/irq_work.c:253) 
> [531394.931816] __sysvec_irq_work (./arch/x86/include/asm/jump_label.h:27 ./include/linux/jump_label.h:207 ./arch/x86/include/asm/trace/irq_vectors.h:64 arch/x86/kernel/irq_work.c:23) 
> [531394.932259] sysvec_irq_work (arch/x86/kernel/irq_work.c:17 (discriminator 14)) 
> [531394.932660]  </IRQ>
> [531394.932854]  <TASK>
> [531394.933607] asm_sysvec_irq_work (./arch/x86/include/asm/idtentry.h:675)
> 
> 
> 
> 
>

Nadav Amit Nov. 29, 2022, 4:25 a.m. UTC | #6

On Nov 28, 2022, at 8:15 PM, Steven Rostedt <rostedt@goodmis.org> wrote:

> !! External Email
> 
> On Tue, 29 Nov 2022 02:36:22 +0000
> Nadav Amit <namit@vmware.com> wrote:
> 
>> On Nov 22, 2022, at 12:51 PM, Nadav Amit <namit@vmware.com> wrote:
>> 
>>> But more importantly, the current “inline”->”notrace” solution just papers
>>> over missing “notrace” annotations. Anyone can remove the “inline” at any
>>> given moment since there is no direct (or indirect) relationship between
>>> “inline” and “notrace”. It seems to me all random and bound to fail at some
>>> point.
>> 
>> Peter, Steven, (and others),
>> 
>> Beyond the issues that are addressed in this patch-set, I encountered one
>> more, which reiterates the fact that the heuristics of marking “inline”
>> functions as “notrace” is not good enough.
>> 
>> Before I send a patch, I would like to get your feedback. I include a splat
>> below. It appeaers the execution might get stuck since some functions that
>> can be used for function tracing can be traced themselves.
>> 
>> For example, __kernel_text_address() and unwind_get_return_address() are
>> traceable. I think that we need to disallow the tracing of all functions
>> that are called directly and indirectly from function_stack_trace_call()
>> (i.e., they are in the dynamic extent of function_stack_trace_call).
> 
> How did this happen. It should be able to handle recursion:
> 
> static void
> function_stack_trace_call(unsigned long ip, unsigned long parent_ip,
>                          struct ftrace_ops *op, struct ftrace_regs *fregs)
> {
>        struct trace_array *tr = op->private;
>        struct trace_array_cpu *data;
>        unsigned long flags;
>        long disabled;
>        int cpu;
>        unsigned int trace_ctx;
> 
>        if (unlikely(!tr->function_enabled))
>                return;
> 
>        /*
>         * Need to use raw, since this must be called before the
>         * recursive protection is performed.
>         */
>        local_irq_save(flags);
>        cpu = raw_smp_processor_id();
>        data = per_cpu_ptr(tr->array_buffer.data, cpu);
>        disabled = atomic_inc_return(&data->disabled);
> 
>        if (likely(disabled == 1)) { <<<---- This stops recursion
> 
>                trace_ctx = tracing_gen_ctx_flags(flags);
>                trace_function(tr, ip, parent_ip, trace_ctx);
>                __trace_stack(tr, trace_ctx, STACK_SKIP);
>        }
> 
>        atomic_dec(&data->disabled);
>        local_irq_restore(flags);
> }
> 
> Each of the stack trace functions may recurse back into this function, but
> it will not recurse further. How did it crash?

Ugh. Thanks. I didn’t know that - so your input is really helpful.

I will need to further debug it, but this issue does not occur every time.

The kernel didn’t crash exactly - it’s more of a deadlock. I have lockdep
enabled, so it is not a deadlock that lockdep knows. Could it be that
somehow things just slowed down due to IPIs and mostly-disabled IRQs? I have
no idea. I would need to recreate the scenario. 

For the record, I tried to saved some details in the previous email. It was
kind of hard to understand what’s going on on the other cores, since the
trace of other cores was interleaved. I extract the parts from that I think
the refer to the another CPU (yes, the output is really slow, as seen in the
timestamps):

[531413.923628] Code: 00 00 31 c0 eb f1 0f 1f 80 00 00 00 00 e8 1b 2e 16 3e 55 48 89 e5 c6 07 00 0f 1f 00 f7 c6 00 02 00 00 74 06 fb 0f 1f 44 00 00 <bf> 01 00 00 00 e8 99 da f1 fe 65 8b 05 f2 99 d7 7d 85 c0 74 02 5d

All code
========
0: 00 00 add %al,(%rax)
2: 31 c0 xor %eax,%eax
4: eb f1 jmp 0xfffffffffffffff7
6: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
d: e8 1b 2e 16 3e call 0x3e162e2d
12: 55 push %rbp
13: 48 89 e5 mov %rsp,%rbp
16: c6 07 00 movb $0x0,(%rdi)
19: 0f 1f 00 nopl (%rax)
1c: f7 c6 00 02 00 00 test $0x200,%esi
22: 74 06 je 0x2a
24: fb sti 
25: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
2a:* bf 01 00 00 00 mov $0x1,%edi <-- trapping instruction
2f: e8 99 da f1 fe call 0xfffffffffef1dacd
34: 65 8b 05 f2 99 d7 7d mov %gs:0x7dd799f2(%rip),%eax # 0x7dd79a2d
3b: 85 c0 test %eax,%eax
3d: 74 02 je 0x41
3f: 5d pop %rbp

Code starting with the faulting instruction
===========================================
0: bf 01 00 00 00 mov $0x1,%edi
5: e8 99 da f1 fe call 0xfffffffffef1daa3
a: 65 8b 05 f2 99 d7 7d mov %gs:0x7dd799f2(%rip),%eax # 0x7dd79a03
11: 85 c0 test %eax,%eax
13: 74 02 je 0x17
15: 5d pop %rbp

[531414.066765] RSP: 0018:ffffc9000c9a77d8 EFLAGS: 00000206
[531414.077943] RIP: 0010:smp_call_function_many_cond (kernel/smp.c:443 kernel/smp.c:988) 
[531416.987351] on_each_cpu_cond_mask (kernel/smp.c:1155) 
[531416.205862] ? text_poke_memset (arch/x86/kernel/alternative.c:1296) 
[531416.681294] ? text_poke_memset (arch/x86/kernel/alternative.c:1296) 
[531417.468443] text_poke_bp_batch (arch/x86/kernel/alternative.c:1553) 
[531418.939923] arch_ftrace_update_trampoline (arch/x86/kernel/ftrace.c:500)
[531419.882055] ? ftrace_no_pid_write (kernel/trace/ftrace.c:7864) 
[531420.510376] ftrace_update_pid_func (kernel/trace/ftrace.c:374 (discriminator 1))
[531420.784703] ftrace_pid_open (kernel/trace/ftrace.c:2918 kernel/trace/ftrace.c:2932 kernel/trace/ftrace.c:7725 kernel/trace/ftrace.c:7835 kernel/trace/ftrace.c:7865) 
[531421.851294] vfs_open (fs/open.c:1017)

Steven Rostedt Nov. 29, 2022, 3:06 p.m. UTC | #7

On Tue, 29 Nov 2022 04:25:38 +0000
Nadav Amit <namit@vmware.com> wrote:


> I will need to further debug it, but this issue does not occur every time.
> 
> The kernel didn’t crash exactly - it’s more of a deadlock. I have lockdep
> enabled, so it is not a deadlock that lockdep knows. Could it be that
> somehow things just slowed down due to IPIs and mostly-disabled IRQs? I have
> no idea. I would need to recreate the scenario. 

You have lockdep enabled and you are running function tracing with stack
trace on? So you are doing a stack trace on *every* function that is traced?

I don't think you hit a deadlock, I think you hit a live lock. You could
possibly slow the system down so much that when an interrupt finishes it's
time for it to be triggered again, and you never make forward progress.

> 
> For the record, I tried to saved some details in the previous email. It was
> kind of hard to understand what’s going on on the other cores, since the
> trace of other cores was interleaved. I extract the parts from that I think
> the refer to the another CPU (yes, the output is really slow, as seen in the
> timestamps):
> 
> [531413.923628] Code: 00 00 31 c0 eb f1 0f 1f 80 00 00 00 00 e8 1b 2e 16 3e 55 48 89 e5 c6 07 00 0f 1f 00 f7 c6 00 02 00 00 74 06 fb 0f 1f 44 00 00 <bf> 01 00 00 00 e8 99 da f1 fe 65 8b 05 f2 99 d7 7d 85 c0 74 02 5d
> 
> All code
> ========
> 0: 00 00 add %al,(%rax)
> 2: 31 c0 xor %eax,%eax
> 4: eb f1 jmp 0xfffffffffffffff7
> 6: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
> d: e8 1b 2e 16 3e call 0x3e162e2d
> 12: 55 push %rbp
> 13: 48 89 e5 mov %rsp,%rbp
> 16: c6 07 00 movb $0x0,(%rdi)
> 19: 0f 1f 00 nopl (%rax)
> 1c: f7 c6 00 02 00 00 test $0x200,%esi
> 22: 74 06 je 0x2a
> 24: fb sti 
> 25: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
> 2a:* bf 01 00 00 00 mov $0x1,%edi <-- trapping instruction
> 2f: e8 99 da f1 fe call 0xfffffffffef1dacd
> 34: 65 8b 05 f2 99 d7 7d mov %gs:0x7dd799f2(%rip),%eax # 0x7dd79a2d
> 3b: 85 c0 test %eax,%eax
> 3d: 74 02 je 0x41
> 3f: 5d pop %rbp
> 
> Code starting with the faulting instruction
> ===========================================
> 0: bf 01 00 00 00 mov $0x1,%edi
> 5: e8 99 da f1 fe call 0xfffffffffef1daa3
> a: 65 8b 05 f2 99 d7 7d mov %gs:0x7dd799f2(%rip),%eax # 0x7dd79a03
> 11: 85 c0 test %eax,%eax
> 13: 74 02 je 0x17
> 15: 5d pop %rbp
> 
> [531414.066765] RSP: 0018:ffffc9000c9a77d8 EFLAGS: 00000206
> [531414.077943] RIP: 0010:smp_call_function_many_cond (kernel/smp.c:443 kernel/smp.c:988) 
> [531416.987351] on_each_cpu_cond_mask (kernel/smp.c:1155) 
> [531416.205862] ? text_poke_memset (arch/x86/kernel/alternative.c:1296) 
> [531416.681294] ? text_poke_memset (arch/x86/kernel/alternative.c:1296) 
> [531417.468443] text_poke_bp_batch (arch/x86/kernel/alternative.c:1553) 
> [531418.939923] arch_ftrace_update_trampoline (arch/x86/kernel/ftrace.c:500)
> [531419.882055] ? ftrace_no_pid_write (kernel/trace/ftrace.c:7864) 
> [531420.510376] ftrace_update_pid_func (kernel/trace/ftrace.c:374 (discriminator 1))
> [531420.784703] ftrace_pid_open (kernel/trace/ftrace.c:2918 kernel/trace/ftrace.c:2932 kernel/trace/ftrace.c:7725 kernel/trace/ftrace.c:7835 kernel/trace/ftrace.c:7865) 
> [531421.851294] vfs_open (fs/open.c:1017)
> 
> 
> 

Do you have an issue with normal function tracing, and not tracing every
function.

I should also add this, because it detects recursion faster than the
atomic_inc_return() does.

-- Steve

diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
index 9f1bfbe105e8..93ec756dc24b 100644
--- a/kernel/trace/trace_functions.c
+++ b/kernel/trace/trace_functions.c
@@ -221,12 +221,18 @@ function_stack_trace_call(unsigned long ip, unsigned long parent_ip,
 	struct trace_array_cpu *data;
 	unsigned long flags;
 	long disabled;
+	int bit;
 	int cpu;
 	unsigned int trace_ctx;
 
 	if (unlikely(!tr->function_enabled))
 		return;
 
+	/* Faster than atomic_inc_return() */
+	bit = ftrace_test_recursion_trylock(ip, parent_ip);
+	if (bit < 0)
+		return;
+
 	/*
 	 * Need to use raw, since this must be called before the
 	 * recursive protection is performed.
@@ -244,6 +250,7 @@ function_stack_trace_call(unsigned long ip, unsigned long parent_ip,
 
 	atomic_dec(&data->disabled);
 	local_irq_restore(flags);
+	ftrace_test_recursion_unlock(bit);
 }
 
 static inline bool is_repeat_check(struct trace_array *tr,

Nadav Amit Nov. 29, 2022, 6:02 p.m. UTC | #8

On Nov 29, 2022, at 7:06 AM, Steven Rostedt <rostedt@goodmis.org> wrote:

> On Tue, 29 Nov 2022 04:25:38 +0000
> Nadav Amit <namit@vmware.com> wrote:
> 
> 
>> I will need to further debug it, but this issue does not occur every time.
>> 
>> The kernel didn’t crash exactly - it’s more of a deadlock. I have lockdep
>> enabled, so it is not a deadlock that lockdep knows. Could it be that
>> somehow things just slowed down due to IPIs and mostly-disabled IRQs? I have
>> no idea. I would need to recreate the scenario. 
> 
> You have lockdep enabled and you are running function tracing with stack
> trace on? So you are doing a stack trace on *every* function that is traced?
> 
> I don't think you hit a deadlock, I think you hit a live lock. You could
> possibly slow the system down so much that when an interrupt finishes it's
> time for it to be triggered again, and you never make forward progress.

It might be the issue. Perhaps I have a bug, because my code was supposed to
either enable stack-tracing with selected functions or create a trace all
function but *without* stack-tracing.

Thanks for the pointer and sorry for the noise.

Regards,
Nadav

diff mbox series

Patch

diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
index eb0466236661..36a99ef03a1a 100644
--- a/include/linux/compiler_types.h
+++ b/include/linux/compiler_types.h
@@ -158,7 +158,7 @@  struct ftrace_likely_data {
  * of extern inline functions at link time.
  * A lot of inline functions can cause havoc with function tracing.
  */
-#define inline inline __gnu_inline __inline_maybe_unused notrace
+#define inline inline __gnu_inline __inline_maybe_unused
 
 /*
  * gcc provides both __inline__ and __inline as alternate spellings of