Message ID | 20221122195329.252654-4-namit@vmware.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:f944:0:0:0:0:0 with SMTP id q4csp2410552wrr; Tue, 22 Nov 2022 11:56:26 -0800 (PST) X-Google-Smtp-Source: AA0mqf79UH6OMtahxnpu5+tWpcsnyiwyKMAELDQ2A8JY5jecWoON/T7D3t7QXLXYBJJbtoUnsQNA X-Received: by 2002:a17:90a:930f:b0:218:9107:381b with SMTP id p15-20020a17090a930f00b002189107381bmr20313497pjo.75.1669146985760; Tue, 22 Nov 2022 11:56:25 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1669146985; cv=none; d=google.com; s=arc-20160816; b=sh65K2NM92jzSL9YDsNApbKNWTE/rk9sfWjs6wd9Fk4N2WpCzW67vbAUzx+kV2DMqQ 4ZiujaDBKGtw/BiUXvnoKElnhYPy9zVn06L/rieUa76shj+hKDu2d2ZdyqSdVpfJurYB 7FStVmfwZdewy7FHiOgv3VZoOR/iJ0UHE4zdTh9NyFfkujgrc8ixBZlQg2b+w3il9dC3 mC72ONdP+sl848DpAjrxYN4XjwmwDUNfzH7XSz8AjABMSvz7MvRiQm5zyCn9O0N/cwBn +Pe30QBFQFgREPFI5nTA7m2gGDz0T/ccvLix3MXG0gW6VqfEr6R7iGLJi4kppxQZ8yvL D1lQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=GY0uNCRJFDIBd1b/0xpSh6r2iyqgkDnm0VnIWHblR1Y=; b=h3+r8ffjPJ6jCkMkvAXdJIhwqZbFJOQW4uCbBAYANcC2nIkc3vPlOPCw5SZYNDNBTS MVPzGrYohV/MRnErs2wphk7ZA6/JcIfM+HaD4kSMborAHasNBvHRFxpfKtE3NcEM/PZe B4WGdz6fTWhWyyZjvODofDqQhIlihMuefhF7nSlllbfWqp5/U5sgrxOj6wdQH/rVWEVS b4+gdUEOZ07+WE8JZNV7ac82uyJLIL1ASjdrV1EBs4bEe7EdtByFMPP4yv1onG6PhCtL NHC1lS4GKvbP+prUP7nsPlHzBy4vSQcALnCarfyi44KFJvWtIffmQ20QeWVx7a6VGZSc Pxsg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b="bKvIp/q0"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id k185-20020a6384c2000000b0046eed3142cesi14803342pgd.350.2022.11.22.11.56.12; Tue, 22 Nov 2022 11:56:25 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b="bKvIp/q0"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234598AbiKVTzF (ORCPT <rfc822;cjcooper78@gmail.com> + 99 others); Tue, 22 Nov 2022 14:55:05 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40784 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234746AbiKVTyX (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 22 Nov 2022 14:54:23 -0500 Received: from mail-pj1-x1036.google.com (mail-pj1-x1036.google.com [IPv6:2607:f8b0:4864:20::1036]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CFCECA3422; Tue, 22 Nov 2022 11:54:22 -0800 (PST) Received: by mail-pj1-x1036.google.com with SMTP id w4-20020a17090ac98400b002186f5d7a4cso14707506pjt.0; Tue, 22 Nov 2022 11:54:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=GY0uNCRJFDIBd1b/0xpSh6r2iyqgkDnm0VnIWHblR1Y=; b=bKvIp/q01sksIgeTfgdSxVX+C8Z+5Q5VPnMnDw6OkhkC00cIjOlJc/ZKdvrNIZSuH4 zMlcdUwGyDqTO6G1LZzPLT9BnwQR8xtwtay3S85AFcj3Ih7Zwm8JuOrJofoPHiGDVUZn lfRF+Ji3Kjjbe3ne+2nHL5KjZXaDbKqLtDX/gW1qDNG8WoTknVg2lZWK+8qZ3lb6ZvRb i8AKbdnE/wzihMT1RF1KcEEAIrVCAAfZ1oSiTvtqo2qeTOLx5bwTTTeOA1fUj87O8jXW B2T84l9rtNUraM2bgG5WNhOSSFktEVM2kiDnXa+8QBdCYNhK88WuMRca5+ysknNexuhF c7xw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=GY0uNCRJFDIBd1b/0xpSh6r2iyqgkDnm0VnIWHblR1Y=; b=FGTfT+8tHodQKSWKsGTlvcL4WSuOEPVWN0MAuvwUshEBy2bI9NpJe+V50Y9MlIvsQu 9GQZVztxjFkLRteVrGRFi+KU263rRAt8GSRNS/kluccMBrUSd2nhepG5krNP+fTEBMXU dnzy8EpzuPAWuXqsMl27M19Pgmh0exdogQrWPXCCP+I8lMaY56pQMAzBnBt38Fr8+dhD iygch5rkf9zTrhpnJWbCpUIn8mIaOA2N2ihdJ1ysfRhaV0eTaOa9M6UJybOTb8ihykMp R58KRsaa4OCaw5GkSlKGMB66JqYUt4zfzlGN3sDqqDZRxNFqWAesTRP/mQCfaqPQDanc heqQ== X-Gm-Message-State: ANoB5pmS3bx/6cLTWPbI0UOr/113qkac1sO8IaSNlMzTdgwfO1t8ihNK m2tMJTG4LKeixNLXF9+VRNc= X-Received: by 2002:a17:902:d2c7:b0:189:d3b:61fe with SMTP id n7-20020a170902d2c700b001890d3b61femr5484441plc.168.1669146861753; Tue, 22 Nov 2022 11:54:21 -0800 (PST) Received: from sc2-haas01-esx0118.eng.vmware.com ([66.170.99.1]) by smtp.gmail.com with ESMTPSA id a8-20020a63d408000000b00460fbe0d75esm9699252pgh.31.2022.11.22.11.54.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 22 Nov 2022 11:54:21 -0800 (PST) From: Nadav Amit <nadav.amit@gmail.com> X-Google-Original-From: Nadav Amit To: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-ia64@vger.kernel.org, linux-um@lists.infradead.org, linux-arch@vger.kernel.org, linux-mm@kvack.org, Andy Lutomirski <luto@kernel.org>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>, x86@kernel.org, Richard Weinberger <richard@nod.at>, Anton Ivanov <anton.ivanov@cambridgegreys.com>, Johannes Berg <johannes@sipsolutions.net>, Arnd Bergmann <arnd@arndb.de>, Andrew Morton <akpm@linux-foundation.org>, Nadav Amit <namit@vmware.com> Subject: [PATCH 3/3] compiler: inline does not imply notrace Date: Tue, 22 Nov 2022 11:53:29 -0800 Message-Id: <20221122195329.252654-4-namit@vmware.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20221122195329.252654-1-namit@vmware.com> References: <20221122195329.252654-1-namit@vmware.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1750227469627878046?= X-GMAIL-MSGID: =?utf-8?q?1750227469627878046?= |
Series |
kprobes: notrace enhancements
|
|
Commit Message
Nadav Amit
Nov. 22, 2022, 7:53 p.m. UTC
From: Nadav Amit <namit@vmware.com> Functions that are marked as "inline" are currently also not tracable. Apparently, this has been done to prevent differences between different configs that caused different functions to be tracable on different platforms. Anyhow, this consideration is not very strong, and tying "inline" and "notrace" does not seem very beneficial. The "inline" keyword is just a hint, and many functions are currently not tracable due to this reason. Disconnect "inline" from "notrace". Signed-off-by: Nadav Amit <namit@vmware.com> --- include/linux/compiler_types.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
Comments
On Tue, Nov 22, 2022, at 20:53, Nadav Amit wrote: > From: Nadav Amit <namit@vmware.com> > > Functions that are marked as "inline" are currently also not tracable. > Apparently, this has been done to prevent differences between different > configs that caused different functions to be tracable on different > platforms. > > Anyhow, this consideration is not very strong, and tying "inline" and > "notrace" does not seem very beneficial. The "inline" keyword is just a > hint, and many functions are currently not tracable due to this reason. The original reason was listed in 93b3cca1ccd3 ("ftrace: Make all inline tags also include notrace"), which describes Commit 5963e317b1e9d2a ("ftrace/x86: Do not change stacks in DEBUG when calling lockdep") prevented lockdep calls from the int3 breakpoint handler from reseting the stack if a function that was called was in the process of being converted for tracing and had a breakpoint on it. The idea is, before calling the lockdep code, do a load_idt() to the special IDT that kept the breakpoint stack from reseting. This worked well as a quick fix for this kernel release, until a certain config caused a lockup in the function tracer start up tests. Investigating it, I found that the load_idt that was used to prevent the int3 from changing stacks was itself being traced! and this sounds like a much stronger reason than what you describe, and I would expect your change to cause regressions in similar places. It's possible that the right answer is that the affected functions should be marked as __always_inline. Arnd
On Tue, 22 Nov 2022 21:09:08 +0100 "Arnd Bergmann" <arnd@arndb.de> wrote: > On Tue, Nov 22, 2022, at 20:53, Nadav Amit wrote: > > From: Nadav Amit <namit@vmware.com> > > > > Functions that are marked as "inline" are currently also not tracable. > > Apparently, this has been done to prevent differences between different > > configs that caused different functions to be tracable on different > > platforms. > > > > Anyhow, this consideration is not very strong, and tying "inline" and > > "notrace" does not seem very beneficial. The "inline" keyword is just a > > hint, and many functions are currently not tracable due to this reason. > > The original reason was listed in 93b3cca1ccd3 ("ftrace: Make all > inline tags also include notrace"), which describes > > Commit 5963e317b1e9d2a ("ftrace/x86: Do not change stacks in DEBUG when > calling lockdep") prevented lockdep calls from the int3 breakpoint handler > from reseting the stack if a function that was called was in the process > of being converted for tracing and had a breakpoint on it. The idea is, > before calling the lockdep code, do a load_idt() to the special IDT that > kept the breakpoint stack from reseting. This worked well as a quick fix > for this kernel release, until a certain config caused a lockup in the > function tracer start up tests. > > Investigating it, I found that the load_idt that was used to prevent > the int3 from changing stacks was itself being traced! > > and this sounds like a much stronger reason than what you describe, > and I would expect your change to cause regressions in similar places. > > It's possible that the right answer is that the affected functions > should be marked as __always_inline. Actually, this requirement may not be as needed as much today. There's been a lot of work in the last 10 years (when that commit was added) to make ftrace much more robust. We could remove the notrace from inline and then see where it breaks ;-) But I'm guessing that it's probably not as much of an issue as it was before. Although, it may cause some splats with noinstr but I think that will be caught at compile time. -- Steve
On Nov 22, 2022, at 12:09 PM, Arnd Bergmann <arnd@arndb.de> wrote: > !! External Email > > On Tue, Nov 22, 2022, at 20:53, Nadav Amit wrote: >> From: Nadav Amit <namit@vmware.com> >> >> Functions that are marked as "inline" are currently also not tracable. >> Apparently, this has been done to prevent differences between different >> configs that caused different functions to be tracable on different >> platforms. >> >> Anyhow, this consideration is not very strong, and tying "inline" and >> "notrace" does not seem very beneficial. The "inline" keyword is just a >> hint, and many functions are currently not tracable due to this reason. > > The original reason was listed in 93b3cca1ccd3 ("ftrace: Make all > inline tags also include notrace"), which describes > > Commit 5963e317b1e9d2a ("ftrace/x86: Do not change stacks in DEBUG when > calling lockdep") prevented lockdep calls from the int3 breakpoint handler > from reseting the stack if a function that was called was in the process > of being converted for tracing and had a breakpoint on it. The idea is, > before calling the lockdep code, do a load_idt() to the special IDT that > kept the breakpoint stack from reseting. This worked well as a quick fix > for this kernel release, until a certain config caused a lockup in the > function tracer start up tests. > > Investigating it, I found that the load_idt that was used to prevent > the int3 from changing stacks was itself being traced! > > and this sounds like a much stronger reason than what you describe, > and I would expect your change to cause regressions in similar places. I had no intention of misrepresenting. That was my understanding from my previous discussion with Steven. I assume that this patch might cause some regressions. The first patch in this series was intended to prevents some regressions that I encountered. This patch also marks a function that was missing “notrace” before. And I did get kernel hangs due to the missing “notrace”. Anyhow, I believe that the alternative - of leaving things as they are (“inline”->”notrace”) - is even worse. Obviously it prevents proper tracing, as there are even system calls that use inline, for instance __do_sys_process_madvise() and __do_sys_mremap(). But more importantly, the current “inline”->”notrace” solution just papers over missing “notrace” annotations. Anyone can remove the “inline” at any given moment since there is no direct (or indirect) relationship between “inline” and “notrace”. It seems to me all random and bound to fail at some point. > It's possible that the right answer is that the affected functions > should be marked as __always_inline. I think that it is probably better to mark them as notrace. People might remove __always_inline. If we want two versions - one traceable and one not traceable - we can also do that. But I am not sure how many people are aware of the relationships between inline/__always_inline and tracing.
On Nov 22, 2022, at 12:51 PM, Nadav Amit <namit@vmware.com> wrote: > But more importantly, the current “inline”->”notrace” solution just papers > over missing “notrace” annotations. Anyone can remove the “inline” at any > given moment since there is no direct (or indirect) relationship between > “inline” and “notrace”. It seems to me all random and bound to fail at some > point. Peter, Steven, (and others), Beyond the issues that are addressed in this patch-set, I encountered one more, which reiterates the fact that the heuristics of marking “inline” functions as “notrace” is not good enough. Before I send a patch, I would like to get your feedback. I include a splat below. It appeaers the execution might get stuck since some functions that can be used for function tracing can be traced themselves. For example, __kernel_text_address() and unwind_get_return_address() are traceable. I think that we need to disallow the tracing of all functions that are called directly and indirectly from function_stack_trace_call() (i.e., they are in the dynamic extent of function_stack_trace_call). In the lack of a proper automated static analysis tool for the matter, I suggest the following solution, but I would like to check that you are ok with the granularity of the “notrace” as I propose. Again, note that this is not caused by this “inline” patch, but an issue that existed before. -- >8 -- Author: Nadav Amit <namit@vmware.com> Date: Tue Nov 29 02:25:12 2022 +0000 trace: Disable tracing of code called from function_stack_trace_call() Signed-off-by: Nadav Amit <namit@vmware.com> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index a20a5ebfacd7..185933222d08 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -21,6 +21,10 @@ CFLAGS_REMOVE_ftrace.o = -pg CFLAGS_REMOVE_early_printk.o = -pg CFLAGS_REMOVE_head64.o = -pg CFLAGS_REMOVE_sev.o = -pg +CFLAGS_REMOVE_unwind_frame.o = -pg +CFLAGS_REMOVE_unwind_guess.o = -pg +CFLAGS_REMOVE_unwind_orc.o = -pg +CFLAGS_REMOVE_stacktrace.o = -pg endif KASAN_SANITIZE_head$(BITS).o := n diff --git a/kernel/Makefile b/kernel/Makefile index 318789c728d3..d688eab1e1f8 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -19,6 +19,8 @@ obj-$(CONFIG_MULTIUSER) += groups.o ifdef CONFIG_FUNCTION_TRACER # Do not trace internal ftrace files CFLAGS_REMOVE_irq_work.o = $(CC_FLAGS_FTRACE) +CFLAGS_REMOVE_stacktrace.o = $(CC_FLAGS_FTRACE) +CFLAGS_REMOVE_extable.o = $(CC_FLAGS_FTRACE) endif # Prevents flicker of uninteresting __do_softirq()/__local_bh_disable_ip() --- [531394.447185] sched: RT throttling activated [531394.874541] NMI watchdog: Watchdog detected hard LOCKUP on cpu 26 [531394.874745] Modules linked in: zram [531394.875224] CPU: 26 PID: 0 Comm: swapper/26 Not tainted 6.0.0-rc6umem+ #29 [531394.875426] Hardware name: Cisco Systems Inc UCSC-C220-M5SX/UCSC-C220-M5SX, BIOS C220M5.4.0.1i.0.0522190226 05/22/2019 [531394.875623] RIP: 0010:poke_int3_handler (arch/x86/kernel/alternative.c:1435) [531394.875903] Code: 45 01 48 8b 0d c8 0a 45 01 49 8d 70 ff 83 f8 01 7f 1c 48 63 39 31 c0 48 81 c7 00 00 00 81 48 39 fe 74 3c f0 ff 0d b3 0a 45 01 <c3> 31 c0 c3 49 89 ca 49 89 c1 49 d1 e9 4c 89 c9 48 c1 e1 04 4c 01 All code ======== 0: 45 01 48 8b add %r9d,-0x75(%r8) 4: 0d c8 0a 45 01 or $0x1450ac8,%eax 9: 49 8d 70 ff lea -0x1(%r8),%rsi d: 83 f8 01 cmp $0x1,%eax 10: 7f 1c jg 0x2e 12: 48 63 39 movslq (%rcx),%rdi 15: 31 c0 xor %eax,%eax 17: 48 81 c7 00 00 00 81 add $0xffffffff81000000,%rdi 1e: 48 39 fe cmp %rdi,%rsi 21: 74 3c je 0x5f 23: f0 ff 0d b3 0a 45 01 lock decl 0x1450ab3(%rip) # 0x1450add 2a:* c3 ret <-- trapping instruction 2b: 31 c0 xor %eax,%eax 2d: c3 ret 2e: 49 89 ca mov %rcx,%r10 31: 49 89 c1 mov %rax,%r9 34: 49 d1 e9 shr %r9 37: 4c 89 c9 mov %r9,%rcx 3a: 48 c1 e1 04 shl $0x4,%rcx 3e: 4c rex.WR 3f: 01 .byte 0x1 Code starting with the faulting instruction =========================================== 0: c3 ret 1: 31 c0 xor %eax,%eax 3: c3 ret 4: 49 89 ca mov %rcx,%r10 7: 49 89 c1 mov %rax,%r9 a: 49 d1 e9 shr %r9 d: 4c 89 c9 mov %r9,%rcx 10: 48 c1 e1 04 shl $0x4,%rcx 14: 4c rex.WR 15: 01 .byte 0x1 [531394.876031] RSP: 0018:ffffc9000cd387b0 EFLAGS: 00000003 [531394.876320] RAX: 0000000000000001 RBX: 0000000000000000 RCX: ffffc9000cd388a8 [531394.876482] RDX: ffffc9000cd387d8 RSI: ffffffff812ba310 RDI: ffffffffc0409094 [531394.876652] RBP: ffffc9000cd387c8 R08: ffffffffc0409099 R09: 0000000000000000 [531394.876782] R10: 0000000000000000 R11: 0000000000000000 R12: ffffc9000cd387d8 [531394.876927] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [531394.877180] FS: 0000000000000000(0000) GS:ffff88afdf900000(0000) knlGS:0000000000000000 [531394.877371] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [531394.877503] CR2: 00007efd1e47b01c CR3: 0000000006a0a001 CR4: 00000000007706e0 [531394.877641] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [531394.877822] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [531394.877993] PKRU: 55555554 [531394.878164] Call Trace: [531394.878356] <IRQ> [531394.878573] ? exc_int3 (arch/x86/kernel/traps.c:817) [531394.879306] asm_exc_int3 (./arch/x86/include/asm/idtentry.h:569) [531394.879725] RIP: 0010:function_stack_trace_call (kernel/trace/trace_functions.c:219) [531394.880058] Code: 2b 81 48 c7 c7 e0 e2 33 83 e8 1c fa fd ff 48 c7 c7 80 e6 1c 83 e8 40 61 fe 00 5d c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc <55> 48 89 e5 41 57 41 56 41 55 41 54 53 48 83 ec 08 4c 8b 62 18 41 All code ======== 0: 2b 81 48 c7 c7 e0 sub -0x1f3838b8(%rcx),%eax 6: e2 33 loop 0x3b 8: 83 e8 1c sub $0x1c,%eax b: fa cli c: fd std d: ff 48 c7 decl -0x39(%rax) 10: c7 80 e6 1c 83 e8 40 movl $0xfe6140,-0x177ce31a(%rax) 17: 61 fe 00 1a: 5d pop %rbp 1b: c3 ret 1c: cc int3 1d: cc int3 1e: cc int3 1f: cc int3 20: cc int3 21: cc int3 22: cc int3 23: cc int3 24: cc int3 25: cc int3 26: cc int3 27: cc int3 28: cc int3 29: cc int3 2a:* 55 push %rbp <-- trapping instruction 2b: 48 89 e5 mov %rsp,%rbp 2e: 41 57 push %r15 30: 41 56 push %r14 32: 41 55 push %r13 34: 41 54 push %r12 36: 53 push %rbx 37: 48 83 ec 08 sub $0x8,%rsp 3b: 4c 8b 62 18 mov 0x18(%rdx),%r12 3f: 41 rex.B Code starting with the faulting instruction =========================================== 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: 41 57 push %r15 6: 41 56 push %r14 8: 41 55 push %r13 a: 41 54 push %r12 c: 53 push %rbx d: 48 83 ec 08 sub $0x8,%rsp 11: 4c 8b 62 18 mov 0x18(%rdx),%r12 15: 41 rex.B [531394.880225] RSP: 0018:ffffc9000cd388a8 EFLAGS: 00000082 [531394.880538] RAX: 0000000000000002 RBX: ffffc9000cd389a8 RCX: ffffc9000cd388b0 [531394.880702] RDX: ffffffff831cbec0 RSI: ffffffff8111893f RDI: ffffffff811b2c80 [531394.880884] RBP: ffffc9000cd38958 R08: ffffc900082f7f48 R09: 0000000000000001 [531394.881055] R10: 0000000000000e00 R11: 0000000000000001 R12: ffffc9000cd38a50 [531394.881225] R13: 0000000000000000 R14: ffff889846ecc080 R15: 0000000000000004 [531394.883694] ? unwind_get_return_address (arch/x86/kernel/unwind_frame.c:19 arch/x86/kernel/unwind_frame.c:14) [531394.883984] ? kernel_text_address (kernel/extable.c:78) [531394.885410] ? 0xffffffffc0409095 [531394.886432] ? 0xffffffffc0409099 [531394.887278] ? 0xffffffffc0409099 [531394.889561] ? __trace_stack (kernel/trace/trace.c:3119) [531394.890036] ? __kernel_text_address (kernel/extable.c:78) [531394.891196] __kernel_text_address (kernel/extable.c:78) [531394.891658] unwind_get_return_address (arch/x86/kernel/unwind_frame.c:19 arch/x86/kernel/unwind_frame.c:14) [531394.892112] ? __kernel_text_address (kernel/extable.c:78) [531394.892382] ? unwind_get_return_address (arch/x86/kernel/unwind_frame.c:19 arch/x86/kernel/unwind_frame.c:14) [531394.892789] ? write_profile (kernel/stacktrace.c:83) [531394.893260] arch_stack_walk (arch/x86/kernel/stacktrace.c:26) [531394.895650] ? __trace_stack (kernel/trace/trace.c:3119) [531394.897209] stack_trace_save (kernel/stacktrace.c:123) [531394.898344] __ftrace_trace_stack (kernel/trace/trace.c:3061) [531394.899500] ? rt_mutex_postunlock (kernel/printk/printk.c:2894) [531394.899975] __trace_stack (kernel/trace/trace.c:3119) [531394.901025] function_stack_trace_call (./arch/x86/include/asm/atomic.h:108 ./include/linux/atomic/atomic-instrumented.h:258 kernel/trace/trace_functions.c:245) [531394.901356] ? fbcon_redraw.constprop.0 (drivers/video/fbdev/core/fbcon.c:1661) [531394.902748] 0xffffffffc0409099 [531394.904289] ? fb_get_color_depth (drivers/video/fbdev/core/fbmem.c:92) [531394.906173] ? console_conditional_schedule (kernel/printk/printk.c:2895) [531394.907244] console_conditional_schedule (kernel/printk/printk.c:2895) [531394.907727] fbcon_redraw.constprop.0 (drivers/video/fbdev/core/fbcon.c:1661) [531394.908200] ? console_conditional_schedule (kernel/printk/printk.c:2895) [531394.908506] ? fbcon_redraw.constprop.0 (drivers/video/fbdev/core/fbcon.c:1661) [531394.911230] fbcon_scroll (drivers/video/fbdev/core/fbcon.c:1838) [531394.912743] con_scroll (drivers/tty/vt/vt.c:630 (discriminator 1)) [531394.914360] lf (drivers/tty/vt/vt.c:1507) [531394.914680] ? lf (drivers/tty/vt/vt.c:1502) [531394.915779] vt_console_print (drivers/tty/vt/vt.c:3126) [531394.918248] console_emit_next_record.constprop.0 (kernel/printk/printk.c:1945 kernel/printk/printk.c:2732) [531394.922365] console_unlock (kernel/printk/printk.c:2794 kernel/printk/printk.c:2861) [531394.924129] vprintk_emit (kernel/printk/printk.c:2272) [531394.925782] vprintk_default (kernel/printk/printk.c:2283) [531394.926306] vprintk (kernel/printk/printk_safe.c:50) [531394.926943] _printk (kernel/printk/printk.c:2296) [531394.929016] perf_duration_warn.cold (kernel/events/core.c:510 kernel/events/core.c:508) [531394.929518] irq_work_single (kernel/irq_work.c:211) [531394.930449] irq_work_run_list (kernel/irq_work.c:241 (discriminator 3)) [531394.931169] irq_work_run (kernel/irq_work.c:253) [531394.931816] __sysvec_irq_work (./arch/x86/include/asm/jump_label.h:27 ./include/linux/jump_label.h:207 ./arch/x86/include/asm/trace/irq_vectors.h:64 arch/x86/kernel/irq_work.c:23) [531394.932259] sysvec_irq_work (arch/x86/kernel/irq_work.c:17 (discriminator 14)) [531394.932660] </IRQ> [531394.932854] <TASK> [531394.933607] asm_sysvec_irq_work (./arch/x86/include/asm/idtentry.h:675)
On Tue, 29 Nov 2022 02:36:22 +0000 Nadav Amit <namit@vmware.com> wrote: > On Nov 22, 2022, at 12:51 PM, Nadav Amit <namit@vmware.com> wrote: > > > But more importantly, the current “inline”->”notrace” solution just papers > > over missing “notrace” annotations. Anyone can remove the “inline” at any > > given moment since there is no direct (or indirect) relationship between > > “inline” and “notrace”. It seems to me all random and bound to fail at some > > point. > > Peter, Steven, (and others), > > Beyond the issues that are addressed in this patch-set, I encountered one > more, which reiterates the fact that the heuristics of marking “inline” > functions as “notrace” is not good enough. > > Before I send a patch, I would like to get your feedback. I include a splat > below. It appeaers the execution might get stuck since some functions that > can be used for function tracing can be traced themselves. > > For example, __kernel_text_address() and unwind_get_return_address() are > traceable. I think that we need to disallow the tracing of all functions > that are called directly and indirectly from function_stack_trace_call() > (i.e., they are in the dynamic extent of function_stack_trace_call). How did this happen. It should be able to handle recursion: static void function_stack_trace_call(unsigned long ip, unsigned long parent_ip, struct ftrace_ops *op, struct ftrace_regs *fregs) { struct trace_array *tr = op->private; struct trace_array_cpu *data; unsigned long flags; long disabled; int cpu; unsigned int trace_ctx; if (unlikely(!tr->function_enabled)) return; /* * Need to use raw, since this must be called before the * recursive protection is performed. */ local_irq_save(flags); cpu = raw_smp_processor_id(); data = per_cpu_ptr(tr->array_buffer.data, cpu); disabled = atomic_inc_return(&data->disabled); if (likely(disabled == 1)) { <<<---- This stops recursion trace_ctx = tracing_gen_ctx_flags(flags); trace_function(tr, ip, parent_ip, trace_ctx); __trace_stack(tr, trace_ctx, STACK_SKIP); } atomic_dec(&data->disabled); local_irq_restore(flags); } Each of the stack trace functions may recurse back into this function, but it will not recurse further. How did it crash? -- Steve > > In the lack of a proper automated static analysis tool for the matter, I > suggest the following solution, but I would like to check that you are ok > with the granularity of the “notrace” as I propose. Again, note that this is > not caused by this “inline” patch, but an issue that existed before. > > -- >8 -- > > Author: Nadav Amit <namit@vmware.com> > Date: Tue Nov 29 02:25:12 2022 +0000 > > trace: Disable tracing of code called from function_stack_trace_call() > > Signed-off-by: Nadav Amit <namit@vmware.com> > > diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile > index a20a5ebfacd7..185933222d08 100644 > --- a/arch/x86/kernel/Makefile > +++ b/arch/x86/kernel/Makefile > @@ -21,6 +21,10 @@ CFLAGS_REMOVE_ftrace.o = -pg > CFLAGS_REMOVE_early_printk.o = -pg > CFLAGS_REMOVE_head64.o = -pg > CFLAGS_REMOVE_sev.o = -pg > +CFLAGS_REMOVE_unwind_frame.o = -pg > +CFLAGS_REMOVE_unwind_guess.o = -pg > +CFLAGS_REMOVE_unwind_orc.o = -pg > +CFLAGS_REMOVE_stacktrace.o = -pg > endif > > KASAN_SANITIZE_head$(BITS).o := n > diff --git a/kernel/Makefile b/kernel/Makefile > index 318789c728d3..d688eab1e1f8 100644 > --- a/kernel/Makefile > +++ b/kernel/Makefile > @@ -19,6 +19,8 @@ obj-$(CONFIG_MULTIUSER) += groups.o > ifdef CONFIG_FUNCTION_TRACER > # Do not trace internal ftrace files > CFLAGS_REMOVE_irq_work.o = $(CC_FLAGS_FTRACE) > +CFLAGS_REMOVE_stacktrace.o = $(CC_FLAGS_FTRACE) > +CFLAGS_REMOVE_extable.o = $(CC_FLAGS_FTRACE) > endif > > # Prevents flicker of uninteresting __do_softirq()/__local_bh_disable_ip() > > > --- > > > [531394.447185] sched: RT throttling activated > [531394.874541] NMI watchdog: Watchdog detected hard LOCKUP on cpu 26 > [531394.874745] Modules linked in: zram > [531394.875224] CPU: 26 PID: 0 Comm: swapper/26 Not tainted 6.0.0-rc6umem+ #29 > [531394.875426] Hardware name: Cisco Systems Inc UCSC-C220-M5SX/UCSC-C220-M5SX, BIOS C220M5.4.0.1i.0.0522190226 05/22/2019 > [531394.875623] RIP: 0010:poke_int3_handler (arch/x86/kernel/alternative.c:1435) > [531394.875903] Code: 45 01 48 8b 0d c8 0a 45 01 49 8d 70 ff 83 f8 01 7f 1c 48 63 39 31 c0 48 81 c7 00 00 00 81 48 39 fe 74 3c f0 ff 0d b3 0a 45 01 <c3> 31 c0 c3 49 89 ca 49 89 c1 49 d1 e9 4c 89 c9 48 c1 e1 04 4c 01 > All code > ======== > 0: 45 01 48 8b add %r9d,-0x75(%r8) > 4: 0d c8 0a 45 01 or $0x1450ac8,%eax > 9: 49 8d 70 ff lea -0x1(%r8),%rsi > d: 83 f8 01 cmp $0x1,%eax > 10: 7f 1c jg 0x2e > 12: 48 63 39 movslq (%rcx),%rdi > 15: 31 c0 xor %eax,%eax > 17: 48 81 c7 00 00 00 81 add $0xffffffff81000000,%rdi > 1e: 48 39 fe cmp %rdi,%rsi > 21: 74 3c je 0x5f > 23: f0 ff 0d b3 0a 45 01 lock decl 0x1450ab3(%rip) # 0x1450add > 2a:* c3 ret <-- trapping instruction > 2b: 31 c0 xor %eax,%eax > 2d: c3 ret > 2e: 49 89 ca mov %rcx,%r10 > 31: 49 89 c1 mov %rax,%r9 > 34: 49 d1 e9 shr %r9 > 37: 4c 89 c9 mov %r9,%rcx > 3a: 48 c1 e1 04 shl $0x4,%rcx > 3e: 4c rex.WR > 3f: 01 .byte 0x1 > > Code starting with the faulting instruction > =========================================== > 0: c3 ret > 1: 31 c0 xor %eax,%eax > 3: c3 ret > 4: 49 89 ca mov %rcx,%r10 > 7: 49 89 c1 mov %rax,%r9 > a: 49 d1 e9 shr %r9 > d: 4c 89 c9 mov %r9,%rcx > 10: 48 c1 e1 04 shl $0x4,%rcx > 14: 4c rex.WR > 15: 01 .byte 0x1 > [531394.876031] RSP: 0018:ffffc9000cd387b0 EFLAGS: 00000003 > [531394.876320] RAX: 0000000000000001 RBX: 0000000000000000 RCX: ffffc9000cd388a8 > [531394.876482] RDX: ffffc9000cd387d8 RSI: ffffffff812ba310 RDI: ffffffffc0409094 > [531394.876652] RBP: ffffc9000cd387c8 R08: ffffffffc0409099 R09: 0000000000000000 > [531394.876782] R10: 0000000000000000 R11: 0000000000000000 R12: ffffc9000cd387d8 > [531394.876927] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 > [531394.877180] FS: 0000000000000000(0000) GS:ffff88afdf900000(0000) knlGS:0000000000000000 > [531394.877371] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [531394.877503] CR2: 00007efd1e47b01c CR3: 0000000006a0a001 CR4: 00000000007706e0 > [531394.877641] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [531394.877822] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [531394.877993] PKRU: 55555554 > [531394.878164] Call Trace: > [531394.878356] <IRQ> > [531394.878573] ? exc_int3 (arch/x86/kernel/traps.c:817) > [531394.879306] asm_exc_int3 (./arch/x86/include/asm/idtentry.h:569) > [531394.879725] RIP: 0010:function_stack_trace_call (kernel/trace/trace_functions.c:219) > [531394.880058] Code: 2b 81 48 c7 c7 e0 e2 33 83 e8 1c fa fd ff 48 c7 c7 80 e6 1c 83 e8 40 61 fe 00 5d c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc <55> 48 89 e5 41 57 41 56 41 55 41 54 53 48 83 ec 08 4c 8b 62 18 41 > All code > ======== > 0: 2b 81 48 c7 c7 e0 sub -0x1f3838b8(%rcx),%eax > 6: e2 33 loop 0x3b > 8: 83 e8 1c sub $0x1c,%eax > b: fa cli > c: fd std > d: ff 48 c7 decl -0x39(%rax) > 10: c7 80 e6 1c 83 e8 40 movl $0xfe6140,-0x177ce31a(%rax) > 17: 61 fe 00 > 1a: 5d pop %rbp > 1b: c3 ret > 1c: cc int3 > 1d: cc int3 > 1e: cc int3 > 1f: cc int3 > 20: cc int3 > 21: cc int3 > 22: cc int3 > 23: cc int3 > 24: cc int3 > 25: cc int3 > 26: cc int3 > 27: cc int3 > 28: cc int3 > 29: cc int3 > 2a:* 55 push %rbp <-- trapping instruction > 2b: 48 89 e5 mov %rsp,%rbp > 2e: 41 57 push %r15 > 30: 41 56 push %r14 > 32: 41 55 push %r13 > 34: 41 54 push %r12 > 36: 53 push %rbx > 37: 48 83 ec 08 sub $0x8,%rsp > 3b: 4c 8b 62 18 mov 0x18(%rdx),%r12 > 3f: 41 rex.B > > Code starting with the faulting instruction > =========================================== > 0: 55 push %rbp > 1: 48 89 e5 mov %rsp,%rbp > 4: 41 57 push %r15 > 6: 41 56 push %r14 > 8: 41 55 push %r13 > a: 41 54 push %r12 > c: 53 push %rbx > d: 48 83 ec 08 sub $0x8,%rsp > 11: 4c 8b 62 18 mov 0x18(%rdx),%r12 > 15: 41 rex.B > [531394.880225] RSP: 0018:ffffc9000cd388a8 EFLAGS: 00000082 > [531394.880538] RAX: 0000000000000002 RBX: ffffc9000cd389a8 RCX: ffffc9000cd388b0 > [531394.880702] RDX: ffffffff831cbec0 RSI: ffffffff8111893f RDI: ffffffff811b2c80 > [531394.880884] RBP: ffffc9000cd38958 R08: ffffc900082f7f48 R09: 0000000000000001 > [531394.881055] R10: 0000000000000e00 R11: 0000000000000001 R12: ffffc9000cd38a50 > [531394.881225] R13: 0000000000000000 R14: ffff889846ecc080 R15: 0000000000000004 > [531394.883694] ? unwind_get_return_address (arch/x86/kernel/unwind_frame.c:19 arch/x86/kernel/unwind_frame.c:14) > [531394.883984] ? kernel_text_address (kernel/extable.c:78) > [531394.885410] ? 0xffffffffc0409095 > [531394.886432] ? 0xffffffffc0409099 > [531394.887278] ? 0xffffffffc0409099 > [531394.889561] ? __trace_stack (kernel/trace/trace.c:3119) > [531394.890036] ? __kernel_text_address (kernel/extable.c:78) > [531394.891196] __kernel_text_address (kernel/extable.c:78) > [531394.891658] unwind_get_return_address (arch/x86/kernel/unwind_frame.c:19 arch/x86/kernel/unwind_frame.c:14) > [531394.892112] ? __kernel_text_address (kernel/extable.c:78) > [531394.892382] ? unwind_get_return_address (arch/x86/kernel/unwind_frame.c:19 arch/x86/kernel/unwind_frame.c:14) > [531394.892789] ? write_profile (kernel/stacktrace.c:83) > [531394.893260] arch_stack_walk (arch/x86/kernel/stacktrace.c:26) > [531394.895650] ? __trace_stack (kernel/trace/trace.c:3119) > [531394.897209] stack_trace_save (kernel/stacktrace.c:123) > [531394.898344] __ftrace_trace_stack (kernel/trace/trace.c:3061) > [531394.899500] ? rt_mutex_postunlock (kernel/printk/printk.c:2894) > [531394.899975] __trace_stack (kernel/trace/trace.c:3119) > [531394.901025] function_stack_trace_call (./arch/x86/include/asm/atomic.h:108 ./include/linux/atomic/atomic-instrumented.h:258 kernel/trace/trace_functions.c:245) > [531394.901356] ? fbcon_redraw.constprop.0 (drivers/video/fbdev/core/fbcon.c:1661) > [531394.902748] 0xffffffffc0409099 > [531394.904289] ? fb_get_color_depth (drivers/video/fbdev/core/fbmem.c:92) > [531394.906173] ? console_conditional_schedule (kernel/printk/printk.c:2895) > [531394.907244] console_conditional_schedule (kernel/printk/printk.c:2895) > [531394.907727] fbcon_redraw.constprop.0 (drivers/video/fbdev/core/fbcon.c:1661) > [531394.908200] ? console_conditional_schedule (kernel/printk/printk.c:2895) > [531394.908506] ? fbcon_redraw.constprop.0 (drivers/video/fbdev/core/fbcon.c:1661) > [531394.911230] fbcon_scroll (drivers/video/fbdev/core/fbcon.c:1838) > [531394.912743] con_scroll (drivers/tty/vt/vt.c:630 (discriminator 1)) > [531394.914360] lf (drivers/tty/vt/vt.c:1507) > [531394.914680] ? lf (drivers/tty/vt/vt.c:1502) > [531394.915779] vt_console_print (drivers/tty/vt/vt.c:3126) > [531394.918248] console_emit_next_record.constprop.0 (kernel/printk/printk.c:1945 kernel/printk/printk.c:2732) > [531394.922365] console_unlock (kernel/printk/printk.c:2794 kernel/printk/printk.c:2861) > [531394.924129] vprintk_emit (kernel/printk/printk.c:2272) > [531394.925782] vprintk_default (kernel/printk/printk.c:2283) > [531394.926306] vprintk (kernel/printk/printk_safe.c:50) > [531394.926943] _printk (kernel/printk/printk.c:2296) > [531394.929016] perf_duration_warn.cold (kernel/events/core.c:510 kernel/events/core.c:508) > [531394.929518] irq_work_single (kernel/irq_work.c:211) > [531394.930449] irq_work_run_list (kernel/irq_work.c:241 (discriminator 3)) > [531394.931169] irq_work_run (kernel/irq_work.c:253) > [531394.931816] __sysvec_irq_work (./arch/x86/include/asm/jump_label.h:27 ./include/linux/jump_label.h:207 ./arch/x86/include/asm/trace/irq_vectors.h:64 arch/x86/kernel/irq_work.c:23) > [531394.932259] sysvec_irq_work (arch/x86/kernel/irq_work.c:17 (discriminator 14)) > [531394.932660] </IRQ> > [531394.932854] <TASK> > [531394.933607] asm_sysvec_irq_work (./arch/x86/include/asm/idtentry.h:675) > > > > >
On Nov 28, 2022, at 8:15 PM, Steven Rostedt <rostedt@goodmis.org> wrote: > !! External Email > > On Tue, 29 Nov 2022 02:36:22 +0000 > Nadav Amit <namit@vmware.com> wrote: > >> On Nov 22, 2022, at 12:51 PM, Nadav Amit <namit@vmware.com> wrote: >> >>> But more importantly, the current “inline”->”notrace” solution just papers >>> over missing “notrace” annotations. Anyone can remove the “inline” at any >>> given moment since there is no direct (or indirect) relationship between >>> “inline” and “notrace”. It seems to me all random and bound to fail at some >>> point. >> >> Peter, Steven, (and others), >> >> Beyond the issues that are addressed in this patch-set, I encountered one >> more, which reiterates the fact that the heuristics of marking “inline” >> functions as “notrace” is not good enough. >> >> Before I send a patch, I would like to get your feedback. I include a splat >> below. It appeaers the execution might get stuck since some functions that >> can be used for function tracing can be traced themselves. >> >> For example, __kernel_text_address() and unwind_get_return_address() are >> traceable. I think that we need to disallow the tracing of all functions >> that are called directly and indirectly from function_stack_trace_call() >> (i.e., they are in the dynamic extent of function_stack_trace_call). > > How did this happen. It should be able to handle recursion: > > static void > function_stack_trace_call(unsigned long ip, unsigned long parent_ip, > struct ftrace_ops *op, struct ftrace_regs *fregs) > { > struct trace_array *tr = op->private; > struct trace_array_cpu *data; > unsigned long flags; > long disabled; > int cpu; > unsigned int trace_ctx; > > if (unlikely(!tr->function_enabled)) > return; > > /* > * Need to use raw, since this must be called before the > * recursive protection is performed. > */ > local_irq_save(flags); > cpu = raw_smp_processor_id(); > data = per_cpu_ptr(tr->array_buffer.data, cpu); > disabled = atomic_inc_return(&data->disabled); > > if (likely(disabled == 1)) { <<<---- This stops recursion > > trace_ctx = tracing_gen_ctx_flags(flags); > trace_function(tr, ip, parent_ip, trace_ctx); > __trace_stack(tr, trace_ctx, STACK_SKIP); > } > > atomic_dec(&data->disabled); > local_irq_restore(flags); > } > > Each of the stack trace functions may recurse back into this function, but > it will not recurse further. How did it crash? Ugh. Thanks. I didn’t know that - so your input is really helpful. I will need to further debug it, but this issue does not occur every time. The kernel didn’t crash exactly - it’s more of a deadlock. I have lockdep enabled, so it is not a deadlock that lockdep knows. Could it be that somehow things just slowed down due to IPIs and mostly-disabled IRQs? I have no idea. I would need to recreate the scenario. For the record, I tried to saved some details in the previous email. It was kind of hard to understand what’s going on on the other cores, since the trace of other cores was interleaved. I extract the parts from that I think the refer to the another CPU (yes, the output is really slow, as seen in the timestamps): [531413.923628] Code: 00 00 31 c0 eb f1 0f 1f 80 00 00 00 00 e8 1b 2e 16 3e 55 48 89 e5 c6 07 00 0f 1f 00 f7 c6 00 02 00 00 74 06 fb 0f 1f 44 00 00 <bf> 01 00 00 00 e8 99 da f1 fe 65 8b 05 f2 99 d7 7d 85 c0 74 02 5d All code ======== 0: 00 00 add %al,(%rax) 2: 31 c0 xor %eax,%eax 4: eb f1 jmp 0xfffffffffffffff7 6: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) d: e8 1b 2e 16 3e call 0x3e162e2d 12: 55 push %rbp 13: 48 89 e5 mov %rsp,%rbp 16: c6 07 00 movb $0x0,(%rdi) 19: 0f 1f 00 nopl (%rax) 1c: f7 c6 00 02 00 00 test $0x200,%esi 22: 74 06 je 0x2a 24: fb sti 25: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 2a:* bf 01 00 00 00 mov $0x1,%edi <-- trapping instruction 2f: e8 99 da f1 fe call 0xfffffffffef1dacd 34: 65 8b 05 f2 99 d7 7d mov %gs:0x7dd799f2(%rip),%eax # 0x7dd79a2d 3b: 85 c0 test %eax,%eax 3d: 74 02 je 0x41 3f: 5d pop %rbp Code starting with the faulting instruction =========================================== 0: bf 01 00 00 00 mov $0x1,%edi 5: e8 99 da f1 fe call 0xfffffffffef1daa3 a: 65 8b 05 f2 99 d7 7d mov %gs:0x7dd799f2(%rip),%eax # 0x7dd79a03 11: 85 c0 test %eax,%eax 13: 74 02 je 0x17 15: 5d pop %rbp [531414.066765] RSP: 0018:ffffc9000c9a77d8 EFLAGS: 00000206 [531414.077943] RIP: 0010:smp_call_function_many_cond (kernel/smp.c:443 kernel/smp.c:988) [531416.987351] on_each_cpu_cond_mask (kernel/smp.c:1155) [531416.205862] ? text_poke_memset (arch/x86/kernel/alternative.c:1296) [531416.681294] ? text_poke_memset (arch/x86/kernel/alternative.c:1296) [531417.468443] text_poke_bp_batch (arch/x86/kernel/alternative.c:1553) [531418.939923] arch_ftrace_update_trampoline (arch/x86/kernel/ftrace.c:500) [531419.882055] ? ftrace_no_pid_write (kernel/trace/ftrace.c:7864) [531420.510376] ftrace_update_pid_func (kernel/trace/ftrace.c:374 (discriminator 1)) [531420.784703] ftrace_pid_open (kernel/trace/ftrace.c:2918 kernel/trace/ftrace.c:2932 kernel/trace/ftrace.c:7725 kernel/trace/ftrace.c:7835 kernel/trace/ftrace.c:7865) [531421.851294] vfs_open (fs/open.c:1017)
On Tue, 29 Nov 2022 04:25:38 +0000 Nadav Amit <namit@vmware.com> wrote: > I will need to further debug it, but this issue does not occur every time. > > The kernel didn’t crash exactly - it’s more of a deadlock. I have lockdep > enabled, so it is not a deadlock that lockdep knows. Could it be that > somehow things just slowed down due to IPIs and mostly-disabled IRQs? I have > no idea. I would need to recreate the scenario. You have lockdep enabled and you are running function tracing with stack trace on? So you are doing a stack trace on *every* function that is traced? I don't think you hit a deadlock, I think you hit a live lock. You could possibly slow the system down so much that when an interrupt finishes it's time for it to be triggered again, and you never make forward progress. > > For the record, I tried to saved some details in the previous email. It was > kind of hard to understand what’s going on on the other cores, since the > trace of other cores was interleaved. I extract the parts from that I think > the refer to the another CPU (yes, the output is really slow, as seen in the > timestamps): > > [531413.923628] Code: 00 00 31 c0 eb f1 0f 1f 80 00 00 00 00 e8 1b 2e 16 3e 55 48 89 e5 c6 07 00 0f 1f 00 f7 c6 00 02 00 00 74 06 fb 0f 1f 44 00 00 <bf> 01 00 00 00 e8 99 da f1 fe 65 8b 05 f2 99 d7 7d 85 c0 74 02 5d > > All code > ======== > 0: 00 00 add %al,(%rax) > 2: 31 c0 xor %eax,%eax > 4: eb f1 jmp 0xfffffffffffffff7 > 6: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) > d: e8 1b 2e 16 3e call 0x3e162e2d > 12: 55 push %rbp > 13: 48 89 e5 mov %rsp,%rbp > 16: c6 07 00 movb $0x0,(%rdi) > 19: 0f 1f 00 nopl (%rax) > 1c: f7 c6 00 02 00 00 test $0x200,%esi > 22: 74 06 je 0x2a > 24: fb sti > 25: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) > 2a:* bf 01 00 00 00 mov $0x1,%edi <-- trapping instruction > 2f: e8 99 da f1 fe call 0xfffffffffef1dacd > 34: 65 8b 05 f2 99 d7 7d mov %gs:0x7dd799f2(%rip),%eax # 0x7dd79a2d > 3b: 85 c0 test %eax,%eax > 3d: 74 02 je 0x41 > 3f: 5d pop %rbp > > Code starting with the faulting instruction > =========================================== > 0: bf 01 00 00 00 mov $0x1,%edi > 5: e8 99 da f1 fe call 0xfffffffffef1daa3 > a: 65 8b 05 f2 99 d7 7d mov %gs:0x7dd799f2(%rip),%eax # 0x7dd79a03 > 11: 85 c0 test %eax,%eax > 13: 74 02 je 0x17 > 15: 5d pop %rbp > > [531414.066765] RSP: 0018:ffffc9000c9a77d8 EFLAGS: 00000206 > [531414.077943] RIP: 0010:smp_call_function_many_cond (kernel/smp.c:443 kernel/smp.c:988) > [531416.987351] on_each_cpu_cond_mask (kernel/smp.c:1155) > [531416.205862] ? text_poke_memset (arch/x86/kernel/alternative.c:1296) > [531416.681294] ? text_poke_memset (arch/x86/kernel/alternative.c:1296) > [531417.468443] text_poke_bp_batch (arch/x86/kernel/alternative.c:1553) > [531418.939923] arch_ftrace_update_trampoline (arch/x86/kernel/ftrace.c:500) > [531419.882055] ? ftrace_no_pid_write (kernel/trace/ftrace.c:7864) > [531420.510376] ftrace_update_pid_func (kernel/trace/ftrace.c:374 (discriminator 1)) > [531420.784703] ftrace_pid_open (kernel/trace/ftrace.c:2918 kernel/trace/ftrace.c:2932 kernel/trace/ftrace.c:7725 kernel/trace/ftrace.c:7835 kernel/trace/ftrace.c:7865) > [531421.851294] vfs_open (fs/open.c:1017) > > > Do you have an issue with normal function tracing, and not tracing every function. I should also add this, because it detects recursion faster than the atomic_inc_return() does. -- Steve diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c index 9f1bfbe105e8..93ec756dc24b 100644 --- a/kernel/trace/trace_functions.c +++ b/kernel/trace/trace_functions.c @@ -221,12 +221,18 @@ function_stack_trace_call(unsigned long ip, unsigned long parent_ip, struct trace_array_cpu *data; unsigned long flags; long disabled; + int bit; int cpu; unsigned int trace_ctx; if (unlikely(!tr->function_enabled)) return; + /* Faster than atomic_inc_return() */ + bit = ftrace_test_recursion_trylock(ip, parent_ip); + if (bit < 0) + return; + /* * Need to use raw, since this must be called before the * recursive protection is performed. @@ -244,6 +250,7 @@ function_stack_trace_call(unsigned long ip, unsigned long parent_ip, atomic_dec(&data->disabled); local_irq_restore(flags); + ftrace_test_recursion_unlock(bit); } static inline bool is_repeat_check(struct trace_array *tr,
On Nov 29, 2022, at 7:06 AM, Steven Rostedt <rostedt@goodmis.org> wrote: > On Tue, 29 Nov 2022 04:25:38 +0000 > Nadav Amit <namit@vmware.com> wrote: > > >> I will need to further debug it, but this issue does not occur every time. >> >> The kernel didn’t crash exactly - it’s more of a deadlock. I have lockdep >> enabled, so it is not a deadlock that lockdep knows. Could it be that >> somehow things just slowed down due to IPIs and mostly-disabled IRQs? I have >> no idea. I would need to recreate the scenario. > > You have lockdep enabled and you are running function tracing with stack > trace on? So you are doing a stack trace on *every* function that is traced? > > I don't think you hit a deadlock, I think you hit a live lock. You could > possibly slow the system down so much that when an interrupt finishes it's > time for it to be triggered again, and you never make forward progress. It might be the issue. Perhaps I have a bug, because my code was supposed to either enable stack-tracing with selected functions or create a trace all function but *without* stack-tracing. Thanks for the pointer and sorry for the noise. Regards, Nadav
diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h index eb0466236661..36a99ef03a1a 100644 --- a/include/linux/compiler_types.h +++ b/include/linux/compiler_types.h @@ -158,7 +158,7 @@ struct ftrace_likely_data { * of extern inline functions at link time. * A lot of inline functions can cause havoc with function tracing. */ -#define inline inline __gnu_inline __inline_maybe_unused notrace +#define inline inline __gnu_inline __inline_maybe_unused /* * gcc provides both __inline__ and __inline as alternate spellings of