Message ID | baafcd3cc1abb14cb757fe081fa696012a5265ee.1676068346.git.jpoimboe@kernel.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp1214332wrn; Fri, 10 Feb 2023 14:46:41 -0800 (PST) X-Google-Smtp-Source: AK7set/7sOgutxURWuxoeqwRMos0ccOQVbStwe+OJ2H6cOHjAXiuUfud3HqqFDb50f21Q+4DYzTL X-Received: by 2002:a17:907:385:b0:878:67ad:cd8d with SMTP id ss5-20020a170907038500b0087867adcd8dmr11477287ejb.19.1676069201001; Fri, 10 Feb 2023 14:46:41 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676069200; cv=none; d=google.com; s=arc-20160816; b=gdFXIOypnXrplWjSC4NYvyOwOgQ+2IEMKLqTiIyam1E1whTD1+xayJ6ENhWuR9zz8U uKJUcUQypwoH9SN+WFUz1tgpAqCcaJSPyoIzn/tInlcrz4iDjpqgQ3b/JFMO4LCqcTpe yMUmkH+PzT9sheGEQaGh/n/NJoIhWlytjEFTJ0i5n84U9geppCrYpEWN5sa736teZIao M597fKZqGirgN8GBp549gwmPF9sUO0oGIKNSbn2/EPsuBbXKFsyp8fSR3q/NKRSkjkLD L2hPBxwqDW7rx+1r+PAFqydfRAIGxCgeGQBUUPsjXWnabbXSVGXh5Tqf+1X8+7xWVStW G/7A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=rmswaymSQnnrqSSRo7otrE8zQyeOAsw/WERYFcTlbFs=; b=FmmC1zAINLD9fD6O9FxO7sSF1132KGYEpe24/RuZRTAWGfEKcPaNT52glUM8W+pRf0 JTahP50exaJSCjIQ7ELuKz/+SJgGyHN4V3EJEJp3yc8r1E1Rbu8iRKwdWRfqvNtPKd7V 5xiOQF3jC4CtwCOUSKb1jTk7lGvVev2sLtHavLCGckgthK6jtldp17SqtjaZD1eDUxYl 00qY24ZYCA33CfiFIkuEi43Pqd8PEUZSmAv/ohHdrafA2zgVaRwkFQ1CxXgxOT3zNOMm OoGSalFVhsPhva5uu4Mh2MpTtK0lScU++409GOT6wx8UkF1d1GugNsEht1kFwkw4S1Nx pEOg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=FteLrbx7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id j11-20020aa7c40b000000b004a20747cde6si6499421edq.192.2023.02.10.14.46.17; Fri, 10 Feb 2023 14:46:40 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=FteLrbx7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233968AbjBJWmY (ORCPT <rfc822;ybw1215001957@gmail.com> + 99 others); Fri, 10 Feb 2023 17:42:24 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36728 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233954AbjBJWmT (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 10 Feb 2023 17:42:19 -0500 Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8EBCF55E72 for <linux-kernel@vger.kernel.org>; Fri, 10 Feb 2023 14:42:14 -0800 (PST) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 360D7B82616 for <linux-kernel@vger.kernel.org>; Fri, 10 Feb 2023 22:42:13 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 90465C433A8; Fri, 10 Feb 2023 22:42:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1676068931; bh=LXh7cCWgskiu1lOCHbKoWebvfAUhqVsfAvFuxIYpy5k=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=FteLrbx76zQKCC/KPwhIke7oMow6ISwTy6SILIJCVjLWipk90bh3gczA4nnV6Y4vQ 216+36ARPC6NR/y0K9b9zoKJkjY0i2P4jbJVI9X0r0zK9EeT4jr6+7CcgKpgfkZWsy lFGbT1FgR12esskj57a+fCtomcG21FDzi70/neyUJ7dGk+ahSqSeQLgfwucmS9xmTo TsjJspy9Z7vf1HH/Fl9iC1BnRZ6bKMQhDG0ASF4q1rijWVOKl/wFzRwIePRTPFc52G 05a0Kjnsoq/97ko3FlHmi3B9kjYYOvqU32ViPEhUc8CytwUVpnNEHogOu3Im+O9WHU efJjvaauzJ/sw== From: Josh Poimboeuf <jpoimboe@kernel.org> To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, Peter Zijlstra <peterz@infradead.org>, Chen Zhongjin <chenzhongjin@huawei.com>, "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>, Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>, "David S. Miller" <davem@davemloft.net>, Masami Hiramatsu <mhiramat@kernel.org> Subject: [PATCH 2/2] x86/entry: Fix unwinding from kprobe on PUSH/POP instruction Date: Fri, 10 Feb 2023 14:42:02 -0800 Message-Id: <baafcd3cc1abb14cb757fe081fa696012a5265ee.1676068346.git.jpoimboe@kernel.org> X-Mailer: git-send-email 2.39.1 In-Reply-To: <cover.1676068346.git.jpoimboe@kernel.org> References: <cover.1676068346.git.jpoimboe@kernel.org> MIME-Version: 1.0 Content-type: text/plain Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1757485938817464999?= X-GMAIL-MSGID: =?utf-8?q?1757485938817464999?= |
Series |
x86/unwind/orc: Fix unwinding from kprobe on PUSH/POP instruction
|
|
Commit Message
Josh Poimboeuf
Feb. 10, 2023, 10:42 p.m. UTC
If a kprobe (INT3) is set on a stack-modifying single-byte instruction,
like a single-byte PUSH/POP or a LEAVE, ORC fails to unwind past it:
Call Trace:
<TASK>
dump_stack_lvl+0x57/0x90
handler_pre+0x33/0x40 [kprobe_example]
aggr_pre_handler+0x49/0x90
kprobe_int3_handler+0xe3/0x180
do_int3+0x3a/0x80
exc_int3+0x7d/0xc0
asm_exc_int3+0x35/0x40
RIP: 0010:kernel_clone+0xe/0x3a0
Code: cc e8 16 b2 bf 00 66 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 41 57 41 56 41 55 41 54 cc <53> 48 89 fb 48 83 ec 68 4c 8b 27 65 48 8b 04 25 28 00 00 00 48 89
RSP: 0018:ffffc9000074fda0 EFLAGS: 00000206
RAX: 0000000000808100 RBX: ffff888109de9d80 RCX: 0000000000000000
RDX: 0000000000000011 RSI: ffff888109de9d80 RDI: ffffc9000074fdc8
RBP: ffff8881019543c0 R08: ffffffff81127e30 R09: 00000000e71742a5
R10: ffff888104764a18 R11: 0000000071742a5e R12: ffff888100078800
R13: ffff888100126000 R14: 0000000000000000 R15: ffff888100126005
? __pfx_call_usermodehelper_exec_async+0x10/0x10
? kernel_clone+0xe/0x3a0
? user_mode_thread+0x5b/0x80
? __pfx_call_usermodehelper_exec_async+0x10/0x10
? call_usermodehelper_exec_work+0x77/0xb0
? process_one_work+0x299/0x5f0
? worker_thread+0x4f/0x3a0
? __pfx_worker_thread+0x10/0x10
? kthread+0xf2/0x120
? __pfx_kthread+0x10/0x10
? ret_from_fork+0x29/0x50
</TASK>
The problem is that #BP saves the pointer to the instruction immediately
*after* the INT3, rather than to the INT3 itself. The instruction
replaced by the INT3 hasn't actually run, but ORC assumes otherwise and
expects the wrong stack layout.
Fix it by annotating the #BP exception as a non-signal stack frame,
which tells the ORC unwinder to decrement the instruction pointer before
looking up the corresponding ORC entry.
Reported-by: Chen Zhongjin <chenzhongjin@huawei.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
arch/x86/entry/entry_64.S | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
Comments
On Fri, 10 Feb 2023 14:42:02 -0800 Josh Poimboeuf <jpoimboe@kernel.org> wrote: > If a kprobe (INT3) is set on a stack-modifying single-byte instruction, > like a single-byte PUSH/POP or a LEAVE, ORC fails to unwind past it: > > Call Trace: > <TASK> > dump_stack_lvl+0x57/0x90 > handler_pre+0x33/0x40 [kprobe_example] > aggr_pre_handler+0x49/0x90 > kprobe_int3_handler+0xe3/0x180 > do_int3+0x3a/0x80 > exc_int3+0x7d/0xc0 > asm_exc_int3+0x35/0x40 > RIP: 0010:kernel_clone+0xe/0x3a0 > Code: cc e8 16 b2 bf 00 66 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 41 57 41 56 41 55 41 54 cc <53> 48 89 fb 48 83 ec 68 4c 8b 27 65 48 8b 04 25 28 00 00 00 48 89 > RSP: 0018:ffffc9000074fda0 EFLAGS: 00000206 > RAX: 0000000000808100 RBX: ffff888109de9d80 RCX: 0000000000000000 > RDX: 0000000000000011 RSI: ffff888109de9d80 RDI: ffffc9000074fdc8 > RBP: ffff8881019543c0 R08: ffffffff81127e30 R09: 00000000e71742a5 > R10: ffff888104764a18 R11: 0000000071742a5e R12: ffff888100078800 > R13: ffff888100126000 R14: 0000000000000000 R15: ffff888100126005 > ? __pfx_call_usermodehelper_exec_async+0x10/0x10 > ? kernel_clone+0xe/0x3a0 > ? user_mode_thread+0x5b/0x80 > ? __pfx_call_usermodehelper_exec_async+0x10/0x10 > ? call_usermodehelper_exec_work+0x77/0xb0 > ? process_one_work+0x299/0x5f0 > ? worker_thread+0x4f/0x3a0 > ? __pfx_worker_thread+0x10/0x10 > ? kthread+0xf2/0x120 > ? __pfx_kthread+0x10/0x10 > ? ret_from_fork+0x29/0x50 > </TASK> > > The problem is that #BP saves the pointer to the instruction immediately > *after* the INT3, rather than to the INT3 itself. The instruction > replaced by the INT3 hasn't actually run, but ORC assumes otherwise and > expects the wrong stack layout. Ah, regs->ip is not adjusted. Yes. kprobes user usually use kp->addr. Hmm, maybe we also can adjust regs->ip in kprobes, but this change may help future use of stackdump in the int3 code. So I agree. > Fix it by annotating the #BP exception as a non-signal stack frame, > which tells the ORC unwinder to decrement the instruction pointer before > looking up the corresponding ORC entry. Just to make it clear, this sounds like a 'hack' use of non-signal stack frame. If so, can we change the flag name as 'literal' or 'non-literal' etc? I concern that the 'signal' flag is used differently in the future. Thank you, > > Reported-by: Chen Zhongjin <chenzhongjin@huawei.com> > Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> > --- > arch/x86/entry/entry_64.S | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S > index 15739a2c0983..8d21881adf86 100644 > --- a/arch/x86/entry/entry_64.S > +++ b/arch/x86/entry/entry_64.S > @@ -385,7 +385,14 @@ SYM_CODE_END(xen_error_entry) > */ > .macro idtentry vector asmsym cfunc has_error_code:req > SYM_CODE_START(\asmsym) > - UNWIND_HINT_IRET_REGS offset=\has_error_code*8 > + > + .if \vector == X86_TRAP_BP > + /* #BP advances %rip to the next instruction */ > + UNWIND_HINT_IRET_REGS offset=\has_error_code*8 signal=0 > + .else > + UNWIND_HINT_IRET_REGS offset=\has_error_code*8 > + .endif > + > ENDBR > ASM_CLAC > cld > -- > 2.39.1 >
On Mon, Feb 13, 2023 at 11:43:57PM +0900, Masami Hiramatsu wrote: > > Fix it by annotating the #BP exception as a non-signal stack frame, > > which tells the ORC unwinder to decrement the instruction pointer before > > looking up the corresponding ORC entry. > > Just to make it clear, this sounds like a 'hack' use of non-signal stack > frame. If so, can we change the flag name as 'literal' or 'non-literal' etc? > I concern that the 'signal' flag is used differently in the future. Oooh, bike-shed :-) Let me suggest trap=1, where a trap is a fault with a different return address, specifically the instruction after the faulting instruction.
On Tue, Feb 14, 2023 at 12:35:04PM +0100, Peter Zijlstra wrote: > On Mon, Feb 13, 2023 at 11:43:57PM +0900, Masami Hiramatsu wrote: > > > > Fix it by annotating the #BP exception as a non-signal stack frame, > > > which tells the ORC unwinder to decrement the instruction pointer before > > > looking up the corresponding ORC entry. > > > > Just to make it clear, this sounds like a 'hack' use of non-signal stack > > frame. If so, can we change the flag name as 'literal' or 'non-literal' etc? > > I concern that the 'signal' flag is used differently in the future. Agreed, though I'm having trouble coming up with a succinct yet scrutable name. If length wasn't an issue it would be something like "decrement_return_address_when_looking_up_the_next_orc_entry" > Oooh, bike-shed :-) Let me suggest trap=1, where a trap is a fault with > a different return address, specifically the instruction after the > faulting instruction. I think "trap" doesn't work because 1) It's more than just traps, it's also function calls. We have traps/calls in one bucket (decrement IP); and everything else (faults, aborts, irqs) in the other (don't decrement IP). 2) It's not necessarily all traps which need the flag, just those that affect a previously-but-now-overwritten stack-modifying instruction. So #OF (which we don't use?) and trap-class #DB don't seem to be affected. In practice maybe this distinction doesn't matter, but for example there's no reason for ORC try to distinguish trap #DB from non-trap #DB at runtime.
On Tue, Feb 14, 2023 at 09:05:52AM -0800, Josh Poimboeuf wrote: > On Tue, Feb 14, 2023 at 12:35:04PM +0100, Peter Zijlstra wrote: > > On Mon, Feb 13, 2023 at 11:43:57PM +0900, Masami Hiramatsu wrote: > > > > > > Fix it by annotating the #BP exception as a non-signal stack frame, > > > > which tells the ORC unwinder to decrement the instruction pointer before > > > > looking up the corresponding ORC entry. > > > > > > Just to make it clear, this sounds like a 'hack' use of non-signal stack > > > frame. If so, can we change the flag name as 'literal' or 'non-literal' etc? > > > I concern that the 'signal' flag is used differently in the future. > > Agreed, though I'm having trouble coming up with a succinct yet > scrutable name. If length wasn't an issue it would be something like > > "decrement_return_address_when_looking_up_the_next_orc_entry" > > > Oooh, bike-shed :-) Let me suggest trap=1, where a trap is a fault with > > a different return address, specifically the instruction after the > > faulting instruction. > > I think "trap" doesn't work because > > 1) It's more than just traps, it's also function calls. We have > traps/calls in one bucket (decrement IP); and everything else > (faults, aborts, irqs) in the other (don't decrement IP). > > 2) It's not necessarily all traps which need the flag, just those that > affect a previously-but-now-overwritten stack-modifying instruction. > So #OF (which we don't use?) and trap-class #DB don't seem to be > affected. In practice maybe this distinction doesn't matter, but > for example there's no reason for ORC try to distinguish trap #DB > from non-trap #DB at runtime. Well, I was specifically thinking about #DB, why don't we need to decrement when we put a hardware breakpoint on a stack modifying op?
On Wed, Feb 15, 2023 at 11:25:54AM +0100, Peter Zijlstra wrote: > On Tue, Feb 14, 2023 at 09:05:52AM -0800, Josh Poimboeuf wrote: > > On Tue, Feb 14, 2023 at 12:35:04PM +0100, Peter Zijlstra wrote: > > > On Mon, Feb 13, 2023 at 11:43:57PM +0900, Masami Hiramatsu wrote: > > > > > > > > Fix it by annotating the #BP exception as a non-signal stack frame, > > > > > which tells the ORC unwinder to decrement the instruction pointer before > > > > > looking up the corresponding ORC entry. > > > > > > > > Just to make it clear, this sounds like a 'hack' use of non-signal stack > > > > frame. If so, can we change the flag name as 'literal' or 'non-literal' etc? > > > > I concern that the 'signal' flag is used differently in the future. > > > > Agreed, though I'm having trouble coming up with a succinct yet > > scrutable name. If length wasn't an issue it would be something like > > > > "decrement_return_address_when_looking_up_the_next_orc_entry" > > > > > Oooh, bike-shed :-) Let me suggest trap=1, where a trap is a fault with > > > a different return address, specifically the instruction after the > > > faulting instruction. > > > > I think "trap" doesn't work because > > > > 1) It's more than just traps, it's also function calls. We have > > traps/calls in one bucket (decrement IP); and everything else > > (faults, aborts, irqs) in the other (don't decrement IP). > > > > 2) It's not necessarily all traps which need the flag, just those that > > affect a previously-but-now-overwritten stack-modifying instruction. > > So #OF (which we don't use?) and trap-class #DB don't seem to be > > affected. In practice maybe this distinction doesn't matter, but > > for example there's no reason for ORC try to distinguish trap #DB > > from non-trap #DB at runtime. > > Well, I was specifically thinking about #DB, why don't we need to > decrement when we put a hardware breakpoint on a stack modifying op? I assume you mean the INT1 instruction. Yeah, maybe we should care about that. I'm struggling to come up with any decent ideas about how to implement that. Presumably the #DB handler would have to communicate to the unwinder somehow whether the given frame is a trap. Alternatively I was thinking the unwinder could read the instruction, but then it doesn't know whether to read regs->ip or the previous instruction.
On Wed, Feb 15, 2023 at 03:16:37PM -0800, Josh Poimboeuf wrote: > On Wed, Feb 15, 2023 at 11:25:54AM +0100, Peter Zijlstra wrote: > > Well, I was specifically thinking about #DB, why don't we need to > > decrement when we put a hardware breakpoint on a stack modifying op? > > I assume you mean the INT1 instruction. Yeah, maybe we should care > about that. Nah, I was thinking #DB from DR7, but ... > I'm struggling to come up with any decent ideas about how to implement > that. Presumably the #DB handler would have to communicate to the > unwinder somehow whether the given frame is a trap. ... I had forgotten that #DB is not unconditionally trap :/ The worst part seems to be that code breakpoints are faults while data breakpoints are traps. And you so don't want to go decode the DR registers in the unwinder, quality mess this :/ Put a breakpoint on the stack and you've got PUSH doing a trap, put a breakpoint on the PUSH instruction and you get a fault, and lo and behold, you get a different unwind :-(
On Thu, Feb 16, 2023 at 11:46:30AM +0100, Peter Zijlstra wrote: > On Wed, Feb 15, 2023 at 03:16:37PM -0800, Josh Poimboeuf wrote: > > On Wed, Feb 15, 2023 at 11:25:54AM +0100, Peter Zijlstra wrote: > > > > Well, I was specifically thinking about #DB, why don't we need to > > > decrement when we put a hardware breakpoint on a stack modifying op? > > > > I assume you mean the INT1 instruction. Yeah, maybe we should care > > about that. > > Nah, I was thinking #DB from DR7, but ... > > > I'm struggling to come up with any decent ideas about how to implement > > that. Presumably the #DB handler would have to communicate to the > > unwinder somehow whether the given frame is a trap. > > ... I had forgotten that #DB is not unconditionally trap :/ The worst > part seems to be that code breakpoints are faults while data breakpoints > are traps. > > And you so don't want to go decode the DR registers in the unwinder, > quality mess this :/ > > Put a breakpoint on the stack and you've got PUSH doing a trap, put a > breakpoint on the PUSH instruction and you get a fault, and lo and > behold, you get a different unwind :-( It could be I'm just confusing things... when #DB traps it is actually because the instruction is complete, so looking up the ORC based on the next instruction is correct, while when #DB faults, it is because the instruction has not yet completed and again ORC lookup on IP just works. So while determining if #DB is trap or fault is a giant pain in the arse, it does not actually matter for the unwinder in this case. And with the INT3 thing the problem is that we've replaced an instruction that was supposed to do a stack op.
On Fri, Feb 10, 2023 at 02:42:02PM -0800, Josh Poimboeuf wrote: > The problem is that #BP saves the pointer to the instruction immediately > *after* the INT3, rather than to the INT3 itself. The instruction > replaced by the INT3 hasn't actually run, but ORC assumes otherwise and > expects the wrong stack layout. > > Fix it by annotating the #BP exception as a non-signal stack frame, > which tells the ORC unwinder to decrement the instruction pointer before > looking up the corresponding ORC entry. > > Reported-by: Chen Zhongjin <chenzhongjin@huawei.com> > Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> > --- > arch/x86/entry/entry_64.S | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S > index 15739a2c0983..8d21881adf86 100644 > --- a/arch/x86/entry/entry_64.S > +++ b/arch/x86/entry/entry_64.S > @@ -385,7 +385,14 @@ SYM_CODE_END(xen_error_entry) > */ > .macro idtentry vector asmsym cfunc has_error_code:req > SYM_CODE_START(\asmsym) > - UNWIND_HINT_IRET_REGS offset=\has_error_code*8 > + > + .if \vector == X86_TRAP_BP > + /* #BP advances %rip to the next instruction */ > + UNWIND_HINT_IRET_REGS offset=\has_error_code*8 signal=0 So the fact that INT3 is trap like is not the problem, the problem is that we use INT3 to overwrite stack modifying instruction and we should not assume those instructions have completed when in the #BP handler. Now, the reason all this actually works is because INT3 itself does not modify the stack so rewinding on non-overwrite INT3 instructions is invariant wrt stack state. > + .else > + UNWIND_HINT_IRET_REGS offset=\has_error_code*8 > + .endif
On Thu, 16 Feb 2023 12:30:24 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Feb 16, 2023 at 11:46:30AM +0100, Peter Zijlstra wrote: > > On Wed, Feb 15, 2023 at 03:16:37PM -0800, Josh Poimboeuf wrote: > > > On Wed, Feb 15, 2023 at 11:25:54AM +0100, Peter Zijlstra wrote: > > > > > > Well, I was specifically thinking about #DB, why don't we need to > > > > decrement when we put a hardware breakpoint on a stack modifying op? > > > > > > I assume you mean the INT1 instruction. Yeah, maybe we should care > > > about that. > > > > Nah, I was thinking #DB from DR7, but ... > > > > > I'm struggling to come up with any decent ideas about how to implement > > > that. Presumably the #DB handler would have to communicate to the > > > unwinder somehow whether the given frame is a trap. > > > > ... I had forgotten that #DB is not unconditionally trap :/ The worst > > part seems to be that code breakpoints are faults while data breakpoints > > are traps. > > > > And you so don't want to go decode the DR registers in the unwinder, > > quality mess this :/ > > > > Put a breakpoint on the stack and you've got PUSH doing a trap, put a > > breakpoint on the PUSH instruction and you get a fault, and lo and > > behold, you get a different unwind :-( > > It could be I'm just confusing things... when #DB traps it is actually > because the instruction is complete, so looking up the ORC based on the > next instruction is correct, while when #DB faults, it is because the > instruction has not yet completed and again ORC lookup on IP just works. > > So while determining if #DB is trap or fault is a giant pain in the > arse, it does not actually matter for the unwinder in this case. > > And with the INT3 thing the problem is that we've replaced an > instruction that was supposed to do a stack op. > If the kprobe checks whether the original instruction do a stack op and if so, setting a flag on current_kprobe will help unwinder finds that case? Of course all INT3 user may need to do this but it should be limited. Thank you,
On Thu, Feb 16, 2023 at 12:58:58PM +0100, Peter Zijlstra wrote: > On Fri, Feb 10, 2023 at 02:42:02PM -0800, Josh Poimboeuf wrote: > > > The problem is that #BP saves the pointer to the instruction immediately > > *after* the INT3, rather than to the INT3 itself. The instruction > > replaced by the INT3 hasn't actually run, but ORC assumes otherwise and > > expects the wrong stack layout. > > > > Fix it by annotating the #BP exception as a non-signal stack frame, > > which tells the ORC unwinder to decrement the instruction pointer before > > looking up the corresponding ORC entry. > > > > Reported-by: Chen Zhongjin <chenzhongjin@huawei.com> > > Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> > > --- > > arch/x86/entry/entry_64.S | 9 ++++++++- > > 1 file changed, 8 insertions(+), 1 deletion(-) > > > > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S > > index 15739a2c0983..8d21881adf86 100644 > > --- a/arch/x86/entry/entry_64.S > > +++ b/arch/x86/entry/entry_64.S > > @@ -385,7 +385,14 @@ SYM_CODE_END(xen_error_entry) > > */ > > .macro idtentry vector asmsym cfunc has_error_code:req > > SYM_CODE_START(\asmsym) > > - UNWIND_HINT_IRET_REGS offset=\has_error_code*8 > > + > > + .if \vector == X86_TRAP_BP > > + /* #BP advances %rip to the next instruction */ > > + UNWIND_HINT_IRET_REGS offset=\has_error_code*8 signal=0 > > So the fact that INT3 is trap like is not the problem, the problem is > that we use INT3 to overwrite stack modifying instruction and we should > not assume those instructions have completed when in the #BP handler. > > Now, the reason all this actually works is because INT3 itself does not > modify the stack so rewinding on non-overwrite INT3 instructions is > invariant wrt stack state. Right, that's what my patch description attempting to say. That's also why I was asking about INT1, which is a trap. Do we care about INT1?
On Thu, Feb 16, 2023 at 11:35:19PM +0900, Masami Hiramatsu wrote: > > It could be I'm just confusing things... when #DB traps it is actually > > because the instruction is complete, so looking up the ORC based on the > > next instruction is correct, while when #DB faults, it is because the > > instruction has not yet completed and again ORC lookup on IP just works. > > > > So while determining if #DB is trap or fault is a giant pain in the > > arse, it does not actually matter for the unwinder in this case. > > > > And with the INT3 thing the problem is that we've replaced an > > instruction that was supposed to do a stack op. > > > > If the kprobe checks whether the original instruction do a stack op and > if so, setting a flag on current_kprobe will help unwinder finds that case? > > Of course all INT3 user may need to do this but it should be limited. No, for INT3, even if the original instruction wasn't a stack op, we can treat it the same way. Either way, we know the instruction hasn't executed so we can still use that address to look up the ORC entry.
On Thu, Feb 16, 2023 at 08:06:19AM -0800, Josh Poimboeuf wrote: > On Thu, Feb 16, 2023 at 12:58:58PM +0100, Peter Zijlstra wrote: > > On Fri, Feb 10, 2023 at 02:42:02PM -0800, Josh Poimboeuf wrote: > > > > > The problem is that #BP saves the pointer to the instruction immediately > > > *after* the INT3, rather than to the INT3 itself. The instruction > > > replaced by the INT3 hasn't actually run, but ORC assumes otherwise and > > > expects the wrong stack layout. > > > > > > Fix it by annotating the #BP exception as a non-signal stack frame, > > > which tells the ORC unwinder to decrement the instruction pointer before > > > looking up the corresponding ORC entry. > > > > > > Reported-by: Chen Zhongjin <chenzhongjin@huawei.com> > > > Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> > > > --- > > > arch/x86/entry/entry_64.S | 9 ++++++++- > > > 1 file changed, 8 insertions(+), 1 deletion(-) > > > > > > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S > > > index 15739a2c0983..8d21881adf86 100644 > > > --- a/arch/x86/entry/entry_64.S > > > +++ b/arch/x86/entry/entry_64.S > > > @@ -385,7 +385,14 @@ SYM_CODE_END(xen_error_entry) > > > */ > > > .macro idtentry vector asmsym cfunc has_error_code:req > > > SYM_CODE_START(\asmsym) > > > - UNWIND_HINT_IRET_REGS offset=\has_error_code*8 > > > + > > > + .if \vector == X86_TRAP_BP > > > + /* #BP advances %rip to the next instruction */ > > > + UNWIND_HINT_IRET_REGS offset=\has_error_code*8 signal=0 > > > > So the fact that INT3 is trap like is not the problem, the problem is > > that we use INT3 to overwrite stack modifying instruction and we should > > not assume those instructions have completed when in the #BP handler. > > > > Now, the reason all this actually works is because INT3 itself does not > > modify the stack so rewinding on non-overwrite INT3 instructions is > > invariant wrt stack state. > > Right, that's what my patch description attempting to say. > > That's also why I was asking about INT1, which is a trap. Do we care > about INT1? We do not care about INT1, #DB is an IST and an allround pain in the backside, INT3 is where it's at.
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 15739a2c0983..8d21881adf86 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -385,7 +385,14 @@ SYM_CODE_END(xen_error_entry) */ .macro idtentry vector asmsym cfunc has_error_code:req SYM_CODE_START(\asmsym) - UNWIND_HINT_IRET_REGS offset=\has_error_code*8 + + .if \vector == X86_TRAP_BP + /* #BP advances %rip to the next instruction */ + UNWIND_HINT_IRET_REGS offset=\has_error_code*8 signal=0 + .else + UNWIND_HINT_IRET_REGS offset=\has_error_code*8 + .endif + ENDBR ASM_CLAC cld