diff mbox series

[-tip,v2] x86/kprobes: Drop removed INT3 handling code

Message ID	166981518895.1131462.4693062055762912734.stgit@devnote3
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> To: x86@kernel.org Cc: Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, Peter Zijlstra <peterz@infradead.org>, Masami Hiramatsu <mhiramat@kernel.org>, linux-kernel@vger.kernel.org, Steven Rostedt <rostedt@goodmis.org>, Huacai Chen <chenhuacai@loongson.cn>, Jinyang He <hejinyang@loongson.cn>, Tiezhu Yang <yangtiezhu@loongson.cn>, "Naveen N . Rao" <naveen.n.rao@linux.ibm.com> Subject: [PATCH -tip v2] x86/kprobes: Drop removed INT3 handling code Date: Wed, 30 Nov 2022 22:33:09 +0900 Message-Id: <166981518895.1131462.4693062055762912734.stgit@devnote3> User-Agent: StGit/0.19 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[-tip,v2] x86/kprobes: Drop removed INT3 handling code \| [-tip,v2] x86/kprobes: Drop removed INT3 handling code

Commit Message

Masami Hiramatsu (Google) Nov. 30, 2022, 1:33 p.m. UTC

  From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Drop removed INT3 handling code from kprobe_int3_handler() because this
case (get_kprobe() doesn't return corresponding kprobe AND the INT3 is
removed) must not happen with the kprobe managed INT3, but can happen
with the non-kprobe INT3, which should be handled by other callbacks.

For the kprobe managed INT3, it is already safe. The commit 5c02ece81848d
("x86/kprobes: Fix ordering while text-patching") introduced
text_poke_sync() to the arch_disarm_kprobe() right after removing INT3.
Since this text_poke_sync() uses IPI to call sync_core() on all online
cpus, that ensures that all running INT3 exception handlers have done.
And, the unregister_kprobe() will remove the kprobe from the hash table
after arch_disarm_kprobe().

Thus, when the kprobe managed INT3 hits, kprobe_int3_handler() should
be able to find corresponding kprobe always by get_kprobe(). If it can
not find any kprobe, this means that is NOT a kprobe managed INT3.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 Changes in v2:
  - update comment to mention that the first safe commit.
---
 arch/x86/kernel/kprobes/core.c |   14 --------------
 1 file changed, 14 deletions(-)

Comments

Matthieu Baerts (NGI0) Jan. 20, 2024, 5:44 p.m. UTC | #1

Hi Masami and the x86 list,

On 30/11/2022 14:33, Masami Hiramatsu (Google) wrote:
> From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> 
> Drop removed INT3 handling code from kprobe_int3_handler() because this
> case (get_kprobe() doesn't return corresponding kprobe AND the INT3 is
> removed) must not happen with the kprobe managed INT3, but can happen
> with the non-kprobe INT3, which should be handled by other callbacks.
> 
> For the kprobe managed INT3, it is already safe. The commit 5c02ece81848d
> ("x86/kprobes: Fix ordering while text-patching") introduced
> text_poke_sync() to the arch_disarm_kprobe() right after removing INT3.
> Since this text_poke_sync() uses IPI to call sync_core() on all online
> cpus, that ensures that all running INT3 exception handlers have done.
> And, the unregister_kprobe() will remove the kprobe from the hash table
> after arch_disarm_kprobe().
> 
> Thus, when the kprobe managed INT3 hits, kprobe_int3_handler() should
> be able to find corresponding kprobe always by get_kprobe(). If it can
> not find any kprobe, this means that is NOT a kprobe managed INT3.
> 
> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
>  Changes in v2:
>   - update comment to mention that the first safe commit.
> ---
>  arch/x86/kernel/kprobes/core.c |   14 --------------
>  1 file changed, 14 deletions(-)


I'm sorry to reply on a patch that is more than one year old, but in
short, it looks like it is causing a kernel panic on our side. To be
honest, I don't understand the link, but I probably missed something.

A bit of context: for the MPTCP net subsystem, we are testing a new CI
service to launch a VM and run our test suite (kselftests, kunit, etc.).
This CI (Github Action) doesn't support KVM acceleration, and QEmu
(v6.2.0) falls back to TCG ("-machine accel=kvm:tcg"). Before, we were
always running the tests with QEmu and KVM support, and I don't think we
had this issue before. Now, in two weeks, this new CI reported 5 kernel
panic in ~30 runs.

I initially reported the issue on netdev [1], because the CI always got
the panic when doing some pings between different netns, not using MPTCP
yet. Eric Dumazet mentioned that it looks like it is an x86 issue, maybe
with the jump labels. Since then, I tried to 'git bisect' the issue on
my side, but it was not easy: hard to reproduce and unclear to me what
is causing it.

After a few (long) sessions, 'git bisect' gave me this commit:

  8e791f7eba4c ("x86/kprobes: Drop removed INT3 handling code")

I thought it was another false positive, but I was not able to reproduce
the issue when testing the parent commit, while I could still do that
when validating this 8e791f7eba4c commit. I also went back to our MPTCP
tree [2], reproduced the issue again, just to be sure. Then I tried to
reproduce it after having reverted 8e791f7eba4c. I didn't have any
panic, and I tried for a long time: ~2000 iterations, while I usually
have the kernel panic after ~50 iterations.

So it looks like the modifications done by this patch is linked to the
kernel panic we have:

> diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
> index 66299682b6b7..33390ed4dcf3 100644
> --- a/arch/x86/kernel/kprobes/core.c
> +++ b/arch/x86/kernel/kprobes/core.c
> @@ -986,20 +986,6 @@ int kprobe_int3_handler(struct pt_regs *regs)
>  			kprobe_post_process(p, regs, kcb);
>  			return 1;
>  		}
> -	}
> -
> -	if (*addr != INT3_INSN_OPCODE) {
> -		/*
> -		 * The breakpoint instruction was removed right
> -		 * after we hit it.  Another cpu has removed
> -		 * either a probepoint or a debugger breakpoint
> -		 * at this address.  In either case, no further
> -		 * handling of this interrupt is appropriate.
> -		 * Back up over the (now missing) int3 and run
> -		 * the original instruction.
> -		 */
> -		regs->ip = (unsigned long)addr;
> -		return 1;
>  	} /* else: not a kprobe fault; let the kernel handle it */
>  
>  	return 0;
> 
> 


As far as I know, our 'mptcp_connect.c' selftest that is being used to
reproduce the issue, is not using KProbes, tracing, dynamic debug, bpf,
etc. It simply creates netns with IP and routes, then do some pings
between the different IPs. Any idea why this commit could be linked to a
kernel panic?


Here is the last call trace I had:

> # INFO: validating network environment with pings
> [ 1985.073189] int3: 0000 [#1] PREEMPT SMP NOPTI
> [ 1985.073246] CPU: 0 PID: 3203 Comm: ping Not tainted 6.7.0-113761-g5e006770879c-dirty #250
> [ 1985.073246] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> [ 1985.073246] RIP: 0010:netif_rx_internal (arch/x86/include/asm/jump_label.h:27)
> [ 1985.073246] Code: 0f 1f 84 00 00 00 00 00 0f 1f 40 00 0f 1f 44 00 00 55 48 89 fd 48 83 ec 20 65 48 8b 04 25 28 00 00 00 48 89 44 24 18 31 c0 e9 <c9> 00 00 00 66 90 66 90 48 8d 54 24 10 48 89 ef 65 8b 35 67 48 d0
> All code
> ========
>    0:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
>    7:   00
>    8:   0f 1f 40 00             nopl   0x0(%rax)
>    c:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
>   11:   55                      push   %rbp
>   12:   48 89 fd                mov    %rdi,%rbp
>   15:   48 83 ec 20             sub    $0x20,%rsp
>   19:   65 48 8b 04 25 28 00    mov    %gs:0x28,%rax
>   20:   00 00
>   22:   48 89 44 24 18          mov    %rax,0x18(%rsp)
>   27:   31 c0                   xor    %eax,%eax
>   29:*  e9 c9 00 00 00          jmp    0xf7             <-- trapping instruction
>   2e:   66 90                   xchg   %ax,%ax
>   30:   66 90                   xchg   %ax,%ax
>   32:   48 8d 54 24 10          lea    0x10(%rsp),%rdx
>   37:   48 89 ef                mov    %rbp,%rdi
>   3a:   65                      gs
>   3b:   8b                      .byte 0x8b
>   3c:   35                      .byte 0x35
>   3d:   67                      addr32
>   3e:   48                      rex.W
>   3f:   d0                      .byte 0xd0
> 
> Code starting with the faulting instruction
> ===========================================
>    0:   c9                      leave
>    1:   00 00                   add    %al,(%rax)
>    3:   00 66 90                add    %ah,-0x70(%rsi)
>    6:   66 90                   xchg   %ax,%ax
>    8:   48 8d 54 24 10          lea    0x10(%rsp),%rdx
>    d:   48 89 ef                mov    %rbp,%rdi
>   10:   65                      gs
>   11:   8b                      .byte 0x8b
>   12:   35                      .byte 0x35
>   13:   67                      addr32
>   14:   48                      rex.W
>   15:   d0                      .byte 0xd0
> [ 1985.073246] RSP: 0018:ffffb36d40003c08 EFLAGS: 00000246
> [ 1985.073246] RAX: 0000000000000000 RBX: ffff9580825ca000 RCX: 0000000000000001
> [ 1985.073246] RDX: 0000000000000002 RSI: ffff9580825c8000 RDI: ffff9580821cca00
> [ 1985.073246] RBP: ffff9580821cca00 R08: 0000000000000000 R09: 000000000000001c
> [ 1985.073246] R10: ffff9580812dcf10 R11: ffff9580812dcf00 R12: ffff9580821cca00
> [ 1985.073246] R13: 000000000000002a R14: 0000000000000000 R15: ffff9580825e1800
> [ 1985.073246] FS:  00007fa7c46be1c0(0000) GS:ffff9580fdc00000(0000) knlGS:0000000000000000
> [ 1985.073246] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1985.073246] CR2: 00005584236b2200 CR3: 0000000002704000 CR4: 00000000000006f0
> [ 1985.073246] Call Trace:
> [ 1985.073246]  <IRQ>
> [ 1985.073246] ? die (arch/x86/kernel/dumpstack.c:421)
> [ 1985.073246] ? exc_int3 (arch/x86/kernel/traps.c:762)
> [ 1985.073246] ? asm_exc_int3 (arch/x86/include/asm/idtentry.h:569)
> [ 1985.073246] ? netif_rx_internal (arch/x86/include/asm/jump_label.h:27)
> [ 1985.073246] ? netif_rx_internal (arch/x86/include/asm/jump_label.h:27)
> [ 1985.073246] ? kmem_cache_alloc_node (mm/slub.c:3843)
> [ 1985.073246] __netif_rx (net/core/dev.c:5084)
> [ 1985.073246] veth_xmit (drivers/net/veth.c:321)
> [ 1985.073246] dev_hard_start_xmit (include/linux/netdevice.h:4989)
> [ 1985.073246] __dev_queue_xmit (include/linux/netdevice.h:3367)
> [ 1985.073246] ? arp_create (net/ipv4/arp.c:577)
> [ 1985.073246] arp_solicit (net/ipv4/arp.c:392)
> [ 1985.073246] ? kmem_cache_alloc (mm/slub.c:3843)
> [ 1985.073246] ? arp_constructor (net/ipv4/arp.c:249)
> [ 1985.073246] neigh_probe (arch/x86/include/asm/atomic.h:53)
> [ 1985.073246] __neigh_event_send (net/core/neighbour.c:1242)
> [ 1985.073246] neigh_resolve_output (net/core/neighbour.c:1547)
> [ 1985.073246] ip_finish_output2 (include/net/neighbour.h:542)
> [ 1985.073246] ? __ip_finish_output.part.0 (include/linux/skbuff.h:4884)
> [ 1985.073246] __netif_receive_skb_one_core (net/core/dev.c:5537)
> [ 1985.073246] process_backlog (include/linux/rcupdate.h:782)
> [ 1985.073246] __napi_poll (net/core/dev.c:6576)
> [ 1985.073246] net_rx_action (net/core/dev.c:6647)
> [ 1985.073246] __do_softirq (arch/x86/include/asm/jump_label.h:27)
> [ 1985.073246] do_softirq (kernel/softirq.c:454)
> [ 1985.073246]  </IRQ>
> [ 1985.073246]  <TASK>
> [ 1985.073246] __local_bh_enable_ip (kernel/softirq.c:381)
> [ 1985.073246] __dev_queue_xmit (net/core/dev.c:4379)
> [ 1985.073246] ip_finish_output2 (include/linux/netdevice.h:3171)
> [ 1985.073246] ? __ip_finish_output.part.0 (include/linux/skbuff.h:4884)
> [ 1985.073246] ip_push_pending_frames (net/ipv4/ip_output.c:1490)
> [ 1985.073246] raw_sendmsg (net/ipv4/raw.c:647)
> [ 1985.073246] ? netfs_rreq_assess (fs/netfs/io.c:101)
> [ 1985.073246] ? folio_add_file_rmap_ptes (arch/x86/include/asm/bitops.h:206)
> [ 1985.073246] ? set_pte_range (mm/memory.c:4529)
> [ 1985.073246] ? __sock_sendmsg (net/socket.c:733)
> [ 1985.073246] __sock_sendmsg (net/socket.c:733)
> [ 1985.073246] ? move_addr_to_kernel.part.0 (net/socket.c:253)
> [ 1985.073246] __sys_sendto (net/socket.c:2191)
> [ 1985.073246] ? __rseq_handle_notify_resume (kernel/rseq.c:257)
> [ 1985.073246] ? ktime_get_real_ts64 (kernel/time/timekeeping.c:292 (discriminator 3))
> [ 1985.073246] __x64_sys_sendto (net/socket.c:2203)
> [ 1985.073246] do_syscall_64 (arch/x86/entry/common.c:52)
> [ 1985.073246] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
> [ 1985.073246] RIP: 0033:0x7fa7c499081a
> [ 1985.073246] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
> All code
> ========
>    0:   d8 64 89 02             fsubs  0x2(%rcx,%rcx,4)
>    4:   48 c7 c0 ff ff ff ff    mov    $0xffffffffffffffff,%rax
>    b:   eb b8                   jmp    0xffffffffffffffc5
>    d:   0f 1f 00                nopl   (%rax)
>   10:   f3 0f 1e fa             endbr64
>   14:   41 89 ca                mov    %ecx,%r10d
>   17:   64 8b 04 25 18 00 00    mov    %fs:0x18,%eax
>   1e:   00
>   1f:   85 c0                   test   %eax,%eax
>   21:   75 15                   jne    0x38
>   23:   b8 2c 00 00 00          mov    $0x2c,%eax
>   28:   0f 05                   syscall
>   2a:*  48 3d 00 f0 ff ff       cmp    $0xfffffffffffff000,%rax         <-- trapping instruction
>   30:   77 7e                   ja     0xb0
>   32:   c3                      ret
>   33:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
>   38:   41 54                   push   %r12
>   3a:   48 83 ec 30             sub    $0x30,%rsp
>   3e:   44                      rex.R
>   3f:   89                      .byte 0x89
> 
> Code starting with the faulting instruction
> ===========================================
>    0:   48 3d 00 f0 ff ff       cmp    $0xfffffffffffff000,%rax
>    6:   77 7e                   ja     0x86
>    8:   c3                      ret
>    9:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
>    e:   41 54                   push   %r12
>   10:   48 83 ec 30             sub    $0x30,%rsp
>   14:   44                      rex.R
>   15:   89                      .byte 0x89
> [ 1985.073246] RSP: 002b:00007ffce269b368 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
> [ 1985.073246] RAX: ffffffffffffffda RBX: 00007ffce269ca10 RCX: 00007fa7c499081a
> [ 1985.073246] RDX: 0000000000000040 RSI: 0000558423f7c300 RDI: 0000000000000003
> [ 1985.073246] RBP: 0000558423f7c300 R08: 00007ffce269ec90 R09: 0000000000000010
> [ 1985.073246] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000040
> [ 1985.073246] R13: 00007ffce269c5a8 R14: 00007ffce269b370 R15: 00007ffce269ca10
> [ 1985.073246]  </TASK>
> [ 1985.073246] Modules linked in:
> [ 1985.073246] ---[ end trace 0000000000000000 ]---
> [ 1985.073246] RIP: 0010:netif_rx_internal (arch/x86/include/asm/jump_label.h:27)
> [ 1985.073246] Code: 0f 1f 84 00 00 00 00 00 0f 1f 40 00 0f 1f 44 00 00 55 48 89 fd 48 83 ec 20 65 48 8b 04 25 28 00 00 00 48 89 44 24 18 31 c0 e9 <c9> 00 00 00 66 90 66 90 48 8d 54 24 10 48 89 ef 65 8b 35 67 48 d0
> All code
> ========
>    0:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
>    7:   00
>    8:   0f 1f 40 00             nopl   0x0(%rax)
>    c:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
>   11:   55                      push   %rbp
>   12:   48 89 fd                mov    %rdi,%rbp
>   15:   48 83 ec 20             sub    $0x20,%rsp
>   19:   65 48 8b 04 25 28 00    mov    %gs:0x28,%rax
>   20:   00 00
>   22:   48 89 44 24 18          mov    %rax,0x18(%rsp)
>   27:   31 c0                   xor    %eax,%eax
>   29:*  e9 c9 00 00 00          jmp    0xf7             <-- trapping instruction
>   2e:   66 90                   xchg   %ax,%ax
>   30:   66 90                   xchg   %ax,%ax
>   32:   48 8d 54 24 10          lea    0x10(%rsp),%rdx
>   37:   48 89 ef                mov    %rbp,%rdi
>   3a:   65                      gs
>   3b:   8b                      .byte 0x8b
>   3c:   35                      .byte 0x35
>   3d:   67                      addr32
>   3e:   48                      rex.W
>   3f:   d0                      .byte 0xd0
> 
> Code starting with the faulting instruction
> ===========================================
>    0:   c9                      leave
>    1:   00 00                   add    %al,(%rax)
>    3:   00 66 90                add    %ah,-0x70(%rsi)
>    6:   66 90                   xchg   %ax,%ax
>    8:   48 8d 54 24 10          lea    0x10(%rsp),%rdx
>    d:   48 89 ef                mov    %rbp,%rdi
>   10:   65                      gs
>   11:   8b                      .byte 0x8b
>   12:   35                      .byte 0x35
>   13:   67                      addr32
>   14:   48                      rex.W
>   15:   d0                      .byte 0xd0
> [ 1985.073246] RSP: 0018:ffffb36d40003c08 EFLAGS: 00000246
> [ 1985.073246] RAX: 0000000000000000 RBX: ffff9580825ca000 RCX: 0000000000000001
> [ 1985.073246] RDX: 0000000000000002 RSI: ffff9580825c8000 RDI: ffff9580821cca00
> [ 1985.073246] RBP: ffff9580821cca00 R08: 0000000000000000 R09: 000000000000001c
> [ 1985.073246] R10: ffff9580812dcf10 R11: ffff9580812dcf00 R12: ffff9580821cca00
> [ 1985.073246] R13: 000000000000002a R14: 0000000000000000 R15: ffff9580825e1800
> [ 1985.073246] FS:  00007fa7c46be1c0(0000) GS:ffff9580fdc00000(0000) knlGS:0000000000000000
> [ 1985.073246] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1985.073246] CR2: 00005584236b2200 CR3: 0000000002704000 CR4: 00000000000006f0
> [ 1985.073246] Kernel panic - not syncing: Fatal exception in interrupt
> [ 1985.073246] Kernel Offset: 0x19a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> Unexpected stop of the VM


My kconfig is available in [3], here is an extract:

$ grep JUMP .config
# CONFIG_KEXEC_JUMP is not set
CONFIG_ARCH_SUPPORTS_KEXEC_JUMP=y
CONFIG_JUMP_LABEL=y
CONFIG_HAVE_ARCH_JUMP_LABEL=y
CONFIG_HAVE_ARCH_JUMP_LABEL_RELATIVE=y
CONFIG_HAVE_JUMP_LABEL_HACK=y

$ grep PROBES .config
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_KPROBES=y
CONFIG_OPTPROBES=y
CONFIG_KPROBES_ON_FTRACE=y
CONFIG_UPROBES=y
CONFIG_KRETPROBES=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_OPTPROBES=y
CONFIG_HAVE_KPROBES_ON_FTRACE=y
# CONFIG_KPROBES_SANITY_TEST is not set

$ grep DYNAMIC .config
CONFIG_PREEMPT_DYNAMIC=y
CONFIG_DYNAMIC_MEMORY_LAYOUT=y
CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT=y
CONFIG_HAVE_PREEMPT_DYNAMIC=y
CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y
CONFIG_DYNAMIC_SIGFRAME=y
# CONFIG_SWIOTLB_DYNAMIC is not set
CONFIG_DYNAMIC_DEBUG=y
CONFIG_DYNAMIC_DEBUG_CORE=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS=y
CONFIG_HAVE_DYNAMIC_FTRACE_NO_PATCHABLE=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS=y
CONFIG_DYNAMIC_FTRACE_WITH_ARGS=y
CONFIG_DYNAMIC_EVENTS=y
# CONFIG_TEST_DYNAMIC_DEBUG is not set


I reported the tests I did, userspace commands before triggering the
bug, call traces, kernel config, vmlinux files, etc. on a ticket that is
publicly accessible [4].


Do you mind having a look at this issue, and tell me what you think
about it, please?

I can run more tests and debug patches if it can help. Please note that
it looks like it is even harder to reproduce the kernel panic when added
more debug mechanisms, but I can always try.


[1]
https://lore.kernel.org/netdev/98724dcd-ddf3-4f78-a386-f966ffbc9528@kernel.org/T/
[2] on top of Linus' tree from 2 days ago: 736b5545d39c ("Merge tag
'net-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
[3] https://github.com/multipath-tcp/mptcp_net-next/files/13998625/config.gz
[4] https://github.com/multipath-tcp/mptcp_net-next/issues/471


Cheers,
Matt

Steven Rostedt Jan. 20, 2024, 10:05 p.m. UTC | #2

On Sat, 20 Jan 2024 18:44:38 +0100
Matthieu Baerts <matttbe@kernel.org> wrote:

> 
> I'm sorry to reply on a patch that is more than one year old, but in

No problem, I've done the same.

> short, it looks like it is causing a kernel panic on our side. To be
> honest, I don't understand the link, but I probably missed something.
> 
> A bit of context: for the MPTCP net subsystem, we are testing a new CI
> service to launch a VM and run our test suite (kselftests, kunit, etc.).
> This CI (Github Action) doesn't support KVM acceleration, and QEmu
> (v6.2.0) falls back to TCG ("-machine accel=kvm:tcg"). Before, we were
> always running the tests with QEmu and KVM support, and I don't think we
> had this issue before. Now, in two weeks, this new CI reported 5 kernel
> panic in ~30 runs.

I'm guessing that qemu doesn't do something that real hardware will do,
which is causing the bug.

> 
> I initially reported the issue on netdev [1], because the CI always got
> the panic when doing some pings between different netns, not using MPTCP
> yet. Eric Dumazet mentioned that it looks like it is an x86 issue, maybe
> with the jump labels. Since then, I tried to 'git bisect' the issue on
> my side, but it was not easy: hard to reproduce and unclear to me what
> is causing it.

It could possibly be due to jump_labels/static_branch as they use int3
as well.

> 
> After a few (long) sessions, 'git bisect' gave me this commit:
> 
>   8e791f7eba4c ("x86/kprobes: Drop removed INT3 handling code")
> 
> I thought it was another false positive, but I was not able to reproduce
> the issue when testing the parent commit, while I could still do that
> when validating this 8e791f7eba4c commit. I also went back to our MPTCP
> tree [2], reproduced the issue again, just to be sure. Then I tried to
> reproduce it after having reverted 8e791f7eba4c. I didn't have any
> panic, and I tried for a long time: ~2000 iterations, while I usually
> have the kernel panic after ~50 iterations.
> 
> So it looks like the modifications done by this patch is linked to the
> kernel panic we have:

Actually, I think it's possibly the other way around. The changes to
this patch actually allow the bug to be hit!

> 
> > diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
> > index 66299682b6b7..33390ed4dcf3 100644
> > --- a/arch/x86/kernel/kprobes/core.c
> > +++ b/arch/x86/kernel/kprobes/core.c
> > @@ -986,20 +986,6 @@ int kprobe_int3_handler(struct pt_regs *regs)
> >  			kprobe_post_process(p, regs, kcb);
> >  			return 1;
> >  		}
> > -	}
> > -
> > -	if (*addr != INT3_INSN_OPCODE) {

So if *addr != INT3_INST_OPCODE, then this returns 1

 addr = (kprobe_opcode_t *)(regs->ip - sizeof(kprobe_opcode_t));

Hmm, so the int3 is removed when this is called, and in that case, this
returns 1.


> > -		/*
> > -		 * The breakpoint instruction was removed right
> > -		 * after we hit it.  Another cpu has removed
> > -		 * either a probepoint or a debugger breakpoint
> > -		 * at this address.  In either case, no further
> > -		 * handling of this interrupt is appropriate.
> > -		 * Back up over the (now missing) int3 and run
> > -		 * the original instruction.
> > -		 */
> > -		regs->ip = (unsigned long)addr;
> > -		return 1;
> >  	} /* else: not a kprobe fault; let the kernel handle it */
> >  
> >  	return 0;
> > 
> >   
> 
> 
> As far as I know, our 'mptcp_connect.c' selftest that is being used to
> reproduce the issue, is not using KProbes, tracing, dynamic debug, bpf,
> etc. It simply creates netns with IP and routes, then do some pings
> between the different IPs. Any idea why this commit could be linked to a
> kernel panic?

The above function is called in traps.c:

static bool do_int3(struct pt_regs *regs)
{
        int res;

#ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
        if (kgdb_ll_trap(DIE_INT3, "int3", regs, 0, X86_TRAP_BP,
                         SIGTRAP) == NOTIFY_STOP)
                return true;
#endif /* CONFIG_KGDB_LOW_LEVEL_TRAP */

#ifdef CONFIG_KPROBES
        if (kprobe_int3_handler(regs))
                return true;
#endif
        res = notify_die(DIE_INT3, "int3", regs, 0, X86_TRAP_BP, SIGTRAP);

        return res == NOTIFY_STOP;
}

If the kprobe_int3_handler() returns 1 it shortcuts the do_int3()
function and it never gets to the notify_die() call.

If something called into this and that notify_die() causes the panic,
the kprobe patch prevented that by sheer luck.

That is, the patch prevented whatever is not working from calling the
notify_die() as it should have happened, and the panic was avoided. By
adding this commit, it allowed the bug to trigger the panic. The crash
has nothing to do with kporbes, it was just that the kprobe handler was
incorrectly short-cutting the do_in3() function and preventing the
panic.

> 
> 
> Here is the last call trace I had:
> 
> > # INFO: validating network environment with pings
> > [ 1985.073189] int3: 0000 [#1] PREEMPT SMP NOPTI
> > [ 1985.073246] CPU: 0 PID: 3203 Comm: ping Not tainted 6.7.0-113761-g5e006770879c-dirty #250
> > [ 1985.073246] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> > [ 1985.073246] RIP: 0010:netif_rx_internal (arch/x86/include/asm/jump_label.h:27)

Yep, definitely a jump_label issue.

> > [ 1985.073246] Code: 0f 1f 84 00 00 00 00 00 0f 1f 40 00 0f 1f 44 00 00 55 48 89 fd 48 83 ec 20 65 48 8b 04 25 28 00 00 00 48 89 44 24 18 31 c0 e9 <c9> 00 00 00 66 90 66 90 48 8d 54 24 10 48 89 ef 65 8b 35 67 48 d0
> > All code
> > ========
> >    0:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
> >    7:   00
> >    8:   0f 1f 40 00             nopl   0x0(%rax)
> >    c:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
> >   11:   55                      push   %rbp
> >   12:   48 89 fd                mov    %rdi,%rbp
> >   15:   48 83 ec 20             sub    $0x20,%rsp
> >   19:   65 48 8b 04 25 28 00    mov    %gs:0x28,%rax
> >   20:   00 00
> >   22:   48 89 44 24 18          mov    %rax,0x18(%rsp)
> >   27:   31 c0                   xor    %eax,%eax
> >   29:*  e9 c9 00 00 00          jmp    0xf7             <-- trapping instruction
> >   2e:   66 90                   xchg   %ax,%ax
> >   30:   66 90                   xchg   %ax,%ax
> >   32:   48 8d 54 24 10          lea    0x10(%rsp),%rdx
> >   37:   48 89 ef                mov    %rbp,%rdi
> >   3a:   65                      gs
> >   3b:   8b                      .byte 0x8b
> >   3c:   35                      .byte 0x35
> >   3d:   67                      addr32
> >   3e:   48                      rex.W
> >   3f:   d0                      .byte 0xd0
> > 

[..]

> 
> I reported the tests I did, userspace commands before triggering the
> bug, call traces, kernel config, vmlinux files, etc. on a ticket that is
> publicly accessible [4].
> 
> 
> Do you mind having a look at this issue, and tell me what you think
> about it, please?

Just did ;-)

So for some reason, jump labels is causing a crash. I can try to look
more into this on Monday.

-- Steve


> 
> I can run more tests and debug patches if it can help. Please note that
> it looks like it is even harder to reproduce the kernel panic when added
> more debug mechanisms, but I can always try.
> 
> 
> [1]
> https://lore.kernel.org/netdev/98724dcd-ddf3-4f78-a386-f966ffbc9528@kernel.org/T/
> [2] on top of Linus' tree from 2 days ago: 736b5545d39c ("Merge tag
> 'net-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
> [3] https://github.com/multipath-tcp/mptcp_net-next/files/13998625/config.gz
> [4] https://github.com/multipath-tcp/mptcp_net-next/issues/471
> 
> 
> Cheers,
> Matt

Masami Hiramatsu (Google) Jan. 21, 2024, 2:28 a.m. UTC | #3

On Sat, 20 Jan 2024 17:05:17 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Sat, 20 Jan 2024 18:44:38 +0100
> Matthieu Baerts <matttbe@kernel.org> wrote:
> 
> > 
> > I'm sorry to reply on a patch that is more than one year old, but in
> 
> No problem, I've done the same.

Yeah, thanks for reporting! I realized the problem.

> 
> > short, it looks like it is causing a kernel panic on our side. To be
> > honest, I don't understand the link, but I probably missed something.
> > 
> > A bit of context: for the MPTCP net subsystem, we are testing a new CI
> > service to launch a VM and run our test suite (kselftests, kunit, etc.).
> > This CI (Github Action) doesn't support KVM acceleration, and QEmu
> > (v6.2.0) falls back to TCG ("-machine accel=kvm:tcg"). Before, we were
> > always running the tests with QEmu and KVM support, and I don't think we
> > had this issue before. Now, in two weeks, this new CI reported 5 kernel
> > panic in ~30 runs.
> 
> I'm guessing that qemu doesn't do something that real hardware will do,
> which is causing the bug.

If this is the timing bug, it is not qemu's issue, but ours.

> > I initially reported the issue on netdev [1], because the CI always got
> > the panic when doing some pings between different netns, not using MPTCP
> > yet. Eric Dumazet mentioned that it looks like it is an x86 issue, maybe
> > with the jump labels. Since then, I tried to 'git bisect' the issue on
> > my side, but it was not easy: hard to reproduce and unclear to me what
> > is causing it.
> 
> It could possibly be due to jump_labels/static_branch as they use int3
> as well.

Yeah, it seems like. Does jump_labels/static_branch wait until all interrupts
exit before removing their object from the "active list"?

kprobes does this but I found it might be *not enough*.
When removing a kprobe, we do

 1. disarm int3
 2. remove kprobe from hlist ("active list") by hlist_del_rcu()
 3. wait for rcu
 4. free kprobe's trampoline

The possible scenario is

          CPU0                      CPU1
                                 0. hit int3
 1. disarm int3
 2. remove kprobe from hlist
                                2.1 run do_int3()
                                2.2 kprobe_int3_handler() failed (*)
                                2.3 notify_die() failed
                                2.4 kernel panic
 3. wait for rcu
 4. free kprobe's trampoline

(*) because the corresponding kprobe is already removed from the
    active list.

Thus exc_int3() needs a check whether the int3 is already removed or not
before it decides that int3 is a stray int3 (== returning false).

Or, another possible solution is to add another synchronize_rcu()
or sync_core() right after disarming int3 so that we ensure all
int3 handler at that point are finished.

> 
> > 
> > After a few (long) sessions, 'git bisect' gave me this commit:
> > 
> >   8e791f7eba4c ("x86/kprobes: Drop removed INT3 handling code")
> > 
> > I thought it was another false positive, but I was not able to reproduce
> > the issue when testing the parent commit, while I could still do that
> > when validating this 8e791f7eba4c commit. I also went back to our MPTCP
> > tree [2], reproduced the issue again, just to be sure. Then I tried to
> > reproduce it after having reverted 8e791f7eba4c. I didn't have any
> > panic, and I tried for a long time: ~2000 iterations, while I usually
> > have the kernel panic after ~50 iterations.
> > 
> > So it looks like the modifications done by this patch is linked to the
> > kernel panic we have:
> 
> Actually, I think it's possibly the other way around. The changes to
> this patch actually allow the bug to be hit!
> 
> > 
> > > diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
> > > index 66299682b6b7..33390ed4dcf3 100644
> > > --- a/arch/x86/kernel/kprobes/core.c
> > > +++ b/arch/x86/kernel/kprobes/core.c
> > > @@ -986,20 +986,6 @@ int kprobe_int3_handler(struct pt_regs *regs)
> > >  			kprobe_post_process(p, regs, kcb);
> > >  			return 1;
> > >  		}
> > > -	}
> > > -
> > > -	if (*addr != INT3_INSN_OPCODE) {
> 
> So if *addr != INT3_INST_OPCODE, then this returns 1
> 
>  addr = (kprobe_opcode_t *)(regs->ip - sizeof(kprobe_opcode_t));
> 
> Hmm, so the int3 is removed when this is called, and in that case, this
> returns 1.

Yes, this is a kind of "fixup" path.

> 
> 
> > > -		/*
> > > -		 * The breakpoint instruction was removed right
> > > -		 * after we hit it.  Another cpu has removed
> > > -		 * either a probepoint or a debugger breakpoint
> > > -		 * at this address.  In either case, no further
> > > -		 * handling of this interrupt is appropriate.
> > > -		 * Back up over the (now missing) int3 and run
> > > -		 * the original instruction.
> > > -		 */
> > > -		regs->ip = (unsigned long)addr;
> > > -		return 1;
> > >  	} /* else: not a kprobe fault; let the kernel handle it */
> > >  
> > >  	return 0;
> > > 
> > >   
> > 
> > 
> > As far as I know, our 'mptcp_connect.c' selftest that is being used to
> > reproduce the issue, is not using KProbes, tracing, dynamic debug, bpf,
> > etc. It simply creates netns with IP and routes, then do some pings
> > between the different IPs. Any idea why this commit could be linked to a
> > kernel panic?
> 
> The above function is called in traps.c:
> 
> static bool do_int3(struct pt_regs *regs)
> {
>         int res;
> 
> #ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
>         if (kgdb_ll_trap(DIE_INT3, "int3", regs, 0, X86_TRAP_BP,
>                          SIGTRAP) == NOTIFY_STOP)
>                 return true;
> #endif /* CONFIG_KGDB_LOW_LEVEL_TRAP */
> 
> #ifdef CONFIG_KPROBES
>         if (kprobe_int3_handler(regs))
>                 return true;
> #endif
>         res = notify_die(DIE_INT3, "int3", regs, 0, X86_TRAP_BP, SIGTRAP);
> 
>         return res == NOTIFY_STOP;
> }
> 
> If the kprobe_int3_handler() returns 1 it shortcuts the do_int3()
> function and it never gets to the notify_die() call.
> 
> If something called into this and that notify_die() causes the panic,
> the kprobe patch prevented that by sheer luck.

Yes, it prevents the panic unexpectedly :)

> 
> That is, the patch prevented whatever is not working from calling the
> notify_die() as it should have happened, and the panic was avoided. By
> adding this commit, it allowed the bug to trigger the panic. The crash
> has nothing to do with kporbes, it was just that the kprobe handler was
> incorrectly short-cutting the do_in3() function and preventing the
> panic.

Yeah, but this should not be done in the kprobe_int3_handler() but
exc_int3() should do.

Thank you,

> 
> > 
> > 
> > Here is the last call trace I had:
> > 
> > > # INFO: validating network environment with pings
> > > [ 1985.073189] int3: 0000 [#1] PREEMPT SMP NOPTI
> > > [ 1985.073246] CPU: 0 PID: 3203 Comm: ping Not tainted 6.7.0-113761-g5e006770879c-dirty #250
> > > [ 1985.073246] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> > > [ 1985.073246] RIP: 0010:netif_rx_internal (arch/x86/include/asm/jump_label.h:27)
> 
> Yep, definitely a jump_label issue.
> 
> > > [ 1985.073246] Code: 0f 1f 84 00 00 00 00 00 0f 1f 40 00 0f 1f 44 00 00 55 48 89 fd 48 83 ec 20 65 48 8b 04 25 28 00 00 00 48 89 44 24 18 31 c0 e9 <c9> 00 00 00 66 90 66 90 48 8d 54 24 10 48 89 ef 65 8b 35 67 48 d0
> > > All code
> > > ========
> > >    0:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
> > >    7:   00
> > >    8:   0f 1f 40 00             nopl   0x0(%rax)
> > >    c:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
> > >   11:   55                      push   %rbp
> > >   12:   48 89 fd                mov    %rdi,%rbp
> > >   15:   48 83 ec 20             sub    $0x20,%rsp
> > >   19:   65 48 8b 04 25 28 00    mov    %gs:0x28,%rax
> > >   20:   00 00
> > >   22:   48 89 44 24 18          mov    %rax,0x18(%rsp)
> > >   27:   31 c0                   xor    %eax,%eax
> > >   29:*  e9 c9 00 00 00          jmp    0xf7             <-- trapping instruction
> > >   2e:   66 90                   xchg   %ax,%ax
> > >   30:   66 90                   xchg   %ax,%ax
> > >   32:   48 8d 54 24 10          lea    0x10(%rsp),%rdx
> > >   37:   48 89 ef                mov    %rbp,%rdi
> > >   3a:   65                      gs
> > >   3b:   8b                      .byte 0x8b
> > >   3c:   35                      .byte 0x35
> > >   3d:   67                      addr32
> > >   3e:   48                      rex.W
> > >   3f:   d0                      .byte 0xd0
> > > 
> 
> [..]
> 
> > 
> > I reported the tests I did, userspace commands before triggering the
> > bug, call traces, kernel config, vmlinux files, etc. on a ticket that is
> > publicly accessible [4].
> > 
> > 
> > Do you mind having a look at this issue, and tell me what you think
> > about it, please?
> 
> Just did ;-)
> 
> So for some reason, jump labels is causing a crash. I can try to look
> more into this on Monday.
> 
> -- Steve
> 
> 
> > 
> > I can run more tests and debug patches if it can help. Please note that
> > it looks like it is even harder to reproduce the kernel panic when added
> > more debug mechanisms, but I can always try.
> > 
> > 
> > [1]
> > https://lore.kernel.org/netdev/98724dcd-ddf3-4f78-a386-f966ffbc9528@kernel.org/T/
> > [2] on top of Linus' tree from 2 days ago: 736b5545d39c ("Merge tag
> > 'net-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
> > [3] https://github.com/multipath-tcp/mptcp_net-next/files/13998625/config.gz
> > [4] https://github.com/multipath-tcp/mptcp_net-next/issues/471
> > 
> > 
> > Cheers,
> > Matt
> 
>

Masami Hiramatsu (Google) Jan. 21, 2024, 9:05 a.m. UTC | #4

On Sun, 21 Jan 2024 11:28:52 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> On Sat, 20 Jan 2024 17:05:17 -0500
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > On Sat, 20 Jan 2024 18:44:38 +0100
> > Matthieu Baerts <matttbe@kernel.org> wrote:
> > 
> > > 
> > > I'm sorry to reply on a patch that is more than one year old, but in
> > 
> > No problem, I've done the same.
> 
> Yeah, thanks for reporting! I realized the problem.
> 
> > 
> > > short, it looks like it is causing a kernel panic on our side. To be
> > > honest, I don't understand the link, but I probably missed something.
> > > 
> > > A bit of context: for the MPTCP net subsystem, we are testing a new CI
> > > service to launch a VM and run our test suite (kselftests, kunit, etc.).
> > > This CI (Github Action) doesn't support KVM acceleration, and QEmu
> > > (v6.2.0) falls back to TCG ("-machine accel=kvm:tcg"). Before, we were
> > > always running the tests with QEmu and KVM support, and I don't think we
> > > had this issue before. Now, in two weeks, this new CI reported 5 kernel
> > > panic in ~30 runs.
> > 
> > I'm guessing that qemu doesn't do something that real hardware will do,
> > which is causing the bug.
> 
> If this is the timing bug, it is not qemu's issue, but ours.

Hmm, as far as I can see, the jump_label uses text_poke_bp(_batch) which
send IPI for sync_core() on each core, after replacing INT3 with other opcode.

void text_poke_sync(void)
{
        on_each_cpu(do_sync_core, NULL, 1);
}

static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries)
{
[...]
        /*
         * Third step: replace the first byte (int3) by the first byte of
         * replacing opcode.
         */
        for (do_sync = 0, i = 0; i < nr_entries; i++) {
                u8 byte = tp[i].text[0];

                if (tp[i].len == 6)
                        byte = 0x0f;

                if (byte == INT3_INSN_OPCODE)
                        continue;

                text_poke(text_poke_addr(&tp[i]), &byte, INT3_INSN_SIZE);
                do_sync++;
        }

        if (do_sync)
                text_poke_sync();
[...]

This must ensure those processor should finish running INT3 exception
handlers because the IPI is handled outside of the INT3 exception.

However, if the I-cache entry servives text_poke() and sync_core(), this
problem may happen.
The text_poke() flushes TLB but for the local (!global) PTE, and sync_core()
just serialize (!= cache flushing?). Thus the other CPUs can still see the
INT3 after text_poke_sync()? If so, on such CPU, removed INT3 is still
alive on the I-cache and hit it after text_poke_sync().
This will be a ghost INT3...


> 
> > > I initially reported the issue on netdev [1], because the CI always got
> > > the panic when doing some pings between different netns, not using MPTCP
> > > yet. Eric Dumazet mentioned that it looks like it is an x86 issue, maybe
> > > with the jump labels. Since then, I tried to 'git bisect' the issue on
> > > my side, but it was not easy: hard to reproduce and unclear to me what
> > > is causing it.
> > 
> > It could possibly be due to jump_labels/static_branch as they use int3
> > as well.
> 
> Yeah, it seems like. Does jump_labels/static_branch wait until all interrupts
> exit before removing their object from the "active list"?
> 
> kprobes does this but I found it might be *not enough*.
> When removing a kprobe, we do
> 
>  1. disarm int3
>  2. remove kprobe from hlist ("active list") by hlist_del_rcu()
>  3. wait for rcu
>  4. free kprobe's trampoline
> 
> The possible scenario is
> 
>           CPU0                      CPU1
>                                  0. hit int3
>  1. disarm int3
>  2. remove kprobe from hlist
>                                 2.1 run do_int3()
>                                 2.2 kprobe_int3_handler() failed (*)
>                                 2.3 notify_die() failed
>                                 2.4 kernel panic
>  3. wait for rcu
>  4. free kprobe's trampoline
> 
> (*) because the corresponding kprobe is already removed from the
>     active list.
> 
> Thus exc_int3() needs a check whether the int3 is already removed or not
> before it decides that int3 is a stray int3 (== returning false).
> 
> Or, another possible solution is to add another synchronize_rcu()
> or sync_core() right after disarming int3 so that we ensure all
> int3 handler at that point are finished.

So this another solution is already done. I think we need to add the
ghost INT3 check in the exc_int3() as follows;

Thank you,

From add8cf7da99cdb096a0d6765b3dc5de9a3ea3019 Mon Sep 17 00:00:00 2001
From: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Date: Sun, 21 Jan 2024 17:16:50 +0900
Subject: [PATCH] x86: Fixup from the removed INT3 if it is unhandled

INT3 is used not only for software breakpoint, but also self modifying
code on x86 in the kernel. For example, jump_label, function tracer etc.
Those may not handle INT3 after removing it but not waiting for
synchronizing CPUs enough. Since such 'ghost' INT3 is not handled by
anyone because they think it has been removed already.
Recheck there is INT3 on the exception address and if not, ignore it.

Note that previously kprobes does the same thing by itself, but that is
not a good location to do that because INT3 is commonly used. Do it at
the common place so that it can handle all 'ghost' INT3.

Reported-by: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Fixes: 8e791f7eba4c ("x86/kprobes: Drop removed INT3 handling code")
---
 arch/x86/kernel/traps.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index c876f1d36a81..f3e7a99c21fe 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -720,6 +720,25 @@ static bool do_int3(struct pt_regs *regs)
 }
 NOKPROBE_SYMBOL(do_int3);
 
+static bool fixup_int3(struct pt_regs *regs)
+{
+	unsigned long addr = instruction_pointer(regs) - INT3_INSN_SIZE;
+
+	if (*(u8 *)addr != INT3_INSN_OPCODE) {
+		/*
+		 * The breakpoint instruction was removed right
+		 * after we hit it.  Another cpu has removed it
+		 * from this address.  In this case, no further
+		 * handling of this interrupt is appropriate.
+		 * Back up over the (now missing) int3 and run
+		 * the original instruction.
+		 */
+		instruction_pointer_set(regs, (unsigned long)addr);
+		return true;
+	}
+	return false;
+}
+
 static void do_int3_user(struct pt_regs *regs)
 {
 	if (do_int3(regs))
@@ -757,7 +776,7 @@ DEFINE_IDTENTRY_RAW(exc_int3)
 		irqentry_state_t irq_state = irqentry_nmi_enter(regs);
 
 		instrumentation_begin();
-		if (!do_int3(regs))
+		if (!do_int3(regs) && !fixup_int3(regs))
 			die("int3", regs, 0);
 		instrumentation_end();
 		irqentry_nmi_exit(regs, irq_state);

Matthieu Baerts (NGI0) Jan. 21, 2024, 3:23 p.m. UTC | #5

Hi Masami, Steven,

On 21/01/2024 10:05, Masami Hiramatsu (Google) wrote:
> On Sun, 21 Jan 2024 11:28:52 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> 
>> On Sat, 20 Jan 2024 17:05:17 -0500
>> Steven Rostedt <rostedt@goodmis.org> wrote:
>>
>>> On Sat, 20 Jan 2024 18:44:38 +0100
>>> Matthieu Baerts <matttbe@kernel.org> wrote:
>>>
>>>>
>>>> I'm sorry to reply on a patch that is more than one year old, but in
>>>
>>> No problem, I've done the same.
>>
>> Yeah, thanks for reporting! I realized the problem.

Thank you both for your quick reply, very useful explanations, analysis
and patch!

(...)

> So this another solution is already done. I think we need to add the
> ghost INT3 check in the exc_int3() as follows;
> 
> Thank you,
> 
> From add8cf7da99cdb096a0d6765b3dc5de9a3ea3019 Mon Sep 17 00:00:00 2001
> From: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
> Date: Sun, 21 Jan 2024 17:16:50 +0900
> Subject: [PATCH] x86: Fixup from the removed INT3 if it is unhandled
> 
> INT3 is used not only for software breakpoint, but also self modifying
> code on x86 in the kernel. For example, jump_label, function tracer etc.
> Those may not handle INT3 after removing it but not waiting for
> synchronizing CPUs enough. Since such 'ghost' INT3 is not handled by
> anyone because they think it has been removed already.
> Recheck there is INT3 on the exception address and if not, ignore it.
> 
> Note that previously kprobes does the same thing by itself, but that is
> not a good location to do that because INT3 is commonly used. Do it at
> the common place so that it can handle all 'ghost' INT3.

I just tested it, and I was able to run pings for 3h without any issues!

While at it, and just to be on the safe side, I also re-run the tests
after having added a "pr_warn()" -- I know, using printk(), especially
when talking to you... but I was not sure what was safe to use at this
place in the code :) -- before returning "true" in the new function you
added, and we can see that the crash is avoided thanks to the new code:

[   27.422518] traps: crash avoided, addr=18446744071882050317
[   27.426182] traps: crash avoided, addr=18446744071882050317

[  370.483208] traps: crash avoided, addr=18446744071882075656
[  370.485066] traps: crash avoided, addr=18446744071882075656
[  370.485084] traps: crash avoided, addr=18446744071882075656

[  592.866416] traps: crash avoided, addr=18446744071882075656
[  592.867937] traps: crash avoided, addr=18446744071882075656

[  980.988342] traps: crash avoided, addr=18446744071882050317
[  980.989866] traps: crash avoided, addr=18446744071882050317

(from my VM running with 2 CPU cores)


Again, thank you for the fix!

(Just in case you need it:)

Tested-by: Matthieu Baerts <matttbe@kernel.org>

Cheers,
Matt

Steven Rostedt Jan. 21, 2024, 3:31 p.m. UTC | #6

On Sun, 21 Jan 2024 16:23:35 +0100
Matthieu Baerts <matttbe@kernel.org> wrote:

> Hi Masami, Steven,
> 
> On 21/01/2024 10:05, Masami Hiramatsu (Google) wrote:
> > On Sun, 21 Jan 2024 11:28:52 +0900
> > Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> >   
> >> On Sat, 20 Jan 2024 17:05:17 -0500
> >> Steven Rostedt <rostedt@goodmis.org> wrote:
> >>  
> >>> On Sat, 20 Jan 2024 18:44:38 +0100
> >>> Matthieu Baerts <matttbe@kernel.org> wrote:
> >>>  
> >>>>
> >>>> I'm sorry to reply on a patch that is more than one year old, but in  
> >>>
> >>> No problem, I've done the same.  
> >>
> >> Yeah, thanks for reporting! I realized the problem.  
> 
> Thank you both for your quick reply, very useful explanations, analysis
> and patch!
> 
> (...)
> 
> > So this another solution is already done. I think we need to add the
> > ghost INT3 check in the exc_int3() as follows;
> > 
> > Thank you,
> > 
> > From add8cf7da99cdb096a0d6765b3dc5de9a3ea3019 Mon Sep 17 00:00:00 2001
> > From: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
> > Date: Sun, 21 Jan 2024 17:16:50 +0900
> > Subject: [PATCH] x86: Fixup from the removed INT3 if it is unhandled
> > 
> > INT3 is used not only for software breakpoint, but also self modifying
> > code on x86 in the kernel. For example, jump_label, function tracer etc.
> > Those may not handle INT3 after removing it but not waiting for
> > synchronizing CPUs enough. Since such 'ghost' INT3 is not handled by
> > anyone because they think it has been removed already.
> > Recheck there is INT3 on the exception address and if not, ignore it.
> > 
> > Note that previously kprobes does the same thing by itself, but that is
> > not a good location to do that because INT3 is commonly used. Do it at
> > the common place so that it can handle all 'ghost' INT3.  
> 
> I just tested it, and I was able to run pings for 3h without any issues!
> 
> While at it, and just to be on the safe side, I also re-run the tests
> after having added a "pr_warn()" -- I know, using printk(), especially
> when talking to you... but I was not sure what was safe to use at this
> place in the code :) -- before returning "true" in the new function you
> added, and we can see that the crash is avoided thanks to the new code:
> 
> [   27.422518] traps: crash avoided, addr=18446744071882050317
> [   27.426182] traps: crash avoided, addr=18446744071882050317
> 
> [  370.483208] traps: crash avoided, addr=18446744071882075656
> [  370.485066] traps: crash avoided, addr=18446744071882075656
> [  370.485084] traps: crash avoided, addr=18446744071882075656
> 
> [  592.866416] traps: crash avoided, addr=18446744071882075656
> [  592.867937] traps: crash avoided, addr=18446744071882075656
> 
> [  980.988342] traps: crash avoided, addr=18446744071882050317
> [  980.989866] traps: crash avoided, addr=18446744071882050317
> 
> (from my VM running with 2 CPU cores)
> 
> 
> Again, thank you for the fix!
> 
> (Just in case you need it:)
> 
> Tested-by: Matthieu Baerts <matttbe@kernel.org>

The thing is, the bug is with qemu and *not* the kernel. Masami's patch
just paper's over the real bug, and worse, if the kernel has a bug
that's not doing proper synchronization, the patch will keep it from
being detected. So no, I do not think this is the proper solution.

The real problem is that qemu does not seem to be honoring the memory
barriers of an interrupt. The reason the code does the ipi's is to
force a full memory barrier across all CPUs so that they all see the
same memory before going forward to the next step.

My guess is that qemu does not treat the IPI being sent as a memory
barrier, and then the CPUs do not see a consistent memory view after
the IPIs are sent. That's a bug in qemu!

This should be reported to the qemu community and should be fixed
there. In the mean time, feel free to use Masami's patch in your local
repo until qemu is fixed, but it should not be added to Linux mainline.

-- Steve

Steven Rostedt Jan. 21, 2024, 3:33 p.m. UTC | #7

On Sun, 21 Jan 2024 10:31:44 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> My guess is that qemu does not treat the IPI being sent as a memory
> barrier, and then the CPUs do not see a consistent memory view after
> the IPIs are sent. That's a bug in qemu!

More specifically, I bet qemu may be doing a dcache barrier, but not an
icache barrier in the interrupt. If the code is already in qemu's
pipeline, it may not be flushing it like real hardware would do.

-- Steve

Steven Rostedt Jan. 21, 2024, 3:42 p.m. UTC | #8

On Sun, 21 Jan 2024 18:05:44 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> However, if the I-cache entry servives text_poke() and sync_core(), this
> problem may happen.
> The text_poke() flushes TLB but for the local (!global) PTE, and sync_core()
> just serialize (!= cache flushing?). Thus the other CPUs can still see the

Yes, the purpose of the IPIs are for cache flushing, including icache.

> INT3 after text_poke_sync()? If so, on such CPU, removed INT3 is still
> alive on the I-cache and hit it after text_poke_sync().
> This will be a ghost INT3...

An interrupt is a full memory barrier and it looks like qemu is not
honoring that. Thus the bug is in qemu and not the kernel.

-- Steve

Matthieu Baerts (NGI0) Jan. 21, 2024, 4:20 p.m. UTC | #9

Hi Steven,

On 21/01/2024 16:31, Steven Rostedt wrote:
> On Sun, 21 Jan 2024 16:23:35 +0100
> Matthieu Baerts <matttbe@kernel.org> wrote:
> 
>> Hi Masami, Steven,
>>
>> On 21/01/2024 10:05, Masami Hiramatsu (Google) wrote:
>>> On Sun, 21 Jan 2024 11:28:52 +0900
>>> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
>>>   
>>>> On Sat, 20 Jan 2024 17:05:17 -0500
>>>> Steven Rostedt <rostedt@goodmis.org> wrote:
>>>>  
>>>>> On Sat, 20 Jan 2024 18:44:38 +0100
>>>>> Matthieu Baerts <matttbe@kernel.org> wrote:
>>>>>  
>>>>>>
>>>>>> I'm sorry to reply on a patch that is more than one year old, but in  
>>>>>
>>>>> No problem, I've done the same.  
>>>>
>>>> Yeah, thanks for reporting! I realized the problem.  
>>
>> Thank you both for your quick reply, very useful explanations, analysis
>> and patch!
>>
>> (...)
>>
>>> So this another solution is already done. I think we need to add the
>>> ghost INT3 check in the exc_int3() as follows;
>>>
>>> Thank you,
>>>
>>> From add8cf7da99cdb096a0d6765b3dc5de9a3ea3019 Mon Sep 17 00:00:00 2001
>>> From: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
>>> Date: Sun, 21 Jan 2024 17:16:50 +0900
>>> Subject: [PATCH] x86: Fixup from the removed INT3 if it is unhandled
>>>
>>> INT3 is used not only for software breakpoint, but also self modifying
>>> code on x86 in the kernel. For example, jump_label, function tracer etc.
>>> Those may not handle INT3 after removing it but not waiting for
>>> synchronizing CPUs enough. Since such 'ghost' INT3 is not handled by
>>> anyone because they think it has been removed already.
>>> Recheck there is INT3 on the exception address and if not, ignore it.
>>>
>>> Note that previously kprobes does the same thing by itself, but that is
>>> not a good location to do that because INT3 is commonly used. Do it at
>>> the common place so that it can handle all 'ghost' INT3.  
>>
>> I just tested it, and I was able to run pings for 3h without any issues!
>>
>> While at it, and just to be on the safe side, I also re-run the tests
>> after having added a "pr_warn()" -- I know, using printk(), especially
>> when talking to you... but I was not sure what was safe to use at this
>> place in the code :) -- before returning "true" in the new function you
>> added, and we can see that the crash is avoided thanks to the new code:
>>
>> [   27.422518] traps: crash avoided, addr=18446744071882050317
>> [   27.426182] traps: crash avoided, addr=18446744071882050317
>>
>> [  370.483208] traps: crash avoided, addr=18446744071882075656
>> [  370.485066] traps: crash avoided, addr=18446744071882075656
>> [  370.485084] traps: crash avoided, addr=18446744071882075656
>>
>> [  592.866416] traps: crash avoided, addr=18446744071882075656
>> [  592.867937] traps: crash avoided, addr=18446744071882075656
>>
>> [  980.988342] traps: crash avoided, addr=18446744071882050317
>> [  980.989866] traps: crash avoided, addr=18446744071882050317
>>
>> (from my VM running with 2 CPU cores)
>>
>>
>> Again, thank you for the fix!
>>
>> (Just in case you need it:)
>>
>> Tested-by: Matthieu Baerts <matttbe@kernel.org>
> 
> The thing is, the bug is with qemu and *not* the kernel. Masami's patch
> just paper's over the real bug, and worse, if the kernel has a bug
> that's not doing proper synchronization, the patch will keep it from
> being detected. So no, I do not think this is the proper solution.
> 
> The real problem is that qemu does not seem to be honoring the memory
> barriers of an interrupt. The reason the code does the ipi's is to
> force a full memory barrier across all CPUs so that they all see the
> same memory before going forward to the next step.
> 
> My guess is that qemu does not treat the IPI being sent as a memory
> barrier, and then the CPUs do not see a consistent memory view after
> the IPIs are sent. That's a bug in qemu!
> 
> This should be reported to the qemu community and should be fixed
> there. In the mean time, feel free to use Masami's patch in your local
> repo until qemu is fixed, but it should not be added to Linux mainline.

Thank you for the explanation!

For QEmu, I'm currently not using a recent version: v6.2.0, while the
latest one is v8.2.0. I was already suspecting that QEmu could be
responsible for this issue -- we don't have the issue with KVM, only TCG
-- but it looks like it is not that easy to upgrade it: for the CI, we
use virtme, which doesn't support newer versions. We will switch to
virtme-ng, upgrade QEmu to a version that is still supported, try to
reproduce the issue without Masami's patch, and report that to QEmu
community.

For the time being, the CI will run the tests with Masami's patch.

Cheers,
Matt

Matthieu Baerts (NGI0) Jan. 21, 2024, 9:59 p.m. UTC | #10

Hi Steven,

On 21/01/2024 17:20, Matthieu Baerts wrote:
> On 21/01/2024 16:31, Steven Rostedt wrote:

(...)

>> The thing is, the bug is with qemu and *not* the kernel. Masami's patch
>> just paper's over the real bug, and worse, if the kernel has a bug
>> that's not doing proper synchronization, the patch will keep it from
>> being detected. So no, I do not think this is the proper solution.
>>
>> The real problem is that qemu does not seem to be honoring the memory
>> barriers of an interrupt. The reason the code does the ipi's is to
>> force a full memory barrier across all CPUs so that they all see the
>> same memory before going forward to the next step.
>>
>> My guess is that qemu does not treat the IPI being sent as a memory
>> barrier, and then the CPUs do not see a consistent memory view after
>> the IPIs are sent. That's a bug in qemu!
>>
>> This should be reported to the qemu community and should be fixed
>> there. In the mean time, feel free to use Masami's patch in your local
>> repo until qemu is fixed, but it should not be added to Linux mainline.
> 
> Thank you for the explanation!
> 
> For QEmu, I'm currently not using a recent version: v6.2.0, while the
> latest one is v8.2.0. I was already suspecting that QEmu could be
> responsible for this issue -- we don't have the issue with KVM, only TCG
> -- but it looks like it is not that easy to upgrade it: for the CI, we
> use virtme, which doesn't support newer versions. We will switch to
> virtme-ng, upgrade QEmu to a version that is still supported, try to
> reproduce the issue without Masami's patch, and report that to QEmu
> community.

FYI, I managed to upgrade QEmu to 8.2.0, and launch the tests: I was not
able to reproduce this issue.

I guess the bug has already been fixed, I'm sorry for the noise!

Cheers,
Matt

Masami Hiramatsu (Google) Jan. 22, 2024, 1:17 a.m. UTC | #11

On Sun, 21 Jan 2024 17:20:26 +0100
Matthieu Baerts <matttbe@kernel.org> wrote:

> Hi Steven,
> 
> On 21/01/2024 16:31, Steven Rostedt wrote:
> > On Sun, 21 Jan 2024 16:23:35 +0100
> > Matthieu Baerts <matttbe@kernel.org> wrote:
> > 
> >> Hi Masami, Steven,
> >>
> >> On 21/01/2024 10:05, Masami Hiramatsu (Google) wrote:
> >>> On Sun, 21 Jan 2024 11:28:52 +0900
> >>> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> >>>   
> >>>> On Sat, 20 Jan 2024 17:05:17 -0500
> >>>> Steven Rostedt <rostedt@goodmis.org> wrote:
> >>>>  
> >>>>> On Sat, 20 Jan 2024 18:44:38 +0100
> >>>>> Matthieu Baerts <matttbe@kernel.org> wrote:
> >>>>>  
> >>>>>>
> >>>>>> I'm sorry to reply on a patch that is more than one year old, but in  
> >>>>>
> >>>>> No problem, I've done the same.  
> >>>>
> >>>> Yeah, thanks for reporting! I realized the problem.  
> >>
> >> Thank you both for your quick reply, very useful explanations, analysis
> >> and patch!
> >>
> >> (...)
> >>
> >>> So this another solution is already done. I think we need to add the
> >>> ghost INT3 check in the exc_int3() as follows;
> >>>
> >>> Thank you,
> >>>
> >>> From add8cf7da99cdb096a0d6765b3dc5de9a3ea3019 Mon Sep 17 00:00:00 2001
> >>> From: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
> >>> Date: Sun, 21 Jan 2024 17:16:50 +0900
> >>> Subject: [PATCH] x86: Fixup from the removed INT3 if it is unhandled
> >>>
> >>> INT3 is used not only for software breakpoint, but also self modifying
> >>> code on x86 in the kernel. For example, jump_label, function tracer etc.
> >>> Those may not handle INT3 after removing it but not waiting for
> >>> synchronizing CPUs enough. Since such 'ghost' INT3 is not handled by
> >>> anyone because they think it has been removed already.
> >>> Recheck there is INT3 on the exception address and if not, ignore it.
> >>>
> >>> Note that previously kprobes does the same thing by itself, but that is
> >>> not a good location to do that because INT3 is commonly used. Do it at
> >>> the common place so that it can handle all 'ghost' INT3.  
> >>
> >> I just tested it, and I was able to run pings for 3h without any issues!
> >>
> >> While at it, and just to be on the safe side, I also re-run the tests
> >> after having added a "pr_warn()" -- I know, using printk(), especially
> >> when talking to you... but I was not sure what was safe to use at this
> >> place in the code :) -- before returning "true" in the new function you
> >> added, and we can see that the crash is avoided thanks to the new code:
> >>
> >> [   27.422518] traps: crash avoided, addr=18446744071882050317
> >> [   27.426182] traps: crash avoided, addr=18446744071882050317
> >>
> >> [  370.483208] traps: crash avoided, addr=18446744071882075656
> >> [  370.485066] traps: crash avoided, addr=18446744071882075656
> >> [  370.485084] traps: crash avoided, addr=18446744071882075656
> >>
> >> [  592.866416] traps: crash avoided, addr=18446744071882075656
> >> [  592.867937] traps: crash avoided, addr=18446744071882075656
> >>
> >> [  980.988342] traps: crash avoided, addr=18446744071882050317
> >> [  980.989866] traps: crash avoided, addr=18446744071882050317
> >>
> >> (from my VM running with 2 CPU cores)
> >>
> >>
> >> Again, thank you for the fix!
> >>
> >> (Just in case you need it:)
> >>
> >> Tested-by: Matthieu Baerts <matttbe@kernel.org>
> > 
> > The thing is, the bug is with qemu and *not* the kernel. Masami's patch
> > just paper's over the real bug, and worse, if the kernel has a bug
> > that's not doing proper synchronization, the patch will keep it from
> > being detected. So no, I do not think this is the proper solution.
> > 
> > The real problem is that qemu does not seem to be honoring the memory
> > barriers of an interrupt. The reason the code does the ipi's is to
> > force a full memory barrier across all CPUs so that they all see the
> > same memory before going forward to the next step.
> > 
> > My guess is that qemu does not treat the IPI being sent as a memory
> > barrier, and then the CPUs do not see a consistent memory view after
> > the IPIs are sent. That's a bug in qemu!
> > 
> > This should be reported to the qemu community and should be fixed
> > there. In the mean time, feel free to use Masami's patch in your local
> > repo until qemu is fixed, but it should not be added to Linux mainline.
> 
> Thank you for the explanation!
> 
> For QEmu, I'm currently not using a recent version: v6.2.0, while the
> latest one is v8.2.0. I was already suspecting that QEmu could be
> responsible for this issue -- we don't have the issue with KVM, only TCG

Ah, I missed this point. If KVM works well, I agree that this is a qemu
TCG's bug. I guess TCG implementation forgets to serialize CPU when the
IPI comes.

> -- but it looks like it is not that easy to upgrade it: for the CI, we
> use virtme, which doesn't support newer versions. We will switch to
> virtme-ng, upgrade QEmu to a version that is still supported, try to
> reproduce the issue without Masami's patch, and report that to QEmu
> community.

Thanks, and if you need a reference, "Intel SDM Vol3A, 9.1.3 Handling
Self- and Cross-Modifying Code" said that what the other CPU needs to
do is "Execute serializing instruction; (* For example, CPUID
instruction *)" for cross-modifying code. that has been done in
do_sync_core(). Thus this bug should not happen.

Thank you!

> 
> For the time being, the CI will run the tests with Masami's patch.
> 
> Cheers,
> Matt
> -- 
> Sponsored by the NGI0 Core fund.

diff mbox series

Patch

diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index 66299682b6b7..33390ed4dcf3 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -986,20 +986,6 @@  int kprobe_int3_handler(struct pt_regs *regs)
 			kprobe_post_process(p, regs, kcb);
 			return 1;
 		}
-	}
-
-	if (*addr != INT3_INSN_OPCODE) {
-		/*
-		 * The breakpoint instruction was removed right
-		 * after we hit it.  Another cpu has removed
-		 * either a probepoint or a debugger breakpoint
-		 * at this address.  In either case, no further
-		 * handling of this interrupt is appropriate.
-		 * Back up over the (now missing) int3 and run
-		 * the original instruction.
-		 */
-		regs->ip = (unsigned long)addr;
-		return 1;
 	} /* else: not a kprobe fault; let the kernel handle it */
 
 	return 0;