Message ID | 20221024114536.44686c83@gandalf.local.home |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp569198wru; Mon, 24 Oct 2022 10:27:41 -0700 (PDT) X-Google-Smtp-Source: AMsMyM5FkHWGQrlFDTe/c0UkVje34KQLaU+9IA7artff68RuG0Be7VIPhUFf+lu2e/el2u0ackJA X-Received: by 2002:a17:906:dac8:b0:741:545b:796a with SMTP id xi8-20020a170906dac800b00741545b796amr27925479ejb.240.1666632461092; Mon, 24 Oct 2022 10:27:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666632461; cv=none; d=google.com; s=arc-20160816; b=mW41Na0BMmjTl7LZwLK5Nd23ZvMP+SgCa4cOmkXas44j7H55mZTidAh/5M9uyRjEMM kVXGgLVXB4uD8uRmp3BRTe8NEOz9VqCfe25j7DOFjbjpN8yT8g4zHj48blN2BM+rVLOY AaKeaApwT2BBqTx8SkoppVzZquCltVSpcN/of5ige3Q7+0L5xTLdEx1JqLIbZ9iy/Bjt c9XifX4LSnHGbcJU67xxR5yCWoi4jeoa/ZWgyrG/te0oKZZFLLf0tlr98Y5LBuh7OPN8 kMFxetQDdUA1OBkkt9y0kIjKr4SvHipOxo+TVPD5zcnjC3QnafftIA20V8n3jNqG3RBZ wIfw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:subject:cc:to:from:date; bh=uB/4t+Qv/44JoAmLn/1GX1lLgLr/YwpSe0L27f9eqFU=; b=f9GvhjdBfYupxz1+8rIBCrhJlKLdIYdB9DBhE7j0UOjhTSiK/9mJEYsueyDriUvsT0 bOqTI8ahzgesSVw02gPgFRSH5ya0jeOKKeqwJ0d5SSTS8T/PphKBNk0l7CKrpoxXBklX +uIMHVFzzbqK8AN2lbi5+KUg5O96kgA0EIKbcRXHSblSmnOxevd+tOYBiylqEFUFQQ3t WdLIZ3fcccWNXgpgyxKTZxCUCdjRo1zh2KS0C86Aj/RK8HGfJrfak2lKZSS7GSMaMpQg 8StW/+gVSYfw/7lga4AKnBHZnWgeQiiXHRJlrI8yJ3w5Jgc+d/gaYKFJaD2k6arlo2Ek Dawg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id mh16-20020a170906eb9000b0079195ddab1bsi260747ejb.221.2022.10.24.10.27.16; Mon, 24 Oct 2022 10:27:41 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232936AbiJXRRO (ORCPT <rfc822;pwkd43@gmail.com> + 99 others); Mon, 24 Oct 2022 13:17:14 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46528 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233150AbiJXRQs (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 24 Oct 2022 13:16:48 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C397FBC82 for <linux-kernel@vger.kernel.org>; Mon, 24 Oct 2022 08:51:49 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 8C38B6147E for <linux-kernel@vger.kernel.org>; Mon, 24 Oct 2022 15:45:27 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3E736C433B5; Mon, 24 Oct 2022 15:45:26 +0000 (UTC) Date: Mon, 24 Oct 2022 11:45:36 -0400 From: Steven Rostedt <rostedt@goodmis.org> To: LKML <linux-kernel@vger.kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org>, Peter Zijlstra <peterz@infradead.org>, Kees Cook <keescook@chromium.org>, Dave Hansen <dave.hansen@intel.com>, Sean Christopherson <seanjc@google.com> Subject: [PATCH] x86/mm: Do not verify W^X at boot up Message-ID: <20221024114536.44686c83@gandalf.local.home> X-Mailer: Claws Mail 3.17.8 (GTK+ 2.24.33; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-6.7 required=5.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_DNSWL_HI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1747590799676889540?= X-GMAIL-MSGID: =?utf-8?q?1747590799676889540?= |
Series |
x86/mm: Do not verify W^X at boot up
|
|
Commit Message
Steven Rostedt
Oct. 24, 2022, 3:45 p.m. UTC
From: "Steven Rostedt (Google)" <rostedt@goodmis.org> Adding on the kernel command line "ftrace=function" triggered: ------------[ cut here ]------------ CPA detected W^X violation: 8000000000000063 -> 0000000000000063 range: 0xffffffffc0013000 - 0xffffffffc0013fff PFN 10031b WARNING: CPU: 0 PID: 0 at arch/x86/mm/pat/set_memory.c:609 verify_rwx+0x61/0x6d Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc1-test+ #3 Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014 RIP: 0010:verify_rwx+0x61/0x6d Code: e5 01 00 75 27 49 c1 e0 0c 48 89 d1 48 89 fe 48 c7 c7 5b b3 92 84 4e 8d 44 02 ff 48 89 da c6 05 71 29 e5 01 01 e8 35 90 e2 00 <0f> 0b 48 89 d8 5b 5d e9 6f 95 1a 01 0f 1f 44 00 00 55 48 89 e5 53 RSP: 0000:ffffffff84c03b08 EFLAGS: 00010086 RAX: 0000000000000000 RBX: 0000000000000063 RCX: 0000000000000003 RDX: 0000000000000003 RSI: ffffffff84c039b0 RDI: 0000000000000001 RBP: ffffffff84c03b10 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000025 R12: ffff8e730031c098 R13: 000000000010031b R14: 800000010031b063 R15: 8000000000000063 FS: 0000000000000000(0000) GS:ffff8e7416a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8e73fd801000 CR3: 00000001fcc22001 CR4: 00000000000606f0 Call Trace: <TASK> __change_page_attr_set_clr+0x146/0x8a6 ? __mutex_unlock_slowpath+0x41/0x213 ? mutex_unlock+0x12/0x18 ? _vm_unmap_aliases+0x126/0x136 change_page_attr_set_clr+0x135/0x268 ? find_vmap_area+0x32/0x3e ? __fentry__+0x10/0x10 change_page_attr_clear.constprop.0+0x16/0x1c set_memory_x+0x2c/0x32 arch_ftrace_update_trampoline+0x218/0x2db ? ftrace_caller_op_ptr+0x17/0x17 ftrace_update_trampoline+0x16/0xa1 ? tracing_gen_ctx+0x1c/0x1c __register_ftrace_function+0x93/0xb2 ftrace_startup+0x21/0xf0 ? tracing_gen_ctx+0x1c/0x1c register_ftrace_function_nolock+0x26/0x40 register_ftrace_function+0x4e/0x143 ? mutex_unlock+0x12/0x18 ? tracing_gen_ctx+0x1c/0x1c function_trace_init+0x7d/0xc3 tracer_init+0x23/0x2c tracing_set_tracer+0x1d5/0x206 register_tracer+0x1c0/0x1e4 init_function_trace+0x90/0x96 early_trace_init+0x25c/0x352 start_kernel+0x424/0x6e4 x86_64_start_reservations+0x24/0x2a x86_64_start_kernel+0x8c/0x95 secondary_startup_64_no_verify+0xe0/0xeb </TASK> ---[ end trace 0000000000000000 ]--- This is because at boot up, kernel text is writable, and there's no reason to do tricks to updated it. But the verifier does not distinguish updates at boot up and at run time, and causes a warning at time of boot. Add a check for system_state == SYSTEM_BOOTING and allow it if that is the case. Link: https://lore.kernel.org/r/20221024112730.180916b3@gandalf.local.home Fixes: 652c5bf380ad0 ("x86/mm: Refuse W^X violations") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> --- arch/x86/mm/pat/set_memory.c | 4 ++++ 1 file changed, 4 insertions(+)
Comments
On 10/24/22 08:45, Steven Rostedt wrote: > --- a/arch/x86/mm/pat/set_memory.c > +++ b/arch/x86/mm/pat/set_memory.c > @@ -587,6 +587,10 @@ static inline pgprot_t verify_rwx(pgprot_t old, pgprot_t new, unsigned long star > { > unsigned long end; > > + /* Kernel text is rw at boot up */ > + if (system_state == SYSTEM_BOOTING) > + return new; Hi Steven, Thanks for the report and the patch. That seems reasonable, but I'm a bit worried that it opens up a big hole (boot time) when a W+X mapping could be created *anywhere*. Could we restrict this bypass to *only* kernel text addresses during boot? Maybe something like this: if ((system_state == SYSTEM_BOOTING) && __kernel_text_address(start)) return new; That would be safe because we know that kernel_text_address() addresses will be made read-only by the time userspace shows up and that is_kernel_inittext() addresses will be freed. Long-term, I wonder if we could teach the early patching code that it can't just use memcpy().
On Mon, 24 Oct 2022 09:14:45 -0700 Dave Hansen <dave.hansen@intel.com> wrote: > On 10/24/22 08:45, Steven Rostedt wrote: > > --- a/arch/x86/mm/pat/set_memory.c > > +++ b/arch/x86/mm/pat/set_memory.c > > @@ -587,6 +587,10 @@ static inline pgprot_t verify_rwx(pgprot_t old, pgprot_t new, unsigned long star > > { > > unsigned long end; > > > > + /* Kernel text is rw at boot up */ > > + if (system_state == SYSTEM_BOOTING) > > + return new; > > Hi Steven, > > Thanks for the report and the patch. That seems reasonable, but I'm a > bit worried that it opens up a big hole (boot time) when a W+X mapping > could be created *anywhere*. > > Could we restrict this bypass to *only* kernel text addresses during > boot? Maybe something like this: > > if ((system_state == SYSTEM_BOOTING) && > __kernel_text_address(start)) > return new; Actually, that brings back the warning, as ftrace creates a trampoline, but text_poke() will still use memcpy on it at early boot up. The trampolines are set to ro at the end of boot up by: 59566b0b622e3 ("x86/ftrace: Have ftrace trampolines turn read-only at the end of system boot up") Which was added because of text_poke() doing the memcpy(). > > That would be safe because we know that kernel_text_address() addresses > will be made read-only by the time userspace shows up and that > is_kernel_inittext() addresses will be freed. > > Long-term, I wonder if we could teach the early patching code that it > can't just use memcpy(). > Maybe. -- Steve
On Mon, Oct 24, 2022 at 8:45 AM Steven Rostedt <rostedt@goodmis.org> wrote: > > From: "Steven Rostedt (Google)" <rostedt@goodmis.org> > > Adding on the kernel command line "ftrace=function" triggered: > > CPA detected W^X violation: 8000000000000063 -> 0000000000000063 range: Hmm. The cause of this actually seems to be this if (likely(system_state != SYSTEM_BOOTING)) set_memory_ro((unsigned long)trampoline, npages); set_memory_x((unsigned long)trampoline, npages); return (unsigned long)trampoline; in create_trampoline(). And that in turn is because of commit 59566b0b622e ("x86/ftrace: Have ftrace trampolines turn read-only at the end of system boot up"), which in turn is because of if (unlikely(system_state == SYSTEM_BOOTING)) { text_poke_early(addr, opcode, len); return; } in text_poke_bp(). And that, in turn, is because PeterZ ended up special-casing this all in commit 768ae4406a5c ("x86/ftrace: Use text_poke()") Maybe we should just strive to get rid of all these SYSTEM_BOOTING special cases, instead of adding yet another a new one. There's presumably "it slows down boot" reason to avoid the full text_poke_bp() dance, but do we really care for the "ftrace=function" case? Linus
On Mon, 24 Oct 2022 11:19:31 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > in text_poke_bp(). And that, in turn, is because PeterZ ended up > special-casing this all in commit 768ae4406a5c ("x86/ftrace: Use > text_poke()") > > Maybe we should just strive to get rid of all these SYSTEM_BOOTING > special cases, instead of adding yet another a new one. > > There's presumably "it slows down boot" reason to avoid the full > text_poke_bp() dance, but do we really care for the "ftrace=function" > case? > It's not just speed up at boot up. It's because text_poke doesn't work at early boot up when function tracing is enabled. If I remove the SYSTEM_BOOTING checks in text_poke() this happens: [ 0.963966] ftrace: allocating 47021 entries in 184 pages [ 0.969376] ftrace section at ffffffff89ef54c0 sorted properly [ 0.982126] ftrace: allocated 184 pages with 4 groups [ 1.009779] Starting tracer 'function' [ 1.013753] BUG: kernel NULL pointer dereference, address: 0000000000000048 [ 1.020516] #PF: supervisor read access in kernel mode [ 1.025616] #PF: error_code(0x0000) - not-present page [ 1.030718] PGD 0 P4D 0 [ 1.033224] Oops: 0000 [#1] PREEMPT SMP PTI [ 1.037373] CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc1-test+ #496 [ 1.044031] Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016 [ 1.052934] RIP: 0010:walk_to_pmd+0x17/0x130 [ 1.057172] Code: 1f ff ff ff 0f 0b e9 ed fe ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f0 41 55 48 c1 e8 24 41 54 25 f8 0f 00 00 55 53 <48> 03 47 48 0f 84 c4 00 00 00 49 89 fc 48 8b 38 48 89 f3 48 89 c5 [ 1.075846] RSP: 0000:ffffffff89603cd8 EFLAGS: 00010046 [ 1.081033] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffd0f980000000 [ 1.088121] RDX: ffffffff89603d68 RSI: 0000000000000000 RDI: 0000000000000000 [ 1.095213] RBP: 0000000000000000 R08: ffffffff8a017cf0 R09: ffffffff8a017cf1 [ 1.102302] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff89603d68 [ 1.109389] R13: 000000000000031e R14: 000000000000031f R15: 8000000000000063 [ 1.116479] FS: 0000000000000000(0000) GS:ffffa068daa00000(0000) knlGS:0000000000000000 [ 1.124516] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1.130220] CR2: 0000000000000048 CR3: 000000000760a001 CR4: 00000000000606f0 [ 1.137309] Call Trace: [ 1.139732] <TASK> [ 1.141805] __get_locked_pte+0x1b/0x120 [ 1.145694] ? ftrace_caller_op_ptr+0x17/0x17 [ 1.150016] __text_poke+0xf8/0x540 [ 1.153474] ? patch_retpoline+0x170/0x170 [ 1.157536] ? text_poke_loc_init+0x5b/0x170 [ 1.161773] text_poke_bp_batch+0x86/0x290 [ 1.165835] ? cache_mod+0x280/0x280 [ 1.169378] text_poke_bp+0x3a/0x60 [ 1.172836] ftrace_update_ftrace_func+0x3e/0x80 [ 1.177416] ftrace_modify_all_code+0x79/0x160 [ 1.181827] ftrace_startup+0xbd/0x150 [ 1.185544] ? ftrace_stacktrace_count+0xc0/0xc0 [ 1.190128] register_ftrace_function_nolock+0x20/0x60 [ 1.195227] register_ftrace_function+0xc7/0x130 [ 1.199807] ? ftrace_stacktrace_count+0xc0/0xc0 [ 1.204390] function_trace_init+0x71/0xd0 [ 1.208454] tracing_set_tracer+0x172/0x280 [ 1.212603] register_tracer+0x1e0/0x205 [ 1.216495] tracer_alloc_buffers.isra.0+0x1f8/0x2fe [ 1.221421] start_kernel+0x21f/0x4e8 [ 1.225051] ? x86_64_start_kernel+0x60/0x78 [ 1.229288] secondary_startup_64_no_verify+0xd3/0xdb [ 1.234301] </TASK> [ 1.236459] Modules linked in: [ 1.239484] CR2: 0000000000000048 [ 1.242769] ---[ end trace 0000000000000000 ]--- Interrupts haven't been enabled yet, so things are still rather fragile at this point of start up. -- Steve
On Mon, Oct 24, 2022 at 11:52 AM Steven Rostedt <rostedt@goodmis.org> wrote: > > It's not just speed up at boot up. It's because text_poke doesn't work at > early boot up when function tracing is enabled. If I remove the > SYSTEM_BOOTING checks in text_poke() this happens: Let's just fix that up. > > [ 1.013753] BUG: kernel NULL pointer dereference, address: 0000000000000048 This is due to __get_locked_pte: 0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 5: 48 89 f0 mov %rsi,%rax 8: 41 55 push %r13 a: 48 c1 e8 24 shr $0x24,%rax e: 41 54 push %r12 10: 25 f8 0f 00 00 and $0xff8,%eax 15: 55 push %rbp 16: 53 push %rbx 17:* 48 03 47 48 add 0x48(%rdi),%rax <-- trapping instruction 1b: 0f 84 c4 00 00 00 je 0xe5 21: 49 89 fc mov %rdi,%r12 24: 48 8b 38 mov (%rax),%rdi 27: 48 89 f3 mov %rsi,%rbx 2a: 48 89 c5 mov %rax,%rbp and that 'addq' seems to be walk_to_pmd() doing pgd = pgd_offset(mm, addr); with a zero mm (that's 'mm->pgd'). And that, in turn, seems to be due to the absolutely disgusing 'poking_mm' hack. > Interrupts haven't been enabled yet, so things are still rather fragile at > this point of start up. I don't think this has anything to do with interrupts. We do need the page structures etc to be workable, but all the tracing setup needs that *anyway*. I suspect it would be fixed by just moving 'poking_init()' earlier. In many ways I suspect it would make most sense as part of 'mm_init()', not as a random call fairly late in start_kernel(). In other words, this all smells like "people added special cases because they didn't want to hunt down the underlying problem". And then all these special cases beget other special cases. Linus
On Mon, 24 Oct 2022 09:14:45 -0700 Dave Hansen <dave.hansen@intel.com> wrote: > On 10/24/22 08:45, Steven Rostedt wrote: > > --- a/arch/x86/mm/pat/set_memory.c > > +++ b/arch/x86/mm/pat/set_memory.c > > @@ -587,6 +587,10 @@ static inline pgprot_t verify_rwx(pgprot_t old, pgprot_t new, unsigned long star > > { > > unsigned long end; > > > > + /* Kernel text is rw at boot up */ > > + if (system_state == SYSTEM_BOOTING) > > + return new; > > Hi Steven, > > Thanks for the report and the patch. That seems reasonable, but I'm a > bit worried that it opens up a big hole (boot time) when a W+X mapping > could be created *anywhere*. > > Could we restrict this bypass to *only* kernel text addresses during > boot? Maybe something like this: > > if ((system_state == SYSTEM_BOOTING) && > __kernel_text_address(start)) > return new; > > That would be safe because we know that kernel_text_address() addresses > will be made read-only by the time userspace shows up and that > is_kernel_inittext() addresses will be freed. > > Long-term, I wonder if we could teach the early patching code that it > can't just use memcpy(). > This is hacky, and I agree with Linus, that ideally, we can get text_poke() working better and remove all theses "special cases", but in case we struggle to do so, the below patch also works. It only looks at the one case that ftrace is setting up its trampoline at early boot up. -- Steve diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h index 908d99b127d3..41b3ecd23a08 100644 --- a/arch/x86/include/asm/ftrace.h +++ b/arch/x86/include/asm/ftrace.h @@ -25,6 +25,8 @@ #ifndef __ASSEMBLY__ extern void __fentry__(void); +extern long ftrace_updated_trampoline; + static inline unsigned long ftrace_call_adjust(unsigned long addr) { /* diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c index bd165004776d..e2a1fc7bbe7a 100644 --- a/arch/x86/kernel/ftrace.c +++ b/arch/x86/kernel/ftrace.c @@ -417,7 +417,11 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size) if (likely(system_state != SYSTEM_BOOTING)) set_memory_ro((unsigned long)trampoline, npages); + else + ftrace_updated_trampoline = trampoline; set_memory_x((unsigned long)trampoline, npages); + ftrace_updated_trampoline = 0; + return (unsigned long)trampoline; fail: tramp_free(trampoline); diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 97342c42dda8..3fd3a45cafe8 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -32,6 +32,7 @@ #include <asm/memtype.h> #include <asm/hyperv-tlfs.h> #include <asm/mshyperv.h> +#include <asm/ftrace.h> #include "../mm_internal.h" @@ -579,6 +580,8 @@ static inline pgprot_t static_protections(pgprot_t prot, unsigned long start, return __pgprot(pgprot_val(prot) & ~forbidden); } +long ftrace_updated_trampoline; + /* * Validate strict W^X semantics. */ @@ -587,6 +590,10 @@ static inline pgprot_t verify_rwx(pgprot_t old, pgprot_t new, unsigned long star { unsigned long end; + /* Kernel text is rw at boot up */ + if (system_state == SYSTEM_BOOTING && ftrace_updated_trampoline == start) + return new; + /* * 32-bit has some unfixable W+X issues, like EFI code * and writeable data being in the same page. Disable
On Mon, 24 Oct 2022 12:08:49 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> And then all these special cases beget other special cases.
Well, I was able to get it to work without these special cases, but it
caused a bit of another kind of special cases to get poking_init() into
mm_init().
To get poking_init() working in mm_init() I had to pull in:
proc_caches_init(), as poking_init() uses some fork code that requires its
caches to be initialized.
Then dup_mm() is called, which uses maple tree code, which required
maple_tree_init() to be there too. (I pulled in radix_tree_init() just to
be consistent). But maple tree code calls kmem_cache_alloc_bulk() which
specifically states:
/* Note that interrupts must be enabled when calling this function. */
and lockdep confirmed it.
So I did some hacking in the maple_tree.c to make that work.
And finally, dup_mm() calls dup_mmap() that calls flush_tlb_mm() for the
old mm, but since this is early boot up, there's really no need for that. I
added some hacks to avoid that.
Thus, I guess you get to choose your poison. Either we have special ftrace
cases in x86 that beget other special cases to keep it working, or we make
text_poke() work early by moving poking_init() into mm_init() and then
creating more generic special cases that beget other special cases (and I
have no idea if this works on other architectures, which could beget more
special cases).
Your call.
-- Steve
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 5cadcea035e0..e240351e0bc1 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1681,11 +1681,6 @@ void __ref text_poke_queue(void *addr, const void *opcode, size_t len, const voi
{
struct text_poke_loc *tp;
- if (unlikely(system_state == SYSTEM_BOOTING)) {
- text_poke_early(addr, opcode, len);
- return;
- }
-
text_poke_flush(addr);
tp = &tp_vec[tp_vec_nr++];
@@ -1707,11 +1702,6 @@ void __ref text_poke_bp(void *addr, const void *opcode, size_t len, const void *
{
struct text_poke_loc tp;
- if (unlikely(system_state == SYSTEM_BOOTING)) {
- text_poke_early(addr, opcode, len);
- return;
- }
-
text_poke_loc_init(&tp, addr, opcode, len, emulate);
text_poke_bp_batch(&tp, 1);
}
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index bd165004776d..43628b8480fa 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -415,8 +415,7 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
set_vm_flush_reset_perms(trampoline);
- if (likely(system_state != SYSTEM_BOOTING))
- set_memory_ro((unsigned long)trampoline, npages);
+ set_memory_ro((unsigned long)trampoline, npages);
set_memory_x((unsigned long)trampoline, npages);
return (unsigned long)trampoline;
fail:
diff --git a/init/main.c b/init/main.c
index aa21add5f7c5..e5f4ae2d4cca 100644
--- a/init/main.c
+++ b/init/main.c
@@ -860,6 +860,10 @@ static void __init mm_init(void)
/* Should be run after espfix64 is set up. */
pti_init();
kmsan_init_runtime();
+ proc_caches_init();
+ radix_tree_init();
+ maple_tree_init();
+ poking_init();
}
#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
@@ -1011,8 +1015,6 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
if (WARN(!irqs_disabled(),
"Interrupts were enabled *very* early, fixing it\n"))
local_irq_disable();
- radix_tree_init();
- maple_tree_init();
/*
* Set up housekeeping before setting up workqueues to allow the unbound
@@ -1117,7 +1119,6 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
thread_stack_cache_init();
cred_init();
fork_init();
- proc_caches_init();
uts_ns_init();
key_init();
security_init();
@@ -1134,7 +1135,6 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
taskstats_init_early();
delayacct_init();
- poking_init();
check_bugs();
acpi_subsystem_init();
diff --git a/kernel/fork.c b/kernel/fork.c
index 08969f5aa38d..e24fb3ddcf9f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -702,7 +702,8 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
mas_destroy(&mas);
out:
mmap_write_unlock(mm);
- flush_tlb_mm(oldmm);
+ if (likely(!early_boot_irqs_disabled))
+ flush_tlb_mm(oldmm);
mmap_write_unlock(oldmm);
dup_userfaultfd_complete(&uf);
fail_uprobe_end:
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index e1743803c851..6fc72ca62c7d 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -1253,7 +1253,12 @@ static inline void mas_alloc_nodes(struct ma_state *mas, gfp_t gfp)
}
max_req = min(requested, max_req);
- count = mt_alloc_bulk(gfp, max_req, slots);
+ if (unlikely(early_boot_irqs_disabled)) {
+ slots[0] = mt_alloc_one(gfp | GFP_ATOMIC);
+ count = slots[0] ? 1 : 0;
+ } else {
+ count = mt_alloc_bulk(gfp, max_req, slots);
+ }
if (!count)
goto nomem_bulk;
On Mon, Oct 24, 2022 at 12:08:49PM -0700, Linus Torvalds wrote: > I suspect it would be fixed by just moving 'poking_init()' earlier. In > many ways I suspect it would make most sense as part of 'mm_init()', > not as a random call fairly late in start_kernel(). dup_mm() doesn't work until after proc_caches_init() at the very least. Let me see if I can untangle some of this..
On Tue, Oct 25, 2022 at 11:39:39AM +0200, Peter Zijlstra wrote: > On Mon, Oct 24, 2022 at 12:08:49PM -0700, Linus Torvalds wrote: > > I suspect it would be fixed by just moving 'poking_init()' earlier. In > > many ways I suspect it would make most sense as part of 'mm_init()', > > not as a random call fairly late in start_kernel(). > > dup_mm() doesn't work until after proc_caches_init() at the very least. > > Let me see if I can untangle some of this.. This seems to boot... --- diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 19221d77dc27..ac341df0e22c 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -1756,11 +1756,6 @@ void __ref text_poke_queue(void *addr, const void *opcode, size_t len, const voi { struct text_poke_loc *tp; - if (unlikely(system_state == SYSTEM_BOOTING)) { - text_poke_early(addr, opcode, len); - return; - } - text_poke_flush(addr); tp = &tp_vec[tp_vec_nr++]; @@ -1782,11 +1777,6 @@ void __ref text_poke_bp(void *addr, const void *opcode, size_t len, const void * { struct text_poke_loc tp; - if (unlikely(system_state == SYSTEM_BOOTING)) { - text_poke_early(addr, opcode, len); - return; - } - text_poke_loc_init(&tp, addr, opcode, len, emulate); text_poke_bp_batch(&tp, 1); } diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c index cf15ef5aecff..7ea412f7b9da 100644 --- a/arch/x86/kernel/ftrace.c +++ b/arch/x86/kernel/ftrace.c @@ -421,8 +421,7 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size) /* ALLOC_TRAMP flags lets us know we created it */ ops->flags |= FTRACE_OPS_FL_ALLOC_TRAMP; - if (likely(system_state != SYSTEM_BOOTING)) - set_memory_ro((unsigned long)trampoline, npages); + set_memory_ro((unsigned long)trampoline, npages); set_memory_x((unsigned long)trampoline, npages); return (unsigned long)trampoline; fail: diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index 9121bc1b9453..d18c45e5d6d7 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -792,6 +792,8 @@ void __init init_mem_mapping(void) early_memtest(0, max_pfn_mapped << PAGE_SHIFT); } +static struct mm_struct __poking_mm; + /* * Initialize an mm_struct to be used during poking and a pointer to be used * during patching. @@ -801,8 +803,9 @@ void __init poking_init(void) spinlock_t *ptl; pte_t *ptep; - poking_mm = copy_init_mm(); - BUG_ON(!poking_mm); + __poking_mm = init_mm; + mm_init(&__poking_mm, NULL, __poking_mm.user_ns); + poking_mm = &__poking_mm; /* * Randomize the poking address, but make sure that the following page diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index d6c48163c6de..8b099a70f291 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -90,7 +90,7 @@ extern void exit_itimers(struct task_struct *); extern pid_t kernel_clone(struct kernel_clone_args *kargs); struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node); struct task_struct *fork_idle(int); -struct mm_struct *copy_init_mm(void); +struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, struct user_namespace *user_ns); extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags); extern pid_t user_mode_thread(int (*fn)(void *), void *arg, unsigned long flags); extern long kernel_wait4(pid_t, int __user *, int, struct rusage *); diff --git a/init/main.c b/init/main.c index aa21add5f7c5..da5f1c1afc12 100644 --- a/init/main.c +++ b/init/main.c @@ -995,6 +995,7 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void) sort_main_extable(); trap_init(); mm_init(); + poking_init(); ftrace_init(); @@ -1134,7 +1135,6 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void) taskstats_init_early(); delayacct_init(); - poking_init(); check_bugs(); acpi_subsystem_init(); diff --git a/kernel/fork.c b/kernel/fork.c index 08969f5aa38d..7a3e8819d95a 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1104,7 +1104,7 @@ static void mm_init_uprobes_state(struct mm_struct *mm) #endif } -static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, +struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, struct user_namespace *user_ns) { mt_init_flags(&mm->mm_mt, MM_MT_FLAGS); @@ -2592,11 +2592,6 @@ struct task_struct * __init fork_idle(int cpu) return task; } -struct mm_struct *copy_init_mm(void) -{ - return dup_mm(NULL, &init_mm); -} - /* * This is like kernel_clone(), but shaved down and tailored to just * creating io_uring workers. It returns a created task, or an error pointer.
On Tue, Oct 25, 2022 at 3:16 AM Peter Zijlstra <peterz@infradead.org> wrote: > > This seems to boot... This looks much better, thanks. But this: > diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c > @@ -801,8 +803,9 @@ void __init poking_init(void) > spinlock_t *ptl; > pte_t *ptep; > > - poking_mm = copy_init_mm(); > - BUG_ON(!poking_mm); > + __poking_mm = init_mm; > + mm_init(&__poking_mm, NULL, __poking_mm.user_ns); > + poking_mm = &__poking_mm; Should probably be just poking_mm = mm_alloc(); because we shouldn't be messing with 'mm_init()' in places like this, and we shouldn't be exporting it either: > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -1104,7 +1104,7 @@ static void mm_init_uprobes_state(struct mm_struct *mm) > #endif > } > > -static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, > +struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, since "mm_alloc()" would seem to be exactly what we need, and it's what execve() already uses. It creates a new VM with nothing attached, ready to be populated. Or is there some reason why mm_alloc() doesn't work? It does require "current" and "current->user_ns" to be set up, but that should be true even during _very_ early boot. Linus
On Tue, Oct 25, 2022 at 09:53:27AM -0700, Linus Torvalds wrote: > On Tue, Oct 25, 2022 at 3:16 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > This seems to boot... > > This looks much better, thanks. > > But this: > > > diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c > > @@ -801,8 +803,9 @@ void __init poking_init(void) > > spinlock_t *ptl; > > pte_t *ptep; > > > > - poking_mm = copy_init_mm(); > > - BUG_ON(!poking_mm); > > + __poking_mm = init_mm; > > + mm_init(&__poking_mm, NULL, __poking_mm.user_ns); > > + poking_mm = &__poking_mm; > > Should probably be just > > poking_mm = mm_alloc(); > > because we shouldn't be messing with 'mm_init()' in places like this, > and we shouldn't be exporting it either: mm_alloc() uses allocate_mm() which requires a kmem_cache to be set-up. Using the static storage and instead calling mm_init() on it avoids that. So I think we can have: static struct mm_struct __poking_mm; mm_init(&__poking_mm, NULL, init_mm.user_ns); and leave out the assignment from init_mm.
On Tue, Oct 25, 2022 at 10:48 AM Peter Zijlstra <peterz@infradead.org> wrote: > > mm_alloc() uses allocate_mm() which requires a kmem_cache to be set-up. Well, that seems to be just a strange effect of mm_cachep being set up by the oddly named "proc_caches_init" (I say oddly named because these days I associate "proc" with proc-fs, but I think it actually comes from "process"). That would actually probably make more sense if it was part of mm_init(), much earlier (where we do "kmem_cache_init()") So this is another oddity in how we do "mm_init()", but we haven't actually initialized _that_ part of the mm setup. Extra bonus points for another strange thing: we have "fork_init()", but that too doesn't actually initialize the mm_cachep that fork() actually uses. It does initialize the process one (task_struct_cachep). So that kind of makes sense, but yeah, the mm_alloc() cachep should have been set up by mm_init. I think this is all "we just ended up randomly initializing things due to hysterical raisins" Linus
On Tue, Oct 25, 2022 at 11:14:07AM -0700, Linus Torvalds wrote: > On Tue, Oct 25, 2022 at 10:48 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > mm_alloc() uses allocate_mm() which requires a kmem_cache to be set-up. > > Well, that seems to be just a strange effect of mm_cachep being set up > by the oddly named "proc_caches_init" (I say oddly named because these > days I associate "proc" with proc-fs, but I think it actually comes > from "process"). > > That would actually probably make more sense if it was part of > mm_init(), much earlier (where we do "kmem_cache_init()") > > So this is another oddity in how we do "mm_init()", but we haven't > actually initialized _that_ part of the mm setup. > > Extra bonus points for another strange thing: we have "fork_init()", > but that too doesn't actually initialize the mm_cachep that fork() > actually uses. It does initialize the process one > (task_struct_cachep). So that kind of makes sense, but yeah, the > mm_alloc() cachep should have been set up by mm_init. > > I think this is all "we just ended up randomly initializing things due > to hysterical raisins" OK, I'll go make all that happen.
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 97342c42dda8..2e5a045731de 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -587,6 +587,10 @@ static inline pgprot_t verify_rwx(pgprot_t old, pgprot_t new, unsigned long star { unsigned long end; + /* Kernel text is rw at boot up */ + if (system_state == SYSTEM_BOOTING) + return new; + /* * 32-bit has some unfixable W+X issues, like EFI code * and writeable data being in the same page. Disable