x86/mm: Do not verify W^X at boot up

Message ID 20221024114536.44686c83@gandalf.local.home
State New
Headers
Series x86/mm: Do not verify W^X at boot up |

Commit Message

Steven Rostedt Oct. 24, 2022, 3:45 p.m. UTC
  From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

Adding on the kernel command line "ftrace=function" triggered:

------------[ cut here ]------------
CPA detected W^X violation: 8000000000000063 -> 0000000000000063 range:
0xffffffffc0013000 - 0xffffffffc0013fff PFN 10031b
WARNING: CPU: 0 PID: 0 at arch/x86/mm/pat/set_memory.c:609
verify_rwx+0x61/0x6d
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc1-test+ #3
Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
RIP: 0010:verify_rwx+0x61/0x6d
Code: e5 01 00 75 27 49 c1 e0 0c 48 89 d1 48 89 fe 48 c7 c7 5b b3 92 84 4e
8d 44 02 ff 48 89 da c6 05 71 29 e5 01 01 e8 35 90 e2 00 <0f> 0b 48 89 d8
5b 5d e9 6f 95 1a 01 0f 1f 44 00 00 55 48 89 e5 53
RSP: 0000:ffffffff84c03b08 EFLAGS: 00010086
RAX: 0000000000000000 RBX: 0000000000000063 RCX: 0000000000000003
RDX: 0000000000000003 RSI: ffffffff84c039b0 RDI: 0000000000000001
RBP: ffffffff84c03b10 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000025 R12: ffff8e730031c098
R13: 000000000010031b R14: 800000010031b063 R15: 8000000000000063
FS:  0000000000000000(0000) GS:ffff8e7416a00000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff8e73fd801000 CR3: 00000001fcc22001 CR4: 00000000000606f0
Call Trace:
 <TASK>
 __change_page_attr_set_clr+0x146/0x8a6
 ? __mutex_unlock_slowpath+0x41/0x213
 ? mutex_unlock+0x12/0x18
 ? _vm_unmap_aliases+0x126/0x136
 change_page_attr_set_clr+0x135/0x268
 ? find_vmap_area+0x32/0x3e
 ? __fentry__+0x10/0x10
 change_page_attr_clear.constprop.0+0x16/0x1c
 set_memory_x+0x2c/0x32
 arch_ftrace_update_trampoline+0x218/0x2db
 ? ftrace_caller_op_ptr+0x17/0x17
 ftrace_update_trampoline+0x16/0xa1
 ? tracing_gen_ctx+0x1c/0x1c
 __register_ftrace_function+0x93/0xb2
 ftrace_startup+0x21/0xf0
 ? tracing_gen_ctx+0x1c/0x1c
 register_ftrace_function_nolock+0x26/0x40
 register_ftrace_function+0x4e/0x143
 ? mutex_unlock+0x12/0x18
 ? tracing_gen_ctx+0x1c/0x1c
 function_trace_init+0x7d/0xc3
 tracer_init+0x23/0x2c
 tracing_set_tracer+0x1d5/0x206
 register_tracer+0x1c0/0x1e4
 init_function_trace+0x90/0x96
 early_trace_init+0x25c/0x352
 start_kernel+0x424/0x6e4
 x86_64_start_reservations+0x24/0x2a
 x86_64_start_kernel+0x8c/0x95
 secondary_startup_64_no_verify+0xe0/0xeb
 </TASK>
---[ end trace 0000000000000000 ]---

This is because at boot up, kernel text is writable, and there's no reason
to do tricks to updated it. But the verifier does not distinguish updates
at boot up and at run time, and causes a warning at time of boot.

Add a check for system_state == SYSTEM_BOOTING and allow it if that is the
case.

Link: https://lore.kernel.org/r/20221024112730.180916b3@gandalf.local.home

Fixes: 652c5bf380ad0 ("x86/mm: Refuse W^X violations")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 arch/x86/mm/pat/set_memory.c | 4 ++++
 1 file changed, 4 insertions(+)
  

Comments

Dave Hansen Oct. 24, 2022, 4:14 p.m. UTC | #1
On 10/24/22 08:45, Steven Rostedt wrote:
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -587,6 +587,10 @@ static inline pgprot_t verify_rwx(pgprot_t old, pgprot_t new, unsigned long star
>  {
>  	unsigned long end;
>  
> +	/* Kernel text is rw at boot up */
> +	if (system_state == SYSTEM_BOOTING)
> +		return new;

Hi Steven,

Thanks for the report and the patch.  That seems reasonable, but I'm a
bit worried that it opens up a big hole (boot time) when a W+X mapping
could be created *anywhere*.

Could we restrict this bypass to *only* kernel text addresses during
boot?  Maybe something like this:

	if ((system_state == SYSTEM_BOOTING) &&
	    __kernel_text_address(start))
		return new;

That would be safe because we know that kernel_text_address() addresses
will be made read-only by the time userspace shows up and that
is_kernel_inittext() addresses will be freed.

Long-term, I wonder if we could teach the early patching code that it
can't just use memcpy().
  
Steven Rostedt Oct. 24, 2022, 6:13 p.m. UTC | #2
On Mon, 24 Oct 2022 09:14:45 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 10/24/22 08:45, Steven Rostedt wrote:
> > --- a/arch/x86/mm/pat/set_memory.c
> > +++ b/arch/x86/mm/pat/set_memory.c
> > @@ -587,6 +587,10 @@ static inline pgprot_t verify_rwx(pgprot_t old, pgprot_t new, unsigned long star
> >  {
> >  	unsigned long end;
> >  
> > +	/* Kernel text is rw at boot up */
> > +	if (system_state == SYSTEM_BOOTING)
> > +		return new;  
> 
> Hi Steven,
> 
> Thanks for the report and the patch.  That seems reasonable, but I'm a
> bit worried that it opens up a big hole (boot time) when a W+X mapping
> could be created *anywhere*.
> 
> Could we restrict this bypass to *only* kernel text addresses during
> boot?  Maybe something like this:
> 
> 	if ((system_state == SYSTEM_BOOTING) &&
> 	    __kernel_text_address(start))
> 		return new;

Actually, that brings back the warning, as ftrace creates a trampoline, but
text_poke() will still use memcpy on it at early boot up.

The trampolines are set to ro at the end of boot up by:

59566b0b622e3 ("x86/ftrace: Have ftrace trampolines turn read-only at the end of system boot up")

Which was added because of text_poke() doing the memcpy().

> 
> That would be safe because we know that kernel_text_address() addresses
> will be made read-only by the time userspace shows up and that
> is_kernel_inittext() addresses will be freed.
> 
> Long-term, I wonder if we could teach the early patching code that it
> can't just use memcpy().
> 

Maybe.

-- Steve
  
Linus Torvalds Oct. 24, 2022, 6:19 p.m. UTC | #3
On Mon, Oct 24, 2022 at 8:45 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
>
> Adding on the kernel command line "ftrace=function" triggered:
>
> CPA detected W^X violation: 8000000000000063 -> 0000000000000063 range:

Hmm.

The cause of this actually seems to be this

        if (likely(system_state != SYSTEM_BOOTING))
                set_memory_ro((unsigned long)trampoline, npages);
        set_memory_x((unsigned long)trampoline, npages);
        return (unsigned long)trampoline;

in create_trampoline().

And that in turn is because of commit 59566b0b622e ("x86/ftrace: Have
ftrace trampolines turn read-only at the end of system boot up"),
which in turn is because of


        if (unlikely(system_state == SYSTEM_BOOTING)) {
                text_poke_early(addr, opcode, len);
                return;
        }

in text_poke_bp(). And that, in turn, is because PeterZ ended up
special-casing this all in commit 768ae4406a5c ("x86/ftrace: Use
text_poke()")

Maybe we should just strive to get rid of all these SYSTEM_BOOTING
special cases, instead of adding yet another a new one.

There's presumably "it slows down boot" reason to avoid the full
text_poke_bp() dance, but do we really care for the "ftrace=function"
case?

                  Linus
  
Steven Rostedt Oct. 24, 2022, 6:52 p.m. UTC | #4
On Mon, 24 Oct 2022 11:19:31 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:


> in text_poke_bp(). And that, in turn, is because PeterZ ended up
> special-casing this all in commit 768ae4406a5c ("x86/ftrace: Use
> text_poke()")
> 
> Maybe we should just strive to get rid of all these SYSTEM_BOOTING
> special cases, instead of adding yet another a new one.
> 
> There's presumably "it slows down boot" reason to avoid the full
> text_poke_bp() dance, but do we really care for the "ftrace=function"
> case?
> 

It's not just speed up at boot up. It's because text_poke doesn't work at
early boot up when function tracing is enabled. If I remove the
SYSTEM_BOOTING checks in text_poke() this happens:

[    0.963966] ftrace: allocating 47021 entries in 184 pages
[    0.969376] ftrace section at ffffffff89ef54c0 sorted properly
[    0.982126] ftrace: allocated 184 pages with 4 groups
[    1.009779] Starting tracer 'function'
[    1.013753] BUG: kernel NULL pointer dereference, address: 0000000000000048
[    1.020516] #PF: supervisor read access in kernel mode
[    1.025616] #PF: error_code(0x0000) - not-present page
[    1.030718] PGD 0 P4D 0 
[    1.033224] Oops: 0000 [#1] PREEMPT SMP PTI
[    1.037373] CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc1-test+ #496
[    1.044031] Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
[    1.052934] RIP: 0010:walk_to_pmd+0x17/0x130
[    1.057172] Code: 1f ff ff ff 0f 0b e9 ed fe ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f0 41 55 48 c1 e8 24 41 54 25 f8 0f 00 00 55 53 <48> 03 47 48 0f 84 c4 00 00 00 49 89 fc 48 8b 38 48 89 f3 48 89 c5
[    1.075846] RSP: 0000:ffffffff89603cd8 EFLAGS: 00010046
[    1.081033] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffd0f980000000
[    1.088121] RDX: ffffffff89603d68 RSI: 0000000000000000 RDI: 0000000000000000
[    1.095213] RBP: 0000000000000000 R08: ffffffff8a017cf0 R09: ffffffff8a017cf1
[    1.102302] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff89603d68
[    1.109389] R13: 000000000000031e R14: 000000000000031f R15: 8000000000000063
[    1.116479] FS:  0000000000000000(0000) GS:ffffa068daa00000(0000) knlGS:0000000000000000
[    1.124516] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.130220] CR2: 0000000000000048 CR3: 000000000760a001 CR4: 00000000000606f0
[    1.137309] Call Trace:
[    1.139732]  <TASK>
[    1.141805]  __get_locked_pte+0x1b/0x120
[    1.145694]  ? ftrace_caller_op_ptr+0x17/0x17
[    1.150016]  __text_poke+0xf8/0x540
[    1.153474]  ? patch_retpoline+0x170/0x170
[    1.157536]  ? text_poke_loc_init+0x5b/0x170
[    1.161773]  text_poke_bp_batch+0x86/0x290
[    1.165835]  ? cache_mod+0x280/0x280
[    1.169378]  text_poke_bp+0x3a/0x60
[    1.172836]  ftrace_update_ftrace_func+0x3e/0x80
[    1.177416]  ftrace_modify_all_code+0x79/0x160
[    1.181827]  ftrace_startup+0xbd/0x150
[    1.185544]  ? ftrace_stacktrace_count+0xc0/0xc0
[    1.190128]  register_ftrace_function_nolock+0x20/0x60
[    1.195227]  register_ftrace_function+0xc7/0x130
[    1.199807]  ? ftrace_stacktrace_count+0xc0/0xc0
[    1.204390]  function_trace_init+0x71/0xd0
[    1.208454]  tracing_set_tracer+0x172/0x280
[    1.212603]  register_tracer+0x1e0/0x205
[    1.216495]  tracer_alloc_buffers.isra.0+0x1f8/0x2fe
[    1.221421]  start_kernel+0x21f/0x4e8
[    1.225051]  ? x86_64_start_kernel+0x60/0x78
[    1.229288]  secondary_startup_64_no_verify+0xd3/0xdb
[    1.234301]  </TASK>
[    1.236459] Modules linked in:
[    1.239484] CR2: 0000000000000048
[    1.242769] ---[ end trace 0000000000000000 ]---

Interrupts haven't been enabled yet, so things are still rather fragile at
this point of start up.

-- Steve
  
Linus Torvalds Oct. 24, 2022, 7:08 p.m. UTC | #5
On Mon, Oct 24, 2022 at 11:52 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> It's not just speed up at boot up. It's because text_poke doesn't work at
> early boot up when function tracing is enabled. If I remove the
> SYSTEM_BOOTING checks in text_poke() this happens:

Let's just fix that up.

>
> [    1.013753] BUG: kernel NULL pointer dereference, address: 0000000000000048

This is due to

__get_locked_pte:
   0: 0f 1f 44 00 00        nopl   0x0(%rax,%rax,1)
   5: 48 89 f0              mov    %rsi,%rax
   8: 41 55                push   %r13
   a: 48 c1 e8 24          shr    $0x24,%rax
   e: 41 54                push   %r12
  10: 25 f8 0f 00 00        and    $0xff8,%eax
  15: 55                    push   %rbp
  16: 53                    push   %rbx
  17:* 48 03 47 48          add    0x48(%rdi),%rax <-- trapping instruction
  1b: 0f 84 c4 00 00 00    je     0xe5
  21: 49 89 fc              mov    %rdi,%r12
  24: 48 8b 38              mov    (%rax),%rdi
  27: 48 89 f3              mov    %rsi,%rbx
  2a: 48 89 c5              mov    %rax,%rbp

and that 'addq' seems to be walk_to_pmd() doing

        pgd = pgd_offset(mm, addr);

with a zero mm (that's 'mm->pgd').

And that, in turn, seems to be due to the absolutely disgusing 'poking_mm' hack.

> Interrupts haven't been enabled yet, so things are still rather fragile at
> this point of start up.

I don't think this has anything to do with interrupts. We do need the
page structures etc to be workable, but all the tracing setup needs
that *anyway*.

I suspect it would be fixed by just moving 'poking_init()' earlier. In
many ways I suspect it would make most sense as part of 'mm_init()',
not as a random call fairly late in start_kernel().

In other words, this all smells like "people added special cases
because they didn't want to hunt down the underlying problem".

And then all these special cases beget other special cases.

                 Linus
  
Steven Rostedt Oct. 24, 2022, 7:26 p.m. UTC | #6
On Mon, 24 Oct 2022 09:14:45 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 10/24/22 08:45, Steven Rostedt wrote:
> > --- a/arch/x86/mm/pat/set_memory.c
> > +++ b/arch/x86/mm/pat/set_memory.c
> > @@ -587,6 +587,10 @@ static inline pgprot_t verify_rwx(pgprot_t old, pgprot_t new, unsigned long star
> >  {
> >  	unsigned long end;
> >  
> > +	/* Kernel text is rw at boot up */
> > +	if (system_state == SYSTEM_BOOTING)
> > +		return new;  
> 
> Hi Steven,
> 
> Thanks for the report and the patch.  That seems reasonable, but I'm a
> bit worried that it opens up a big hole (boot time) when a W+X mapping
> could be created *anywhere*.
> 
> Could we restrict this bypass to *only* kernel text addresses during
> boot?  Maybe something like this:
> 
> 	if ((system_state == SYSTEM_BOOTING) &&
> 	    __kernel_text_address(start))
> 		return new;
> 
> That would be safe because we know that kernel_text_address() addresses
> will be made read-only by the time userspace shows up and that
> is_kernel_inittext() addresses will be freed.
> 
> Long-term, I wonder if we could teach the early patching code that it
> can't just use memcpy().
> 

This is hacky, and I agree with Linus, that ideally, we can get text_poke()
working better and remove all theses "special cases", but in case we
struggle to do so, the below patch also works.

It only looks at the one case that ftrace is setting up its trampoline at
early boot up.

-- Steve


diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
index 908d99b127d3..41b3ecd23a08 100644
--- a/arch/x86/include/asm/ftrace.h
+++ b/arch/x86/include/asm/ftrace.h
@@ -25,6 +25,8 @@
 #ifndef __ASSEMBLY__
 extern void __fentry__(void);
 
+extern long ftrace_updated_trampoline;
+
 static inline unsigned long ftrace_call_adjust(unsigned long addr)
 {
 	/*
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index bd165004776d..e2a1fc7bbe7a 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -417,7 +417,11 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
 
 	if (likely(system_state != SYSTEM_BOOTING))
 		set_memory_ro((unsigned long)trampoline, npages);
+	else
+		ftrace_updated_trampoline = trampoline;
 	set_memory_x((unsigned long)trampoline, npages);
+	ftrace_updated_trampoline = 0;
+
 	return (unsigned long)trampoline;
 fail:
 	tramp_free(trampoline);
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 97342c42dda8..3fd3a45cafe8 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -32,6 +32,7 @@
 #include <asm/memtype.h>
 #include <asm/hyperv-tlfs.h>
 #include <asm/mshyperv.h>
+#include <asm/ftrace.h>
 
 #include "../mm_internal.h"
 
@@ -579,6 +580,8 @@ static inline pgprot_t static_protections(pgprot_t prot, unsigned long start,
 	return __pgprot(pgprot_val(prot) & ~forbidden);
 }
 
+long ftrace_updated_trampoline;
+
 /*
  * Validate strict W^X semantics.
  */
@@ -587,6 +590,10 @@ static inline pgprot_t verify_rwx(pgprot_t old, pgprot_t new, unsigned long star
 {
 	unsigned long end;
 
+	/* Kernel text is rw at boot up */
+	if (system_state == SYSTEM_BOOTING && ftrace_updated_trampoline == start)
+		return new;
+
 	/*
 	 * 32-bit has some unfixable W+X issues, like EFI code
 	 * and writeable data being in the same page.  Disable
  
Steven Rostedt Oct. 24, 2022, 10:04 p.m. UTC | #7
On Mon, 24 Oct 2022 12:08:49 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> And then all these special cases beget other special cases.

Well, I was able to get it to work without these special cases, but it
caused a bit of another kind of special cases to get poking_init() into
mm_init().

To get poking_init() working in mm_init() I had to pull in:

proc_caches_init(), as poking_init() uses some fork code that requires its
caches to be initialized.

Then dup_mm() is called, which uses maple tree code, which required
maple_tree_init() to be there too. (I pulled in radix_tree_init() just to
be consistent). But maple tree code calls kmem_cache_alloc_bulk() which
specifically states:

  /* Note that interrupts must be enabled when calling this function. */

and lockdep confirmed it.

So I did some hacking in the maple_tree.c to make that work.

And finally, dup_mm() calls dup_mmap() that calls flush_tlb_mm() for the
old mm, but since this is early boot up, there's really no need for that. I
added some hacks to avoid that.

Thus, I guess you get to choose your poison. Either we have special ftrace
cases in x86 that beget other special cases to keep it working, or we make
text_poke() work early by moving poking_init() into mm_init() and then
creating more generic special cases that beget other special cases (and I
have no idea if this works on other architectures, which could beget more
special cases).

Your call.

-- Steve

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 5cadcea035e0..e240351e0bc1 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1681,11 +1681,6 @@ void __ref text_poke_queue(void *addr, const void *opcode, size_t len, const voi
 {
 	struct text_poke_loc *tp;
 
-	if (unlikely(system_state == SYSTEM_BOOTING)) {
-		text_poke_early(addr, opcode, len);
-		return;
-	}
-
 	text_poke_flush(addr);
 
 	tp = &tp_vec[tp_vec_nr++];
@@ -1707,11 +1702,6 @@ void __ref text_poke_bp(void *addr, const void *opcode, size_t len, const void *
 {
 	struct text_poke_loc tp;
 
-	if (unlikely(system_state == SYSTEM_BOOTING)) {
-		text_poke_early(addr, opcode, len);
-		return;
-	}
-
 	text_poke_loc_init(&tp, addr, opcode, len, emulate);
 	text_poke_bp_batch(&tp, 1);
 }
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index bd165004776d..43628b8480fa 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -415,8 +415,7 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
 
 	set_vm_flush_reset_perms(trampoline);
 
-	if (likely(system_state != SYSTEM_BOOTING))
-		set_memory_ro((unsigned long)trampoline, npages);
+	set_memory_ro((unsigned long)trampoline, npages);
 	set_memory_x((unsigned long)trampoline, npages);
 	return (unsigned long)trampoline;
 fail:
diff --git a/init/main.c b/init/main.c
index aa21add5f7c5..e5f4ae2d4cca 100644
--- a/init/main.c
+++ b/init/main.c
@@ -860,6 +860,10 @@ static void __init mm_init(void)
 	/* Should be run after espfix64 is set up. */
 	pti_init();
 	kmsan_init_runtime();
+	proc_caches_init();
+	radix_tree_init();
+	maple_tree_init();
+	poking_init();
 }
 
 #ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
@@ -1011,8 +1015,6 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
 	if (WARN(!irqs_disabled(),
 		 "Interrupts were enabled *very* early, fixing it\n"))
 		local_irq_disable();
-	radix_tree_init();
-	maple_tree_init();
 
 	/*
 	 * Set up housekeeping before setting up workqueues to allow the unbound
@@ -1117,7 +1119,6 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
 	thread_stack_cache_init();
 	cred_init();
 	fork_init();
-	proc_caches_init();
 	uts_ns_init();
 	key_init();
 	security_init();
@@ -1134,7 +1135,6 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
 	taskstats_init_early();
 	delayacct_init();
 
-	poking_init();
 	check_bugs();
 
 	acpi_subsystem_init();
diff --git a/kernel/fork.c b/kernel/fork.c
index 08969f5aa38d..e24fb3ddcf9f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -702,7 +702,8 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	mas_destroy(&mas);
 out:
 	mmap_write_unlock(mm);
-	flush_tlb_mm(oldmm);
+	if (likely(!early_boot_irqs_disabled))
+		flush_tlb_mm(oldmm);
 	mmap_write_unlock(oldmm);
 	dup_userfaultfd_complete(&uf);
 fail_uprobe_end:
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index e1743803c851..6fc72ca62c7d 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -1253,7 +1253,12 @@ static inline void mas_alloc_nodes(struct ma_state *mas, gfp_t gfp)
 		}
 
 		max_req = min(requested, max_req);
-		count = mt_alloc_bulk(gfp, max_req, slots);
+		if (unlikely(early_boot_irqs_disabled)) {
+			slots[0] = mt_alloc_one(gfp | GFP_ATOMIC);
+			count = slots[0] ? 1 : 0;
+		} else {
+			count = mt_alloc_bulk(gfp, max_req, slots);
+		}
 		if (!count)
 			goto nomem_bulk;
  
Peter Zijlstra Oct. 25, 2022, 9:39 a.m. UTC | #8
On Mon, Oct 24, 2022 at 12:08:49PM -0700, Linus Torvalds wrote:
> I suspect it would be fixed by just moving 'poking_init()' earlier. In
> many ways I suspect it would make most sense as part of 'mm_init()',
> not as a random call fairly late in start_kernel().

dup_mm() doesn't work until after proc_caches_init() at the very least.

Let me see if I can untangle some of this..
  
Peter Zijlstra Oct. 25, 2022, 10:16 a.m. UTC | #9
On Tue, Oct 25, 2022 at 11:39:39AM +0200, Peter Zijlstra wrote:
> On Mon, Oct 24, 2022 at 12:08:49PM -0700, Linus Torvalds wrote:
> > I suspect it would be fixed by just moving 'poking_init()' earlier. In
> > many ways I suspect it would make most sense as part of 'mm_init()',
> > not as a random call fairly late in start_kernel().
> 
> dup_mm() doesn't work until after proc_caches_init() at the very least.
> 
> Let me see if I can untangle some of this..

This seems to boot...

---
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 19221d77dc27..ac341df0e22c 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1756,11 +1756,6 @@ void __ref text_poke_queue(void *addr, const void *opcode, size_t len, const voi
 {
 	struct text_poke_loc *tp;
 
-	if (unlikely(system_state == SYSTEM_BOOTING)) {
-		text_poke_early(addr, opcode, len);
-		return;
-	}
-
 	text_poke_flush(addr);
 
 	tp = &tp_vec[tp_vec_nr++];
@@ -1782,11 +1777,6 @@ void __ref text_poke_bp(void *addr, const void *opcode, size_t len, const void *
 {
 	struct text_poke_loc tp;
 
-	if (unlikely(system_state == SYSTEM_BOOTING)) {
-		text_poke_early(addr, opcode, len);
-		return;
-	}
-
 	text_poke_loc_init(&tp, addr, opcode, len, emulate);
 	text_poke_bp_batch(&tp, 1);
 }
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index cf15ef5aecff..7ea412f7b9da 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -421,8 +421,7 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
 	/* ALLOC_TRAMP flags lets us know we created it */
 	ops->flags |= FTRACE_OPS_FL_ALLOC_TRAMP;
 
-	if (likely(system_state != SYSTEM_BOOTING))
-		set_memory_ro((unsigned long)trampoline, npages);
+	set_memory_ro((unsigned long)trampoline, npages);
 	set_memory_x((unsigned long)trampoline, npages);
 	return (unsigned long)trampoline;
 fail:
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 9121bc1b9453..d18c45e5d6d7 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -792,6 +792,8 @@ void __init init_mem_mapping(void)
 	early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
 }
 
+static struct mm_struct __poking_mm;
+
 /*
  * Initialize an mm_struct to be used during poking and a pointer to be used
  * during patching.
@@ -801,8 +803,9 @@ void __init poking_init(void)
 	spinlock_t *ptl;
 	pte_t *ptep;
 
-	poking_mm = copy_init_mm();
-	BUG_ON(!poking_mm);
+	__poking_mm = init_mm;
+	mm_init(&__poking_mm, NULL, __poking_mm.user_ns);
+	poking_mm = &__poking_mm;
 
 	/*
 	 * Randomize the poking address, but make sure that the following page
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index d6c48163c6de..8b099a70f291 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -90,7 +90,7 @@ extern void exit_itimers(struct task_struct *);
 extern pid_t kernel_clone(struct kernel_clone_args *kargs);
 struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node);
 struct task_struct *fork_idle(int);
-struct mm_struct *copy_init_mm(void);
+struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, struct user_namespace *user_ns);
 extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
 extern pid_t user_mode_thread(int (*fn)(void *), void *arg, unsigned long flags);
 extern long kernel_wait4(pid_t, int __user *, int, struct rusage *);
diff --git a/init/main.c b/init/main.c
index aa21add5f7c5..da5f1c1afc12 100644
--- a/init/main.c
+++ b/init/main.c
@@ -995,6 +995,7 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
 	sort_main_extable();
 	trap_init();
 	mm_init();
+	poking_init();
 
 	ftrace_init();
 
@@ -1134,7 +1135,6 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
 	taskstats_init_early();
 	delayacct_init();
 
-	poking_init();
 	check_bugs();
 
 	acpi_subsystem_init();
diff --git a/kernel/fork.c b/kernel/fork.c
index 08969f5aa38d..7a3e8819d95a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1104,7 +1104,7 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
 #endif
 }
 
-static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
+struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	struct user_namespace *user_ns)
 {
 	mt_init_flags(&mm->mm_mt, MM_MT_FLAGS);
@@ -2592,11 +2592,6 @@ struct task_struct * __init fork_idle(int cpu)
 	return task;
 }
 
-struct mm_struct *copy_init_mm(void)
-{
-	return dup_mm(NULL, &init_mm);
-}
-
 /*
  * This is like kernel_clone(), but shaved down and tailored to just
  * creating io_uring workers. It returns a created task, or an error pointer.
  
Linus Torvalds Oct. 25, 2022, 4:53 p.m. UTC | #10
On Tue, Oct 25, 2022 at 3:16 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> This seems to boot...

This looks much better, thanks.

But this:

> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> @@ -801,8 +803,9 @@ void __init poking_init(void)
>         spinlock_t *ptl;
>         pte_t *ptep;
>
> -       poking_mm = copy_init_mm();
> -       BUG_ON(!poking_mm);
> +       __poking_mm = init_mm;
> +       mm_init(&__poking_mm, NULL, __poking_mm.user_ns);
> +       poking_mm = &__poking_mm;

Should probably be just

        poking_mm = mm_alloc();

because we shouldn't be messing with 'mm_init()' in places like this,
and we shouldn't be exporting it either:

> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1104,7 +1104,7 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
>  #endif
>  }
>
> -static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> +struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,

since "mm_alloc()" would seem to be exactly what we need, and it's
what execve() already uses. It creates a new VM with nothing attached,
ready to be populated.

Or is there some reason why mm_alloc() doesn't work? It does require
"current" and "current->user_ns" to be set up, but that should be true
even during _very_ early boot.

                Linus
  
Peter Zijlstra Oct. 25, 2022, 5:47 p.m. UTC | #11
On Tue, Oct 25, 2022 at 09:53:27AM -0700, Linus Torvalds wrote:
> On Tue, Oct 25, 2022 at 3:16 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > This seems to boot...
> 
> This looks much better, thanks.
> 
> But this:
> 
> > diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> > @@ -801,8 +803,9 @@ void __init poking_init(void)
> >         spinlock_t *ptl;
> >         pte_t *ptep;
> >
> > -       poking_mm = copy_init_mm();
> > -       BUG_ON(!poking_mm);
> > +       __poking_mm = init_mm;
> > +       mm_init(&__poking_mm, NULL, __poking_mm.user_ns);
> > +       poking_mm = &__poking_mm;
> 
> Should probably be just
> 
>         poking_mm = mm_alloc();
> 
> because we shouldn't be messing with 'mm_init()' in places like this,
> and we shouldn't be exporting it either:

mm_alloc() uses allocate_mm() which requires a kmem_cache to be set-up.
Using the static storage and instead calling mm_init() on it avoids
that.

So I think we can have:

static struct mm_struct __poking_mm;

	mm_init(&__poking_mm, NULL, init_mm.user_ns);

and leave out the assignment from init_mm.
  
Linus Torvalds Oct. 25, 2022, 6:14 p.m. UTC | #12
On Tue, Oct 25, 2022 at 10:48 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> mm_alloc() uses allocate_mm() which requires a kmem_cache to be set-up.

Well, that seems to be just a strange effect of mm_cachep being set up
by the oddly named "proc_caches_init" (I say oddly named because these
days I associate "proc" with proc-fs, but I think it actually comes
from "process").

That would actually probably make more sense if it was part of
mm_init(), much earlier (where we do "kmem_cache_init()")

So this is another oddity in how we do "mm_init()", but we haven't
actually initialized _that_ part of the mm setup.

Extra bonus points for another strange thing: we have "fork_init()",
but that too doesn't actually initialize the mm_cachep that fork()
actually uses. It does initialize the process one
(task_struct_cachep). So that kind of makes sense, but yeah, the
mm_alloc() cachep should have been set up by mm_init.

I think this is all "we just ended up randomly initializing things due
to hysterical raisins"

              Linus
  
Peter Zijlstra Oct. 25, 2022, 6:46 p.m. UTC | #13
On Tue, Oct 25, 2022 at 11:14:07AM -0700, Linus Torvalds wrote:
> On Tue, Oct 25, 2022 at 10:48 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > mm_alloc() uses allocate_mm() which requires a kmem_cache to be set-up.
> 
> Well, that seems to be just a strange effect of mm_cachep being set up
> by the oddly named "proc_caches_init" (I say oddly named because these
> days I associate "proc" with proc-fs, but I think it actually comes
> from "process").
> 
> That would actually probably make more sense if it was part of
> mm_init(), much earlier (where we do "kmem_cache_init()")
> 
> So this is another oddity in how we do "mm_init()", but we haven't
> actually initialized _that_ part of the mm setup.
> 
> Extra bonus points for another strange thing: we have "fork_init()",
> but that too doesn't actually initialize the mm_cachep that fork()
> actually uses. It does initialize the process one
> (task_struct_cachep). So that kind of makes sense, but yeah, the
> mm_alloc() cachep should have been set up by mm_init.
> 
> I think this is all "we just ended up randomly initializing things due
> to hysterical raisins"

OK, I'll go make all that happen.
  

Patch

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 97342c42dda8..2e5a045731de 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -587,6 +587,10 @@  static inline pgprot_t verify_rwx(pgprot_t old, pgprot_t new, unsigned long star
 {
 	unsigned long end;
 
+	/* Kernel text is rw at boot up */
+	if (system_state == SYSTEM_BOOTING)
+		return new;
+
 	/*
 	 * 32-bit has some unfixable W+X issues, like EFI code
 	 * and writeable data being in the same page.  Disable