[0/5] x86/ftrace: Cure boot time W+X mapping

Message ID 20221025200656.951281799@infradead.org
Headers
Series x86/ftrace: Cure boot time W+X mapping |

Message

Peter Zijlstra Oct. 25, 2022, 8:06 p.m. UTC
  Hi,

These few patches re-work and re-order boot things enough to avoid ftrace
creating boot time W+X maps.

The patches compile and boot for the one config I tested things on (with
ftrace=function enabled; *slooooow*).

I've pushed them out for the robots to have a go at here:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git x86/mm.poke_me
  

Comments

Linus Torvalds Oct. 25, 2022, 11:07 p.m. UTC | #1
On Tue, Oct 25, 2022 at 1:11 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> These few patches re-work and re-order boot things enough to avoid ftrace
> creating boot time W+X maps.

Thanks, looks fine.

> The patches compile and boot for the one config I tested things on (with
> ftrace=function enabled; *slooooow*).

So this might be just tracing overhead, but it might also be that you
slowed down text_poke() at bootup a _lot_.

The only part that the NX^W checking cared about was that

-       if (likely(system_state != SYSTEM_BOOTING))
-               set_memory_ro((unsigned long)trampoline, npages);
+       set_memory_ro((unsigned long)trampoline, npages);
        set_memory_x((unsigned long)trampoline, npages);

for the create_trampoline(), because without the 'set_memory_ro()',
the 'set_memory_x()' will complain.

It does strike me that it's stupid to make those be two calls that do
exactly the same thing, and we should have a combined "set it
read-only and executable" function, but that's a separate issue.

The slowness is probably not the trampilines, but just the regular
"text_poke of kernel text" that we probably want to keep special just
because otherwise it's _so_ slow to do for every alternative etc.

                Linus
  
Steven Rostedt Oct. 25, 2022, 11:17 p.m. UTC | #2
On Tue, 25 Oct 2022 16:07:25 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> The slowness is probably not the trampilines, but just the regular
> "text_poke of kernel text" that we probably want to keep special just
> because otherwise it's _so_ slow to do for every alternative etc.

Yes. That's why I recommended to change patch 4 to:

	if (unlikely(system_state == SYSTEM_BOOTING) &&
	    core_kernel_text((unsigned long)addr)) {
		text_poke_early(addr, opcode, len);
		return;
	}

The only special case we had was the ftrace trampoline that was dynamically
allocated, and there's paths that require updating it (when you add two
function callbacks to the same location).

-- Steve
  
Peter Zijlstra Oct. 26, 2022, 7:15 a.m. UTC | #3
On Tue, Oct 25, 2022 at 04:07:25PM -0700, Linus Torvalds wrote:

> It does strike me that it's stupid to make those be two calls that do
> exactly the same thing, and we should have a combined "set it
> read-only and executable" function, but that's a separate issue.

Right, and we have it all over the place. Something like the below
perhaps? I'll feed it to the robots, see if something breaks.

> The slowness is probably not the trampilines, but just the regular
> "text_poke of kernel text" that we probably want to keep special just
> because otherwise it's _so_ slow to do for every alternative etc.

I tried with and without the patches, it's dead slow either way and I
couldn't spot a noticable difference between the two -- so I'm assuming
it's simply the trace overhead, not the trace-enable time.


---
--- a/arch/arm/mach-omap1/sram-init.c
+++ b/arch/arm/mach-omap1/sram-init.c
@@ -74,8 +74,7 @@ void *omap_sram_push(void *funcp, unsign
 
 	dst = fncpy(sram, funcp, size);
 
-	set_memory_ro(base, pages);
-	set_memory_x(base, pages);
+	set_memory_rox(base, pages);
 
 	return dst;
 }
@@ -126,8 +125,7 @@ static void __init omap_detect_and_map_s
 	base = (unsigned long)omap_sram_base;
 	pages = PAGE_ALIGN(omap_sram_size) / PAGE_SIZE;
 
-	set_memory_ro(base, pages);
-	set_memory_x(base, pages);
+	set_memory_rox(base, pages);
 }
 
 static void (*_omap_sram_reprogram_clock)(u32 dpllctl, u32 ckctl);
--- a/arch/arm/mach-omap2/sram.c
+++ b/arch/arm/mach-omap2/sram.c
@@ -96,8 +96,7 @@ void *omap_sram_push(void *funcp, unsign
 
 	dst = fncpy(sram, funcp, size);
 
-	set_memory_ro(base, pages);
-	set_memory_x(base, pages);
+	set_memory_rox(base, pages);
 
 	return dst;
 }
@@ -217,8 +216,7 @@ static void __init omap2_map_sram(void)
 	base = (unsigned long)omap_sram_base;
 	pages = PAGE_ALIGN(omap_sram_size) / PAGE_SIZE;
 
-	set_memory_ro(base, pages);
-	set_memory_x(base, pages);
+	set_memory_rox(base, pages);
 }
 
 static void (*_omap2_sram_ddr_init)(u32 *slow_dll_ctrl, u32 fast_dll_ctrl,
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -134,10 +134,9 @@ void *alloc_insn_page(void)
 	if (!page)
 		return NULL;
 
-	if (strict_module_rwx_enabled()) {
-		set_memory_ro((unsigned long)page, 1);
-		set_memory_x((unsigned long)page, 1);
-	}
+	if (strict_module_rwx_enabled())
+		set_memory_rox((unsigned long)page, 1);
+
 	return page;
 }
 
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -415,8 +415,7 @@ create_trampoline(struct ftrace_ops *ops
 
 	set_vm_flush_reset_perms(trampoline);
 
-	set_memory_ro((unsigned long)trampoline, npages);
-	set_memory_x((unsigned long)trampoline, npages);
+	set_memory_rox((unsigned long)trampoline, npages);
 	return (unsigned long)trampoline;
 fail:
 	tramp_free(trampoline);
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -415,17 +415,12 @@ void *alloc_insn_page(void)
 		return NULL;
 
 	set_vm_flush_reset_perms(page);
-	/*
-	 * First make the page read-only, and only then make it executable to
-	 * prevent it from being W+X in between.
-	 */
-	set_memory_ro((unsigned long)page, 1);
 
 	/*
 	 * TODO: Once additional kernel code protection mechanisms are set, ensure
 	 * that the page was not maliciously altered and it is still zeroed.
 	 */
-	set_memory_x((unsigned long)page, 1);
+	set_memory_rox((unsigned long)page, 1);
 
 	return page;
 }
--- a/drivers/misc/sram-exec.c
+++ b/drivers/misc/sram-exec.c
@@ -106,10 +106,7 @@ void *sram_exec_copy(struct gen_pool *po
 
 	dst_cpy = fncpy(dst, src, size);
 
-	ret = set_memory_ro((unsigned long)base, pages);
-	if (ret)
-		goto error_out;
-	ret = set_memory_x((unsigned long)base, pages);
+	ret = set_memory_rox((unsigned long)base, pages);
 	if (ret)
 		goto error_out;
 
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -860,8 +860,7 @@ static inline void bpf_prog_lock_ro(stru
 static inline void bpf_jit_binary_lock_ro(struct bpf_binary_header *hdr)
 {
 	set_vm_flush_reset_perms(hdr);
-	set_memory_ro((unsigned long)hdr, hdr->size >> PAGE_SHIFT);
-	set_memory_x((unsigned long)hdr, hdr->size >> PAGE_SHIFT);
+	set_memory_rox((unsigned long)hdr, hdr->size >> PAGE_SHIFT);
 }
 
 int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap);
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -14,6 +14,14 @@ static inline int set_memory_x(unsigned
 static inline int set_memory_nx(unsigned long addr, int numpages) { return 0; }
 #endif
 
+static inline int set_memory_rox(unsigned long addr, int numpages)
+{
+	int ret = set_memory_ro(addr, numpages);
+	if (ret)
+		return ret;
+	return set_memory_x(addr, numpages);
+}
+
 #ifndef CONFIG_ARCH_HAS_SET_DIRECT_MAP
 static inline int set_direct_map_invalid_noflush(struct page *page)
 {
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -494,8 +494,7 @@ static int bpf_struct_ops_map_update_ele
 	refcount_set(&kvalue->refcnt, 1);
 	bpf_map_inc(map);
 
-	set_memory_ro((long)st_map->image, 1);
-	set_memory_x((long)st_map->image, 1);
+	set_memory_rox((long)st_map->image, 1);
 	err = st_ops->reg(kdata);
 	if (likely(!err)) {
 		/* Pair with smp_load_acquire() during lookup_elem().
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -864,8 +864,7 @@ static struct bpf_prog_pack *alloc_new_p
 	list_add_tail(&pack->list, &pack_list);
 
 	set_vm_flush_reset_perms(pack->ptr);
-	set_memory_ro((unsigned long)pack->ptr, BPF_PROG_PACK_SIZE / PAGE_SIZE);
-	set_memory_x((unsigned long)pack->ptr, BPF_PROG_PACK_SIZE / PAGE_SIZE);
+	set_memory_rox((unsigned long)pack->ptr, BPF_PROG_PACK_SIZE / PAGE_SIZE);
 	return pack;
 }
 
@@ -883,8 +882,7 @@ void *bpf_prog_pack_alloc(u32 size, bpf_
 		if (ptr) {
 			bpf_fill_ill_insns(ptr, size);
 			set_vm_flush_reset_perms(ptr);
-			set_memory_ro((unsigned long)ptr, size / PAGE_SIZE);
-			set_memory_x((unsigned long)ptr, size / PAGE_SIZE);
+			set_memory_rox((unsigned long)ptr, size / PAGE_SIZE);
 		}
 		goto out;
 	}
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -468,8 +468,7 @@ static int bpf_trampoline_update(struct
 	if (err < 0)
 		goto out;
 
-	set_memory_ro((long)im->image, 1);
-	set_memory_x((long)im->image, 1);
+	set_memory_rox((long)im->image, 1);
 
 	WARN_ON(tr->cur_image && tr->selector == 0);
 	WARN_ON(!tr->cur_image && tr->selector);
--- a/net/bpf/bpf_dummy_struct_ops.c
+++ b/net/bpf/bpf_dummy_struct_ops.c
@@ -124,8 +124,7 @@ int bpf_struct_ops_test_run(struct bpf_p
 	if (err < 0)
 		goto out;
 
-	set_memory_ro((long)image, 1);
-	set_memory_x((long)image, 1);
+	set_memory_rox((long)image, 1);
 	prog_ret = dummy_ops_call_op(image, args);
 
 	err = dummy_ops_copy_args(args);
  
Linus Torvalds Oct. 26, 2022, 5:59 p.m. UTC | #4
On Wed, Oct 26, 2022 at 12:15 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Right, and we have it all over the place. Something like the below
> perhaps? I'll feed it to the robots, see if something breaks.

I was nodding along with the patches like this:

> -       set_memory_ro(base, pages);
> -       set_memory_x(base, pages);
> +       set_memory_rox(base, pages);

but then I got to this part:

> +static inline int set_memory_rox(unsigned long addr, int numpages)
> +{
> +       int ret = set_memory_ro(addr, numpages);
> +       if (ret)
> +               return ret;
> +       return set_memory_x(addr, numpages);
> +}

and that's when I went "no, I really meant make it one single call".

set_memory_ro() and set_memory_x() basically end up doing the exact
same thing, just with different bits.  So it's not only silly to have
the callers do two different calls, it's silly to have the
*implementation* do two different scans of the page tables and page
merging/splitting.

I think in the case of x86, the set_memory_rox() function would
basically just be

    int set_memory_rox(unsigned long addr, int numpages)
    {
        pgprot_t clr = __pgprot(_PAGE_RW);
        pgprot_t set = { 0 };

        if (__supported_pte_mask & _PAGE_NX)
                set.pgprot |= _PAGE_NX;

        return change_page_attr_set_clr(&addr, numpages, set, clr, 0, NULL);
    }

or something close to that. (NOTE! The above was cobbled together in
the MUA, hasn't seen a compiler much less been booted, and might be
completely broken for some reason - it's meant to be the concept, not
some kind of final word).

IOW, the whole "set_rox" is really just a _single_
change_page_attr_set_clr() call.

Maybe you meant to do that, and this patch was just prep-work for the
arch code being the second stage?

                  Linus
  
Peter Zijlstra Oct. 27, 2022, 6:59 a.m. UTC | #5
On Wed, Oct 26, 2022 at 10:59:29AM -0700, Linus Torvalds wrote:

> Maybe you meant to do that, and this patch was just prep-work for the
> arch code being the second stage?

Yeah; also, since this is cross arch, we need a fallback. Anyway;
robots hated on me for missing a few includes. I'll go prod at this
more.
  
Peter Zijlstra Oct. 29, 2022, 11:30 a.m. UTC | #6
On Thu, Oct 27, 2022 at 08:59:45AM +0200, Peter Zijlstra wrote:
> On Wed, Oct 26, 2022 at 10:59:29AM -0700, Linus Torvalds wrote:
> 
> > Maybe you meant to do that, and this patch was just prep-work for the
> > arch code being the second stage?
> 
> Yeah; also, since this is cross arch, we need a fallback. Anyway;
> robots hated on me for missing a few includes. I'll go prod at this
> more.

Got around to it; I added the below patch on top and things seem to
still boot so it must be good :-)

---
Subject: x86/mm: Implement native set_memory_rox()
From: Peter Zijlstra <peterz@infradead.org>
Date: Sat Oct 29 13:19:31 CEST 2022

Provide a native implementation of set_memory_rox(), avoiding the
double set_memory_ro();set_memory_x(); calls.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/set_memory.h |    3 +++
 arch/x86/mm/pat/set_memory.c      |   10 ++++++++++
 include/linux/set_memory.h        |    2 ++
 3 files changed, 15 insertions(+)

--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -6,6 +6,9 @@
 #include <asm/page.h>
 #include <asm-generic/set_memory.h>
 
+#define set_memory_rox set_memory_rox
+int set_memory_rox(unsigned long addr, int numpages);
+
 /*
  * The set_memory_* API can be used to change various attributes of a virtual
  * address range. The attributes include:
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2025,6 +2025,16 @@ int set_memory_ro(unsigned long addr, in
 	return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW), 0);
 }
 
+int set_memory_rox(unsigned long addr, int numpages)
+{
+	pgprot_t clr = __pgprot(_PAGE_RW);
+
+	if (__supported_pte_mask & _PAGE_NX)
+		clr.pgprot |= _PAGE_NX;
+
+	return change_page_attr_clear(&addr, numpages, clr, 0);
+}
+
 int set_memory_rw(unsigned long addr, int numpages)
 {
 	return change_page_attr_set(&addr, numpages, __pgprot(_PAGE_RW), 0);
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -14,6 +14,7 @@ static inline int set_memory_x(unsigned
 static inline int set_memory_nx(unsigned long addr, int numpages) { return 0; }
 #endif
 
+#ifndef set_memory_rox
 static inline int set_memory_rox(unsigned long addr, int numpages)
 {
 	int ret = set_memory_ro(addr, numpages);
@@ -21,6 +22,7 @@ static inline int set_memory_rox(unsigne
 		return ret;
 	return set_memory_x(addr, numpages);
 }
+#endif
 
 #ifndef CONFIG_ARCH_HAS_SET_DIRECT_MAP
 static inline int set_direct_map_invalid_noflush(struct page *page)
  
Linus Torvalds Oct. 29, 2022, 5:35 p.m. UTC | #7
On Sat, Oct 29, 2022 at 4:30 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Got around to it; I added the below patch on top and things seem to
> still boot so it must be good :-)

Thanks, looks good to me, and as I see your simpler version, I
realized that my broken MUA version wasn't "rox", it was "ronx".

                Linus