[RFC] text_poke/ftrace/x86: Allow text_poke() to be called in early boot
Commit Message
From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
Currently text_poke() just does a simple memcpy() on early boot because
the kernel code is read writable at that time. But ftrace uses text_poke
on the ftrace trampoline, which is not part of kernel text, and having non
kernel text around that can be writable and executable causes several
special cases where checks for system_state == SYSTEM_BOOTING needs to be
done to ignore this special case. This is tricky and can lead to memory
that can be kernel writable and executable after boot (due to bugs).
By moving poking_init() to mm_init() which is called before ftrace_init(),
this will allow ftrace to create its trampoline as read only, and the
text_poke() will do its normal thing.
This required some updates to fork and the maple_tree code to allow it to
be called with enabling interrupts in the time when interrupts must remain
disabled.
text_poke() will still use memcpy() on kernel core text during boot up as
it keeps things fast for all static_branch()es and such as well as
modifying the ftrace locations at boot up too.
This removes the special code added around ftrace trampolines in x86 to be
writable and executable during boot up.
Link: https://lore.kernel.org/r/20221024112730.180916b3@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
** Note this may break other architectures. **
arch/x86/include/asm/ftrace.h | 6 ------
arch/x86/kernel/alternative.c | 6 ++++--
arch/x86/kernel/ftrace.c | 29 +----------------------------
arch/x86/mm/init_64.c | 2 --
init/main.c | 8 ++++----
kernel/fork.c | 8 +++++++-
lib/maple_tree.c | 16 +++++++++++++++-
7 files changed, 31 insertions(+), 44 deletions(-)
Comments
On Mon, Oct 24, 2022 at 4:03 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> This required some updates to fork and the maple_tree code to allow it to
> be called with enabling interrupts in the time when interrupts must remain
> disabled.
Yeah, moving special cases from one place to another doesn't really
help. Particularly to something as core as dup_mm().
All of this comes from "poking_init()" being a steaming pile of bovine
excrement, doing random odd things, and having that special
"copy_init_mm()" helper that just makes things even worse. Nothing
else uses that, and it shouldn't have called "dup_mm()" in the first
place.
At this point, there is no actual user VM to even copy, so 99% of
everything that duip_mm() does is not just pointless, but actively
wrong, like the mmap_write_lock_nested() when we're in early boot.
I'm not even sure why "poking_mm" exists at all, and why it has
created a whole new copy of "init_mm", and why this code isn't just
using '&init_mm' like everything else that wants to just walk the
kernel page tables.
Yes, I see that commit 4fc19708b165 ("x86/alternatives: Initialize
temporary mm for patching"), and no, none of that makes any sense to
me. It seems just (mis-)designed to fail.
Linus
On Mon, 24 Oct 2022 17:11:13 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Mon, Oct 24, 2022 at 4:03 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > This required some updates to fork and the maple_tree code to allow it to
> > be called with enabling interrupts in the time when interrupts must remain
> > disabled.
>
> Yeah, moving special cases from one place to another doesn't really
> help. Particularly to something as core as dup_mm().
>
> All of this comes from "poking_init()" being a steaming pile of bovine
> excrement, doing random odd things, and having that special
> "copy_init_mm()" helper that just makes things even worse. Nothing
> else uses that, and it shouldn't have called "dup_mm()" in the first
> place.
>
> At this point, there is no actual user VM to even copy, so 99% of
> everything that duip_mm() does is not just pointless, but actively
> wrong, like the mmap_write_lock_nested() when we're in early boot.
>
> I'm not even sure why "poking_mm" exists at all, and why it has
> created a whole new copy of "init_mm", and why this code isn't just
> using '&init_mm' like everything else that wants to just walk the
> kernel page tables.
It's not just walking the page tables, it's creating one that nobody else
is using. Since we want to keep all executable code read only, the way
text_poke() works is to create a new memory mapping where the pages it has
isn't visible by anyone else (which is why it doesn't use init_mm). And
then makes a mapping to the executable address as non executable and
writable. Makes the update, and then removes the mapping.
>
> Yes, I see that commit 4fc19708b165 ("x86/alternatives: Initialize
> temporary mm for patching"), and no, none of that makes any sense to
> me. It seems just (mis-)designed to fail.
>
It's all about updating read only pages that are executable with a shadow mm.
-- Steve
On Mon, Oct 24, 2022 at 5:21 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> It's all about updating read only pages that are executable with a shadow mm.
Right. And it doesn't actually need the mm at all, all it wants is the
kernel page tables. Which is why all the "dup_mmap()" stuff seems so
wrong.
I suspect mm_alloc() does everything that VM actually needs.
IOW, it shouldn't have used the fork() helper, it should have used the
execve() helper that actually starts out from a clean slate. Because a
clean slate is exactly what that code wants.
No?
Linus
On Mon, 24 Oct 2022 18:02:32 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Mon, Oct 24, 2022 at 5:21 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > It's all about updating read only pages that are executable with a shadow mm.
>
> Right. And it doesn't actually need the mm at all, all it wants is the
> kernel page tables. Which is why all the "dup_mmap()" stuff seems so
> wrong.
>
> I suspect mm_alloc() does everything that VM actually needs.
>
> IOW, it shouldn't have used the fork() helper, it should have used the
> execve() helper that actually starts out from a clean slate. Because a
> clean slate is exactly what that code wants.
>
> No?
>
Something to look into. But I'm guessing that's best for the next merge
window, and not for the -rc releases?
-- Steve
On Mon, Oct 24, 2022 at 6:05 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Something to look into. But I'm guessing that's best for the next merge
> window, and not for the -rc releases?
Yes., I just applied your v1 patch.
Linus
On Mon, Oct 24, 2022 at 05:11:13PM -0700, Linus Torvalds wrote:
> All of this comes from "poking_init()" being a steaming pile of bovine
> excrement, doing random odd things, and having that special
> "copy_init_mm()" helper that just makes things even worse. Nothing
> else uses that, and it shouldn't have called "dup_mm()" in the first
> place.
Agreed; dup_mm() makes no sense and it is easily removed, see my earlier
patch. Perhaps it can be simplified further to:
__poking_mm = init_mm
omitting the mm_init() I retained, but I need to stare harder at all
that.
> I'm not even sure why "poking_mm" exists at all, and why it has
> created a whole new copy of "init_mm", and why this code isn't just
> using '&init_mm' like everything else that wants to just walk the
> kernel page tables.
Because it instantiates user-space page-tables in it, you really don't
want those in init_mm.
The whole (and sole) purpose of poking_mm is to contain the writable
aliases. Only the CPU that has the poking_mm active has access to them.
@@ -85,12 +85,6 @@ struct dyn_arch_ftrace {
#ifndef __ASSEMBLY__
-#if defined(CONFIG_FUNCTION_TRACER) && defined(CONFIG_DYNAMIC_FTRACE)
-extern void set_ftrace_ops_ro(void);
-#else
-static inline void set_ftrace_ops_ro(void) { }
-#endif
-
#define ARCH_HAS_SYSCALL_MATCH_SYM_NAME
static inline bool arch_syscall_match_sym_name(const char *sym, const char *name)
{
@@ -1681,7 +1681,8 @@ void __ref text_poke_queue(void *addr, const void *opcode, size_t len, const voi
{
struct text_poke_loc *tp;
- if (unlikely(system_state == SYSTEM_BOOTING)) {
+ if (unlikely(system_state == SYSTEM_BOOTING &&
+ core_kernel_text((unsigned long)addr))) {
text_poke_early(addr, opcode, len);
return;
}
@@ -1707,7 +1708,8 @@ void __ref text_poke_bp(void *addr, const void *opcode, size_t len, const void *
{
struct text_poke_loc tp;
- if (unlikely(system_state == SYSTEM_BOOTING)) {
+ if (unlikely(system_state == SYSTEM_BOOTING &&
+ core_kernel_text((unsigned long)addr))) {
text_poke_early(addr, opcode, len);
return;
}
@@ -415,8 +415,7 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
set_vm_flush_reset_perms(trampoline);
- if (likely(system_state != SYSTEM_BOOTING))
- set_memory_ro((unsigned long)trampoline, npages);
+ set_memory_ro((unsigned long)trampoline, npages);
set_memory_x((unsigned long)trampoline, npages);
return (unsigned long)trampoline;
fail:
@@ -424,32 +423,6 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
return 0;
}
-void set_ftrace_ops_ro(void)
-{
- struct ftrace_ops *ops;
- unsigned long start_offset;
- unsigned long end_offset;
- unsigned long npages;
- unsigned long size;
-
- do_for_each_ftrace_op(ops, ftrace_ops_list) {
- if (!(ops->flags & FTRACE_OPS_FL_ALLOC_TRAMP))
- continue;
-
- if (ops->flags & FTRACE_OPS_FL_SAVE_REGS) {
- start_offset = (unsigned long)ftrace_regs_caller;
- end_offset = (unsigned long)ftrace_regs_caller_end;
- } else {
- start_offset = (unsigned long)ftrace_caller;
- end_offset = (unsigned long)ftrace_caller_end;
- }
- size = end_offset - start_offset;
- size = size + RET_SIZE + sizeof(void *);
- npages = DIV_ROUND_UP(size, PAGE_SIZE);
- set_memory_ro((unsigned long)ops->trampoline, npages);
- } while_for_each_ftrace_op(ops);
-}
-
static unsigned long calc_trampoline_call_offset(bool save_regs)
{
unsigned long start_offset;
@@ -1398,8 +1398,6 @@ void mark_rodata_ro(void)
all_end = roundup((unsigned long)_brk_end, PMD_SIZE);
set_memory_nx(text_end, (all_end - text_end) >> PAGE_SHIFT);
- set_ftrace_ops_ro();
-
#ifdef CONFIG_CPA_DEBUG
printk(KERN_INFO "Testing CPA: undo %lx-%lx\n", start, end);
set_memory_rw(start, (end-start) >> PAGE_SHIFT);
@@ -860,6 +860,10 @@ static void __init mm_init(void)
/* Should be run after espfix64 is set up. */
pti_init();
kmsan_init_runtime();
+ proc_caches_init();
+ radix_tree_init();
+ maple_tree_init();
+ poking_init();
}
#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
@@ -1011,8 +1015,6 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
if (WARN(!irqs_disabled(),
"Interrupts were enabled *very* early, fixing it\n"))
local_irq_disable();
- radix_tree_init();
- maple_tree_init();
/*
* Set up housekeeping before setting up workqueues to allow the unbound
@@ -1117,7 +1119,6 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
thread_stack_cache_init();
cred_init();
fork_init();
- proc_caches_init();
uts_ns_init();
key_init();
security_init();
@@ -1134,7 +1135,6 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
taskstats_init_early();
delayacct_init();
- poking_init();
check_bugs();
acpi_subsystem_init();
@@ -702,7 +702,13 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
mas_destroy(&mas);
out:
mmap_write_unlock(mm);
- flush_tlb_mm(oldmm);
+ /*
+ * poking_init() calls into here at early boot up.
+ * At that time, there's no need to flush the tlb.
+ * If we do, it will enable interrupts and cause a bug.
+ */
+ if (likely(!early_boot_irqs_disabled))
+ flush_tlb_mm(oldmm);
mmap_write_unlock(oldmm);
dup_userfaultfd_complete(&uf);
fail_uprobe_end:
@@ -1253,7 +1253,21 @@ static inline void mas_alloc_nodes(struct ma_state *mas, gfp_t gfp)
}
max_req = min(requested, max_req);
- count = mt_alloc_bulk(gfp, max_req, slots);
+
+ /*
+ * text_poke() can be called very early, and it
+ * calls dup_mm() which eventually leads down to here.
+ * In that case, mt_alloc_bulk() will call kmem_cache_alloc_bulk()
+ * which must be called with interrupts enabled. To avoid
+ * doing that in early bootup, where interrupts must remain
+ * disabled, just allocate a single slot.
+ */
+ if (unlikely(early_boot_irqs_disabled)) {
+ slots[0] = mt_alloc_one(gfp | GFP_ATOMIC);
+ count = slots[0] ? 1 : 0;
+ } else {
+ count = mt_alloc_bulk(gfp, max_req, slots);
+ }
if (!count)
goto nomem_bulk;