[RFC,v2,18/20] context_tracking,x86: Defer kernel text patching IPIs
Commit Message
text_poke_bp_batch() sends IPIs to all online CPUs to synchronize
them vs the newly patched instruction. CPUs that are executing in userspace
do not need this synchronization to happen immediately, and this is
actually harmful interference for NOHZ_FULL CPUs.
As the synchronization IPIs are sent using a blocking call, returning from
text_poke_bp_batch() implies all CPUs will observe the patched
instruction(s), and this should be preserved even if the IPI is deferred.
In other words, to safely defer this synchronization, any kernel
instruction leading to the execution of the deferred instruction
sync (ct_work_flush()) must *not* be mutable (patchable) at runtime.
This means we must pay attention to mutable instructions in the early entry
code:
- alternatives
- static keys
- all sorts of probes (kprobes/ftrace/bpf/???)
The early entry code leading to ct_work_flush() is noinstr, which gets rid
of the probes.
Alternatives are safe, because it's boot-time patching (before SMP is
even brought up) which is before any IPI deferral can happen.
This leaves us with static keys. Any static key used in early entry code
should be only forever-enabled at boot time, IOW __ro_after_init (pretty
much like alternatives). Objtool is now able to point at static keys that
don't respect this, and all static keys used in early entry code have now
been verified as behaving like so.
Leverage the new context_tracking infrastructure to defer sync_core() IPIs
to a target CPU's next kernel entry.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
arch/x86/include/asm/context_tracking_work.h | 6 +++--
arch/x86/include/asm/text-patching.h | 1 +
arch/x86/kernel/alternative.c | 24 ++++++++++++++++----
arch/x86/kernel/kprobes/core.c | 4 ++--
arch/x86/kernel/kprobes/opt.c | 4 ++--
arch/x86/kernel/module.c | 2 +-
include/linux/context_tracking_work.h | 4 ++--
7 files changed, 32 insertions(+), 13 deletions(-)
Comments
Interesting series Valentin. Some high-level question/comments on this one:
> On Jul 20, 2023, at 12:34 PM, Valentin Schneider <vschneid@redhat.com> wrote:
>
> text_poke_bp_batch() sends IPIs to all online CPUs to synchronize
> them vs the newly patched instruction. CPUs that are executing in userspace
> do not need this synchronization to happen immediately, and this is
> actually harmful interference for NOHZ_FULL CPUs.
Does the amount of harm not correspond to practical frequency of text_poke?
How often does instruction patching really happen? If it is very infrequent
then I am not sure if it is that harmful.
>
> As the synchronization IPIs are sent using a blocking call, returning from
> text_poke_bp_batch() implies all CPUs will observe the patched
> instruction(s), and this should be preserved even if the IPI is deferred.
> In other words, to safely defer this synchronization, any kernel
> instruction leading to the execution of the deferred instruction
> sync (ct_work_flush()) must *not* be mutable (patchable) at runtime.
If it is not infrequent, then are you handling the case where userland
spends multiple seconds before entering the kernel, and all this while
the blocking call waits? Perhaps in such situation you want the real IPI
to be sent out instead of the deferred one?
thanks,
- Joel
>
> This means we must pay attention to mutable instructions in the early entry
> code:
> - alternatives
> - static keys
> - all sorts of probes (kprobes/ftrace/bpf/???)
>
> The early entry code leading to ct_work_flush() is noinstr, which gets rid
> of the probes.
>
> Alternatives are safe, because it's boot-time patching (before SMP is
> even brought up) which is before any IPI deferral can happen.
>
> This leaves us with static keys. Any static key used in early entry code
> should be only forever-enabled at boot time, IOW __ro_after_init (pretty
> much like alternatives). Objtool is now able to point at static keys that
> don't respect this, and all static keys used in early entry code have now
> been verified as behaving like so.
>
> Leverage the new context_tracking infrastructure to defer sync_core() IPIs
> to a target CPU's next kernel entry.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
> arch/x86/include/asm/context_tracking_work.h | 6 +++--
> arch/x86/include/asm/text-patching.h | 1 +
> arch/x86/kernel/alternative.c | 24 ++++++++++++++++----
> arch/x86/kernel/kprobes/core.c | 4 ++--
> arch/x86/kernel/kprobes/opt.c | 4 ++--
> arch/x86/kernel/module.c | 2 +-
> include/linux/context_tracking_work.h | 4 ++--
> 7 files changed, 32 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h
> index 5bc29e6b2ed38..2c66687ce00e2 100644
> --- a/arch/x86/include/asm/context_tracking_work.h
> +++ b/arch/x86/include/asm/context_tracking_work.h
> @@ -2,11 +2,13 @@
> #ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H
> #define _ASM_X86_CONTEXT_TRACKING_WORK_H
>
> +#include <asm/sync_core.h>
> +
> static __always_inline void arch_context_tracking_work(int work)
> {
> switch (work) {
> - case CONTEXT_WORK_n:
> - // Do work...
> + case CONTEXT_WORK_SYNC:
> + sync_core();
> break;
> }
> }
> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
> index 29832c338cdc5..b6939e965e69d 100644
> --- a/arch/x86/include/asm/text-patching.h
> +++ b/arch/x86/include/asm/text-patching.h
> @@ -43,6 +43,7 @@ extern void text_poke_early(void *addr, const void *opcode, size_t len);
> */
> extern void *text_poke(void *addr, const void *opcode, size_t len);
> extern void text_poke_sync(void);
> +extern void text_poke_sync_deferrable(void);
> extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
> extern void *text_poke_copy(void *addr, const void *opcode, size_t len);
> extern void *text_poke_copy_locked(void *addr, const void *opcode, size_t len, bool core_ok);
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 72646d75b6ffe..fcce480e1919e 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -18,6 +18,7 @@
> #include <linux/mmu_context.h>
> #include <linux/bsearch.h>
> #include <linux/sync_core.h>
> +#include <linux/context_tracking.h>
> #include <asm/text-patching.h>
> #include <asm/alternative.h>
> #include <asm/sections.h>
> @@ -1933,9 +1934,24 @@ static void do_sync_core(void *info)
> sync_core();
> }
>
> +static bool do_sync_core_defer_cond(int cpu, void *info)
> +{
> + return !ct_set_cpu_work(cpu, CONTEXT_WORK_SYNC);
> +}
> +
> +static void __text_poke_sync(smp_cond_func_t cond_func)
> +{
> + on_each_cpu_cond(cond_func, do_sync_core, NULL, 1);
> +}
> +
> void text_poke_sync(void)
> {
> - on_each_cpu(do_sync_core, NULL, 1);
> + __text_poke_sync(NULL);
> +}
> +
> +void text_poke_sync_deferrable(void)
> +{
> + __text_poke_sync(do_sync_core_defer_cond);
> }
>
> /*
> @@ -2145,7 +2161,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
> text_poke(text_poke_addr(&tp[i]), &int3, INT3_INSN_SIZE);
> }
>
> - text_poke_sync();
> + text_poke_sync_deferrable();
>
> /*
> * Second step: update all but the first byte of the patched range.
> @@ -2207,7 +2223,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
> * not necessary and we'd be safe even without it. But
> * better safe than sorry (plus there's not only Intel).
> */
> - text_poke_sync();
> + text_poke_sync_deferrable();
> }
>
> /*
> @@ -2228,7 +2244,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
> }
>
> if (do_sync)
> - text_poke_sync();
> + text_poke_sync_deferrable();
>
> /*
> * Remove and wait for refs to be zero.
> diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
> index f7f6042eb7e6c..a38c914753397 100644
> --- a/arch/x86/kernel/kprobes/core.c
> +++ b/arch/x86/kernel/kprobes/core.c
> @@ -735,7 +735,7 @@ void arch_arm_kprobe(struct kprobe *p)
> u8 int3 = INT3_INSN_OPCODE;
>
> text_poke(p->addr, &int3, 1);
> - text_poke_sync();
> + text_poke_sync_deferrable();
> perf_event_text_poke(p->addr, &p->opcode, 1, &int3, 1);
> }
>
> @@ -745,7 +745,7 @@ void arch_disarm_kprobe(struct kprobe *p)
>
> perf_event_text_poke(p->addr, &int3, 1, &p->opcode, 1);
> text_poke(p->addr, &p->opcode, 1);
> - text_poke_sync();
> + text_poke_sync_deferrable();
> }
>
> void arch_remove_kprobe(struct kprobe *p)
> diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c
> index 57b0037d0a996..88451a744ceda 100644
> --- a/arch/x86/kernel/kprobes/opt.c
> +++ b/arch/x86/kernel/kprobes/opt.c
> @@ -521,11 +521,11 @@ void arch_unoptimize_kprobe(struct optimized_kprobe *op)
> JMP32_INSN_SIZE - INT3_INSN_SIZE);
>
> text_poke(addr, new, INT3_INSN_SIZE);
> - text_poke_sync();
> + text_poke_sync_deferrable();
> text_poke(addr + INT3_INSN_SIZE,
> new + INT3_INSN_SIZE,
> JMP32_INSN_SIZE - INT3_INSN_SIZE);
> - text_poke_sync();
> + text_poke_sync_deferrable();
>
> perf_event_text_poke(op->kp.addr, old, JMP32_INSN_SIZE, new, JMP32_INSN_SIZE);
> }
> diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
> index b05f62ee2344b..8b4542dc51b6d 100644
> --- a/arch/x86/kernel/module.c
> +++ b/arch/x86/kernel/module.c
> @@ -242,7 +242,7 @@ static int write_relocate_add(Elf64_Shdr *sechdrs,
> write, apply);
>
> if (!early) {
> - text_poke_sync();
> + text_poke_sync_deferrable();
> mutex_unlock(&text_mutex);
> }
>
> diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h
> index fb74db8876dd2..13fc97b395030 100644
> --- a/include/linux/context_tracking_work.h
> +++ b/include/linux/context_tracking_work.h
> @@ -5,12 +5,12 @@
> #include <linux/bitops.h>
>
> enum {
> - CONTEXT_WORK_n_OFFSET,
> + CONTEXT_WORK_SYNC_OFFSET,
> CONTEXT_WORK_MAX_OFFSET
> };
>
> enum ct_work {
> - CONTEXT_WORK_n = BIT(CONTEXT_WORK_n_OFFSET),
> + CONTEXT_WORK_SYNC = BIT(CONTEXT_WORK_SYNC_OFFSET),
> CONTEXT_WORK_MAX = BIT(CONTEXT_WORK_MAX_OFFSET)
> };
>
> --
> 2.31.1
>
On 25/07/23 06:49, Joel Fernandes wrote:
> Interesting series Valentin. Some high-level question/comments on this one:
>
>> On Jul 20, 2023, at 12:34 PM, Valentin Schneider <vschneid@redhat.com> wrote:
>>
>> text_poke_bp_batch() sends IPIs to all online CPUs to synchronize
>> them vs the newly patched instruction. CPUs that are executing in userspace
>> do not need this synchronization to happen immediately, and this is
>> actually harmful interference for NOHZ_FULL CPUs.
>
> Does the amount of harm not correspond to practical frequency of text_poke?
> How often does instruction patching really happen? If it is very infrequent
> then I am not sure if it is that harmful.
>
Being pushed over a latency threshold *once* is enough to impact the
latency evaluation of your given system/application.
It's mainly about shielding the isolated, NOHZ_FULL CPUs from whatever the
housekeeping CPUs may be up to (flipping static keys, loading kprobes,
using ftrace...) - frequency of the interference isn't such a big part of
the reasoning.
>>
>> As the synchronization IPIs are sent using a blocking call, returning from
>> text_poke_bp_batch() implies all CPUs will observe the patched
>> instruction(s), and this should be preserved even if the IPI is deferred.
>> In other words, to safely defer this synchronization, any kernel
>> instruction leading to the execution of the deferred instruction
>> sync (ct_work_flush()) must *not* be mutable (patchable) at runtime.
>
> If it is not infrequent, then are you handling the case where userland
> spends multiple seconds before entering the kernel, and all this while
> the blocking call waits? Perhaps in such situation you want the real IPI
> to be sent out instead of the deferred one?
>
The blocking call only waits for CPUs for which it queued a CSD. Deferred
calls do not queue a CSD thus do not impact the waiting at all. See
smp_call_function_many_cond().
On Tue, Jul 25, 2023 at 06:49:45AM -0400, Joel Fernandes wrote:
> Interesting series Valentin. Some high-level question/comments on this one:
>
> > On Jul 20, 2023, at 12:34 PM, Valentin Schneider <vschneid@redhat.com> wrote:
> >
> > text_poke_bp_batch() sends IPIs to all online CPUs to synchronize
> > them vs the newly patched instruction. CPUs that are executing in userspace
> > do not need this synchronization to happen immediately, and this is
> > actually harmful interference for NOHZ_FULL CPUs.
>
> Does the amount of harm not correspond to practical frequency of text_poke?
> How often does instruction patching really happen? If it is very infrequent
> then I am not sure if it is that harmful.
Well, it can happen quite a bit, also from things people would not
typically 'expect' it.
For instance, the moment you create the first per-task perf event we
frob some jump-labels (and again some second after the last one goes
away).
The same for a bunch of runtime network configurations.
> > As the synchronization IPIs are sent using a blocking call, returning from
> > text_poke_bp_batch() implies all CPUs will observe the patched
> > instruction(s), and this should be preserved even if the IPI is deferred.
> > In other words, to safely defer this synchronization, any kernel
> > instruction leading to the execution of the deferred instruction
> > sync (ct_work_flush()) must *not* be mutable (patchable) at runtime.
>
> If it is not infrequent, then are you handling the case where userland
> spends multiple seconds before entering the kernel, and all this while
> the blocking call waits? Perhaps in such situation you want the real IPI
> to be sent out instead of the deferred one?
Please re-read what Valentin wrote -- nobody is waiting on anything.
On 7/25/23 09:36, Valentin Schneider wrote:
> On 25/07/23 06:49, Joel Fernandes wrote:
>> Interesting series Valentin. Some high-level question/comments on this one:
>>
>>> On Jul 20, 2023, at 12:34 PM, Valentin Schneider <vschneid@redhat.com> wrote:
>>>
>>> text_poke_bp_batch() sends IPIs to all online CPUs to synchronize
>>> them vs the newly patched instruction. CPUs that are executing in userspace
>>> do not need this synchronization to happen immediately, and this is
>>> actually harmful interference for NOHZ_FULL CPUs.
>>
>> Does the amount of harm not correspond to practical frequency of text_poke?
>> How often does instruction patching really happen? If it is very infrequent
>> then I am not sure if it is that harmful.
>>
>
> Being pushed over a latency threshold *once* is enough to impact the
> latency evaluation of your given system/application.
>
> It's mainly about shielding the isolated, NOHZ_FULL CPUs from whatever the
> housekeeping CPUs may be up to (flipping static keys, loading kprobes,
> using ftrace...) - frequency of the interference isn't such a big part of
> the reasoning.
Makes sense.
>>> As the synchronization IPIs are sent using a blocking call, returning from
>>> text_poke_bp_batch() implies all CPUs will observe the patched
>>> instruction(s), and this should be preserved even if the IPI is deferred.
>>> In other words, to safely defer this synchronization, any kernel
>>> instruction leading to the execution of the deferred instruction
>>> sync (ct_work_flush()) must *not* be mutable (patchable) at runtime.
>>
>> If it is not infrequent, then are you handling the case where userland
>> spends multiple seconds before entering the kernel, and all this while
>> the blocking call waits? Perhaps in such situation you want the real IPI
>> to be sent out instead of the deferred one?
>>
>
> The blocking call only waits for CPUs for which it queued a CSD. Deferred
> calls do not queue a CSD thus do not impact the waiting at all. See
> smp_call_function_many_cond().
Ah I see you are using on_each_cpu_cond(). I should have gone through
the other patch before making noise.
thanks,
- Joel
On 7/25/23 09:39, Peter Zijlstra wrote:
> On Tue, Jul 25, 2023 at 06:49:45AM -0400, Joel Fernandes wrote:
>> Interesting series Valentin. Some high-level question/comments on this one:
>>
>>> On Jul 20, 2023, at 12:34 PM, Valentin Schneider <vschneid@redhat.com> wrote:
>>>
>>> text_poke_bp_batch() sends IPIs to all online CPUs to synchronize
>>> them vs the newly patched instruction. CPUs that are executing in userspace
>>> do not need this synchronization to happen immediately, and this is
>>> actually harmful interference for NOHZ_FULL CPUs.
>>
>> Does the amount of harm not correspond to practical frequency of text_poke?
>> How often does instruction patching really happen? If it is very infrequent
>> then I am not sure if it is that harmful.
>
> Well, it can happen quite a bit, also from things people would not
> typically 'expect' it.
>
> For instance, the moment you create the first per-task perf event we
> frob some jump-labels (and again some second after the last one goes
> away).
>
> The same for a bunch of runtime network configurations.
Ok cool. I guess I still have memories of that old ARM device I had
where modifications to kernel text was forbidden by hardware (was a
security feature). That was making kprobes unusable...
>>> As the synchronization IPIs are sent using a blocking call, returning from
>>> text_poke_bp_batch() implies all CPUs will observe the patched
>>> instruction(s), and this should be preserved even if the IPI is deferred.
>>> In other words, to safely defer this synchronization, any kernel
>>> instruction leading to the execution of the deferred instruction
>>> sync (ct_work_flush()) must *not* be mutable (patchable) at runtime.
>>
>> If it is not infrequent, then are you handling the case where userland
>> spends multiple seconds before entering the kernel, and all this while
>> the blocking call waits? Perhaps in such situation you want the real IPI
>> to be sent out instead of the deferred one?
>
> Please re-read what Valentin wrote -- nobody is waiting on anything.
Makes sense. To be fair I received his email 3 minutes before yours ;-).
But thank you both for clarifying!
- Joel
@@ -2,11 +2,13 @@
#ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H
#define _ASM_X86_CONTEXT_TRACKING_WORK_H
+#include <asm/sync_core.h>
+
static __always_inline void arch_context_tracking_work(int work)
{
switch (work) {
- case CONTEXT_WORK_n:
- // Do work...
+ case CONTEXT_WORK_SYNC:
+ sync_core();
break;
}
}
@@ -43,6 +43,7 @@ extern void text_poke_early(void *addr, const void *opcode, size_t len);
*/
extern void *text_poke(void *addr, const void *opcode, size_t len);
extern void text_poke_sync(void);
+extern void text_poke_sync_deferrable(void);
extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
extern void *text_poke_copy(void *addr, const void *opcode, size_t len);
extern void *text_poke_copy_locked(void *addr, const void *opcode, size_t len, bool core_ok);
@@ -18,6 +18,7 @@
#include <linux/mmu_context.h>
#include <linux/bsearch.h>
#include <linux/sync_core.h>
+#include <linux/context_tracking.h>
#include <asm/text-patching.h>
#include <asm/alternative.h>
#include <asm/sections.h>
@@ -1933,9 +1934,24 @@ static void do_sync_core(void *info)
sync_core();
}
+static bool do_sync_core_defer_cond(int cpu, void *info)
+{
+ return !ct_set_cpu_work(cpu, CONTEXT_WORK_SYNC);
+}
+
+static void __text_poke_sync(smp_cond_func_t cond_func)
+{
+ on_each_cpu_cond(cond_func, do_sync_core, NULL, 1);
+}
+
void text_poke_sync(void)
{
- on_each_cpu(do_sync_core, NULL, 1);
+ __text_poke_sync(NULL);
+}
+
+void text_poke_sync_deferrable(void)
+{
+ __text_poke_sync(do_sync_core_defer_cond);
}
/*
@@ -2145,7 +2161,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
text_poke(text_poke_addr(&tp[i]), &int3, INT3_INSN_SIZE);
}
- text_poke_sync();
+ text_poke_sync_deferrable();
/*
* Second step: update all but the first byte of the patched range.
@@ -2207,7 +2223,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
* not necessary and we'd be safe even without it. But
* better safe than sorry (plus there's not only Intel).
*/
- text_poke_sync();
+ text_poke_sync_deferrable();
}
/*
@@ -2228,7 +2244,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
}
if (do_sync)
- text_poke_sync();
+ text_poke_sync_deferrable();
/*
* Remove and wait for refs to be zero.
@@ -735,7 +735,7 @@ void arch_arm_kprobe(struct kprobe *p)
u8 int3 = INT3_INSN_OPCODE;
text_poke(p->addr, &int3, 1);
- text_poke_sync();
+ text_poke_sync_deferrable();
perf_event_text_poke(p->addr, &p->opcode, 1, &int3, 1);
}
@@ -745,7 +745,7 @@ void arch_disarm_kprobe(struct kprobe *p)
perf_event_text_poke(p->addr, &int3, 1, &p->opcode, 1);
text_poke(p->addr, &p->opcode, 1);
- text_poke_sync();
+ text_poke_sync_deferrable();
}
void arch_remove_kprobe(struct kprobe *p)
@@ -521,11 +521,11 @@ void arch_unoptimize_kprobe(struct optimized_kprobe *op)
JMP32_INSN_SIZE - INT3_INSN_SIZE);
text_poke(addr, new, INT3_INSN_SIZE);
- text_poke_sync();
+ text_poke_sync_deferrable();
text_poke(addr + INT3_INSN_SIZE,
new + INT3_INSN_SIZE,
JMP32_INSN_SIZE - INT3_INSN_SIZE);
- text_poke_sync();
+ text_poke_sync_deferrable();
perf_event_text_poke(op->kp.addr, old, JMP32_INSN_SIZE, new, JMP32_INSN_SIZE);
}
@@ -242,7 +242,7 @@ static int write_relocate_add(Elf64_Shdr *sechdrs,
write, apply);
if (!early) {
- text_poke_sync();
+ text_poke_sync_deferrable();
mutex_unlock(&text_mutex);
}
@@ -5,12 +5,12 @@
#include <linux/bitops.h>
enum {
- CONTEXT_WORK_n_OFFSET,
+ CONTEXT_WORK_SYNC_OFFSET,
CONTEXT_WORK_MAX_OFFSET
};
enum ct_work {
- CONTEXT_WORK_n = BIT(CONTEXT_WORK_n_OFFSET),
+ CONTEXT_WORK_SYNC = BIT(CONTEXT_WORK_SYNC_OFFSET),
CONTEXT_WORK_MAX = BIT(CONTEXT_WORK_MAX_OFFSET)
};