[RFC,0/1] BPF tracing for arm64 using fprobe

Message ID 20221108220651.24492-1-revest@chromium.org
Headers
Series BPF tracing for arm64 using fprobe |

Message

Florent Revest Nov. 8, 2022, 10:06 p.m. UTC
  Hi!

With this RFC, I'd like to revive the conversation between BPF, ARM and tracing
folks on what BPF tracing (fentry/fexit/fmod_ret) could/should look like on
arm64.

Current status of BPF tracing
=============================

On currently supported architectures (like x86), BPF tracing programs are
called from a JITted BPF trampoline, itself called from the ftrace patch site
thanks to the ftrace "direct call" API. (or from the end of the ftrace
trampoline if a ftrace ops is also tracing that function, but this is
transparent to BPF)

Thanks to Xu's work [1], we now have BPF trampolines on arm64 (these can be
used for struct ops programs already), but Xu's attempts at getting ftrace
direct calls support [2][3] on arm64 have been unsucessful so far so we still
do not support BPF tracing programs. This prompted me to try a different
approach. I'd like to collect feedback on it here.

Why not direct calls ?
======================

Mark and Steven have not been too keen on getting direct calls on arm64 because:
- working around BL instruction's limited range introduces complexity [4]
- it's difficult to get reliable stacktraces right with direct calls [5]
- direct calls are complex to maintain on the arch/ftrace side [5]

In the absence of ftrace direct calls support, BPF tracing programs would need
to be called from an ftrace ops instead. Note that the BPF callback signature
would have to be different, so we can't re-use trampolines (direct called
callbacks receive arguments in registers whereas ftrace ops callbacks receive
arguments in a struct ftrace_regs pointer)

Why fprobe ?
============

Ftrace ops per-se only expose an API to hook before a function. There are two
systems built on top of ftrace ops that also allow hooking the function exit:
fprobe (using rethook) and the function graph tracer. There are plans from
Masami and Steven to unify these two systems but, as they stand, only fprobe
gives enough flexibility to implement BPF tracing.

In order not to reinvent the wheel, if direct calls aren't available on the
arch, BPF could leverage fprobe to hook before and after the traced function.
Note that return hooking is implemented a bit differently than it is in BPF
trampolines. Instead of keeping arguments on a stack frame and calling the
traced function, rethook saves arguments in a memory pool and returns to the
traced function with a hijacked return pointer that will have its ret jump back
to the rethook trampoline.

What about performances ?
=========================

In its current state, a fprobe callback on arm64 is very expensive because:
1- the ftrace trampoline saves all registers (including many unnecessary ones)
2- it calls ftrace_ops_list_func which iterates over all ops and is very slow
3- the fprobe ops unconditionally hooks a rethook
4- rethook grabs memory from a freelist which is slow under high contention

However, all the above points are currently being addressed:
1- by Mark's series to save argument registers only [6]
2- by Mark's series to call single ops directly [7]
3- by Masami's patch to skip rethooks if not needed [8]
4- Masami said the rethook freelist would be replaced by a per-task stack as
   part of its unification with the function graph tracer [9]

I measured the costs of BPF on different approaches on my RPi4 here: [10]
tl;dr: the BPF "bench" takes a performance hit of:
- 28.6% w/ BPF tracing on direct calls (best case scenario for reference) [11]
- 66.8% w/ BPF on kprobe (just for reference)
- 62.6% w/ BPF tracing on fprobe without any optimizations (current state) [12]
- 34.1% w/ BPF tracing on fprobe with all optimizations (near-future state) [13]

On top of Mark's and Masami's existing and planned work, BPF tracing programs
called from a fprobe callback become much closer to the performances of direct
calls but there's still a hit. If we want to try optimizing even further, we
could potentially go even one step further by JITting the fprobe callbacks.
This would let us unroll the argument loop and use direct calls instead of
indirect calls. However this would introduce quite some complexity and I expect
the performance impact should be fairly minimal. (arm64 doesn't have repotlines
so it doesn't suffer too much from indirect calls anyway)

Two other performance discrepancies I can think of, that would stay anyway are:
1- the ftrace trampoline saves all argument registers while BPF trampolines can
   get away with only saving the number of arguments that are actually used by
   that function (thanks to BTF info), consequentially, we need to convert the
   ftrace_regs structure into a BPF "ctx" array which inevitably takes some
   cycles
2- When a fexit program is attached, going through the rethook trampoline means
   that we need to save argument registers a second time while BPF trampolines
   can just rely on arguments kept on the stack

How to apply this RFC ?
=======================

This RFC only brings up to discussion the eventual patch that would
touch the BPF subsystem because the ftrace and fprobe optimizations it
is built on are not as controversial and already on the way. However,
this patch is meant to apply on top of Mark's and Masami's work. If
you'd like to run this patch you can use my branch. [13]

Links
=====

1: https://lore.kernel.org/bpf/20220711144722.2100039-1-xukuohai@huawei.com/
2: https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/
3: https://lore.kernel.org/bpf/20220913162732.163631-1-xukuohai@huaweicloud.com/
4: https://lore.kernel.org/bpf/Yo4xb2w+FHhUtJNw@FVFF77S0Q05N/
5: https://lore.kernel.org/bpf/YzR5WSLux4mmFIXg@FVFF77S0Q05N/
6: https://lore.kernel.org/all/20221103170520.931305-1-mark.rutland@arm.com/
7: https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops
8: https://lore.kernel.org/all/166792255429.919356.14116090269057513181.stgit@devnote3/T/#m9d43fbdc91f48b03d644be77ac18017963a08df5
9: https://lore.kernel.org/bpf/20221024220008.48780b0f58903afed2dc8d4a@kernel.org/
10: https://paste.debian.net/1260011/
11: https://github.com/FlorentRevest/linux/commits/bpf-direct-calls
12: https://github.com/FlorentRevest/linux/commits/bpf-fprobe-slow
13: https://github.com/FlorentRevest/linux/commits/bpf-fprobe-rfc

Florent Revest (1):
  bpf: Invoke tracing progs using fprobe on archs without direct call

 include/linux/bpf.h     |   5 ++
 kernel/bpf/trampoline.c | 120 ++++++++++++++++++++++++++++++++++++++--
 2 files changed, 121 insertions(+), 4 deletions(-)
  

Comments

Alexei Starovoitov Nov. 17, 2022, 2:41 a.m. UTC | #1
On Tue, Nov 8, 2022 at 2:07 PM Florent Revest <revest@chromium.org> wrote:
>
> Hi!
>
> With this RFC, I'd like to revive the conversation between BPF, ARM and tracing
> folks on what BPF tracing (fentry/fexit/fmod_ret) could/should look like on
> arm64.
>
> Current status of BPF tracing
> =============================
>
> On currently supported architectures (like x86), BPF tracing programs are
> called from a JITted BPF trampoline, itself called from the ftrace patch site
> thanks to the ftrace "direct call" API. (or from the end of the ftrace
> trampoline if a ftrace ops is also tracing that function, but this is
> transparent to BPF)
>
> Thanks to Xu's work [1], we now have BPF trampolines on arm64 (these can be
> used for struct ops programs already), but Xu's attempts at getting ftrace
> direct calls support [2][3] on arm64 have been unsucessful so far so we still
> do not support BPF tracing programs. This prompted me to try a different
> approach. I'd like to collect feedback on it here.
>
> Why not direct calls ?
> ======================
>
> Mark and Steven have not been too keen on getting direct calls on arm64 because:
> - working around BL instruction's limited range introduces complexity [4]
> - it's difficult to get reliable stacktraces right with direct calls [5]
> - direct calls are complex to maintain on the arch/ftrace side [5]
>
> In the absence of ftrace direct calls support, BPF tracing programs would need
> to be called from an ftrace ops instead. Note that the BPF callback signature
> would have to be different, so we can't re-use trampolines (direct called
> callbacks receive arguments in registers whereas ftrace ops callbacks receive
> arguments in a struct ftrace_regs pointer)
>
> Why fprobe ?
> ============
>
> Ftrace ops per-se only expose an API to hook before a function. There are two
> systems built on top of ftrace ops that also allow hooking the function exit:
> fprobe (using rethook) and the function graph tracer. There are plans from
> Masami and Steven to unify these two systems but, as they stand, only fprobe
> gives enough flexibility to implement BPF tracing.
>
> In order not to reinvent the wheel, if direct calls aren't available on the
> arch, BPF could leverage fprobe to hook before and after the traced function.
> Note that return hooking is implemented a bit differently than it is in BPF
> trampolines. Instead of keeping arguments on a stack frame and calling the
> traced function, rethook saves arguments in a memory pool and returns to the
> traced function with a hijacked return pointer that will have its ret jump back
> to the rethook trampoline.
>
> What about performances ?
> =========================
>
> In its current state, a fprobe callback on arm64 is very expensive because:
> 1- the ftrace trampoline saves all registers (including many unnecessary ones)
> 2- it calls ftrace_ops_list_func which iterates over all ops and is very slow
> 3- the fprobe ops unconditionally hooks a rethook
> 4- rethook grabs memory from a freelist which is slow under high contention
>
> However, all the above points are currently being addressed:
> 1- by Mark's series to save argument registers only [6]
> 2- by Mark's series to call single ops directly [7]
> 3- by Masami's patch to skip rethooks if not needed [8]
> 4- Masami said the rethook freelist would be replaced by a per-task stack as
>    part of its unification with the function graph tracer [9]
>
> I measured the costs of BPF on different approaches on my RPi4 here: [10]
> tl;dr: the BPF "bench" takes a performance hit of:
> - 28.6% w/ BPF tracing on direct calls (best case scenario for reference) [11]
> - 66.8% w/ BPF on kprobe (just for reference)
> - 62.6% w/ BPF tracing on fprobe without any optimizations (current state) [12]
> - 34.1% w/ BPF tracing on fprobe with all optimizations (near-future state) [13]

Even with all optimization the performance overhead is not acceptable.
It feels to me that folks are still thinking about bpf trampoline
as a tracing facility.
It's a lot more than that. It needs to run 24/7 with zero overhead.
It needs to replace the kernel functions and be invoked
millions times a second until the system is rebooted.
In this environment every nanosecond counts.

Even if the fprobe side was completely free the patch 1 has so much
overhead in copy of bpf_cookie, regs, etc that it's a non-starter
for these use cases.

There are several other fundamental issues in this approach
because of fprobe/ftrace.
It has ftrace_test_recursion_trylock and disables preemption.
Both are deal breakers.

bpf trampoline has to allow recursion in some cases.
See __bpf_prog_enter*() flavors.

bpf trampoline also has to use migrate_disable instead of preemption
and rcu_read_lock() in some cases and rcu_read_lock_trace() in others.

bpf trampoline must never allocate memory or grab locks.

All of these mandatory features exclude fprobe, ftrace, rethook
from possible options.

Let's figure out how to address concerns with direct calls:

> - working around BL instruction's limited range introduces complexity [4]
> - it's difficult to get reliable stacktraces right with direct calls [5]
> - direct calls are complex to maintain on the arch/ftrace side [5]
  
Masami Hiramatsu (Google) Nov. 17, 2022, 1:16 p.m. UTC | #2
Hi Florent,

On Tue,  8 Nov 2022 23:06:50 +0100
Florent Revest <revest@chromium.org> wrote:

> Hi!
> 
> With this RFC, I'd like to revive the conversation between BPF, ARM and tracing
> folks on what BPF tracing (fentry/fexit/fmod_ret) could/should look like on
> arm64.
> 
> Current status of BPF tracing
> =============================
> 
> On currently supported architectures (like x86), BPF tracing programs are
> called from a JITted BPF trampoline, itself called from the ftrace patch site
> thanks to the ftrace "direct call" API. (or from the end of the ftrace
> trampoline if a ftrace ops is also tracing that function, but this is
> transparent to BPF)
> 
> Thanks to Xu's work [1], we now have BPF trampolines on arm64 (these can be
> used for struct ops programs already), but Xu's attempts at getting ftrace
> direct calls support [2][3] on arm64 have been unsucessful so far so we still
> do not support BPF tracing programs. This prompted me to try a different
> approach. I'd like to collect feedback on it here.
> 
> Why not direct calls ?
> ======================
> 
> Mark and Steven have not been too keen on getting direct calls on arm64 because:
> - working around BL instruction's limited range introduces complexity [4]
> - it's difficult to get reliable stacktraces right with direct calls [5]
> - direct calls are complex to maintain on the arch/ftrace side [5]
> 
> In the absence of ftrace direct calls support, BPF tracing programs would need
> to be called from an ftrace ops instead. Note that the BPF callback signature
> would have to be different, so we can't re-use trampolines (direct called
> callbacks receive arguments in registers whereas ftrace ops callbacks receive
> arguments in a struct ftrace_regs pointer)
> 
> Why fprobe ?
> ============
> 
> Ftrace ops per-se only expose an API to hook before a function. There are two
> systems built on top of ftrace ops that also allow hooking the function exit:
> fprobe (using rethook) and the function graph tracer. There are plans from
> Masami and Steven to unify these two systems but, as they stand, only fprobe
> gives enough flexibility to implement BPF tracing.
> 
> In order not to reinvent the wheel, if direct calls aren't available on the
> arch, BPF could leverage fprobe to hook before and after the traced function.
> Note that return hooking is implemented a bit differently than it is in BPF
> trampolines. Instead of keeping arguments on a stack frame and calling the
> traced function, rethook saves arguments in a memory pool and returns to the
> traced function with a hijacked return pointer that will have its ret jump back
> to the rethook trampoline.

Yeah, that is designed to not change the task's stack, but it makes another
list-based stack. But it eventually replaced by the per-task shadow stack.

> 
> What about performances ?
> =========================
> 
> In its current state, a fprobe callback on arm64 is very expensive because:
> 1- the ftrace trampoline saves all registers (including many unnecessary ones)
> 2- it calls ftrace_ops_list_func which iterates over all ops and is very slow
> 3- the fprobe ops unconditionally hooks a rethook
> 4- rethook grabs memory from a freelist which is slow under high contention
> 
> However, all the above points are currently being addressed:
> 1- by Mark's series to save argument registers only [6]
> 2- by Mark's series to call single ops directly [7]
> 3- by Masami's patch to skip rethooks if not needed [8]
> 4- Masami said the rethook freelist would be replaced by a per-task stack as
>    part of its unification with the function graph tracer [9]
> 
> I measured the costs of BPF on different approaches on my RPi4 here: [10]
> tl;dr: the BPF "bench" takes a performance hit of:
> - 28.6% w/ BPF tracing on direct calls (best case scenario for reference) [11]
> - 66.8% w/ BPF on kprobe (just for reference)
> - 62.6% w/ BPF tracing on fprobe without any optimizations (current state) [12]
> - 34.1% w/ BPF tracing on fprobe with all optimizations (near-future state) [13]

Great! thanks for measuring that.

I've checked your tree[13] and basically it is good for me.
 - rethook: fprobe: Use ftrace_regs instead of pt_regs
   This patch may break other arch which is trying to implement rethook,
   so can you break it down to patches for fprobe, rethook and arch specific
   one?

> 
> On top of Mark's and Masami's existing and planned work, BPF tracing programs
> called from a fprobe callback become much closer to the performances of direct
> calls but there's still a hit. If we want to try optimizing even further, we
> could potentially go even one step further by JITting the fprobe callbacks.

Would this mean JITting fprobe or callbacks?

> This would let us unroll the argument loop and use direct calls instead of
> indirect calls. However this would introduce quite some complexity and I expect
> the performance impact should be fairly minimal. (arm64 doesn't have repotlines
> so it doesn't suffer too much from indirect calls anyway)
> 
> Two other performance discrepancies I can think of, that would stay anyway are:
> 1- the ftrace trampoline saves all argument registers while BPF trampolines can
>    get away with only saving the number of arguments that are actually used by
>    that function (thanks to BTF info), consequentially, we need to convert the
>    ftrace_regs structure into a BPF "ctx" array which inevitably takes some
>    cycles

Have you checked the root cause by perf? I'm not sure that makes difference
so much.

> 2- When a fexit program is attached, going through the rethook trampoline means
>    that we need to save argument registers a second time while BPF trampolines
>    can just rely on arguments kept on the stack

I think we can make a new trampoline eventually just copying some required
arguments on the shadow stack.

Thanks!

> 
> How to apply this RFC ?
> =======================
> 
> This RFC only brings up to discussion the eventual patch that would
> touch the BPF subsystem because the ftrace and fprobe optimizations it
> is built on are not as controversial and already on the way. However,
> this patch is meant to apply on top of Mark's and Masami's work. If
> you'd like to run this patch you can use my branch. [13]
> 
> Links
> =====
> 
> 1: https://lore.kernel.org/bpf/20220711144722.2100039-1-xukuohai@huawei.com/
> 2: https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/
> 3: https://lore.kernel.org/bpf/20220913162732.163631-1-xukuohai@huaweicloud.com/
> 4: https://lore.kernel.org/bpf/Yo4xb2w+FHhUtJNw@FVFF77S0Q05N/
> 5: https://lore.kernel.org/bpf/YzR5WSLux4mmFIXg@FVFF77S0Q05N/
> 6: https://lore.kernel.org/all/20221103170520.931305-1-mark.rutland@arm.com/
> 7: https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops
> 8: https://lore.kernel.org/all/166792255429.919356.14116090269057513181.stgit@devnote3/T/#m9d43fbdc91f48b03d644be77ac18017963a08df5
> 9: https://lore.kernel.org/bpf/20221024220008.48780b0f58903afed2dc8d4a@kernel.org/
> 10: https://paste.debian.net/1260011/
> 11: https://github.com/FlorentRevest/linux/commits/bpf-direct-calls
> 12: https://github.com/FlorentRevest/linux/commits/bpf-fprobe-slow
> 13: https://github.com/FlorentRevest/linux/commits/bpf-fprobe-rfc
> 
> Florent Revest (1):
>   bpf: Invoke tracing progs using fprobe on archs without direct call
> 
>  include/linux/bpf.h     |   5 ++
>  kernel/bpf/trampoline.c | 120 ++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 121 insertions(+), 4 deletions(-)
> 
> -- 
> 2.38.1.431.g37b22c650d-goog
>
  
Masami Hiramatsu (Google) Nov. 17, 2022, 1:33 p.m. UTC | #3
On Wed, 16 Nov 2022 18:41:26 -0800
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Tue, Nov 8, 2022 at 2:07 PM Florent Revest <revest@chromium.org> wrote:
> >
> > Hi!
> >
> > With this RFC, I'd like to revive the conversation between BPF, ARM and tracing
> > folks on what BPF tracing (fentry/fexit/fmod_ret) could/should look like on
> > arm64.
> >
> > Current status of BPF tracing
> > =============================
> >
> > On currently supported architectures (like x86), BPF tracing programs are
> > called from a JITted BPF trampoline, itself called from the ftrace patch site
> > thanks to the ftrace "direct call" API. (or from the end of the ftrace
> > trampoline if a ftrace ops is also tracing that function, but this is
> > transparent to BPF)
> >
> > Thanks to Xu's work [1], we now have BPF trampolines on arm64 (these can be
> > used for struct ops programs already), but Xu's attempts at getting ftrace
> > direct calls support [2][3] on arm64 have been unsucessful so far so we still
> > do not support BPF tracing programs. This prompted me to try a different
> > approach. I'd like to collect feedback on it here.
> >
> > Why not direct calls ?
> > ======================
> >
> > Mark and Steven have not been too keen on getting direct calls on arm64 because:
> > - working around BL instruction's limited range introduces complexity [4]
> > - it's difficult to get reliable stacktraces right with direct calls [5]
> > - direct calls are complex to maintain on the arch/ftrace side [5]
> >
> > In the absence of ftrace direct calls support, BPF tracing programs would need
> > to be called from an ftrace ops instead. Note that the BPF callback signature
> > would have to be different, so we can't re-use trampolines (direct called
> > callbacks receive arguments in registers whereas ftrace ops callbacks receive
> > arguments in a struct ftrace_regs pointer)
> >
> > Why fprobe ?
> > ============
> >
> > Ftrace ops per-se only expose an API to hook before a function. There are two
> > systems built on top of ftrace ops that also allow hooking the function exit:
> > fprobe (using rethook) and the function graph tracer. There are plans from
> > Masami and Steven to unify these two systems but, as they stand, only fprobe
> > gives enough flexibility to implement BPF tracing.
> >
> > In order not to reinvent the wheel, if direct calls aren't available on the
> > arch, BPF could leverage fprobe to hook before and after the traced function.
> > Note that return hooking is implemented a bit differently than it is in BPF
> > trampolines. Instead of keeping arguments on a stack frame and calling the
> > traced function, rethook saves arguments in a memory pool and returns to the
> > traced function with a hijacked return pointer that will have its ret jump back
> > to the rethook trampoline.
> >
> > What about performances ?
> > =========================
> >
> > In its current state, a fprobe callback on arm64 is very expensive because:
> > 1- the ftrace trampoline saves all registers (including many unnecessary ones)
> > 2- it calls ftrace_ops_list_func which iterates over all ops and is very slow
> > 3- the fprobe ops unconditionally hooks a rethook
> > 4- rethook grabs memory from a freelist which is slow under high contention
> >
> > However, all the above points are currently being addressed:
> > 1- by Mark's series to save argument registers only [6]
> > 2- by Mark's series to call single ops directly [7]
> > 3- by Masami's patch to skip rethooks if not needed [8]
> > 4- Masami said the rethook freelist would be replaced by a per-task stack as
> >    part of its unification with the function graph tracer [9]
> >
> > I measured the costs of BPF on different approaches on my RPi4 here: [10]
> > tl;dr: the BPF "bench" takes a performance hit of:
> > - 28.6% w/ BPF tracing on direct calls (best case scenario for reference) [11]
> > - 66.8% w/ BPF on kprobe (just for reference)
> > - 62.6% w/ BPF tracing on fprobe without any optimizations (current state) [12]
> > - 34.1% w/ BPF tracing on fprobe with all optimizations (near-future state) [13]
> 
> Even with all optimization the performance overhead is not acceptable.
> It feels to me that folks are still thinking about bpf trampoline
> as a tracing facility.
> It's a lot more than that. It needs to run 24/7 with zero overhead.
> It needs to replace the kernel functions and be invoked
> millions times a second until the system is rebooted.
> In this environment every nanosecond counts.
> 
> Even if the fprobe side was completely free the patch 1 has so much
> overhead in copy of bpf_cookie, regs, etc that it's a non-starter
> for these use cases.
> 
> There are several other fundamental issues in this approach
> because of fprobe/ftrace.
> It has ftrace_test_recursion_trylock and disables preemption.
> Both are deal breakers.

I talked with Florent about this offline.
ftrace_test_recursion_trylock() is required for generic ftrace
use because user callback can call a function which can be
traced by ftrace. This means it can cause an infinite loop.
However, if user can ensure to check it by itself, I can add a
flag to avoid that trylock. (Of course, you can shoot your foot.)

I thought the preemption disabling was for accessing per-cpu,
but it is needed for rethook to get an object from an RCU
protected list.
Thus when we move on the per-task shadow stack, it can be
removed too.

> 
> bpf trampoline has to allow recursion in some cases.
> See __bpf_prog_enter*() flavors.
> 
> bpf trampoline also has to use migrate_disable instead of preemption
> and rcu_read_lock() in some cases and rcu_read_lock_trace() in others.

Is rcu_read_lock() better than preempt_disable()? 

> 
> bpf trampoline must never allocate memory or grab locks.

Note that ftrace_test_recursion_trylock() is just a bit operation
per-task, not taking a lock (nor atomic).

Thank you,

> 
> All of these mandatory features exclude fprobe, ftrace, rethook
> from possible options.
> 
> Let's figure out how to address concerns with direct calls:
> 
> > - working around BL instruction's limited range introduces complexity [4]
> > - it's difficult to get reliable stacktraces right with direct calls [5]
> > - direct calls are complex to maintain on the arch/ftrace side [5]
  
Alexei Starovoitov Nov. 17, 2022, 4:50 p.m. UTC | #4
On Thu, Nov 17, 2022 at 5:34 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Wed, 16 Nov 2022 18:41:26 -0800
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>
> > On Tue, Nov 8, 2022 at 2:07 PM Florent Revest <revest@chromium.org> wrote:
> > >
> > > Hi!
> > >
> > > With this RFC, I'd like to revive the conversation between BPF, ARM and tracing
> > > folks on what BPF tracing (fentry/fexit/fmod_ret) could/should look like on
> > > arm64.
> > >
> > > Current status of BPF tracing
> > > =============================
> > >
> > > On currently supported architectures (like x86), BPF tracing programs are
> > > called from a JITted BPF trampoline, itself called from the ftrace patch site
> > > thanks to the ftrace "direct call" API. (or from the end of the ftrace
> > > trampoline if a ftrace ops is also tracing that function, but this is
> > > transparent to BPF)
> > >
> > > Thanks to Xu's work [1], we now have BPF trampolines on arm64 (these can be
> > > used for struct ops programs already), but Xu's attempts at getting ftrace
> > > direct calls support [2][3] on arm64 have been unsucessful so far so we still
> > > do not support BPF tracing programs. This prompted me to try a different
> > > approach. I'd like to collect feedback on it here.
> > >
> > > Why not direct calls ?
> > > ======================
> > >
> > > Mark and Steven have not been too keen on getting direct calls on arm64 because:
> > > - working around BL instruction's limited range introduces complexity [4]
> > > - it's difficult to get reliable stacktraces right with direct calls [5]
> > > - direct calls are complex to maintain on the arch/ftrace side [5]
> > >
> > > In the absence of ftrace direct calls support, BPF tracing programs would need
> > > to be called from an ftrace ops instead. Note that the BPF callback signature
> > > would have to be different, so we can't re-use trampolines (direct called
> > > callbacks receive arguments in registers whereas ftrace ops callbacks receive
> > > arguments in a struct ftrace_regs pointer)
> > >
> > > Why fprobe ?
> > > ============
> > >
> > > Ftrace ops per-se only expose an API to hook before a function. There are two
> > > systems built on top of ftrace ops that also allow hooking the function exit:
> > > fprobe (using rethook) and the function graph tracer. There are plans from
> > > Masami and Steven to unify these two systems but, as they stand, only fprobe
> > > gives enough flexibility to implement BPF tracing.
> > >
> > > In order not to reinvent the wheel, if direct calls aren't available on the
> > > arch, BPF could leverage fprobe to hook before and after the traced function.
> > > Note that return hooking is implemented a bit differently than it is in BPF
> > > trampolines. Instead of keeping arguments on a stack frame and calling the
> > > traced function, rethook saves arguments in a memory pool and returns to the
> > > traced function with a hijacked return pointer that will have its ret jump back
> > > to the rethook trampoline.
> > >
> > > What about performances ?
> > > =========================
> > >
> > > In its current state, a fprobe callback on arm64 is very expensive because:
> > > 1- the ftrace trampoline saves all registers (including many unnecessary ones)
> > > 2- it calls ftrace_ops_list_func which iterates over all ops and is very slow
> > > 3- the fprobe ops unconditionally hooks a rethook
> > > 4- rethook grabs memory from a freelist which is slow under high contention
> > >
> > > However, all the above points are currently being addressed:
> > > 1- by Mark's series to save argument registers only [6]
> > > 2- by Mark's series to call single ops directly [7]
> > > 3- by Masami's patch to skip rethooks if not needed [8]
> > > 4- Masami said the rethook freelist would be replaced by a per-task stack as
> > >    part of its unification with the function graph tracer [9]
> > >
> > > I measured the costs of BPF on different approaches on my RPi4 here: [10]
> > > tl;dr: the BPF "bench" takes a performance hit of:
> > > - 28.6% w/ BPF tracing on direct calls (best case scenario for reference) [11]
> > > - 66.8% w/ BPF on kprobe (just for reference)
> > > - 62.6% w/ BPF tracing on fprobe without any optimizations (current state) [12]
> > > - 34.1% w/ BPF tracing on fprobe with all optimizations (near-future state) [13]
> >
> > Even with all optimization the performance overhead is not acceptable.
> > It feels to me that folks are still thinking about bpf trampoline
> > as a tracing facility.
> > It's a lot more than that. It needs to run 24/7 with zero overhead.
> > It needs to replace the kernel functions and be invoked
> > millions times a second until the system is rebooted.
> > In this environment every nanosecond counts.
> >
> > Even if the fprobe side was completely free the patch 1 has so much
> > overhead in copy of bpf_cookie, regs, etc that it's a non-starter
> > for these use cases.
> >
> > There are several other fundamental issues in this approach
> > because of fprobe/ftrace.
> > It has ftrace_test_recursion_trylock and disables preemption.
> > Both are deal breakers.
>
> I talked with Florent about this offline.
> ftrace_test_recursion_trylock() is required for generic ftrace
> use because user callback can call a function which can be
> traced by ftrace. This means it can cause an infinite loop.
> However, if user can ensure to check it by itself, I can add a
> flag to avoid that trylock. (Of course, you can shoot your foot.)
>
> I thought the preemption disabling was for accessing per-cpu,
> but it is needed for rethook to get an object from an RCU
> protected list.
> Thus when we move on the per-task shadow stack, it can be
> removed too.

There might not be a task available where bpf trampoline is running.
rcu protection might not be there either.
Really we're just scratching the surface of all the issues why
fprobe is not usable.
  
Steven Rostedt Nov. 17, 2022, 5:16 p.m. UTC | #5
On Wed, 16 Nov 2022 18:41:26 -0800
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> Even with all optimization the performance overhead is not acceptable.
> It feels to me that folks are still thinking about bpf trampoline
> as a tracing facility.
> It's a lot more than that. It needs to run 24/7 with zero overhead.

It obviously doesn't have zero overhead.

And correctness and maintainability trumps micro-optimizations.

> It needs to replace the kernel functions and be invoked

What do you mean by "replace the kernel functions"? You mean an existing
kernel function can be replaced by a bpf program? Like live patching?

This seems rather dangerous, and how does one know that their system has
integrity? Is there a feature to sign bpf programs before they can be added?

Also, it may be time to bring in the lawyers. If a bpf program can replace
an existing kernel function, then it has definitely passed the "user space"
exception to the GPL, where user space must use the system call interface.
By injecting executable code into the kernel, especially something that
replaces kernel functionality, it becomes arguably derived from the kernel
itself. And the BPF program must be GPL.

Allowing proprietary BPF programs to replace kernel functionality looks
like a clear violation and circumvention of the GPL. But I could be
mistaken. As I said, it's time to bring in the lawyers on this one.

> millions times a second until the system is rebooted.
> In this environment every nanosecond counts.
> 
> Even if the fprobe side was completely free the patch 1 has so much
> overhead in copy of bpf_cookie, regs, etc that it's a non-starter
> for these use cases.
> 
> There are several other fundamental issues in this approach
> because of fprobe/ftrace.
> It has ftrace_test_recursion_trylock and disables preemption.
> Both are deal breakers.

Please explain why? The recursion protection lock is a simply bit operation
on the task struct which is used to protect against recursion at the same
context. Which if you do not have, will likely happen, and the only hint of
it is that the system triple faults and reboots. If you are only hooking to
one function, then it is easy to figure this out. But with the multi work
being done, that is no longer the case.

Hooking to functions is *extremely* intrusive. And protection against
errors is a must have, and not an option.

> 
> bpf trampoline has to allow recursion in some cases.
> See __bpf_prog_enter*() flavors.

The recursion lock allows recursions, but not at the same context. That is,
interrupt against normal context is fine. But really, you should not have
it within the same context. How do you verify that you do not run out of
stack?

> 
> bpf trampoline also has to use migrate_disable instead of preemption
> and rcu_read_lock() in some cases and rcu_read_lock_trace() in others.
> 
> bpf trampoline must never allocate memory or grab locks.

Neither should fprobes / ftrace.

-- Steve

> 
> All of these mandatory features exclude fprobe, ftrace, rethook
> from possible options.
> 
> Let's figure out how to address concerns with direct calls:
  
Chris Mason Nov. 17, 2022, 9:55 p.m. UTC | #6
On 11/17/22 12:16 PM, Steven Rostedt wrote:
> On Wed, 16 Nov 2022 18:41:26 -0800
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
>> Even with all optimization the performance overhead is not acceptable.
>> It feels to me that folks are still thinking about bpf trampoline
>> as a tracing facility.
>> It's a lot more than that. It needs to run 24/7 with zero overhead.
> 
> It obviously doesn't have zero overhead.
> 
> And correctness and maintainability trumps micro-optimizations.

During the bpf office hours today Mark Rutland and Florent had some
great ideas about how to wire things up.  I'm sure Mark will need some
time to write it all down but it was a fun call.

> 
>> It needs to replace the kernel functions and be invoked
> 
> What do you mean by "replace the kernel functions"? You mean an existing
> kernel function can be replaced by a bpf program? Like live patching?
> 
> This seems rather dangerous, and how does one know that their system has
> integrity? Is there a feature to sign bpf programs before they can be added?
> 
> Also, it may be time to bring in the lawyers. If a bpf program can replace
> an existing kernel function, then it has definitely passed the "user space"
> exception to the GPL, where user space must use the system call interface.
> By injecting executable code into the kernel, especially something that
> replaces kernel functionality, it becomes arguably derived from the kernel
> itself. And the BPF program must be GPL.
> 
> Allowing proprietary BPF programs to replace kernel functionality looks
> like a clear violation and circumvention of the GPL. But I could be
> mistaken. As I said, it's time to bring in the lawyers on this one.

https://docs.kernel.org/bpf/bpf_licensing.html answers most of your
questions.  It was reviewed by lawyers and also discussed pretty
extensively on the lists.

The short answer to your concerns is that you can't replace kernel
functions from proprietary BPF programs.  The LSM and TCP congestion
control features intentionally have GPL only support functions in the
way.  bpf_probe_read_kernel() is also GPL only and massively limits the
things that can be done from proprietary code.

This list of helpers is pretty current and details which ones are GPL only:

https://github.com/iovisor/bcc/blob/master/docs/kernel-versions.md#helpers

I know there's a long and glorious history of collaboration around these
parts of bpf and ftrace.  I really hope this time around we all come
away feeling like the technical discussion made both projects better.
Mark and Florent today certainly made me think that was the direction we
were headed.

Along these lines, I'm also hoping to avoid diving into old debates and
alarmist conclusions about GPL compliance and signed bpf programs. Or,
if some part of those old debates is no longer valid, can we split
it off into a well researched separate thread and focus on technical 
bits here?

-chris
  
Steven Rostedt Nov. 17, 2022, 10:40 p.m. UTC | #7
On Thu, 17 Nov 2022 16:55:12 -0500
Chris Mason <clm@meta.com> wrote:

> On 11/17/22 12:16 PM, Steven Rostedt wrote:
> > On Wed, 16 Nov 2022 18:41:26 -0800
> > Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >   
> >> Even with all optimization the performance overhead is not acceptable.
> >> It feels to me that folks are still thinking about bpf trampoline
> >> as a tracing facility.
> >> It's a lot more than that. It needs to run 24/7 with zero overhead.  
> > 
> > It obviously doesn't have zero overhead.
> > 
> > And correctness and maintainability trumps micro-optimizations.  
> 
> During the bpf office hours today Mark Rutland and Florent had some
> great ideas about how to wire things up.  I'm sure Mark will need some
> time to write it all down but it was a fun call.

That's good to hear.

> 
> >   
> >> It needs to replace the kernel functions and be invoked  
> > 
> > What do you mean by "replace the kernel functions"? You mean an existing
> > kernel function can be replaced by a bpf program? Like live patching?
> > 
> > This seems rather dangerous, and how does one know that their system has
> > integrity? Is there a feature to sign bpf programs before they can be added?
> > 
> > Also, it may be time to bring in the lawyers. If a bpf program can replace
> > an existing kernel function, then it has definitely passed the "user space"
> > exception to the GPL, where user space must use the system call interface.
> > By injecting executable code into the kernel, especially something that
> > replaces kernel functionality, it becomes arguably derived from the kernel
> > itself. And the BPF program must be GPL.
> > 
> > Allowing proprietary BPF programs to replace kernel functionality looks
> > like a clear violation and circumvention of the GPL. But I could be
> > mistaken. As I said, it's time to bring in the lawyers on this one.  
> 
> https://docs.kernel.org/bpf/bpf_licensing.html answers most of your
> questions.  It was reviewed by lawyers and also discussed pretty
> extensively on the lists.
> 



> The short answer to your concerns is that you can't replace kernel
> functions from proprietary BPF programs.  The LSM and TCP congestion
> control features intentionally have GPL only support functions in the
> way.  bpf_probe_read_kernel() is also GPL only and massively limits the
> things that can be done from proprietary code.

^^^^^^^^^^^^^^^^^

That's the part I wanted to hear. But just the fact of replacing a kernel
function with BPF code seems a bit concerning.

> 
> This list of helpers is pretty current and details which ones are GPL only:
> 
> https://github.com/iovisor/bcc/blob/master/docs/kernel-versions.md#helpers
> 
> I know there's a long and glorious history of collaboration around these
> parts of bpf and ftrace.  I really hope this time around we all come
> away feeling like the technical discussion made both projects better.
> Mark and Florent today certainly made me think that was the direction we
> were headed.
> 
> Along these lines, I'm also hoping to avoid diving into old debates and
> alarmist conclusions about GPL compliance and signed bpf programs. Or,

Not alarmist, but concern as being able to modify what a kernel function can
do is not something I take lightly.

-- Steve

> if some part of those old debates is no longer valid, can we split
> it off into a well researched separate thread and focus on technical 
> bits here?
  
Mark Rutland Nov. 18, 2022, 4:18 p.m. UTC | #8
On Thu, Nov 17, 2022 at 04:55:12PM -0500, Chris Mason wrote:
> On 11/17/22 12:16 PM, Steven Rostedt wrote:
> > On Wed, 16 Nov 2022 18:41:26 -0800
> > Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > 
> > > Even with all optimization the performance overhead is not acceptable.
> > > It feels to me that folks are still thinking about bpf trampoline
> > > as a tracing facility.
> > > It's a lot more than that. It needs to run 24/7 with zero overhead.
> > 
> > It obviously doesn't have zero overhead.
> > 
> > And correctness and maintainability trumps micro-optimizations.
> 
> During the bpf office hours today Mark Rutland and Florent had some
> great ideas about how to wire things up.  I'm sure Mark will need some
> time to write it all down but it was a fun call.

I'd hoped to write that up today, but I haven't had enough time yet, so I'll
try to write up that proposal next week.

The rough idea was to *somehow* rejig the per-callsite ftrace_ops code I've
been working on to permit (but not require) the use of custom trampolines. As
mentioned during the call I want to ensure that this doesn't adversely affect
regular ftrace usage, and I'd also like to ensure that the regular ftrace code
is able to gain form those changes (without the need for trampolines). AFAICT,
that's likely to require some rework to the way direct functions are managed.

The WIP code for per-callsite ftrace_ops is at:

 https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops

To be clear, my comments were purely about the *mechanism* we end up
implementing. I do have concerns w.r.t. overriding arbitrary parts of the
kernel.

Thanks,
Mark.
  
Mark Rutland Nov. 18, 2022, 4:26 p.m. UTC | #9
Hi Alexei,

On Thu, Nov 17, 2022 at 08:50:12AM -0800, Alexei Starovoitov wrote:
> There might not be a task available where bpf trampoline is running.

I'm not sure what you mean by "there might not be a task available"; do you
mean that there might not be space in the per-task shadow stack, or that the
BPF program can be invoked inan IRQ context?

> rcu protection might not be there either.

We've spent a lot of time reworking entry/idle sequences with noinstr, so any
time BPF can be invoked, we should have a regular kernel environment available,
with RCU watching (but not necessarily in an RCU read-side critical section).
If BPF is being invoked without RCU watching, that is a bug that needs to be
fixed.

Do you have a specific example in mind?

Thanks,
Mark.
  
Mark Rutland Nov. 18, 2022, 4:34 p.m. UTC | #10
On Thu, Nov 17, 2022 at 05:40:30PM -0500, Steven Rostedt wrote:
> On Thu, 17 Nov 2022 16:55:12 -0500
> Chris Mason <clm@meta.com> wrote:
> 
> > On 11/17/22 12:16 PM, Steven Rostedt wrote:

> > The short answer to your concerns is that you can't replace kernel
> > functions from proprietary BPF programs.  The LSM and TCP congestion
> > control features intentionally have GPL only support functions in the
> > way.  bpf_probe_read_kernel() is also GPL only and massively limits the
> > things that can be done from proprietary code.
> 
> ^^^^^^^^^^^^^^^^^
> 
> That's the part I wanted to hear. But just the fact of replacing a kernel
> function with BPF code seems a bit concerning.

> > This list of helpers is pretty current and details which ones are GPL only:
> > 
> > https://github.com/iovisor/bcc/blob/master/docs/kernel-versions.md#helpers
> > 
> > I know there's a long and glorious history of collaboration around these
> > parts of bpf and ftrace.  I really hope this time around we all come
> > away feeling like the technical discussion made both projects better.
> > Mark and Florent today certainly made me think that was the direction we
> > were headed.
> > 
> > Along these lines, I'm also hoping to avoid diving into old debates and
> > alarmist conclusions about GPL compliance and signed bpf programs. Or,
> 
> Not alarmist, but concern as being able to modify what a kernel function can
> do is not something I take lightly.

FWIW, given that the aim here seems to be to expose all kernel internals to be
overridden arbitrarily, I'm also concerned that there's a huge surface area for
issues with maintainability, robustness/correctness, and security.

I really don't want to be stuck in a position where someone argues that all
kernel internal functions are ABI and need to stay around as-is to be hooked by
eBPF, and I hope that we all agree that there are no guarantees on that front.

Thanks,
Mark.
  
Steven Rostedt Nov. 18, 2022, 4:45 p.m. UTC | #11
On Fri, 18 Nov 2022 16:34:50 +0000
Mark Rutland <mark.rutland@arm.com> wrote:

> > Not alarmist, but concern as being able to modify what a kernel function can
> > do is not something I take lightly.  
> 
> FWIW, given that the aim here seems to be to expose all kernel internals to be
> overridden arbitrarily, I'm also concerned that there's a huge surface area for
> issues with maintainability, robustness/correctness, and security.
> 
> I really don't want to be stuck in a position where someone argues that all
> kernel internal functions are ABI and need to stay around as-is to be hooked by
> eBPF, and I hope that we all agree that there are no guarantees on that front.

My biggest concern is changing functionality of arbitrary functions by BPF.
I would much rather limit what functions BPF could change with some
annotation.

int __bpf_modify foo()
{
	...
}


That way if somethings not working, you can see directly in the code that
the function could be modified by a BPF program, instead of getting some
random bug report because a function returned an unexpected result that the
code of that function could never produce.

-- Steve
  
Chris Mason Nov. 18, 2022, 5:44 p.m. UTC | #12
On 11/18/22 11:45 AM, Steven Rostedt wrote:
> On Fri, 18 Nov 2022 16:34:50 +0000
> Mark Rutland <mark.rutland@arm.com> wrote:
> 
>>> Not alarmist, but concern as being able to modify what a kernel function can
>>> do is not something I take lightly.
>>
>> FWIW, given that the aim here seems to be to expose all kernel internals to be
>> overridden arbitrarily, I'm also concerned that there's a huge surface area for
>> issues with maintainability, robustness/correctness, and security.
>>
>> I really don't want to be stuck in a position where someone argues that all
>> kernel internal functions are ABI and need to stay around as-is to be hooked by
>> eBPF, and I hope that we all agree that there are no guarantees on that front.
> 

For ABI concerns, I don't think we're doing anything fundamentally
different from x86 here.  So unless our ARM users are substantially more
exciting than the x86 crowd, it should all fall under the discussion
from maintainer's summit:

https://lwn.net/Articles/908464/

> My biggest concern is changing functionality of arbitrary functions by BPF.
> I would much rather limit what functions BPF could change with some
> annotation.
> 
> int __bpf_modify foo()
> {
> 	...
> }
> 
> 
> That way if somethings not working, you can see directly in the code that
> the function could be modified by a BPF program, instead of getting some
> random bug report because a function returned an unexpected result that the
> code of that function could never produce.
>

The good news is that BPF generally confines the function replacement
through struct ops interfaces.  There are also explicit allow lists to
limit functions where you can do return value overrides etc, so I think
it's fair to say these concerns are already baked in.  I'm sure they can
be improved over the long term, but I don't think that's related to this
set of functionality on ARM.

-chris
  
Steven Rostedt Nov. 18, 2022, 6:06 p.m. UTC | #13
On Fri, 18 Nov 2022 12:44:00 -0500
Chris Mason <clm@meta.com> wrote:

> > My biggest concern is changing functionality of arbitrary functions by BPF.
> > I would much rather limit what functions BPF could change with some
> > annotation.
> > 
> > int __bpf_modify foo()
> > {
> > 	...
> > }
> > 
> > 
> > That way if somethings not working, you can see directly in the code that
> > the function could be modified by a BPF program, instead of getting some
> > random bug report because a function returned an unexpected result that the
> > code of that function could never produce.
> >  
> 
> The good news is that BPF generally confines the function replacement
> through struct ops interfaces.

What struct ops interfaces?

>  There are also explicit allow lists to
> limit functions where you can do return value overrides etc, so I think

Where are these lists.

> it's fair to say these concerns are already baked in.  I'm sure they can

How do I know that a function return was modified by BPF? If I'm debugging
something, is it obvious to the developer that is debugging an issue
(perhaps unaware of what BPF programs are loaded on the users machine),
that the return of a function was tweaked by BPF and that could be the
source of the bug?

> be improved over the long term, but I don't think that's related to this
> set of functionality on ARM.

I disagree. These issues may have been added to x86, but perhaps we should
take a deeper look at them again before extending them to other
architectures.

-- Steve
  
Chris Mason Nov. 18, 2022, 6:52 p.m. UTC | #14
On 11/18/22 1:06 PM, Steven Rostedt wrote:
> On Fri, 18 Nov 2022 12:44:00 -0500
> Chris Mason <clm@meta.com> wrote:
> 
>>> My biggest concern is changing functionality of arbitrary functions by BPF.
>>> I would much rather limit what functions BPF could change with some
>>> annotation.
>>>
>>> int __bpf_modify foo()
>>> {
>>> 	...
>>> }
>>>
>>>
>>> That way if somethings not working, you can see directly in the code that
>>> the function could be modified by a BPF program, instead of getting some
>>> random bug report because a function returned an unexpected result that the
>>> code of that function could never produce.
>>>   
>>
>> The good news is that BPF generally confines the function replacement
>> through struct ops interfaces.
> 
> What struct ops interfaces?

https://lwn.net/Articles/811631/

> 
>>   There are also explicit allow lists to
>> limit functions where you can do return value overrides etc, so I think
> 
> Where are these lists.

Some of the original features:

https://lwn.net/Articles/811631/

It has changed and expanded since then, but hopefully you get the idea.

> 
>> it's fair to say these concerns are already baked in.  I'm sure they can
> 
> How do I know that a function return was modified by BPF? If I'm debugging
> something, is it obvious to the developer that is debugging an issue
> (perhaps unaware of what BPF programs are loaded on the users machine),
> that the return of a function was tweaked by BPF and that could be the
> source of the bug?
> 
>> be improved over the long term, but I don't think that's related to this
>> set of functionality on ARM.
> 
> I disagree. These issues may have been added to x86, but perhaps we should
> take a deeper look at them again before extending them to other
> architectures.

Honestly, I think a large scale architecture review of every BPF feature
and decision over the last 10 years is just the wrong bar for this patch
series.

 From my understanding, Mark and Florent have some changes planned
that'll improve ftrace, livepatching, and bpf.  Lets talk about those,
and tackle any long term improvements you'd like to make to BPF in other
patch series.

-chris
  
Peter Zijlstra Nov. 21, 2022, 10:09 a.m. UTC | #15
On Fri, Nov 18, 2022 at 01:06:08PM -0500, Steven Rostedt wrote:
> How do I know that a function return was modified by BPF? If I'm debugging
> something, is it obvious to the developer that is debugging an issue
> (perhaps unaware of what BPF programs are loaded on the users machine),
> that the return of a function was tweaked by BPF and that could be the
> source of the bug?

Have it taint the kernel if something is overridden ;-) Then we can all
ignore the report until it comes back without taint.
  
KP Singh Nov. 21, 2022, 1:47 p.m. UTC | #16
On Fri, Nov 18, 2022 at 7:52 PM Chris Mason <clm@meta.com> wrote:
>
> On 11/18/22 1:06 PM, Steven Rostedt wrote:
> > On Fri, 18 Nov 2022 12:44:00 -0500
> > Chris Mason <clm@meta.com> wrote:
> >

(adding this back here)

> >>>> On Fri, 18 Nov 2022 16:34:50 +0000
> >>>> Mark Rutland <mark.rutland@arm.com> wrote:
> >>>> FWIW, given that the aim here seems to be to expose all kernel internals to be
> >>>> overridden arbitrarily, I'm also concerned that there's a huge surface area for
> >>>> issues with maintainability, robustness/correctness, and security.
> >>>>

This is not all kernel internals, just the functions allowed for error
injection.

> >>>> I really don't want to be stuck in a position where someone argues that all
> >>>> kernel internal functions are ABI and need to stay around as-is to be hooked by
> >>>> eBPF, and I hope that we all agree that there are no guarantees on that front.
> >>>>

Yes, BPF provides no guarantee that kernel functions will remain
stable (similar to tracepoints and kprobes).

> >>>> Thanks,
> >>>> Mark.
> >>>>
> >>> My biggest concern is changing functionality of arbitrary functions by BPF.
> >>> I would much rather limit what functions BPF could change with some
> >>> annotation.
> >>>
> >>> int __bpf_modify foo()
> >>> {
> >>>     ...
> >>> }

This annotation already exists, i.e. ALLOW_ERROR_INJECTION

Users, with CONFIG_FUNCTION_ERROR_INJECTION, can already modify return
values of kernel functions using kprobes and the failure injection
framework [1] for functions annotated with ALLOW_ERROR_INJECTION.

BPF just provides another way to do the same thing with "modify
return" programs and this also respects the error injection list [2]
and users can *only* attach these programs to the functions annotated
with ALLOW_ERROR_INJECTION.

[1] https://www.kernel.org/doc/Documentation/fault-injection/fault-injection.txt
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/bpf/verifier.c?id=f4c4ca70dedc1bce8e7b1648e652aa9be1d3fcd7#n14948

> >>>
> >>>
> >>> That way if somethings not working, you can see directly in the code that
> >>> the function could be modified by a BPF program, instead of getting some
> >>> random bug report because a function returned an unexpected result that the
> >>> code of that function could never produce.
> >>>
> >>
> >> The good news is that BPF generally confines the function replacement
> >> through struct ops interfaces.
> >
> > What struct ops interfaces?
>
> https://lwn.net/Articles/811631/
>
> >
> >>   There are also explicit allow lists to
> >> limit functions where you can do return value overrides etc, so I think
> >
> > Where are these lists.
>
> Some of the original features:
>
> https://lwn.net/Articles/811631/

I think you meant: https://lwn.net/Articles/740146/ ?

>
> It has changed and expanded since then, but hopefully you get the idea.
>
> >
> >> it's fair to say these concerns are already baked in.  I'm sure they can
> >
> > How do I know that a function return was modified by BPF? If I'm debugging

You can list the BPF programs that are loaded in the kernel with

# bpftool prog list

Also, the BPF programs show up in call stacks when you are debugging.

> > something, is it obvious to the developer that is debugging an issue
> > (perhaps unaware of what BPF programs are loaded on the users machine),
> > that the return of a function was tweaked by BPF and that could be the
> > source of the bug?
> >
> >> be improved over the long term, but I don't think that's related to this
> >> set of functionality on ARM.

There are workloads and applications (e.g. https://kubearmor.io/) that
already use BPF Tracing and LSM programs and are currently blocked on
their ARM server deployments.

This may be obvious, but I want to reiterate that while the attachment
points are not UAPI and users have to tolerate kernel function
changes, they do expect the core loading and attachment mechanisms to
exist (i.e. the ability to use LSM and tracing programs).

> >
> > I disagree. These issues may have been added to x86, but perhaps we should
> > take a deeper look at them again before extending them to other
> > architectures.
>
> Honestly, I think a large scale architecture review of every BPF feature
> and decision over the last 10 years is just the wrong bar for this patch
> series.

+1

>
>  From my understanding, Mark and Florent have some changes planned
> that'll improve ftrace, livepatching, and bpf.  Lets talk about those,
> and tackle any long term improvements you'd like to make to BPF in other
> patch series.

+1

 - KP

>
> -chris
>
  
Peter Zijlstra Nov. 21, 2022, 2:16 p.m. UTC | #17
On Mon, Nov 21, 2022 at 02:47:10PM +0100, KP Singh wrote:

> > > How do I know that a function return was modified by BPF? If I'm debugging
> 
> You can list the BPF programs that are loaded in the kernel with
> 
> # bpftool prog list

Only when you have access to the machine; most cases it's people sending
random splats by email.

> Also, the BPF programs show up in call stacks when you are debugging.

Only when it splats inside the BPF part, not when it splats after
because BPF changed semantics of a function.
  
KP Singh Nov. 21, 2022, 2:23 p.m. UTC | #18
On Mon, Nov 21, 2022 at 3:17 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Nov 21, 2022 at 02:47:10PM +0100, KP Singh wrote:
>
> > > > How do I know that a function return was modified by BPF? If I'm debugging
> >
> > You can list the BPF programs that are loaded in the kernel with
> >
> > # bpftool prog list
>
> Only when you have access to the machine; most cases it's people sending
> random splats by email.

Good point, What about having information about loaded BPF programs in the
kernel stack traces and sharing bytecode, somehow, like in crash dumps?

>
> > Also, the BPF programs show up in call stacks when you are debugging.
>
> Only when it splats inside the BPF part, not when it splats after
> because BPF changed semantics of a function.
>
>
  
Masami Hiramatsu (Google) Nov. 21, 2022, 2:40 p.m. UTC | #19
On Mon, 21 Nov 2022 11:09:21 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Nov 18, 2022 at 01:06:08PM -0500, Steven Rostedt wrote:
> > How do I know that a function return was modified by BPF? If I'm debugging
> > something, is it obvious to the developer that is debugging an issue
> > (perhaps unaware of what BPF programs are loaded on the users machine),
> > that the return of a function was tweaked by BPF and that could be the
> > source of the bug?
> 
> Have it taint the kernel if something is overridden ;-) Then we can all
> ignore the report until it comes back without taint.

Hmm, indeed. BTW, error injection should set that too.

Thanks,
  
Steven Rostedt Nov. 21, 2022, 3:15 p.m. UTC | #20
On Mon, 21 Nov 2022 14:47:10 +0100
KP Singh <kpsingh@kernel.org> wrote:

> This annotation already exists, i.e. ALLOW_ERROR_INJECTION
> 
> Users, with CONFIG_FUNCTION_ERROR_INJECTION, can already modify return
> values of kernel functions using kprobes and the failure injection
> framework [1] for functions annotated with ALLOW_ERROR_INJECTION.
> 
> BPF just provides another way to do the same thing with "modify
> return" programs and this also respects the error injection list [2]
> and users can *only* attach these programs to the functions annotated
> with ALLOW_ERROR_INJECTION.

WAIT!

Looking at the Kconfigs, I see

CONFIG_FUNCTION_ERROR_INJECTION is set when
CONFIG_HAVE_FUNCTION_ERROR_INJECTION is set, and when CONFIG_KPROBES is set.

And ALLOW_ERROR_INJECTION() is set when CONFIG_FUNCTION_ERROR_INJECTION is.

There's no way to turn it off on x86 except by disabling kprobes!

WTF!

I don't want a kernel that can add error injection just because kprobes is
enabled. There's two kinds of kprobes. One that is for visibility only (for
tracing) and one that can be used for functional changes. I want the
visibility without the ability to change the kernel. The visibility portion
is very useful for security, where as the modifying one can be used to
circumvent security.

As kprobes are set in most production environments, so is error injection.
Do we really want error injection enabled on production environments?
I don't.

I think we need this patch ASAP!

-- Steve

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index c3c0b077ade3..9ee72d8860c3 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1874,8 +1874,14 @@ config NETDEV_NOTIFIER_ERROR_INJECT
 	  If unsure, say N.
 
 config FUNCTION_ERROR_INJECTION
-	def_bool y
+	bool "Fault-injections of functions"
 	depends on HAVE_FUNCTION_ERROR_INJECTION && KPROBES
+	help
+	  Add fault injections into various functions that are annotated with
+	  ALLOW_ERROR_INJECTION() in the kernel. BPF may also modify the return
+	  value of theses functions. This is useful to test error paths of code.
+
+	  If unsure, say N
 
 config FAULT_INJECTION
 	bool "Fault-injection framework"
  
KP Singh Nov. 21, 2022, 3:29 p.m. UTC | #21
On Mon, Nov 21, 2022 at 4:15 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Mon, 21 Nov 2022 14:47:10 +0100
> KP Singh <kpsingh@kernel.org> wrote:
>
> > This annotation already exists, i.e. ALLOW_ERROR_INJECTION
> >
> > Users, with CONFIG_FUNCTION_ERROR_INJECTION, can already modify return
> > values of kernel functions using kprobes and the failure injection
> > framework [1] for functions annotated with ALLOW_ERROR_INJECTION.
> >
> > BPF just provides another way to do the same thing with "modify
> > return" programs and this also respects the error injection list [2]
> > and users can *only* attach these programs to the functions annotated
> > with ALLOW_ERROR_INJECTION.
>
> WAIT!
>
> Looking at the Kconfigs, I see
>
> CONFIG_FUNCTION_ERROR_INJECTION is set when
> CONFIG_HAVE_FUNCTION_ERROR_INJECTION is set, and when CONFIG_KPROBES is set.
>
> And ALLOW_ERROR_INJECTION() is set when CONFIG_FUNCTION_ERROR_INJECTION is.
>
> There's no way to turn it off on x86 except by disabling kprobes!
>
> WTF!
>
> I don't want a kernel that can add error injection just because kprobes is
> enabled. There's two kinds of kprobes. One that is for visibility only (for
> tracing) and one that can be used for functional changes. I want the
> visibility without the ability to change the kernel. The visibility portion
> is very useful for security, where as the modifying one can be used to
> circumvent security.

I am not sure how they can circumvent security since this needs root /
root equivalent permissions. Fault injection is actually a very useful
debugging tool.

>
> As kprobes are set in most production environments, so is error injection.
> Do we really want error injection enabled on production environments?
> I don't.
>
> I think we need this patch ASAP!
>
> -- Steve
>
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index c3c0b077ade3..9ee72d8860c3 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1874,8 +1874,14 @@ config NETDEV_NOTIFIER_ERROR_INJECT
>           If unsure, say N.
>
>  config FUNCTION_ERROR_INJECTION
> -       def_bool y
> +       bool "Fault-injections of functions"
>         depends on HAVE_FUNCTION_ERROR_INJECTION && KPROBES
> +       help
> +         Add fault injections into various functions that are annotated with
> +         ALLOW_ERROR_INJECTION() in the kernel. BPF may also modify the return
> +         value of theses functions. This is useful to test error paths of code.
> +
> +         If unsure, say N
>
>  config FAULT_INJECTION
>         bool "Fault-injection framework"
  
Steven Rostedt Nov. 21, 2022, 3:39 p.m. UTC | #22
On Mon, 21 Nov 2022 16:29:54 +0100
KP Singh <kpsingh@kernel.org> wrote:

> I am not sure how they can circumvent security since this needs root /
> root equivalent permissions. Fault injection is actually a very useful
> debugging tool.

On ChromeOS, we even consider root untrusted and lock down pretty
much all privileged activities (like loading modules and such).

As you said. It's a good debugging tool. Not something to allow in
production environments. Or at the very least, allow admins to disable it.

-- Steve
  
Alexei Starovoitov Nov. 21, 2022, 3:40 p.m. UTC | #23
On Mon, Nov 21, 2022 at 7:15 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Mon, 21 Nov 2022 14:47:10 +0100
> KP Singh <kpsingh@kernel.org> wrote:
>
> > This annotation already exists, i.e. ALLOW_ERROR_INJECTION
> >
> > Users, with CONFIG_FUNCTION_ERROR_INJECTION, can already modify return
> > values of kernel functions using kprobes and the failure injection
> > framework [1] for functions annotated with ALLOW_ERROR_INJECTION.
> >
> > BPF just provides another way to do the same thing with "modify
> > return" programs and this also respects the error injection list [2]
> > and users can *only* attach these programs to the functions annotated
> > with ALLOW_ERROR_INJECTION.
>
> WAIT!
>
> Looking at the Kconfigs, I see
>
> CONFIG_FUNCTION_ERROR_INJECTION is set when
> CONFIG_HAVE_FUNCTION_ERROR_INJECTION is set, and when CONFIG_KPROBES is set.
>
> And ALLOW_ERROR_INJECTION() is set when CONFIG_FUNCTION_ERROR_INJECTION is.
>
> There's no way to turn it off on x86 except by disabling kprobes!
>
> WTF!
>
> I don't want a kernel that can add error injection just because kprobes is
> enabled. There's two kinds of kprobes. One that is for visibility only (for
> tracing) and one that can be used for functional changes. I want the
> visibility without the ability to change the kernel. The visibility portion
> is very useful for security, where as the modifying one can be used to
> circumvent security.
>
> As kprobes are set in most production environments, so is error injection.
> Do we really want error injection enabled on production environments?

We absolutely want it enabled in production.

> I don't.

Speak for yourself, because your employer thinks otherwise.
  
Steven Rostedt Nov. 21, 2022, 3:45 p.m. UTC | #24
On Mon, 21 Nov 2022 07:40:11 -0800
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> >
> > As kprobes are set in most production environments, so is error injection.
> > Do we really want error injection enabled on production environments?  
> 
> We absolutely want it enabled in production.
> 
> > I don't.  
> 
> Speak for yourself, because your employer thinks otherwise.

And speak for yourself! Just because you want it in production, does not
mean that everyone else is forced to do so too.

It's a DEBUG feature. It's in Kconfig.debug. Why is it enabled by default???

-- Steve
  
Borislav Petkov Nov. 21, 2022, 3:55 p.m. UTC | #25
On Mon, Nov 21, 2022 at 10:45:48AM -0500, Steven Rostedt wrote:
> And speak for yourself! Just because you want it in production, does not
> mean that everyone else is forced to do so too.

Yes, I don't want it enabled in my enterprise kernels either because I
don't want bug reports for stuff which people should clearly not do in
production.

It is as simple as that.
  
Jiri Kosina Nov. 21, 2022, 4:16 p.m. UTC | #26
On Mon, 21 Nov 2022, KP Singh wrote:

> > Looking at the Kconfigs, I see
> >
> > CONFIG_FUNCTION_ERROR_INJECTION is set when
> > CONFIG_HAVE_FUNCTION_ERROR_INJECTION is set, and when CONFIG_KPROBES is set.
> >
> > And ALLOW_ERROR_INJECTION() is set when CONFIG_FUNCTION_ERROR_INJECTION is.
> >
> > There's no way to turn it off on x86 except by disabling kprobes!
> >
> > WTF!
> >
> > I don't want a kernel that can add error injection just because kprobes is
> > enabled. There's two kinds of kprobes. One that is for visibility only (for
> > tracing) and one that can be used for functional changes. I want the
> > visibility without the ability to change the kernel. The visibility portion
> > is very useful for security, where as the modifying one can be used to
> > circumvent security.
> 
> I am not sure how they can circumvent security since this needs root /
> root equivalent permissions. Fault injection is actually a very useful
> debugging tool.

There are environments where root is untrusted (e.g. secure boot), and 
there is a whole mechanism in kernel for dealing with that (all the 
CONFIG_LOCKDOWN_LSM handling). Seems like error injection should be wired 
up into lockdown handling at minimum.