[bpf-next,v7,0/3] Support storing struct task_struct objects as kptrs

Message ID 20221117032402.2356776-1-void@manifault.com
Headers
Series Support storing struct task_struct objects as kptrs |

Message

David Vernet Nov. 17, 2022, 3:23 a.m. UTC
  Now that BPF supports adding new kernel functions with kfuncs, and
storing kernel objects in maps with kptrs, we can add a set of kfuncs
which allow struct task_struct objects to be stored in maps as
referenced kptrs.

The possible use cases for doing this are plentiful.  During tracing,
for example, it would be useful to be able to collect some tasks that
performed a certain operation, and then periodically summarize who they
are, which cgroup they're in, how much CPU time they've utilized, etc.
Doing this now would require storing the tasks' pids along with some
relevant data to be exported to user space, and later associating the
pids to tasks in other event handlers where the data is recorded.
Another useful by-product of this is that it allows a program to pin a
task in a BPF program, and by proxy therefore also e.g. pin its task
local storage.

In order to support this, we'll need to expand KF_TRUSTED_ARGS to
support receiving trusted, non-refcounted pointers. It currently only
supports either PTR_TO_CTX pointers, or refcounted pointers. What this
means in terms of the implementation is that btf_check_func_arg_match()
would have to add another condition to its logic for checking if
a ptr needs a refcount to also require that the pointer has at least one
type modifier, such as a new modifier we're adding called PTR_TRUSTED
(described below). Note that PTR_UNTRUSTED is insufficient for this
purpose, as it does not cover all of the possible pointers we need to
watch out for, though. For example, a pointer obtained from walking a
struct is considered "trusted" (or at least, not PTR_UNTRUSTED). To
account for this and enable us to expand KF_TRUSTED_ARGS to include
allow-listed arguments such as those passed by the kernel to tracepoints
and struct_ops callbacks, this patch set also introduces a new
PTR_TRUSTED type flag modifier which records if a pointer was obtained
passed from the kernel in a trusted context.

In closing, this patch set:

1. Adds the new PTR_TRUSTED register type modifier flag, and updates the
   verifier and existing selftests accordingly.
2. Expands KF_TRUSTED_ARGS to also include trusted pointers that were
   not obtained from walking structs. 
3. Adds a new set of kfuncs that allows struct task_struct* objects to be
   used as kptrs.
4. Adds a new selftest suite to validate these new task kfuncs.

--
Changelog:
v6 -> v7:
- Removed the PTR_WALKED type modifier, and instead define a new
  PTR_TRUSTED type modifier which is set on registers containing
  pointers passed from trusted contexts (i.e. as tracepoint or
  struct_ops callback args) (Alexei)
- Remove the new KF_OWNED_ARGS kfunc flag. This can be accomplished
  by defining a new type that wraps an existing type, such as with
  struct nf_conn___init (Alexei)
- Add a test_task_current_acquire_release testcase which verifies we can
  acquire a task struct returned from bpf_get_current_task_btf().
- Make bpf_task_acquire() no longer return NULL, as it can no longer be
  called with a NULL task.
- Removed unnecessary is_test_kfunc_task() checks from failure
  testcases.

v5 -> v6:
- Add a new KF_OWNED_ARGS kfunc flag which may be used by kfuncs to
  express that they require trusted, refcounted args (Kumar)
- Rename PTR_NESTED -> PTR_WALKED in the verifier (Kumar)
- Convert reg_type_str() prefixes to use snprintf() instead of strncpy()
  (Kumar)
- Add PTR_TO_BTF_ID | PTR_WALKED to missing struct btf_reg_type
  instances -- specifically btf_id_sock_common_types, and
  percpu_btf_ptr_types.
- Add a missing PTR_TO_BTF_ID | PTR_WALKED switch case entry in
  check_func_arg_reg_off(), which is required when validating helper
  calls (Kumar)
- Update reg_type_mismatch_ok() to check base types for the registers
  (i.e. to accommodate type modifiers). Additionally, add a lengthy
  comment that explains why this is being done (Kumar)
- Update convert_ctx_accesses() to also issue probe reads for
  PTR_TO_BTF_ID | PTR_WALKED (Kumar)
- Update selftests to expect new prefix reg type strings.
- Rename task_kfunc_acquire_trusted_nested testcase to
  task_kfunc_acquire_trusted_walked, and fix a comment (Kumar)
- Remove KF_TRUSTED_ARGS from bpf_task_release(), which already includes
  KF_RELEASE (Kumar)
- Add bpf-next in patch subject lines (Kumar)

v4 -> v5:
- Fix an improperly formatted patch title.

v3 -> v4:
- Remove an unnecessary check from my repository that I forgot to remove
  after debugging something.

v2 -> v3:
- Make bpf_task_acquire() check for NULL, and include KF_RET_NULL
  (Martin)
- Include new PTR_NESTED register modifier type flag which specifies
  whether a pointer was obtained from walking a struct. Use this to
  expand the meaning of KF_TRUSTED_ARGS to include trusted pointers that
  were passed from the kernel (Kumar)
- Add more selftests to the task_kfunc selftest suite which verify that
  you cannot pass a walked pointer to bpf_task_acquire().
- Update bpf_task_acquire() to also specify KF_TRUSTED_ARGS.

v1 -> v2:
- Rename tracing_btf_ids to generic_kfunc_btf_ids, and add the new
  kfuncs to that list instead of making a separate btf id list (Alexei).
- Don't run the new selftest suite on s390x, which doesn't appear to
  support invoking kfuncs.
- Add a missing __diag_ignore block for -Wmissing-prototypes
  (lkp@intel.com).
- Fix formatting on some of the SPDX-License-Identifier tags.
- Clarified the function header comment a bit on bpf_task_kptr_get().

David Vernet (3):
  bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs
  bpf: Add kfuncs for storing struct task_struct * as a kptr
  bpf/selftests: Add selftests for new task kfuncs

 Documentation/bpf/kfuncs.rst                  |  28 +-
 include/linux/bpf.h                           |  25 ++
 include/linux/btf.h                           |  66 ++--
 kernel/bpf/btf.c                              |  44 ++-
 kernel/bpf/helpers.c                          |  83 ++++-
 kernel/bpf/verifier.c                         |  45 ++-
 kernel/trace/bpf_trace.c                      |   2 +-
 net/ipv4/bpf_tcp_ca.c                         |   4 +-
 tools/testing/selftests/bpf/DENYLIST.s390x    |   1 +
 .../selftests/bpf/prog_tests/task_kfunc.c     | 160 +++++++++
 .../selftests/bpf/progs/task_kfunc_common.h   |  81 +++++
 .../selftests/bpf/progs/task_kfunc_failure.c  | 304 ++++++++++++++++++
 .../selftests/bpf/progs/task_kfunc_success.c  | 127 ++++++++
 tools/testing/selftests/bpf/verifier/calls.c  |   4 +-
 .../selftests/bpf/verifier/ref_tracking.c     |   4 +-
 15 files changed, 906 insertions(+), 72 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/task_kfunc.c
 create mode 100644 tools/testing/selftests/bpf/progs/task_kfunc_common.h
 create mode 100644 tools/testing/selftests/bpf/progs/task_kfunc_failure.c
 create mode 100644 tools/testing/selftests/bpf/progs/task_kfunc_success.c
  

Comments

John Fastabend Nov. 17, 2022, 9:03 p.m. UTC | #1
David Vernet wrote:
> Now that BPF supports adding new kernel functions with kfuncs, and
> storing kernel objects in maps with kptrs, we can add a set of kfuncs
> which allow struct task_struct objects to be stored in maps as
> referenced kptrs.
> 
> The possible use cases for doing this are plentiful.  During tracing,
> for example, it would be useful to be able to collect some tasks that
> performed a certain operation, and then periodically summarize who they
> are, which cgroup they're in, how much CPU time they've utilized, etc.
> Doing this now would require storing the tasks' pids along with some
> relevant data to be exported to user space, and later associating the
> pids to tasks in other event handlers where the data is recorded.
> Another useful by-product of this is that it allows a program to pin a
> task in a BPF program, and by proxy therefore also e.g. pin its task
> local storage.

Sorry wasn't obvious to me (late to the party so if it was in some
other v* described apologies). Can we say something about the life
cycle of this acquired task_structs because they are incrementing
the ref cnt on the task struct they have potential to impact system.
I know at least we've had a few bugs in our task struct tracking
that has led to various bugs where we leak references. In our case
we didn't pin the kernel object so the leak is just BPF memory and
user space memory, still sort of bad because we would hit memory
limits and get OOM'd. Leaking kernel task structs is worse though.

quick question. If you put acquired task struct in a map what
happens if user side deletes the entry? Presumably this causes the
release to happen and the task_struct is good to go. Did I miss
the logic? I was thinking you would have something in bpf_map_free_kptrs
and a type callback to release() the refcnt?

> 
> In order to support this, we'll need to expand KF_TRUSTED_ARGS to
> support receiving trusted, non-refcounted pointers. It currently only
> supports either PTR_TO_CTX pointers, or refcounted pointers. What this
> means in terms of the implementation is that btf_check_func_arg_match()
> would have to add another condition to its logic for checking if
> a ptr needs a refcount to also require that the pointer has at least one
> type modifier, such as a new modifier we're adding called PTR_TRUSTED
> (described below). Note that PTR_UNTRUSTED is insufficient for this
> purpose, as it does not cover all of the possible pointers we need to
> watch out for, though. For example, a pointer obtained from walking a
> struct is considered "trusted" (or at least, not PTR_UNTRUSTED). To
> account for this and enable us to expand KF_TRUSTED_ARGS to include
> allow-listed arguments such as those passed by the kernel to tracepoints
> and struct_ops callbacks, this patch set also introduces a new
> PTR_TRUSTED type flag modifier which records if a pointer was obtained
> passed from the kernel in a trusted context.
> 
> In closing, this patch set:
> 
> 1. Adds the new PTR_TRUSTED register type modifier flag, and updates the
>    verifier and existing selftests accordingly.
> 2. Expands KF_TRUSTED_ARGS to also include trusted pointers that were
>    not obtained from walking structs. 
> 3. Adds a new set of kfuncs that allows struct task_struct* objects to be
>    used as kptrs.
> 4. Adds a new selftest suite to validate these new task kfuncs.
> 
> --
> Changelog:
> v6 -> v7:
> - Removed the PTR_WALKED type modifier, and instead define a new
>   PTR_TRUSTED type modifier which is set on registers containing
>   pointers passed from trusted contexts (i.e. as tracepoint or
>   struct_ops callback args) (Alexei)
> - Remove the new KF_OWNED_ARGS kfunc flag. This can be accomplished
>   by defining a new type that wraps an existing type, such as with
>   struct nf_conn___init (Alexei)
> - Add a test_task_current_acquire_release testcase which verifies we can
>   acquire a task struct returned from bpf_get_current_task_btf().
> - Make bpf_task_acquire() no longer return NULL, as it can no longer be
>   called with a NULL task.
> - Removed unnecessary is_test_kfunc_task() checks from failure
>   testcases.
> 
> v5 -> v6:
> - Add a new KF_OWNED_ARGS kfunc flag which may be used by kfuncs to
>   express that they require trusted, refcounted args (Kumar)
> - Rename PTR_NESTED -> PTR_WALKED in the verifier (Kumar)
> - Convert reg_type_str() prefixes to use snprintf() instead of strncpy()
>   (Kumar)
> - Add PTR_TO_BTF_ID | PTR_WALKED to missing struct btf_reg_type
>   instances -- specifically btf_id_sock_common_types, and
>   percpu_btf_ptr_types.
> - Add a missing PTR_TO_BTF_ID | PTR_WALKED switch case entry in
>   check_func_arg_reg_off(), which is required when validating helper
>   calls (Kumar)
> - Update reg_type_mismatch_ok() to check base types for the registers
>   (i.e. to accommodate type modifiers). Additionally, add a lengthy
>   comment that explains why this is being done (Kumar)
> - Update convert_ctx_accesses() to also issue probe reads for
>   PTR_TO_BTF_ID | PTR_WALKED (Kumar)
> - Update selftests to expect new prefix reg type strings.
> - Rename task_kfunc_acquire_trusted_nested testcase to
>   task_kfunc_acquire_trusted_walked, and fix a comment (Kumar)
> - Remove KF_TRUSTED_ARGS from bpf_task_release(), which already includes
>   KF_RELEASE (Kumar)
> - Add bpf-next in patch subject lines (Kumar)
> 
> v4 -> v5:
> - Fix an improperly formatted patch title.
> 
> v3 -> v4:
> - Remove an unnecessary check from my repository that I forgot to remove
>   after debugging something.
> 
> v2 -> v3:
> - Make bpf_task_acquire() check for NULL, and include KF_RET_NULL
>   (Martin)
> - Include new PTR_NESTED register modifier type flag which specifies
>   whether a pointer was obtained from walking a struct. Use this to
>   expand the meaning of KF_TRUSTED_ARGS to include trusted pointers that
>   were passed from the kernel (Kumar)
> - Add more selftests to the task_kfunc selftest suite which verify that
>   you cannot pass a walked pointer to bpf_task_acquire().
> - Update bpf_task_acquire() to also specify KF_TRUSTED_ARGS.
> 
> v1 -> v2:
> - Rename tracing_btf_ids to generic_kfunc_btf_ids, and add the new
>   kfuncs to that list instead of making a separate btf id list (Alexei).
> - Don't run the new selftest suite on s390x, which doesn't appear to
>   support invoking kfuncs.
> - Add a missing __diag_ignore block for -Wmissing-prototypes
>   (lkp@intel.com).
> - Fix formatting on some of the SPDX-License-Identifier tags.
> - Clarified the function header comment a bit on bpf_task_kptr_get().
> 
> David Vernet (3):
>   bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs
>   bpf: Add kfuncs for storing struct task_struct * as a kptr
>   bpf/selftests: Add selftests for new task kfuncs
> 
>  Documentation/bpf/kfuncs.rst                  |  28 +-
>  include/linux/bpf.h                           |  25 ++
>  include/linux/btf.h                           |  66 ++--
>  kernel/bpf/btf.c                              |  44 ++-
>  kernel/bpf/helpers.c                          |  83 ++++-
>  kernel/bpf/verifier.c                         |  45 ++-
>  kernel/trace/bpf_trace.c                      |   2 +-
>  net/ipv4/bpf_tcp_ca.c                         |   4 +-
>  tools/testing/selftests/bpf/DENYLIST.s390x    |   1 +
>  .../selftests/bpf/prog_tests/task_kfunc.c     | 160 +++++++++
>  .../selftests/bpf/progs/task_kfunc_common.h   |  81 +++++
>  .../selftests/bpf/progs/task_kfunc_failure.c  | 304 ++++++++++++++++++
>  .../selftests/bpf/progs/task_kfunc_success.c  | 127 ++++++++
>  tools/testing/selftests/bpf/verifier/calls.c  |   4 +-
>  .../selftests/bpf/verifier/ref_tracking.c     |   4 +-
>  15 files changed, 906 insertions(+), 72 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/task_kfunc.c
>  create mode 100644 tools/testing/selftests/bpf/progs/task_kfunc_common.h
>  create mode 100644 tools/testing/selftests/bpf/progs/task_kfunc_failure.c
>  create mode 100644 tools/testing/selftests/bpf/progs/task_kfunc_success.c
> 
> -- 
> 2.38.1
  
David Vernet Nov. 17, 2022, 9:54 p.m. UTC | #2
On Thu, Nov 17, 2022 at 01:03:45PM -0800, John Fastabend wrote:
> David Vernet wrote:
> > Now that BPF supports adding new kernel functions with kfuncs, and
> > storing kernel objects in maps with kptrs, we can add a set of kfuncs
> > which allow struct task_struct objects to be stored in maps as
> > referenced kptrs.
> > 
> > The possible use cases for doing this are plentiful.  During tracing,
> > for example, it would be useful to be able to collect some tasks that
> > performed a certain operation, and then periodically summarize who they
> > are, which cgroup they're in, how much CPU time they've utilized, etc.
> > Doing this now would require storing the tasks' pids along with some
> > relevant data to be exported to user space, and later associating the
> > pids to tasks in other event handlers where the data is recorded.
> > Another useful by-product of this is that it allows a program to pin a
> > task in a BPF program, and by proxy therefore also e.g. pin its task
> > local storage.
> 
> Sorry wasn't obvious to me (late to the party so if it was in some
> other v* described apologies). Can we say something about the life
> cycle of this acquired task_structs because they are incrementing
> the ref cnt on the task struct they have potential to impact system.

We should probably add an entire docs page which describes how kptrs
work, and I am happy to do that (ideally in a follow-on patch set if
that's OK with you). In general I think it would be useful to include
docs for any general-purpose kfuncs like the ones proposed in this set.

In regards to your specific question about the task lifecycle, nothing
being proposed in this patch set differs from how kptr lifecycles work
in general. The idea is that the BPF program:

1. Gets a "kptr_ref" kptr from an "acquire" kfunc.
2. Stores it in a map with bpf_kptr_xchg().

The program can then either later manually extract it from the map
(again with bpf_kptr_xchg()) and release it, or if the reference is
never removed from the map, let it be automatically released when the
map is destroyed. See [0] and [1] for a bit more information.

[0]: https://docs.kernel.org/bpf/kfuncs.html?highlight=kptr#kf-acquire-flag
[1]: https://lwn.net/Articles/900749/

> I know at least we've had a few bugs in our task struct tracking
> that has led to various bugs where we leak references. In our case
> we didn't pin the kernel object so the leak is just BPF memory and
> user space memory, still sort of bad because we would hit memory
> limits and get OOM'd. Leaking kernel task structs is worse though.

I don't think we need to worry about leaks. The verifier should fail to
load any program that doesn't properly release a kptr, and if it fails
to identify programs that improperly hold refcounts, that's a bug that
needs to be fixed. Similarly, if any map implementation (described
below) fails to properly free references at the appropriate time (when
an element or the map itself is destroyed), those are just bugs that
need to be fixed.

I think the relevant tradeoff here is really the possible side effects
of keeping a task pinned and avoiding it being reaped. I agree that's an
important consideration, but I think that would arguably apply to any
kptr (modulo the size of the object being pinned, which is certainly
relevant as well). We already have kptrs for e.g. struct nf_conn [2].
Granted, struct task_struct is significantly larger, but bpf_kptr_xchg()
is only enabled for privileged programs, so it seems like a reasonable
operation to allow.

[2]: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/net/netfilter/nf_conntrack_bpf.c#n253

> quick question. If you put acquired task struct in a map what
> happens if user side deletes the entry? Presumably this causes the
> release to happen and the task_struct is good to go. Did I miss
> the logic? I was thinking you would have something in bpf_map_free_kptrs
> and a type callback to release() the refcnt?

Someone else can chime in here to correct me if I'm wrong, but AFAIU
this is handled by the map implementations calling out to
bpf_obj_free_fields() to invoke the kptr destructor when the element is
destroyed. See [3] and [4] for examples of where they're called from the
arraymap and hashmap logic respectively. This is how the destructors are
similarly invoked when the maps are destroyed.

[3]: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/kernel/bpf/arraymap.c#n431
[4]: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/kernel/bpf/hashtab.c#n764

[...]
  
John Fastabend Nov. 17, 2022, 10:36 p.m. UTC | #3
David Vernet wrote:
> On Thu, Nov 17, 2022 at 01:03:45PM -0800, John Fastabend wrote:
> > David Vernet wrote:
> > > Now that BPF supports adding new kernel functions with kfuncs, and
> > > storing kernel objects in maps with kptrs, we can add a set of kfuncs
> > > which allow struct task_struct objects to be stored in maps as
> > > referenced kptrs.
> > > 
> > > The possible use cases for doing this are plentiful.  During tracing,
> > > for example, it would be useful to be able to collect some tasks that
> > > performed a certain operation, and then periodically summarize who they
> > > are, which cgroup they're in, how much CPU time they've utilized, etc.
> > > Doing this now would require storing the tasks' pids along with some
> > > relevant data to be exported to user space, and later associating the
> > > pids to tasks in other event handlers where the data is recorded.
> > > Another useful by-product of this is that it allows a program to pin a
> > > task in a BPF program, and by proxy therefore also e.g. pin its task
> > > local storage.
> > 
> > Sorry wasn't obvious to me (late to the party so if it was in some
> > other v* described apologies). Can we say something about the life
> > cycle of this acquired task_structs because they are incrementing
> > the ref cnt on the task struct they have potential to impact system.
> 
> We should probably add an entire docs page which describes how kptrs
> work, and I am happy to do that (ideally in a follow-on patch set if
> that's OK with you). In general I think it would be useful to include
> docs for any general-purpose kfuncs like the ones proposed in this set.

Sure, I wouldn't require that for your series though fwiw.

> 
> In regards to your specific question about the task lifecycle, nothing
> being proposed in this patch set differs from how kptr lifecycles work
> in general. The idea is that the BPF program:
> 
> 1. Gets a "kptr_ref" kptr from an "acquire" kfunc.
> 2. Stores it in a map with bpf_kptr_xchg().
> 
> The program can then either later manually extract it from the map
> (again with bpf_kptr_xchg()) and release it, or if the reference is
> never removed from the map, let it be automatically released when the
> map is destroyed. See [0] and [1] for a bit more information.

Yep as long as the ref is decremented on map destroy and elem delete
all good.

> 
> [0]: https://docs.kernel.org/bpf/kfuncs.html?highlight=kptr#kf-acquire-flag
> [1]: https://lwn.net/Articles/900749/
> 
> > I know at least we've had a few bugs in our task struct tracking
> > that has led to various bugs where we leak references. In our case
> > we didn't pin the kernel object so the leak is just BPF memory and
> > user space memory, still sort of bad because we would hit memory
> > limits and get OOM'd. Leaking kernel task structs is worse though.
> 
> I don't think we need to worry about leaks. The verifier should fail to
> load any program that doesn't properly release a kptr, and if it fails
> to identify programs that improperly hold refcounts, that's a bug that
> needs to be fixed. Similarly, if any map implementation (described
> below) fails to properly free references at the appropriate time (when
> an element or the map itself is destroyed), those are just bugs that
> need to be fixed.
> 
> I think the relevant tradeoff here is really the possible side effects
> of keeping a task pinned and avoiding it being reaped. I agree that's an
> important consideration, but I think that would arguably apply to any
> kptr (modulo the size of the object being pinned, which is certainly
> relevant as well). We already have kptrs for e.g. struct nf_conn [2].
> Granted, struct task_struct is significantly larger, but bpf_kptr_xchg()
> is only enabled for privileged programs, so it seems like a reasonable
> operation to allow.

No not arguing it shouldn't be possible just didn't see the release
hook.

> 
> [2]: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/net/netfilter/nf_conntrack_bpf.c#n253
> 
> > quick question. If you put acquired task struct in a map what
> > happens if user side deletes the entry? Presumably this causes the
> > release to happen and the task_struct is good to go. Did I miss
> > the logic? I was thinking you would have something in bpf_map_free_kptrs
> > and a type callback to release() the refcnt?
> 
> Someone else can chime in here to correct me if I'm wrong, but AFAIU
> this is handled by the map implementations calling out to
> bpf_obj_free_fields() to invoke the kptr destructor when the element is
> destroyed. See [3] and [4] for examples of where they're called from the
> arraymap and hashmap logic respectively. This is how the destructors are
> similarly invoked when the maps are destroyed.

Yep I found the dtor() gets populated in btf.c and apparently needed
to repull my local tree because I missed it. Thanks for the detailed
response.

And last thing I was checking is because KF_SLEEPABLE is not set
this should be blocked from running on sleepable progs which would
break the call_rcu in the destructor. Maybe small nit, not sure
its worth it but might be nice to annotate the helper description
with a note, "will not work on sleepable progs" or something to
that effect.

Thanks.

> 
> [3]: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/kernel/bpf/arraymap.c#n431
> [4]: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/kernel/bpf/hashtab.c#n764
> 
> [...]
  
David Vernet Nov. 18, 2022, 1:41 a.m. UTC | #4
On Thu, Nov 17, 2022 at 02:36:50PM -0800, John Fastabend wrote:
> David Vernet wrote:
> > On Thu, Nov 17, 2022 at 01:03:45PM -0800, John Fastabend wrote:
> > > David Vernet wrote:
> > > > Now that BPF supports adding new kernel functions with kfuncs, and
> > > > storing kernel objects in maps with kptrs, we can add a set of kfuncs
> > > > which allow struct task_struct objects to be stored in maps as
> > > > referenced kptrs.
> > > > 
> > > > The possible use cases for doing this are plentiful.  During tracing,
> > > > for example, it would be useful to be able to collect some tasks that
> > > > performed a certain operation, and then periodically summarize who they
> > > > are, which cgroup they're in, how much CPU time they've utilized, etc.
> > > > Doing this now would require storing the tasks' pids along with some
> > > > relevant data to be exported to user space, and later associating the
> > > > pids to tasks in other event handlers where the data is recorded.
> > > > Another useful by-product of this is that it allows a program to pin a
> > > > task in a BPF program, and by proxy therefore also e.g. pin its task
> > > > local storage.
> > > 
> > > Sorry wasn't obvious to me (late to the party so if it was in some
> > > other v* described apologies). Can we say something about the life
> > > cycle of this acquired task_structs because they are incrementing
> > > the ref cnt on the task struct they have potential to impact system.
> > 
> > We should probably add an entire docs page which describes how kptrs
> > work, and I am happy to do that (ideally in a follow-on patch set if
> > that's OK with you). In general I think it would be useful to include
> > docs for any general-purpose kfuncs like the ones proposed in this set.
> 
> Sure, I wouldn't require that for your series though fwiw.

Sounds good to me

[...]

> > > quick question. If you put acquired task struct in a map what
> > > happens if user side deletes the entry? Presumably this causes the
> > > release to happen and the task_struct is good to go. Did I miss
> > > the logic? I was thinking you would have something in bpf_map_free_kptrs
> > > and a type callback to release() the refcnt?
> > 
> > Someone else can chime in here to correct me if I'm wrong, but AFAIU
> > this is handled by the map implementations calling out to
> > bpf_obj_free_fields() to invoke the kptr destructor when the element is
> > destroyed. See [3] and [4] for examples of where they're called from the
> > arraymap and hashmap logic respectively. This is how the destructors are
> > similarly invoked when the maps are destroyed.
> 
> Yep I found the dtor() gets populated in btf.c and apparently needed
> to repull my local tree because I missed it. Thanks for the detailed
> response.
> 
> And last thing I was checking is because KF_SLEEPABLE is not set
> this should be blocked from running on sleepable progs which would
> break the call_rcu in the destructor. Maybe small nit, not sure
> its worth it but might be nice to annotate the helper description
> with a note, "will not work on sleepable progs" or something to
> that effect.

KF_SLEEPABLE is used to indicate whether the kfunc _itself_ may sleep,
not whether the calling program can be sleepable. call_rcu() doesn't
block, so no need to mark the kfunc as KF_SLEEPABLE. The key is that if
a kfunc is sleepable, non-sleepable programs are not able to call it
(and this is enforced in the verifier).
  
John Fastabend Nov. 18, 2022, 6:04 a.m. UTC | #5
David Vernet wrote:
> On Thu, Nov 17, 2022 at 02:36:50PM -0800, John Fastabend wrote:
> > David Vernet wrote:
> > > On Thu, Nov 17, 2022 at 01:03:45PM -0800, John Fastabend wrote:
> > > > David Vernet wrote:
> > > > > Now that BPF supports adding new kernel functions with kfuncs, and
> > > > > storing kernel objects in maps with kptrs, we can add a set of kfuncs
> > > > > which allow struct task_struct objects to be stored in maps as
> > > > > referenced kptrs.
> > > > > 
> > > > > The possible use cases for doing this are plentiful.  During tracing,
> > > > > for example, it would be useful to be able to collect some tasks that
> > > > > performed a certain operation, and then periodically summarize who they
> > > > > are, which cgroup they're in, how much CPU time they've utilized, etc.
> > > > > Doing this now would require storing the tasks' pids along with some
> > > > > relevant data to be exported to user space, and later associating the
> > > > > pids to tasks in other event handlers where the data is recorded.
> > > > > Another useful by-product of this is that it allows a program to pin a
> > > > > task in a BPF program, and by proxy therefore also e.g. pin its task
> > > > > local storage.
> > > > 
> > > > Sorry wasn't obvious to me (late to the party so if it was in some
> > > > other v* described apologies). Can we say something about the life
> > > > cycle of this acquired task_structs because they are incrementing
> > > > the ref cnt on the task struct they have potential to impact system.
> > > 
> > > We should probably add an entire docs page which describes how kptrs
> > > work, and I am happy to do that (ideally in a follow-on patch set if
> > > that's OK with you). In general I think it would be useful to include
> > > docs for any general-purpose kfuncs like the ones proposed in this set.
> > 
> > Sure, I wouldn't require that for your series though fwiw.
> 
> Sounds good to me
> 
> [...]
> 
> > > > quick question. If you put acquired task struct in a map what
> > > > happens if user side deletes the entry? Presumably this causes the
> > > > release to happen and the task_struct is good to go. Did I miss
> > > > the logic? I was thinking you would have something in bpf_map_free_kptrs
> > > > and a type callback to release() the refcnt?
> > > 
> > > Someone else can chime in here to correct me if I'm wrong, but AFAIU
> > > this is handled by the map implementations calling out to
> > > bpf_obj_free_fields() to invoke the kptr destructor when the element is
> > > destroyed. See [3] and [4] for examples of where they're called from the
> > > arraymap and hashmap logic respectively. This is how the destructors are
> > > similarly invoked when the maps are destroyed.
> > 
> > Yep I found the dtor() gets populated in btf.c and apparently needed
> > to repull my local tree because I missed it. Thanks for the detailed
> > response.
> > 
> > And last thing I was checking is because KF_SLEEPABLE is not set
> > this should be blocked from running on sleepable progs which would
> > break the call_rcu in the destructor. Maybe small nit, not sure
> > its worth it but might be nice to annotate the helper description
> > with a note, "will not work on sleepable progs" or something to
> > that effect.
> 
> KF_SLEEPABLE is used to indicate whether the kfunc _itself_ may sleep,
> not whether the calling program can be sleepable. call_rcu() doesn't
> block, so no need to mark the kfunc as KF_SLEEPABLE. The key is that if
> a kfunc is sleepable, non-sleepable programs are not able to call it
> (and this is enforced in the verifier).

OK but should these helpers be allowed in sleepable progs? I think
not. What stops this, (using your helpers):

  cpu0                                       cpu1
  ----
  v = insert_lookup_task(task)
  kptr = bpf_kptr_xchg(&v->task, NULL);
  if (!kptr)
    return 0;
                                            map_delete_elem()
                                               put_task()
                                                 rcu_call
  do_something_might_sleep()
                                                    put_task_struct
                                                      ... free  
  kptr->[free'd memory]
 
the insert_lookup_task will bump the refcnt on the acquire on map
insert. But the lookup doesn't do anything to the refcnt and the
map_delete_elem will delete it. We have a check for spin_lock
types to stop them from being in sleepable progs. Did I miss a
similar check for these?

Thanks again
  
David Vernet Nov. 18, 2022, 3:08 p.m. UTC | #6
On Thu, Nov 17, 2022 at 10:04:27PM -0800, John Fastabend wrote:

[...]

> > > And last thing I was checking is because KF_SLEEPABLE is not set
> > > this should be blocked from running on sleepable progs which would
> > > break the call_rcu in the destructor. Maybe small nit, not sure
> > > its worth it but might be nice to annotate the helper description
> > > with a note, "will not work on sleepable progs" or something to
> > > that effect.
> > 
> > KF_SLEEPABLE is used to indicate whether the kfunc _itself_ may sleep,
> > not whether the calling program can be sleepable. call_rcu() doesn't
> > block, so no need to mark the kfunc as KF_SLEEPABLE. The key is that if
> > a kfunc is sleepable, non-sleepable programs are not able to call it
> > (and this is enforced in the verifier).
> 
> OK but should these helpers be allowed in sleepable progs? I think
> not. What stops this, (using your helpers):
> 
>   cpu0                                       cpu1
>   ----
>   v = insert_lookup_task(task)
>   kptr = bpf_kptr_xchg(&v->task, NULL);
>   if (!kptr)
>     return 0;
>                                             map_delete_elem()
>                                                put_task()
>                                                  rcu_call
>   do_something_might_sleep()
>                                                     put_task_struct
>                                                       ... free  
>   kptr->[free'd memory]
>  
> the insert_lookup_task will bump the refcnt on the acquire on map
> insert. But the lookup doesn't do anything to the refcnt and the
> map_delete_elem will delete it. We have a check for spin_lock
> types to stop them from being in sleepable progs. Did I miss a
> similar check for these?

So, in your example above, bpf_kptr_xchg(&v->task, NULL) will atomically
xchg the kptr from the map, and so the map_delete_elem() call would fail
with (something like) -ENOENT. In general, the semantics are similar to
std::unique_ptr::swap() in C++.

FWIW, I think KF_KPTR_GET kfuncs are the more complex / racy kfuncs to
reason about. The reason is that we're passing a pointer to the map
value containing a kptr directly to the kfunc (with the attempt of
acquiring an additional reference if a kptr was already present in the
map) rather than doing an xchg which atomically gets us the unique
pointer if nobody else xchgs it in first. So with KF_KPTR_GET, someone
else could come along and delete the kptr from the map while the kfunc
is trying to acquire that additional reference. The race looks something
like this:

   cpu0                                       cpu1
   ----
   v = insert_lookup_task(task)
   kptr = bpf_task_kptr_get(&v->task);
                                             map_delete_elem()
                                                put_task()
                                                  rcu_call
                                                     put_task_struct
                                                       ... free  
   if (!kptr)
     /* In this race example, this path will be taken. */
     return 0;

The difference is that here, we're not doing an atomic xchg of the kptr
out of the map. Instead, we're passing a pointer to the map value
containing the kptr directly to bpf_task_kptr_get(), which itself tries
to acquire an additional reference on the task to return to the program
as a kptr. This is still safe, however, as bpf_task_kptr_get() uses RCU
and refcount_inc_not_zero() in the bpf_task_kptr_get() kfunc to ensure
that it can't hit a UAF, and that it won't return a dying task to the
caller:

/**
 * bpf_task_kptr_get - Acquire a reference on a struct task_struct kptr. A task
 * kptr acquired by this kfunc which is not subsequently stored in a map, must
 * be released by calling bpf_task_release().
 * @pp: A pointer to a task kptr on which a reference is being acquired.
 */
__used noinline
struct task_struct *bpf_task_kptr_get(struct task_struct **pp)
{
        struct task_struct *p;

        rcu_read_lock();
        p = READ_ONCE(*pp);

	/* <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
	 * cpu1 could remove the element from the map here, and invoke
	 * put_task_struct_rcu_user(). We're in an RCU read region
	 * though, so the task won't be freed until at the very
	 * earliest, the rcu_read_unlock() below.
	 * >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
	 */

        if (p && !refcount_inc_not_zero(&p->rcu_users))
		/* <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
		 * refcount_inc_not_zero() will return false, as cpu1
		 * deleted the element from the map and dropped its last
		 * refcount. So we just return NULL as the task will be
		 * deleted once an RCU gp has elapsed.
		 * >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
		 */
                p = NULL;
        rcu_read_unlock();

        return p;
}

Let me know if that makes sense. This stuff is tricky, and I plan to
clearly / thoroughly add it to that kptr docs page once this patch set
lands.
  
Alexei Starovoitov Nov. 18, 2022, 6:31 p.m. UTC | #7
On Fri, Nov 18, 2022 at 09:08:12AM -0600, David Vernet wrote:
> On Thu, Nov 17, 2022 at 10:04:27PM -0800, John Fastabend wrote:
> 
> [...]
> 
> > > > And last thing I was checking is because KF_SLEEPABLE is not set
> > > > this should be blocked from running on sleepable progs which would
> > > > break the call_rcu in the destructor. Maybe small nit, not sure
> > > > its worth it but might be nice to annotate the helper description
> > > > with a note, "will not work on sleepable progs" or something to
> > > > that effect.
> > > 
> > > KF_SLEEPABLE is used to indicate whether the kfunc _itself_ may sleep,
> > > not whether the calling program can be sleepable. call_rcu() doesn't
> > > block, so no need to mark the kfunc as KF_SLEEPABLE. The key is that if
> > > a kfunc is sleepable, non-sleepable programs are not able to call it
> > > (and this is enforced in the verifier).
> > 
> > OK but should these helpers be allowed in sleepable progs? I think
> > not. What stops this, (using your helpers):
> > 
> >   cpu0                                       cpu1
> >   ----
> >   v = insert_lookup_task(task)
> >   kptr = bpf_kptr_xchg(&v->task, NULL);
> >   if (!kptr)
> >     return 0;
> >                                             map_delete_elem()
> >                                                put_task()
> >                                                  rcu_call
> >   do_something_might_sleep()
> >                                                     put_task_struct
> >                                                       ... free  

the free won't happen here, because the kptr on cpu0 holds the refcnt.
bpf side never does direct free of kptr. It only inc/dec refcnt via kfuncs.

> >   kptr->[free'd memory]
> >  
> > the insert_lookup_task will bump the refcnt on the acquire on map
> > insert. But the lookup doesn't do anything to the refcnt and the

lookup from map doesn't touch kptrs in the value.
just reading v->kptr becomes PTR_UNTRUSTED with probe_mem protection.

> > map_delete_elem will delete it. We have a check for spin_lock
> > types to stop them from being in sleepable progs. Did I miss a
> > similar check for these?
> 
> So, in your example above, bpf_kptr_xchg(&v->task, NULL) will atomically
> xchg the kptr from the map, and so the map_delete_elem() call would fail
> with (something like) -ENOENT. In general, the semantics are similar to
> std::unique_ptr::swap() in C++.
> 
> FWIW, I think KF_KPTR_GET kfuncs are the more complex / racy kfuncs to
> reason about. The reason is that we're passing a pointer to the map
> value containing a kptr directly to the kfunc (with the attempt of
> acquiring an additional reference if a kptr was already present in the
> map) rather than doing an xchg which atomically gets us the unique
> pointer if nobody else xchgs it in first. So with KF_KPTR_GET, someone
> else could come along and delete the kptr from the map while the kfunc
> is trying to acquire that additional reference. The race looks something
> like this:
> 
>    cpu0                                       cpu1
>    ----
>    v = insert_lookup_task(task)
>    kptr = bpf_task_kptr_get(&v->task);
>                                              map_delete_elem()
>                                                 put_task()
>                                                   rcu_call
>                                                      put_task_struct
>                                                        ... free  
>    if (!kptr)
>      /* In this race example, this path will be taken. */
>      return 0;
> 
> The difference is that here, we're not doing an atomic xchg of the kptr
> out of the map. Instead, we're passing a pointer to the map value
> containing the kptr directly to bpf_task_kptr_get(), which itself tries
> to acquire an additional reference on the task to return to the program
> as a kptr. This is still safe, however, as bpf_task_kptr_get() uses RCU
> and refcount_inc_not_zero() in the bpf_task_kptr_get() kfunc to ensure
> that it can't hit a UAF, and that it won't return a dying task to the
> caller:
> 
> /**
>  * bpf_task_kptr_get - Acquire a reference on a struct task_struct kptr. A task
>  * kptr acquired by this kfunc which is not subsequently stored in a map, must
>  * be released by calling bpf_task_release().
>  * @pp: A pointer to a task kptr on which a reference is being acquired.
>  */
> __used noinline
> struct task_struct *bpf_task_kptr_get(struct task_struct **pp)
> {
>         struct task_struct *p;
> 
>         rcu_read_lock();
>         p = READ_ONCE(*pp);
> 
> 	/* <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> 	 * cpu1 could remove the element from the map here, and invoke
> 	 * put_task_struct_rcu_user(). We're in an RCU read region
> 	 * though, so the task won't be freed until at the very
> 	 * earliest, the rcu_read_unlock() below.
> 	 * >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> 	 */
> 
>         if (p && !refcount_inc_not_zero(&p->rcu_users))
> 		/* <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> 		 * refcount_inc_not_zero() will return false, as cpu1
> 		 * deleted the element from the map and dropped its last
> 		 * refcount. So we just return NULL as the task will be
> 		 * deleted once an RCU gp has elapsed.
> 		 * >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> 		 */
>                 p = NULL;
>         rcu_read_unlock();
> 
>         return p;
> }
> 
> Let me know if that makes sense. This stuff is tricky, and I plan to
> clearly / thoroughly add it to that kptr docs page once this patch set
> lands.

All correct. Probably worth adding this comment directly in bpf_task_kptr_get.
  
John Fastabend Nov. 19, 2022, 6:09 a.m. UTC | #8
Alexei Starovoitov wrote:
> On Fri, Nov 18, 2022 at 09:08:12AM -0600, David Vernet wrote:
> > On Thu, Nov 17, 2022 at 10:04:27PM -0800, John Fastabend wrote:
> > 
> > [...]
> > 
> > > > > And last thing I was checking is because KF_SLEEPABLE is not set
> > > > > this should be blocked from running on sleepable progs which would
> > > > > break the call_rcu in the destructor. Maybe small nit, not sure
> > > > > its worth it but might be nice to annotate the helper description
> > > > > with a note, "will not work on sleepable progs" or something to
> > > > > that effect.
> > > > 
> > > > KF_SLEEPABLE is used to indicate whether the kfunc _itself_ may sleep,
> > > > not whether the calling program can be sleepable. call_rcu() doesn't
> > > > block, so no need to mark the kfunc as KF_SLEEPABLE. The key is that if
> > > > a kfunc is sleepable, non-sleepable programs are not able to call it
> > > > (and this is enforced in the verifier).
> > > 
> > > OK but should these helpers be allowed in sleepable progs? I think
> > > not. What stops this, (using your helpers):
> > > 
> > >   cpu0                                       cpu1
> > >   ----
> > >   v = insert_lookup_task(task)
> > >   kptr = bpf_kptr_xchg(&v->task, NULL);
> > >   if (!kptr)
> > >     return 0;
> > >                                             map_delete_elem()
> > >                                                put_task()
> > >                                                  rcu_call
> > >   do_something_might_sleep()
> > >                                                     put_task_struct
> > >                                                       ... free  
> 
> the free won't happen here, because the kptr on cpu0 holds the refcnt.
> bpf side never does direct free of kptr. It only inc/dec refcnt via kfuncs.
> 
> > >   kptr->[free'd memory]
> > >  
> > > the insert_lookup_task will bump the refcnt on the acquire on map
> > > insert. But the lookup doesn't do anything to the refcnt and the
> 
> lookup from map doesn't touch kptrs in the value.
> just reading v->kptr becomes PTR_UNTRUSTED with probe_mem protection.
> 
> > > map_delete_elem will delete it. We have a check for spin_lock
> > > types to stop them from being in sleepable progs. Did I miss a
> > > similar check for these?
> > 
> > So, in your example above, bpf_kptr_xchg(&v->task, NULL) will atomically
> > xchg the kptr from the map, and so the map_delete_elem() call would fail
> > with (something like) -ENOENT. In general, the semantics are similar to
> > std::unique_ptr::swap() in C++.
> > 
> > FWIW, I think KF_KPTR_GET kfuncs are the more complex / racy kfuncs to
> > reason about. The reason is that we're passing a pointer to the map
> > value containing a kptr directly to the kfunc (with the attempt of
> > acquiring an additional reference if a kptr was already present in the
> > map) rather than doing an xchg which atomically gets us the unique
> > pointer if nobody else xchgs it in first. So with KF_KPTR_GET, someone
> > else could come along and delete the kptr from the map while the kfunc
> > is trying to acquire that additional reference. The race looks something
> > like this:
> > 
> >    cpu0                                       cpu1
> >    ----
> >    v = insert_lookup_task(task)
> >    kptr = bpf_task_kptr_get(&v->task);
> >                                              map_delete_elem()
> >                                                 put_task()
> >                                                   rcu_call
> >                                                      put_task_struct
> >                                                        ... free  
> >    if (!kptr)
> >      /* In this race example, this path will be taken. */
> >      return 0;
> > 
> > The difference is that here, we're not doing an atomic xchg of the kptr
> > out of the map. Instead, we're passing a pointer to the map value
> > containing the kptr directly to bpf_task_kptr_get(), which itself tries
> > to acquire an additional reference on the task to return to the program
> > as a kptr. This is still safe, however, as bpf_task_kptr_get() uses RCU
> > and refcount_inc_not_zero() in the bpf_task_kptr_get() kfunc to ensure
> > that it can't hit a UAF, and that it won't return a dying task to the
> > caller:
> > 
> > /**
> >  * bpf_task_kptr_get - Acquire a reference on a struct task_struct kptr. A task
> >  * kptr acquired by this kfunc which is not subsequently stored in a map, must
> >  * be released by calling bpf_task_release().
> >  * @pp: A pointer to a task kptr on which a reference is being acquired.
> >  */
> > __used noinline
> > struct task_struct *bpf_task_kptr_get(struct task_struct **pp)
> > {
> >         struct task_struct *p;
> > 
> >         rcu_read_lock();
> >         p = READ_ONCE(*pp);
> > 
> > 	/* <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> > 	 * cpu1 could remove the element from the map here, and invoke
> > 	 * put_task_struct_rcu_user(). We're in an RCU read region
> > 	 * though, so the task won't be freed until at the very
> > 	 * earliest, the rcu_read_unlock() below.
> > 	 * >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > 	 */
> > 
> >         if (p && !refcount_inc_not_zero(&p->rcu_users))
> > 		/* <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> > 		 * refcount_inc_not_zero() will return false, as cpu1
> > 		 * deleted the element from the map and dropped its last
> > 		 * refcount. So we just return NULL as the task will be
> > 		 * deleted once an RCU gp has elapsed.
> > 		 * >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > 		 */
> >                 p = NULL;
> >         rcu_read_unlock();
> > 
> >         return p;
> > }
> > 
> > Let me know if that makes sense. This stuff is tricky, and I plan to
> > clearly / thoroughly add it to that kptr docs page once this patch set
> > lands.
> 
> All correct. Probably worth adding this comment directly in bpf_task_kptr_get.

Yes also agree thanks for the details. Spent sometime trying to break
it this event, but didn't find anything.

Thanks.