[bpf-next,v9,3/4] bpf: Add kfuncs for storing struct task_struct * as a kptr

Message ID 20221120051004.3605026-4-void@manifault.com
State New
Headers
Series Support storing struct task_struct objects as kptrs |

Commit Message

David Vernet Nov. 20, 2022, 5:10 a.m. UTC
  Now that BPF supports adding new kernel functions with kfuncs, and
storing kernel objects in maps with kptrs, we can add a set of kfuncs
which allow struct task_struct objects to be stored in maps as
referenced kptrs. The possible use cases for doing this are plentiful.
During tracing, for example, it would be useful to be able to collect
some tasks that performed a certain operation, and then periodically
summarize who they are, which cgroup they're in, how much CPU time
they've utilized, etc.

In order to enable this, this patch adds three new kfuncs:

struct task_struct *bpf_task_acquire(struct task_struct *p);
struct task_struct *bpf_task_kptr_get(struct task_struct **pp);
void bpf_task_release(struct task_struct *p);

A follow-on patch will add selftests validating these kfuncs.

Signed-off-by: David Vernet <void@manifault.com>
---
 kernel/bpf/helpers.c | 78 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 75 insertions(+), 3 deletions(-)
  

Comments

Matus Jokay Dec. 5, 2022, 10:11 a.m. UTC | #1
Hello David,

Your idea behind this patch is cool, but I'm afraid that the
implementation is incorrect.

As you can see, the task_struct:rcu_users member shares the same memory
area with the task_struct:rcu (the head of an RCU CB).
Consequence: *violated invariant* that the reference counter will
remain zero after reaching zero!!!
After reaching zero the task_struct:rcu head is set, so further attempts
to access the task_struct:rcu_users may lead to a non-zero value.
For more information see
https://lore.kernel.org/lkml/CAHk-=wjT6LG6sDaZtfeT80B9RaMP-y7RNRM4F5CX2v2Z=o8e=A@mail.gmail.com/
In my opinion, the decision about task_struct:rcu and
task_struct:rcu_users union is very bad, but you should probably consult
the memory separation with authors of the actual implementation.

For now, in our project, we use the following approach:

1) get a reference to a valid task within RCU read-side with
   get_task_struct()
2) in the release function:
    2.1) enter RCU read-side
    2.2) if the task state is not TASK_DEAD: use put_task_struct()
         Note: In the case of a race with an exiting task it's OK to
         call put_task_struct(), because task_struct will be freed
         *after* the current RCU GP thanks to the task_struct:rcu_users
         mechanism.
    2.3) otherwise if test_and_set(my_cb_flag): call_rcu(my_cb)
         Note1: With respect to the RCU CB API you should guarantee that
         your CB will be installed only once within a given RCU GP. For
         that purpose we use my_cb_flag.
         Note2: This code will race with the task_struct:rcu_users
         mechanism [delayed_put_task_struct()], but it's OK. Either the
         delayed_put_task_struct() or my_cb() can be the last to call
         final put_task_struct() after the current RCU GP.
    2.4) otherwise: call put_task_struct()
         Note: The my_cb() is already installed, so within the current
         RCU GP we can invoke put_task_struct() and the ref counter of
         the task_struct will not reach zero.
    2.5) release the RCU read-side
3) The RCU CB my_cb() should set the my_cb_flag to False and call
put_task_struct().

If the release function is called within RCU read-side, the task_struct
is guaranteed to remain valid until the end of the current RCU GP.

Good luck,
mY
  
David Vernet Dec. 5, 2022, 3:02 p.m. UTC | #2
On Mon, Dec 05, 2022 at 11:11:47AM +0100, Matus Jokay wrote:
> Hello David,

Hi Matus,

> 
> Your idea behind this patch is cool, but I'm afraid that the
> implementation is incorrect.
>
> As you can see, the task_struct:rcu_users member shares the same memory
> area with the task_struct:rcu (the head of an RCU CB).
> Consequence: *violated invariant* that the reference counter will
> remain zero after reaching zero!!!
> After reaching zero the task_struct:rcu head is set, so further attempts
> to access the task_struct:rcu_users may lead to a non-zero value.

Yes, you're right. Thanks for explaining this and pointing out the
oversight.

> For more information see
> https://lore.kernel.org/lkml/CAHk-=wjT6LG6sDaZtfeT80B9RaMP-y7RNRM4F5CX2v2Z=o8e=A@mail.gmail.com/
> In my opinion, the decision about task_struct:rcu and
> task_struct:rcu_users union is very bad, but you should probably consult
> the memory separation with authors of the actual implementation.

I expect the reason it's like that is because prior to this change, as
Linus pointed out, nothing ever increments the refcount (other than as
of commit 912616f142bf: ("exit: Guarantee make_task_dead leaks the tsk
when calling do_task_exit"), which similarly increments before the
reference could have ever gone to 0, so I think is fine), so we had the
ability to save a few bytes of memory in struct task_struct. Eric
mentioned this explicitly in the commit summary for commit 3fbd7ee285b2
("tasks: Add a count of task RCU users").

Now that the refcount is actually being used as a proper refcount with
this commit, that space saving is no longer an option (unless we rip out
my changes of course). +cc Eric and Oleg -- would you guys be OK with
separating them out from that union? I guess the alternative would be to
check for p->flags & PF_EXITING in the helper, but using p->rcu_users
feels more natural.

> For now, in our project, we use the following approach:
> 
> 1) get a reference to a valid task within RCU read-side with
>    get_task_struct()
> 2) in the release function:
>     2.1) enter RCU read-side
>     2.2) if the task state is not TASK_DEAD: use put_task_struct()
>          Note: In the case of a race with an exiting task it's OK to
>          call put_task_struct(), because task_struct will be freed
>          *after* the current RCU GP thanks to the task_struct:rcu_users
>          mechanism.
>     2.3) otherwise if test_and_set(my_cb_flag): call_rcu(my_cb)
>          Note1: With respect to the RCU CB API you should guarantee that
>          your CB will be installed only once within a given RCU GP. For
>          that purpose we use my_cb_flag.
>          Note2: This code will race with the task_struct:rcu_users
>          mechanism [delayed_put_task_struct()], but it's OK. Either the
>          delayed_put_task_struct() or my_cb() can be the last to call
>          final put_task_struct() after the current RCU GP.

I think this idea would work, but in order for us to do this, I believe
we'd have to add _another_ struct rcu_head to struct task_struct. If we
did that, I don't think there's any reason to not just separate them out
of the union where they live today as it's only like that for
space-saving reasons.

>     2.4) otherwise: call put_task_struct()
>          Note: The my_cb() is already installed, so within the current
>          RCU GP we can invoke put_task_struct() and the ref counter of
>          the task_struct will not reach zero.
>     2.5) release the RCU read-side
> 3) The RCU CB my_cb() should set the my_cb_flag to False and call
> put_task_struct().
> 
> If the release function is called within RCU read-side, the task_struct
> is guaranteed to remain valid until the end of the current RCU GP.
> 
> Good luck,
> mY
  

Patch

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 212e791d7452..89a95f3d854c 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1824,6 +1824,63 @@  struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head)
 	return __bpf_list_del(head, true);
 }
 
+/**
+ * bpf_task_acquire - Acquire a reference to a task. A task acquired by this
+ * kfunc which is not stored in a map as a kptr, must be released by calling
+ * bpf_task_release().
+ * @p: The task on which a reference is being acquired.
+ */
+struct task_struct *bpf_task_acquire(struct task_struct *p)
+{
+	refcount_inc(&p->rcu_users);
+	return p;
+}
+
+/**
+ * bpf_task_kptr_get - Acquire a reference on a struct task_struct kptr. A task
+ * kptr acquired by this kfunc which is not subsequently stored in a map, must
+ * be released by calling bpf_task_release().
+ * @pp: A pointer to a task kptr on which a reference is being acquired.
+ */
+struct task_struct *bpf_task_kptr_get(struct task_struct **pp)
+{
+	struct task_struct *p;
+
+	rcu_read_lock();
+	p = READ_ONCE(*pp);
+
+	/* Another context could remove the task from the map and release it at
+	 * any time, including after we've done the lookup above. This is safe
+	 * because we're in an RCU read region, so the task is guaranteed to
+	 * remain valid until at least the rcu_read_unlock() below.
+	 */
+	if (p && !refcount_inc_not_zero(&p->rcu_users))
+		/* If the task had been removed from the map and freed as
+		 * described above, refcount_inc_not_zero() will return false.
+		 * The task will be freed at some point after the current RCU
+		 * gp has ended, so just return NULL to the user.
+		 */
+		p = NULL;
+	rcu_read_unlock();
+
+	return p;
+}
+
+/**
+ * bpf_task_release - Release the reference acquired on a struct task_struct *.
+ * If this kfunc is invoked in an RCU read region, the task_struct is
+ * guaranteed to not be freed until the current grace period has ended, even if
+ * its refcount drops to 0.
+ * @p: The task on which a reference is being released.
+ */
+void bpf_task_release(struct task_struct *p)
+{
+	if (!p)
+		return;
+
+	put_task_struct_rcu_user(p);
+}
+
 __diag_pop();
 
 BTF_SET8_START(generic_btf_ids)
@@ -1836,6 +1893,9 @@  BTF_ID_FLAGS(func, bpf_list_push_front)
 BTF_ID_FLAGS(func, bpf_list_push_back)
 BTF_ID_FLAGS(func, bpf_list_pop_front, KF_ACQUIRE | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_list_pop_back, KF_ACQUIRE | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_task_acquire, KF_ACQUIRE | KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_task_kptr_get, KF_ACQUIRE | KF_KPTR_GET | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_task_release, KF_RELEASE)
 BTF_SET8_END(generic_btf_ids)
 
 static const struct btf_kfunc_id_set generic_kfunc_set = {
@@ -1843,14 +1903,26 @@  static const struct btf_kfunc_id_set generic_kfunc_set = {
 	.set   = &generic_btf_ids,
 };
 
+BTF_ID_LIST(generic_dtor_ids)
+BTF_ID(struct, task_struct)
+BTF_ID(func, bpf_task_release)
+
 static int __init kfunc_init(void)
 {
 	int ret;
+	const struct btf_id_dtor_kfunc generic_dtors[] = {
+		{
+			.btf_id       = generic_dtor_ids[0],
+			.kfunc_btf_id = generic_dtor_ids[1]
+		},
+	};
 
 	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &generic_kfunc_set);
-	if (ret)
-		return ret;
-	return register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &generic_kfunc_set);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &generic_kfunc_set);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &generic_kfunc_set);
+	return ret ?: register_btf_id_dtor_kfuncs(generic_dtors,
+						  ARRAY_SIZE(generic_dtors),
+						  THIS_MODULE);
 }
 
 late_initcall(kfunc_init);