tasks: Extract rcu_users out of union

Message ID 20230215233033.889644-1-void@manifault.com
State New
Headers
Series tasks: Extract rcu_users out of union |

Commit Message

David Vernet Feb. 15, 2023, 11:30 p.m. UTC
  In commit 3fbd7ee285b2b ("tasks: Add a count of task RCU users"), a
count on the number of RCU users was added to struct task_struct. This
was done so as to enable the removal of task_rcu_dereference(), and
allow tasks to be protected by RCU even after exiting and being removed
from the runqueue. In this commit, the 'refcount_t rcu_users' field that
keeps track of this refcount was put into a union co-located with
'struct rcu_head rcu', so as to avoid taking up any extra space in
task_struct. This was possible to do safely, because the field was only
ever decremented by a static set of specific callers, and then never
incremented again.

While this restriction of there only being a small, static set of users
of this field has worked fine, it prevents us from leveraging the field
to use RCU to protect tasks in other contexts.

During tracing, for example, it would be useful to be able to collect
some tasks that performed a certain operation, put them in a map, and
then periodically summarize who they are, which cgroup they're in, how
much CPU time they've utilized, etc. While this can currently be done
with 'usage', it becomes tricky when a task is already in a map, or if a
reference should only be taken if a task is valid and will not soon be
reaped. Ideally, we could do something like pass a reference to a map
value, and then try to acquire a reference to the task in an RCU read
region by using refcount_inc_not_zero().

Similarly, in sched_ext, schedulers are using integer pids to remember
tasks, and then looking them up with find_task_by_pid_ns(). This is
slow, error prone, and adds complexity. It would be more convenient and
performant if BPF schedulers could instead store tasks directly in maps,
and then leverage RCU to ensure they can be safely accessed with low
overhead.

Finally, overloading fields like this is error prone. Someone that wants
to use 'rcu_users' could easily overlook the fact that once the rcu
callback is scheduled, the refcount will go back to being nonzero, thus
precluding the use of refcount_inc_not_zero(). Furthermore, as described
below, it's possible to extract the fields of the union without changing
the size of task_struct.

There are several possible ways to enable this:

1. The lightest touch approach is likely the one proposed in this patch,
   which is to simply extract 'rcu_users' and 'rcu' from the union, so
   that scheduling the 'rcu' callback doesn't overwrite the 'rcu_users'
   refcount. If we have a trusted task pointer, this would allow us to
   use refcnt_inc_not_zero() inside of an RCU region to determine if we
   can safely acquire a reference to the task and store it in a map. As
   mentioned below, this can be done without changing the size of
   task_struct, by moving the location of the union to another location
   that has padding gaps we can fill in.

2. Removing 'refcount_t rcu_users', and instead having the entire task
   be freed in an rcu callback. This is likely the most sound overall
   design, though it changes the behavioral semantics exposed to
   callers, who currently expect that a task that's successfully looked
   up in e.g. the pid_list with find_task_by_pid_ns(), can always have a
   'usage' reference acquired on them, as it's guaranteed to be >
   0 until after the next gp. In order for this approach to work, we'd
   have to audit all callers. This approach also slightly changes
   behavior observed by user space by not invoking
   trace_sched_process_free() until the whole task_struct is actually being
   freed, rather than just after it's exited. It also may change
   timings, as memory will be freed in an RCU callback rather than
   immediately when the final 'usage' refcount drops to 0. This also is
   arguably a benefit, as it provides more predictable performance to
   callers who are refcounting tasks.

3. There may be other solutions as well that don't require changing the
   layout of task_struct. For example, we could possibly do something
   complex from the BPF side, such as listen for task exit and remove a
   task from a map when the task is exiting. This would likely require
   significant custom handling for task_struct in the verifier, so a
   more generalizable solution is likely warranted.

As mentioned above, this patch proposes the lightest-touch approach
which allows callers elsewhere in the kernel to use 'rcu_users' to
ensure the lifetime of a task, by extracting 'rcu_users' and 'rcu' from
the union. There is no size change in task_struct with this patch.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: David Vernet <void@manifault.com>
---
 include/linux/sched.h | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)
  

Comments

Oleg Nesterov Feb. 16, 2023, 8:04 a.m. UTC | #1
I won't argue with this patch, but I can't understand the changelog...

On 02/15, David Vernet wrote:
>
> Similarly, in sched_ext, schedulers are using integer pids to remember
> tasks, and then looking them up with find_task_by_pid_ns(). This is
> slow, error prone, and adds complexity. It would be more convenient and
> performant if BPF schedulers could instead store tasks directly in maps,
> and then leverage RCU to ensure they can be safely accessed with low
> overhead.

To simplify, suppose we have

	int global_pid;

	void func(void)
	{
		rcu_read_lock();
		task = find_task_by_pid(global_pid);
		do_something(task);
		rcu_read_unlock();
	}

Could you explain how exactly can this patch help to turn global_pid into
"task_struct *" ? Why do you need to increment task->rcu_users ?

>    a task that's successfully looked
>    up in e.g. the pid_list with find_task_by_pid_ns(), can always have a
>    'usage' reference acquired on them, as it's guaranteed to be >
>    0 until after the next gp.

Yes. So it seems you need another key-to-task_struct map with rcu-safe
lookup/get and thus the add() method needs inc_not_zero(task->rcu_users) ?

I am just curious,

Oleg.
  
David Vernet Feb. 16, 2023, 6:39 p.m. UTC | #2
On Thu, Feb 16, 2023 at 09:04:59AM +0100, Oleg Nesterov wrote:
> I won't argue with this patch, but I can't understand the changelog...
> 
> On 02/15, David Vernet wrote:
> >
> > Similarly, in sched_ext, schedulers are using integer pids to remember
> > tasks, and then looking them up with find_task_by_pid_ns(). This is
> > slow, error prone, and adds complexity. It would be more convenient and
> > performant if BPF schedulers could instead store tasks directly in maps,
> > and then leverage RCU to ensure they can be safely accessed with low
> > overhead.
> 
> To simplify, suppose we have
> 
> 	int global_pid;
> 
> 	void func(void)
> 	{
> 		rcu_read_lock();
> 		task = find_task_by_pid(global_pid);
> 		do_something(task);
> 		rcu_read_unlock();
> 	}
> 
> Could you explain how exactly can this patch help to turn global_pid into
> "task_struct *" ? Why do you need to increment task->rcu_users ?

If you're not persisting the task in a map / data structure, then I
agree that find_task_by_pid_ns() is likely sufficient. What we want to
be able to do is something like this:

void func(void)
{
	rcu_read_lock();
	task = peek_next_task();
	if (task)
		do_something(task);
	rcu_read_unlock();
}

In such an example, we could be peeking into a statically allocated
circular queue, and want to be able to ensure that a task we look at
from the top is protected with rcu. The general mechanics would be that
a task is inserted with a refcount_inc_not_zero(), and when it's
removed, we do a put_task_struct_rcu_user().

Does that make sense?

> 
> >    a task that's successfully looked
> >    up in e.g. the pid_list with find_task_by_pid_ns(), can always have a
> >    'usage' reference acquired on them, as it's guaranteed to be >
> >    0 until after the next gp.
> 
> Yes. So it seems you need another key-to-task_struct map with rcu-safe
> lookup/get and thus the add() method needs inc_not_zero(task->rcu_users) ?

Yes, exactly.

Thanks for taking a look at the patch.

- David
  
Oleg Nesterov Feb. 17, 2023, 10:25 a.m. UTC | #3
On 02/16, David Vernet wrote:
>
> On Thu, Feb 16, 2023 at 09:04:59AM +0100, Oleg Nesterov wrote:
> >
> > >    a task that's successfully looked
> > >    up in e.g. the pid_list with find_task_by_pid_ns(), can always have a
> > >    'usage' reference acquired on them, as it's guaranteed to be >
> > >    0 until after the next gp.
> >
> > Yes. So it seems you need another key-to-task_struct map with rcu-safe
> > lookup/get and thus the add() method needs inc_not_zero(task->rcu_users) ?
>
> Yes, exactly.

OK, in this case I agree, inc_not_zero(rcu_users) makes sense and thus we need
this patch.

Just I was confused by the previous part of the changelog due to my bad English.

Thanks,

Oleg.
  
David Vernet Feb. 27, 2023, 4:11 p.m. UTC | #4
On Fri, Feb 17, 2023 at 11:25:21AM +0100, Oleg Nesterov wrote:
> On 02/16, David Vernet wrote:
> >
> > On Thu, Feb 16, 2023 at 09:04:59AM +0100, Oleg Nesterov wrote:
> > >
> > > >    a task that's successfully looked
> > > >    up in e.g. the pid_list with find_task_by_pid_ns(), can always have a
> > > >    'usage' reference acquired on them, as it's guaranteed to be >
> > > >    0 until after the next gp.
> > >
> > > Yes. So it seems you need another key-to-task_struct map with rcu-safe
> > > lookup/get and thus the add() method needs inc_not_zero(task->rcu_users) ?
> >
> > Yes, exactly.
> 
> OK, in this case I agree, inc_not_zero(rcu_users) makes sense and thus we need
> this patch.

Thanks again for taking a look.

> 
> Just I was confused by the previous part of the changelog due to my bad English.

No worries at all -- the commit summary definitely could have been more
clear.

Does anyone else have any thoughts? Is there anything else we need to do
for this patch to land?

Thanks,
David
  
David Vernet March 7, 2023, 5:19 p.m. UTC | #5
On Mon, Feb 27, 2023 at 10:11:42AM -0600, David Vernet wrote:
> On Fri, Feb 17, 2023 at 11:25:21AM +0100, Oleg Nesterov wrote:
> > On 02/16, David Vernet wrote:
> > >
> > > On Thu, Feb 16, 2023 at 09:04:59AM +0100, Oleg Nesterov wrote:
> > > >
> > > > >    a task that's successfully looked
> > > > >    up in e.g. the pid_list with find_task_by_pid_ns(), can always have a
> > > > >    'usage' reference acquired on them, as it's guaranteed to be >
> > > > >    0 until after the next gp.
> > > >
> > > > Yes. So it seems you need another key-to-task_struct map with rcu-safe
> > > > lookup/get and thus the add() method needs inc_not_zero(task->rcu_users) ?
> > >
> > > Yes, exactly.
> > 
> > OK, in this case I agree, inc_not_zero(rcu_users) makes sense and thus we need
> > this patch.
> 
> Thanks again for taking a look.
> 
> > 
> > Just I was confused by the previous part of the changelog due to my bad English.
> 
> No worries at all -- the commit summary definitely could have been more
> clear.
> 
> Does anyone else have any thoughts? Is there anything else we need to do
> for this patch to land?

Sending a friendly ping on this now that the merge window has closed and
things have settled down a bit.

Thanks,
David
  
Oleg Nesterov March 8, 2023, 6:29 p.m. UTC | #6
On 03/07, David Vernet wrote:
>
> Sending a friendly ping on this now that the merge window has closed and
> things have settled down a bit.

Well, I can't help to merge this change, I lost my account on
kernel.org years ago.

Perhaps you need to resend this patch? FWIW, feel free to add

Acked-by: Oleg Nesterov <oleg@redhat.com>
  
Alexei Starovoitov March 9, 2023, 12:55 a.m. UTC | #7
On 3/8/23 10:29 AM, Oleg Nesterov wrote:
> On 03/07, David Vernet wrote:
>>
>> Sending a friendly ping on this now that the merge window has closed and
>> things have settled down a bit.
> 
> Well, I can't help to merge this change, I lost my account on
> kernel.org years ago.
> 
> Perhaps you need to resend this patch? FWIW, feel free to add
> 
> Acked-by: Oleg Nesterov <oleg@redhat.com>

Thanks for the review.
If everyone is ok with it, we will take this patch into bpf tree.
  
Alexei Starovoitov March 13, 2023, 7:59 p.m. UTC | #8
On 3/8/23 4:55 PM, Alexei Starovoitov wrote:
> On 3/8/23 10:29 AM, Oleg Nesterov wrote:
>> On 03/07, David Vernet wrote:
>>>
>>> Sending a friendly ping on this now that the merge window has closed and
>>> things have settled down a bit.
>>
>> Well, I can't help to merge this change, I lost my account on
>> kernel.org years ago.
>>
>> Perhaps you need to resend this patch? FWIW, feel free to add
>>
>> Acked-by: Oleg Nesterov <oleg@redhat.com>
> 
> Thanks for the review.
> If everyone is ok with it, we will take this patch into bpf tree.

Applied to bpf-next.
  

Patch

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4df2b3e76b30..1ab082c9d562 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1318,11 +1318,6 @@  struct task_struct {
 
 	struct tlbflush_unmap_batch	tlb_ubc;
 
-	union {
-		refcount_t		rcu_users;
-		struct rcu_head		rcu;
-	};
-
 	/* Cache last used pipe for splice(): */
 	struct pipe_inode_info		*splice_pipe;
 
@@ -1459,6 +1454,8 @@  struct task_struct {
 	unsigned long			saved_state_change;
 # endif
 #endif
+	struct rcu_head			rcu;
+	refcount_t			rcu_users;
 	int				pagefault_disabled;
 #ifdef CONFIG_MMU
 	struct task_struct		*oom_reaper_list;