[v3,1/3] pidfd: allow pidfd_open() on non-thread-group leaders

Message ID 20240123153452.170866-2-tycho@tycho.pizza
State New
Headers
Series pidfds for non thread group leaders |

Commit Message

Tycho Andersen Jan. 23, 2024, 3:34 p.m. UTC
  From: Tycho Andersen <tandersen@netflix.com>

We are using the pidfd family of syscalls with the seccomp userspace
notifier. When some thread triggers a seccomp notification, we want to do
some things to its context (munge fd tables via pidfd_getfd(), maybe write
to its memory, etc.). However, threads created with ~CLONE_FILES or
~CLONE_VM mean that we can't use the pidfd family of syscalls for this
purpose, since their fd table or mm are distinct from the thread group
leader's. In this patch, we relax this restriction for pidfd_open().

In order to avoid dangling poll() users we need to notify pidfd waiters
when individual threads die, but once we do that all the other machinery
seems to work ok viz. the tests. But I suppose there are more cases than
just this one.

Signed-off-by: Tycho Andersen <tandersen@netflix.com>
--
v2: unify pidfd notification to all go through do_notify_pidfd() inside of
    __exit_signals() suggested by Oleg.
    Link to v1: https://lore.kernel.org/all/20231130163946.277502-1-tycho@tycho.pizza/
v3: go back to two separate call sites, the exiting one in
    do_notify_parent(), and a new one in release_task(), when a thread is
    not the thread group leader.
---
 include/linux/sched/signal.h |  1 +
 kernel/exit.c                | 11 +++++++++++
 kernel/fork.c                |  4 +---
 kernel/pid.c                 | 11 +----------
 kernel/signal.c              |  2 +-
 5 files changed, 15 insertions(+), 14 deletions(-)
  

Comments

Oleg Nesterov Jan. 23, 2024, 7:56 p.m. UTC | #1
Too late for me, but I don't understand this patch after a quick glance.
perhaps I missed something...

On 01/23, Tycho Andersen wrote:
>
> @@ -256,6 +256,17 @@ void release_task(struct task_struct *p)
>  	write_lock_irq(&tasklist_lock);
>  	ptrace_release_task(p);
>  	thread_pid = get_pid(p->thread_pid);
> +
> +	/*
> +	 * If we're not the leader, notify any waiters on our pidfds. Note that
> +	 * we don't want to notify the leader until /everyone/ in the thread
> +	 * group is dead, viz. the condition below.
> +	 *
> +	 * We have to do this here, since __exit_signal() will
> +	 * __unhash_processes(), and break do_notify_pidfd()'s lookup.
> +	 */
> +	if (!thread_group_leader(p))
> +		do_notify_pidfd(p);

This doesn't look consistent.

If the task is a group leader do_notify_pidfd() is called by exit_notify()
when it becomes a zombie (if no other threads), before it is reaped by its
parent (unless autoreap).

If it is a sub-thread, it is called by release_task() above. Note that a
sub-thread can become a zombie too if it is traced.

>  	__exit_signal(p);

and,  do_notify_pidfd() is called before __exit_signal() which does
__unhash_process() -> detach_pid(PIDTYPE_PID).

Doesn't this mean that pidfd_poll() can hang? thread_group_exited()
won't return true after do_notify_pidfd() above, not to mention that
thread_group_empty() is not possible if !thread_group_leader().

So. When do we want to do do_notify_pidfd() ? Whe the task (leader or not)
becomes a zombie (passes exit_notify) or when it is reaped by release_task?

Either way pidfd_poll() needs more changes with this patch and it can't
use thread_group_exited(). If do_notify_pidfd() is called by release_task()
after __exit_signal(), it can just check pid_has_task(PIDTYPE_PID).

Oleg.
  
Tycho Andersen Jan. 23, 2024, 9:10 p.m. UTC | #2
On Tue, Jan 23, 2024 at 08:56:08PM +0100, Oleg Nesterov wrote:
> Too late for me, but I don't understand this patch after a quick glance.
> perhaps I missed something...

Thanks for taking a look.

> On 01/23, Tycho Andersen wrote:
> >
> > @@ -256,6 +256,17 @@ void release_task(struct task_struct *p)
> >  	write_lock_irq(&tasklist_lock);
> >  	ptrace_release_task(p);
> >  	thread_pid = get_pid(p->thread_pid);
> > +
> > +	/*
> > +	 * If we're not the leader, notify any waiters on our pidfds. Note that
> > +	 * we don't want to notify the leader until /everyone/ in the thread
> > +	 * group is dead, viz. the condition below.
> > +	 *
> > +	 * We have to do this here, since __exit_signal() will
> > +	 * __unhash_processes(), and break do_notify_pidfd()'s lookup.
> > +	 */
> > +	if (!thread_group_leader(p))
> > +		do_notify_pidfd(p);
> 
> This doesn't look consistent.
> 
> If the task is a group leader do_notify_pidfd() is called by exit_notify()
> when it becomes a zombie (if no other threads), before it is reaped by its
> parent (unless autoreap).

There is another path, also in release_task(), that I was trying to
mirror since it deals explicitly with sub-threads but,

> If it is a sub-thread, it is called by release_task() above. Note that a
> sub-thread can become a zombie too if it is traced.

I didn't know about this.

> >  	__exit_signal(p);
> 
> and,  do_notify_pidfd() is called before __exit_signal() which does
> __unhash_process() -> detach_pid(PIDTYPE_PID).
> 
> Doesn't this mean that pidfd_poll() can hang? thread_group_exited()
> won't return true after do_notify_pidfd() above, not to mention that
> thread_group_empty() is not possible if !thread_group_leader().

I was wondering about this too, but the test_non_tgl_poll_exit test in
the next patch tests exactly this and works as expected.

> So. When do we want to do do_notify_pidfd() ? Whe the task (leader or not)
> becomes a zombie (passes exit_notify) or when it is reaped by release_task?

It seems like we'd want it when exit_notify() is called in principle,
since that's when the pid actually dies. When it is reaped is "mostly
unrelated". Something like,

1. in the "normal" exit_notify() paths via do_notify_parent()
2. if none of those cases are true (aka the final else in
   exit_notify()) and the thread is not ptraced
3. via release_task() finally if this was the thread group leader and
   it died before some sub-thread

then in pidfd_poll(), we can do:

    if (!tsk || (tsk->exit_state >= 0) || thread_group_exited())
        do_notify_pidfd();

?

> Either way pidfd_poll() needs more changes with this patch and it can't
> use thread_group_exited(). If do_notify_pidfd() is called by release_task()
> after __exit_signal(), it can just check pid_has_task(PIDTYPE_PID).

I suppose this is why my test works, since pid_task(PIDTYPE_PID) is null
after release_task(). But if we want it to happen earlier, we'll have
to do something like the above.

Tycho
  
Oleg Nesterov Jan. 23, 2024, 10:22 p.m. UTC | #3
I am already sleeping. I'll try to reply to other parts of your email
tomorrow but I am not sure, I will be very busy with family duties.

On 01/23, Tycho Andersen wrote:
>
> > >  	__exit_signal(p);
> >
> > and,  do_notify_pidfd() is called before __exit_signal() which does
> > __unhash_process() -> detach_pid(PIDTYPE_PID).
> >
> > Doesn't this mean that pidfd_poll() can hang? thread_group_exited()
> > won't return true after do_notify_pidfd() above, not to mention that
> > thread_group_empty() is not possible if !thread_group_leader().
>
> I was wondering about this too, but the test_non_tgl_poll_exit test in
> the next patch tests exactly this and works as expected.

Well, if release_task() completes __exit_signal() before the woken task
does thread_group_exited(), pid_task(PIDTYPE_PID) will return 0 and
pidfd_poll() won't hang.

But to be honest I can't understand test_non_tgl_poll_exit() at all. I don't
even understand why the process/thread created by fork_task_with_thread()
should ever exit. And why it creates the "writer" child... Never mind, too
late for me to read the code.

Oleg.
  
Oleg Nesterov Jan. 24, 2024, 1:25 a.m. UTC | #4
On 01/23, Oleg Nesterov wrote:
>
> But to be honest I can't understand test_non_tgl_poll_exit() at all. I don't
> even understand why the process/thread created by fork_task_with_thread()
> should ever exit. And why it creates the "writer" child... Never mind, too
> late for me to read the code.

Ah, OK, it passes thread_wait_exit to fork_task_with_thread(), and this "fn"
reads sk_pair and does exit() which is actually exit_group().

Oleg.
  
Oleg Nesterov Jan. 25, 2024, 2:08 p.m. UTC | #5
Add Eric.

On 01/23, Tycho Andersen wrote:
>
> On Tue, Jan 23, 2024 at 08:56:08PM +0100, Oleg Nesterov wrote:
>
> > So. When do we want to do do_notify_pidfd() ? Whe the task (leader or not)
> > becomes a zombie (passes exit_notify) or when it is reaped by release_task?
>
> It seems like we'd want it when exit_notify() is called in principle,
> since that's when the pid actually dies.

No the pid "dies" after this task is reaped, until then its nr is still
in use and pid_task(PIDTYPE_PID) returns the exiting/exited task.

> When it is reaped is "mostly unrelated".

Then why pidfd_poll() can't simply check !task || task->exit_state ?

Nevermind. So, currently pidfd_poll() succeeds when the leader can be
reaped, iow the whole thread group has exited. But even if you are the
parent, you can't expect that wait(WNOHANG) must succeed, the leader
can be traced. I guess it is too late to change this behaviour.

What if we add the new PIDFD_THREAD flag? With this flag

	- sys_pidfd_open() doesn't require the must be a group leader

	- pidfd_poll() succeeds when the task passes exit_notify() and
	  becomes a zombie, even if it is a leader and has other threads.


Please the the incomplete/untested patch below.

	- The change in exit_notify() is sub-optimal, we can do better
	  to avoid 2 do_notify_pidfd() calls from exit_notify(). But
	  so far this is only for discussion, lets keep it simple.

	- __pidfd_prepare() needs some minor cleanups regardless of
	  this change, I'll send the patch...

What do you think?

And why is thread_group_exited() exported?

Oleg.

diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h
index 5406fbc13074..2e6461459877 100644
--- a/include/uapi/linux/pidfd.h
+++ b/include/uapi/linux/pidfd.h
@@ -7,6 +7,7 @@
 #include <linux/fcntl.h>
 
 /* Flags for pidfd_open().  */
-#define PIDFD_NONBLOCK O_NONBLOCK
+#define PIDFD_NONBLOCK	O_NONBLOCK
+#define PIDFD_THREAD	O_EXCL	// or anything else not used by anon_inode's
 
 #endif /* _UAPI_LINUX_PIDFD_H */
diff --git a/kernel/exit.c b/kernel/exit.c
index dfb963d2f862..9f8526b7d717 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -752,6 +752,10 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 		autoreap = true;
 	}
 
+	/* unnecessary if do_notify_parent() was already called,
+	   we can do better */
+	do_notify_pidfd(tsk);
+
 	if (autoreap) {
 		tsk->exit_state = EXIT_DEAD;
 		list_add(&tsk->ptrace_entry, &dead);
diff --git a/kernel/fork.c b/kernel/fork.c
index c981fa6171c1..38f2c7423fb4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -101,6 +101,7 @@
 #include <linux/user_events.h>
 #include <linux/iommu.h>
 #include <linux/rseq.h>
+#include <uapi/linux/pidfd.h>
 
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -2068,12 +2069,27 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
 }
 #endif
 
+static bool xxx_exited(struct pid *pid, int excl)
+{
+	struct task_struct *task;
+	bool exited;
+
+	rcu_read_lock();
+	task = pid_task(pid, PIDTYPE_PID);
+	exited = !task ||
+		(READ_ONCE(task->exit_state) && (excl || thread_group_empty(task)));
+	rcu_read_unlock();
+
+	return exited;
+}
+
 /*
  * Poll support for process exit notification.
  */
 static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
 {
 	struct pid *pid = file->private_data;
+	int excl = file->f_flags & PIDFD_THREAD;
 	__poll_t poll_flags = 0;
 
 	poll_wait(file, &pid->wait_pidfd, pts);
@@ -2083,7 +2099,7 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
 	 * If the thread group leader exits before all other threads in the
 	 * group, then poll(2) should block, similar to the wait(2) family.
 	 */
-	if (thread_group_exited(pid))
+	if (xxx_exited(pid, excl))
 		poll_flags = EPOLLIN | EPOLLRDNORM;
 
 	return poll_flags;
@@ -2129,7 +2145,9 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
 {
 	int pidfd;
 	struct file *pidfd_file;
+	unsigned excl = flags & PIDFD_THREAD;
 
+	flags &= ~PIDFD_THREAD;
 	if (flags & ~(O_NONBLOCK | O_RDWR | O_CLOEXEC))
 		return -EINVAL;
 
@@ -2144,6 +2162,7 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
 		return PTR_ERR(pidfd_file);
 	}
 	get_pid(pid); /* held by pidfd_file now */
+	pidfd_file->f_flags |= excl;
 	*ret = pidfd_file;
 	return pidfd;
 }
@@ -2176,7 +2195,9 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
  */
 int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
 {
-	if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
+	unsigned excl = flags & PIDFD_THREAD;
+
+	if (!pid || !pid_has_task(pid, excl ? PIDTYPE_PID : PIDTYPE_TGID))
 		return -EINVAL;
 
 	return __pidfd_prepare(pid, flags, ret);
diff --git a/kernel/pid.c b/kernel/pid.c
index b52b10865454..5257197f9493 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -629,7 +629,7 @@ SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
 	int fd;
 	struct pid *p;
 
-	if (flags & ~PIDFD_NONBLOCK)
+	if (flags & ~(PIDFD_NONBLOCK | PIDFD_THREAD))
 		return -EINVAL;
 
 	if (pid <= 0)
  
Christian Brauner Jan. 25, 2024, 5:17 p.m. UTC | #6
> > When it is reaped is "mostly unrelated".
> 
> Then why pidfd_poll() can't simply check !task || task->exit_state ?
> 
> Nevermind. So, currently pidfd_poll() succeeds when the leader can be

Hm, the comment right above mentions:

        /*
         * Inform pollers only when the whole thread group exits.
         * If the thread group leader exits before all other threads in the
         * group, then poll(2) should block, similar to the wait(2) family.
         */

> reaped, iow the whole thread group has exited. But even if you are the
> parent, you can't expect that wait(WNOHANG) must succeed, the leader
> can be traced. I guess it is too late to change this behaviour.

Hm, why is that an issue though? And if it is an issue why shouldn't we
be able to change it? Because a program would rely on WNOHANG to hang on
a ptraced leader? That seems esoteric imho. I might just misunderstand.

> 
> What if we add the new PIDFD_THREAD flag? With this flag
> 
> 	- sys_pidfd_open() doesn't require the must be a group leader

Yes.

> 
> 	- pidfd_poll() succeeds when the task passes exit_notify() and
> 	  becomes a zombie, even if it is a leader and has other threads.

Iiuc, if an existing user creates a pidfd for a thread-group leader and
then polls that pidfd they would currently only get notified if the
thread-group is empty and the leader has exited.

If we now start notifying when the thread-group leader exits but the
thread-group isn't empty then this would be a fairly big api change and
I would expect this to cause regressions as that surely is something
that userspace relies on. Am I understand this right?
  
Oleg Nesterov Jan. 25, 2024, 5:51 p.m. UTC | #7
On 01/25, Christian Brauner wrote:
>
> > > When it is reaped is "mostly unrelated".
> >
> > Then why pidfd_poll() can't simply check !task || task->exit_state ?
> >
> > Nevermind. So, currently pidfd_poll() succeeds when the leader can be
>
> Hm, the comment right above mentions:
>
>         /*
>          * Inform pollers only when the whole thread group exits.
>          * If the thread group leader exits before all other threads in the
>          * group, then poll(2) should block, similar to the wait(2) family.
>          */
> > reaped, iow the whole thread group has exited.

Yes, but the comment doesn't contradict with what I have said?


> > But even if you are the
> > parent, you can't expect that wait(WNOHANG) must succeed, the leader
> > can be traced. I guess it is too late to change this behaviour.
>
> Hm, why is that an issue though?

Well, I didn't say this is a problem. I simply do not know how/why people
use pidfd_poll().

I mostly tried to explain why do I think that do_notify_pidfd() should
be always called from exit_notify() path, not by release_task(), even
if the task is not a leader.

> Because a program would rely on WNOHANG to hang on
> a ptraced leader? That seems esoteric imho.

To me it would be usefule, but lets not discuss this now. The "patch"
I sent doesn't change the current behaviour.

> > What if we add the new PIDFD_THREAD flag? With this flag
> >
> > 	- sys_pidfd_open() doesn't require the must be a group leader
>
> Yes.
>
> >
> > 	- pidfd_poll() succeeds when the task passes exit_notify() and
> > 	  becomes a zombie, even if it is a leader and has other threads.
>
> Iiuc, if an existing user creates a pidfd for a thread-group leader and
> then polls that pidfd they would currently only get notified if the
> thread-group is empty and the leader has exited.
>
> If we now start notifying when the thread-group leader exits but the
> thread-group isn't empty then this would be a fairly big api change

Hmm... again, this patch doesn't (shouldn't) change the current behavior.

Please note "with this flag" above. If sys_pidfd_open() was called
without PIDFD_THREAD, then sys_pidfd_open() still requires that the
target task must be a group leader, and pidfd_poll() won't succeed
until the leader exits and thread_group_empty() is true.

Oleg.
  
Tycho Andersen Jan. 25, 2024, 6:03 p.m. UTC | #8
On Thu, Jan 25, 2024 at 06:51:14PM +0100, Oleg Nesterov wrote:
> > > What if we add the new PIDFD_THREAD flag? With this flag
> > >
> > > 	- sys_pidfd_open() doesn't require the must be a group leader
> >
> > Yes.
> >
> > >
> > > 	- pidfd_poll() succeeds when the task passes exit_notify() and
> > > 	  becomes a zombie, even if it is a leader and has other threads.
> >
> > Iiuc, if an existing user creates a pidfd for a thread-group leader and
> > then polls that pidfd they would currently only get notified if the
> > thread-group is empty and the leader has exited.
> >
> > If we now start notifying when the thread-group leader exits but the
> > thread-group isn't empty then this would be a fairly big api change
> 
> Hmm... again, this patch doesn't (shouldn't) change the current behavior.
> 
> Please note "with this flag" above. If sys_pidfd_open() was called
> without PIDFD_THREAD, then sys_pidfd_open() still requires that the
> target task must be a group leader, and pidfd_poll() won't succeed
> until the leader exits and thread_group_empty() is true.

Thanks for sending your patch, I'll take a look at it (probably
tomorrow at this rate).

One of the things I don't like about PIDFD_THREAD is that it's hard to
tell whether an arbitrary thread is a leader or not. Right now we do
it by parsing /proc/pid/status, which shows all the stuff from
do_task_stat() that we don't care about but which is quite expensive
to compute. (Maybe there's a better way?)

With PIDFD_THREAD we could could do it twice, once with the flag, get
EINVAL, and then do it again. But ideally we wouldn't have to.

Still, if that's the only way that makes sense, that's fine.

Tycho
  
Oleg Nesterov Jan. 25, 2024, 6:25 p.m. UTC | #9
On 01/25, Tycho Andersen wrote:
>
> One of the things I don't like about PIDFD_THREAD is that it's hard to
> tell whether an arbitrary thread is a leader or not. Right now we do
> it by parsing /proc/pid/status, which shows all the stuff from
> do_task_stat() that we don't care about but which is quite expensive
> to compute. (Maybe there's a better way?)
>
> With PIDFD_THREAD we could could do it twice, once with the flag, get
> EINVAL, and then do it again. But ideally we wouldn't have to.

Too late for me, most probably I misunderstood.

If you want the PIDFD_THREAD behaviour, you can always use this flag
without any check...

Could you spell?

Oleg.
  
Oleg Nesterov Jan. 25, 2024, 6:30 p.m. UTC | #10
On 01/25, Oleg Nesterov wrote:
>
> On 01/25, Tycho Andersen wrote:
> >
> > One of the things I don't like about PIDFD_THREAD is that it's hard to
> > tell whether an arbitrary thread is a leader or not. Right now we do
> > it by parsing /proc/pid/status, which shows all the stuff from
> > do_task_stat() that we don't care about but which is quite expensive
> > to compute. (Maybe there's a better way?)
> >
> > With PIDFD_THREAD we could could do it twice, once with the flag, get
> > EINVAL, and then do it again. But ideally we wouldn't have to.
>
> Too late for me, most probably I misunderstood.
>
> If you want the PIDFD_THREAD behaviour, you can always use this flag
> without any check...
>
> Could you spell?

Just in case, we can even add PIDFD_AUTO (modulo naming) which acts as
PIDFD_THREAD if the target task is not a leader or 0 (current behaviour)
otherwise. Trivial.

Oleg.
  
Tycho Andersen Jan. 25, 2024, 6:36 p.m. UTC | #11
On Thu, Jan 25, 2024 at 07:30:46PM +0100, Oleg Nesterov wrote:
> On 01/25, Oleg Nesterov wrote:
> >
> > On 01/25, Tycho Andersen wrote:
> > >
> > > One of the things I don't like about PIDFD_THREAD is that it's hard to
> > > tell whether an arbitrary thread is a leader or not. Right now we do
> > > it by parsing /proc/pid/status, which shows all the stuff from
> > > do_task_stat() that we don't care about but which is quite expensive
> > > to compute. (Maybe there's a better way?)
> > >
> > > With PIDFD_THREAD we could could do it twice, once with the flag, get
> > > EINVAL, and then do it again. But ideally we wouldn't have to.
> >
> > Too late for me, most probably I misunderstood.
> >
> > If you want the PIDFD_THREAD behaviour, you can always use this flag
> > without any check...

Sorry, I hadn't read the patch. If it's ok to use PIDFD_THREAD on a
leader, then we can just always specify it. (We don't care about the
behavior of pidfd_poll().)

> > Could you spell?
> 
> Just in case, we can even add PIDFD_AUTO (modulo naming) which acts as
> PIDFD_THREAD if the target task is not a leader or 0 (current behaviour)
> otherwise. Trivial.

Yep, or given the above, maybe it'll work as-is, thank you.

Tycho
  
Christian Brauner Jan. 26, 2024, 9:42 a.m. UTC | #12
On Thu, Jan 25, 2024 at 06:51:14PM +0100, Oleg Nesterov wrote:
> On 01/25, Christian Brauner wrote:
> >
> > > > When it is reaped is "mostly unrelated".
> > >
> > > Then why pidfd_poll() can't simply check !task || task->exit_state ?
> > >
> > > Nevermind. So, currently pidfd_poll() succeeds when the leader can be
> >
> > Hm, the comment right above mentions:
> >
> >         /*
> >          * Inform pollers only when the whole thread group exits.
> >          * If the thread group leader exits before all other threads in the
> >          * group, then poll(2) should block, similar to the wait(2) family.
> >          */
> > > reaped, iow the whole thread group has exited.
> 
> Yes, but the comment doesn't contradict with what I have said?

No, it doesn't. I'm trying to understand what you are suggesting though.
Are you saying !task || tas->exit_state is enough and we shouldn't use
the helper that was added in commit 38fd525a4c61 ("exit: Factor
thread_group_exited out of pidfd_poll"). If so what does that buy us
open-coding the check instead of using that helper? Is there an actual
bug here?

> > > But even if you are the
> > > parent, you can't expect that wait(WNOHANG) must succeed, the leader
> > > can be traced. I guess it is too late to change this behaviour.
> >
> > Hm, why is that an issue though?
> 
> Well, I didn't say this is a problem. I simply do not know how/why people
> use pidfd_poll().

Sorry, I just have a hard time understanding what you wanted then. :)

"I guess it is too late to change this behavior." made it sound like a)
there's a problem and b) that you would prefer to change behavior. Thus,
it seems that wait(WNOHANG) hanging when a traced leader of an empty
thread-group has exited is a problem in your eyes.

> 
> I mostly tried to explain why do I think that do_notify_pidfd() should
> be always called from exit_notify() path, not by release_task(), even
> if the task is not a leader.
> 
> > Because a program would rely on WNOHANG to hang on
> > a ptraced leader? That seems esoteric imho.
> 
> To me it would be usefule, but lets not discuss this now. The "patch"

Ok, that's good then. I would expect that at least stuff like rr makes
use of pidfd and they might rely on this behavior - although I haven't
checked their code.

> I sent doesn't change the current behaviour.

Yeah, I got that but it would still be useful to understand the wider
context you were adressing.

> 
> > > What if we add the new PIDFD_THREAD flag? With this flag
> > >
> > > 	- sys_pidfd_open() doesn't require the must be a group leader
> >
> > Yes.
> >
> > >
> > > 	- pidfd_poll() succeeds when the task passes exit_notify() and
> > > 	  becomes a zombie, even if it is a leader and has other threads.
> >
> > Iiuc, if an existing user creates a pidfd for a thread-group leader and
> > then polls that pidfd they would currently only get notified if the
> > thread-group is empty and the leader has exited.
> >
> > If we now start notifying when the thread-group leader exits but the
> > thread-group isn't empty then this would be a fairly big api change
> 
> Hmm... again, this patch doesn't (shouldn't) change the current behavior.
> 
> Please note "with this flag" above. If sys_pidfd_open() was called
> without PIDFD_THREAD, then sys_pidfd_open() still requires that the
> target task must be a group leader, and pidfd_poll() won't succeed
> until the leader exits and thread_group_empty() is true.

Yeah, I missed the PIDFD_THREAD flag suggestion. Sorry about that. Btw,
I'm not sure whether you remember that but when we originally did the
pidfd work you and I discussed thread support and already decided back
then that having a flag like PIDFD_THREAD would likely be the way to go.

The PIDFD_THREAD flag would be would be interesting because we could
make pidfd_send_signal() support this flag as well to allow sending a
signal to a specific thread. That's something that I had also wanted to
support. And I've been asked for this a few times already. What do you
think?
  
Christian Brauner Jan. 26, 2024, 9:47 a.m. UTC | #13
> Please the the incomplete/untested patch below.
> 
> 	- The change in exit_notify() is sub-optimal, we can do better
> 	  to avoid 2 do_notify_pidfd() calls from exit_notify(). But
> 	  so far this is only for discussion, lets keep it simple.
> 
> 	- __pidfd_prepare() needs some minor cleanups regardless of
> 	  this change, I'll send the patch...
> 
> What do you think?
> 
> And why is thread_group_exited() exported?
> 
> Oleg.
> 
> diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h
> index 5406fbc13074..2e6461459877 100644
> --- a/include/uapi/linux/pidfd.h
> +++ b/include/uapi/linux/pidfd.h
> @@ -7,6 +7,7 @@
>  #include <linux/fcntl.h>
>  
>  /* Flags for pidfd_open().  */
> -#define PIDFD_NONBLOCK O_NONBLOCK
> +#define PIDFD_NONBLOCK	O_NONBLOCK
> +#define PIDFD_THREAD	O_EXCL	// or anything else not used by anon_inode's

I like it!

The only request I would have is to not alias O_EXCL and PIDFD_THREAD.
Because it doesn't map as clearly as NONBLOCK did.

>  
>  #endif /* _UAPI_LINUX_PIDFD_H */
> diff --git a/kernel/exit.c b/kernel/exit.c
> index dfb963d2f862..9f8526b7d717 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -752,6 +752,10 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>  		autoreap = true;
>  	}
>  
> +	/* unnecessary if do_notify_parent() was already called,
> +	   we can do better */
> +	do_notify_pidfd(tsk);
> +
>  	if (autoreap) {
>  		tsk->exit_state = EXIT_DEAD;
>  		list_add(&tsk->ptrace_entry, &dead);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index c981fa6171c1..38f2c7423fb4 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -101,6 +101,7 @@
>  #include <linux/user_events.h>
>  #include <linux/iommu.h>
>  #include <linux/rseq.h>
> +#include <uapi/linux/pidfd.h>
>  
>  #include <asm/pgalloc.h>
>  #include <linux/uaccess.h>
> @@ -2068,12 +2069,27 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
>  }
>  #endif
>  
> +static bool xxx_exited(struct pid *pid, int excl)
> +{
> +	struct task_struct *task;
> +	bool exited;
> +
> +	rcu_read_lock();
> +	task = pid_task(pid, PIDTYPE_PID);
> +	exited = !task ||
> +		(READ_ONCE(task->exit_state) && (excl || thread_group_empty(task)));
> +	rcu_read_unlock();
> +
> +	return exited;
> +}
> +
>  /*
>   * Poll support for process exit notification.
>   */
>  static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
>  {
>  	struct pid *pid = file->private_data;
> +	int excl = file->f_flags & PIDFD_THREAD;
>  	__poll_t poll_flags = 0;
>  
>  	poll_wait(file, &pid->wait_pidfd, pts);
> @@ -2083,7 +2099,7 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
>  	 * If the thread group leader exits before all other threads in the
>  	 * group, then poll(2) should block, similar to the wait(2) family.
>  	 */
> -	if (thread_group_exited(pid))
> +	if (xxx_exited(pid, excl))
>  		poll_flags = EPOLLIN | EPOLLRDNORM;
>  
>  	return poll_flags;
> @@ -2129,7 +2145,9 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
>  {
>  	int pidfd;
>  	struct file *pidfd_file;
> +	unsigned excl = flags & PIDFD_THREAD;
>  
> +	flags &= ~PIDFD_THREAD;
>  	if (flags & ~(O_NONBLOCK | O_RDWR | O_CLOEXEC))
>  		return -EINVAL;
>  
> @@ -2144,6 +2162,7 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
>  		return PTR_ERR(pidfd_file);
>  	}
>  	get_pid(pid); /* held by pidfd_file now */
> +	pidfd_file->f_flags |= excl;
>  	*ret = pidfd_file;
>  	return pidfd;
>  }
> @@ -2176,7 +2195,9 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
>   */
>  int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
>  {
> -	if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
> +	unsigned excl = flags & PIDFD_THREAD;
> +
> +	if (!pid || !pid_has_task(pid, excl ? PIDTYPE_PID : PIDTYPE_TGID))
>  		return -EINVAL;
>  
>  	return __pidfd_prepare(pid, flags, ret);
> diff --git a/kernel/pid.c b/kernel/pid.c
> index b52b10865454..5257197f9493 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -629,7 +629,7 @@ SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
>  	int fd;
>  	struct pid *p;
>  
> -	if (flags & ~PIDFD_NONBLOCK)
> +	if (flags & ~(PIDFD_NONBLOCK | PIDFD_THREAD))
>  		return -EINVAL;
>  
>  	if (pid <= 0)
>
  
Christian Brauner Jan. 26, 2024, 9:49 a.m. UTC | #14
On Thu, Jan 25, 2024 at 11:36:50AM -0700, Tycho Andersen wrote:
> On Thu, Jan 25, 2024 at 07:30:46PM +0100, Oleg Nesterov wrote:
> > On 01/25, Oleg Nesterov wrote:
> > >
> > > On 01/25, Tycho Andersen wrote:
> > > >
> > > > One of the things I don't like about PIDFD_THREAD is that it's hard to
> > > > tell whether an arbitrary thread is a leader or not. Right now we do
> > > > it by parsing /proc/pid/status, which shows all the stuff from
> > > > do_task_stat() that we don't care about but which is quite expensive
> > > > to compute. (Maybe there's a better way?)
> > > >
> > > > With PIDFD_THREAD we could could do it twice, once with the flag, get
> > > > EINVAL, and then do it again. But ideally we wouldn't have to.
> > >
> > > Too late for me, most probably I misunderstood.
> > >
> > > If you want the PIDFD_THREAD behaviour, you can always use this flag
> > > without any check...
> 
> Sorry, I hadn't read the patch. If it's ok to use PIDFD_THREAD on a
> leader, then we can just always specify it. (We don't care about the
> behavior of pidfd_poll().)
> 
> > > Could you spell?
> > 
> > Just in case, we can even add PIDFD_AUTO (modulo naming) which acts as
> > PIDFD_THREAD if the target task is not a leader or 0 (current behaviour)
> > otherwise. Trivial.
> 
> Yep, or given the above, maybe it'll work as-is, thank you.

Yes, let's rather do the explicit PIDFD_THREAD.
  
Oleg Nesterov Jan. 26, 2024, 2:33 p.m. UTC | #15
On 01/26, Christian Brauner wrote:
>
> No, it doesn't. I'm trying to understand what you are suggesting though.
> Are you saying !task || tas->exit_state is enough

If PIDFD_THREAD then I think it is enough. Otherwise we still need
!task || (exit_state && thread_group_empty)

> and we shouldn't use
> the helper that was added in commit 38fd525a4c61 ("exit: Factor
> thread_group_exited out of pidfd_poll"). If so what does that buy us
> open-coding the check instead of using that helper? Is there an actual
> bug here?

The patch adds the new xxx_exited(task, excl) helper which checks

	!task || (exit_state && (excl || thread_group_empty))

yes, the naming is not good.

> > Well, I didn't say this is a problem. I simply do not know how/why people
> > use pidfd_poll().
>
> Sorry, I just have a hard time understanding what you wanted then. :)
>
> "I guess it is too late to change this behavior." made it sound like a)
> there's a problem and b) that you would prefer to change behavior. Thus,
> it seems that wait(WNOHANG) hanging when a traced leader of an empty
> thread-group has exited is a problem in your eyes.

Again, I mostly tried to argue with do_notify_pidfd() called by realese_task().

I think that with PIDFD_THREAD set pidfd_poll() should succeed right
after the exiting thread becomes a zombie (passes exit_notify), whether
it is a leader or not.

Let me quote part of my reply to Tycho's patch

	> +	/*
	> +	 * If we're not the leader, notify any waiters on our pidfds. Note that
	> +	 * we don't want to notify the leader until /everyone/ in the thread
	> +	 * group is dead, viz. the condition below.
	> +	 *
	> +	 * We have to do this here, since __exit_signal() will
	> +	 * __unhash_processes(), and break do_notify_pidfd()'s lookup.
	> +	 */
	> +	if (!thread_group_leader(p))
	> +		do_notify_pidfd(p);

	This doesn't look consistent.

	If the task is a group leader do_notify_pidfd() is called by exit_notify()
	when it becomes a zombie (if no other threads), before it is reaped by its
	parent (unless autoreap).

	If it is a sub-thread, it is called by release_task() above. Note that a
	sub-thread can become a zombie too if it is traced.

Not to mention that this is racy.

I would not mind if we simply move do_notify_pidfd() from exit_notify() to
release_task() and do it regardless of thread_group_leader(). And in some
sense this looks more logical to me.

But as I said:

	- I do not know how/why people actually use poll(pidfd)

	- it is too late to change the current behaviour

Sorry for confusion.

> I'm not sure whether you remember that but when we originally did the
> pidfd work you and I discussed thread support and already decided back
> then that having a flag like PIDFD_THREAD would likely be the way to go.

All I can recall is that, yes, we had some discussions about pidfd in
the past ;)

> The PIDFD_THREAD flag would be would be interesting because we could
> make pidfd_send_signal() support this flag

Agreed,

Oleg.
  
Oleg Nesterov Jan. 26, 2024, 2:33 p.m. UTC | #16
On 01/26, Christian Brauner wrote:
>
> > --- a/include/uapi/linux/pidfd.h
> > +++ b/include/uapi/linux/pidfd.h
> > @@ -7,6 +7,7 @@
> >  #include <linux/fcntl.h>
> >
> >  /* Flags for pidfd_open().  */
> > -#define PIDFD_NONBLOCK O_NONBLOCK
> > +#define PIDFD_NONBLOCK	O_NONBLOCK
> > +#define PIDFD_THREAD	O_EXCL	// or anything else not used by anon_inode's
>
> I like it!
>
> The only request I would have is to not alias O_EXCL and PIDFD_THREAD.
> Because it doesn't map as clearly as NONBLOCK did.

But it would be nice to have PIDFD_THREAD in file->f_flags. Where else
can we keep it?

I chose O_EXCL because it can only be used at open time, it can never
be used or changed after anon_inode_getfile(), so we can safely do

	pidfd_file->f_flags |= PIDFD_THREAD;

in __pidfd_prepare() and then check in pidfd_poll/pidfd_send_signal.

What do you suggest instead?

Oleg.
  
Tycho Andersen Jan. 26, 2024, 9:50 p.m. UTC | #17
Hi Oleg,

On Thu, Jan 25, 2024 at 03:08:31PM +0100, Oleg Nesterov wrote:
> What do you think?

Thank you, it passes all my tests.

> +	/* unnecessary if do_notify_parent() was already called,
> +	   we can do better */
> +	do_notify_pidfd(tsk);

"do better" here could be something like,

diff --git a/kernel/exit.c b/kernel/exit.c
index efe8f1d3a6af..7e545393f2f5 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -742,6 +742,7 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 	bool autoreap;
 	struct task_struct *p, *n;
 	LIST_HEAD(dead);
+	bool needs_notify = true;
 
 	write_lock_irq(&tasklist_lock);
 	forget_original_parent(tsk, &dead);
@@ -756,16 +757,21 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 				!ptrace_reparented(tsk) ?
 			tsk->exit_signal : SIGCHLD;
 		autoreap = do_notify_parent(tsk, sig);
+		needs_notify = false;
 	} else if (thread_group_leader(tsk)) {
-		autoreap = thread_group_empty(tsk) &&
-			do_notify_parent(tsk, tsk->exit_signal);
+		autoreap = false;
+		if (thread_group_empty(tsk)) {
+			autoreap = do_notify_parent(tsk, tsk->exit_signal);
+			needs_notify = false;
+		}
 	} else {
 		autoreap = true;
 	}
 
 	/* unnecessary if do_notify_parent() was already called,
 	   we can do better */
-	do_notify_pidfd(tsk);
+	if (needs_notify)
+		do_notify_pidfd(tsk);
 
 	if (autoreap) {
 		tsk->exit_state = EXIT_DEAD;


but even with that, there's other calls in the tree to
do_notify_parent() that might double notify.

This brings up another interesting behavior that I noticed while
testing this, if you do a poll() on pidfd, followed quickly by a
pidfd_getfd() on the same thread you just got an event on, you can
sometimes get an EBADF from __pidfd_fget() instead of the more
expected ESRCH higher up the stack.

I wonder if it makes sense to abuse ->f_flags to add a PIDFD_NOTIFIED?
Then we can refuse further pidfd syscall operations in a sane way, and
also "do better" above by checking this flag from do_pidfd_notify()
before doing it again?

Tycho
  
Oleg Nesterov Jan. 27, 2024, 10:54 a.m. UTC | #18
Hi Tycho,

On 01/26, Tycho Andersen wrote:
>
> On Thu, Jan 25, 2024 at 03:08:31PM +0100, Oleg Nesterov wrote:
> > What do you think?
>
> Thank you, it passes all my tests.

Great, thanks!

OK, I'll make v2 on top of the recent
"pidfd: cleanup the usage of __pidfd_prepare's flags"

but we need to finish our discussion with Christian about the
usage of O_EXCL.

As for clone(CLONE_PIDFD | CLONE_THREAD), this is trivial but
I think this needs another discussion too, lets do this later.

> > +	/* unnecessary if do_notify_parent() was already called,
> > +	   we can do better */
> > +	do_notify_pidfd(tsk);
>
> "do better" here could be something like,
>
> [...snip...]

No, no, please see below.

For the moment, please forget about PIDFD_THREAD, lets discuss
the current behaviour.

> but even with that, there's other calls in the tree to
> do_notify_parent() that might double notify.

Yes, and we can't avoid this. Well, perhaps do_notify_parent()
can do something like

	if (ptrace_reparented())
		do_notify_pidfd();

so that only the "final" do_notify_parent() does do_notify_pidfd()
but this needs another discussion and in fact I don't think this
would be right or make much sense. Lets forget this for now.

Now. Even without PIDFD_THREAD, I think it makes sense to change
do_notify_parent() to do

	if (thread_group_empty(tsk))
		do_notify_pidfd(tsk);

thread_group_empty(tsk) can only be true if tsk is a group leader
and it is the last thread. And this is exactly what pidfd_poll()
currently needs.

In fact I'd even prefer to do this in a separate patch for the
documentation purposes.

Now, PIDFD_THREAD can just add

	if (!thread_group_empty(tsk))
		do_notify_pidfd(tsk);

right after "tsk->exit_state = EXIT_ZOMBIE", that is all.

This also preserves the do_notify_pidfd/__wake_up_parent ordering.
Not that I think this is important, just for consistency.

> This brings up another interesting behavior that I noticed while
> testing this, if you do a poll() on pidfd, followed quickly by a
> pidfd_getfd() on the same thread you just got an event on, you can
> sometimes get an EBADF from __pidfd_fget() instead of the more
> expected ESRCH higher up the stack.

exit_notify() is called after exit_files(). pidfd_getfd() returns
ESRCH if the exiting thread completes release_task(), otherwise it
returns EBADF because ->files == NULL. This too doesn't really
depend on PIDFD_THREAD.

> I wonder if it makes sense to abuse ->f_flags to add a PIDFD_NOTIFIED?
> Then we can refuse further pidfd syscall operations in a sane way, and

But how? We only have "struct pid *", how can we find all files
"attached" to this pid?

> also "do better" above by checking this flag from do_pidfd_notify()
> before doing it again?

and even it was possible, I don't think it makes a lot of sense, see
also above.

but perhaps I understood you...

Oleg.
  
Christian Brauner Jan. 27, 2024, 2:26 p.m. UTC | #19
On Fri, Jan 26, 2024 at 03:33:50PM +0100, Oleg Nesterov wrote:
> On 01/26, Christian Brauner wrote:
> >
> > > --- a/include/uapi/linux/pidfd.h
> > > +++ b/include/uapi/linux/pidfd.h
> > > @@ -7,6 +7,7 @@
> > >  #include <linux/fcntl.h>
> > >
> > >  /* Flags for pidfd_open().  */
> > > -#define PIDFD_NONBLOCK O_NONBLOCK
> > > +#define PIDFD_NONBLOCK	O_NONBLOCK
> > > +#define PIDFD_THREAD	O_EXCL	// or anything else not used by anon_inode's
> >
> > I like it!
> >
> > The only request I would have is to not alias O_EXCL and PIDFD_THREAD.
> > Because it doesn't map as clearly as NONBLOCK did.
> 
> But it would be nice to have PIDFD_THREAD in file->f_flags. Where else
> can we keep it?

No, I did just mean that the uapi value doesn't necessarily have to
reflect what we do internally. IOW, we can still raise O_EXCL internally
in ->f_flags but there's no need to expose it as O_EXCL to userspace. We
often have internal and external flag spaces. If you prefer it your way
I'm not going argue.

> I chose O_EXCL because it can only be used at open time, it can never
> be used or changed after anon_inode_getfile(), so we can safely do
> 
> 	pidfd_file->f_flags |= PIDFD_THREAD;
> 
> in __pidfd_prepare() and then check in pidfd_poll/pidfd_send_signal.
> 
> What do you suggest instead?

(Long-term and unrelated to this here, I think we will need to consider
not just stashing struct pid in pidfd_file->private_data but instead
struct pid with additional data because there's various functionality
that users would like that requires additional state to be stored and we
can't or don't want to do that in struct pid directly.)
  
Christian Brauner Jan. 27, 2024, 2:33 p.m. UTC | #20
On Sat, Jan 27, 2024 at 11:54:32AM +0100, Oleg Nesterov wrote:
> Hi Tycho,
> 
> On 01/26, Tycho Andersen wrote:
> >
> > On Thu, Jan 25, 2024 at 03:08:31PM +0100, Oleg Nesterov wrote:
> > > What do you think?
> >
> > Thank you, it passes all my tests.
> 
> Great, thanks!
> 
> OK, I'll make v2 on top of the recent
> "pidfd: cleanup the usage of __pidfd_prepare's flags"
> 
> but we need to finish our discussion with Christian about the
> usage of O_EXCL.

Just write the patch exactly like you want. If it's as a little a detail
as the uapi flag value we can just always change that when applying.
There's no reason to introduce artificial delays because of my
preference here..

> 
> As for clone(CLONE_PIDFD | CLONE_THREAD), this is trivial but
> I think this needs another discussion too, lets do this later.
> 
> > > +	/* unnecessary if do_notify_parent() was already called,
> > > +	   we can do better */
> > > +	do_notify_pidfd(tsk);
> >
> > "do better" here could be something like,
> >
> > [...snip...]
> 
> No, no, please see below.
> 
> For the moment, please forget about PIDFD_THREAD, lets discuss
> the current behaviour.
> 
> > but even with that, there's other calls in the tree to
> > do_notify_parent() that might double notify.
> 
> Yes, and we can't avoid this. Well, perhaps do_notify_parent()
> can do something like
> 
> 	if (ptrace_reparented())
> 		do_notify_pidfd();
> 
> so that only the "final" do_notify_parent() does do_notify_pidfd()
> but this needs another discussion and in fact I don't think this
> would be right or make much sense. Lets forget this for now.
> 
> Now. Even without PIDFD_THREAD, I think it makes sense to change
> do_notify_parent() to do
> 
> 	if (thread_group_empty(tsk))
> 		do_notify_pidfd(tsk);
> 
> thread_group_empty(tsk) can only be true if tsk is a group leader
> and it is the last thread. And this is exactly what pidfd_poll()
> currently needs.
> 
> In fact I'd even prefer to do this in a separate patch for the
> documentation purposes.
> 
> Now, PIDFD_THREAD can just add
> 
> 	if (!thread_group_empty(tsk))
> 		do_notify_pidfd(tsk);
> 
> right after "tsk->exit_state = EXIT_ZOMBIE", that is all.

Sounds good.

> 
> This also preserves the do_notify_pidfd/__wake_up_parent ordering.
> Not that I think this is important, just for consistency.
> 
> > This brings up another interesting behavior that I noticed while
> > testing this, if you do a poll() on pidfd, followed quickly by a
> > pidfd_getfd() on the same thread you just got an event on, you can
> > sometimes get an EBADF from __pidfd_fget() instead of the more
> > expected ESRCH higher up the stack.
> 
> exit_notify() is called after exit_files(). pidfd_getfd() returns
> ESRCH if the exiting thread completes release_task(), otherwise it
> returns EBADF because ->files == NULL. This too doesn't really
> depend on PIDFD_THREAD.
> 
> > I wonder if it makes sense to abuse ->f_flags to add a PIDFD_NOTIFIED?
> > Then we can refuse further pidfd syscall operations in a sane way, and
> 
> But how? We only have "struct pid *", how can we find all files
> "attached" to this pid?

We can't. There's some use-cases that would make this desirable but that
would mean we would need another data structure to track this. I once
toyed with a patch for this but never got so excited about it to
actually finish it.
  
Tycho Andersen Jan. 27, 2024, 3:55 p.m. UTC | #21
On Sat, Jan 27, 2024 at 11:54:32AM +0100, Oleg Nesterov wrote:
> Hi Tycho,
> 
> On 01/26, Tycho Andersen wrote:
> >
> > On Thu, Jan 25, 2024 at 03:08:31PM +0100, Oleg Nesterov wrote:
> > > What do you think?
> >
> > Thank you, it passes all my tests.
> 
> Great, thanks!
> 
> OK, I'll make v2 on top of the recent
> "pidfd: cleanup the usage of __pidfd_prepare's flags"
> 
> but we need to finish our discussion with Christian about the
> usage of O_EXCL.
> 
> As for clone(CLONE_PIDFD | CLONE_THREAD), this is trivial but
> I think this needs another discussion too, lets do this later.
> 
> > > +	/* unnecessary if do_notify_parent() was already called,
> > > +	   we can do better */
> > > +	do_notify_pidfd(tsk);
> >
> > "do better" here could be something like,
> >
> > [...snip...]
> 
> No, no, please see below.
> 
> For the moment, please forget about PIDFD_THREAD, lets discuss
> the current behaviour.
> 
> > but even with that, there's other calls in the tree to
> > do_notify_parent() that might double notify.
> 
> Yes, and we can't avoid this. Well, perhaps do_notify_parent()
> can do something like
> 
> 	if (ptrace_reparented())
> 		do_notify_pidfd();
> 
> so that only the "final" do_notify_parent() does do_notify_pidfd()
> but this needs another discussion and in fact I don't think this
> would be right or make much sense. Lets forget this for now.

It seems like (and the current pidfd_test enforces for some cases) we
want exactly one notification for a task dying. I don't understand
how we guarantee this now, with all of these calls.

> > This brings up another interesting behavior that I noticed while
> > testing this, if you do a poll() on pidfd, followed quickly by a
> > pidfd_getfd() on the same thread you just got an event on, you can
> > sometimes get an EBADF from __pidfd_fget() instead of the more
> > expected ESRCH higher up the stack.
> 
> exit_notify() is called after exit_files(). pidfd_getfd() returns
> ESRCH if the exiting thread completes release_task(), otherwise it
> returns EBADF because ->files == NULL. This too doesn't really
> depend on PIDFD_THREAD.

Yup, understood. It just seems like an inconsistency we might want to
fix.

> > I wonder if it makes sense to abuse ->f_flags to add a PIDFD_NOTIFIED?
> > Then we can refuse further pidfd syscall operations in a sane way, and
> 
> But how? We only have "struct pid *", how can we find all files
> "attached" to this pid?

Yeah, we'd need some other linkage as Christian points out. But if
there is a predicate we can write that says whether this task has been
notified or not, it's not necessary. I just don't understand what that
is. But maybe your patch will make it clearer.

Tycho
  
Oleg Nesterov Jan. 27, 2024, 4:31 p.m. UTC | #22
On 01/27, Tycho Andersen wrote:
>
> It seems like (and the current pidfd_test enforces for some cases)

Which pidfd_test ?

> we
> want exactly one notification for a task dying.

This can't be right. EVERY user of poll_wait() or wait_event/etc
must handle/tolerate the false wakeups.

> I don't understand
> how we guarantee this now, with all of these calls.

I don't understand why do we need or even want to guarantee this.

The extra wakeup must be always fine correctness-wise. Sure, it
would be nice to avoid the unnecessary wakeups, and perhaps we
can change wake_up_all() to pass a key to, say, only wake_up the
PIDFD_THREAD waiters from exit_notify(). But certainly this is
outside the scope of PIDFD_THREAD change we discuss.

The changes in do_notify_parent() (I have already sent the patch) and
in exit_notify() (proposed in my previous email) just ensure that,
with the minimal changes, we avoid 2 do_notify_pidfd's from the same
exit_notify() path.

> > exit_notify() is called after exit_files(). pidfd_getfd() returns
> > ESRCH if the exiting thread completes release_task(), otherwise it
> > returns EBADF because ->files == NULL. This too doesn't really
> > depend on PIDFD_THREAD.
>
> Yup, understood. It just seems like an inconsistency we might want to
> fix.

Not sure this worth "fixing"...

Oleg.
  
Tycho Andersen Jan. 27, 2024, 5:20 p.m. UTC | #23
On Sat, Jan 27, 2024 at 05:31:39PM +0100, Oleg Nesterov wrote:
> On 01/27, Tycho Andersen wrote:
> >
> > It seems like (and the current pidfd_test enforces for some cases)
> 
> Which pidfd_test ?

I was thinking of poll_pidfd() in pidfd_test.c, but,

> > we
> > want exactly one notification for a task dying.
> 
> This can't be right. EVERY user of poll_wait() or wait_event/etc
> must handle/tolerate the false wakeups.

you're right, it doesn't enforce exactly once.

> > I don't understand
> > how we guarantee this now, with all of these calls.
> 
> I don't understand why do we need or even want to guarantee this.
> 
> The extra wakeup must be always fine correctness-wise. Sure, it
> would be nice to avoid the unnecessary wakeups, and perhaps we
> can change wake_up_all() to pass a key to, say, only wake_up the
> PIDFD_THREAD waiters from exit_notify(). But certainly this is
> outside the scope of PIDFD_THREAD change we discuss.
> 
> The changes in do_notify_parent() (I have already sent the patch) and
> in exit_notify() (proposed in my previous email) just ensure that,
> with the minimal changes, we avoid 2 do_notify_pidfd's from the same
> exit_notify() path.

Sounds good.

> > > exit_notify() is called after exit_files(). pidfd_getfd() returns
> > > ESRCH if the exiting thread completes release_task(), otherwise it
> > > returns EBADF because ->files == NULL. This too doesn't really
> > > depend on PIDFD_THREAD.
> >
> > Yup, understood. It just seems like an inconsistency we might want to
> > fix.
> 
> Not sure this worth "fixing"...

Yep, maybe not. Just wanted to point it out.

Tycho
  
Oleg Nesterov Jan. 27, 2024, 7:31 p.m. UTC | #24
On 01/27, Tycho Andersen wrote:
>
> > > > exit_notify() is called after exit_files(). pidfd_getfd() returns
> > > > ESRCH if the exiting thread completes release_task(), otherwise it
> > > > returns EBADF because ->files == NULL. This too doesn't really
> > > > depend on PIDFD_THREAD.
> > >
> > > Yup, understood. It just seems like an inconsistency we might want to
> > > fix.
> >
> > Not sure this worth "fixing"...
>
> Yep, maybe not. Just wanted to point it out.

On the second thought I am starting to understand your concern...

Indeed, in this case -EBADF is technically correct but it can confuse
the user which doesn't or can't know that this task/thread is exiting,
because EBADF looks as if the "int fd" argument was wrong.

Sorry I missed your point before.

Oleg.
  
Tycho Andersen Jan. 27, 2024, 8:44 p.m. UTC | #25
On Sat, Jan 27, 2024 at 08:31:27PM +0100, Oleg Nesterov wrote:
> On 01/27, Tycho Andersen wrote:
> >
> > > > > exit_notify() is called after exit_files(). pidfd_getfd() returns
> > > > > ESRCH if the exiting thread completes release_task(), otherwise it
> > > > > returns EBADF because ->files == NULL. This too doesn't really
> > > > > depend on PIDFD_THREAD.
> > > >
> > > > Yup, understood. It just seems like an inconsistency we might want to
> > > > fix.
> > >
> > > Not sure this worth "fixing"...
> >
> > Yep, maybe not. Just wanted to point it out.
> 
> On the second thought I am starting to understand your concern...
> 
> Indeed, in this case -EBADF is technically correct but it can confuse
> the user which doesn't or can't know that this task/thread is exiting,
> because EBADF looks as if the "int fd" argument was wrong.
> 
> Sorry I missed your point before.

No worries. I realized it's not so hard to fix with your new
xxx_exited() helper from the PIDFD_THREAD patch, so maybe worth
cleaning up after all?

Tycho
  
Oleg Nesterov Jan. 27, 2024, 9:10 p.m. UTC | #26
On 01/27, Tycho Andersen wrote:
>
> On Sat, Jan 27, 2024 at 08:31:27PM +0100, Oleg Nesterov wrote:
> >
> > On the second thought I am starting to understand your concern...
> >
> > Indeed, in this case -EBADF is technically correct but it can confuse
> > the user which doesn't or can't know that this task/thread is exiting,
> > because EBADF looks as if the "int fd" argument was wrong.
> >
> > Sorry I missed your point before.
>
> No worries. I realized it's not so hard to fix with your new
> xxx_exited() helper from the PIDFD_THREAD patch, so maybe worth
> cleaning up after all?

OK, lets discuss this later.

I'll (hopefully) send v2 on top of

	pidfd: cleanup the usage of __pidfd_prepare's flags
	pidfd: don't do_notify_pidfd() if !thread_group_empty()

on Monday, will be busy tomorrow (family duties ;)

Oleg.
  

Patch

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 4b7664c56208..d752f003a69a 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -333,6 +333,7 @@  extern int kill_pid_usb_asyncio(int sig, int errno, sigval_t addr, struct pid *,
 extern int kill_pgrp(struct pid *pid, int sig, int priv);
 extern int kill_pid(struct pid *pid, int sig, int priv);
 extern __must_check bool do_notify_parent(struct task_struct *, int);
+extern void do_notify_pidfd(struct task_struct *task);
 extern void __wake_up_parent(struct task_struct *p, struct task_struct *parent);
 extern void force_sig(int);
 extern void force_fatal_sig(int);
diff --git a/kernel/exit.c b/kernel/exit.c
index 3988a02efaef..90117d7861f4 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -256,6 +256,17 @@  void release_task(struct task_struct *p)
 	write_lock_irq(&tasklist_lock);
 	ptrace_release_task(p);
 	thread_pid = get_pid(p->thread_pid);
+
+	/*
+	 * If we're not the leader, notify any waiters on our pidfds. Note that
+	 * we don't want to notify the leader until /everyone/ in the thread
+	 * group is dead, viz. the condition below.
+	 *
+	 * We have to do this here, since __exit_signal() will
+	 * __unhash_processes(), and break do_notify_pidfd()'s lookup.
+	 */
+	if (!thread_group_leader(p))
+		do_notify_pidfd(p);
 	__exit_signal(p);
 
 	/*
diff --git a/kernel/fork.c b/kernel/fork.c
index 47ff3b35352e..44969cd472f0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2157,8 +2157,6 @@  static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
  * Allocate a new file that stashes @pid and reserve a new pidfd number in the
  * caller's file descriptor table. The pidfd is reserved but not installed yet.
  *
- * The helper verifies that @pid is used as a thread group leader.
- *
  * If this function returns successfully the caller is responsible to either
  * call fd_install() passing the returned pidfd and pidfd file as arguments in
  * order to install the pidfd into its file descriptor table or they must use
@@ -2176,7 +2174,7 @@  static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
  */
 int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
 {
-	if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
+	if (!pid)
 		return -EINVAL;
 
 	return __pidfd_prepare(pid, flags, ret);
diff --git a/kernel/pid.c b/kernel/pid.c
index b52b10865454..b55c0adf457b 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -552,11 +552,6 @@  struct pid *pidfd_get_pid(unsigned int fd, unsigned int *flags)
  * Return the task associated with @pidfd. The function takes a reference on
  * the returned task. The caller is responsible for releasing that reference.
  *
- * Currently, the process identified by @pidfd is always a thread-group leader.
- * This restriction currently exists for all aspects of pidfds including pidfd
- * creation (CLONE_PIDFD cannot be used with CLONE_THREAD) and pidfd polling
- * (only supports thread group leaders).
- *
  * Return: On success, the task_struct associated with the pidfd.
  *	   On error, a negative errno number will be returned.
  */
@@ -615,11 +610,7 @@  int pidfd_create(struct pid *pid, unsigned int flags)
  * @flags: flags to pass
  *
  * This creates a new pid file descriptor with the O_CLOEXEC flag set for
- * the process identified by @pid. Currently, the process identified by
- * @pid must be a thread-group leader. This restriction currently exists
- * for all aspects of pidfds including pidfd creation (CLONE_PIDFD cannot
- * be used with CLONE_THREAD) and pidfd polling (only supports thread group
- * leaders).
+ * the process identified by @pid.
  *
  * Return: On success, a cloexec pidfd is returned.
  *         On error, a negative errno number will be returned.
diff --git a/kernel/signal.c b/kernel/signal.c
index c9c57d053ce4..3e3c9d0fa3a5 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2019,7 +2019,7 @@  int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type)
 	return ret;
 }
 
-static void do_notify_pidfd(struct task_struct *task)
+void do_notify_pidfd(struct task_struct *task)
 {
 	struct pid *pid;