pid: allow pidfds for reaped tasks

Message ID 20230807085203.819772-1-david@readahead.eu
State New
Headers
Series pid: allow pidfds for reaped tasks |

Commit Message

David Rheinsberg Aug. 7, 2023, 8:52 a.m. UTC
  A pidfd can currently only be created for tasks that are thread-group
leaders and not reaped. This patch changes the pidfd-core to allow for
pidfds on reapead thread-group leaders as well.

A pidfd can outlive the task it refers to, and thus user-space must
already be prepared that the task underlying a pidfd is gone at the time
they get their hands on the pidfd. For instance, resolving the pidfd to
a PID via the fdinfo must be prepared to read `-1`.

Despite user-space knowing that a pidfd might be stale, several kernel
APIs currently add another layer that checks for this. In particular,
SO_PEERPIDFD returns `EINVAL` if the peer-task was already reaped,
but returns a stale pidfd if the task is reaped immediately after the
respective alive-check.

This has the unfortunate effect that user-space now has two ways to
check for the exact same scenario: A syscall might return
EINVAL/ESRCH/... *or* the pidfd might be stale, even though there is no
particular reason to distinguish both cases. This also propagates
through user-space APIs, which pass on pidfds. They must be prepared to
pass on `-1` *or* the pidfd, because there is no guaranteed way to get a
stale pidfd from the kernel.

This patch changes the core pidfd helpers to allow creation of pidfds
even if the PID is no longer linked to any task. This only affects one
of the three pidfd users that currently exist:

 1) fanotify already tests for a linked TGID-task manually before
    creating the PIDFD, thus it is not directly affected by this change.
    However, note that the current fanotify code fails with an error if
    the target process is reaped exactly between the TGID-check in
    fanotify and the test in pidfd_prepare(). With this patch, this
    will no longer be the case.

 2) pidfd_open(2) calls find_get_pid() before creating the pidfd, thus
    it is also not directly affected by this change.
    Again, similar to fanotify, there is a race between the
    find_get_pid() call and pidfd_prepare(), which currently causes
    pidfd_open(2) to return EINVAL rather than ESRCH if the process is
    reaped just between those two checks. With this patch, this will no
    longer be the case.

 3) SO_PEERPIDFD will be affected by this change and from now on return
    stale pidfds rather than EINVAL if the respective peer task is
    reaped already.

Given that users of SO_PEERPIDFD must already deal with stale pidfds,
this change hopefully simplifies the API of SO_PEERPIDFD, and all
dependent user-space APIs (e.g., GetConnectionCredentials() on D-Bus
driver APIs). Also note that SO_PEERPIDFD is still pending to be
released with linux-6.5.

Signed-off-by: David Rheinsberg <david@readahead.eu>
---
 kernel/fork.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)
  

Comments

Alexander Mikhalitsyn Aug. 7, 2023, 9:01 a.m. UTC | #1
On Mon, Aug 7, 2023 at 10:52 AM David Rheinsberg <david@readahead.eu> wrote:
>
> A pidfd can currently only be created for tasks that are thread-group
> leaders and not reaped. This patch changes the pidfd-core to allow for
> pidfds on reapead thread-group leaders as well.
>
> A pidfd can outlive the task it refers to, and thus user-space must
> already be prepared that the task underlying a pidfd is gone at the time
> they get their hands on the pidfd. For instance, resolving the pidfd to
> a PID via the fdinfo must be prepared to read `-1`.
>
> Despite user-space knowing that a pidfd might be stale, several kernel
> APIs currently add another layer that checks for this. In particular,
> SO_PEERPIDFD returns `EINVAL` if the peer-task was already reaped,
> but returns a stale pidfd if the task is reaped immediately after the
> respective alive-check.
>
> This has the unfortunate effect that user-space now has two ways to
> check for the exact same scenario: A syscall might return
> EINVAL/ESRCH/... *or* the pidfd might be stale, even though there is no
> particular reason to distinguish both cases. This also propagates
> through user-space APIs, which pass on pidfds. They must be prepared to
> pass on `-1` *or* the pidfd, because there is no guaranteed way to get a
> stale pidfd from the kernel.
>
> This patch changes the core pidfd helpers to allow creation of pidfds
> even if the PID is no longer linked to any task. This only affects one
> of the three pidfd users that currently exist:
>
>  1) fanotify already tests for a linked TGID-task manually before
>     creating the PIDFD, thus it is not directly affected by this change.
>     However, note that the current fanotify code fails with an error if
>     the target process is reaped exactly between the TGID-check in
>     fanotify and the test in pidfd_prepare(). With this patch, this
>     will no longer be the case.
>
>  2) pidfd_open(2) calls find_get_pid() before creating the pidfd, thus
>     it is also not directly affected by this change.
>     Again, similar to fanotify, there is a race between the
>     find_get_pid() call and pidfd_prepare(), which currently causes
>     pidfd_open(2) to return EINVAL rather than ESRCH if the process is
>     reaped just between those two checks. With this patch, this will no
>     longer be the case.
>
>  3) SO_PEERPIDFD will be affected by this change and from now on return
>     stale pidfds rather than EINVAL if the respective peer task is
>     reaped already.
>
> Given that users of SO_PEERPIDFD must already deal with stale pidfds,
> this change hopefully simplifies the API of SO_PEERPIDFD, and all
> dependent user-space APIs (e.g., GetConnectionCredentials() on D-Bus
> driver APIs). Also note that SO_PEERPIDFD is still pending to be
> released with linux-6.5.
>
> Signed-off-by: David Rheinsberg <david@readahead.eu>
> ---
>  kernel/fork.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index d2e12b6d2b18..4dde19a8c264 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2161,7 +2161,7 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
>   * Allocate a new file that stashes @pid and reserve a new pidfd number in the
>   * caller's file descriptor table. The pidfd is reserved but not installed yet.
>   *
> - * The helper verifies that @pid is used as a thread group leader.
> + * The helper verifies that @pid is/was used as a thread group leader.
>   *
>   * If this function returns successfully the caller is responsible to either
>   * call fd_install() passing the returned pidfd and pidfd file as arguments in
> @@ -2180,7 +2180,14 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
>   */
>  int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
>  {
> -       if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
> +       if (!pid)
> +               return -EINVAL;
> +
> +       /*
> +        * Non thread-group leaders cannot have pidfds, but we allow them for
> +        * reaped thread-group leaders.
> +        */
> +       if (pid_has_task(pid, PIDTYPE_PID) && !pid_has_task(pid, PIDTYPE_TGID))
>                 return -EINVAL;

Hi David!

As far as I understand, __unhash_process is always called with a
tasklist_lock held for writing.
Don't we need to take tasklist_lock for reading here to guarantee
consistency between
pid_has_task(pid, PIDTYPE_PID) and pid_has_task(pid, PIDTYPE_TGID)
return values?

Kind regards,
Alex

>
>         return __pidfd_prepare(pid, flags, ret);
> --
> 2.41.0
>
  
David Rheinsberg Aug. 7, 2023, 9:12 a.m. UTC | #2
Hi

On Mon, Aug 7, 2023, at 11:01 AM, Alexander Mikhalitsyn wrote:
> On Mon, Aug 7, 2023 at 10:52 AM David Rheinsberg <david@readahead.eu> wrote:
[...]
>>  int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
>>  {
>> -       if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
>> +       if (!pid)
>> +               return -EINVAL;
>> +
>> +       /*
>> +        * Non thread-group leaders cannot have pidfds, but we allow them for
>> +        * reaped thread-group leaders.
>> +        */
>> +       if (pid_has_task(pid, PIDTYPE_PID) && !pid_has_task(pid, PIDTYPE_TGID))
>>                 return -EINVAL;
>
> Hi David!
>
> As far as I understand, __unhash_process is always called with a
> tasklist_lock held for writing.
> Don't we need to take tasklist_lock for reading here to guarantee
> consistency between
> pid_has_task(pid, PIDTYPE_PID) and pid_has_task(pid, PIDTYPE_TGID)
> return values?

You mean PIDTYPE_TGID being cleared before PIDTYPE_PID (at least from the perspective of the unlocked reader)? I don't think it is a compatibility issue, because the same issue existed before the patch. But it might indeed be required to avoid spurious EINVAL _while_ the target process is reaped.

It would be unfortunate if we need that. Because it is really not required for AF_UNIX or fanotify (they guarantee that they always deal with TGIDs). So maybe the correct call is to just drop pidfd_prepare() and always use __pidfd_prepare()? So far the safety-measures of pidfd_prepare() introduced two races I already mentioned in the commit-message. So maybe it is just better to document that the caller of __pidfd_prepare() needs to ensure the source is/was a TGID?

Thanks
David
  
Alexander Mikhalitsyn Aug. 7, 2023, 9:31 a.m. UTC | #3
On Mon, Aug 7, 2023 at 11:12 AM David Rheinsberg <david@readahead.eu> wrote:
>
> Hi
>
> On Mon, Aug 7, 2023, at 11:01 AM, Alexander Mikhalitsyn wrote:
> > On Mon, Aug 7, 2023 at 10:52 AM David Rheinsberg <david@readahead.eu> wrote:
> [...]
> >>  int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
> >>  {
> >> -       if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
> >> +       if (!pid)
> >> +               return -EINVAL;
> >> +
> >> +       /*
> >> +        * Non thread-group leaders cannot have pidfds, but we allow them for
> >> +        * reaped thread-group leaders.
> >> +        */
> >> +       if (pid_has_task(pid, PIDTYPE_PID) && !pid_has_task(pid, PIDTYPE_TGID))
> >>                 return -EINVAL;
> >
> > Hi David!
> >
> > As far as I understand, __unhash_process is always called with a
> > tasklist_lock held for writing.
> > Don't we need to take tasklist_lock for reading here to guarantee
> > consistency between
> > pid_has_task(pid, PIDTYPE_PID) and pid_has_task(pid, PIDTYPE_TGID)
> > return values?
>
> You mean PIDTYPE_TGID being cleared before PIDTYPE_PID (at least from the perspective of the unlocked reader)? I don't think it is a compatibility issue, because the same issue existed before the patch. But it might indeed be required to avoid spurious EINVAL _while_ the target process is reaped.

Yes, that was my thought. At the same time we can see that
__unhash_process() function at first detaches PIDTYPE_PID and then
PIDTYPE_TGID.
But without having any kind of memory barrier (and locks are also
implicit memory barriers) we can't be sure that inconsistency won't
happen here.

>
> It would be unfortunate if we need that. Because it is really not required for AF_UNIX or fanotify (they guarantee that they always deal with TGIDs). So maybe the correct call is to just drop pidfd_prepare() and always use __pidfd_prepare()? So far the safety-measures of pidfd_prepare() introduced two races I already mentioned in the commit-message. So maybe it is just better to document that the caller of __pidfd_prepare() needs to ensure the source is/was a TGID?

Do you think that taking read_lock(&tasklist_lock) can cause any
issues with contention on it?
IMHO, read_lock should be safe as we are taking it for a short period of time.

But anyways, I'm not insisting on that. I've just wanted to point this
out to discuss with you and folks.

Kind regards,
Alex

>
> Thanks
> David
  
Christian Brauner Aug. 7, 2023, 10:07 a.m. UTC | #4
Hey Oleg,

A question for you below.

On Mon, Aug 07, 2023 at 10:52:03AM +0200, David Rheinsberg wrote:
> A pidfd can currently only be created for tasks that are thread-group
> leaders and not reaped. This patch changes the pidfd-core to allow for
> pidfds on reapead thread-group leaders as well.
> 
> A pidfd can outlive the task it refers to, and thus user-space must
> already be prepared that the task underlying a pidfd is gone at the time
> they get their hands on the pidfd. For instance, resolving the pidfd to
> a PID via the fdinfo must be prepared to read `-1`.
> 
> Despite user-space knowing that a pidfd might be stale, several kernel
> APIs currently add another layer that checks for this. In particular,
> SO_PEERPIDFD returns `EINVAL` if the peer-task was already reaped,
> but returns a stale pidfd if the task is reaped immediately after the
> respective alive-check.
> 
> This has the unfortunate effect that user-space now has two ways to
> check for the exact same scenario: A syscall might return
> EINVAL/ESRCH/... *or* the pidfd might be stale, even though there is no
> particular reason to distinguish both cases. This also propagates
> through user-space APIs, which pass on pidfds. They must be prepared to
> pass on `-1` *or* the pidfd, because there is no guaranteed way to get a
> stale pidfd from the kernel.
> 
> This patch changes the core pidfd helpers to allow creation of pidfds
> even if the PID is no longer linked to any task. This only affects one
> of the three pidfd users that currently exist:
> 
>  1) fanotify already tests for a linked TGID-task manually before
>     creating the PIDFD, thus it is not directly affected by this change.
>     However, note that the current fanotify code fails with an error if
>     the target process is reaped exactly between the TGID-check in
>     fanotify and the test in pidfd_prepare(). With this patch, this
>     will no longer be the case.
> 
>  2) pidfd_open(2) calls find_get_pid() before creating the pidfd, thus
>     it is also not directly affected by this change.
>     Again, similar to fanotify, there is a race between the
>     find_get_pid() call and pidfd_prepare(), which currently causes
>     pidfd_open(2) to return EINVAL rather than ESRCH if the process is
>     reaped just between those two checks. With this patch, this will no
>     longer be the case.
> 
>  3) SO_PEERPIDFD will be affected by this change and from now on return
>     stale pidfds rather than EINVAL if the respective peer task is
>     reaped already.
> 
> Given that users of SO_PEERPIDFD must already deal with stale pidfds,
> this change hopefully simplifies the API of SO_PEERPIDFD, and all
> dependent user-space APIs (e.g., GetConnectionCredentials() on D-Bus
> driver APIs). Also note that SO_PEERPIDFD is still pending to be
> released with linux-6.5.
> 
> Signed-off-by: David Rheinsberg <david@readahead.eu>
> ---
>  kernel/fork.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index d2e12b6d2b18..4dde19a8c264 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2161,7 +2161,7 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
>   * Allocate a new file that stashes @pid and reserve a new pidfd number in the
>   * caller's file descriptor table. The pidfd is reserved but not installed yet.
>   *
> - * The helper verifies that @pid is used as a thread group leader.
> + * The helper verifies that @pid is/was used as a thread group leader.
>   *
>   * If this function returns successfully the caller is responsible to either
>   * call fd_install() passing the returned pidfd and pidfd file as arguments in
> @@ -2180,7 +2180,14 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
>   */
>  int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
>  {
> -	if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
> +	if (!pid)
> +		return -EINVAL;
> +
> +	/*
> +	 * Non thread-group leaders cannot have pidfds, but we allow them for
> +	 * reaped thread-group leaders.
> +	 */
> +	if (pid_has_task(pid, PIDTYPE_PID) && !pid_has_task(pid, PIDTYPE_TGID))
>  		return -EINVAL;

TL;DR userspace wants to be able to get a pidfd to an already reaped
thread-group leader. I don't see any issues with this.

But I'm not entirely clear how to make it safe so that we can
distinguish between @pid not being used as a thread-group leader and
PIDTYPE_TGID having already been detached from @pid. IOW, we need a
snapshot of PIDTYPE_PID and PIDTYPE_TGID so that we can compare the
returned tasks (Or another way to achieve a similar result.).

Any thoughts?
  
Oleg Nesterov Aug. 11, 2023, 11:29 a.m. UTC | #5
Hi Christian,

Sorry for delay, I've just returned from vacation and I am slowly
crawling my email backlog.



On 08/07, Christian Brauner wrote:
>
> >  int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
> >  {
> > -	if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
> > +	if (!pid)
> > +		return -EINVAL;
> > +
> > +	/*
> > +	 * Non thread-group leaders cannot have pidfds, but we allow them for
> > +	 * reaped thread-group leaders.
> > +	 */
> > +	if (pid_has_task(pid, PIDTYPE_PID) && !pid_has_task(pid, PIDTYPE_TGID))
> >  		return -EINVAL;
>
> TL;DR userspace wants to be able to get a pidfd to an already reaped
> thread-group leader. I don't see any issues with this.

I guess I need to read the whole thread carefully, but right now
I don't understand this patch and the problem...

OK, suppose we have a group leader L with pid 100 and its sub-thread
T with pid 101.

With this patch pidfd_open(101) can succeed if T exits right after
find_get_pid(101) because pid_has_task(pid, PIDTYPE_PID) above will
fail, right?

This looks wrong, 101 was never a leader pid...

Oleg.
  
Christian Brauner Aug. 11, 2023, 11:40 a.m. UTC | #6
On Fri, Aug 11, 2023 at 01:29:11PM +0200, Oleg Nesterov wrote:
> Hi Christian,
> 
> Sorry for delay, I've just returned from vacation and I am slowly

Absolutely no problem! Thanks for getting back to us.

> crawling my email backlog.
> 
> 
> 
> On 08/07, Christian Brauner wrote:
> >
> > >  int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
> > >  {
> > > -	if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
> > > +	if (!pid)
> > > +		return -EINVAL;
> > > +
> > > +	/*
> > > +	 * Non thread-group leaders cannot have pidfds, but we allow them for
> > > +	 * reaped thread-group leaders.
> > > +	 */
> > > +	if (pid_has_task(pid, PIDTYPE_PID) && !pid_has_task(pid, PIDTYPE_TGID))
> > >  		return -EINVAL;
> >
> > TL;DR userspace wants to be able to get a pidfd to an already reaped
> > thread-group leader. I don't see any issues with this.
> 
> I guess I need to read the whole thread carefully, but right now
> I don't understand this patch and the problem...
> 
> OK, suppose we have a group leader L with pid 100 and its sub-thread
> T with pid 101.
> 
> With this patch pidfd_open(101) can succeed if T exits right after
> find_get_pid(101) because pid_has_task(pid, PIDTYPE_PID) above will
> fail, right?
> 
> This looks wrong, 101 was never a leader pid...

Well, let me simplify the question:

What code do we need to allow userspace to open a pidfd to a leader pid
even if it has already been exited and reaped (without also accidently
allowing to open non-lead pid pidfds)?

I hope that clarifies?
  
Oleg Nesterov Aug. 11, 2023, 11:47 a.m. UTC | #7
As I said, I am not sure I understand the problem. And I know nothing
about net/ but...

On 08/07, Christian Brauner wrote:
>
> > SO_PEERPIDFD returns `EINVAL` if the peer-task was already reaped,
> > but returns a stale pidfd if the task is reaped immediately after the
> > respective alive-check.

after the quick grep it seems that SO_PEERPIDFD can simply use
__pidfd_prepare() ? We know that sk->sk_peer_pid was initialized with
task_tgid(current) and thus we know it is (was) a valid TGID pid.

The same is probably true for scm->pid and scm_pidfd_recv()...

But again, I am not familiar with this code, I can be wrong.

Oleg.
  
Oleg Nesterov Aug. 11, 2023, 11:57 a.m. UTC | #8
On 08/11, Christian Brauner wrote:
>
> > > >  int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
> > > >  {
> > > > -	if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
> > > > +	if (!pid)
> > > > +		return -EINVAL;
> > > > +
> > > > +	/*
> > > > +	 * Non thread-group leaders cannot have pidfds, but we allow them for
> > > > +	 * reaped thread-group leaders.
> > > > +	 */
> > > > +	if (pid_has_task(pid, PIDTYPE_PID) && !pid_has_task(pid, PIDTYPE_TGID))
> > > >  		return -EINVAL;
> > >
> > > TL;DR userspace wants to be able to get a pidfd to an already reaped
> > > thread-group leader. I don't see any issues with this.
> >
> > I guess I need to read the whole thread carefully, but right now
> > I don't understand this patch and the problem...
> >
> > OK, suppose we have a group leader L with pid 100 and its sub-thread
> > T with pid 101.
> >
> > With this patch pidfd_open(101) can succeed if T exits right after
> > find_get_pid(101) because pid_has_task(pid, PIDTYPE_PID) above will
> > fail, right?
> >
> > This looks wrong, 101 was never a leader pid...
>
> Well, let me simplify the question:

Thanks,

> What code do we need to allow userspace to open a pidfd to a leader pid
> even if it has already been exited and reaped (without also accidently
> allowing to open non-lead pid pidfds)?

I'll try to think more, but can you also explain why do we need this?

See my another email. Can't we simply shift the pid_has_task(PIDTYPE_TGID)
check from pidfd_prepare() to pidfd_create() ? (and then we can kill
pidfd_prepare and rename __pidfd_prepare to pidfd_prepare).

Oleg.
  
David Rheinsberg Aug. 14, 2023, 5:13 a.m. UTC | #9
Hi Oleg,

On Fri, Aug 11, 2023, at 1:57 PM, Oleg Nesterov wrote:
>> What code do we need to allow userspace to open a pidfd to a leader pid
>> even if it has already been exited and reaped (without also accidently
>> allowing to open non-lead pid pidfds)?
>
> I'll try to think more, but can you also explain why do we need this?
>
> See my another email. Can't we simply shift the pid_has_task(PIDTYPE_TGID)
> check from pidfd_prepare() to pidfd_create() ? (and then we can kill
> pidfd_prepare and rename __pidfd_prepare to pidfd_prepare).

Yes, the easiest solution would be to use `__pidfd_prepare()` and ensure that the caller only ever calls this on tg-leaders. This would work just fine, imo. And this was my initial approach.

I think Christian preferred an explicit assertion that ensures we do not accidentally hand out pidfds for non-tg-leaders. The question is thus whether there is an easy way to assert this even for reaped tasks? Or whether there is a simple way to flag a pid that was used as tg-leader? Or, ultimately, whether this has limited use and we should just use `__pidfd_prepare()`?

Thanks a lot!
David
  
Oleg Nesterov Aug. 14, 2023, 1:20 p.m. UTC | #10
On 08/14, David Rheinsberg wrote:
>
> Hi Oleg,
>
> On Fri, Aug 11, 2023, at 1:57 PM, Oleg Nesterov wrote:
> >> What code do we need to allow userspace to open a pidfd to a leader pid
> >> even if it has already been exited and reaped (without also accidently
> >> allowing to open non-lead pid pidfds)?
> >
> > I'll try to think more, but can you also explain why do we need this?
> >
> > See my another email. Can't we simply shift the pid_has_task(PIDTYPE_TGID)
> > check from pidfd_prepare() to pidfd_create() ? (and then we can kill
> > pidfd_prepare and rename __pidfd_prepare to pidfd_prepare).
>
> Yes, the easiest solution would be to use `__pidfd_prepare()` and ensure
> that the caller only ever calls this on tg-leaders. This would work just
> fine, imo. And this was my initial approach.

Great,

> I think Christian preferred an explicit assertion that ensures we do not
> accidentally hand out pidfds for non-tg-leaders. The question is thus whether
> there is an easy way to assert this even for reaped tasks?
> Or whether there is a simple way to flag a pid that was used as tg-leader?

I do not see how can we check if a detached pid was a leader pid, and I don't
think it makes sense to add a new member into struct pid...

> Or, ultimately, whether this has limited use and we should just use
> `__pidfd_prepare()`?

Well, if you confirm that sk->sk_peer_pid and scm->pid are always initialized with
task_tgid(current), I'd certainly prefer this approach unless Christian objects.

Oleg.
  
Alexander Mikhalitsyn Aug. 14, 2023, 1:34 p.m. UTC | #11
On Mon, Aug 14, 2023 at 3:21 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 08/14, David Rheinsberg wrote:
> >
> > Hi Oleg,
> >
> > On Fri, Aug 11, 2023, at 1:57 PM, Oleg Nesterov wrote:
> > >> What code do we need to allow userspace to open a pidfd to a leader pid
> > >> even if it has already been exited and reaped (without also accidently
> > >> allowing to open non-lead pid pidfds)?
> > >
> > > I'll try to think more, but can you also explain why do we need this?
> > >
> > > See my another email. Can't we simply shift the pid_has_task(PIDTYPE_TGID)
> > > check from pidfd_prepare() to pidfd_create() ? (and then we can kill
> > > pidfd_prepare and rename __pidfd_prepare to pidfd_prepare).
> >
> > Yes, the easiest solution would be to use `__pidfd_prepare()` and ensure
> > that the caller only ever calls this on tg-leaders. This would work just
> > fine, imo. And this was my initial approach.
>
> Great,
>
> > I think Christian preferred an explicit assertion that ensures we do not
> > accidentally hand out pidfds for non-tg-leaders. The question is thus whether
> > there is an easy way to assert this even for reaped tasks?
> > Or whether there is a simple way to flag a pid that was used as tg-leader?
>
> I do not see how can we check if a detached pid was a leader pid, and I don't
> think it makes sense to add a new member into struct pid...
>
> > Or, ultimately, whether this has limited use and we should just use
> > `__pidfd_prepare()`?
>
> Well, if you confirm that sk->sk_peer_pid and scm->pid are always initialized with
> task_tgid(current), I'd certainly prefer this approach unless Christian objects.

Dear colleagues,

I can confirm that sk->sk_peer_pid and scm->pid are always thread-group leaders.

Kind regards,
Alex

>
> Oleg.
>
  
Christian Brauner Aug. 14, 2023, 3:11 p.m. UTC | #12
On Mon, Aug 14, 2023 at 03:20:39PM +0200, Oleg Nesterov wrote:
> On 08/14, David Rheinsberg wrote:
> >
> > Hi Oleg,
> >
> > On Fri, Aug 11, 2023, at 1:57 PM, Oleg Nesterov wrote:
> > >> What code do we need to allow userspace to open a pidfd to a leader pid
> > >> even if it has already been exited and reaped (without also accidently
> > >> allowing to open non-lead pid pidfds)?
> > >
> > > I'll try to think more, but can you also explain why do we need this?
> > >
> > > See my another email. Can't we simply shift the pid_has_task(PIDTYPE_TGID)
> > > check from pidfd_prepare() to pidfd_create() ? (and then we can kill
> > > pidfd_prepare and rename __pidfd_prepare to pidfd_prepare).
> >
> > Yes, the easiest solution would be to use `__pidfd_prepare()` and ensure
> > that the caller only ever calls this on tg-leaders. This would work just
> > fine, imo. And this was my initial approach.
> 
> Great,
> 
> > I think Christian preferred an explicit assertion that ensures we do not
> > accidentally hand out pidfds for non-tg-leaders. The question is thus whether
> > there is an easy way to assert this even for reaped tasks?
> > Or whether there is a simple way to flag a pid that was used as tg-leader?
> 
> I do not see how can we check if a detached pid was a leader pid, and I don't
> think it makes sense to add a new member into struct pid...
> 
> > Or, ultimately, whether this has limited use and we should just use
> > `__pidfd_prepare()`?
> 
> Well, if you confirm that sk->sk_peer_pid and scm->pid are always initialized with
> task_tgid(current), I'd certainly prefer this approach unless Christian objects.

No no, I'm absolutely not objecting. I specifically want you to take the
opinionated lead here. :) Thanks for chiming in!
  

Patch

diff --git a/kernel/fork.c b/kernel/fork.c
index d2e12b6d2b18..4dde19a8c264 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2161,7 +2161,7 @@  static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
  * Allocate a new file that stashes @pid and reserve a new pidfd number in the
  * caller's file descriptor table. The pidfd is reserved but not installed yet.
  *
- * The helper verifies that @pid is used as a thread group leader.
+ * The helper verifies that @pid is/was used as a thread group leader.
  *
  * If this function returns successfully the caller is responsible to either
  * call fd_install() passing the returned pidfd and pidfd file as arguments in
@@ -2180,7 +2180,14 @@  static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
  */
 int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
 {
-	if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
+	if (!pid)
+		return -EINVAL;
+
+	/*
+	 * Non thread-group leaders cannot have pidfds, but we allow them for
+	 * reaped thread-group leaders.
+	 */
+	if (pid_has_task(pid, PIDTYPE_PID) && !pid_has_task(pid, PIDTYPE_TGID))
 		return -EINVAL;
 
 	return __pidfd_prepare(pid, flags, ret);