[v2,1/3] pidfd: allow pidfd_open() on non-thread-group leaders

Message ID 20231207170946.130823-1-tycho@tycho.pizza
State New
Headers
Series [v2,1/3] pidfd: allow pidfd_open() on non-thread-group leaders |

Commit Message

Tycho Andersen Dec. 7, 2023, 5:09 p.m. UTC
  From: Tycho Andersen <tandersen@netflix.com>

We are using the pidfd family of syscalls with the seccomp userspace
notifier. When some thread triggers a seccomp notification, we want to do
some things to its context (munge fd tables via pidfd_getfd(), maybe write
to its memory, etc.). However, threads created with ~CLONE_FILES or
~CLONE_VM mean that we can't use the pidfd family of syscalls for this
purpose, since their fd table or mm are distinct from the thread group
leader's. In this patch, we relax this restriction for pidfd_open().

In order to avoid dangling poll() users we need to notify pidfd waiters
when individual threads die, but once we do that all the other machinery
seems to work ok viz. the tests. But I suppose there are more cases than
just this one.

Signed-off-by: Tycho Andersen <tandersen@netflix.com>
--
v2: unify pidfd notification to all go through do_notify_pidfd() inside of
    __exit_signals() suggested by Oleg.
    Link to v1: https://lore.kernel.org/all/20231130163946.277502-1-tycho@tycho.pizza/
---
 include/linux/sched/signal.h |  1 +
 kernel/exit.c                |  3 +++
 kernel/fork.c                |  4 +---
 kernel/pid.c                 | 11 +----------
 kernel/signal.c              |  5 +----
 5 files changed, 7 insertions(+), 17 deletions(-)


base-commit: bee0e7762ad2c6025b9f5245c040fcc36ef2bde8
  

Comments

kernel test robot Dec. 11, 2023, 7:28 a.m. UTC | #1
Hello,

kernel test robot noticed "kernel-selftests.pidfd.pidfd_test.fail" on:

commit: e6d9be676d2c1fa8332c63c4382b8d3227fca991 ("[PATCH v2 1/3] pidfd: allow pidfd_open() on non-thread-group leaders")
url: https://github.com/intel-lab-lkp/linux/commits/Tycho-Andersen/selftests-pidfd-add-non-thread-group-leader-tests/20231208-011135
patch link: https://lore.kernel.org/all/20231207170946.130823-1-tycho@tycho.pizza/
patch subject: [PATCH v2 1/3] pidfd: allow pidfd_open() on non-thread-group leaders

in testcase: kernel-selftests
version: kernel-selftests-x86_64-60acb023-1_20230329
with following parameters:

	group: pidfd



compiler: gcc-12
test machine: 36 threads 1 sockets Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz (Cascade Lake) with 32G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202312111516.26dc3fd5-oliver.sang@intel.com


besides, we also observed kernel-selftests.pidfd.pidfd_poll_test.fail on this
commit, but clean on parent:

bee0e7762ad2c602 e6d9be676d2c1fa8332c63c4382
---------------- ---------------------------
       fail:runs  %reproduction    fail:runs
           |             |             |
           :6          100%           6:6     kernel-selftests.pidfd.pidfd_poll_test.fail
           :6          100%           6:6     kernel-selftests.pidfd.pidfd_test.fail



TAP version 13
1..7
# timeout set to 300
# selftests: pidfd: pidfd_test
# TAP version 13
# 1..8
# # Parent: pid: 2191
# # Parent: Waiting for Child (2192) to complete.
# # Child (pidfd): starting. pid 2192 tid 2192
# # Child Thread: starting. pid 2192 tid 2193 ; and sleeping
# # Child Thread: doing exec of sleep
# Bail out! pidfd_poll check for premature notification on child thread exec test: Unexpected epoll_wait result (c=0, events=0) (errno 0)
# # Planned tests != run tests (8 != 0)
# # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0
not ok 1 selftests: pidfd: pidfd_test # exit=1

...

# timeout set to 300
# selftests: pidfd: pidfd_poll_test
# # running pidfd poll test for 10000 iterations
# Bail out! death notification wait timeout
# # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0
not ok 4 selftests: pidfd: pidfd_poll_test # exit=1




The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231211/202312111516.26dc3fd5-oliver.sang@intel.com
  
Christian Brauner Dec. 13, 2023, 12:18 p.m. UTC | #2
On Mon, Dec 11, 2023 at 03:28:09PM +0800, kernel test robot wrote:
> 
> 
> Hello,
> 
> kernel test robot noticed "kernel-selftests.pidfd.pidfd_test.fail" on:
> 
> commit: e6d9be676d2c1fa8332c63c4382b8d3227fca991 ("[PATCH v2 1/3] pidfd: allow pidfd_open() on non-thread-group leaders")
> url: https://github.com/intel-lab-lkp/linux/commits/Tycho-Andersen/selftests-pidfd-add-non-thread-group-leader-tests/20231208-011135
> patch link: https://lore.kernel.org/all/20231207170946.130823-1-tycho@tycho.pizza/
> patch subject: [PATCH v2 1/3] pidfd: allow pidfd_open() on non-thread-group leaders
> 
> in testcase: kernel-selftests
> version: kernel-selftests-x86_64-60acb023-1_20230329
> with following parameters:
> 
> 	group: pidfd
> 
> 
> 
> compiler: gcc-12
> test machine: 36 threads 1 sockets Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz (Cascade Lake) with 32G memory
> 
> (please refer to attached dmesg/kmsg for entire log/backtrace)
> 
> 
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202312111516.26dc3fd5-oliver.sang@intel.com
> 
> 
> besides, we also observed kernel-selftests.pidfd.pidfd_poll_test.fail on this
> commit, but clean on parent:
> 
> bee0e7762ad2c602 e6d9be676d2c1fa8332c63c4382
> ---------------- ---------------------------
>        fail:runs  %reproduction    fail:runs
>            |             |             |
>            :6          100%           6:6     kernel-selftests.pidfd.pidfd_poll_test.fail
>            :6          100%           6:6     kernel-selftests.pidfd.pidfd_test.fail
> 
> 
> 
> TAP version 13
> 1..7
> # timeout set to 300
> # selftests: pidfd: pidfd_test
> # TAP version 13
> # 1..8
> # # Parent: pid: 2191
> # # Parent: Waiting for Child (2192) to complete.
> # # Child (pidfd): starting. pid 2192 tid 2192
> # # Child Thread: starting. pid 2192 tid 2193 ; and sleeping
> # # Child Thread: doing exec of sleep
> # Bail out! pidfd_poll check for premature notification on child thread exec test: Unexpected epoll_wait result (c=0, events=0) (errno 0)

So it seems that this broke multi-threaded exit notifications.
  
Tycho Andersen Dec. 13, 2023, 7:18 p.m. UTC | #3
On Wed, Dec 13, 2023 at 01:18:03PM +0100, Christian Brauner wrote:
> On Mon, Dec 11, 2023 at 03:28:09PM +0800, kernel test robot wrote:
> > 
> > 
> > Hello,
> > 
> > kernel test robot noticed "kernel-selftests.pidfd.pidfd_test.fail" on:
> > 
> > commit: e6d9be676d2c1fa8332c63c4382b8d3227fca991 ("[PATCH v2 1/3] pidfd: allow pidfd_open() on non-thread-group leaders")
> > url: https://github.com/intel-lab-lkp/linux/commits/Tycho-Andersen/selftests-pidfd-add-non-thread-group-leader-tests/20231208-011135
> > patch link: https://lore.kernel.org/all/20231207170946.130823-1-tycho@tycho.pizza/
> > patch subject: [PATCH v2 1/3] pidfd: allow pidfd_open() on non-thread-group leaders
> > 
> > in testcase: kernel-selftests
> > version: kernel-selftests-x86_64-60acb023-1_20230329
> > with following parameters:
> > 
> > 	group: pidfd
> > 
> > 
> > 
> > compiler: gcc-12
> > test machine: 36 threads 1 sockets Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz (Cascade Lake) with 32G memory
> > 
> > (please refer to attached dmesg/kmsg for entire log/backtrace)
> > 
> > 
> > 
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > | Reported-by: kernel test robot <oliver.sang@intel.com>
> > | Closes: https://lore.kernel.org/oe-lkp/202312111516.26dc3fd5-oliver.sang@intel.com
> > 
> > 
> > besides, we also observed kernel-selftests.pidfd.pidfd_poll_test.fail on this
> > commit, but clean on parent:
> > 
> > bee0e7762ad2c602 e6d9be676d2c1fa8332c63c4382
> > ---------------- ---------------------------
> >        fail:runs  %reproduction    fail:runs
> >            |             |             |
> >            :6          100%           6:6     kernel-selftests.pidfd.pidfd_poll_test.fail
> >            :6          100%           6:6     kernel-selftests.pidfd.pidfd_test.fail
> > 
> > 
> > 
> > TAP version 13
> > 1..7
> > # timeout set to 300
> > # selftests: pidfd: pidfd_test
> > # TAP version 13
> > # 1..8
> > # # Parent: pid: 2191
> > # # Parent: Waiting for Child (2192) to complete.
> > # # Child (pidfd): starting. pid 2192 tid 2192
> > # # Child Thread: starting. pid 2192 tid 2193 ; and sleeping
> > # # Child Thread: doing exec of sleep
> > # Bail out! pidfd_poll check for premature notification on child thread exec test: Unexpected epoll_wait result (c=0, events=0) (errno 0)
> 
> So it seems that this broke multi-threaded exit notifications.

Yeah... I've been trying to figure out how to fix it.

de_thread() calls release_task() for the original leader, which I
didn't realize.

Tycho
  

Patch

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 3499c1a8b929..37d6b4e4ab70 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -332,6 +332,7 @@  extern int kill_pid_usb_asyncio(int sig, int errno, sigval_t addr, struct pid *,
 extern int kill_pgrp(struct pid *pid, int sig, int priv);
 extern int kill_pid(struct pid *pid, int sig, int priv);
 extern __must_check bool do_notify_parent(struct task_struct *, int);
+extern void do_notify_pidfd(struct task_struct *task);
 extern void __wake_up_parent(struct task_struct *p, struct task_struct *parent);
 extern void force_sig(int);
 extern void force_fatal_sig(int);
diff --git a/kernel/exit.c b/kernel/exit.c
index ee9f43bed49a..7bb6488ebd79 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -149,6 +149,9 @@  static void __exit_signal(struct task_struct *tsk)
 	struct tty_struct *tty;
 	u64 utime, stime;
 
+	/* Wake up all pidfd waiters */
+	do_notify_pidfd(tsk);
+
 	sighand = rcu_dereference_check(tsk->sighand,
 					lockdep_tasklist_lock_is_held());
 	spin_lock(&sighand->siglock);
diff --git a/kernel/fork.c b/kernel/fork.c
index 10917c3e1f03..eef15c93f6cf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2163,8 +2163,6 @@  static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
  * Allocate a new file that stashes @pid and reserve a new pidfd number in the
  * caller's file descriptor table. The pidfd is reserved but not installed yet.
  *
- * The helper verifies that @pid is used as a thread group leader.
- *
  * If this function returns successfully the caller is responsible to either
  * call fd_install() passing the returned pidfd and pidfd file as arguments in
  * order to install the pidfd into its file descriptor table or they must use
@@ -2182,7 +2180,7 @@  static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
  */
 int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
 {
-	if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
+	if (!pid)
 		return -EINVAL;
 
 	return __pidfd_prepare(pid, flags, ret);
diff --git a/kernel/pid.c b/kernel/pid.c
index 6500ef956f2f..4806798022d9 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -552,11 +552,6 @@  struct pid *pidfd_get_pid(unsigned int fd, unsigned int *flags)
  * Return the task associated with @pidfd. The function takes a reference on
  * the returned task. The caller is responsible for releasing that reference.
  *
- * Currently, the process identified by @pidfd is always a thread-group leader.
- * This restriction currently exists for all aspects of pidfds including pidfd
- * creation (CLONE_PIDFD cannot be used with CLONE_THREAD) and pidfd polling
- * (only supports thread group leaders).
- *
  * Return: On success, the task_struct associated with the pidfd.
  *	   On error, a negative errno number will be returned.
  */
@@ -615,11 +610,7 @@  int pidfd_create(struct pid *pid, unsigned int flags)
  * @flags: flags to pass
  *
  * This creates a new pid file descriptor with the O_CLOEXEC flag set for
- * the process identified by @pid. Currently, the process identified by
- * @pid must be a thread-group leader. This restriction currently exists
- * for all aspects of pidfds including pidfd creation (CLONE_PIDFD cannot
- * be used with CLONE_THREAD) and pidfd polling (only supports thread group
- * leaders).
+ * the process identified by @pid.
  *
  * Return: On success, a cloexec pidfd is returned.
  *         On error, a negative errno number will be returned.
diff --git a/kernel/signal.c b/kernel/signal.c
index 47a7602dfe8d..7b3a1e147225 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2028,7 +2028,7 @@  int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type)
 	return ret;
 }
 
-static void do_notify_pidfd(struct task_struct *task)
+void do_notify_pidfd(struct task_struct *task)
 {
 	struct pid *pid;
 
@@ -2060,9 +2060,6 @@  bool do_notify_parent(struct task_struct *tsk, int sig)
 	WARN_ON_ONCE(!tsk->ptrace &&
 	       (tsk->group_leader != tsk || !thread_group_empty(tsk)));
 
-	/* Wake up all pidfd waiters */
-	do_notify_pidfd(tsk);
-
 	if (sig != SIGCHLD) {
 		/*
 		 * This is only possible if parent == real_parent.