Message ID | 20240123153452.170866-2-tycho@tycho.pizza |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-35560-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:2553:b0:103:945f:af90 with SMTP id p19csp420721dyi; Tue, 23 Jan 2024 07:42:29 -0800 (PST) X-Google-Smtp-Source: AGHT+IEiOduPBY1Tr4T1YVZNkElb3CBgJWr/4Xft4QlrJUs1Ml/2DjjNpkVnqLDiJNTy+99Mjr8i X-Received: by 2002:a17:906:4316:b0:a30:de02:84a1 with SMTP id j22-20020a170906431600b00a30de0284a1mr43204ejm.15.1706024549529; Tue, 23 Jan 2024 07:42:29 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1706024549; cv=pass; d=google.com; s=arc-20160816; b=MyD4BJYiBeIBj6QAWMSRM+gBOk4XSyOap2e6FCYgOMIHLfcDXL9Z1x3L61Is/ky24U P5qgAH8t6o4du9rpr/wYN0RPH+P6Dx1BHKmrFKVYCt8aT6U0ujZuxhxNuGH7Lww5UdGZ 3cKXC1b9hheZy79Rj+gIGg+n8JZN6qPzFniDZt+U1Vd1uq+nxH/O8Bct6hj4ZUp6HIWt 2ttUErrebzgdxpGBejZpAMmKFD8HvUlLP5goEB0LV1qPpJ2HBLQNQGNggOsSwVKmFcwL izq7u99XNwKhgX+L4qjpvfRKYHYf0vCb49Bp1JNX9Y1MlRFFyvQJKHtDkgp3KPybk2cE 2ejQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:in-reply-to:message-id :date:subject:cc:to:from:feedback-id:dkim-signature:dkim-signature; bh=M9C9oXaW03dr+ruWy17gy2FxHRKy1LqG/xA89jIm6Jg=; fh=QcQZKq0sgyRIdA06Hpl2aXUYNOInwZzwoPRM2roth5s=; b=oe2MmvZR4rfj4yiBZEAJOPEeBN8AtGlH3oM9hBkEegRYJ3fMgASsGlnYhBlPfqoAYg CFSBbOAu4u3/p3DCheudEbmhzfQTYDwGSJeqliwfzWEou6duJrkPexluo0ORaMwgSx4Q 0BOSfTzlHm3b3DZDlB+XzXYG98tnKjOwLIJhi7EsqYoqgLRqzFjZKTjUXRwnnCDxJJ5l zcZv3oy024NRCJff/cAY+vsCm7Uhn2GNKNhrJe2WJrABenj7EK4TbdI2xvyLu26wCoN9 ZB7EEsO0OjE+nX+zH236llYc0iR3racRTc4slziwacr4xldAFFAb8d13EOEASm7ptQrx rlMA== ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@tycho.pizza header.s=fm2 header.b=RtwdzzVy; dkim=pass header.i=@messagingengine.com header.s=fm3 header.b=WRbCi4+z; arc=pass (i=1 spf=pass spfdomain=tycho.pizza dkim=pass dkdomain=tycho.pizza dkim=pass dkdomain=messagingengine.com); spf=pass (google.com: domain of linux-kernel+bounces-35560-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-35560-ouuuleilei=gmail.com@vger.kernel.org" Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [147.75.80.249]) by mx.google.com with ESMTPS id r16-20020a170906365000b00a2ec486b444si6967036ejb.111.2024.01.23.07.42.29 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 23 Jan 2024 07:42:29 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-35560-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) client-ip=147.75.80.249; Authentication-Results: mx.google.com; dkim=pass header.i=@tycho.pizza header.s=fm2 header.b=RtwdzzVy; dkim=pass header.i=@messagingengine.com header.s=fm3 header.b=WRbCi4+z; arc=pass (i=1 spf=pass spfdomain=tycho.pizza dkim=pass dkdomain=tycho.pizza dkim=pass dkdomain=messagingengine.com); spf=pass (google.com: domain of linux-kernel+bounces-35560-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-35560-ouuuleilei=gmail.com@vger.kernel.org" Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 02E541F25002 for <ouuuleilei@gmail.com>; Tue, 23 Jan 2024 15:42:29 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 33272604D1; Tue, 23 Jan 2024 15:35:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=tycho.pizza header.i=@tycho.pizza header.b="RtwdzzVy"; dkim=pass (2048-bit key) header.d=messagingengine.com header.i=@messagingengine.com header.b="WRbCi4+z" Received: from out4-smtp.messagingengine.com (out4-smtp.messagingengine.com [66.111.4.28]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C59C960273; Tue, 23 Jan 2024 15:35:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=66.111.4.28 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706024109; cv=none; b=LeRyhAEjPZbzzlF/5rpNnaRHjPfchjxAvxsS8onGgZ/4AI0lvuc3cvCOoVChjzgoKLtqhFzdXPMTqEefDakRLgMgo39M+dSPKb9SqBTpWqghQ6uUQR3gdg6B6FcL9kda1RKkIq1m83P7H0rWrxbNaYQBEk9maVESJiylzUwRp0U= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706024109; c=relaxed/simple; bh=GNoSOzPorlXaJiPSQNqXcwNADrooIycd6sfZwndbFVo=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=u0r//uKZEh/cmIn8EnnqeHxBfT4LcrhK8ehiOMKOL7pn/90jJzYH14xWqfnez0kkIg42Ps+pFEZt44x5czQqTKglsAoq0++BCpIOAx2GlnC7mJ4Ayprss7U3ryVJ6dc9BbQ3USTueKePyh76aoRIghVU9fxXaKShFVUFOlAhEBo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=tycho.pizza; spf=pass smtp.mailfrom=tycho.pizza; dkim=pass (2048-bit key) header.d=tycho.pizza header.i=@tycho.pizza header.b=RtwdzzVy; dkim=pass (2048-bit key) header.d=messagingengine.com header.i=@messagingengine.com header.b=WRbCi4+z; arc=none smtp.client-ip=66.111.4.28 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=tycho.pizza Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=tycho.pizza Received: from compute2.internal (compute2.nyi.internal [10.202.2.46]) by mailout.nyi.internal (Postfix) with ESMTP id C1BFA5C01D5; Tue, 23 Jan 2024 10:35:06 -0500 (EST) Received: from mailfrontend1 ([10.202.2.162]) by compute2.internal (MEProxy); Tue, 23 Jan 2024 10:35:06 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=tycho.pizza; h= cc:cc:content-transfer-encoding:content-type:date:date:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:subject:subject:to:to; s=fm2; t=1706024106; x= 1706110506; bh=M9C9oXaW03dr+ruWy17gy2FxHRKy1LqG/xA89jIm6Jg=; b=R twdzzVyrV/XPizH2kgzfstAvqCjHQrdbkDlmXTV08IV1djbRnMUdcTjWKHpG0vuP Dapaiy/1NkYu6m9Xzr+8sW+PnN1z8Zw7dSHbCxY4vKG6J5nkchsG96+RMTjXUX+v o0KD0uIW5pVRgc3/T9yoiR8wVL5lj6a/npJTQqXTus5ejG0VRVm5Y+EflMMjgbFs ZkffURe5kNgxtHFmpV0JEmcqCtOGyjz18ackzpo+v17RvRTkic3rDxu3E0R6389T zHZ34hIlWu+7C/cs9kqpihLR4q+K2CFTFXeYdE4eyRakyWojx8bBFQb38WCSSj9M V7islNSCDD18c+imK7GJA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-transfer-encoding :content-type:date:date:feedback-id:feedback-id:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:subject:subject:to:to:x-me-proxy:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm3; t=1706024106; x= 1706110506; bh=M9C9oXaW03dr+ruWy17gy2FxHRKy1LqG/xA89jIm6Jg=; b=W RbCi4+zYrIaBZg6/tDRZk7pM/C7EYtcMUgWtI7PWu22wxaYeQmN5aMCLklJG8HOf F9PK08aG3jX+bo+XXRg1N/hKLFAXvK6ZdseeBnEWN1JAnOGYhBkdeJF6fBUF3Ytq cmKFkkq0/XFeuRAQmjdo7ZA0r3Zyumlw0Jg6S25gMwdfsVidiA4VNQzfSqzEs8bt XUOYJfW+dVRat9lHihGMU2NQ+ZgMbuxpBl6cWMtjIVBd5FmpCBLUmdyf3xL3/5F1 mNpzpwNRn8VuDSs+AUGQqcStXOO+FTUIR8I/IGF5ZFqF5f8+usW/QKA6JQRPaxtM jCgGmYe8Vn9YA0Ax8PWYA== X-ME-Sender: <xms:qtyvZUdjCeKpCZAf4N72Y4hStfS1FdktPIvIsE8pekBGbduV8rUKQA> <xme:qtyvZWMrmwzKeGIAIug2GLWDVHYAmcFqegJa-zB6mr_nw4ANok3VVyyn93kE8A651 npM5Csdxr4GSRw0SRk> X-ME-Received: <xmr:qtyvZVjDsJd-vsG-eWeaGRE5RfD3yssgM2JE-tVDfcsOWJFd3_JwjShCXNKmTe5QGRaVeA> X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvkedrvdekkedgjeejucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmne cujfgurhephffvvefufffkofgjfhgggfestdekredtredttdenucfhrhhomhepvfihtghh ohcutehnuggvrhhsvghnuceothihtghhohesthihtghhohdrphhiiiiirgeqnecuggftrf grthhtvghrnhephfevhfdvvdffueffgfffjeehlefffeefffeuheegvedvgefgueefkedu ieekgfelnecuffhomhgrihhnpehkvghrnhgvlhdrohhrghenucevlhhushhtvghrufhiii gvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpehthigthhhosehthigthhhordhpihii iigr X-ME-Proxy: <xmx:qtyvZZ-oxw8spuJs6iMruPU6KqTBmnphQdajNFy8XAUk1hR4cOYUyg> <xmx:qtyvZQtqWartjP9mGRbjfFb6j_PNE-tv2RyKBLp5nkf2jagkAv4FiQ> <xmx:qtyvZQGlYqq3tfzCLOLj-s6abTWeG9s2P4V4O9QKj8HfSJIv6C2svA> <xmx:qtyvZaXnl-JbqYgaeIIktv1HfuQqtBUKJXW_B-kei706mJFKXxH9WA> Feedback-ID: i21f147d5:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 23 Jan 2024 10:35:05 -0500 (EST) From: Tycho Andersen <tycho@tycho.pizza> To: Christian Brauner <brauner@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com>, linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, Tycho Andersen <tycho@tycho.pizza>, Tycho Andersen <tandersen@netflix.com> Subject: [PATCH v3 1/3] pidfd: allow pidfd_open() on non-thread-group leaders Date: Tue, 23 Jan 2024 08:34:50 -0700 Message-Id: <20240123153452.170866-2-tycho@tycho.pizza> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240123153452.170866-1-tycho@tycho.pizza> References: <20240123153452.170866-1-tycho@tycho.pizza> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1788896398018051641 X-GMAIL-MSGID: 1788896398018051641 |
Series |
pidfds for non thread group leaders
|
|
Commit Message
Tycho Andersen
Jan. 23, 2024, 3:34 p.m. UTC
From: Tycho Andersen <tandersen@netflix.com> We are using the pidfd family of syscalls with the seccomp userspace notifier. When some thread triggers a seccomp notification, we want to do some things to its context (munge fd tables via pidfd_getfd(), maybe write to its memory, etc.). However, threads created with ~CLONE_FILES or ~CLONE_VM mean that we can't use the pidfd family of syscalls for this purpose, since their fd table or mm are distinct from the thread group leader's. In this patch, we relax this restriction for pidfd_open(). In order to avoid dangling poll() users we need to notify pidfd waiters when individual threads die, but once we do that all the other machinery seems to work ok viz. the tests. But I suppose there are more cases than just this one. Signed-off-by: Tycho Andersen <tandersen@netflix.com> -- v2: unify pidfd notification to all go through do_notify_pidfd() inside of __exit_signals() suggested by Oleg. Link to v1: https://lore.kernel.org/all/20231130163946.277502-1-tycho@tycho.pizza/ v3: go back to two separate call sites, the exiting one in do_notify_parent(), and a new one in release_task(), when a thread is not the thread group leader. --- include/linux/sched/signal.h | 1 + kernel/exit.c | 11 +++++++++++ kernel/fork.c | 4 +--- kernel/pid.c | 11 +---------- kernel/signal.c | 2 +- 5 files changed, 15 insertions(+), 14 deletions(-)
Comments
Too late for me, but I don't understand this patch after a quick glance. perhaps I missed something... On 01/23, Tycho Andersen wrote: > > @@ -256,6 +256,17 @@ void release_task(struct task_struct *p) > write_lock_irq(&tasklist_lock); > ptrace_release_task(p); > thread_pid = get_pid(p->thread_pid); > + > + /* > + * If we're not the leader, notify any waiters on our pidfds. Note that > + * we don't want to notify the leader until /everyone/ in the thread > + * group is dead, viz. the condition below. > + * > + * We have to do this here, since __exit_signal() will > + * __unhash_processes(), and break do_notify_pidfd()'s lookup. > + */ > + if (!thread_group_leader(p)) > + do_notify_pidfd(p); This doesn't look consistent. If the task is a group leader do_notify_pidfd() is called by exit_notify() when it becomes a zombie (if no other threads), before it is reaped by its parent (unless autoreap). If it is a sub-thread, it is called by release_task() above. Note that a sub-thread can become a zombie too if it is traced. > __exit_signal(p); and, do_notify_pidfd() is called before __exit_signal() which does __unhash_process() -> detach_pid(PIDTYPE_PID). Doesn't this mean that pidfd_poll() can hang? thread_group_exited() won't return true after do_notify_pidfd() above, not to mention that thread_group_empty() is not possible if !thread_group_leader(). So. When do we want to do do_notify_pidfd() ? Whe the task (leader or not) becomes a zombie (passes exit_notify) or when it is reaped by release_task? Either way pidfd_poll() needs more changes with this patch and it can't use thread_group_exited(). If do_notify_pidfd() is called by release_task() after __exit_signal(), it can just check pid_has_task(PIDTYPE_PID). Oleg.
On Tue, Jan 23, 2024 at 08:56:08PM +0100, Oleg Nesterov wrote: > Too late for me, but I don't understand this patch after a quick glance. > perhaps I missed something... Thanks for taking a look. > On 01/23, Tycho Andersen wrote: > > > > @@ -256,6 +256,17 @@ void release_task(struct task_struct *p) > > write_lock_irq(&tasklist_lock); > > ptrace_release_task(p); > > thread_pid = get_pid(p->thread_pid); > > + > > + /* > > + * If we're not the leader, notify any waiters on our pidfds. Note that > > + * we don't want to notify the leader until /everyone/ in the thread > > + * group is dead, viz. the condition below. > > + * > > + * We have to do this here, since __exit_signal() will > > + * __unhash_processes(), and break do_notify_pidfd()'s lookup. > > + */ > > + if (!thread_group_leader(p)) > > + do_notify_pidfd(p); > > This doesn't look consistent. > > If the task is a group leader do_notify_pidfd() is called by exit_notify() > when it becomes a zombie (if no other threads), before it is reaped by its > parent (unless autoreap). There is another path, also in release_task(), that I was trying to mirror since it deals explicitly with sub-threads but, > If it is a sub-thread, it is called by release_task() above. Note that a > sub-thread can become a zombie too if it is traced. I didn't know about this. > > __exit_signal(p); > > and, do_notify_pidfd() is called before __exit_signal() which does > __unhash_process() -> detach_pid(PIDTYPE_PID). > > Doesn't this mean that pidfd_poll() can hang? thread_group_exited() > won't return true after do_notify_pidfd() above, not to mention that > thread_group_empty() is not possible if !thread_group_leader(). I was wondering about this too, but the test_non_tgl_poll_exit test in the next patch tests exactly this and works as expected. > So. When do we want to do do_notify_pidfd() ? Whe the task (leader or not) > becomes a zombie (passes exit_notify) or when it is reaped by release_task? It seems like we'd want it when exit_notify() is called in principle, since that's when the pid actually dies. When it is reaped is "mostly unrelated". Something like, 1. in the "normal" exit_notify() paths via do_notify_parent() 2. if none of those cases are true (aka the final else in exit_notify()) and the thread is not ptraced 3. via release_task() finally if this was the thread group leader and it died before some sub-thread then in pidfd_poll(), we can do: if (!tsk || (tsk->exit_state >= 0) || thread_group_exited()) do_notify_pidfd(); ? > Either way pidfd_poll() needs more changes with this patch and it can't > use thread_group_exited(). If do_notify_pidfd() is called by release_task() > after __exit_signal(), it can just check pid_has_task(PIDTYPE_PID). I suppose this is why my test works, since pid_task(PIDTYPE_PID) is null after release_task(). But if we want it to happen earlier, we'll have to do something like the above. Tycho
I am already sleeping. I'll try to reply to other parts of your email tomorrow but I am not sure, I will be very busy with family duties. On 01/23, Tycho Andersen wrote: > > > > __exit_signal(p); > > > > and, do_notify_pidfd() is called before __exit_signal() which does > > __unhash_process() -> detach_pid(PIDTYPE_PID). > > > > Doesn't this mean that pidfd_poll() can hang? thread_group_exited() > > won't return true after do_notify_pidfd() above, not to mention that > > thread_group_empty() is not possible if !thread_group_leader(). > > I was wondering about this too, but the test_non_tgl_poll_exit test in > the next patch tests exactly this and works as expected. Well, if release_task() completes __exit_signal() before the woken task does thread_group_exited(), pid_task(PIDTYPE_PID) will return 0 and pidfd_poll() won't hang. But to be honest I can't understand test_non_tgl_poll_exit() at all. I don't even understand why the process/thread created by fork_task_with_thread() should ever exit. And why it creates the "writer" child... Never mind, too late for me to read the code. Oleg.
On 01/23, Oleg Nesterov wrote: > > But to be honest I can't understand test_non_tgl_poll_exit() at all. I don't > even understand why the process/thread created by fork_task_with_thread() > should ever exit. And why it creates the "writer" child... Never mind, too > late for me to read the code. Ah, OK, it passes thread_wait_exit to fork_task_with_thread(), and this "fn" reads sk_pair and does exit() which is actually exit_group(). Oleg.
Add Eric. On 01/23, Tycho Andersen wrote: > > On Tue, Jan 23, 2024 at 08:56:08PM +0100, Oleg Nesterov wrote: > > > So. When do we want to do do_notify_pidfd() ? Whe the task (leader or not) > > becomes a zombie (passes exit_notify) or when it is reaped by release_task? > > It seems like we'd want it when exit_notify() is called in principle, > since that's when the pid actually dies. No the pid "dies" after this task is reaped, until then its nr is still in use and pid_task(PIDTYPE_PID) returns the exiting/exited task. > When it is reaped is "mostly unrelated". Then why pidfd_poll() can't simply check !task || task->exit_state ? Nevermind. So, currently pidfd_poll() succeeds when the leader can be reaped, iow the whole thread group has exited. But even if you are the parent, you can't expect that wait(WNOHANG) must succeed, the leader can be traced. I guess it is too late to change this behaviour. What if we add the new PIDFD_THREAD flag? With this flag - sys_pidfd_open() doesn't require the must be a group leader - pidfd_poll() succeeds when the task passes exit_notify() and becomes a zombie, even if it is a leader and has other threads. Please the the incomplete/untested patch below. - The change in exit_notify() is sub-optimal, we can do better to avoid 2 do_notify_pidfd() calls from exit_notify(). But so far this is only for discussion, lets keep it simple. - __pidfd_prepare() needs some minor cleanups regardless of this change, I'll send the patch... What do you think? And why is thread_group_exited() exported? Oleg. diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h index 5406fbc13074..2e6461459877 100644 --- a/include/uapi/linux/pidfd.h +++ b/include/uapi/linux/pidfd.h @@ -7,6 +7,7 @@ #include <linux/fcntl.h> /* Flags for pidfd_open(). */ -#define PIDFD_NONBLOCK O_NONBLOCK +#define PIDFD_NONBLOCK O_NONBLOCK +#define PIDFD_THREAD O_EXCL // or anything else not used by anon_inode's #endif /* _UAPI_LINUX_PIDFD_H */ diff --git a/kernel/exit.c b/kernel/exit.c index dfb963d2f862..9f8526b7d717 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -752,6 +752,10 @@ static void exit_notify(struct task_struct *tsk, int group_dead) autoreap = true; } + /* unnecessary if do_notify_parent() was already called, + we can do better */ + do_notify_pidfd(tsk); + if (autoreap) { tsk->exit_state = EXIT_DEAD; list_add(&tsk->ptrace_entry, &dead); diff --git a/kernel/fork.c b/kernel/fork.c index c981fa6171c1..38f2c7423fb4 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -101,6 +101,7 @@ #include <linux/user_events.h> #include <linux/iommu.h> #include <linux/rseq.h> +#include <uapi/linux/pidfd.h> #include <asm/pgalloc.h> #include <linux/uaccess.h> @@ -2068,12 +2069,27 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f) } #endif +static bool xxx_exited(struct pid *pid, int excl) +{ + struct task_struct *task; + bool exited; + + rcu_read_lock(); + task = pid_task(pid, PIDTYPE_PID); + exited = !task || + (READ_ONCE(task->exit_state) && (excl || thread_group_empty(task))); + rcu_read_unlock(); + + return exited; +} + /* * Poll support for process exit notification. */ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts) { struct pid *pid = file->private_data; + int excl = file->f_flags & PIDFD_THREAD; __poll_t poll_flags = 0; poll_wait(file, &pid->wait_pidfd, pts); @@ -2083,7 +2099,7 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts) * If the thread group leader exits before all other threads in the * group, then poll(2) should block, similar to the wait(2) family. */ - if (thread_group_exited(pid)) + if (xxx_exited(pid, excl)) poll_flags = EPOLLIN | EPOLLRDNORM; return poll_flags; @@ -2129,7 +2145,9 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re { int pidfd; struct file *pidfd_file; + unsigned excl = flags & PIDFD_THREAD; + flags &= ~PIDFD_THREAD; if (flags & ~(O_NONBLOCK | O_RDWR | O_CLOEXEC)) return -EINVAL; @@ -2144,6 +2162,7 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re return PTR_ERR(pidfd_file); } get_pid(pid); /* held by pidfd_file now */ + pidfd_file->f_flags |= excl; *ret = pidfd_file; return pidfd; } @@ -2176,7 +2195,9 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re */ int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret) { - if (!pid || !pid_has_task(pid, PIDTYPE_TGID)) + unsigned excl = flags & PIDFD_THREAD; + + if (!pid || !pid_has_task(pid, excl ? PIDTYPE_PID : PIDTYPE_TGID)) return -EINVAL; return __pidfd_prepare(pid, flags, ret); diff --git a/kernel/pid.c b/kernel/pid.c index b52b10865454..5257197f9493 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -629,7 +629,7 @@ SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags) int fd; struct pid *p; - if (flags & ~PIDFD_NONBLOCK) + if (flags & ~(PIDFD_NONBLOCK | PIDFD_THREAD)) return -EINVAL; if (pid <= 0)
> > When it is reaped is "mostly unrelated". > > Then why pidfd_poll() can't simply check !task || task->exit_state ? > > Nevermind. So, currently pidfd_poll() succeeds when the leader can be Hm, the comment right above mentions: /* * Inform pollers only when the whole thread group exits. * If the thread group leader exits before all other threads in the * group, then poll(2) should block, similar to the wait(2) family. */ > reaped, iow the whole thread group has exited. But even if you are the > parent, you can't expect that wait(WNOHANG) must succeed, the leader > can be traced. I guess it is too late to change this behaviour. Hm, why is that an issue though? And if it is an issue why shouldn't we be able to change it? Because a program would rely on WNOHANG to hang on a ptraced leader? That seems esoteric imho. I might just misunderstand. > > What if we add the new PIDFD_THREAD flag? With this flag > > - sys_pidfd_open() doesn't require the must be a group leader Yes. > > - pidfd_poll() succeeds when the task passes exit_notify() and > becomes a zombie, even if it is a leader and has other threads. Iiuc, if an existing user creates a pidfd for a thread-group leader and then polls that pidfd they would currently only get notified if the thread-group is empty and the leader has exited. If we now start notifying when the thread-group leader exits but the thread-group isn't empty then this would be a fairly big api change and I would expect this to cause regressions as that surely is something that userspace relies on. Am I understand this right?
On 01/25, Christian Brauner wrote: > > > > When it is reaped is "mostly unrelated". > > > > Then why pidfd_poll() can't simply check !task || task->exit_state ? > > > > Nevermind. So, currently pidfd_poll() succeeds when the leader can be > > Hm, the comment right above mentions: > > /* > * Inform pollers only when the whole thread group exits. > * If the thread group leader exits before all other threads in the > * group, then poll(2) should block, similar to the wait(2) family. > */ > > reaped, iow the whole thread group has exited. Yes, but the comment doesn't contradict with what I have said? > > But even if you are the > > parent, you can't expect that wait(WNOHANG) must succeed, the leader > > can be traced. I guess it is too late to change this behaviour. > > Hm, why is that an issue though? Well, I didn't say this is a problem. I simply do not know how/why people use pidfd_poll(). I mostly tried to explain why do I think that do_notify_pidfd() should be always called from exit_notify() path, not by release_task(), even if the task is not a leader. > Because a program would rely on WNOHANG to hang on > a ptraced leader? That seems esoteric imho. To me it would be usefule, but lets not discuss this now. The "patch" I sent doesn't change the current behaviour. > > What if we add the new PIDFD_THREAD flag? With this flag > > > > - sys_pidfd_open() doesn't require the must be a group leader > > Yes. > > > > > - pidfd_poll() succeeds when the task passes exit_notify() and > > becomes a zombie, even if it is a leader and has other threads. > > Iiuc, if an existing user creates a pidfd for a thread-group leader and > then polls that pidfd they would currently only get notified if the > thread-group is empty and the leader has exited. > > If we now start notifying when the thread-group leader exits but the > thread-group isn't empty then this would be a fairly big api change Hmm... again, this patch doesn't (shouldn't) change the current behavior. Please note "with this flag" above. If sys_pidfd_open() was called without PIDFD_THREAD, then sys_pidfd_open() still requires that the target task must be a group leader, and pidfd_poll() won't succeed until the leader exits and thread_group_empty() is true. Oleg.
On Thu, Jan 25, 2024 at 06:51:14PM +0100, Oleg Nesterov wrote: > > > What if we add the new PIDFD_THREAD flag? With this flag > > > > > > - sys_pidfd_open() doesn't require the must be a group leader > > > > Yes. > > > > > > > > - pidfd_poll() succeeds when the task passes exit_notify() and > > > becomes a zombie, even if it is a leader and has other threads. > > > > Iiuc, if an existing user creates a pidfd for a thread-group leader and > > then polls that pidfd they would currently only get notified if the > > thread-group is empty and the leader has exited. > > > > If we now start notifying when the thread-group leader exits but the > > thread-group isn't empty then this would be a fairly big api change > > Hmm... again, this patch doesn't (shouldn't) change the current behavior. > > Please note "with this flag" above. If sys_pidfd_open() was called > without PIDFD_THREAD, then sys_pidfd_open() still requires that the > target task must be a group leader, and pidfd_poll() won't succeed > until the leader exits and thread_group_empty() is true. Thanks for sending your patch, I'll take a look at it (probably tomorrow at this rate). One of the things I don't like about PIDFD_THREAD is that it's hard to tell whether an arbitrary thread is a leader or not. Right now we do it by parsing /proc/pid/status, which shows all the stuff from do_task_stat() that we don't care about but which is quite expensive to compute. (Maybe there's a better way?) With PIDFD_THREAD we could could do it twice, once with the flag, get EINVAL, and then do it again. But ideally we wouldn't have to. Still, if that's the only way that makes sense, that's fine. Tycho
On 01/25, Tycho Andersen wrote: > > One of the things I don't like about PIDFD_THREAD is that it's hard to > tell whether an arbitrary thread is a leader or not. Right now we do > it by parsing /proc/pid/status, which shows all the stuff from > do_task_stat() that we don't care about but which is quite expensive > to compute. (Maybe there's a better way?) > > With PIDFD_THREAD we could could do it twice, once with the flag, get > EINVAL, and then do it again. But ideally we wouldn't have to. Too late for me, most probably I misunderstood. If you want the PIDFD_THREAD behaviour, you can always use this flag without any check... Could you spell? Oleg.
On 01/25, Oleg Nesterov wrote: > > On 01/25, Tycho Andersen wrote: > > > > One of the things I don't like about PIDFD_THREAD is that it's hard to > > tell whether an arbitrary thread is a leader or not. Right now we do > > it by parsing /proc/pid/status, which shows all the stuff from > > do_task_stat() that we don't care about but which is quite expensive > > to compute. (Maybe there's a better way?) > > > > With PIDFD_THREAD we could could do it twice, once with the flag, get > > EINVAL, and then do it again. But ideally we wouldn't have to. > > Too late for me, most probably I misunderstood. > > If you want the PIDFD_THREAD behaviour, you can always use this flag > without any check... > > Could you spell? Just in case, we can even add PIDFD_AUTO (modulo naming) which acts as PIDFD_THREAD if the target task is not a leader or 0 (current behaviour) otherwise. Trivial. Oleg.
On Thu, Jan 25, 2024 at 07:30:46PM +0100, Oleg Nesterov wrote: > On 01/25, Oleg Nesterov wrote: > > > > On 01/25, Tycho Andersen wrote: > > > > > > One of the things I don't like about PIDFD_THREAD is that it's hard to > > > tell whether an arbitrary thread is a leader or not. Right now we do > > > it by parsing /proc/pid/status, which shows all the stuff from > > > do_task_stat() that we don't care about but which is quite expensive > > > to compute. (Maybe there's a better way?) > > > > > > With PIDFD_THREAD we could could do it twice, once with the flag, get > > > EINVAL, and then do it again. But ideally we wouldn't have to. > > > > Too late for me, most probably I misunderstood. > > > > If you want the PIDFD_THREAD behaviour, you can always use this flag > > without any check... Sorry, I hadn't read the patch. If it's ok to use PIDFD_THREAD on a leader, then we can just always specify it. (We don't care about the behavior of pidfd_poll().) > > Could you spell? > > Just in case, we can even add PIDFD_AUTO (modulo naming) which acts as > PIDFD_THREAD if the target task is not a leader or 0 (current behaviour) > otherwise. Trivial. Yep, or given the above, maybe it'll work as-is, thank you. Tycho
On Thu, Jan 25, 2024 at 06:51:14PM +0100, Oleg Nesterov wrote: > On 01/25, Christian Brauner wrote: > > > > > > When it is reaped is "mostly unrelated". > > > > > > Then why pidfd_poll() can't simply check !task || task->exit_state ? > > > > > > Nevermind. So, currently pidfd_poll() succeeds when the leader can be > > > > Hm, the comment right above mentions: > > > > /* > > * Inform pollers only when the whole thread group exits. > > * If the thread group leader exits before all other threads in the > > * group, then poll(2) should block, similar to the wait(2) family. > > */ > > > reaped, iow the whole thread group has exited. > > Yes, but the comment doesn't contradict with what I have said? No, it doesn't. I'm trying to understand what you are suggesting though. Are you saying !task || tas->exit_state is enough and we shouldn't use the helper that was added in commit 38fd525a4c61 ("exit: Factor thread_group_exited out of pidfd_poll"). If so what does that buy us open-coding the check instead of using that helper? Is there an actual bug here? > > > But even if you are the > > > parent, you can't expect that wait(WNOHANG) must succeed, the leader > > > can be traced. I guess it is too late to change this behaviour. > > > > Hm, why is that an issue though? > > Well, I didn't say this is a problem. I simply do not know how/why people > use pidfd_poll(). Sorry, I just have a hard time understanding what you wanted then. :) "I guess it is too late to change this behavior." made it sound like a) there's a problem and b) that you would prefer to change behavior. Thus, it seems that wait(WNOHANG) hanging when a traced leader of an empty thread-group has exited is a problem in your eyes. > > I mostly tried to explain why do I think that do_notify_pidfd() should > be always called from exit_notify() path, not by release_task(), even > if the task is not a leader. > > > Because a program would rely on WNOHANG to hang on > > a ptraced leader? That seems esoteric imho. > > To me it would be usefule, but lets not discuss this now. The "patch" Ok, that's good then. I would expect that at least stuff like rr makes use of pidfd and they might rely on this behavior - although I haven't checked their code. > I sent doesn't change the current behaviour. Yeah, I got that but it would still be useful to understand the wider context you were adressing. > > > > What if we add the new PIDFD_THREAD flag? With this flag > > > > > > - sys_pidfd_open() doesn't require the must be a group leader > > > > Yes. > > > > > > > > - pidfd_poll() succeeds when the task passes exit_notify() and > > > becomes a zombie, even if it is a leader and has other threads. > > > > Iiuc, if an existing user creates a pidfd for a thread-group leader and > > then polls that pidfd they would currently only get notified if the > > thread-group is empty and the leader has exited. > > > > If we now start notifying when the thread-group leader exits but the > > thread-group isn't empty then this would be a fairly big api change > > Hmm... again, this patch doesn't (shouldn't) change the current behavior. > > Please note "with this flag" above. If sys_pidfd_open() was called > without PIDFD_THREAD, then sys_pidfd_open() still requires that the > target task must be a group leader, and pidfd_poll() won't succeed > until the leader exits and thread_group_empty() is true. Yeah, I missed the PIDFD_THREAD flag suggestion. Sorry about that. Btw, I'm not sure whether you remember that but when we originally did the pidfd work you and I discussed thread support and already decided back then that having a flag like PIDFD_THREAD would likely be the way to go. The PIDFD_THREAD flag would be would be interesting because we could make pidfd_send_signal() support this flag as well to allow sending a signal to a specific thread. That's something that I had also wanted to support. And I've been asked for this a few times already. What do you think?
> Please the the incomplete/untested patch below. > > - The change in exit_notify() is sub-optimal, we can do better > to avoid 2 do_notify_pidfd() calls from exit_notify(). But > so far this is only for discussion, lets keep it simple. > > - __pidfd_prepare() needs some minor cleanups regardless of > this change, I'll send the patch... > > What do you think? > > And why is thread_group_exited() exported? > > Oleg. > > diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h > index 5406fbc13074..2e6461459877 100644 > --- a/include/uapi/linux/pidfd.h > +++ b/include/uapi/linux/pidfd.h > @@ -7,6 +7,7 @@ > #include <linux/fcntl.h> > > /* Flags for pidfd_open(). */ > -#define PIDFD_NONBLOCK O_NONBLOCK > +#define PIDFD_NONBLOCK O_NONBLOCK > +#define PIDFD_THREAD O_EXCL // or anything else not used by anon_inode's I like it! The only request I would have is to not alias O_EXCL and PIDFD_THREAD. Because it doesn't map as clearly as NONBLOCK did. > > #endif /* _UAPI_LINUX_PIDFD_H */ > diff --git a/kernel/exit.c b/kernel/exit.c > index dfb963d2f862..9f8526b7d717 100644 > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -752,6 +752,10 @@ static void exit_notify(struct task_struct *tsk, int group_dead) > autoreap = true; > } > > + /* unnecessary if do_notify_parent() was already called, > + we can do better */ > + do_notify_pidfd(tsk); > + > if (autoreap) { > tsk->exit_state = EXIT_DEAD; > list_add(&tsk->ptrace_entry, &dead); > diff --git a/kernel/fork.c b/kernel/fork.c > index c981fa6171c1..38f2c7423fb4 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -101,6 +101,7 @@ > #include <linux/user_events.h> > #include <linux/iommu.h> > #include <linux/rseq.h> > +#include <uapi/linux/pidfd.h> > > #include <asm/pgalloc.h> > #include <linux/uaccess.h> > @@ -2068,12 +2069,27 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f) > } > #endif > > +static bool xxx_exited(struct pid *pid, int excl) > +{ > + struct task_struct *task; > + bool exited; > + > + rcu_read_lock(); > + task = pid_task(pid, PIDTYPE_PID); > + exited = !task || > + (READ_ONCE(task->exit_state) && (excl || thread_group_empty(task))); > + rcu_read_unlock(); > + > + return exited; > +} > + > /* > * Poll support for process exit notification. > */ > static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts) > { > struct pid *pid = file->private_data; > + int excl = file->f_flags & PIDFD_THREAD; > __poll_t poll_flags = 0; > > poll_wait(file, &pid->wait_pidfd, pts); > @@ -2083,7 +2099,7 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts) > * If the thread group leader exits before all other threads in the > * group, then poll(2) should block, similar to the wait(2) family. > */ > - if (thread_group_exited(pid)) > + if (xxx_exited(pid, excl)) > poll_flags = EPOLLIN | EPOLLRDNORM; > > return poll_flags; > @@ -2129,7 +2145,9 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re > { > int pidfd; > struct file *pidfd_file; > + unsigned excl = flags & PIDFD_THREAD; > > + flags &= ~PIDFD_THREAD; > if (flags & ~(O_NONBLOCK | O_RDWR | O_CLOEXEC)) > return -EINVAL; > > @@ -2144,6 +2162,7 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re > return PTR_ERR(pidfd_file); > } > get_pid(pid); /* held by pidfd_file now */ > + pidfd_file->f_flags |= excl; > *ret = pidfd_file; > return pidfd; > } > @@ -2176,7 +2195,9 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re > */ > int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret) > { > - if (!pid || !pid_has_task(pid, PIDTYPE_TGID)) > + unsigned excl = flags & PIDFD_THREAD; > + > + if (!pid || !pid_has_task(pid, excl ? PIDTYPE_PID : PIDTYPE_TGID)) > return -EINVAL; > > return __pidfd_prepare(pid, flags, ret); > diff --git a/kernel/pid.c b/kernel/pid.c > index b52b10865454..5257197f9493 100644 > --- a/kernel/pid.c > +++ b/kernel/pid.c > @@ -629,7 +629,7 @@ SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags) > int fd; > struct pid *p; > > - if (flags & ~PIDFD_NONBLOCK) > + if (flags & ~(PIDFD_NONBLOCK | PIDFD_THREAD)) > return -EINVAL; > > if (pid <= 0) >
On Thu, Jan 25, 2024 at 11:36:50AM -0700, Tycho Andersen wrote: > On Thu, Jan 25, 2024 at 07:30:46PM +0100, Oleg Nesterov wrote: > > On 01/25, Oleg Nesterov wrote: > > > > > > On 01/25, Tycho Andersen wrote: > > > > > > > > One of the things I don't like about PIDFD_THREAD is that it's hard to > > > > tell whether an arbitrary thread is a leader or not. Right now we do > > > > it by parsing /proc/pid/status, which shows all the stuff from > > > > do_task_stat() that we don't care about but which is quite expensive > > > > to compute. (Maybe there's a better way?) > > > > > > > > With PIDFD_THREAD we could could do it twice, once with the flag, get > > > > EINVAL, and then do it again. But ideally we wouldn't have to. > > > > > > Too late for me, most probably I misunderstood. > > > > > > If you want the PIDFD_THREAD behaviour, you can always use this flag > > > without any check... > > Sorry, I hadn't read the patch. If it's ok to use PIDFD_THREAD on a > leader, then we can just always specify it. (We don't care about the > behavior of pidfd_poll().) > > > > Could you spell? > > > > Just in case, we can even add PIDFD_AUTO (modulo naming) which acts as > > PIDFD_THREAD if the target task is not a leader or 0 (current behaviour) > > otherwise. Trivial. > > Yep, or given the above, maybe it'll work as-is, thank you. Yes, let's rather do the explicit PIDFD_THREAD.
On 01/26, Christian Brauner wrote: > > No, it doesn't. I'm trying to understand what you are suggesting though. > Are you saying !task || tas->exit_state is enough If PIDFD_THREAD then I think it is enough. Otherwise we still need !task || (exit_state && thread_group_empty) > and we shouldn't use > the helper that was added in commit 38fd525a4c61 ("exit: Factor > thread_group_exited out of pidfd_poll"). If so what does that buy us > open-coding the check instead of using that helper? Is there an actual > bug here? The patch adds the new xxx_exited(task, excl) helper which checks !task || (exit_state && (excl || thread_group_empty)) yes, the naming is not good. > > Well, I didn't say this is a problem. I simply do not know how/why people > > use pidfd_poll(). > > Sorry, I just have a hard time understanding what you wanted then. :) > > "I guess it is too late to change this behavior." made it sound like a) > there's a problem and b) that you would prefer to change behavior. Thus, > it seems that wait(WNOHANG) hanging when a traced leader of an empty > thread-group has exited is a problem in your eyes. Again, I mostly tried to argue with do_notify_pidfd() called by realese_task(). I think that with PIDFD_THREAD set pidfd_poll() should succeed right after the exiting thread becomes a zombie (passes exit_notify), whether it is a leader or not. Let me quote part of my reply to Tycho's patch > + /* > + * If we're not the leader, notify any waiters on our pidfds. Note that > + * we don't want to notify the leader until /everyone/ in the thread > + * group is dead, viz. the condition below. > + * > + * We have to do this here, since __exit_signal() will > + * __unhash_processes(), and break do_notify_pidfd()'s lookup. > + */ > + if (!thread_group_leader(p)) > + do_notify_pidfd(p); This doesn't look consistent. If the task is a group leader do_notify_pidfd() is called by exit_notify() when it becomes a zombie (if no other threads), before it is reaped by its parent (unless autoreap). If it is a sub-thread, it is called by release_task() above. Note that a sub-thread can become a zombie too if it is traced. Not to mention that this is racy. I would not mind if we simply move do_notify_pidfd() from exit_notify() to release_task() and do it regardless of thread_group_leader(). And in some sense this looks more logical to me. But as I said: - I do not know how/why people actually use poll(pidfd) - it is too late to change the current behaviour Sorry for confusion. > I'm not sure whether you remember that but when we originally did the > pidfd work you and I discussed thread support and already decided back > then that having a flag like PIDFD_THREAD would likely be the way to go. All I can recall is that, yes, we had some discussions about pidfd in the past ;) > The PIDFD_THREAD flag would be would be interesting because we could > make pidfd_send_signal() support this flag Agreed, Oleg.
On 01/26, Christian Brauner wrote: > > > --- a/include/uapi/linux/pidfd.h > > +++ b/include/uapi/linux/pidfd.h > > @@ -7,6 +7,7 @@ > > #include <linux/fcntl.h> > > > > /* Flags for pidfd_open(). */ > > -#define PIDFD_NONBLOCK O_NONBLOCK > > +#define PIDFD_NONBLOCK O_NONBLOCK > > +#define PIDFD_THREAD O_EXCL // or anything else not used by anon_inode's > > I like it! > > The only request I would have is to not alias O_EXCL and PIDFD_THREAD. > Because it doesn't map as clearly as NONBLOCK did. But it would be nice to have PIDFD_THREAD in file->f_flags. Where else can we keep it? I chose O_EXCL because it can only be used at open time, it can never be used or changed after anon_inode_getfile(), so we can safely do pidfd_file->f_flags |= PIDFD_THREAD; in __pidfd_prepare() and then check in pidfd_poll/pidfd_send_signal. What do you suggest instead? Oleg.
Hi Oleg, On Thu, Jan 25, 2024 at 03:08:31PM +0100, Oleg Nesterov wrote: > What do you think? Thank you, it passes all my tests. > + /* unnecessary if do_notify_parent() was already called, > + we can do better */ > + do_notify_pidfd(tsk); "do better" here could be something like, diff --git a/kernel/exit.c b/kernel/exit.c index efe8f1d3a6af..7e545393f2f5 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -742,6 +742,7 @@ static void exit_notify(struct task_struct *tsk, int group_dead) bool autoreap; struct task_struct *p, *n; LIST_HEAD(dead); + bool needs_notify = true; write_lock_irq(&tasklist_lock); forget_original_parent(tsk, &dead); @@ -756,16 +757,21 @@ static void exit_notify(struct task_struct *tsk, int group_dead) !ptrace_reparented(tsk) ? tsk->exit_signal : SIGCHLD; autoreap = do_notify_parent(tsk, sig); + needs_notify = false; } else if (thread_group_leader(tsk)) { - autoreap = thread_group_empty(tsk) && - do_notify_parent(tsk, tsk->exit_signal); + autoreap = false; + if (thread_group_empty(tsk)) { + autoreap = do_notify_parent(tsk, tsk->exit_signal); + needs_notify = false; + } } else { autoreap = true; } /* unnecessary if do_notify_parent() was already called, we can do better */ - do_notify_pidfd(tsk); + if (needs_notify) + do_notify_pidfd(tsk); if (autoreap) { tsk->exit_state = EXIT_DEAD; but even with that, there's other calls in the tree to do_notify_parent() that might double notify. This brings up another interesting behavior that I noticed while testing this, if you do a poll() on pidfd, followed quickly by a pidfd_getfd() on the same thread you just got an event on, you can sometimes get an EBADF from __pidfd_fget() instead of the more expected ESRCH higher up the stack. I wonder if it makes sense to abuse ->f_flags to add a PIDFD_NOTIFIED? Then we can refuse further pidfd syscall operations in a sane way, and also "do better" above by checking this flag from do_pidfd_notify() before doing it again? Tycho
Hi Tycho, On 01/26, Tycho Andersen wrote: > > On Thu, Jan 25, 2024 at 03:08:31PM +0100, Oleg Nesterov wrote: > > What do you think? > > Thank you, it passes all my tests. Great, thanks! OK, I'll make v2 on top of the recent "pidfd: cleanup the usage of __pidfd_prepare's flags" but we need to finish our discussion with Christian about the usage of O_EXCL. As for clone(CLONE_PIDFD | CLONE_THREAD), this is trivial but I think this needs another discussion too, lets do this later. > > + /* unnecessary if do_notify_parent() was already called, > > + we can do better */ > > + do_notify_pidfd(tsk); > > "do better" here could be something like, > > [...snip...] No, no, please see below. For the moment, please forget about PIDFD_THREAD, lets discuss the current behaviour. > but even with that, there's other calls in the tree to > do_notify_parent() that might double notify. Yes, and we can't avoid this. Well, perhaps do_notify_parent() can do something like if (ptrace_reparented()) do_notify_pidfd(); so that only the "final" do_notify_parent() does do_notify_pidfd() but this needs another discussion and in fact I don't think this would be right or make much sense. Lets forget this for now. Now. Even without PIDFD_THREAD, I think it makes sense to change do_notify_parent() to do if (thread_group_empty(tsk)) do_notify_pidfd(tsk); thread_group_empty(tsk) can only be true if tsk is a group leader and it is the last thread. And this is exactly what pidfd_poll() currently needs. In fact I'd even prefer to do this in a separate patch for the documentation purposes. Now, PIDFD_THREAD can just add if (!thread_group_empty(tsk)) do_notify_pidfd(tsk); right after "tsk->exit_state = EXIT_ZOMBIE", that is all. This also preserves the do_notify_pidfd/__wake_up_parent ordering. Not that I think this is important, just for consistency. > This brings up another interesting behavior that I noticed while > testing this, if you do a poll() on pidfd, followed quickly by a > pidfd_getfd() on the same thread you just got an event on, you can > sometimes get an EBADF from __pidfd_fget() instead of the more > expected ESRCH higher up the stack. exit_notify() is called after exit_files(). pidfd_getfd() returns ESRCH if the exiting thread completes release_task(), otherwise it returns EBADF because ->files == NULL. This too doesn't really depend on PIDFD_THREAD. > I wonder if it makes sense to abuse ->f_flags to add a PIDFD_NOTIFIED? > Then we can refuse further pidfd syscall operations in a sane way, and But how? We only have "struct pid *", how can we find all files "attached" to this pid? > also "do better" above by checking this flag from do_pidfd_notify() > before doing it again? and even it was possible, I don't think it makes a lot of sense, see also above. but perhaps I understood you... Oleg.
On Fri, Jan 26, 2024 at 03:33:50PM +0100, Oleg Nesterov wrote: > On 01/26, Christian Brauner wrote: > > > > > --- a/include/uapi/linux/pidfd.h > > > +++ b/include/uapi/linux/pidfd.h > > > @@ -7,6 +7,7 @@ > > > #include <linux/fcntl.h> > > > > > > /* Flags for pidfd_open(). */ > > > -#define PIDFD_NONBLOCK O_NONBLOCK > > > +#define PIDFD_NONBLOCK O_NONBLOCK > > > +#define PIDFD_THREAD O_EXCL // or anything else not used by anon_inode's > > > > I like it! > > > > The only request I would have is to not alias O_EXCL and PIDFD_THREAD. > > Because it doesn't map as clearly as NONBLOCK did. > > But it would be nice to have PIDFD_THREAD in file->f_flags. Where else > can we keep it? No, I did just mean that the uapi value doesn't necessarily have to reflect what we do internally. IOW, we can still raise O_EXCL internally in ->f_flags but there's no need to expose it as O_EXCL to userspace. We often have internal and external flag spaces. If you prefer it your way I'm not going argue. > I chose O_EXCL because it can only be used at open time, it can never > be used or changed after anon_inode_getfile(), so we can safely do > > pidfd_file->f_flags |= PIDFD_THREAD; > > in __pidfd_prepare() and then check in pidfd_poll/pidfd_send_signal. > > What do you suggest instead? (Long-term and unrelated to this here, I think we will need to consider not just stashing struct pid in pidfd_file->private_data but instead struct pid with additional data because there's various functionality that users would like that requires additional state to be stored and we can't or don't want to do that in struct pid directly.)
On Sat, Jan 27, 2024 at 11:54:32AM +0100, Oleg Nesterov wrote: > Hi Tycho, > > On 01/26, Tycho Andersen wrote: > > > > On Thu, Jan 25, 2024 at 03:08:31PM +0100, Oleg Nesterov wrote: > > > What do you think? > > > > Thank you, it passes all my tests. > > Great, thanks! > > OK, I'll make v2 on top of the recent > "pidfd: cleanup the usage of __pidfd_prepare's flags" > > but we need to finish our discussion with Christian about the > usage of O_EXCL. Just write the patch exactly like you want. If it's as a little a detail as the uapi flag value we can just always change that when applying. There's no reason to introduce artificial delays because of my preference here.. > > As for clone(CLONE_PIDFD | CLONE_THREAD), this is trivial but > I think this needs another discussion too, lets do this later. > > > > + /* unnecessary if do_notify_parent() was already called, > > > + we can do better */ > > > + do_notify_pidfd(tsk); > > > > "do better" here could be something like, > > > > [...snip...] > > No, no, please see below. > > For the moment, please forget about PIDFD_THREAD, lets discuss > the current behaviour. > > > but even with that, there's other calls in the tree to > > do_notify_parent() that might double notify. > > Yes, and we can't avoid this. Well, perhaps do_notify_parent() > can do something like > > if (ptrace_reparented()) > do_notify_pidfd(); > > so that only the "final" do_notify_parent() does do_notify_pidfd() > but this needs another discussion and in fact I don't think this > would be right or make much sense. Lets forget this for now. > > Now. Even without PIDFD_THREAD, I think it makes sense to change > do_notify_parent() to do > > if (thread_group_empty(tsk)) > do_notify_pidfd(tsk); > > thread_group_empty(tsk) can only be true if tsk is a group leader > and it is the last thread. And this is exactly what pidfd_poll() > currently needs. > > In fact I'd even prefer to do this in a separate patch for the > documentation purposes. > > Now, PIDFD_THREAD can just add > > if (!thread_group_empty(tsk)) > do_notify_pidfd(tsk); > > right after "tsk->exit_state = EXIT_ZOMBIE", that is all. Sounds good. > > This also preserves the do_notify_pidfd/__wake_up_parent ordering. > Not that I think this is important, just for consistency. > > > This brings up another interesting behavior that I noticed while > > testing this, if you do a poll() on pidfd, followed quickly by a > > pidfd_getfd() on the same thread you just got an event on, you can > > sometimes get an EBADF from __pidfd_fget() instead of the more > > expected ESRCH higher up the stack. > > exit_notify() is called after exit_files(). pidfd_getfd() returns > ESRCH if the exiting thread completes release_task(), otherwise it > returns EBADF because ->files == NULL. This too doesn't really > depend on PIDFD_THREAD. > > > I wonder if it makes sense to abuse ->f_flags to add a PIDFD_NOTIFIED? > > Then we can refuse further pidfd syscall operations in a sane way, and > > But how? We only have "struct pid *", how can we find all files > "attached" to this pid? We can't. There's some use-cases that would make this desirable but that would mean we would need another data structure to track this. I once toyed with a patch for this but never got so excited about it to actually finish it.
On Sat, Jan 27, 2024 at 11:54:32AM +0100, Oleg Nesterov wrote: > Hi Tycho, > > On 01/26, Tycho Andersen wrote: > > > > On Thu, Jan 25, 2024 at 03:08:31PM +0100, Oleg Nesterov wrote: > > > What do you think? > > > > Thank you, it passes all my tests. > > Great, thanks! > > OK, I'll make v2 on top of the recent > "pidfd: cleanup the usage of __pidfd_prepare's flags" > > but we need to finish our discussion with Christian about the > usage of O_EXCL. > > As for clone(CLONE_PIDFD | CLONE_THREAD), this is trivial but > I think this needs another discussion too, lets do this later. > > > > + /* unnecessary if do_notify_parent() was already called, > > > + we can do better */ > > > + do_notify_pidfd(tsk); > > > > "do better" here could be something like, > > > > [...snip...] > > No, no, please see below. > > For the moment, please forget about PIDFD_THREAD, lets discuss > the current behaviour. > > > but even with that, there's other calls in the tree to > > do_notify_parent() that might double notify. > > Yes, and we can't avoid this. Well, perhaps do_notify_parent() > can do something like > > if (ptrace_reparented()) > do_notify_pidfd(); > > so that only the "final" do_notify_parent() does do_notify_pidfd() > but this needs another discussion and in fact I don't think this > would be right or make much sense. Lets forget this for now. It seems like (and the current pidfd_test enforces for some cases) we want exactly one notification for a task dying. I don't understand how we guarantee this now, with all of these calls. > > This brings up another interesting behavior that I noticed while > > testing this, if you do a poll() on pidfd, followed quickly by a > > pidfd_getfd() on the same thread you just got an event on, you can > > sometimes get an EBADF from __pidfd_fget() instead of the more > > expected ESRCH higher up the stack. > > exit_notify() is called after exit_files(). pidfd_getfd() returns > ESRCH if the exiting thread completes release_task(), otherwise it > returns EBADF because ->files == NULL. This too doesn't really > depend on PIDFD_THREAD. Yup, understood. It just seems like an inconsistency we might want to fix. > > I wonder if it makes sense to abuse ->f_flags to add a PIDFD_NOTIFIED? > > Then we can refuse further pidfd syscall operations in a sane way, and > > But how? We only have "struct pid *", how can we find all files > "attached" to this pid? Yeah, we'd need some other linkage as Christian points out. But if there is a predicate we can write that says whether this task has been notified or not, it's not necessary. I just don't understand what that is. But maybe your patch will make it clearer. Tycho
On 01/27, Tycho Andersen wrote: > > It seems like (and the current pidfd_test enforces for some cases) Which pidfd_test ? > we > want exactly one notification for a task dying. This can't be right. EVERY user of poll_wait() or wait_event/etc must handle/tolerate the false wakeups. > I don't understand > how we guarantee this now, with all of these calls. I don't understand why do we need or even want to guarantee this. The extra wakeup must be always fine correctness-wise. Sure, it would be nice to avoid the unnecessary wakeups, and perhaps we can change wake_up_all() to pass a key to, say, only wake_up the PIDFD_THREAD waiters from exit_notify(). But certainly this is outside the scope of PIDFD_THREAD change we discuss. The changes in do_notify_parent() (I have already sent the patch) and in exit_notify() (proposed in my previous email) just ensure that, with the minimal changes, we avoid 2 do_notify_pidfd's from the same exit_notify() path. > > exit_notify() is called after exit_files(). pidfd_getfd() returns > > ESRCH if the exiting thread completes release_task(), otherwise it > > returns EBADF because ->files == NULL. This too doesn't really > > depend on PIDFD_THREAD. > > Yup, understood. It just seems like an inconsistency we might want to > fix. Not sure this worth "fixing"... Oleg.
On Sat, Jan 27, 2024 at 05:31:39PM +0100, Oleg Nesterov wrote: > On 01/27, Tycho Andersen wrote: > > > > It seems like (and the current pidfd_test enforces for some cases) > > Which pidfd_test ? I was thinking of poll_pidfd() in pidfd_test.c, but, > > we > > want exactly one notification for a task dying. > > This can't be right. EVERY user of poll_wait() or wait_event/etc > must handle/tolerate the false wakeups. you're right, it doesn't enforce exactly once. > > I don't understand > > how we guarantee this now, with all of these calls. > > I don't understand why do we need or even want to guarantee this. > > The extra wakeup must be always fine correctness-wise. Sure, it > would be nice to avoid the unnecessary wakeups, and perhaps we > can change wake_up_all() to pass a key to, say, only wake_up the > PIDFD_THREAD waiters from exit_notify(). But certainly this is > outside the scope of PIDFD_THREAD change we discuss. > > The changes in do_notify_parent() (I have already sent the patch) and > in exit_notify() (proposed in my previous email) just ensure that, > with the minimal changes, we avoid 2 do_notify_pidfd's from the same > exit_notify() path. Sounds good. > > > exit_notify() is called after exit_files(). pidfd_getfd() returns > > > ESRCH if the exiting thread completes release_task(), otherwise it > > > returns EBADF because ->files == NULL. This too doesn't really > > > depend on PIDFD_THREAD. > > > > Yup, understood. It just seems like an inconsistency we might want to > > fix. > > Not sure this worth "fixing"... Yep, maybe not. Just wanted to point it out. Tycho
On 01/27, Tycho Andersen wrote: > > > > > exit_notify() is called after exit_files(). pidfd_getfd() returns > > > > ESRCH if the exiting thread completes release_task(), otherwise it > > > > returns EBADF because ->files == NULL. This too doesn't really > > > > depend on PIDFD_THREAD. > > > > > > Yup, understood. It just seems like an inconsistency we might want to > > > fix. > > > > Not sure this worth "fixing"... > > Yep, maybe not. Just wanted to point it out. On the second thought I am starting to understand your concern... Indeed, in this case -EBADF is technically correct but it can confuse the user which doesn't or can't know that this task/thread is exiting, because EBADF looks as if the "int fd" argument was wrong. Sorry I missed your point before. Oleg.
On Sat, Jan 27, 2024 at 08:31:27PM +0100, Oleg Nesterov wrote: > On 01/27, Tycho Andersen wrote: > > > > > > > exit_notify() is called after exit_files(). pidfd_getfd() returns > > > > > ESRCH if the exiting thread completes release_task(), otherwise it > > > > > returns EBADF because ->files == NULL. This too doesn't really > > > > > depend on PIDFD_THREAD. > > > > > > > > Yup, understood. It just seems like an inconsistency we might want to > > > > fix. > > > > > > Not sure this worth "fixing"... > > > > Yep, maybe not. Just wanted to point it out. > > On the second thought I am starting to understand your concern... > > Indeed, in this case -EBADF is technically correct but it can confuse > the user which doesn't or can't know that this task/thread is exiting, > because EBADF looks as if the "int fd" argument was wrong. > > Sorry I missed your point before. No worries. I realized it's not so hard to fix with your new xxx_exited() helper from the PIDFD_THREAD patch, so maybe worth cleaning up after all? Tycho
On 01/27, Tycho Andersen wrote: > > On Sat, Jan 27, 2024 at 08:31:27PM +0100, Oleg Nesterov wrote: > > > > On the second thought I am starting to understand your concern... > > > > Indeed, in this case -EBADF is technically correct but it can confuse > > the user which doesn't or can't know that this task/thread is exiting, > > because EBADF looks as if the "int fd" argument was wrong. > > > > Sorry I missed your point before. > > No worries. I realized it's not so hard to fix with your new > xxx_exited() helper from the PIDFD_THREAD patch, so maybe worth > cleaning up after all? OK, lets discuss this later. I'll (hopefully) send v2 on top of pidfd: cleanup the usage of __pidfd_prepare's flags pidfd: don't do_notify_pidfd() if !thread_group_empty() on Monday, will be busy tomorrow (family duties ;) Oleg.
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 4b7664c56208..d752f003a69a 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -333,6 +333,7 @@ extern int kill_pid_usb_asyncio(int sig, int errno, sigval_t addr, struct pid *, extern int kill_pgrp(struct pid *pid, int sig, int priv); extern int kill_pid(struct pid *pid, int sig, int priv); extern __must_check bool do_notify_parent(struct task_struct *, int); +extern void do_notify_pidfd(struct task_struct *task); extern void __wake_up_parent(struct task_struct *p, struct task_struct *parent); extern void force_sig(int); extern void force_fatal_sig(int); diff --git a/kernel/exit.c b/kernel/exit.c index 3988a02efaef..90117d7861f4 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -256,6 +256,17 @@ void release_task(struct task_struct *p) write_lock_irq(&tasklist_lock); ptrace_release_task(p); thread_pid = get_pid(p->thread_pid); + + /* + * If we're not the leader, notify any waiters on our pidfds. Note that + * we don't want to notify the leader until /everyone/ in the thread + * group is dead, viz. the condition below. + * + * We have to do this here, since __exit_signal() will + * __unhash_processes(), and break do_notify_pidfd()'s lookup. + */ + if (!thread_group_leader(p)) + do_notify_pidfd(p); __exit_signal(p); /* diff --git a/kernel/fork.c b/kernel/fork.c index 47ff3b35352e..44969cd472f0 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2157,8 +2157,6 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re * Allocate a new file that stashes @pid and reserve a new pidfd number in the * caller's file descriptor table. The pidfd is reserved but not installed yet. * - * The helper verifies that @pid is used as a thread group leader. - * * If this function returns successfully the caller is responsible to either * call fd_install() passing the returned pidfd and pidfd file as arguments in * order to install the pidfd into its file descriptor table or they must use @@ -2176,7 +2174,7 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re */ int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret) { - if (!pid || !pid_has_task(pid, PIDTYPE_TGID)) + if (!pid) return -EINVAL; return __pidfd_prepare(pid, flags, ret); diff --git a/kernel/pid.c b/kernel/pid.c index b52b10865454..b55c0adf457b 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -552,11 +552,6 @@ struct pid *pidfd_get_pid(unsigned int fd, unsigned int *flags) * Return the task associated with @pidfd. The function takes a reference on * the returned task. The caller is responsible for releasing that reference. * - * Currently, the process identified by @pidfd is always a thread-group leader. - * This restriction currently exists for all aspects of pidfds including pidfd - * creation (CLONE_PIDFD cannot be used with CLONE_THREAD) and pidfd polling - * (only supports thread group leaders). - * * Return: On success, the task_struct associated with the pidfd. * On error, a negative errno number will be returned. */ @@ -615,11 +610,7 @@ int pidfd_create(struct pid *pid, unsigned int flags) * @flags: flags to pass * * This creates a new pid file descriptor with the O_CLOEXEC flag set for - * the process identified by @pid. Currently, the process identified by - * @pid must be a thread-group leader. This restriction currently exists - * for all aspects of pidfds including pidfd creation (CLONE_PIDFD cannot - * be used with CLONE_THREAD) and pidfd polling (only supports thread group - * leaders). + * the process identified by @pid. * * Return: On success, a cloexec pidfd is returned. * On error, a negative errno number will be returned. diff --git a/kernel/signal.c b/kernel/signal.c index c9c57d053ce4..3e3c9d0fa3a5 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2019,7 +2019,7 @@ int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type) return ret; } -static void do_notify_pidfd(struct task_struct *task) +void do_notify_pidfd(struct task_struct *task) { struct pid *pid;