Message ID | tencent_5CD40341EC9384E9B7CC127EA5CF2655B408@qq.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-69227-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:c619:b0:108:e6aa:91d0 with SMTP id hn25csp726224dyb; Fri, 16 Feb 2024 11:06:15 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCWhN7eeIMe9kiPfisYKKMA9Y2ox/QzXaoa3CQkpc0LifekAPB2BE/EwsQwo/NTw4hPU6T3yIbZUoK7vPhiY3BqVBnX1Dw== X-Google-Smtp-Source: AGHT+IE+3XnLybFSAIBJT1w4n6PTyvii/rmN4/zkybqggMNL5qRCrFIsAsQTgD77jTUwoErlp1aI X-Received: by 2002:a17:906:b898:b0:a3c:b7d5:22dd with SMTP id hb24-20020a170906b89800b00a3cb7d522ddmr4425475ejb.33.1708110374821; Fri, 16 Feb 2024 11:06:14 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1708110374; cv=pass; d=google.com; s=arc-20160816; b=TpPvZFQYCqmbTC3uwsKnZN71pQ3R8q0nuNr1BRye7Bfa0Mf6hj19v3FphlJTErl7QL K3WUokm7RM6XeurjRolzQz7HezINiKA5zOc0ep+q99LkoUDbuzWqX9NV+2lxs6tbO97n GKpxLLXL7XlooGkiB9Zx6WJbZSiJ6Ag/K8j4HRt2CEO8LQ04m87Y5hXcosz34PBogER1 5ZmIeyGZavlGGv4ENJ9O1wzAHNqZ5uximF4Pp4cCg6UgwGhdIpiS6z+4V4ey91OeXG1T aHSRUMl5G2aYRwYcpGzHvzmoY9hyEOrIQY3DEd/5URHGZrM6sRz64iY/mlDX47jdcwW8 6QTA== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:date:subject:cc:to:from :message-id:dkim-signature; bh=tnXXBc9Dk9zk7+Jx65RRNfhiBjugPoxHOwEKCCZqCk4=; fh=E+fjSkszu/3wtcymzZVrQck5XVLw8YkYnkF0SUB03B0=; b=FlzxppMOog97/mXaOwirX+7ulSTuV6WvYIsSK9kpTVitIBkIAhEsyhr/mHpYBx1Unm lMuvjs6Niog5+wGUao/UyrD53UMItatTG+HqcDkDkDYr/Aeq6TKUd2y9xpBajBl5BZVV XtrFjuVPXq2PLn4sPajLdr/L7WKbweepXOGF9xoJ6ziONGDi8YU+sRcW4v+Qtiy8TYXg Ln5Jz7VXKEN0ZirzM7gyw9R8L1uWaENnTr3JA41PoYdXYOQ5snN4J0UxEj1ubaiPSwo8 TrptTrC35OUrN+A50bobYpt9AECGSOIvuFmgWL2dbCyVni2SJzMf0KtisrUa0avSvq5K EZKw==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@foxmail.com header.s=s201512 header.b=X8qngnjZ; arc=pass (i=1 spf=pass spfdomain=foxmail.com dkim=pass dkdomain=foxmail.com dmarc=pass fromdomain=foxmail.com); spf=pass (google.com: domain of linux-kernel+bounces-69227-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-69227-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=foxmail.com Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [147.75.80.249]) by mx.google.com with ESMTPS id o12-20020a17090608cc00b00a3d47b6aa71si169739eje.309.2024.02.16.11.06.14 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 16 Feb 2024 11:06:14 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-69227-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) client-ip=147.75.80.249; Authentication-Results: mx.google.com; dkim=pass header.i=@foxmail.com header.s=s201512 header.b=X8qngnjZ; arc=pass (i=1 spf=pass spfdomain=foxmail.com dkim=pass dkdomain=foxmail.com dmarc=pass fromdomain=foxmail.com); spf=pass (google.com: domain of linux-kernel+bounces-69227-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-69227-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=foxmail.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 722C11F24827 for <ouuuleilei@gmail.com>; Fri, 16 Feb 2024 19:06:14 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 7F779135A74; Fri, 16 Feb 2024 19:05:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=foxmail.com header.i=@foxmail.com header.b="X8qngnjZ" Received: from out203-205-221-236.mail.qq.com (out203-205-221-236.mail.qq.com [203.205.221.236]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D6D0577F05 for <linux-kernel@vger.kernel.org>; Fri, 16 Feb 2024 19:05:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=203.205.221.236 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708110315; cv=none; b=EBKHK52G09MplzFka5wvgpy+a19hrT+UjMGtUcb3xce3HSAgaDmksU9CkILI//wP3GkS0SEWnjecmr6kOWU/LENkdFTB2hADtGCueUS/VKGnOPvVCJzmrWtAyO/VF1hBuqcqh92/qrQESnkbpZjhLMwyy4KNyPNuqQZ5uKV5R30= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708110315; c=relaxed/simple; bh=/CdKXp/M0IoZ35hcQEsRX21/nqp6k3E+d9PhAK6qvWM=; h=Message-ID:From:To:Cc:Subject:Date:MIME-Version; b=ubflAlgVbqw95H5lio7v4wIls8AJQ75EuW854R/izcHd9pXS+ehJ6ljXKUCYA1N8rvE9RDssU7tpOpHsJiEovDHaj3DMg3ohsh3UZCjcq3dXhG6AljppfOoNRo/xwOkS0gEM9M25fdUj+vfPB0fL8j+Y24A7uKPCMUDn4La8OXo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=foxmail.com; spf=pass smtp.mailfrom=foxmail.com; dkim=pass (1024-bit key) header.d=foxmail.com header.i=@foxmail.com header.b=X8qngnjZ; arc=none smtp.client-ip=203.205.221.236 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=foxmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=foxmail.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=foxmail.com; s=s201512; t=1708110004; bh=tnXXBc9Dk9zk7+Jx65RRNfhiBjugPoxHOwEKCCZqCk4=; h=From:To:Cc:Subject:Date; b=X8qngnjZoBF5u5759OHcd3yi6Fg1ZPbr3Bx9B7zpGGg2hSjBrWK0GrqeKFmBHfowu 3NQACBS2DcLK705NkmOl8H+G3KdWF9LRnS9YvdgixEhMcNG/qQrBpd0lxOeCAXznyO ABSVucae6HftGVnh01ZOf7xz8NVOYVwNHx71P7+E= Received: from localhost.localdomain ([2409:8a60:2a63:47a0:34af:c8ac:3595:9c10]) by newxmesmtplogicsvrszb9-0.qq.com (NewEsmtp) with SMTP id 1BA45; Sat, 17 Feb 2024 03:00:00 +0800 X-QQ-mid: xmsmtpt1708110000txpy65wqo Message-ID: <tencent_5CD40341EC9384E9B7CC127EA5CF2655B408@qq.com> X-QQ-XMAILINFO: MdIvn1KXX/GIZaBqqhfY5ZNopXF6lFNATki9PxNHb1s0aqdSaXmHESGehNahBx fvLc2ueJ32JX/nmq2Cg78kEyZZS3UfagYnaTbOc22pjVzu7S+6wJ4FzisZYqIpM2+G5aZ9w+v7ce vfdUWGq1NQwWpwvdgvzA40qfEKJI0wy0HXt6XN/83B8O5YUCVcfNlGuEsW8Myt1nEwp6AHn/8aNm YyLI3xjbMZp9d+hpmAYz1SPJoQIYt3tDrAbJ5u74+dSHfFIekue6PFa5xugw7S+a2Y/+U9OvuF3w Zkdatxzvx6gtPTUvKAxMoll23mEw0oFcl4ZkiDvBV5VBa7XzqZ6E3xUmZloGFRu3Iq8Fsbxd1nSV 7zDaJtimT9jNeT3s8av6gUxOyGYlLN7wU7VxPYI71uzLTHfj4XimKeidRtYdl5zC6Z0n/ylOcJ4P 4zir20b1vpBFhVkrZk6Ssdn7AIE4X7We6gXHYJHAO4TIsxplPGgjoHpHNJ73lohIVpOG9czEAKdr iDtRF+P3/+sbW3ZureR3map+rifPyanpXeQwJJ7UAzizzPMuxlMqeyz8vnenZgUEYpQy/yjOlSpO qJyU12xSV43XBoUeS7aYpbp8yPtUWrnCHhMO4BmM8oqVIsNHCnrgIzp0rt3K0YbJKGNLKwSBP62e bgxMAvHsMgu0ztRcby1LqqRV5G9Yw2jmKANEuem4dvvsod7xLptaOQpbdFPcIiHvicMpQYf6HiiE 1n+BNShPiqnqgxYvMGdK0P9BTEBn3o+PKLm3+3A73OGmZzw/JZ+m3wgmycXBjSn326D1Ex1xStDK BU/VUbaoaAXDAWij2hE0naFkYgdVqISocd4aQRirlV+Oh7kgpEX7ENhZrmrblkI4t/hfPSZ5PT55 +n60dep3xxW7tPzv4PC37yxVdzmy7v5Yr2QuXGcZwMW5grd9Uuj+m58WKvutNTxwe7CjEZMQEhrQ tAgxi3U7ELfAXgpCEQbQEYJgtljn6P9Eb6ghQQTmCI429KRoP5Z7UKtwYho1/A0Ie17mgwcY54KR 3PQazk/g== X-QQ-XMRINFO: M/715EihBoGSf6IYSX1iLFg= From: wenyang.linux@foxmail.com To: Steven Rostedt <rostedt@goodmis.org>, Masami Hiramatsu <mhiramat@kernel.org>, Ingo Molnar <mingo@kernel.org> Cc: Wen Yang <wenyang.linux@foxmail.com>, Oleg Nesterov <oleg@redhat.com>, Mathieu Desnoyers <mathieu.desnoyers@efficios.com>, Mel Gorman <mgorman@techsingularity.net>, Peter Zijlstra <peterz@infradead.org>, linux-kernel@vger.kernel.org Subject: [PATCH] coredump debugging: add a tracepoint to report the coredumping Date: Sat, 17 Feb 2024 02:59:47 +0800 X-OQ-MSGID: <20240216185947.352535-1-wenyang.linux@foxmail.com> X-Mailer: git-send-email 2.25.1 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1791083544326754860 X-GMAIL-MSGID: 1791083544326754860 |
Series |
coredump debugging: add a tracepoint to report the coredumping
|
|
Commit Message
Wen Yang
Feb. 16, 2024, 6:59 p.m. UTC
From: Wen Yang <wenyang.linux@foxmail.com> Currently coredump_task_exit() takes some time to wait for the generation of the dump file. But if the user-space wants to receive a notification as soon as possible it maybe inconvenient. Add the new trace_sched_process_coredump() into coredump_task_exit(), this way a user-space monitor could easily wait for the exits and potentially make some preparations in advance. Signed-off-by: Wen Yang <wenyang.linux@foxmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org --- include/trace/events/sched.h | 7 +++++++ kernel/exit.c | 1 + 2 files changed, 8 insertions(+)
Comments
On 02/17, wenyang.linux@foxmail.com wrote: > > From: Wen Yang <wenyang.linux@foxmail.com> > > Currently coredump_task_exit() takes some time to wait for the generation > of the dump file. But if the user-space wants to receive a notification > as soon as possible it maybe inconvenient. > > Add the new trace_sched_process_coredump() into coredump_task_exit(), > this way a user-space monitor could easily wait for the exits and > potentially make some preparations in advance. Can't comment, I never know when the new tracepoint will make sense. Stupid question. Can we simply shift trace_sched_process_exit() up before coredump_task_exit() ? Oleg. > Signed-off-by: Wen Yang <wenyang.linux@foxmail.com> > Cc: Oleg Nesterov <oleg@redhat.com> > Cc: Steven Rostedt <rostedt@goodmis.org> > Cc: Masami Hiramatsu <mhiramat@kernel.org> > Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > Cc: Ingo Molnar <mingo@kernel.org> > Cc: Mel Gorman <mgorman@techsingularity.net> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: linux-kernel@vger.kernel.org > --- > include/trace/events/sched.h | 7 +++++++ > kernel/exit.c | 1 + > 2 files changed, 8 insertions(+) > > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h > index dbb01b4b7451..ce7448065986 100644 > --- a/include/trace/events/sched.h > +++ b/include/trace/events/sched.h > @@ -334,6 +334,13 @@ DEFINE_EVENT(sched_process_template, sched_process_exit, > TP_PROTO(struct task_struct *p), > TP_ARGS(p)); > > +/* > + * Tracepoint for a task coredumping: > + */ > +DEFINE_EVENT(sched_process_template, sched_process_coredump, > + TP_PROTO(struct task_struct *p), > + TP_ARGS(p)); > + > /* > * Tracepoint for waiting on task to unschedule: > */ > diff --git a/kernel/exit.c b/kernel/exit.c > index 493647fd7c07..c11e12d73f4e 100644 > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -425,6 +425,7 @@ static void coredump_task_exit(struct task_struct *tsk) > self.next = xchg(&core_state->dumper.next, &self); > else > self.task = NULL; > + trace_sched_process_coredump(tsk); > /* > * Implies mb(), the result of xchg() must be visible > * to core_state->dumper. > -- > 2.25.1 >
On 2024/2/17 18:49, Oleg Nesterov wrote: > On 02/17, wenyang.linux@foxmail.com wrote: >> From: Wen Yang <wenyang.linux@foxmail.com> >> >> Currently coredump_task_exit() takes some time to wait for the generation >> of the dump file. But if the user-space wants to receive a notification >> as soon as possible it maybe inconvenient. >> >> Add the new trace_sched_process_coredump() into coredump_task_exit(), >> this way a user-space monitor could easily wait for the exits and >> potentially make some preparations in advance. > Can't comment, I never know when the new tracepoint will make sense. > > Stupid question. > Oleg. Thanks for your help. trace_sched_process_exit() is located after the PF_EXITING flag is set, so it could not be moved to there. Could we make the following modifications? diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index dbb01b4b7451..53e9420540dc 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -334,6 +334,13 @@ DEFINE_EVENT(sched_process_template, sched_process_exit, TP_PROTO(struct task_struct *p), TP_ARGS(p)); +/* + * Tracepoint for killing a task by a signal: + */ +DEFINE_EVENT(sched_process_template, sched_process_kill, + TP_PROTO(struct task_struct *p), + TP_ARGS(p)); + /* * Tracepoint for waiting on task to unschedule: */ diff --git a/kernel/signal.c b/kernel/signal.c index 9b40109f0c56..571342799824 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2866,6 +2866,7 @@ bool get_signal(struct ksignal *ksig) * Anything else is fatal, maybe with a core dump. */ current->flags |= PF_SIGNALED; + trace_sched_process_kill(current); if (sig_kernel_coredump(signr)) { if (print_fatal_signals) -- Best wishes, Wen > >> Signed-off-by: Wen Yang <wenyang.linux@foxmail.com> >> Cc: Oleg Nesterov <oleg@redhat.com> >> Cc: Steven Rostedt <rostedt@goodmis.org> >> Cc: Masami Hiramatsu <mhiramat@kernel.org> >> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> >> Cc: Ingo Molnar <mingo@kernel.org> >> Cc: Mel Gorman <mgorman@techsingularity.net> >> Cc: Peter Zijlstra <peterz@infradead.org> >> Cc: linux-kernel@vger.kernel.org >> --- >> include/trace/events/sched.h | 7 +++++++ >> kernel/exit.c | 1 + >> 2 files changed, 8 insertions(+) >> >> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h >> index dbb01b4b7451..ce7448065986 100644 >> --- a/include/trace/events/sched.h >> +++ b/include/trace/events/sched.h >> @@ -334,6 +334,13 @@ DEFINE_EVENT(sched_process_template, sched_process_exit, >> TP_PROTO(struct task_struct *p), >> TP_ARGS(p)); >> >> +/* >> + * Tracepoint for a task coredumping: >> + */ >> +DEFINE_EVENT(sched_process_template, sched_process_coredump, >> + TP_PROTO(struct task_struct *p), >> + TP_ARGS(p)); >> + >> /* >> * Tracepoint for waiting on task to unschedule: >> */ >> diff --git a/kernel/exit.c b/kernel/exit.c >> index 493647fd7c07..c11e12d73f4e 100644 >> --- a/kernel/exit.c >> +++ b/kernel/exit.c >> @@ -425,6 +425,7 @@ static void coredump_task_exit(struct task_struct *tsk) >> self.next = xchg(&core_state->dumper.next, &self); >> else >> self.task = NULL; >> + trace_sched_process_coredump(tsk); >> /* >> * Implies mb(), the result of xchg() must be visible >> * to core_state->dumper. >> -- >> 2.25.1 >>
On 02/18, Wen Yang wrote: > > On 2024/2/17 18:49, Oleg Nesterov wrote: > >On 02/17, wenyang.linux@foxmail.com wrote: > >>From: Wen Yang <wenyang.linux@foxmail.com> > >> > >>Currently coredump_task_exit() takes some time to wait for the generation > >>of the dump file. But if the user-space wants to receive a notification > >>as soon as possible it maybe inconvenient. > >> > >>Add the new trace_sched_process_coredump() into coredump_task_exit(), > >>this way a user-space monitor could easily wait for the exits and > >>potentially make some preparations in advance. > >Can't comment, I never know when the new tracepoint will make sense. > > > >Stupid question. > >Oleg. > > Thanks for your help. Well thanks, but no, I can't help. As I said I can't really comment this patch. > trace_sched_process_exit() is located after the PF_EXITING flag is set Yes, > so it could not be moved to there. Why? DECLARE_EVENT_CLASS(sched_process_template) doesn't report task->flags. Again, again, I am not arguing. But I think that the changelog should explain why we can't move trace_sched_process_exit() in more details. > Could we make the following modifications? .. > > @@ -2866,6 +2866,7 @@ bool get_signal(struct ksignal *ksig) > * Anything else is fatal, maybe with a core dump. > */ > current->flags |= PF_SIGNALED; > + trace_sched_process_kill(current); Another case when I can't comment the intent. We already have trace_signal_deliver() in get_signal(). I'm afraid you need to explain why do you think userspace needs yet another tracepoint. Oleg.
On Sat, 17 Feb 2024 11:49:24 +0100 Oleg Nesterov <oleg@redhat.com> wrote: > On 02/17, wenyang.linux@foxmail.com wrote: > > > > From: Wen Yang <wenyang.linux@foxmail.com> > > > > Currently coredump_task_exit() takes some time to wait for the generation > > of the dump file. But if the user-space wants to receive a notification > > as soon as possible it maybe inconvenient. > > > > Add the new trace_sched_process_coredump() into coredump_task_exit(), > > this way a user-space monitor could easily wait for the exits and > > potentially make some preparations in advance. > > Can't comment, I never know when the new tracepoint will make sense. > > Stupid question. Can we simply shift trace_sched_process_exit() up > before coredump_task_exit() ? Reading the rest of the thread and looking at the code, we do have this: void __noreturn do_exit(long code) { struct task_struct *tsk = current; int group_dead; [...] acct_collect(code, group_dead); if (group_dead) tty_audit_exit(); audit_free(tsk); tsk->exit_code = code; taskstats_exit(tsk, group_dead); exit_mm(); if (group_dead) acct_process(); trace_sched_process_exit(tsk); There's a lot that happens before we trigger the above event. I could imagine that there are users expecting those actions to have taken place by the time the event triggered. Like the "exit_mm()" call, as well as many others. I would be leery of moving that tracepoint. -- Steve
On 02/19, Steven Rostedt wrote: > > On Sat, 17 Feb 2024 11:49:24 +0100 > Oleg Nesterov <oleg@redhat.com> wrote: > > > On 02/17, wenyang.linux@foxmail.com wrote: > > > > > > From: Wen Yang <wenyang.linux@foxmail.com> > > > > > > Currently coredump_task_exit() takes some time to wait for the generation > > > of the dump file. But if the user-space wants to receive a notification > > > as soon as possible it maybe inconvenient. > > > > > > Add the new trace_sched_process_coredump() into coredump_task_exit(), > > > this way a user-space monitor could easily wait for the exits and > > > potentially make some preparations in advance. > > > > Can't comment, I never know when the new tracepoint will make sense. > > > > Stupid question. Can we simply shift trace_sched_process_exit() up > > before coredump_task_exit() ? > > Reading the rest of the thread and looking at the code, we do have this: > > void __noreturn do_exit(long code) > { > struct task_struct *tsk = current; > int group_dead; > > [...] > acct_collect(code, group_dead); > if (group_dead) > tty_audit_exit(); > audit_free(tsk); > > tsk->exit_code = code; > taskstats_exit(tsk, group_dead); > > exit_mm(); > > if (group_dead) > acct_process(); > trace_sched_process_exit(tsk); > > There's a lot that happens before we trigger the above event. and a lot after. To me the current placement of trace_sched_process_exit() looks absolutely random. > I could > imagine that there are users expecting those actions to have taken place by > the time the event triggered. Like the "exit_mm()" call, as well as many > others. > > I would be leery of moving that tracepoint. And I agree. I am always scared of every user-visible change, simply because it is user-visbible. If it was not clear, I didn't try to nack this patch. I simply do not know how people use the tracepoints and for what. Apart from debugging. But if we add the new one into coredump_task_exit(), then we probably want another one in ptrace_event(PTRACE_EVENT_EXIT) ? It too can "take some time" before the exiting task actually exits. So I think this needs some discussion, and the changelog should probably say more. In short: I am glad you are here, I leave this to you and Wen ;) Oleg.
On Mon, 19 Feb 2024 18:00:38 +0100 Oleg Nesterov <oleg@redhat.com> wrote: > > void __noreturn do_exit(long code) > > { > > struct task_struct *tsk = current; > > int group_dead; > > > > [...] > > acct_collect(code, group_dead); > > if (group_dead) > > tty_audit_exit(); > > audit_free(tsk); > > > > tsk->exit_code = code; > > taskstats_exit(tsk, group_dead); > > > > exit_mm(); > > > > if (group_dead) > > acct_process(); > > trace_sched_process_exit(tsk); > > > > There's a lot that happens before we trigger the above event. > > and a lot after. True. There really isn't a meaningful location here is there? I actually use this tracepoint in my pid tracing. The set_ftrace_pid and set_event_pid from /sys/kernel/tracing will add and remove PIDs if the options function-fork or event-fork are set respectively. I hook to the sched_process_fork tracepoint to add new PIDs if the parent pid is already in one of the files, and remove a PID via the sched_process_exit function. Honestly, if anything, it should probably be moved down right next to perf_event_exit_task() (I never understood why perf needed its own hooks and not just use tracepoints). > > To me the current placement of trace_sched_process_exit() looks absolutely > random. Agreed. > > > I could > > imagine that there are users expecting those actions to have taken place by > > the time the event triggered. Like the "exit_mm()" call, as well as many > > others. > > > > I would be leery of moving that tracepoint. > > And I agree. I am always scared of every user-visible change, simply > because it is user-visbible. > > If it was not clear, I didn't try to nack this patch. I simply do not know > how people use the tracepoints and for what. Apart from debugging. > > But if we add the new one into coredump_task_exit(), then we probably want > another one in ptrace_event(PTRACE_EVENT_EXIT) ? It too can "take some time" > before the exiting task actually exits. > > So I think this needs some discussion, and the changelog should probably say > more. > > In short: I am glad you are here, I leave this to you and Wen ;) I still would like to have your input too ;-) -- Steve
On 2024-02-19 12:28, Steven Rostedt wrote: > On Mon, 19 Feb 2024 18:00:38 +0100 > Oleg Nesterov <oleg@redhat.com> wrote: > >>> void __noreturn do_exit(long code) >>> { >>> struct task_struct *tsk = current; >>> int group_dead; >>> >>> [...] >>> acct_collect(code, group_dead); >>> if (group_dead) >>> tty_audit_exit(); >>> audit_free(tsk); >>> >>> tsk->exit_code = code; >>> taskstats_exit(tsk, group_dead); >>> >>> exit_mm(); >>> >>> if (group_dead) >>> acct_process(); >>> trace_sched_process_exit(tsk); >>> >>> There's a lot that happens before we trigger the above event. >> >> and a lot after. > > True. There really isn't a meaningful location here is there? > > I actually use this tracepoint in my pid tracing. > > The set_ftrace_pid and set_event_pid from /sys/kernel/tracing will add and > remove PIDs if the options function-fork or event-fork are set respectively. > > I hook to the sched_process_fork tracepoint to add new PIDs if the parent > pid is already in one of the files, and remove a PID via the > sched_process_exit function. No ? Those hook on sched_process_free, which is the actual point where the task is freed (AFAIR after it's been a zombie and then waited for by another task). kernel/trace/trace_events.c: void trace_event_follow_fork(struct trace_array *tr, bool enable) { if (enable) { register_trace_prio_sched_process_fork(event_filter_pid_sched_process_fork, tr, INT_MIN); register_trace_prio_sched_process_free(event_filter_pid_sched_process_exit, tr, INT_MAX); } else { unregister_trace_sched_process_fork(event_filter_pid_sched_process_fork, tr); unregister_trace_sched_process_free(event_filter_pid_sched_process_exit, tr); } } kernel/trace/ftrace.c: void ftrace_pid_follow_fork(struct trace_array *tr, bool enable) { if (enable) { register_trace_sched_process_fork(ftrace_pid_follow_sched_process_fork, tr); register_trace_sched_process_free(ftrace_pid_follow_sched_process_exit, tr); } else { unregister_trace_sched_process_fork(ftrace_pid_follow_sched_process_fork, tr); unregister_trace_sched_process_free(ftrace_pid_follow_sched_process_exit, tr); } } AFAIU, "sched_process_exit" is issued close to the point where the task exits (it should not go back to userspace after that). "sched_process_free" is done when the task is really being removed. Between "sched_process_exit" and "sched_process_free", the task can still be observed by a trace analysis looking at sched and signal events: it's a zombie at that stage. Thanks, Mathieu
On Mon, 19 Feb 2024 13:01:16 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > No ? Those hook on sched_process_free, which is the actual point where the > task is freed (AFAIR after it's been a zombie and then waited for by another > task). Bah, you're correct. It *used to* be attached to sched_process_exit, and the function callback still has that name. It was commit afcab63665742 ("tracing: Use trace_sched_process_free() instead of exit() for pid tracing") that changed it. > > AFAIU, "sched_process_exit" is issued close to the point where the task exits > (it should not go back to userspace after that). "sched_process_free" is done > when the task is really being removed. > > Between "sched_process_exit" and "sched_process_free", the task can still be > observed by a trace analysis looking at sched and signal events: it's a zombie at > that stage. Right, thanks for reminding me what I did ;-) I guess I'm starting to get to "that age". -- Steve
On 2024/2/20 01:00, Oleg Nesterov wrote: > On 02/19, Steven Rostedt wrote: >> >> On Sat, 17 Feb 2024 11:49:24 +0100 >> Oleg Nesterov <oleg@redhat.com> wrote: >> >>> On 02/17, wenyang.linux@foxmail.com wrote: >>>> >>>> From: Wen Yang <wenyang.linux@foxmail.com> >>>> >>>> Currently coredump_task_exit() takes some time to wait for the generation >>>> of the dump file. But if the user-space wants to receive a notification >>>> as soon as possible it maybe inconvenient. >>>> >>>> Add the new trace_sched_process_coredump() into coredump_task_exit(), >>>> this way a user-space monitor could easily wait for the exits and >>>> potentially make some preparations in advance. >>> >>> Can't comment, I never know when the new tracepoint will make sense. >>> >>> Stupid question. Can we simply shift trace_sched_process_exit() up >>> before coredump_task_exit() ? >> >> Reading the rest of the thread and looking at the code, we do have this: >> >> void __noreturn do_exit(long code) >> { >> struct task_struct *tsk = current; >> int group_dead; >> >> [...] >> acct_collect(code, group_dead); >> if (group_dead) >> tty_audit_exit(); >> audit_free(tsk); >> >> tsk->exit_code = code; >> taskstats_exit(tsk, group_dead); >> >> exit_mm(); >> >> if (group_dead) >> acct_process(); >> trace_sched_process_exit(tsk); >> >> There's a lot that happens before we trigger the above event. > > and a lot after. > > To me the current placement of looks absolutely > random. > >> I could >> imagine that there are users expecting those actions to have taken place by >> the time the event triggered. Like the "exit_mm()" call, as well as many >> others. >> >> I would be leery of moving that tracepoint. > > And I agree. I am always scared of every user-visible change, simply > because it is user-visbible. > > If it was not clear, I didn't try to nack this patch. I simply do not know > how people use the tracepoints and for what. Apart from debugging. > > But if we add the new one into coredump_task_exit(), then we probably want > another one in ptrace_event(PTRACE_EVENT_EXIT) ? It too can "take some time" > before the exiting task actually exits. > > So I think this needs some discussion, and the changelog should probably say > more. > > In short: I am glad you are here, I leave this to you and Wen ;) > Thank you Oleg, thank you Steven, We could first put trace_sched_process_exit() aside and take a look at these three patches: 2d4bcf886e42f0f4846a3d9bdc3a90d278903a2e ("exit: Remove profile_task_exit & profile_munmap") 586b58cac8b4683eb58a1446fbc399de18974e40 (“exit: Move preemption fixup up, move blocking operations down”) And the original: 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 Could we add a new TP similar to profile_task_exit()? -- Best wishes, Wen
On 2024/2/20 01:28, Steven Rostedt wrote: > On Mon, 19 Feb 2024 18:00:38 +0100 > Oleg Nesterov <oleg@redhat.com> wrote: > >>> void __noreturn do_exit(long code) >>> { >>> struct task_struct *tsk = current; >>> int group_dead; >>> >>> [...] >>> acct_collect(code, group_dead); >>> if (group_dead) >>> tty_audit_exit(); >>> audit_free(tsk); >>> >>> tsk->exit_code = code; >>> taskstats_exit(tsk, group_dead); >>> >>> exit_mm(); >>> >>> if (group_dead) >>> acct_process(); >>> trace_sched_process_exit(tsk); >>> >>> There's a lot that happens before we trigger the above event. >> >> and a lot after. > > True. There really isn't a meaningful location here is there? > > I actually use this tracepoint in my pid tracing. > > The set_ftrace_pid and set_event_pid from /sys/kernel/tracing will add and > remove PIDs if the options function-fork or event-fork are set respectively. > > I hook to the sched_process_fork tracepoint to add new PIDs if the parent > pid is already in one of the files, and remove a PID via the > sched_process_exit function. > > Honestly, if anything, it should probably be moved down right next to > () (I never understood why needed its own hooks > and not just use tracepoints). > Perhaps it's just because perf appeared earlier, and it doesn't rely on TRACEPOINTS. It is indeed reasonable to replace perf_event_exit_task() with TRACEPOINT, and we are willing to try to modify it. It will require some work and time. -- Best wishes, Wen >> >> To me the current placement of trace_sched_process_exit() looks absolutely >> random. > > Agreed. > >> >>> I could >>> imagine that there are users expecting those actions to have taken place by >>> the time the event triggered. Like the "exit_mm()" call, as well as many >>> others. >>> >>> I would be leery of moving that tracepoint. >> >> And I agree. I am always scared of every user-visible change, simply >> because it is user-visbible. >> >> If it was not clear, I didn't try to nack this patch. I simply do not know >> how people use the tracepoints and for what. Apart from debugging. >> >> But if we add the new one into coredump_task_exit(), then we probably want >> another one in ptrace_event(PTRACE_EVENT_EXIT) ? It too can "take some time" >> before the exiting task actually exits. >> >> So I think this needs some discussion, and the changelog should probably say >> more. >> >> In short: I am glad you are here, I leave this to you and Wen ;) > > I still would like to have your input too ;-) > > -- Steve
On Wed, 21 Feb 2024 23:45:58 +0800 Wen Yang <wenyang.linux@foxmail.com> wrote: > Thank you Oleg, thank you Steven, > > We could first put trace_sched_process_exit() aside and take a look at > these three patches: > > 2d4bcf886e42f0f4846a3d9bdc3a90d278903a2e ("exit: Remove > profile_task_exit & profile_munmap") > > 586b58cac8b4683eb58a1446fbc399de18974e40 (“exit: Move preemption fixup > up, move blocking operations down”) > > And the original: 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 > > > Could we add a new TP similar to profile_task_exit()? I have no problem with adding that. But others may have other opinions on the subject matter. -- Steve
On Thu, 22 Feb 2024 00:00:55 +0800 Wen Yang <wenyang.linux@foxmail.com> wrote: > > Honestly, if anything, it should probably be moved down right next to > > () (I never understood why needed its own hooks > > and not just use tracepoints). > > > > Perhaps it's just because perf appeared earlier, and it doesn't rely on > TRACEPOINTS. tracepoints existed in 2008 and perf was added in 2009, so time frame was not a factor. > It is indeed reasonable to replace perf_event_exit_task() with > TRACEPOINT, and we are willing to try to modify it. It will require some > work and time. I think that would be worth while, but I guess Peter will need to approve that. -- Steve
On Mon, 19 Feb 2024 13:01:16 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > Between "sched_process_exit" and "sched_process_free", the task can still be > observed by a trace analysis looking at sched and signal events: it's a zombie at > that stage. Looking at the history of this tracepoint, it was added in 2008 by commit 0a16b60758433 ("tracing, sched: LTTng instrumentation - scheduler"). Hmm, LLTng? I wonder who the author was? Author: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> :-D Mathieu, I would say it's your call on where the tracepoint can be located. You added it, you own it! -- Steve
On 2024-02-23 09:26, Steven Rostedt wrote: > On Mon, 19 Feb 2024 13:01:16 -0500 > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > >> Between "sched_process_exit" and "sched_process_free", the task can still be >> observed by a trace analysis looking at sched and signal events: it's a zombie at >> that stage. > > Looking at the history of this tracepoint, it was added in 2008 by commit > 0a16b60758433 ("tracing, sched: LTTng instrumentation - scheduler"). > Hmm, LLTng? I wonder who the author was? [ common typo: LLTng -> LTTng ;-) ] > > Author: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> > > :-D > > Mathieu, I would say it's your call on where the tracepoint can be located. > You added it, you own it! Wow! that's now 16 years ago :) I've checked with Matthew Khouzam (maintainer of Trace Compass) which care about this tracepoint, and we have not identified any significant impact of moving it on its model of the scheduler, other than slightly changing its timing. I've also checked quickly in lttng-analyses and have not found any code that care about its specific placement. So I would say go ahead and move it earlier in do_exit(), it's fine by me. If you are interested in a bit of archeology, "sched_process_free" originated from my ltt-experimental 0.1.99.13 kernel patch against 2.6.12-rc4-mm2 back in September 2005 (that's 19 years ago). It was a precursor to the LTTng 0.x kernel patchset. https://lttng.org/files/ltt-experimental/patch-2.6.12-rc4-mm2-ltt-exp-0.1.99.13.gz Index: kernel/exit.c =================================================================== --- a/kernel/exit.c (.../trunk/kernel/linux-2.6.12-rc4-mm2) (revision 41) +++ b/kernel/exit.c (.../branches/mathieu/linux-2.6.12-rc4-mm2) (revision 41) @@ -4,6 +4,7 @@ * Copyright (C) 1991, 1992 Linus Torvalds */ +#include <linux/ltt/ltt-facility-process.h> #include <linux/config.h> #include <linux/mm.h> #include <linux/slab.h> @@ -55,6 +56,7 @@ static void __unhash_process(struct task } REMOVE_LINKS(p); + trace_process_free(p->pid); } void release_task(struct task_struct * p) @@ -832,6 +834,8 @@ fastcall NORET_TYPE void do_exit(long co } exit_mm(tsk); + trace_process_exit(tsk->pid); + exit_sem(tsk); __exit_files(tsk); __exit_fs(tsk); This was a significant improvement over the prior LTT which only had the equivalent of "sched_process_exit", which caused issues with the Linux scheduler model in LTTV due to zombie processes. Here is where it appeared in LTT back in 1999: http://www.opersys.com/ftp/pub/LTT/TracePackage-0.9.0.tgz patch-ltt-2.2.13-991118 diff -urN linux/kernel/exit.c linux-2.2.13/kernel/exit.c --- linux/kernel/exit.c Tue Oct 19 20:14:02 1999 +++ linux-2.2.13/kernel/exit.c Sun Nov 7 23:49:17 1999 @@ -14,6 +14,8 @@ #include <linux/acct.h> #endif +#include <linux/trace.h> + #include <asm/uaccess.h> #include <asm/pgtable.h> #include <asm/mmu_context.h> @@ -386,6 +388,8 @@ del_timer(&tsk->real_timer); end_bh_atomic(); + TRACE_PROCESS(TRACE_EV_PROCESS_EXIT, 0, 0); + lock_kernel(); fake_volatile: #ifdef CONFIG_BSD_PROCESS_ACCT And it was moved to its current location (after exit_mm()) a bit later (2001): http://www.opersys.com/ftp/pub/LTT/TraceToolkit-0.9.5pre2.tgz Patches/patch-ltt-linux-2.4.5-vanilla-010909-1.10 diff -urN linux/kernel/exit.c /ext2/home/karym/kernel/linux-2.4.5/kernel/exit.c --- linux/kernel/exit.c Fri May 4 17:44:06 2001 +++ /ext2/home/karym/kernel/linux-2.4.5/kernel/exit.c Wed Jun 20 12:39:24 2001 @@ -14,6 +14,8 @@ #include <linux/acct.h> #endif +#include <linux/trace.h> + #include <asm/uaccess.h> #include <asm/pgtable.h> #include <asm/mmu_context.h> @@ -439,6 +441,8 @@ #endif __exit_mm(tsk); + TRACE_PROCESS(TRACE_EV_PROCESS_EXIT, 0, 0); + lock_kernel(); sem_exit(); __exit_files(tsk); So this sched_process_exit placement was actually decided by Karim Yaghmour back in the LTT days (2001). I don't think he will mind us moving it around some 23 years later. ;) Thanks, Mathieu
On Fri, 23 Feb 2024 11:54:30 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > So this sched_process_exit placement was actually decided > by Karim Yaghmour back in the LTT days (2001). I don't think > he will mind us moving it around some 23 years later. ;) And I wonder how many people are involved in this that are younger than that change :-p I'm starting to feel really old. -- Steve
On 2/23/24 11:54, Mathieu Desnoyers wrote: .. > So this sched_process_exit placement was actually decided > by Karim Yaghmour back in the LTT days (2001). I don't think > he will mind us moving it around some 23 years later. ;) Gee ... the shadows are longer than I thought :) It's all yours. You guys have been doing a fantastic job and I'm happy to be on the consumer side of it all these days :D Cheers,
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index dbb01b4b7451..ce7448065986 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -334,6 +334,13 @@ DEFINE_EVENT(sched_process_template, sched_process_exit, TP_PROTO(struct task_struct *p), TP_ARGS(p)); +/* + * Tracepoint for a task coredumping: + */ +DEFINE_EVENT(sched_process_template, sched_process_coredump, + TP_PROTO(struct task_struct *p), + TP_ARGS(p)); + /* * Tracepoint for waiting on task to unschedule: */ diff --git a/kernel/exit.c b/kernel/exit.c index 493647fd7c07..c11e12d73f4e 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -425,6 +425,7 @@ static void coredump_task_exit(struct task_struct *tsk) self.next = xchg(&core_state->dumper.next, &self); else self.task = NULL; + trace_sched_process_coredump(tsk); /* * Implies mb(), the result of xchg() must be visible * to core_state->dumper.