coredump debugging: add a tracepoint to report the coredumping

Message ID tencent_5CD40341EC9384E9B7CC127EA5CF2655B408@qq.com
State New
Headers
Series coredump debugging: add a tracepoint to report the coredumping |

Commit Message

Wen Yang Feb. 16, 2024, 6:59 p.m. UTC
  From: Wen Yang <wenyang.linux@foxmail.com>

Currently coredump_task_exit() takes some time to wait for the generation
of the dump file. But if the user-space wants to receive a notification
as soon as possible it maybe inconvenient.

Add the new trace_sched_process_coredump() into coredump_task_exit(),
this way a user-space monitor could easily wait for the exits and
potentially make some preparations in advance.

Signed-off-by: Wen Yang <wenyang.linux@foxmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
---
 include/trace/events/sched.h | 7 +++++++
 kernel/exit.c                | 1 +
 2 files changed, 8 insertions(+)
  

Comments

Oleg Nesterov Feb. 17, 2024, 10:49 a.m. UTC | #1
On 02/17, wenyang.linux@foxmail.com wrote:
>
> From: Wen Yang <wenyang.linux@foxmail.com>
>
> Currently coredump_task_exit() takes some time to wait for the generation
> of the dump file. But if the user-space wants to receive a notification
> as soon as possible it maybe inconvenient.
>
> Add the new trace_sched_process_coredump() into coredump_task_exit(),
> this way a user-space monitor could easily wait for the exits and
> potentially make some preparations in advance.

Can't comment, I never know when the new tracepoint will make sense.

Stupid question. Can we simply shift trace_sched_process_exit() up
before coredump_task_exit() ?

Oleg.


> Signed-off-by: Wen Yang <wenyang.linux@foxmail.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: linux-kernel@vger.kernel.org
> ---
>  include/trace/events/sched.h | 7 +++++++
>  kernel/exit.c                | 1 +
>  2 files changed, 8 insertions(+)
>
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index dbb01b4b7451..ce7448065986 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -334,6 +334,13 @@ DEFINE_EVENT(sched_process_template, sched_process_exit,
>  	     TP_PROTO(struct task_struct *p),
>  	     TP_ARGS(p));
>
> +/*
> + * Tracepoint for a task coredumping:
> + */
> +DEFINE_EVENT(sched_process_template, sched_process_coredump,
> +	     TP_PROTO(struct task_struct *p),
> +	     TP_ARGS(p));
> +
>  /*
>   * Tracepoint for waiting on task to unschedule:
>   */
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 493647fd7c07..c11e12d73f4e 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -425,6 +425,7 @@ static void coredump_task_exit(struct task_struct *tsk)
>  			self.next = xchg(&core_state->dumper.next, &self);
>  		else
>  			self.task = NULL;
> +		trace_sched_process_coredump(tsk);
>  		/*
>  		 * Implies mb(), the result of xchg() must be visible
>  		 * to core_state->dumper.
> --
> 2.25.1
>
  
Wen Yang Feb. 18, 2024, 3:16 p.m. UTC | #2
On 2024/2/17 18:49, Oleg Nesterov wrote:
> On 02/17, wenyang.linux@foxmail.com wrote:
>> From: Wen Yang <wenyang.linux@foxmail.com>
>>
>> Currently coredump_task_exit() takes some time to wait for the generation
>> of the dump file. But if the user-space wants to receive a notification
>> as soon as possible it maybe inconvenient.
>>
>> Add the new trace_sched_process_coredump() into coredump_task_exit(),
>> this way a user-space monitor could easily wait for the exits and
>> potentially make some preparations in advance.
> Can't comment, I never know when the new tracepoint will make sense.
>
> Stupid question.
> Oleg.

Thanks for your help.

trace_sched_process_exit() is located after the PF_EXITING flag is set,
so it could not be moved to there.
Could we make the following modifications?

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index dbb01b4b7451..53e9420540dc 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -334,6 +334,13 @@ DEFINE_EVENT(sched_process_template, 
sched_process_exit,
              TP_PROTO(struct task_struct *p),
              TP_ARGS(p));

+/*
+ * Tracepoint for killing a task by a signal:
+ */
+DEFINE_EVENT(sched_process_template, sched_process_kill,
+            TP_PROTO(struct task_struct *p),
+            TP_ARGS(p));
+
  /*
   * Tracepoint for waiting on task to unschedule:
   */
diff --git a/kernel/signal.c b/kernel/signal.c
index 9b40109f0c56..571342799824 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2866,6 +2866,7 @@ bool get_signal(struct ksignal *ksig)
                  * Anything else is fatal, maybe with a core dump.
                  */
                 current->flags |= PF_SIGNALED;
+               trace_sched_process_kill(current);

                 if (sig_kernel_coredump(signr)) {
                         if (print_fatal_signals)

--

Best wishes,

Wen


>
>> Signed-off-by: Wen Yang <wenyang.linux@foxmail.com>
>> Cc: Oleg Nesterov <oleg@redhat.com>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>> Cc: Masami Hiramatsu <mhiramat@kernel.org>
>> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> Cc: Ingo Molnar <mingo@kernel.org>
>> Cc: Mel Gorman <mgorman@techsingularity.net>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: linux-kernel@vger.kernel.org
>> ---
>>   include/trace/events/sched.h | 7 +++++++
>>   kernel/exit.c                | 1 +
>>   2 files changed, 8 insertions(+)
>>
>> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
>> index dbb01b4b7451..ce7448065986 100644
>> --- a/include/trace/events/sched.h
>> +++ b/include/trace/events/sched.h
>> @@ -334,6 +334,13 @@ DEFINE_EVENT(sched_process_template, sched_process_exit,
>>   	     TP_PROTO(struct task_struct *p),
>>   	     TP_ARGS(p));
>>
>> +/*
>> + * Tracepoint for a task coredumping:
>> + */
>> +DEFINE_EVENT(sched_process_template, sched_process_coredump,
>> +	     TP_PROTO(struct task_struct *p),
>> +	     TP_ARGS(p));
>> +
>>   /*
>>    * Tracepoint for waiting on task to unschedule:
>>    */
>> diff --git a/kernel/exit.c b/kernel/exit.c
>> index 493647fd7c07..c11e12d73f4e 100644
>> --- a/kernel/exit.c
>> +++ b/kernel/exit.c
>> @@ -425,6 +425,7 @@ static void coredump_task_exit(struct task_struct *tsk)
>>   			self.next = xchg(&core_state->dumper.next, &self);
>>   		else
>>   			self.task = NULL;
>> +		trace_sched_process_coredump(tsk);
>>   		/*
>>   		 * Implies mb(), the result of xchg() must be visible
>>   		 * to core_state->dumper.
>> --
>> 2.25.1
>>
  
Oleg Nesterov Feb. 18, 2024, 5:52 p.m. UTC | #3
On 02/18, Wen Yang wrote:
>
> On 2024/2/17 18:49, Oleg Nesterov wrote:
> >On 02/17, wenyang.linux@foxmail.com wrote:
> >>From: Wen Yang <wenyang.linux@foxmail.com>
> >>
> >>Currently coredump_task_exit() takes some time to wait for the generation
> >>of the dump file. But if the user-space wants to receive a notification
> >>as soon as possible it maybe inconvenient.
> >>
> >>Add the new trace_sched_process_coredump() into coredump_task_exit(),
> >>this way a user-space monitor could easily wait for the exits and
> >>potentially make some preparations in advance.
> >Can't comment, I never know when the new tracepoint will make sense.
> >
> >Stupid question.
> >Oleg.
>
> Thanks for your help.

Well thanks, but no, I can't help. As I said I can't really comment this
patch.

> trace_sched_process_exit() is located after the PF_EXITING flag is set

Yes,

> so it could not be moved to there.

Why? DECLARE_EVENT_CLASS(sched_process_template) doesn't report task->flags.

Again, again, I am not arguing. But I think that the changelog should
explain why we can't move trace_sched_process_exit() in more details.

> Could we make the following modifications?
..
>
> @@ -2866,6 +2866,7 @@ bool get_signal(struct ksignal *ksig)
>                  * Anything else is fatal, maybe with a core dump.
>                  */
>                 current->flags |= PF_SIGNALED;
> +               trace_sched_process_kill(current);

Another case when I can't comment the intent.

We already have trace_signal_deliver() in get_signal(). I'm afraid you
need to explain why do you think userspace needs yet another tracepoint.

Oleg.
  
Steven Rostedt Feb. 19, 2024, 4:29 p.m. UTC | #4
On Sat, 17 Feb 2024 11:49:24 +0100
Oleg Nesterov <oleg@redhat.com> wrote:

> On 02/17, wenyang.linux@foxmail.com wrote:
> >
> > From: Wen Yang <wenyang.linux@foxmail.com>
> >
> > Currently coredump_task_exit() takes some time to wait for the generation
> > of the dump file. But if the user-space wants to receive a notification
> > as soon as possible it maybe inconvenient.
> >
> > Add the new trace_sched_process_coredump() into coredump_task_exit(),
> > this way a user-space monitor could easily wait for the exits and
> > potentially make some preparations in advance.  
> 
> Can't comment, I never know when the new tracepoint will make sense.
> 
> Stupid question. Can we simply shift trace_sched_process_exit() up
> before coredump_task_exit() ?

Reading the rest of the thread and looking at the code, we do have this:

void __noreturn do_exit(long code)
{
	struct task_struct *tsk = current;
	int group_dead;

[...]
	acct_collect(code, group_dead);
	if (group_dead)
		tty_audit_exit();
	audit_free(tsk);

	tsk->exit_code = code;
	taskstats_exit(tsk, group_dead);

	exit_mm();

	if (group_dead)
		acct_process();
	trace_sched_process_exit(tsk);

There's a lot that happens before we trigger the above event. I could
imagine that there are users expecting those actions to have taken place by
the time the event triggered. Like the "exit_mm()" call, as well as many
others.

I would be leery of moving that tracepoint.

-- Steve
  
Oleg Nesterov Feb. 19, 2024, 5 p.m. UTC | #5
On 02/19, Steven Rostedt wrote:
>
> On Sat, 17 Feb 2024 11:49:24 +0100
> Oleg Nesterov <oleg@redhat.com> wrote:
>
> > On 02/17, wenyang.linux@foxmail.com wrote:
> > >
> > > From: Wen Yang <wenyang.linux@foxmail.com>
> > >
> > > Currently coredump_task_exit() takes some time to wait for the generation
> > > of the dump file. But if the user-space wants to receive a notification
> > > as soon as possible it maybe inconvenient.
> > >
> > > Add the new trace_sched_process_coredump() into coredump_task_exit(),
> > > this way a user-space monitor could easily wait for the exits and
> > > potentially make some preparations in advance.
> >
> > Can't comment, I never know when the new tracepoint will make sense.
> >
> > Stupid question. Can we simply shift trace_sched_process_exit() up
> > before coredump_task_exit() ?
>
> Reading the rest of the thread and looking at the code, we do have this:
>
> void __noreturn do_exit(long code)
> {
> 	struct task_struct *tsk = current;
> 	int group_dead;
>
> [...]
> 	acct_collect(code, group_dead);
> 	if (group_dead)
> 		tty_audit_exit();
> 	audit_free(tsk);
>
> 	tsk->exit_code = code;
> 	taskstats_exit(tsk, group_dead);
>
> 	exit_mm();
>
> 	if (group_dead)
> 		acct_process();
> 	trace_sched_process_exit(tsk);
>
> There's a lot that happens before we trigger the above event.

and a lot after.

To me the current placement of trace_sched_process_exit() looks absolutely
random.

> I could
> imagine that there are users expecting those actions to have taken place by
> the time the event triggered. Like the "exit_mm()" call, as well as many
> others.
>
> I would be leery of moving that tracepoint.

And I agree. I am always scared of every user-visible change, simply
because it is user-visbible.

If it was not clear, I didn't try to nack this patch. I simply do not know
how people use the tracepoints and for what. Apart from debugging.

But if we add the new one into coredump_task_exit(), then we probably want
another one in ptrace_event(PTRACE_EVENT_EXIT) ? It too can "take some time"
before the exiting task actually exits.

So I think this needs some discussion, and the changelog should probably say
more.

In short: I am glad you are here, I leave this to you and Wen ;)

Oleg.
  
Steven Rostedt Feb. 19, 2024, 5:28 p.m. UTC | #6
On Mon, 19 Feb 2024 18:00:38 +0100
Oleg Nesterov <oleg@redhat.com> wrote:

> > void __noreturn do_exit(long code)
> > {
> > 	struct task_struct *tsk = current;
> > 	int group_dead;
> >
> > [...]
> > 	acct_collect(code, group_dead);
> > 	if (group_dead)
> > 		tty_audit_exit();
> > 	audit_free(tsk);
> >
> > 	tsk->exit_code = code;
> > 	taskstats_exit(tsk, group_dead);
> >
> > 	exit_mm();
> >
> > 	if (group_dead)
> > 		acct_process();
> > 	trace_sched_process_exit(tsk);
> >
> > There's a lot that happens before we trigger the above event.  
> 
> and a lot after.

True. There really isn't a meaningful location here is there?

I actually use this tracepoint in my pid tracing.

The set_ftrace_pid and set_event_pid from /sys/kernel/tracing will add and
remove PIDs if the options function-fork or event-fork are set respectively.

I hook to the sched_process_fork tracepoint to add new PIDs if the parent
pid is already in one of the files, and remove a PID via the
sched_process_exit function.

Honestly, if anything, it should probably be moved down right next to
perf_event_exit_task() (I never understood why perf needed its own hooks
and not just use tracepoints).

> 
> To me the current placement of trace_sched_process_exit() looks absolutely
> random.

Agreed.

> 
> > I could
> > imagine that there are users expecting those actions to have taken place by
> > the time the event triggered. Like the "exit_mm()" call, as well as many
> > others.
> >
> > I would be leery of moving that tracepoint.  
> 
> And I agree. I am always scared of every user-visible change, simply
> because it is user-visbible.
> 
> If it was not clear, I didn't try to nack this patch. I simply do not know
> how people use the tracepoints and for what. Apart from debugging.
> 
> But if we add the new one into coredump_task_exit(), then we probably want
> another one in ptrace_event(PTRACE_EVENT_EXIT) ? It too can "take some time"
> before the exiting task actually exits.
> 
> So I think this needs some discussion, and the changelog should probably say
> more.
> 
> In short: I am glad you are here, I leave this to you and Wen ;)

I still would like to have your input too ;-)

-- Steve
  
Mathieu Desnoyers Feb. 19, 2024, 6:01 p.m. UTC | #7
On 2024-02-19 12:28, Steven Rostedt wrote:
> On Mon, 19 Feb 2024 18:00:38 +0100
> Oleg Nesterov <oleg@redhat.com> wrote:
> 
>>> void __noreturn do_exit(long code)
>>> {
>>> 	struct task_struct *tsk = current;
>>> 	int group_dead;
>>>
>>> [...]
>>> 	acct_collect(code, group_dead);
>>> 	if (group_dead)
>>> 		tty_audit_exit();
>>> 	audit_free(tsk);
>>>
>>> 	tsk->exit_code = code;
>>> 	taskstats_exit(tsk, group_dead);
>>>
>>> 	exit_mm();
>>>
>>> 	if (group_dead)
>>> 		acct_process();
>>> 	trace_sched_process_exit(tsk);
>>>
>>> There's a lot that happens before we trigger the above event.
>>
>> and a lot after.
> 
> True. There really isn't a meaningful location here is there?
> 
> I actually use this tracepoint in my pid tracing.
> 
> The set_ftrace_pid and set_event_pid from /sys/kernel/tracing will add and
> remove PIDs if the options function-fork or event-fork are set respectively.
> 
> I hook to the sched_process_fork tracepoint to add new PIDs if the parent
> pid is already in one of the files, and remove a PID via the
> sched_process_exit function.

No ? Those hook on sched_process_free, which is the actual point where the
task is freed (AFAIR after it's been a zombie and then waited for by another
task).

kernel/trace/trace_events.c:

void trace_event_follow_fork(struct trace_array *tr, bool enable)
{
         if (enable) {
                 register_trace_prio_sched_process_fork(event_filter_pid_sched_process_fork,
                                                        tr, INT_MIN);
                 register_trace_prio_sched_process_free(event_filter_pid_sched_process_exit,
                                                        tr, INT_MAX);
         } else {
                 unregister_trace_sched_process_fork(event_filter_pid_sched_process_fork,
                                                     tr);
                 unregister_trace_sched_process_free(event_filter_pid_sched_process_exit,
                                                     tr);
         }
}

kernel/trace/ftrace.c:

void ftrace_pid_follow_fork(struct trace_array *tr, bool enable)
{
         if (enable) {
                 register_trace_sched_process_fork(ftrace_pid_follow_sched_process_fork,
                                                   tr);
                 register_trace_sched_process_free(ftrace_pid_follow_sched_process_exit,
                                                   tr);
         } else {
                 unregister_trace_sched_process_fork(ftrace_pid_follow_sched_process_fork,
                                                     tr);
                 unregister_trace_sched_process_free(ftrace_pid_follow_sched_process_exit,
                                                     tr);
         }
}

AFAIU, "sched_process_exit" is issued close to the point where the task exits
(it should not go back to userspace after that). "sched_process_free" is done
when the task is really being removed.

Between "sched_process_exit" and "sched_process_free", the task can still be
observed by a trace analysis looking at sched and signal events: it's a zombie at
that stage.

Thanks,

Mathieu
  
Steven Rostedt Feb. 20, 2024, 3:08 p.m. UTC | #8
On Mon, 19 Feb 2024 13:01:16 -0500
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> No ? Those hook on sched_process_free, which is the actual point where the
> task is freed (AFAIR after it's been a zombie and then waited for by another
> task).

Bah, you're correct. It *used to* be attached to sched_process_exit, and
the function callback still has that name. It was commit afcab63665742
("tracing: Use trace_sched_process_free() instead of exit() for pid
tracing") that changed it.
> 
> AFAIU, "sched_process_exit" is issued close to the point where the task exits
> (it should not go back to userspace after that). "sched_process_free" is done
> when the task is really being removed.
> 
> Between "sched_process_exit" and "sched_process_free", the task can still be
> observed by a trace analysis looking at sched and signal events: it's a zombie at
> that stage.

Right, thanks for reminding me what I did ;-) I guess I'm starting to get to "that age".

-- Steve
  
Wen Yang Feb. 21, 2024, 3:45 p.m. UTC | #9
On 2024/2/20 01:00, Oleg Nesterov wrote:
> On 02/19, Steven Rostedt wrote:
>>
>> On Sat, 17 Feb 2024 11:49:24 +0100
>> Oleg Nesterov <oleg@redhat.com> wrote:
>>
>>> On 02/17, wenyang.linux@foxmail.com wrote:
>>>>
>>>> From: Wen Yang <wenyang.linux@foxmail.com>
>>>>
>>>> Currently coredump_task_exit() takes some time to wait for the generation
>>>> of the dump file. But if the user-space wants to receive a notification
>>>> as soon as possible it maybe inconvenient.
>>>>
>>>> Add the new trace_sched_process_coredump() into coredump_task_exit(),
>>>> this way a user-space monitor could easily wait for the exits and
>>>> potentially make some preparations in advance.
>>>
>>> Can't comment, I never know when the new tracepoint will make sense.
>>>
>>> Stupid question. Can we simply shift trace_sched_process_exit() up
>>> before coredump_task_exit() ?
>>
>> Reading the rest of the thread and looking at the code, we do have this:
>>
>> void __noreturn do_exit(long code)
>> {
>> 	struct task_struct *tsk = current;
>> 	int group_dead;
>>
>> [...]
>> 	acct_collect(code, group_dead);
>> 	if (group_dead)
>> 		tty_audit_exit();
>> 	audit_free(tsk);
>>
>> 	tsk->exit_code = code;
>> 	taskstats_exit(tsk, group_dead);
>>
>> 	exit_mm();
>>
>> 	if (group_dead)
>> 		acct_process();
>> 	trace_sched_process_exit(tsk);
>>
>> There's a lot that happens before we trigger the above event.
> 
> and a lot after.
> 
> To me the current placement of  looks absolutely
> random.
> 
>> I could
>> imagine that there are users expecting those actions to have taken place by
>> the time the event triggered. Like the "exit_mm()" call, as well as many
>> others.
>>
>> I would be leery of moving that tracepoint.
> 
> And I agree. I am always scared of every user-visible change, simply
> because it is user-visbible.
> 
> If it was not clear, I didn't try to nack this patch. I simply do not know
> how people use the tracepoints and for what. Apart from debugging.
> 
> But if we add the new one into coredump_task_exit(), then we probably want
> another one in ptrace_event(PTRACE_EVENT_EXIT) ? It too can "take some time"
> before the exiting task actually exits.
> 
> So I think this needs some discussion, and the changelog should probably say
> more.
> 
> In short: I am glad you are here, I leave this to you and Wen ;)
> 

Thank you Oleg, thank you Steven,

We could first put trace_sched_process_exit() aside and take a look at 
these three patches:

2d4bcf886e42f0f4846a3d9bdc3a90d278903a2e ("exit: Remove 
profile_task_exit & profile_munmap")

586b58cac8b4683eb58a1446fbc399de18974e40 (“exit: Move preemption fixup 
up, move blocking operations down”)

And the original: 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2


Could we add a new TP similar to profile_task_exit()?

--
Best wishes,
Wen
  
Wen Yang Feb. 21, 2024, 4 p.m. UTC | #10
On 2024/2/20 01:28, Steven Rostedt wrote:
> On Mon, 19 Feb 2024 18:00:38 +0100
> Oleg Nesterov <oleg@redhat.com> wrote:
> 
>>> void __noreturn do_exit(long code)
>>> {
>>> 	struct task_struct *tsk = current;
>>> 	int group_dead;
>>>
>>> [...]
>>> 	acct_collect(code, group_dead);
>>> 	if (group_dead)
>>> 		tty_audit_exit();
>>> 	audit_free(tsk);
>>>
>>> 	tsk->exit_code = code;
>>> 	taskstats_exit(tsk, group_dead);
>>>
>>> 	exit_mm();
>>>
>>> 	if (group_dead)
>>> 		acct_process();
>>> 	trace_sched_process_exit(tsk);
>>>
>>> There's a lot that happens before we trigger the above event.
>>
>> and a lot after.
> 
> True. There really isn't a meaningful location here is there?
> 
> I actually use this tracepoint in my pid tracing.
> 
> The set_ftrace_pid and set_event_pid from /sys/kernel/tracing will add and
> remove PIDs if the options function-fork or event-fork are set respectively.
> 
> I hook to the sched_process_fork tracepoint to add new PIDs if the parent
> pid is already in one of the files, and remove a PID via the
> sched_process_exit function.
> 
> Honestly, if anything, it should probably be moved down right next to
> () (I never understood why  needed its own hooks
> and not just use tracepoints).
> 

Perhaps it's just because perf appeared earlier, and it doesn't rely on 
TRACEPOINTS.
It is indeed reasonable to replace perf_event_exit_task() with 
TRACEPOINT, and we are willing to try to modify it. It will require some 
work and time.

--
Best wishes,
Wen

>>
>> To me the current placement of trace_sched_process_exit() looks absolutely
>> random.
> 
> Agreed.
> 
>>
>>> I could
>>> imagine that there are users expecting those actions to have taken place by
>>> the time the event triggered. Like the "exit_mm()" call, as well as many
>>> others.
>>>
>>> I would be leery of moving that tracepoint.
>>
>> And I agree. I am always scared of every user-visible change, simply
>> because it is user-visbible.
>>
>> If it was not clear, I didn't try to nack this patch. I simply do not know
>> how people use the tracepoints and for what. Apart from debugging.
>>
>> But if we add the new one into coredump_task_exit(), then we probably want
>> another one in ptrace_event(PTRACE_EVENT_EXIT) ? It too can "take some time"
>> before the exiting task actually exits.
>>
>> So I think this needs some discussion, and the changelog should probably say
>> more.
>>
>> In short: I am glad you are here, I leave this to you and Wen ;)
> 
> I still would like to have your input too ;-)
> 
> -- Steve
  
Steven Rostedt Feb. 21, 2024, 5:48 p.m. UTC | #11
On Wed, 21 Feb 2024 23:45:58 +0800
Wen Yang <wenyang.linux@foxmail.com> wrote:

> Thank you Oleg, thank you Steven,
> 
> We could first put trace_sched_process_exit() aside and take a look at 
> these three patches:
> 
> 2d4bcf886e42f0f4846a3d9bdc3a90d278903a2e ("exit: Remove 
> profile_task_exit & profile_munmap")
> 
> 586b58cac8b4683eb58a1446fbc399de18974e40 (“exit: Move preemption fixup 
> up, move blocking operations down”)
> 
> And the original: 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2
> 
> 
> Could we add a new TP similar to profile_task_exit()?

I have no problem with adding that. But others may have other opinions on
the subject matter.

-- Steve
  
Steven Rostedt Feb. 21, 2024, 5:54 p.m. UTC | #12
On Thu, 22 Feb 2024 00:00:55 +0800
Wen Yang <wenyang.linux@foxmail.com> wrote:

> > Honestly, if anything, it should probably be moved down right next to
> > () (I never understood why  needed its own hooks
> > and not just use tracepoints).
> >   
> 
> Perhaps it's just because perf appeared earlier, and it doesn't rely on 
> TRACEPOINTS.

tracepoints existed in 2008 and perf was added in 2009, so time frame was
not a factor.

> It is indeed reasonable to replace perf_event_exit_task() with 
> TRACEPOINT, and we are willing to try to modify it. It will require some 
> work and time.

I think that would be worth while, but I guess Peter will need to approve that.

-- Steve
  
Steven Rostedt Feb. 23, 2024, 2:26 p.m. UTC | #13
On Mon, 19 Feb 2024 13:01:16 -0500
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> Between "sched_process_exit" and "sched_process_free", the task can still be
> observed by a trace analysis looking at sched and signal events: it's a zombie at
> that stage.

Looking at the history of this tracepoint, it was added in 2008 by commit
0a16b60758433 ("tracing, sched: LTTng instrumentation - scheduler").
Hmm, LLTng? I wonder who the author was?

  Author: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>

 :-D

Mathieu, I would say it's your call on where the tracepoint can be located.
You added it, you own it!

-- Steve
  
Mathieu Desnoyers Feb. 23, 2024, 4:54 p.m. UTC | #14
On 2024-02-23 09:26, Steven Rostedt wrote:
> On Mon, 19 Feb 2024 13:01:16 -0500
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>> Between "sched_process_exit" and "sched_process_free", the task can still be
>> observed by a trace analysis looking at sched and signal events: it's a zombie at
>> that stage.
> 
> Looking at the history of this tracepoint, it was added in 2008 by commit
> 0a16b60758433 ("tracing, sched: LTTng instrumentation - scheduler").
> Hmm, LLTng? I wonder who the author was?

[ common typo: LLTng -> LTTng ;-) ]

> 
>    Author: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> 
>   :-D
> 
> Mathieu, I would say it's your call on where the tracepoint can be located.
> You added it, you own it!

Wow! that's now 16 years ago :)

I've checked with Matthew Khouzam (maintainer of Trace Compass)
which care about this tracepoint, and we have not identified any
significant impact of moving it on its model of the scheduler, other
than slightly changing its timing.

I've also checked quickly in lttng-analyses and have not found
any code that care about its specific placement.

So I would say go ahead and move it earlier in do_exit(), it's
fine by me.

If you are interested in a bit of archeology, "sched_process_free"
originated from my ltt-experimental 0.1.99.13 kernel patch against
2.6.12-rc4-mm2 back in September 2005 (that's 19 years ago). It was
a precursor to the LTTng 0.x kernel patchset.

https://lttng.org/files/ltt-experimental/patch-2.6.12-rc4-mm2-ltt-exp-0.1.99.13.gz

Index: kernel/exit.c
===================================================================
--- a/kernel/exit.c	(.../trunk/kernel/linux-2.6.12-rc4-mm2)	(revision 41)
+++ b/kernel/exit.c	(.../branches/mathieu/linux-2.6.12-rc4-mm2)	(revision 41)
@@ -4,6 +4,7 @@
   *  Copyright (C) 1991, 1992  Linus Torvalds
   */
  
+#include <linux/ltt/ltt-facility-process.h>
  #include <linux/config.h>
  #include <linux/mm.h>
  #include <linux/slab.h>
@@ -55,6 +56,7 @@ static void __unhash_process(struct task
  	}
  
  	REMOVE_LINKS(p);
+  trace_process_free(p->pid);
  }
  
  void release_task(struct task_struct * p)
@@ -832,6 +834,8 @@ fastcall NORET_TYPE void do_exit(long co
  	}
  	exit_mm(tsk);
  
+	trace_process_exit(tsk->pid);
+
  	exit_sem(tsk);
  	__exit_files(tsk);
  	__exit_fs(tsk);

This was a significant improvement over the prior LTT which only
had the equivalent of "sched_process_exit", which caused issues
with the Linux scheduler model in LTTV due to zombie processes.

Here is where it appeared in LTT back in 1999:

http://www.opersys.com/ftp/pub/LTT/TracePackage-0.9.0.tgz

patch-ltt-2.2.13-991118

diff -urN linux/kernel/exit.c linux-2.2.13/kernel/exit.c
--- linux/kernel/exit.c	Tue Oct 19 20:14:02 1999
+++ linux-2.2.13/kernel/exit.c	Sun Nov  7 23:49:17 1999
@@ -14,6 +14,8 @@
  #include <linux/acct.h>
  #endif
  
+#include <linux/trace.h>
+
  #include <asm/uaccess.h>
  #include <asm/pgtable.h>
  #include <asm/mmu_context.h>
@@ -386,6 +388,8 @@
  	del_timer(&tsk->real_timer);
  	end_bh_atomic();
  
+	TRACE_PROCESS(TRACE_EV_PROCESS_EXIT, 0, 0);
+
  	lock_kernel();
  fake_volatile:
  #ifdef CONFIG_BSD_PROCESS_ACCT

And it was moved to its current location (after exit_mm()) a bit
later (2001):

http://www.opersys.com/ftp/pub/LTT/TraceToolkit-0.9.5pre2.tgz

Patches/patch-ltt-linux-2.4.5-vanilla-010909-1.10

diff -urN linux/kernel/exit.c /ext2/home/karym/kernel/linux-2.4.5/kernel/exit.c
--- linux/kernel/exit.c	Fri May  4 17:44:06 2001
+++ /ext2/home/karym/kernel/linux-2.4.5/kernel/exit.c	Wed Jun 20 12:39:24 2001
@@ -14,6 +14,8 @@
  #include <linux/acct.h>
  #endif
  
+#include <linux/trace.h>
+
  #include <asm/uaccess.h>
  #include <asm/pgtable.h>
  #include <asm/mmu_context.h>
@@ -439,6 +441,8 @@
  #endif
  	__exit_mm(tsk);
  
+	TRACE_PROCESS(TRACE_EV_PROCESS_EXIT, 0, 0);
+
  	lock_kernel();
  	sem_exit();
  	__exit_files(tsk);

So this sched_process_exit placement was actually decided
by Karim Yaghmour back in the LTT days (2001). I don't think
he will mind us moving it around some 23 years later. ;)

Thanks,

Mathieu
  
Steven Rostedt Feb. 23, 2024, 5:03 p.m. UTC | #15
On Fri, 23 Feb 2024 11:54:30 -0500
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> So this sched_process_exit placement was actually decided
> by Karim Yaghmour back in the LTT days (2001). I don't think
> he will mind us moving it around some 23 years later. ;)

And I wonder how many people are involved in this that are younger than
that change :-p

I'm starting to feel really old.

-- Steve
  
Karim Yaghmour Feb. 23, 2024, 5:12 p.m. UTC | #16
On 2/23/24 11:54, Mathieu Desnoyers wrote:
..
> So this sched_process_exit placement was actually decided
> by Karim Yaghmour back in the LTT days (2001). I don't think
> he will mind us moving it around some 23 years later. ;)

Gee ... the shadows are longer than I thought :)

It's all yours. You guys have been doing a fantastic job and I'm happy 
to be on the consumer side of it all these days :D

Cheers,
  

Patch

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index dbb01b4b7451..ce7448065986 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -334,6 +334,13 @@  DEFINE_EVENT(sched_process_template, sched_process_exit,
 	     TP_PROTO(struct task_struct *p),
 	     TP_ARGS(p));
 
+/*
+ * Tracepoint for a task coredumping:
+ */
+DEFINE_EVENT(sched_process_template, sched_process_coredump,
+	     TP_PROTO(struct task_struct *p),
+	     TP_ARGS(p));
+
 /*
  * Tracepoint for waiting on task to unschedule:
  */
diff --git a/kernel/exit.c b/kernel/exit.c
index 493647fd7c07..c11e12d73f4e 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -425,6 +425,7 @@  static void coredump_task_exit(struct task_struct *tsk)
 			self.next = xchg(&core_state->dumper.next, &self);
 		else
 			self.task = NULL;
+		trace_sched_process_coredump(tsk);
 		/*
 		 * Implies mb(), the result of xchg() must be visible
 		 * to core_state->dumper.