[RFC,v3,4/6] sched, tracing: reorganize fields of switch event struct

Message ID 20230801090124.8050-5-zegao@tencent.com
State New
Headers
Series add to report task state in symbolic chars from sched tracepoint |

Commit Message

Ze Gao Aug. 1, 2023, 9:01 a.m. UTC
  Report priorities in 'short' and prev_state in 'int' to save
some buffer space. And also reorder the fields so that we take
struct alignment into consideration to make the record compact.

Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Ze Gao <zegao@tencent.com>
---
 include/trace/events/sched.h | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)
  

Comments

Peter Zijlstra Aug. 1, 2023, 11:46 a.m. UTC | #1
On Tue, Aug 01, 2023 at 05:01:22PM +0800, Ze Gao wrote:
> Report priorities in 'short' and prev_state in 'int' to save
> some buffer space. And also reorder the fields so that we take
> struct alignment into consideration to make the record compact.
> 
> Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>

I don't see a single line describing the effort you've done to audit
consumers of this tracepoint.

*IF* you're wanting to break this tracepoint ABI, because seriously
that's what it is, then you get to invest the time and effort to audit
the users.
  
Ze Gao Aug. 1, 2023, 1:16 p.m. UTC | #2
Oops, I thought sending this series for RFC is the "effort" you mean
to audit the users :/

Correct me if I'm making stupid moves here and enlighten me what
I should do furthermore to audit the users.

Thanks,
Ze

On Tue, Aug 1, 2023 at 7:47 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Aug 01, 2023 at 05:01:22PM +0800, Ze Gao wrote:
> > Report priorities in 'short' and prev_state in 'int' to save
> > some buffer space. And also reorder the fields so that we take
> > struct alignment into consideration to make the record compact.
> >
> > Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
>
> I don't see a single line describing the effort you've done to audit
> consumers of this tracepoint.
>
> *IF* you're wanting to break this tracepoint ABI, because seriously
> that's what it is, then you get to invest the time and effort to audit
> the users.
  
Steven Rostedt Aug. 1, 2023, 2:16 p.m. UTC | #3
On Tue,  1 Aug 2023 17:01:22 +0800
Ze Gao <zegao2021@gmail.com> wrote:

> Report priorities in 'short' and prev_state in 'int' to save
> some buffer space. And also reorder the fields so that we take
> struct alignment into consideration to make the record compact.
> 
> Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> Signed-off-by: Ze Gao <zegao@tencent.com>

I'd swap this patch with patch 3. That is, make the field changes first.
I'd like this to get in regardless of if the state_char is accepted. We may
want to get this in first to see if there's any regressions before we add a
state_char.

-- Steve


> ---
>  include/trace/events/sched.h | 22 +++++++++++-----------
>  1 file changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index e507901bcab8..36863ffb00c6 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -188,7 +188,7 @@ DEFINE_EVENT(sched_wakeup_template, sched_wakeup_new,
>  	     TP_ARGS(p));
>  
>  #ifdef CREATE_TRACE_POINTS
> -static inline long __trace_sched_switch_state(bool preempt,
> +static inline int __trace_sched_switch_state(bool preempt,
>  					      unsigned int prev_state,
>  					      struct task_struct *p)
>  {
> @@ -251,25 +251,25 @@ TRACE_EVENT(sched_switch,
>  	TP_ARGS(preempt, prev, next, prev_state),
>  
>  	TP_STRUCT__entry(
> -		__array(	char,	prev_comm,	TASK_COMM_LEN	)
>  		__field(	pid_t,	prev_pid			)
> -		__field(	int,	prev_prio			)
> -		__field(	long,	prev_state			)
> -		__field(	char,	prev_state_char			)
> -		__array(	char,	next_comm,	TASK_COMM_LEN	)
>  		__field(	pid_t,	next_pid			)
> -		__field(	int,	next_prio			)
> +		__field(	short,	prev_prio			)
> +		__field(	short,	next_prio			)
> +		__field(	int,	prev_state			)
> +		__array(	char,	prev_comm,	TASK_COMM_LEN	)
> +		__array(	char,	next_comm,	TASK_COMM_LEN	)
> +		__field(	char,	prev_state_char			)
>  	),
>  
>  	TP_fast_assign(
> -		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
>  		__entry->prev_pid		= prev->pid;
> -		__entry->prev_prio		= prev->prio;
> +		__entry->next_pid		= next->pid;
> +		__entry->prev_prio		= (short) prev->prio;
> +		__entry->next_prio		= (short) next->prio;
>  		__entry->prev_state		= __trace_sched_switch_state(preempt, prev_state, prev);
>  		__entry->prev_state_char	= __trace_sched_switch_state_char(preempt, prev_state, prev);
>  		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
> -		__entry->next_pid		= next->pid;
> -		__entry->next_prio		= next->prio;
> +		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
>  		/* XXX SCHED_DEADLINE */
>  	),
>
  
Steven Rostedt Aug. 1, 2023, 2:33 p.m. UTC | #4
On Tue, 1 Aug 2023 13:46:50 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Aug 01, 2023 at 05:01:22PM +0800, Ze Gao wrote:
> > Report priorities in 'short' and prev_state in 'int' to save
> > some buffer space. And also reorder the fields so that we take
> > struct alignment into consideration to make the record compact.
> > 
> > Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>  
> 
> I don't see a single line describing the effort you've done to audit
> consumers of this tracepoint.
> 
> *IF* you're wanting to break this tracepoint ABI, because seriously
> that's what it is, then you get to invest the time and effort to audit
> the users.

The known major users that I am aware of is raesdaemon,
powertop/latencytop, perf, trace-cmd and some bpf tools. The bpf tooling is
known to update per kernel. The others all use libtraceevent that can
handle this change.

What other tools are there? There's Perfetto, but it also looks at tracefs
to examine where the values are. There's LTTng, but I believe it uses the
raw tracepoint directly and doesn't look at the layout of the ftrace/perf
buffers.

All other tooling I am slightly aware of uses libtracefs and libtraceveent,
as I've been giving many talks on how to use those libraries.

-- Steve
  
Ze Gao Aug. 2, 2023, 3:06 a.m. UTC | #5
Thanks for clarifying this ! Steven. This is really helpful.

Regards,
Ze

On Tue, Aug 1, 2023 at 10:33 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 1 Aug 2023 13:46:50 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
>
> > On Tue, Aug 01, 2023 at 05:01:22PM +0800, Ze Gao wrote:
> > > Report priorities in 'short' and prev_state in 'int' to save
> > > some buffer space. And also reorder the fields so that we take
> > > struct alignment into consideration to make the record compact.
> > >
> > > Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> >
> > I don't see a single line describing the effort you've done to audit
> > consumers of this tracepoint.
> >
> > *IF* you're wanting to break this tracepoint ABI, because seriously
> > that's what it is, then you get to invest the time and effort to audit
> > the users.
>
> The known major users that I am aware of is raesdaemon,
> powertop/latencytop, perf, trace-cmd and some bpf tools. The bpf tooling is
> known to update per kernel. The others all use libtraceevent that can
> handle this change.
>
> What other tools are there? There's Perfetto, but it also looks at tracefs
> to examine where the values are. There's LTTng, but I believe it uses the
> raw tracepoint directly and doesn't look at the layout of the ftrace/perf
> buffers.
>
> All other tooling I am slightly aware of uses libtracefs and libtraceveent,
> as I've been giving many talks on how to use those libraries.
>
> -- Steve
  
Ze Gao Aug. 2, 2023, 3:07 a.m. UTC | #6
Fair point, will do it in v4 as well.

Thanks,
Ze

On Tue, Aug 1, 2023 at 10:16 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue,  1 Aug 2023 17:01:22 +0800
> Ze Gao <zegao2021@gmail.com> wrote:
>
> > Report priorities in 'short' and prev_state in 'int' to save
> > some buffer space. And also reorder the fields so that we take
> > struct alignment into consideration to make the record compact.
> >
> > Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> > Signed-off-by: Ze Gao <zegao@tencent.com>
>
> I'd swap this patch with patch 3. That is, make the field changes first.
> I'd like this to get in regardless of if the state_char is accepted. We may
> want to get this in first to see if there's any regressions before we add a
> state_char.
>
> -- Steve
>
>
> > ---
> >  include/trace/events/sched.h | 22 +++++++++++-----------
> >  1 file changed, 11 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> > index e507901bcab8..36863ffb00c6 100644
> > --- a/include/trace/events/sched.h
> > +++ b/include/trace/events/sched.h
> > @@ -188,7 +188,7 @@ DEFINE_EVENT(sched_wakeup_template, sched_wakeup_new,
> >            TP_ARGS(p));
> >
> >  #ifdef CREATE_TRACE_POINTS
> > -static inline long __trace_sched_switch_state(bool preempt,
> > +static inline int __trace_sched_switch_state(bool preempt,
> >                                             unsigned int prev_state,
> >                                             struct task_struct *p)
> >  {
> > @@ -251,25 +251,25 @@ TRACE_EVENT(sched_switch,
> >       TP_ARGS(preempt, prev, next, prev_state),
> >
> >       TP_STRUCT__entry(
> > -             __array(        char,   prev_comm,      TASK_COMM_LEN   )
> >               __field(        pid_t,  prev_pid                        )
> > -             __field(        int,    prev_prio                       )
> > -             __field(        long,   prev_state                      )
> > -             __field(        char,   prev_state_char                 )
> > -             __array(        char,   next_comm,      TASK_COMM_LEN   )
> >               __field(        pid_t,  next_pid                        )
> > -             __field(        int,    next_prio                       )
> > +             __field(        short,  prev_prio                       )
> > +             __field(        short,  next_prio                       )
> > +             __field(        int,    prev_state                      )
> > +             __array(        char,   prev_comm,      TASK_COMM_LEN   )
> > +             __array(        char,   next_comm,      TASK_COMM_LEN   )
> > +             __field(        char,   prev_state_char                 )
> >       ),
> >
> >       TP_fast_assign(
> > -             memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
> >               __entry->prev_pid               = prev->pid;
> > -             __entry->prev_prio              = prev->prio;
> > +             __entry->next_pid               = next->pid;
> > +             __entry->prev_prio              = (short) prev->prio;
> > +             __entry->next_prio              = (short) next->prio;
> >               __entry->prev_state             = __trace_sched_switch_state(preempt, prev_state, prev);
> >               __entry->prev_state_char        = __trace_sched_switch_state_char(preempt, prev_state, prev);
> >               memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
> > -             __entry->next_pid               = next->pid;
> > -             __entry->next_prio              = next->prio;
> > +             memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
> >               /* XXX SCHED_DEADLINE */
> >       ),
> >
>
  

Patch

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index e507901bcab8..36863ffb00c6 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -188,7 +188,7 @@  DEFINE_EVENT(sched_wakeup_template, sched_wakeup_new,
 	     TP_ARGS(p));
 
 #ifdef CREATE_TRACE_POINTS
-static inline long __trace_sched_switch_state(bool preempt,
+static inline int __trace_sched_switch_state(bool preempt,
 					      unsigned int prev_state,
 					      struct task_struct *p)
 {
@@ -251,25 +251,25 @@  TRACE_EVENT(sched_switch,
 	TP_ARGS(preempt, prev, next, prev_state),
 
 	TP_STRUCT__entry(
-		__array(	char,	prev_comm,	TASK_COMM_LEN	)
 		__field(	pid_t,	prev_pid			)
-		__field(	int,	prev_prio			)
-		__field(	long,	prev_state			)
-		__field(	char,	prev_state_char			)
-		__array(	char,	next_comm,	TASK_COMM_LEN	)
 		__field(	pid_t,	next_pid			)
-		__field(	int,	next_prio			)
+		__field(	short,	prev_prio			)
+		__field(	short,	next_prio			)
+		__field(	int,	prev_state			)
+		__array(	char,	prev_comm,	TASK_COMM_LEN	)
+		__array(	char,	next_comm,	TASK_COMM_LEN	)
+		__field(	char,	prev_state_char			)
 	),
 
 	TP_fast_assign(
-		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
 		__entry->prev_pid		= prev->pid;
-		__entry->prev_prio		= prev->prio;
+		__entry->next_pid		= next->pid;
+		__entry->prev_prio		= (short) prev->prio;
+		__entry->next_prio		= (short) next->prio;
 		__entry->prev_state		= __trace_sched_switch_state(preempt, prev_state, prev);
 		__entry->prev_state_char	= __trace_sched_switch_state_char(preempt, prev_state, prev);
 		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
-		__entry->next_pid		= next->pid;
-		__entry->next_prio		= next->prio;
+		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
 		/* XXX SCHED_DEADLINE */
 	),