[RFC,v2,0/3] report task state in symbolic chars from sched tracepoint

Message ID 20230726121618.19198-1-zegao@tencent.com
Headers
Series report task state in symbolic chars from sched tracepoint |

Message

Ze Gao July 26, 2023, 12:16 p.m. UTC
  This is the 2nd attempt to fix the report task state issue in sched
tracepint, here is the first version:

https://lore.kernel.org/linux-trace-kernel/20230725072254.32045-1-zegao@tencent.com

Against v1, add a new var to report task state in symbolic char instead
of replacing the old one and to not to break anything.

--

In the status quo, we should see three different outcomes of the reported
sched-out task state from perf-script, perf-sched-timehist, and Tp_printk
of tracepoint sched_switch.  And it's not hard to figure out that the
former two are built upon the third one, and the reason why we see this
inconsistency is that the former two does not catch up with the internal
change of reported task state definitions as the kernel evolves.

IMHO, exporting internal representations of task state in the tracepoint
sched_switch is not a good practice and not encouraged at all, which can
easily break userspace tools that relies on it. Especially when tracepoints
are massively used in many observability tools nowadays due to its stable
nature, which makes them no longer used for debug only purpose and we
should be careful to decide what ought to be reported to userspace and what
ought not.

Therefore, to fix the issues mentioned above for good, instead of choosing
I proposed to add a new variable to report task state in sched_switch with
a symbolic character along with the old hardcoded value, and save the
further processing of userspace tools and spare them from knowing
implementation details in the kernel.

After this patch seires, we report 'RSDTtXZPI' the same as in procfs, plus
a 'p' which denotes PREEMP_ACTIVE and is used for sched_switch tracepoint only.

Reviews welcome!

Regards,

Ze

Ze Gao (2):
  sched, tracing: add to report task state in symbolic chars
  perf sched: use the new prev_state_char instead in tracepoint
    sched_switch

 include/trace/events/sched.h | 60 +++++++++++++++++++++---------------
 tools/perf/builtin-sched.c   | 57 ++++++----------------------------
 2 files changed, 45 insertions(+), 72 deletions(-)

Ze Gao (1):
  libtraceevent: use the new prev_state_char instead in tracepoint
    sched_switch

 plugins/plugin_sched_switch.c | 29 ++++-------------------------
 1 file changed, 4 insertions(+), 25 deletions(-)
  

Comments

Ian Rogers July 28, 2023, 5:29 p.m. UTC | #1
On Wed, Jul 26, 2023 at 5:16 AM Ze Gao <zegao2021@gmail.com> wrote:
>
>
> This is the 2nd attempt to fix the report task state issue in sched
> tracepint, here is the first version:
>
> https://lore.kernel.org/linux-trace-kernel/20230725072254.32045-1-zegao@tencent.com
>
> Against v1, add a new var to report task state in symbolic char instead
> of replacing the old one and to not to break anything.
>
> --
>
> In the status quo, we should see three different outcomes of the reported
> sched-out task state from perf-script, perf-sched-timehist, and Tp_printk
> of tracepoint sched_switch.  And it's not hard to figure out that the
> former two are built upon the third one, and the reason why we see this
> inconsistency is that the former two does not catch up with the internal
> change of reported task state definitions as the kernel evolves.
>
> IMHO, exporting internal representations of task state in the tracepoint
> sched_switch is not a good practice and not encouraged at all, which can
> easily break userspace tools that relies on it. Especially when tracepoints
> are massively used in many observability tools nowadays due to its stable
> nature, which makes them no longer used for debug only purpose and we
> should be careful to decide what ought to be reported to userspace and what
> ought not.
>
> Therefore, to fix the issues mentioned above for good, instead of choosing
> I proposed to add a new variable to report task state in sched_switch with
> a symbolic character along with the old hardcoded value, and save the
> further processing of userspace tools and spare them from knowing
> implementation details in the kernel.
>
> After this patch seires, we report 'RSDTtXZPI' the same as in procfs, plus
> a 'p' which denotes PREEMP_ACTIVE and is used for sched_switch tracepoint only.
>
> Reviews welcome!

Thanks Ze,

I think this is worthwhile cleanup and makes the code overall simpler.
I don't know if others have strong opinions, I don't often work in
this code, but I think the patches are worth landing this.

Acked-by: Ian Rogers <irogers@google.com>

Thanks,
Ian

> Regards,
>
> Ze
>
> Ze Gao (2):
>   sched, tracing: add to report task state in symbolic chars
>   perf sched: use the new prev_state_char instead in tracepoint
>     sched_switch
>
>  include/trace/events/sched.h | 60 +++++++++++++++++++++---------------
>  tools/perf/builtin-sched.c   | 57 ++++++----------------------------
>  2 files changed, 45 insertions(+), 72 deletions(-)
>
> Ze Gao (1):
>   libtraceevent: use the new prev_state_char instead in tracepoint
>     sched_switch
>
>  plugins/plugin_sched_switch.c | 29 ++++-------------------------
>  1 file changed, 4 insertions(+), 25 deletions(-)
>
> --
> 2.40.1
>
  
Ze Gao July 31, 2023, 2:09 a.m. UTC | #2
Thanks Ian,

In regard to ABI, symbolic chars are much more stable and I think
we can benefit from this in the long run.

Regards,
Ze

On Sat, Jul 29, 2023 at 1:29 AM Ian Rogers <irogers@google.com> wrote:
>
> On Wed, Jul 26, 2023 at 5:16 AM Ze Gao <zegao2021@gmail.com> wrote:
> >
> >
> > This is the 2nd attempt to fix the report task state issue in sched
> > tracepint, here is the first version:
> >
> > https://lore.kernel.org/linux-trace-kernel/20230725072254.32045-1-zegao@tencent.com
> >
> > Against v1, add a new var to report task state in symbolic char instead
> > of replacing the old one and to not to break anything.
> >
> > --
> >
> > In the status quo, we should see three different outcomes of the reported
> > sched-out task state from perf-script, perf-sched-timehist, and Tp_printk
> > of tracepoint sched_switch.  And it's not hard to figure out that the
> > former two are built upon the third one, and the reason why we see this
> > inconsistency is that the former two does not catch up with the internal
> > change of reported task state definitions as the kernel evolves.
> >
> > IMHO, exporting internal representations of task state in the tracepoint
> > sched_switch is not a good practice and not encouraged at all, which can
> > easily break userspace tools that relies on it. Especially when tracepoints
> > are massively used in many observability tools nowadays due to its stable
> > nature, which makes them no longer used for debug only purpose and we
> > should be careful to decide what ought to be reported to userspace and what
> > ought not.
> >
> > Therefore, to fix the issues mentioned above for good, instead of choosing
> > I proposed to add a new variable to report task state in sched_switch with
> > a symbolic character along with the old hardcoded value, and save the
> > further processing of userspace tools and spare them from knowing
> > implementation details in the kernel.
> >
> > After this patch seires, we report 'RSDTtXZPI' the same as in procfs, plus
> > a 'p' which denotes PREEMP_ACTIVE and is used for sched_switch tracepoint only.
> >
> > Reviews welcome!
>
> Thanks Ze,
>
> I think this is worthwhile cleanup and makes the code overall simpler.
> I don't know if others have strong opinions, I don't often work in
> this code, but I think the patches are worth landing this.
>
> Acked-by: Ian Rogers <irogers@google.com>
>
> Thanks,
> Ian
>
> > Regards,
> >
> > Ze
> >
> > Ze Gao (2):
> >   sched, tracing: add to report task state in symbolic chars
> >   perf sched: use the new prev_state_char instead in tracepoint
> >     sched_switch
> >
> >  include/trace/events/sched.h | 60 +++++++++++++++++++++---------------
> >  tools/perf/builtin-sched.c   | 57 ++++++----------------------------
> >  2 files changed, 45 insertions(+), 72 deletions(-)
> >
> > Ze Gao (1):
> >   libtraceevent: use the new prev_state_char instead in tracepoint
> >     sched_switch
> >
> >  plugins/plugin_sched_switch.c | 29 ++++-------------------------
> >  1 file changed, 4 insertions(+), 25 deletions(-)
> >
> > --
> > 2.40.1
> >
  
Masami Hiramatsu (Google) July 31, 2023, 3:52 a.m. UTC | #3
On Wed, 26 Jul 2023 20:16:15 +0800
Ze Gao <zegao2021@gmail.com> wrote:

> 
> This is the 2nd attempt to fix the report task state issue in sched
> tracepint, here is the first version:
> 
> https://lore.kernel.org/linux-trace-kernel/20230725072254.32045-1-zegao@tencent.com
> 
> Against v1, add a new var to report task state in symbolic char instead
> of replacing the old one and to not to break anything.
> 
> --
> 
> In the status quo, we should see three different outcomes of the reported
> sched-out task state from perf-script, perf-sched-timehist, and Tp_printk
> of tracepoint sched_switch.  And it's not hard to figure out that the
> former two are built upon the third one, and the reason why we see this
> inconsistency is that the former two does not catch up with the internal
> change of reported task state definitions as the kernel evolves.
> 
> IMHO, exporting internal representations of task state in the tracepoint
> sched_switch is not a good practice and not encouraged at all, which can
> easily break userspace tools that relies on it. Especially when tracepoints
> are massively used in many observability tools nowadays due to its stable
> nature, which makes them no longer used for debug only purpose and we
> should be careful to decide what ought to be reported to userspace and what
> ought not.
> 
> Therefore, to fix the issues mentioned above for good, instead of choosing
> I proposed to add a new variable to report task state in sched_switch with
> a symbolic character along with the old hardcoded value, and save the
> further processing of userspace tools and spare them from knowing
> implementation details in the kernel.
> 
> After this patch seires, we report 'RSDTtXZPI' the same as in procfs, plus
> a 'p' which denotes PREEMP_ACTIVE and is used for sched_switch tracepoint only.

This series looks good to me. Putting the flag in the trace record is
a good idea :)

Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

for this series.

Thank you,

> 
> Reviews welcome!
> 
> Regards,
> 
> Ze
> 
> Ze Gao (2):
>   sched, tracing: add to report task state in symbolic chars
>   perf sched: use the new prev_state_char instead in tracepoint
>     sched_switch
> 
>  include/trace/events/sched.h | 60 +++++++++++++++++++++---------------
>  tools/perf/builtin-sched.c   | 57 ++++++----------------------------
>  2 files changed, 45 insertions(+), 72 deletions(-)
> 
> Ze Gao (1):
>   libtraceevent: use the new prev_state_char instead in tracepoint
>     sched_switch
> 
>  plugins/plugin_sched_switch.c | 29 ++++-------------------------
>  1 file changed, 4 insertions(+), 25 deletions(-)
> 
> -- 
> 2.40.1
>
  
Ze Gao July 31, 2023, 6:19 a.m. UTC | #4
Hi Masami,

Thanks for your review!

Sincerely,
Ze

On Mon, Jul 31, 2023 at 11:53 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Wed, 26 Jul 2023 20:16:15 +0800
> Ze Gao <zegao2021@gmail.com> wrote:
>
> >
> > This is the 2nd attempt to fix the report task state issue in sched
> > tracepint, here is the first version:
> >
> > https://lore.kernel.org/linux-trace-kernel/20230725072254.32045-1-zegao@tencent.com
> >
> > Against v1, add a new var to report task state in symbolic char instead
> > of replacing the old one and to not to break anything.
> >
> > --
> >
> > In the status quo, we should see three different outcomes of the reported
> > sched-out task state from perf-script, perf-sched-timehist, and Tp_printk
> > of tracepoint sched_switch.  And it's not hard to figure out that the
> > former two are built upon the third one, and the reason why we see this
> > inconsistency is that the former two does not catch up with the internal
> > change of reported task state definitions as the kernel evolves.
> >
> > IMHO, exporting internal representations of task state in the tracepoint
> > sched_switch is not a good practice and not encouraged at all, which can
> > easily break userspace tools that relies on it. Especially when tracepoints
> > are massively used in many observability tools nowadays due to its stable
> > nature, which makes them no longer used for debug only purpose and we
> > should be careful to decide what ought to be reported to userspace and what
> > ought not.
> >
> > Therefore, to fix the issues mentioned above for good, instead of choosing
> > I proposed to add a new variable to report task state in sched_switch with
> > a symbolic character along with the old hardcoded value, and save the
> > further processing of userspace tools and spare them from knowing
> > implementation details in the kernel.
> >
> > After this patch seires, we report 'RSDTtXZPI' the same as in procfs, plus
> > a 'p' which denotes PREEMP_ACTIVE and is used for sched_switch tracepoint only.
>
> This series looks good to me. Putting the flag in the trace record is
> a good idea :)
>
> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
>
> for this series.
>
> Thank you,
>
> >
> > Reviews welcome!
> >
> > Regards,
> >
> > Ze
> >
> > Ze Gao (2):
> >   sched, tracing: add to report task state in symbolic chars
> >   perf sched: use the new prev_state_char instead in tracepoint
> >     sched_switch
> >
> >  include/trace/events/sched.h | 60 +++++++++++++++++++++---------------
> >  tools/perf/builtin-sched.c   | 57 ++++++----------------------------
> >  2 files changed, 45 insertions(+), 72 deletions(-)
> >
> > Ze Gao (1):
> >   libtraceevent: use the new prev_state_char instead in tracepoint
> >     sched_switch
> >
> >  plugins/plugin_sched_switch.c | 29 ++++-------------------------
> >  1 file changed, 4 insertions(+), 25 deletions(-)
> >
> > --
> > 2.40.1
> >
>
>
> --
> Masami Hiramatsu (Google) <mhiramat@kernel.org>