Message ID | 20230801090124.8050-4-zegao@tencent.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:918b:0:b0:3e4:2afc:c1 with SMTP id s11csp2545974vqg; Tue, 1 Aug 2023 02:25:28 -0700 (PDT) X-Google-Smtp-Source: APBJJlFi2eAFpPiGsMb/u/oSzUurZaRH87QJXq6oDJyjtc27MeJ/6ba3A2nCP13quJefBrZHjHI9 X-Received: by 2002:a05:6402:1299:b0:522:2af1:1ffe with SMTP id w25-20020a056402129900b005222af11ffemr1761934edv.9.1690881928294; Tue, 01 Aug 2023 02:25:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1690881928; cv=none; d=google.com; s=arc-20160816; b=NsEfs5d+u8/cdggXt6NiW9Ert1ip5aWuOzbOZyKfVjrV6ZKAjevDToxiMcqnr7TvNV EaqFW4YFDrHZlih4N0jkbGqoiSZ5TYzcZHbiNNzQePfvbTEo16jzzHfbS8IIL8WGRpC/ 360/Xscy4ztiQzPlrV/2o/+VfoEOSa8OWX+xmv+gsYnoNsrWoPpPcRMqIQ9pPKPIhY8Q 5U6AsB/XmB4GftY1mQ/iuy2t36NDWF2+nbUCETgGsQg1voPeNC50QMvlUDVjat0amDcg JBdSVJ4BEbNl1pVGJHMPRJsNu7zcyMR6jyV6/93WAThTrnICDvQ+hfCcV8es2sIoB1Ii Ny8Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=PBxCbacsLX6m7o32e5QWwMlB/3bI1PwM+xCI9ISuc6o=; fh=4S4dukURCUlSsjzSa0Ofy3qHSrrdu/uT1HnGBPj5BK4=; b=gViXXG5PvkybomDu6nVHqc/z3CkgWRGuhX70p01HHvNV3F5YBc6UGu1h1rTZoOcGOk 1dxvGZNgkbgHEvhM+QJdJfuGBRhFtqN27aMhR69uW2gclMqu698qOnVEBz+7UN4z6Wdv fwTHJjGDsO91MvlN/irzfQV+wyELhbIx6kJms7NxdPuFGisDcNnOP+8piB/7I3oWsCv3 TAKcNk4qMyxtG9K5vzvMjybBY8pGPJkPuLKhb8i2osVLoECDKkgTWRnmBBO/yqUNFikt JGzmC+vGeAMNOu2BJg6eDI5cYgfq8yN2O9AfRirRff0m1nec5NOAgxtrRo8XSSPTR4Yc 4ukA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=eZn7HUGd; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id n17-20020a056402061100b0052229b7a96asi7839209edv.684.2023.08.01.02.25.04; Tue, 01 Aug 2023 02:25:28 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=eZn7HUGd; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232272AbjHAJB6 (ORCPT <rfc822;maxi.paulin@gmail.com> + 99 others); Tue, 1 Aug 2023 05:01:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59728 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230368AbjHAJBt (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 1 Aug 2023 05:01:49 -0400 Received: from mail-pl1-x634.google.com (mail-pl1-x634.google.com [IPv6:2607:f8b0:4864:20::634]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1425E1996; Tue, 1 Aug 2023 02:01:45 -0700 (PDT) Received: by mail-pl1-x634.google.com with SMTP id d9443c01a7336-1bbdc05a93bso33066885ad.0; Tue, 01 Aug 2023 02:01:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1690880504; x=1691485304; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=PBxCbacsLX6m7o32e5QWwMlB/3bI1PwM+xCI9ISuc6o=; b=eZn7HUGdRY0o+oSO7rF5Pch85mgNf4Y/292AsG/gybfExewTsCGLCO3KLIeVP3cDn3 wkKZ1f3YHvWHzpdzWJXNmRqahVWImESo2wNDUDvjUVmHYvtarL2Y6b9qB5KRzOcwVZvc mpiLj/BFpjrOBNA41rHxf8YSE7ACPQzxyIPu+g5s4/jK+vX1cp1p3ZwhhUg1VfFOnPgE UzhHsqB3a1WYWpJOfq4o3HPa1tL3tmpuLHO5jkWZN+nrEOizwlk6MNVydcHPGi4ZgYAl xootA5r0isHTNLsL5E9/XNIC93uG6VArfY/ZA1/v2RCpcvFO2NWu7XCXHItMEfvWQSm7 eobg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690880504; x=1691485304; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=PBxCbacsLX6m7o32e5QWwMlB/3bI1PwM+xCI9ISuc6o=; b=lXgYiRCrTKJieo5FvvUAhba5byX7V7wHE2IrBtkVci9nKoopYPqwt/ZYetipXweL+P 85jC690noIPOLrzjhFecxSCC4XelByuLSftdTmhoElsp68+Rt8LWhrbs1Yyw0Da19OLw 9BQRJO2+aiKFxh6pcJV9kf74A4bidCzTKtmNPzQJ4pqRdMOxWSn+j7Hb1A6M5D+3XBC8 /r1ht36ZdxanxzPn9n/+RK9r/O2B/+WMiEUvy0P1zv5cpJwcDPsKpewZTHJJXUuKSyWj cPZYoapQiL1/30Xl2KXi+1taaNB7h4MX/m3F5LXONlE+8U6BbMbPKOP3IolO5/sd/6jE S9LA== X-Gm-Message-State: ABy/qLaaT0dIIrutKmifz5yCjjxXUDgr/1GkZfGpbmzPaYc3WCxVb5p2 tySPXAJyB0cyPaDOm1BJMTQ= X-Received: by 2002:a17:90b:4b4c:b0:268:31f3:79d6 with SMTP id mi12-20020a17090b4b4c00b0026831f379d6mr9635389pjb.36.1690880504352; Tue, 01 Aug 2023 02:01:44 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.20]) by smtp.googlemail.com with ESMTPSA id x34-20020a17090a6c2500b00264044cca0fsm1592523pjj.1.2023.08.01.02.01.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 01 Aug 2023 02:01:44 -0700 (PDT) From: Ze Gao <zegao2021@gmail.com> X-Google-Original-From: Ze Gao <zegao@tencent.com> To: Adrian Hunter <adrian.hunter@intel.com>, Alexander Shishkin <alexander.shishkin@linux.intel.com>, Arnaldo Carvalho de Melo <acme@kernel.org>, Ian Rogers <irogers@google.com>, Ingo Molnar <mingo@redhat.com>, Jiri Olsa <jolsa@kernel.org>, Mark Rutland <mark.rutland@arm.com>, Masami Hiramatsu <mhiramat@kernel.org>, Namhyung Kim <namhyung@kernel.org>, Peter Zijlstra <peterz@infradead.org>, Steven Rostedt <rostedt@goodmis.org> Cc: linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-trace-devel@vger.kernel.org, Ze Gao <zegao@tencent.com> Subject: [RFC PATCH v3 3/6] sched, tracing: add to report task state in symbolic chars Date: Tue, 1 Aug 2023 17:01:21 +0800 Message-Id: <20230801090124.8050-4-zegao@tencent.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230801090124.8050-1-zegao@tencent.com> References: <20230801090124.8050-1-zegao@tencent.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FROM,RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1773018209054004586 X-GMAIL-MSGID: 1773018209054004586 |
Series |
add to report task state in symbolic chars from sched tracepoint
|
|
Commit Message
Ze Gao
Aug. 1, 2023, 9:01 a.m. UTC
Internal representations of task state are likely to be changed or ordered, and reporting them to userspace without exporting them as part of API is basically wrong, which can easily break a userspace observability tool as kernel evolves. For example, perf suffers from this and still reports wrong states as of this writing. OTOH, some masqueraded states like TASK_REPORT_IDLE and TASK_REPORT_MAX are also reported inadvertently, which confuses things even more and most userspace tools do not even take them into consideration. So add a new variable in company with the old raw value to report task state in symbolic chars, which are self-explaining and no further translation is needed. Of course this does not break any userspace tool. Note for PREEMPT_ACTIVE, we introduce 'p' to report it and use the old conventions for the rest. Signed-off-by: Ze Gao <zegao@tencent.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Ian Rogers <irogers@google.com> --- include/trace/events/sched.h | 54 +++++++++++++++++++++--------------- 1 file changed, 32 insertions(+), 22 deletions(-)
Comments
On Tue, Aug 01, 2023 at 05:01:21PM +0800, Ze Gao wrote: > Internal representations of task state are likely to be changed > or ordered, and reporting them to userspace without exporting > them as part of API is basically wrong, which can easily break > a userspace observability tool as kernel evolves. For example, > perf suffers from this and still reports wrong states as of this > writing. > > OTOH, some masqueraded states like TASK_REPORT_IDLE and > TASK_REPORT_MAX are also reported inadvertently, which confuses > things even more and most userspace tools do not even take them > into consideration. > > So add a new variable in company with the old raw value to > report task state in symbolic chars, which are self-explaining > and no further translation is needed. Of course this does not > break any userspace tool. > > Note for PREEMPT_ACTIVE, we introduce 'p' to report it and use > the old conventions for the rest. *sigh*... just because userspace if daft, we need to change the kernel? Why do we need this character anyway, why not just print the state in hex and leave it at that? These single character state things are a relic, please just let them die.
On Tue, Aug 01, 2023 at 05:01:21PM +0800, Ze Gao wrote: > @@ -233,6 +255,7 @@ TRACE_EVENT(sched_switch, > __field( pid_t, prev_pid ) > __field( int, prev_prio ) > __field( long, prev_state ) > + __field( char, prev_state_char ) > __array( char, next_comm, TASK_COMM_LEN ) > __field( pid_t, next_pid ) > __field( int, next_prio ) And this again will wreck everybody that consumes the raw tracepoint without looking at tracefs.
On Tue, Aug 1, 2023 at 7:34 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Aug 01, 2023 at 05:01:21PM +0800, Ze Gao wrote: > > Internal representations of task state are likely to be changed > > or ordered, and reporting them to userspace without exporting > > them as part of API is basically wrong, which can easily break > > a userspace observability tool as kernel evolves. For example, > > perf suffers from this and still reports wrong states as of this > > writing. > > > > OTOH, some masqueraded states like TASK_REPORT_IDLE and > > TASK_REPORT_MAX are also reported inadvertently, which confuses > > things even more and most userspace tools do not even take them > > into consideration. > > > > So add a new variable in company with the old raw value to > > report task state in symbolic chars, which are self-explaining > > and no further translation is needed. Of course this does not > > break any userspace tool. > > > > Note for PREEMPT_ACTIVE, we introduce 'p' to report it and use > > the old conventions for the rest. > > *sigh*... just because userspace if daft, we need to change the kernel? Hi Peter, Sorry that I don't quite agree with you on this one. It's just the design that exporting internal details is fundamentally wrong. And even worse, I did not see any userspace tool is aware of masqueraded states like TASK_REPORT_IDLE and TASK_REPORT_MAX and let alone parse it correctly. This confused me a lot when I decided to write my own bpf version of sched-latency analysis tool and only after I figured out everything underneath, I started to make things right here. Again, I mean it's not me that deliberately "breaks" ABI here and I am never meant to upset anyone. My confusion is why did people forget to update in-tree perf the very last time they decide to rearrange the task state mapping since we all agree this is important "ABI" here. I don't think it's the tool's fault. And that's my initiative to request this RFC. > Why do we need this character anyway, why not just print the state in > hex and leave it at that? These single character state things are a > relic, please just let them die. I believe hex is ok only after having the reported task state mapping appear in the uapi headers, otherwise it's still useless to userspace especially for value like TASK_REPORT_IDLE and TASK_REPORT_MAX, which need to dig into the kernel to see what the hell is going on here. Thoughts? Regards, Ze
Sorry that I don't get this one, did you mean kernel subsystems like bpf or third party modules? Honestly I don't know how it works here for userspace to consume the raw tracepoint without looking at tracefs. Regards, Ze On Tue, Aug 1, 2023 at 7:46 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Aug 01, 2023 at 05:01:21PM +0800, Ze Gao wrote: > > @@ -233,6 +255,7 @@ TRACE_EVENT(sched_switch, > > __field( pid_t, prev_pid ) > > __field( int, prev_prio ) > > __field( long, prev_state ) > > + __field( char, prev_state_char ) > > __array( char, next_comm, TASK_COMM_LEN ) > > __field( pid_t, next_pid ) > > __field( int, next_prio ) > > And this again will wreck everybody that consumes the raw tracepoint > without looking at tracefs.
On Tue, Aug 01, 2023 at 09:03:51PM +0800, Ze Gao wrote: > It's just the design that exporting internal details is fundamentally wrong. This is tracing... it wasn't supposed to be ABI (although it somehow ended up being one). But even then, things like PF_foo get exposed in procfs but even that we change. The whole point of tracing is to see internals in order to figure out wth is going wrong. > And even worse, I did not see any userspace tool is aware of masqueraded > states like TASK_REPORT_IDLE and TASK_REPORT_MAX and let alone > parse it correctly. That's probably because I never use tools, I just look at the raw trace output -- sometimes an impromptu awk script. I'm pretty sure I ran with something like the below when I did the freezer rewrite -- or perhaps I just stuck in trace_printk() and didn't even bother with the tracepoints, I can't remember. > > Why do we need this character anyway, why not just print the state in > > hex and leave it at that? These single character state things are a > > relic, please just let them die. > > I believe hex is ok only after having the reported task state mapping > appear in the uapi headers, otherwise it's still useless to userspace > especially for value like TASK_REPORT_IDLE and TASK_REPORT_MAX, which > need to dig into the kernel to see what the hell is going on here. > > Thoughts? If you're tracing the kernel, you had better know what the kernel is doing, otherwise you get to keep the pieces. Anyway, if you're doing BPF then why do you care about the trace event at all, just attach to the raw tracepoint and consume @preemt, @prev, @next and @prev_state. --- diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index fbb99a61f714..cb0c1251729e 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -186,36 +186,6 @@ DEFINE_EVENT(sched_wakeup_template, sched_wakeup_new, TP_PROTO(struct task_struct *p), TP_ARGS(p)); -#ifdef CREATE_TRACE_POINTS -static inline long __trace_sched_switch_state(bool preempt, - unsigned int prev_state, - struct task_struct *p) -{ - unsigned int state; - -#ifdef CONFIG_SCHED_DEBUG - BUG_ON(p != current); -#endif /* CONFIG_SCHED_DEBUG */ - - /* - * Preemption ignores task state, therefore preempted tasks are always - * RUNNING (we will not have dequeued if state != RUNNING). - */ - if (preempt) - return TASK_REPORT_MAX; - - /* - * task_state_index() uses fls() and returns a value from 0-8 range. - * Decrement it by 1 (except TASK_RUNNING state i.e 0) before using - * it for left shift operation to get the correct task->state - * mapping. - */ - state = __task_state_index(prev_state, p->exit_state); - - return state ? (1 << (state - 1)) : state; -} -#endif /* CREATE_TRACE_POINTS */ - /* * Tracepoint for task switches, performed by the scheduler: */ @@ -242,29 +212,16 @@ TRACE_EVENT(sched_switch, memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN); __entry->prev_pid = prev->pid; __entry->prev_prio = prev->prio; - __entry->prev_state = __trace_sched_switch_state(preempt, prev_state, prev); + __entry->prev_state = prev_state | (preempt * TASK_STATE_MAX); memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN); __entry->next_pid = next->pid; __entry->next_prio = next->prio; /* XXX SCHED_DEADLINE */ ), - TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d", + TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%x ==> next_comm=%s next_pid=%d next_prio=%d", __entry->prev_comm, __entry->prev_pid, __entry->prev_prio, - - (__entry->prev_state & (TASK_REPORT_MAX - 1)) ? - __print_flags(__entry->prev_state & (TASK_REPORT_MAX - 1), "|", - { TASK_INTERRUPTIBLE, "S" }, - { TASK_UNINTERRUPTIBLE, "D" }, - { __TASK_STOPPED, "T" }, - { __TASK_TRACED, "t" }, - { EXIT_DEAD, "X" }, - { EXIT_ZOMBIE, "Z" }, - { TASK_PARKED, "P" }, - { TASK_DEAD, "I" }) : - "R", - - __entry->prev_state & TASK_REPORT_MAX ? "+" : "", + __entry->prev_state, __entry->next_comm, __entry->next_pid, __entry->next_prio) );
On Tue, 1 Aug 2023 13:45:45 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, Aug 01, 2023 at 05:01:21PM +0800, Ze Gao wrote: > > @@ -233,6 +255,7 @@ TRACE_EVENT(sched_switch, > > __field( pid_t, prev_pid ) > > __field( int, prev_prio ) > > __field( long, prev_state ) > > + __field( char, prev_state_char ) > > __array( char, next_comm, TASK_COMM_LEN ) > > __field( pid_t, next_pid ) > > __field( int, next_prio ) > > And this again will wreck everybody that consumes the raw tracepoint > without looking at tracefs. Nobody does that anymore, as the events change constantly, and are different on different kernels. Powertop (the tool that caused us pain before by using raw values) had to break down and use libtraceevent, because it would break if there was a 32 bit version running on a 64 bit kernel. I've changed the offsets of raw events a few times and nobody has complained since. -- Steve
On Tue, Aug 1, 2023 at 9:42 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Aug 01, 2023 at 09:03:51PM +0800, Ze Gao wrote: > > > It's just the design that exporting internal details is fundamentally wrong. > > This is tracing... it wasn't supposed to be ABI (although it somehow Sorry, I'm confused. And it sounds contradicting here because you said this change is abi break before. > ended up being one). But even then, things like PF_foo get exposed in > procfs but even that we change. > > The whole point of tracing is to see internals in order to figure out > wth is going wrong. Fair point, but I think tracepoints are somewhat different from kprobes/raw_tracepoints due to their stable nature. And I think at least more and more observability tools use them in this way. With all due respect, If it was for kernel developers only, what's the point of leaking this since now we have raw tracepoints support? Does that mean all tracepoints are useless now? Apparently the answer is no. So I'm not convinced by this "for internal inspecting" defense to not ignore what the problem is. Honestly, I would've never thought to change this if I I got the correct meaning for values I captured like 0x80/0x100 when I tried to look it up in include/linux/sched.h the first time. But it really annoyed me to figure out what it is only after I dived into the kernel and collected all the pieces. And you know what, when I turned to the famous in-tree perf for possible help, only to find out it's horribly broken and still uses an outdated state char array to interpret this weird raw value. Anyway, we have to accept the fact that prev_state leaves a huge burden on its users to make things right. And I'm open and glad to see any solutions (possibly better than this one) or efforts or suggestions to improve this. Regards, Ze > > And even worse, I did not see any userspace tool is aware of masqueraded > > states like TASK_REPORT_IDLE and TASK_REPORT_MAX and let alone > > parse it correctly. > > That's probably because I never use tools, I just look at the raw trace > output -- sometimes an impromptu awk script. I'm pretty sure I ran with > something like the below when I did the freezer rewrite -- or perhaps I > just stuck in trace_printk() and didn't even bother with the > tracepoints, I can't remember. > > > > Why do we need this character anyway, why not just print the state in > > > hex and leave it at that? These single character state things are a > > > relic, please just let them die. > > > > I believe hex is ok only after having the reported task state mapping > > appear in the uapi headers, otherwise it's still useless to userspace > > especially for value like TASK_REPORT_IDLE and TASK_REPORT_MAX, which > > need to dig into the kernel to see what the hell is going on here. > > > > Thoughts? > > If you're tracing the kernel, you had better know what the kernel is > doing, otherwise you get to keep the pieces. > > Anyway, if you're doing BPF then why do you care about the trace event > at all, just attach to the raw tracepoint and consume @preemt, @prev, > @next and @prev_state. > > --- > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h > index fbb99a61f714..cb0c1251729e 100644 > --- a/include/trace/events/sched.h > +++ b/include/trace/events/sched.h > @@ -186,36 +186,6 @@ DEFINE_EVENT(sched_wakeup_template, sched_wakeup_new, > TP_PROTO(struct task_struct *p), > TP_ARGS(p)); > > -#ifdef CREATE_TRACE_POINTS > -static inline long __trace_sched_switch_state(bool preempt, > - unsigned int prev_state, > - struct task_struct *p) > -{ > - unsigned int state; > - > -#ifdef CONFIG_SCHED_DEBUG > - BUG_ON(p != current); > -#endif /* CONFIG_SCHED_DEBUG */ > - > - /* > - * Preemption ignores task state, therefore preempted tasks are always > - * RUNNING (we will not have dequeued if state != RUNNING). > - */ > - if (preempt) > - return TASK_REPORT_MAX; > - > - /* > - * task_state_index() uses fls() and returns a value from 0-8 range. > - * Decrement it by 1 (except TASK_RUNNING state i.e 0) before using > - * it for left shift operation to get the correct task->state > - * mapping. > - */ > - state = __task_state_index(prev_state, p->exit_state); > - > - return state ? (1 << (state - 1)) : state; > -} > -#endif /* CREATE_TRACE_POINTS */ > - > /* > * Tracepoint for task switches, performed by the scheduler: > */ > @@ -242,29 +212,16 @@ TRACE_EVENT(sched_switch, > memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN); > __entry->prev_pid = prev->pid; > __entry->prev_prio = prev->prio; > - __entry->prev_state = __trace_sched_switch_state(preempt, prev_state, prev); > + __entry->prev_state = prev_state | (preempt * TASK_STATE_MAX); > memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN); > __entry->next_pid = next->pid; > __entry->next_prio = next->prio; > /* XXX SCHED_DEADLINE */ > ), > > - TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d", > + TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%x ==> next_comm=%s next_pid=%d next_prio=%d", > __entry->prev_comm, __entry->prev_pid, __entry->prev_prio, > - > - (__entry->prev_state & (TASK_REPORT_MAX - 1)) ? > - __print_flags(__entry->prev_state & (TASK_REPORT_MAX - 1), "|", > - { TASK_INTERRUPTIBLE, "S" }, > - { TASK_UNINTERRUPTIBLE, "D" }, > - { __TASK_STOPPED, "T" }, > - { __TASK_TRACED, "t" }, > - { EXIT_DEAD, "X" }, > - { EXIT_ZOMBIE, "Z" }, > - { TASK_PARKED, "P" }, > - { TASK_DEAD, "I" }) : > - "R", > - > - __entry->prev_state & TASK_REPORT_MAX ? "+" : "", > + __entry->prev_state, > __entry->next_comm, __entry->next_pid, __entry->next_prio) > ); >
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index fbb99a61f714..e507901bcab8 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -6,6 +6,7 @@ #define _TRACE_SCHED_H #include <linux/kthread.h> +#include <linux/sched.h> #include <linux/sched/numa_balancing.h> #include <linux/tracepoint.h> #include <linux/binfmts.h> @@ -214,6 +215,27 @@ static inline long __trace_sched_switch_state(bool preempt, return state ? (1 << (state - 1)) : state; } + +static inline char __trace_sched_switch_state_char(bool preempt, + unsigned int prev_state, + struct task_struct *p) +{ + long state; + +#ifdef CONFIG_SCHED_DEBUG + BUG_ON(p != current); +#endif /* CONFIG_SCHED_DEBUG */ + + /* + * For PREEMPT_ACTIVE, we introduce 'p' to report it and use the old + * conventions for the rest. + */ + if (preempt) + return 'p'; + + state = __task_state_index(prev_state, p->exit_state); + return task_index_to_char(state); +} #endif /* CREATE_TRACE_POINTS */ /* @@ -233,6 +255,7 @@ TRACE_EVENT(sched_switch, __field( pid_t, prev_pid ) __field( int, prev_prio ) __field( long, prev_state ) + __field( char, prev_state_char ) __array( char, next_comm, TASK_COMM_LEN ) __field( pid_t, next_pid ) __field( int, next_prio ) @@ -240,32 +263,19 @@ TRACE_EVENT(sched_switch, TP_fast_assign( memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN); - __entry->prev_pid = prev->pid; - __entry->prev_prio = prev->prio; - __entry->prev_state = __trace_sched_switch_state(preempt, prev_state, prev); + __entry->prev_pid = prev->pid; + __entry->prev_prio = prev->prio; + __entry->prev_state = __trace_sched_switch_state(preempt, prev_state, prev); + __entry->prev_state_char = __trace_sched_switch_state_char(preempt, prev_state, prev); memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN); - __entry->next_pid = next->pid; - __entry->next_prio = next->prio; + __entry->next_pid = next->pid; + __entry->next_prio = next->prio; /* XXX SCHED_DEADLINE */ ), - TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d", - __entry->prev_comm, __entry->prev_pid, __entry->prev_prio, - - (__entry->prev_state & (TASK_REPORT_MAX - 1)) ? - __print_flags(__entry->prev_state & (TASK_REPORT_MAX - 1), "|", - { TASK_INTERRUPTIBLE, "S" }, - { TASK_UNINTERRUPTIBLE, "D" }, - { __TASK_STOPPED, "T" }, - { __TASK_TRACED, "t" }, - { EXIT_DEAD, "X" }, - { EXIT_ZOMBIE, "Z" }, - { TASK_PARKED, "P" }, - { TASK_DEAD, "I" }) : - "R", - - __entry->prev_state & TASK_REPORT_MAX ? "+" : "", - __entry->next_comm, __entry->next_pid, __entry->next_prio) + TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%c ==> next_comm=%s next_pid=%d next_prio=%d", + __entry->prev_comm, __entry->prev_pid, __entry->prev_prio, __entry->prev_state_char, __entry->next_comm, + __entry->next_pid, __entry->next_prio) ); /*