[v3,0/7] perf cs_etm: Basic support for virtual/kernel timestamps

Message ID 20230103162042.423694-1-james.clark@arm.com
Headers
Series perf cs_etm: Basic support for virtual/kernel timestamps |

Message

James Clark Jan. 3, 2023, 4:20 p.m. UTC
  Changes since v2:

  * Remove const to non-const change and copy strings where needed
    instead.
  * Use sizeof() instead of PATH_MAX
  * Append "will not be set accurately." to new error message
  * Remove unneeded stat() call
  * Rebase on perf/core
  
==========================

Changes since v1:

  * Add 3 refactor commits for sysfs reading around pmu.c as suggested
    by Arnaldo here [1]
  * The dependency on [2] has now reached mainline so is no longer
    blocking
  * Rebase on perf/core
  
[1]: https://lore.kernel.org/all/YnqVqq5QW%2Fb14oPZ@kernel.org/
[2]: https://lore.kernel.org/all/20220503123537.1003035-1-german.gomez@arm.com/

German Gomez (4):
  perf pmu: Add function to check if a pmu file exists
  perf cs_etm: Keep separate symbols for ETMv4 and ETE parameters
  perf cs_etm: Record ts_source in AUXTRACE_INFO for ETMv4 and ETE
  perf cs_etm: Set the time field in the synthetic samples

James Clark (3):
  perf: Remove duplication around EVENT_SOURCE_DEVICE_PATH
  perf: Use perf_pmu__open_file() and perf_pmu__scan_file()
  perf: Remove remaining duplication of bus/event_source/devices/...

 tools/perf/arch/arm/util/auxtrace.c |   5 +-
 tools/perf/arch/arm/util/cs-etm.c   |  91 ++++++++++++--
 tools/perf/arch/x86/util/pmu.c      |  12 +-
 tools/perf/util/cputopo.c           |   9 +-
 tools/perf/util/cs-etm-base.c       |  34 ++++--
 tools/perf/util/cs-etm.c            |  86 ++++++++++++--
 tools/perf/util/cs-etm.h            |  13 +-
 tools/perf/util/pmu-hybrid.c        |  27 +----
 tools/perf/util/pmu.c               | 177 +++++++++++-----------------
 tools/perf/util/pmu.h               |  10 +-
 10 files changed, 284 insertions(+), 180 deletions(-)


base-commit: 09e6f9f98370be9a9f8978139e0eb1be87d1125f
  

Comments

Tanmay Jagdale Jan. 10, 2023, 4:42 p.m. UTC | #1
Hi James,

> Changes since v2:
> 
>   * Remove const to non-const change and copy strings where needed
>     instead.
>   * Use sizeof() instead of PATH_MAX
>   * Append "will not be set accurately." to new error message
>   * Remove unneeded stat() call
>   * Rebase on perf/core
> 
> ==========================
> 
> Changes since v1:
> 
>   * Add 3 refactor commits for sysfs reading around pmu.c as suggested
>     by Arnaldo here [1]
>   * The dependency on [2] has now reached mainline so is no longer
>     blocking
>   * Rebase on perf/core
> 
> [1]: https://lore.kernel.org/all/YnqVqq5QW%2Fb14oPZ@kernel.org/
> [2]: https://lore.kernel.org/all/20220503123537.1003035-1-german.gomez@arm.com/
> 
> German Gomez (4):
>   perf pmu: Add function to check if a pmu file exists
>   perf cs_etm: Keep separate symbols for ETMv4 and ETE parameters
>   perf cs_etm: Record ts_source in AUXTRACE_INFO for ETMv4 and ETE
>   perf cs_etm: Set the time field in the synthetic samples
> 
> James Clark (3):
>   perf: Remove duplication around EVENT_SOURCE_DEVICE_PATH
>   perf: Use perf_pmu__open_file() and perf_pmu__scan_file()
>   perf: Remove remaining duplication of bus/event_source/devices/...
> 
>  tools/perf/arch/arm/util/auxtrace.c |   5 +-
>  tools/perf/arch/arm/util/cs-etm.c   |  91 ++++++++++++--
>  tools/perf/arch/x86/util/pmu.c      |  12 +-
>  tools/perf/util/cputopo.c           |   9 +-
>  tools/perf/util/cs-etm-base.c       |  34 ++++--
>  tools/perf/util/cs-etm.c            |  86 ++++++++++++--
>  tools/perf/util/cs-etm.h            |  13 +-
>  tools/perf/util/pmu-hybrid.c        |  27 +----
>  tools/perf/util/pmu.c               | 177 +++++++++++-----------------
>  tools/perf/util/pmu.h               |  10 +-
>  10 files changed, 284 insertions(+), 180 deletions(-)
> 
> 
> base-commit: 09e6f9f98370be9a9f8978139e0eb1be87d1125f
I have tested this patch set on our platform and able to see updated
timestamp values in perf script's output.

$ perf record -e cs_etm/cycacc,@tmc_etr0/k -C 9 taskset -c 9 sleep 2
$ perf script --itrace=i1ns --ns -Fcomm,tid,pid,time,cpu,event,ip,sym,addr,symoff,flags,callindent

At certain points noticed that hardware emits same timestamp packets
but with updated cycle count (CC) values. A small snippet of the log:

Idx:100; ID:12; I_ADDR_S_IS0 : Address, Short, IS0.; Addr=0xFFFF8000086A761C ~[0x761C]
Idx:103; ID:12; I_TIMESTAMP : Timestamp.; Updated val = 0x2f373e37e02; CC=0x3d
Idx:107; ID:12; I_ATOM_F2 : Atom format 2.; EN
Idx:108; ID:12; I_TIMESTAMP : Timestamp.; Updated val = 0x2f373e37e02; CC=0x3f
Idx:112; ID:12; I_ATOM_F1 : Atom format 1.; N
Idx:113; ID:12; I_TIMESTAMP : Timestamp.; Updated val = 0x2f373e37e02; CC=0x45
Idx:116; ID:12; I_ATOM_F1 : Atom format 1.; E
Idx:117; ID:12; I_ADDR_S_IS0 : Address, Short, IS0.; Addr=0xFFFF8000086B52D4 ~[0x152D4]

Since the source of timestamp is the Generic Timer block and the CPUs
run at higher frequencies, this behaviour could be possible on high
performance ARM cores.

Having consecutive timestamps with same value is resulting in a
slightly jumbled order (in nanosecs) in perf script's time column.
A snippet corresponding to the Coresight trace data mentioned above:
...
perf   965/965   [001]  3182.286629044:            instructions:k:   return                               0 ffff8000086a761c coresight_timeout+0xc8
perf   965/965   [001]  3182.286629044:            instructions:k:   return                               0 ffff8000086a7620 coresight_timeout+0xcc
perf   965/965   [001]  3182.286629046:            instructions:k:   jmp                                  0 ffff8000086a75c8 coresight_timeout+0x74
perf   965/965   [001]  3182.286629046:            instructions:k:   jmp                                  0 ffff8000086a75cc coresight_timeout+0x78
perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75d0 coresight_timeout+0x7c
perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75d4 coresight_timeout+0x80
perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75d8 coresight_timeout+0x84
perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75dc coresight_timeout+0x88
perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75e0 coresight_timeout+0x8c
perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75e4 coresight_timeout+0x90
perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75e8 coresight_timeout+0x94
perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75ec coresight_timeout+0x98

Perf's do_soft_timestamp() logic in cs_etm_decoder.c file is incrementing
the HW timestamp based on instruction count. Since the next timestamp
also has the same value, it could be leading to this jumbled order.

We would like to know if this has been observed on other platforms ?
And what could be a solution in SW for this ?

With Regards,
Tanmay
  
James Clark Jan. 11, 2023, 10:56 a.m. UTC | #2
On 10/01/2023 16:42, Tanmay Jagdale wrote:
[...]
>>
>> base-commit: 09e6f9f98370be9a9f8978139e0eb1be87d1125f
> I have tested this patch set on our platform and able to see updated
> timestamp values in perf script's output.
> 
> $ perf record -e cs_etm/cycacc,@tmc_etr0/k -C 9 taskset -c 9 sleep 2
> $ perf script --itrace=i1ns --ns -Fcomm,tid,pid,time,cpu,event,ip,sym,addr,symoff,flags,callindent
> 
> At certain points noticed that hardware emits same timestamp packets
> but with updated cycle count (CC) values. A small snippet of the log:
> 
> Idx:100; ID:12; I_ADDR_S_IS0 : Address, Short, IS0.; Addr=0xFFFF8000086A761C ~[0x761C]
> Idx:103; ID:12; I_TIMESTAMP : Timestamp.; Updated val = 0x2f373e37e02; CC=0x3d
> Idx:107; ID:12; I_ATOM_F2 : Atom format 2.; EN
> Idx:108; ID:12; I_TIMESTAMP : Timestamp.; Updated val = 0x2f373e37e02; CC=0x3f
> Idx:112; ID:12; I_ATOM_F1 : Atom format 1.; N
> Idx:113; ID:12; I_TIMESTAMP : Timestamp.; Updated val = 0x2f373e37e02; CC=0x45
> Idx:116; ID:12; I_ATOM_F1 : Atom format 1.; E
> Idx:117; ID:12; I_ADDR_S_IS0 : Address, Short, IS0.; Addr=0xFFFF8000086B52D4 ~[0x152D4]
> 
> Since the source of timestamp is the Generic Timer block and the CPUs
> run at higher frequencies, this behaviour could be possible on high
> performance ARM cores.
> 
> Having consecutive timestamps with same value is resulting in a
> slightly jumbled order (in nanosecs) in perf script's time column.
> A snippet corresponding to the Coresight trace data mentioned above:
> ...
> perf   965/965   [001]  3182.286629044:            instructions:k:   return                               0 ffff8000086a761c coresight_timeout+0xc8
> perf   965/965   [001]  3182.286629044:            instructions:k:   return                               0 ffff8000086a7620 coresight_timeout+0xcc
> perf   965/965   [001]  3182.286629046:            instructions:k:   jmp                                  0 ffff8000086a75c8 coresight_timeout+0x74
> perf   965/965   [001]  3182.286629046:            instructions:k:   jmp                                  0 ffff8000086a75cc coresight_timeout+0x78
> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75d0 coresight_timeout+0x7c
> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75d4 coresight_timeout+0x80
> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75d8 coresight_timeout+0x84
> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75dc coresight_timeout+0x88
> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75e0 coresight_timeout+0x8c
> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75e4 coresight_timeout+0x90
> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75e8 coresight_timeout+0x94
> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75ec coresight_timeout+0x98
> 
> Perf's do_soft_timestamp() logic in cs_etm_decoder.c file is incrementing
> the HW timestamp based on instruction count. Since the next timestamp
> also has the same value, it could be leading to this jumbled order.
> 
> We would like to know if this has been observed on other platforms ?
> And what could be a solution in SW for this ?

Nice catch. If I'm understanding this correctly it looks like the issue
is a combination of the cycle count in the packet being ignored by Perf,
and the instruction count being reset to 0 when a new timestamp is received.

It looks like it can be fixed by some combination of combining the cycle
count and instruction count and maybe not resetting instruction count if
the newly received timestamp is the same as the last one. I will look
into this.

We haven't noticed it on any other platforms, but we haven't been
looking too closely at the timestamps until now. Perhaps I can add a
test that checks if the trace in a known function goes in the correct
time order.

Thanks
James
  
James Clark Jan. 11, 2023, 11:35 a.m. UTC | #3
On 10/01/2023 16:42, Tanmay Jagdale wrote:
> Hi James,
> 
[...]

>> base-commit: 09e6f9f98370be9a9f8978139e0eb1be87d1125f
> I have tested this patch set on our platform and able to see updated
> timestamp values in perf script's output.
> 
> $ perf record -e cs_etm/cycacc,@tmc_etr0/k -C 9 taskset -c 9 sleep 2
> $ perf script --itrace=i1ns --ns -Fcomm,tid,pid,time,cpu,event,ip,sym,addr,symoff,flags,callindent
> 
> At certain points noticed that hardware emits same timestamp packets
> but with updated cycle count (CC) values. A small snippet of the log:
> 
> Idx:100; ID:12; I_ADDR_S_IS0 : Address, Short, IS0.; Addr=0xFFFF8000086A761C ~[0x761C]
> Idx:103; ID:12; I_TIMESTAMP : Timestamp.; Updated val = 0x2f373e37e02; CC=0x3d
> Idx:107; ID:12; I_ATOM_F2 : Atom format 2.; EN
> Idx:108; ID:12; I_TIMESTAMP : Timestamp.; Updated val = 0x2f373e37e02; CC=0x3f
> Idx:112; ID:12; I_ATOM_F1 : Atom format 1.; N
> Idx:113; ID:12; I_TIMESTAMP : Timestamp.; Updated val = 0x2f373e37e02; CC=0x45
> Idx:116; ID:12; I_ATOM_F1 : Atom format 1.; E
> Idx:117; ID:12; I_ADDR_S_IS0 : Address, Short, IS0.; Addr=0xFFFF8000086B52D4 ~[0x152D4]

Hi Tanmay,

Are you able to email or send a download link with the perf archive and
perf.data file of this recording please?

Thanks
James
  
James Clark Jan. 11, 2023, 5:47 p.m. UTC | #4
On 11/01/2023 10:56, James Clark wrote:
> 
> 
> On 10/01/2023 16:42, Tanmay Jagdale wrote:
> [...]
>>>
>>> base-commit: 09e6f9f98370be9a9f8978139e0eb1be87d1125f
>> I have tested this patch set on our platform and able to see updated
>> timestamp values in perf script's output.
>>
>> $ perf record -e cs_etm/cycacc,@tmc_etr0/k -C 9 taskset -c 9 sleep 2
>> $ perf script --itrace=i1ns --ns -Fcomm,tid,pid,time,cpu,event,ip,sym,addr,symoff,flags,callindent
>>
>> At certain points noticed that hardware emits same timestamp packets
>> but with updated cycle count (CC) values. A small snippet of the log:
>>
>> Idx:100; ID:12; I_ADDR_S_IS0 : Address, Short, IS0.; Addr=0xFFFF8000086A761C ~[0x761C]
>> Idx:103; ID:12; I_TIMESTAMP : Timestamp.; Updated val = 0x2f373e37e02; CC=0x3d
>> Idx:107; ID:12; I_ATOM_F2 : Atom format 2.; EN
>> Idx:108; ID:12; I_TIMESTAMP : Timestamp.; Updated val = 0x2f373e37e02; CC=0x3f
>> Idx:112; ID:12; I_ATOM_F1 : Atom format 1.; N
>> Idx:113; ID:12; I_TIMESTAMP : Timestamp.; Updated val = 0x2f373e37e02; CC=0x45
>> Idx:116; ID:12; I_ATOM_F1 : Atom format 1.; E
>> Idx:117; ID:12; I_ADDR_S_IS0 : Address, Short, IS0.; Addr=0xFFFF8000086B52D4 ~[0x152D4]
>>
>> Since the source of timestamp is the Generic Timer block and the CPUs
>> run at higher frequencies, this behaviour could be possible on high
>> performance ARM cores.
>>
>> Having consecutive timestamps with same value is resulting in a
>> slightly jumbled order (in nanosecs) in perf script's time column.
>> A snippet corresponding to the Coresight trace data mentioned above:
>> ...
>> perf   965/965   [001]  3182.286629044:            instructions:k:   return                               0 ffff8000086a761c coresight_timeout+0xc8
>> perf   965/965   [001]  3182.286629044:            instructions:k:   return                               0 ffff8000086a7620 coresight_timeout+0xcc
>> perf   965/965   [001]  3182.286629046:            instructions:k:   jmp                                  0 ffff8000086a75c8 coresight_timeout+0x74
>> perf   965/965   [001]  3182.286629046:            instructions:k:   jmp                                  0 ffff8000086a75cc coresight_timeout+0x78
>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75d0 coresight_timeout+0x7c
>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75d4 coresight_timeout+0x80
>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75d8 coresight_timeout+0x84
>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75dc coresight_timeout+0x88
>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75e0 coresight_timeout+0x8c
>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75e4 coresight_timeout+0x90
>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75e8 coresight_timeout+0x94
>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75ec coresight_timeout+0x98
>>
>> Perf's do_soft_timestamp() logic in cs_etm_decoder.c file is incrementing
>> the HW timestamp based on instruction count. Since the next timestamp
>> also has the same value, it could be leading to this jumbled order.
>>
>> We would like to know if this has been observed on other platforms ?
>> And what could be a solution in SW for this ?
> 
> Nice catch. If I'm understanding this correctly it looks like the issue
> is a combination of the cycle count in the packet being ignored by Perf,
> and the instruction count being reset to 0 when a new timestamp is received.
> 
> It looks like it can be fixed by some combination of combining the cycle
> count and instruction count and maybe not resetting instruction count if
> the newly received timestamp is the same as the last one. I will look
> into this.
> 
> We haven't noticed it on any other platforms, but we haven't been
> looking too closely at the timestamps until now. Perhaps I can add a
> test that checks if the trace in a known function goes in the correct
> time order.
> 

I'm thinking of something like the following patch to fix the ordering.
It doesn't use the cycle count, but I'm not sure if that would be
worthwhile in the end, considering that it would have sub nanosecond
resolution so wouldn't affect the Perf timestamps:


diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c
index 31fa3b45134a..08a028e3e87a 100644
--- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c
+++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c
@@ -112,6 +112,19 @@ int cs_etm_decoder__get_packet(struct cs_etm_packet_queue *packet_queue,
 	return 1;
 }
 
+static u64 cs_etm_decoder__instr_count_to_ns(u64 instr_count)
+{
+	/*
+	 * Assume a maximum of 0.1ns elapsed per instruction. This would be the
+	 * case with a theoretical 10GHz core executing 1 instruction per cycle.
+	 * Used to estimate the sample time for synthesized instructions because
+	 * Coresight only emits a timestamp for a range of instructions rather
+	 * than per instruction.
+	 */
+	const int INSTR_PER_NS = 10;
+
+	return instr_count / INSTR_PER_NS;
+}
 static int cs_etm_decoder__gen_etmv3_config(struct cs_etm_trace_params *params,
 					    ocsd_etmv3_cfg *config)
 {
@@ -267,7 +280,7 @@ cs_etm_decoder__do_soft_timestamp(struct cs_etm_queue *etmq,
 	packet_queue->cs_timestamp = packet_queue->next_cs_timestamp;
 
 	/* Estimate the timestamp for the next range packet */
-	packet_queue->next_cs_timestamp += packet_queue->instr_count;
+	packet_queue->next_cs_timestamp += cs_etm_decoder__instr_count_to_ns(packet_queue->instr_count);
 	packet_queue->instr_count = 0;
 
 	/* Tell the front end which traceid_queue needs attention */
@@ -295,11 +308,17 @@ cs_etm_decoder__do_hard_timestamp(struct cs_etm_queue *etmq,
 	 * hence asking the decoder to keep decoding rather than stopping.
 	 */
 	if (packet_queue->cs_timestamp) {
-		packet_queue->next_cs_timestamp = elem->timestamp;
+		/*
+		 * Don't allow next_cs_timestamp to be less than the last one estimated by
+		 * cs_etm_decoder__do_soft_timestamp() otherwise new instructions would
+		 * appear to go back in time. In theory this should never happen, but if
+		 * it did, then next_cs_timestamp should eventually catch up to real time
+		 * unless every single range was predicted to be too long for some reason.
+		 */
+		packet_queue->next_cs_timestamp = max(elem->timestamp, packet_queue->next_cs_timestamp);
 		return OCSD_RESP_CONT;
 	}
 
-
 	if (!elem->timestamp) {
 		/*
 		 * Zero timestamps can be seen due to misconfiguration or hardware bugs.
@@ -312,7 +331,7 @@ cs_etm_decoder__do_hard_timestamp(struct cs_etm_queue *etmq,
 					". Decoding may be improved by prepending 'Z' to your current --itrace arguments.\n",
 					indx);
 
-	} else if (packet_queue->instr_count > elem->timestamp) {
+	} else if (cs_etm_decoder__instr_count_to_ns(packet_queue->instr_count) > elem->timestamp) {
 		/*
 		 * Sanity check that the elem->timestamp - packet_queue->instr_count would not
 		 * result in an underflow. Warn and clamp at 0 if it would.
@@ -327,7 +346,8 @@ cs_etm_decoder__do_hard_timestamp(struct cs_etm_queue *etmq,
 		 * which instructions started by subtracting the number of instructions
 		 * executed to the timestamp.
 		 */
-		packet_queue->cs_timestamp = elem->timestamp - packet_queue->instr_count;
+		packet_queue->cs_timestamp = elem->timestamp -
+			cs_etm_decoder__instr_count_to_ns(packet_queue->instr_count);
 	}
 	packet_queue->next_cs_timestamp = elem->timestamp;
 	packet_queue->instr_count = 0;
  
Tanmay Jagdale Jan. 12, 2023, 3:33 p.m. UTC | #5
Hi James

> >>> base-commit: 09e6f9f98370be9a9f8978139e0eb1be87d1125f
> >> I have tested this patch set on our platform and able to see updated
> >> timestamp values in perf script's output.
> >>
> >> $ perf record -e cs_etm/cycacc,@tmc_etr0/k -C 9 taskset -c 9 sleep 2
> >> $ perf script --itrace=i1ns --ns -Fcomm,tid,pid,time,cpu,event,ip,sym,addr,symoff,flags,callindent
> >>
> >> At certain points noticed that hardware emits same timestamp packets
> >> but with updated cycle count (CC) values. A small snippet of the log:
> >>
> >> Idx:100; ID:12; I_ADDR_S_IS0 : Address, Short, IS0.; Addr=0xFFFF8000086A761C ~[0x761C]
> >> Idx:103; ID:12; I_TIMESTAMP : Timestamp.; Updated val = 0x2f373e37e02; CC=0x3d
> >> Idx:107; ID:12; I_ATOM_F2 : Atom format 2.; EN
> >> Idx:108; ID:12; I_TIMESTAMP : Timestamp.; Updated val = 0x2f373e37e02; CC=0x3f
> >> Idx:112; ID:12; I_ATOM_F1 : Atom format 1.; N
> >> Idx:113; ID:12; I_TIMESTAMP : Timestamp.; Updated val = 0x2f373e37e02; CC=0x45
> >> Idx:116; ID:12; I_ATOM_F1 : Atom format 1.; E
> >> Idx:117; ID:12; I_ADDR_S_IS0 : Address, Short, IS0.; Addr=0xFFFF8000086B52D4 ~[0x152D4]
> >>
> >> Since the source of timestamp is the Generic Timer block and the CPUs
> >> run at higher frequencies, this behaviour could be possible on high
> >> performance ARM cores.
> >>
> >> Having consecutive timestamps with same value is resulting in a
> >> slightly jumbled order (in nanosecs) in perf script's time column.
> >> A snippet corresponding to the Coresight trace data mentioned above:
> >> ...
> >> perf   965/965   [001]  3182.286629044:            instructions:k:   return                               0 ffff8000086a761c coresight_timeout+0xc8
> >> perf   965/965   [001]  3182.286629044:            instructions:k:   return                               0 ffff8000086a7620 coresight_timeout+0xcc
> >> perf   965/965   [001]  3182.286629046:            instructions:k:   jmp                                  0 ffff8000086a75c8 coresight_timeout+0x74
> >> perf   965/965   [001]  3182.286629046:            instructions:k:   jmp                                  0 ffff8000086a75cc coresight_timeout+0x78
> >> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75d0 coresight_timeout+0x7c
> >> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75d4 coresight_timeout+0x80
> >> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75d8 coresight_timeout+0x84
> >> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75dc coresight_timeout+0x88
> >> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75e0 coresight_timeout+0x8c
> >> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75e4 coresight_timeout+0x90
> >> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75e8 coresight_timeout+0x94
> >> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75ec coresight_timeout+0x98
> >>
> >> Perf's do_soft_timestamp() logic in cs_etm_decoder.c file is incrementing
> >> the HW timestamp based on instruction count. Since the next timestamp
> >> also has the same value, it could be leading to this jumbled order.
> >>
> >> We would like to know if this has been observed on other platforms ?
> >> And what could be a solution in SW for this ?
> >
> > Nice catch. If I'm understanding this correctly it looks like the issue
> > is a combination of the cycle count in the packet being ignored by Perf,
> > and the instruction count being reset to 0 when a new timestamp is received.
> >
> > It looks like it can be fixed by some combination of combining the cycle
> > count and instruction count and maybe not resetting instruction count if
> > the newly received timestamp is the same as the last one. I will look
> > into this.
> >
> > We haven't noticed it on any other platforms, but we haven't been
> > looking too closely at the timestamps until now. Perhaps I can add a
> > test that checks if the trace in a known function goes in the correct
> > time order.
> >
> 
> I'm thinking of something like the following patch to fix the ordering.
> It doesn't use the cycle count, but I'm not sure if that would be
> worthwhile in the end, considering that it would have sub nanosecond
> resolution so wouldn't affect the Perf timestamps:
Thanks for coming up with a quick solution for this issue !

> 
> 
> diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c
> index 31fa3b45134a..08a028e3e87a 100644
> --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c
> +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c
> @@ -112,6 +112,19 @@ int cs_etm_decoder__get_packet(struct cs_etm_packet_queue *packet_queue,
>  	return 1;
>  }
> 
> +static u64 cs_etm_decoder__instr_count_to_ns(u64 instr_count)
> +{
> +	/*
> +	 * Assume a maximum of 0.1ns elapsed per instruction. This would be the
> +	 * case with a theoretical 10GHz core executing 1 instruction per cycle.
> +	 * Used to estimate the sample time for synthesized instructions because
> +	 * Coresight only emits a timestamp for a range of instructions rather
> +	 * than per instruction.
> +	 */
> +	const int INSTR_PER_NS = 10;
> +
> +	return instr_count / INSTR_PER_NS;
> +}
>  static int cs_etm_decoder__gen_etmv3_config(struct cs_etm_trace_params *params,
>  					    ocsd_etmv3_cfg *config)
>  {
> @@ -267,7 +280,7 @@ cs_etm_decoder__do_soft_timestamp(struct cs_etm_queue *etmq,
>  	packet_queue->cs_timestamp = packet_queue->next_cs_timestamp;
> 
>  	/* Estimate the timestamp for the next range packet */
> -	packet_queue->next_cs_timestamp += packet_queue->instr_count;
> +	packet_queue->next_cs_timestamp += cs_etm_decoder__instr_count_to_ns(packet_queue->instr_count);
>  	packet_queue->instr_count = 0;
> 
>  	/* Tell the front end which traceid_queue needs attention */
> @@ -295,11 +308,17 @@ cs_etm_decoder__do_hard_timestamp(struct cs_etm_queue *etmq,
>  	 * hence asking the decoder to keep decoding rather than stopping.
>  	 */
>  	if (packet_queue->cs_timestamp) {
> -		packet_queue->next_cs_timestamp = elem->timestamp;
> +		/*
> +		 * Don't allow next_cs_timestamp to be less than the last one estimated by
> +		 * cs_etm_decoder__do_soft_timestamp() otherwise new instructions would
> +		 * appear to go back in time. In theory this should never happen, but if
> +		 * it did, then next_cs_timestamp should eventually catch up to real time
> +		 * unless every single range was predicted to be too long for some reason.
> +		 */
> +		packet_queue->next_cs_timestamp = max(elem->timestamp, packet_queue->next_cs_timestamp);
>  		return OCSD_RESP_CONT;
>  	}
> 
> -
>  	if (!elem->timestamp) {
>  		/*
>  		 * Zero timestamps can be seen due to misconfiguration or hardware bugs.
> @@ -312,7 +331,7 @@ cs_etm_decoder__do_hard_timestamp(struct cs_etm_queue *etmq,
>  					". Decoding may be improved by prepending 'Z' to your current --itrace
> arguments.\n",
>  					indx);
> 
> -	} else if (packet_queue->instr_count > elem->timestamp) {
> +	} else if (cs_etm_decoder__instr_count_to_ns(packet_queue->instr_count) > elem->timestamp) {
>  		/*
>  		 * Sanity check that the elem->timestamp - packet_queue->instr_count would not
>  		 * result in an underflow. Warn and clamp at 0 if it would.
> @@ -327,7 +346,8 @@ cs_etm_decoder__do_hard_timestamp(struct cs_etm_queue *etmq,
>  		 * which instructions started by subtracting the number of instructions
>  		 * executed to the timestamp.
>  		 */
> -		packet_queue->cs_timestamp = elem->timestamp - packet_queue->instr_count;
> +		packet_queue->cs_timestamp = elem->timestamp -
> +			cs_etm_decoder__instr_count_to_ns(packet_queue->instr_count);
>  	}
>  	packet_queue->next_cs_timestamp = elem->timestamp;
>  	packet_queue->instr_count = 0;
I have tested this diff along with the patchset and found that
timestamps are generated/estimated in an increasing order.

However, found few corner cases where they weren't in order.
I'm currently gathering more information on these corner cases.

Used the following steps to find inconsistencies:
$ perf record -e cs_etm/@tmc_etr0/k -C 4 taskset -c 4 sleep 1
$ perf script --itrace=i1ns --ns -Fcomm,tid,pid,time,cpu,event,ip,sym,addr,symoff,flags,callindent > itrace
$ sed 's/://g' itrace | awk -F '.' ' { print $2 } ' | awk '{ if ($1 < prev) { print "line:" NR " " $0 } {prev=$1}}'

Setting INSTR_PER_NS to higher values (14,16,20 etc) results in
lower inconsistencies but it would defeat the purpose of
estimating values in software since we could be returning 0
from __instr_count_to_ns() in many cases.

Please share your thoughts.

Thanks,
Tanmay
  
James Clark Jan. 13, 2023, 11:12 a.m. UTC | #6
On 12/01/2023 15:33, Tanmay Jagdale wrote:
> Hi James
> 
>>>>> base-commit: 09e6f9f98370be9a9f8978139e0eb1be87d1125f
>>>> I have tested this patch set on our platform and able to see updated
>>>> timestamp values in perf script's output.
>>>>
>>>> $ perf record -e cs_etm/cycacc,@tmc_etr0/k -C 9 taskset -c 9 sleep 2
>>>> $ perf script --itrace=i1ns --ns -Fcomm,tid,pid,time,cpu,event,ip,sym,addr,symoff,flags,callindent
>>>>
>>>> At certain points noticed that hardware emits same timestamp packets
>>>> but with updated cycle count (CC) values. A small snippet of the log:
>>>>
>>>> Idx:100; ID:12; I_ADDR_S_IS0 : Address, Short, IS0.; Addr=0xFFFF8000086A761C ~[0x761C]
>>>> Idx:103; ID:12; I_TIMESTAMP : Timestamp.; Updated val = 0x2f373e37e02; CC=0x3d
>>>> Idx:107; ID:12; I_ATOM_F2 : Atom format 2.; EN
>>>> Idx:108; ID:12; I_TIMESTAMP : Timestamp.; Updated val = 0x2f373e37e02; CC=0x3f
>>>> Idx:112; ID:12; I_ATOM_F1 : Atom format 1.; N
>>>> Idx:113; ID:12; I_TIMESTAMP : Timestamp.; Updated val = 0x2f373e37e02; CC=0x45
>>>> Idx:116; ID:12; I_ATOM_F1 : Atom format 1.; E
>>>> Idx:117; ID:12; I_ADDR_S_IS0 : Address, Short, IS0.; Addr=0xFFFF8000086B52D4 ~[0x152D4]
>>>>
>>>> Since the source of timestamp is the Generic Timer block and the CPUs
>>>> run at higher frequencies, this behaviour could be possible on high
>>>> performance ARM cores.
>>>>
>>>> Having consecutive timestamps with same value is resulting in a
>>>> slightly jumbled order (in nanosecs) in perf script's time column.
>>>> A snippet corresponding to the Coresight trace data mentioned above:
>>>> ...
>>>> perf   965/965   [001]  3182.286629044:            instructions:k:   return                               0 ffff8000086a761c coresight_timeout+0xc8
>>>> perf   965/965   [001]  3182.286629044:            instructions:k:   return                               0 ffff8000086a7620 coresight_timeout+0xcc
>>>> perf   965/965   [001]  3182.286629046:            instructions:k:   jmp                                  0 ffff8000086a75c8 coresight_timeout+0x74
>>>> perf   965/965   [001]  3182.286629046:            instructions:k:   jmp                                  0 ffff8000086a75cc coresight_timeout+0x78
>>>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75d0 coresight_timeout+0x7c
>>>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75d4 coresight_timeout+0x80
>>>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75d8 coresight_timeout+0x84
>>>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75dc coresight_timeout+0x88
>>>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75e0 coresight_timeout+0x8c
>>>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75e4 coresight_timeout+0x90
>>>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75e8 coresight_timeout+0x94
>>>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75ec coresight_timeout+0x98
>>>>
>>>> Perf's do_soft_timestamp() logic in cs_etm_decoder.c file is incrementing
>>>> the HW timestamp based on instruction count. Since the next timestamp
>>>> also has the same value, it could be leading to this jumbled order.
>>>>
>>>> We would like to know if this has been observed on other platforms ?
>>>> And what could be a solution in SW for this ?
>>>
>>> Nice catch. If I'm understanding this correctly it looks like the issue
>>> is a combination of the cycle count in the packet being ignored by Perf,
>>> and the instruction count being reset to 0 when a new timestamp is received.
>>>
>>> It looks like it can be fixed by some combination of combining the cycle
>>> count and instruction count and maybe not resetting instruction count if
>>> the newly received timestamp is the same as the last one. I will look
>>> into this.
>>>
>>> We haven't noticed it on any other platforms, but we haven't been
>>> looking too closely at the timestamps until now. Perhaps I can add a
>>> test that checks if the trace in a known function goes in the correct
>>> time order.
>>>
>>
>> I'm thinking of something like the following patch to fix the ordering.
>> It doesn't use the cycle count, but I'm not sure if that would be
>> worthwhile in the end, considering that it would have sub nanosecond
>> resolution so wouldn't affect the Perf timestamps:
> Thanks for coming up with a quick solution for this issue !
> 
>>
>>
>> diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c
>> index 31fa3b45134a..08a028e3e87a 100644
>> --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c
>> +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c
>> @@ -112,6 +112,19 @@ int cs_etm_decoder__get_packet(struct cs_etm_packet_queue *packet_queue,
>>  	return 1;
>>  }
>>
>> +static u64 cs_etm_decoder__instr_count_to_ns(u64 instr_count)
>> +{
>> +	/*
>> +	 * Assume a maximum of 0.1ns elapsed per instruction. This would be the
>> +	 * case with a theoretical 10GHz core executing 1 instruction per cycle.
>> +	 * Used to estimate the sample time for synthesized instructions because
>> +	 * Coresight only emits a timestamp for a range of instructions rather
>> +	 * than per instruction.
>> +	 */
>> +	const int INSTR_PER_NS = 10;
>> +
>> +	return instr_count / INSTR_PER_NS;
>> +}
>>  static int cs_etm_decoder__gen_etmv3_config(struct cs_etm_trace_params *params,
>>  					    ocsd_etmv3_cfg *config)
>>  {
>> @@ -267,7 +280,7 @@ cs_etm_decoder__do_soft_timestamp(struct cs_etm_queue *etmq,
>>  	packet_queue->cs_timestamp = packet_queue->next_cs_timestamp;
>>
>>  	/* Estimate the timestamp for the next range packet */
>> -	packet_queue->next_cs_timestamp += packet_queue->instr_count;
>> +	packet_queue->next_cs_timestamp += cs_etm_decoder__instr_count_to_ns(packet_queue->instr_count);
>>  	packet_queue->instr_count = 0;
>>
>>  	/* Tell the front end which traceid_queue needs attention */
>> @@ -295,11 +308,17 @@ cs_etm_decoder__do_hard_timestamp(struct cs_etm_queue *etmq,
>>  	 * hence asking the decoder to keep decoding rather than stopping.
>>  	 */
>>  	if (packet_queue->cs_timestamp) {
>> -		packet_queue->next_cs_timestamp = elem->timestamp;
>> +		/*
>> +		 * Don't allow next_cs_timestamp to be less than the last one estimated by
>> +		 * cs_etm_decoder__do_soft_timestamp() otherwise new instructions would
>> +		 * appear to go back in time. In theory this should never happen, but if
>> +		 * it did, then next_cs_timestamp should eventually catch up to real time
>> +		 * unless every single range was predicted to be too long for some reason.
>> +		 */
>> +		packet_queue->next_cs_timestamp = max(elem->timestamp, packet_queue->next_cs_timestamp);
>>  		return OCSD_RESP_CONT;
>>  	}
>>
>> -
>>  	if (!elem->timestamp) {
>>  		/*
>>  		 * Zero timestamps can be seen due to misconfiguration or hardware bugs.
>> @@ -312,7 +331,7 @@ cs_etm_decoder__do_hard_timestamp(struct cs_etm_queue *etmq,
>>  					". Decoding may be improved by prepending 'Z' to your current --itrace
>> arguments.\n",
>>  					indx);
>>
>> -	} else if (packet_queue->instr_count > elem->timestamp) {
>> +	} else if (cs_etm_decoder__instr_count_to_ns(packet_queue->instr_count) > elem->timestamp) {
>>  		/*
>>  		 * Sanity check that the elem->timestamp - packet_queue->instr_count would not
>>  		 * result in an underflow. Warn and clamp at 0 if it would.
>> @@ -327,7 +346,8 @@ cs_etm_decoder__do_hard_timestamp(struct cs_etm_queue *etmq,
>>  		 * which instructions started by subtracting the number of instructions
>>  		 * executed to the timestamp.
>>  		 */
>> -		packet_queue->cs_timestamp = elem->timestamp - packet_queue->instr_count;
>> +		packet_queue->cs_timestamp = elem->timestamp -
>> +			cs_etm_decoder__instr_count_to_ns(packet_queue->instr_count);
>>  	}
>>  	packet_queue->next_cs_timestamp = elem->timestamp;
>>  	packet_queue->instr_count = 0;
> I have tested this diff along with the patchset and found that
> timestamps are generated/estimated in an increasing order.
> 
> However, found few corner cases where they weren't in order.
> I'm currently gathering more information on these corner cases.
> 
> Used the following steps to find inconsistencies:
> $ perf record -e cs_etm/@tmc_etr0/k -C 4 taskset -c 4 sleep 1
> $ perf script --itrace=i1ns --ns -Fcomm,tid,pid,time,cpu,event,ip,sym,addr,symoff,flags,callindent > itrace
> $ sed 's/://g' itrace | awk -F '.' ' { print $2 } ' | awk '{ if ($1 < prev) { print "line:" NR " " $0 } {prev=$1}}'
> 
> Setting INSTR_PER_NS to higher values (14,16,20 etc) results in
> lower inconsistencies but it would defeat the purpose of
> estimating values in software since we could be returning 0
> from __instr_count_to_ns() in many cases.
> 
> Please share your thoughts.

I believe this could be because I was adding nanosecond values
to the raw coresight counter before it was converted which was
a mistake. I have made another change to move the conversion
as early as possible to avoid mistakes like this.

Increasing the INSTR_PER_NS shouldn't be required because of
the max() call, in fact INSTR_PER_NS isn't necessarily
needed at all, I just thought it would make it more accurate.

I'll test the new version on the file that you sent before
pasting it here. If you have another recording that has more
of the edge cases then please also upload that one too.

James
  
James Clark Jan. 19, 2023, 3:53 p.m. UTC | #7
On 13/01/2023 11:12, James Clark wrote:
> 
> 
> On 12/01/2023 15:33, Tanmay Jagdale wrote:
>> Hi James
>>
>>>>>> base-commit: 09e6f9f98370be9a9f8978139e0eb1be87d1125f
>>>>> I have tested this patch set on our platform and able to see updated
>>>>> timestamp values in perf script's output.
>>>>>
>>>>> $ perf record -e cs_etm/cycacc,@tmc_etr0/k -C 9 taskset -c 9 sleep 2
>>>>> $ perf script --itrace=i1ns --ns -Fcomm,tid,pid,time,cpu,event,ip,sym,addr,symoff,flags,callindent
>>>>>
>>>>> At certain points noticed that hardware emits same timestamp packets
>>>>> but with updated cycle count (CC) values. A small snippet of the log:
>>>>>
>>>>> Idx:100; ID:12; I_ADDR_S_IS0 : Address, Short, IS0.; Addr=0xFFFF8000086A761C ~[0x761C]
>>>>> Idx:103; ID:12; I_TIMESTAMP : Timestamp.; Updated val = 0x2f373e37e02; CC=0x3d
>>>>> Idx:107; ID:12; I_ATOM_F2 : Atom format 2.; EN
>>>>> Idx:108; ID:12; I_TIMESTAMP : Timestamp.; Updated val = 0x2f373e37e02; CC=0x3f
>>>>> Idx:112; ID:12; I_ATOM_F1 : Atom format 1.; N
>>>>> Idx:113; ID:12; I_TIMESTAMP : Timestamp.; Updated val = 0x2f373e37e02; CC=0x45
>>>>> Idx:116; ID:12; I_ATOM_F1 : Atom format 1.; E
>>>>> Idx:117; ID:12; I_ADDR_S_IS0 : Address, Short, IS0.; Addr=0xFFFF8000086B52D4 ~[0x152D4]
>>>>>
>>>>> Since the source of timestamp is the Generic Timer block and the CPUs
>>>>> run at higher frequencies, this behaviour could be possible on high
>>>>> performance ARM cores.
>>>>>
>>>>> Having consecutive timestamps with same value is resulting in a
>>>>> slightly jumbled order (in nanosecs) in perf script's time column.
>>>>> A snippet corresponding to the Coresight trace data mentioned above:
>>>>> ...
>>>>> perf   965/965   [001]  3182.286629044:            instructions:k:   return                               0 ffff8000086a761c coresight_timeout+0xc8
>>>>> perf   965/965   [001]  3182.286629044:            instructions:k:   return                               0 ffff8000086a7620 coresight_timeout+0xcc
>>>>> perf   965/965   [001]  3182.286629046:            instructions:k:   jmp                                  0 ffff8000086a75c8 coresight_timeout+0x74
>>>>> perf   965/965   [001]  3182.286629046:            instructions:k:   jmp                                  0 ffff8000086a75cc coresight_timeout+0x78
>>>>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75d0 coresight_timeout+0x7c
>>>>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75d4 coresight_timeout+0x80
>>>>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75d8 coresight_timeout+0x84
>>>>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75dc coresight_timeout+0x88
>>>>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75e0 coresight_timeout+0x8c
>>>>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75e4 coresight_timeout+0x90
>>>>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75e8 coresight_timeout+0x94
>>>>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0 ffff8000086a75ec coresight_timeout+0x98
>>>>>
>>>>> Perf's do_soft_timestamp() logic in cs_etm_decoder.c file is incrementing
>>>>> the HW timestamp based on instruction count. Since the next timestamp
>>>>> also has the same value, it could be leading to this jumbled order.
>>>>>
>>>>> We would like to know if this has been observed on other platforms ?
>>>>> And what could be a solution in SW for this ?
>>>>
>>>> Nice catch. If I'm understanding this correctly it looks like the issue
>>>> is a combination of the cycle count in the packet being ignored by Perf,
>>>> and the instruction count being reset to 0 when a new timestamp is received.
>>>>
>>>> It looks like it can be fixed by some combination of combining the cycle
>>>> count and instruction count and maybe not resetting instruction count if
>>>> the newly received timestamp is the same as the last one. I will look
>>>> into this.
>>>>
>>>> We haven't noticed it on any other platforms, but we haven't been
>>>> looking too closely at the timestamps until now. Perhaps I can add a
>>>> test that checks if the trace in a known function goes in the correct
>>>> time order.
>>>>
>>>
>>> I'm thinking of something like the following patch to fix the ordering.
>>> It doesn't use the cycle count, but I'm not sure if that would be
>>> worthwhile in the end, considering that it would have sub nanosecond
>>> resolution so wouldn't affect the Perf timestamps:
>> Thanks for coming up with a quick solution for this issue !
>>
>>>
>>>
>>> diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c
>>> index 31fa3b45134a..08a028e3e87a 100644
>>> --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c
>>> +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c
>>> @@ -112,6 +112,19 @@ int cs_etm_decoder__get_packet(struct cs_etm_packet_queue *packet_queue,
>>>  	return 1;
>>>  }
>>>
>>> +static u64 cs_etm_decoder__instr_count_to_ns(u64 instr_count)
>>> +{
>>> +	/*
>>> +	 * Assume a maximum of 0.1ns elapsed per instruction. This would be the
>>> +	 * case with a theoretical 10GHz core executing 1 instruction per cycle.
>>> +	 * Used to estimate the sample time for synthesized instructions because
>>> +	 * Coresight only emits a timestamp for a range of instructions rather
>>> +	 * than per instruction.
>>> +	 */
>>> +	const int INSTR_PER_NS = 10;
>>> +
>>> +	return instr_count / INSTR_PER_NS;
>>> +}
>>>  static int cs_etm_decoder__gen_etmv3_config(struct cs_etm_trace_params *params,
>>>  					    ocsd_etmv3_cfg *config)
>>>  {
>>> @@ -267,7 +280,7 @@ cs_etm_decoder__do_soft_timestamp(struct cs_etm_queue *etmq,
>>>  	packet_queue->cs_timestamp = packet_queue->next_cs_timestamp;
>>>
>>>  	/* Estimate the timestamp for the next range packet */
>>> -	packet_queue->next_cs_timestamp += packet_queue->instr_count;
>>> +	packet_queue->next_cs_timestamp += cs_etm_decoder__instr_count_to_ns(packet_queue->instr_count);
>>>  	packet_queue->instr_count = 0;
>>>
>>>  	/* Tell the front end which traceid_queue needs attention */
>>> @@ -295,11 +308,17 @@ cs_etm_decoder__do_hard_timestamp(struct cs_etm_queue *etmq,
>>>  	 * hence asking the decoder to keep decoding rather than stopping.
>>>  	 */
>>>  	if (packet_queue->cs_timestamp) {
>>> -		packet_queue->next_cs_timestamp = elem->timestamp;
>>> +		/*
>>> +		 * Don't allow next_cs_timestamp to be less than the last one estimated by
>>> +		 * cs_etm_decoder__do_soft_timestamp() otherwise new instructions would
>>> +		 * appear to go back in time. In theory this should never happen, but if
>>> +		 * it did, then next_cs_timestamp should eventually catch up to real time
>>> +		 * unless every single range was predicted to be too long for some reason.
>>> +		 */
>>> +		packet_queue->next_cs_timestamp = max(elem->timestamp, packet_queue->next_cs_timestamp);
>>>  		return OCSD_RESP_CONT;
>>>  	}
>>>
>>> -
>>>  	if (!elem->timestamp) {
>>>  		/*
>>>  		 * Zero timestamps can be seen due to misconfiguration or hardware bugs.
>>> @@ -312,7 +331,7 @@ cs_etm_decoder__do_hard_timestamp(struct cs_etm_queue *etmq,
>>>  					". Decoding may be improved by prepending 'Z' to your current --itrace
>>> arguments.\n",
>>>  					indx);
>>>
>>> -	} else if (packet_queue->instr_count > elem->timestamp) {
>>> +	} else if (cs_etm_decoder__instr_count_to_ns(packet_queue->instr_count) > elem->timestamp) {
>>>  		/*
>>>  		 * Sanity check that the elem->timestamp - packet_queue->instr_count would not
>>>  		 * result in an underflow. Warn and clamp at 0 if it would.
>>> @@ -327,7 +346,8 @@ cs_etm_decoder__do_hard_timestamp(struct cs_etm_queue *etmq,
>>>  		 * which instructions started by subtracting the number of instructions
>>>  		 * executed to the timestamp.
>>>  		 */
>>> -		packet_queue->cs_timestamp = elem->timestamp - packet_queue->instr_count;
>>> +		packet_queue->cs_timestamp = elem->timestamp -
>>> +			cs_etm_decoder__instr_count_to_ns(packet_queue->instr_count);
>>>  	}
>>>  	packet_queue->next_cs_timestamp = elem->timestamp;
>>>  	packet_queue->instr_count = 0;
>> I have tested this diff along with the patchset and found that
>> timestamps are generated/estimated in an increasing order.
>>
>> However, found few corner cases where they weren't in order.
>> I'm currently gathering more information on these corner cases.
>>
>> Used the following steps to find inconsistencies:
>> $ perf record -e cs_etm/@tmc_etr0/k -C 4 taskset -c 4 sleep 1
>> $ perf script --itrace=i1ns --ns -Fcomm,tid,pid,time,cpu,event,ip,sym,addr,symoff,flags,callindent > itrace
>> $ sed 's/://g' itrace | awk -F '.' ' { print $2 } ' | awk '{ if ($1 < prev) { print "line:" NR " " $0 } {prev=$1}}'
>>
>> Setting INSTR_PER_NS to higher values (14,16,20 etc) results in
>> lower inconsistencies but it would defeat the purpose of
>> estimating values in software since we could be returning 0
>> from __instr_count_to_ns() in many cases.
>>
>> Please share your thoughts.
> 
> I believe this could be because I was adding nanosecond values
> to the raw coresight counter before it was converted which was
> a mistake. I have made another change to move the conversion
> as early as possible to avoid mistakes like this.
> 
> Increasing the INSTR_PER_NS shouldn't be required because of
> the max() call, in fact INSTR_PER_NS isn't necessarily
> needed at all, I just thought it would make it more accurate.
> 
> I'll test the new version on the file that you sent before
> pasting it here. If you have another recording that has more
> of the edge cases then please also upload that one too.

Hi Tanmay,

I've sent v4 and I don't see any issues now with the files that you
sent. There were a couple of different edge cases that I listed in the
last commit message. Let me know if you find any more, otherwise if you
leave your review or tested-by tag that would be great.

Thanks
James
  
Tanmay Jagdale Jan. 20, 2023, 5:30 p.m. UTC | #8
Hi James,

> On 13/01/2023 11:12, James Clark wrote:
> > On 12/01/2023 15:33, Tanmay Jagdale wrote:
> >> Hi James
> >>
> >>>>>> base-commit: 09e6f9f98370be9a9f8978139e0eb1be87d1125f
> >>>>> I have tested this patch set on our platform and able to see updated
> >>>>> timestamp values in perf script's output.
> >>>>>
> >>>>> $ perf record -e cs_etm/cycacc,@tmc_etr0/k -C 9 taskset -c 9 sleep 2
> >>>>> $ perf script --itrace=i1ns --ns -Fcomm,tid,pid,time,cpu,event,ip,sym,addr,symoff,flags,callindent
> >>>>>
> >>>>> At certain points noticed that hardware emits same timestamp packets
> >>>>> but with updated cycle count (CC) values. A small snippet of the log:
> >>>>>
> >>>>> Idx:100; ID:12; I_ADDR_S_IS0 : Address, Short, IS0.; Addr=0xFFFF8000086A761C ~[0x761C]
> >>>>> Idx:103; ID:12; I_TIMESTAMP : Timestamp.; Updated val = 0x2f373e37e02; CC=0x3d
> >>>>> Idx:107; ID:12; I_ATOM_F2 : Atom format 2.; EN
> >>>>> Idx:108; ID:12; I_TIMESTAMP : Timestamp.; Updated val = 0x2f373e37e02; CC=0x3f
> >>>>> Idx:112; ID:12; I_ATOM_F1 : Atom format 1.; N
> >>>>> Idx:113; ID:12; I_TIMESTAMP : Timestamp.; Updated val = 0x2f373e37e02; CC=0x45
> >>>>> Idx:116; ID:12; I_ATOM_F1 : Atom format 1.; E
> >>>>> Idx:117; ID:12; I_ADDR_S_IS0 : Address, Short, IS0.; Addr=0xFFFF8000086B52D4 ~[0x152D4]
> >>>>>
> >>>>> Since the source of timestamp is the Generic Timer block and the CPUs
> >>>>> run at higher frequencies, this behaviour could be possible on high
> >>>>> performance ARM cores.
> >>>>>
> >>>>> Having consecutive timestamps with same value is resulting in a
> >>>>> slightly jumbled order (in nanosecs) in perf script's time column.
> >>>>> A snippet corresponding to the Coresight trace data mentioned above:
> >>>>> ...
> >>>>> perf   965/965   [001]  3182.286629044:            instructions:k:   return                               0
> ffff8000086a761c coresight_timeout+0xc8
> >>>>> perf   965/965   [001]  3182.286629044:            instructions:k:   return                               0
> ffff8000086a7620 coresight_timeout+0xcc
> >>>>> perf   965/965   [001]  3182.286629046:            instructions:k:   jmp                                  0
> ffff8000086a75c8 coresight_timeout+0x74
> >>>>> perf   965/965   [001]  3182.286629046:            instructions:k:   jmp                                  0
> ffff8000086a75cc coresight_timeout+0x78
> >>>>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0
> ffff8000086a75d0 coresight_timeout+0x7c
> >>>>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0
> ffff8000086a75d4 coresight_timeout+0x80
> >>>>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0
> ffff8000086a75d8 coresight_timeout+0x84
> >>>>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0
> ffff8000086a75dc coresight_timeout+0x88
> >>>>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0
> ffff8000086a75e0 coresight_timeout+0x8c
> >>>>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0
> ffff8000086a75e4 coresight_timeout+0x90
> >>>>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0
> ffff8000086a75e8 coresight_timeout+0x94
> >>>>> perf   965/965   [001]  3182.286629044:            instructions:k:   jcc                                  0
> ffff8000086a75ec coresight_timeout+0x98
> >>>>>
> >>>>> Perf's do_soft_timestamp() logic in cs_etm_decoder.c file is incrementing
> >>>>> the HW timestamp based on instruction count. Since the next timestamp
> >>>>> also has the same value, it could be leading to this jumbled order.
> >>>>>
> >>>>> We would like to know if this has been observed on other platforms ?
> >>>>> And what could be a solution in SW for this ?
> >>>>
> >>>> Nice catch. If I'm understanding this correctly it looks like the issue
> >>>> is a combination of the cycle count in the packet being ignored by Perf,
> >>>> and the instruction count being reset to 0 when a new timestamp is received.
> >>>>
> >>>> It looks like it can be fixed by some combination of combining the cycle
> >>>> count and instruction count and maybe not resetting instruction count if
> >>>> the newly received timestamp is the same as the last one. I will look
> >>>> into this.
> >>>>
> >>>> We haven't noticed it on any other platforms, but we haven't been
> >>>> looking too closely at the timestamps until now. Perhaps I can add a
> >>>> test that checks if the trace in a known function goes in the correct
> >>>> time order.
> >>>>
> >>>
> >>> I'm thinking of something like the following patch to fix the ordering.
> >>> It doesn't use the cycle count, but I'm not sure if that would be
> >>> worthwhile in the end, considering that it would have sub nanosecond
> >>> resolution so wouldn't affect the Perf timestamps:
> >> Thanks for coming up with a quick solution for this issue !
> >>
> >>>
> >>>
> >>> diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c
> >>> index 31fa3b45134a..08a028e3e87a 100644
> >>> --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c
> >>> +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c
> >>> @@ -112,6 +112,19 @@ int cs_etm_decoder__get_packet(struct cs_etm_packet_queue *packet_queue,
> >>>  	return 1;
> >>>  }
> >>>
> >>> +static u64 cs_etm_decoder__instr_count_to_ns(u64 instr_count)
> >>> +{
> >>> +	/*
> >>> +	 * Assume a maximum of 0.1ns elapsed per instruction. This would be the
> >>> +	 * case with a theoretical 10GHz core executing 1 instruction per cycle.
> >>> +	 * Used to estimate the sample time for synthesized instructions because
> >>> +	 * Coresight only emits a timestamp for a range of instructions rather
> >>> +	 * than per instruction.
> >>> +	 */
> >>> +	const int INSTR_PER_NS = 10;
> >>> +
> >>> +	return instr_count / INSTR_PER_NS;
> >>> +}
> >>>  static int cs_etm_decoder__gen_etmv3_config(struct cs_etm_trace_params *params,
> >>>  					    ocsd_etmv3_cfg *config)
> >>>  {
> >>> @@ -267,7 +280,7 @@ cs_etm_decoder__do_soft_timestamp(struct cs_etm_queue *etmq,
> >>>  	packet_queue->cs_timestamp = packet_queue->next_cs_timestamp;
> >>>
> >>>  	/* Estimate the timestamp for the next range packet */
> >>> -	packet_queue->next_cs_timestamp += packet_queue->instr_count;
> >>> +	packet_queue->next_cs_timestamp += cs_etm_decoder__instr_count_to_ns(packet_queue->instr_count);
> >>>  	packet_queue->instr_count = 0;
> >>>
> >>>  	/* Tell the front end which traceid_queue needs attention */
> >>> @@ -295,11 +308,17 @@ cs_etm_decoder__do_hard_timestamp(struct cs_etm_queue *etmq,
> >>>  	 * hence asking the decoder to keep decoding rather than stopping.
> >>>  	 */
> >>>  	if (packet_queue->cs_timestamp) {
> >>> -		packet_queue->next_cs_timestamp = elem->timestamp;
> >>> +		/*
> >>> +		 * Don't allow next_cs_timestamp to be less than the last one estimated by
> >>> +		 * cs_etm_decoder__do_soft_timestamp() otherwise new instructions would
> >>> +		 * appear to go back in time. In theory this should never happen, but if
> >>> +		 * it did, then next_cs_timestamp should eventually catch up to real time
> >>> +		 * unless every single range was predicted to be too long for some reason.
> >>> +		 */
> >>> +		packet_queue->next_cs_timestamp = max(elem->timestamp, packet_queue->next_cs_timestamp);
> >>>  		return OCSD_RESP_CONT;
> >>>  	}
> >>>
> >>> -
> >>>  	if (!elem->timestamp) {
> >>>  		/*
> >>>  		 * Zero timestamps can be seen due to misconfiguration or hardware bugs.
> >>> @@ -312,7 +331,7 @@ cs_etm_decoder__do_hard_timestamp(struct cs_etm_queue *etmq,
> >>>  					". Decoding may be improved by prepending 'Z' to your current --itrace
> >>> arguments.\n",
> >>>  					indx);
> >>>
> >>> -	} else if (packet_queue->instr_count > elem->timestamp) {
> >>> +	} else if (cs_etm_decoder__instr_count_to_ns(packet_queue->instr_count) > elem->timestamp) {
> >>>  		/*
> >>>  		 * Sanity check that the elem->timestamp - packet_queue->instr_count would not
> >>>  		 * result in an underflow. Warn and clamp at 0 if it would.
> >>> @@ -327,7 +346,8 @@ cs_etm_decoder__do_hard_timestamp(struct cs_etm_queue *etmq,
> >>>  		 * which instructions started by subtracting the number of instructions
> >>>  		 * executed to the timestamp.
> >>>  		 */
> >>> -		packet_queue->cs_timestamp = elem->timestamp - packet_queue->instr_count;
> >>> +		packet_queue->cs_timestamp = elem->timestamp -
> >>> +			cs_etm_decoder__instr_count_to_ns(packet_queue->instr_count);
> >>>  	}
> >>>  	packet_queue->next_cs_timestamp = elem->timestamp;
> >>>  	packet_queue->instr_count = 0;
> >> I have tested this diff along with the patchset and found that
> >> timestamps are generated/estimated in an increasing order.
> >>
> >> However, found few corner cases where they weren't in order.
> >> I'm currently gathering more information on these corner cases.
> >>
> >> Used the following steps to find inconsistencies:
> >> $ perf record -e cs_etm/@tmc_etr0/k -C 4 taskset -c 4 sleep 1
> >> $ perf script --itrace=i1ns --ns -Fcomm,tid,pid,time,cpu,event,ip,sym,addr,symoff,flags,callindent > itrace
> >> $ sed 's/://g' itrace | awk -F '.' ' { print $2 } ' | awk '{ if ($1 < prev) { print "line:" NR " " $0 }
> {prev=$1}}'
> >>
> >> Setting INSTR_PER_NS to higher values (14,16,20 etc) results in
> >> lower inconsistencies but it would defeat the purpose of
> >> estimating values in software since we could be returning 0
> >> from __instr_count_to_ns() in many cases.
> >>
> >> Please share your thoughts.
> >
> > I believe this could be because I was adding nanosecond values
> > to the raw coresight counter before it was converted which was
> > a mistake. I have made another change to move the conversion
> > as early as possible to avoid mistakes like this.
> >
> > Increasing the INSTR_PER_NS shouldn't be required because of
> > the max() call, in fact INSTR_PER_NS isn't necessarily
> > needed at all, I just thought it would make it more accurate.
> >
> > I'll test the new version on the file that you sent before
> > pasting it here. If you have another recording that has more
> > of the edge cases then please also upload that one too.
> 
> Hi Tanmay,
> 
> I've sent v4 and I don't see any issues now with the files that you
> sent. There were a couple of different edge cases that I listed in the
> last commit message. Let me know if you find any more, otherwise if you
> leave your review or tested-by tag that would be great.
I have tested the latest v5 version on our platform with test
cases that run with and without filters for kernel/user space.
The timestamps look good on all of them.

Thanks for the effort !

Thanks and regards,
Tanmay
> 
> Thanks
> James
  
Arnaldo Carvalho de Melo Jan. 20, 2023, 5:47 p.m. UTC | #9
Em Fri, Jan 20, 2023 at 05:30:30PM +0000, Tanmay Jagdale escreveu:
> > On 13/01/2023 11:12, James Clark wrote:
> > > On 12/01/2023 15:33, Tanmay Jagdale wrote:
> > I've sent v4 and I don't see any issues now with the files that you
> > sent. There were a couple of different edge cases that I listed in the
> > last commit message. Let me know if you find any more, otherwise if you
> > leave your review or tested-by tag that would be great.
> I have tested the latest v5 version on our platform with test
> cases that run with and without filters for kernel/user space.
> The timestamps look good on all of them.
> 
> Thanks for the effort !

Can I take that as an:

Tested-by: Tanmay Jagdale <tanmay@marvell.com>

For all the patches in this v5 series?

- Arnaldo
  
Tanmay Jagdale Jan. 21, 2023, 5:13 a.m. UTC | #10
Hi Arnaldo,

> Em Fri, Jan 20, 2023 at 05:30:30PM +0000, Tanmay Jagdale escreveu:
> > > On 13/01/2023 11:12, James Clark wrote:
> > > > On 12/01/2023 15:33, Tanmay Jagdale wrote:
> > > I've sent v4 and I don't see any issues now with the files that you
> > > sent. There were a couple of different edge cases that I listed in the
> > > last commit message. Let me know if you find any more, otherwise if you
> > > leave your review or tested-by tag that would be great.
> > I have tested the latest v5 version on our platform with test
> > cases that run with and without filters for kernel/user space.
> > The timestamps look good on all of them.
> >
> > Thanks for the effort !
> 
> Can I take that as an:
> 
> Tested-by: Tanmay Jagdale <tanmay@marvell.com>
> 
> For all the patches in this v5 series?
Yes sure. I tested the patch series on our platform
and able to see the correct timestamps.

Thanks,
Tanmay
> 
> - Arnaldo