[RFC,V2,2/9] perf: Extend ABI to support post-processing monotonic raw conversion

Message ID 20230213190754.1836051-3-kan.liang@linux.intel.com
State New
Headers
Series Convert TSC to monotonic raw clock for PEBS |

Commit Message

Liang, Kan Feb. 13, 2023, 7:07 p.m. UTC
  From: Kan Liang <kan.liang@linux.intel.com>

The monotonic raw clock is not affected by NTP/PTP correction. The
calculation of the monotonic raw clock can be done in the
post-processing, which can reduce the kernel overhead.

Add hw_time in the struct perf_event_attr to tell the kernel dump the
raw HW time to user space. The perf tool will calculate the HW time
in post-processing.
Currently, only supports the monotonic raw conversion.
Only dump the raw HW time with PERF_RECORD_SAMPLE, because the accurate
HW time can only be provided in a sample by HW. For other type of
records, the user requested clock should be returned as usual. Nothing
is changed.

Add perf_event_mmap_page::cap_user_time_mono_raw ABI to dump the
conversion information. The cap_user_time_mono_raw also indicates
whether the monotonic raw conversion information is available.
If yes, the clock monotonic raw can be calculated as
mono_raw = base + ((cyc - last) * mult + nsec) >> shift

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 include/uapi/linux/perf_event.h | 21 ++++++++++++++++++---
 kernel/events/core.c            |  7 +++++++
 2 files changed, 25 insertions(+), 3 deletions(-)
  

Comments

John Stultz Feb. 13, 2023, 7:37 p.m. UTC | #1
On Mon, Feb 13, 2023 at 11:08 AM <kan.liang@linux.intel.com> wrote:
>
> From: Kan Liang <kan.liang@linux.intel.com>
>
> The monotonic raw clock is not affected by NTP/PTP correction. The
> calculation of the monotonic raw clock can be done in the
> post-processing, which can reduce the kernel overhead.
>
> Add hw_time in the struct perf_event_attr to tell the kernel dump the
> raw HW time to user space. The perf tool will calculate the HW time
> in post-processing.
> Currently, only supports the monotonic raw conversion.
> Only dump the raw HW time with PERF_RECORD_SAMPLE, because the accurate
> HW time can only be provided in a sample by HW. For other type of
> records, the user requested clock should be returned as usual. Nothing
> is changed.
>
> Add perf_event_mmap_page::cap_user_time_mono_raw ABI to dump the
> conversion information. The cap_user_time_mono_raw also indicates
> whether the monotonic raw conversion information is available.
> If yes, the clock monotonic raw can be calculated as
> mono_raw = base + ((cyc - last) * mult + nsec) >> shift

Again, I appreciate you reworking and resending this series out, I
know it took some effort.

But oof, I'd really like to make sure we're not exporting timekeeping
internals to userland.

I think Thomas' suggestion of doing the timestamp conversion in
post-processing was more about interpolating collected system times
with the counter (tsc) values captured.

I get the interpolation can be difficult as the counter value and
system time can't currently atomically collected, so potentially there
may be a need for a way to tie two together (see my previous email's
thought of ktime_get_raw_monotonic_from_timestamp()), but we'd
probably want a clear understanding of the benefit (quantitative
reduction in interpolation error, and what real benefit that brings),
and would also want the driver to generate and share those pairs
rather than having userland have access.

thanks
-john
  
Liang, Kan Feb. 13, 2023, 9:40 p.m. UTC | #2
On 2023-02-13 2:37 p.m., John Stultz wrote:
> On Mon, Feb 13, 2023 at 11:08 AM <kan.liang@linux.intel.com> wrote:
>>
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> The monotonic raw clock is not affected by NTP/PTP correction. The
>> calculation of the monotonic raw clock can be done in the
>> post-processing, which can reduce the kernel overhead.
>>
>> Add hw_time in the struct perf_event_attr to tell the kernel dump the
>> raw HW time to user space. The perf tool will calculate the HW time
>> in post-processing.
>> Currently, only supports the monotonic raw conversion.
>> Only dump the raw HW time with PERF_RECORD_SAMPLE, because the accurate
>> HW time can only be provided in a sample by HW. For other type of
>> records, the user requested clock should be returned as usual. Nothing
>> is changed.
>>
>> Add perf_event_mmap_page::cap_user_time_mono_raw ABI to dump the
>> conversion information. The cap_user_time_mono_raw also indicates
>> whether the monotonic raw conversion information is available.
>> If yes, the clock monotonic raw can be calculated as
>> mono_raw = base + ((cyc - last) * mult + nsec) >> shift
> 
> Again, I appreciate you reworking and resending this series out, I
> know it took some effort.
> 
> But oof, I'd really like to make sure we're not exporting timekeeping
> internals to userland.
> 
> I think Thomas' suggestion of doing the timestamp conversion in
> post-processing was more about interpolating collected system times
> with the counter (tsc) values captured.
>

Thomas, could you please clarify your suggestion regarding "the relevant
conversion information" provided by the kernel?
https://lore.kernel.org/lkml/87ilgsgl5f.ffs@tglx/

Is it only the interpolation information or the entire conversion
information (Mult, shift etc.)?

If it's only the interpolation information, the user space will be lack
of information to handle all the cases. If I understand John's comments
correctly, it could also bring some interpolation error which can only
be addressed by the mult/shift conversion.

If the suggestion is to dump the entire conversion information into the
user space, we have to expose the timekeeping internals.

Considering the above difficulties, could we use the kernel conversion?
(The current perf already uses the kernel conversion for monotonic raw.
It should not bring extra overhead.)

Thanks,
Kan

> I get the interpolation can be difficult as the counter value and
> system time can't currently atomically collected, so potentially there
> may be a need for a way to tie two together (see my previous email's
> thought of ktime_get_raw_monotonic_from_timestamp()), but we'd
> probably want a clear understanding of the benefit (quantitative
> reduction in interpolation error, and what real benefit that brings),
> and would also want the driver to generate and share those pairs
> rather than having userland have access.
> 
> thanks
> -john
  
John Stultz Feb. 13, 2023, 10:22 p.m. UTC | #3
On Mon, Feb 13, 2023 at 1:40 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
> On 2023-02-13 2:37 p.m., John Stultz wrote:
> > On Mon, Feb 13, 2023 at 11:08 AM <kan.liang@linux.intel.com> wrote:
> >>
> >> From: Kan Liang <kan.liang@linux.intel.com>
> >>
> >> The monotonic raw clock is not affected by NTP/PTP correction. The
> >> calculation of the monotonic raw clock can be done in the
> >> post-processing, which can reduce the kernel overhead.
> >>
> >> Add hw_time in the struct perf_event_attr to tell the kernel dump the
> >> raw HW time to user space. The perf tool will calculate the HW time
> >> in post-processing.
> >> Currently, only supports the monotonic raw conversion.
> >> Only dump the raw HW time with PERF_RECORD_SAMPLE, because the accurate
> >> HW time can only be provided in a sample by HW. For other type of
> >> records, the user requested clock should be returned as usual. Nothing
> >> is changed.
> >>
> >> Add perf_event_mmap_page::cap_user_time_mono_raw ABI to dump the
> >> conversion information. The cap_user_time_mono_raw also indicates
> >> whether the monotonic raw conversion information is available.
> >> If yes, the clock monotonic raw can be calculated as
> >> mono_raw = base + ((cyc - last) * mult + nsec) >> shift
> >
> > Again, I appreciate you reworking and resending this series out, I
> > know it took some effort.
> >
> > But oof, I'd really like to make sure we're not exporting timekeeping
> > internals to userland.
> >
> > I think Thomas' suggestion of doing the timestamp conversion in
> > post-processing was more about interpolating collected system times
> > with the counter (tsc) values captured.
> >
>
> Thomas, could you please clarify your suggestion regarding "the relevant
> conversion information" provided by the kernel?
> https://lore.kernel.org/lkml/87ilgsgl5f.ffs@tglx/
>
> Is it only the interpolation information or the entire conversion
> information (Mult, shift etc.)?
>
> If it's only the interpolation information, the user space will be lack
> of information to handle all the cases. If I understand John's comments
> correctly, it could also bring some interpolation error which can only
> be addressed by the mult/shift conversion.

"Only" is maybe too strong a word. I think having the driver use
kernel timekeeping accessors to CLOCK_MONONOTONIC_RAW time with
counter values will minimize the error.

But again, it's not yet established that any interpolation error using
existing interfaces is great enough to be problematic here.

The interpoloation is pretty easy to do:

do {
    start= readtsc();
    clock_gett(CLOCK_MONOTONIC_RAW, &ts);
    end = readtsc();
    delta = end-start;
} while (delta  > THRESHOLD)   // make sure the reads were not preempted
mid = start + (delta +(delta/2))/2; //round-closest

and be able to get you a fairly close matching of TSC to
CLOCK_MONOTONIC_RAW value.

Once you have that mapping you can take a few samples and establish
the linear function.

But that will have some error, so quantifying that error helps
establish why being able to get an atomic mapping of TSC ->
CLOCK_MONOTONIC_RAW would help.

So I really don't think we need to expose the kernel internal values
to userland, but I'm willing to guess the atomic mapping (which the
driver will have access to, not userland) may be helpful for the fine
granularity you want in the trace.

thanks
-john
  
Peter Zijlstra Feb. 14, 2023, 10:43 a.m. UTC | #4
On Mon, Feb 13, 2023 at 02:22:39PM -0800, John Stultz wrote:
> The interpoloation is pretty easy to do:
> 
> do {
>     start= readtsc();
>     clock_gett(CLOCK_MONOTONIC_RAW, &ts);
>     end = readtsc();
>     delta = end-start;
> } while (delta  > THRESHOLD)   // make sure the reads were not preempted
> mid = start + (delta +(delta/2))/2; //round-closest
> 
> and be able to get you a fairly close matching of TSC to
> CLOCK_MONOTONIC_RAW value.
> 
> Once you have that mapping you can take a few samples and establish
> the linear function.

Right, this is how we do the TSC calibration in the first place, and if
NTP can achieve high correctness over a network, then surely we can do
better locally.

That is, this scheme should work for all CLOCKs, not only MONOTONIC_RAW.
  
Liang, Kan Feb. 14, 2023, 2:51 p.m. UTC | #5
On 2023-02-13 5:22 p.m., John Stultz wrote:
> On Mon, Feb 13, 2023 at 1:40 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
>> On 2023-02-13 2:37 p.m., John Stultz wrote:
>>> On Mon, Feb 13, 2023 at 11:08 AM <kan.liang@linux.intel.com> wrote:
>>>>
>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>
>>>> The monotonic raw clock is not affected by NTP/PTP correction. The
>>>> calculation of the monotonic raw clock can be done in the
>>>> post-processing, which can reduce the kernel overhead.
>>>>
>>>> Add hw_time in the struct perf_event_attr to tell the kernel dump the
>>>> raw HW time to user space. The perf tool will calculate the HW time
>>>> in post-processing.
>>>> Currently, only supports the monotonic raw conversion.
>>>> Only dump the raw HW time with PERF_RECORD_SAMPLE, because the accurate
>>>> HW time can only be provided in a sample by HW. For other type of
>>>> records, the user requested clock should be returned as usual. Nothing
>>>> is changed.
>>>>
>>>> Add perf_event_mmap_page::cap_user_time_mono_raw ABI to dump the
>>>> conversion information. The cap_user_time_mono_raw also indicates
>>>> whether the monotonic raw conversion information is available.
>>>> If yes, the clock monotonic raw can be calculated as
>>>> mono_raw = base + ((cyc - last) * mult + nsec) >> shift
>>>
>>> Again, I appreciate you reworking and resending this series out, I
>>> know it took some effort.
>>>
>>> But oof, I'd really like to make sure we're not exporting timekeeping
>>> internals to userland.
>>>
>>> I think Thomas' suggestion of doing the timestamp conversion in
>>> post-processing was more about interpolating collected system times
>>> with the counter (tsc) values captured.
>>>
>>
>> Thomas, could you please clarify your suggestion regarding "the relevant
>> conversion information" provided by the kernel?
>> https://lore.kernel.org/lkml/87ilgsgl5f.ffs@tglx/
>>
>> Is it only the interpolation information or the entire conversion
>> information (Mult, shift etc.)?
>>
>> If it's only the interpolation information, the user space will be lack
>> of information to handle all the cases. If I understand John's comments
>> correctly, it could also bring some interpolation error which can only
>> be addressed by the mult/shift conversion.
> 


Thanks for the details John.

> "Only" is maybe too strong a word. I think having the driver use
> kernel timekeeping accessors to CLOCK_MONONOTONIC_RAW time with
> counter values will minimize the error.
>

The key motivation of using the TSC in the PEBS record is to get an
accurate timestamp of each record. We definitely want the conversion has
minimized error.


> But again, it's not yet established that any interpolation error using
> existing interfaces is great enough to be problematic here.
> 
> The interpoloation is pretty easy to do:
> 
> do {
>     start= readtsc();
>     clock_gett(CLOCK_MONOTONIC_RAW, &ts);
>     end = readtsc();
>     delta = end-start;
> } while (delta  > THRESHOLD)   // make sure the reads were not preempted
> mid = start + (delta +(delta/2))/2; //round-closest
>

How to choose the THRESHOLD? It seems the THRESHOLD value also impacts
the accuracy.


> and be able to get you a fairly close matching of TSC to
> CLOCK_MONOTONIC_RAW value.
> 
> Once you have that mapping you can take a few samples and establish
> the linear function.
> 
> But that will have some error, so quantifying that error helps
> establish why being able to get an atomic mapping of TSC ->
> CLOCK_MONOTONIC_RAW would help.
> 
> So I really don't think we need to expose the kernel internal values
> to userland, but I'm willing to guess the atomic mapping (which the
> driver will have access to, not userland) may be helpful for the fine
> granularity you want in the trace.
> 

If I understand correctly, the idea is to let the user space tool run
the above interpoloation algorithm several times to 'guess' the atomic
mapping. Using the mapping information to covert the TSC from the PEBS
record. Is my understanding correct?

If so, to be honest, I doubt we can get the accuracy we want.

Thanks,
Kan
  
Liang, Kan Feb. 14, 2023, 5 p.m. UTC | #6
On 2023-02-14 9:51 a.m., Liang, Kan wrote:
> 
> 
> On 2023-02-13 5:22 p.m., John Stultz wrote:
>> On Mon, Feb 13, 2023 at 1:40 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>> On 2023-02-13 2:37 p.m., John Stultz wrote:
>>>> On Mon, Feb 13, 2023 at 11:08 AM <kan.liang@linux.intel.com> wrote:
>>>>>
>>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>>
>>>>> The monotonic raw clock is not affected by NTP/PTP correction. The
>>>>> calculation of the monotonic raw clock can be done in the
>>>>> post-processing, which can reduce the kernel overhead.
>>>>>
>>>>> Add hw_time in the struct perf_event_attr to tell the kernel dump the
>>>>> raw HW time to user space. The perf tool will calculate the HW time
>>>>> in post-processing.
>>>>> Currently, only supports the monotonic raw conversion.
>>>>> Only dump the raw HW time with PERF_RECORD_SAMPLE, because the accurate
>>>>> HW time can only be provided in a sample by HW. For other type of
>>>>> records, the user requested clock should be returned as usual. Nothing
>>>>> is changed.
>>>>>
>>>>> Add perf_event_mmap_page::cap_user_time_mono_raw ABI to dump the
>>>>> conversion information. The cap_user_time_mono_raw also indicates
>>>>> whether the monotonic raw conversion information is available.
>>>>> If yes, the clock monotonic raw can be calculated as
>>>>> mono_raw = base + ((cyc - last) * mult + nsec) >> shift
>>>>
>>>> Again, I appreciate you reworking and resending this series out, I
>>>> know it took some effort.
>>>>
>>>> But oof, I'd really like to make sure we're not exporting timekeeping
>>>> internals to userland.
>>>>
>>>> I think Thomas' suggestion of doing the timestamp conversion in
>>>> post-processing was more about interpolating collected system times
>>>> with the counter (tsc) values captured.
>>>>
>>>
>>> Thomas, could you please clarify your suggestion regarding "the relevant
>>> conversion information" provided by the kernel?
>>> https://lore.kernel.org/lkml/87ilgsgl5f.ffs@tglx/
>>>
>>> Is it only the interpolation information or the entire conversion
>>> information (Mult, shift etc.)?
>>>
>>> If it's only the interpolation information, the user space will be lack
>>> of information to handle all the cases. If I understand John's comments
>>> correctly, it could also bring some interpolation error which can only
>>> be addressed by the mult/shift conversion.
>>
> 
> 
> Thanks for the details John.
> 
>> "Only" is maybe too strong a word. I think having the driver use
>> kernel timekeeping accessors to CLOCK_MONONOTONIC_RAW time with
>> counter values will minimize the error.
>>
> 
> The key motivation of using the TSC in the PEBS record is to get an
> accurate timestamp of each record. We definitely want the conversion has
> minimized error.
> 
> 
>> But again, it's not yet established that any interpolation error using
>> existing interfaces is great enough to be problematic here.
>>
>> The interpoloation is pretty easy to do:
>>
>> do {
>>     start= readtsc();
>>     clock_gett(CLOCK_MONOTONIC_RAW, &ts);
>>     end = readtsc();
>>     delta = end-start;
>> } while (delta  > THRESHOLD)   // make sure the reads were not preempted
>> mid = start + (delta +(delta/2))/2; //round-closest
>>
> 
> How to choose the THRESHOLD? It seems the THRESHOLD value also impacts
> the accuracy.
> 
> 
>> and be able to get you a fairly close matching of TSC to
>> CLOCK_MONOTONIC_RAW value.
>>
>> Once you have that mapping you can take a few samples and establish
>> the linear function.
>>
>> But that will have some error, so quantifying that error helps
>> establish why being able to get an atomic mapping of TSC ->
>> CLOCK_MONOTONIC_RAW would help.
>>
>> So I really don't think we need to expose the kernel internal values
>> to userland, but I'm willing to guess the atomic mapping (which the
>> driver will have access to, not userland) may be helpful for the fine
>> granularity you want in the trace.
>>
> 
> If I understand correctly, the idea is to let the user space tool run
> the above interpoloation algorithm several times to 'guess' the atomic
> mapping. Using the mapping information to covert the TSC from the PEBS
> record. Is my understanding correct?
> 
> If so, to be honest, I doubt we can get the accuracy we want.
> 

I implemented a simple test to evaluate the error.

I collected TSC -> CLOCK_MONOTONIC_RAW mapping using the above algorithm
at the start and end of perf cmd.
	MONO_RAW	TSC
start	89553516545645	223619715214239
end	89562251233830	223641517000376

Here is what I get via mult/shift conversion from this patch.
	MONO_RAW	TSC
PEBS	89555942691466	223625770878571

Then I use the time information from start and end to create a linear
function and 'guess' the MONO_RAW of PEBS from the TSC. I get
89555942692721.
There is a 1255 ns difference.
I tried several different PEBS records. The error is ~1000ns.
I think it should be an observable error.

Thanks,
Kan
  
Liang, Kan Feb. 14, 2023, 5:46 p.m. UTC | #7
On 2023-02-14 5:43 a.m., Peter Zijlstra wrote:
> On Mon, Feb 13, 2023 at 02:22:39PM -0800, John Stultz wrote:
>> The interpoloation is pretty easy to do:
>>
>> do {
>>     start= readtsc();
>>     clock_gett(CLOCK_MONOTONIC_RAW, &ts);
>>     end = readtsc();
>>     delta = end-start;
>> } while (delta  > THRESHOLD)   // make sure the reads were not preempted
>> mid = start + (delta +(delta/2))/2; //round-closest
>>
>> and be able to get you a fairly close matching of TSC to
>> CLOCK_MONOTONIC_RAW value.
>>
>> Once you have that mapping you can take a few samples and establish
>> the linear function.
> 
> Right, this is how we do the TSC calibration in the first place, and if
> NTP can achieve high correctness over a network, then surely we can do
> better locally.
> 
> That is, this scheme should work for all CLOCKs, not only MONOTONIC_RAW.

If I understand correctly, the TSC calibration is done in the kernel.
The kernel keeps updating the mul/shift. We dump the mul/shift into the
perf mmap page for the user tools.

But for the CLOCKs, the mul/shift is kernel internal values which we
don't want to expose to the user space.

If we only apply the scheme in the user space, it brings some observable
 errors based on my test mentioned in the other thread.

Thanks,
Kan
  
John Stultz Feb. 14, 2023, 7:34 p.m. UTC | #8
On Tue, Feb 14, 2023 at 7:56 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Feb 13, 2023 at 02:22:39PM -0800, John Stultz wrote:
> > The interpoloation is pretty easy to do:
> >
> > do {
> >     start= readtsc();
> >     clock_gett(CLOCK_MONOTONIC_RAW, &ts);
> >     end = readtsc();
> >     delta = end-start;
> > } while (delta  > THRESHOLD)   // make sure the reads were not preempted
> > mid = start + (delta +(delta/2))/2; //round-closest
> >
> > and be able to get you a fairly close matching of TSC to
> > CLOCK_MONOTONIC_RAW value.
> >
> > Once you have that mapping you can take a few samples and establish
> > the linear function.
>
> Right, this is how we do the TSC calibration in the first place, and if
> NTP can achieve high correctness over a network, then surely we can do
> better locally.
>
> That is, this scheme should work for all CLOCKs, not only MONOTONIC_RAW.

Well, CLOCK_MONOTONIC_RAW is at least a fixed function, we don't
change its frequency. Whereas other clocks will likely be adjusted
over their lifetime, so deriving the frequency has to be continually
re-calculated, so they aren't ideal for this sort of interpolation.

thanks
-john
  
John Stultz Feb. 14, 2023, 7:37 p.m. UTC | #9
On Tue, Feb 14, 2023 at 9:46 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> On 2023-02-14 5:43 a.m., Peter Zijlstra wrote:
> > On Mon, Feb 13, 2023 at 02:22:39PM -0800, John Stultz wrote:
> >> The interpoloation is pretty easy to do:
> >>
> >> do {
> >>     start= readtsc();
> >>     clock_gett(CLOCK_MONOTONIC_RAW, &ts);
> >>     end = readtsc();
> >>     delta = end-start;
> >> } while (delta  > THRESHOLD)   // make sure the reads were not preempted
> >> mid = start + (delta +(delta/2))/2; //round-closest
> >>
> >> and be able to get you a fairly close matching of TSC to
> >> CLOCK_MONOTONIC_RAW value.
> >>
> >> Once you have that mapping you can take a few samples and establish
> >> the linear function.
> >
> > Right, this is how we do the TSC calibration in the first place, and if
> > NTP can achieve high correctness over a network, then surely we can do
> > better locally.
> >
> > That is, this scheme should work for all CLOCKs, not only MONOTONIC_RAW.
>
> If I understand correctly, the TSC calibration is done in the kernel.
> The kernel keeps updating the mul/shift. We dump the mul/shift into the
> perf mmap page for the user tools.

Where is that done in the perf mmap? I wasn't aware.

thanks
-john
  
John Stultz Feb. 14, 2023, 7:52 p.m. UTC | #10
On Tue, Feb 14, 2023 at 6:51 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> On 2023-02-13 5:22 p.m., John Stultz wrote:
> > On Mon, Feb 13, 2023 at 1:40 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
> >> On 2023-02-13 2:37 p.m., John Stultz wrote:
> >>> On Mon, Feb 13, 2023 at 11:08 AM <kan.liang@linux.intel.com> wrote:
> >>>>
> >>>> From: Kan Liang <kan.liang@linux.intel.com>
> >>>>
> >>>> The monotonic raw clock is not affected by NTP/PTP correction. The
> >>>> calculation of the monotonic raw clock can be done in the
> >>>> post-processing, which can reduce the kernel overhead.
> >>>>
> >>>> Add hw_time in the struct perf_event_attr to tell the kernel dump the
> >>>> raw HW time to user space. The perf tool will calculate the HW time
> >>>> in post-processing.
> >>>> Currently, only supports the monotonic raw conversion.
> >>>> Only dump the raw HW time with PERF_RECORD_SAMPLE, because the accurate
> >>>> HW time can only be provided in a sample by HW. For other type of
> >>>> records, the user requested clock should be returned as usual. Nothing
> >>>> is changed.
> >>>>
> >>>> Add perf_event_mmap_page::cap_user_time_mono_raw ABI to dump the
> >>>> conversion information. The cap_user_time_mono_raw also indicates
> >>>> whether the monotonic raw conversion information is available.
> >>>> If yes, the clock monotonic raw can be calculated as
> >>>> mono_raw = base + ((cyc - last) * mult + nsec) >> shift
> >>>
> >>> Again, I appreciate you reworking and resending this series out, I
> >>> know it took some effort.
> >>>
> >>> But oof, I'd really like to make sure we're not exporting timekeeping
> >>> internals to userland.
> >>>
> >>> I think Thomas' suggestion of doing the timestamp conversion in
> >>> post-processing was more about interpolating collected system times
> >>> with the counter (tsc) values captured.
> >>>
> >>
> >> Thomas, could you please clarify your suggestion regarding "the relevant
> >> conversion information" provided by the kernel?
> >> https://lore.kernel.org/lkml/87ilgsgl5f.ffs@tglx/
> >>
> >> Is it only the interpolation information or the entire conversion
> >> information (Mult, shift etc.)?
> >>
> >> If it's only the interpolation information, the user space will be lack
> >> of information to handle all the cases. If I understand John's comments
> >> correctly, it could also bring some interpolation error which can only
> >> be addressed by the mult/shift conversion.
> >
>
>
> Thanks for the details John.
>
> > "Only" is maybe too strong a word. I think having the driver use
> > kernel timekeeping accessors to CLOCK_MONONOTONIC_RAW time with
> > counter values will minimize the error.
> >
>
> The key motivation of using the TSC in the PEBS record is to get an
> accurate timestamp of each record. We definitely want the conversion has
> minimized error.

Yep.

> > But again, it's not yet established that any interpolation error using
> > existing interfaces is great enough to be problematic here.
> >
> > The interpoloation is pretty easy to do:
> >
> > do {
> >     start= readtsc();
> >     clock_gett(CLOCK_MONOTONIC_RAW, &ts);
> >     end = readtsc();
> >     delta = end-start;
> > } while (delta  > THRESHOLD)   // make sure the reads were not preempted
> > mid = start + (delta +(delta/2))/2; //round-closest
> >
>
> How to choose the THRESHOLD? It seems the THRESHOLD value also impacts
> the accuracy.

Maybe by running a number of of these reads and collecting the detlas,
then setting THRESHOLD to a standard deviation of the results?
(I'm sure there's more sound methods, but I'd have to do some digging
to find them)

Alternatively you could always take 10 samples and then only do the
mapping with the smallest delta value.


> > and be able to get you a fairly close matching of TSC to
> > CLOCK_MONOTONIC_RAW value.
> >
> > Once you have that mapping you can take a few samples and establish
> > the linear function.
> >
> > But that will have some error, so quantifying that error helps
> > establish why being able to get an atomic mapping of TSC ->
> > CLOCK_MONOTONIC_RAW would help.
> >
> > So I really don't think we need to expose the kernel internal values
> > to userland, but I'm willing to guess the atomic mapping (which the
> > driver will have access to, not userland) may be helpful for the fine
> > granularity you want in the trace.
> >
>
> If I understand correctly, the idea is to let the user space tool run
> the above interpoloation algorithm several times to 'guess' the atomic
> mapping. Using the mapping information to covert the TSC from the PEBS
> record. Is my understanding correct?

So I think that's what Thomas was suggesting.

The next step would probably be to provide a way for the driver to
provide atomic TSC->CLOCK_MONOTONIC_RAW samples, so userland can
calculate the function itself.

So then the problem becomes if X1 and Y1 are exactly mapped, and X2
and Y2 are exactly mapped, then given X3, find Y3.

And if that doesn't work, then we would have to see about having the
driver do all the conversions.

> If so, to be honest, I doubt we can get the accuracy we want.

Sure. I just want to make sure its quantified that the pure userland
interpolation approach won't work before we go adding in extra
in-kernel logic

(We'd obviously rather do the logic that can be done in userland in userland)

thanks
-john
  
Liang, Kan Feb. 14, 2023, 8:09 p.m. UTC | #11
On 2023-02-14 2:37 p.m., John Stultz wrote:
> On Tue, Feb 14, 2023 at 9:46 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>> On 2023-02-14 5:43 a.m., Peter Zijlstra wrote:
>>> On Mon, Feb 13, 2023 at 02:22:39PM -0800, John Stultz wrote:
>>>> The interpoloation is pretty easy to do:
>>>>
>>>> do {
>>>>     start= readtsc();
>>>>     clock_gett(CLOCK_MONOTONIC_RAW, &ts);
>>>>     end = readtsc();
>>>>     delta = end-start;
>>>> } while (delta  > THRESHOLD)   // make sure the reads were not preempted
>>>> mid = start + (delta +(delta/2))/2; //round-closest
>>>>
>>>> and be able to get you a fairly close matching of TSC to
>>>> CLOCK_MONOTONIC_RAW value.
>>>>
>>>> Once you have that mapping you can take a few samples and establish
>>>> the linear function.
>>>
>>> Right, this is how we do the TSC calibration in the first place, and if
>>> NTP can achieve high correctness over a network, then surely we can do
>>> better locally.
>>>
>>> That is, this scheme should work for all CLOCKs, not only MONOTONIC_RAW.
>>
>> If I understand correctly, the TSC calibration is done in the kernel.
>> The kernel keeps updating the mul/shift. We dump the mul/shift into the
>> perf mmap page for the user tools.
> 
> Where is that done in the perf mmap? I wasn't aware.

The updating of the mul/shift for sched_clock should be done in the
set_cyc2ns_scale() in tsc.c

The perf user space tool mmap a page to retrieve the enabling
time/running time from the kernel. On X86 and Arm, the conversion
information from HW time (TSC) to sched_clock/perf_time is also stored
in the page. Please see the arch_perf_update_userpage(). In the perf
mmap, it only retrieve the current mul/shift information and write them
into the page for the user space tool.

This V2 patch series try to do the same thing for the monotonic raw
conversion. So the kernel internal mul/shift information has to be exposed.


Thanks,
Kan
  
John Stultz Feb. 14, 2023, 8:11 p.m. UTC | #12
On Tue, Feb 14, 2023 at 9:00 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> On 2023-02-14 9:51 a.m., Liang, Kan wrote:
> > If I understand correctly, the idea is to let the user space tool run
> > the above interpoloation algorithm several times to 'guess' the atomic
> > mapping. Using the mapping information to covert the TSC from the PEBS
> > record. Is my understanding correct?
> >
> > If so, to be honest, I doubt we can get the accuracy we want.
> >
>
> I implemented a simple test to evaluate the error.

Very cool!

> I collected TSC -> CLOCK_MONOTONIC_RAW mapping using the above algorithm
> at the start and end of perf cmd.
>         MONO_RAW        TSC
> start   89553516545645  223619715214239
> end     89562251233830  223641517000376
>
> Here is what I get via mult/shift conversion from this patch.
>         MONO_RAW        TSC
> PEBS    89555942691466  223625770878571
>
> Then I use the time information from start and end to create a linear
> function and 'guess' the MONO_RAW of PEBS from the TSC. I get
> 89555942692721.
> There is a 1255 ns difference.
> I tried several different PEBS records. The error is ~1000ns.
> I think it should be an observable error.

Interesting. That's a good bit higher than I'd expect as I'd expect a
clock_gettime() call to take ~ double digit nanoseconds range on
average, so the error should be within that.

Can you share your logic?

thanks
-john
  
John Stultz Feb. 14, 2023, 8:21 p.m. UTC | #13
On Tue, Feb 14, 2023 at 12:09 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
> On 2023-02-14 2:37 p.m., John Stultz wrote:
> > On Tue, Feb 14, 2023 at 9:46 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> >> If I understand correctly, the TSC calibration is done in the kernel.
> >> The kernel keeps updating the mul/shift. We dump the mul/shift into the
> >> perf mmap page for the user tools.
> >
> > Where is that done in the perf mmap? I wasn't aware.
>
> The updating of the mul/shift for sched_clock should be done in the
> set_cyc2ns_scale() in tsc.c

Thanks for the pointer!

> The perf user space tool mmap a page to retrieve the enabling
> time/running time from the kernel. On X86 and Arm, the conversion
> information from HW time (TSC) to sched_clock/perf_time is also stored
> in the page. Please see the arch_perf_update_userpage(). In the perf
> mmap, it only retrieve the current mul/shift information and write them
> into the page for the user space tool.
>
> This V2 patch series try to do the same thing for the monotonic raw
> conversion. So the kernel internal mul/shift information has to be exposed.

Ugh. Well, I think perf may have made a bad API choice here, so I'm
still going to push back on exposting timekeeping internals to
userland.

But I do suspect that with ways to provide paired TSC/CLOCK_MONOTONIC
values, you should be able to get the same functionality in userland
as if the underlying data was shared.

thanks
-john
  
Liang, Kan Feb. 14, 2023, 8:38 p.m. UTC | #14
On 2023-02-14 3:11 p.m., John Stultz wrote:
> On Tue, Feb 14, 2023 at 9:00 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>> On 2023-02-14 9:51 a.m., Liang, Kan wrote:
>>> If I understand correctly, the idea is to let the user space tool run
>>> the above interpoloation algorithm several times to 'guess' the atomic
>>> mapping. Using the mapping information to covert the TSC from the PEBS
>>> record. Is my understanding correct?
>>>
>>> If so, to be honest, I doubt we can get the accuracy we want.
>>>
>>
>> I implemented a simple test to evaluate the error.
> 
> Very cool!
> 
>> I collected TSC -> CLOCK_MONOTONIC_RAW mapping using the above algorithm
>> at the start and end of perf cmd.
>>         MONO_RAW        TSC
>> start   89553516545645  223619715214239
>> end     89562251233830  223641517000376
>>
>> Here is what I get via mult/shift conversion from this patch.
>>         MONO_RAW        TSC
>> PEBS    89555942691466  223625770878571
>>
>> Then I use the time information from start and end to create a linear
>> function and 'guess' the MONO_RAW of PEBS from the TSC. I get
>> 89555942692721.
>> There is a 1255 ns difference.
>> I tried several different PEBS records. The error is ~1000ns.
>> I think it should be an observable error.
> 
> Interesting. That's a good bit higher than I'd expect as I'd expect a
> clock_gettime() call to take ~ double digit nanoseconds range on
> average, so the error should be within that.
> 
> Can you share your logic?
> 

I run the algorithm right before and after the perf command as below.
(The source code of time is attached.)

$./time
$perf record -e cycles:upp --clockid monotonic_raw $some_workaround
$./time

The time will dump both MONO_RAW and TSC. That's where "start" and "end"
from.
The perf command print out both TSC and converted MONO_RAW (using the
mul/shift from this patch series). That's where "PEBS" value from.

Than I use the below formula to calculate the guessed MONO_RAW of PEBS TSC.
Guessed_MONO_RAW = (PEBS_TSC - start_TSC) / (end_TSC - start_TSC) *
(end_MONO_RAW - start_MONO_RAW) + start_MONO_RAW.

The guessed_MONO_RAW is 89555942692721.
The PEBS_MONO_RAW is 89555942691466.
The difference is 1255.

Is the calculation correct?

Thanks,
Kan
#include <sys/time.h>
#include <time.h>
#include <stdio.h>
#include <errno.h>

static inline unsigned long rdtsc ()
{
  unsigned long var;
  unsigned int hi, lo;

  asm volatile ("rdtsc" : "=a" (lo), "=d" (hi));
  var = ((unsigned long long int) hi << 32) | lo;

  return var;
}

typedef unsigned long long u64;

int main()
{
	struct timespec ts;
	u64 start, end, delta, mid;
	do {
		start= rdtsc();
		clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
		end = rdtsc();
		delta = end-start;
	} while (delta  > 20000);   // make sure the reads were not preempted
	mid = start + (delta +(delta/2))/2; //round-closest
	printf("%llu %llu %llu\n", start, end, delta);
	printf("MONO_RAW: %llu TSC: %llu\n", (u64)ts.tv_sec * 1000000000 + ts.tv_nsec, mid);
}
  
John Stultz Feb. 17, 2023, 11:11 p.m. UTC | #15
On Tue, Feb 14, 2023 at 12:38 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
> On 2023-02-14 3:11 p.m., John Stultz wrote:
> > On Tue, Feb 14, 2023 at 9:00 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> >> On 2023-02-14 9:51 a.m., Liang, Kan wrote:
> >>> If I understand correctly, the idea is to let the user space tool run
> >>> the above interpoloation algorithm several times to 'guess' the atomic
> >>> mapping. Using the mapping information to covert the TSC from the PEBS
> >>> record. Is my understanding correct?
> >>>
> >>> If so, to be honest, I doubt we can get the accuracy we want.
> >>>
> >>
> >> I implemented a simple test to evaluate the error.
> >
> > Very cool!
> >
> >> I collected TSC -> CLOCK_MONOTONIC_RAW mapping using the above algorithm
> >> at the start and end of perf cmd.
> >>         MONO_RAW        TSC
> >> start   89553516545645  223619715214239
> >> end     89562251233830  223641517000376
> >>
> >> Here is what I get via mult/shift conversion from this patch.
> >>         MONO_RAW        TSC
> >> PEBS    89555942691466  223625770878571
> >>
> >> Then I use the time information from start and end to create a linear
> >> function and 'guess' the MONO_RAW of PEBS from the TSC. I get
> >> 89555942692721.
> >> There is a 1255 ns difference.
> >> I tried several different PEBS records. The error is ~1000ns.
> >> I think it should be an observable error.
> >
> > Interesting. That's a good bit higher than I'd expect as I'd expect a
> > clock_gettime() call to take ~ double digit nanoseconds range on
> > average, so the error should be within that.
> >
> > Can you share your logic?
> >
>
> I run the algorithm right before and after the perf command as below.
> (The source code of time is attached.)
>
> $./time
> $perf record -e cycles:upp --clockid monotonic_raw $some_workaround
> $./time
>
> The time will dump both MONO_RAW and TSC. That's where "start" and "end"
> from.
> The perf command print out both TSC and converted MONO_RAW (using the
> mul/shift from this patch series). That's where "PEBS" value from.
>
> Than I use the below formula to calculate the guessed MONO_RAW of PEBS TSC.
> Guessed_MONO_RAW = (PEBS_TSC - start_TSC) / (end_TSC - start_TSC) *
> (end_MONO_RAW - start_MONO_RAW) + start_MONO_RAW.
>
> The guessed_MONO_RAW is 89555942692721.
> The PEBS_MONO_RAW is 89555942691466.
> The difference is 1255.
>
> Is the calculation correct?

Thanks for sharing it. The equation you have there looks ok at a high
level for the values you captured (there's small tweaks like doing the
mult before the div to make sure you don't hit integer precision
issues, but I didn't see that with your results).

I've got a todo to try to see how the calculation changes if we do
provide atomic TSC/RAW stamps, here but I got a little busy with other
work and haven't gotten to it.
So my apologies, but I'll try to get back to this soon.

thanks
-john
  
Liang, Kan March 8, 2023, 6:44 p.m. UTC | #16
Hi John,

On 2023-02-17 6:11 p.m., John Stultz wrote:
> On Tue, Feb 14, 2023 at 12:38 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
>> On 2023-02-14 3:11 p.m., John Stultz wrote:
>>> On Tue, Feb 14, 2023 at 9:00 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>>> On 2023-02-14 9:51 a.m., Liang, Kan wrote:
>>>>> If I understand correctly, the idea is to let the user space tool run
>>>>> the above interpoloation algorithm several times to 'guess' the atomic
>>>>> mapping. Using the mapping information to covert the TSC from the PEBS
>>>>> record. Is my understanding correct?
>>>>>
>>>>> If so, to be honest, I doubt we can get the accuracy we want.
>>>>>
>>>>
>>>> I implemented a simple test to evaluate the error.
>>>
>>> Very cool!
>>>
>>>> I collected TSC -> CLOCK_MONOTONIC_RAW mapping using the above algorithm
>>>> at the start and end of perf cmd.
>>>>         MONO_RAW        TSC
>>>> start   89553516545645  223619715214239
>>>> end     89562251233830  223641517000376
>>>>
>>>> Here is what I get via mult/shift conversion from this patch.
>>>>         MONO_RAW        TSC
>>>> PEBS    89555942691466  223625770878571
>>>>
>>>> Then I use the time information from start and end to create a linear
>>>> function and 'guess' the MONO_RAW of PEBS from the TSC. I get
>>>> 89555942692721.
>>>> There is a 1255 ns difference.
>>>> I tried several different PEBS records. The error is ~1000ns.
>>>> I think it should be an observable error.
>>>
>>> Interesting. That's a good bit higher than I'd expect as I'd expect a
>>> clock_gettime() call to take ~ double digit nanoseconds range on
>>> average, so the error should be within that.
>>>
>>> Can you share your logic?
>>>
>>
>> I run the algorithm right before and after the perf command as below.
>> (The source code of time is attached.)
>>
>> $./time
>> $perf record -e cycles:upp --clockid monotonic_raw $some_workaround
>> $./time
>>
>> The time will dump both MONO_RAW and TSC. That's where "start" and "end"
>> from.
>> The perf command print out both TSC and converted MONO_RAW (using the
>> mul/shift from this patch series). That's where "PEBS" value from.
>>
>> Than I use the below formula to calculate the guessed MONO_RAW of PEBS TSC.
>> Guessed_MONO_RAW = (PEBS_TSC - start_TSC) / (end_TSC - start_TSC) *
>> (end_MONO_RAW - start_MONO_RAW) + start_MONO_RAW.
>>
>> The guessed_MONO_RAW is 89555942692721.
>> The PEBS_MONO_RAW is 89555942691466.
>> The difference is 1255.
>>
>> Is the calculation correct?
> 
> Thanks for sharing it. The equation you have there looks ok at a high
> level for the values you captured (there's small tweaks like doing the
> mult before the div to make sure you don't hit integer precision
> issues, but I didn't see that with your results).
> 
> I've got a todo to try to see how the calculation changes if we do
> provide atomic TSC/RAW stamps, here but I got a little busy with other
> work and haven't gotten to it.
> So my apologies, but I'll try to get back to this soon.
> 

Have you got a chance to try the idea?

I just want to check whether the userspace interpolation approach works.
Should I prepare V3 and go back to the kernel solution?


Thanks,
Kan
  
John Stultz March 9, 2023, 1:17 a.m. UTC | #17
On Wed, Mar 8, 2023 at 10:44 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> On 2023-02-17 6:11 p.m., John Stultz wrote:
> > On Tue, Feb 14, 2023 at 12:38 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
> >> On 2023-02-14 3:11 p.m., John Stultz wrote:
> >>> On Tue, Feb 14, 2023 at 9:00 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> >>>> On 2023-02-14 9:51 a.m., Liang, Kan wrote:
> >>>>> If I understand correctly, the idea is to let the user space tool run
> >>>>> the above interpoloation algorithm several times to 'guess' the atomic
> >>>>> mapping. Using the mapping information to covert the TSC from the PEBS
> >>>>> record. Is my understanding correct?
> >>>>>
> >>>>> If so, to be honest, I doubt we can get the accuracy we want.
> >>>>>
> >>>>
> >>>> I implemented a simple test to evaluate the error.
> >>>
> >>> Very cool!
> >>>
> >>>> I collected TSC -> CLOCK_MONOTONIC_RAW mapping using the above algorithm
> >>>> at the start and end of perf cmd.
> >>>>         MONO_RAW        TSC
> >>>> start   89553516545645  223619715214239
> >>>> end     89562251233830  223641517000376
> >>>>
> >>>> Here is what I get via mult/shift conversion from this patch.
> >>>>         MONO_RAW        TSC
> >>>> PEBS    89555942691466  223625770878571
> >>>>
> >>>> Then I use the time information from start and end to create a linear
> >>>> function and 'guess' the MONO_RAW of PEBS from the TSC. I get
> >>>> 89555942692721.
> >>>> There is a 1255 ns difference.
> >>>> I tried several different PEBS records. The error is ~1000ns.
> >>>> I think it should be an observable error.
> >>>
> >>> Interesting. That's a good bit higher than I'd expect as I'd expect a
> >>> clock_gettime() call to take ~ double digit nanoseconds range on
> >>> average, so the error should be within that.
> >>>
> >>> Can you share your logic?
> >>>
> >>
> >> I run the algorithm right before and after the perf command as below.
> >> (The source code of time is attached.)
> >>
> >> $./time
> >> $perf record -e cycles:upp --clockid monotonic_raw $some_workaround
> >> $./time
> >>
> >> The time will dump both MONO_RAW and TSC. That's where "start" and "end"
> >> from.
> >> The perf command print out both TSC and converted MONO_RAW (using the
> >> mul/shift from this patch series). That's where "PEBS" value from.
> >>
> >> Than I use the below formula to calculate the guessed MONO_RAW of PEBS TSC.
> >> Guessed_MONO_RAW = (PEBS_TSC - start_TSC) / (end_TSC - start_TSC) *
> >> (end_MONO_RAW - start_MONO_RAW) + start_MONO_RAW.
> >>
> >> The guessed_MONO_RAW is 89555942692721.
> >> The PEBS_MONO_RAW is 89555942691466.
> >> The difference is 1255.
> >>
> >> Is the calculation correct?
> >
> > Thanks for sharing it. The equation you have there looks ok at a high
> > level for the values you captured (there's small tweaks like doing the
> > mult before the div to make sure you don't hit integer precision
> > issues, but I didn't see that with your results).
> >
> > I've got a todo to try to see how the calculation changes if we do
> > provide atomic TSC/RAW stamps, here but I got a little busy with other
> > work and haven't gotten to it.
> > So my apologies, but I'll try to get back to this soon.
> >
>
> Have you got a chance to try the idea?
>
> I just want to check whether the userspace interpolation approach works.
> Should I prepare V3 and go back to the kernel solution?

Oh, my apologies. I had some other work come up and this fell off my plate.

So I spent a little bit of time today adding some trace_printks to the
timekeeping code so I could record the actual TSC and timestamps being
calculated from CLOCK_MONOTONIC_RAW.

I did catch one error in the test code, which unfortunately I'm to blame for:
  mid = start + (delta +(delta/2))/2; //round-closest

That should be
  mid = start + (delta +(2/2))/2  //round-closest
or more simply
  mid = start + (delta +1)/2; //round-closest

Generalized rounding should be: (value + (DIV/2))/DIV), but I'm
guessing with two as the divisor, my brain mixed it up and typed
"delta". My apologies!

With that fix, I'm seeing closer to ~500ns of error in the
interpolation, just using the userland sampling.   Now, I've also
disabled vsyscalls for this (otherwise I wouldn't be able to
trace_printk), so the error likely would be higher than with
vsyscalls.

Now, part of the error is that:
  start= rdtsc();
  clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
  end = rdtsc();

Ends up looking like
  start= rdtsc();
  clock_gettime() {
     now = rdtsc();
     delta = now - last;
     ns = (delta * mult) >> shift
[~midpoint~]
     ts->nsec = base_ns + ns;
     ts->sec = base_sec;
     normalize_ts(ts)
  }
  end = rdtsc();

And so by taking the mid-point we're always a little skewed from where
the tsc was actually read.  Looking at the data for my case the tsc
read seems to be ~12% in, so you could instead try:

delta = end - start;
p12 = start + ((delta * 12) + (100/2))/100;

With that adjustment, I'm seeing error around ~40ns.

Mind giving that a try?

Now, if you had two snapshots of MONOTONIC_RAW + the TSC value used to
calculate it(maybe the driver access this via a special internal
timekeeping interface), in my testing interpolating will give you
sub-ns error. So I think this is workable without exposing quite so
much to userland.

thanks
-john
  
Liang, Kan March 9, 2023, 4:56 p.m. UTC | #18
On 2023-03-08 8:17 p.m., John Stultz wrote:
> On Wed, Mar 8, 2023 at 10:44 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>> On 2023-02-17 6:11 p.m., John Stultz wrote:
>>> On Tue, Feb 14, 2023 at 12:38 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>>> On 2023-02-14 3:11 p.m., John Stultz wrote:
>>>>> On Tue, Feb 14, 2023 at 9:00 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>>>>> On 2023-02-14 9:51 a.m., Liang, Kan wrote:
>>>>>>> If I understand correctly, the idea is to let the user space tool run
>>>>>>> the above interpoloation algorithm several times to 'guess' the atomic
>>>>>>> mapping. Using the mapping information to covert the TSC from the PEBS
>>>>>>> record. Is my understanding correct?
>>>>>>>
>>>>>>> If so, to be honest, I doubt we can get the accuracy we want.
>>>>>>>
>>>>>>
>>>>>> I implemented a simple test to evaluate the error.
>>>>>
>>>>> Very cool!
>>>>>
>>>>>> I collected TSC -> CLOCK_MONOTONIC_RAW mapping using the above algorithm
>>>>>> at the start and end of perf cmd.
>>>>>>         MONO_RAW        TSC
>>>>>> start   89553516545645  223619715214239
>>>>>> end     89562251233830  223641517000376
>>>>>>
>>>>>> Here is what I get via mult/shift conversion from this patch.
>>>>>>         MONO_RAW        TSC
>>>>>> PEBS    89555942691466  223625770878571
>>>>>>
>>>>>> Then I use the time information from start and end to create a linear
>>>>>> function and 'guess' the MONO_RAW of PEBS from the TSC. I get
>>>>>> 89555942692721.
>>>>>> There is a 1255 ns difference.
>>>>>> I tried several different PEBS records. The error is ~1000ns.
>>>>>> I think it should be an observable error.
>>>>>
>>>>> Interesting. That's a good bit higher than I'd expect as I'd expect a
>>>>> clock_gettime() call to take ~ double digit nanoseconds range on
>>>>> average, so the error should be within that.
>>>>>
>>>>> Can you share your logic?
>>>>>
>>>>
>>>> I run the algorithm right before and after the perf command as below.
>>>> (The source code of time is attached.)
>>>>
>>>> $./time
>>>> $perf record -e cycles:upp --clockid monotonic_raw $some_workaround
>>>> $./time
>>>>
>>>> The time will dump both MONO_RAW and TSC. That's where "start" and "end"
>>>> from.
>>>> The perf command print out both TSC and converted MONO_RAW (using the
>>>> mul/shift from this patch series). That's where "PEBS" value from.
>>>>
>>>> Than I use the below formula to calculate the guessed MONO_RAW of PEBS TSC.
>>>> Guessed_MONO_RAW = (PEBS_TSC - start_TSC) / (end_TSC - start_TSC) *
>>>> (end_MONO_RAW - start_MONO_RAW) + start_MONO_RAW.
>>>>
>>>> The guessed_MONO_RAW is 89555942692721.
>>>> The PEBS_MONO_RAW is 89555942691466.
>>>> The difference is 1255.
>>>>
>>>> Is the calculation correct?
>>>
>>> Thanks for sharing it. The equation you have there looks ok at a high
>>> level for the values you captured (there's small tweaks like doing the
>>> mult before the div to make sure you don't hit integer precision
>>> issues, but I didn't see that with your results).
>>>
>>> I've got a todo to try to see how the calculation changes if we do
>>> provide atomic TSC/RAW stamps, here but I got a little busy with other
>>> work and haven't gotten to it.
>>> So my apologies, but I'll try to get back to this soon.
>>>
>>
>> Have you got a chance to try the idea?
>>
>> I just want to check whether the userspace interpolation approach works.
>> Should I prepare V3 and go back to the kernel solution?
> 
> Oh, my apologies. I had some other work come up and this fell off my plate.
> 
> So I spent a little bit of time today adding some trace_printks to the
> timekeeping code so I could record the actual TSC and timestamps being
> calculated from CLOCK_MONOTONIC_RAW.
> 
> I did catch one error in the test code, which unfortunately I'm to blame for:
>   mid = start + (delta +(delta/2))/2; //round-closest
> 
> That should be
>   mid = start + (delta +(2/2))/2  //round-closest
> or more simply
>   mid = start + (delta +1)/2; //round-closest
> 
> Generalized rounding should be: (value + (DIV/2))/DIV), but I'm
> guessing with two as the divisor, my brain mixed it up and typed
> "delta". My apologies!
> 
> With that fix, I'm seeing closer to ~500ns of error in the
> interpolation, just using the userland sampling.   Now, I've also
> disabled vsyscalls for this (otherwise I wouldn't be able to
> trace_printk), so the error likely would be higher than with
> vsyscalls.
> 
> Now, part of the error is that:
>   start= rdtsc();
>   clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
>   end = rdtsc();
> 
> Ends up looking like
>   start= rdtsc();
>   clock_gettime() {
>      now = rdtsc();
>      delta = now - last;
>      ns = (delta * mult) >> shift
> [~midpoint~]
>      ts->nsec = base_ns + ns;
>      ts->sec = base_sec;
>      normalize_ts(ts)
>   }
>   end = rdtsc();
> 
> And so by taking the mid-point we're always a little skewed from where
> the tsc was actually read.  Looking at the data for my case the tsc
> read seems to be ~12% in, so you could instead try:
> 
> delta = end - start;
> p12 = start + ((delta * 12) + (100/2))/100;
> 
> With that adjustment, I'm seeing error around ~40ns.
> 
> Mind giving that a try?

I tried both the new mid and p12. The error becomes even larger.

With new mid (start + (delta +1)/2), the error is now ~3800ns
With p12 adjustment, the error is ~6700ns.


Here is how I run the test.
$./time
$perf record -e cycles:upp --clockid monotonic_raw $some_workaround
$./time

Here are some raw data.

For the first ./time,
start: 961886196018
end: 961886215603
MONO_RAW: 341485848531

For the second ./time,
start: 986870117783
end: 986870136152
MONO_RAW: 351495432044

Here is the time generated from one PEBS record.
TSC: 968210217271
PEBS_MONO_RAW (calculated via kernel conversion information): 344019503072

Using new mid (start + (delta +1)/2), the guessed PEBS_MONO_RAW is
344019506897. The error is 3825ns.
Using p12 adjustment, the guessed PEBS_MONO_RAW is 344019509831.
The error is 6759ns

Thanks,
Kan
> 
> Now, if you had two snapshots of MONOTONIC_RAW + the TSC value used to
> calculate it(maybe the driver access this via a special internal
> timekeeping interface), in my testing interpolating will give you
> sub-ns error. So I think this is workable without exposing quite so
> much to userland.
> 
> thanks
> -john
  
John Stultz March 11, 2023, 5:55 a.m. UTC | #19
On Thu, Mar 9, 2023 at 8:56 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> On 2023-03-08 8:17 p.m., John Stultz wrote:
> > So I spent a little bit of time today adding some trace_printks to the
> > timekeeping code so I could record the actual TSC and timestamps being
> > calculated from CLOCK_MONOTONIC_RAW.
> >
> > I did catch one error in the test code, which unfortunately I'm to blame for:
> >   mid = start + (delta +(delta/2))/2; //round-closest
> >
> > That should be
> >   mid = start + (delta +(2/2))/2  //round-closest
> > or more simply
> >   mid = start + (delta +1)/2; //round-closest
> >
> > Generalized rounding should be: (value + (DIV/2))/DIV), but I'm
> > guessing with two as the divisor, my brain mixed it up and typed
> > "delta". My apologies!
> >
> > With that fix, I'm seeing closer to ~500ns of error in the
> > interpolation, just using the userland sampling.   Now, I've also
> > disabled vsyscalls for this (otherwise I wouldn't be able to
> > trace_printk), so the error likely would be higher than with
> > vsyscalls.
> >
> > Now, part of the error is that:
> >   start= rdtsc();
> >   clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
> >   end = rdtsc();
> >
> > Ends up looking like
> >   start= rdtsc();
> >   clock_gettime() {
> >      now = rdtsc();
> >      delta = now - last;
> >      ns = (delta * mult) >> shift
> > [~midpoint~]
> >      ts->nsec = base_ns + ns;
> >      ts->sec = base_sec;
> >      normalize_ts(ts)
> >   }
> >   end = rdtsc();
> >
> > And so by taking the mid-point we're always a little skewed from where
> > the tsc was actually read.  Looking at the data for my case the tsc
> > read seems to be ~12% in, so you could instead try:
> >
> > delta = end - start;
> > p12 = start + ((delta * 12) + (100/2))/100;
> >
> > With that adjustment, I'm seeing error around ~40ns.
> >
> > Mind giving that a try?
>
> I tried both the new mid and p12. The error becomes even larger.
>
> With new mid (start + (delta +1)/2), the error is now ~3800ns
> With p12 adjustment, the error is ~6700ns.
>
>
> Here is how I run the test.
> $./time
> $perf record -e cycles:upp --clockid monotonic_raw $some_workaround
> $./time
>
> Here are some raw data.
>
> For the first ./time,
> start: 961886196018
> end: 961886215603
> MONO_RAW: 341485848531
>
> For the second ./time,
> start: 986870117783
> end: 986870136152
> MONO_RAW: 351495432044
>
> Here is the time generated from one PEBS record.
> TSC: 968210217271
> PEBS_MONO_RAW (calculated via kernel conversion information): 344019503072
>
> Using new mid (start + (delta +1)/2), the guessed PEBS_MONO_RAW is
> 344019506897. The error is 3825ns.
> Using p12 adjustment, the guessed PEBS_MONO_RAW is 344019509831.
> The error is 6759ns

Huh. I dunno. That seems wild that the error increased.

Just in case something is going astray with the PEBS_MONO_RAW logic,
can you apply the hack patch I was using to display the MONOTONIC_RAW
values the kernel calculates?
  https://github.com/johnstultz-work/linux-dev/commit/8d7896b078965b059ea5e8cc21841580557f6df6

It uses trace_printk, so you'll have to cat /sys/kernel/tracing/trace
to get the output.

thanks
-john
  
Andi Kleen March 12, 2023, 8:50 p.m. UTC | #20
ersion. So the kernel internal mul/shift information has to be exposed.
> Ugh. Well, I think perf may have made a bad API choice here, so I'm
> still going to push back on exposting timekeeping internals to
> userland.

It's not about the perf ABI.

The perf mmap mult/offset if for PT, which always has raw TSCs.

Without it the PT decoder couldn't supply wall clock time.

-Andi
  
Liang, Kan March 13, 2023, 9:19 p.m. UTC | #21
On 2023-03-11 12:55 a.m., John Stultz wrote:
> On Thu, Mar 9, 2023 at 8:56 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>> On 2023-03-08 8:17 p.m., John Stultz wrote:
>>> So I spent a little bit of time today adding some trace_printks to the
>>> timekeeping code so I could record the actual TSC and timestamps being
>>> calculated from CLOCK_MONOTONIC_RAW.
>>>
>>> I did catch one error in the test code, which unfortunately I'm to blame for:
>>>   mid = start + (delta +(delta/2))/2; //round-closest
>>>
>>> That should be
>>>   mid = start + (delta +(2/2))/2  //round-closest
>>> or more simply
>>>   mid = start + (delta +1)/2; //round-closest
>>>
>>> Generalized rounding should be: (value + (DIV/2))/DIV), but I'm
>>> guessing with two as the divisor, my brain mixed it up and typed
>>> "delta". My apologies!
>>>
>>> With that fix, I'm seeing closer to ~500ns of error in the
>>> interpolation, just using the userland sampling.   Now, I've also
>>> disabled vsyscalls for this (otherwise I wouldn't be able to
>>> trace_printk), so the error likely would be higher than with
>>> vsyscalls.
>>>
>>> Now, part of the error is that:
>>>   start= rdtsc();
>>>   clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
>>>   end = rdtsc();
>>>
>>> Ends up looking like
>>>   start= rdtsc();
>>>   clock_gettime() {
>>>      now = rdtsc();
>>>      delta = now - last;
>>>      ns = (delta * mult) >> shift
>>> [~midpoint~]
>>>      ts->nsec = base_ns + ns;
>>>      ts->sec = base_sec;
>>>      normalize_ts(ts)
>>>   }
>>>   end = rdtsc();
>>>
>>> And so by taking the mid-point we're always a little skewed from where
>>> the tsc was actually read.  Looking at the data for my case the tsc
>>> read seems to be ~12% in, so you could instead try:
>>>
>>> delta = end - start;
>>> p12 = start + ((delta * 12) + (100/2))/100;
>>>
>>> With that adjustment, I'm seeing error around ~40ns.
>>>
>>> Mind giving that a try?
>>
>> I tried both the new mid and p12. The error becomes even larger.
>>
>> With new mid (start + (delta +1)/2), the error is now ~3800ns
>> With p12 adjustment, the error is ~6700ns.
>>
>>
>> Here is how I run the test.
>> $./time
>> $perf record -e cycles:upp --clockid monotonic_raw $some_workaround
>> $./time
>>
>> Here are some raw data.
>>
>> For the first ./time,
>> start: 961886196018
>> end: 961886215603
>> MONO_RAW: 341485848531
>>
>> For the second ./time,
>> start: 986870117783
>> end: 986870136152
>> MONO_RAW: 351495432044
>>
>> Here is the time generated from one PEBS record.
>> TSC: 968210217271
>> PEBS_MONO_RAW (calculated via kernel conversion information): 344019503072
>>
>> Using new mid (start + (delta +1)/2), the guessed PEBS_MONO_RAW is
>> 344019506897. The error is 3825ns.
>> Using p12 adjustment, the guessed PEBS_MONO_RAW is 344019509831.
>> The error is 6759ns
> 
> Huh. I dunno. That seems wild that the error increased.
> 
> Just in case something is going astray with the PEBS_MONO_RAW logic,
> can you apply the hack patch I was using to display the MONOTONIC_RAW
> values the kernel calculates?
>   https://github.com/johnstultz-work/linux-dev/commit/8d7896b078965b059ea5e8cc21841580557f6df6
> 
> It uses trace_printk, so you'll have to cat /sys/kernel/tracing/trace
> to get the output.
> 


$ ./time_3
start: 7358368893806 end: 7358368902944 delta: 9138
MONO_RAW: 2899739790738
MID: 7358368898375
P12: 7358368894903
$ sudo cat /sys/kernel/tracing/trace | grep time_3
          time_3-1443    [002] .....  2899.858936: ktime_get_raw_ts64:
JDB: timekeeping_get_delta cycle_now: 7358368897679
          time_3-1443    [002] .....  2899.858937: ktime_get_raw_ts64:
JDB: ktime_get_raw_ts64: 2899739790738

The error between MID and cycle_now is -696ns
The error between P12 and cycle_now is 2776ns

The time_3.c is attached.

Thanks,
Kan
#include <sys/time.h>
#include <time.h>
#include <stdio.h>
#include <errno.h>

static inline unsigned long rdtsc ()
{
  unsigned long var;
  unsigned int hi, lo;

  asm volatile ("rdtsc" : "=a" (lo), "=d" (hi));
  var = ((unsigned long long int) hi << 32) | lo;

  return var;
}

typedef unsigned long long u64;

int main()
{
	struct timespec ts;
	u64 start, end, delta, mid, p12;
	do {
		start= rdtsc();
		clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
		end = rdtsc();
		delta = end-start;
	} while (delta  > 20000);   // make sure the reads were not preempted

	printf("start: %llu end: %llu delta: %llu\n", start, end, delta);
	printf("MONO_RAW: %llu\n", (u64)ts.tv_sec * 1000000000 + ts.tv_nsec);
	mid = start + (delta + 1)/2; //round-closest
	printf("MID: %llu\n", mid);
	p12 = start + ((delta * 12) + (100/2))/100;
	printf("P12: %llu\n", p12);
}
  
John Stultz March 18, 2023, 6:02 a.m. UTC | #22
On Mon, Mar 13, 2023 at 2:19 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
>
>
>
> On 2023-03-11 12:55 a.m., John Stultz wrote:
> > On Thu, Mar 9, 2023 at 8:56 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> >> On 2023-03-08 8:17 p.m., John Stultz wrote:
> >>> So I spent a little bit of time today adding some trace_printks to the
> >>> timekeeping code so I could record the actual TSC and timestamps being
> >>> calculated from CLOCK_MONOTONIC_RAW.
> >>>
> >>> I did catch one error in the test code, which unfortunately I'm to blame for:
> >>>   mid = start + (delta +(delta/2))/2; //round-closest
> >>>
> >>> That should be
> >>>   mid = start + (delta +(2/2))/2  //round-closest
> >>> or more simply
> >>>   mid = start + (delta +1)/2; //round-closest
> >>>
> >>> Generalized rounding should be: (value + (DIV/2))/DIV), but I'm
> >>> guessing with two as the divisor, my brain mixed it up and typed
> >>> "delta". My apologies!
> >>>
> >>> With that fix, I'm seeing closer to ~500ns of error in the
> >>> interpolation, just using the userland sampling.   Now, I've also
> >>> disabled vsyscalls for this (otherwise I wouldn't be able to
> >>> trace_printk), so the error likely would be higher than with
> >>> vsyscalls.
> >>>
> >>> Now, part of the error is that:
> >>>   start= rdtsc();
> >>>   clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
> >>>   end = rdtsc();
> >>>
> >>> Ends up looking like
> >>>   start= rdtsc();
> >>>   clock_gettime() {
> >>>      now = rdtsc();
> >>>      delta = now - last;
> >>>      ns = (delta * mult) >> shift
> >>> [~midpoint~]
> >>>      ts->nsec = base_ns + ns;
> >>>      ts->sec = base_sec;
> >>>      normalize_ts(ts)
> >>>   }
> >>>   end = rdtsc();
> >>>
> >>> And so by taking the mid-point we're always a little skewed from where
> >>> the tsc was actually read.  Looking at the data for my case the tsc
> >>> read seems to be ~12% in, so you could instead try:
> >>>
> >>> delta = end - start;
> >>> p12 = start + ((delta * 12) + (100/2))/100;
> >>>
> >>> With that adjustment, I'm seeing error around ~40ns.
> >>>
> >>> Mind giving that a try?
> >>
> >> I tried both the new mid and p12. The error becomes even larger.
> >>
> >> With new mid (start + (delta +1)/2), the error is now ~3800ns
> >> With p12 adjustment, the error is ~6700ns.
> >>
> >>
> >> Here is how I run the test.
> >> $./time
> >> $perf record -e cycles:upp --clockid monotonic_raw $some_workaround
> >> $./time
> >>
> >> Here are some raw data.
> >>
> >> For the first ./time,
> >> start: 961886196018
> >> end: 961886215603
> >> MONO_RAW: 341485848531
> >>
> >> For the second ./time,
> >> start: 986870117783
> >> end: 986870136152
> >> MONO_RAW: 351495432044
> >>
> >> Here is the time generated from one PEBS record.
> >> TSC: 968210217271
> >> PEBS_MONO_RAW (calculated via kernel conversion information): 344019503072
> >>
> >> Using new mid (start + (delta +1)/2), the guessed PEBS_MONO_RAW is
> >> 344019506897. The error is 3825ns.
> >> Using p12 adjustment, the guessed PEBS_MONO_RAW is 344019509831.
> >> The error is 6759ns
> >
> > Huh. I dunno. That seems wild that the error increased.
> >
> > Just in case something is going astray with the PEBS_MONO_RAW logic,
> > can you apply the hack patch I was using to display the MONOTONIC_RAW
> > values the kernel calculates?
> >   https://github.com/johnstultz-work/linux-dev/commit/8d7896b078965b059ea5e8cc21841580557f6df6
> >
> > It uses trace_printk, so you'll have to cat /sys/kernel/tracing/trace
> > to get the output.
> >
>
>
> $ ./time_3
> start: 7358368893806 end: 7358368902944 delta: 9138
> MONO_RAW: 2899739790738
> MID: 7358368898375
> P12: 7358368894903
> $ sudo cat /sys/kernel/tracing/trace | grep time_3
>           time_3-1443    [002] .....  2899.858936: ktime_get_raw_ts64:
> JDB: timekeeping_get_delta cycle_now: 7358368897679
>           time_3-1443    [002] .....  2899.858937: ktime_get_raw_ts64:
> JDB: ktime_get_raw_ts64: 2899739790738
>
> The error between MID and cycle_now is -696ns
> The error between P12 and cycle_now is 2776ns

Hey Kan,
  So I'm terribly sorry, I'm a bit underwater right now and haven't
had time to look deeper at this. The MID case you have above looks
closer to what I was seeing but I can't explain why the 12% case is
worse.

Since I feel it's not really fair to object to your patch but not have
the time to work through an alternative with you, I'm going to
withdraw my objection (though others may persist!).
I'd still really prefer if we avoided exposing internal timekeeping
state directly to userland, and it would be good to see some further
exploration in other directions, but there is the existing perf mmap
precedence (even if I dislike it).   Sorry I can't be of more help to
find a better approach here. :(

thanks
-john
  
Liang, Kan March 21, 2023, 3:26 p.m. UTC | #23
Hi John,

On 2023-03-18 2:02 a.m., John Stultz wrote:
> On Mon, Mar 13, 2023 at 2:19 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>
>>
>>
>> On 2023-03-11 12:55 a.m., John Stultz wrote:
>>> On Thu, Mar 9, 2023 at 8:56 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>>> On 2023-03-08 8:17 p.m., John Stultz wrote:
>>>>> So I spent a little bit of time today adding some trace_printks to the
>>>>> timekeeping code so I could record the actual TSC and timestamps being
>>>>> calculated from CLOCK_MONOTONIC_RAW.
>>>>>
>>>>> I did catch one error in the test code, which unfortunately I'm to blame for:
>>>>>   mid = start + (delta +(delta/2))/2; //round-closest
>>>>>
>>>>> That should be
>>>>>   mid = start + (delta +(2/2))/2  //round-closest
>>>>> or more simply
>>>>>   mid = start + (delta +1)/2; //round-closest
>>>>>
>>>>> Generalized rounding should be: (value + (DIV/2))/DIV), but I'm
>>>>> guessing with two as the divisor, my brain mixed it up and typed
>>>>> "delta". My apologies!
>>>>>
>>>>> With that fix, I'm seeing closer to ~500ns of error in the
>>>>> interpolation, just using the userland sampling.   Now, I've also
>>>>> disabled vsyscalls for this (otherwise I wouldn't be able to
>>>>> trace_printk), so the error likely would be higher than with
>>>>> vsyscalls.
>>>>>
>>>>> Now, part of the error is that:
>>>>>   start= rdtsc();
>>>>>   clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
>>>>>   end = rdtsc();
>>>>>
>>>>> Ends up looking like
>>>>>   start= rdtsc();
>>>>>   clock_gettime() {
>>>>>      now = rdtsc();
>>>>>      delta = now - last;
>>>>>      ns = (delta * mult) >> shift
>>>>> [~midpoint~]
>>>>>      ts->nsec = base_ns + ns;
>>>>>      ts->sec = base_sec;
>>>>>      normalize_ts(ts)
>>>>>   }
>>>>>   end = rdtsc();
>>>>>
>>>>> And so by taking the mid-point we're always a little skewed from where
>>>>> the tsc was actually read.  Looking at the data for my case the tsc
>>>>> read seems to be ~12% in, so you could instead try:
>>>>>
>>>>> delta = end - start;
>>>>> p12 = start + ((delta * 12) + (100/2))/100;
>>>>>
>>>>> With that adjustment, I'm seeing error around ~40ns.
>>>>>
>>>>> Mind giving that a try?
>>>>
>>>> I tried both the new mid and p12. The error becomes even larger.
>>>>
>>>> With new mid (start + (delta +1)/2), the error is now ~3800ns
>>>> With p12 adjustment, the error is ~6700ns.
>>>>
>>>>
>>>> Here is how I run the test.
>>>> $./time
>>>> $perf record -e cycles:upp --clockid monotonic_raw $some_workaround
>>>> $./time
>>>>
>>>> Here are some raw data.
>>>>
>>>> For the first ./time,
>>>> start: 961886196018
>>>> end: 961886215603
>>>> MONO_RAW: 341485848531
>>>>
>>>> For the second ./time,
>>>> start: 986870117783
>>>> end: 986870136152
>>>> MONO_RAW: 351495432044
>>>>
>>>> Here is the time generated from one PEBS record.
>>>> TSC: 968210217271
>>>> PEBS_MONO_RAW (calculated via kernel conversion information): 344019503072
>>>>
>>>> Using new mid (start + (delta +1)/2), the guessed PEBS_MONO_RAW is
>>>> 344019506897. The error is 3825ns.
>>>> Using p12 adjustment, the guessed PEBS_MONO_RAW is 344019509831.
>>>> The error is 6759ns
>>>
>>> Huh. I dunno. That seems wild that the error increased.
>>>
>>> Just in case something is going astray with the PEBS_MONO_RAW logic,
>>> can you apply the hack patch I was using to display the MONOTONIC_RAW
>>> values the kernel calculates?
>>>   https://github.com/johnstultz-work/linux-dev/commit/8d7896b078965b059ea5e8cc21841580557f6df6
>>>
>>> It uses trace_printk, so you'll have to cat /sys/kernel/tracing/trace
>>> to get the output.
>>>
>>
>>
>> $ ./time_3
>> start: 7358368893806 end: 7358368902944 delta: 9138
>> MONO_RAW: 2899739790738
>> MID: 7358368898375
>> P12: 7358368894903
>> $ sudo cat /sys/kernel/tracing/trace | grep time_3
>>           time_3-1443    [002] .....  2899.858936: ktime_get_raw_ts64:
>> JDB: timekeeping_get_delta cycle_now: 7358368897679
>>           time_3-1443    [002] .....  2899.858937: ktime_get_raw_ts64:
>> JDB: ktime_get_raw_ts64: 2899739790738
>>
>> The error between MID and cycle_now is -696ns
>> The error between P12 and cycle_now is 2776ns
> 
> Hey Kan,
>   So I'm terribly sorry, I'm a bit underwater right now and haven't
> had time to look deeper at this. The MID case you have above looks
> closer to what I was seeing but I can't explain why the 12% case is
> worse.
> 
> Since I feel it's not really fair to object to your patch but not have
> the time to work through an alternative with you, I'm going to
> withdraw my objection (though others may persist!).
> I'd still really prefer if we avoided exposing internal timekeeping
> state directly to userland, and it would be good to see some further
> exploration in other directions, but there is the existing perf mmap
> precedence (even if I dislike it).   Sorry I can't be of more help to
> find a better approach here. :(
> 

Thank you all the same. I think we learnt that there should be more work
for the pure user space solution. It is not a solution for the monotonic
raw conversion for now.

I have no idea how to do the post-processing conversion without the
internal conversion information.
So, for now, there seems only two candidate solutions.
- Pure kernel solution (Similar to V1).
- Expose the internal conversion information to the user space and does
post-processing conversion. (V2)

I will ping Thomas in the other thread and see if he has any suggestions.

Thanks,
Kan
  

Patch

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index ccb7f5dad59b..9d56fe027f6c 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -455,7 +455,8 @@  struct perf_event_attr {
 				inherit_thread :  1, /* children only inherit if cloned with CLONE_THREAD */
 				remove_on_exec :  1, /* event is removed from task on exec */
 				sigtrap        :  1, /* send synchronous SIGTRAP on event */
-				__reserved_1   : 26;
+				hw_time        :  1, /* generate raw HW time for samples */
+				__reserved_1   : 25;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
@@ -615,7 +616,8 @@  struct perf_event_mmap_page {
 				cap_user_time		: 1, /* The time_{shift,mult,offset} fields are used */
 				cap_user_time_zero	: 1, /* The time_zero field is used */
 				cap_user_time_short	: 1, /* the time_{cycle,mask} fields are used */
-				cap_____res		: 58;
+				cap_user_time_mono_raw  : 1, /* The time_mono_* fields are used */
+				cap_____res		: 57;
 		};
 	};
 
@@ -692,11 +694,24 @@  struct perf_event_mmap_page {
 	__u64	time_cycles;
 	__u64	time_mask;
 
+	/*
+	 * If cap_user_time_mono_raw, the monotonic raw clock can be calculated
+	 * from the hardware clock (e.g. TSC) 'cyc'.
+	 *
+	 * mono_raw = base + ((cyc - last) * mult + nsec) >> shift
+	 *
+	 */
+	__u64	time_mono_last;
+	__u32	time_mono_mult;
+	__u32	time_mono_shift;
+	__u64	time_mono_nsec;
+	__u64	time_mono_base;
+
 		/*
 		 * Hole for extension of the self monitor capabilities
 		 */
 
-	__u8	__reserved[116*8];	/* align to 1k. */
+	__u8	__reserved[112*8];	/* align to 1k. */
 
 	/*
 	 * Control data for the mmap() data buffer.
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 380476a934e8..f062cce2dafc 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -12135,6 +12135,13 @@  static int perf_copy_attr(struct perf_event_attr __user *uattr,
 	if (attr->sigtrap && !attr->remove_on_exec)
 		return -EINVAL;
 
+	if (attr->use_clockid) {
+		/*
+		 * Only support post-processing for the monotonic raw clock
+		 */
+		if (attr->hw_time && (attr->clockid != CLOCK_MONOTONIC_RAW))
+			return -EINVAL;
+	}
 out:
 	return ret;