[1/3] timekeeping: NMI safe converter from a given time to monotonic

Message ID 20230123182728.825519-2-kan.liang@linux.intel.com
State New
Headers
Series Convert TSC to monotonic clock for PEBS |

Commit Message

Liang, Kan Jan. 23, 2023, 6:27 p.m. UTC
  From: Kan Liang <kan.liang@linux.intel.com>

It's useful to provide a NMI safe function to convert a given time to
monotonic. For example, the perf_event subsystem wants to convert a TSC
of a PEBS record to a monotonic clock in a NMI handler.

Considered the below two existing functions, but none of them fulfill
the above requirements.
- The ktime_get_mono_fast_ns() is NMI safe, but it can only return the
  current clock monotonic rather than a given time's monotonic.
- The get_device_system_crosststamp() can calculate the system time from
  a given device time. But it's not fast and NMI safe.

The new function is a combination of the two existing functions. Use the
tk_fast_mono timekeeper to make the new function fast and NMI safe. Use
the get_time_fn callback to retrieve the given timestamp and its
clocksource. The history is not supported, since the perf case doesn't
really need it. It can be added later once there is a use case.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 include/linux/timekeeping.h |  9 +++++
 kernel/time/timekeeping.c   | 68 +++++++++++++++++++++++++++++++++++--
 2 files changed, 74 insertions(+), 3 deletions(-)
  

Comments

John Stultz Jan. 24, 2023, 7:01 a.m. UTC | #1
On Mon, Jan 23, 2023 at 10:27 AM <kan.liang@linux.intel.com> wrote:
> +int notrace get_mono_fast_from_given_time(int (*get_time_fn)
> +                                               (struct system_counterval_t *sys_counterval,
> +                                               void *ctx),
> +                                         void *ctx,
> +                                         u64 *mono_ns)
> +{
> +       struct system_counterval_t system_counterval;
> +       struct tk_fast *tkf = &tk_fast_mono;
> +       u64 cycles, now, interval_start;
> +       struct tk_read_base *tkr;
> +       unsigned int seq;
> +       int ret;
> +
> +       do {
> +               seq = raw_read_seqcount_latch(&tkf->seq);
> +               tkr = tkf->base + (seq & 0x01);
> +
> +               ret = get_time_fn(&system_counterval, ctx);
> +               if (ret)
> +                       return ret;
> +
> +               /*
> +                * Verify that the clocksource associated with the given
> +                * timestamp is the same as the currently installed
> +                * timekeeper clocksource
> +                */
> +               if (tkr->clock != system_counterval.cs)
> +                       return -EOPNOTSUPP;
> +               cycles = system_counterval.cycles;
> +
> +               /*
> +                * Check whether the given timestamp is on the current
> +                * timekeeping interval.
> +                */
> +               now = tk_clock_read(tkr);
> +               interval_start = tkr->cycle_last;
> +               if (!cycle_between(interval_start, cycles, now))
> +                       return -EOPNOTSUPP;

So. I've not fully thought this out, but it seems like it would be
quite likely that you'd run into the case where the cycle_last value
is updated and your earlier TSC timestamp isn't valid for the current
interval. The get_device_system_crosststamp() logic has a big chunk of
complex code to try to handle this case by interpolating the cycle
value back in time. How well does just failing in this case work out?

thanks
-john
  
Thomas Gleixner Jan. 24, 2023, 8:51 a.m. UTC | #2
On Mon, Jan 23 2023 at 10:27, kan liang wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
>
> It's useful to provide a NMI safe function to convert a given time to
> monotonic. For example, the perf_event subsystem wants to convert a TSC
> of a PEBS record to a monotonic clock in a NMI handler.

Why? That's a postprocessing problem, really.

Thanks,

        tglx
  
Stephane Eranian Jan. 24, 2023, 9:10 a.m. UTC | #3
On Tue, Jan 24, 2023 at 12:52 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Mon, Jan 23 2023 at 10:27, kan liang wrote:
> > From: Kan Liang <kan.liang@linux.intel.com>
> >
> > It's useful to provide a NMI safe function to convert a given time to
> > monotonic. For example, the perf_event subsystem wants to convert a TSC
> > of a PEBS record to a monotonic clock in a NMI handler.
>
> Why? That's a postprocessing problem, really.
>
Because you want to correlate samples captured by PEBS with samples
from applications timestamped with a user available clock such as
CLOCK_MONOTONIC, for instance.
When I create a perf_event event and I stipulate that
event_attr.clockid=MONOTONIC, I expect all the samples from
that event to be timestamped using the same clock source, regardless
of PEBS or IBS.



> Thanks,
>
>         tglx
  
Liang, Kan Jan. 24, 2023, 3:09 p.m. UTC | #4
On 2023-01-24 2:01 a.m., John Stultz wrote:
> On Mon, Jan 23, 2023 at 10:27 AM <kan.liang@linux.intel.com> wrote:
>> +int notrace get_mono_fast_from_given_time(int (*get_time_fn)
>> +                                               (struct system_counterval_t *sys_counterval,
>> +                                               void *ctx),
>> +                                         void *ctx,
>> +                                         u64 *mono_ns)
>> +{
>> +       struct system_counterval_t system_counterval;
>> +       struct tk_fast *tkf = &tk_fast_mono;
>> +       u64 cycles, now, interval_start;
>> +       struct tk_read_base *tkr;
>> +       unsigned int seq;
>> +       int ret;
>> +
>> +       do {
>> +               seq = raw_read_seqcount_latch(&tkf->seq);
>> +               tkr = tkf->base + (seq & 0x01);
>> +
>> +               ret = get_time_fn(&system_counterval, ctx);
>> +               if (ret)
>> +                       return ret;
>> +
>> +               /*
>> +                * Verify that the clocksource associated with the given
>> +                * timestamp is the same as the currently installed
>> +                * timekeeper clocksource
>> +                */
>> +               if (tkr->clock != system_counterval.cs)
>> +                       return -EOPNOTSUPP;
>> +               cycles = system_counterval.cycles;
>> +
>> +               /*
>> +                * Check whether the given timestamp is on the current
>> +                * timekeeping interval.
>> +                */
>> +               now = tk_clock_read(tkr);
>> +               interval_start = tkr->cycle_last;
>> +               if (!cycle_between(interval_start, cycles, now))
>> +                       return -EOPNOTSUPP;
> 
> So. I've not fully thought this out, but it seems like it would be
> quite likely that you'd run into the case where the cycle_last value
> is updated and your earlier TSC timestamp isn't valid for the current
> interval. The get_device_system_crosststamp() logic has a big chunk of
> complex code to try to handle this case by interpolating the cycle
> value back in time. How well does just failing in this case work out?
> 

For the case, perf fallback to the time captured in the NMI handler, via
ktime_get_mono_fast_ns().

The TSC in PEBS is captured by HW when the sample was generated. There
should be a small delta compared with the time captured in the NMI
handler. But I think the delta should be acceptable as a backup solution
for the most analysis cases. Also, I don't think the case (the
cycle_last value is updated during the monitoring) should occur very
often either. So I drop the history support to simplify the function.

Thanks,
Kan
  
Liang, Kan Jan. 24, 2023, 4:06 p.m. UTC | #5
On 2023-01-24 4:10 a.m., Stephane Eranian wrote:
> On Tue, Jan 24, 2023 at 12:52 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> On Mon, Jan 23 2023 at 10:27, kan liang wrote:
>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>
>>> It's useful to provide a NMI safe function to convert a given time to
>>> monotonic. For example, the perf_event subsystem wants to convert a TSC
>>> of a PEBS record to a monotonic clock in a NMI handler.
>>
>> Why? That's a postprocessing problem, really.
>>

Besides the reason Stephane mentioned, the current perf tool assumes
that kernel always gives the time it asked. For example, without the
patch, the kernel uses ktime_get_mono_fast_ns() to get the MONOTONIC
time and dump it to user space.

If we want to break the assumption, I think we have to modify the
current interfaces and dump more extra data to indicate the clock
source. That makes the existing interface more complex. Preparing and
dumping the extra data also brings overhead. We also have to force the
user to update their tools.


Thanks,
Kan

> Because you want to correlate samples captured by PEBS with samples
> from applications timestamped with a user available clock such as
> CLOCK_MONOTONIC, for instance.
> When I create a perf_event event and I stipulate that
> event_attr.clockid=MONOTONIC, I expect all the samples from
> that event to be timestamped using the same clock source, regardless
> of PEBS or IBS.
> 
> 
> 
>> Thanks,
>>
>>         tglx
  
John Stultz Jan. 24, 2023, 6:43 p.m. UTC | #6
On Tue, Jan 24, 2023 at 7:09 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> On 2023-01-24 2:01 a.m., John Stultz wrote:
> > On Mon, Jan 23, 2023 at 10:27 AM <kan.liang@linux.intel.com> wrote:
> >> +               /*
> >> +                * Check whether the given timestamp is on the current
> >> +                * timekeeping interval.
> >> +                */
> >> +               now = tk_clock_read(tkr);
> >> +               interval_start = tkr->cycle_last;
> >> +               if (!cycle_between(interval_start, cycles, now))
> >> +                       return -EOPNOTSUPP;
> >
> > So. I've not fully thought this out, but it seems like it would be
> > quite likely that you'd run into the case where the cycle_last value
> > is updated and your earlier TSC timestamp isn't valid for the current
> > interval. The get_device_system_crosststamp() logic has a big chunk of
> > complex code to try to handle this case by interpolating the cycle
> > value back in time. How well does just failing in this case work out?
> >
>
> For the case, perf fallback to the time captured in the NMI handler, via
> ktime_get_mono_fast_ns().

This feels like *very* subtle behavior. Maybe I'm misunderstanding,
but the goal seems to be to have more accurate timestamps on the hw
events, and using the captured tsc timestamp avoids the measuring
latency reading the time again. But if every timekeeping update
interval (~tick) you transparently get a delayed value due to the
fallback, it makes it hard to understand which timestamps are better
or worse. The latency between two reads may be real or it may be just
bad luck. This doesn't intuitively seem like a great benefit over more
consistent latency of just using the ktime_get_mono_fast()
timestamping.

> The TSC in PEBS is captured by HW when the sample was generated. There
> should be a small delta compared with the time captured in the NMI
> handler. But I think the delta should be acceptable as a backup solution
> for the most analysis cases. Also, I don't think the case (the
> cycle_last value is updated during the monitoring) should occur very
> often either. So I drop the history support to simplify the function.

So the reads and this function are *always* used in NMI context?   Has
this been stressed with things like SMIs to see how it does if
interrupted in those cases?

My worry is that (as I bored everyone earlier), the
ktime_get_*_fast_ns() interfaces already have some sharp edges and
need a fair amount of thought as to when they should be used. This is
sort of compounding that adding an interface that has further special
cases where it can fail, making it difficult to fully understand and
easier to accidentally misuse.

My other concern is that interfaces always get stretched and used
beyond anything they were initially planned for (see the
ktime_get_*fast_ns() interfaces here as an example! :), and in this
case the logic seems to have lots of implicit dependencies on the
facts of your specific use case, so it seems a bit fragile should
folks on other architectures with other constraints try to use it.

So I just want to push a bit to think how you might be able to
extend/generalize the existing get_system_crosststamp for your
purposes, or alternatively find a way to simplify the logic's behavior
so its less tied to specific constraints ("this works most of the time
from NMI, but otherwise no promises").  Or at least some better
documentation around the code, its uses and its constraints? ( "NMI
safe" is not the same as "Only safe to use from NMI" :)

And apologies if I've misunderstood things here.

thanks
-john
  
Liang, Kan Jan. 24, 2023, 8:12 p.m. UTC | #7
On 2023-01-24 1:43 p.m., John Stultz wrote:
> On Tue, Jan 24, 2023 at 7:09 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>> On 2023-01-24 2:01 a.m., John Stultz wrote:
>>> On Mon, Jan 23, 2023 at 10:27 AM <kan.liang@linux.intel.com> wrote:
>>>> +               /*
>>>> +                * Check whether the given timestamp is on the current
>>>> +                * timekeeping interval.
>>>> +                */
>>>> +               now = tk_clock_read(tkr);
>>>> +               interval_start = tkr->cycle_last;
>>>> +               if (!cycle_between(interval_start, cycles, now))
>>>> +                       return -EOPNOTSUPP;
>>>
>>> So. I've not fully thought this out, but it seems like it would be
>>> quite likely that you'd run into the case where the cycle_last value
>>> is updated and your earlier TSC timestamp isn't valid for the current
>>> interval. The get_device_system_crosststamp() logic has a big chunk of
>>> complex code to try to handle this case by interpolating the cycle
>>> value back in time. How well does just failing in this case work out?
>>>
>>
>> For the case, perf fallback to the time captured in the NMI handler, via
>> ktime_get_mono_fast_ns().
> 
> This feels like *very* subtle behavior. Maybe I'm misunderstanding,
> but the goal seems to be to have more accurate timestamps on the hw
> events, and using the captured tsc timestamp avoids the measuring
> latency reading the time again. But if every timekeeping update
> interval (~tick) you transparently get a delayed value due to the
> fallback, it makes it hard to understand which timestamps are better
> or worse. The latency between two reads may be real or it may be just
> bad luck. This doesn't intuitively seem like a great benefit over more
> consistent latency of just using the ktime_get_mono_fast()
> timestamping.

Your understand is correct. We want a more accurate timestamp for the
analysis work.

As my understanding, the timekeeping update should not be very often. If
I read the code correctly, it should happen only when adjusting NTP or
suspending/resuming. If so, I think the drawback should not impact the
normal analysis work. I will call out the drwabacks in the comments
where the function is used.

> 
>> The TSC in PEBS is captured by HW when the sample was generated. There
>> should be a small delta compared with the time captured in the NMI
>> handler. But I think the delta should be acceptable as a backup solution
>> for the most analysis cases. Also, I don't think the case (the
>> cycle_last value is updated during the monitoring) should occur very
>> often either. So I drop the history support to simplify the function.
> 
> So the reads and this function are *always* used in NMI context?   Has
> this been stressed with things like SMIs to see how it does if
> interrupted in those cases?

Yes, it's *always* and only used in NMI context.

> 
> My worry is that (as I bored everyone earlier), the
> ktime_get_*_fast_ns() interfaces already have some sharp edges and
> need a fair amount of thought as to when they should be used. This is
> sort of compounding that adding an interface that has further special
> cases where it can fail, making it difficult to fully understand and
> easier to accidentally misuse.
> 
> My other concern is that interfaces always get stretched and used
> beyond anything they were initially planned for (see the
> ktime_get_*fast_ns() interfaces here as an example! :), and in this
> case the logic seems to have lots of implicit dependencies on the
> facts of your specific use case, so it seems a bit fragile should
> folks on other architectures with other constraints try to use it.
> 
> So I just want to push a bit to think how you might be able to
> extend/generalize the existing get_system_crosststamp for your
> purposes, or alternatively find a way to simplify the logic's behavior
> so its less tied to specific constraints ("this works most of the time
> from NMI, but otherwise no promises").  Or at least some better
> documentation around the code, its uses and its constraints? ( "NMI
> safe" is not the same as "Only safe to use from NMI" :)

Since our usage is fixed (only in NMI), I prefer the latter. I think
extending/generalizing the existing function only makes the function
extremely complex and low efficient. The new function should have the
same constraints as the existing ktime_get_mono_fast_ns(). Since perf
can live with the ktime_get_mono_fast_ns(), there should be no problem
with the new function for the constraints. I will add more comments to
clarify the usage and constraints to avoid the abuse of the new function.

Thanks,
Kan
  
John Stultz Jan. 24, 2023, 8:33 p.m. UTC | #8
On Tue, Jan 24, 2023 at 12:13 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
> On 2023-01-24 1:43 p.m., John Stultz wrote:
> > On Tue, Jan 24, 2023 at 7:09 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> >> On 2023-01-24 2:01 a.m., John Stultz wrote:
> >>> On Mon, Jan 23, 2023 at 10:27 AM <kan.liang@linux.intel.com> wrote:
> >>>> +               /*
> >>>> +                * Check whether the given timestamp is on the current
> >>>> +                * timekeeping interval.
> >>>> +                */
> >>>> +               now = tk_clock_read(tkr);
> >>>> +               interval_start = tkr->cycle_last;
> >>>> +               if (!cycle_between(interval_start, cycles, now))
> >>>> +                       return -EOPNOTSUPP;
> >>>
> >>> So. I've not fully thought this out, but it seems like it would be
> >>> quite likely that you'd run into the case where the cycle_last value
> >>> is updated and your earlier TSC timestamp isn't valid for the current
> >>> interval. The get_device_system_crosststamp() logic has a big chunk of
> >>> complex code to try to handle this case by interpolating the cycle
> >>> value back in time. How well does just failing in this case work out?
> >>>
> >>
> >> For the case, perf fallback to the time captured in the NMI handler, via
> >> ktime_get_mono_fast_ns().
> >
> > This feels like *very* subtle behavior. Maybe I'm misunderstanding,
> > but the goal seems to be to have more accurate timestamps on the hw
> > events, and using the captured tsc timestamp avoids the measuring
> > latency reading the time again. But if every timekeeping update
> > interval (~tick) you transparently get a delayed value due to the
> > fallback, it makes it hard to understand which timestamps are better
> > or worse. The latency between two reads may be real or it may be just
> > bad luck. This doesn't intuitively seem like a great benefit over more
> > consistent latency of just using the ktime_get_mono_fast()
> > timestamping.
>
> Your understand is correct. We want a more accurate timestamp for the
> analysis work.
>
> As my understanding, the timekeeping update should not be very often. If

"Often" depends on your your timescale.

> I read the code correctly, it should happen only when adjusting NTP or
> suspending/resuming. If so, I think the drawback should not impact the
> normal analysis work. I will call out the drwabacks in the comments
> where the function is used.

So the adjustments are done at tick time depending on the current NTP
"error" (basically what the kernel tracks as the delta from its sense
of what NTP has told us).

Not just at the time when ntp makes an adjustment.

So the window for it to happen is every timekeeping update (which is ~HZ).


> >> The TSC in PEBS is captured by HW when the sample was generated. There
> >> should be a small delta compared with the time captured in the NMI
> >> handler. But I think the delta should be acceptable as a backup solution
> >> for the most analysis cases. Also, I don't think the case (the
> >> cycle_last value is updated during the monitoring) should occur very
> >> often either. So I drop the history support to simplify the function.
> >
> > So the reads and this function are *always* used in NMI context?   Has
> > this been stressed with things like SMIs to see how it does if
> > interrupted in those cases?
>
> Yes, it's *always* and only used in NMI context.

Thanks, that is helpful to clarify.

> > My worry is that (as I bored everyone earlier), the
> > ktime_get_*_fast_ns() interfaces already have some sharp edges and
> > need a fair amount of thought as to when they should be used. This is
> > sort of compounding that adding an interface that has further special
> > cases where it can fail, making it difficult to fully understand and
> > easier to accidentally misuse.
> >
> > My other concern is that interfaces always get stretched and used
> > beyond anything they were initially planned for (see the
> > ktime_get_*fast_ns() interfaces here as an example! :), and in this
> > case the logic seems to have lots of implicit dependencies on the
> > facts of your specific use case, so it seems a bit fragile should
> > folks on other architectures with other constraints try to use it.
> >
> > So I just want to push a bit to think how you might be able to
> > extend/generalize the existing get_system_crosststamp for your
> > purposes, or alternatively find a way to simplify the logic's behavior
> > so its less tied to specific constraints ("this works most of the time
> > from NMI, but otherwise no promises").  Or at least some better
> > documentation around the code, its uses and its constraints? ( "NMI
> > safe" is not the same as "Only safe to use from NMI" :)
>
> Since our usage is fixed (only in NMI), I prefer the latter. I think
> extending/generalizing the existing function only makes the function
> extremely complex and low efficient. The new function should have the
> same constraints as the existing ktime_get_mono_fast_ns(). Since perf
> can live with the ktime_get_mono_fast_ns(), there should be no problem
> with the new function for the constraints. I will add more comments to
> clarify the usage and constraints to avoid the abuse of the new function.

I agree the existing function is complex, so adding more complexity
isn't ideal, but potentially breaking it up or reworking it might be
better. Having two similar but different implementations is also a
complexity. So I just want to make sure this is well considered.  But
clearer documentation as a first step will help.

Thanks
-john
  
Liang, Kan Jan. 24, 2023, 10:08 p.m. UTC | #9
On 2023-01-24 3:33 p.m., John Stultz wrote:
> On Tue, Jan 24, 2023 at 12:13 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
>> On 2023-01-24 1:43 p.m., John Stultz wrote:
>>> On Tue, Jan 24, 2023 at 7:09 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>>> On 2023-01-24 2:01 a.m., John Stultz wrote:
>>>>> On Mon, Jan 23, 2023 at 10:27 AM <kan.liang@linux.intel.com> wrote:
>>>>>> +               /*
>>>>>> +                * Check whether the given timestamp is on the current
>>>>>> +                * timekeeping interval.
>>>>>> +                */
>>>>>> +               now = tk_clock_read(tkr);
>>>>>> +               interval_start = tkr->cycle_last;
>>>>>> +               if (!cycle_between(interval_start, cycles, now))
>>>>>> +                       return -EOPNOTSUPP;
>>>>> So. I've not fully thought this out, but it seems like it would be
>>>>> quite likely that you'd run into the case where the cycle_last value
>>>>> is updated and your earlier TSC timestamp isn't valid for the current
>>>>> interval. The get_device_system_crosststamp() logic has a big chunk of
>>>>> complex code to try to handle this case by interpolating the cycle
>>>>> value back in time. How well does just failing in this case work out?
>>>>>
>>>> For the case, perf fallback to the time captured in the NMI handler, via
>>>> ktime_get_mono_fast_ns().
>>> This feels like *very* subtle behavior. Maybe I'm misunderstanding,
>>> but the goal seems to be to have more accurate timestamps on the hw
>>> events, and using the captured tsc timestamp avoids the measuring
>>> latency reading the time again. But if every timekeeping update
>>> interval (~tick) you transparently get a delayed value due to the
>>> fallback, it makes it hard to understand which timestamps are better
>>> or worse. The latency between two reads may be real or it may be just
>>> bad luck. This doesn't intuitively seem like a great benefit over more
>>> consistent latency of just using the ktime_get_mono_fast()
>>> timestamping.
>> Your understand is correct. We want a more accurate timestamp for the
>> analysis work.
>>
>> As my understanding, the timekeeping update should not be very often. If
> "Often" depends on your your timescale.
> 
>> I read the code correctly, it should happen only when adjusting NTP or
>> suspending/resuming. If so, I think the drawback should not impact the
>> normal analysis work. I will call out the drwabacks in the comments
>> where the function is used.
> So the adjustments are done at tick time depending on the current NTP
> "error" (basically what the kernel tracks as the delta from its sense
> of what NTP has told us).
> 
> Not just at the time when ntp makes an adjustment.
> 
> So the window for it to happen is every timekeeping update (which is ~HZ).

You mean the tkf->base is updated in each tick?
If so, that should be a problem.

Does the NTP "error" consistently happen on all the platforms? Or just
on some platforms which we can check and limit the usage?

There are two configurations for PEBS, single PEBS, and large PEBS.
For the single PEBS mode, HW triggers a NMI for each record. The TSC is
the time which the record is generated, and we convert it in the same
NMI. I don't think the NTP "error" can impact it.

But for the large PEBS mode, HW triggers a NMI only when a fixed number
of records are generated. It's very likely that the base has been
updated between the records. That could bring troubles, since we can
only fall back to the NMI handler time of the last record. I'm afraid we
have to disable the large PEBS mode.

Stephane,

What do you think?
Is it still useful for you if we only support the single PEBS?


Thanks,
Kan
  
John Stultz Jan. 24, 2023, 10:40 p.m. UTC | #10
On Tue, Jan 24, 2023 at 2:08 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
> On 2023-01-24 3:33 p.m., John Stultz wrote:
> > On Tue, Jan 24, 2023 at 12:13 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
> >> On 2023-01-24 1:43 p.m., John Stultz wrote:
> >> I read the code correctly, it should happen only when adjusting NTP or
> >> suspending/resuming. If so, I think the drawback should not impact the
> >> normal analysis work. I will call out the drwabacks in the comments
> >> where the function is used.
> > So the adjustments are done at tick time depending on the current NTP
> > "error" (basically what the kernel tracks as the delta from its sense
> > of what NTP has told us).
> >
> > Not just at the time when ntp makes an adjustment.
> >
> > So the window for it to happen is every timekeeping update (which is ~HZ).
>
> You mean the tkf->base is updated in each tick?

Right, every call to timekeeping_update(), which is called for every
major state change in time along with regular updates via
timekeeping_advance() which is called from update_wall_time() which
gets called from the various tick handlers.

> If so, that should be a problem.
>
> Does the NTP "error" consistently happen on all the platforms? Or just
> on some platforms which we can check and limit the usage?

Basically any platform where adjtimex was ever called.  Both offset
and freq adjustments will trigger this. The trouble is that the
adjustments requested are finer grained than what the actual
clocksources mult values can be adjusted for. So the kernel tracks the
error internally and will oscillate the clocksource multiplier value
to keep the long-term accuracy close to what was requested.

> There are two configurations for PEBS, single PEBS, and large PEBS.
> For the single PEBS mode, HW triggers a NMI for each record. The TSC is
> the time which the record is generated, and we convert it in the same
> NMI. I don't think the NTP "error" can impact it.

If the clocksource is being adjusted in between reads (imagine the
freq is dropped), when using the fast accessors, time may seem to go
backwards between those reads.

In fact, if the same TSC value is used for two calls in a row, you
could see time go backwards or forwards between them if the adjustment
happened inbetween.

The regular (non-fast) clocksource accessors use the seqcount to
prevent reads from occuring while the adjustment is made from being
used (you'll note we always re-read the clocksource within the
seqcount loop) but this makes it not NMI safe.


> But for the large PEBS mode, HW triggers a NMI only when a fixed number
> of records are generated. It's very likely that the base has been
> updated between the records. That could bring troubles, since we can
> only fall back to the NMI handler time of the last record. I'm afraid we
> have to disable the large PEBS mode.
>

Right, if I recall, this is why the interpolation is done in the
existing get_device_system_crosststamp(). It's obviously never going
to be perfect as we dont' keep the timekeeper state for every tick,
but it might be better than the bailout you have if you didn't luck
into the current interval.

My apologies, as my intention isn't to dissuade you here, but just to
make sure the complexities (and unfortunately there are a lot) are
well worked out and well tested.

thanks
-john
  
Liang, Kan Jan. 25, 2023, 4:44 p.m. UTC | #11
On 2023-01-24 5:40 p.m., John Stultz wrote:
> On Tue, Jan 24, 2023 at 2:08 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
>> On 2023-01-24 3:33 p.m., John Stultz wrote:
>>> On Tue, Jan 24, 2023 at 12:13 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>>> On 2023-01-24 1:43 p.m., John Stultz wrote:
>>>> I read the code correctly, it should happen only when adjusting NTP or
>>>> suspending/resuming. If so, I think the drawback should not impact the
>>>> normal analysis work. I will call out the drwabacks in the comments
>>>> where the function is used.
>>> So the adjustments are done at tick time depending on the current NTP
>>> "error" (basically what the kernel tracks as the delta from its sense
>>> of what NTP has told us).
>>>
>>> Not just at the time when ntp makes an adjustment.
>>>
>>> So the window for it to happen is every timekeeping update (which is ~HZ).
>>
>> You mean the tkf->base is updated in each tick?
> 
> Right, every call to timekeeping_update(), which is called for every
> major state change in time along with regular updates via
> timekeeping_advance() which is called from update_wall_time() which
> gets called from the various tick handlers.
> 
>> If so, that should be a problem.
>>
>> Does the NTP "error" consistently happen on all the platforms? Or just
>> on some platforms which we can check and limit the usage?
> 
> Basically any platform where adjtimex was ever called.  Both offset
> and freq adjustments will trigger this. The trouble is that the
> adjustments requested are finer grained than what the actual
> clocksources mult values can be adjusted for. So the kernel tracks the
> error internally and will oscillate the clocksource multiplier value
> to keep the long-term accuracy close to what was requested.
> 
>> There are two configurations for PEBS, single PEBS, and large PEBS.
>> For the single PEBS mode, HW triggers a NMI for each record. The TSC is
>> the time which the record is generated, and we convert it in the same
>> NMI. I don't think the NTP "error" can impact it.
> 
> If the clocksource is being adjusted in between reads (imagine the
> freq is dropped), when using the fast accessors, time may seem to go
> backwards between those reads.
> 
> In fact, if the same TSC value is used for two calls in a row, you
> could see time go backwards or forwards between them if the adjustment
> happened inbetween.
> 
> The regular (non-fast) clocksource accessors use the seqcount to
> prevent reads from occuring while the adjustment is made from being
> used (you'll note we always re-read the clocksource within the
> seqcount loop) but this makes it not NMI safe.
> 
> 
>> But for the large PEBS mode, HW triggers a NMI only when a fixed number
>> of records are generated. It's very likely that the base has been
>> updated between the records. That could bring troubles, since we can
>> only fall back to the NMI handler time of the last record. I'm afraid we
>> have to disable the large PEBS mode.
>>
> 
> Right, if I recall, this is why the interpolation is done in the
> existing get_device_system_crosststamp(). It's obviously never going
> to be perfect as we dont' keep the timekeeper state for every tick,
> but it might be better than the bailout you have if you didn't luck
> into the current interval.
> 
> My apologies, as my intention isn't to dissuade you here, but just to
> make sure the complexities (and unfortunately there are a lot) are
> well worked out and well tested.
> 

Thank you very much for your patience and all the details, I really
appreciate it. I think I understand the problem and the concerns now. I
will reevaluate the current design and try to find a proper solution.

Thanks,
Kan
  
Thomas Gleixner Jan. 27, 2023, 1:30 p.m. UTC | #12
On Tue, Jan 24 2023 at 01:10, Stephane Eranian wrote:
> On Tue, Jan 24, 2023 at 12:52 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> > It's useful to provide a NMI safe function to convert a given time to
>> > monotonic. For example, the perf_event subsystem wants to convert a TSC
>> > of a PEBS record to a monotonic clock in a NMI handler.
>>
>> Why? That's a postprocessing problem, really.
>>
> Because you want to correlate samples captured by PEBS with samples
> from applications timestamped with a user available clock such as
> CLOCK_MONOTONIC, for instance.

Sure. Postprocessing can do that if the kernel provides the relevant
conversion information. Absolutely no need for doing this in the kernel.

> When I create a perf_event event and I stipulate that
> event_attr.clockid=MONOTONIC, I expect all the samples from that event
> to be timestamped using the same clock source, regardless of PEBS or
> IBS.

Just because the kernel can do the conversion, there is no requirement
to actually do so. Instrumentation has to be as lightweight as possible
and a lot of instrumentation goes the extra mile of doing postprocessing
exactly for this reason.

So if we implement this conversion in the kernel then there needs to be
a compelling technical reason to do so.

So far the provided arguments are in the 'I want a pony' realm.

Thanks,

        tglx
  

Patch

diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index fe1e467ba046..234fa87a846b 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -289,6 +289,15 @@  extern int get_device_system_crosststamp(
 			struct system_time_snapshot *history,
 			struct system_device_crosststamp *xtstamp);
 
+/*
+ * Fast NMI safe way to convert a given timestamp to clock monotonic
+ */
+extern int get_mono_fast_from_given_time(int (*get_time_fn)
+					      (struct system_counterval_t *sys_counterval,
+					      void *ctx),
+					 void *ctx,
+					 u64 *mono_ns);
+
 /*
  * Simultaneously snapshot realtime and monotonic raw clocks
  */
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 5579ead449f2..5bd32b2981fd 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -431,14 +431,19 @@  static void update_fast_timekeeper(const struct tk_read_base *tkr,
 	memcpy(base + 1, base, sizeof(*base));
 }
 
-static __always_inline u64 fast_tk_get_delta_ns(struct tk_read_base *tkr)
+static __always_inline u64 fast_tk_get_delta_ns_from_cycles(struct tk_read_base *tkr,
+							    u64 cycles)
 {
-	u64 delta, cycles = tk_clock_read(tkr);
+	u64 delta = clocksource_delta(cycles, tkr->cycle_last, tkr->mask);
 
-	delta = clocksource_delta(cycles, tkr->cycle_last, tkr->mask);
 	return timekeeping_delta_to_ns(tkr, delta);
 }
 
+static __always_inline u64 fast_tk_get_delta_ns(struct tk_read_base *tkr)
+{
+	return fast_tk_get_delta_ns_from_cycles(tkr, tk_clock_read(tkr));
+}
+
 static __always_inline u64 __ktime_get_fast_ns(struct tk_fast *tkf)
 {
 	struct tk_read_base *tkr;
@@ -1303,6 +1308,63 @@  int get_device_system_crosststamp(int (*get_time_fn)
 }
 EXPORT_SYMBOL_GPL(get_device_system_crosststamp);
 
+/**
+ * get_mono_fast_from_given_time - Fast NMI safe access to convert a given
+ *				    timestamp to clock monotonic
+ * @get_time_fn:	Callback to get the given time and its clocksource
+ * @ctx:		Context passed to get_time_fn()
+ * @mono_ns:		The monotonic time of the given time
+ */
+int notrace get_mono_fast_from_given_time(int (*get_time_fn)
+						(struct system_counterval_t *sys_counterval,
+						void *ctx),
+					  void *ctx,
+					  u64 *mono_ns)
+{
+	struct system_counterval_t system_counterval;
+	struct tk_fast *tkf = &tk_fast_mono;
+	u64 cycles, now, interval_start;
+	struct tk_read_base *tkr;
+	unsigned int seq;
+	int ret;
+
+	do {
+		seq = raw_read_seqcount_latch(&tkf->seq);
+		tkr = tkf->base + (seq & 0x01);
+
+		ret = get_time_fn(&system_counterval, ctx);
+		if (ret)
+			return ret;
+
+		/*
+		 * Verify that the clocksource associated with the given
+		 * timestamp is the same as the currently installed
+		 * timekeeper clocksource
+		 */
+		if (tkr->clock != system_counterval.cs)
+			return -EOPNOTSUPP;
+		cycles = system_counterval.cycles;
+
+		/*
+		 * Check whether the given timestamp is on the current
+		 * timekeeping interval.
+		 */
+		now = tk_clock_read(tkr);
+		interval_start = tkr->cycle_last;
+		if (!cycle_between(interval_start, cycles, now))
+			return -EOPNOTSUPP;
+
+		now = ktime_to_ns(tkr->base);
+		now += fast_tk_get_delta_ns_from_cycles(tkr, cycles);
+
+	} while (read_seqcount_latch_retry(&tkf->seq, seq));
+
+	*mono_ns = now;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(get_mono_fast_from_given_time);
+
 /**
  * do_settimeofday64 - Sets the time of day.
  * @ts:     pointer to the timespec64 variable containing the new time