[RFC,V2,4/9] perf/x86: Enable post-processing monotonic raw conversion
Commit Message
From: Kan Liang <kan.liang@linux.intel.com>
The raw HW time is from TSC on X86. Preload the HW time for each sample,
once the hw_time is set with the monotonic raw clock by the new perf
tool. Also, dump the conversion information into mmap_page.
For the legacy perf tool which doesn't know the hw_time, nothing is
changed.
Move the x86_pmu_sample_preload() before setup_pebs_time() to utilize
the TSC from a PEBS record.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
---
arch/x86/events/core.c | 10 ++++++++++
arch/x86/events/intel/ds.c | 14 +++++++++++---
arch/x86/events/perf_event.h | 12 ++++++++++++
3 files changed, 33 insertions(+), 3 deletions(-)
Comments
Kan!
On Mon, Feb 13 2023 at 11:07, kan liang wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> + } else if (perf_event_hw_time(event)) {
> + struct ktime_conv mono;
> +
> + userpg->cap_user_time_mono_raw = 1;
> + ktime_get_fast_mono_raw_conv(&mono);
What guarantees that the clocksource used by the timekeeping core is
actually TSC? Nothing at all. You cannot make assumptions here.
Thanks,
tglx
On 2023-02-14 3:02 p.m., Thomas Gleixner wrote:
> Kan!
>
> On Mon, Feb 13 2023 at 11:07, kan liang wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>> + } else if (perf_event_hw_time(event)) {
>> + struct ktime_conv mono;
>> +
>> + userpg->cap_user_time_mono_raw = 1;
>> + ktime_get_fast_mono_raw_conv(&mono);
>
> What guarantees that the clocksource used by the timekeeping core is
> actually TSC? Nothing at all. You cannot make assumptions here.
>
Yes, you are right.
I will add a check to make sure the clocksource is TSC when perf does
the conversion.
Could you please comment on whether the patch is in the right direction?
This V2 patch series expose the kernel internal conversion information
into the user space. Is it OK for you?
Thanks,
Kan
On Tue, Feb 14 2023 at 15:21, Kan Liang wrote:
> On 2023-02-14 3:02 p.m., Thomas Gleixner wrote:
>>
>> What guarantees that the clocksource used by the timekeeping core is
>> actually TSC? Nothing at all. You cannot make assumptions here.
>>
>
> Yes, you are right.
> I will add a check to make sure the clocksource is TSC when perf does
> the conversion.
>
> Could you please comment on whether the patch is in the right direction?
> This V2 patch series expose the kernel internal conversion information
> into the user space. Is it OK for you?
Making the conversion info an ABI is suboptimal at best. I'm still
trying to wrap my brain around all of this. Will reply over there once
my confusion subsides.
Thanks,
tglx
Hi Thomas,
On 2023-02-14 3:55 p.m., Thomas Gleixner wrote:
> On Tue, Feb 14 2023 at 15:21, Kan Liang wrote:
>> On 2023-02-14 3:02 p.m., Thomas Gleixner wrote:
>>>
>>> What guarantees that the clocksource used by the timekeeping core is
>>> actually TSC? Nothing at all. You cannot make assumptions here.
>>>
>>
>> Yes, you are right.
>> I will add a check to make sure the clocksource is TSC when perf does
>> the conversion.
>>
>> Could you please comment on whether the patch is in the right direction?
>> This V2 patch series expose the kernel internal conversion information
>> into the user space. Is it OK for you?
>
> Making the conversion info an ABI is suboptimal at best. I'm still
> trying to wrap my brain around all of this. Will reply over there once
> my confusion subsides.
>
John and I have tried a pure user-space solution (avoid exposing
internal conversion info) to convert a given TSC to a monotonic raw.
But it doesn't work well.
https://lore.kernel.org/lkml/CANDhNComKRDdZJ8SJECNdoAzQhmR3vu9yKAtp7NKDmECxff=fg@mail.gmail.com/
So, for now, there seems only two solutions.
Solution 1: Do the conversion in the kernel (Similar to V1).
https://lore.kernel.org/lkml/20230123182728.825519-1-kan.liang@linux.intel.com/
Solution 2: Expose the internal conversion information to the user space
via perf mmap and does post-processing conversion. (Implemented in this V2)
Personally, I incline the solution 1. Because
- The current monotonic raw is calculated in the kernel as well.
The solution 1 just follow the existing method. It doesn't introduce
extra overhead.
- It avoids exposing the internal timekeeping state directly to userspace.
What do you think? Are there other directions I can explore?
Thanks,
Kan
@@ -2740,6 +2740,16 @@ void arch_perf_update_userpage(struct perf_event *event,
if (!event->attr.use_clockid) {
userpg->cap_user_time_zero = 1;
userpg->time_zero = offset;
+ } else if (perf_event_hw_time(event)) {
+ struct ktime_conv mono;
+
+ userpg->cap_user_time_mono_raw = 1;
+ ktime_get_fast_mono_raw_conv(&mono);
+ userpg->time_mono_last = mono.cycle_last;
+ userpg->time_mono_mult = mono.mult;
+ userpg->time_mono_shift = mono.shift;
+ userpg->time_mono_nsec = mono.xtime_nsec;
+ userpg->time_mono_base = mono.base;
}
cyc2ns_read_end();
@@ -1574,6 +1574,12 @@ static void setup_pebs_time(struct perf_event *event,
struct perf_sample_data *data,
u64 tsc)
{
+ u64 time = tsc;
+
+ /* Perf tool does the conversion. No conversion here. */
+ if (perf_event_hw_time(event))
+ goto done;
+
/* Converting to a user-defined clock is not supported yet. */
if (event->attr.use_clockid != 0)
return;
@@ -1588,7 +1594,9 @@ static void setup_pebs_time(struct perf_event *event,
if (!using_native_sched_clock() || !sched_clock_stable())
return;
- data->time = native_sched_clock_from_tsc(tsc) + __sched_clock_offset;
+ time = native_sched_clock_from_tsc(tsc) + __sched_clock_offset;
+done:
+ data->time = time;
data->sample_flags |= PERF_SAMPLE_TIME;
}
@@ -1733,6 +1741,8 @@ static void setup_pebs_fixed_sample_data(struct perf_event *event,
}
}
+ x86_pmu_sample_preload(data, event, cpuc);
+
/*
* v3 supplies an accurate time stamp, so we use that
* for the time stamp.
@@ -1741,8 +1751,6 @@ static void setup_pebs_fixed_sample_data(struct perf_event *event,
*/
if (x86_pmu.intel_cap.pebs_format >= 3)
setup_pebs_time(event, data, pebs->tsc);
-
- x86_pmu_sample_preload(data, event, cpuc);
}
static void adaptive_pebs_save_regs(struct pt_regs *regs,
@@ -1185,12 +1185,24 @@ int x86_pmu_handle_irq(struct pt_regs *regs);
void x86_pmu_show_pmu_cap(int num_counters, int num_counters_fixed,
u64 intel_ctrl);
+static inline bool perf_event_hw_time(struct perf_event *event)
+{
+ return (event->attr.hw_time &&
+ event->attr.use_clockid &&
+ (event->attr.clockid == CLOCK_MONOTONIC_RAW));
+}
+
static inline void x86_pmu_sample_preload(struct perf_sample_data *data,
struct perf_event *event,
struct cpu_hw_events *cpuc)
{
if (has_branch_stack(event))
perf_sample_save_brstack(data, event, &cpuc->lbr_stack);
+
+ if (perf_event_hw_time(event)) {
+ data->time = rdtsc();
+ data->sample_flags |= PERF_SAMPLE_TIME;
+ }
}
extern struct event_constraint emptyconstraint;