[3/3] perf/x86/intel/ds: Support monotonic clock for PEBS

Message ID 20230123182728.825519-4-kan.liang@linux.intel.com
State New
Headers
Series Convert TSC to monotonic clock for PEBS |

Commit Message

Liang, Kan Jan. 23, 2023, 6:27 p.m. UTC
  From: Kan Liang <kan.liang@linux.intel.com>

Users try to reconcile user samples with PEBS samples and require a
common clock source. However, the current PEBS codes only convert to
sched_clock, which is not available from the user space.

Only support converting to clock monotonic. Having one common clock
source is good enough to fulfill the requirement.

Enable the large PEBS for the monotonic clock to reduce the PEBS
overhead.

There are a few rare cases that may make the conversion fails. For
example, TSC overflows. The cycle_last may be changed between samples.
The time will fallback to the inaccurate SW times. But the cases are
extremely unlikely to happen.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---

The patch has to be on top of the below patch
https://lore.kernel.org/all/20230123172027.125385-1-kan.liang@linux.intel.com/

 arch/x86/events/intel/core.c |  2 +-
 arch/x86/events/intel/ds.c   | 30 ++++++++++++++++++++++++++----
 2 files changed, 27 insertions(+), 5 deletions(-)
  

Comments

John Stultz Jan. 24, 2023, 6:56 a.m. UTC | #1
On Mon, Jan 23, 2023 at 10:27 AM <kan.liang@linux.intel.com> wrote:
>
> From: Kan Liang <kan.liang@linux.intel.com>
>
> Users try to reconcile user samples with PEBS samples and require a
> common clock source. However, the current PEBS codes only convert to
> sched_clock, which is not available from the user space.
>
> Only support converting to clock monotonic. Having one common clock
> source is good enough to fulfill the requirement.
>
> Enable the large PEBS for the monotonic clock to reduce the PEBS
> overhead.
>
> There are a few rare cases that may make the conversion fails. For
> example, TSC overflows. The cycle_last may be changed between samples.
> The time will fallback to the inaccurate SW times. But the cases are
> extremely unlikely to happen.
>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> ---

Thanks for sending this out!
A few minor style issues below and a warning.

> The patch has to be on top of the below patch
> https://lore.kernel.org/all/20230123172027.125385-1-kan.liang@linux.intel.com/
>
>  arch/x86/events/intel/core.c |  2 +-
>  arch/x86/events/intel/ds.c   | 30 ++++++++++++++++++++++++++----
>  2 files changed, 27 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index 14f0a746257d..ea194556cc73 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -3777,7 +3777,7 @@ static unsigned long intel_pmu_large_pebs_flags(struct perf_event *event)
>  {
>         unsigned long flags = x86_pmu.large_pebs_flags;
>
> -       if (event->attr.use_clockid)
> +       if (event->attr.use_clockid && (event->attr.clockid != CLOCK_MONOTONIC))
>                 flags &= ~PERF_SAMPLE_TIME;
>         if (!event->attr.exclude_kernel)
>                 flags &= ~PERF_SAMPLE_REGS_USER;
> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index 7980e92dec64..d7f0eaf4405c 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -1570,13 +1570,33 @@ static u64 get_data_src(struct perf_event *event, u64 aux)
>         return val;
>  }
>
> +static int pebs_get_synctime(struct system_counterval_t *system,
> +                            void *ctx)

Just because the abstract function type taken by
get_mono_fast_from_given_time is vague, doesn't mean the
implementation needs to be.
ctx is really a tsc value, right? So let's call it that to make this a
bit more readable.

> +{
> +       *system = set_tsc_system_counterval(*(u64 *)ctx);
> +       return 0;
> +}
> +
> +static inline int pebs_clockid_time(clockid_t clk_id, u64 tsc, u64 *clk_id_time)

clk_id_time is maybe a bit too fuzzy. It is really a mono_ns value,
right? Let's keep that explicit here.

> +{
> +       /* Only support converting to clock monotonic */
> +       if (clk_id != CLOCK_MONOTONIC)
> +               return -EINVAL;
> +
> +       return get_mono_fast_from_given_time(pebs_get_synctime, &tsc, clk_id_time);
> +}
> +
>  static void setup_pebs_time(struct perf_event *event,
>                             struct perf_sample_data *data,
>                             u64 tsc)
>  {
> -       /* Converting to a user-defined clock is not supported yet. */
> -       if (event->attr.use_clockid != 0)
> -               return;
> +       u64 time;

Again, "time" is too generic a term without any context here.
mono_nsec or something would be more clear.

> +
> +       if (event->attr.use_clockid != 0) {
> +               if (pebs_clockid_time(event->attr.clockid, tsc, &time))
> +                       return;
> +               goto done;
> +       }

Apologies for this warning/rant:

So, I do get the NMI safety of the "fast" time accessors (along with
the "high performance" sounding name!) is attractive, but as its use
expands I worry the downsides of this interface isn't made clear
enough.

The fast accessors *can* see time discontinuities! Because the logic
is done without holding the tk_core.seq lock, If you are reading in
the middle of a ntp adjustment, you may find the current value to be
larger than the next time you read the time.  These discontinuities
are likely to be very small, but a negative delta will look very large
as a u64.  So part of using these "fast *and unsafe*" interfaces is
you get to keep both pieces when it breaks. Make sure the code here
that is using these interfaces guards against this (zeroing out
negative deltas).

thanks
-john
  
Liang, Kan Jan. 24, 2023, 3:17 p.m. UTC | #2
On 2023-01-24 1:56 a.m., John Stultz wrote:
> On Mon, Jan 23, 2023 at 10:27 AM <kan.liang@linux.intel.com> wrote:
>>
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> Users try to reconcile user samples with PEBS samples and require a
>> common clock source. However, the current PEBS codes only convert to
>> sched_clock, which is not available from the user space.
>>
>> Only support converting to clock monotonic. Having one common clock
>> source is good enough to fulfill the requirement.
>>
>> Enable the large PEBS for the monotonic clock to reduce the PEBS
>> overhead.
>>
>> There are a few rare cases that may make the conversion fails. For
>> example, TSC overflows. The cycle_last may be changed between samples.
>> The time will fallback to the inaccurate SW times. But the cases are
>> extremely unlikely to happen.
>>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> ---
> 
> Thanks for sending this out!
> A few minor style issues below and a warning.

Thanks.

> 
>> The patch has to be on top of the below patch
>> https://lore.kernel.org/all/20230123172027.125385-1-kan.liang@linux.intel.com/
>>
>>  arch/x86/events/intel/core.c |  2 +-
>>  arch/x86/events/intel/ds.c   | 30 ++++++++++++++++++++++++++----
>>  2 files changed, 27 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
>> index 14f0a746257d..ea194556cc73 100644
>> --- a/arch/x86/events/intel/core.c
>> +++ b/arch/x86/events/intel/core.c
>> @@ -3777,7 +3777,7 @@ static unsigned long intel_pmu_large_pebs_flags(struct perf_event *event)
>>  {
>>         unsigned long flags = x86_pmu.large_pebs_flags;
>>
>> -       if (event->attr.use_clockid)
>> +       if (event->attr.use_clockid && (event->attr.clockid != CLOCK_MONOTONIC))
>>                 flags &= ~PERF_SAMPLE_TIME;
>>         if (!event->attr.exclude_kernel)
>>                 flags &= ~PERF_SAMPLE_REGS_USER;
>> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
>> index 7980e92dec64..d7f0eaf4405c 100644
>> --- a/arch/x86/events/intel/ds.c
>> +++ b/arch/x86/events/intel/ds.c
>> @@ -1570,13 +1570,33 @@ static u64 get_data_src(struct perf_event *event, u64 aux)
>>         return val;
>>  }
>>
>> +static int pebs_get_synctime(struct system_counterval_t *system,
>> +                            void *ctx)
> 
> Just because the abstract function type taken by
> get_mono_fast_from_given_time is vague, doesn't mean the
> implementation needs to be.
> ctx is really a tsc value, right? So let's call it that to make this a
> bit more readable.

Sure, I will use the tsc to replace ctx.

> 
>> +{
>> +       *system = set_tsc_system_counterval(*(u64 *)ctx);
>> +       return 0;
>> +}
>> +
>> +static inline int pebs_clockid_time(clockid_t clk_id, u64 tsc, u64 *clk_id_time)
> 
> clk_id_time is maybe a bit too fuzzy. It is really a mono_ns value,
> right? Let's keep that explicit here.

Yes. Will make it explicit.

> 
>> +{
>> +       /* Only support converting to clock monotonic */
>> +       if (clk_id != CLOCK_MONOTONIC)
>> +               return -EINVAL;
>> +
>> +       return get_mono_fast_from_given_time(pebs_get_synctime, &tsc, clk_id_time);
>> +}
>> +
>>  static void setup_pebs_time(struct perf_event *event,
>>                             struct perf_sample_data *data,
>>                             u64 tsc)
>>  {
>> -       /* Converting to a user-defined clock is not supported yet. */
>> -       if (event->attr.use_clockid != 0)
>> -               return;
>> +       u64 time;
> 
> Again, "time" is too generic a term without any context here.
> mono_nsec or something would be more clear.

Sure.

> 
>> +
>> +       if (event->attr.use_clockid != 0) {
>> +               if (pebs_clockid_time(event->attr.clockid, tsc, &time))
>> +                       return;
>> +               goto done;
>> +       }
> 
> Apologies for this warning/rant:
> 
> So, I do get the NMI safety of the "fast" time accessors (along with
> the "high performance" sounding name!) is attractive, but as its use
> expands I worry the downsides of this interface isn't made clear
> enough.
> 
> The fast accessors *can* see time discontinuities! Because the logic
> is done without holding the tk_core.seq lock, If you are reading in
> the middle of a ntp adjustment, you may find the current value to be
> larger than the next time you read the time.  These discontinuities
> are likely to be very small, but a negative delta will look very large
> as a u64.  So part of using these "fast *and unsafe*" interfaces is
> you get to keep both pieces when it breaks. Make sure the code here
> that is using these interfaces guards against this (zeroing out
> negative deltas).
> 

Thanks for the warning.
I will add more comments and specially handle it here.

Thanks,
Kan
  

Patch

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 14f0a746257d..ea194556cc73 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3777,7 +3777,7 @@  static unsigned long intel_pmu_large_pebs_flags(struct perf_event *event)
 {
 	unsigned long flags = x86_pmu.large_pebs_flags;
 
-	if (event->attr.use_clockid)
+	if (event->attr.use_clockid && (event->attr.clockid != CLOCK_MONOTONIC))
 		flags &= ~PERF_SAMPLE_TIME;
 	if (!event->attr.exclude_kernel)
 		flags &= ~PERF_SAMPLE_REGS_USER;
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 7980e92dec64..d7f0eaf4405c 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1570,13 +1570,33 @@  static u64 get_data_src(struct perf_event *event, u64 aux)
 	return val;
 }
 
+static int pebs_get_synctime(struct system_counterval_t *system,
+			     void *ctx)
+{
+	*system = set_tsc_system_counterval(*(u64 *)ctx);
+	return 0;
+}
+
+static inline int pebs_clockid_time(clockid_t clk_id, u64 tsc, u64 *clk_id_time)
+{
+	/* Only support converting to clock monotonic */
+	if (clk_id != CLOCK_MONOTONIC)
+		return -EINVAL;
+
+	return get_mono_fast_from_given_time(pebs_get_synctime, &tsc, clk_id_time);
+}
+
 static void setup_pebs_time(struct perf_event *event,
 			    struct perf_sample_data *data,
 			    u64 tsc)
 {
-	/* Converting to a user-defined clock is not supported yet. */
-	if (event->attr.use_clockid != 0)
-		return;
+	u64 time;
+
+	if (event->attr.use_clockid != 0) {
+		if (pebs_clockid_time(event->attr.clockid, tsc, &time))
+			return;
+		goto done;
+	}
 
 	/*
 	 * Converting the TSC to perf time is only supported,
@@ -1587,8 +1607,10 @@  static void setup_pebs_time(struct perf_event *event,
 	 */
 	if (!using_native_sched_clock() || !sched_clock_stable())
 		return;
+	time = native_sched_clock_from_tsc(tsc) + __sched_clock_offset;
 
-	data->time = native_sched_clock_from_tsc(tsc) + __sched_clock_offset;
+done:
+	data->time = time;
 	data->sample_flags |= PERF_SAMPLE_TIME;
 }