perf print-events: make is_event_supported() more robust

Message ID 20240116170348.463479-1-mark.rutland@arm.com
State New
Headers
Series perf print-events: make is_event_supported() more robust |

Commit Message

Mark Rutland Jan. 16, 2024, 5:03 p.m. UTC
  Currently the perf tool doesn't deteect support for extneded event types
on Apple M1/M2 systems, and will not auto-expand plain PERF_EVENT_TYPE
hardware events into per-PMU events. This is due to the detection of
extended event types not handling mandatory filters required by the
M1/M2 PMU driver.

PMU drivers and the core perf_events code can require that
perf_event_attr::exclude_* filters are configured in a specific way and
may reject certain configurations of filters, for example:

(a) Many PMUs lack support for any event filtering, and require all
    perf_event_attr::exclude_* bits to be clear. This includes Alpha's
    CPU PMU, and ARM CPU PMUs prior to the introduction of PMUv2 in
    ARMv7,

(b) When /proc/sys/kernel/perf_event_paranoid >= 2, the perf core
    requires that perf_event_attr::exclude_kernel is set.

(c) The Apple M1/M2 PMU requires that perf_event_attr::exclude_guest is
    set as the hardware PMU does not count while a guest is running (but
    might be extended in future to do so).

In is_event_supported(), we try to account for cases (a) and (b), first
attempting to open an event without any filters, and if this fails,
retrying with perf_event_attr::exclude_kernel set. We do not account for
case (c), or any other filters that drivers could theoretically require
to be set.

Thus is_event_supported() will fail to detect support for any events
targetting an Apple M1/M2 PMU, even where events would be supported with
perf_event_attr:::exclude_guest set.

Since commit:

  82fe2e45cdb00de4 ("perf pmus: Check if we can encode the PMU number in perf_event_attr.type")

.. we use is_event_supported() to detect support for extended types,
with the PMU ID encoded into the perf_event_attr::type. As above, on an
Apple M1/M2 system this will always fail to detect that the event is
supported, and consequently we fail to detect support for extended types
even when these are supported, as they have been since commit:

  5c816728651ae425 ("arm_pmu: Add PERF_PMU_CAP_EXTENDED_HW_TYPE capability")

Due to this, the perf tool will not automatically expand plain
PERF_TYPE_HARDWARE events into per-PMU events, even when all the
necessary kernel support is present.

This patch updates is_event_supported() to additionally try opening
events with perf_event_attr::exclude_guest set, allowing support for
events to be detected on Apple M1/M2 systems. I beleive that this is
sufficient for all contemporary CPU PMU drivers, though in future it may
be necessary to check for other combinations of filter bits.

I've deliberately changed the check to not expect a specific error code
for missing filters, as today ;the kernel may return a number of
different error codes for missing filters (e.g. -EACCESS, -EINVAL, or
-EOPNOTSUPP) depending on why and where the filter configuration is
rejected, and retrying for any error is more robust.

Note that this does not remove the need for commit:

  a24d9d9dc096fc0d ("perf parse-events: Make legacy events lower priority than sysfs/JSON")

.. which is still necessary so that named-pmu/event/ events work on
kernels without extended type support, even if the event name happens to
be the same as a PERF_EVENT_TYPE_HARDWARE event (e.g. as is the case for
the M1/M2 PMU's 'cycles' and 'instructions' events).

Fixes: 82fe2e45cdb00de4 ("perf pmus: Check if we can encode the PMU number in perf_event_attr.type")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Hector Martin <marcan@marcan.st>
Cc: Ian Rogers <irogers@google.com>
Cc: James Clark <james.clark@arm.com>
Cc: John Garry <john.g.garry@oracle.com>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mike Leach <mike.leach@linaro.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Thomas Richter <tmricht@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
---
 tools/perf/util/print-events.c | 27 +++++++++++++++++++--------
 1 file changed, 19 insertions(+), 8 deletions(-)

Hector, Marc, I'd appreciate if either of you could give this a spin on
your M1/M2 machines. I've given it local testing with the arm_pmuv3
driver modified to behave the same as the apple_m1_pmu driver (requiring
exclude_guest, having a 'cycles' event in sysfs), but that might not
perfectly replicate your setup.

The patch is based on the 'perf-tools-for-v6.8-1-2024-01-09' tag in the
perf-tools tree:

  https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools.git/

.. and I've pushed it out to the 'perf-tools/event-supported-filters'
branch in my tree:

  https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/

This patch *should* make it possible to do:

	perf stat -e cycles ./workload
	perf stat -e instructions ./workload

.. with those 'cycles' and 'instructions' events being automatically
expanded and reported as separate events per-PMU, which is a nice
quality-of-life improvement.

Comparing before and after this patch:

| # ./perf-before stat -e cycles true
| 
|  Performance counter stats for 'true':
| 
|      <not counted>      cycles                                                                  (0.00%)
| 
|        0.000990250 seconds time elapsed
| 
|        0.000934000 seconds user
|        0.000000000 seconds sys
| 
| # ./perf-after stat -e cycles true
| 
|  Performance counter stats for 'true':
| 
|             965175      armv8_pmuv3_0/cycles/                                                 
|      <not counted>      armv8_pmuv3_1/cycles/                                                   (0.00%)
|      <not counted>      armv8_pmuv3_2/cycles/                                                   (0.00%)
|      <not counted>      armv8_pmuv3_3/cycles/                                                   (0.00%)
| 
|        0.000836555 seconds time elapsed
| 
|        0.000884000 seconds user
|        0.000000000 seconds sys

This *shouldn't* change the interpetation of named-pmu events, e.g.

	perf stat -e apple_whichever_pmu/cycles/ ./workload

.. should behave the same as without this patch

Comparing before and after this patch:

| # ./perf-before stat -e armv8_pmuv3_0/cycles/ -e armv8_pmuv3_1/cycles/ -e armv8_pmuv3_2/cycles/ -e armv8_pmuv3_3/cycles/ true
| 
|  Performance counter stats for 'true':
| 
|      <not counted>      armv8_pmuv3_0/cycles/                                                   (0.00%)
|      <not counted>      armv8_pmuv3_1/cycles/                                                   (0.00%)
|      <not counted>      armv8_pmuv3_2/cycles/                                                   (0.00%)
|             901415      armv8_pmuv3_3/cycles/                                                 
| 
|        0.000756590 seconds time elapsed
| 
|        0.000811000 seconds user
|        0.000000000 seconds sys
| 
| # ./perf-after stat -e armv8_pmuv3_0/cycles/ -e armv8_pmuv3_1/cycles/ -e armv8_pmuv3_2/cycles/ -e armv8_pmuv3_3/cycles/ true
| 
|  Performance counter stats for 'true':
| 
|             923314      armv8_pmuv3_0/cycles/                                                 
|      <not counted>      armv8_pmuv3_1/cycles/                                                   (0.00%)
|      <not counted>      armv8_pmuv3_2/cycles/                                                   (0.00%)
|      <not counted>      armv8_pmuv3_3/cycles/                                                   (0.00%)
| 
|        0.000782420 seconds time elapsed
| 
|        0.000836000 seconds user
|        0.000000000 seconds sys

One thing I'm still looing into is that this doesn't seem to do anything
for a default perf stat session, e.g.

	perf stat ./workload

.. doesn't automatically expand the implicitly-created events into per-pmu
events.

Comparing before and after this patch:

| # ./perf-before stat true
| 
|  Performance counter stats for 'true':
| 
|               0.42 msec task-clock                       #    0.569 CPUs utilized             
|                  0      context-switches                 #    0.000 /sec                      
|                  0      cpu-migrations                   #    0.000 /sec                      
|                 38      page-faults                      #   89.796 K/sec                     
|      <not counted>      cycles                                                                  (0.00%)
|      <not counted>      instructions                                                            (0.00%)
|      <not counted>      branches                                                                (0.00%)
|      <not counted>      branch-misses                                                           (0.00%)
| 
|        0.000744185 seconds time elapsed
| 
|        0.000795000 seconds user
|        0.000000000 seconds sys
| 
| # ./perf-after stat true
| 
|  Performance counter stats for 'true':
| 
|               0.43 msec task-clock                       #    0.582 CPUs utilized             
|                  0      context-switches                 #    0.000 /sec                      
|                  0      cpu-migrations                   #    0.000 /sec                      
|                 38      page-faults                      #   88.960 K/sec                     
|      <not counted>      cycles                                                                  (0.00%)
|      <not counted>      instructions                                                            (0.00%)
|      <not counted>      branches                                                                (0.00%)
|      <not counted>      branch-misses                                                           (0.00%)
| 
|        0.000734120 seconds time elapsed
| 
|        0.000786000 seconds user
|        0.000000000 seconds sys

Ian, how does that behave on x86? Is that the same, or do the default
events get expanded?

Thanks,
Mark.
  

Comments

Marc Zyngier Jan. 17, 2024, 9:05 a.m. UTC | #1
Hi Mark,

On Tue, 16 Jan 2024 17:03:48 +0000,
Mark Rutland <mark.rutland@arm.com> wrote:
> 
> Currently the perf tool doesn't deteect support for extneded event types
> on Apple M1/M2 systems, and will not auto-expand plain PERF_EVENT_TYPE
> hardware events into per-PMU events. This is due to the detection of
> extended event types not handling mandatory filters required by the
> M1/M2 PMU driver.

Thanks for looking into this.

I've given your patch a go on my M1 box, and it indeed makes things
substantially better:

$ sudo ./perf stat -e cycles ~/hackbench 100 process 1000
Running with 100*40 (== 4000) tasks.
Time: 3.419

 Performance counter stats for '/home/maz/hackbench 100 process 1000':

   174,783,472,090      apple_firestorm_pmu/cycles/                                             (93.10%)
    39,134,744,813      apple_icestorm_pmu/cycles/                                              (71.86%)

       3.568145595 seconds time elapsed

      12.203084000 seconds user
      55.135271000 seconds sys

However, I'm seeing some slightly odd behaviours:

$ sudo ./perf stat -e cycles:k ~/hackbench 100 process 1000
Running with 100*40 (== 4000) tasks.
Time: 3.313

 Performance counter stats for '/home/maz/hackbench 100 process 1000':

   <not supported>      apple_firestorm_pmu/cycles:k/                                         
   <not supported>      apple_icestorm_pmu/cycles:k/                                          

       3.467568841 seconds time elapsed

      13.080111000 seconds user
      53.162099000 seconds sys

I would have expected it to count, but it didn't. For that to work, I
have to add the 'H' modifier:

$ sudo ./perf stat -e cycles:Hk ~/hackbench 100 process 1000
Running with 100*40 (== 4000) tasks.
Time: 3.335

 Performance counter stats for '/home/maz/hackbench 100 process 1000':

   183,756,134,397      apple_firestorm_pmu/cycles:Hk/                                          (85.56%)
    37,302,841,991      apple_icestorm_pmu/cycles:Hk/                                           (72.10%)

       3.490138958 seconds time elapsed

      13.376772000 seconds user
      53.326289000 seconds sys

But my perf-foo is as basic as it gets, so it is likely that I'm
missing something.

Thanks,

	M.
  
Mark Rutland Jan. 17, 2024, 12:12 p.m. UTC | #2
On Wed, Jan 17, 2024 at 09:05:25AM +0000, Marc Zyngier wrote:
> Hi Mark,
> 
> On Tue, 16 Jan 2024 17:03:48 +0000,
> Mark Rutland <mark.rutland@arm.com> wrote:
> > 
> > Currently the perf tool doesn't deteect support for extneded event types
> > on Apple M1/M2 systems, and will not auto-expand plain PERF_EVENT_TYPE
> > hardware events into per-PMU events. This is due to the detection of
> > extended event types not handling mandatory filters required by the
> > M1/M2 PMU driver.
> 
> Thanks for looking into this.
> 
> I've given your patch a go on my M1 box, and it indeed makes things
> substantially better:
> 
> $ sudo ./perf stat -e cycles ~/hackbench 100 process 1000
> Running with 100*40 (== 4000) tasks.
> Time: 3.419
> 
>  Performance counter stats for '/home/maz/hackbench 100 process 1000':
> 
>    174,783,472,090      apple_firestorm_pmu/cycles/                                             (93.10%)
>     39,134,744,813      apple_icestorm_pmu/cycles/                                              (71.86%)
> 
>        3.568145595 seconds time elapsed
> 
>       12.203084000 seconds user
>       55.135271000 seconds sys

Thanks for giving that a spin!

> However, I'm seeing some slightly odd behaviours:
> 
> $ sudo ./perf stat -e cycles:k ~/hackbench 100 process 1000
> Running with 100*40 (== 4000) tasks.
> Time: 3.313
> 
>  Performance counter stats for '/home/maz/hackbench 100 process 1000':
> 
>    <not supported>      apple_firestorm_pmu/cycles:k/                                         
>    <not supported>      apple_icestorm_pmu/cycles:k/                                          
> 
>        3.467568841 seconds time elapsed
> 
>       13.080111000 seconds user
>       53.162099000 seconds sys
> 
> I would have expected it to count, but it didn't. For that to work, I
> have to add the 'H' modifier:

Ok, so that'll have something to do with the way the tool chooses which
perf_evant_attr::exclude_* bits to set. I thought that was the same for plain
events and pmu_name/event/ events, but I could be mistaken.

Is that something you had tried prior to this patch, and did that "just work"
with the explicit pmu_name/event/ syntax prior to this patch?

e.g. did something like:

	perf stat -e apple_firestorm_pmu/cycles/k -e apple_icestorm_pmu/cycles/k ./workload

.. happen to work withiout requiring the addition of 'H'?

If so, does that behave the same before/after this patch?

.. and could you run that with '-vvv' and dump the output for comparison?

Thanks,
Mark.
  
Namhyung Kim Jan. 19, 2024, 5:57 a.m. UTC | #3
Hello,

Adding Ravi to CC.

On Wed, Jan 17, 2024 at 4:12 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Wed, Jan 17, 2024 at 09:05:25AM +0000, Marc Zyngier wrote:
> > Hi Mark,
> >
> > On Tue, 16 Jan 2024 17:03:48 +0000,
> > Mark Rutland <mark.rutland@arm.com> wrote:
> > >
> > > Currently the perf tool doesn't deteect support for extneded event types
> > > on Apple M1/M2 systems, and will not auto-expand plain PERF_EVENT_TYPE
> > > hardware events into per-PMU events. This is due to the detection of
> > > extended event types not handling mandatory filters required by the
> > > M1/M2 PMU driver.
> >
> > Thanks for looking into this.
> >
> > I've given your patch a go on my M1 box, and it indeed makes things
> > substantially better:
> >
> > $ sudo ./perf stat -e cycles ~/hackbench 100 process 1000
> > Running with 100*40 (== 4000) tasks.
> > Time: 3.419
> >
> >  Performance counter stats for '/home/maz/hackbench 100 process 1000':
> >
> >    174,783,472,090      apple_firestorm_pmu/cycles/                                             (93.10%)
> >     39,134,744,813      apple_icestorm_pmu/cycles/                                              (71.86%)
> >
> >        3.568145595 seconds time elapsed
> >
> >       12.203084000 seconds user
> >       55.135271000 seconds sys
>
> Thanks for giving that a spin!
>
> > However, I'm seeing some slightly odd behaviours:
> >
> > $ sudo ./perf stat -e cycles:k ~/hackbench 100 process 1000
> > Running with 100*40 (== 4000) tasks.
> > Time: 3.313
> >
> >  Performance counter stats for '/home/maz/hackbench 100 process 1000':
> >
> >    <not supported>      apple_firestorm_pmu/cycles:k/
> >    <not supported>      apple_icestorm_pmu/cycles:k/

Hmm.. I guess this should look like apple_firestorm_pmu/cycles/k.
IIRC there was a thread for this, right?

> >
> >        3.467568841 seconds time elapsed
> >
> >       13.080111000 seconds user
> >       53.162099000 seconds sys
> >
> > I would have expected it to count, but it didn't. For that to work, I
> > have to add the 'H' modifier:
>
> Ok, so that'll have something to do with the way the tool chooses which
> perf_evant_attr::exclude_* bits to set. I thought that was the same for plain
> events and pmu_name/event/ events, but I could be mistaken.

I think it sets the attr.exclude_guest by event_attr_init().  Maybe
it's deleted during the missing feature detection logic.  But IIUC
it should work on each PMU separately.

By the way, I really hope the kernel exports caps/exclude_bits
for PMUs so that tools can see which bits are supported.  For
example AMD IBS has CAP_NO_EXCLUDE so setting exclude_guest
will fail to open.  Then it disables the new features added after
that in the missing feature detection logic.

If we know if it doesn't support any exclude bits, then tools can
try other features after removing the bit first.

Thanks,
Namhyung

>
> Is that something you had tried prior to this patch, and did that "just work"
> with the explicit pmu_name/event/ syntax prior to this patch?
>
> e.g. did something like:
>
>         perf stat -e apple_firestorm_pmu/cycles/k -e apple_icestorm_pmu/cycles/k ./workload
>
> ... happen to work withiout requiring the addition of 'H'?
>
> If so, does that behave the same before/after this patch?
>
> ... and could you run that with '-vvv' and dump the output for comparison?
>
> Thanks,
> Mark.
  
James Clark Jan. 19, 2024, 3 p.m. UTC | #4
On 17/01/2024 09:05, Marc Zyngier wrote:
> Hi Mark,
> 
> On Tue, 16 Jan 2024 17:03:48 +0000,
> Mark Rutland <mark.rutland@arm.com> wrote:
>>
>> Currently the perf tool doesn't deteect support for extneded event types
>> on Apple M1/M2 systems, and will not auto-expand plain PERF_EVENT_TYPE
>> hardware events into per-PMU events. This is due to the detection of
>> extended event types not handling mandatory filters required by the
>> M1/M2 PMU driver.
> 
> Thanks for looking into this.
> 
> I've given your patch a go on my M1 box, and it indeed makes things
> substantially better:
> 
> $ sudo ./perf stat -e cycles ~/hackbench 100 process 1000
> Running with 100*40 (== 4000) tasks.
> Time: 3.419
> 
>  Performance counter stats for '/home/maz/hackbench 100 process 1000':
> 
>    174,783,472,090      apple_firestorm_pmu/cycles/                                             (93.10%)
>     39,134,744,813      apple_icestorm_pmu/cycles/                                              (71.86%)
> 
>        3.568145595 seconds time elapsed
> 
>       12.203084000 seconds user
>       55.135271000 seconds sys
> 
> However, I'm seeing some slightly odd behaviours:
> 
> $ sudo ./perf stat -e cycles:k ~/hackbench 100 process 1000
> Running with 100*40 (== 4000) tasks.
> Time: 3.313
> 
>  Performance counter stats for '/home/maz/hackbench 100 process 1000':
> 
>    <not supported>      apple_firestorm_pmu/cycles:k/                                         
>    <not supported>      apple_icestorm_pmu/cycles:k/                                          
> 
>        3.467568841 seconds time elapsed
> 
>       13.080111000 seconds user
>       53.162099000 seconds sys
> 
> I would have expected it to count, but it didn't. For that to work, I
> have to add the 'H' modifier:
> 
> $ sudo ./perf stat -e cycles:Hk ~/hackbench 100 process 1000
> Running with 100*40 (== 4000) tasks.
> Time: 3.335
> 
>  Performance counter stats for '/home/maz/hackbench 100 process 1000':
> 
>    183,756,134,397      apple_firestorm_pmu/cycles:Hk/                                          (85.56%)
>     37,302,841,991      apple_icestorm_pmu/cycles:Hk/                                           (72.10%)
> 
>        3.490138958 seconds time elapsed
> 
>       13.376772000 seconds user
>       53.326289000 seconds sys
> 
> But my perf-foo is as basic as it gets, so it is likely that I'm
> missing something.
> 
> Thanks,
> 
> 	M.
> 

The default for exclude_guest=1 is always set unless you use "perf kvm
record". But unfortunately if you add _any_ modifier then all the
default values get overwritten, which is why adding k stops it from working.

It seems like a lot of the pain comes from the fact that the command
line modifiers are includes, and the kernel options are excludes.
Resulting in some complicated logic in get_event_modifier() to try to
work out what you wanted, which I suppose it got it wrong in this scenario.

Another thing is that evsel__detect_missing_features() only works to
remove arguments, where as for M1/M2 you want to _add_ exclude_guest
back in again. But maybe not if you started adding modifiers...

I don't really have an answer or a solution, but I was wondering if
there was a bug that could be fixed so I went to look.

My first thought was maybe we could ignore exclude_guest=0 on M1/M2
rather than returning an error. I don't think it would be too surprising
that guest samples are missing if they are never supported. Maybe the
exclude_guest argument is only useful if it actually has an effect,
rather than having an argument that always needs to be on anyway.

Or secondly maybe we could make get_event_modifier() interact with
evsel__detect_missing_features() but I can't imagine it being very
maintainable once we start adding a few interactions. And it's still
hard to predict what the user really wants given a single letter and
many exclude_* options.

James
  
Ian Rogers Jan. 20, 2024, 6:27 p.m. UTC | #5
On Tue, Jan 16, 2024 at 9:04 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> Currently the perf tool doesn't deteect support for extneded event types

nit: s/deteect/detect/
nit: s/extneded/extended/

> on Apple M1/M2 systems, and will not auto-expand plain PERF_EVENT_TYPE
> hardware events into per-PMU events. This is due to the detection of
> extended event types not handling mandatory filters required by the
> M1/M2 PMU driver.
>
> PMU drivers and the core perf_events code can require that
> perf_event_attr::exclude_* filters are configured in a specific way and
> may reject certain configurations of filters, for example:
>
> (a) Many PMUs lack support for any event filtering, and require all
>     perf_event_attr::exclude_* bits to be clear. This includes Alpha's
>     CPU PMU, and ARM CPU PMUs prior to the introduction of PMUv2 in
>     ARMv7,
>
> (b) When /proc/sys/kernel/perf_event_paranoid >= 2, the perf core
>     requires that perf_event_attr::exclude_kernel is set.
>
> (c) The Apple M1/M2 PMU requires that perf_event_attr::exclude_guest is
>     set as the hardware PMU does not count while a guest is running (but
>     might be extended in future to do so).
>
> In is_event_supported(), we try to account for cases (a) and (b), first
> attempting to open an event without any filters, and if this fails,
> retrying with perf_event_attr::exclude_kernel set. We do not account for
> case (c), or any other filters that drivers could theoretically require
> to be set.
>
> Thus is_event_supported() will fail to detect support for any events
> targetting an Apple M1/M2 PMU, even where events would be supported with

nit: s/targetting/targeting/

> perf_event_attr:::exclude_guest set.
>
> Since commit:
>
>   82fe2e45cdb00de4 ("perf pmus: Check if we can encode the PMU number in perf_event_attr.type")
>
> ... we use is_event_supported() to detect support for extended types,
> with the PMU ID encoded into the perf_event_attr::type. As above, on an
> Apple M1/M2 system this will always fail to detect that the event is
> supported, and consequently we fail to detect support for extended types
> even when these are supported, as they have been since commit:
>
>   5c816728651ae425 ("arm_pmu: Add PERF_PMU_CAP_EXTENDED_HW_TYPE capability")
>
> Due to this, the perf tool will not automatically expand plain
> PERF_TYPE_HARDWARE events into per-PMU events, even when all the
> necessary kernel support is present.
>
> This patch updates is_event_supported() to additionally try opening
> events with perf_event_attr::exclude_guest set, allowing support for
> events to be detected on Apple M1/M2 systems. I beleive that this is

nit: s/beleive/believe/

> sufficient for all contemporary CPU PMU drivers, though in future it may
> be necessary to check for other combinations of filter bits.
>
> I've deliberately changed the check to not expect a specific error code
> for missing filters, as today ;the kernel may return a number of
> different error codes for missing filters (e.g. -EACCESS, -EINVAL, or
> -EOPNOTSUPP) depending on why and where the filter configuration is
> rejected, and retrying for any error is more robust.
>
> Note that this does not remove the need for commit:
>
>   a24d9d9dc096fc0d ("perf parse-events: Make legacy events lower priority than sysfs/JSON")
>
> ... which is still necessary so that named-pmu/event/ events work on
> kernels without extended type support, even if the event name happens to
> be the same as a PERF_EVENT_TYPE_HARDWARE event (e.g. as is the case for
> the M1/M2 PMU's 'cycles' and 'instructions' events).
>
> Fixes: 82fe2e45cdb00de4 ("perf pmus: Check if we can encode the PMU number in perf_event_attr.type")
> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> Cc: Hector Martin <marcan@marcan.st>
> Cc: Ian Rogers <irogers@google.com>
> Cc: James Clark <james.clark@arm.com>
> Cc: John Garry <john.g.garry@oracle.com>
> Cc: Leo Yan <leo.yan@linaro.org>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Mike Leach <mike.leach@linaro.org>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
> Cc: Thomas Richter <tmricht@linux.ibm.com>
> Cc: Will Deacon <will@kernel.org>
> ---
>  tools/perf/util/print-events.c | 27 +++++++++++++++++++--------
>  1 file changed, 19 insertions(+), 8 deletions(-)
>
> Hector, Marc, I'd appreciate if either of you could give this a spin on
> your M1/M2 machines. I've given it local testing with the arm_pmuv3
> driver modified to behave the same as the apple_m1_pmu driver (requiring
> exclude_guest, having a 'cycles' event in sysfs), but that might not
> perfectly replicate your setup.
>
> The patch is based on the 'perf-tools-for-v6.8-1-2024-01-09' tag in the
> perf-tools tree:
>
>   https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools.git/
>
> ... and I've pushed it out to the 'perf-tools/event-supported-filters'
> branch in my tree:
>
>   https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/
>
> This patch *should* make it possible to do:
>
>         perf stat -e cycles ./workload
>         perf stat -e instructions ./workload
>
> ... with those 'cycles' and 'instructions' events being automatically
> expanded and reported as separate events per-PMU, which is a nice
> quality-of-life improvement.
>
> Comparing before and after this patch:
>
> | # ./perf-before stat -e cycles true
> |
> |  Performance counter stats for 'true':
> |
> |      <not counted>      cycles                                                                  (0.00%)
> |
> |        0.000990250 seconds time elapsed
> |
> |        0.000934000 seconds user
> |        0.000000000 seconds sys
> |
> | # ./perf-after stat -e cycles true
> |
> |  Performance counter stats for 'true':
> |
> |             965175      armv8_pmuv3_0/cycles/
> |      <not counted>      armv8_pmuv3_1/cycles/                                                   (0.00%)
> |      <not counted>      armv8_pmuv3_2/cycles/                                                   (0.00%)
> |      <not counted>      armv8_pmuv3_3/cycles/                                                   (0.00%)
> |
> |        0.000836555 seconds time elapsed
> |
> |        0.000884000 seconds user
> |        0.000000000 seconds sys

Just to check, this is the expected expansion of cycles? I'm unclear
why 4 core PMUs.

>
> This *shouldn't* change the interpetation of named-pmu events, e.g.
>
>         perf stat -e apple_whichever_pmu/cycles/ ./workload
>
> ... should behave the same as without this patch
>
> Comparing before and after this patch:
>
> | # ./perf-before stat -e armv8_pmuv3_0/cycles/ -e armv8_pmuv3_1/cycles/ -e armv8_pmuv3_2/cycles/ -e armv8_pmuv3_3/cycles/ true
> |
> |  Performance counter stats for 'true':
> |
> |      <not counted>      armv8_pmuv3_0/cycles/                                                   (0.00%)
> |      <not counted>      armv8_pmuv3_1/cycles/                                                   (0.00%)
> |      <not counted>      armv8_pmuv3_2/cycles/                                                   (0.00%)
> |             901415      armv8_pmuv3_3/cycles/
> |
> |        0.000756590 seconds time elapsed
> |
> |        0.000811000 seconds user
> |        0.000000000 seconds sys
> |
> | # ./perf-after stat -e armv8_pmuv3_0/cycles/ -e armv8_pmuv3_1/cycles/ -e armv8_pmuv3_2/cycles/ -e armv8_pmuv3_3/cycles/ true
> |
> |  Performance counter stats for 'true':
> |
> |             923314      armv8_pmuv3_0/cycles/
> |      <not counted>      armv8_pmuv3_1/cycles/                                                   (0.00%)
> |      <not counted>      armv8_pmuv3_2/cycles/                                                   (0.00%)
> |      <not counted>      armv8_pmuv3_3/cycles/                                                   (0.00%)
> |
> |        0.000782420 seconds time elapsed
> |
> |        0.000836000 seconds user
> |        0.000000000 seconds sys
>
> One thing I'm still looing into is that this doesn't seem to do anything
> for a default perf stat session, e.g.
>
>         perf stat ./workload
>
> ... doesn't automatically expand the implicitly-created events into per-pmu
> events.

Ugh, weak symbols. x86 has overridden the default adding of attributes
to do it for hybrid:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/arch/x86/util/evlist.c?h=perf-tools-next#n36
I think we should lose the adding events via attributes and just go to
using parse events for everything. I'll see if I can do some cleanup
and that should resolve this - I also want to cleanup the default
events/metrics and the detailed ones as we can drop the unsupported
events, etc.

> Comparing before and after this patch:
>
> | # ./perf-before stat true
> |
> |  Performance counter stats for 'true':
> |
> |               0.42 msec task-clock                       #    0.569 CPUs utilized
> |                  0      context-switches                 #    0.000 /sec
> |                  0      cpu-migrations                   #    0.000 /sec
> |                 38      page-faults                      #   89.796 K/sec
> |      <not counted>      cycles                                                                  (0.00%)
> |      <not counted>      instructions                                                            (0.00%)
> |      <not counted>      branches                                                                (0.00%)
> |      <not counted>      branch-misses                                                           (0.00%)
> |
> |        0.000744185 seconds time elapsed
> |
> |        0.000795000 seconds user
> |        0.000000000 seconds sys
> |
> | # ./perf-after stat true
> |
> |  Performance counter stats for 'true':
> |
> |               0.43 msec task-clock                       #    0.582 CPUs utilized
> |                  0      context-switches                 #    0.000 /sec
> |                  0      cpu-migrations                   #    0.000 /sec
> |                 38      page-faults                      #   88.960 K/sec
> |      <not counted>      cycles                                                                  (0.00%)
> |      <not counted>      instructions                                                            (0.00%)
> |      <not counted>      branches                                                                (0.00%)
> |      <not counted>      branch-misses                                                           (0.00%)
> |
> |        0.000734120 seconds time elapsed
> |
> |        0.000786000 seconds user
> |        0.000000000 seconds sys
>
> Ian, how does that behave on x86? Is that the same, or do the default
> events get expanded?

The default events are expanded, the not counted is a feature of a
fast binary (true here). I'm trying to remove custom code paths so
that things like this don't happen, sorry that you've come across
another instance but at least I can fix it.

Thanks,
Ian

> Thanks,
> Mark.
>
> diff --git a/tools/perf/util/print-events.c b/tools/perf/util/print-events.c
> index b0fc48be623f3..4f67e8f00a4d6 100644
> --- a/tools/perf/util/print-events.c
> +++ b/tools/perf/util/print-events.c
> @@ -232,7 +232,6 @@ void print_sdt_events(const struct print_callbacks *print_cb, void *print_state)
>  bool is_event_supported(u8 type, u64 config)
>  {
>         bool ret = true;
> -       int open_return;
>         struct evsel *evsel;
>         struct perf_event_attr attr = {
>                 .type = type,
> @@ -246,20 +245,32 @@ bool is_event_supported(u8 type, u64 config)
>
>         evsel = evsel__new(&attr);
>         if (evsel) {
> -               open_return = evsel__open(evsel, NULL, tmap);
> -               ret = open_return >= 0;
> +               ret = evsel__open(evsel, NULL, tmap) >= 0;
>
> -               if (open_return == -EACCES) {
> +               if (!ret) {
>                         /*
> -                        * This happens if the paranoid value
> +                        * The event may fail to open if the paranoid value
>                          * /proc/sys/kernel/perf_event_paranoid is set to 2
> -                        * Re-run with exclude_kernel set; we don't do that
> -                        * by default as some ARM machines do not support it.
> -                        *
> +                        * Re-run with exclude_kernel set; we don't do that by
> +                        * default as some ARM machines do not support it.
>                          */
>                         evsel->core.attr.exclude_kernel = 1;
>                         ret = evsel__open(evsel, NULL, tmap) >= 0;
>                 }
> +
> +               if (!ret) {
> +                       /*
> +                        * The event may fail to open if the PMU requires
> +                        * exclude_guest to be set (e.g. as the Apple M1 PMU
> +                        * requires).
> +                        * Re-run with exclude_guest set; we don't do that by
> +                        * default as it's equally legitimate for another PMU
> +                        * driver to require that exclude_guest is clear.
> +                        */
> +                       evsel->core.attr.exclude_guest = 1;
> +                       ret = evsel__open(evsel, NULL, tmap) >= 0;
> +               }
> +
>                 evsel__delete(evsel);
>         }
>
> --
> 2.30.2
>
  
Ian Rogers Jan. 20, 2024, 6:29 p.m. UTC | #6
On Sat, Jan 20, 2024 at 10:27 AM Ian Rogers <irogers@google.com> wrote:
>
> On Tue, Jan 16, 2024 at 9:04 AM Mark Rutland <mark.rutland@arm.com> wrote:
> >
> > Currently the perf tool doesn't deteect support for extneded event types
>
> nit: s/deteect/detect/
> nit: s/extneded/extended/
>
> > on Apple M1/M2 systems, and will not auto-expand plain PERF_EVENT_TYPE
> > hardware events into per-PMU events. This is due to the detection of
> > extended event types not handling mandatory filters required by the
> > M1/M2 PMU driver.
> >
> > PMU drivers and the core perf_events code can require that
> > perf_event_attr::exclude_* filters are configured in a specific way and
> > may reject certain configurations of filters, for example:
> >
> > (a) Many PMUs lack support for any event filtering, and require all
> >     perf_event_attr::exclude_* bits to be clear. This includes Alpha's
> >     CPU PMU, and ARM CPU PMUs prior to the introduction of PMUv2 in
> >     ARMv7,
> >
> > (b) When /proc/sys/kernel/perf_event_paranoid >= 2, the perf core
> >     requires that perf_event_attr::exclude_kernel is set.
> >
> > (c) The Apple M1/M2 PMU requires that perf_event_attr::exclude_guest is
> >     set as the hardware PMU does not count while a guest is running (but
> >     might be extended in future to do so).
> >
> > In is_event_supported(), we try to account for cases (a) and (b), first
> > attempting to open an event without any filters, and if this fails,
> > retrying with perf_event_attr::exclude_kernel set. We do not account for
> > case (c), or any other filters that drivers could theoretically require
> > to be set.
> >
> > Thus is_event_supported() will fail to detect support for any events
> > targetting an Apple M1/M2 PMU, even where events would be supported with
>
> nit: s/targetting/targeting/
>
> > perf_event_attr:::exclude_guest set.
> >
> > Since commit:
> >
> >   82fe2e45cdb00de4 ("perf pmus: Check if we can encode the PMU number in perf_event_attr.type")
> >
> > ... we use is_event_supported() to detect support for extended types,
> > with the PMU ID encoded into the perf_event_attr::type. As above, on an
> > Apple M1/M2 system this will always fail to detect that the event is
> > supported, and consequently we fail to detect support for extended types
> > even when these are supported, as they have been since commit:
> >
> >   5c816728651ae425 ("arm_pmu: Add PERF_PMU_CAP_EXTENDED_HW_TYPE capability")
> >
> > Due to this, the perf tool will not automatically expand plain
> > PERF_TYPE_HARDWARE events into per-PMU events, even when all the
> > necessary kernel support is present.
> >
> > This patch updates is_event_supported() to additionally try opening
> > events with perf_event_attr::exclude_guest set, allowing support for
> > events to be detected on Apple M1/M2 systems. I beleive that this is
>
> nit: s/beleive/believe/
>
> > sufficient for all contemporary CPU PMU drivers, though in future it may
> > be necessary to check for other combinations of filter bits.
> >
> > I've deliberately changed the check to not expect a specific error code
> > for missing filters, as today ;the kernel may return a number of
> > different error codes for missing filters (e.g. -EACCESS, -EINVAL, or
> > -EOPNOTSUPP) depending on why and where the filter configuration is
> > rejected, and retrying for any error is more robust.
> >
> > Note that this does not remove the need for commit:
> >
> >   a24d9d9dc096fc0d ("perf parse-events: Make legacy events lower priority than sysfs/JSON")
> >
> > ... which is still necessary so that named-pmu/event/ events work on
> > kernels without extended type support, even if the event name happens to
> > be the same as a PERF_EVENT_TYPE_HARDWARE event (e.g. as is the case for
> > the M1/M2 PMU's 'cycles' and 'instructions' events).
> >
> > Fixes: 82fe2e45cdb00de4 ("perf pmus: Check if we can encode the PMU number in perf_event_attr.type")
> > Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> > Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> > Cc: Hector Martin <marcan@marcan.st>
> > Cc: Ian Rogers <irogers@google.com>
> > Cc: James Clark <james.clark@arm.com>
> > Cc: John Garry <john.g.garry@oracle.com>
> > Cc: Leo Yan <leo.yan@linaro.org>
> > Cc: Marc Zyngier <maz@kernel.org>
> > Cc: Mike Leach <mike.leach@linaro.org>
> > Cc: Namhyung Kim <namhyung@kernel.org>
> > Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
> > Cc: Thomas Richter <tmricht@linux.ibm.com>
> > Cc: Will Deacon <will@kernel.org>

Also:
Tested-by: Ian Rogers <irogers@google.com>

No regressions on Alderlake except a pre-existing problem I noted here:
https://lore.kernel.org/lkml/CAP-5=fWVQ-7ijjK3-w1q+k2WYVNHbAcejb-xY0ptbjRw476VKA@mail.gmail.com/

Thanks,
Ian

> > ---
> >  tools/perf/util/print-events.c | 27 +++++++++++++++++++--------
> >  1 file changed, 19 insertions(+), 8 deletions(-)
> >
> > Hector, Marc, I'd appreciate if either of you could give this a spin on
> > your M1/M2 machines. I've given it local testing with the arm_pmuv3
> > driver modified to behave the same as the apple_m1_pmu driver (requiring
> > exclude_guest, having a 'cycles' event in sysfs), but that might not
> > perfectly replicate your setup.
> >
> > The patch is based on the 'perf-tools-for-v6.8-1-2024-01-09' tag in the
> > perf-tools tree:
> >
> >   https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools.git/
> >
> > ... and I've pushed it out to the 'perf-tools/event-supported-filters'
> > branch in my tree:
> >
> >   https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/
> >
> > This patch *should* make it possible to do:
> >
> >         perf stat -e cycles ./workload
> >         perf stat -e instructions ./workload
> >
> > ... with those 'cycles' and 'instructions' events being automatically
> > expanded and reported as separate events per-PMU, which is a nice
> > quality-of-life improvement.
> >
> > Comparing before and after this patch:
> >
> > | # ./perf-before stat -e cycles true
> > |
> > |  Performance counter stats for 'true':
> > |
> > |      <not counted>      cycles                                                                  (0.00%)
> > |
> > |        0.000990250 seconds time elapsed
> > |
> > |        0.000934000 seconds user
> > |        0.000000000 seconds sys
> > |
> > | # ./perf-after stat -e cycles true
> > |
> > |  Performance counter stats for 'true':
> > |
> > |             965175      armv8_pmuv3_0/cycles/
> > |      <not counted>      armv8_pmuv3_1/cycles/                                                   (0.00%)
> > |      <not counted>      armv8_pmuv3_2/cycles/                                                   (0.00%)
> > |      <not counted>      armv8_pmuv3_3/cycles/                                                   (0.00%)
> > |
> > |        0.000836555 seconds time elapsed
> > |
> > |        0.000884000 seconds user
> > |        0.000000000 seconds sys
>
> Just to check, this is the expected expansion of cycles? I'm unclear
> why 4 core PMUs.
>
> >
> > This *shouldn't* change the interpetation of named-pmu events, e.g.
> >
> >         perf stat -e apple_whichever_pmu/cycles/ ./workload
> >
> > ... should behave the same as without this patch
> >
> > Comparing before and after this patch:
> >
> > | # ./perf-before stat -e armv8_pmuv3_0/cycles/ -e armv8_pmuv3_1/cycles/ -e armv8_pmuv3_2/cycles/ -e armv8_pmuv3_3/cycles/ true
> > |
> > |  Performance counter stats for 'true':
> > |
> > |      <not counted>      armv8_pmuv3_0/cycles/                                                   (0.00%)
> > |      <not counted>      armv8_pmuv3_1/cycles/                                                   (0.00%)
> > |      <not counted>      armv8_pmuv3_2/cycles/                                                   (0.00%)
> > |             901415      armv8_pmuv3_3/cycles/
> > |
> > |        0.000756590 seconds time elapsed
> > |
> > |        0.000811000 seconds user
> > |        0.000000000 seconds sys
> > |
> > | # ./perf-after stat -e armv8_pmuv3_0/cycles/ -e armv8_pmuv3_1/cycles/ -e armv8_pmuv3_2/cycles/ -e armv8_pmuv3_3/cycles/ true
> > |
> > |  Performance counter stats for 'true':
> > |
> > |             923314      armv8_pmuv3_0/cycles/
> > |      <not counted>      armv8_pmuv3_1/cycles/                                                   (0.00%)
> > |      <not counted>      armv8_pmuv3_2/cycles/                                                   (0.00%)
> > |      <not counted>      armv8_pmuv3_3/cycles/                                                   (0.00%)
> > |
> > |        0.000782420 seconds time elapsed
> > |
> > |        0.000836000 seconds user
> > |        0.000000000 seconds sys
> >
> > One thing I'm still looing into is that this doesn't seem to do anything
> > for a default perf stat session, e.g.
> >
> >         perf stat ./workload
> >
> > ... doesn't automatically expand the implicitly-created events into per-pmu
> > events.
>
> Ugh, weak symbols. x86 has overridden the default adding of attributes
> to do it for hybrid:
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/arch/x86/util/evlist.c?h=perf-tools-next#n36
> I think we should lose the adding events via attributes and just go to
> using parse events for everything. I'll see if I can do some cleanup
> and that should resolve this - I also want to cleanup the default
> events/metrics and the detailed ones as we can drop the unsupported
> events, etc.
>
> > Comparing before and after this patch:
> >
> > | # ./perf-before stat true
> > |
> > |  Performance counter stats for 'true':
> > |
> > |               0.42 msec task-clock                       #    0.569 CPUs utilized
> > |                  0      context-switches                 #    0.000 /sec
> > |                  0      cpu-migrations                   #    0.000 /sec
> > |                 38      page-faults                      #   89.796 K/sec
> > |      <not counted>      cycles                                                                  (0.00%)
> > |      <not counted>      instructions                                                            (0.00%)
> > |      <not counted>      branches                                                                (0.00%)
> > |      <not counted>      branch-misses                                                           (0.00%)
> > |
> > |        0.000744185 seconds time elapsed
> > |
> > |        0.000795000 seconds user
> > |        0.000000000 seconds sys
> > |
> > | # ./perf-after stat true
> > |
> > |  Performance counter stats for 'true':
> > |
> > |               0.43 msec task-clock                       #    0.582 CPUs utilized
> > |                  0      context-switches                 #    0.000 /sec
> > |                  0      cpu-migrations                   #    0.000 /sec
> > |                 38      page-faults                      #   88.960 K/sec
> > |      <not counted>      cycles                                                                  (0.00%)
> > |      <not counted>      instructions                                                            (0.00%)
> > |      <not counted>      branches                                                                (0.00%)
> > |      <not counted>      branch-misses                                                           (0.00%)
> > |
> > |        0.000734120 seconds time elapsed
> > |
> > |        0.000786000 seconds user
> > |        0.000000000 seconds sys
> >
> > Ian, how does that behave on x86? Is that the same, or do the default
> > events get expanded?
>
> The default events are expanded, the not counted is a feature of a
> fast binary (true here). I'm trying to remove custom code paths so
> that things like this don't happen, sorry that you've come across
> another instance but at least I can fix it.
>
> Thanks,
> Ian
>
> > Thanks,
> > Mark.
> >
> > diff --git a/tools/perf/util/print-events.c b/tools/perf/util/print-events.c
> > index b0fc48be623f3..4f67e8f00a4d6 100644
> > --- a/tools/perf/util/print-events.c
> > +++ b/tools/perf/util/print-events.c
> > @@ -232,7 +232,6 @@ void print_sdt_events(const struct print_callbacks *print_cb, void *print_state)
> >  bool is_event_supported(u8 type, u64 config)
> >  {
> >         bool ret = true;
> > -       int open_return;
> >         struct evsel *evsel;
> >         struct perf_event_attr attr = {
> >                 .type = type,
> > @@ -246,20 +245,32 @@ bool is_event_supported(u8 type, u64 config)
> >
> >         evsel = evsel__new(&attr);
> >         if (evsel) {
> > -               open_return = evsel__open(evsel, NULL, tmap);
> > -               ret = open_return >= 0;
> > +               ret = evsel__open(evsel, NULL, tmap) >= 0;
> >
> > -               if (open_return == -EACCES) {
> > +               if (!ret) {
> >                         /*
> > -                        * This happens if the paranoid value
> > +                        * The event may fail to open if the paranoid value
> >                          * /proc/sys/kernel/perf_event_paranoid is set to 2
> > -                        * Re-run with exclude_kernel set; we don't do that
> > -                        * by default as some ARM machines do not support it.
> > -                        *
> > +                        * Re-run with exclude_kernel set; we don't do that by
> > +                        * default as some ARM machines do not support it.
> >                          */
> >                         evsel->core.attr.exclude_kernel = 1;
> >                         ret = evsel__open(evsel, NULL, tmap) >= 0;
> >                 }
> > +
> > +               if (!ret) {
> > +                       /*
> > +                        * The event may fail to open if the PMU requires
> > +                        * exclude_guest to be set (e.g. as the Apple M1 PMU
> > +                        * requires).
> > +                        * Re-run with exclude_guest set; we don't do that by
> > +                        * default as it's equally legitimate for another PMU
> > +                        * driver to require that exclude_guest is clear.
> > +                        */
> > +                       evsel->core.attr.exclude_guest = 1;
> > +                       ret = evsel__open(evsel, NULL, tmap) >= 0;
> > +               }
> > +
> >                 evsel__delete(evsel);
> >         }
> >
> > --
> > 2.30.2
> >
  
James Clark Jan. 22, 2024, 10:43 a.m. UTC | #7
On 16/01/2024 17:03, Mark Rutland wrote:
> Currently the perf tool doesn't deteect support for extneded event types
> on Apple M1/M2 systems, and will not auto-expand plain PERF_EVENT_TYPE
> hardware events into per-PMU events. This is due to the detection of
> extended event types not handling mandatory filters required by the
> M1/M2 PMU driver.
> 
> PMU drivers and the core perf_events code can require that
> perf_event_attr::exclude_* filters are configured in a specific way and
> may reject certain configurations of filters, for example:
> 
> (a) Many PMUs lack support for any event filtering, and require all
>     perf_event_attr::exclude_* bits to be clear. This includes Alpha's
>     CPU PMU, and ARM CPU PMUs prior to the introduction of PMUv2 in
>     ARMv7,
> 
> (b) When /proc/sys/kernel/perf_event_paranoid >= 2, the perf core
>     requires that perf_event_attr::exclude_kernel is set.
> 
> (c) The Apple M1/M2 PMU requires that perf_event_attr::exclude_guest is
>     set as the hardware PMU does not count while a guest is running (but
>     might be extended in future to do so).
> 
> In is_event_supported(), we try to account for cases (a) and (b), first
> attempting to open an event without any filters, and if this fails,
> retrying with perf_event_attr::exclude_kernel set. We do not account for
> case (c), or any other filters that drivers could theoretically require
> to be set.
> 
> Thus is_event_supported() will fail to detect support for any events
> targetting an Apple M1/M2 PMU, even where events would be supported with
> perf_event_attr:::exclude_guest set.
> 
> Since commit:
> 
>   82fe2e45cdb00de4 ("perf pmus: Check if we can encode the PMU number in perf_event_attr.type")
> 
> ... we use is_event_supported() to detect support for extended types,
> with the PMU ID encoded into the perf_event_attr::type. As above, on an
> Apple M1/M2 system this will always fail to detect that the event is
> supported, and consequently we fail to detect support for extended types
> even when these are supported, as they have been since commit:
> 
>   5c816728651ae425 ("arm_pmu: Add PERF_PMU_CAP_EXTENDED_HW_TYPE capability")
> 
> Due to this, the perf tool will not automatically expand plain
> PERF_TYPE_HARDWARE events into per-PMU events, even when all the
> necessary kernel support is present.
> 
> This patch updates is_event_supported() to additionally try opening
> events with perf_event_attr::exclude_guest set, allowing support for
> events to be detected on Apple M1/M2 systems. I beleive that this is
> sufficient for all contemporary CPU PMU drivers, though in future it may
> be necessary to check for other combinations of filter bits.
> 
> I've deliberately changed the check to not expect a specific error code
> for missing filters, as today ;the kernel may return a number of
> different error codes for missing filters (e.g. -EACCESS, -EINVAL, or
> -EOPNOTSUPP) depending on why and where the filter configuration is
> rejected, and retrying for any error is more robust.
> 
> Note that this does not remove the need for commit:
> 
>   a24d9d9dc096fc0d ("perf parse-events: Make legacy events lower priority than sysfs/JSON")
> 
> ... which is still necessary so that named-pmu/event/ events work on
> kernels without extended type support, even if the event name happens to
> be the same as a PERF_EVENT_TYPE_HARDWARE event (e.g. as is the case for
> the M1/M2 PMU's 'cycles' and 'instructions' events).
> 
> Fixes: 82fe2e45cdb00de4 ("perf pmus: Check if we can encode the PMU number in perf_event_attr.type")
> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> Cc: Hector Martin <marcan@marcan.st>
> Cc: Ian Rogers <irogers@google.com>
> Cc: James Clark <james.clark@arm.com>
> Cc: John Garry <john.g.garry@oracle.com>
> Cc: Leo Yan <leo.yan@linaro.org>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Mike Leach <mike.leach@linaro.org>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
> Cc: Thomas Richter <tmricht@linux.ibm.com>
> Cc: Will Deacon <will@kernel.org>
> ---
>  tools/perf/util/print-events.c | 27 +++++++++++++++++++--------
>  1 file changed, 19 insertions(+), 8 deletions(-)
> 

Tested-by: James Clark <james.clark@arm.com>

Tested on Juno and N1SDP, although I wouldn't have expected it to make a
difference on those platforms because they support exclude_guest=0.

Although I do see an interaction with the test "Session topology" if I
hack the driver to behave like M1. The test has been failing (on
big.LITTLE) since commit 251aa040244a ("perf parse-events: Wildcard most
"numeric" events") but the result is that the test actually starts
passing with this change. I don't think that should really block this
though, it's likely going to require a separate fix which I will look into.

James
  
Mark Rutland Jan. 24, 2024, 3:48 p.m. UTC | #8
On Sat, Jan 20, 2024 at 10:27:33AM -0800, Ian Rogers wrote:
> On Tue, Jan 16, 2024 at 9:04 AM Mark Rutland <mark.rutland@arm.com> wrote:
> >
> > Currently the perf tool doesn't deteect support for extneded event types
> 
> nit: s/deteect/detect/
> nit: s/extneded/extended/

> > Thus is_event_supported() will fail to detect support for any events
> > targetting an Apple M1/M2 PMU, even where events would be supported with
> 
> nit: s/targetting/targeting/

> > This patch updates is_event_supported() to additionally try opening
> > events with perf_event_attr::exclude_guest set, allowing support for
> > events to be detected on Apple M1/M2 systems. I beleive that this is
> 
> nit: s/beleive/believe/

Whoops; I've fixed those in my local tree now.

[...]

> > Hector, Marc, I'd appreciate if either of you could give this a spin on
> > your M1/M2 machines. I've given it local testing with the arm_pmuv3
> > driver modified to behave the same as the apple_m1_pmu driver (requiring
> > exclude_guest, having a 'cycles' event in sysfs), but that might not
> > perfectly replicate your setup.
> >
> > The patch is based on the 'perf-tools-for-v6.8-1-2024-01-09' tag in the
> > perf-tools tree:
> >
> >   https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools.git/
> >
> > ... and I've pushed it out to the 'perf-tools/event-supported-filters'
> > branch in my tree:
> >
> >   https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/
> >
> > This patch *should* make it possible to do:
> >
> >         perf stat -e cycles ./workload
> >         perf stat -e instructions ./workload
> >
> > ... with those 'cycles' and 'instructions' events being automatically
> > expanded and reported as separate events per-PMU, which is a nice
> > quality-of-life improvement.
> >
> > Comparing before and after this patch:
> >
> > | # ./perf-before stat -e cycles true
> > |
> > |  Performance counter stats for 'true':
> > |
> > |      <not counted>      cycles                                                                  (0.00%)
> > |
> > |        0.000990250 seconds time elapsed
> > |
> > |        0.000934000 seconds user
> > |        0.000000000 seconds sys
> > |
> > | # ./perf-after stat -e cycles true
> > |
> > |  Performance counter stats for 'true':
> > |
> > |             965175      armv8_pmuv3_0/cycles/
> > |      <not counted>      armv8_pmuv3_1/cycles/                                                   (0.00%)
> > |      <not counted>      armv8_pmuv3_2/cycles/                                                   (0.00%)
> > |      <not counted>      armv8_pmuv3_3/cycles/                                                   (0.00%)
> > |
> > |        0.000836555 seconds time elapsed
> > |
> > |        0.000884000 seconds user
> > |        0.000000000 seconds sys
> 
> Just to check, this is the expected expansion of cycles? I'm unclear
> why 4 core PMUs.

Yep; I had a fake big.LITTLE setup with four distinct microarchitectures (one
per CPU), so the expansion above is expected. I had meant to explain that in
the notes along with the other driver modifications, but I forgot, sorry!

> > This *shouldn't* change the interpetation of named-pmu events, e.g.
> >
> >         perf stat -e apple_whichever_pmu/cycles/ ./workload
> >
> > ... should behave the same as without this patch
> >
> > Comparing before and after this patch:
> >
> > | # ./perf-before stat -e armv8_pmuv3_0/cycles/ -e armv8_pmuv3_1/cycles/ -e armv8_pmuv3_2/cycles/ -e armv8_pmuv3_3/cycles/ true
> > |
> > |  Performance counter stats for 'true':
> > |
> > |      <not counted>      armv8_pmuv3_0/cycles/                                                   (0.00%)
> > |      <not counted>      armv8_pmuv3_1/cycles/                                                   (0.00%)
> > |      <not counted>      armv8_pmuv3_2/cycles/                                                   (0.00%)
> > |             901415      armv8_pmuv3_3/cycles/
> > |
> > |        0.000756590 seconds time elapsed
> > |
> > |        0.000811000 seconds user
> > |        0.000000000 seconds sys
> > |
> > | # ./perf-after stat -e armv8_pmuv3_0/cycles/ -e armv8_pmuv3_1/cycles/ -e armv8_pmuv3_2/cycles/ -e armv8_pmuv3_3/cycles/ true
> > |
> > |  Performance counter stats for 'true':
> > |
> > |             923314      armv8_pmuv3_0/cycles/
> > |      <not counted>      armv8_pmuv3_1/cycles/                                                   (0.00%)
> > |      <not counted>      armv8_pmuv3_2/cycles/                                                   (0.00%)
> > |      <not counted>      armv8_pmuv3_3/cycles/                                                   (0.00%)
> > |
> > |        0.000782420 seconds time elapsed
> > |
> > |        0.000836000 seconds user
> > |        0.000000000 seconds sys
> >
> > One thing I'm still looing into is that this doesn't seem to do anything
> > for a default perf stat session, e.g.
> >
> >         perf stat ./workload
> >
> > ... doesn't automatically expand the implicitly-created events into per-pmu
> > events.
> 
> Ugh, weak symbols. x86 has overridden the default adding of attributes
> to do it for hybrid:
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/arch/x86/util/evlist.c?h=perf-tools-next#n36
> I think we should lose the adding events via attributes and just go to
> using parse events for everything. I'll see if I can do some cleanup
> and that should resolve this - I also want to cleanup the default
> events/metrics and the detailed ones as we can drop the unsupported
> events, etc.

Ok; so IIUC we can treat that as a separate problem? I'm happy to test/review
patches there.

> > Comparing before and after this patch:
> >
> > | # ./perf-before stat true
> > |
> > |  Performance counter stats for 'true':
> > |
> > |               0.42 msec task-clock                       #    0.569 CPUs utilized
> > |                  0      context-switches                 #    0.000 /sec
> > |                  0      cpu-migrations                   #    0.000 /sec
> > |                 38      page-faults                      #   89.796 K/sec
> > |      <not counted>      cycles                                                                  (0.00%)
> > |      <not counted>      instructions                                                            (0.00%)
> > |      <not counted>      branches                                                                (0.00%)
> > |      <not counted>      branch-misses                                                           (0.00%)
> > |
> > |        0.000744185 seconds time elapsed
> > |
> > |        0.000795000 seconds user
> > |        0.000000000 seconds sys
> > |
> > | # ./perf-after stat true
> > |
> > |  Performance counter stats for 'true':
> > |
> > |               0.43 msec task-clock                       #    0.582 CPUs utilized
> > |                  0      context-switches                 #    0.000 /sec
> > |                  0      cpu-migrations                   #    0.000 /sec
> > |                 38      page-faults                      #   88.960 K/sec
> > |      <not counted>      cycles                                                                  (0.00%)
> > |      <not counted>      instructions                                                            (0.00%)
> > |      <not counted>      branches                                                                (0.00%)
> > |      <not counted>      branch-misses                                                           (0.00%)
> > |
> > |        0.000734120 seconds time elapsed
> > |
> > |        0.000786000 seconds user
> > |        0.000000000 seconds sys
> >
> > Ian, how does that behave on x86? Is that the same, or do the default
> > events get expanded?
> 
> The default events are expanded, the not counted is a feature of a
> fast binary (true here). I'm trying to remove custom code paths so
> that things like this don't happen, sorry that you've come across
> another instance but at least I can fix it.

Huh; I'm surprised that works with the named-pmu events, since that's the same
'true' binary.

Is there anything I should go look at for that?

Mark.
  
Mark Rutland Jan. 24, 2024, 3:51 p.m. UTC | #9
On Sat, Jan 20, 2024 at 10:29:54AM -0800, Ian Rogers wrote:
> On Sat, Jan 20, 2024 at 10:27 AM Ian Rogers <irogers@google.com> wrote:
> >
> > On Tue, Jan 16, 2024 at 9:04 AM Mark Rutland <mark.rutland@arm.com> wrote:
> > >
> > > Currently the perf tool doesn't deteect support for extneded event types
> >
> > nit: s/deteect/detect/
> > nit: s/extneded/extended/
> >
> > > on Apple M1/M2 systems, and will not auto-expand plain PERF_EVENT_TYPE
> > > hardware events into per-PMU events. This is due to the detection of
> > > extended event types not handling mandatory filters required by the
> > > M1/M2 PMU driver.
> > >
> > > PMU drivers and the core perf_events code can require that
> > > perf_event_attr::exclude_* filters are configured in a specific way and
> > > may reject certain configurations of filters, for example:
> > >
> > > (a) Many PMUs lack support for any event filtering, and require all
> > >     perf_event_attr::exclude_* bits to be clear. This includes Alpha's
> > >     CPU PMU, and ARM CPU PMUs prior to the introduction of PMUv2 in
> > >     ARMv7,
> > >
> > > (b) When /proc/sys/kernel/perf_event_paranoid >= 2, the perf core
> > >     requires that perf_event_attr::exclude_kernel is set.
> > >
> > > (c) The Apple M1/M2 PMU requires that perf_event_attr::exclude_guest is
> > >     set as the hardware PMU does not count while a guest is running (but
> > >     might be extended in future to do so).
> > >
> > > In is_event_supported(), we try to account for cases (a) and (b), first
> > > attempting to open an event without any filters, and if this fails,
> > > retrying with perf_event_attr::exclude_kernel set. We do not account for
> > > case (c), or any other filters that drivers could theoretically require
> > > to be set.
> > >
> > > Thus is_event_supported() will fail to detect support for any events
> > > targetting an Apple M1/M2 PMU, even where events would be supported with
> >
> > nit: s/targetting/targeting/
> >
> > > perf_event_attr:::exclude_guest set.
> > >
> > > Since commit:
> > >
> > >   82fe2e45cdb00de4 ("perf pmus: Check if we can encode the PMU number in perf_event_attr.type")
> > >
> > > ... we use is_event_supported() to detect support for extended types,
> > > with the PMU ID encoded into the perf_event_attr::type. As above, on an
> > > Apple M1/M2 system this will always fail to detect that the event is
> > > supported, and consequently we fail to detect support for extended types
> > > even when these are supported, as they have been since commit:
> > >
> > >   5c816728651ae425 ("arm_pmu: Add PERF_PMU_CAP_EXTENDED_HW_TYPE capability")
> > >
> > > Due to this, the perf tool will not automatically expand plain
> > > PERF_TYPE_HARDWARE events into per-PMU events, even when all the
> > > necessary kernel support is present.
> > >
> > > This patch updates is_event_supported() to additionally try opening
> > > events with perf_event_attr::exclude_guest set, allowing support for
> > > events to be detected on Apple M1/M2 systems. I beleive that this is
> >
> > nit: s/beleive/believe/
> >
> > > sufficient for all contemporary CPU PMU drivers, though in future it may
> > > be necessary to check for other combinations of filter bits.
> > >
> > > I've deliberately changed the check to not expect a specific error code
> > > for missing filters, as today ;the kernel may return a number of
> > > different error codes for missing filters (e.g. -EACCESS, -EINVAL, or
> > > -EOPNOTSUPP) depending on why and where the filter configuration is
> > > rejected, and retrying for any error is more robust.
> > >
> > > Note that this does not remove the need for commit:
> > >
> > >   a24d9d9dc096fc0d ("perf parse-events: Make legacy events lower priority than sysfs/JSON")
> > >
> > > ... which is still necessary so that named-pmu/event/ events work on
> > > kernels without extended type support, even if the event name happens to
> > > be the same as a PERF_EVENT_TYPE_HARDWARE event (e.g. as is the case for
> > > the M1/M2 PMU's 'cycles' and 'instructions' events).
> > >
> > > Fixes: 82fe2e45cdb00de4 ("perf pmus: Check if we can encode the PMU number in perf_event_attr.type")
> > > Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> > > Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> > > Cc: Hector Martin <marcan@marcan.st>
> > > Cc: Ian Rogers <irogers@google.com>
> > > Cc: James Clark <james.clark@arm.com>
> > > Cc: John Garry <john.g.garry@oracle.com>
> > > Cc: Leo Yan <leo.yan@linaro.org>
> > > Cc: Marc Zyngier <maz@kernel.org>
> > > Cc: Mike Leach <mike.leach@linaro.org>
> > > Cc: Namhyung Kim <namhyung@kernel.org>
> > > Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
> > > Cc: Thomas Richter <tmricht@linux.ibm.com>
> > > Cc: Will Deacon <will@kernel.org>
> 
> Also:
> Tested-by: Ian Rogers <irogers@google.com>
> 
> No regressions on Alderlake except a pre-existing problem I noted here:
> https://lore.kernel.org/lkml/CAP-5=fWVQ-7ijjK3-w1q+k2WYVNHbAcejb-xY0ptbjRw476VKA@mail.gmail.com/

Cheers; I've folded that tag in along with the typo fixes.

I'll send a v2 tomorrow assuming there's nothing else that crops up.

Mark.
  
Mark Rutland Jan. 24, 2024, 3:53 p.m. UTC | #10
On Mon, Jan 22, 2024 at 10:43:27AM +0000, James Clark wrote:
> 
> 
> On 16/01/2024 17:03, Mark Rutland wrote:
> > Currently the perf tool doesn't deteect support for extneded event types
> > on Apple M1/M2 systems, and will not auto-expand plain PERF_EVENT_TYPE
> > hardware events into per-PMU events. This is due to the detection of
> > extended event types not handling mandatory filters required by the
> > M1/M2 PMU driver.
> > 
> > PMU drivers and the core perf_events code can require that
> > perf_event_attr::exclude_* filters are configured in a specific way and
> > may reject certain configurations of filters, for example:
> > 
> > (a) Many PMUs lack support for any event filtering, and require all
> >     perf_event_attr::exclude_* bits to be clear. This includes Alpha's
> >     CPU PMU, and ARM CPU PMUs prior to the introduction of PMUv2 in
> >     ARMv7,
> > 
> > (b) When /proc/sys/kernel/perf_event_paranoid >= 2, the perf core
> >     requires that perf_event_attr::exclude_kernel is set.
> > 
> > (c) The Apple M1/M2 PMU requires that perf_event_attr::exclude_guest is
> >     set as the hardware PMU does not count while a guest is running (but
> >     might be extended in future to do so).
> > 
> > In is_event_supported(), we try to account for cases (a) and (b), first
> > attempting to open an event without any filters, and if this fails,
> > retrying with perf_event_attr::exclude_kernel set. We do not account for
> > case (c), or any other filters that drivers could theoretically require
> > to be set.
> > 
> > Thus is_event_supported() will fail to detect support for any events
> > targetting an Apple M1/M2 PMU, even where events would be supported with
> > perf_event_attr:::exclude_guest set.
> > 
> > Since commit:
> > 
> >   82fe2e45cdb00de4 ("perf pmus: Check if we can encode the PMU number in perf_event_attr.type")
> > 
> > ... we use is_event_supported() to detect support for extended types,
> > with the PMU ID encoded into the perf_event_attr::type. As above, on an
> > Apple M1/M2 system this will always fail to detect that the event is
> > supported, and consequently we fail to detect support for extended types
> > even when these are supported, as they have been since commit:
> > 
> >   5c816728651ae425 ("arm_pmu: Add PERF_PMU_CAP_EXTENDED_HW_TYPE capability")
> > 
> > Due to this, the perf tool will not automatically expand plain
> > PERF_TYPE_HARDWARE events into per-PMU events, even when all the
> > necessary kernel support is present.
> > 
> > This patch updates is_event_supported() to additionally try opening
> > events with perf_event_attr::exclude_guest set, allowing support for
> > events to be detected on Apple M1/M2 systems. I beleive that this is
> > sufficient for all contemporary CPU PMU drivers, though in future it may
> > be necessary to check for other combinations of filter bits.
> > 
> > I've deliberately changed the check to not expect a specific error code
> > for missing filters, as today ;the kernel may return a number of
> > different error codes for missing filters (e.g. -EACCESS, -EINVAL, or
> > -EOPNOTSUPP) depending on why and where the filter configuration is
> > rejected, and retrying for any error is more robust.
> > 
> > Note that this does not remove the need for commit:
> > 
> >   a24d9d9dc096fc0d ("perf parse-events: Make legacy events lower priority than sysfs/JSON")
> > 
> > ... which is still necessary so that named-pmu/event/ events work on
> > kernels without extended type support, even if the event name happens to
> > be the same as a PERF_EVENT_TYPE_HARDWARE event (e.g. as is the case for
> > the M1/M2 PMU's 'cycles' and 'instructions' events).
> > 
> > Fixes: 82fe2e45cdb00de4 ("perf pmus: Check if we can encode the PMU number in perf_event_attr.type")
> > Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> > Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> > Cc: Hector Martin <marcan@marcan.st>
> > Cc: Ian Rogers <irogers@google.com>
> > Cc: James Clark <james.clark@arm.com>
> > Cc: John Garry <john.g.garry@oracle.com>
> > Cc: Leo Yan <leo.yan@linaro.org>
> > Cc: Marc Zyngier <maz@kernel.org>
> > Cc: Mike Leach <mike.leach@linaro.org>
> > Cc: Namhyung Kim <namhyung@kernel.org>
> > Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
> > Cc: Thomas Richter <tmricht@linux.ibm.com>
> > Cc: Will Deacon <will@kernel.org>
> > ---
> >  tools/perf/util/print-events.c | 27 +++++++++++++++++++--------
> >  1 file changed, 19 insertions(+), 8 deletions(-)
> > 
> 
> Tested-by: James Clark <james.clark@arm.com>
> 
> Tested on Juno and N1SDP, although I wouldn't have expected it to make a
> difference on those platforms because they support exclude_guest=0.

It's good to be certain, anyhow!

I've folded that tag in for v2.

> Although I do see an interaction with the test "Session topology" if I
> hack the driver to behave like M1. The test has been failing (on
> big.LITTLE) since commit 251aa040244a ("perf parse-events: Wildcard most
> "numeric" events") but the result is that the test actually starts
> passing with this change. I don't think that should really block this
> though, it's likely going to require a separate fix which I will look into.

IIUC that fix is:

  https://lore.kernel.org/lkml/20240124094358.489372-1-james.clark@arm.com/

.. right?

Mark.
  
Mark Rutland Jan. 24, 2024, 4:05 p.m. UTC | #11
On Thu, Jan 18, 2024 at 09:57:32PM -0800, Namhyung Kim wrote:
> Hello,
> 
> Adding Ravi to CC.
> 
> On Wed, Jan 17, 2024 at 4:12 AM Mark Rutland <mark.rutland@arm.com> wrote:
> >
> > On Wed, Jan 17, 2024 at 09:05:25AM +0000, Marc Zyngier wrote:
> > > Hi Mark,
> > >
> > > On Tue, 16 Jan 2024 17:03:48 +0000,
> > > Mark Rutland <mark.rutland@arm.com> wrote:
> > > >
> > > > Currently the perf tool doesn't deteect support for extneded event types
> > > > on Apple M1/M2 systems, and will not auto-expand plain PERF_EVENT_TYPE
> > > > hardware events into per-PMU events. This is due to the detection of
> > > > extended event types not handling mandatory filters required by the
> > > > M1/M2 PMU driver.
> > >
> > > Thanks for looking into this.
> > >
> > > I've given your patch a go on my M1 box, and it indeed makes things
> > > substantially better:
> > >
> > > $ sudo ./perf stat -e cycles ~/hackbench 100 process 1000
> > > Running with 100*40 (== 4000) tasks.
> > > Time: 3.419
> > >
> > >  Performance counter stats for '/home/maz/hackbench 100 process 1000':
> > >
> > >    174,783,472,090      apple_firestorm_pmu/cycles/                                             (93.10%)
> > >     39,134,744,813      apple_icestorm_pmu/cycles/                                              (71.86%)
> > >
> > >        3.568145595 seconds time elapsed
> > >
> > >       12.203084000 seconds user
> > >       55.135271000 seconds sys
> >
> > Thanks for giving that a spin!
> >
> > > However, I'm seeing some slightly odd behaviours:
> > >
> > > $ sudo ./perf stat -e cycles:k ~/hackbench 100 process 1000
> > > Running with 100*40 (== 4000) tasks.
> > > Time: 3.313
> > >
> > >  Performance counter stats for '/home/maz/hackbench 100 process 1000':
> > >
> > >    <not supported>      apple_firestorm_pmu/cycles:k/
> > >    <not supported>      apple_icestorm_pmu/cycles:k/
> 
> Hmm.. I guess this should look like apple_firestorm_pmu/cycles/k.

Yeah, the ":k" within the slashes isn't right, and I suspect that's due to the
way the event gets expanded.

Ian, is that something you're aware of already?

> IIRC there was a thread for this, right?

I'm not aware of a thread for the way filters get applied to expanded events.

> > >
> > >        3.467568841 seconds time elapsed
> > >
> > >       13.080111000 seconds user
> > >       53.162099000 seconds sys
> > >
> > > I would have expected it to count, but it didn't. For that to work, I
> > > have to add the 'H' modifier:
> >
> > Ok, so that'll have something to do with the way the tool chooses which
> > perf_evant_attr::exclude_* bits to set. I thought that was the same for plain
> > events and pmu_name/event/ events, but I could be mistaken.
> 
> I think it sets the attr.exclude_guest by event_attr_init().  Maybe
> it's deleted during the missing feature detection logic.  But IIUC
> it should work on each PMU separately.

I'll try to look into this a bit more.

> By the way, I really hope the kernel exports caps/exclude_bits
> for PMUs so that tools can see which bits are supported.  For
> example AMD IBS has CAP_NO_EXCLUDE so setting exclude_guest
> will fail to open.  Then it disables the new features added after
> that in the missing feature detection logic.

I'm ok in principle with exposing some info on the supported exclude_*
configuration, but it's worth noting that event where a PMU supports specific
filters, it might not support all combinations of filters for all events. For
example, s390 doesn't support exclude_* filters on RAW events, and doesn't
support kernel-only counters. Given that, I'm not sure how we can expose that
in a useful way.

Mark.

> If we know if it doesn't support any exclude bits, then tools can try other
> features after removing the bit first.
> 
> Thanks,
> Namhyung
> 
> >
> > Is that something you had tried prior to this patch, and did that "just work"
> > with the explicit pmu_name/event/ syntax prior to this patch?
> >
> > e.g. did something like:
> >
> >         perf stat -e apple_firestorm_pmu/cycles/k -e apple_icestorm_pmu/cycles/k ./workload
> >
> > ... happen to work withiout requiring the addition of 'H'?
> >
> > If so, does that behave the same before/after this patch?
> >
> > ... and could you run that with '-vvv' and dump the output for comparison?
> >
> > Thanks,
> > Mark.
  
James Clark Jan. 24, 2024, 4:19 p.m. UTC | #12
On 24/01/2024 15:53, Mark Rutland wrote:
> On Mon, Jan 22, 2024 at 10:43:27AM +0000, James Clark wrote:
>>
>>
>> On 16/01/2024 17:03, Mark Rutland wrote:
>>> Currently the perf tool doesn't deteect support for extneded event types
>>> on Apple M1/M2 systems, and will not auto-expand plain PERF_EVENT_TYPE
>>> hardware events into per-PMU events. This is due to the detection of
>>> extended event types not handling mandatory filters required by the
>>> M1/M2 PMU driver.
>>>
>>> PMU drivers and the core perf_events code can require that
>>> perf_event_attr::exclude_* filters are configured in a specific way and
>>> may reject certain configurations of filters, for example:
>>>
>>> (a) Many PMUs lack support for any event filtering, and require all
>>>     perf_event_attr::exclude_* bits to be clear. This includes Alpha's
>>>     CPU PMU, and ARM CPU PMUs prior to the introduction of PMUv2 in
>>>     ARMv7,
>>>
>>> (b) When /proc/sys/kernel/perf_event_paranoid >= 2, the perf core
>>>     requires that perf_event_attr::exclude_kernel is set.
>>>
>>> (c) The Apple M1/M2 PMU requires that perf_event_attr::exclude_guest is
>>>     set as the hardware PMU does not count while a guest is running (but
>>>     might be extended in future to do so).
>>>
>>> In is_event_supported(), we try to account for cases (a) and (b), first
>>> attempting to open an event without any filters, and if this fails,
>>> retrying with perf_event_attr::exclude_kernel set. We do not account for
>>> case (c), or any other filters that drivers could theoretically require
>>> to be set.
>>>
>>> Thus is_event_supported() will fail to detect support for any events
>>> targetting an Apple M1/M2 PMU, even where events would be supported with
>>> perf_event_attr:::exclude_guest set.
>>>
>>> Since commit:
>>>
>>>   82fe2e45cdb00de4 ("perf pmus: Check if we can encode the PMU number in perf_event_attr.type")
>>>
>>> ... we use is_event_supported() to detect support for extended types,
>>> with the PMU ID encoded into the perf_event_attr::type. As above, on an
>>> Apple M1/M2 system this will always fail to detect that the event is
>>> supported, and consequently we fail to detect support for extended types
>>> even when these are supported, as they have been since commit:
>>>
>>>   5c816728651ae425 ("arm_pmu: Add PERF_PMU_CAP_EXTENDED_HW_TYPE capability")
>>>
>>> Due to this, the perf tool will not automatically expand plain
>>> PERF_TYPE_HARDWARE events into per-PMU events, even when all the
>>> necessary kernel support is present.
>>>
>>> This patch updates is_event_supported() to additionally try opening
>>> events with perf_event_attr::exclude_guest set, allowing support for
>>> events to be detected on Apple M1/M2 systems. I beleive that this is
>>> sufficient for all contemporary CPU PMU drivers, though in future it may
>>> be necessary to check for other combinations of filter bits.
>>>
>>> I've deliberately changed the check to not expect a specific error code
>>> for missing filters, as today ;the kernel may return a number of
>>> different error codes for missing filters (e.g. -EACCESS, -EINVAL, or
>>> -EOPNOTSUPP) depending on why and where the filter configuration is
>>> rejected, and retrying for any error is more robust.
>>>
>>> Note that this does not remove the need for commit:
>>>
>>>   a24d9d9dc096fc0d ("perf parse-events: Make legacy events lower priority than sysfs/JSON")
>>>
>>> ... which is still necessary so that named-pmu/event/ events work on
>>> kernels without extended type support, even if the event name happens to
>>> be the same as a PERF_EVENT_TYPE_HARDWARE event (e.g. as is the case for
>>> the M1/M2 PMU's 'cycles' and 'instructions' events).
>>>
>>> Fixes: 82fe2e45cdb00de4 ("perf pmus: Check if we can encode the PMU number in perf_event_attr.type")
>>> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
>>> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
>>> Cc: Hector Martin <marcan@marcan.st>
>>> Cc: Ian Rogers <irogers@google.com>
>>> Cc: James Clark <james.clark@arm.com>
>>> Cc: John Garry <john.g.garry@oracle.com>
>>> Cc: Leo Yan <leo.yan@linaro.org>
>>> Cc: Marc Zyngier <maz@kernel.org>
>>> Cc: Mike Leach <mike.leach@linaro.org>
>>> Cc: Namhyung Kim <namhyung@kernel.org>
>>> Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
>>> Cc: Thomas Richter <tmricht@linux.ibm.com>
>>> Cc: Will Deacon <will@kernel.org>
>>> ---
>>>  tools/perf/util/print-events.c | 27 +++++++++++++++++++--------
>>>  1 file changed, 19 insertions(+), 8 deletions(-)
>>>
>>
>> Tested-by: James Clark <james.clark@arm.com>
>>
>> Tested on Juno and N1SDP, although I wouldn't have expected it to make a
>> difference on those platforms because they support exclude_guest=0.
> 
> It's good to be certain, anyhow!
> 
> I've folded that tag in for v2.
> 
>> Although I do see an interaction with the test "Session topology" if I
>> hack the driver to behave like M1. The test has been failing (on
>> big.LITTLE) since commit 251aa040244a ("perf parse-events: Wildcard most
>> "numeric" events") but the result is that the test actually starts
>> passing with this change. I don't think that should really block this
>> though, it's likely going to require a separate fix which I will look into.
> 
> IIUC that fix is:
> 
>   https://lore.kernel.org/lkml/20240124094358.489372-1-james.clark@arm.com/
> 
> ... right?
> 
> Mark.
> 

Yes that's it. Turned out to be unrelated. Or only semi related and not
blocking for your patch.
  
Mark Rutland Jan. 26, 2024, 2:36 p.m. UTC | #13
On Wed, Jan 17, 2024 at 12:12:05PM +0000, Mark Rutland wrote:
> On Wed, Jan 17, 2024 at 09:05:25AM +0000, Marc Zyngier wrote:
> > However, I'm seeing some slightly odd behaviours:

I believe that this is a separate issue; info dump below.
> > $ sudo ./perf stat -e cycles:k ~/hackbench 100 process 1000
> > Running with 100*40 (== 4000) tasks.
> > Time: 3.313
> > 
> >  Performance counter stats for '/home/maz/hackbench 100 process 1000':
> > 
> >    <not supported>      apple_firestorm_pmu/cycles:k/                                         
> >    <not supported>      apple_icestorm_pmu/cycles:k/                                          
> > 
> >        3.467568841 seconds time elapsed
> > 
> >       13.080111000 seconds user
> >       53.162099000 seconds sys
> > 
> > I would have expected it to count, but it didn't. For that to work, I
> > have to add the 'H' modifier:

I gave that a spin with the aforementioned hacked-up PMUv3 driver, and I see
the same:

| # ./perf-after stat -e cycles true
| 
|  Performance counter stats for 'true':
| 
|      <not counted>      armv8_pmuv3_0/cycles/                                                   (0.00%)
|            1375271      armv8_pmuv3_1/cycles/                                                 
| 
|        0.001153070 seconds time elapsed
| 
|        0.001204000 seconds user
|        0.000000000 seconds sys
| 
| 
| # ./perf-after stat -e cycles:k true
| 
|  Performance counter stats for 'true':
| 
|    <not supported>      armv8_pmuv3_0/cycles:k/                                               
|    <not supported>      armv8_pmuv3_1/cycles:k/                                               
| 
|        0.000983130 seconds time elapsed
| 
|        0.001037000 seconds user
|        0.000000000 seconds sys
| 
| 
| # ./perf-after stat -e cycles:kH true
| 
|  Performance counter stats for 'true':
| 
|      <not counted>      armv8_pmuv3_0/cycles:kH/                                                (0.00%)
|             932067      armv8_pmuv3_1/cycles:kH/                                              
| 
|        0.001090100 seconds time elapsed
| 
|        0.001125000 seconds user
|        0.000000000 seconds sys

.. though interestingly 'cycles:u' works:

| # ./perf-after stat -e cycles:u true
| 
|  Performance counter stats for 'true':
| 
|             369753      armv8_pmuv3_0/cycles:u/                                               
|      <not counted>      armv8_pmuv3_1/cycles:u/                                                 (0.00%)
| 
|        0.001171980 seconds time elapsed
| 
|        0.001245000 seconds user
|        0.000000000 seconds sys

Looking at the output with '-vvv' the perf tool implicitly sets exclude_guest
for 'cycles', 'cycles:u', and 'cycles:kH', but does not set exclude_guest for
'cycles:k'.

It looks like that's consistent with the behaviour of opening separate events
prior to this patch:

| # ./perf-before stat -e armv8_pmuv3_0/cycles/ -e armv8_pmuv3_1/cycles/ true
| 
|  Performance counter stats for 'true':
| 
|            1407624      armv8_pmuv3_0/cycles/                                                 
|      <not counted>      armv8_pmuv3_1/cycles/                                                   (0.00%)
| 
|        0.001179205 seconds time elapsed
| 
|        0.001217000 seconds user
|        0.000000000 seconds sys
| 
| 
| # ./perf-before stat -e armv8_pmuv3_0/cycles/u -e armv8_pmuv3_1/cycles/u true
| 
|  Performance counter stats for 'true':
| 
|             329212      armv8_pmuv3_0/cycles/u                                                
|      <not counted>      armv8_pmuv3_1/cycles/u                                                  (0.00%)
| 
|        0.001050550 seconds time elapsed
| 
|        0.001081000 seconds user
|        0.000000000 seconds sys
| 
| 
| # ./perf-before stat -e armv8_pmuv3_0/cycles/k -e armv8_pmuv3_1/cycles/k true
| 
|  Performance counter stats for 'true':
| 
|    <not supported>      armv8_pmuv3_0/cycles/k                                                
|    <not supported>      armv8_pmuv3_1/cycles/k                                                
| 
|        0.000944285 seconds time elapsed
| 
|        0.000985000 seconds user
|        0.000000000 seconds sys
| 
| 
| # ./perf-before stat -e armv8_pmuv3_0/cycles/kH -e armv8_pmuv3_1/cycles/kH true
| 
|  Performance counter stats for 'true':
| 
|            1016160      armv8_pmuv3_0/cycles/kH                                               
|      <not counted>      armv8_pmuv3_1/cycles/kH                                                 (0.00%)
| 
|        0.001179220 seconds time elapsed
| 
|        0.001239000 seconds user
|        0.000000000 seconds sys

.. and per '-vvv', exclude_guest is set in the same cases.

I agree it's a bit weird that the tool sets exclude_guest for unfilted and ':u'
events, but not ':k' events, but it looks like that's separate from the way
events get expanded.

Thanks,
Mark.
  

Patch

diff --git a/tools/perf/util/print-events.c b/tools/perf/util/print-events.c
index b0fc48be623f3..4f67e8f00a4d6 100644
--- a/tools/perf/util/print-events.c
+++ b/tools/perf/util/print-events.c
@@ -232,7 +232,6 @@  void print_sdt_events(const struct print_callbacks *print_cb, void *print_state)
 bool is_event_supported(u8 type, u64 config)
 {
 	bool ret = true;
-	int open_return;
 	struct evsel *evsel;
 	struct perf_event_attr attr = {
 		.type = type,
@@ -246,20 +245,32 @@  bool is_event_supported(u8 type, u64 config)
 
 	evsel = evsel__new(&attr);
 	if (evsel) {
-		open_return = evsel__open(evsel, NULL, tmap);
-		ret = open_return >= 0;
+		ret = evsel__open(evsel, NULL, tmap) >= 0;
 
-		if (open_return == -EACCES) {
+		if (!ret) {
 			/*
-			 * This happens if the paranoid value
+			 * The event may fail to open if the paranoid value
 			 * /proc/sys/kernel/perf_event_paranoid is set to 2
-			 * Re-run with exclude_kernel set; we don't do that
-			 * by default as some ARM machines do not support it.
-			 *
+			 * Re-run with exclude_kernel set; we don't do that by
+			 * default as some ARM machines do not support it.
 			 */
 			evsel->core.attr.exclude_kernel = 1;
 			ret = evsel__open(evsel, NULL, tmap) >= 0;
 		}
+
+		if (!ret) {
+			/*
+			 * The event may fail to open if the PMU requires
+			 * exclude_guest to be set (e.g. as the Apple M1 PMU
+			 * requires).
+			 * Re-run with exclude_guest set; we don't do that by
+			 * default as it's equally legitimate for another PMU
+			 * driver to require that exclude_guest is clear.
+			 */
+			evsel->core.attr.exclude_guest = 1;
+			ret = evsel__open(evsel, NULL, tmap) >= 0;
+		}
+
 		evsel__delete(evsel);
 	}