[2/6] perf: Support branch events logging

Message ID 20230410204352.1098067-2-kan.liang@linux.intel.com
State New
Headers
Series [1/6] perf/x86/intel: Add Grand Ridge and Sierra Forest |

Commit Message

Liang, Kan April 10, 2023, 8:43 p.m. UTC
  From: Kan Liang <kan.liang@linux.intel.com>

With the cycle time information between branches, stalls can be easily
observed. But it's difficult to explain what causes the long delay.

Add a new field to collect the occurrences of events since the last
branch entry, which can be used to provide some causality information
for the cycle time values currently recorded in branches.

Add a new branch sample type to indicate whether include occurrences of
events in branch info.

Only support up to 4 events with saturating at value 3.
In the current kernel, the events are ordered by either the counter
index or the enabling sequence. But none of the order information is
available to the user space tool.
Add a new PERF_SAMPLE format, PERF_SAMPLE_BRANCH_EVENT_IDS, and generic
support to dump the event IDs of the branch events.
Add a helper function to detect the branch event flag.
These will be used in the following patch.

Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 include/linux/perf_event.h      | 26 ++++++++++++++++++++++++++
 include/uapi/linux/perf_event.h | 22 ++++++++++++++++++++--
 kernel/events/core.c            | 23 +++++++++++++++++++++++
 3 files changed, 69 insertions(+), 2 deletions(-)
  

Comments

Peter Zijlstra April 14, 2023, 10:38 a.m. UTC | #1
On Mon, Apr 10, 2023 at 01:43:48PM -0700, kan.liang@linux.intel.com wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> With the cycle time information between branches, stalls can be easily
> observed. But it's difficult to explain what causes the long delay.
> 
> Add a new field to collect the occurrences of events since the last
> branch entry, which can be used to provide some causality information
> for the cycle time values currently recorded in branches.
> 
> Add a new branch sample type to indicate whether include occurrences of
> events in branch info.
> 
> Only support up to 4 events with saturating at value 3.
> In the current kernel, the events are ordered by either the counter
> index or the enabling sequence. But none of the order information is
> available to the user space tool.
> Add a new PERF_SAMPLE format, PERF_SAMPLE_BRANCH_EVENT_IDS, and generic
> support to dump the event IDs of the branch events.
> Add a helper function to detect the branch event flag.
> These will be used in the following patch.

I'm having trouble reverse engineering this. Can you more coherently
explain this feature and how you've implemented it?
  
Liang, Kan April 14, 2023, 1:35 p.m. UTC | #2
On 2023-04-14 6:38 a.m., Peter Zijlstra wrote:
> On Mon, Apr 10, 2023 at 01:43:48PM -0700, kan.liang@linux.intel.com wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> With the cycle time information between branches, stalls can be easily
>> observed. But it's difficult to explain what causes the long delay.
>>
>> Add a new field to collect the occurrences of events since the last
>> branch entry, which can be used to provide some causality information
>> for the cycle time values currently recorded in branches.
>>
>> Add a new branch sample type to indicate whether include occurrences of
>> events in branch info.
>>
>> Only support up to 4 events with saturating at value 3.
>> In the current kernel, the events are ordered by either the counter
>> index or the enabling sequence. But none of the order information is
>> available to the user space tool.
>> Add a new PERF_SAMPLE format, PERF_SAMPLE_BRANCH_EVENT_IDS, and generic
>> support to dump the event IDs of the branch events.
>> Add a helper function to detect the branch event flag.
>> These will be used in the following patch.
> 
> I'm having trouble reverse engineering this. Can you more coherently
> explain this feature and how you've implemented it?

Sorry for that.

The feature is an enhancement of ARCH LBR. It adds new fields in the
LBR_INFO MSRs to log the occurrences of events on the first 4 GP
counters. Worked with the previous timed LBR feature together, the user
can understand not only the latency between two LBR blocks, but also
which events causes the stall.

The spec can be found at the latest Intel® Architecture Instruction Set
Extensions and Future Features, v048. Chapter 8.4.
https://cdrdv2.intel.com/v1/dl/getContent/671368

To support the feature, there are three main changes in ABIs.
- A new branch sample type, PERF_SAMPLE_BRANCH_EVENT, is used as a knob
to enable the feature.
- Extend the struct perf_branch_entry layout, because we have to save
and pass the occurrences of events to user space. Since it's only
available for 4 counters and saturating at value 3, it only occupies 8
bits. For the current Intel implementation, the order is the order of
counters.
- Add a new PERF_SAMPLE format, PERF_SAMPLE_BRANCH_EVENT_IDS, to dump
the order information. User space tool doesn't understand the order of
counters. So it cannot map the new fields in struct perf_branch_entry to
a specific event. We have to dump the order information.
I once considered using enabling order to avoid this new sample format.
It works for some cases, e.g., group. But it doesn't work for some
complex cases, e.g., multiplexing, in which the enabling order keeps
changing.
Ideally, we should dump the order information for each LBR entry. But
that will include too much duplicate information. So the order
information is only dumped for each sample. The drawback is that we have
to flush/update old LBR entries once the events are rescheduled between
samples, e.g., multiplexing. Because it's possible that the new sample
can still see the stall LBR entries. That's specially handled in the
next Intel specific patch.

For the current implementation, perf tool has to apply both
PERF_SAMPLE_BRANCH_EVENT and PERF_SAMPLE_BRANCH_EVENT_IDS to enable the
feature.

Thanks,
Kan
  
Peter Zijlstra April 14, 2023, 2:53 p.m. UTC | #3
On Fri, Apr 14, 2023 at 09:35:37AM -0400, Liang, Kan wrote:
> 
> 
> On 2023-04-14 6:38 a.m., Peter Zijlstra wrote:
> > On Mon, Apr 10, 2023 at 01:43:48PM -0700, kan.liang@linux.intel.com wrote:
> >> From: Kan Liang <kan.liang@linux.intel.com>
> >>
> >> With the cycle time information between branches, stalls can be easily
> >> observed. But it's difficult to explain what causes the long delay.
> >>
> >> Add a new field to collect the occurrences of events since the last
> >> branch entry, which can be used to provide some causality information
> >> for the cycle time values currently recorded in branches.
> >>
> >> Add a new branch sample type to indicate whether include occurrences of
> >> events in branch info.
> >>
> >> Only support up to 4 events with saturating at value 3.
> >> In the current kernel, the events are ordered by either the counter
> >> index or the enabling sequence. But none of the order information is
> >> available to the user space tool.
> >> Add a new PERF_SAMPLE format, PERF_SAMPLE_BRANCH_EVENT_IDS, and generic
> >> support to dump the event IDs of the branch events.
> >> Add a helper function to detect the branch event flag.
> >> These will be used in the following patch.
> > 
> > I'm having trouble reverse engineering this. Can you more coherently
> > explain this feature and how you've implemented it?
> 
> Sorry for that.
> 
> The feature is an enhancement of ARCH LBR. It adds new fields in the
> LBR_INFO MSRs to log the occurrences of events on the first 4 GP
> counters. Worked with the previous timed LBR feature together, the user
> can understand not only the latency between two LBR blocks, but also
> which events causes the stall.
> 
> The spec can be found at the latest Intel® Architecture Instruction Set
> Extensions and Future Features, v048. Chapter 8.4.
> https://cdrdv2.intel.com/v1/dl/getContent/671368

Oh gawd; that's terse. Why can't these people write comprehensible
things :/ It's almost as if they don't want this stuff to be used.

So IA32_LBR_x_INFO is extended:

  [0:15]	CYC_CNT
  [16:31]	undefined
+ [32:33]	PMC0_CNT
+ [34:35]	PMC1_CNT
+ [36:37]	PMC2_CNT
+ [38:39]	PMC3_CNT
+ [40:41]	PMC4_CNT
+ [42:43]	PMC5_CNT
+ [44:45]	PMC6_CNT
+ [46:47]	PMC7_CNT
  [48:55]	undefined
  [56:59]	BR_TYPE
  [60]		CYC_CNT_VALID
  [61]		TSX_ABORT

Where the PMCx_CNT fields are saturating counters for the respective
PMCs. And we'll run out of bits if we get more than 12 PMCs. Is SMT=n
PMC merging still a thing?

And for some reason this counting is enabled in PERFEVTSELx[35] instead
of in LBR_CTL somewhere :/

> To support the feature, there are three main changes in ABIs.
> - A new branch sample type, PERF_SAMPLE_BRANCH_EVENT, is used as a knob
> to enable the feature.

> - Extend the struct perf_branch_entry layout, because we have to save
> and pass the occurrences of events to user space. Since it's only
> available for 4 counters and saturating at value 3, it only occupies 8
> bits. For the current Intel implementation, the order is the order of
> counters.

Only for 4? Where does it say that? If it were to only support 4, then
we're in counter scheduling contraint hell again and we need to somehow
group all these things together with the LBR event.

  @@ -1410,6 +1423,10 @@ union perf_mem_data_src {
  *    cycles: cycles from last branch (or 0 if not supported)
  *      type: branch type
  *      spec: branch speculation info (or 0 if not supported)
  + *      events: occurrences of events since the last branch entry.
  + *              The fields can store up to 4 events with saturating
  + *              at value 3.
  + *              (or 0 if not supported)
  */
  struct perf_branch_entry {
	  __u64   from;
  @@ -1423,7 +1440,8 @@ struct perf_branch_entry {
		  spec:2,     /* branch speculation info */
		  new_type:4, /* additional branch type */
		  priv:3,     /* privilege level */
  -               reserved:31;
  +               events:8,   /* occurrences of events since the last branch entry */
  +               reserved:23;
  };

  union perf_sample_weight {

This seems properly terrible from an interface pov. What if the next
generation of silicon extends this to all 8 PMCs or another architecture
comes along that does this with 3 bits per counter etc...

> - Add a new PERF_SAMPLE format, PERF_SAMPLE_BRANCH_EVENT_IDS, to dump
> the order information. User space tool doesn't understand the order of
> counters. So it cannot map the new fields in struct perf_branch_entry to
> a specific event. We have to dump the order information.

Sorry; I can't parse this.
  
Liang, Kan April 14, 2023, 3:56 p.m. UTC | #4
On 2023-04-14 10:53 a.m., Peter Zijlstra wrote:
> On Fri, Apr 14, 2023 at 09:35:37AM -0400, Liang, Kan wrote:
>>
>>
>> On 2023-04-14 6:38 a.m., Peter Zijlstra wrote:
>>> On Mon, Apr 10, 2023 at 01:43:48PM -0700, kan.liang@linux.intel.com wrote:
>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>
>>>> With the cycle time information between branches, stalls can be easily
>>>> observed. But it's difficult to explain what causes the long delay.
>>>>
>>>> Add a new field to collect the occurrences of events since the last
>>>> branch entry, which can be used to provide some causality information
>>>> for the cycle time values currently recorded in branches.
>>>>
>>>> Add a new branch sample type to indicate whether include occurrences of
>>>> events in branch info.
>>>>
>>>> Only support up to 4 events with saturating at value 3.
>>>> In the current kernel, the events are ordered by either the counter
>>>> index or the enabling sequence. But none of the order information is
>>>> available to the user space tool.
>>>> Add a new PERF_SAMPLE format, PERF_SAMPLE_BRANCH_EVENT_IDS, and generic
>>>> support to dump the event IDs of the branch events.
>>>> Add a helper function to detect the branch event flag.
>>>> These will be used in the following patch.
>>>
>>> I'm having trouble reverse engineering this. Can you more coherently
>>> explain this feature and how you've implemented it?
>>
>> Sorry for that.
>>
>> The feature is an enhancement of ARCH LBR. It adds new fields in the
>> LBR_INFO MSRs to log the occurrences of events on the first 4 GP
>> counters. Worked with the previous timed LBR feature together, the user
>> can understand not only the latency between two LBR blocks, but also
>> which events causes the stall.
>>
>> The spec can be found at the latest Intel® Architecture Instruction Set
>> Extensions and Future Features, v048. Chapter 8.4.
>> https://cdrdv2.intel.com/v1/dl/getContent/671368
> 
> Oh gawd; that's terse. Why can't these people write comprehensible
> things :/ It's almost as if they don't want this stuff to be used.
> 
> So IA32_LBR_x_INFO is extended:
> 
>   [0:15]	CYC_CNT
>   [16:31]	undefined
> + [32:33]	PMC0_CNT
> + [34:35]	PMC1_CNT
> + [36:37]	PMC2_CNT
> + [38:39]	PMC3_CNT
> + [40:41]	PMC4_CNT
> + [42:43]	PMC5_CNT
> + [44:45]	PMC6_CNT
> + [46:47]	PMC7_CNT
>   [48:55]	undefined
>   [56:59]	BR_TYPE
>   [60]		CYC_CNT_VALID
>   [61]		TSX_ABORT
> 
> Where the PMCx_CNT fields are saturating counters for the respective
> PMCs. And we'll run out of bits if we get more than 12 PMCs. Is SMT=n
> PMC merging still a thing?
> 
> And for some reason this counting is enabled in PERFEVTSELx[35] instead
> of in LBR_CTL somewhere :/
> 
>> To support the feature, there are three main changes in ABIs.
>> - A new branch sample type, PERF_SAMPLE_BRANCH_EVENT, is used as a knob
>> to enable the feature.
> 
>> - Extend the struct perf_branch_entry layout, because we have to save
>> and pass the occurrences of events to user space. Since it's only
>> available for 4 counters and saturating at value 3, it only occupies 8
>> bits. For the current Intel implementation, the order is the order of
>> counters.
> 
> Only for 4? Where does it say that? 

"Per-counter support for LBR Event Logging is indicated by the “Event
Logging Supported” bitmap in
CPUID.(EAX=01CH, ECX=0).ECX[19:16]"

There are only 4 bits to indicate the supported counter.

> If it were to only support 4, then
> we're in counter scheduling contraint hell again 

Unfortunately, yes.

> and we need to somehow
> group all these things together with the LBR event.

Group will bring many limits for the usage. For example, I was told
there could be someone wants to use it with multiplexing.

> 
>   @@ -1410,6 +1423,10 @@ union perf_mem_data_src {
>   *    cycles: cycles from last branch (or 0 if not supported)
>   *      type: branch type
>   *      spec: branch speculation info (or 0 if not supported)
>   + *      events: occurrences of events since the last branch entry.
>   + *              The fields can store up to 4 events with saturating
>   + *              at value 3.
>   + *              (or 0 if not supported)
>   */
>   struct perf_branch_entry {
> 	  __u64   from;
>   @@ -1423,7 +1440,8 @@ struct perf_branch_entry {
> 		  spec:2,     /* branch speculation info */
> 		  new_type:4, /* additional branch type */
> 		  priv:3,     /* privilege level */
>   -               reserved:31;
>   +               events:8,   /* occurrences of events since the last branch entry */
>   +               reserved:23;
>   };
> 
>   union perf_sample_weight {
> 
> This seems properly terrible from an interface pov. What if the next
> generation of silicon extends this to all 8 PMCs or another architecture
> comes along that does this with 3 bits per counter etc...

OK. The reserved space is not enough anymore. I think we have to add
several new fields. I will redesign it.

> 
>> - Add a new PERF_SAMPLE format, PERF_SAMPLE_BRANCH_EVENT_IDS, to dump
>> the order information. User space tool doesn't understand the order of
>> counters. So it cannot map the new fields in struct perf_branch_entry to
>> a specific event. We have to dump the order information.
> 
> Sorry; I can't parse this.

The perf tool has no idea which physical counter is assigned to an event.
The HW has no idea about an event. It only log the information from the
counter 0 into IA32_LBR_x_INFO[32:33].
If we pass the information from IA32_LBR_x_INFO[32:33] to the perf tool.
The perf tool lacks of knowledge to connect the information to an event.
So we have to dump the event ID at the same time.

Thanks,
Kan
  
Peter Zijlstra April 14, 2023, 4:09 p.m. UTC | #5
On Fri, Apr 14, 2023 at 11:56:41AM -0400, Liang, Kan wrote:
> > If it were to only support 4, then
> > we're in counter scheduling contraint hell again 
> 
> Unfortunately, yes.
> 
> > and we need to somehow
> > group all these things together with the LBR event.
> 
> Group will bring many limits for the usage. For example, I was told
> there could be someone wants to use it with multiplexing.

You can create two groups, each with an LBR event, no?
  
Liang, Kan April 14, 2023, 5:53 p.m. UTC | #6
On 2023-04-14 12:09 p.m., Peter Zijlstra wrote:
> On Fri, Apr 14, 2023 at 11:56:41AM -0400, Liang, Kan wrote:
>>> If it were to only support 4, then
>>> we're in counter scheduling contraint hell again 
>>
>> Unfortunately, yes.
>>
>>> and we need to somehow
>>> group all these things together with the LBR event.
>>
>> Group will bring many limits for the usage. For example, I was told
>> there could be someone wants to use it with multiplexing.
> 
> You can create two groups, each with an LBR event, no?

If we put everything in a group, that will make the enabling much
simpler. I don't think the perf tool needs the order information
anymore. Because the kernel enables the events one by one in a group.
The kernel just need to convert the information from the counter order
to the enabling order and dump to user space.

But if we have two groups with LBR event, the order information is still
required. Why we still want to group things?


Thanks,
Kan
  
Peter Zijlstra April 14, 2023, 7:24 p.m. UTC | #7
On Fri, Apr 14, 2023 at 01:53:24PM -0400, Liang, Kan wrote:
> 
> 
> On 2023-04-14 12:09 p.m., Peter Zijlstra wrote:
> > On Fri, Apr 14, 2023 at 11:56:41AM -0400, Liang, Kan wrote:
> >>> If it were to only support 4, then
> >>> we're in counter scheduling contraint hell again 
> >>
> >> Unfortunately, yes.
> >>
> >>> and we need to somehow
> >>> group all these things together with the LBR event.
> >>
> >> Group will bring many limits for the usage. For example, I was told
> >> there could be someone wants to use it with multiplexing.
> > 
> > You can create two groups, each with an LBR event, no?
> 
> If we put everything in a group, that will make the enabling much
> simpler. I don't think the perf tool needs the order information
> anymore. Because the kernel enables the events one by one in a group.
> The kernel just need to convert the information from the counter order
> to the enabling order and dump to user space.

I never understood the whole order thing. What was it trying to do?

> But if we have two groups with LBR event, the order information is still
> required. Why we still want to group things?

Why would you need that; what is that whole order nonsense about?

{e1, e2, e3, e4}, {e5, e6, e7, e8} with e1 and e5 both having LBR on
just works no?

Since they have LBR and that extra sample flag they all get a 0-3
constraint.

Since both e1 and e5 use LBR, they're mutually exclusive, either e1 or
e5 group runs.
  
Liang, Kan April 14, 2023, 8:34 p.m. UTC | #8
On 2023-04-14 3:24 p.m., Peter Zijlstra wrote:
> On Fri, Apr 14, 2023 at 01:53:24PM -0400, Liang, Kan wrote:
>>
>>
>> On 2023-04-14 12:09 p.m., Peter Zijlstra wrote:
>>> On Fri, Apr 14, 2023 at 11:56:41AM -0400, Liang, Kan wrote:
>>>>> If it were to only support 4, then
>>>>> we're in counter scheduling contraint hell again 
>>>>
>>>> Unfortunately, yes.
>>>>
>>>>> and we need to somehow
>>>>> group all these things together with the LBR event.
>>>>
>>>> Group will bring many limits for the usage. For example, I was told
>>>> there could be someone wants to use it with multiplexing.
>>>
>>> You can create two groups, each with an LBR event, no?
>>
>> If we put everything in a group, that will make the enabling much
>> simpler. I don't think the perf tool needs the order information
>> anymore. Because the kernel enables the events one by one in a group.
>> The kernel just need to convert the information from the counter order
>> to the enabling order and dump to user space.
> 
> I never understood the whole order thing. What was it trying to do?

Let's say we have three events with the LBR event logging feature as below.
    perf record -e branches,branches,instructions:ppp -j any,event

The counter 0 will be assigned to instructions:ppp, since the PDist is
only supported on GP 0 & 1.
The count 1 & 2 will be assigned to the other two branches.

If branches occurs 1 time and instructions occurs 3 times in a LBR
block, the LBR_INFO will have 0b010111 (counter order).

But as you can see from the perf command, the first event is actually
branches. Without the event IDs information, perf tool will interpret
that branches 3 branches 1 and instructions:ppp 1. That's wrong.

If there are multiple users, the situation becomes even worse.

> 
>> But if we have two groups with LBR event, the order information is still
>> required. Why we still want to group things?
> 
> Why would you need that; what is that whole order nonsense about?
> 
> {e1, e2, e3, e4}, {e5, e6, e7, e8} with e1 and e5 both having LBR on
> just works no?
> 
> Since they have LBR and that extra sample flag they all get a 0-3
> constraint.
> 
> Since both e1 and e5 use LBR, they're mutually exclusive, either e1 or
> e5 group runs.

It's possible that someone pins an event using LBR, and set more than 4
events for logging, e0:D,{e1, e2},{e3, e4},{e5, e6}. If so, those events
could do multiplexing. Without the event IDs information, perf tool has
no idea how to interpret the information.


Andi, do you have any other cases which require the multiplexing support
for LBR event logging.


Thanks,
Kan
  
Peter Zijlstra April 14, 2023, 10:01 p.m. UTC | #9
On Fri, Apr 14, 2023 at 04:34:45PM -0400, Liang, Kan wrote:

> > I never understood the whole order thing. What was it trying to do?
> 
> Let's say we have three events with the LBR event logging feature as below.
>     perf record -e branches,branches,instructions:ppp -j any,event
> 
> The counter 0 will be assigned to instructions:ppp, since the PDist is
> only supported on GP 0 & 1.
> The count 1 & 2 will be assigned to the other two branches.
> 
> If branches occurs 1 time and instructions occurs 3 times in a LBR
> block, the LBR_INFO will have 0b010111 (counter order).
> 
> But as you can see from the perf command, the first event is actually
> branches. Without the event IDs information, perf tool will interpret
> that branches 3 branches 1 and instructions:ppp 1. That's wrong.
> 
> If there are multiple users, the situation becomes even worse.

But this makes no sense what so ever; in this case you have no control
over what events you'll actually get in your LBR. Could be none of the
events you're interested in end up in 0-3 but instead land on 4-7.

> >> But if we have two groups with LBR event, the order information is still
> >> required. Why we still want to group things?
> > 
> > Why would you need that; what is that whole order nonsense about?
> > 
> > {e1, e2, e3, e4}, {e5, e6, e7, e8} with e1 and e5 both having LBR on
> > just works no?
> > 
> > Since they have LBR and that extra sample flag they all get a 0-3
> > constraint.
> > 
> > Since both e1 and e5 use LBR, they're mutually exclusive, either e1 or
> > e5 group runs.
> 
> It's possible that someone pins an event using LBR, and set more than 4
> events for logging, e0:D,{e1, e2},{e3, e4},{e5, e6}. If so, those events
> could do multiplexing. Without the event IDs information, perf tool has
> no idea how to interpret the information.

Yeah, don't do this. There is no guarantee what so ever you'll get any
of those events in the 0-3 range.

You really *must* make then a group such that perf knows what events to
associated with the LBR event and constain them to the 0-3 range of
PMCs.

If you want multiplexing, simply create multiple groups with an LBR
event in them.
  
Andi Kleen April 14, 2023, 10:47 p.m. UTC | #10
> Yeah, don't do this. There is no guarantee what so ever you'll get any
> of those events in the 0-3 range.


The kernel can simply force to 0-3 if LBR is enabled and the feature 
too. It's in Kan's patch

and it isn't particularly complicated.

>
> You really *must* make then a group such that perf knows what events to
> associated with the LBR event and constain them to the 0-3 range of
> PMCs.
>
> If you want multiplexing, simply create multiple groups with an LBR
> event in them.


Well if you force groups then you require user space or a user which 
understands all the constraints

to create groups. I thought one of the basic ideas of perf was to be 
able to abstract those things.

There are tools today (e.g. the perf builtin metrics[1] and I believe 
others) that don't have enough

knowledge to set up real groups and rely on the kernel.  While they 
could be fixed it's a lot of work

(I do it in pmu-tools, and it's quite hairy code)


-Andi


[1] Given they are are not supported by perf record yet, but since perf 
script already supports

evaluating them they could be at some point, and then it would make 
sense to use them

with this feature too.
  
Peter Zijlstra April 17, 2023, 11:46 a.m. UTC | #11
On Fri, Apr 14, 2023 at 03:47:29PM -0700, Andi Kleen wrote:
> 
> > Yeah, don't do this. There is no guarantee what so ever you'll get any
> > of those events in the 0-3 range.
> 
> 
> The kernel can simply force to 0-3 if LBR is enabled and the feature too.
> It's in Kan's patch
> 
> and it isn't particularly complicated.

And what, totally leave 4-7 unused even if those counters were not
related to LBR at all? That seems exceedingly daft.
  
Peter Zijlstra April 17, 2023, 11:55 a.m. UTC | #12
On Fri, Apr 14, 2023 at 03:47:29PM -0700, Andi Kleen wrote:

> > You really *must* make then a group such that perf knows what events to
> > associated with the LBR event and constain them to the 0-3 range of
> > PMCs.
> > 
> > If you want multiplexing, simply create multiple groups with an LBR
> > event in them.
> 
> 
> Well if you force groups then you require user space or a user which
> understands all the constraints

The LBR feature is naturally a group read, very similar to
PERF_SAMPLE_READ+PERF_FORMAT_GROUP.

> to create groups. I thought one of the basic ideas of perf was to be able to
> abstract those things.

Yeah, the abstraction at hand is a group.
  
Andi Kleen April 17, 2023, 1:37 p.m. UTC | #13
On 4/17/2023 4:46 AM, Peter Zijlstra wrote:
> On Fri, Apr 14, 2023 at 03:47:29PM -0700, Andi Kleen wrote:
>>> Yeah, don't do this. There is no guarantee what so ever you'll get any
>>> of those events in the 0-3 range.
>>
>> The kernel can simply force to 0-3 if LBR is enabled and the feature too.
>> It's in Kan's patch
>>
>> and it isn't particularly complicated.
> And what, totally leave 4-7 unused even if those counters were not
> related to LBR at all? That seems exceedingly daft.


Only for the events which enabled LBR and also only if the branch events 
feature is enabled

-j event -e '{event1:b,event2:b,event3:b,event4:b,event5,event6}'

event5 and 6 can go > 3

Given there is currently no syntax to control branch events inside a 
group other than fully enabling/disabling LBR.

Kan, I guess that could be added to the user tools.

-Andi
  
Andi Kleen April 17, 2023, 1:41 p.m. UTC | #14
On 4/17/2023 4:55 AM, Peter Zijlstra wrote:
>
>> to create groups. I thought one of the basic ideas of perf was to be able to
>> abstract those things.
> Yeah, the abstraction at hand is a group.

Okay then if we add perf record metrics and want to support this feature 
we'll need to replicate

the kernel scheduler in the user tools

-Andi
  
Liang, Kan April 17, 2023, 2:07 p.m. UTC | #15
On 2023-04-17 9:37 a.m., Andi Kleen wrote:
> 
> On 4/17/2023 4:46 AM, Peter Zijlstra wrote:
>> On Fri, Apr 14, 2023 at 03:47:29PM -0700, Andi Kleen wrote:
>>>> Yeah, don't do this. There is no guarantee what so ever you'll get any
>>>> of those events in the 0-3 range.
>>>
>>> The kernel can simply force to 0-3 if LBR is enabled and the feature
>>> too.
>>> It's in Kan's patch
>>>
>>> and it isn't particularly complicated.
>> And what, totally leave 4-7 unused even if those counters were not
>> related to LBR at all? That seems exceedingly daft.
> 
> 
> Only for the events which enabled LBR and also only if the branch events
> feature is enabled
> 
> -j event -e '{event1:b,event2:b,event3:b,event4:b,event5,event6}'
> 
> event5 and 6 can go > 3
> 
> Given there is currently no syntax to control branch events inside a
> group other than fully enabling/disabling LBR.
> 
> Kan, I guess that could be added to the user tools.

We already have a per-event option for LBR, branch_type, which can be
used to control branch events in a group. With the patch in this series,
we can do, e.g.,

-j call -e
'{cpu/event=0x1,branch_type=event,/,cpu/event=0x2,branch_type=event/,cpu/event=0x3,branch_type=event/,cpu/event=0x4,branch_type=event/,cpu/event=0x5/,cpu/event=0x6/}'


Thanks,
Kan
  

Patch

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d5628a7b5eaa..3b659a57129a 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -126,6 +126,11 @@  struct perf_branch_stack {
 	struct perf_branch_entry	entries[];
 };
 
+struct perf_branch_event_ids {
+	__u64				nr;
+	__u64				ids[];
+};
+
 struct task_struct;
 
 /*
@@ -1127,6 +1132,11 @@  static inline bool branch_sample_priv(const struct perf_event *event)
 	return event->attr.branch_sample_type & PERF_SAMPLE_BRANCH_PRIV_SAVE;
 }
 
+static inline bool branch_sample_event(const struct perf_event *event)
+{
+	return event->attr.branch_sample_type & PERF_SAMPLE_BRANCH_EVENT;
+}
+
 
 struct perf_sample_data {
 	/*
@@ -1161,6 +1171,7 @@  struct perf_sample_data {
 	struct perf_callchain_entry	*callchain;
 	struct perf_raw_record		*raw;
 	struct perf_branch_stack	*br_stack;
+	struct perf_branch_event_ids	*br_event_ids;
 	union perf_sample_weight	weight;
 	union  perf_mem_data_src	data_src;
 	u64				txn;
@@ -1250,6 +1261,21 @@  static inline void perf_sample_save_brstack(struct perf_sample_data *data,
 	data->sample_flags |= PERF_SAMPLE_BRANCH_STACK;
 }
 
+static inline void perf_sample_save_event_ids(struct perf_sample_data *data,
+					      struct perf_branch_event_ids *ids)
+{
+	int size = sizeof(u64); /* nr */
+
+	if (WARN_ON_ONCE(ids->nr > PERF_MAX_BRANCH_EVENTS))
+		ids->nr = PERF_MAX_BRANCH_EVENTS;
+
+	size += ids->nr * sizeof(u64);
+
+	data->br_event_ids = ids;
+	data->dyn_size += size;
+	data->sample_flags |= PERF_SAMPLE_BRANCH_EVENT_IDS;
+}
+
 static inline u32 perf_sample_data_size(struct perf_sample_data *data,
 					struct perf_event *event)
 {
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 37675437b768..36d70717ecbd 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -162,8 +162,9 @@  enum perf_event_sample_format {
 	PERF_SAMPLE_DATA_PAGE_SIZE		= 1U << 22,
 	PERF_SAMPLE_CODE_PAGE_SIZE		= 1U << 23,
 	PERF_SAMPLE_WEIGHT_STRUCT		= 1U << 24,
+	PERF_SAMPLE_BRANCH_EVENT_IDS		= 1U << 25,
 
-	PERF_SAMPLE_MAX = 1U << 25,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 26,		/* non-ABI */
 };
 
 #define PERF_SAMPLE_WEIGHT_TYPE	(PERF_SAMPLE_WEIGHT | PERF_SAMPLE_WEIGHT_STRUCT)
@@ -204,6 +205,8 @@  enum perf_branch_sample_type_shift {
 
 	PERF_SAMPLE_BRANCH_PRIV_SAVE_SHIFT	= 18, /* save privilege mode */
 
+	PERF_SAMPLE_BRANCH_EVENT_SHIFT		= 19, /* save occurrences of events */
+
 	PERF_SAMPLE_BRANCH_MAX_SHIFT		/* non-ABI */
 };
 
@@ -235,6 +238,8 @@  enum perf_branch_sample_type {
 
 	PERF_SAMPLE_BRANCH_PRIV_SAVE	= 1U << PERF_SAMPLE_BRANCH_PRIV_SAVE_SHIFT,
 
+	PERF_SAMPLE_BRANCH_EVENT	= 1U << PERF_SAMPLE_BRANCH_EVENT_SHIFT,
+
 	PERF_SAMPLE_BRANCH_MAX		= 1U << PERF_SAMPLE_BRANCH_MAX_SHIFT,
 };
 
@@ -1018,6 +1023,8 @@  enum perf_event_type {
 	 *	  char			data[size]; } && PERF_SAMPLE_AUX
 	 *	{ u64			data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
 	 *	{ u64			code_page_size;} && PERF_SAMPLE_CODE_PAGE_SIZE
+	 *	{ u64			nr;
+	 *	  u64			ids[nr];} && PERF_SAMPLE_BRANCH_EVENT_IDS
 	 * };
 	 */
 	PERF_RECORD_SAMPLE			= 9,
@@ -1394,6 +1401,12 @@  union perf_mem_data_src {
 #define PERF_MEM_S(a, s) \
 	(((__u64)PERF_MEM_##a##_##s) << PERF_MEM_##a##_SHIFT)
 
+#define PERF_MAX_BRANCH_EVENTS		4
+#define PERF_BRANCH_EVENTS_MASK		0x3
+#define PERF_BRANCH_EVENTS_STEP		2
+
+#define perf_branch_event_by_idx(_events, _idx)		\
+	(((_events) >> ((_idx) * PERF_BRANCH_EVENTS_STEP)) & PERF_BRANCH_EVENTS_MASK)
 /*
  * single taken branch record layout:
  *
@@ -1410,6 +1423,10 @@  union perf_mem_data_src {
  *    cycles: cycles from last branch (or 0 if not supported)
  *      type: branch type
  *      spec: branch speculation info (or 0 if not supported)
+ *      events: occurrences of events since the last branch entry.
+ *              The fields can store up to 4 events with saturating
+ *              at value 3.
+ *              (or 0 if not supported)
  */
 struct perf_branch_entry {
 	__u64	from;
@@ -1423,7 +1440,8 @@  struct perf_branch_entry {
 		spec:2,     /* branch speculation info */
 		new_type:4, /* additional branch type */
 		priv:3,     /* privilege level */
-		reserved:31;
+		events:8,   /* occurrences of events since the last branch entry */
+		reserved:23;
 };
 
 union perf_sample_weight {
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f79fd8b87f75..1ec7cc8b0730 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7401,6 +7401,23 @@  void perf_output_sample(struct perf_output_handle *handle,
 			perf_aux_sample_output(event, handle, data);
 	}
 
+	if (sample_type & PERF_SAMPLE_BRANCH_EVENT_IDS) {
+		if (data->br_event_ids) {
+			size_t size;
+
+			size = data->br_event_ids->nr * sizeof(u64);
+			perf_output_put(handle, data->br_event_ids->nr);
+			perf_output_copy(handle, data->br_event_ids->ids, size);
+		} else {
+			/*
+			 * we always store at least the value of nr
+			 */
+			u64 nr = 0;
+
+			perf_output_put(handle, nr);
+		}
+	}
+
 	if (!event->attr.watermark) {
 		int wakeup_events = event->attr.wakeup_events;
 
@@ -7747,6 +7764,12 @@  void perf_prepare_sample(struct perf_sample_data *data,
 		data->dyn_size += size + sizeof(u64); /* size above */
 		data->sample_flags |= PERF_SAMPLE_AUX;
 	}
+
+	if (filtered_sample_type & PERF_SAMPLE_BRANCH_EVENT_IDS) {
+		data->br_event_ids = NULL;
+		data->dyn_size += sizeof(u64);
+		data->sample_flags |= PERF_SAMPLE_BRANCH_EVENT_IDS;
+	}
 }
 
 void perf_prepare_header(struct perf_event_header *header,