diff mbox series

[V4,4/7] perf/x86/intel: Support LBR event logging

Message ID	20231004184044.3062788-4-kan.liang@linux.intel.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) client-ip=2620:137:e000::3:4; From: kan.liang@linux.intel.com To: peterz@infradead.org, mingo@redhat.com, acme@kernel.org, linux-kernel@vger.kernel.org Cc: mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, namhyung@kernel.org, irogers@google.com, adrian.hunter@intel.com, ak@linux.intel.com, eranian@google.com, alexey.v.bayduraev@linux.intel.com, tinghao.zhang@intel.com, Kan Liang <kan.liang@linux.intel.com> Subject: [PATCH V4 4/7] perf/x86/intel: Support LBR event logging Date: Wed, 4 Oct 2023 11:40:41 -0700 Message-Id: <20231004184044.3062788-4-kan.liang@linux.intel.com> In-Reply-To: <20231004184044.3062788-1-kan.liang@linux.intel.com> References: <20231004184044.3062788-1-kan.liang@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[V4,1/7] perf: Add branch stack counters \| [V4,1/7] perf: Add branch stack counters [V4,2/7] perf/x86: Add PERF_X86_EVENT_NEEDS_BRANCH_STACK flag [V4,3/7] perf: Add branch_sample_call_stack [V4,4/7] perf/x86/intel: Support LBR event logging [V4,5/7] tools headers UAPI: Sync include/uapi/linux/perf_event.h header with the kernel [V4,6/7] perf header: Support num and width of branch counters [V4,7/7] perf tools: Add branch counter knob

Commit Message

Liang, Kan Oct. 4, 2023, 6:40 p.m. UTC

  From: Kan Liang <kan.liang@linux.intel.com>

The LBR event logging introduces a per-counter indication of precise
event occurrences in LBRs. It can provide a means to attribute exposed
retirement latency to combinations of events across a block of
instructions. It also provides a means of attributing Timed LBR
latencies to events.

The feature is first introduced on SRF/GRR. It is an enhancement of the
ARCH LBR. It adds new fields in the LBR_INFO MSRs to log the occurrences
of events on the GP counters. The information is displayed by the order
of counters.

The design proposed in this patch requires that the events which are
logged must be in a group with the event that has LBR. If there are
more than one LBR group, the event logging information only from the
current group (overflowed) are stored for the perf tool, otherwise the
perf tool cannot know which and when other groups are scheduled
especially when multiplexing is triggered. The user can ensure it uses
the maximum number of counters that support LBR info (4 by now) by
making the group large enough.

The HW only logs events by the order of counters. The order may be
different from the order of enabling which the perf tool can understand.
When parsing the information of each branch entry, convert the counter
order to the enabled order, and store the enabled order in the extension
space.

Unconditionally reset LBRs for an LBR event group when it's deleted. The
logged events' occurrences information is only valid for the current LBR
group. If another LBR group is scheduled later, the information from the
stale LBRs would be otherwise wrongly interpreted.

Add a sanity check in intel_pmu_hw_config(). Disable the feature if other
counter filters (inv, cmask, edge, in_tx) are set or LBR call stack mode
is enabled. (For the LBR call stack mode, we cannot simply flush the
LBR, since it will break the call stack. Also, there is no obvious usage
with the call stack mode for now.)

Only applying the PERF_SAMPLE_BRANCH_COUNTERS doesn't require any branch
stack setup.

Expose the maximum number of supported counters and the width of the
counters into the sysfs. The perf tool can use the information to parse
the logged counters in each branch.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---

Changes since V3
- Support the "branch_counter_nr" and "branch_counter_width"
- Support the PERF_SAMPLE_BRANCH_COUNTERS

 arch/x86/events/intel/core.c       | 91 +++++++++++++++++++++++++++--
 arch/x86/events/intel/ds.c         |  2 +-
 arch/x86/events/intel/lbr.c        | 94 +++++++++++++++++++++++++++++-
 arch/x86/events/perf_event.h       | 12 ++++
 arch/x86/events/perf_event_flags.h |  1 +
 arch/x86/include/asm/msr-index.h   |  2 +
 arch/x86/include/asm/perf_event.h  |  4 ++
 7 files changed, 198 insertions(+), 8 deletions(-)

Comments

Peter Zijlstra Oct. 19, 2023, 9:23 a.m. UTC | #1

On Wed, Oct 04, 2023 at 11:40:41AM -0700, kan.liang@linux.intel.com wrote:

> diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
> index c3b0d15a9841..1e80a551a4c2 100644
> --- a/arch/x86/events/intel/lbr.c
> +++ b/arch/x86/events/intel/lbr.c
> @@ -676,6 +676,21 @@ void intel_pmu_lbr_del(struct perf_event *event)
>  	WARN_ON_ONCE(cpuc->lbr_users < 0);
>  	WARN_ON_ONCE(cpuc->lbr_pebs_users < 0);
>  	perf_sched_cb_dec(event->pmu);
> +
> +	/*
> +	 * The logged occurrences information is only valid for the
> +	 * current LBR group. If another LBR group is scheduled in
> +	 * later, the information from the stale LBRs will be wrongly
> +	 * interpreted. Reset the LBRs here.
> +	 * For the context switch, the LBR will be unconditionally
> +	 * flushed when a new task is scheduled in. If both the new task
> +	 * and the old task are monitored by a LBR event group. The
> +	 * reset here is redundant. But the extra reset doesn't impact
> +	 * the functionality. It's hard to distinguish the above case.
> +	 * Keep the unconditionally reset for a LBR event group for now.
> +	 */

I found this really hard to read, also should this not rely on
!cpuc->lbr_users ?

As is, you'll reset the lbr for every event in the group.

> +	if (is_branch_counters_group(event))
> +		intel_pmu_lbr_reset();
>  }

Peter Zijlstra Oct. 19, 2023, 9:26 a.m. UTC | #2

On Wed, Oct 04, 2023 at 11:40:41AM -0700, kan.liang@linux.intel.com wrote:

> +
>  static struct attribute *lbr_attrs[] = {
>  	&dev_attr_branches.attr,
> +	&dev_attr_branch_counter_nr.attr,
> +	&dev_attr_branch_counter_width.attr,
>  	NULL
>  };
>  
> @@ -5590,7 +5665,11 @@ mem_is_visible(struct kobject *kobj, struct attribute *attr, int i)
>  static umode_t
>  lbr_is_visible(struct kobject *kobj, struct attribute *attr, int i)
>  {
> -	return x86_pmu.lbr_nr ? attr->mode : 0;
> +	/* branches */
> +	if (i == 0)
> +		return x86_pmu.lbr_nr ? attr->mode : 0;
> +
> +	return (x86_pmu.flags & PMU_FL_LBR_EVENT) ? attr->mode : 0;
>  }

As in the patch this is fairly readable, but I just checked and in the
code lbr_attrs and lbr_is_visible() are rather far away from one another
which makes the whole i thing hard to interpret.

Should we re-organize that a little?

Peter Zijlstra Oct. 19, 2023, 10:52 a.m. UTC | #3

On Wed, Oct 04, 2023 at 11:40:41AM -0700, kan.liang@linux.intel.com wrote:

> +#define ARCH_LBR_EVENT_LOG_WIDTH	2
> +#define ARCH_LBR_EVENT_LOG_MASK		0x3

event log ?


> +static __always_inline void intel_pmu_update_lbr_event(u64 *lbr_events, int idx, int pos)
> +{
> +	u64 logs = *lbr_events >> (LBR_INFO_EVENTS_OFFSET +
> +				   idx * ARCH_LBR_EVENT_LOG_WIDTH);
> +
> +	logs &= ARCH_LBR_EVENT_LOG_MASK;
> +	*lbr_events |= logs << (pos * ARCH_LBR_EVENT_LOG_WIDTH);
> +}
> +
> +/*
> + * The enabled order may be different from the counter order.
> + * Update the lbr_events with the enabled order.
> + */
> +static void intel_pmu_lbr_event_reorder(struct cpu_hw_events *cpuc,
> +					struct perf_event *event)
> +{
> +	int i, j, pos = 0, enabled[X86_PMC_IDX_MAX];
> +	struct perf_event *leader, *sibling;
> +
> +	leader = event->group_leader;
> +	if (branch_sample_counters(leader))
> +		enabled[pos++] = leader->hw.idx;
> +
> +	for_each_sibling_event(sibling, leader) {
> +		if (!branch_sample_counters(sibling))
> +			continue;
> +		enabled[pos++] = sibling->hw.idx;
> +	}

Ok, so far so good: enabled[x] = y, is a mapping of hardware index (y)
to group order (x).

Although I would perhaps name that order[] instead of enabled[].

> +
> +	if (!pos)
> +		return;

How would we ever get here if this is the case?

> +
> +	for (i = 0; i < cpuc->lbr_stack.nr; i++) {
> +		for (j = 0; j < pos; j++)
> +			intel_pmu_update_lbr_event(&cpuc->lbr_events[i], enabled[j], j);

But this confuses me... per that function it:

 - extracts counter value for enabled[j] and,
 - or's it into the same variable at j

But what if j is already taken by something else?

That is, suppose enabled[] = {3,2,1,0}, and lbr_events = 11 10 01 00

Then: for (j) intel_pmu_update_lbt_event(&lbr_event, enabled[j], j);

0: 3->0, 11 10 01 00 -> 11 10 01 11
1: 2->1, 11 10 01 11 -> 11 10 11 11
2: 1->2, 11 10 11 11 -> 11 11 11 11



> +
> +		/* Clear the original counter order */
> +		cpuc->lbr_events[i] &= ~LBR_INFO_EVENTS;
> +	}
> +}

Would not something like:

	src = lbr_events[i];
	dst = 0;
	for (j = 0; j < pos; j++) {
		cnt = (src >> enabled[j]*2) & 3;
		dst |= cnt << j*2
	}
	lbr_events[i] = dst;

be *FAR* clearer, and actually work?

Peter Zijlstra Oct. 19, 2023, 11 a.m. UTC | #4

On Wed, Oct 04, 2023 at 11:40:41AM -0700, kan.liang@linux.intel.com wrote:

> +static ssize_t branch_counter_width_show(struct device *cdev,
> +					 struct device_attribute *attr,
> +					 char *buf)
> +{
> +	return snprintf(buf, PAGE_SIZE, "2\n");
> +}

> +#define ARCH_LBR_EVENT_LOG_WIDTH	2

I'm assuming this is the same '2' ? And having it hard-coded in two
locations is awesome..

> +#define ARCH_LBR_EVENT_LOG_MASK		0x3

Should probably be ((1<<2)-1)

As per that other email, the naming is confusing, should this not be:

ARCH_LBR_EVENT_COUNTER_BITS

or, since it's all local to lbr.c something shorter still, like:

LBR_COUNTER_BITS

hmm?

Peter Zijlstra Oct. 19, 2023, 11:09 a.m. UTC | #5

On Wed, Oct 04, 2023 at 11:40:41AM -0700, kan.liang@linux.intel.com wrote:

> @@ -3905,6 +3915,44 @@ static int intel_pmu_hw_config(struct perf_event *event)
>  	if (needs_branch_stack(event) && is_sampling_event(event))
>  		event->hw.flags  |= PERF_X86_EVENT_NEEDS_BRANCH_STACK;
>  
> +	if (branch_sample_counters(event)) {
> +		struct perf_event *leader, *sibling;
> +
> +		if (!(x86_pmu.flags & PMU_FL_LBR_EVENT) ||
> +		    (event->attr.config & ~INTEL_ARCH_EVENT_MASK))
> +			return -EINVAL;
> +
> +		/*
> +		 * The event logging is not supported in the call stack mode
> +		 * yet, since we cannot simply flush the LBR during e.g.,
> +		 * multiplexing. Also, there is no obvious usage with the call
> +		 * stack mode. Simply forbids it for now.
> +		 *
> +		 * If any events in the group enable the LBR event logging
> +		 * feature, the group is treated as a LBR event logging group,
> +		 * which requires the extra space to store the counters.
> +		 */
> +		leader = event->group_leader;
> +		if (branch_sample_call_stack(leader))
> +			return -EINVAL;
> +		leader->hw.flags  |= PERF_X86_EVENT_BRANCH_COUNTERS;

(superfluous whitespace before operator)

> +
> +		for_each_sibling_event(sibling, leader) {
> +			if (branch_sample_call_stack(sibling))
> +				return -EINVAL;
> +		}
> +
> +		/*
> +		 * Only applying the PERF_SAMPLE_BRANCH_COUNTERS doesn't
> +		 * require any branch stack setup.
> +		 * Clear the bit to avoid unnecessary branch stack setup.
> +		 */
> +		if (0 == (event->attr.branch_sample_type &
> +			  ~(PERF_SAMPLE_BRANCH_PLM_ALL |
> +			    PERF_SAMPLE_BRANCH_COUNTERS)))
> +			event->hw.flags  &= ~PERF_X86_EVENT_NEEDS_BRANCH_STACK;
> +	}

Does this / should this check the number of group members vs supported
number of lbr counters?

Peter Zijlstra Oct. 19, 2023, 11:12 a.m. UTC | #6

On Wed, Oct 04, 2023 at 11:40:41AM -0700, kan.liang@linux.intel.com wrote:
> +static __always_inline void get_lbr_events(struct cpu_hw_events *cpuc,
> +					   int i, u64 info)
> +{
> +	/*
> +	 * The later code will decide what content can be disclosed
> +	 * to the perf tool. It's no harmful to unconditionally update
> +	 * the cpuc->lbr_events.
> +	 * Pleae see intel_pmu_lbr_event_reorder()
> +	 */
> +	cpuc->lbr_events[i] = info & LBR_INFO_EVENTS;
> +}

You could be forcing an extra cachemiss here. A long time ago I had
hacks to profile perf with perf, but perhaps PT can be abused for that
now?

Liang, Kan Oct. 19, 2023, 1:56 p.m. UTC | #7

On 2023-10-19 5:23 a.m., Peter Zijlstra wrote:
> On Wed, Oct 04, 2023 at 11:40:41AM -0700, kan.liang@linux.intel.com wrote:
> 
>> diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
>> index c3b0d15a9841..1e80a551a4c2 100644
>> --- a/arch/x86/events/intel/lbr.c
>> +++ b/arch/x86/events/intel/lbr.c
>> @@ -676,6 +676,21 @@ void intel_pmu_lbr_del(struct perf_event *event)
>>  	WARN_ON_ONCE(cpuc->lbr_users < 0);
>>  	WARN_ON_ONCE(cpuc->lbr_pebs_users < 0);
>>  	perf_sched_cb_dec(event->pmu);
>> +
>> +	/*
>> +	 * The logged occurrences information is only valid for the
>> +	 * current LBR group. If another LBR group is scheduled in
>> +	 * later, the information from the stale LBRs will be wrongly
>> +	 * interpreted. Reset the LBRs here.
>> +	 * For the context switch, the LBR will be unconditionally
>> +	 * flushed when a new task is scheduled in. If both the new task
>> +	 * and the old task are monitored by a LBR event group. The
>> +	 * reset here is redundant. But the extra reset doesn't impact
>> +	 * the functionality. It's hard to distinguish the above case.
>> +	 * Keep the unconditionally reset for a LBR event group for now.
>> +	 */
> 
> I found this really hard to read, also should this not rely on
> !cpuc->lbr_users ?
>

It's possible that the last LBR user is not in the branch_counters
group, e.g., a branch_counters group + several normal LBR events.
For this case, the is_branch_counters_group(event) return false for the
last LBR user. The LBR will not be reset.

> As is, you'll reset the lbr for every event in the group.
> 
>> +	if (is_branch_counters_group(event))
>> +		intel_pmu_lbr_reset();
>>  }

Right, I forgot to change it after I modified flag. :(

Here I think we should only clear the LBRs once for a branch_counters
group, e.g., in the leader event.

+	if (is_branch_counters_group(event) && event == event->group_leader)+	
intel_pmu_lbr_reset();

The only problem is that the leader event may not be an LBR event. But I
guess it should be OK to limit that the leader event of a
branch_counters group must be an LBR event in hw_config().

Thanks,
Kan

Liang, Kan Oct. 19, 2023, 1:58 p.m. UTC | #8

On 2023-10-19 5:26 a.m., Peter Zijlstra wrote:
> On Wed, Oct 04, 2023 at 11:40:41AM -0700, kan.liang@linux.intel.com wrote:
> 
>> +
>>  static struct attribute *lbr_attrs[] = {
>>  	&dev_attr_branches.attr,
>> +	&dev_attr_branch_counter_nr.attr,
>> +	&dev_attr_branch_counter_width.attr,
>>  	NULL
>>  };
>>  
>> @@ -5590,7 +5665,11 @@ mem_is_visible(struct kobject *kobj, struct attribute *attr, int i)
>>  static umode_t
>>  lbr_is_visible(struct kobject *kobj, struct attribute *attr, int i)
>>  {
>> -	return x86_pmu.lbr_nr ? attr->mode : 0;
>> +	/* branches */
>> +	if (i == 0)
>> +		return x86_pmu.lbr_nr ? attr->mode : 0;
>> +
>> +	return (x86_pmu.flags & PMU_FL_LBR_EVENT) ? attr->mode : 0;
>>  }
> 
> As in the patch this is fairly readable, but I just checked and in the
> code lbr_attrs and lbr_is_visible() are rather far away from one another
> which makes the whole i thing hard to interpret.
> 
> Should we re-organize that a little?

Sure, I will implement a separate patch to re-organize it.

It seems there are only two attribute groups which have both .attrs and
.is_visible, group_default and group_caps_lbr. I will re-organize for
both of them.

Thanks,
Kan

Liang, Kan Oct. 19, 2023, 2:26 p.m. UTC | #9

On 2023-10-19 6:52 a.m., Peter Zijlstra wrote:
> On Wed, Oct 04, 2023 at 11:40:41AM -0700, kan.liang@linux.intel.com wrote:
> 
>> +#define ARCH_LBR_EVENT_LOG_WIDTH	2
>> +#define ARCH_LBR_EVENT_LOG_MASK		0x3
> 
> event log ?

That's the name in the Intel spec. I will change it to the name used in
Linux and add a comment to map the name event log to the name branch
counter.

> 
> 
>> +static __always_inline void intel_pmu_update_lbr_event(u64 *lbr_events, int idx, int pos)
>> +{
>> +	u64 logs = *lbr_events >> (LBR_INFO_EVENTS_OFFSET +
>> +				   idx * ARCH_LBR_EVENT_LOG_WIDTH);
>> +
>> +	logs &= ARCH_LBR_EVENT_LOG_MASK;
>> +	*lbr_events |= logs << (pos * ARCH_LBR_EVENT_LOG_WIDTH);
>> +}
>> +
>> +/*
>> + * The enabled order may be different from the counter order.
>> + * Update the lbr_events with the enabled order.
>> + */
>> +static void intel_pmu_lbr_event_reorder(struct cpu_hw_events *cpuc,
>> +					struct perf_event *event)
>> +{
>> +	int i, j, pos = 0, enabled[X86_PMC_IDX_MAX];
>> +	struct perf_event *leader, *sibling;
>> +
>> +	leader = event->group_leader;
>> +	if (branch_sample_counters(leader))
>> +		enabled[pos++] = leader->hw.idx;
>> +
>> +	for_each_sibling_event(sibling, leader) {
>> +		if (!branch_sample_counters(sibling))
>> +			continue;
>> +		enabled[pos++] = sibling->hw.idx;
>> +	}
> 
> Ok, so far so good: enabled[x] = y, is a mapping of hardware index (y)
> to group order (x).
> 
> Although I would perhaps name that order[] instead of enabled[].

Sure

> 
>> +
>> +	if (!pos)
>> +		return;
> 
> How would we ever get here if this is the case?

It should be a bug. I will use a WARN_ON_ONCE() to replace it.

> 
>> +
>> +	for (i = 0; i < cpuc->lbr_stack.nr; i++) {
>> +		for (j = 0; j < pos; j++)
>> +			intel_pmu_update_lbr_event(&cpuc->lbr_events[i], enabled[j], j);
> 
> But this confuses me... per that function it:
> 
>  - extracts counter value for enabled[j] and,
>  - or's it into the same variable at j
> 
> But what if j is already taken by something else?
> 
> That is, suppose enabled[] = {3,2,1,0}, and lbr_events = 11 10 01 00
> 
> Then: for (j) intel_pmu_update_lbt_event(&lbr_event, enabled[j], j);
> 
> 0: 3->0, 11 10 01 00 -> 11 10 01 11
> 1: 2->1, 11 10 01 11 -> 11 10 11 11
> 2: 1->2, 11 10 11 11 -> 11 11 11 11
> 
> 
> 
>> +
>> +		/* Clear the original counter order */
>> +		cpuc->lbr_events[i] &= ~LBR_INFO_EVENTS;
>> +	}
>> +}
> 
> Would not something like:
> 
> 	src = lbr_events[i];
> 	dst = 0;
> 	for (j = 0; j < pos; j++) {
> 		cnt = (src >> enabled[j]*2) & 3;
> 		dst |= cnt << j*2
> 	}
> 	lbr_events[i] = dst;
> 
> be *FAR* clearer, and actually work?

The original LBR event data is saved at offset 32 of LBR_INFO register.
In get_lbr_events(), the data was simply copied to the offset 32 of
cpuc->lbr_events.

The intel_pmu_update_lbr_event() reorders the value and saves it
starting from the offset 0.

I agree it's hard to read since it combines the src and dst into the
same variable.

I will use the suggested code and also update the get_lbr_events().

	cpuc->lbr_events[i] = (info >> 32) & LBR_INFO_EVENTS;

Thanks,
Kan

Liang, Kan Oct. 19, 2023, 2:28 p.m. UTC | #10

On 2023-10-19 7:00 a.m., Peter Zijlstra wrote:
> On Wed, Oct 04, 2023 at 11:40:41AM -0700, kan.liang@linux.intel.com wrote:
> 
>> +static ssize_t branch_counter_width_show(struct device *cdev,
>> +					 struct device_attribute *attr,
>> +					 char *buf)
>> +{
>> +	return snprintf(buf, PAGE_SIZE, "2\n");
>> +}
> 
>> +#define ARCH_LBR_EVENT_LOG_WIDTH	2
> 
> I'm assuming this is the same '2' ? And having it hard-coded in two
> locations is awesome..
> 
>> +#define ARCH_LBR_EVENT_LOG_MASK		0x3
> 
> Should probably be ((1<<2)-1)
> 
> As per that other email, the naming is confusing, should this not be:
> 
> ARCH_LBR_EVENT_COUNTER_BITS
> 
> or, since it's all local to lbr.c something shorter still, like:
> 
> LBR_COUNTER_BITS
> 
> hmm?

Sure, I will use the name LBR_COUNTER_BITS.

Thanks,
Kan

Liang, Kan Oct. 19, 2023, 2:31 p.m. UTC | #11

On 2023-10-19 7:09 a.m., Peter Zijlstra wrote:
> On Wed, Oct 04, 2023 at 11:40:41AM -0700, kan.liang@linux.intel.com wrote:
> 
>> @@ -3905,6 +3915,44 @@ static int intel_pmu_hw_config(struct perf_event *event)
>>  	if (needs_branch_stack(event) && is_sampling_event(event))
>>  		event->hw.flags  |= PERF_X86_EVENT_NEEDS_BRANCH_STACK;
>>  
>> +	if (branch_sample_counters(event)) {
>> +		struct perf_event *leader, *sibling;
>> +
>> +		if (!(x86_pmu.flags & PMU_FL_LBR_EVENT) ||
>> +		    (event->attr.config & ~INTEL_ARCH_EVENT_MASK))
>> +			return -EINVAL;
>> +
>> +		/*
>> +		 * The event logging is not supported in the call stack mode
>> +		 * yet, since we cannot simply flush the LBR during e.g.,
>> +		 * multiplexing. Also, there is no obvious usage with the call
>> +		 * stack mode. Simply forbids it for now.
>> +		 *
>> +		 * If any events in the group enable the LBR event logging
>> +		 * feature, the group is treated as a LBR event logging group,
>> +		 * which requires the extra space to store the counters.
>> +		 */
>> +		leader = event->group_leader;
>> +		if (branch_sample_call_stack(leader))
>> +			return -EINVAL;
>> +		leader->hw.flags  |= PERF_X86_EVENT_BRANCH_COUNTERS;
> 
> (superfluous whitespace before operator)
> 
>> +
>> +		for_each_sibling_event(sibling, leader) {
>> +			if (branch_sample_call_stack(sibling))
>> +				return -EINVAL;
>> +		}
>> +
>> +		/*
>> +		 * Only applying the PERF_SAMPLE_BRANCH_COUNTERS doesn't
>> +		 * require any branch stack setup.
>> +		 * Clear the bit to avoid unnecessary branch stack setup.
>> +		 */
>> +		if (0 == (event->attr.branch_sample_type &
>> +			  ~(PERF_SAMPLE_BRANCH_PLM_ALL |
>> +			    PERF_SAMPLE_BRANCH_COUNTERS)))
>> +			event->hw.flags  &= ~PERF_X86_EVENT_NEEDS_BRANCH_STACK;
>> +	}
> 
> Does this / should this check the number of group members vs supported
> number of lbr counters?

Sure, I will add the check here for the numbers, so perf can error out
earlier.

Thanks,
Kan

Peter Zijlstra Oct. 19, 2023, 6:18 p.m. UTC | #12

On Thu, Oct 19, 2023 at 10:26:01AM -0400, Liang, Kan wrote:

> The original LBR event data is saved at offset 32 of LBR_INFO register.
> In get_lbr_events(), the data was simply copied to the offset 32 of
> cpuc->lbr_events.

Urgh, missed that. Clearly reading is a skill :-)

> 
> The intel_pmu_update_lbr_event() reorders the value and saves it
> starting from the offset 0.
> 
> I agree it's hard to read since it combines the src and dst into the
> same variable.
> 
> I will use the suggested code and also update the get_lbr_events().

Thanks!

Liang, Kan Oct. 20, 2023, 12:45 p.m. UTC | #13

On 2023-10-19 7:12 a.m., Peter Zijlstra wrote:
> On Wed, Oct 04, 2023 at 11:40:41AM -0700, kan.liang@linux.intel.com wrote:
>> +static __always_inline void get_lbr_events(struct cpu_hw_events *cpuc,
>> +					   int i, u64 info)
>> +{
>> +	/*
>> +	 * The later code will decide what content can be disclosed
>> +	 * to the perf tool. It's no harmful to unconditionally update
>> +	 * the cpuc->lbr_events.
>> +	 * Pleae see intel_pmu_lbr_event_reorder()
>> +	 */
>> +	cpuc->lbr_events[i] = info & LBR_INFO_EVENTS;
>> +}
> 
> You could be forcing an extra cachemiss here. 

Here is to temporarily store the branch _counter information. Maybe we
can leverage the reserved field of cpuc->lbr_entries[i] to avoid the
cachemiss.

	e->reserved = info & LBR_INFO_COUNTERS;

I tried to add something like a static_assert to check the size of the
reserved field in case the field is shrink later. But the reserved field
is a bit field. I have no idea how to get the exact size of a bit field
unless define a macro. Is something as below OK? Any suggestions are
appreciated.


diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 1e80a551a4c2..62675593e39a 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -1582,6 +1582,8 @@ static bool is_arch_lbr_xsave_available(void)
 	return true;
 }

+static_assert((64 - PERF_BRANCH_ENTRY_INFO_BITS_MAX) >
LBR_INFO_COUNTERS_MAX_NUM * 2);
+
 void __init intel_pmu_arch_lbr_init(void)
 {
	struct pmu *pmu = x86_get_pmu(smp_processor_id());
diff --git a/arch/x86/include/asm/msr-index.h
b/arch/x86/include/asm/msr-index.h
index f220c3598d03..e9ff8eba5efd 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -238,6 +238,7 @@
 #define LBR_INFO_BR_TYPE		(0xfull << LBR_INFO_BR_TYPE_OFFSET)
 #define LBR_INFO_EVENTS_OFFSET		32
 #define LBR_INFO_EVENTS			(0xffull << LBR_INFO_EVENTS_OFFSET)
+#define LBR_INFO_COUNTERS_MAX_NUM	4

 #define MSR_ARCH_LBR_CTL		0x000014ce
 #define ARCH_LBR_CTL_LBREN		BIT(0)
diff --git a/include/uapi/linux/perf_event.h
b/include/uapi/linux/perf_event.h
index 4461f380425b..3a64499b0f5d 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -1437,6 +1437,9 @@ struct perf_branch_entry {
 		reserved:31;
 };

+/* Size of used info bits in struct perf_branch_entry */
+#define PERF_BRANCH_ENTRY_INFO_BITS_MAX		33
+
 union perf_sample_weight {
 	__u64		full;
 #if defined(__LITTLE_ENDIAN_BITFIELD)



> A long time ago I had
> hacks to profile perf with perf, but perhaps PT can be abused for that
> now?

As my understanding, the PT can only give the trace information, and may
not tell if there is a canchemiss or something.
I will take a deep look and see if PT can help the case.

Thanks,
Kan

diff mbox series

Patch

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index a99449c0d77c..5557310d430a 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2792,6 +2792,7 @@  static void intel_pmu_enable_fixed(struct perf_event *event)
 
 static void intel_pmu_enable_event(struct perf_event *event)
 {
+	u64 enable_mask = ARCH_PERFMON_EVENTSEL_ENABLE;
 	struct hw_perf_event *hwc = &event->hw;
 	int idx = hwc->idx;
 
@@ -2800,8 +2801,10 @@  static void intel_pmu_enable_event(struct perf_event *event)
 
 	switch (idx) {
 	case 0 ... INTEL_PMC_IDX_FIXED - 1:
+		if (branch_sample_counters(event))
+			enable_mask |= ARCH_PERFMON_EVENTSEL_LBR_LOG;
 		intel_set_masks(event, idx);
-		__x86_pmu_enable_event(hwc, ARCH_PERFMON_EVENTSEL_ENABLE);
+		__x86_pmu_enable_event(hwc, enable_mask);
 		break;
 	case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_BTS - 1:
 	case INTEL_PMC_IDX_METRIC_BASE ... INTEL_PMC_IDX_METRIC_END:
@@ -3052,7 +3055,7 @@  static int handle_pmi_common(struct pt_regs *regs, u64 status)
 		perf_sample_data_init(&data, 0, event->hw.last_period);
 
 		if (has_branch_stack(event))
-			perf_sample_save_brstack(&data, event, &cpuc->lbr_stack, NULL);
+			intel_pmu_lbr_save_brstack(&data, cpuc, event);
 
 		if (perf_event_overflow(event, &data, regs))
 			x86_pmu_stop(event, 0);
@@ -3617,6 +3620,13 @@  intel_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
 	if (cpuc->excl_cntrs)
 		return intel_get_excl_constraints(cpuc, event, idx, c2);
 
+	/* The LBR event logging may not be available for all counters. */
+	if (branch_sample_counters(event)) {
+		c2 = dyn_constraint(cpuc, c2, idx);
+		c2->idxmsk64 &= x86_pmu.lbr_events;
+		c2->weight = hweight64(c2->idxmsk64);
+	}
+
 	return c2;
 }
 
@@ -3905,6 +3915,44 @@  static int intel_pmu_hw_config(struct perf_event *event)
 	if (needs_branch_stack(event) && is_sampling_event(event))
 		event->hw.flags  |= PERF_X86_EVENT_NEEDS_BRANCH_STACK;
 
+	if (branch_sample_counters(event)) {
+		struct perf_event *leader, *sibling;
+
+		if (!(x86_pmu.flags & PMU_FL_LBR_EVENT) ||
+		    (event->attr.config & ~INTEL_ARCH_EVENT_MASK))
+			return -EINVAL;
+
+		/*
+		 * The event logging is not supported in the call stack mode
+		 * yet, since we cannot simply flush the LBR during e.g.,
+		 * multiplexing. Also, there is no obvious usage with the call
+		 * stack mode. Simply forbids it for now.
+		 *
+		 * If any events in the group enable the LBR event logging
+		 * feature, the group is treated as a LBR event logging group,
+		 * which requires the extra space to store the counters.
+		 */
+		leader = event->group_leader;
+		if (branch_sample_call_stack(leader))
+			return -EINVAL;
+		leader->hw.flags  |= PERF_X86_EVENT_BRANCH_COUNTERS;
+
+		for_each_sibling_event(sibling, leader) {
+			if (branch_sample_call_stack(sibling))
+				return -EINVAL;
+		}
+
+		/*
+		 * Only applying the PERF_SAMPLE_BRANCH_COUNTERS doesn't
+		 * require any branch stack setup.
+		 * Clear the bit to avoid unnecessary branch stack setup.
+		 */
+		if (0 == (event->attr.branch_sample_type &
+			  ~(PERF_SAMPLE_BRANCH_PLM_ALL |
+			    PERF_SAMPLE_BRANCH_COUNTERS)))
+			event->hw.flags  &= ~PERF_X86_EVENT_NEEDS_BRANCH_STACK;
+	}
+
 	if (intel_pmu_needs_branch_stack(event)) {
 		ret = intel_pmu_setup_lbr_filter(event);
 		if (ret)
@@ -4383,8 +4431,13 @@  cmt_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
 	 */
 	if (event->attr.precise_ip == 3) {
 		/* Force instruction:ppp on PMC0, 1 and Fixed counter 0 */
-		if (constraint_match(&fixed0_constraint, event->hw.config))
-			return &fixed0_counter0_1_constraint;
+		if (constraint_match(&fixed0_constraint, event->hw.config)) {
+			/* The fixed counter 0 doesn't support LBR event logging. */
+			if (branch_sample_counters(event))
+				return &counter0_1_constraint;
+			else
+				return &fixed0_counter0_1_constraint;
+		}
 
 		switch (c->idxmsk64 & 0x3ull) {
 		case 0x1:
@@ -4563,7 +4616,7 @@  int intel_cpuc_prepare(struct cpu_hw_events *cpuc, int cpu)
 			goto err;
 	}
 
-	if (x86_pmu.flags & (PMU_FL_EXCL_CNTRS | PMU_FL_TFA)) {
+	if (x86_pmu.flags & (PMU_FL_EXCL_CNTRS | PMU_FL_TFA | PMU_FL_LBR_EVENT)) {
 		size_t sz = X86_PMC_IDX_MAX * sizeof(struct event_constraint);
 
 		cpuc->constraint_list = kzalloc_node(sz, GFP_KERNEL, cpu_to_node(cpu));
@@ -5535,8 +5588,30 @@  static ssize_t branches_show(struct device *cdev,
 
 static DEVICE_ATTR_RO(branches);
 
+static ssize_t branch_counter_nr_show(struct device *cdev,
+				      struct device_attribute *attr,
+				      char *buf)
+{
+	return snprintf(buf, PAGE_SIZE, "%d\n", fls(x86_pmu.lbr_events));
+}
+
+static DEVICE_ATTR_RO(branch_counter_nr);
+
+static ssize_t branch_counter_width_show(struct device *cdev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	return snprintf(buf, PAGE_SIZE, "2\n");
+}
+
+static DEVICE_ATTR_RO(branch_counter_width);
+
+
+
 static struct attribute *lbr_attrs[] = {
 	&dev_attr_branches.attr,
+	&dev_attr_branch_counter_nr.attr,
+	&dev_attr_branch_counter_width.attr,
 	NULL
 };
 
@@ -5590,7 +5665,11 @@  mem_is_visible(struct kobject *kobj, struct attribute *attr, int i)
 static umode_t
 lbr_is_visible(struct kobject *kobj, struct attribute *attr, int i)
 {
-	return x86_pmu.lbr_nr ? attr->mode : 0;
+	/* branches */
+	if (i == 0)
+		return x86_pmu.lbr_nr ? attr->mode : 0;
+
+	return (x86_pmu.flags & PMU_FL_LBR_EVENT) ? attr->mode : 0;
 }
 
 static umode_t
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index cb3f329f8fa4..d49d661ec0a7 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1912,7 +1912,7 @@  static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 
 		if (has_branch_stack(event)) {
 			intel_pmu_store_pebs_lbrs(lbr);
-			perf_sample_save_brstack(data, event, &cpuc->lbr_stack, NULL);
+			intel_pmu_lbr_save_brstack(data, cpuc, event);
 		}
 	}
 
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index c3b0d15a9841..1e80a551a4c2 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -676,6 +676,21 @@  void intel_pmu_lbr_del(struct perf_event *event)
 	WARN_ON_ONCE(cpuc->lbr_users < 0);
 	WARN_ON_ONCE(cpuc->lbr_pebs_users < 0);
 	perf_sched_cb_dec(event->pmu);
+
+	/*
+	 * The logged occurrences information is only valid for the
+	 * current LBR group. If another LBR group is scheduled in
+	 * later, the information from the stale LBRs will be wrongly
+	 * interpreted. Reset the LBRs here.
+	 * For the context switch, the LBR will be unconditionally
+	 * flushed when a new task is scheduled in. If both the new task
+	 * and the old task are monitored by a LBR event group. The
+	 * reset here is redundant. But the extra reset doesn't impact
+	 * the functionality. It's hard to distinguish the above case.
+	 * Keep the unconditionally reset for a LBR event group for now.
+	 */
+	if (is_branch_counters_group(event))
+		intel_pmu_lbr_reset();
 }
 
 static inline bool vlbr_exclude_host(void)
@@ -866,6 +881,18 @@  static __always_inline u16 get_lbr_cycles(u64 info)
 	return cycles;
 }
 
+static __always_inline void get_lbr_events(struct cpu_hw_events *cpuc,
+					   int i, u64 info)
+{
+	/*
+	 * The later code will decide what content can be disclosed
+	 * to the perf tool. It's no harmful to unconditionally update
+	 * the cpuc->lbr_events.
+	 * Pleae see intel_pmu_lbr_event_reorder()
+	 */
+	cpuc->lbr_events[i] = info & LBR_INFO_EVENTS;
+}
+
 static void intel_pmu_store_lbr(struct cpu_hw_events *cpuc,
 				struct lbr_entry *entries)
 {
@@ -898,11 +925,70 @@  static void intel_pmu_store_lbr(struct cpu_hw_events *cpuc,
 		e->abort	= !!(info & LBR_INFO_ABORT);
 		e->cycles	= get_lbr_cycles(info);
 		e->type		= get_lbr_br_type(info);
+
+		get_lbr_events(cpuc, i, info);
 	}
 
 	cpuc->lbr_stack.nr = i;
 }
 
+#define ARCH_LBR_EVENT_LOG_WIDTH	2
+#define ARCH_LBR_EVENT_LOG_MASK		0x3
+
+static __always_inline void intel_pmu_update_lbr_event(u64 *lbr_events, int idx, int pos)
+{
+	u64 logs = *lbr_events >> (LBR_INFO_EVENTS_OFFSET +
+				   idx * ARCH_LBR_EVENT_LOG_WIDTH);
+
+	logs &= ARCH_LBR_EVENT_LOG_MASK;
+	*lbr_events |= logs << (pos * ARCH_LBR_EVENT_LOG_WIDTH);
+}
+
+/*
+ * The enabled order may be different from the counter order.
+ * Update the lbr_events with the enabled order.
+ */
+static void intel_pmu_lbr_event_reorder(struct cpu_hw_events *cpuc,
+					struct perf_event *event)
+{
+	int i, j, pos = 0, enabled[X86_PMC_IDX_MAX];
+	struct perf_event *leader, *sibling;
+
+	leader = event->group_leader;
+	if (branch_sample_counters(leader))
+		enabled[pos++] = leader->hw.idx;
+
+	for_each_sibling_event(sibling, leader) {
+		if (!branch_sample_counters(sibling))
+			continue;
+		enabled[pos++] = sibling->hw.idx;
+	}
+
+	if (!pos)
+		return;
+
+	for (i = 0; i < cpuc->lbr_stack.nr; i++) {
+		for (j = 0; j < pos; j++)
+			intel_pmu_update_lbr_event(&cpuc->lbr_events[i], enabled[j], j);
+
+		/* Clear the original counter order */
+		cpuc->lbr_events[i] &= ~LBR_INFO_EVENTS;
+	}
+}
+
+void intel_pmu_lbr_save_brstack(struct perf_sample_data *data,
+				struct cpu_hw_events *cpuc,
+				struct perf_event *event)
+{
+	if (is_branch_counters_group(event)) {
+		intel_pmu_lbr_event_reorder(cpuc, event);
+		perf_sample_save_brstack(data, event, &cpuc->lbr_stack, cpuc->lbr_events);
+		return;
+	}
+
+	perf_sample_save_brstack(data, event, &cpuc->lbr_stack, NULL);
+}
+
 static void intel_pmu_arch_lbr_read(struct cpu_hw_events *cpuc)
 {
 	intel_pmu_store_lbr(cpuc, NULL);
@@ -1173,8 +1259,10 @@  intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
 	for (i = 0; i < cpuc->lbr_stack.nr; ) {
 		if (!cpuc->lbr_entries[i].from) {
 			j = i;
-			while (++j < cpuc->lbr_stack.nr)
+			while (++j < cpuc->lbr_stack.nr) {
 				cpuc->lbr_entries[j-1] = cpuc->lbr_entries[j];
+				cpuc->lbr_events[j-1] = cpuc->lbr_events[j];
+			}
 			cpuc->lbr_stack.nr--;
 			if (!cpuc->lbr_entries[i].from)
 				continue;
@@ -1525,8 +1613,12 @@  void __init intel_pmu_arch_lbr_init(void)
 	x86_pmu.lbr_mispred = ecx.split.lbr_mispred;
 	x86_pmu.lbr_timed_lbr = ecx.split.lbr_timed_lbr;
 	x86_pmu.lbr_br_type = ecx.split.lbr_br_type;
+	x86_pmu.lbr_events = ecx.split.lbr_events;
 	x86_pmu.lbr_nr = lbr_nr;
 
+	if (!!x86_pmu.lbr_events)
+		x86_pmu.flags |= PMU_FL_LBR_EVENT;
+
 	if (x86_pmu.lbr_mispred)
 		static_branch_enable(&x86_lbr_mispred);
 	if (x86_pmu.lbr_timed_lbr)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 53dd5d495ba6..4f0722a1be76 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -110,6 +110,11 @@  static inline bool is_topdown_event(struct perf_event *event)
 	return is_metric_event(event) || is_slots_event(event);
 }
 
+static inline bool is_branch_counters_group(struct perf_event *event)
+{
+	return event->group_leader->hw.flags & PERF_X86_EVENT_BRANCH_COUNTERS;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
@@ -283,6 +288,7 @@  struct cpu_hw_events {
 	int				lbr_pebs_users;
 	struct perf_branch_stack	lbr_stack;
 	struct perf_branch_entry	lbr_entries[MAX_LBR_ENTRIES];
+	u64				lbr_events[MAX_LBR_ENTRIES]; /* branch stack extra */
 	union {
 		struct er_account		*lbr_sel;
 		struct er_account		*lbr_ctl;
@@ -888,6 +894,7 @@  struct x86_pmu {
 	unsigned int	lbr_mispred:1;
 	unsigned int	lbr_timed_lbr:1;
 	unsigned int	lbr_br_type:1;
+	unsigned int	lbr_events:4;
 
 	void		(*lbr_reset)(void);
 	void		(*lbr_read)(struct cpu_hw_events *cpuc);
@@ -1012,6 +1019,7 @@  do {									\
 #define PMU_FL_INSTR_LATENCY	0x80 /* Support Instruction Latency in PEBS Memory Info Record */
 #define PMU_FL_MEM_LOADS_AUX	0x100 /* Require an auxiliary event for the complete memory info */
 #define PMU_FL_RETIRE_LATENCY	0x200 /* Support Retire Latency in PEBS */
+#define PMU_FL_LBR_EVENT	0x400 /* Support LBR event logging */
 
 #define EVENT_VAR(_id)  event_attr_##_id
 #define EVENT_PTR(_id) &event_attr_##_id.attr.attr
@@ -1552,6 +1560,10 @@  void intel_pmu_store_pebs_lbrs(struct lbr_entry *lbr);
 
 void intel_ds_init(void);
 
+void intel_pmu_lbr_save_brstack(struct perf_sample_data *data,
+				struct cpu_hw_events *cpuc,
+				struct perf_event *event);
+
 void intel_pmu_lbr_swap_task_ctx(struct perf_event_pmu_context *prev_epc,
 				 struct perf_event_pmu_context *next_epc);
 
diff --git a/arch/x86/events/perf_event_flags.h b/arch/x86/events/perf_event_flags.h
index a1685981c520..6c977c19f2cd 100644
--- a/arch/x86/events/perf_event_flags.h
+++ b/arch/x86/events/perf_event_flags.h
@@ -21,3 +21,4 @@  PERF_ARCH(PEBS_STLAT,		0x08000) /* st+stlat data address sampling */
 PERF_ARCH(AMD_BRS,		0x10000) /* AMD Branch Sampling */
 PERF_ARCH(PEBS_LAT_HYBRID,	0x20000) /* ld and st lat for hybrid */
 PERF_ARCH(NEEDS_BRANCH_STACK,	0x40000) /* require branch stack setup */
+PERF_ARCH(BRANCH_COUNTERS,	0x80000) /* logs the counters in the extra space of each branch */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 1d111350197f..7306b70f21ac 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -236,6 +236,8 @@ 
 #define LBR_INFO_CYCLES			0xffff
 #define LBR_INFO_BR_TYPE_OFFSET		56
 #define LBR_INFO_BR_TYPE		(0xfull << LBR_INFO_BR_TYPE_OFFSET)
+#define LBR_INFO_EVENTS_OFFSET		32
+#define LBR_INFO_EVENTS			(0xffull << LBR_INFO_EVENTS_OFFSET)
 
 #define MSR_ARCH_LBR_CTL		0x000014ce
 #define ARCH_LBR_CTL_LBREN		BIT(0)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 85a9fd5a3ec3..7677605a39ef 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -31,6 +31,7 @@ 
 #define ARCH_PERFMON_EVENTSEL_ENABLE			(1ULL << 22)
 #define ARCH_PERFMON_EVENTSEL_INV			(1ULL << 23)
 #define ARCH_PERFMON_EVENTSEL_CMASK			0xFF000000ULL
+#define ARCH_PERFMON_EVENTSEL_LBR_LOG			(1ULL << 35)
 
 #define INTEL_FIXED_BITS_MASK				0xFULL
 #define INTEL_FIXED_BITS_STRIDE			4
@@ -216,6 +217,9 @@  union cpuid28_ecx {
 		unsigned int    lbr_timed_lbr:1;
 		/* Branch Type Field Supported */
 		unsigned int    lbr_br_type:1;
+		unsigned int	reserved:13;
+		/* Event Logging Supported */
+		unsigned int	lbr_events:4;
 	} split;
 	unsigned int            full;
 };