Message ID | 20230410204352.1098067-2-kan.liang@linux.intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp2144576vqo; Mon, 10 Apr 2023 13:54:07 -0700 (PDT) X-Google-Smtp-Source: AKy350bTkCS7I+ELFUpdXFWvuBcH6cQeEToXH4NYH0kijVwQS37YW50CZBL+Q+Yu/3uwmwVxuYve X-Received: by 2002:a17:903:230b:b0:1a1:a9a4:ba36 with SMTP id d11-20020a170903230b00b001a1a9a4ba36mr16694622plh.41.1681160046968; Mon, 10 Apr 2023 13:54:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1681160046; cv=none; d=google.com; s=arc-20160816; b=seaTu9jHZ7zxgz3d7Na6uUgX6cheLOHOD7lVccKxDndR+SL3PZZVwIldsIfzfUN/vN 08cpkmwe0cPm3uzOl5+ociBLluWx0HcxOIoHL7c6Q0Zjq4OyhPznwGplWc/Z+zaznm0z DFWf2pnnsgTZQm7XDk99EIkyFP1nGU21FPSe9KLwNsRneTY1l0XmVroC+FcAaUZmygG+ 5fB/md3zs6fOb04KXdZOR9CckC5+4fHywwbadV08BZ04KcjQ7BtpcGCJvDEUCivt3Wsj dvPXvl6gXrk4cQeUmF2segbs1G4R9z4mhu7KCAp1hTDAyCHYjp2hcLVCUh8NsHlkaYuE jk7A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=VAFU+gOU6C5NyOfuXZ24XlXgCN5I7JaFWySzP265zN4=; b=OShfy5VWhNg/hcbSkTmwM1tjBHsiiyNv6yXg4k1Uj5qPmUmS+K+C2jRFxlnsjh+CKN zN8n8R+zHd5lJxzkqLBhny9srrebxyhywmfjevKeF0ifde7lOwVEjBJDmNPf9msPM3QK 9Ij6QhJ2bdkjFmqvKOesXKsvLH8ncZUoDW9LGtbFuitYADMA9YlR7PrK/z8pzp0Wd9MQ wozGq7Vokb9vH2AtNw1fCQ4/yEJbGxVtcr5Tp9c3pUTNcs7L20/lWNuAbrvGPAsV8YBR I6G3E8kOmBS8mRv0S+mbkfZTEMcaMAJBSgm4u5I6FZ5lDB+4bLlNgpp7p32hlOv3G0d3 Z+xQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="M/CCSL5o"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id ld15-20020a170902facf00b001a643d9f1b8si2422246plb.643.2023.04.10.13.53.54; Mon, 10 Apr 2023 13:54:06 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="M/CCSL5o"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229722AbjDJUoL (ORCPT <rfc822;yuanzuo1009@gmail.com> + 99 others); Mon, 10 Apr 2023 16:44:11 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57936 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229697AbjDJUoJ (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 10 Apr 2023 16:44:09 -0400 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3D5791BC for <linux-kernel@vger.kernel.org>; Mon, 10 Apr 2023 13:44:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1681159448; x=1712695448; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=3yZhgwnPsync3tBJ/DDYSKMmfcMk16C74zq7dCxuWvo=; b=M/CCSL5olFluRLDk/WdvxT2DtL1rFIqkFjZ092q2kDC3gbtUPOrZps79 1mXEx+ylPiR+Zj7z1VLfvzPPC8A74ChLjydlVfQ7Ku1j6/3Hd/aW9GYko xIFQmEEOJZxpeUXB98Rx8G9di2XaEv194rXkfeb7zmlgGfyjTkQVUXv/6 BuUZ2mav75XWkNG4Gh+zJn5ohlJo2htlkkyKwazYuwRlkFY4BNTZ5PBZo uLv9JeVhkblkmp1ZSoJq5jGOEw99hvoXiRlnC7Ig24kWfXOBxZtG2EVFS fwTTwhhW8r2aIsbIZxN5rWl1XW1BJcmg1AC38KGm0vz4T8kNb+oWRiG2T Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10676"; a="408588928" X-IronPort-AV: E=Sophos;i="5.98,333,1673942400"; d="scan'208";a="408588928" Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Apr 2023 13:44:06 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10676"; a="638569717" X-IronPort-AV: E=Sophos;i="5.98,333,1673942400"; d="scan'208";a="638569717" Received: from kanliang-dev.jf.intel.com ([10.165.154.102]) by orsmga003.jf.intel.com with ESMTP; 10 Apr 2023 13:44:05 -0700 From: kan.liang@linux.intel.com To: peterz@infradead.org, mingo@redhat.com, acme@kernel.org, linux-kernel@vger.kernel.org Cc: mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, namhyung@kernel.org, irogers@google.com, adrian.hunter@intel.com, ak@linux.intel.com, eranian@google.com, Kan Liang <kan.liang@linux.intel.com> Subject: [PATCH 2/6] perf: Support branch events logging Date: Mon, 10 Apr 2023 13:43:48 -0700 Message-Id: <20230410204352.1098067-2-kan.liang@linux.intel.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20230410204352.1098067-1-kan.liang@linux.intel.com> References: <20230410204352.1098067-1-kan.liang@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.4 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H2, SPF_HELO_NONE,SPF_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1762824077465995258?= X-GMAIL-MSGID: =?utf-8?q?1762824077465995258?= |
Series |
[1/6] perf/x86/intel: Add Grand Ridge and Sierra Forest
|
|
Commit Message
Liang, Kan
April 10, 2023, 8:43 p.m. UTC
From: Kan Liang <kan.liang@linux.intel.com> With the cycle time information between branches, stalls can be easily observed. But it's difficult to explain what causes the long delay. Add a new field to collect the occurrences of events since the last branch entry, which can be used to provide some causality information for the cycle time values currently recorded in branches. Add a new branch sample type to indicate whether include occurrences of events in branch info. Only support up to 4 events with saturating at value 3. In the current kernel, the events are ordered by either the counter index or the enabling sequence. But none of the order information is available to the user space tool. Add a new PERF_SAMPLE format, PERF_SAMPLE_BRANCH_EVENT_IDS, and generic support to dump the event IDs of the branch events. Add a helper function to detect the branch event flag. These will be used in the following patch. Reviewed-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> --- include/linux/perf_event.h | 26 ++++++++++++++++++++++++++ include/uapi/linux/perf_event.h | 22 ++++++++++++++++++++-- kernel/events/core.c | 23 +++++++++++++++++++++++ 3 files changed, 69 insertions(+), 2 deletions(-)
Comments
On Mon, Apr 10, 2023 at 01:43:48PM -0700, kan.liang@linux.intel.com wrote: > From: Kan Liang <kan.liang@linux.intel.com> > > With the cycle time information between branches, stalls can be easily > observed. But it's difficult to explain what causes the long delay. > > Add a new field to collect the occurrences of events since the last > branch entry, which can be used to provide some causality information > for the cycle time values currently recorded in branches. > > Add a new branch sample type to indicate whether include occurrences of > events in branch info. > > Only support up to 4 events with saturating at value 3. > In the current kernel, the events are ordered by either the counter > index or the enabling sequence. But none of the order information is > available to the user space tool. > Add a new PERF_SAMPLE format, PERF_SAMPLE_BRANCH_EVENT_IDS, and generic > support to dump the event IDs of the branch events. > Add a helper function to detect the branch event flag. > These will be used in the following patch. I'm having trouble reverse engineering this. Can you more coherently explain this feature and how you've implemented it?
On 2023-04-14 6:38 a.m., Peter Zijlstra wrote: > On Mon, Apr 10, 2023 at 01:43:48PM -0700, kan.liang@linux.intel.com wrote: >> From: Kan Liang <kan.liang@linux.intel.com> >> >> With the cycle time information between branches, stalls can be easily >> observed. But it's difficult to explain what causes the long delay. >> >> Add a new field to collect the occurrences of events since the last >> branch entry, which can be used to provide some causality information >> for the cycle time values currently recorded in branches. >> >> Add a new branch sample type to indicate whether include occurrences of >> events in branch info. >> >> Only support up to 4 events with saturating at value 3. >> In the current kernel, the events are ordered by either the counter >> index or the enabling sequence. But none of the order information is >> available to the user space tool. >> Add a new PERF_SAMPLE format, PERF_SAMPLE_BRANCH_EVENT_IDS, and generic >> support to dump the event IDs of the branch events. >> Add a helper function to detect the branch event flag. >> These will be used in the following patch. > > I'm having trouble reverse engineering this. Can you more coherently > explain this feature and how you've implemented it? Sorry for that. The feature is an enhancement of ARCH LBR. It adds new fields in the LBR_INFO MSRs to log the occurrences of events on the first 4 GP counters. Worked with the previous timed LBR feature together, the user can understand not only the latency between two LBR blocks, but also which events causes the stall. The spec can be found at the latest Intel® Architecture Instruction Set Extensions and Future Features, v048. Chapter 8.4. https://cdrdv2.intel.com/v1/dl/getContent/671368 To support the feature, there are three main changes in ABIs. - A new branch sample type, PERF_SAMPLE_BRANCH_EVENT, is used as a knob to enable the feature. - Extend the struct perf_branch_entry layout, because we have to save and pass the occurrences of events to user space. Since it's only available for 4 counters and saturating at value 3, it only occupies 8 bits. For the current Intel implementation, the order is the order of counters. - Add a new PERF_SAMPLE format, PERF_SAMPLE_BRANCH_EVENT_IDS, to dump the order information. User space tool doesn't understand the order of counters. So it cannot map the new fields in struct perf_branch_entry to a specific event. We have to dump the order information. I once considered using enabling order to avoid this new sample format. It works for some cases, e.g., group. But it doesn't work for some complex cases, e.g., multiplexing, in which the enabling order keeps changing. Ideally, we should dump the order information for each LBR entry. But that will include too much duplicate information. So the order information is only dumped for each sample. The drawback is that we have to flush/update old LBR entries once the events are rescheduled between samples, e.g., multiplexing. Because it's possible that the new sample can still see the stall LBR entries. That's specially handled in the next Intel specific patch. For the current implementation, perf tool has to apply both PERF_SAMPLE_BRANCH_EVENT and PERF_SAMPLE_BRANCH_EVENT_IDS to enable the feature. Thanks, Kan
On Fri, Apr 14, 2023 at 09:35:37AM -0400, Liang, Kan wrote: > > > On 2023-04-14 6:38 a.m., Peter Zijlstra wrote: > > On Mon, Apr 10, 2023 at 01:43:48PM -0700, kan.liang@linux.intel.com wrote: > >> From: Kan Liang <kan.liang@linux.intel.com> > >> > >> With the cycle time information between branches, stalls can be easily > >> observed. But it's difficult to explain what causes the long delay. > >> > >> Add a new field to collect the occurrences of events since the last > >> branch entry, which can be used to provide some causality information > >> for the cycle time values currently recorded in branches. > >> > >> Add a new branch sample type to indicate whether include occurrences of > >> events in branch info. > >> > >> Only support up to 4 events with saturating at value 3. > >> In the current kernel, the events are ordered by either the counter > >> index or the enabling sequence. But none of the order information is > >> available to the user space tool. > >> Add a new PERF_SAMPLE format, PERF_SAMPLE_BRANCH_EVENT_IDS, and generic > >> support to dump the event IDs of the branch events. > >> Add a helper function to detect the branch event flag. > >> These will be used in the following patch. > > > > I'm having trouble reverse engineering this. Can you more coherently > > explain this feature and how you've implemented it? > > Sorry for that. > > The feature is an enhancement of ARCH LBR. It adds new fields in the > LBR_INFO MSRs to log the occurrences of events on the first 4 GP > counters. Worked with the previous timed LBR feature together, the user > can understand not only the latency between two LBR blocks, but also > which events causes the stall. > > The spec can be found at the latest Intel® Architecture Instruction Set > Extensions and Future Features, v048. Chapter 8.4. > https://cdrdv2.intel.com/v1/dl/getContent/671368 Oh gawd; that's terse. Why can't these people write comprehensible things :/ It's almost as if they don't want this stuff to be used. So IA32_LBR_x_INFO is extended: [0:15] CYC_CNT [16:31] undefined + [32:33] PMC0_CNT + [34:35] PMC1_CNT + [36:37] PMC2_CNT + [38:39] PMC3_CNT + [40:41] PMC4_CNT + [42:43] PMC5_CNT + [44:45] PMC6_CNT + [46:47] PMC7_CNT [48:55] undefined [56:59] BR_TYPE [60] CYC_CNT_VALID [61] TSX_ABORT Where the PMCx_CNT fields are saturating counters for the respective PMCs. And we'll run out of bits if we get more than 12 PMCs. Is SMT=n PMC merging still a thing? And for some reason this counting is enabled in PERFEVTSELx[35] instead of in LBR_CTL somewhere :/ > To support the feature, there are three main changes in ABIs. > - A new branch sample type, PERF_SAMPLE_BRANCH_EVENT, is used as a knob > to enable the feature. > - Extend the struct perf_branch_entry layout, because we have to save > and pass the occurrences of events to user space. Since it's only > available for 4 counters and saturating at value 3, it only occupies 8 > bits. For the current Intel implementation, the order is the order of > counters. Only for 4? Where does it say that? If it were to only support 4, then we're in counter scheduling contraint hell again and we need to somehow group all these things together with the LBR event. @@ -1410,6 +1423,10 @@ union perf_mem_data_src { * cycles: cycles from last branch (or 0 if not supported) * type: branch type * spec: branch speculation info (or 0 if not supported) + * events: occurrences of events since the last branch entry. + * The fields can store up to 4 events with saturating + * at value 3. + * (or 0 if not supported) */ struct perf_branch_entry { __u64 from; @@ -1423,7 +1440,8 @@ struct perf_branch_entry { spec:2, /* branch speculation info */ new_type:4, /* additional branch type */ priv:3, /* privilege level */ - reserved:31; + events:8, /* occurrences of events since the last branch entry */ + reserved:23; }; union perf_sample_weight { This seems properly terrible from an interface pov. What if the next generation of silicon extends this to all 8 PMCs or another architecture comes along that does this with 3 bits per counter etc... > - Add a new PERF_SAMPLE format, PERF_SAMPLE_BRANCH_EVENT_IDS, to dump > the order information. User space tool doesn't understand the order of > counters. So it cannot map the new fields in struct perf_branch_entry to > a specific event. We have to dump the order information. Sorry; I can't parse this.
On 2023-04-14 10:53 a.m., Peter Zijlstra wrote: > On Fri, Apr 14, 2023 at 09:35:37AM -0400, Liang, Kan wrote: >> >> >> On 2023-04-14 6:38 a.m., Peter Zijlstra wrote: >>> On Mon, Apr 10, 2023 at 01:43:48PM -0700, kan.liang@linux.intel.com wrote: >>>> From: Kan Liang <kan.liang@linux.intel.com> >>>> >>>> With the cycle time information between branches, stalls can be easily >>>> observed. But it's difficult to explain what causes the long delay. >>>> >>>> Add a new field to collect the occurrences of events since the last >>>> branch entry, which can be used to provide some causality information >>>> for the cycle time values currently recorded in branches. >>>> >>>> Add a new branch sample type to indicate whether include occurrences of >>>> events in branch info. >>>> >>>> Only support up to 4 events with saturating at value 3. >>>> In the current kernel, the events are ordered by either the counter >>>> index or the enabling sequence. But none of the order information is >>>> available to the user space tool. >>>> Add a new PERF_SAMPLE format, PERF_SAMPLE_BRANCH_EVENT_IDS, and generic >>>> support to dump the event IDs of the branch events. >>>> Add a helper function to detect the branch event flag. >>>> These will be used in the following patch. >>> >>> I'm having trouble reverse engineering this. Can you more coherently >>> explain this feature and how you've implemented it? >> >> Sorry for that. >> >> The feature is an enhancement of ARCH LBR. It adds new fields in the >> LBR_INFO MSRs to log the occurrences of events on the first 4 GP >> counters. Worked with the previous timed LBR feature together, the user >> can understand not only the latency between two LBR blocks, but also >> which events causes the stall. >> >> The spec can be found at the latest Intel® Architecture Instruction Set >> Extensions and Future Features, v048. Chapter 8.4. >> https://cdrdv2.intel.com/v1/dl/getContent/671368 > > Oh gawd; that's terse. Why can't these people write comprehensible > things :/ It's almost as if they don't want this stuff to be used. > > So IA32_LBR_x_INFO is extended: > > [0:15] CYC_CNT > [16:31] undefined > + [32:33] PMC0_CNT > + [34:35] PMC1_CNT > + [36:37] PMC2_CNT > + [38:39] PMC3_CNT > + [40:41] PMC4_CNT > + [42:43] PMC5_CNT > + [44:45] PMC6_CNT > + [46:47] PMC7_CNT > [48:55] undefined > [56:59] BR_TYPE > [60] CYC_CNT_VALID > [61] TSX_ABORT > > Where the PMCx_CNT fields are saturating counters for the respective > PMCs. And we'll run out of bits if we get more than 12 PMCs. Is SMT=n > PMC merging still a thing? > > And for some reason this counting is enabled in PERFEVTSELx[35] instead > of in LBR_CTL somewhere :/ > >> To support the feature, there are three main changes in ABIs. >> - A new branch sample type, PERF_SAMPLE_BRANCH_EVENT, is used as a knob >> to enable the feature. > >> - Extend the struct perf_branch_entry layout, because we have to save >> and pass the occurrences of events to user space. Since it's only >> available for 4 counters and saturating at value 3, it only occupies 8 >> bits. For the current Intel implementation, the order is the order of >> counters. > > Only for 4? Where does it say that? "Per-counter support for LBR Event Logging is indicated by the “Event Logging Supported” bitmap in CPUID.(EAX=01CH, ECX=0).ECX[19:16]" There are only 4 bits to indicate the supported counter. > If it were to only support 4, then > we're in counter scheduling contraint hell again Unfortunately, yes. > and we need to somehow > group all these things together with the LBR event. Group will bring many limits for the usage. For example, I was told there could be someone wants to use it with multiplexing. > > @@ -1410,6 +1423,10 @@ union perf_mem_data_src { > * cycles: cycles from last branch (or 0 if not supported) > * type: branch type > * spec: branch speculation info (or 0 if not supported) > + * events: occurrences of events since the last branch entry. > + * The fields can store up to 4 events with saturating > + * at value 3. > + * (or 0 if not supported) > */ > struct perf_branch_entry { > __u64 from; > @@ -1423,7 +1440,8 @@ struct perf_branch_entry { > spec:2, /* branch speculation info */ > new_type:4, /* additional branch type */ > priv:3, /* privilege level */ > - reserved:31; > + events:8, /* occurrences of events since the last branch entry */ > + reserved:23; > }; > > union perf_sample_weight { > > This seems properly terrible from an interface pov. What if the next > generation of silicon extends this to all 8 PMCs or another architecture > comes along that does this with 3 bits per counter etc... OK. The reserved space is not enough anymore. I think we have to add several new fields. I will redesign it. > >> - Add a new PERF_SAMPLE format, PERF_SAMPLE_BRANCH_EVENT_IDS, to dump >> the order information. User space tool doesn't understand the order of >> counters. So it cannot map the new fields in struct perf_branch_entry to >> a specific event. We have to dump the order information. > > Sorry; I can't parse this. The perf tool has no idea which physical counter is assigned to an event. The HW has no idea about an event. It only log the information from the counter 0 into IA32_LBR_x_INFO[32:33]. If we pass the information from IA32_LBR_x_INFO[32:33] to the perf tool. The perf tool lacks of knowledge to connect the information to an event. So we have to dump the event ID at the same time. Thanks, Kan
On Fri, Apr 14, 2023 at 11:56:41AM -0400, Liang, Kan wrote: > > If it were to only support 4, then > > we're in counter scheduling contraint hell again > > Unfortunately, yes. > > > and we need to somehow > > group all these things together with the LBR event. > > Group will bring many limits for the usage. For example, I was told > there could be someone wants to use it with multiplexing. You can create two groups, each with an LBR event, no?
On 2023-04-14 12:09 p.m., Peter Zijlstra wrote: > On Fri, Apr 14, 2023 at 11:56:41AM -0400, Liang, Kan wrote: >>> If it were to only support 4, then >>> we're in counter scheduling contraint hell again >> >> Unfortunately, yes. >> >>> and we need to somehow >>> group all these things together with the LBR event. >> >> Group will bring many limits for the usage. For example, I was told >> there could be someone wants to use it with multiplexing. > > You can create two groups, each with an LBR event, no? If we put everything in a group, that will make the enabling much simpler. I don't think the perf tool needs the order information anymore. Because the kernel enables the events one by one in a group. The kernel just need to convert the information from the counter order to the enabling order and dump to user space. But if we have two groups with LBR event, the order information is still required. Why we still want to group things? Thanks, Kan
On Fri, Apr 14, 2023 at 01:53:24PM -0400, Liang, Kan wrote: > > > On 2023-04-14 12:09 p.m., Peter Zijlstra wrote: > > On Fri, Apr 14, 2023 at 11:56:41AM -0400, Liang, Kan wrote: > >>> If it were to only support 4, then > >>> we're in counter scheduling contraint hell again > >> > >> Unfortunately, yes. > >> > >>> and we need to somehow > >>> group all these things together with the LBR event. > >> > >> Group will bring many limits for the usage. For example, I was told > >> there could be someone wants to use it with multiplexing. > > > > You can create two groups, each with an LBR event, no? > > If we put everything in a group, that will make the enabling much > simpler. I don't think the perf tool needs the order information > anymore. Because the kernel enables the events one by one in a group. > The kernel just need to convert the information from the counter order > to the enabling order and dump to user space. I never understood the whole order thing. What was it trying to do? > But if we have two groups with LBR event, the order information is still > required. Why we still want to group things? Why would you need that; what is that whole order nonsense about? {e1, e2, e3, e4}, {e5, e6, e7, e8} with e1 and e5 both having LBR on just works no? Since they have LBR and that extra sample flag they all get a 0-3 constraint. Since both e1 and e5 use LBR, they're mutually exclusive, either e1 or e5 group runs.
On 2023-04-14 3:24 p.m., Peter Zijlstra wrote: > On Fri, Apr 14, 2023 at 01:53:24PM -0400, Liang, Kan wrote: >> >> >> On 2023-04-14 12:09 p.m., Peter Zijlstra wrote: >>> On Fri, Apr 14, 2023 at 11:56:41AM -0400, Liang, Kan wrote: >>>>> If it were to only support 4, then >>>>> we're in counter scheduling contraint hell again >>>> >>>> Unfortunately, yes. >>>> >>>>> and we need to somehow >>>>> group all these things together with the LBR event. >>>> >>>> Group will bring many limits for the usage. For example, I was told >>>> there could be someone wants to use it with multiplexing. >>> >>> You can create two groups, each with an LBR event, no? >> >> If we put everything in a group, that will make the enabling much >> simpler. I don't think the perf tool needs the order information >> anymore. Because the kernel enables the events one by one in a group. >> The kernel just need to convert the information from the counter order >> to the enabling order and dump to user space. > > I never understood the whole order thing. What was it trying to do? Let's say we have three events with the LBR event logging feature as below. perf record -e branches,branches,instructions:ppp -j any,event The counter 0 will be assigned to instructions:ppp, since the PDist is only supported on GP 0 & 1. The count 1 & 2 will be assigned to the other two branches. If branches occurs 1 time and instructions occurs 3 times in a LBR block, the LBR_INFO will have 0b010111 (counter order). But as you can see from the perf command, the first event is actually branches. Without the event IDs information, perf tool will interpret that branches 3 branches 1 and instructions:ppp 1. That's wrong. If there are multiple users, the situation becomes even worse. > >> But if we have two groups with LBR event, the order information is still >> required. Why we still want to group things? > > Why would you need that; what is that whole order nonsense about? > > {e1, e2, e3, e4}, {e5, e6, e7, e8} with e1 and e5 both having LBR on > just works no? > > Since they have LBR and that extra sample flag they all get a 0-3 > constraint. > > Since both e1 and e5 use LBR, they're mutually exclusive, either e1 or > e5 group runs. It's possible that someone pins an event using LBR, and set more than 4 events for logging, e0:D,{e1, e2},{e3, e4},{e5, e6}. If so, those events could do multiplexing. Without the event IDs information, perf tool has no idea how to interpret the information. Andi, do you have any other cases which require the multiplexing support for LBR event logging. Thanks, Kan
On Fri, Apr 14, 2023 at 04:34:45PM -0400, Liang, Kan wrote: > > I never understood the whole order thing. What was it trying to do? > > Let's say we have three events with the LBR event logging feature as below. > perf record -e branches,branches,instructions:ppp -j any,event > > The counter 0 will be assigned to instructions:ppp, since the PDist is > only supported on GP 0 & 1. > The count 1 & 2 will be assigned to the other two branches. > > If branches occurs 1 time and instructions occurs 3 times in a LBR > block, the LBR_INFO will have 0b010111 (counter order). > > But as you can see from the perf command, the first event is actually > branches. Without the event IDs information, perf tool will interpret > that branches 3 branches 1 and instructions:ppp 1. That's wrong. > > If there are multiple users, the situation becomes even worse. But this makes no sense what so ever; in this case you have no control over what events you'll actually get in your LBR. Could be none of the events you're interested in end up in 0-3 but instead land on 4-7. > >> But if we have two groups with LBR event, the order information is still > >> required. Why we still want to group things? > > > > Why would you need that; what is that whole order nonsense about? > > > > {e1, e2, e3, e4}, {e5, e6, e7, e8} with e1 and e5 both having LBR on > > just works no? > > > > Since they have LBR and that extra sample flag they all get a 0-3 > > constraint. > > > > Since both e1 and e5 use LBR, they're mutually exclusive, either e1 or > > e5 group runs. > > It's possible that someone pins an event using LBR, and set more than 4 > events for logging, e0:D,{e1, e2},{e3, e4},{e5, e6}. If so, those events > could do multiplexing. Without the event IDs information, perf tool has > no idea how to interpret the information. Yeah, don't do this. There is no guarantee what so ever you'll get any of those events in the 0-3 range. You really *must* make then a group such that perf knows what events to associated with the LBR event and constain them to the 0-3 range of PMCs. If you want multiplexing, simply create multiple groups with an LBR event in them.
> Yeah, don't do this. There is no guarantee what so ever you'll get any > of those events in the 0-3 range. The kernel can simply force to 0-3 if LBR is enabled and the feature too. It's in Kan's patch and it isn't particularly complicated. > > You really *must* make then a group such that perf knows what events to > associated with the LBR event and constain them to the 0-3 range of > PMCs. > > If you want multiplexing, simply create multiple groups with an LBR > event in them. Well if you force groups then you require user space or a user which understands all the constraints to create groups. I thought one of the basic ideas of perf was to be able to abstract those things. There are tools today (e.g. the perf builtin metrics[1] and I believe others) that don't have enough knowledge to set up real groups and rely on the kernel. While they could be fixed it's a lot of work (I do it in pmu-tools, and it's quite hairy code) -Andi [1] Given they are are not supported by perf record yet, but since perf script already supports evaluating them they could be at some point, and then it would make sense to use them with this feature too.
On Fri, Apr 14, 2023 at 03:47:29PM -0700, Andi Kleen wrote: > > > Yeah, don't do this. There is no guarantee what so ever you'll get any > > of those events in the 0-3 range. > > > The kernel can simply force to 0-3 if LBR is enabled and the feature too. > It's in Kan's patch > > and it isn't particularly complicated. And what, totally leave 4-7 unused even if those counters were not related to LBR at all? That seems exceedingly daft.
On Fri, Apr 14, 2023 at 03:47:29PM -0700, Andi Kleen wrote: > > You really *must* make then a group such that perf knows what events to > > associated with the LBR event and constain them to the 0-3 range of > > PMCs. > > > > If you want multiplexing, simply create multiple groups with an LBR > > event in them. > > > Well if you force groups then you require user space or a user which > understands all the constraints The LBR feature is naturally a group read, very similar to PERF_SAMPLE_READ+PERF_FORMAT_GROUP. > to create groups. I thought one of the basic ideas of perf was to be able to > abstract those things. Yeah, the abstraction at hand is a group.
On 4/17/2023 4:46 AM, Peter Zijlstra wrote: > On Fri, Apr 14, 2023 at 03:47:29PM -0700, Andi Kleen wrote: >>> Yeah, don't do this. There is no guarantee what so ever you'll get any >>> of those events in the 0-3 range. >> >> The kernel can simply force to 0-3 if LBR is enabled and the feature too. >> It's in Kan's patch >> >> and it isn't particularly complicated. > And what, totally leave 4-7 unused even if those counters were not > related to LBR at all? That seems exceedingly daft. Only for the events which enabled LBR and also only if the branch events feature is enabled -j event -e '{event1:b,event2:b,event3:b,event4:b,event5,event6}' event5 and 6 can go > 3 Given there is currently no syntax to control branch events inside a group other than fully enabling/disabling LBR. Kan, I guess that could be added to the user tools. -Andi
On 4/17/2023 4:55 AM, Peter Zijlstra wrote: > >> to create groups. I thought one of the basic ideas of perf was to be able to >> abstract those things. > Yeah, the abstraction at hand is a group. Okay then if we add perf record metrics and want to support this feature we'll need to replicate the kernel scheduler in the user tools -Andi
On 2023-04-17 9:37 a.m., Andi Kleen wrote: > > On 4/17/2023 4:46 AM, Peter Zijlstra wrote: >> On Fri, Apr 14, 2023 at 03:47:29PM -0700, Andi Kleen wrote: >>>> Yeah, don't do this. There is no guarantee what so ever you'll get any >>>> of those events in the 0-3 range. >>> >>> The kernel can simply force to 0-3 if LBR is enabled and the feature >>> too. >>> It's in Kan's patch >>> >>> and it isn't particularly complicated. >> And what, totally leave 4-7 unused even if those counters were not >> related to LBR at all? That seems exceedingly daft. > > > Only for the events which enabled LBR and also only if the branch events > feature is enabled > > -j event -e '{event1:b,event2:b,event3:b,event4:b,event5,event6}' > > event5 and 6 can go > 3 > > Given there is currently no syntax to control branch events inside a > group other than fully enabling/disabling LBR. > > Kan, I guess that could be added to the user tools. We already have a per-event option for LBR, branch_type, which can be used to control branch events in a group. With the patch in this series, we can do, e.g., -j call -e '{cpu/event=0x1,branch_type=event,/,cpu/event=0x2,branch_type=event/,cpu/event=0x3,branch_type=event/,cpu/event=0x4,branch_type=event/,cpu/event=0x5/,cpu/event=0x6/}' Thanks, Kan
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index d5628a7b5eaa..3b659a57129a 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -126,6 +126,11 @@ struct perf_branch_stack { struct perf_branch_entry entries[]; }; +struct perf_branch_event_ids { + __u64 nr; + __u64 ids[]; +}; + struct task_struct; /* @@ -1127,6 +1132,11 @@ static inline bool branch_sample_priv(const struct perf_event *event) return event->attr.branch_sample_type & PERF_SAMPLE_BRANCH_PRIV_SAVE; } +static inline bool branch_sample_event(const struct perf_event *event) +{ + return event->attr.branch_sample_type & PERF_SAMPLE_BRANCH_EVENT; +} + struct perf_sample_data { /* @@ -1161,6 +1171,7 @@ struct perf_sample_data { struct perf_callchain_entry *callchain; struct perf_raw_record *raw; struct perf_branch_stack *br_stack; + struct perf_branch_event_ids *br_event_ids; union perf_sample_weight weight; union perf_mem_data_src data_src; u64 txn; @@ -1250,6 +1261,21 @@ static inline void perf_sample_save_brstack(struct perf_sample_data *data, data->sample_flags |= PERF_SAMPLE_BRANCH_STACK; } +static inline void perf_sample_save_event_ids(struct perf_sample_data *data, + struct perf_branch_event_ids *ids) +{ + int size = sizeof(u64); /* nr */ + + if (WARN_ON_ONCE(ids->nr > PERF_MAX_BRANCH_EVENTS)) + ids->nr = PERF_MAX_BRANCH_EVENTS; + + size += ids->nr * sizeof(u64); + + data->br_event_ids = ids; + data->dyn_size += size; + data->sample_flags |= PERF_SAMPLE_BRANCH_EVENT_IDS; +} + static inline u32 perf_sample_data_size(struct perf_sample_data *data, struct perf_event *event) { diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h index 37675437b768..36d70717ecbd 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -162,8 +162,9 @@ enum perf_event_sample_format { PERF_SAMPLE_DATA_PAGE_SIZE = 1U << 22, PERF_SAMPLE_CODE_PAGE_SIZE = 1U << 23, PERF_SAMPLE_WEIGHT_STRUCT = 1U << 24, + PERF_SAMPLE_BRANCH_EVENT_IDS = 1U << 25, - PERF_SAMPLE_MAX = 1U << 25, /* non-ABI */ + PERF_SAMPLE_MAX = 1U << 26, /* non-ABI */ }; #define PERF_SAMPLE_WEIGHT_TYPE (PERF_SAMPLE_WEIGHT | PERF_SAMPLE_WEIGHT_STRUCT) @@ -204,6 +205,8 @@ enum perf_branch_sample_type_shift { PERF_SAMPLE_BRANCH_PRIV_SAVE_SHIFT = 18, /* save privilege mode */ + PERF_SAMPLE_BRANCH_EVENT_SHIFT = 19, /* save occurrences of events */ + PERF_SAMPLE_BRANCH_MAX_SHIFT /* non-ABI */ }; @@ -235,6 +238,8 @@ enum perf_branch_sample_type { PERF_SAMPLE_BRANCH_PRIV_SAVE = 1U << PERF_SAMPLE_BRANCH_PRIV_SAVE_SHIFT, + PERF_SAMPLE_BRANCH_EVENT = 1U << PERF_SAMPLE_BRANCH_EVENT_SHIFT, + PERF_SAMPLE_BRANCH_MAX = 1U << PERF_SAMPLE_BRANCH_MAX_SHIFT, }; @@ -1018,6 +1023,8 @@ enum perf_event_type { * char data[size]; } && PERF_SAMPLE_AUX * { u64 data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE * { u64 code_page_size;} && PERF_SAMPLE_CODE_PAGE_SIZE + * { u64 nr; + * u64 ids[nr];} && PERF_SAMPLE_BRANCH_EVENT_IDS * }; */ PERF_RECORD_SAMPLE = 9, @@ -1394,6 +1401,12 @@ union perf_mem_data_src { #define PERF_MEM_S(a, s) \ (((__u64)PERF_MEM_##a##_##s) << PERF_MEM_##a##_SHIFT) +#define PERF_MAX_BRANCH_EVENTS 4 +#define PERF_BRANCH_EVENTS_MASK 0x3 +#define PERF_BRANCH_EVENTS_STEP 2 + +#define perf_branch_event_by_idx(_events, _idx) \ + (((_events) >> ((_idx) * PERF_BRANCH_EVENTS_STEP)) & PERF_BRANCH_EVENTS_MASK) /* * single taken branch record layout: * @@ -1410,6 +1423,10 @@ union perf_mem_data_src { * cycles: cycles from last branch (or 0 if not supported) * type: branch type * spec: branch speculation info (or 0 if not supported) + * events: occurrences of events since the last branch entry. + * The fields can store up to 4 events with saturating + * at value 3. + * (or 0 if not supported) */ struct perf_branch_entry { __u64 from; @@ -1423,7 +1440,8 @@ struct perf_branch_entry { spec:2, /* branch speculation info */ new_type:4, /* additional branch type */ priv:3, /* privilege level */ - reserved:31; + events:8, /* occurrences of events since the last branch entry */ + reserved:23; }; union perf_sample_weight { diff --git a/kernel/events/core.c b/kernel/events/core.c index f79fd8b87f75..1ec7cc8b0730 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -7401,6 +7401,23 @@ void perf_output_sample(struct perf_output_handle *handle, perf_aux_sample_output(event, handle, data); } + if (sample_type & PERF_SAMPLE_BRANCH_EVENT_IDS) { + if (data->br_event_ids) { + size_t size; + + size = data->br_event_ids->nr * sizeof(u64); + perf_output_put(handle, data->br_event_ids->nr); + perf_output_copy(handle, data->br_event_ids->ids, size); + } else { + /* + * we always store at least the value of nr + */ + u64 nr = 0; + + perf_output_put(handle, nr); + } + } + if (!event->attr.watermark) { int wakeup_events = event->attr.wakeup_events; @@ -7747,6 +7764,12 @@ void perf_prepare_sample(struct perf_sample_data *data, data->dyn_size += size + sizeof(u64); /* size above */ data->sample_flags |= PERF_SAMPLE_AUX; } + + if (filtered_sample_type & PERF_SAMPLE_BRANCH_EVENT_IDS) { + data->br_event_ids = NULL; + data->dyn_size += sizeof(u64); + data->sample_flags |= PERF_SAMPLE_BRANCH_EVENT_IDS; + } } void perf_prepare_header(struct perf_event_header *header,