From patchwork Sun Feb 19 09:27:58 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59095 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp772011wrn; Sun, 19 Feb 2023 01:31:56 -0800 (PST) X-Google-Smtp-Source: AK7set8tpWZ+S6qDyKSBHqPMSsAjaE3oIy0GpN5UukVMuIZ0F/0oFLwJANvcACeft/gpt8vy0hLQ X-Received: by 2002:a17:906:d7ac:b0:8b2:3e72:101f with SMTP id pk12-20020a170906d7ac00b008b23e72101fmr8063031ejb.19.1676799115863; Sun, 19 Feb 2023 01:31:55 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799115; cv=none; d=google.com; s=arc-20160816; b=wJwR9VwFx31FA37KyGZM8S2BhtnV+rph22IlQfqUvCGaxIBAIRdBPDQYx6o5wBhxnn E+PeCULgwNJYECidERhnMa5CvQ9t79Ku6xAWS//k6cPAa1vRwUHiijUdGylNqaiPXPtc Tz/BBgxCt3q3WT7r32qM00wt6z+RvygJMAPCIqg9SD2GENn3isdAkvdgArFt5PWjtAyI c/RVkNo8KsqU3QlGuGrzMQO0GQE6UG2YTBdGWd1P7CoH57LktqY2TMpzAe/WtfvQR4wp hw6/0PLyZH2l/OIQzWHigiCeISXhqjs0wjtvci+onahH53uYoemOzimrreElXcAICOoD RwGw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=hFVME4MawZ7uSatJrZd2lz8rTpTsmAg9tHqKzSK6mH4=; b=T5H4lKuHbj716a3hijBlpRWUyEC81AUTNtBS3p0RP3c+LvR4nFaaSFghZNFlMBo5Tl kJrVQsqemuAZBi9MHrrI81BJvY2NKwptAA1qzSQmz0xnwTiHI98S9QI/yUNLLPuHAoz4 pCLCI2WWpqyHrFDsqTaKxfMbMtKPMvtYa0DbEnlo8fMIAsCz+dK5kxaG5kqdaPOfHxqB ADuC179KlrQ0xo16gfIq7uit+0gIhMEXTSjw5xt33R5Sl5J5GXMLLxtxHsDffhDQRDoz yMCftpHbjTUU3Zv3qLcad3DsLEyyOrSlwPz6HEeZhQ1doD9ojsHC5JyYL+t8LzMhdDdf dJhQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=lYts0MZF; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id cm2-20020a170907938200b008b85ed1bbc1si5489846ejc.206.2023.02.19.01.31.31; Sun, 19 Feb 2023 01:31:55 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=lYts0MZF; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229758AbjBSJ3P (ORCPT + 99 others); Sun, 19 Feb 2023 04:29:15 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50498 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229754AbjBSJ3N (ORCPT ); Sun, 19 Feb 2023 04:29:13 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 63957E3B8 for ; Sun, 19 Feb 2023 01:29:10 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id x7-20020a25b907000000b006fdc6aaec4fso363357ybj.20 for ; Sun, 19 Feb 2023 01:29:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=hFVME4MawZ7uSatJrZd2lz8rTpTsmAg9tHqKzSK6mH4=; b=lYts0MZFkBMWWWqsvbZ4s4LF7YBZTVWoKqr0XB8T3QZMUTOA2s2Z+G/g2UZslwUZuC Bg0JfwZvu8l1XTdfCWtyr+kOVvUd3vxdajoINwRuBTIhR35jt0E02ChfAVNX7oNfDnUc OQxysWfmRktf3N7SbQdfMUQDwloDegoPhyu+QeutG7viIgr3NSQJg+6Q1X7h+12Q6I/W lCtx21C8+b0DCEVf/bHP/8aPsvD8vdxjq3D6aElrMFCmBpWcEKc/gfnWhBWTMFWPGVfd G0NcqD61UhEmodEZAPS6XL71MQzBlkE/OwDXXgcA/gAT9OXasg2DRifrqgTUe2QIxicB mB5Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=hFVME4MawZ7uSatJrZd2lz8rTpTsmAg9tHqKzSK6mH4=; b=HJIJ6FXUsc35aEBlKPLKVeKvfhb/F/pf/3/f6WfRpLVs6AmhrXdUXmy4B/lsi0rrqP ttQjvecPVl3yvSgJjsppg4FEo0465/5qAzHXyrtwLpZuc46l0UdX3r6a2aP0H/+rQIuL QLt8YpFOT8kuGuH8jS2CCllbK12mObcgHdjMIUv0ER/jmrdebq+xCzxGfTZi+JXuVhJA WRpS4LN3YGK43QD/pI6tn/DyYrHQZQYSmwU2yONkfkJRf0x4w+iPPm03b8md2UQ+zKeH Z1YqOJGvpnGPdQ5vwbmnq7iPCKuEKxtZgbEtbxYZsbRUWnZl4T+tpUfcepRzO8vz1wZ4 H/GQ== X-Gm-Message-State: AO0yUKVaOlNqMQR8bm/zvtDLa7WU4rL0utpT/J+NNiPwgrTd0jmABkFR 3wtuRE0pGG+6XxGcU+sNLhFAlmVW+EU4 X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a25:9a45:0:b0:97a:4493:a7b8 with SMTP id r5-20020a259a45000000b0097a4493a7b8mr128740ybo.505.1676798949427; Sun, 19 Feb 2023 01:29:09 -0800 (PST) Date: Sun, 19 Feb 2023 01:27:58 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-2-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 01/51] perf tools: Ensure evsel name is initialized From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758251310111071046?= X-GMAIL-MSGID: =?utf-8?q?1758251310111071046?= Use the evsel__name accessor as otherwise name may be NULL resulting in a segv. This was observed with the perf stat shell test. Signed-off-by: Ian Rogers Reviewed-by: Kajol Jain --- tools/perf/util/synthetic-events.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c index 9ab9308ee80c..6def01036eb5 100644 --- a/tools/perf/util/synthetic-events.c +++ b/tools/perf/util/synthetic-events.c @@ -2004,7 +2004,7 @@ int perf_event__synthesize_event_update_name(struct perf_tool *tool, struct evse perf_event__handler_t process) { struct perf_record_event_update *ev; - size_t len = strlen(evsel->name); + size_t len = strlen(evsel__name(evsel)); int err; ev = event_update_event__new(len + 1, PERF_EVENT_UPDATE__NAME, evsel->core.id[0]); From patchwork Sun Feb 19 09:27:59 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59091 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp771614wrn; Sun, 19 Feb 2023 01:30:58 -0800 (PST) X-Google-Smtp-Source: AK7set/2YXUqiEA6buDLI3F9TyMqTKeNmU48et4NcSo4nT4OYYHmHYk2EAryXCM8u60CCuRFqulW X-Received: by 2002:aa7:d4d4:0:b0:4ac:b4b1:53fe with SMTP id t20-20020aa7d4d4000000b004acb4b153femr1220203edr.20.1676799057869; Sun, 19 Feb 2023 01:30:57 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799057; cv=none; d=google.com; s=arc-20160816; b=jRBtmEfM9u8S9Q6Jz0A/d8vJi7tC/JPupf+B4fYjxtLSJBudptJuSsXoro6BR1OVeb adZWFk5obVBtbal3ezfa47Qb7Q986n3RK6iTrjBjQnub4JbkoxLtrwlgBeYAtwSa1rHh NPZcqZo4cXYhlxprsgi2VPAOgCKuopZOUik0uMaezbOVsDnKY/Lz7PKEa5dtPO3hHOoM fgc1x4rDeCkTU+ZRmBz3puAPU4eYiCpryBk7xQYw49rYPFb/GxJ5AmNbBPz7ACDfrYh3 QLsYQWoP36cnsqUuS31FAI/R/IicRBRVcc9TskKtQAZNnIyz/izyOGg+9Fb2wUrSUwbk vTUw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=sNT3gLbGiThQ7GyW2mykwD5/AMMBv9vHnEgALQN2uR8=; b=TnW8RUnWx6qIw4kFi3uO5Ovae1/eB6ILr6Bg8TioTu1Fnkx/I5/zjOllyflCJg8+cW cxWq0fMH5UOHyZTeCtdirNwC5RTG0GR6A2jrMu62YN6o9A5XtSZdrc2x9bbnU4vkaaO1 W3K2MHo+UM0CWrgpSwK3cLCfFFAOEABblOHrQm5SV8GCwvcSFaVC8oI781yso8x4TdF1 UkcOP2LuCZqequXY/Cpnc1qstHTbfAba+O0xTGjsV9JSEtueZwQ9BCF6weUk6wZC6B8w dueSNRsVCjKjlZMta0qccueXRpl8bzTH9p9PHvoS0TmBxcyFjk1UVpXwND1WIGs9oMf9 QbyA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=Y4Uf0mLp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id bf3-20020a0564021a4300b004ada4ddbbf6si7904286edb.426.2023.02.19.01.30.34; Sun, 19 Feb 2023 01:30:57 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=Y4Uf0mLp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229770AbjBSJ33 (ORCPT + 99 others); Sun, 19 Feb 2023 04:29:29 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50986 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229767AbjBSJ31 (ORCPT ); Sun, 19 Feb 2023 04:29:27 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id ED9DF11EBE for ; Sun, 19 Feb 2023 01:29:18 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id 124-20020a250482000000b0090f2c84a6a4so2098813ybe.13 for ; Sun, 19 Feb 2023 01:29:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=sNT3gLbGiThQ7GyW2mykwD5/AMMBv9vHnEgALQN2uR8=; b=Y4Uf0mLpp4+NuPEb+TnO5xc7bJkkLosWlqADaN5wSxFmQOe/91hOwRqdwOsnA2dwI8 lOeOYfyo6/3phCF0kMnC6CRSjNuEGHWoLQ4n347Mk7HueB4ctrpOjryVRDEQGLXkzd6e rvIu7zQzPyqi4JgLTyorNcbIgDkhaZ9lqcvEm3zhvEUDd78PLKPII0UTMhArle+NMJCM obu/2gqT85mz/4xWbiFZJLq167Dtf7l7o+cy7Lhcg42quJzavDoS3PWwycFMbMEJVWYE vN1YRagfuIU0eNlebWCwaWKRhJXfu9/tzk89DZPgU7KmkmxzwdcCOb4OESIj1KGmA6On O5Ng== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=sNT3gLbGiThQ7GyW2mykwD5/AMMBv9vHnEgALQN2uR8=; b=y1tFtEhU217lwTw/fbBgyI0cfg8Btib/a5zz0eJjuaOBAziy4/1U+iBXRRBV3ESGbM TZXxyyseew59pXAot8DgTBAPfTTUEkNBEIYxO7tzm5Y7nVvoVvkloZ8WNvrphgnawZpJ Q3XoT8lLkAhVl/kdAnBwWAGnvHOZWVTdDF++AcyPLAR+6CXhr+6CN49VbMJZG66MuQUx n1+57jAk3W0ncspfRVhu2cIDH+/+ZeeYpr4W3ymGnglb8e911UpdeQckUB3fSNXbp7zh hsn/MKE5wBxHsx71Hs36ZgO93t6ISJNakgo8MPIUieys0FJHCJ+YItOgnHgkrokVkgLW 9ZAg== X-Gm-Message-State: AO0yUKUcsi2oVQnzN/L0Hq7u9Hkwi1c6fLEM4blPozEnUBFfNeyrIXbw oSVdxMMTUKb0B6ouls8+PTtI/4kq0t/2 X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a25:9390:0:b0:94a:ebba:cba7 with SMTP id a16-20020a259390000000b0094aebbacba7mr18673ybm.8.1676798958130; Sun, 19 Feb 2023 01:29:18 -0800 (PST) Date: Sun, 19 Feb 2023 01:27:59 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-3-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 02/51] perf metrics: Improve variable names From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758251248823120082?= X-GMAIL-MSGID: =?utf-8?q?1758251248823120082?= has_constraint implies the NMI_WATCHDOG_CONSTRAINT and if the constraint is detected it causes events not to be grouped. Most of the code cares about whether events are grouped or not, so rename has_constraint to group_events. Also remove group from metricgroup___watchdog_constraint_hint as the warning is specific to a metric. Make the warning message agree with this too. Signed-off-by: Ian Rogers --- tools/perf/util/metricgroup.c | 45 +++++++++++++++++------------------ 1 file changed, 22 insertions(+), 23 deletions(-) diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c index f3559be95541..b2aa6e049804 100644 --- a/tools/perf/util/metricgroup.c +++ b/tools/perf/util/metricgroup.c @@ -136,10 +136,9 @@ struct metric { /** Optional null terminated array of referenced metrics. */ struct metric_ref *metric_refs; /** - * Is there a constraint on the group of events? In which case the - * events won't be grouped. + * Should events of the metric be grouped? */ - bool has_constraint; + bool group_events; /** * Parsed events for the metric. Optional as events may be taken from a * different metric whose group contains all the IDs necessary for this @@ -148,12 +147,12 @@ struct metric { struct evlist *evlist; }; -static void metricgroup___watchdog_constraint_hint(const char *name, bool foot) +static void metric__watchdog_constraint_hint(const char *name, bool foot) { static bool violate_nmi_constraint; if (!foot) { - pr_warning("Splitting metric group %s into standalone metrics.\n", name); + pr_warning("Not grouping metric %s's events.\n", name); violate_nmi_constraint = true; return; } @@ -167,18 +166,18 @@ static void metricgroup___watchdog_constraint_hint(const char *name, bool foot) " echo 1 > /proc/sys/kernel/nmi_watchdog\n"); } -static bool metricgroup__has_constraint(const struct pmu_metric *pm) +static bool metric__group_events(const struct pmu_metric *pm) { if (!pm->metric_constraint) - return false; + return true; if (!strcmp(pm->metric_constraint, "NO_NMI_WATCHDOG") && sysctl__nmi_watchdog_enabled()) { - metricgroup___watchdog_constraint_hint(pm->metric_name, false); - return true; + metric__watchdog_constraint_hint(pm->metric_name, /*foot=*/false); + return false; } - return false; + return true; } static void metric__free(struct metric *m) @@ -227,7 +226,7 @@ static struct metric *metric__new(const struct pmu_metric *pm, } m->pctx->sctx.runtime = runtime; m->pctx->sctx.system_wide = system_wide; - m->has_constraint = metric_no_group || metricgroup__has_constraint(pm); + m->group_events = !metric_no_group && metric__group_events(pm); m->metric_refs = NULL; m->evlist = NULL; @@ -637,7 +636,7 @@ static int decode_all_metric_ids(struct evlist *perf_evlist, const char *modifie static int metricgroup__build_event_string(struct strbuf *events, const struct expr_parse_ctx *ctx, const char *modifier, - bool has_constraint) + bool group_events) { struct hashmap_entry *cur; size_t bkt; @@ -662,7 +661,7 @@ static int metricgroup__build_event_string(struct strbuf *events, } /* Separate events with commas and open the group if necessary. */ if (no_group) { - if (!has_constraint) { + if (group_events) { ret = strbuf_addch(events, '{'); RETURN_IF_NON_ZERO(ret); } @@ -716,7 +715,7 @@ static int metricgroup__build_event_string(struct strbuf *events, RETURN_IF_NON_ZERO(ret); } } - if (!no_group && !has_constraint) { + if (!no_group && group_events) { ret = strbuf_addf(events, "}:W"); RETURN_IF_NON_ZERO(ret); } @@ -1252,7 +1251,7 @@ static int metricgroup__add_metric_list(const char *list, bool metric_no_group, * Warn about nmi_watchdog if any parsed metrics had the * NO_NMI_WATCHDOG constraint. */ - metricgroup___watchdog_constraint_hint(NULL, true); + metric__watchdog_constraint_hint(NULL, /*foot=*/true); /* No metrics. */ if (count == 0) return -EINVAL; @@ -1295,7 +1294,7 @@ static void find_tool_events(const struct list_head *metric_list, } /** - * build_combined_expr_ctx - Make an expr_parse_ctx with all has_constraint + * build_combined_expr_ctx - Make an expr_parse_ctx with all !group_events * metric IDs, as the IDs are held in a set, * duplicates will be removed. * @metric_list: List to take metrics from. @@ -1315,7 +1314,7 @@ static int build_combined_expr_ctx(const struct list_head *metric_list, return -ENOMEM; list_for_each_entry(m, metric_list, nd) { - if (m->has_constraint && !m->modifier) { + if (!m->group_events && !m->modifier) { hashmap__for_each_entry(m->pctx->ids, cur, bkt) { dup = strdup(cur->pkey); if (!dup) { @@ -1342,14 +1341,14 @@ static int build_combined_expr_ctx(const struct list_head *metric_list, * @fake_pmu: used when testing metrics not supported by the current CPU. * @ids: the event identifiers parsed from a metric. * @modifier: any modifiers added to the events. - * @has_constraint: false if events should be placed in a weak group. + * @group_events: should events be placed in a weak group. * @tool_events: entries set true if the tool event of index could be present in * the overall list of metrics. * @out_evlist: the created list of events. */ static int parse_ids(bool metric_no_merge, struct perf_pmu *fake_pmu, struct expr_parse_ctx *ids, const char *modifier, - bool has_constraint, const bool tool_events[PERF_TOOL_MAX], + bool group_events, const bool tool_events[PERF_TOOL_MAX], struct evlist **out_evlist) { struct parse_events_error parse_error; @@ -1393,7 +1392,7 @@ static int parse_ids(bool metric_no_merge, struct perf_pmu *fake_pmu, } } ret = metricgroup__build_event_string(&events, ids, modifier, - has_constraint); + group_events); if (ret) return ret; @@ -1458,7 +1457,7 @@ static int parse_groups(struct evlist *perf_evlist, const char *str, if (!ret && combined && hashmap__size(combined->ids)) { ret = parse_ids(metric_no_merge, fake_pmu, combined, /*modifier=*/NULL, - /*has_constraint=*/true, + /*group_events=*/false, tool_events, &combined_evlist); } @@ -1476,7 +1475,7 @@ static int parse_groups(struct evlist *perf_evlist, const char *str, struct metric *n; struct metric_expr *expr; - if (combined_evlist && m->has_constraint) { + if (combined_evlist && !m->group_events) { metric_evlist = combined_evlist; } else if (!metric_no_merge) { /* @@ -1507,7 +1506,7 @@ static int parse_groups(struct evlist *perf_evlist, const char *str, } if (!metric_evlist) { ret = parse_ids(metric_no_merge, fake_pmu, m->pctx, m->modifier, - m->has_constraint, tool_events, &m->evlist); + m->group_events, tool_events, &m->evlist); if (ret) goto out; From patchwork Sun Feb 19 09:28:00 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59096 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp772160wrn; Sun, 19 Feb 2023 01:32:15 -0800 (PST) X-Google-Smtp-Source: AK7set/DnLpndNWsAYG/P0fZbkfxj0i3lqRlZH2hp0CtkLWleim+jvHSAi+IRj5Nm9qC/OdrvDei X-Received: by 2002:a17:906:9f19:b0:879:ec1a:4ac with SMTP id fy25-20020a1709069f1900b00879ec1a04acmr7152465ejc.76.1676799135656; Sun, 19 Feb 2023 01:32:15 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799135; cv=none; d=google.com; s=arc-20160816; b=t7XdTM3TSqD7ARCwhdxS7LclUb8V31FYW4CqSC5Rm0POKYDdBUU24BRYPBy0T+A58U VB8RGjRPjJonJU84lTNACjZdh3cI/U/tPkDv/eW6Z3+8rlA7a4z7mTkRnHr+JC1338Pp 9+st2Dk/WJjDAYf5BAm/04XEdrpSebo4TaFrmlrSc03i03lp+vOPkOKblzB16xiAAljZ JMM8zZX4R4ZrZtSBX8QmAle/dugdgaMagAhUqhxeXrMQitmlufMizNQ8ZVG8MKYoce2e 4ZbaQM8s/gkFwexlJjP4A2B6h+V9NYfJ8DubXvit7J19kFDTUDBtMtkn7LOhWSeH6yDf Vz0Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=d+nYipBPRK9ot6wFg6zmAiDtcnqlM5sx03XMKeU13O4=; b=ibf3SVs6yq6y02KBu+wYSwTIpklTHroG2AfsFF7wgEodMSv89udoqFHoYVzCNpnA3y fwsnQv/zzBzA+rWyXfk1X6uD8dnLl1toE78LUtlF3qYeRFBxAYw2VoEGFibmDoUcNxR4 RAG/yOMe7+eqBiOw3+zYNIMrkJG9eJF/SFfvEAc9jma2ZPv6z0d+ERjKEV68VwcqwKE5 xFdYYrz6u3qbVgX2pLQdWxLMCsAF9OY9xK9bb0RdkoVsCYyV+5HsF8SDfT4wLdtwixcZ R8Kcpqnk3FCH/N0wynpJzT6dEnAxDndTQq9QswjdSCbwBqUEO3MdbjXzAzV9zQHnN5ez y69w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=mDOqVGr0; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id 24-20020a170906225800b008c926aa3bb5si2842406ejr.342.2023.02.19.01.31.52; Sun, 19 Feb 2023 01:32:15 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=mDOqVGr0; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229580AbjBSJ3l (ORCPT + 99 others); Sun, 19 Feb 2023 04:29:41 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51410 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229733AbjBSJ3j (ORCPT ); Sun, 19 Feb 2023 04:29:39 -0500 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 54D3E10258 for ; Sun, 19 Feb 2023 01:29:27 -0800 (PST) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-53687f6de13so18387797b3.9 for ; Sun, 19 Feb 2023 01:29:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=d+nYipBPRK9ot6wFg6zmAiDtcnqlM5sx03XMKeU13O4=; b=mDOqVGr0N132A/Ll3pITLGW4r43KgUDNLPKUyUvW8eIG8vXmMxFkVQb5dbaC0M2IOs 2GoGfqJQUeUuATD9qo9D7VoT+Pze6N0XYIPKKvB4wW3libDyZhB2y4XFUwSAiPjb+fxF d62ecsMidsbi/mZCuKzEfF4tYuUcoP4N6Y0ppQAp/tynm+DGQKu5p21gv51vkhSBFals 0qcg9jft/tM3YZvvPuyYuJbIeHf7ShfSEZr76R9jwVkBrc4zZSl09n8ttK31h8tYFqSp 2UTe/9SYmbm7tRvmPVqXfdfIS1+MPbuPJIL6t6mQjY3yomupc8sIMDOl76CGNIiTqd4V JB5A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=d+nYipBPRK9ot6wFg6zmAiDtcnqlM5sx03XMKeU13O4=; b=aqts44gduNFwFQuGB2fA6qA0TA5ahSrl5lUcX5TcmYu7R809s/eXdnX9QfwsvwFKFO iCYkNEmE++cFHSf6hpPTgvI7vxjjsM/cYcXJ6ihHYg1S3BSRBg1zygj2//wQ9eYFj3Ml N0LAibQLf7+UZKp14BK3m4lW5I/qFjmAvYqFSqsA2XPgMjV4C+MiAX8BHiGac+xkGCj6 yO7aumre2/nxRmTuQQR6esexS1jzJ0Ftx05XGr0N68NF4XfR8YY+z93I+mtr3wTDhu4W 4bllmWlRGE5kpZAj6fyYVIkANb9TK9TAB1qA1kUkHPsH3ofZ2wWGPqLC7f5fSANYYmuH wlzg== X-Gm-Message-State: AO0yUKX3gRcVwZIoMEjTLCpJgTXM2Y3koOC//h2zQ639Yem7tCF/v+MY 7bRAkeYUwidMO5wla/ONepBoR0IXdx4M X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a05:6902:144b:b0:88a:f2f:d004 with SMTP id a11-20020a056902144b00b0088a0f2fd004mr190523ybv.5.1676798966583; Sun, 19 Feb 2023 01:29:26 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:00 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-4-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 03/51] perf pmu-events: Remove aggr_mode from pmu_event From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758251330504338539?= X-GMAIL-MSGID: =?utf-8?q?1758251330504338539?= aggr_mode is used on Power to set a flag for metrics. For pmu_event it is unused. Signed-off-by: Ian Rogers --- tools/perf/pmu-events/jevents.py | 2 +- tools/perf/pmu-events/pmu-events.h | 1 - tools/perf/tests/pmu-events.c | 6 ------ 3 files changed, 1 insertion(+), 8 deletions(-) diff --git a/tools/perf/pmu-events/jevents.py b/tools/perf/pmu-events/jevents.py index 2bcd07ce609f..db8b92de113e 100755 --- a/tools/perf/pmu-events/jevents.py +++ b/tools/perf/pmu-events/jevents.py @@ -44,7 +44,7 @@ _json_event_attributes = [ # Seems useful, put it early. 'event', # Short things in alphabetical order. - 'aggr_mode', 'compat', 'deprecated', 'perpkg', 'unit', + 'compat', 'deprecated', 'perpkg', 'unit', # Longer things (the last won't be iterated over during decompress). 'long_desc' ] diff --git a/tools/perf/pmu-events/pmu-events.h b/tools/perf/pmu-events/pmu-events.h index b7d4a66b8ad2..cee8b83792f8 100644 --- a/tools/perf/pmu-events/pmu-events.h +++ b/tools/perf/pmu-events/pmu-events.h @@ -22,7 +22,6 @@ struct pmu_event { const char *pmu; const char *unit; const char *perpkg; - const char *aggr_mode; const char *deprecated; }; diff --git a/tools/perf/tests/pmu-events.c b/tools/perf/tests/pmu-events.c index accf44b3d968..9b4c94ba5460 100644 --- a/tools/perf/tests/pmu-events.c +++ b/tools/perf/tests/pmu-events.c @@ -331,12 +331,6 @@ static int compare_pmu_events(const struct pmu_event *e1, const struct pmu_event return -1; } - if (!is_same(e1->aggr_mode, e2->aggr_mode)) { - pr_debug2("testing event e1 %s: mismatched aggr_mode, %s vs %s\n", - e1->name, e1->aggr_mode, e2->aggr_mode); - return -1; - } - if (!is_same(e1->deprecated, e2->deprecated)) { pr_debug2("testing event e1 %s: mismatched deprecated, %s vs %s\n", e1->name, e1->deprecated, e2->deprecated); From patchwork Sun Feb 19 09:28:01 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59092 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp771764wrn; Sun, 19 Feb 2023 01:31:16 -0800 (PST) X-Google-Smtp-Source: AK7set9uy8oVOvkP5AFf4PIuFzkLOAZMo/gg8st/+A7pyd/i38yrBeA4Dx85+UBu7ljARxh5X1EI X-Received: by 2002:a05:6402:344f:b0:4ac:be68:9f9b with SMTP id l15-20020a056402344f00b004acbe689f9bmr931591edc.23.1676799076390; Sun, 19 Feb 2023 01:31:16 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799076; cv=none; d=google.com; s=arc-20160816; b=D5+dpa7P5YZ/Y3+AwcgiP5acZyHGwvZctJUJpQjDqSqdpbGV9rqrCSykP7WaqJ2c5s TP0wBX4uIgP5ixs69yOeE13aVft1SXr9DazK88ERuBy+/2p7xHCN3QUmYWz2Amuw/sfo TODk4mj4dpgkpnbaKGVsjKwkLbVgn9sKOZpiqGiGNqVv5nELH5u3yiHx/tAJ3uTvQ0Ef P6rf3ulFZOV2q3or6rqgDWyOdaxHzuVYPUDTWW/ux9DYEKaAsgW71N502VThxlcspphi uVOmiyEUwjmh5Or7AGYWiJp9su/rXjcA4cGkcUDbauP0gBas0lcoOq7eFGNFvni64iIr 21HQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=iXfoLKVrMT/xFDAU9J/MJSSK5kW91sU/HJUqY9Ndk9k=; b=GMeH2BuGjQLQJqrPKXI2mHxvSUvV7xMQAU+cMWQP0FAJwLlvSf6GmKeIwoz2UTRMr3 MsIMx4wTyaRjnfCeDWtS1JAH0g/+EQF4Gti+NrKa9L3rvwQfTLFBlRp6jBoucnMYwnn+ PYm5Cb7kt0VQfKZZ1/3g/V8xaLRmKVBBNjKWSi0pgdLxdvNv1bQBNXX17+NPGGdGu0WM RugXSK+k7DcSNKxFBkt/OhvPfCs4nU5YVU/8/qwCSAc884EV9i/3Rs+2oYl4yXy0+VaQ fHwwF5P8L/e4Ee3gJImnFZPll9zUZ/2QhyAzHeOZLjEATe2kgau/vMFK8nyZbdMRwx/i Ylzw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=hvIt7PlP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id w25-20020aa7d299000000b004ac1ba3398asi11164721edq.210.2023.02.19.01.30.53; Sun, 19 Feb 2023 01:31:16 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=hvIt7PlP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229733AbjBSJ3y (ORCPT + 99 others); Sun, 19 Feb 2023 04:29:54 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51788 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229824AbjBSJ3w (ORCPT ); Sun, 19 Feb 2023 04:29:52 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D473112059 for ; Sun, 19 Feb 2023 01:29:34 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id 124-20020a250482000000b0090f2c84a6a4so2099150ybe.13 for ; Sun, 19 Feb 2023 01:29:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=iXfoLKVrMT/xFDAU9J/MJSSK5kW91sU/HJUqY9Ndk9k=; b=hvIt7PlPUa96zxgL3OClxpS1XAsvx3NC/q+W9HuZaXdVTFcPF5mhHBBPgrxNo/plkE 8SQgtmTt3OY+IeyeApAVO5G8qepG4VqMq1Rr0Pbx4PYZOqQAt6FIvLde6bhwNblKTjVK t3obfWfpr+KpqjPHIHR7LnqVww4FgJg5b5kXeSn61Ikf4P4eddPaykia8uJeA3rCltTk Ui4vESOwkX8LRuB+I4RbXOQOgqoMW16sEo1RhpbJ9scKV16Vp/7uA4jyyz0EdB/cBKp8 a6/8FMaDj+VptzYctbUOoj7Qd4ameeLlWF+41x+qOCatSy9HcN2Qd9Nb4+sI94xObNim NIog== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=iXfoLKVrMT/xFDAU9J/MJSSK5kW91sU/HJUqY9Ndk9k=; b=gwAwbpzDZpljGdHLR3X/yx5dvAuxb2M+TF2fvmX4Lxx8EwXfgpUw2CmMX4nhh7t8eZ zPF8g5ILiOd9I3CB2O+buKrLSvAiobgOyn6BG+BXUjmy7+k0HGxitqGNEcOLLrqrpz15 4VV+FAprzmhq5fchIK42wleJBkDXgidUbFMLg5q1F59PPI6BDfbl9hPoxUs3e7KHjjuq V0jckmAe5c4b55FmO6MnHaIJEGjLdvLvOV0dqH07EuWt+ckt0d/JNTIZiE93GFtLYVOp wb3mVBY/8RkFBCGiDVO1hlMf60iWm5x4tZQSPLpfLq4G5s8UieFIyLG9+Eke65dqwfD7 epmA== X-Gm-Message-State: AO0yUKWnYDr7EoOSK4sLLemToYfTE5j4dVK5Ix6PK4FicmX1gZM5HX1N Glme14yH+RKUQiCyS0n3iccp//Iv4WBV X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a0d:eb03:0:b0:535:83c5:cf3b with SMTP id u3-20020a0deb03000000b0053583c5cf3bmr347945ywe.87.1676798974536; Sun, 19 Feb 2023 01:29:34 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:01 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-5-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 04/51] perf pmu-events: Change aggr_mode to be an enum From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758251268379149945?= X-GMAIL-MSGID: =?utf-8?q?1758251268379149945?= Rather than use a string to encode aggr_mode, use an enum value. Signed-off-by: Ian Rogers --- tools/perf/arch/powerpc/util/header.c | 2 +- tools/perf/pmu-events/jevents.py | 17 +++++++++++------ tools/perf/pmu-events/pmu-events.h | 2 +- 3 files changed, 13 insertions(+), 8 deletions(-) diff --git a/tools/perf/arch/powerpc/util/header.c b/tools/perf/arch/powerpc/util/header.c index 78eef77d8a8d..c8d0dc775e5d 100644 --- a/tools/perf/arch/powerpc/util/header.c +++ b/tools/perf/arch/powerpc/util/header.c @@ -45,6 +45,6 @@ int arch_get_runtimeparam(const struct pmu_metric *pm) int count; char path[PATH_MAX] = "/devices/hv_24x7/interface/"; - atoi(pm->aggr_mode) == PerChip ? strcat(path, "sockets") : strcat(path, "coresperchip"); + strcat(path, pm->aggr_mode == PerChip ? "sockets" : "coresperchip"); return sysfs__read_int(path, &count) < 0 ? 1 : count; } diff --git a/tools/perf/pmu-events/jevents.py b/tools/perf/pmu-events/jevents.py index db8b92de113e..2b08d7c18f4b 100755 --- a/tools/perf/pmu-events/jevents.py +++ b/tools/perf/pmu-events/jevents.py @@ -678,10 +678,13 @@ static void decompress_event(int offset, struct pmu_event *pe) { \tconst char *p = &big_c_string[offset]; """) + enum_attributes = ['aggr_mode'] for attr in _json_event_attributes: - _args.output_file.write(f""" -\tpe->{attr} = (*p == '\\0' ? NULL : p); -""") + _args.output_file.write(f'\n\tpe->{attr} = ') + if attr in enum_attributes: + _args.output_file.write("(*p == '\\0' ? 0 : *p - '0');\n") + else: + _args.output_file.write("(*p == '\\0' ? NULL : p);\n") if attr == _json_event_attributes[-1]: continue _args.output_file.write('\twhile (*p++);') @@ -692,9 +695,11 @@ static void decompress_metric(int offset, struct pmu_metric *pm) \tconst char *p = &big_c_string[offset]; """) for attr in _json_metric_attributes: - _args.output_file.write(f""" -\tpm->{attr} = (*p == '\\0' ? NULL : p); -""") + _args.output_file.write(f'\n\tpm->{attr} = ') + if attr in enum_attributes: + _args.output_file.write("(*p == '\\0' ? 0 : *p - '0');\n") + else: + _args.output_file.write("(*p == '\\0' ? NULL : p);\n") if attr == _json_metric_attributes[-1]: continue _args.output_file.write('\twhile (*p++);') diff --git a/tools/perf/pmu-events/pmu-events.h b/tools/perf/pmu-events/pmu-events.h index cee8b83792f8..7225efc4e4df 100644 --- a/tools/perf/pmu-events/pmu-events.h +++ b/tools/perf/pmu-events/pmu-events.h @@ -31,10 +31,10 @@ struct pmu_metric { const char *metric_expr; const char *unit; const char *compat; - const char *aggr_mode; const char *metric_constraint; const char *desc; const char *long_desc; + enum aggr_mode_class aggr_mode; }; struct pmu_events_table; From patchwork Sun Feb 19 09:28:02 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59093 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp771793wrn; Sun, 19 Feb 2023 01:31:21 -0800 (PST) X-Google-Smtp-Source: AK7set+9JogcYXXtydWoK+FBT0NJooLMykUBPA5aaqp5Q1Q85NKTZor3z3cjVF4oh91J+Hp26j3y X-Received: by 2002:a17:907:8886:b0:88d:ba89:1837 with SMTP id rp6-20020a170907888600b0088dba891837mr2543166ejc.8.1676799080992; Sun, 19 Feb 2023 01:31:20 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799080; cv=none; d=google.com; s=arc-20160816; b=H37uNx0EU1H9hN2hLQ4Bew+lbWW+5o0qhQ61Mnm/yLzeTG/m/iTI0W1gwKt6fdYxge 8N8mQ+JMeEsx97IpJ1wrIfXdM7J22Sdf0VkPwtbh+fQLhNbeFX2FwIs0bH/ihZgkpaZN 1W7k0D666XMDqYuejy75SSiTmGAGwJu7yQPy0J2sdW1RvzA2keoSCKw3e6jFePROZz9M GHiiyTBMCRq6aC+JfkdCew29Cen+SJjYOfIRZZO9/tE21d6Nj+JexmpE67er8AkgVuyw DTIoiNGLJpG+qcjw9SBCKwHeRmqD/hJZC9uUpvdVkGgaY2u0jwoh3MjwwandFtk+gbxU EgXw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=HpSaNFvC8VmLiMMzNqVi6YMEMj22ak1daXzeVt5eMAk=; b=W4bmOgsUOvVe9nYX1XhGWxdJ1tefSfS/hpSpRdCM/7DGpP0Xnjn3mG031yNv1omIzS BjhoDrZuwq1U7Lg+H6zpepELXEvgIAxrerWTLy3tPBs/aeNFu4i2xENZ50lo2YZLs5KD IsXDulmN3U2aEshm4Q78lY39SEO2YZEL2FHG67SS1wKLwbikf/mkWGw8nZzbjTJOkKc9 nulaJJ3TSyDne0w/X2ClwflxJy1yxTeNYETjkmRZQKHN/yL2uFKpqADqgIRQqlAsFMmk KHWAq9lg1yiW85tF2ZlWTpvT3QjHS54BQW5ptlAAbHD+JUfdGRVp5urUIyiG2xwyRMkY kixA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=HcBv1rkb; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id w19-20020a170906969300b008b9b135aef1si6072987ejx.997.2023.02.19.01.30.57; Sun, 19 Feb 2023 01:31:20 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=HcBv1rkb; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229818AbjBSJaF (ORCPT + 99 others); Sun, 19 Feb 2023 04:30:05 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52030 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229558AbjBSJaE (ORCPT ); Sun, 19 Feb 2023 04:30:04 -0500 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D182612583 for ; Sun, 19 Feb 2023 01:29:43 -0800 (PST) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-5369a87b661so2648387b3.11 for ; Sun, 19 Feb 2023 01:29:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=HpSaNFvC8VmLiMMzNqVi6YMEMj22ak1daXzeVt5eMAk=; b=HcBv1rkbvDldKyrDrO0aggLoq3rOU9HtvU+pzmjwa8Ux0B1KVOh4j8kILHgYFzodGs q7FHA8VF7wHWGnSmn6YVglkgBiheCph85mVXO2727cgB4U/R416NU+pIV9PVC1yP1IKT nsJ+wXh647jNXsjYXnz/rmI6T3y4ixe98L+h/LwtJ6uZDKOkqF145rV3liTbQfcNuvXw +wvjCDO3XWGrwUhOU6wNzR+0DSatllL4NCLdst2qS9VItUg1x4bigbuFwqNwRCqqlJaH SA2vUGQpGnDR+Dm03t5wFIxSQN7+d3nR0wTN/jsAXGHCxURshr7AqnpGX5wjEOu5SCEw Pa7Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=HpSaNFvC8VmLiMMzNqVi6YMEMj22ak1daXzeVt5eMAk=; b=awqkL4Ev57is+Js7I0RlYrMMP40YnZU8jjjZkbULq4V1Q8IxQxGUCbHVeIzHkd/Sjs T41h2e1wIm2mryKL0bhwu5UoQtjry0vwF26La8vXhnWWyhbOcFdzLCuOC3r9kWdyaSzG EdK+rGzAhdblsljgdptRws8MK1SiUXEugXIoLcU7cdiEnm94WZfmmILMnOl95/in/5KR aPC9S7He4bs4WNTSJT7+h4OhcnDIm7+IldTNaqZeOTU26inemxW2jx7KCNPnOyJYWQtG NyeJPkSPgWqyPqRBCbBqLUezaxQbkpdgHViN8peXFBsgnZGGswC86zjynRkGktIa2TEl AimQ== X-Gm-Message-State: AO0yUKWH75TzBAJ+/Cxny8tS9Wl0c96cSN4gAa/jxOPGJTGzRV5UHoaW i72mX1TyWfs4KkEZPXMTxf/3qBRSxM+f X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a05:6902:10c6:b0:97a:ebd:a594 with SMTP id w6-20020a05690210c600b0097a0ebda594mr389689ybu.3.1676798982991; Sun, 19 Feb 2023 01:29:42 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:02 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-6-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 05/51] perf pmu-events: Change deprecated to be a bool From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758251273261062635?= X-GMAIL-MSGID: =?utf-8?q?1758251273261062635?= Switch to a more natural bool rather than string encoding, where NULL implicitly meant false. Signed-off-by: Ian Rogers --- tools/perf/pmu-events/jevents.py | 2 +- tools/perf/pmu-events/pmu-events.h | 4 +++- tools/perf/tests/pmu-events.c | 4 ++-- tools/perf/util/pmu.c | 10 ++++------ 4 files changed, 10 insertions(+), 10 deletions(-) diff --git a/tools/perf/pmu-events/jevents.py b/tools/perf/pmu-events/jevents.py index 2b08d7c18f4b..35ca34eca74a 100755 --- a/tools/perf/pmu-events/jevents.py +++ b/tools/perf/pmu-events/jevents.py @@ -678,7 +678,7 @@ static void decompress_event(int offset, struct pmu_event *pe) { \tconst char *p = &big_c_string[offset]; """) - enum_attributes = ['aggr_mode'] + enum_attributes = ['aggr_mode', 'deprecated'] for attr in _json_event_attributes: _args.output_file.write(f'\n\tpe->{attr} = ') if attr in enum_attributes: diff --git a/tools/perf/pmu-events/pmu-events.h b/tools/perf/pmu-events/pmu-events.h index 7225efc4e4df..2434bc7cf92d 100644 --- a/tools/perf/pmu-events/pmu-events.h +++ b/tools/perf/pmu-events/pmu-events.h @@ -2,6 +2,8 @@ #ifndef PMU_EVENTS_H #define PMU_EVENTS_H +#include + struct perf_pmu; enum aggr_mode_class { @@ -22,7 +24,7 @@ struct pmu_event { const char *pmu; const char *unit; const char *perpkg; - const char *deprecated; + bool deprecated; }; struct pmu_metric { diff --git a/tools/perf/tests/pmu-events.c b/tools/perf/tests/pmu-events.c index 9b4c94ba5460..937804c84e29 100644 --- a/tools/perf/tests/pmu-events.c +++ b/tools/perf/tests/pmu-events.c @@ -331,8 +331,8 @@ static int compare_pmu_events(const struct pmu_event *e1, const struct pmu_event return -1; } - if (!is_same(e1->deprecated, e2->deprecated)) { - pr_debug2("testing event e1 %s: mismatched deprecated, %s vs %s\n", + if (e1->deprecated != e2->deprecated) { + pr_debug2("testing event e1 %s: mismatched deprecated, %d vs %d\n", e1->name, e1->deprecated, e2->deprecated); return -1; } diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c index c256b29defad..80644e25a568 100644 --- a/tools/perf/util/pmu.c +++ b/tools/perf/util/pmu.c @@ -331,14 +331,15 @@ static int __perf_pmu__new_alias(struct list_head *list, char *dir, char *name, int num; char newval[256]; char *long_desc = NULL, *topic = NULL, *unit = NULL, *perpkg = NULL, - *deprecated = NULL, *pmu_name = NULL; + *pmu_name = NULL; + bool deprecated = false; if (pe) { long_desc = (char *)pe->long_desc; topic = (char *)pe->topic; unit = (char *)pe->unit; perpkg = (char *)pe->perpkg; - deprecated = (char *)pe->deprecated; + deprecated = pe->deprecated; pmu_name = (char *)pe->pmu; } @@ -351,7 +352,7 @@ static int __perf_pmu__new_alias(struct list_head *list, char *dir, char *name, alias->unit[0] = '\0'; alias->per_pkg = false; alias->snapshot = false; - alias->deprecated = false; + alias->deprecated = deprecated; ret = parse_events_terms(&alias->terms, val); if (ret) { @@ -405,9 +406,6 @@ static int __perf_pmu__new_alias(struct list_head *list, char *dir, char *name, alias->str = strdup(newval); alias->pmu_name = pmu_name ? strdup(pmu_name) : NULL; - if (deprecated) - alias->deprecated = true; - if (!perf_pmu_merge_alias(alias, list)) list_add_tail(&alias->list, list); From patchwork Sun Feb 19 09:28:03 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59098 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp772178wrn; Sun, 19 Feb 2023 01:32:17 -0800 (PST) X-Google-Smtp-Source: AK7set+rAkAfw1XDpJkdrVwkCRVKlMNvN4gPJPPnysAW0Ie4VFXjqPjhKb1dZUVLOlPLtt+N32Kr X-Received: by 2002:a17:906:99d6:b0:8b1:d5c:94ec with SMTP id s22-20020a17090699d600b008b10d5c94ecmr5555747ejn.22.1676799137746; Sun, 19 Feb 2023 01:32:17 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799137; cv=none; d=google.com; s=arc-20160816; b=tQ5G3kcXV7ZOLF1eP7FQPm0OrkelTl8IR8LNoyNJrVtiJr2RdE7xq7T10/b6ECRuPt CDWQSung/VLlDqfvi7F5r2hiUiE19cmYJWnqKVDCAY+V5TvFcEdQtQ4148ajF9vGNjZv ZKslG26SspFfi4poH6FdhVVmRxOwycUt6wj3lLhxC5woEiP8pKMHrrGD5Z3JyWStHSBL /f6OcwvUGUA+DF3/58eX00a6wFHwhdWjUIqrocmVUTAwn5AkBlnIa83pAlOsVx+R1XJK MSnM1TZU2NufdbJwgjSNbADsV1zAUSRR81XPuLsCBs30F0Dq45B78rqmtUHAZxFzlGls Wejw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=ty9nxmBgJsUnSpKV2MHBIMxHaMhrZHfa8zZXWCP8Cbg=; b=vbbRSey0bA3OYSYBmdOY3kZzvBtm1SDRQJeqja8rQeE2dXUhV2HGPOoo4z3Gf0wHjD Uzd4WB9jWq2A/Jjsb60pvW0o002w0+0sTdUhAe9RaiuaGgSW9Hh/da2kNXltWyaeAqcD NgyT/cUKD8Nj+p1uD2jJHDVbm8XZ93H0yJhapfu1c1iCIbHoIAKRkT3eiENy4yXQQXl5 XreTbrm+t1YdNqfPxuNFSsxV1VhNgMIGbD1Lg6SkCQ6tPfOD9KLSrG3e9lHbpY8Wyqu0 ZZ0sxkm8JcyN/Q2emQeTl5ry6z9eAKkaAmR1eYqnVWxM5J/Q/Yugw3Ankzuk2h80vSPE RxzA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=fXPFn6oS; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id mu15-20020a1709068a8f00b008cc211391bdsi1622438ejc.820.2023.02.19.01.31.54; Sun, 19 Feb 2023 01:32:17 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=fXPFn6oS; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229867AbjBSJa3 (ORCPT + 99 others); Sun, 19 Feb 2023 04:30:29 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52290 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229837AbjBSJaY (ORCPT ); Sun, 19 Feb 2023 04:30:24 -0500 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 60F6112045 for ; Sun, 19 Feb 2023 01:29:53 -0800 (PST) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-536885323c1so17938527b3.15 for ; Sun, 19 Feb 2023 01:29:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=ty9nxmBgJsUnSpKV2MHBIMxHaMhrZHfa8zZXWCP8Cbg=; b=fXPFn6oSwVykULAqbxpfhUL0GIEFqD2VYVuHvyjznRwBP0ndR6qs7rXslQtWXWUNDh /2XPYoU8PtIy6GaUZ111tSYonx7D26wZEZSKzGtpsztGeEjNKD6TZA1M4wYrnGZz9BuG GVN+QKkPziKsRuGsy1hU2SmnPxPPt+UDETmNsrK3VXcueClddLswkxO1Akd46lSS2YW6 uaJoN9BzoCcsr3DJPtkDs+01a1IBH3ojOB4udgVPPUg7gGlIxu2bwhMSr/KvdeFjGumO BX/oZoE1NUz/N56t7evkLcNhnUn+6CYqcKAZlVrXPxXwzfkxuyJFXxTp6kiBL5aj5LHv Y3WQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=ty9nxmBgJsUnSpKV2MHBIMxHaMhrZHfa8zZXWCP8Cbg=; b=MdZGWXh5AdJn3XaJNJcJqerSfsvbKsvVqbJcW6YVAUhFmrS56KoJvvwwDFF51P56Cb Xd05KwxG0Dt64Q3rEu3aNdkIzwkdDhuIsiujCQoV2ZlIpvm0RHUcWmJJTU3CWxBtSfRQ l6Pj0u3kd4ORojuetsfGMFTdN4/60GWvLA0ajGiHnc1gxVl9E26c0AKm7x8jE31XZkE9 2YjKeRhuxuBcQrglGyJ3JugTWwOO0HBsOF8haNVtqlGfeybny8ISHft5WT1hZkCQeCbm /DeO91A24nGoc0B1u8T3G8+Y7ECgnRPuCJkaZoZ+gr+SLr0Dmk93nIVXiX7IxZWdvzgV fAeQ== X-Gm-Message-State: AO0yUKWfz3zrSze38HJw/JySXdadamm5lKPooSxGpSdSRUxlSExeb9O8 2X3jLfOCAUjJHpXReK0ul89aptVtb3X0 X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a05:690c:f01:b0:52e:ac97:115f with SMTP id dc1-20020a05690c0f0100b0052eac97115fmr228775ywb.5.1676798992575; Sun, 19 Feb 2023 01:29:52 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:03 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-7-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 06/51] perf pmu-events: Change perpkg to be a bool From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758251332464890469?= X-GMAIL-MSGID: =?utf-8?q?1758251332464890469?= Switch to a more natural bool rather than string encoding, where NULL implicitly meant false. The only value of 'PerPkg' in the event json is '1'. Signed-off-by: Ian Rogers --- tools/perf/pmu-events/jevents.py | 2 +- tools/perf/pmu-events/pmu-events.h | 2 +- tools/perf/tests/pmu-events.c | 4 ++-- tools/perf/util/pmu.c | 11 ++++------- 4 files changed, 8 insertions(+), 11 deletions(-) diff --git a/tools/perf/pmu-events/jevents.py b/tools/perf/pmu-events/jevents.py index 35ca34eca74a..2da55408398f 100755 --- a/tools/perf/pmu-events/jevents.py +++ b/tools/perf/pmu-events/jevents.py @@ -678,7 +678,7 @@ static void decompress_event(int offset, struct pmu_event *pe) { \tconst char *p = &big_c_string[offset]; """) - enum_attributes = ['aggr_mode', 'deprecated'] + enum_attributes = ['aggr_mode', 'deprecated', 'perpkg'] for attr in _json_event_attributes: _args.output_file.write(f'\n\tpe->{attr} = ') if attr in enum_attributes: diff --git a/tools/perf/pmu-events/pmu-events.h b/tools/perf/pmu-events/pmu-events.h index 2434bc7cf92d..4d236bb32fd3 100644 --- a/tools/perf/pmu-events/pmu-events.h +++ b/tools/perf/pmu-events/pmu-events.h @@ -23,7 +23,7 @@ struct pmu_event { const char *long_desc; const char *pmu; const char *unit; - const char *perpkg; + bool perpkg; bool deprecated; }; diff --git a/tools/perf/tests/pmu-events.c b/tools/perf/tests/pmu-events.c index 937804c84e29..521557c396bc 100644 --- a/tools/perf/tests/pmu-events.c +++ b/tools/perf/tests/pmu-events.c @@ -325,8 +325,8 @@ static int compare_pmu_events(const struct pmu_event *e1, const struct pmu_event return -1; } - if (!is_same(e1->perpkg, e2->perpkg)) { - pr_debug2("testing event e1 %s: mismatched perpkg, %s vs %s\n", + if (e1->perpkg != e2->perpkg) { + pr_debug2("testing event e1 %s: mismatched perpkg, %d vs %d\n", e1->name, e1->perpkg, e2->perpkg); return -1; } diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c index 80644e25a568..43b6182d96b7 100644 --- a/tools/perf/util/pmu.c +++ b/tools/perf/util/pmu.c @@ -328,17 +328,15 @@ static int __perf_pmu__new_alias(struct list_head *list, char *dir, char *name, struct parse_events_term *term; struct perf_pmu_alias *alias; int ret; - int num; char newval[256]; - char *long_desc = NULL, *topic = NULL, *unit = NULL, *perpkg = NULL, - *pmu_name = NULL; - bool deprecated = false; + char *long_desc = NULL, *topic = NULL, *unit = NULL, *pmu_name = NULL; + bool deprecated = false, perpkg = false; if (pe) { long_desc = (char *)pe->long_desc; topic = (char *)pe->topic; unit = (char *)pe->unit; - perpkg = (char *)pe->perpkg; + perpkg = pe->perpkg; deprecated = pe->deprecated; pmu_name = (char *)pe->pmu; } @@ -350,7 +348,7 @@ static int __perf_pmu__new_alias(struct list_head *list, char *dir, char *name, INIT_LIST_HEAD(&alias->terms); alias->scale = 1.0; alias->unit[0] = '\0'; - alias->per_pkg = false; + alias->per_pkg = perpkg; alias->snapshot = false; alias->deprecated = deprecated; @@ -402,7 +400,6 @@ static int __perf_pmu__new_alias(struct list_head *list, char *dir, char *name, return -1; snprintf(alias->unit, sizeof(alias->unit), "%s", unit); } - alias->per_pkg = perpkg && sscanf(perpkg, "%d", &num) == 1 && num == 1; alias->str = strdup(newval); alias->pmu_name = pmu_name ? strdup(pmu_name) : NULL; From patchwork Sun Feb 19 09:28:04 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59094 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp771964wrn; Sun, 19 Feb 2023 01:31:48 -0800 (PST) X-Google-Smtp-Source: AK7set/IXLwSlQtxzn2Ly676npxA9aioYjNvG5oxw+NiNOaswSsanRSkZoD6yEz3c5j3DC80AirT X-Received: by 2002:aa7:dbd2:0:b0:4ac:d2cd:81c7 with SMTP id v18-20020aa7dbd2000000b004acd2cd81c7mr731740edt.5.1676799108748; Sun, 19 Feb 2023 01:31:48 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799108; cv=none; d=google.com; s=arc-20160816; b=F8QlnJEsg3914V5qJlcFUav1+toIdvjKFC9iydIgEL7RHpfUCZIB9pYfGYow2BA0Sw 9/VP5a+XHo6tV6JlGryg/9S8lSxvpMz0ApwhKGSqvFl/ToOVir0nKkW+zCJTUTpdnkY7 ClHX9eR6eNgN+BVGWnh9RGdYbFyWy0byIHbluX52zbNpVCziiBppFJJXyk3y4vF16w6A 1CODyjQGasH6zXOWBGLnK3cMMKebPISvVy3gmKdxIhbLJ4OmdzOVt2DINX1M+KbTuv8S ol7UfwCFniq1Z6JUqK3gbqLXTw5hJOcwZDysrMbM1hpqz5K6FhhnKVZ/rp8BxjRhe/Ma eLzg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=McpPUh9UUbr+SfZnc3Yy3m0883o69CMJ9U4yySTvUIw=; b=WpKrwcVsFUutz3GYbxW6gAdkmNUytyqDix7P1RnIIUDnFfUmse6GYsaz49Ws6wxKnS 3aXZNkS19KC37QMRZh4iWqob7NikUCKq+h/fo7YRJElIYiLvBxdxkopP0BP47Q97C4yt MMhZGpiEFmEdGFGZT8VKU8maIG5OJWQuCJlQi7EZbAUWembNhrgrenWR8ddEgF52s658 9thlRyKC1PLd2U9t9kOQ+XauQuiPrgm3COqr7kRAj5i0UO0WcWjO6C6J5eVzoTaSAhAC ZQT5oh/2soFJIXkey63o+8dD+oYd9qcPbQah9p+FaXOlKYlbQ01lqJs7+5f/tTyb8JX4 Jaxw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=nr+0lt5V; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id y4-20020aa7c244000000b004ad033944a6si11432093edo.606.2023.02.19.01.31.24; Sun, 19 Feb 2023 01:31:48 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=nr+0lt5V; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229558AbjBSJac (ORCPT + 99 others); Sun, 19 Feb 2023 04:30:32 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52434 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229840AbjBSJa2 (ORCPT ); Sun, 19 Feb 2023 04:30:28 -0500 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 98C7CEB75 for ; Sun, 19 Feb 2023 01:30:00 -0800 (PST) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-53687f6de13so18395537b3.9 for ; Sun, 19 Feb 2023 01:30:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=McpPUh9UUbr+SfZnc3Yy3m0883o69CMJ9U4yySTvUIw=; b=nr+0lt5VymGETH/u61CNycPQbXKy2g0gRYNfkLIoR9cI2Xz1zqqpvRQB4b+fwUG7ST vok5Jj6oAiPiqTpK2qYsd0OSEZ1HtpmVkbitSOvgfeo2SX11R5cqjvdJj+7o4l3pP2Cd M9yIKk4dJBKlIyffdAsE5VM7TQMCqt40bg09sqWDvbX5oSWMR+QbG+cvSpUeLzk5gQ5p UBn4OhTfgAaRWfGzVdnvzS/ChmcUbPyAEprZQDL9beqiWSFPi01G1bQDim0d6bj9rTn3 IfeKuaPWVMI8DkuGAhsnoD79dvPT0XMD6WCXnyrM/WWfO+5k2nksBylj3NgndWilb/m6 cCOA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=McpPUh9UUbr+SfZnc3Yy3m0883o69CMJ9U4yySTvUIw=; b=QrO8Nksi1/T5Aih5lWlcF5dMv4gP4Vndngi2MGEw3rYbRj/R0EhmK31K4hQ2Gv1NkS 7Nsxg4SStuPfefUDcPaGbYs1DzMJtmgEY/omURFbXfwC1CVqgQTcNd2sqJp640jF1vZ1 qAampn3PzYKWYhPlpnie9IXgbdnsRR1UBOTZCK4sYUA5WrelRkA1m3ljuXJs1ARQyvnj r/XCiZYG2tHixMb9YvxhHmS7yBCWdLmWiVJHCB235I5qSaDfF2O6HhrIHwSB7DNmZbOL 3RnRQneiOPnSvbHtAdMPWAkc895i/Z85txJungU0plXBkr0YqC98j3eb8MUdFfHjYe7g 38VA== X-Gm-Message-State: AO0yUKW400IArChcxvPY3oUNnWT/z/E+TE+fDP65EysIbVGCGJGrS+fz P/LypTDoOYbQVvvyXfCZnFpRk04Zjh5A X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a05:6902:45:b0:995:ccb:1aae with SMTP id m5-20020a056902004500b009950ccb1aaemr109787ybh.13.1676799000276; Sun, 19 Feb 2023 01:30:00 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:04 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-8-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 07/51] perf expr: Make the online topology accessible globally From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758251302333366733?= X-GMAIL-MSGID: =?utf-8?q?1758251302333366733?= Knowing the topology of online CPUs is useful for more than just expr literals. Move to a global function that caches the value. An additional upside is that this may also avoid computing the CPU topology in some situations. Signed-off-by: Ian Rogers --- tools/perf/tests/expr.c | 7 ++----- tools/perf/util/cputopo.c | 14 ++++++++++++++ tools/perf/util/cputopo.h | 5 +++++ tools/perf/util/expr.c | 16 ++++++---------- tools/perf/util/smt.c | 11 +++++------ tools/perf/util/smt.h | 12 ++++++------ 6 files changed, 38 insertions(+), 27 deletions(-) diff --git a/tools/perf/tests/expr.c b/tools/perf/tests/expr.c index a9eb1ed6bd63..cbf0e0c74906 100644 --- a/tools/perf/tests/expr.c +++ b/tools/perf/tests/expr.c @@ -154,13 +154,10 @@ static int test__expr(struct test_suite *t __maybe_unused, int subtest __maybe_u /* Only EVENT1 or EVENT2 need be measured depending on the value of smt_on. */ { - struct cpu_topology *topology = cpu_topology__new(); - bool smton = smt_on(topology); + bool smton = smt_on(); bool corewide = core_wide(/*system_wide=*/false, - /*user_requested_cpus=*/false, - topology); + /*user_requested_cpus=*/false); - cpu_topology__delete(topology); expr__ctx_clear(ctx); TEST_ASSERT_VAL("find ids", expr__find_ids("EVENT1 if #smt_on else EVENT2", diff --git a/tools/perf/util/cputopo.c b/tools/perf/util/cputopo.c index e08797c3cdbc..ca1d833a0c26 100644 --- a/tools/perf/util/cputopo.c +++ b/tools/perf/util/cputopo.c @@ -238,6 +238,20 @@ static bool has_die_topology(void) return true; } +const struct cpu_topology *online_topology(void) +{ + static const struct cpu_topology *topology; + + if (!topology) { + topology = cpu_topology__new(); + if (!topology) { + pr_err("Error creating CPU topology"); + abort(); + } + } + return topology; +} + struct cpu_topology *cpu_topology__new(void) { struct cpu_topology *tp = NULL; diff --git a/tools/perf/util/cputopo.h b/tools/perf/util/cputopo.h index 969e5920a00e..8d42f6102954 100644 --- a/tools/perf/util/cputopo.h +++ b/tools/perf/util/cputopo.h @@ -56,6 +56,11 @@ struct hybrid_topology { struct hybrid_topology_node nodes[]; }; +/* + * The topology for online CPUs, lazily created. + */ +const struct cpu_topology *online_topology(void); + struct cpu_topology *cpu_topology__new(void); void cpu_topology__delete(struct cpu_topology *tp); /* Determine from the core list whether SMT was enabled. */ diff --git a/tools/perf/util/expr.c b/tools/perf/util/expr.c index c1da20b868db..d46a1878bc9e 100644 --- a/tools/perf/util/expr.c +++ b/tools/perf/util/expr.c @@ -402,7 +402,7 @@ double arch_get_tsc_freq(void) double expr__get_literal(const char *literal, const struct expr_scanner_ctx *ctx) { - static struct cpu_topology *topology; + const struct cpu_topology *topology; double result = NAN; if (!strcmp("#num_cpus", literal)) { @@ -421,31 +421,27 @@ double expr__get_literal(const char *literal, const struct expr_scanner_ctx *ctx * these strings gives an indication of the number of packages, dies, * etc. */ - if (!topology) { - topology = cpu_topology__new(); - if (!topology) { - pr_err("Error creating CPU topology"); - goto out; - } - } if (!strcasecmp("#smt_on", literal)) { - result = smt_on(topology) ? 1.0 : 0.0; + result = smt_on() ? 1.0 : 0.0; goto out; } if (!strcmp("#core_wide", literal)) { - result = core_wide(ctx->system_wide, ctx->user_requested_cpu_list, topology) + result = core_wide(ctx->system_wide, ctx->user_requested_cpu_list) ? 1.0 : 0.0; goto out; } if (!strcmp("#num_packages", literal)) { + topology = online_topology(); result = topology->package_cpus_lists; goto out; } if (!strcmp("#num_dies", literal)) { + topology = online_topology(); result = topology->die_cpus_lists; goto out; } if (!strcmp("#num_cores", literal)) { + topology = online_topology(); result = topology->core_cpus_lists; goto out; } diff --git a/tools/perf/util/smt.c b/tools/perf/util/smt.c index 994e9e418227..650e804d0adc 100644 --- a/tools/perf/util/smt.c +++ b/tools/perf/util/smt.c @@ -4,7 +4,7 @@ #include "cputopo.h" #include "smt.h" -bool smt_on(const struct cpu_topology *topology) +bool smt_on(void) { static bool cached; static bool cached_result; @@ -16,22 +16,21 @@ bool smt_on(const struct cpu_topology *topology) if (sysfs__read_int("devices/system/cpu/smt/active", &fs_value) >= 0) cached_result = (fs_value == 1); else - cached_result = cpu_topology__smt_on(topology); + cached_result = cpu_topology__smt_on(online_topology()); cached = true; return cached_result; } -bool core_wide(bool system_wide, const char *user_requested_cpu_list, - const struct cpu_topology *topology) +bool core_wide(bool system_wide, const char *user_requested_cpu_list) { /* If not everything running on a core is being recorded then we can't use core_wide. */ if (!system_wide) return false; /* Cheap case that SMT is disabled and therefore we're inherently core_wide. */ - if (!smt_on(topology)) + if (!smt_on()) return true; - return cpu_topology__core_wide(topology, user_requested_cpu_list); + return cpu_topology__core_wide(online_topology(), user_requested_cpu_list); } diff --git a/tools/perf/util/smt.h b/tools/perf/util/smt.h index ae9095f2c38c..01441fd2c0a2 100644 --- a/tools/perf/util/smt.h +++ b/tools/perf/util/smt.h @@ -2,16 +2,16 @@ #ifndef __SMT_H #define __SMT_H 1 -struct cpu_topology; - -/* Returns true if SMT (aka hyperthreading) is enabled. */ -bool smt_on(const struct cpu_topology *topology); +/* + * Returns true if SMT (aka hyperthreading) is enabled. Determined via sysfs or + * the online topology. + */ +bool smt_on(void); /* * Returns true when system wide and all SMT threads for a core are in the * user_requested_cpus map. */ -bool core_wide(bool system_wide, const char *user_requested_cpu_list, - const struct cpu_topology *topology); +bool core_wide(bool system_wide, const char *user_requested_cpu_list); #endif /* __SMT_H */ From patchwork Sun Feb 19 09:28:05 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59099 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp772195wrn; Sun, 19 Feb 2023 01:32:20 -0800 (PST) X-Google-Smtp-Source: AK7set894khXznC2uOJRZkUW+ToMY4BMTIp8Ue1sg9+Rqa9jrIEyuL4lv1vk2DBVWYxqr/BILXr7 X-Received: by 2002:a17:907:c10:b0:87b:da74:d272 with SMTP id ga16-20020a1709070c1000b0087bda74d272mr5811699ejc.45.1676799139908; Sun, 19 Feb 2023 01:32:19 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799139; cv=none; d=google.com; s=arc-20160816; b=RTmJzgxD/b50uTJALY14hSzob4ZNuWyYt8B3Ab1beE0ktD15iEFVfYIc5VetadnwAK LK4GQQNveex2axxDnfIKbDG2Z+5ufuXDouKoGi3e7I0usR24sMRdWe+JIyqT6D5Yq+TM bWvJeFg5EYskmMGsyW8V+MnlAxXxpB2r/5xmRLfeG7xCLCHAuXqThucuLe5qFAmSs0KH 9FC+A2MXZK/V1f/Wd8Ck71GCMvfzDILgJ/58c/gsbUwiCFBOOtkLeR3/tu6S4gG3qm0T 6WFyUKrlz2pILUwJRnprc7oSnpAYpBP2zgu/Sf9Yis9oIx9ce6RRycKpRrHkANy3oZJE H7KA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=fuS1q84R0xcAiENu72F9xaB/dldCNXgOZrtbWFpv52E=; b=F2LPrTdzNFqf0AaA4JMT7FZ9skperHngiMnqcD7Qedi/K8+7HsTRjfBCwMC9giSA8k UpfLTqqDRxYeVH/O6li91aySiUZcQWyRH/2wiEj8C1SobyUemmkeEcEzwX0yyJETmpPV 8/H6tGvee2gbpjufLD5iR7FLEWGepXif0Bnhs6avVOk/WJUmEU7E3otYwokW2TD/NrZ1 YmtB1On92XsGDWRDOub8CdFHfudITCqbT8P4ZaKeCqdNJPTH3gHy4nVUBYCM5TomYYsQ MBASy6yGZ/4jAQ61W/U29727BDQ9Qojzr68uWmpVW4hyAhnB/Lyj2s1FpQ7G/vJUAbKV rBeA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=cI8dh0bW; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id mx10-20020a1709065a0a00b008c03ffa5ec8si3708529ejc.689.2023.02.19.01.31.57; Sun, 19 Feb 2023 01:32:19 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=cI8dh0bW; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229847AbjBSJap (ORCPT + 99 others); Sun, 19 Feb 2023 04:30:45 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52670 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229881AbjBSJah (ORCPT ); Sun, 19 Feb 2023 04:30:37 -0500 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CD3EC1207F for ; Sun, 19 Feb 2023 01:30:10 -0800 (PST) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-53686d8ce27so20518817b3.5 for ; Sun, 19 Feb 2023 01:30:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=fuS1q84R0xcAiENu72F9xaB/dldCNXgOZrtbWFpv52E=; b=cI8dh0bWWKZeW1y5k5GPM8AzvnkbIA5dJQUUoDzOXJvWxVyShYBYx60369eQUy4OzL E6SVtCAQM+IaKz1bV0Rn3Uuqp+jyypxxArizXArFcTzgMndVSzmzut43pU+fIivj5gSC XKTj5fMpGYRtuDMkea81itweVHTeFMS37/+yA2iSPh75iDbyZ+kgaX2m5UodziZr1WHu sKBH5KHPK9zik8QbzJ6V3vRrwymvXNniazfNki04nQD+n1XXTKZW1LPzmzv18NmS4/2T 4FkiuO6YzbUl/HqF+S/0btFco6XOhO2Bs0WDcFHWpQ5VE+9asDN4xuYeYZaRTGDc6yIO VGLw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=fuS1q84R0xcAiENu72F9xaB/dldCNXgOZrtbWFpv52E=; b=u8q07jSD/7KZrJgC/nzMHUEVKwihYCZHeJo97FXilnoLuAduI1fVB8hdBDIMjj82go jiMrx+PqNFKBiUisKUaRvcqmujq8rbMDwW4dlNem+eHw4WDFTGZkLhkYMNDmjT3t+DKb ZS2h7k0mc1Cokx0E0Twis2Fpjy13zeLECKVYNsdKeGMRaByVadjr/f186pfL2fJPB9O2 Lp/7TLZ7Ny3EqNestCtvZrJT2ttxsFCQji9DPIQyK+iqben/YmgysFiXjmmTWCbZbqyy AAAF8m9ZLOKZVCSRDLY90t0gA7tV6hDaGY+PSwtqUDVTmfbNIq7hsFNVRdb10EEVnvvA OGww== X-Gm-Message-State: AO0yUKXVfXURObbT0XdN52dbkQ+hujW3qF4fmJClonvcy4QFkBgr/CST /GmnGqEjs3yfmeogeXDMuzLAQU0EoCiT X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a0d:dc85:0:b0:530:a183:aa0 with SMTP id f127-20020a0ddc85000000b00530a1830aa0mr1619527ywe.384.1676799008532; Sun, 19 Feb 2023 01:30:08 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:05 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-9-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 08/51] perf pmu-events: Make the metric_constraint an enum From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758251334936959395?= X-GMAIL-MSGID: =?utf-8?q?1758251334936959395?= Rename metric_constraint to event_grouping to better explain what the variable is used for. Switch to use an enum for encoding instead of a string. Rather than just no constraint/grouping information or "NO_NMI_WATCHDOG", have 4 enum values. The values encode whether to group or not, and two cases where the behavior is dependent on either the NMI watchdog being enabled or SMT being enabled. Signed-off-by: Ian Rogers --- tools/perf/pmu-events/jevents.py | 20 ++++++++++++++++---- tools/perf/pmu-events/pmu-events.h | 25 ++++++++++++++++++++++++- tools/perf/util/metricgroup.c | 19 ++++++++++++------- 3 files changed, 52 insertions(+), 12 deletions(-) diff --git a/tools/perf/pmu-events/jevents.py b/tools/perf/pmu-events/jevents.py index 2da55408398f..dc0c56dccb5e 100755 --- a/tools/perf/pmu-events/jevents.py +++ b/tools/perf/pmu-events/jevents.py @@ -51,8 +51,8 @@ _json_event_attributes = [ # Attributes that are in pmu_metric rather than pmu_event. _json_metric_attributes = [ - 'metric_name', 'metric_group', 'metric_constraint', 'metric_expr', 'desc', - 'long_desc', 'unit', 'compat', 'aggr_mode' + 'metric_name', 'metric_group', 'metric_expr', 'desc', + 'long_desc', 'unit', 'compat', 'aggr_mode', 'event_grouping' ] def removesuffix(s: str, suffix: str) -> str: @@ -204,6 +204,18 @@ class JsonEvent: } return aggr_mode_to_enum[aggr_mode] + def convert_metric_constraint(metric_constraint: str) -> Optional[str]: + """Returns the metric_event_groups enum value associated with the JSON string.""" + if not metric_constraint: + return None + metric_constraint_to_enum = { + 'NO_GROUP_EVENTS': '1', + 'NO_GROUP_EVENTS_NMI': '2', + 'NO_NMI_WATCHDOG': '2', + 'NO_GROUP_EVENTS_SMT': '3', + } + return metric_constraint_to_enum[metric_constraint] + def lookup_msr(num: str) -> Optional[str]: """Converts the msr number, or first in a list to the appropriate event field.""" if not num: @@ -288,7 +300,7 @@ class JsonEvent: self.deprecated = jd.get('Deprecated') self.metric_name = jd.get('MetricName') self.metric_group = jd.get('MetricGroup') - self.metric_constraint = jd.get('MetricConstraint') + self.event_grouping = convert_metric_constraint(jd.get('MetricConstraint')) self.metric_expr = None if 'MetricExpr' in jd: self.metric_expr = metric.ParsePerfJson(jd['MetricExpr']).Simplify() @@ -678,7 +690,7 @@ static void decompress_event(int offset, struct pmu_event *pe) { \tconst char *p = &big_c_string[offset]; """) - enum_attributes = ['aggr_mode', 'deprecated', 'perpkg'] + enum_attributes = ['aggr_mode', 'deprecated', 'event_grouping', 'perpkg'] for attr in _json_event_attributes: _args.output_file.write(f'\n\tpe->{attr} = ') if attr in enum_attributes: diff --git a/tools/perf/pmu-events/pmu-events.h b/tools/perf/pmu-events/pmu-events.h index 4d236bb32fd3..57a38e3e5c32 100644 --- a/tools/perf/pmu-events/pmu-events.h +++ b/tools/perf/pmu-events/pmu-events.h @@ -11,6 +11,29 @@ enum aggr_mode_class { PerCore }; +/** + * enum metric_event_groups - How events within a pmu_metric should be grouped. + */ +enum metric_event_groups { + /** + * @MetricGroupEvents: Default, group events within the metric. + */ + MetricGroupEvents = 0, + /** + * @MetricNoGroupEvents: Don't group events for the metric. + */ + MetricNoGroupEvents = 1, + /** + * @MetricNoGroupEventsNmi: Don't group events for the metric if the NMI + * watchdog is enabled. + */ + MetricNoGroupEventsNmi = 2, + /** + * @MetricNoGroupEventsSmt: Don't group events for the metric if SMT is + * enabled. + */ + MetricNoGroupEventsSmt = 3, +}; /* * Describe each PMU event. Each CPU has a table of PMU events. */ @@ -33,10 +56,10 @@ struct pmu_metric { const char *metric_expr; const char *unit; const char *compat; - const char *metric_constraint; const char *desc; const char *long_desc; enum aggr_mode_class aggr_mode; + enum metric_event_groups event_grouping; }; struct pmu_events_table; diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c index b2aa6e049804..868fc9c35606 100644 --- a/tools/perf/util/metricgroup.c +++ b/tools/perf/util/metricgroup.c @@ -13,6 +13,7 @@ #include "pmu.h" #include "pmu-hybrid.h" #include "print-events.h" +#include "smt.h" #include "expr.h" #include "rblist.h" #include @@ -168,16 +169,20 @@ static void metric__watchdog_constraint_hint(const char *name, bool foot) static bool metric__group_events(const struct pmu_metric *pm) { - if (!pm->metric_constraint) - return true; - - if (!strcmp(pm->metric_constraint, "NO_NMI_WATCHDOG") && - sysctl__nmi_watchdog_enabled()) { + switch (pm->event_grouping) { + case MetricNoGroupEvents: + return false; + case MetricNoGroupEventsNmi: + if (!sysctl__nmi_watchdog_enabled()) + return true; metric__watchdog_constraint_hint(pm->metric_name, /*foot=*/false); return false; + case MetricNoGroupEventsSmt: + return !smt_on(); + case MetricGroupEvents: + default: + return true; } - - return true; } static void metric__free(struct metric *m) From patchwork Sun Feb 19 09:28:06 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59097 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp772166wrn; Sun, 19 Feb 2023 01:32:16 -0800 (PST) X-Google-Smtp-Source: AK7set+RGTTiObk92A3SdEcdO3ja7QXYwynOo+DeoIFBgVsL/xMf0/GiqQ4YgvD2EQ9gz0eCYYuT X-Received: by 2002:a17:906:9f19:b0:879:ec1a:4ac with SMTP id fy25-20020a1709069f1900b00879ec1a04acmr7152486ejc.76.1676799136358; Sun, 19 Feb 2023 01:32:16 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799136; cv=none; d=google.com; s=arc-20160816; b=DkdnMTtQpVtMJUW3mslnh9yzbzyBmBmdRpGor7NZOVm5o5rGE93ZnOMmOkTcX0Cl0m hJpbo+g7lsnIWfQEZzsjsqwgcGhBzS5O2z5qxs+w7JSLpW77q/GDBoRGTqqmiil/pyVe 4JbxtX9x5Rh/hEv86aUhItFU24B2SfP3jsLsvQCgo27vryYmZdO/WUUS4JhfCMUtkgaY lKYMwq5XLrKsiNSK7Q0i0Ao87kHnukz0Ii/hHaWb6RToOIawj9rjB3PGUx/Bc/rS3tg+ v3GsSUSFKZ5Zqi0alkIGPwzcewC2LWuI1mMh3AEJk3GyS1Os8f6VtCL5rC5Kg84xMYKK x9AA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=ZekM3DFhsGli0TNli/C0IYKvA9f7AyUZpwfMde3iUAk=; b=e2n+hpKvJz9GH6cF9Yxc1KsjLn/XLe1SCSgMdeRKcH80NJe3H7d7uS1txb00WfQILg j8xB2HW/jpTmEzzqss0f5LwOIpupMlXDV46igD1CtR9Y2Nwpx4qMLSR3TPvWtOnZ/V+n smtrnmYybOrTiNnYyXTYLDEYpczf7ul+zIH3ve7t1VnADvgbtpIThX1MADodqiaOG0K3 GrymbF2mf7UitRP7rK/S4437NoxYsD3iGxg+seiKLletNbo0DmbU0X0FbN8upGg/uwpz axt+vpnjH0p2LqDNeOsgp1yHQGwSAhIEe77OFANAaBhrwbFNm4iDmTKsmI6JGyTF8LvV YbXA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=njwS0zdv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id i3-20020aa7c703000000b004acc75b447bsi11481741edq.273.2023.02.19.01.31.53; Sun, 19 Feb 2023 01:32:16 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=njwS0zdv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229846AbjBSJax (ORCPT + 99 others); Sun, 19 Feb 2023 04:30:53 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52786 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229873AbjBSJar (ORCPT ); Sun, 19 Feb 2023 04:30:47 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 95775DBF9 for ; Sun, 19 Feb 2023 01:30:19 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id o14-20020a25730e000000b009419f64f6afso2219572ybc.2 for ; Sun, 19 Feb 2023 01:30:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=ZekM3DFhsGli0TNli/C0IYKvA9f7AyUZpwfMde3iUAk=; b=njwS0zdvPBr2f5Gwgx+MnFdMKJsacyJ3/qym2jLfPEHxAMivAGs+rlWWdk+Du+BWeG roLMmds7cxYanNraNuE8rI1XscoaMpSnYds3csTWiUZXGSDGsb+iX6X/mzHsk16/KvNg JIRRQggidvMZ9LijU87B8gNDY3B0npD5/KmjFMTlcvpZlx5GcIoFBSm4uQcG+OTZrVCB ryA5rcmtM4YDU3yMrGfDKb+E+nYgR15LIJmRsygIIdD202+A6TIyuwJoFUQHcUClbq2W k75D//NmRE3BNcT8UUVZbhzSfbsObFP7TGXybtoz712NdmByjrR0HfRP9//XsBj4ATZb B3Qg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=ZekM3DFhsGli0TNli/C0IYKvA9f7AyUZpwfMde3iUAk=; b=fkIFPKKN1h2UsXuPX7h36j/deB/HdGMeN0ruy33kiSvEwH2LQv3GWCkfSbOwJavdYs JoHNi8WvkDYxmQNpaDwSA0n5RmBseG5fmqJZCuE+8r+c2/XVK+kJFtMaiMMxaUNqsofZ 5MsC/tiqIS3j+zkN1ZB9wdLNRetgyr5NlEG2OGvnNQXAnjWZg71+1lE9vXmFG62ANZAi d9pfzOR0W0eMqVJAhoAu7wi2mmL5kwS4FzBXePV1z2TrQTpem6rT/56aU6bgT841gtYg J6yiaz6zUOahtWYA7SHWq5Gvz1fbu0P4FZ8BFXNviUpeJmcERHMJRUck9JXHfO/6h5LQ pZdA== X-Gm-Message-State: AO0yUKUD9R4054MHU0ys3kU72ZqsCYWq3y3d/6hwLrNc0C/CP7zA/qq8 2MhvjunU6Xo6f7w1ZeYczlMDgMDZi5PZ X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a81:8482:0:b0:533:a071:d7f8 with SMTP id u124-20020a818482000000b00533a071d7f8mr449175ywf.547.1676799016916; Sun, 19 Feb 2023 01:30:16 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:06 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-10-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 09/51] perf pmu-events: Don't '\0' terminate enum values From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758251331334271935?= X-GMAIL-MSGID: =?utf-8?q?1758251331334271935?= Encoding enums like '1\0' wastes a byte and could be '1' (no '\0' terminator) if the 0 case is '0', it also removes a branch for decompressing. Signed-off-by: Ian Rogers --- tools/perf/pmu-events/jevents.py | 26 ++++++++++++++++++-------- 1 file changed, 18 insertions(+), 8 deletions(-) diff --git a/tools/perf/pmu-events/jevents.py b/tools/perf/pmu-events/jevents.py index dc0c56dccb5e..e82dff3a1228 100755 --- a/tools/perf/pmu-events/jevents.py +++ b/tools/perf/pmu-events/jevents.py @@ -54,6 +54,8 @@ _json_metric_attributes = [ 'metric_name', 'metric_group', 'metric_expr', 'desc', 'long_desc', 'unit', 'compat', 'aggr_mode', 'event_grouping' ] +# Attributes that are bools or enum int values, encoded as '0', '1',... +_json_enum_attributes = ['aggr_mode', 'deprecated', 'event_grouping', 'perpkg'] def removesuffix(s: str, suffix: str) -> str: """Remove the suffix from a string @@ -360,7 +362,10 @@ class JsonEvent: # Convert parsed metric expressions into a string. Slashes # must be doubled in the file. x = x.ToPerfJson().replace('\\', '\\\\') - s += f'{x}\\000' if x else '\\000' + if attr in _json_enum_attributes: + s += x if x else '0' + else: + s += f'{x}\\000' if x else '\\000' return s def to_c_string(self, metric: bool) -> str: @@ -690,16 +695,18 @@ static void decompress_event(int offset, struct pmu_event *pe) { \tconst char *p = &big_c_string[offset]; """) - enum_attributes = ['aggr_mode', 'deprecated', 'event_grouping', 'perpkg'] for attr in _json_event_attributes: _args.output_file.write(f'\n\tpe->{attr} = ') - if attr in enum_attributes: - _args.output_file.write("(*p == '\\0' ? 0 : *p - '0');\n") + if attr in _json_enum_attributes: + _args.output_file.write("*p - '0';\n") else: _args.output_file.write("(*p == '\\0' ? NULL : p);\n") if attr == _json_event_attributes[-1]: continue - _args.output_file.write('\twhile (*p++);') + if attr in _json_enum_attributes: + _args.output_file.write('\tp++;') + else: + _args.output_file.write('\twhile (*p++);') _args.output_file.write("""} static void decompress_metric(int offset, struct pmu_metric *pm) @@ -708,13 +715,16 @@ static void decompress_metric(int offset, struct pmu_metric *pm) """) for attr in _json_metric_attributes: _args.output_file.write(f'\n\tpm->{attr} = ') - if attr in enum_attributes: - _args.output_file.write("(*p == '\\0' ? 0 : *p - '0');\n") + if attr in _json_enum_attributes: + _args.output_file.write("*p - '0';\n") else: _args.output_file.write("(*p == '\\0' ? NULL : p);\n") if attr == _json_metric_attributes[-1]: continue - _args.output_file.write('\twhile (*p++);') + if attr in _json_enum_attributes: + _args.output_file.write('\tp++;') + else: + _args.output_file.write('\twhile (*p++);') _args.output_file.write("""} int pmu_events_table_for_each_event(const struct pmu_events_table *table, From patchwork Sun Feb 19 09:28:07 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59100 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp772215wrn; Sun, 19 Feb 2023 01:32:24 -0800 (PST) X-Google-Smtp-Source: AK7set/k4dAxOg8CRxDh/GM90Xcelbvroxcoz5xj2XVbaFbmCi8NYLiHyljiHgqcQRuIKFd5vUy2 X-Received: by 2002:aa7:cccb:0:b0:4ad:1e35:771a with SMTP id y11-20020aa7cccb000000b004ad1e35771amr1239331edt.4.1676799144040; Sun, 19 Feb 2023 01:32:24 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799144; cv=none; d=google.com; s=arc-20160816; b=021I8u9orQ0tUJIjA++OhYnnjId1qDocZIweL4489zrz5e4K3jMPByNvSqdPP6l+4u 1kbMMaXn8RY7BYV1lOLsi731OqmkKzELicQUnPbMBjiNLjKXFwyrkKs3qfOvwoKEJsbg Fq9ubeGy2zqwttObx/gXORG7WHkmaufWJxocwzT2kL4+GsxOLWZVNQeWPcXTMEQTsYAu Znm0uWgLjEjNhOWiQta1el5aJWTlgp0lrCjalwBwTC5+WF4eREysxBMNpH742ee0ZaOl UrLhpNpHmDITvd+BucEWhs1wEI4B5MFC8m2L+mttPqKjY5bYSExF6MqbksN+RARvHGq1 BS7A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :references:mime-version:message-id:in-reply-to:date:dkim-signature; bh=Wg6Vhc0tzXLtuHVnUe0I2g0P4qJqToZ80hJnRYQxPs0=; b=RqkHPmSqVVwXU+G4NE56Zo+9Yenur2jz/7QDcPB9JBKH64cMH4Wh9uCWykArllAO6I oKbxdtIWfP7Xbb6L0mJzoQycezAuP9DaN1cJbQQtZzaB6dAEAC9lharKOf1gycxV9qu8 jNzTLIy1G5xUc1qJ+xE2bie8lrHNpc6mpnwhtFojmoIg/2o2OUgsgppV9YBA4H5LRCO1 JlcKXjLw44FC2eukTEwFixVprl6xel0uhHE6bmXj0zEDdHWWf0ZxRqMgikeXAlck/ejK iPsO9m+peLO+rl8ErcriO92YArCaWIMenxQOBlrK1agKp3IVY1GWkvszZZJB1gi1pShB RVqw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b="sp7/fQQs"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id g4-20020aa7c584000000b004abdba04e7dsi11185563edq.393.2023.02.19.01.31.57; Sun, 19 Feb 2023 01:32:23 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b="sp7/fQQs"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229883AbjBSJa5 (ORCPT + 99 others); Sun, 19 Feb 2023 04:30:57 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52952 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229838AbjBSJay (ORCPT ); Sun, 19 Feb 2023 04:30:54 -0500 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A87D0DBEA for ; Sun, 19 Feb 2023 01:30:28 -0800 (PST) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-53688a1b4fdso17683647b3.22 for ; Sun, 19 Feb 2023 01:30:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:from:to:cc:subject:date :message-id:reply-to; bh=Wg6Vhc0tzXLtuHVnUe0I2g0P4qJqToZ80hJnRYQxPs0=; b=sp7/fQQsO9W+LlsOnpJG+P/P7DJMBeB98TfnNytNFYtpz/0yssRiPrGjn+I1r5L/5X c6ElwFWnoa438rqXKPOi9uhXzdOS3quV37lrfudok/YfuJcClffSrjbvFL2u2JUGpmen dcCCJEbX7MOTJuKB3DYnmbAct8VLFk6+Tmp1Hclh29VUXmLBbuLNze2bQzW6XJ0xZvtL /29YtEkDXh+BqKCn0V8uaF5Wooj5HaR/ALQsAxSiLm6Jch7ixxigT29DUqgr+UYtZEoS qbEnEK/GURwxA4mPxq+gNzOuKFCAi5+l9jMfX8N6Xbyp0N/1W+tcsbQtStkmWiLXvz5D K4FA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=Wg6Vhc0tzXLtuHVnUe0I2g0P4qJqToZ80hJnRYQxPs0=; b=O3mYcOZARkPhIayKdjV9B8x95kXVmRoI3e5V1qJsoTxTEbjW0qEYOrbSp+KJr4ZQrr 77ZFKseH6ZXIrjNgmDcC1muQXdhHWyxZDUHcJfT3rWScwap4o858HI6V6u4Gw9nQSN57 uh9TuwUfoEHK2WeVAodf3V0RdHCHM8SIuVXrIsRavWaZOeUOA2sPjTNUrFn68aZi5IQu tspYsR6FU1RfyOuYGQwgtFLO+7N4sUpNi4tnzUxMb6KiuA+ldRuYSZrhM3heVrQ00j/1 VCLnYL5YNkVYVbwyIaaGnFTRtPer4BbnXoZi6ByL4OtmlN+nh/gd+4uYeNfij+s5D5DX VFuw== X-Gm-Message-State: AO0yUKW3TXpBbANmyiVtI2IFTNdi8q7AoQFqhn6uPKQl/i535+Isw9Oa +XPRHRnYgKKWJAxcgimt5z9jWT/YvfwH X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a05:6902:45:b0:995:ccb:1aae with SMTP id m5-20020a056902004500b009950ccb1aaemr109854ybh.13.1676799025326; Sun, 19 Feb 2023 01:30:25 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:07 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-11-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 10/51] perf vendor events intel: Refresh alderlake events From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758251339670705501?= X-GMAIL-MSGID: =?utf-8?q?1758251339670705501?= Update the alderlake events from 1.16 to 1.18. Generation was done using https://github.com/intel/perfmon. Notable changes are new events and event descriptions, TMA metrics are updated to version 4.5, TMA info metrics are renamed from their node name to be lower case and prefixed by tma_info_, MetricThreshold expressions are added, smi_cost and transaction metric groups are added replicating existing hard coded metrics in stat-shadow. Signed-off-by: Ian Rogers --- .../arch/x86/alderlake/adl-metrics.json | 3190 ++++++++++------- .../pmu-events/arch/x86/alderlake/cache.json | 36 +- .../arch/x86/alderlake/floating-point.json | 27 + .../arch/x86/alderlake/frontend.json | 9 + .../pmu-events/arch/x86/alderlake/memory.json | 3 +- .../arch/x86/alderlake/pipeline.json | 14 +- .../arch/x86/alderlake/uncore-other.json | 28 +- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 8 files changed, 1902 insertions(+), 1407 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json index 2eb3d7464d9f..7bb8410a2bf9 100644 --- a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json +++ b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json @@ -1,2034 +1,2440 @@ [ { - "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", - "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / SLOTS", - "MetricGroup": "PGO;TopdownL1;tma_L1_group", - "MetricName": "tma_frontend_bound", - "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Machine_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "C10 residency percent per package", + "MetricExpr": "cstate_pkg@c10\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C10_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", - "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / SLOTS", - "MetricGroup": "Frontend;TopdownL2;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_latency", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "C1 residency percent per core", + "MetricExpr": "cstate_core@c1\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C1_Core_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses", - "MetricExpr": "ICACHE_DATA.STALLS / CLKS", - "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_icache_misses", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RETIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "C2 residency percent per package", + "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C2_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses", - "MetricExpr": "ICACHE_TAG.STALLS / CLKS", - "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_itlb_misses", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTEND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "C3 residency percent per package", + "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", - "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / CLKS + tma_unknown_branches", - "MetricGroup": "FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_branch_resteers", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "C6 residency percent per core", + "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Core_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage", - "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * INT_MISC.CLEAR_RESTEER_CYCLES / CLKS", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL4;tma_branch_resteers_group", - "MetricName": "tma_mispredicts_resteers", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "C6 residency percent per package", + "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears", - "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) * INT_MISC.CLEAR_RESTEER_CYCLES / CLKS", - "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_branch_resteers_group", - "MetricName": "tma_clears_resteers", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "C7 residency percent per core", + "MetricExpr": "cstate_core@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Core_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears", - "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / CLKS", - "MetricGroup": "BigFoot;FetchLat;TopdownL4;tma_branch_resteers_group", - "MetricName": "tma_unknown_branches", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears. These are fetched branches the Branch Prediction Unit was unable to recognize (First fetch or hitting BPU capacity limit). Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "C7 residency percent per package", + "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / CLKS", - "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_dsb_switches", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DSB_MISS_PS", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "C8 residency percent per package", + "MetricExpr": "cstate_pkg@c8\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C8_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", - "MetricExpr": "DECODE.LCP / CLKS", - "MetricGroup": "FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_lcp", - "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs.", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "C9 residency percent per package", + "MetricExpr": "cstate_pkg@c9\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C9_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", - "MetricExpr": "3 * IDQ.MS_SWITCHES / CLKS", - "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_ms_switches", - "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals. Sample with: FRONTEND_RETIRED.MS_FLOWS", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Uncore frequency per die [GHZ]", + "MetricExpr": "tma_info_socket_clks / #num_dies / duration_time / 1e9", + "MetricGroup": "SoC", + "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", - "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)", - "MetricGroup": "FetchBW;Frontend;TopdownL2;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_bandwidth", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Sample with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Percentage of cycles spent in System Management Interrupts.", + "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0 else 0)", + "MetricGroup": "smi", + "MetricName": "smi_cycles", + "MetricThreshold": "smi_cycles > 0.1", + "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline)", - "MetricExpr": "(IDQ.MITE_CYCLES_ANY - IDQ.MITE_CYCLES_OK) / CORE_CLKS / 2", - "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_fetch_bandwidth_group", - "MetricName": "tma_mite", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or LSD. For example; inefficiencies due to asymmetric decoders; use of long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sample with: FRONTEND_RETIRED.ANY_DSB_MISS", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Number of SMI interrupts.", + "MetricExpr": "msr@smi@", + "MetricGroup": "smi", + "MetricName": "smi_num", + "ScaleUnit": "1SMI#" }, { - "BriefDescription": "This metric represents fraction of cycles where decoder-0 was the only active decoder", - "MetricExpr": "(cpu_core@INST_DECODED.DECODERS\\,cmask\\=1@ - cpu_core@INST_DECODED.DECODERS\\,cmask\\=2@) / CORE_CLKS", - "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_mite_group", - "MetricName": "tma_decoder0_alone", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Percentage of cycles in aborted transactions.", + "MetricExpr": "max(cpu@cycles\\-t@ - cpu@cycles\\-ct@, 0) / cycles", + "MetricGroup": "transaction", + "MetricName": "tsx_aborted_cycles", + "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline", - "MetricExpr": "(IDQ.DSB_CYCLES_ANY - IDQ.DSB_CYCLES_OK) / CORE_CLKS / 2", - "MetricGroup": "DSB;FetchBW;TopdownL3;tma_fetch_bandwidth_group", - "MetricName": "tma_dsb", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline. For example; inefficient utilization of the DSB cache structure or bank conflict when reading from it; are categorized here.", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Number of cycles within a transaction divided by the number of elisions.", + "MetricExpr": "cpu@cycles\\-t@ / cpu@el\\-start@", + "MetricGroup": "transaction", + "MetricName": "tsx_cycles_per_elision", + "ScaleUnit": "1cycles / elision" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to LSD (Loop Stream Detector) unit", - "MetricExpr": "(LSD.CYCLES_ACTIVE - LSD.CYCLES_OK) / CORE_CLKS / 2", - "MetricGroup": "FetchBW;LSD;TopdownL3;tma_fetch_bandwidth_group", - "MetricName": "tma_lsd", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to LSD (Loop Stream Detector) unit. LSD typically does well sustaining Uop supply. However; in some rare cases; optimal uop-delivery could not be reached for small loops whose size (in terms of number of uops) does not suit well the LSD structure.", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Number of cycles within a transaction divided by the number of transactions.", + "MetricExpr": "cpu@cycles\\-t@ / cpu@tx\\-start@", + "MetricGroup": "transaction", + "MetricName": "tsx_cycles_per_transaction", + "ScaleUnit": "1cycles / transaction" }, { - "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations", - "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)", - "MetricGroup": "TopdownL1;tma_L1_group", - "MetricName": "tma_bad_speculation", - "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Percentage of cycles within a transaction region.", + "MetricExpr": "cpu@cycles\\-t@ / cycles", + "MetricGroup": "transaction", + "MetricName": "tsx_transactional_cycles", + "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction", - "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * SLOTS", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL2;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_branch_mispredicts", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS", + "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to certain allocation restrictions.", + "MetricExpr": "TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_alloc_restriction", + "MetricThreshold": "tma_alloc_restriction > 0.1", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", - "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts)", - "MetricGroup": "BadSpec;MachineClears;TopdownL2;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_machine_clears", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT", + "BriefDescription": "Counts the total number of issue slots that were not consumed by the backend due to backend stalls", + "MetricExpr": "TOPDOWN_BE_BOUND.ALL / tma_info_slots", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.1", + "PublicDescription": "Counts the total number of issue slots that were not consumed by the backend due to backend stalls. Note that uops must be available for consumption in order for this event to count. If a uop is not available (IQ is empty), this event will not count. The rest of these subevents count backend stalls, in cycles, due to an outstanding request which is memory bound vs core bound. The subevents are not slot based events and therefore can not be precisely added or subtracted from the Backend_Bound_Aux subevents which are slot based.", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend", - "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * SLOTS", + "BriefDescription": "Counts the total number of issue slots that were not consumed by the backend due to backend stalls", + "MetricExpr": "tma_backend_bound", "MetricGroup": "TopdownL1;tma_L1_group", - "MetricName": "tma_backend_bound", - "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", + "MetricName": "tma_backend_bound_aux", + "MetricThreshold": "tma_backend_bound_aux > 0.2", + "PublicDescription": "Counts the total number of issue slots that were not consumed by the backend due to backend stalls. Note that UOPS must be available for consumption in order for this event to count. If a uop is not available (IQ is empty), this event will not count. All of these subevents count backend stalls, in slots, due to a resource limitation. These are not cycle based events and therefore can not be precisely added or subtracted from the Backend_Bound subevents which are cycle based. These subevents are supplementary to Backend_Bound and can be used to analyze results from a resource perspective at allocation.", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", - "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * SLOTS", - "MetricGroup": "Backend;TopdownL2;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_memory_bound", - "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", + "BriefDescription": "Counts the total number of issue slots that were not consumed by the backend because allocation is stalled due to a mispredicted jump or a machine clear", + "MetricExpr": "(tma_info_slots - (TOPDOWN_FE_BOUND.ALL + TOPDOWN_BE_BOUND.ALL + TOPDOWN_RETIRING.ALL)) / tma_info_slots", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", + "PublicDescription": "Counts the total number of issue slots that were not consumed by the backend because allocation is stalled due to a mispredicted jump or a machine clear. Only issue slots wasted due to fast nukes such as memory ordering nukes are counted. Other nukes are not accounted for. Counts all issue slots blocked during this recovery window including relevant microcode flows and while uops are not yet available in the instruction queue (IQ). Also includes the issue slots that were consumed by the backend but were thrown away because they were younger than the mispredict or machine clear.", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache", - "MetricExpr": "max((EXE_ACTIVITY.BOUND_ON_LOADS - MEMORY_ACTIVITY.STALLS_L1D_MISS) / CLKS, 0)", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_l1_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache. The L1 data cache typically has the shortest latency. However; in certain cases like loads blocked on older stores; a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls; while some non-completed demand load lives in the machine without having that demand load missing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB_HIT_PS", + "BriefDescription": "Counts the number of uops that are not from the microsequencer.", + "MetricExpr": "(TOPDOWN_RETIRING.ALL - UOPS_RETIRED.MS) / tma_info_slots", + "MetricGroup": "TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_base", + "MetricThreshold": "tma_base > 0.6", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "min(7 * cpu_core@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - MEMORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / CLKS", - "MetricGroup": "MemoryTLB;TopdownL4;tma_l1_bound_group", - "MetricName": "tma_dtlb_load", - "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS", + "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to BACLEARS, which occurs when the Branch Target Buffer (BTB) prediction or lack thereof, was corrected by a later branch predictor in the frontend", + "MetricExpr": "TOPDOWN_FE_BOUND.BRANCH_DETECT / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_frontend_latency_group", + "MetricName": "tma_branch_detect", + "MetricThreshold": "tma_branch_detect > 0.05", + "PublicDescription": "Counts the number of issue slots that were not delivered by the frontend due to BACLEARS, which occurs when the Branch Target Buffer (BTB) prediction or lack thereof, was corrected by a later branch predictor in the frontend. Includes BACLEARS due to all branch types including conditional and unconditional jumps, returns, and indirect branches.", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the (first level) DTLB was missed by load accesses, that later on hit in second-level TLB (STLB)", - "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", - "MetricGroup": "MemoryTLB;TopdownL5;tma_dtlb_load_group", - "MetricName": "tma_load_stlb_hit", + "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to branch mispredicts.", + "MetricExpr": "TOPDOWN_BAD_SPECULATION.MISPREDICT / tma_info_slots", + "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", + "MetricName": "tma_branch_mispredicts", + "MetricThreshold": "tma_branch_mispredicts > 0.05", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric estimates the fraction of cycles where the Second-level TLB (STLB) was missed by load accesses, performing a hardware page walk", - "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / CLKS", - "MetricGroup": "MemoryTLB;TopdownL5;tma_dtlb_load_group", - "MetricName": "tma_load_stlb_miss", + "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to BTCLEARS, which occurs when the Branch Target Buffer (BTB) predicts a taken branch.", + "MetricExpr": "TOPDOWN_FE_BOUND.BRANCH_RESTEER / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_frontend_latency_group", + "MetricName": "tma_branch_resteer", + "MetricThreshold": "tma_branch_resteer > 0.05", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores", - "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_store_fwd_blk", - "PublicDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores. To streamline memory operations in the pipeline; a load can avoid waiting for memory if a prior in-flight store is writing the data that the load wants to read (store forwarding process). However; in some cases the load may be blocked for a significant time pending the store forward. For example; when the prior store is writing a smaller region than the load is reading.", + "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to the microcode sequencer (MS).", + "MetricExpr": "TOPDOWN_FE_BOUND.CISC / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_frontend_bandwidth_group", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.05", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations", - "MetricExpr": "(16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS.ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10 * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO))) / CLKS", - "MetricGroup": "Offcore;TopdownL4;tma_l1_bound_group", - "MetricName": "tma_lock_latency", - "PublicDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations. Due to the microarchitecture handling of locks; they are classified as L1_Bound regardless of what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOADS_PS", + "BriefDescription": "Counts the number of cycles due to backend bound stalls that are core execution bound and not attributed to outstanding demand load or store stalls.", + "MetricExpr": "max(0, tma_backend_bound - tma_load_store_bound)", + "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary", - "MetricExpr": "Load_Miss_Real_Latency * LD_BLOCKS.NO_SR / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_split_loads", - "PublicDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to decode stalls.", + "MetricExpr": "TOPDOWN_FE_BOUND.DECODE / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_frontend_bandwidth_group", + "MetricName": "tma_decode", + "MetricThreshold": "tma_decode > 0.05", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed", - "MetricExpr": "L1D_PEND_MISS.FB_FULL / CLKS", - "MetricGroup": "MemoryBW;TopdownL4;tma_l1_bound_group", - "MetricName": "tma_fb_full", - "PublicDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed. The higher the metric value; the deeper the memory hierarchy level the misses are satisfied from (metric values >1 are valid). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or external memory).", + "BriefDescription": "Counts the number of machine clears relative to the number of nuke slots due to memory disambiguation.", + "MetricExpr": "tma_nuke * (MACHINE_CLEARS.DISAMBIGUATION / MACHINE_CLEARS.SLOW)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_nuke_group", + "MetricName": "tma_disambiguation", + "MetricThreshold": "tma_disambiguation > 0.02", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads", - "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L1D_MISS - MEMORY_ACTIVITY.STALLS_L2_MISS) / CLKS", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_l2_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 misses/L2 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L2_HIT_PS", + "BriefDescription": "Counts the number of cycles the core is stalled due to a demand load miss which hit in DRAM or MMIO (Non-DRAM).", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "MEM_BOUND_STALLS.LOAD_DRAM_HIT / tma_info_clks - MEM_BOUND_STALLS_AT_RET_CORRECTION * MEM_BOUND_STALLS.LOAD_DRAM_HIT / MEM_BOUND_STALLS.LOAD", + "MetricGroup": "TopdownL3;tma_L3_group;tma_load_store_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core", - "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L2_MISS - MEMORY_ACTIVITY.STALLS_L3_MISS) / CLKS", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_l3_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to a machine clear classified as a fast nuke due to memory ordering, memory disambiguation and memory renaming.", + "MetricExpr": "TOPDOWN_BAD_SPECULATION.FASTNUKE / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", + "MetricName": "tma_fast_nuke", + "MetricThreshold": "tma_fast_nuke > 0.05", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses", - "MetricExpr": "(25 * Average_Frequency * (MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) + 24 * Average_Frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CLKS", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_contested_accesses", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses. Contested accesses occur when data written by one Logical Processor are read by another Logical Processor on a different Physical Core. Examples of contested accesses include synchronizations such as locks; true data sharing such as modified locked variables; and false sharing. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS", + "BriefDescription": "Counts the number of machine clears relative to the number of nuke slots due to FP assists.", + "MetricExpr": "tma_nuke * (MACHINE_CLEARS.FP_ASSIST / MACHINE_CLEARS.SLOW)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_nuke_group", + "MetricName": "tma_fp_assist", + "MetricThreshold": "tma_fp_assist > 0.02", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses", - "MetricExpr": "24 * Average_Frequency * (MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD + MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (1 - OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CLKS", - "MetricGroup": "Offcore;Snoop;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_data_sharing", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses. Data shared by multiple Logical Processors (even just read shared) may cause increased access latency due to cache coherency. Excessive data sharing can drastically harm multithreaded performance. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD", + "BriefDescription": "Counts the number of floating point operations per uop with all default weighting.", + "MetricExpr": "UOPS_RETIRED.FPDIV / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_base_group", + "MetricName": "tma_fp_uops", + "MetricThreshold": "tma_fp_uops > 0.2", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited)", - "MetricExpr": "9 * Average_Frequency * MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CLKS", - "MetricGroup": "MemoryLat;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_l3_hit_latency", - "PublicDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve the latency; reduce contention with sibling physical cores and increase performance. Note the value of this node may overlap with its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to frontend bandwidth restrictions due to decode, predecode, cisc, and other limitations.", + "MetricExpr": "TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH / tma_info_slots", + "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_frontend_bandwidth", + "MetricThreshold": "tma_frontend_bandwidth > 0.1", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors)", - "MetricExpr": "(XQ.FULL_CYCLES + L1D_PEND_MISS.L2_STALLS) / CLKS", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_sq_full", - "PublicDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors). The Super Queue is used for requests to access the L2 cache or to go out to the Uncore.", + "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to frontend stalls.", + "MetricExpr": "TOPDOWN_FE_BOUND.ALL / tma_info_slots", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.2", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", - "MetricExpr": "MEMORY_ACTIVITY.STALLS_L3_MISS / CLKS", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_dram_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L3_MISS_PS", + "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to frontend bandwidth restrictions due to decode, predecode, cisc, and other limitations.", + "MetricExpr": "TOPDOWN_FE_BOUND.FRONTEND_LATENCY / tma_info_slots", + "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_frontend_latency", + "MetricThreshold": "tma_frontend_latency > 0.15", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu_core@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=4@) / CLKS", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_dram_bound_group", - "MetricName": "tma_mem_bandwidth", - "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that).", + "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to instruction cache misses.", + "MetricExpr": "TOPDOWN_FE_BOUND.ICACHE / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_frontend_latency_group", + "MetricName": "tma_icache", + "MetricThreshold": "tma_icache > 0.05", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / CLKS - tma_mem_bandwidth", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_dram_bound_group", - "MetricName": "tma_mem_latency", - "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that).", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Percentage of total non-speculative loads with a address aliasing block", + "MetricExpr": "100 * LD_BLOCKS.4K_ALIAS / MEM_UOPS_RETIRED.ALL_LOADS", + "MetricName": "tma_info_address_alias_blocks", + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", - "MetricExpr": "tma_st_buffer", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_store_bound", - "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck. Sample with: MEM_INST_RETIRED.ALL_STORES_PS", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Ratio of all branches which mispredict", + "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": " ", + "MetricName": "tma_info_branch_mispredict_ratio", + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses", - "MetricExpr": "(MEM_STORE_RETIRED.L2_HIT * 10 * (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / CLKS", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_store_bound_group", - "MetricName": "tma_store_latency", - "PublicDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses. Store accesses usually less impact out-of-order core performance; however; holding resources for longer time can lead into undesired implications (e.g. contention on L1D fill-buffer entries - see FB_Full)", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Ratio between Mispredicted branches and unknown branches", + "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / BACLEARS.ANY", + "MetricGroup": " ", + "MetricName": "tma_info_branch_mispredict_to_unknown_branch_ratio", + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing", - "MetricExpr": "28 * Average_Frequency * OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM / CLKS", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_store_bound_group", - "MetricName": "tma_false_sharing", - "PublicDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing. False Sharing is a multithreading hiccup; where multiple Logical Processors contend on different data-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "", + "MetricExpr": "CPU_CLK_UNHALTED.CORE", + "MetricGroup": " ", + "MetricName": "tma_info_clks", + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents rate of split store accesses", - "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / CORE_CLKS", - "MetricGroup": "TopdownL4;tma_store_bound_group", - "MetricName": "tma_split_stores", - "PublicDescription": "This metric represents rate of split store accesses. Consider aligning your data to the 64-byte cache line granularity. Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "", + "MetricExpr": "CPU_CLK_UNHALTED.CORE_P", + "MetricGroup": " ", + "MetricName": "tma_info_clks_p", + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric estimates how often CPU was stalled due to Streaming store memory accesses; Streaming store optimize out a read request required by RFO stores", - "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / CLKS", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_store_bound_group", - "MetricName": "tma_streaming_stores", - "PublicDescription": "This metric estimates how often CPU was stalled due to Streaming store memory accesses; Streaming store optimize out a read request required by RFO stores. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should Streaming stores be a bottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Cycles Per Instruction", + "MetricExpr": "tma_info_clks / INST_RETIRED.ANY", + "MetricGroup": " ", + "MetricName": "tma_info_cpi", + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses", - "MetricExpr": "(7 * cpu_core@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=1@ + DTLB_STORE_MISSES.WALK_ACTIVE) / CORE_CLKS", - "MetricGroup": "MemoryTLB;TopdownL4;tma_store_bound_group", - "MetricName": "tma_dtlb_store", - "PublicDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses. As with ordinary data caching; focus on improving data locality and reducing working-set size to reduce DTLB overhead. Additionally; consider using profile-guided optimization (PGO) to collocate frequently-used data on the same page. Try using larger page sizes for large amounts of frequently-used data. Sample with: MEM_INST_RETIRED.STLB_MISS_STORES_PS", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Average CPU Utilization", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": " ", + "MetricName": "tma_info_cpu_utilization", + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the TLB was missed by store accesses, hitting in the second-level TLB (STLB)", - "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", - "MetricGroup": "MemoryTLB;TopdownL5;tma_dtlb_store_group", - "MetricName": "tma_store_stlb_hit", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Cycle cost per DRAM hit", + "MetricExpr": "MEM_BOUND_STALLS.LOAD_DRAM_HIT / MEM_LOAD_UOPS_RETIRED.DRAM_HIT", + "MetricGroup": " ", + "MetricName": "tma_info_cycles_per_demand_load_dram_hit", + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric estimates the fraction of cycles where the STLB was missed by store accesses, performing a hardware page walk", - "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / CORE_CLKS", - "MetricGroup": "MemoryTLB;TopdownL5;tma_dtlb_store_group", - "MetricName": "tma_store_stlb_miss", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Cycle cost per L2 hit", + "MetricExpr": "MEM_BOUND_STALLS.LOAD_L2_HIT / MEM_LOAD_UOPS_RETIRED.L2_HIT", + "MetricGroup": " ", + "MetricName": "tma_info_cycles_per_demand_load_l2_hit", + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", - "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)", - "MetricGroup": "Backend;Compute;TopdownL2;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_core_bound", - "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Cycle cost per LLC hit", + "MetricExpr": "MEM_BOUND_STALLS.LOAD_LLC_HIT / MEM_LOAD_UOPS_RETIRED.L3_HIT", + "MetricGroup": " ", + "MetricName": "tma_info_cycles_per_demand_load_l3_hit", + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", - "MetricExpr": "ARITH.DIVIDER_ACTIVE / CLKS", - "MetricGroup": "TopdownL3;tma_core_bound_group", - "MetricName": "tma_divider", - "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication. Sample with: ARITH.DIVIDER_ACTIVE", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Percentage of all uops which are FPDiv uops", + "MetricExpr": "100 * UOPS_RETIRED.FPDIV / UOPS_RETIRED.ALL", + "MetricGroup": " ", + "MetricName": "tma_info_fpdiv_uop_ratio", + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", - "MetricExpr": "((cpu_core@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=0x80@ + (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * cpu_core@EXE_ACTIVITY.2_PORTS_UTIL\\,umask\\=0xc@)) / CLKS if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - EXE_ACTIVITY.BOUND_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * cpu_core@EXE_ACTIVITY.2_PORTS_UTIL\\,umask\\=0xc@) / CLKS)", - "MetricGroup": "PortsUtil;TopdownL3;tma_core_bound_group", - "MetricName": "tma_ports_utilization", - "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Percentage of all uops which are IDiv uops", + "MetricExpr": "100 * UOPS_RETIRED.IDIV / UOPS_RETIRED.ALL", + "MetricGroup": " ", + "MetricName": "tma_info_idiv_uop_ratio", + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "cpu_core@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=0x80@ / CLKS + tma_serializing_operation * (CYCLE_ACTIVITY.STALLS_TOTAL - EXE_ACTIVITY.BOUND_ON_LOADS) / CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_0", - "PublicDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise). Long-latency instructions like divides may contribute to this metric.", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Percent of instruction miss cost that hit in DRAM", + "MetricExpr": "100 * MEM_BOUND_STALLS.IFETCH_DRAM_HIT / MEM_BOUND_STALLS.IFETCH", + "MetricGroup": " ", + "MetricName": "tma_info_inst_miss_cost_dramhit_percent", + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations", - "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / CLKS", - "MetricGroup": "TopdownL5;tma_ports_utilized_0_group", - "MetricName": "tma_serializing_operation", - "PublicDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations. Instructions like CPUID; WRMSR or LFENCE serialize the out-of-order execution which may limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Percent of instruction miss cost that hit in the L2", + "MetricExpr": "100 * MEM_BOUND_STALLS.IFETCH_L2_HIT / MEM_BOUND_STALLS.IFETCH", + "MetricGroup": " ", + "MetricName": "tma_info_inst_miss_cost_l2hit_percent", + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to PAUSE Instructions", - "MetricExpr": "CPU_CLK_UNHALTED.PAUSE / CLKS", - "MetricGroup": "TopdownL6;tma_serializing_operation_group", - "MetricName": "tma_slow_pause", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.PAUSE_INST", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Percent of instruction miss cost that hit in the L3", + "MetricExpr": "100 * MEM_BOUND_STALLS.IFETCH_LLC_HIT / MEM_BOUND_STALLS.IFETCH", + "MetricGroup": " ", + "MetricName": "tma_info_inst_miss_cost_l3hit_percent", + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to LFENCE Instructions.", - "MetricExpr": "13 * MISC2_RETIRED.LFENCE / CLKS", - "MetricGroup": "TopdownL6;tma_serializing_operation_group", - "MetricName": "tma_memory_fence", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Instructions per Branch (lower number means higher occurance rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": " ", + "MetricName": "tma_info_ipbranch", + "Unit": "cpu_atom" }, { - "BriefDescription": "The Mixing_Vectors metric gives the percentage of injected blend uops out of all uops issued", - "MetricExpr": "160 * ASSISTS.SSE_AVX_MIX / CLKS", - "MetricGroup": "TopdownL5;tma_ports_utilized_0_group", - "MetricName": "tma_mixing_vectors", - "PublicDescription": "The Mixing_Vectors metric gives the percentage of injected blend uops out of all uops issued. Usually a Mixing_Vectors over 5% is worth investigating. Read more in Appendix B1 of the Optimizations Guide for this topic.", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Instructions Per Cycle", + "MetricExpr": "INST_RETIRED.ANY / tma_info_clks", + "MetricGroup": " ", + "MetricName": "tma_info_ipc", + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_1", - "PublicDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). This can be due to heavy data-dependency among software instructions; or over oversubscribing a particular hardware resource. In some other cases with high 1_Port_Utilized and L1_Bound; this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Instruction per (near) call (lower number means higher occurance rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.CALL", + "MetricGroup": " ", + "MetricName": "tma_info_ipcall", + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_2", - "PublicDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization -most compilers feature auto-Vectorization options today- reduces pressure on the execution ports as multiple elements are calculated with same uop. Sample with: EXE_ACTIVITY.2_PORTS_UTIL", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Instructions per Far Branch", + "MetricExpr": "INST_RETIRED.ANY / (BR_INST_RETIRED.FAR_BRANCH / 2)", + "MetricGroup": " ", + "MetricName": "tma_info_ipfarbranch", + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_3m", - "PublicDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Sample with: UOPS_EXECUTED.CYCLES_GE_3", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Instructions per Load", + "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_LOADS", + "MetricGroup": " ", + "MetricName": "tma_info_ipload", + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.", - "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + UOPS_DISPATCHED.PORT_5_11 + UOPS_DISPATCHED.PORT_6) / (5 * CORE_CLKS)", - "MetricGroup": "TopdownL5;tma_ports_utilized_3m_group", - "MetricName": "tma_alu_op_utilization", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": " ", + "MetricName": "tma_info_ipmispredict", + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch) Sample with: UOPS_DISPATCHED.PORT_0", - "MetricExpr": "UOPS_DISPATCHED.PORT_0 / CORE_CLKS", - "MetricGroup": "Compute;TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_0", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Instructions per Store", + "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_STORES", + "MetricGroup": " ", + "MetricName": "tma_info_ipstore", + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU) Sample with: UOPS_DISPATCHED.PORT_1", - "MetricExpr": "UOPS_DISPATCHED.PORT_1 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_1", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Fraction of cycles spent in Kernel mode", + "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.CORE@k / CPU_CLK_UNHALTED.CORE", + "MetricGroup": " ", + "MetricName": "tma_info_kernel_utilization", + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU) Sample with: UOPS_DISPATCHED.PORT_6", - "MetricExpr": "UOPS_DISPATCHED.PORT_6 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_6", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Percentage of total non-speculative loads that are splits", + "MetricExpr": "100 * MEM_UOPS_RETIRED.SPLIT_LOADS / MEM_UOPS_RETIRED.ALL_LOADS", + "MetricName": "tma_info_load_splits", + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations Sample with: UOPS_DISPATCHED.PORT_2_3_10", - "MetricExpr": "UOPS_DISPATCHED.PORT_2_3_10 / (3 * CORE_CLKS)", - "MetricGroup": "TopdownL5;tma_ports_utilized_3m_group", - "MetricName": "tma_load_op_utilization", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "load ops retired per 1000 instruction", + "MetricExpr": "1e3 * MEM_UOPS_RETIRED.ALL_LOADS / INST_RETIRED.ANY", + "MetricGroup": " ", + "MetricName": "tma_info_memloadpki", + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations Sample with: UOPS_DISPATCHED.PORT_7_8", - "MetricExpr": "(UOPS_DISPATCHED.PORT_4_9 + UOPS_DISPATCHED.PORT_7_8) / (4 * CORE_CLKS)", - "MetricGroup": "TopdownL5;tma_ports_utilized_3m_group", - "MetricName": "tma_store_op_utilization", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Percentage of all uops which are ucode ops", + "MetricExpr": "100 * UOPS_RETIRED.MS / UOPS_RETIRED.ALL", + "MetricGroup": " ", + "MetricName": "tma_info_microcode_uop_ratio", + "Unit": "cpu_atom" }, { - "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group", - "MetricName": "tma_retiring", - "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "", + "MetricExpr": "5 * tma_info_clks", + "MetricGroup": " ", + "MetricName": "tma_info_slots", + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", - "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_retiring_group", - "MetricName": "tma_light_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Percentage of total non-speculative loads with a store forward or unknown store address block", + "MetricExpr": "100 * LD_BLOCKS.DATA_UNKNOWN / MEM_UOPS_RETIRED.ALL_LOADS", + "MetricName": "tma_info_store_fwd_blocks", + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)", - "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", - "MetricGroup": "HPC;TopdownL3;tma_light_operations_group", - "MetricName": "tma_fp_arith", - "PublicDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired). Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain and FMA double-counting.", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Average Frequency Utilization relative nominal frequency", + "MetricExpr": "tma_info_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": " ", + "MetricName": "tma_info_turbo_utilization", + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric serves as an approximation of legacy x87 usage", - "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.THREAD", - "MetricGroup": "Compute;TopdownL4;tma_fp_arith_group", - "MetricName": "tma_x87_use", - "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "UOPS_RETIRED.ALL / INST_RETIRED.ANY", + "MetricGroup": " ", + "MetricName": "tma_info_upi", + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE) / (tma_retiring * SLOTS)", - "MetricGroup": "Compute;Flops;TopdownL4;tma_fp_arith_group", - "MetricName": "tma_fp_scalar", - "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting.", - "ScaleUnit": "100%", - "Unit": "cpu_core" + "BriefDescription": "Percentage of all uops which are x87 uops", + "MetricExpr": "100 * UOPS_RETIRED.X87 / UOPS_RETIRED.ALL", + "MetricGroup": " ", + "MetricName": "tma_info_x87_uop_ratio", + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths", - "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / (tma_retiring * SLOTS)", - "MetricGroup": "Compute;Flops;TopdownL4;tma_fp_arith_group", - "MetricName": "tma_fp_vector", - "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting.", + "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to Instruction Table Lookaside Buffer (ITLB) misses.", + "MetricExpr": "TOPDOWN_FE_BOUND.ITLB / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_frontend_latency_group", + "MetricName": "tma_itlb", + "MetricThreshold": "tma_itlb > 0.05", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors", - "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * SLOTS)", - "MetricGroup": "Compute;Flops;TopdownL5;tma_fp_vector_group", - "MetricName": "tma_fp_vector_128b", - "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors. May overcount due to FMA double counting.", + "BriefDescription": "Counts the number of cycles that the oldest load of the load buffer is stalled at retirement due to a load block.", + "MetricExpr": "LD_HEAD.L1_BOUND_AT_RET / tma_info_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_load_store_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors", - "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / (tma_retiring * SLOTS)", - "MetricGroup": "Compute;Flops;TopdownL5;tma_fp_vector_group", - "MetricName": "tma_fp_vector_256b", - "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors. May overcount due to FMA double counting.", + "BriefDescription": "Counts the number of cycles a core is stalled due to a demand load which hit in the L2 Cache.", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "MEM_BOUND_STALLS.LOAD_L2_HIT / tma_info_clks - MEM_BOUND_STALLS_AT_RET_CORRECTION * MEM_BOUND_STALLS.LOAD_L2_HIT / MEM_BOUND_STALLS.LOAD", + "MetricGroup": "TopdownL3;tma_L3_group;tma_load_store_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.1", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents overall Integer (Int) select operations fraction the CPU has executed (retired)", - "MetricExpr": "tma_int_vector_128b + tma_int_vector_256b + tma_shuffles", - "MetricGroup": "Pipeline;TopdownL3;tma_light_operations_group", - "MetricName": "tma_int_operations", - "PublicDescription": "This metric represents overall Integer (Int) select operations fraction the CPU has executed (retired). Vector/Matrix Int operations and shuffles are counted. Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain.", + "BriefDescription": "Counts the number of cycles a core is stalled due to a demand load which hit in the Last Level Cache (LLC) or other core with HITE/F/M.", + "MetricExpr": "MEM_BOUND_STALLS.LOAD_LLC_HIT / tma_info_clks - MEM_BOUND_STALLS_AT_RET_CORRECTION * MEM_BOUND_STALLS.LOAD_LLC_HIT / MEM_BOUND_STALLS.LOAD", + "MetricGroup": "TopdownL3;tma_L3_group;tma_load_store_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.1", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents 128-bit vector Integer ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the CPU has retired.", - "MetricExpr": "(INT_VEC_RETIRED.ADD_128 + INT_VEC_RETIRED.VNNI_128) / (tma_retiring * SLOTS)", - "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_int_operations_group", - "MetricName": "tma_int_vector_128b", + "BriefDescription": "Counts the number of cycles, relative to the number of mem_scheduler slots, in which uops are blocked due to load buffer full", + "MetricExpr": "tma_mem_scheduler * MEM_SCHEDULER_BLOCK.LD_BUF / MEM_SCHEDULER_BLOCK.ALL", + "MetricGroup": "TopdownL4;tma_L4_group;tma_mem_scheduler_group", + "MetricName": "tma_ld_buffer", + "MetricThreshold": "tma_ld_buffer > 0.05", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents 256-bit vector Integer ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the CPU has retired.", - "MetricExpr": "(INT_VEC_RETIRED.ADD_256 + INT_VEC_RETIRED.MUL_256 + INT_VEC_RETIRED.VNNI_256) / (tma_retiring * SLOTS)", - "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_int_operations_group", - "MetricName": "tma_int_vector_256b", + "BriefDescription": "Counts the number of cycles the core is stalled due to stores or loads.", + "MetricExpr": "min(tma_backend_bound, LD_HEAD.ANY_AT_RET / tma_info_clks + tma_store_bound)", + "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_load_store_bound", + "MetricThreshold": "tma_load_store_bound > 0.2", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents Shuffle (cross \"vector lane\" data transfers) uops fraction the CPU has retired.", - "MetricExpr": "INT_VEC_RETIRED.SHUFFLES / (tma_retiring * SLOTS)", - "MetricGroup": "HPC;Pipeline;TopdownL4;tma_int_operations_group", - "MetricName": "tma_shuffles", + "BriefDescription": "Counts the total number of issue slots that were not consumed by the backend because allocation is stalled due to a machine clear (nuke) of any kind including memory ordering and memory disambiguation.", + "MetricExpr": "TOPDOWN_BAD_SPECULATION.MACHINE_CLEARS / tma_info_slots", + "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.05", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring memory operations -- uops for memory load or store accesses.", - "MetricExpr": "tma_light_operations * MEM_UOP_RETIRED.ANY / (tma_retiring * SLOTS)", - "MetricGroup": "Pipeline;TopdownL3;tma_light_operations_group", - "MetricName": "tma_memory_operations", + "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to memory reservation stalls in which a scheduler is not able to accept uops.", + "MetricExpr": "TOPDOWN_BE_BOUND.MEM_SCHEDULER / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_mem_scheduler", + "MetricThreshold": "tma_mem_scheduler > 0.1", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring fused instructions -- where one uop can represent multiple contiguous instructions", - "MetricExpr": "tma_light_operations * INST_RETIRED.MACRO_FUSED / (tma_retiring * SLOTS)", - "MetricGroup": "Pipeline;TopdownL3;tma_light_operations_group", - "MetricName": "tma_fused_instructions", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring fused instructions -- where one uop can represent multiple contiguous instructions. The instruction pairs of CMP+JCC or DEC+JCC are commonly used examples.", + "BriefDescription": "Counts the number of machine clears relative to the number of nuke slots due to memory ordering.", + "MetricExpr": "tma_nuke * (MACHINE_CLEARS.MEMORY_ORDERING / MACHINE_CLEARS.SLOW)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_nuke_group", + "MetricName": "tma_memory_ordering", + "MetricThreshold": "tma_memory_ordering > 0.02", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring branch instructions that were not fused", - "MetricExpr": "tma_light_operations * (BR_INST_RETIRED.ALL_BRANCHES - INST_RETIRED.MACRO_FUSED) / (tma_retiring * SLOTS)", - "MetricGroup": "Pipeline;TopdownL3;tma_light_operations_group", - "MetricName": "tma_non_fused_branches", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring branch instructions that were not fused. Non-conditional branches like direct JMP or CALL would count here. Can be used to examine fusible conditional jumps that were not fused.", + "BriefDescription": "Counts the number of uops that are from the complex flows issued by the micro-sequencer (MS)", + "MetricExpr": "tma_microcode_sequencer", + "MetricGroup": "TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_ms_uops", + "MetricThreshold": "tma_ms_uops > 0.05", + "PublicDescription": "Counts the number of uops that are from the complex flows issued by the micro-sequencer (MS). This includes uops from flows due to complex instructions, faults, assists, and inserted flows.", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions", - "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_retiring * SLOTS)", - "MetricGroup": "Pipeline;TopdownL3;tma_light_operations_group", - "MetricName": "tma_nop_instructions", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions. Compilers often use NOPs for certain address alignments - e.g. start address of a function or loop body. Sample with: INST_RETIRED.NOP", + "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to IEC or FPC RAT stalls, which can be due to FIQ or IEC reservation stalls in which the integer, floating point or SIMD scheduler is not able to accept uops.", + "MetricExpr": "TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_non_mem_scheduler", + "MetricThreshold": "tma_non_mem_scheduler > 0.1", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents the remaining light uops fraction the CPU has executed - remaining means not covered by other sibling nodes. May undercount due to FMA double counting", - "MetricExpr": "max(0, tma_light_operations - (tma_fp_arith + tma_int_operations + tma_memory_operations + tma_fused_instructions + tma_non_fused_branches + tma_nop_instructions))", - "MetricGroup": "Pipeline;TopdownL3;tma_light_operations_group", - "MetricName": "tma_other_light_ops", + "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to a machine clear (slow nuke).", + "MetricExpr": "TOPDOWN_BAD_SPECULATION.NUKE / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", + "MetricName": "tma_nuke", + "MetricThreshold": "tma_nuke > 0.05", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences", - "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * SLOTS", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_retiring_group", - "MetricName": "tma_heavy_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences. This highly-correlates with the uop length of these instructions/sequences. Sample with: UOPS_RETIRED.HEAVY", + "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to other common frontend stalls not categorized.", + "MetricExpr": "TOPDOWN_FE_BOUND.OTHER / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_frontend_bandwidth_group", + "MetricName": "tma_other_fb", + "MetricThreshold": "tma_other_fb > 0.05", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops", - "MetricExpr": "tma_heavy_operations - tma_microcode_sequencer", - "MetricGroup": "TopdownL3;tma_heavy_operations_group", - "MetricName": "tma_few_uops_instructions", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number of uops in such instructions.", + "BriefDescription": "Counts the number of cycles that the oldest load of the load buffer is stalled at retirement due to a number of other load blocks.", + "MetricExpr": "LD_HEAD.OTHER_AT_RET / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_other_l1", + "MetricThreshold": "tma_other_l1 > 0.05", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", - "MetricExpr": "tma_ms_uops", - "MetricGroup": "MicroSeq;TopdownL3;tma_heavy_operations_group", - "MetricName": "tma_microcode_sequencer", - "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided. Sample with: UOPS_RETIRED.MS", + "BriefDescription": "Counts the number of cycles the core is stalled due to a demand load miss which hits in the L2, LLC, DRAM or MMIO (Non-DRAM) but could not be correctly attributed or cycles in which the load miss is waiting on a request buffer.", + "MetricExpr": "max(0, tma_load_store_bound - (tma_store_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound))", + "MetricGroup": "TopdownL3;tma_L3_group;tma_load_store_bound_group", + "MetricName": "tma_other_load_store", + "MetricThreshold": "tma_other_load_store > 0.1", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists", - "MetricExpr": "100 * cpu_core@ASSISTS.ANY\\,umask\\=0x1B@ / SLOTS", - "MetricGroup": "TopdownL4;tma_microcode_sequencer_group", - "MetricName": "tma_assists", - "PublicDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists. Assists are long sequences of uops that are required in certain corner-cases for operations that cannot be handled natively by the execution pipeline. For example; when working with very small floating point values (so-called Denormals); the FP units are not set up to perform these operations natively. Instead; a sequence of instructions to perform the computation on the Denormals is injected into the pipeline. Since these microcode sequences might be dozens of uops long; Assists can be extremely deleterious to performance and they can be avoided in many cases. Sample with: ASSISTS.ANY", + "BriefDescription": "Counts the number of uops retired excluding ms and fp div uops.", + "MetricExpr": "(TOPDOWN_RETIRING.ALL - UOPS_RETIRED.MS - UOPS_RETIRED.FPDIV) / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_base_group", + "MetricName": "tma_other_ret", + "MetricThreshold": "tma_other_ret > 0.3", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric roughly estimates fraction of slots the CPU retired uops as a result of handing Page Faults", - "MetricExpr": "99 * ASSISTS.PAGE_FAULT / SLOTS", - "MetricGroup": "TopdownL5;tma_assists_group", - "MetricName": "tma_page_faults", - "PublicDescription": "This metric roughly estimates fraction of slots the CPU retired uops as a result of handing Page Faults. A Page Fault may apply on first application access to a memory page. Note operating system handling of page faults accounts for the majority of its cost.", + "BriefDescription": "Counts the number of machine clears relative to the number of nuke slots due to page faults.", + "MetricExpr": "tma_nuke * (MACHINE_CLEARS.PAGE_FAULT / MACHINE_CLEARS.SLOW)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_nuke_group", + "MetricName": "tma_page_fault", + "MetricThreshold": "tma_page_fault > 0.02", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric roughly estimates fraction of slots the CPU retired uops as a result of handing Floating Point (FP) Assists", - "MetricExpr": "30 * ASSISTS.FP / SLOTS", - "MetricGroup": "HPC;TopdownL5;tma_assists_group", - "MetricName": "tma_fp_assists", - "PublicDescription": "This metric roughly estimates fraction of slots the CPU retired uops as a result of handing Floating Point (FP) Assists. FP Assist may apply when working with very small floating point values (so-called denormals).", + "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to wrong predecodes.", + "MetricExpr": "TOPDOWN_FE_BOUND.PREDECODE / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_frontend_bandwidth_group", + "MetricName": "tma_predecode", + "MetricThreshold": "tma_predecode > 0.05", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric estimates fraction of slots the CPU retired uops as a result of handing SSE to AVX* or AVX* to SSE transition Assists. ", - "MetricExpr": "63 * ASSISTS.SSE_AVX_MIX / SLOTS", - "MetricGroup": "HPC;TopdownL5;tma_assists_group", - "MetricName": "tma_avx_assists", + "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to the physical register file unable to accept an entry (marble stalls).", + "MetricExpr": "TOPDOWN_BE_BOUND.REGISTER / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_register", + "MetricThreshold": "tma_register > 0.1", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction", - "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", - "MetricGroup": "TopdownL4;tma_microcode_sequencer_group", - "MetricName": "tma_cisc", - "PublicDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction. A CISC instruction has multiple uops that are required to perform the instruction's functionality as in the case of read-modify-write as an example. Since these instructions require multiple uops they may or may not imply sub-optimal use of machine resources. Sample with: FRONTEND_RETIRED.MS_FLOWS", + "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to the reorder buffer being full (ROB stalls).", + "MetricExpr": "TOPDOWN_BE_BOUND.REORDER_BUFFER / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_reorder_buffer", + "MetricThreshold": "tma_reorder_buffer > 0.1", "ScaleUnit": "100%", - "Unit": "cpu_core" + "Unit": "cpu_atom" }, { - "BriefDescription": "Total pipeline cost of Branch Misprediction related bottlenecks", - "MetricExpr": "100 * (tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts", - "MetricName": "Mispredictions", - "Unit": "cpu_core" + "BriefDescription": "Counts the total number of issue slots that were not consumed by the backend due to backend stalls", + "MetricExpr": "tma_backend_bound", + "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_aux_group", + "MetricName": "tma_resource_bound", + "MetricThreshold": "tma_resource_bound > 0.2", + "PublicDescription": "Counts the total number of issue slots that were not consumed by the backend due to backend stalls. Note that uops must be available for consumption in order for this event to count. If a uop is not available (IQ is empty), this event will not count.", + "ScaleUnit": "100%", + "Unit": "cpu_atom" }, { - "BriefDescription": "Total pipeline cost of (external) Memory Bandwidth related bottlenecks", - "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_sq_full / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full))) + tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_fb_full / (tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk))", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "Memory_Bandwidth", - "Unit": "cpu_core" + "BriefDescription": "Counts the numer of issue slots that result in retirement slots.", + "MetricExpr": "TOPDOWN_RETIRING.ALL / tma_info_slots", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.75", + "ScaleUnit": "100%", + "Unit": "cpu_atom" }, { - "BriefDescription": "Total pipeline cost of Memory Latency related bottlenecks (external memory and off-core caches)", - "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_l3_hit_latency / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound))", - "MetricGroup": "Mem;MemoryLat;Offcore", - "MetricName": "Memory_Latency", - "Unit": "cpu_core" + "BriefDescription": "Counts the number of cycles, relative to the number of mem_scheduler slots, in which uops are blocked due to RSV full relative", + "MetricExpr": "tma_mem_scheduler * MEM_SCHEDULER_BLOCK.RSV / MEM_SCHEDULER_BLOCK.ALL", + "MetricGroup": "TopdownL4;tma_L4_group;tma_mem_scheduler_group", + "MetricName": "tma_rsv", + "MetricThreshold": "tma_rsv > 0.05", + "ScaleUnit": "100%", + "Unit": "cpu_atom" }, { - "BriefDescription": "Total pipeline cost of Memory Address Translation related bottlenecks (data-side TLBs)", - "MetricExpr": "100 * tma_memory_bound * (tma_l1_bound / max(tma_memory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_latency + tma_streaming_stores)))", - "MetricGroup": "Mem;MemoryTLB;Offcore", - "MetricName": "Memory_Data_TLBs", - "Unit": "cpu_core" + "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to scoreboards from the instruction queue (IQ), jump execution unit (JEU), or microcode sequencer (MS).", + "MetricExpr": "TOPDOWN_BE_BOUND.SERIALIZATION / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_serialization", + "MetricThreshold": "tma_serialization > 0.1", + "ScaleUnit": "100%", + "Unit": "cpu_atom" }, { - "BriefDescription": "Total pipeline cost of branch related instructions (used for program control-flow including function calls)", - "MetricExpr": "100 * ((BR_INST_RETIRED.COND + 3 * BR_INST_RETIRED.NEAR_CALL + (BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL)) / SLOTS)", - "MetricGroup": "Ret", - "MetricName": "Branching_Overhead", - "Unit": "cpu_core" + "BriefDescription": "Counts the number of machine clears relative to the number of nuke slots due to SMC.", + "MetricExpr": "tma_nuke * (MACHINE_CLEARS.SMC / MACHINE_CLEARS.SLOW)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_nuke_group", + "MetricName": "tma_smc", + "MetricThreshold": "tma_smc > 0.02", + "ScaleUnit": "100%", + "Unit": "cpu_atom" }, { - "BriefDescription": "Total pipeline cost of instruction fetch related bottlenecks by large code footprint programs (i-side cache; TLB and BTB misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_icache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFoot;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "Big_Code", - "Unit": "cpu_core" + "BriefDescription": "Counts the number of cycles, relative to the number of mem_scheduler slots, in which uops are blocked due to store buffer full", + "MetricExpr": "tma_store_bound", + "MetricGroup": "TopdownL4;tma_L4_group;tma_mem_scheduler_group", + "MetricName": "tma_st_buffer", + "MetricThreshold": "tma_st_buffer > 0.05", + "ScaleUnit": "100%", + "Unit": "cpu_atom" }, { - "BriefDescription": "Total pipeline cost of instruction fetch bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_frontend_bound - tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) - Big_Code", - "MetricGroup": "Fed;FetchBW;Frontend", - "MetricName": "Instruction_Fetch_BW", - "Unit": "cpu_core" + "BriefDescription": "Counts the number of cycles that the oldest load of the load buffer is stalled at retirement due to a first level TLB miss.", + "MetricExpr": "LD_HEAD.DTLB_MISS_AT_RET / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_stlb_hit", + "MetricThreshold": "tma_stlb_hit > 0.05", + "ScaleUnit": "100%", + "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions Per Cycle (per Logical Processor)", - "MetricExpr": "INST_RETIRED.ANY / CLKS", - "MetricGroup": "Ret;Summary", - "MetricName": "IPC", - "Unit": "cpu_core" + "BriefDescription": "Counts the number of cycles that the oldest load of the load buffer is stalled at retirement due to a second level TLB miss requiring a page walk.", + "MetricExpr": "LD_HEAD.PGWALK_AT_RET / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_stlb_miss", + "MetricThreshold": "tma_stlb_miss > 0.05", + "ScaleUnit": "100%", + "Unit": "cpu_atom" }, { - "BriefDescription": "Uops Per Instruction", - "MetricExpr": "tma_retiring * SLOTS / INST_RETIRED.ANY", - "MetricGroup": "Pipeline;Ret;Retire", - "MetricName": "UPI", - "Unit": "cpu_core" + "BriefDescription": "Counts the number of cycles the core is stalled due to store buffer full.", + "MetricExpr": "tma_mem_scheduler * (MEM_SCHEDULER_BLOCK.ST_BUF / MEM_SCHEDULER_BLOCK.ALL)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_load_store_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.1", + "ScaleUnit": "100%", + "Unit": "cpu_atom" }, { - "BriefDescription": "Instruction per taken branch", - "MetricExpr": "tma_retiring * SLOTS / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW", - "MetricName": "UpTB", - "Unit": "cpu_core" + "BriefDescription": "Counts the number of cycles that the oldest load of the load buffer is stalled at retirement due to a store forward block.", + "MetricExpr": "LD_HEAD.ST_ADDR_AT_RET / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_store_fwd", + "MetricThreshold": "tma_store_fwd > 0.05", + "ScaleUnit": "100%", + "Unit": "cpu_atom" }, { - "BriefDescription": "Cycles Per Instruction (per Logical Processor)", - "MetricExpr": "1 / IPC", - "MetricGroup": "Mem;Pipeline", - "MetricName": "CPI", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.", + "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + UOPS_DISPATCHED.PORT_5_11 + UOPS_DISPATCHED.PORT_6) / (5 * tma_info_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.6", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "Pipeline", - "MetricName": "CLKS", + "BriefDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists", + "MetricExpr": "100 * cpu_core@ASSISTS.ANY\\,umask\\=0x1B@ / tma_info_slots", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists. Assists are long sequences of uops that are required in certain corner-cases for operations that cannot be handled natively by the execution pipeline. For example; when working with very small floating point values (so-called Denormals); the FP units are not set up to perform these operations natively. Instead; a sequence of instructions to perform the computation on the Denormals is injected into the pipeline. Since these microcode sequences might be dozens of uops long; Assists can be extremely deleterious to performance and they can be avoided in many cases. Sample with: ASSISTS.ANY", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "TOPDOWN.SLOTS", - "MetricGroup": "tma_L1_group", - "MetricName": "SLOTS", + "BriefDescription": "This metric estimates fraction of slots the CPU retired uops as a result of handing SSE to AVX* or AVX* to SSE transition Assists.", + "MetricExpr": "63 * ASSISTS.SSE_AVX_MIX / tma_info_slots", + "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", + "MetricName": "tma_avx_assists", + "MetricThreshold": "tma_avx_assists > 0.1", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Fraction of Physical Core issue-slots utilized by this Logical Processor", - "MetricExpr": "(SLOTS / (TOPDOWN.SLOTS / 2) if #SMT_on else 1)", - "MetricGroup": "SMT;tma_L1_group", - "MetricName": "Slots_Utilization", + "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend", + "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.2", + "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "The ratio of Executed- by Issued-Uops", - "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", - "MetricGroup": "Cor;Pipeline", - "MetricName": "Execute_per_Issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate of \"execute\" at rename stage.", + "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations", + "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", + "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", - "MetricExpr": "INST_RETIRED.ANY / CORE_CLKS", - "MetricGroup": "Ret;SMT;tma_L1_group", - "MetricName": "CoreIPC", + "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction", + "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_slots", + "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM", + "MetricName": "tma_branch_mispredicts", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: tma_info_branch_misprediction_cost, tma_info_mispredictions, tma_mispredicts_resteers", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Floating Point Operations Per Cycle", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / CORE_CLKS", - "MetricGroup": "Flops;Ret", - "MetricName": "FLOPc", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", + "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_clks + tma_unknown_branches", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width)", - "MetricExpr": "(FP_ARITH_DISPATCHED.PORT_0 + FP_ARITH_DISPATCHED.PORT_1 + FP_ARITH_DISPATCHED.PORT_5) / (2 * CORE_CLKS)", - "MetricGroup": "Cor;Flops;HPC", - "MetricName": "FP_Arith_Utilization", - "PublicDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width). Values > 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less common).", + "BriefDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction. A CISC instruction has multiple uops that are required to perform the instruction's functionality as in the case of read-modify-write as an example. Since these instructions require multiple uops they may or may not imply sub-optimal use of machine resources. Sample with: FRONTEND_RETIRED.MS_FLOWS", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_GE_1 / 2 if #SMT_on else UOPS_EXECUTED.CORE_CYCLES_GE_1)", - "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", - "MetricName": "ILP", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears", + "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_clks", + "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_branch_resteers_group;tma_issueMC", + "MetricName": "tma_clears_resteers", + "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_ms_switches", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Probability of Core Bound bottleneck hidden by SMT-profiling artifacts", - "MetricExpr": "((1 - tma_core_bound / tma_ports_utilization if tma_core_bound < tma_ports_utilization else 1) if SMT_2T_Utilization > 0.5 else 0)", - "MetricGroup": "Cor;SMT", - "MetricName": "Core_Bound_Likely", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(25 * tma_info_average_frequency * (MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) + 24 * tma_info_average_frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_contested_accesses", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses. Contested accesses occur when data written by one Logical Processor are read by another Logical Processor on a different Physical Core. Examples of contested accesses include synchronizations such as locks; true data sharing such as modified locked variables; and false sharing. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma_remote_cache", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", - "MetricExpr": "CPU_CLK_UNHALTED.DISTRIBUTED", - "MetricGroup": "SMT", - "MetricName": "CORE_CLKS", + "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", + "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)", + "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per Load (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", - "MetricGroup": "InsType", - "MetricName": "IpLoad", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "24 * tma_info_average_frequency * (MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD + MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (1 - OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses. Data shared by multiple Logical Processors (even just read shared) may cause increased access latency due to cache coherency. Excessive data sharing can drastically harm multithreaded performance. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_contested_accesses, tma_false_sharing, tma_machine_clears, tma_remote_cache", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per Store (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", - "MetricGroup": "InsType", - "MetricName": "IpStore", + "BriefDescription": "This metric represents fraction of cycles where decoder-0 was the only active decoder", + "MetricExpr": "(cpu_core@INST_DECODED.DECODERS\\,cmask\\=1@ - cpu_core@INST_DECODED.DECODERS\\,cmask\\=2@) / tma_info_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0;tma_mite_group", + "MetricName": "tma_decoder0_alone", + "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 6 > 0.35))", + "PublicDescription": "This metric represents fraction of cycles where decoder-0 was the only active decoder. Related metrics: tma_few_uops_instructions", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Branches;Fed;InsType", - "MetricName": "IpBranch", + "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", + "MetricExpr": "ARITH.DIV_ACTIVE / tma_info_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication. Sample with: ARITH.DIVIDER_ACTIVE", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per (near) call (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "IpCall", + "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "MEMORY_ACTIVITY.STALLS_L3_MISS / tma_info_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L3_MISS_PS", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Instruction per taken branch", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO", - "MetricName": "IpTB", + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline", + "MetricExpr": "(IDQ.DSB_CYCLES_ANY - IDQ.DSB_CYCLES_OK) / tma_info_core_clks / 2", + "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 6 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline. For example; inefficient utilization of the DSB cache structure or bank conflict when reading from it; are categorized here.", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Branch instructions per taken branch. ", - "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "BpTkBranch", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_clks", + "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DSB_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_dsb_misses, tma_info_iptb, tma_lcp", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per Floating Point (FP) Operation (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", - "MetricGroup": "Flops;InsType", - "MetricName": "IpFLOP", + "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricExpr": "min(7 * cpu_core@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - MEMORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: tma_dtlb_store, tma_info_memory_data_tlbs", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE))", - "MetricGroup": "Flops;InsType", - "MetricName": "IpArith", - "PublicDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate). May undercount due to FMA double counting. Approximated prior to BDW.", + "BriefDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses", + "MetricExpr": "(7 * cpu_core@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=1@ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses. As with ordinary data caching; focus on improving data locality and reducing working-set size to reduce DTLB overhead. Additionally; consider using profile-guided optimization (PGO) to collocate frequently-used data on the same page. Try using larger page sizes for large amounts of frequently-used data. Sample with: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, tma_info_memory_data_tlbs", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SINGLE", - "MetricGroup": "Flops;FpScalar;InsType", - "MetricName": "IpArith_Scalar_SP", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting.", + "BriefDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing", + "MetricExpr": "28 * tma_info_average_frequency * OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM / tma_info_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_store_bound_group", + "MetricName": "tma_false_sharing", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing. False Sharing is a multithreading hiccup; where multiple Logical Processors contend on different data-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cache", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOUBLE", - "MetricGroup": "Flops;FpScalar;InsType", - "MetricName": "IpArith_Scalar_DP", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting.", + "BriefDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed", + "MetricExpr": "L1D_PEND_MISS.FB_FULL / tma_info_clks", + "MetricGroup": "MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;tma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed. The higher the metric value; the deeper the memory hierarchy level the misses are satisfied from (metric values >1 are valid). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or external memory). Related metrics: tma_info_dram_bw_use, tma_info_memory_bandwidth, tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX128", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting.", + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", + "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)", + "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 6 > 0.35", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Sample with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_switches, tma_info_dsb_coverage, tma_info_dsb_misses, tma_info_iptb, tma_lcp", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX256", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting.", + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", + "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_slots", + "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per Software prefetch instruction (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / cpu_core@SW_PREFETCH_ACCESS.T0\\,umask\\=0xF@", - "MetricGroup": "Prefetches", - "MetricName": "IpSWPF", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops", + "MetricExpr": "max(0, tma_heavy_operations - tma_microcode_sequencer)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;tma_issueD0", + "MetricName": "tma_few_uops_instructions", + "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number of uops in such instructions. Related metrics: tma_decoder0_alone", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Total number of retired Instructions Sample with: INST_RETIRED.PREC_DIST", - "MetricExpr": "INST_RETIRED.ANY", - "MetricGroup": "Summary;tma_L1_group", - "MetricName": "Instructions", + "BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", + "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired). Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain and FMA double-counting.", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.", - "MetricExpr": "tma_retiring * SLOTS / cpu_core@UOPS_RETIRED.SLOTS\\,cmask\\=1@", - "MetricGroup": "Pipeline;Ret", - "MetricName": "Retire", + "BriefDescription": "This metric roughly estimates fraction of slots the CPU retired uops as a result of handing Floating Point (FP) Assists", + "MetricExpr": "30 * ASSISTS.FP / tma_info_slots", + "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", + "MetricName": "tma_fp_assists", + "MetricThreshold": "tma_fp_assists > 0.1", + "PublicDescription": "This metric roughly estimates fraction of slots the CPU retired uops as a result of handing Floating Point (FP) Assists. FP Assist may apply when working with very small floating point values (so-called Denormals).", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Estimated fraction of retirement-cycles dealing with repeat instructions", - "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu_core@UOPS_RETIRED.SLOTS\\,cmask\\=1@", - "MetricGroup": "Pipeline;Ret", - "MetricName": "Strings_Cycles", + "BriefDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired", + "MetricExpr": "cpu_core@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P", + "MetricName": "tma_fp_scalar", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per a microcode Assist invocation. See Assists tree node for details (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / cpu_core@ASSISTS.ANY\\,umask\\=0x1B@", - "MetricGroup": "Pipeline;Ret;Retire", - "MetricName": "IpAssist", + "BriefDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths", + "MetricExpr": "cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0x3c@ / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P", + "MetricName": "tma_fp_vector", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu_core@UOPS_EXECUTED.THREAD\\,cmask\\=1@", - "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", - "MetricName": "Execute", + "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P", + "MetricName": "tma_fp_vector_128b", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Average number of Uops issued by front-end when it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu_core@UOPS_ISSUED.ANY\\,cmask\\=1@", - "MetricGroup": "Fed;FetchBW", - "MetricName": "Fetch_UpC", + "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P", + "MetricName": "tma_fp_vector_256b", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Fraction of Uops delivered by the LSD (Loop Stream Detector; aka Loop Cache)", - "MetricExpr": "LSD.UOPS / (IDQ.DSB_UOPS + LSD.UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", - "MetricGroup": "Fed;LSD", - "MetricName": "LSD_Coverage", + "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", + "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_slots", + "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", - "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + LSD.UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", - "MetricGroup": "DSB;Fed;FetchBW", - "MetricName": "DSB_Coverage", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring fused instructions -- where one uop can represent multiple contiguous instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.MACRO_FUSED / (tma_retiring * tma_info_slots)", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_fused_instructions", + "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring fused instructions -- where one uop can represent multiple contiguous instructions. The instruction pairs of CMP+JCC or DEC+JCC are commonly used examples.", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Average number of cycles of a switch from the DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details.", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu_core@DSB2MITE_SWITCHES.PENALTY_CYCLES\\,cmask\\=1\\,edge@", - "MetricGroup": "DSBmiss", - "MetricName": "DSB_Switch_Cost", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences", + "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_slots", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences. Sample with: UOPS_RETIRED.HEAVY", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Total penalty related to DSB (uop cache) misses - subset of the Instruction_Fetch_BW Bottleneck.", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tma_lsd + tma_mite))", - "MetricGroup": "DSBmiss;Fed", - "MetricName": "DSB_Misses", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses", + "MetricExpr": "ICACHE_DATA.STALLS / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RETIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Number of Instructions per non-speculative DSB miss (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", - "MetricGroup": "DSBmiss;Fed", - "MetricName": "IpDSB_Miss_Ret", + "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", + "MetricExpr": "tma_info_turbo_utilization * TSC / 1e9 / duration_time", + "MetricGroup": "Power;Summary", + "MetricName": "tma_info_average_frequency", "Unit": "cpu_core" }, { - "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;BadSpec;BrMispredicts", - "MetricName": "IpMispredict", + "BriefDescription": "Total pipeline cost of instruction fetch related bottlenecks by large code footprint programs (i-side cache; TLB and BTB misses)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_icache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", + "MetricGroup": "BigFoot;Fed;Frontend;IcMiss;MemoryTLB;tma_issueBC", + "MetricName": "tma_info_big_code", + "MetricThreshold": "tma_info_big_code > 20", + "PublicDescription": "Total pipeline cost of instruction fetch related bottlenecks by large code footprint programs (i-side cache; TLB and BTB misses). Related metrics: tma_info_branching_overhead", "Unit": "cpu_core" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)", - "MetricExpr": "(tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) * SLOTS / BR_MISP_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;BrMispredicts", - "MetricName": "Branch_Misprediction_Cost", + "BriefDescription": "Branch instructions per taken branch.", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_bptkbranch", "Unit": "cpu_core" }, { - "BriefDescription": "Fraction of branches that are non-taken conditionals", - "MetricExpr": "BR_INST_RETIRED.COND_NTAKEN / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches;CodeGen;PGO", - "MetricName": "Cond_NT", + "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)", + "MetricExpr": "(tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) * tma_info_slots / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BrMispredicts;tma_issueBM", + "MetricName": "tma_info_branch_misprediction_cost", + "PublicDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear). Related metrics: tma_branch_mispredicts, tma_info_mispredictions, tma_mispredicts_resteers", "Unit": "cpu_core" }, { - "BriefDescription": "Fraction of branches that are taken conditionals", - "MetricExpr": "BR_INST_RETIRED.COND_TAKEN / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches;CodeGen;PGO", - "MetricName": "Cond_TK", + "BriefDescription": "Total pipeline cost of branch related instructions (used for program control-flow including function calls)", + "MetricExpr": "100 * ((BR_INST_RETIRED.COND + 3 * BR_INST_RETIRED.NEAR_CALL + (BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL)) / tma_info_slots)", + "MetricGroup": "Ret;tma_issueBC", + "MetricName": "tma_info_branching_overhead", + "MetricThreshold": "tma_info_branching_overhead > 10", + "PublicDescription": "Total pipeline cost of branch related instructions (used for program control-flow including function calls). Related metrics: tma_info_big_code", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of branches that are CALL or RET", "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_RETURN) / BR_INST_RETIRED.ALL_BRANCHES", "MetricGroup": "Bad;Branches", - "MetricName": "CallRet", + "MetricName": "tma_info_callret", "Unit": "cpu_core" }, { - "BriefDescription": "Fraction of branches that are unconditional (direct or indirect) jumps", - "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL) / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches", - "MetricName": "Jump", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Fraction of branches of other types (not individually covered by other metrics in Info.Branches group)", - "MetricExpr": "1 - (Cond_NT + Cond_TK + CallRet + Jump)", - "MetricGroup": "Bad;Branches", - "MetricName": "Other_Branches", + "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Pipeline", + "MetricName": "tma_info_clks", "Unit": "cpu_core" }, { - "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)", - "MetricExpr": "L1D_PEND_MISS.PENDING / MEM_LOAD_COMPLETED.L1_MISS_ANY", - "MetricGroup": "Mem;MemoryBound;MemoryLat", - "MetricName": "Load_Miss_Real_Latency", + "BriefDescription": "STLB (2nd level TLB) code speculative misses per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", + "MetricGroup": "Fed;MemoryTLB", + "MetricName": "tma_info_code_stlb_mpki", "Unit": "cpu_core" }, { - "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)", - "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLES", - "MetricGroup": "Mem;MemoryBW;MemoryBound", - "MetricName": "MLP", + "BriefDescription": "Fraction of branches that are non-taken conditionals", + "MetricExpr": "BR_INST_RETIRED.COND_NTAKEN / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_cond_nt", "Unit": "cpu_core" }, { - "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L1MPKI", + "BriefDescription": "Fraction of branches that are taken conditionals", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_cond_tk", "Unit": "cpu_core" }, { - "BriefDescription": "L1 cache true misses per kilo instruction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L1MPKI_Load", + "BriefDescription": "Probability of Core Bound bottleneck hidden by SMT-profiling artifacts", + "MetricExpr": "(100 * (1 - tma_core_bound / tma_ports_utilization if tma_core_bound < tma_ports_utilization else 1) if tma_info_smt_2t_utilization > 0.5 else 0)", + "MetricGroup": "Cor;SMT", + "MetricName": "tma_info_core_bound_likely", + "MetricThreshold": "tma_info_core_bound_likely > 0.5", "Unit": "cpu_core" }, { - "BriefDescription": "L2 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", - "MetricGroup": "Backend;CacheMisses;Mem", - "MetricName": "L2MPKI", + "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", + "MetricExpr": "CPU_CLK_UNHALTED.DISTRIBUTED", + "MetricGroup": "SMT", + "MetricName": "tma_info_core_clks", "Unit": "cpu_core" }, { - "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all request types (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem;Offcore", - "MetricName": "L2MPKI_All", + "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_core_clks", + "MetricGroup": "Ret;SMT;TmaL1;tma_L1_group", + "MetricName": "tma_info_coreipc", "Unit": "cpu_core" }, { - "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2MPKI_Load", + "BriefDescription": "Cycles Per Instruction (per Logical Processor)", + "MetricExpr": "1 / tma_info_ipc", + "MetricGroup": "Mem;Pipeline", + "MetricName": "tma_info_cpi", "Unit": "cpu_core" }, { - "BriefDescription": "L2 cache hits per kilo instruction for all request types (including speculative)", - "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2HPKI_All", + "BriefDescription": "Average CPU Utilization", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": "HPC;Summary", + "MetricName": "tma_info_cpu_utilization", "Unit": "cpu_core" }, { - "BriefDescription": "L2 cache hits per kilo instruction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2HPKI_Load", + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD / OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_data_l2_mlp", "Unit": "cpu_core" }, { - "BriefDescription": "L3 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L3MPKI", + "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", + "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_REQUESTS.ALL) / 1e6 / duration_time / 1e3", + "MetricGroup": "HPC;Mem;MemoryBW;SoC;tma_issueBW", + "MetricName": "tma_info_dram_bw_use", + "PublicDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_memory_bandwidth, tma_mem_bandwidth, tma_sq_full", "Unit": "cpu_core" }, { - "BriefDescription": "Fill Buffer (FB) hits per kilo instructions for retired demand loads (L1D misses that merge into ongoing miss-handling entries)", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "FB_HPKI", + "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_dsb_coverage", + "MetricThreshold": "tma_info_dsb_coverage < 0.7 & tma_info_ipc / 6 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_misses, tma_info_iptb, tma_lcp", "Unit": "cpu_core" }, { - "BriefDescription": "Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses", - "MetricConstraint": "NO_NMI_WATCHDOG", - "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_PENDING + DTLB_STORE_MISSES.WALK_PENDING) / (4 * CORE_CLKS)", - "MetricGroup": "Mem;MemoryTLB", - "MetricName": "Page_Walks_Utilization", + "BriefDescription": "Total pipeline cost of DSB (uop cache) misses - subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tma_lsd + tma_mite))", + "MetricGroup": "DSBmiss;Fed;tma_issueFB", + "MetricName": "tma_info_dsb_misses", + "MetricThreshold": "tma_info_dsb_misses > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) misses - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_iptb, tma_lcp", "Unit": "cpu_core" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW", + "BriefDescription": "Average number of cycles of a switch from the DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details.", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu_core@DSB2MITE_SWITCHES.PENALTY_CYCLES\\,cmask\\=1\\,edge@", + "MetricGroup": "DSBmiss", + "MetricName": "tma_info_dsb_switch_cost", "Unit": "cpu_core" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW", + "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-thread", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu_core@UOPS_EXECUTED.THREAD\\,cmask\\=1@", + "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", + "MetricName": "tma_info_execute", "Unit": "cpu_core" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW", + "BriefDescription": "The ratio of Executed- by Issued-Uops", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", + "MetricGroup": "Cor;Pipeline", + "MetricName": "tma_info_execute_per_issue", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate of \"execute\" at rename stage.", "Unit": "cpu_core" }, { - "BriefDescription": "Average per-core data access bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "L3_Cache_Access_BW", + "BriefDescription": "Fill Buffer (FB) hits per kilo instructions for retired demand loads (L1D misses that merge into ongoing miss-handling entries)", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_fb_hpki", "Unit": "cpu_core" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]", - "MetricExpr": "L1D_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW_1T", + "BriefDescription": "Average number of Uops issued by front-end when it issued something", + "MetricExpr": "UOPS_ISSUED.ANY / cpu_core@UOPS_ISSUED.ANY\\,cmask\\=1@", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_fetch_upc", "Unit": "cpu_core" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]", - "MetricExpr": "L2_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW_1T", + "BriefDescription": "Floating Point Operations Per Cycle", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / tma_info_core_clks", + "MetricGroup": "Flops;Ret", + "MetricName": "tma_info_flopc", "Unit": "cpu_core" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "L3_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW_1T", + "BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(FP_ARITH_DISPATCHED.PORT_0 + FP_ARITH_DISPATCHED.PORT_1 + FP_ARITH_DISPATCHED.PORT_5) / (2 * tma_info_core_clks)", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_fp_arith_utilization", + "PublicDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width). Values > 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less common).", "Unit": "cpu_core" }, { - "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "L3_Cache_Access_BW", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "L3_Cache_Access_BW_1T", + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / duration_time", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine.", "Unit": "cpu_core" }, { - "BriefDescription": "Average CPU Utilization", - "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", - "MetricGroup": "HPC;Summary", - "MetricName": "CPU_Utilization", + "BriefDescription": "Total pipeline cost of Instruction Cache misses - subset of the Big_Code Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", + "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", + "MetricName": "tma_info_ic_misses", + "MetricThreshold": "tma_info_ic_misses > 5", + "PublicDescription": "Total pipeline cost of Instruction Cache misses - subset of the Big_Code Bottleneck. Related metrics: ", "Unit": "cpu_core" }, { - "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", - "MetricExpr": "Turbo_Utilization * TSC / 1e9 / duration_time", - "MetricGroup": "Power;Summary", - "MetricName": "Average_Frequency", + "BriefDescription": "Average Latency for L1 instruction cache misses", + "MetricExpr": "ICACHE_DATA.STALLS / cpu_core@ICACHE_DATA.STALLS\\,cmask\\=1\\,edge@", + "MetricGroup": "Fed;FetchLat;IcMiss", + "MetricName": "tma_info_icache_miss_latency", "Unit": "cpu_core" }, { - "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / duration_time", - "MetricGroup": "Cor;Flops;HPC", - "MetricName": "GFLOPs", - "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine.", + "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_GE_1 / 2 if #SMT_on else UOPS_EXECUTED.CORE_CYCLES_GE_1)", + "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", + "MetricName": "tma_info_ilp", "Unit": "cpu_core" }, { - "BriefDescription": "Average Frequency Utilization relative nominal frequency", - "MetricExpr": "CLKS / CPU_CLK_UNHALTED.REF_TSC", - "MetricGroup": "Power", - "MetricName": "Turbo_Utilization", + "BriefDescription": "Total pipeline cost of instruction fetch bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_frontend_bound - tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) - tma_info_big_code", + "MetricGroup": "Fed;FetchBW;Frontend", + "MetricName": "tma_info_instruction_fetch_bw", + "MetricThreshold": "tma_info_instruction_fetch_bw > 20", "Unit": "cpu_core" }, { - "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", - "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_DISTRIBUTED if #SMT_on else 0)", - "MetricGroup": "SMT", - "MetricName": "SMT_2T_Utilization", + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Summary;TmaL1;tma_L1_group", + "MetricName": "tma_info_instructions", + "PublicDescription": "Total number of retired Instructions. Sample with: INST_RETIRED.PREC_DIST", "Unit": "cpu_core" }, { - "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "OS", - "MetricName": "Kernel_Utilization", + "BriefDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (cpu_core@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0x3c@)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_iparith", + "MetricThreshold": "tma_info_iparith < 10", + "PublicDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate). May undercount due to FMA double counting. Approximated prior to BDW.", "Unit": "cpu_core" }, { - "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", - "MetricGroup": "OS", - "MetricName": "Kernel_CPI", + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx128", + "MetricThreshold": "tma_info_iparith_avx128 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting.", "Unit": "cpu_core" }, { - "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", - "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_REQUESTS.ALL) / 1e6 / duration_time / 1e3", - "MetricGroup": "HPC;Mem;MemoryBW;SoC", - "MetricName": "DRAM_BW_Use", + "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx256", + "MetricThreshold": "tma_info_iparith_avx256 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting.", "Unit": "cpu_core" }, { - "BriefDescription": "Average number of parallel requests to external memory. Accounts for all requests", - "MetricExpr": "UNC_ARB_TRK_OCCUPANCY.ALL / UNC_ARB_TRK_REQUESTS.ALL", - "MetricGroup": "Mem;SoC", - "MetricName": "MEM_Parallel_Requests", + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOUBLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_iparith_scalar_dp", + "MetricThreshold": "tma_info_iparith_scalar_dp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting.", "Unit": "cpu_core" }, { - "BriefDescription": "Socket actual clocks when any core is active on that socket", - "MetricExpr": "UNC_CLOCK.SOCKET", - "MetricGroup": "SoC", - "MetricName": "Socket_CLKS", + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SINGLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_iparith_scalar_sp", + "MetricThreshold": "tma_info_iparith_scalar_sp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting.", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", - "MetricGroup": "Branches;OS", - "MetricName": "IpFarBranch", + "BriefDescription": "Instructions per a microcode Assist invocation", + "MetricExpr": "INST_RETIRED.ANY / cpu_core@ASSISTS.ANY\\,umask\\=0x1B@", + "MetricGroup": "Pipeline;Ret;Retire", + "MetricName": "tma_info_ipassist", + "MetricThreshold": "tma_info_ipassist < 100e3", + "PublicDescription": "Instructions per a microcode Assist invocation. See Assists tree node for details (lower number means higher occurrence rate)", "Unit": "cpu_core" }, { - "BriefDescription": "Uncore frequency per die [GHZ]", - "MetricExpr": "Socket_CLKS / #num_dies / duration_time / 1e9", - "MetricGroup": "SoC", - "MetricName": "UNCORE_FREQ" + "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Branches;Fed;InsType", + "MetricName": "tma_info_ipbranch", + "MetricThreshold": "tma_info_ipbranch < 8", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to frontend stalls.", - "MetricExpr": "TOPDOWN_FE_BOUND.ALL / SLOTS", - "MetricGroup": "TopdownL1", - "MetricName": "tma_frontend_bound", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Instructions Per Cycle (per Logical Processor)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_clks", + "MetricGroup": "Ret;Summary", + "MetricName": "tma_info_ipc", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to frontend bandwidth restrictions due to decode, predecode, cisc, and other limitations.", - "MetricExpr": "TOPDOWN_FE_BOUND.FRONTEND_LATENCY / SLOTS", - "MetricGroup": "TopdownL2;tma_frontend_bound_group", - "MetricName": "tma_frontend_latency", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Instructions per (near) call (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_ipcall", + "MetricThreshold": "tma_info_ipcall < 200", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to instruction cache misses.", - "MetricExpr": "TOPDOWN_FE_BOUND.ICACHE / SLOTS", - "MetricGroup": "TopdownL3;tma_frontend_latency_group", - "MetricName": "tma_icache", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Instructions per non-speculative DSB miss (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", + "MetricGroup": "DSBmiss;Fed", + "MetricName": "tma_info_ipdsb_miss_ret", + "MetricThreshold": "tma_info_ipdsb_miss_ret < 50", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to Instruction Table Lookaside Buffer (ITLB) misses.", - "MetricExpr": "TOPDOWN_FE_BOUND.ITLB / SLOTS", - "MetricGroup": "TopdownL3;tma_frontend_latency_group", - "MetricName": "tma_itlb", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;OS", + "MetricName": "tma_info_ipfarbranch", + "MetricThreshold": "tma_info_ipfarbranch < 1e6", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to BACLEARS, which occurs when the Branch Target Buffer (BTB) prediction or lack thereof, was corrected by a later branch predictor in the frontend", - "MetricExpr": "TOPDOWN_FE_BOUND.BRANCH_DETECT / SLOTS", - "MetricGroup": "TopdownL3;tma_frontend_latency_group", - "MetricName": "tma_branch_detect", - "PublicDescription": "Counts the number of issue slots that were not delivered by the frontend due to BACLEARS, which occurs when the Branch Target Buffer (BTB) prediction or lack thereof, was corrected by a later branch predictor in the frontend. Includes BACLEARS due to all branch types including conditional and unconditional jumps, returns, and indirect branches.", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Instructions per Floating Point (FP) Operation (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_ipflop", + "MetricThreshold": "tma_info_ipflop < 10", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to BTCLEARS, which occurs when the Branch Target Buffer (BTB) predicts a taken branch.", - "MetricExpr": "TOPDOWN_FE_BOUND.BRANCH_RESTEER / SLOTS", - "MetricGroup": "TopdownL3;tma_frontend_latency_group", - "MetricName": "tma_branch_resteer", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Instructions per Load (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipload", + "MetricThreshold": "tma_info_ipload < 3", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to frontend bandwidth restrictions due to decode, predecode, cisc, and other limitations.", - "MetricExpr": "TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH / SLOTS", - "MetricGroup": "TopdownL2;tma_frontend_bound_group", - "MetricName": "tma_frontend_bandwidth", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Instructions per retired mispredicts for conditional non-taken branches (lower number means higher occurrence rate).", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_cond_ntaken", + "MetricThreshold": "tma_info_ipmisp_cond_ntaken < 200", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to the microcode sequencer (MS).", - "MetricExpr": "TOPDOWN_FE_BOUND.CISC / SLOTS", - "MetricGroup": "TopdownL3;tma_frontend_bandwidth_group", - "MetricName": "tma_cisc", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Instructions per retired mispredicts for conditional taken branches (lower number means higher occurrence rate).", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_cond_taken", + "MetricThreshold": "tma_info_ipmisp_cond_taken < 200", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to decode stalls.", - "MetricExpr": "TOPDOWN_FE_BOUND.DECODE / SLOTS", - "MetricGroup": "TopdownL3;tma_frontend_bandwidth_group", - "MetricName": "tma_decode", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Instructions per retired mispredicts for indirect CALL or JMP branches (lower number means higher occurrence rate).", + "MetricExpr": "cpu_core@BR_MISP_RETIRED.INDIRECT_CALL\\,umask\\=0x80@ / BR_MISP_RETIRED.INDIRECT", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_indirect", + "MetricThreshold": "tma_info_ipmisp_indirect < 1e3", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to wrong predecodes.", - "MetricExpr": "TOPDOWN_FE_BOUND.PREDECODE / SLOTS", - "MetricGroup": "TopdownL3;tma_frontend_bandwidth_group", - "MetricName": "tma_predecode", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Instructions per retired mispredicts for return branches (lower number means higher occurrence rate).", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_ret", + "MetricThreshold": "tma_info_ipmisp_ret < 500", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to other common frontend stalls not categorized.", - "MetricExpr": "TOPDOWN_FE_BOUND.OTHER / SLOTS", - "MetricGroup": "TopdownL3;tma_frontend_bandwidth_group", - "MetricName": "tma_other_fb", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts", + "MetricName": "tma_info_ipmispredict", + "MetricThreshold": "tma_info_ipmispredict < 200", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the total number of issue slots that were not consumed by the backend because allocation is stalled due to a mispredicted jump or a machine clear", - "MetricExpr": "(SLOTS - (TOPDOWN_FE_BOUND.ALL + TOPDOWN_BE_BOUND.ALL + TOPDOWN_RETIRING.ALL)) / SLOTS", - "MetricGroup": "TopdownL1", - "MetricName": "tma_bad_speculation", - "PublicDescription": "Counts the total number of issue slots that were not consumed by the backend because allocation is stalled due to a mispredicted jump or a machine clear. Only issue slots wasted due to fast nukes such as memory ordering nukes are counted. Other nukes are not accounted for. Counts all issue slots blocked during this recovery window including relevant microcode flows and while uops are not yet available in the instruction queue (IQ). Also includes the issue slots that were consumed by the backend but were thrown away because they were younger than the mispredict or machine clear.", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Instructions per Store (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipstore", + "MetricThreshold": "tma_info_ipstore < 8", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to branch mispredicts.", - "MetricExpr": "TOPDOWN_BAD_SPECULATION.MISPREDICT / SLOTS", - "MetricGroup": "TopdownL2;tma_bad_speculation_group", - "MetricName": "tma_branch_mispredicts", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Instructions per Software prefetch instruction (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / cpu_core@SW_PREFETCH_ACCESS.T0\\,umask\\=0xF@", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_ipswpf", + "MetricThreshold": "tma_info_ipswpf < 100", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the total number of issue slots that were not consumed by the backend because allocation is stalled due to a machine clear (nuke) of any kind including memory ordering and memory disambiguation.", - "MetricExpr": "TOPDOWN_BAD_SPECULATION.MACHINE_CLEARS / SLOTS", - "MetricGroup": "TopdownL2;tma_bad_speculation_group", - "MetricName": "tma_machine_clears", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", + "MetricName": "tma_info_iptb", + "MetricThreshold": "tma_info_iptb < 13", + "PublicDescription": "Instruction per taken branch. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_dsb_misses, tma_lcp", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to a machine clear (slow nuke).", - "MetricExpr": "TOPDOWN_BAD_SPECULATION.NUKE / SLOTS", - "MetricGroup": "TopdownL3;tma_machine_clears_group", - "MetricName": "tma_nuke", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Instructions per speculative Unknown Branch Misprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_instructions / BACLEARS.ANY", + "MetricGroup": "Fed", + "MetricName": "tma_info_ipunknown_branch", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of machine clears relative to the number of nuke slots due to SMC. ", - "MetricExpr": "tma_nuke * (MACHINE_CLEARS.SMC / MACHINE_CLEARS.SLOW)", - "MetricGroup": "TopdownL4;tma_nuke_group", - "MetricName": "tma_smc", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Fraction of branches that are unconditional (direct or indirect) jumps", + "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_jump", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of machine clears relative to the number of nuke slots due to memory ordering. ", - "MetricExpr": "tma_nuke * (MACHINE_CLEARS.MEMORY_ORDERING / MACHINE_CLEARS.SLOW)", - "MetricGroup": "TopdownL4;tma_nuke_group", - "MetricName": "tma_memory_ordering", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_cpi", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of machine clears relative to the number of nuke slots due to FP assists. ", - "MetricExpr": "tma_nuke * (MACHINE_CLEARS.FP_ASSIST / MACHINE_CLEARS.SLOW)", - "MetricGroup": "TopdownL4;tma_nuke_group", - "MetricName": "tma_fp_assist", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_utilization", + "MetricThreshold": "tma_info_kernel_utilization > 0.05", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of machine clears relative to the number of nuke slots due to memory disambiguation. ", - "MetricExpr": "tma_nuke * (MACHINE_CLEARS.DISAMBIGUATION / MACHINE_CLEARS.SLOW)", - "MetricGroup": "TopdownL4;tma_nuke_group", - "MetricName": "tma_disambiguation", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of machine clears relative to the number of nuke slots due to page faults. ", - "MetricExpr": "tma_nuke * (MACHINE_CLEARS.PAGE_FAULT / MACHINE_CLEARS.SLOW)", - "MetricGroup": "TopdownL4;tma_nuke_group", - "MetricName": "tma_page_fault", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]", + "MetricExpr": "tma_info_l1d_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw_1t", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to a machine clear classified as a fast nuke due to memory ordering, memory disambiguation and memory renaming.", - "MetricExpr": "TOPDOWN_BAD_SPECULATION.FASTNUKE / SLOTS", - "MetricGroup": "TopdownL3;tma_machine_clears_group", - "MetricName": "tma_fast_nuke", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l1mpki", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the total number of issue slots that were not consumed by the backend due to backend stalls", - "MetricExpr": "TOPDOWN_BE_BOUND.ALL / SLOTS", - "MetricGroup": "TopdownL1", - "MetricName": "tma_backend_bound", - "PublicDescription": "Counts the total number of issue slots that were not consumed by the backend due to backend stalls. Note that uops must be available for consumption in order for this event to count. If a uop is not available (IQ is empty), this event will not count. The rest of these subevents count backend stalls, in cycles, due to an outstanding request which is memory bound vs core bound. The subevents are not slot based events and therefore can not be precisely added or subtracted from the Backend_Bound_Aux subevents which are slot based.", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "L1 cache true misses per kilo instruction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l1mpki_load", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of cycles due to backend bound stalls that are core execution bound and not attributed to outstanding demand load or store stalls. ", - "MetricExpr": "max(0, tma_backend_bound - tma_load_store_bound)", - "MetricGroup": "TopdownL2;tma_backend_bound_group", - "MetricName": "tma_core_bound", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of cycles the core is stalled due to stores or loads. ", - "MetricExpr": "min(tma_backend_bound, LD_HEAD.ANY_AT_RET / CLKS + tma_store_bound)", - "MetricGroup": "TopdownL2;tma_backend_bound_group", - "MetricName": "tma_load_store_bound", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]", + "MetricExpr": "tma_info_l2_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw_1t", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of cycles the core is stalled due to store buffer full.", - "MetricExpr": "tma_st_buffer", - "MetricGroup": "TopdownL3;tma_load_store_bound_group", - "MetricName": "tma_store_bound", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "L2 cache hits per kilo instruction for all request types (including speculative)", + "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2hpki_all", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of cycles that the oldest load of the load buffer is stalled at retirement due to a load block.", - "MetricExpr": "LD_HEAD.L1_BOUND_AT_RET / CLKS", - "MetricGroup": "TopdownL3;tma_load_store_bound_group", - "MetricName": "tma_l1_bound", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "L2 cache hits per kilo instruction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2hpki_load", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of cycles that the oldest load of the load buffer is stalled at retirement due to a store forward block.", - "MetricExpr": "LD_HEAD.ST_ADDR_AT_RET / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_store_fwd", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "L2 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "Backend;CacheMisses;Mem", + "MetricName": "tma_info_l2mpki", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of cycles that the oldest load of the load buffer is stalled at retirement due to a first level TLB miss.", - "MetricExpr": "LD_HEAD.DTLB_MISS_AT_RET / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_stlb_hit", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all request types (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem;Offcore", + "MetricName": "tma_info_l2mpki_all", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of cycles that the oldest load of the load buffer is stalled at retirement due to a second level TLB miss requiring a page walk.", - "MetricExpr": "LD_HEAD.PGWALK_AT_RET / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_stlb_miss", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "L2 cache true code cacheline misses per kilo instruction", + "MetricExpr": "1e3 * FRONTEND_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_l2mpki_code", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of cycles that the oldest load of the load buffer is stalled at retirement due to a number of other load blocks.", - "MetricExpr": "LD_HEAD.OTHER_AT_RET / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_other_l1", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "L2 cache speculative code cacheline misses per kilo instruction", + "MetricExpr": "1e3 * L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_l2mpki_code_all", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of cycles a core is stalled due to a demand load which hit in the L2 Cache.", - "MetricExpr": "MEM_BOUND_STALLS.LOAD_L2_HIT / CLKS - MEM_BOUND_STALLS_AT_RET_CORRECTION * MEM_BOUND_STALLS.LOAD_L2_HIT / MEM_BOUND_STALLS.LOAD", - "MetricGroup": "TopdownL3;tma_load_store_bound_group", - "MetricName": "tma_l2_bound", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2mpki_load", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of cycles a core is stalled due to a demand load which hit in the Last Level Cache (LLC) or other core with HITE/F/M.", - "MetricExpr": "MEM_BOUND_STALLS.LOAD_LLC_HIT / CLKS - MEM_BOUND_STALLS_AT_RET_CORRECTION * MEM_BOUND_STALLS.LOAD_LLC_HIT / MEM_BOUND_STALLS.LOAD", - "MetricGroup": "TopdownL3;tma_load_store_bound_group", - "MetricName": "tma_l3_bound", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Average per-core data access bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_l3_cache_access_bw", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of cycles the core is stalled due to a demand load miss which hit in DRAM or MMIO (Non-DRAM).", - "MetricExpr": "MEM_BOUND_STALLS.LOAD_DRAM_HIT / CLKS - MEM_BOUND_STALLS_AT_RET_CORRECTION * MEM_BOUND_STALLS.LOAD_DRAM_HIT / MEM_BOUND_STALLS.LOAD", - "MetricGroup": "TopdownL3;tma_load_store_bound_group", - "MetricName": "tma_dram_bound", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "tma_info_l3_cache_access_bw", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_l3_cache_access_bw_1t", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of cycles the core is stalled due to a demand load miss which hits in the L2, LLC, DRAM or MMIO (Non-DRAM) but could not be correctly attributed or cycles in which the load miss is waiting on a request buffer.", - "MetricExpr": "max(0, tma_load_store_bound - (tma_store_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound))", - "MetricGroup": "TopdownL3;tma_load_store_bound_group", - "MetricName": "tma_other_load_store", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the total number of issue slots that were not consumed by the backend due to backend stalls", - "MetricExpr": "tma_backend_bound", - "MetricGroup": "TopdownL1", - "MetricName": "tma_backend_bound_aux", - "PublicDescription": "Counts the total number of issue slots that were not consumed by the backend due to backend stalls. Note that UOPS must be available for consumption in order for this event to count. If a uop is not available (IQ is empty), this event will not count. All of these subevents count backend stalls, in slots, due to a resource limitation. These are not cycle based events and therefore can not be precisely added or subtracted from the Backend_Bound subevents which are cycle based. These subevents are supplementary to Backend_Bound and can be used to analyze results from a resource perspective at allocation. ", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "tma_info_l3_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw_1t", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the total number of issue slots that were not consumed by the backend due to backend stalls", - "MetricExpr": "tma_backend_bound", - "MetricGroup": "TopdownL2;tma_backend_bound_aux_group", - "MetricName": "tma_resource_bound", - "PublicDescription": "Counts the total number of issue slots that were not consumed by the backend due to backend stalls. Note that uops must be available for consumption in order for this event to count. If a uop is not available (IQ is empty), this event will not count. ", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "L3 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l3mpki", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to memory reservation stalls in which a scheduler is not able to accept uops.", - "MetricExpr": "TOPDOWN_BE_BOUND.MEM_SCHEDULER / SLOTS", - "MetricGroup": "TopdownL3;tma_resource_bound_group", - "MetricName": "tma_mem_scheduler", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Average Latency for L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCORE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "Memory_Lat;Offcore", + "MetricName": "tma_info_load_l2_miss_latency", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of cycles, relative to the number of mem_scheduler slots, in which uops are blocked due to store buffer full", - "MetricExpr": "tma_mem_scheduler * (MEM_SCHEDULER_BLOCK.ST_BUF / MEM_SCHEDULER_BLOCK.ALL)", - "MetricGroup": "TopdownL4;tma_mem_scheduler_group", - "MetricName": "tma_st_buffer", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=1@", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_load_l2_mlp", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of cycles, relative to the number of mem_scheduler slots, in which uops are blocked due to load buffer full", - "MetricExpr": "tma_mem_scheduler * MEM_SCHEDULER_BLOCK.LD_BUF / MEM_SCHEDULER_BLOCK.ALL", - "MetricGroup": "TopdownL4;tma_mem_scheduler_group", - "MetricName": "tma_ld_buffer", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Average Latency for L3 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAND_DATA_RD / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", + "MetricGroup": "Memory_Lat;Offcore", + "MetricName": "tma_info_load_l3_miss_latency", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of cycles, relative to the number of mem_scheduler slots, in which uops are blocked due to RSV full relative ", - "MetricExpr": "tma_mem_scheduler * MEM_SCHEDULER_BLOCK.RSV / MEM_SCHEDULER_BLOCK.ALL", - "MetricGroup": "TopdownL4;tma_mem_scheduler_group", - "MetricName": "tma_rsv", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)", + "MetricExpr": "L1D_PEND_MISS.PENDING / MEM_LOAD_COMPLETED.L1_MISS_ANY", + "MetricGroup": "Mem;MemoryBound;MemoryLat", + "MetricName": "tma_info_load_miss_real_latency", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to IEC or FPC RAT stalls, which can be due to FIQ or IEC reservation stalls in which the integer, floating point or SIMD scheduler is not able to accept uops.", - "MetricExpr": "TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER / SLOTS", - "MetricGroup": "TopdownL3;tma_resource_bound_group", - "MetricName": "tma_non_mem_scheduler", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "STLB (2nd level TLB) data load speculative misses per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_load_stlb_mpki", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to the physical register file unable to accept an entry (marble stalls).", - "MetricExpr": "TOPDOWN_BE_BOUND.REGISTER / SLOTS", - "MetricGroup": "TopdownL3;tma_resource_bound_group", - "MetricName": "tma_register", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Fraction of Uops delivered by the LSD (Loop Stream Detector; aka Loop Cache)", + "MetricExpr": "LSD.UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "Fed;LSD", + "MetricName": "tma_info_lsd_coverage", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to the reorder buffer being full (ROB stalls).", - "MetricExpr": "TOPDOWN_BE_BOUND.REORDER_BUFFER / SLOTS", - "MetricGroup": "TopdownL3;tma_resource_bound_group", - "MetricName": "tma_reorder_buffer", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Average number of parallel data read requests to external memory", + "MetricExpr": "UNC_ARB_DAT_OCCUPANCY.RD / cpu_core@UNC_ARB_DAT_OCCUPANCY.RD\\,cmask\\=1@", + "MetricGroup": "Mem;MemoryBW;SoC", + "MetricName": "tma_info_mem_parallel_reads", + "PublicDescription": "Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to certain allocation restrictions.", - "MetricExpr": "TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS / SLOTS", - "MetricGroup": "TopdownL3;tma_resource_bound_group", - "MetricName": "tma_alloc_restriction", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Average latency of data read request to external memory (in nanoseconds)", + "MetricExpr": "(UNC_ARB_TRK_OCCUPANCY.RD + UNC_ARB_DAT_OCCUPANCY.RD) / UNC_ARB_TRK_REQUESTS.RD", + "MetricGroup": "Mem;MemoryLat;SoC", + "MetricName": "tma_info_mem_read_latency", + "PublicDescription": "Average latency of data read request to external memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetches. ([RKL+]memory-controller only)", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to scoreboards from the instruction queue (IQ), jump execution unit (JEU), or microcode sequencer (MS).", - "MetricExpr": "TOPDOWN_BE_BOUND.SERIALIZATION / SLOTS", - "MetricGroup": "TopdownL3;tma_resource_bound_group", - "MetricName": "tma_serialization", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Average latency of all requests to external memory (in Uncore cycles)", + "MetricExpr": "(UNC_ARB_TRK_OCCUPANCY.ALL + UNC_ARB_DAT_OCCUPANCY.RD) / UNC_ARB_TRK_REQUESTS.ALL", + "MetricGroup": "Mem;SoC", + "MetricName": "tma_info_mem_request_latency", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the numer of issue slots that result in retirement slots. ", - "MetricExpr": "TOPDOWN_RETIRING.ALL / SLOTS", - "MetricGroup": "TopdownL1", - "MetricName": "tma_retiring", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Total pipeline cost of (external) Memory Bandwidth related bottlenecks", + "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_sq_full / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full))) + tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_fb_full / (tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk))", + "MetricGroup": "Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_info_memory_bandwidth", + "MetricThreshold": "tma_info_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of (external) Memory Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_mem_bandwidth, tma_sq_full", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of uops that are not from the microsequencer. ", - "MetricExpr": "(TOPDOWN_RETIRING.ALL - UOPS_RETIRED.MS) / SLOTS", - "MetricGroup": "TopdownL2;tma_retiring_group", - "MetricName": "tma_base", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Total pipeline cost of Memory Address Translation related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * tma_memory_bound * (tma_l1_bound / max(tma_memory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_latency + tma_streaming_stores)))", + "MetricGroup": "Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_info_memory_data_tlbs", + "MetricThreshold": "tma_info_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Translation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load, tma_dtlb_store", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of floating point operations per uop with all default weighting.", - "MetricExpr": "UOPS_RETIRED.FPDIV / SLOTS", - "MetricGroup": "TopdownL3;tma_base_group", - "MetricName": "tma_fp_uops", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Total pipeline cost of Memory Latency related bottlenecks (external memory and off-core caches)", + "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_l3_hit_latency / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound))", + "MetricGroup": "Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_info_memory_latency", + "MetricThreshold": "tma_info_memory_latency > 20", + "PublicDescription": "Total pipeline cost of Memory Latency related bottlenecks (external memory and off-core caches). Related metrics: tma_l3_hit_latency, tma_mem_latency", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of uops retired excluding ms and fp div uops.", - "MetricExpr": "(TOPDOWN_RETIRING.ALL - UOPS_RETIRED.MS - UOPS_RETIRED.FPDIV) / SLOTS", - "MetricGroup": "TopdownL3;tma_base_group", - "MetricName": "tma_other_ret", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Total pipeline cost of Branch Misprediction related bottlenecks", + "MetricExpr": "100 * (tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;tma_issueBM", + "MetricName": "tma_info_mispredictions", + "MetricThreshold": "tma_info_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_branch_misprediction_cost, tma_mispredicts_resteers", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of uops that are from the complex flows issued by the micro-sequencer (MS)", - "MetricExpr": "UOPS_RETIRED.MS / SLOTS", - "MetricGroup": "TopdownL2;tma_retiring_group", - "MetricName": "tma_ms_uops", - "PublicDescription": "Counts the number of uops that are from the complex flows issued by the micro-sequencer (MS). This includes uops from flows due to complex instructions, faults, assists, and inserted flows.", - "ScaleUnit": "100%", - "Unit": "cpu_atom" + "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss", + "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLES", + "MetricGroup": "Mem;MemoryBW;MemoryBound", + "MetricName": "tma_info_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)", + "Unit": "cpu_core" }, { - "BriefDescription": "", - "MetricExpr": "CPU_CLK_UNHALTED.CORE", - "MetricName": "CLKS", - "Unit": "cpu_atom" + "BriefDescription": "Fraction of branches of other types (not individually covered by other metrics in Info.Branches group)", + "MetricExpr": "1 - (tma_info_cond_nt + tma_info_cond_tk + tma_info_callret + tma_info_jump)", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_other_branches", + "Unit": "cpu_core" }, { - "BriefDescription": "", - "MetricExpr": "CPU_CLK_UNHALTED.CORE_P", - "MetricName": "CLKS_P", - "Unit": "cpu_atom" + "BriefDescription": "Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses", + "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_PENDING + DTLB_STORE_MISSES.WALK_PENDING) / (4 * tma_info_core_clks)", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_page_walks_utilization", + "MetricThreshold": "tma_info_page_walks_utilization > 0.5", + "Unit": "cpu_core" }, { - "BriefDescription": "", - "MetricExpr": "5 * CLKS", - "MetricName": "SLOTS", - "Unit": "cpu_atom" + "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.", + "MetricExpr": "tma_retiring * tma_info_slots / cpu_core@UOPS_RETIRED.SLOTS\\,cmask\\=1@", + "MetricGroup": "Pipeline;Ret", + "MetricName": "tma_info_retire", + "Unit": "cpu_core" }, { - "BriefDescription": "Instructions Per Cycle", - "MetricExpr": "INST_RETIRED.ANY / CLKS", - "MetricName": "IPC", - "Unit": "cpu_atom" + "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "TOPDOWN.SLOTS", + "MetricGroup": "TmaL1;tma_L1_group", + "MetricName": "tma_info_slots", + "Unit": "cpu_core" }, { - "BriefDescription": "Cycles Per Instruction", - "MetricExpr": "CLKS / INST_RETIRED.ANY", - "MetricName": "CPI", - "Unit": "cpu_atom" + "BriefDescription": "Fraction of Physical Core issue-slots utilized by this Logical Processor", + "MetricExpr": "(tma_info_slots / (TOPDOWN.SLOTS / 2) if #SMT_on else 1)", + "MetricGroup": "SMT;TmaL1;tma_L1_group", + "MetricName": "tma_info_slots_utilization", + "Unit": "cpu_core" }, { - "BriefDescription": "Uops Per Instruction", - "MetricExpr": "UOPS_RETIRED.ALL / INST_RETIRED.ANY", - "MetricName": "UPI", - "Unit": "cpu_atom" + "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", + "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_DISTRIBUTED if #SMT_on else 0)", + "MetricGroup": "SMT", + "MetricName": "tma_info_smt_2t_utilization", + "Unit": "cpu_core" }, { - "BriefDescription": "Percentage of total non-speculative loads with a store forward or unknown store address block", - "MetricExpr": "100 * LD_BLOCKS.DATA_UNKNOWN / MEM_UOPS_RETIRED.ALL_LOADS", - "MetricName": "Store_Fwd_Blocks", - "Unit": "cpu_atom" + "BriefDescription": "Socket actual clocks when any core is active on that socket", + "MetricExpr": "UNC_CLOCK.SOCKET", + "MetricGroup": "SoC", + "MetricName": "tma_info_socket_clks", + "Unit": "cpu_core" }, { - "BriefDescription": "Percentage of total non-speculative loads with a address aliasing block", - "MetricExpr": "100 * LD_BLOCKS.4K_ALIAS / MEM_UOPS_RETIRED.ALL_LOADS", - "MetricName": "Address_Alias_Blocks", - "Unit": "cpu_atom" + "BriefDescription": "STLB (2nd level TLB) data store speculative misses per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_store_stlb_mpki", + "Unit": "cpu_core" }, { - "BriefDescription": "Percentage of total non-speculative loads that are splits", - "MetricExpr": "100 * MEM_UOPS_RETIRED.SPLIT_LOADS / MEM_UOPS_RETIRED.ALL_LOADS", - "MetricName": "Load_Splits", - "Unit": "cpu_atom" + "BriefDescription": "Estimated fraction of retirement-cycles dealing with repeat instructions", + "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu_core@UOPS_RETIRED.SLOTS\\,cmask\\=1@", + "MetricGroup": "Pipeline;Ret", + "MetricName": "tma_info_strings_cycles", + "MetricThreshold": "tma_info_strings_cycles > 0.1", + "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per Branch (lower number means higher occurance rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", - "MetricName": "IpBranch", - "Unit": "cpu_atom" + "BriefDescription": "Average Frequency Utilization relative nominal frequency", + "MetricExpr": "tma_info_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_turbo_utilization", + "Unit": "cpu_core" }, { - "BriefDescription": "Instruction per (near) call (lower number means higher occurance rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.CALL", - "MetricName": "IpCall", - "Unit": "cpu_atom" + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "tma_retiring * tma_info_slots / INST_RETIRED.ANY", + "MetricGroup": "Pipeline;Ret;Retire", + "MetricName": "tma_info_uoppi", + "MetricThreshold": "tma_info_uoppi > 1.05", + "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per Load", - "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_LOADS", - "MetricName": "IpLoad", - "Unit": "cpu_atom" + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "tma_retiring * tma_info_slots / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW", + "MetricName": "tma_info_uptb", + "MetricThreshold": "tma_info_uptb < 9", + "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per Store", - "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_STORES", - "MetricName": "IpStore", - "Unit": "cpu_atom" + "BriefDescription": "This metric represents overall Integer (Int) select operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_int_vector_128b + tma_int_vector_256b + tma_shuffles", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_int_operations", + "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents overall Integer (Int) select operations fraction the CPU has executed (retired). Vector/Matrix Int operations and shuffles are counted. Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain.", + "ScaleUnit": "100%", + "Unit": "cpu_core" }, { - "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction", - "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", - "MetricName": "IpMispredict", - "Unit": "cpu_atom" + "BriefDescription": "This metric represents 128-bit vector Integer ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the CPU has retired", + "MetricExpr": "(INT_VEC_RETIRED.ADD_128 + INT_VEC_RETIRED.VNNI_128) / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_128b", + "MetricThreshold": "tma_int_vector_128b > 0.1 & (tma_int_operations > 0.1 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents 128-bit vector Integer ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per Far Branch", - "MetricExpr": "INST_RETIRED.ANY / (BR_INST_RETIRED.FAR_BRANCH / 2)", - "MetricName": "IpFarBranch", - "Unit": "cpu_atom" + "BriefDescription": "This metric represents 256-bit vector Integer ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the CPU has retired", + "MetricExpr": "(INT_VEC_RETIRED.ADD_256 + INT_VEC_RETIRED.MUL_256 + INT_VEC_RETIRED.VNNI_256) / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_256b", + "MetricThreshold": "tma_int_vector_256b > 0.1 & (tma_int_operations > 0.1 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents 256-bit vector Integer ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" }, { - "BriefDescription": "Ratio of all branches which mispredict", - "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.ALL_BRANCHES", - "MetricName": "Branch_Mispredict_Ratio", - "Unit": "cpu_atom" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "ICACHE_TAG.STALLS / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTEND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "ScaleUnit": "100%", + "Unit": "cpu_core" }, { - "BriefDescription": "Ratio between Mispredicted branches and unknown branches", - "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / BACLEARS.ANY", - "MetricName": "Branch_Mispredict_to_Unknown_Branch_Ratio", - "Unit": "cpu_atom" + "BriefDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache", + "MetricExpr": "max((EXE_ACTIVITY.BOUND_ON_LOADS - MEMORY_ACTIVITY.STALLS_L1D_MISS) / tma_info_clks, 0)", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache. The L1 data cache typically has the shortest latency. However; in certain cases like loads blocked on older stores; a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls; while some non-completed demand load lives in the machine without having that demand load missing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB_HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "ScaleUnit": "100%", + "Unit": "cpu_core" }, { - "BriefDescription": "Percentage of all uops which are ucode ops", - "MetricExpr": "100 * UOPS_RETIRED.MS / UOPS_RETIRED.ALL", - "MetricName": "Microcode_Uop_Ratio", - "Unit": "cpu_atom" + "BriefDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L1D_MISS - MEMORY_ACTIVITY.STALLS_L2_MISS) / tma_info_clks", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 misses/L2 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L2_HIT_PS", + "ScaleUnit": "100%", + "Unit": "cpu_core" }, { - "BriefDescription": "Percentage of all uops which are FPDiv uops", - "MetricExpr": "100 * UOPS_RETIRED.FPDIV / UOPS_RETIRED.ALL", - "MetricName": "FPDiv_Uop_Ratio", - "Unit": "cpu_atom" + "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core", + "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L2_MISS - MEMORY_ACTIVITY.STALLS_L3_MISS) / tma_info_clks", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "ScaleUnit": "100%", + "Unit": "cpu_core" }, { - "BriefDescription": "Percentage of all uops which are IDiv uops", - "MetricExpr": "100 * UOPS_RETIRED.IDIV / UOPS_RETIRED.ALL", - "MetricName": "IDiv_Uop_Ratio", - "Unit": "cpu_atom" + "BriefDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited)", + "MetricExpr": "9 * tma_info_average_frequency * MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_issueLat;tma_l3_bound_group", + "MetricName": "tma_l3_hit_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve the latency; reduce contention with sibling physical cores and increase performance. Note the value of this node may overlap with its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tma_info_memory_latency, tma_mem_latency", + "ScaleUnit": "100%", + "Unit": "cpu_core" }, { - "BriefDescription": "Percentage of all uops which are x87 uops", - "MetricExpr": "100 * UOPS_RETIRED.X87 / UOPS_RETIRED.ALL", - "MetricName": "X87_Uop_Ratio", - "Unit": "cpu_atom" + "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "DECODE.LCP / tma_info_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_dsb_misses, tma_info_iptb", + "ScaleUnit": "100%", + "Unit": "cpu_core" }, { - "BriefDescription": "Average Frequency Utilization relative nominal frequency", - "MetricExpr": "CLKS / CPU_CLK_UNHALTED.REF_TSC", - "MetricName": "Turbo_Utilization", - "Unit": "cpu_atom" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", + "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST", + "ScaleUnit": "100%", + "Unit": "cpu_core" }, { - "BriefDescription": "Fraction of cycles spent in Kernel mode", - "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.CORE@k / CPU_CLK_UNHALTED.CORE", - "MetricName": "Kernel_Utilization", - "Unit": "cpu_atom" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations", + "MetricExpr": "UOPS_DISPATCHED.PORT_2_3_10 / (3 * tma_info_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations. Sample with: UOPS_DISPATCHED.PORT_2_3_10", + "ScaleUnit": "100%", + "Unit": "cpu_core" }, { - "BriefDescription": "Average CPU Utilization", - "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", - "MetricName": "CPU_Utilization", - "Unit": "cpu_atom" + "BriefDescription": "This metric roughly estimates the fraction of cycles where the (first level) DTLB was missed by load accesses, that later on hit in second-level TLB (STLB)", + "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_group", + "MetricName": "tma_load_stlb_hit", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "ScaleUnit": "100%", + "Unit": "cpu_core" }, { - "BriefDescription": "Cycle cost per L2 hit", - "MetricExpr": "MEM_BOUND_STALLS.LOAD_L2_HIT / MEM_LOAD_UOPS_RETIRED.L2_HIT", - "MetricName": "Cycles_per_Demand_Load_L2_Hit", - "Unit": "cpu_atom" + "BriefDescription": "This metric estimates the fraction of cycles where the Second-level TLB (STLB) was missed by load accesses, performing a hardware page walk", + "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_group", + "MetricName": "tma_load_stlb_miss", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "ScaleUnit": "100%", + "Unit": "cpu_core" }, { - "BriefDescription": "Cycle cost per LLC hit", - "MetricExpr": "MEM_BOUND_STALLS.LOAD_LLC_HIT / MEM_LOAD_UOPS_RETIRED.L3_HIT", - "MetricName": "Cycles_per_Demand_Load_L3_Hit", - "Unit": "cpu_atom" + "BriefDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS.ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10 * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO))) / tma_info_clks", + "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1_bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations. Due to the microarchitecture handling of locks; they are classified as L1_Bound regardless of what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOADS_PS. Related metrics: tma_store_latency", + "ScaleUnit": "100%", + "Unit": "cpu_core" }, { - "BriefDescription": "Cycle cost per DRAM hit", - "MetricExpr": "MEM_BOUND_STALLS.LOAD_DRAM_HIT / MEM_LOAD_UOPS_RETIRED.DRAM_HIT", - "MetricName": "Cycles_per_Demand_Load_DRAM_Hit", - "Unit": "cpu_atom" + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to LSD (Loop Stream Detector) unit", + "MetricExpr": "(LSD.CYCLES_ACTIVE - LSD.CYCLES_OK) / tma_info_core_clks / 2", + "MetricGroup": "FetchBW;LSD;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_lsd", + "MetricThreshold": "tma_lsd > 0.15 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 6 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to LSD (Loop Stream Detector) unit. LSD typically does well sustaining Uop supply. However; in some rare cases; optimal uop-delivery could not be reached for small loops whose size (in terms of number of uops) does not suit well the LSD structure.", + "ScaleUnit": "100%", + "Unit": "cpu_core" }, { - "BriefDescription": "Percent of instruction miss cost that hit in the L2", - "MetricExpr": "100 * MEM_BOUND_STALLS.IFETCH_L2_HIT / MEM_BOUND_STALLS.IFETCH", - "MetricName": "Inst_Miss_Cost_L2Hit_Percent", - "Unit": "cpu_atom" + "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", + "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts)", + "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_cache", + "ScaleUnit": "100%", + "Unit": "cpu_core" }, { - "BriefDescription": "Percent of instruction miss cost that hit in the L3", - "MetricExpr": "100 * MEM_BOUND_STALLS.IFETCH_LLC_HIT / MEM_BOUND_STALLS.IFETCH", - "MetricName": "Inst_Miss_Cost_L3Hit_Percent", - "Unit": "cpu_atom" + "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu_core@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=4@) / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that). Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_info_memory_bandwidth, tma_sq_full", + "ScaleUnit": "100%", + "Unit": "cpu_core" }, { - "BriefDescription": "Percent of instruction miss cost that hit in DRAM", - "MetricExpr": "100 * MEM_BOUND_STALLS.IFETCH_DRAM_HIT / MEM_BOUND_STALLS.IFETCH", - "MetricName": "Inst_Miss_Cost_DRAMHit_Percent", - "Unit": "cpu_atom" + "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / tma_info_clks - tma_mem_bandwidth", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that). Related metrics: tma_info_memory_latency, tma_l3_hit_latency", + "ScaleUnit": "100%", + "Unit": "cpu_core" }, { - "BriefDescription": "load ops retired per 1000 instruction", - "MetricExpr": "1e3 * MEM_UOPS_RETIRED.ALL_LOADS / INST_RETIRED.ANY", - "MetricName": "MemLoadPKI", - "Unit": "cpu_atom" + "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", + "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_slots", + "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", + "ScaleUnit": "100%", + "Unit": "cpu_core" }, { - "BriefDescription": "C1 residency percent per core", - "MetricExpr": "cstate_core@c1\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C1_Core_Residency", - "ScaleUnit": "100%" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to LFENCE Instructions.", + "MetricExpr": "13 * MISC2_RETIRED.LFENCE / tma_info_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_serializing_operation_group", + "MetricName": "tma_memory_fence", + "MetricThreshold": "tma_memory_fence > 0.05 & (tma_serializing_operation > 0.1 & (tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))))", + "ScaleUnit": "100%", + "Unit": "cpu_core" }, { - "BriefDescription": "C6 residency percent per core", - "MetricExpr": "cstate_core@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Core_Residency", - "ScaleUnit": "100%" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring memory operations -- uops for memory load or store accesses.", + "MetricExpr": "tma_light_operations * MEM_UOP_RETIRED.ANY / (tma_retiring * tma_info_slots)", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_memory_operations", + "MetricThreshold": "tma_memory_operations > 0.1 & tma_light_operations > 0.6", + "ScaleUnit": "100%", + "Unit": "cpu_core" }, { - "BriefDescription": "C7 residency percent per core", - "MetricExpr": "cstate_core@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Core_Residency", - "ScaleUnit": "100%" + "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "UOPS_RETIRED.MS / tma_info_slots", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operations_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_core" }, { - "BriefDescription": "C2 residency percent per package", - "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C2_Pkg_Residency", - "ScaleUnit": "100%" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage", + "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_clks", + "MetricGroup": "BadSpec;BrMispredicts;TopdownL4;tma_L4_group;tma_branch_resteers_group;tma_issueBM", + "MetricName": "tma_mispredicts_resteers", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_branch_mispredicts, tma_info_branch_misprediction_cost, tma_info_mispredictions", + "ScaleUnit": "100%", + "Unit": "cpu_core" }, { - "BriefDescription": "C3 residency percent per package", - "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C3_Pkg_Residency", - "ScaleUnit": "100%" + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline)", + "MetricExpr": "(IDQ.MITE_CYCLES_ANY - IDQ.MITE_CYCLES_OK) / tma_info_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 6 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or LSD. For example; inefficiencies due to asymmetric decoders; use of long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sample with: FRONTEND_RETIRED.ANY_DSB_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_core" }, { - "BriefDescription": "C6 residency percent per package", - "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Pkg_Residency", - "ScaleUnit": "100%" + "BriefDescription": "The Mixing_Vectors metric gives the percentage of injected blend uops out of all uops issued", + "MetricExpr": "160 * ASSISTS.SSE_AVX_MIX / tma_info_clks", + "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utilized_0_group", + "MetricName": "tma_mixing_vectors", + "MetricThreshold": "tma_mixing_vectors > 0.05", + "PublicDescription": "The Mixing_Vectors metric gives the percentage of injected blend uops out of all uops issued. Usually a Mixing_Vectors over 5% is worth investigating. Read more in Appendix B1 of the Optimizations Guide for this topic. Related metrics: tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_core" }, { - "BriefDescription": "C7 residency percent per package", - "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Pkg_Residency", - "ScaleUnit": "100%" + "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", + "MetricExpr": "3 * cpu_core@UOPS_RETIRED.MS\\,cmask\\=1\\,edge@ / (tma_retiring * tma_info_slots / UOPS_ISSUED.ANY) / tma_info_clks", + "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals. Sample with: FRONTEND_RETIRED.MS_FLOWS. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", + "ScaleUnit": "100%", + "Unit": "cpu_core" }, { - "BriefDescription": "C8 residency percent per package", - "MetricExpr": "cstate_pkg@c8\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C8_Pkg_Residency", - "ScaleUnit": "100%" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring branch instructions that were not fused", + "MetricExpr": "tma_light_operations * (BR_INST_RETIRED.ALL_BRANCHES - INST_RETIRED.MACRO_FUSED) / (tma_retiring * tma_info_slots)", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_non_fused_branches", + "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring branch instructions that were not fused. Non-conditional branches like direct JMP or CALL would count here. Can be used to examine fusible conditional jumps that were not fused.", + "ScaleUnit": "100%", + "Unit": "cpu_core" }, { - "BriefDescription": "C9 residency percent per package", - "MetricExpr": "cstate_pkg@c9\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C9_Pkg_Residency", - "ScaleUnit": "100%" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_retiring * tma_info_slots)", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_nop_instructions", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions. Compilers often use NOPs for certain address alignments - e.g. start address of a function or loop body. Sample with: INST_RETIRED.NOP", + "ScaleUnit": "100%", + "Unit": "cpu_core" }, { - "BriefDescription": "C10 residency percent per package", - "MetricExpr": "cstate_pkg@c10\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C10_Pkg_Residency", - "ScaleUnit": "100%" + "BriefDescription": "This metric represents the remaining light uops fraction the CPU has executed - remaining means not covered by other sibling nodes", + "MetricExpr": "max(0, tma_light_operations - (tma_fp_arith + tma_int_operations + tma_memory_operations + tma_fused_instructions + tma_non_fused_branches + tma_nop_instructions))", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_other_light_ops", + "MetricThreshold": "tma_other_light_ops > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents the remaining light uops fraction the CPU has executed - remaining means not covered by other sibling nodes. May undercount due to FMA double counting", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates fraction of slots the CPU retired uops as a result of handing Page Faults", + "MetricExpr": "99 * ASSISTS.PAGE_FAULT / tma_info_slots", + "MetricGroup": "TopdownL5;tma_L5_group;tma_assists_group", + "MetricName": "tma_page_faults", + "MetricThreshold": "tma_page_faults > 0.05", + "PublicDescription": "This metric roughly estimates fraction of slots the CPU retired uops as a result of handing Page Faults. A Page Fault may apply on first application access to a memory page. Note operating system handling of page faults accounts for the majority of its cost.", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch)", + "MetricExpr": "UOPS_DISPATCHED.PORT_0 / tma_info_core_clks", + "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_0", + "MetricThreshold": "tma_port_0 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_256b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU)", + "MetricExpr": "UOPS_DISPATCHED.PORT_1 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_1", + "MetricThreshold": "tma_port_1 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU)", + "MetricExpr": "UOPS_DISPATCHED.PORT_6 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_6", + "MetricThreshold": "tma_port_6 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU). Sample with: UOPS_DISPATCHED.PORT_6. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", + "MetricExpr": "((cpu_core@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=0x80@ + tma_serializing_operation * (CYCLE_ACTIVITY.STALLS_TOTAL - EXE_ACTIVITY.BOUND_ON_LOADS) + (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * cpu_core@EXE_ACTIVITY.2_PORTS_UTIL\\,umask\\=0xc@)) / tma_info_clks if ARITH.DIV_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - EXE_ACTIVITY.BOUND_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * cpu_core@EXE_ACTIVITY.2_PORTS_UTIL\\,umask\\=0xc@) / tma_info_clks)", + "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "cpu_core@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=0x80@ / tma_info_clks + tma_serializing_operation * (CYCLE_ACTIVITY.STALLS_TOTAL - EXE_ACTIVITY.BOUND_ON_LOADS) / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise). Long-latency instructions like divides may contribute to this metric.", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_1", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). This can be due to heavy data-dependency among software instructions; or over oversubscribing a particular hardware resource. In some other cases with high 1_Port_Utilized and L1_Bound; this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related metrics: tma_l1_bound", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization -most compilers feature auto-Vectorization options today- reduces pressure on the execution ports as multiple elements are calculated with same uop. Sample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.7 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Sample with: UOPS_EXECUTED.CYCLES_GE_3", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1", + "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations", + "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL5;tma_L5_group;tma_issueSO;tma_ports_utilized_0_group", + "MetricName": "tma_serializing_operation", + "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)))", + "PublicDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations. Instructions like CPUID; WRMSR or LFENCE serialize the out-of-order execution which may limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metrics: tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Shuffle (cross \"vector lane\" data transfers) uops fraction the CPU has retired.", + "MetricExpr": "INT_VEC_RETIRED.SHUFFLES / (tma_retiring * tma_info_slots)", + "MetricGroup": "HPC;Pipeline;TopdownL4;tma_L4_group;tma_int_operations_group", + "MetricName": "tma_shuffles", + "MetricThreshold": "tma_shuffles > 0.1 & (tma_int_operations > 0.1 & tma_light_operations > 0.6)", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to PAUSE Instructions", + "MetricExpr": "CPU_CLK_UNHALTED.PAUSE / tma_info_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_serializing_operation_group", + "MetricName": "tma_slow_pause", + "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_operation > 0.1 & (tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.PAUSE_INST", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary", + "MetricExpr": "tma_info_load_miss_real_latency * LD_BLOCKS.NO_SR / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents rate of split store accesses", + "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bound_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store accesses. Consider aligning your data to the 64-byte cache line granularity. Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_4", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors)", + "MetricExpr": "(XQ.FULL_CYCLES + L1D_PEND_MISS.L2_STALLS) / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueBW;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors). Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_info_memory_bandwidth, tma_mem_bandwidth", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", + "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck. Sample with: MEM_INST_RETIRED.ALL_STORES_PS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores. To streamline memory operations in the pipeline; a load can avoid waiting for memory if a prior in-flight store is writing the data that the load wants to read (store forwarding process). However; in some cases the load may be blocked for a significant time pending the store forward. For example; when the prior store is writing a smaller region than the load is reading.", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses", + "MetricExpr": "(MEM_STORE_RETIRED.L2_HIT * 10 * (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_clks", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_issueSL;tma_store_bound_group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses. Store accesses usually less impact out-of-order core performance; however; holding resources for longer time can lead into undesired implications (e.g. contention on L1D fill-buffer entries - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations", + "MetricExpr": "(UOPS_DISPATCHED.PORT_4_9 + UOPS_DISPATCHED.PORT_7_8) / (4 * tma_info_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations. Sample with: UOPS_DISPATCHED.PORT_7_8", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of cycles where the TLB was missed by store accesses, hitting in the second-level TLB (STLB)", + "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_group", + "MetricName": "tma_store_stlb_hit", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles where the STLB was missed by store accesses, performing a hardware page walk", + "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_clks", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_group", + "MetricName": "tma_store_stlb_miss", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often CPU was stalled due to Streaming store memory accesses; Streaming store optimize out a read request required by RFO stores", + "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueSmSt;tma_store_bound_group", + "MetricName": "tma_streaming_stores", + "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often CPU was stalled due to Streaming store memory accesses; Streaming store optimize out a read request required by RFO stores. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should Streaming stores be a bottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma_fb_full", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;TopdownL4;tma_L4_group;tma_branch_resteers_group", + "MetricName": "tma_unknown_branches", + "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears. These are fetched branches the Branch Prediction Unit was unable to recognize (e.g. first time the branch is fetched or hitting BTB capacity limit). Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric serves as an approximation of legacy x87 usage", + "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.THREAD", + "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", + "ScaleUnit": "100%", + "Unit": "cpu_core" } ] diff --git a/tools/perf/pmu-events/arch/x86/alderlake/cache.json b/tools/perf/pmu-events/arch/x86/alderlake/cache.json index adc9887b8ae0..51770416bcc2 100644 --- a/tools/perf/pmu-events/arch/x86/alderlake/cache.json +++ b/tools/perf/pmu-events/arch/x86/alderlake/cache.json @@ -92,19 +92,19 @@ "Unit": "cpu_core" }, { - "BriefDescription": "All accesses to L2 cache[This event is alias to L2_RQSTS.REFERENCES]", + "BriefDescription": "All accesses to L2 cache [This event is alias to L2_RQSTS.REFERENCES]", "EventCode": "0x24", "EventName": "L2_REQUEST.ALL", - "PublicDescription": "Counts all requests that were hit or true misses in L2 cache. True-miss excludes misses that were merged with ongoing L2 misses.[This event is alias to L2_RQSTS.REFERENCES]", + "PublicDescription": "Counts all requests that were hit or true misses in L2 cache. True-miss excludes misses that were merged with ongoing L2 misses. [This event is alias to L2_RQSTS.REFERENCES]", "SampleAfterValue": "200003", "UMask": "0xff", "Unit": "cpu_core" }, { - "BriefDescription": "Read requests with true-miss in L2 cache.[This event is alias to L2_RQSTS.MISS]", + "BriefDescription": "Read requests with true-miss in L2 cache. [This event is alias to L2_RQSTS.MISS]", "EventCode": "0x24", "EventName": "L2_REQUEST.MISS", - "PublicDescription": "Counts read requests of any type with true-miss in the L2 cache. True-miss excludes L2 misses that were merged with ongoing L2 misses.[This event is alias to L2_RQSTS.MISS]", + "PublicDescription": "Counts read requests of any type with true-miss in the L2 cache. True-miss excludes L2 misses that were merged with ongoing L2 misses. [This event is alias to L2_RQSTS.MISS]", "SampleAfterValue": "200003", "UMask": "0x3f", "Unit": "cpu_core" @@ -198,19 +198,19 @@ "Unit": "cpu_core" }, { - "BriefDescription": "Read requests with true-miss in L2 cache.[This event is alias to L2_REQUEST.MISS]", + "BriefDescription": "Read requests with true-miss in L2 cache. [This event is alias to L2_REQUEST.MISS]", "EventCode": "0x24", "EventName": "L2_RQSTS.MISS", - "PublicDescription": "Counts read requests of any type with true-miss in the L2 cache. True-miss excludes L2 misses that were merged with ongoing L2 misses.[This event is alias to L2_REQUEST.MISS]", + "PublicDescription": "Counts read requests of any type with true-miss in the L2 cache. True-miss excludes L2 misses that were merged with ongoing L2 misses. [This event is alias to L2_REQUEST.MISS]", "SampleAfterValue": "200003", "UMask": "0x3f", "Unit": "cpu_core" }, { - "BriefDescription": "All accesses to L2 cache[This event is alias to L2_REQUEST.ALL]", + "BriefDescription": "All accesses to L2 cache [This event is alias to L2_REQUEST.ALL]", "EventCode": "0x24", "EventName": "L2_RQSTS.REFERENCES", - "PublicDescription": "Counts all requests that were hit or true misses in L2 cache. True-miss excludes misses that were merged with ongoing L2 misses.[This event is alias to L2_REQUEST.ALL]", + "PublicDescription": "Counts all requests that were hit or true misses in L2 cache. True-miss excludes misses that were merged with ongoing L2 misses. [This event is alias to L2_REQUEST.ALL]", "SampleAfterValue": "200003", "UMask": "0xff", "Unit": "cpu_core" @@ -895,7 +895,7 @@ "Unit": "cpu_atom" }, { - "BriefDescription": "OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD", + "BriefDescription": "Counts demand data reads that resulted in a snoop hit in another cores caches which forwarded the unmodified data to the requesting core.", "EventCode": "0x2A,0x2B", "EventName": "OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD", "MSRIndex": "0x1a6,0x1a7", @@ -980,6 +980,15 @@ "UMask": "0x8", "Unit": "cpu_core" }, + { + "BriefDescription": "Cycles where at least 1 outstanding demand data read request is pending.", + "CounterMask": "1", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_DATA_RD", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, { "BriefDescription": "For every cycle where the core is waiting on at least 1 outstanding Demand RFO request, increments by 1.", "CounterMask": "1", @@ -999,6 +1008,15 @@ "UMask": "0x8", "Unit": "cpu_core" }, + { + "BriefDescription": "For every cycle, increments by the number of outstanding demand data read requests pending.", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD", + "PublicDescription": "For every cycle, increments by the number of outstanding demand data read requests pending. Requests are considered outstanding from the time they miss the core's L2 cache until the transaction completion message is sent to the requestor.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, { "BriefDescription": "Number of PREFETCHNTA instructions executed.", "EventCode": "0x40", diff --git a/tools/perf/pmu-events/arch/x86/alderlake/floating-point.json b/tools/perf/pmu-events/arch/x86/alderlake/floating-point.json index 3eb7cab9b431..c8ba96c4a7f8 100644 --- a/tools/perf/pmu-events/arch/x86/alderlake/floating-point.json +++ b/tools/perf/pmu-events/arch/x86/alderlake/floating-point.json @@ -85,6 +85,24 @@ "UMask": "0x20", "Unit": "cpu_core" }, + { + "BriefDescription": "Number of SSE/AVX computational 128-bit packed single and 256-bit packed double precision FP instructions retired; some instructions will count twice as noted below. Each count represents 2 or/and 4 computation operations, 1 for each element. Applies to SSE* and AVX* packed single precision and packed double precision FP instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX RCP14 RSQRT14 SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB count twice as they perform 2 calculations per element.", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.4_FLOPS", + "PublicDescription": "Number of SSE/AVX computational 128-bit packed single precision and 256-bit packed double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 2 or/and 4 computation operations, one for each element. Applies to SSE* and AVX* packed single precision floating-point and packed double precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX RCP14 RSQRT14 SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element. The DAZ and FTZ flags in the MXCSR register need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x18", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of SSE/AVX computational scalar floating-point instructions retired; some instructions will count twice as noted below. Applies to SSE* and AVX* scalar, double and single precision floating-point: ADD SUB MUL DIV MIN MAX RCP14 RSQRT14 RANGE SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they perform multiple calculations per element.", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.SCALAR", + "PublicDescription": "Number of SSE/AVX computational scalar single precision and double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 1 computational operation. Applies to SSE* and AVX* scalar single precision floating-point instructions: ADD SUB MUL DIV MIN MAX SQRT RSQRT RCP FM(N)ADD/SUB. FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element. The DAZ and FTZ flags in the MXCSR register need to be set when using these events.", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_core" + }, { "BriefDescription": "Counts number of SSE/AVX computational scalar double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 1 computational operation. Applies to SSE* and AVX* scalar double precision floating-point instructions: ADD SUB MUL DIV MIN MAX SQRT FM(N)ADD/SUB. FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element.", "EventCode": "0xc7", @@ -103,6 +121,15 @@ "UMask": "0x2", "Unit": "cpu_core" }, + { + "BriefDescription": "Number of any Vector retired FP arithmetic instructions", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.VECTOR", + "PublicDescription": "Number of any Vector retired FP arithmetic instructions. The DAZ and FTZ flags in the MXCSR register need to be set when using these events.", + "SampleAfterValue": "1000003", + "UMask": "0xfc", + "Unit": "cpu_core" + }, { "BriefDescription": "Counts the number of floating point operations retired that required microcode assist.", "EventCode": "0xc3", diff --git a/tools/perf/pmu-events/arch/x86/alderlake/frontend.json b/tools/perf/pmu-events/arch/x86/alderlake/frontend.json index 250cd128b674..81349100fe32 100644 --- a/tools/perf/pmu-events/arch/x86/alderlake/frontend.json +++ b/tools/perf/pmu-events/arch/x86/alderlake/frontend.json @@ -8,6 +8,15 @@ "UMask": "0x1", "Unit": "cpu_atom" }, + { + "BriefDescription": "Clears due to Unknown Branches.", + "EventCode": "0x60", + "EventName": "BACLEARS.ANY", + "PublicDescription": "Number of times the front-end is resteered when it finds a branch instruction in a fetch line. This is called Unknown Branch which occurs for the first time a branch instruction is fetched or when the branch is not tracked by the BPU (Branch Prediction Unit) anymore.", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, { "BriefDescription": "Stalls caused by changing prefix length of the instruction.", "EventCode": "0x87", diff --git a/tools/perf/pmu-events/arch/x86/alderlake/memory.json b/tools/perf/pmu-events/arch/x86/alderlake/memory.json index 7595eb4ab46f..37f3d062a788 100644 --- a/tools/perf/pmu-events/arch/x86/alderlake/memory.json +++ b/tools/perf/pmu-events/arch/x86/alderlake/memory.json @@ -278,10 +278,9 @@ "Unit": "cpu_atom" }, { - "BriefDescription": "Demand Data Read requests who miss L3 cache", + "BriefDescription": "Counts demand data read requests that miss the L3 cache.", "EventCode": "0x21", "EventName": "OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", - "PublicDescription": "Demand Data Read requests who miss L3 cache.", "SampleAfterValue": "100003", "UMask": "0x10", "Unit": "cpu_core" diff --git a/tools/perf/pmu-events/arch/x86/alderlake/pipeline.json b/tools/perf/pmu-events/arch/x86/alderlake/pipeline.json index f46fa7ba168a..2dba3a115f97 100644 --- a/tools/perf/pmu-events/arch/x86/alderlake/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/alderlake/pipeline.json @@ -793,13 +793,25 @@ "Unit": "cpu_core" }, { - "BriefDescription": "INST_RETIRED.REP_ITERATION", + "BriefDescription": "Iterations of Repeat string retired instructions.", "EventCode": "0xc0", "EventName": "INST_RETIRED.REP_ITERATION", + "PublicDescription": "Number of iterations of Repeat (REP) string retired instructions such as MOVS, CMPS, and SCAS. Each has a byte, word, and doubleword version and string instructions can be repeated using a repetition prefix, REP, that allows their architectural execution to be repeated a number of times as specified by the RCX register. Note the number of iterations is implementation-dependent.", "SampleAfterValue": "2000003", "UMask": "0x8", "Unit": "cpu_core" }, + { + "BriefDescription": "Clears speculative count", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0xad", + "EventName": "INT_MISC.CLEARS_COUNT", + "PublicDescription": "Counts the number of speculative clears due to any type of branch misprediction or machine clears", + "SampleAfterValue": "500009", + "UMask": "0x1", + "Unit": "cpu_core" + }, { "BriefDescription": "Counts cycles after recovery from a branch misprediction or machine clear till the first uop is issued from the resteered path.", "EventCode": "0xad", diff --git a/tools/perf/pmu-events/arch/x86/alderlake/uncore-other.json b/tools/perf/pmu-events/arch/x86/alderlake/uncore-other.json index bc5fb6b76065..5f3b4c6e2e39 100644 --- a/tools/perf/pmu-events/arch/x86/alderlake/uncore-other.json +++ b/tools/perf/pmu-events/arch/x86/alderlake/uncore-other.json @@ -24,7 +24,7 @@ "Unit": "ARB" }, { - "BriefDescription": "Number of coherent read requests sent to memory controller that were issued by any core.", + "BriefDescription": "This event is deprecated. Refer to new event UNC_ARB_REQ_TRK_REQUEST.DRD", "EventCode": "0x81", "EventName": "UNC_ARB_DAT_REQUESTS.RD", "PerPkg": "1", @@ -40,7 +40,15 @@ "Unit": "ARB" }, { - "BriefDescription": "This event is deprecated. Refer to new event UNC_ARB_DAT_REQUESTS.RD", + "BriefDescription": "Each cycle count number of 'valid' coherent Data Read entries . Such entry is defined as valid when it is allocated till deallocation. Doesn't include prefetches [This event is alias to UNC_ARB_TRK_OCCUPANCY.RD]", + "EventCode": "0x80", + "EventName": "UNC_ARB_REQ_TRK_OCCUPANCY.DRD", + "PerPkg": "1", + "UMask": "0x2", + "Unit": "ARB" + }, + { + "BriefDescription": "Number of all coherent Data Read entries. Doesn't include prefetches [This event is alias to UNC_ARB_TRK_REQUESTS.RD]", "EventCode": "0x81", "EventName": "UNC_ARB_REQ_TRK_REQUEST.DRD", "PerPkg": "1", @@ -55,6 +63,14 @@ "UMask": "0x1", "Unit": "ARB" }, + { + "BriefDescription": "Each cycle count number of 'valid' coherent Data Read entries . Such entry is defined as valid when it is allocated till deallocation. Doesn't include prefetches [This event is alias to UNC_ARB_REQ_TRK_OCCUPANCY.DRD]", + "EventCode": "0x80", + "EventName": "UNC_ARB_TRK_OCCUPANCY.RD", + "PerPkg": "1", + "UMask": "0x2", + "Unit": "ARB" + }, { "BriefDescription": "Counts the number of coherent and in-coherent requests initiated by IA cores, processor graphic units, or LLC.", "EventCode": "0x81", @@ -63,6 +79,14 @@ "UMask": "0x1", "Unit": "ARB" }, + { + "BriefDescription": "Number of all coherent Data Read entries. Doesn't include prefetches [This event is alias to UNC_ARB_REQ_TRK_REQUEST.DRD]", + "EventCode": "0x81", + "EventName": "UNC_ARB_TRK_REQUESTS.RD", + "PerPkg": "1", + "UMask": "0x2", + "Unit": "ARB" + }, { "BriefDescription": "This 48-bit fixed counter counts the UCLK cycles.", "EventCode": "0xff", diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-events/arch/x86/mapfile.csv index 5facdac6fe8e..4bcccab07ea9 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -1,5 +1,5 @@ Family-model,Version,Filename,EventType -GenuineIntel-6-(97|9A|B7|BA|BF),v1.16,alderlake,core +GenuineIntel-6-(97|9A|B7|BA|BF),v1.18,alderlake,core GenuineIntel-6-BE,v1.16,alderlaken,core GenuineIntel-6-(1C|26|27|35|36),v4,bonnell,core GenuineIntel-6-(3D|47),v26,broadwell,core From patchwork Sun Feb 19 09:28:08 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59101 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp772324wrn; Sun, 19 Feb 2023 01:32:42 -0800 (PST) X-Google-Smtp-Source: AK7set9H7fGG6VYL9hHTTTHddoiBmhZb7Ix/5g8S2jjU4jZkgI13PODuIpj3W6RM0aFjP7430RCx X-Received: by 2002:a17:907:2141:b0:8b1:264d:6187 with SMTP id rk1-20020a170907214100b008b1264d6187mr5710156ejb.46.1676799162268; Sun, 19 Feb 2023 01:32:42 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799162; cv=none; d=google.com; s=arc-20160816; b=ToFzRdzZXqj1W05LSZPCChJ61iODYawr6QZNDzKsRX7sVKm+Bro7tbttzokqU1K9yT 8hunxCcvg49zOgihD+NoJiqkp/KRJvAjdn/pTQS7oJxga39Mw63w+IpBDgmA19QxuobW 2We63ZdEJUaJYLHg975O8s1P/HTEuH1EsRckXelneQyaF+wRq1RMi/lH9uN1Mp2jR3OB TX4q4sHjDPITEESg58gXSmqtkhJOBg0OErdadCqKSmn2KTkZlgzAoqeLOFPsmSLj1eOb hi2ZYdvpqZE1cDkISiGrhzKSk4+ZKA7MCeVM7iDwBBMNeT4yKKTPvzSmhF11XqvcCEfh pbTw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :references:mime-version:message-id:in-reply-to:date:dkim-signature; bh=pg8DtD/6AWtZUKUB0xk490e6K26T1+8gvWgbrGLVsfA=; b=hLna5bIerrvOpMaX05Yb2dVHVEoxi4iHyD8lnSDzkyPOd3noclrMGbBtrKrTvgDZfd W/7/ruOSI28N8NOiNmLVTAGJtSThovn8dvaX/zdbf3ruNN154E2jSzzsN74iyXLATB02 BF/DyWS/4lYMPJ4kOQgylay0CbhXkk8yozsxd6V+jerMQ+0UFlSFnOhR+Sz7fsTGLaFM q1sMJWsgEJxOEY39VKvcIdzG4qnZFSpzbLUJpCwHBryczKJk55/ndtwBdLriPw7o0nD6 +3aXVm0nuihBmjpewUss1v2C9xjQL/K8ZR9uHhTTqw3EGENxfxwqXirOskJLfvRnFmK5 1qcA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=hr35H7hV; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id l3-20020a170907914300b008b17ac5c729si8503618ejs.10.2023.02.19.01.32.19; Sun, 19 Feb 2023 01:32:42 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=hr35H7hV; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229898AbjBSJbT (ORCPT + 99 others); Sun, 19 Feb 2023 04:31:19 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52952 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229910AbjBSJbM (ORCPT ); Sun, 19 Feb 2023 04:31:12 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1745E12595 for ; Sun, 19 Feb 2023 01:30:36 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id a34-20020a25a1a5000000b0092aabd4fa90so2167713ybi.18 for ; Sun, 19 Feb 2023 01:30:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:from:to:cc:subject:date :message-id:reply-to; bh=pg8DtD/6AWtZUKUB0xk490e6K26T1+8gvWgbrGLVsfA=; b=hr35H7hV2yotrjnrYtqxRDqhq5ee1Refjs9MiQhrAo48+q64Pp73OggomQ3mtUkFQ3 lr5RbP06IUlqvH7vhExRFFbU1xlMy4YZP9sKZYWHsahmu887mIOnx/CwpznVSHvuulJ0 bA5d1aCu3iF6f6+AeY2ABMhlsrGhDKm7c6YAU/HD1cfh87Av5h7UwaJYS1yvqSHHvvWs ZijmN7vTOyoiWR7+Y42Fk6Xc1gv2k6FFAOoQRPlcOtW4wja9w35qoGs8+OwsRbB0Wbic xvTkKnyzLhhaz0VD1676Iq7RPRYxXB8do84zqAAXB1OIvayCArQLyGEhMTQidf6JHUzr tQdg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=pg8DtD/6AWtZUKUB0xk490e6K26T1+8gvWgbrGLVsfA=; b=RyHrb/MtPJqCpT3TWRyYUdtyTrHF3rhiW2ORDYanklhuB+1wOraFBs97taj2wGx99I PK5C8UvSPaa2iqu67wp53F3BgHboaVjRnAxY7nWdoDLdZamkKTfIatx4KTdf0CgqekKc geEjTLUic9gwPerhHrP450Mqj5sMu4inv9zvXz8Z6tUgdcKKyF6Un8bFezAYTcl7PxqF FQ339A+923OuRJKRhkNjCzy3HMcx9UCsbvFDwj4FRnB/kur6hzYwrXKKArDRI8pnvswJ rlPlNzcW9eHckrnz4HVPy5Fz+r3e21v0ox9R6MGNq/cwhMhPKtFLTA0WxGzsg33l7teJ IYqg== X-Gm-Message-State: AO0yUKWN2u1mUu2KrtJKrmgeup2fFeFHDLuc1LhqXwXuc5MGFSgV5lkB hjyKFLho+P3afyhALM6yRnl7wzbul6Yt X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a25:8589:0:b0:92c:2340:4de0 with SMTP id x9-20020a258589000000b0092c23404de0mr1502249ybk.310.1676799034298; Sun, 19 Feb 2023 01:30:34 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:08 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-12-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 11/51] perf vendor events intel: Refresh alderlake-n metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758251358025961679?= X-GMAIL-MSGID: =?utf-8?q?1758251358025961679?= Update the alderlake-n events from 1.16 to 1.18 (no change) and metrics. Generation was done using https://github.com/intel/perfmon. Notable changes are TMA metrics are updated to version 4.5, TMA info metrics are renamed from their node name to be lower case and prefixed by tma_info_, MetricThreshold expressions are added and the smi_cost metric group is added replicating existing hard coded metrics in stat-shadow. Signed-off-by: Ian Rogers --- .../arch/x86/alderlaken/adln-metrics.json | 811 ++++++++++-------- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 2 files changed, 454 insertions(+), 359 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/alderlaken/adln-metrics.json b/tools/perf/pmu-events/arch/x86/alderlaken/adln-metrics.json index 9ab1d5bcf4a2..5078c468480f 100644 --- a/tools/perf/pmu-events/arch/x86/alderlaken/adln-metrics.json +++ b/tools/perf/pmu-events/arch/x86/alderlaken/adln-metrics.json @@ -1,583 +1,678 @@ [ { - "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to frontend stalls.", - "MetricExpr": "TOPDOWN_FE_BOUND.ALL / SLOTS", - "MetricGroup": "TopdownL1", - "MetricName": "tma_frontend_bound", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to frontend bandwidth restrictions due to decode, predecode, cisc, and other limitations.", - "MetricExpr": "TOPDOWN_FE_BOUND.FRONTEND_LATENCY / SLOTS", - "MetricGroup": "TopdownL2;tma_frontend_bound_group", - "MetricName": "tma_frontend_latency", + "BriefDescription": "C10 residency percent per package", + "MetricExpr": "cstate_pkg@c10\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C10_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to instruction cache misses.", - "MetricExpr": "TOPDOWN_FE_BOUND.ICACHE / SLOTS", - "MetricGroup": "TopdownL3;tma_frontend_latency_group", - "MetricName": "tma_icache", + "BriefDescription": "C1 residency percent per core", + "MetricExpr": "cstate_core@c1\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C1_Core_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to Instruction Table Lookaside Buffer (ITLB) misses.", - "MetricExpr": "TOPDOWN_FE_BOUND.ITLB / SLOTS", - "MetricGroup": "TopdownL3;tma_frontend_latency_group", - "MetricName": "tma_itlb", + "BriefDescription": "C2 residency percent per package", + "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C2_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to BACLEARS, which occurs when the Branch Target Buffer (BTB) prediction or lack thereof, was corrected by a later branch predictor in the frontend", - "MetricExpr": "TOPDOWN_FE_BOUND.BRANCH_DETECT / SLOTS", - "MetricGroup": "TopdownL3;tma_frontend_latency_group", - "MetricName": "tma_branch_detect", - "PublicDescription": "Counts the number of issue slots that were not delivered by the frontend due to BACLEARS, which occurs when the Branch Target Buffer (BTB) prediction or lack thereof, was corrected by a later branch predictor in the frontend. Includes BACLEARS due to all branch types including conditional and unconditional jumps, returns, and indirect branches.", + "BriefDescription": "C3 residency percent per package", + "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to BTCLEARS, which occurs when the Branch Target Buffer (BTB) predicts a taken branch.", - "MetricExpr": "TOPDOWN_FE_BOUND.BRANCH_RESTEER / SLOTS", - "MetricGroup": "TopdownL3;tma_frontend_latency_group", - "MetricName": "tma_branch_resteer", + "BriefDescription": "C6 residency percent per core", + "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Core_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to frontend bandwidth restrictions due to decode, predecode, cisc, and other limitations.", - "MetricExpr": "TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH / SLOTS", - "MetricGroup": "TopdownL2;tma_frontend_bound_group", - "MetricName": "tma_frontend_bandwidth", + "BriefDescription": "C6 residency percent per package", + "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to the microcode sequencer (MS).", - "MetricExpr": "TOPDOWN_FE_BOUND.CISC / SLOTS", - "MetricGroup": "TopdownL3;tma_frontend_bandwidth_group", - "MetricName": "tma_cisc", + "BriefDescription": "C7 residency percent per core", + "MetricExpr": "cstate_core@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Core_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to decode stalls.", - "MetricExpr": "TOPDOWN_FE_BOUND.DECODE / SLOTS", - "MetricGroup": "TopdownL3;tma_frontend_bandwidth_group", - "MetricName": "tma_decode", + "BriefDescription": "C7 residency percent per package", + "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to wrong predecodes.", - "MetricExpr": "TOPDOWN_FE_BOUND.PREDECODE / SLOTS", - "MetricGroup": "TopdownL3;tma_frontend_bandwidth_group", - "MetricName": "tma_predecode", + "BriefDescription": "C8 residency percent per package", + "MetricExpr": "cstate_pkg@c8\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C8_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to other common frontend stalls not categorized.", - "MetricExpr": "TOPDOWN_FE_BOUND.OTHER / SLOTS", - "MetricGroup": "TopdownL3;tma_frontend_bandwidth_group", - "MetricName": "tma_other_fb", + "BriefDescription": "C9 residency percent per package", + "MetricExpr": "cstate_pkg@c9\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C9_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the total number of issue slots that were not consumed by the backend because allocation is stalled due to a mispredicted jump or a machine clear", - "MetricExpr": "(SLOTS - (TOPDOWN_FE_BOUND.ALL + TOPDOWN_BE_BOUND.ALL + TOPDOWN_RETIRING.ALL)) / SLOTS", - "MetricGroup": "TopdownL1", - "MetricName": "tma_bad_speculation", - "PublicDescription": "Counts the total number of issue slots that were not consumed by the backend because allocation is stalled due to a mispredicted jump or a machine clear. Only issue slots wasted due to fast nukes such as memory ordering nukes are counted. Other nukes are not accounted for. Counts all issue slots blocked during this recovery window including relevant microcode flows and while uops are not yet available in the instruction queue (IQ). Also includes the issue slots that were consumed by the backend but were thrown away because they were younger than the mispredict or machine clear.", + "BriefDescription": "Percentage of cycles spent in System Management Interrupts.", + "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0 else 0)", + "MetricGroup": "smi", + "MetricName": "smi_cycles", + "MetricThreshold": "smi_cycles > 0.1", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to branch mispredicts.", - "MetricExpr": "TOPDOWN_BAD_SPECULATION.MISPREDICT / SLOTS", - "MetricGroup": "TopdownL2;tma_bad_speculation_group", - "MetricName": "tma_branch_mispredicts", - "ScaleUnit": "100%" + "BriefDescription": "Number of SMI interrupts.", + "MetricExpr": "msr@smi@", + "MetricGroup": "smi", + "MetricName": "smi_num", + "ScaleUnit": "1SMI#" }, { - "BriefDescription": "Counts the total number of issue slots that were not consumed by the backend because allocation is stalled due to a machine clear (nuke) of any kind including memory ordering and memory disambiguation.", - "MetricExpr": "TOPDOWN_BAD_SPECULATION.MACHINE_CLEARS / SLOTS", - "MetricGroup": "TopdownL2;tma_bad_speculation_group", - "MetricName": "tma_machine_clears", + "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to certain allocation restrictions.", + "MetricExpr": "TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_alloc_restriction", + "MetricThreshold": "tma_alloc_restriction > 0.1", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to a machine clear (slow nuke).", - "MetricExpr": "TOPDOWN_BAD_SPECULATION.NUKE / SLOTS", - "MetricGroup": "TopdownL3;tma_machine_clears_group", - "MetricName": "tma_nuke", + "BriefDescription": "Counts the total number of issue slots that were not consumed by the backend due to backend stalls", + "MetricExpr": "TOPDOWN_BE_BOUND.ALL / tma_info_slots", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.1", + "PublicDescription": "Counts the total number of issue slots that were not consumed by the backend due to backend stalls. Note that uops must be available for consumption in order for this event to count. If a uop is not available (IQ is empty), this event will not count. The rest of these subevents count backend stalls, in cycles, due to an outstanding request which is memory bound vs core bound. The subevents are not slot based events and therefore can not be precisely added or subtracted from the Backend_Bound_Aux subevents which are slot based.", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of machine clears relative to the number of nuke slots due to SMC. ", - "MetricExpr": "tma_nuke * (MACHINE_CLEARS.SMC / MACHINE_CLEARS.SLOW)", - "MetricGroup": "TopdownL4;tma_nuke_group", - "MetricName": "tma_smc", + "BriefDescription": "Counts the total number of issue slots that were not consumed by the backend due to backend stalls", + "MetricExpr": "tma_backend_bound", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_backend_bound_aux", + "MetricThreshold": "tma_backend_bound_aux > 0.2", + "PublicDescription": "Counts the total number of issue slots that were not consumed by the backend due to backend stalls. Note that UOPS must be available for consumption in order for this event to count. If a uop is not available (IQ is empty), this event will not count. All of these subevents count backend stalls, in slots, due to a resource limitation. These are not cycle based events and therefore can not be precisely added or subtracted from the Backend_Bound subevents which are cycle based. These subevents are supplementary to Backend_Bound and can be used to analyze results from a resource perspective at allocation.", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of machine clears relative to the number of nuke slots due to memory ordering. ", - "MetricExpr": "tma_nuke * (MACHINE_CLEARS.MEMORY_ORDERING / MACHINE_CLEARS.SLOW)", - "MetricGroup": "TopdownL4;tma_nuke_group", - "MetricName": "tma_memory_ordering", + "BriefDescription": "Counts the total number of issue slots that were not consumed by the backend because allocation is stalled due to a mispredicted jump or a machine clear", + "MetricExpr": "(tma_info_slots - (TOPDOWN_FE_BOUND.ALL + TOPDOWN_BE_BOUND.ALL + TOPDOWN_RETIRING.ALL)) / tma_info_slots", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", + "PublicDescription": "Counts the total number of issue slots that were not consumed by the backend because allocation is stalled due to a mispredicted jump or a machine clear. Only issue slots wasted due to fast nukes such as memory ordering nukes are counted. Other nukes are not accounted for. Counts all issue slots blocked during this recovery window including relevant microcode flows and while uops are not yet available in the instruction queue (IQ). Also includes the issue slots that were consumed by the backend but were thrown away because they were younger than the mispredict or machine clear.", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of machine clears relative to the number of nuke slots due to FP assists. ", - "MetricExpr": "tma_nuke * (MACHINE_CLEARS.FP_ASSIST / MACHINE_CLEARS.SLOW)", - "MetricGroup": "TopdownL4;tma_nuke_group", - "MetricName": "tma_fp_assist", + "BriefDescription": "Counts the number of uops that are not from the microsequencer.", + "MetricExpr": "(TOPDOWN_RETIRING.ALL - UOPS_RETIRED.MS) / tma_info_slots", + "MetricGroup": "TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_base", + "MetricThreshold": "tma_base > 0.6", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of machine clears relative to the number of nuke slots due to memory disambiguation. ", - "MetricExpr": "tma_nuke * (MACHINE_CLEARS.DISAMBIGUATION / MACHINE_CLEARS.SLOW)", - "MetricGroup": "TopdownL4;tma_nuke_group", - "MetricName": "tma_disambiguation", + "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to BACLEARS, which occurs when the Branch Target Buffer (BTB) prediction or lack thereof, was corrected by a later branch predictor in the frontend", + "MetricExpr": "TOPDOWN_FE_BOUND.BRANCH_DETECT / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_frontend_latency_group", + "MetricName": "tma_branch_detect", + "MetricThreshold": "tma_branch_detect > 0.05", + "PublicDescription": "Counts the number of issue slots that were not delivered by the frontend due to BACLEARS, which occurs when the Branch Target Buffer (BTB) prediction or lack thereof, was corrected by a later branch predictor in the frontend. Includes BACLEARS due to all branch types including conditional and unconditional jumps, returns, and indirect branches.", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of machine clears relative to the number of nuke slots due to page faults. ", - "MetricExpr": "tma_nuke * (MACHINE_CLEARS.PAGE_FAULT / MACHINE_CLEARS.SLOW)", - "MetricGroup": "TopdownL4;tma_nuke_group", - "MetricName": "tma_page_fault", + "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to branch mispredicts.", + "MetricExpr": "TOPDOWN_BAD_SPECULATION.MISPREDICT / tma_info_slots", + "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", + "MetricName": "tma_branch_mispredicts", + "MetricThreshold": "tma_branch_mispredicts > 0.05", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to a machine clear classified as a fast nuke due to memory ordering, memory disambiguation and memory renaming.", - "MetricExpr": "TOPDOWN_BAD_SPECULATION.FASTNUKE / SLOTS", - "MetricGroup": "TopdownL3;tma_machine_clears_group", - "MetricName": "tma_fast_nuke", + "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to BTCLEARS, which occurs when the Branch Target Buffer (BTB) predicts a taken branch.", + "MetricExpr": "TOPDOWN_FE_BOUND.BRANCH_RESTEER / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_frontend_latency_group", + "MetricName": "tma_branch_resteer", + "MetricThreshold": "tma_branch_resteer > 0.05", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the total number of issue slots that were not consumed by the backend due to backend stalls", - "MetricExpr": "TOPDOWN_BE_BOUND.ALL / SLOTS", - "MetricGroup": "TopdownL1", - "MetricName": "tma_backend_bound", - "PublicDescription": "Counts the total number of issue slots that were not consumed by the backend due to backend stalls. Note that uops must be available for consumption in order for this event to count. If a uop is not available (IQ is empty), this event will not count. The rest of these subevents count backend stalls, in cycles, due to an outstanding request which is memory bound vs core bound. The subevents are not slot based events and therefore can not be precisely added or subtracted from the Backend_Bound_Aux subevents which are slot based.", + "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to the microcode sequencer (MS).", + "MetricExpr": "TOPDOWN_FE_BOUND.CISC / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_frontend_bandwidth_group", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.05", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of cycles due to backend bound stalls that are core execution bound and not attributed to outstanding demand load or store stalls. ", + "BriefDescription": "Counts the number of cycles due to backend bound stalls that are core execution bound and not attributed to outstanding demand load or store stalls.", "MetricExpr": "max(0, tma_backend_bound - tma_load_store_bound)", - "MetricGroup": "TopdownL2;tma_backend_bound_group", + "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of cycles the core is stalled due to stores or loads. ", - "MetricExpr": "min(tma_backend_bound, LD_HEAD.ANY_AT_RET / CLKS + tma_store_bound)", - "MetricGroup": "TopdownL2;tma_backend_bound_group", - "MetricName": "tma_load_store_bound", + "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to decode stalls.", + "MetricExpr": "TOPDOWN_FE_BOUND.DECODE / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_frontend_bandwidth_group", + "MetricName": "tma_decode", + "MetricThreshold": "tma_decode > 0.05", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of cycles the core is stalled due to store buffer full.", - "MetricExpr": "tma_st_buffer", - "MetricGroup": "TopdownL3;tma_load_store_bound_group", - "MetricName": "tma_store_bound", + "BriefDescription": "Counts the number of machine clears relative to the number of nuke slots due to memory disambiguation.", + "MetricExpr": "tma_nuke * (MACHINE_CLEARS.DISAMBIGUATION / MACHINE_CLEARS.SLOW)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_nuke_group", + "MetricName": "tma_disambiguation", + "MetricThreshold": "tma_disambiguation > 0.02", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of cycles that the oldest load of the load buffer is stalled at retirement due to a load block.", - "MetricExpr": "LD_HEAD.L1_BOUND_AT_RET / CLKS", - "MetricGroup": "TopdownL3;tma_load_store_bound_group", - "MetricName": "tma_l1_bound", + "BriefDescription": "Counts the number of cycles the core is stalled due to a demand load miss which hit in DRAM or MMIO (Non-DRAM).", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "MEM_BOUND_STALLS.LOAD_DRAM_HIT / tma_info_clks - MEM_BOUND_STALLS_AT_RET_CORRECTION * MEM_BOUND_STALLS.LOAD_DRAM_HIT / MEM_BOUND_STALLS.LOAD", + "MetricGroup": "TopdownL3;tma_L3_group;tma_load_store_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of cycles that the oldest load of the load buffer is stalled at retirement due to a store forward block.", - "MetricExpr": "LD_HEAD.ST_ADDR_AT_RET / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_store_fwd", + "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to a machine clear classified as a fast nuke due to memory ordering, memory disambiguation and memory renaming.", + "MetricExpr": "TOPDOWN_BAD_SPECULATION.FASTNUKE / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", + "MetricName": "tma_fast_nuke", + "MetricThreshold": "tma_fast_nuke > 0.05", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of cycles that the oldest load of the load buffer is stalled at retirement due to a first level TLB miss.", - "MetricExpr": "LD_HEAD.DTLB_MISS_AT_RET / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_stlb_hit", + "BriefDescription": "Counts the number of machine clears relative to the number of nuke slots due to FP assists.", + "MetricExpr": "tma_nuke * (MACHINE_CLEARS.FP_ASSIST / MACHINE_CLEARS.SLOW)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_nuke_group", + "MetricName": "tma_fp_assist", + "MetricThreshold": "tma_fp_assist > 0.02", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of cycles that the oldest load of the load buffer is stalled at retirement due to a second level TLB miss requiring a page walk.", - "MetricExpr": "LD_HEAD.PGWALK_AT_RET / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_stlb_miss", + "BriefDescription": "Counts the number of floating point operations per uop with all default weighting.", + "MetricExpr": "UOPS_RETIRED.FPDIV / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_base_group", + "MetricName": "tma_fp_uops", + "MetricThreshold": "tma_fp_uops > 0.2", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of cycles that the oldest load of the load buffer is stalled at retirement due to a number of other load blocks.", - "MetricExpr": "LD_HEAD.OTHER_AT_RET / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_other_l1", + "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to frontend bandwidth restrictions due to decode, predecode, cisc, and other limitations.", + "MetricExpr": "TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH / tma_info_slots", + "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_frontend_bandwidth", + "MetricThreshold": "tma_frontend_bandwidth > 0.1", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of cycles a core is stalled due to a demand load which hit in the L2 Cache.", - "MetricExpr": "MEM_BOUND_STALLS.LOAD_L2_HIT / CLKS - MEM_BOUND_STALLS_AT_RET_CORRECTION * MEM_BOUND_STALLS.LOAD_L2_HIT / MEM_BOUND_STALLS.LOAD", - "MetricGroup": "TopdownL3;tma_load_store_bound_group", - "MetricName": "tma_l2_bound", + "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to frontend stalls.", + "MetricExpr": "TOPDOWN_FE_BOUND.ALL / tma_info_slots", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.2", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of cycles a core is stalled due to a demand load which hit in the Last Level Cache (LLC) or other core with HITE/F/M.", - "MetricExpr": "MEM_BOUND_STALLS.LOAD_LLC_HIT / CLKS - MEM_BOUND_STALLS_AT_RET_CORRECTION * MEM_BOUND_STALLS.LOAD_LLC_HIT / MEM_BOUND_STALLS.LOAD", - "MetricGroup": "TopdownL3;tma_load_store_bound_group", - "MetricName": "tma_l3_bound", + "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to frontend bandwidth restrictions due to decode, predecode, cisc, and other limitations.", + "MetricExpr": "TOPDOWN_FE_BOUND.FRONTEND_LATENCY / tma_info_slots", + "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_frontend_latency", + "MetricThreshold": "tma_frontend_latency > 0.15", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of cycles the core is stalled due to a demand load miss which hit in DRAM or MMIO (Non-DRAM).", - "MetricExpr": "MEM_BOUND_STALLS.LOAD_DRAM_HIT / CLKS - MEM_BOUND_STALLS_AT_RET_CORRECTION * MEM_BOUND_STALLS.LOAD_DRAM_HIT / MEM_BOUND_STALLS.LOAD", - "MetricGroup": "TopdownL3;tma_load_store_bound_group", - "MetricName": "tma_dram_bound", + "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to instruction cache misses.", + "MetricExpr": "TOPDOWN_FE_BOUND.ICACHE / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_frontend_latency_group", + "MetricName": "tma_icache", + "MetricThreshold": "tma_icache > 0.05", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of cycles the core is stalled due to a demand load miss which hits in the L2, LLC, DRAM or MMIO (Non-DRAM) but could not be correctly attributed or cycles in which the load miss is waiting on a request buffer.", - "MetricExpr": "max(0, tma_load_store_bound - (tma_store_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound))", - "MetricGroup": "TopdownL3;tma_load_store_bound_group", - "MetricName": "tma_other_load_store", - "ScaleUnit": "100%" + "BriefDescription": "Percentage of total non-speculative loads with a address aliasing block", + "MetricExpr": "100 * LD_BLOCKS.4K_ALIAS / MEM_UOPS_RETIRED.ALL_LOADS", + "MetricName": "tma_info_address_alias_blocks" }, { - "BriefDescription": "Counts the total number of issue slots that were not consumed by the backend due to backend stalls", - "MetricExpr": "tma_backend_bound", - "MetricGroup": "TopdownL1", - "MetricName": "tma_backend_bound_aux", - "PublicDescription": "Counts the total number of issue slots that were not consumed by the backend due to backend stalls. Note that UOPS must be available for consumption in order for this event to count. If a uop is not available (IQ is empty), this event will not count. All of these subevents count backend stalls, in slots, due to a resource limitation. These are not cycle based events and therefore can not be precisely added or subtracted from the Backend_Bound subevents which are cycle based. These subevents are supplementary to Backend_Bound and can be used to analyze results from a resource perspective at allocation. ", - "ScaleUnit": "100%" + "BriefDescription": "Ratio of all branches which mispredict", + "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": " ", + "MetricName": "tma_info_branch_mispredict_ratio" }, { - "BriefDescription": "Counts the total number of issue slots that were not consumed by the backend due to backend stalls", - "MetricExpr": "tma_backend_bound", - "MetricGroup": "TopdownL2;tma_backend_bound_aux_group", - "MetricName": "tma_resource_bound", - "PublicDescription": "Counts the total number of issue slots that were not consumed by the backend due to backend stalls. Note that uops must be available for consumption in order for this event to count. If a uop is not available (IQ is empty), this event will not count. ", - "ScaleUnit": "100%" + "BriefDescription": "Ratio between Mispredicted branches and unknown branches", + "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / BACLEARS.ANY", + "MetricGroup": " ", + "MetricName": "tma_info_branch_mispredict_to_unknown_branch_ratio" }, { - "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to memory reservation stalls in which a scheduler is not able to accept uops.", - "MetricExpr": "TOPDOWN_BE_BOUND.MEM_SCHEDULER / SLOTS", - "MetricGroup": "TopdownL3;tma_resource_bound_group", - "MetricName": "tma_mem_scheduler", - "ScaleUnit": "100%" + "BriefDescription": "", + "MetricExpr": "CPU_CLK_UNHALTED.CORE", + "MetricGroup": " ", + "MetricName": "tma_info_clks" }, { - "BriefDescription": "Counts the number of cycles, relative to the number of mem_scheduler slots, in which uops are blocked due to store buffer full", - "MetricExpr": "tma_mem_scheduler * (MEM_SCHEDULER_BLOCK.ST_BUF / MEM_SCHEDULER_BLOCK.ALL)", - "MetricGroup": "TopdownL4;tma_mem_scheduler_group", - "MetricName": "tma_st_buffer", - "ScaleUnit": "100%" + "BriefDescription": "", + "MetricExpr": "CPU_CLK_UNHALTED.CORE_P", + "MetricGroup": " ", + "MetricName": "tma_info_clks_p" }, { - "BriefDescription": "Counts the number of cycles, relative to the number of mem_scheduler slots, in which uops are blocked due to load buffer full", - "MetricExpr": "tma_mem_scheduler * MEM_SCHEDULER_BLOCK.LD_BUF / MEM_SCHEDULER_BLOCK.ALL", - "MetricGroup": "TopdownL4;tma_mem_scheduler_group", - "MetricName": "tma_ld_buffer", - "ScaleUnit": "100%" + "BriefDescription": "Cycles Per Instruction", + "MetricExpr": "tma_info_clks / INST_RETIRED.ANY", + "MetricGroup": " ", + "MetricName": "tma_info_cpi" }, { - "BriefDescription": "Counts the number of cycles, relative to the number of mem_scheduler slots, in which uops are blocked due to RSV full relative ", - "MetricExpr": "tma_mem_scheduler * MEM_SCHEDULER_BLOCK.RSV / MEM_SCHEDULER_BLOCK.ALL", - "MetricGroup": "TopdownL4;tma_mem_scheduler_group", - "MetricName": "tma_rsv", - "ScaleUnit": "100%" + "BriefDescription": "Average CPU Utilization", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": " ", + "MetricName": "tma_info_cpu_utilization" }, { - "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to IEC or FPC RAT stalls, which can be due to FIQ or IEC reservation stalls in which the integer, floating point or SIMD scheduler is not able to accept uops.", - "MetricExpr": "TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER / SLOTS", - "MetricGroup": "TopdownL3;tma_resource_bound_group", - "MetricName": "tma_non_mem_scheduler", - "ScaleUnit": "100%" + "BriefDescription": "Cycle cost per DRAM hit", + "MetricExpr": "MEM_BOUND_STALLS.LOAD_DRAM_HIT / MEM_LOAD_UOPS_RETIRED.DRAM_HIT", + "MetricGroup": " ", + "MetricName": "tma_info_cycles_per_demand_load_dram_hit" }, { - "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to the physical register file unable to accept an entry (marble stalls).", - "MetricExpr": "TOPDOWN_BE_BOUND.REGISTER / SLOTS", - "MetricGroup": "TopdownL3;tma_resource_bound_group", - "MetricName": "tma_register", - "ScaleUnit": "100%" + "BriefDescription": "Cycle cost per L2 hit", + "MetricExpr": "MEM_BOUND_STALLS.LOAD_L2_HIT / MEM_LOAD_UOPS_RETIRED.L2_HIT", + "MetricGroup": " ", + "MetricName": "tma_info_cycles_per_demand_load_l2_hit" }, { - "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to the reorder buffer being full (ROB stalls).", - "MetricExpr": "TOPDOWN_BE_BOUND.REORDER_BUFFER / SLOTS", - "MetricGroup": "TopdownL3;tma_resource_bound_group", - "MetricName": "tma_reorder_buffer", - "ScaleUnit": "100%" + "BriefDescription": "Cycle cost per LLC hit", + "MetricExpr": "MEM_BOUND_STALLS.LOAD_LLC_HIT / MEM_LOAD_UOPS_RETIRED.L3_HIT", + "MetricGroup": " ", + "MetricName": "tma_info_cycles_per_demand_load_l3_hit" }, { - "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to certain allocation restrictions.", - "MetricExpr": "TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS / SLOTS", - "MetricGroup": "TopdownL3;tma_resource_bound_group", - "MetricName": "tma_alloc_restriction", - "ScaleUnit": "100%" + "BriefDescription": "Percentage of all uops which are FPDiv uops", + "MetricExpr": "100 * UOPS_RETIRED.FPDIV / UOPS_RETIRED.ALL", + "MetricGroup": " ", + "MetricName": "tma_info_fpdiv_uop_ratio" }, { - "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to scoreboards from the instruction queue (IQ), jump execution unit (JEU), or microcode sequencer (MS).", - "MetricExpr": "TOPDOWN_BE_BOUND.SERIALIZATION / SLOTS", - "MetricGroup": "TopdownL3;tma_resource_bound_group", - "MetricName": "tma_serialization", - "ScaleUnit": "100%" + "BriefDescription": "Percentage of all uops which are IDiv uops", + "MetricExpr": "100 * UOPS_RETIRED.IDIV / UOPS_RETIRED.ALL", + "MetricGroup": " ", + "MetricName": "tma_info_idiv_uop_ratio" }, { - "BriefDescription": "Counts the numer of issue slots that result in retirement slots. ", - "MetricExpr": "TOPDOWN_RETIRING.ALL / SLOTS", - "MetricGroup": "TopdownL1", - "MetricName": "tma_retiring", - "ScaleUnit": "100%" + "BriefDescription": "Percent of instruction miss cost that hit in DRAM", + "MetricExpr": "100 * MEM_BOUND_STALLS.IFETCH_DRAM_HIT / MEM_BOUND_STALLS.IFETCH", + "MetricGroup": " ", + "MetricName": "tma_info_inst_miss_cost_dramhit_percent" }, { - "BriefDescription": "Counts the number of uops that are not from the microsequencer. ", - "MetricExpr": "(TOPDOWN_RETIRING.ALL - UOPS_RETIRED.MS) / SLOTS", - "MetricGroup": "TopdownL2;tma_retiring_group", - "MetricName": "tma_base", - "ScaleUnit": "100%" + "BriefDescription": "Percent of instruction miss cost that hit in the L2", + "MetricExpr": "100 * MEM_BOUND_STALLS.IFETCH_L2_HIT / MEM_BOUND_STALLS.IFETCH", + "MetricGroup": " ", + "MetricName": "tma_info_inst_miss_cost_l2hit_percent" }, { - "BriefDescription": "Counts the number of floating point operations per uop with all default weighting.", - "MetricExpr": "UOPS_RETIRED.FPDIV / SLOTS", - "MetricGroup": "TopdownL3;tma_base_group", - "MetricName": "tma_fp_uops", - "ScaleUnit": "100%" + "BriefDescription": "Percent of instruction miss cost that hit in the L3", + "MetricExpr": "100 * MEM_BOUND_STALLS.IFETCH_LLC_HIT / MEM_BOUND_STALLS.IFETCH", + "MetricGroup": " ", + "MetricName": "tma_info_inst_miss_cost_l3hit_percent" }, { - "BriefDescription": "Counts the number of uops retired excluding ms and fp div uops.", - "MetricExpr": "(TOPDOWN_RETIRING.ALL - UOPS_RETIRED.MS - UOPS_RETIRED.FPDIV) / SLOTS", - "MetricGroup": "TopdownL3;tma_base_group", - "MetricName": "tma_other_ret", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per Branch (lower number means higher occurance rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": " ", + "MetricName": "tma_info_ipbranch" }, { - "BriefDescription": "Counts the number of uops that are from the complex flows issued by the micro-sequencer (MS)", - "MetricExpr": "UOPS_RETIRED.MS / SLOTS", - "MetricGroup": "TopdownL2;tma_retiring_group", - "MetricName": "tma_ms_uops", - "PublicDescription": "Counts the number of uops that are from the complex flows issued by the micro-sequencer (MS). This includes uops from flows due to complex instructions, faults, assists, and inserted flows.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions Per Cycle", + "MetricExpr": "INST_RETIRED.ANY / tma_info_clks", + "MetricGroup": " ", + "MetricName": "tma_info_ipc" }, { - "BriefDescription": "", - "MetricExpr": "CPU_CLK_UNHALTED.CORE", - "MetricName": "CLKS" + "BriefDescription": "Instruction per (near) call (lower number means higher occurance rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.CALL", + "MetricGroup": " ", + "MetricName": "tma_info_ipcall" }, { - "BriefDescription": "", - "MetricExpr": "CPU_CLK_UNHALTED.CORE_P", - "MetricName": "CLKS_P" + "BriefDescription": "Instructions per Far Branch", + "MetricExpr": "INST_RETIRED.ANY / (BR_INST_RETIRED.FAR_BRANCH / 2)", + "MetricGroup": " ", + "MetricName": "tma_info_ipfarbranch" }, { - "BriefDescription": "", - "MetricExpr": "5 * CLKS", - "MetricName": "SLOTS" + "BriefDescription": "Instructions per Load", + "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_LOADS", + "MetricGroup": " ", + "MetricName": "tma_info_ipload" }, { - "BriefDescription": "Instructions Per Cycle", - "MetricExpr": "INST_RETIRED.ANY / CLKS", - "MetricName": "IPC" + "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": " ", + "MetricName": "tma_info_ipmispredict" }, { - "BriefDescription": "Cycles Per Instruction", - "MetricExpr": "CLKS / INST_RETIRED.ANY", - "MetricName": "CPI" + "BriefDescription": "Instructions per Store", + "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_STORES", + "MetricGroup": " ", + "MetricName": "tma_info_ipstore" }, { - "BriefDescription": "Uops Per Instruction", - "MetricExpr": "UOPS_RETIRED.ALL / INST_RETIRED.ANY", - "MetricName": "UPI" + "BriefDescription": "Fraction of cycles spent in Kernel mode", + "MetricExpr": "cpu@CPU_CLK_UNHALTED.CORE@k / CPU_CLK_UNHALTED.CORE", + "MetricGroup": " ", + "MetricName": "tma_info_kernel_utilization" }, { - "BriefDescription": "Percentage of total non-speculative loads with a store forward or unknown store address block", - "MetricExpr": "100 * LD_BLOCKS.DATA_UNKNOWN / MEM_UOPS_RETIRED.ALL_LOADS", - "MetricName": "Store_Fwd_Blocks" + "BriefDescription": "Percentage of total non-speculative loads that are splits", + "MetricExpr": "100 * MEM_UOPS_RETIRED.SPLIT_LOADS / MEM_UOPS_RETIRED.ALL_LOADS", + "MetricName": "tma_info_load_splits" }, { - "BriefDescription": "Percentage of total non-speculative loads with a address aliasing block", - "MetricExpr": "100 * LD_BLOCKS.4K_ALIAS / MEM_UOPS_RETIRED.ALL_LOADS", - "MetricName": "Address_Alias_Blocks" + "BriefDescription": "load ops retired per 1000 instruction", + "MetricExpr": "1e3 * MEM_UOPS_RETIRED.ALL_LOADS / INST_RETIRED.ANY", + "MetricGroup": " ", + "MetricName": "tma_info_memloadpki" }, { - "BriefDescription": "Percentage of total non-speculative loads that are splits", - "MetricExpr": "100 * MEM_UOPS_RETIRED.SPLIT_LOADS / MEM_UOPS_RETIRED.ALL_LOADS", - "MetricName": "Load_Splits" + "BriefDescription": "Percentage of all uops which are ucode ops", + "MetricExpr": "100 * UOPS_RETIRED.MS / UOPS_RETIRED.ALL", + "MetricGroup": " ", + "MetricName": "tma_info_microcode_uop_ratio" }, { - "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", - "MetricName": "IpBranch" + "BriefDescription": "", + "MetricExpr": "5 * tma_info_clks", + "MetricGroup": " ", + "MetricName": "tma_info_slots" }, { - "BriefDescription": "Instruction per (near) call (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.CALL", - "MetricName": "IpCall" + "BriefDescription": "Percentage of total non-speculative loads with a store forward or unknown store address block", + "MetricExpr": "100 * LD_BLOCKS.DATA_UNKNOWN / MEM_UOPS_RETIRED.ALL_LOADS", + "MetricName": "tma_info_store_fwd_blocks" }, { - "BriefDescription": "Instructions per Load", - "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_LOADS", - "MetricName": "IpLoad" + "BriefDescription": "Average Frequency Utilization relative nominal frequency", + "MetricExpr": "tma_info_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": " ", + "MetricName": "tma_info_turbo_utilization" }, { - "BriefDescription": "Instructions per Store", - "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_STORES", - "MetricName": "IpStore" + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "UOPS_RETIRED.ALL / INST_RETIRED.ANY", + "MetricGroup": " ", + "MetricName": "tma_info_upi" }, { - "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction", - "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", - "MetricName": "IpMispredict" + "BriefDescription": "Percentage of all uops which are x87 uops", + "MetricExpr": "100 * UOPS_RETIRED.X87 / UOPS_RETIRED.ALL", + "MetricGroup": " ", + "MetricName": "tma_info_x87_uop_ratio" }, { - "BriefDescription": "Instructions per Far Branch", - "MetricExpr": "INST_RETIRED.ANY / (BR_INST_RETIRED.FAR_BRANCH / 2)", - "MetricName": "IpFarBranch" + "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to Instruction Table Lookaside Buffer (ITLB) misses.", + "MetricExpr": "TOPDOWN_FE_BOUND.ITLB / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_frontend_latency_group", + "MetricName": "tma_itlb", + "MetricThreshold": "tma_itlb > 0.05", + "ScaleUnit": "100%" }, { - "BriefDescription": "Ratio of all branches which mispredict", - "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.ALL_BRANCHES", - "MetricName": "Branch_Mispredict_Ratio" + "BriefDescription": "Counts the number of cycles that the oldest load of the load buffer is stalled at retirement due to a load block.", + "MetricExpr": "LD_HEAD.L1_BOUND_AT_RET / tma_info_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_load_store_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1", + "ScaleUnit": "100%" }, { - "BriefDescription": "Ratio between Mispredicted branches and unknown branches", - "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / BACLEARS.ANY", - "MetricName": "Branch_Mispredict_to_Unknown_Branch_Ratio" + "BriefDescription": "Counts the number of cycles a core is stalled due to a demand load which hit in the L2 Cache.", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "MEM_BOUND_STALLS.LOAD_L2_HIT / tma_info_clks - MEM_BOUND_STALLS_AT_RET_CORRECTION * MEM_BOUND_STALLS.LOAD_L2_HIT / MEM_BOUND_STALLS.LOAD", + "MetricGroup": "TopdownL3;tma_L3_group;tma_load_store_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.1", + "ScaleUnit": "100%" }, { - "BriefDescription": "Percentage of all uops which are ucode ops", - "MetricExpr": "100 * UOPS_RETIRED.MS / UOPS_RETIRED.ALL", - "MetricName": "Microcode_Uop_Ratio" + "BriefDescription": "Counts the number of cycles a core is stalled due to a demand load which hit in the Last Level Cache (LLC) or other core with HITE/F/M.", + "MetricExpr": "MEM_BOUND_STALLS.LOAD_LLC_HIT / tma_info_clks - MEM_BOUND_STALLS_AT_RET_CORRECTION * MEM_BOUND_STALLS.LOAD_LLC_HIT / MEM_BOUND_STALLS.LOAD", + "MetricGroup": "TopdownL3;tma_L3_group;tma_load_store_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.1", + "ScaleUnit": "100%" }, { - "BriefDescription": "Percentage of all uops which are FPDiv uops", - "MetricExpr": "100 * UOPS_RETIRED.FPDIV / UOPS_RETIRED.ALL", - "MetricName": "FPDiv_Uop_Ratio" + "BriefDescription": "Counts the number of cycles, relative to the number of mem_scheduler slots, in which uops are blocked due to load buffer full", + "MetricExpr": "tma_mem_scheduler * MEM_SCHEDULER_BLOCK.LD_BUF / MEM_SCHEDULER_BLOCK.ALL", + "MetricGroup": "TopdownL4;tma_L4_group;tma_mem_scheduler_group", + "MetricName": "tma_ld_buffer", + "MetricThreshold": "tma_ld_buffer > 0.05", + "ScaleUnit": "100%" }, { - "BriefDescription": "Percentage of all uops which are IDiv uops", - "MetricExpr": "100 * UOPS_RETIRED.IDIV / UOPS_RETIRED.ALL", - "MetricName": "IDiv_Uop_Ratio" + "BriefDescription": "Counts the number of cycles the core is stalled due to stores or loads.", + "MetricExpr": "min(tma_backend_bound, LD_HEAD.ANY_AT_RET / tma_info_clks + tma_store_bound)", + "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_load_store_bound", + "MetricThreshold": "tma_load_store_bound > 0.2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Percentage of all uops which are x87 uops", - "MetricExpr": "100 * UOPS_RETIRED.X87 / UOPS_RETIRED.ALL", - "MetricName": "X87_Uop_Ratio" + "BriefDescription": "Counts the total number of issue slots that were not consumed by the backend because allocation is stalled due to a machine clear (nuke) of any kind including memory ordering and memory disambiguation.", + "MetricExpr": "TOPDOWN_BAD_SPECULATION.MACHINE_CLEARS / tma_info_slots", + "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.05", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average Frequency Utilization relative nominal frequency", - "MetricExpr": "CLKS / CPU_CLK_UNHALTED.REF_TSC", - "MetricName": "Turbo_Utilization" + "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to memory reservation stalls in which a scheduler is not able to accept uops.", + "MetricExpr": "TOPDOWN_BE_BOUND.MEM_SCHEDULER / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_mem_scheduler", + "MetricThreshold": "tma_mem_scheduler > 0.1", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of cycles spent in Kernel mode", - "MetricExpr": "cpu@CPU_CLK_UNHALTED.CORE@k / CPU_CLK_UNHALTED.CORE", - "MetricName": "Kernel_Utilization" + "BriefDescription": "Counts the number of machine clears relative to the number of nuke slots due to memory ordering.", + "MetricExpr": "tma_nuke * (MACHINE_CLEARS.MEMORY_ORDERING / MACHINE_CLEARS.SLOW)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_nuke_group", + "MetricName": "tma_memory_ordering", + "MetricThreshold": "tma_memory_ordering > 0.02", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average CPU Utilization", - "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", - "MetricName": "CPU_Utilization" + "BriefDescription": "Counts the number of uops that are from the complex flows issued by the micro-sequencer (MS)", + "MetricExpr": "UOPS_RETIRED.MS / tma_info_slots", + "MetricGroup": "TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_ms_uops", + "MetricThreshold": "tma_ms_uops > 0.05", + "PublicDescription": "Counts the number of uops that are from the complex flows issued by the micro-sequencer (MS). This includes uops from flows due to complex instructions, faults, assists, and inserted flows.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Cycle cost per L2 hit", - "MetricExpr": "MEM_BOUND_STALLS.LOAD_L2_HIT / MEM_LOAD_UOPS_RETIRED.L2_HIT", - "MetricName": "Cycles_per_Demand_Load_L2_Hit" + "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to IEC or FPC RAT stalls, which can be due to FIQ or IEC reservation stalls in which the integer, floating point or SIMD scheduler is not able to accept uops.", + "MetricExpr": "TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_non_mem_scheduler", + "MetricThreshold": "tma_non_mem_scheduler > 0.1", + "ScaleUnit": "100%" }, { - "BriefDescription": "Cycle cost per LLC hit", - "MetricExpr": "MEM_BOUND_STALLS.LOAD_LLC_HIT / MEM_LOAD_UOPS_RETIRED.L3_HIT", - "MetricName": "Cycles_per_Demand_Load_L3_Hit" + "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to a machine clear (slow nuke).", + "MetricExpr": "TOPDOWN_BAD_SPECULATION.NUKE / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", + "MetricName": "tma_nuke", + "MetricThreshold": "tma_nuke > 0.05", + "ScaleUnit": "100%" }, { - "BriefDescription": "Cycle cost per DRAM hit", - "MetricExpr": "MEM_BOUND_STALLS.LOAD_DRAM_HIT / MEM_LOAD_UOPS_RETIRED.DRAM_HIT", - "MetricName": "Cycles_per_Demand_Load_DRAM_Hit" + "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to other common frontend stalls not categorized.", + "MetricExpr": "TOPDOWN_FE_BOUND.OTHER / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_frontend_bandwidth_group", + "MetricName": "tma_other_fb", + "MetricThreshold": "tma_other_fb > 0.05", + "ScaleUnit": "100%" }, { - "BriefDescription": "Percent of instruction miss cost that hit in the L2", - "MetricExpr": "100 * MEM_BOUND_STALLS.IFETCH_L2_HIT / MEM_BOUND_STALLS.IFETCH", - "MetricName": "Inst_Miss_Cost_L2Hit_Percent" + "BriefDescription": "Counts the number of cycles that the oldest load of the load buffer is stalled at retirement due to a number of other load blocks.", + "MetricExpr": "LD_HEAD.OTHER_AT_RET / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_other_l1", + "MetricThreshold": "tma_other_l1 > 0.05", + "ScaleUnit": "100%" }, { - "BriefDescription": "Percent of instruction miss cost that hit in the L3", - "MetricExpr": "100 * MEM_BOUND_STALLS.IFETCH_LLC_HIT / MEM_BOUND_STALLS.IFETCH", - "MetricName": "Inst_Miss_Cost_L3Hit_Percent" + "BriefDescription": "Counts the number of cycles the core is stalled due to a demand load miss which hits in the L2, LLC, DRAM or MMIO (Non-DRAM) but could not be correctly attributed or cycles in which the load miss is waiting on a request buffer.", + "MetricExpr": "max(0, tma_load_store_bound - (tma_store_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound))", + "MetricGroup": "TopdownL3;tma_L3_group;tma_load_store_bound_group", + "MetricName": "tma_other_load_store", + "MetricThreshold": "tma_other_load_store > 0.1", + "ScaleUnit": "100%" }, { - "BriefDescription": "Percent of instruction miss cost that hit in DRAM", - "MetricExpr": "100 * MEM_BOUND_STALLS.IFETCH_DRAM_HIT / MEM_BOUND_STALLS.IFETCH", - "MetricName": "Inst_Miss_Cost_DRAMHit_Percent" + "BriefDescription": "Counts the number of uops retired excluding ms and fp div uops.", + "MetricExpr": "(TOPDOWN_RETIRING.ALL - UOPS_RETIRED.MS - UOPS_RETIRED.FPDIV) / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_base_group", + "MetricName": "tma_other_ret", + "MetricThreshold": "tma_other_ret > 0.3", + "ScaleUnit": "100%" }, { - "BriefDescription": "load ops retired per 1000 instruction", - "MetricExpr": "1e3 * MEM_UOPS_RETIRED.ALL_LOADS / INST_RETIRED.ANY", - "MetricName": "MemLoadPKI" + "BriefDescription": "Counts the number of machine clears relative to the number of nuke slots due to page faults.", + "MetricExpr": "tma_nuke * (MACHINE_CLEARS.PAGE_FAULT / MACHINE_CLEARS.SLOW)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_nuke_group", + "MetricName": "tma_page_fault", + "MetricThreshold": "tma_page_fault > 0.02", + "ScaleUnit": "100%" }, { - "BriefDescription": "C1 residency percent per core", - "MetricExpr": "cstate_core@c1\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C1_Core_Residency", + "BriefDescription": "Counts the number of issue slots that were not delivered by the frontend due to wrong predecodes.", + "MetricExpr": "TOPDOWN_FE_BOUND.PREDECODE / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_frontend_bandwidth_group", + "MetricName": "tma_predecode", + "MetricThreshold": "tma_predecode > 0.05", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per core", - "MetricExpr": "cstate_core@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Core_Residency", + "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to the physical register file unable to accept an entry (marble stalls).", + "MetricExpr": "TOPDOWN_BE_BOUND.REGISTER / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_register", + "MetricThreshold": "tma_register > 0.1", "ScaleUnit": "100%" }, { - "BriefDescription": "C7 residency percent per core", - "MetricExpr": "cstate_core@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Core_Residency", + "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to the reorder buffer being full (ROB stalls).", + "MetricExpr": "TOPDOWN_BE_BOUND.REORDER_BUFFER / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_reorder_buffer", + "MetricThreshold": "tma_reorder_buffer > 0.1", "ScaleUnit": "100%" }, { - "BriefDescription": "C2 residency percent per package", - "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C2_Pkg_Residency", + "BriefDescription": "Counts the total number of issue slots that were not consumed by the backend due to backend stalls", + "MetricExpr": "tma_backend_bound", + "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_aux_group", + "MetricName": "tma_resource_bound", + "MetricThreshold": "tma_resource_bound > 0.2", + "PublicDescription": "Counts the total number of issue slots that were not consumed by the backend due to backend stalls. Note that uops must be available for consumption in order for this event to count. If a uop is not available (IQ is empty), this event will not count.", "ScaleUnit": "100%" }, { - "BriefDescription": "C3 residency percent per package", - "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C3_Pkg_Residency", + "BriefDescription": "Counts the numer of issue slots that result in retirement slots.", + "MetricExpr": "TOPDOWN_RETIRING.ALL / tma_info_slots", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.75", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per package", - "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Pkg_Residency", + "BriefDescription": "Counts the number of cycles, relative to the number of mem_scheduler slots, in which uops are blocked due to RSV full relative", + "MetricExpr": "tma_mem_scheduler * MEM_SCHEDULER_BLOCK.RSV / MEM_SCHEDULER_BLOCK.ALL", + "MetricGroup": "TopdownL4;tma_L4_group;tma_mem_scheduler_group", + "MetricName": "tma_rsv", + "MetricThreshold": "tma_rsv > 0.05", "ScaleUnit": "100%" }, { - "BriefDescription": "C7 residency percent per package", - "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Pkg_Residency", + "BriefDescription": "Counts the number of issue slots that were not consumed by the backend due to scoreboards from the instruction queue (IQ), jump execution unit (JEU), or microcode sequencer (MS).", + "MetricExpr": "TOPDOWN_BE_BOUND.SERIALIZATION / tma_info_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_serialization", + "MetricThreshold": "tma_serialization > 0.1", "ScaleUnit": "100%" }, { - "BriefDescription": "C8 residency percent per package", - "MetricExpr": "cstate_pkg@c8\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C8_Pkg_Residency", + "BriefDescription": "Counts the number of machine clears relative to the number of nuke slots due to SMC.", + "MetricExpr": "tma_nuke * (MACHINE_CLEARS.SMC / MACHINE_CLEARS.SLOW)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_nuke_group", + "MetricName": "tma_smc", + "MetricThreshold": "tma_smc > 0.02", "ScaleUnit": "100%" }, { - "BriefDescription": "C9 residency percent per package", - "MetricExpr": "cstate_pkg@c9\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C9_Pkg_Residency", + "BriefDescription": "Counts the number of cycles, relative to the number of mem_scheduler slots, in which uops are blocked due to store buffer full", + "MetricExpr": "tma_store_bound", + "MetricGroup": "TopdownL4;tma_L4_group;tma_mem_scheduler_group", + "MetricName": "tma_st_buffer", + "MetricThreshold": "tma_st_buffer > 0.05", "ScaleUnit": "100%" }, { - "BriefDescription": "C10 residency percent per package", - "MetricExpr": "cstate_pkg@c10\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C10_Pkg_Residency", + "BriefDescription": "Counts the number of cycles that the oldest load of the load buffer is stalled at retirement due to a first level TLB miss.", + "MetricExpr": "LD_HEAD.DTLB_MISS_AT_RET / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_stlb_hit", + "MetricThreshold": "tma_stlb_hit > 0.05", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Counts the number of cycles that the oldest load of the load buffer is stalled at retirement due to a second level TLB miss requiring a page walk.", + "MetricExpr": "LD_HEAD.PGWALK_AT_RET / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_stlb_miss", + "MetricThreshold": "tma_stlb_miss > 0.05", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Counts the number of cycles the core is stalled due to store buffer full.", + "MetricExpr": "tma_mem_scheduler * (MEM_SCHEDULER_BLOCK.ST_BUF / MEM_SCHEDULER_BLOCK.ALL)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_load_store_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.1", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Counts the number of cycles that the oldest load of the load buffer is stalled at retirement due to a store forward block.", + "MetricExpr": "LD_HEAD.ST_ADDR_AT_RET / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_store_fwd", + "MetricThreshold": "tma_store_fwd > 0.05", "ScaleUnit": "100%" } ] diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-events/arch/x86/mapfile.csv index 4bcccab07ea9..cad51223d0ea 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -1,6 +1,6 @@ Family-model,Version,Filename,EventType GenuineIntel-6-(97|9A|B7|BA|BF),v1.18,alderlake,core -GenuineIntel-6-BE,v1.16,alderlaken,core +GenuineIntel-6-BE,v1.18,alderlaken,core GenuineIntel-6-(1C|26|27|35|36),v4,bonnell,core GenuineIntel-6-(3D|47),v26,broadwell,core GenuineIntel-6-56,v7,broadwellde,core From patchwork Sun Feb 19 09:28:09 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59102 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp772353wrn; Sun, 19 Feb 2023 01:32:47 -0800 (PST) X-Google-Smtp-Source: AK7set/0d7eUpXEho9lSXcMJoCQTfZSMN3u4kCAAD6ohQGlOvTibFqJx50RMG3ZLrtbYrMwGdNaf X-Received: by 2002:a05:6402:50d0:b0:4ac:d90e:92b with SMTP id h16-20020a05640250d000b004acd90e092bmr3210813edb.10.1676799166868; Sun, 19 Feb 2023 01:32:46 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799166; cv=none; d=google.com; s=arc-20160816; b=uASMaCaZcMSk3nTE5+xX/c2oMbs4TrDjUJFXLq5I31IMyJ6a6TVuqQdL0bR9mm0pL5 TCCgi8Z8ZsiQbzF/zYTQKKK/PCVAcY3rm8I61Dtn0NmltSct2S2w28CNa/jW0ZipetDP gZlWv0EfoIz8j5Q4N6GvCE8hyv+FwbGCt/cx/B4JFf2ri/RTTEIuUHk1XQa9v0GLVJjV +1R25tw3bIi6eL1QkB+wPvum+UUZQ45t+BhcCCa5xvZVSNWqHgbFCqRmgIw353wzaqhh b6Apfx9dXeEJHERSBKq6d8efHauW61lN1os9Ui1TLkdCer/9grwjpHYS9mq354RQPvuw DyRw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :references:mime-version:message-id:in-reply-to:date:dkim-signature; bh=xj7HoqjVzWtTG1WEzlSYw7nbEo3g/+joKzJ7OferbPM=; b=c6xBUmwfOBycqrG94cJvWhFTxQrmtddCmOhfQAnEWri0/LiKkHUjRGiyEOS+6de6np 4Bsyc2A7R997UuhonVA5zvRxeHhtsSaWlpHuX+6FuV5kRbi06Zu/BS5mV242zbqJZ1fV OtvLDl3zYNzpxq68UjaNtV8K5JU7NF1OUnUoEkunGaW6Fv11dd8FVdI6h43r94f8aHV8 of9u4OkIesi+ik6AmPogXuB3QFwzFKCGd0QkP0kVXidRdHJAsfqb9gLmtsA8qZgq4pCc 9UeIXTVcPv5Xwy7Jv+iRyM33np1SfxEqKEQkvjJu32ZCcAGhPE//ZkPNmd6MusSlM8nr RHlA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=VFXOgSzE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id n13-20020aa7c44d000000b004ad038e9ec7si11916254edr.308.2023.02.19.01.32.23; Sun, 19 Feb 2023 01:32:46 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=VFXOgSzE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229902AbjBSJbX (ORCPT + 99 others); Sun, 19 Feb 2023 04:31:23 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53522 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229884AbjBSJbU (ORCPT ); Sun, 19 Feb 2023 04:31:20 -0500 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2596912586 for ; Sun, 19 Feb 2023 01:30:47 -0800 (PST) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-53687aec426so19084817b3.4 for ; Sun, 19 Feb 2023 01:30:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:from:to:cc:subject:date :message-id:reply-to; bh=xj7HoqjVzWtTG1WEzlSYw7nbEo3g/+joKzJ7OferbPM=; b=VFXOgSzEe7xgPv8qGuqAZCZJ/GgNwCQnb23GJi01HPUFBwI2tO9Vlu/TSb7ih/MqNr R585nxBAMRiaVoql2/apEaSplcVeN2hAgjS58PyPcyPVVgXE5VXeDCA6Ssu/ksbWQ4gI g9k05/38+3QlC9xxaUEcJlC3DIz8W4dZvH+bsBTXrkTflW0tvbMHP1Rb2TJkZO7vLOc9 G8vWn4+IMEbGaM01zxloynAXbVuoFcIYiU5lKLTBkdwWvfKzXsDI5rUzpe01EjOU24Qo Nm5KtSRfq6yzMXexdtyAzJ3iih7NcJkP4p3SVxotLkRBVOeZrN94Ef2wz/AIP3r5lPIL futQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=xj7HoqjVzWtTG1WEzlSYw7nbEo3g/+joKzJ7OferbPM=; b=xVLiHLhYgmO38eBf3WVonxB/bE44HAEEvAq6+e+vxllo8AE7v9otc6o7z4XpR45FfZ Txio+5qHL8RzTDxqtn1ssTvTNW3oei7f/0VDK4x4NPL1YtDlmaoxXAQ6DRNDbuQfraSy JIpUQ9OluISU++jD7/4gDIOxbrnjrTlWoQ9Wa/1eOy5+azE9G3x7+gAaV12JzwqeAvlU p4PmQLpTfyQeHiuv5hKFPby6POCtAwht6YDcCOPRlWC37n8WZVPai7FEvJUSNyINYU3S nY9vXDMhTnWwY8zMs3OJuNXPcbfCFa7b2o0gn+Lgndml8MKZ/hkAEa8Auxkh4YmnPX5q /YMA== X-Gm-Message-State: AO0yUKXs287sNT2vhcuA+WHRdxM4xl2j19DCspojge1Ao8vOv2f7GcQS ElMmyQTqF6MbbJQVDaAoDzwP89abrmG+ X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a25:e90d:0:b0:80b:b818:e841 with SMTP id n13-20020a25e90d000000b0080bb818e841mr1579149ybd.94.1676799043073; Sun, 19 Feb 2023 01:30:43 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:09 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-13-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 12/51] perf vendor events intel: Refresh broadwell metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758251363303626828?= X-GMAIL-MSGID: =?utf-8?q?1758251363303626828?= Update the broadwell metrics to TMA version 4.5. Generation was done using https://github.com/intel/perfmon. Notable changes are TMA info metrics are renamed from their node name to be lower case and prefixed by tma_info_, MetricThreshold expressions are added, and the smi_cost metric group is added replicating existing hard coded metrics in stat-shadow. Signed-off-by: Ian Rogers --- .../arch/x86/broadwell/bdw-metrics.json | 1439 +++++++++-------- 1 file changed, 805 insertions(+), 634 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/broadwell/bdw-metrics.json b/tools/perf/pmu-events/arch/x86/broadwell/bdw-metrics.json index c3ea39d6c944..51cf8560a8d3 100644 --- a/tools/perf/pmu-events/arch/x86/broadwell/bdw-metrics.json +++ b/tools/perf/pmu-events/arch/x86/broadwell/bdw-metrics.json @@ -1,965 +1,1136 @@ [ { - "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", - "MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / SLOTS", - "MetricGroup": "PGO;TopdownL1;tma_L1_group", - "MetricName": "tma_frontend_bound", - "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Machine_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound.", + "BriefDescription": "C2 residency percent per package", + "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C2_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", - "MetricExpr": "4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / SLOTS", - "MetricGroup": "Frontend;TopdownL2;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_latency", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: RS_EVENTS.EMPTY_END", + "BriefDescription": "C3 residency percent per core", + "MetricExpr": "cstate_core@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Core_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses.", - "MetricExpr": "ICACHE.IFDATA_STALL / CLKS", - "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_icache_misses", + "BriefDescription": "C3 residency percent per package", + "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses", - "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + cpu@ITLB_MISSES.WALK_DURATION\\,cmask\\=1@ + 7 * ITLB_MISSES.WALK_COMPLETED) / CLKS", - "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_itlb_misses", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: ITLB_MISSES.WALK_COMPLETED", + "BriefDescription": "C6 residency percent per core", + "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Core_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", - "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY) / CLKS", - "MetricGroup": "FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_branch_resteers", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", + "BriefDescription": "C6 residency percent per package", + "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage. ", - "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES * tma_branch_resteers / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL4;tma_branch_resteers_group", - "MetricName": "tma_mispredicts_resteers", + "BriefDescription": "C7 residency percent per core", + "MetricExpr": "cstate_core@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Core_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears. ", - "MetricExpr": "MACHINE_CLEARS.COUNT * tma_branch_resteers / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", - "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_branch_resteers_group", - "MetricName": "tma_clears_resteers", + "BriefDescription": "C7 residency percent per package", + "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears", - "MetricExpr": "tma_branch_resteers - tma_mispredicts_resteers - tma_clears_resteers", - "MetricGroup": "BigFoot;FetchLat;TopdownL4;tma_branch_resteers_group", - "MetricName": "tma_unknown_branches", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears. These are fetched branches the Branch Prediction Unit was unable to recognize (First fetch or hitting BPU capacity limit). Sample with: BACLEARS.ANY", - "ScaleUnit": "100%" + "BriefDescription": "Uncore frequency per die [GHZ]", + "MetricExpr": "tma_info_socket_clks / #num_dies / duration_time / 1e9", + "MetricGroup": "SoC", + "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / CLKS", - "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_dsb_switches", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty.", + "BriefDescription": "Percentage of cycles spent in System Management Interrupts.", + "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0 else 0)", + "MetricGroup": "smi", + "MetricName": "smi_cycles", + "MetricThreshold": "smi_cycles > 0.1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", - "MetricExpr": "ILD_STALL.LCP / CLKS", - "MetricGroup": "FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_lcp", - "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs.", - "ScaleUnit": "100%" + "BriefDescription": "Number of SMI interrupts.", + "MetricExpr": "msr@smi@", + "MetricGroup": "smi", + "MetricName": "smi_num", + "ScaleUnit": "1SMI#" }, { - "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", - "MetricExpr": "2 * IDQ.MS_SWITCHES / CLKS", - "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_ms_switches", - "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals. Sample with: IDQ.MS_SWITCHES", + "BriefDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset", + "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_4k_aliasing", + "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset. False match is possible; which incur a few cycles load re-issue. However; the short re-issue duration is often hidden by the out-of-order core and HW optimizations; hence a user may safely ignore a high value of this metric unless it manages to propagate up into parent nodes of the hierarchy (e.g. to L1_Bound).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", - "MetricExpr": "tma_frontend_bound - tma_fetch_latency", - "MetricGroup": "FetchBW;Frontend;TopdownL2;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_bandwidth", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend.", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT.PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_info_slots", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline)", - "MetricExpr": "(IDQ.ALL_MITE_CYCLES_ANY_UOPS - IDQ.ALL_MITE_CYCLES_4_UOPS) / CORE_CLKS / 2", - "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_fetch_bandwidth_group", - "MetricName": "tma_mite", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or LSD. For example; inefficiencies due to asymmetric decoders; use of long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.", + "BriefDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists", + "MetricExpr": "100 * OTHER_ASSISTS.ANY_WB_ASSIST / tma_info_slots", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists. Assists are long sequences of uops that are required in certain corner-cases for operations that cannot be handled natively by the execution pipeline. For example; when working with very small floating point values (so-called Denormals); the FP units are not set up to perform these operations natively. Instead; a sequence of instructions to perform the computation on the Denormals is injected into the pipeline. Since these microcode sequences might be dozens of uops long; Assists can be extremely deleterious to performance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.ANY", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline", - "MetricExpr": "(IDQ.ALL_DSB_CYCLES_ANY_UOPS - IDQ.ALL_DSB_CYCLES_4_UOPS) / CORE_CLKS / 2", - "MetricGroup": "DSB;FetchBW;TopdownL3;tma_fetch_bandwidth_group", - "MetricName": "tma_dsb", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline. For example; inefficient utilization of the DSB cache structure or bank conflict when reading from it; are categorized here.", + "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "1 - (tma_frontend_bound + tma_bad_speculation + tma_retiring)", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.2", + "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound.", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations", - "MetricExpr": "(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) / SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group", + "MetricExpr": "(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction", + "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * tma_bad_speculation", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL2;tma_L2_group;tma_bad_speculation_group", + "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM", "MetricName": "tma_branch_mispredicts", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics: tma_info_branch_misprediction_cost, tma_mispredicts_resteers", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", - "MetricExpr": "tma_bad_speculation - tma_branch_mispredicts", - "MetricGroup": "BadSpec;MachineClears;TopdownL2;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_machine_clears", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", + "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY) / tma_info_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", "ScaleUnit": "100%" }, { - "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend", - "MetricExpr": "1 - (tma_frontend_bound + tma_bad_speculation + tma_retiring)", - "MetricGroup": "TopdownL1;tma_L1_group", - "MetricName": "tma_backend_bound", - "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound.", + "BriefDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction. A CISC instruction has multiple uops that are required to perform the instruction's functionality as in the case of read-modify-write as an example. Since these instructions require multiple uops they may or may not imply sub-optimal use of machine resources.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", - "MetricExpr": "(CYCLE_ACTIVITY.STALLS_MEM_ANY + RESOURCE_STALLS.SB) / (CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if IPC > 1.8 else (UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC - RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else RESOURCE_STALLS.SB)) * tma_backend_bound", - "MetricGroup": "Backend;TopdownL2;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_memory_bound", - "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears", + "MetricExpr": "MACHINE_CLEARS.COUNT * tma_branch_resteers / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", + "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_branch_resteers_group;tma_issueMC", + "MetricName": "tma_clears_resteers", + "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears. Related metrics: tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache", - "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY.STALLS_L1D_MISS) / CLKS, 0)", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_l1_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache. The L1 data cache typically has the shortest latency. However; in certain cases like loads blocked on older stores; a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls; while some non-completed demand load lives in the machine without having that demand load missing the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT_PS;MEM_LOAD_UOPS_RETIRED.HIT_LFB_PS", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(60 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.L3_MISS))) + 43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.L3_MISS)))) / tma_info_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_contested_accesses", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses. Contested accesses occur when data written by one Logical Processor are read by another Logical Processor on a different Physical Core. Examples of contested accesses include synchronizations such as locks; true data sharing such as modified locked variables; and false sharing. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Related metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + cpu@DTLB_LOAD_MISSES.WALK_DURATION\\,cmask\\=1@ + 7 * DTLB_LOAD_MISSES.WALK_COMPLETED) / CLKS", - "MetricGroup": "MemoryTLB;TopdownL4;tma_l1_bound_group", - "MetricName": "tma_dtlb_load", - "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS", + "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_backend_bound - tma_memory_bound", + "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores", - "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_store_fwd_blk", - "PublicDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores. To streamline memory operations in the pipeline; a load can avoid waiting for memory if a prior in-flight store is writing the data that the load wants to read (store forwarding process). However; in some cases the load may be blocked for a significant time pending the store forward. For example; when the prior store is writing a smaller region than the load is reading.", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.L3_MISS))) / tma_info_clks", + "MetricGroup": "Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses. Data shared by multiple Logical Processors (even just read shared) may cause increased access latency due to cache coherency. Excessive data sharing can drastically harm multithreaded performance. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma_false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations", - "MetricExpr": "MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO) / CLKS", - "MetricGroup": "Offcore;TopdownL4;tma_l1_bound_group", - "MetricName": "tma_lock_latency", - "PublicDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations. Due to the microarchitecture handling of locks; they are classified as L1_Bound regardless of what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOADS_PS", + "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", + "MetricExpr": "ARITH.FPU_DIV_ACTIVE / tma_info_core_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication. Sample with: ARITH.DIVIDER_UOPS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary", - "MetricExpr": "Load_Miss_Real_Latency * LD_BLOCKS.NO_SR / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_split_loads", - "PublicDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS_PS", + "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", + "MetricConstraint": "NO_GROUP_EVENTS_SMT", + "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS)) * CYCLE_ACTIVITY.STALLS_L2_MISS / tma_info_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_MISS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset", - "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_4k_aliasing", - "PublicDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset. False match is possible; which incur a few cycles load re-issue. However; the short re-issue duration is often hidden by the out-of-order core and HW optimizations; hence a user may safely ignore a high value of this metric unless it manages to propagate up into parent nodes of the hierarchy (e.g. to L1_Bound).", + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline", + "MetricExpr": "(IDQ.ALL_DSB_CYCLES_ANY_UOPS - IDQ.ALL_DSB_CYCLES_4_UOPS) / tma_info_core_clks / 2", + "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline. For example; inefficient utilization of the DSB cache structure or bank conflict when reading from it; are categorized here.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed", - "MetricExpr": "Load_Miss_Real_Latency * cpu@L1D_PEND_MISS.FB_FULL\\,cmask\\=1@ / CLKS", - "MetricGroup": "MemoryBW;TopdownL4;tma_l1_bound_group", - "MetricName": "tma_fb_full", - "PublicDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed. The higher the metric value; the deeper the memory hierarchy level the misses are satisfied from (metric values >1 are valid). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or external memory).", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_clks", + "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty. Related metrics: tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_iptb, tma_lcp", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads", - "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / CLKS", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_l2_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 misses/L2 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L2_HIT_PS", + "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + cpu@DTLB_LOAD_MISSES.WALK_DURATION\\,cmask\\=1@ + 7 * DTLB_LOAD_MISSES.WALK_COMPLETED) / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS. Related metrics: tma_dtlb_store", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core", - "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS) * CYCLE_ACTIVITY.STALLS_L2_MISS / CLKS", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_l3_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS", + "BriefDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses", + "MetricExpr": "(8 * DTLB_STORE_MISSES.STLB_HIT + cpu@DTLB_STORE_MISSES.WALK_DURATION\\,cmask\\=1@ + 7 * DTLB_STORE_MISSES.WALK_COMPLETED) / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses. As with ordinary data caching; focus on improving data locality and reducing working-set size to reduce DTLB overhead. Additionally; consider using profile-guided optimization (PGO) to collocate frequently-used data on the same page. Try using larger page sizes for large amounts of frequently-used data. Sample with: MEM_UOPS_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses", - "MetricExpr": "(60 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.L3_MISS))) + 43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.L3_MISS)))) / CLKS", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_contested_accesses", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses. Contested accesses occur when data written by one Logical Processor are read by another Logical Processor on a different Physical Core. Examples of contested accesses include synchronizations such as locks; true data sharing such as modified locked variables; and false sharing. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS", + "BriefDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing", + "MetricExpr": "60 * OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM / tma_info_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_store_bound_group", + "MetricName": "tma_false_sharing", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing. False Sharing is a multithreading hiccup; where multiple Logical Processors contend on different data-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses", - "MetricExpr": "43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.L3_MISS))) / CLKS", - "MetricGroup": "Offcore;Snoop;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_data_sharing", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses. Data shared by multiple Logical Processors (even just read shared) may cause increased access latency due to cache coherency. Excessive data sharing can drastically harm multithreaded performance. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT_PS", + "BriefDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_info_load_miss_real_latency * cpu@L1D_PEND_MISS.FB_FULL\\,cmask\\=1@ / tma_info_clks", + "MetricGroup": "MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;tma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed. The higher the metric value; the deeper the memory hierarchy level the misses are satisfied from (metric values >1 are valid). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or external memory). Related metrics: tma_info_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited)", - "MetricExpr": "29 * (MEM_LOAD_UOPS_RETIRED.L3_HIT * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.L3_MISS))) / CLKS", - "MetricGroup": "MemoryLat;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_l3_hit_latency", - "PublicDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve the latency; reduce contention with sibling physical cores and increase performance. Note the value of this node may overlap with its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS", + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", + "MetricExpr": "tma_frontend_bound - tma_fetch_latency", + "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Related metrics: tma_dsb_switches, tma_info_dsb_coverage, tma_info_iptb, tma_lcp", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors)", - "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on else OFFCORE_REQUESTS_BUFFER.SQ_FULL) / CORE_CLKS", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_sq_full", - "PublicDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors). The Super Queue is used for requests to access the L2 cache or to go out to the Uncore.", + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", + "MetricExpr": "4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / tma_info_slots", + "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: RS_EVENTS.EMPTY_END", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", - "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS)) * CYCLE_ACTIVITY.STALLS_L2_MISS / CLKS", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_dram_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_MISS_PS", + "BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", + "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired). Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain and FMA double-counting.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=4@) / CLKS", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_dram_bound_group", - "MetricName": "tma_mem_bandwidth", - "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that).", + "BriefDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P", + "MetricName": "tma_fp_scalar", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / CLKS - tma_mem_bandwidth", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_dram_bound_group", - "MetricName": "tma_mem_latency", - "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that).", + "BriefDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0x3c@ / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P", + "MetricName": "tma_fp_vector", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", - "MetricExpr": "RESOURCE_STALLS.SB / CLKS", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_store_bound", - "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck. Sample with: MEM_UOPS_RETIRED.ALL_STORES_PS", + "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P", + "MetricName": "tma_fp_vector_128b", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses", - "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / CLKS", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_store_bound_group", - "MetricName": "tma_store_latency", - "PublicDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses. Store accesses usually less impact out-of-order core performance; however; holding resources for longer time can lead into undesired implications (e.g. contention on L1D fill-buffer entries - see FB_Full)", + "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P", + "MetricName": "tma_fp_vector_256b", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing", - "MetricExpr": "60 * OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM / CLKS", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_store_bound_group", - "MetricName": "tma_false_sharing", - "PublicDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing. False Sharing is a multithreading hiccup; where multiple Logical Processors contend on different data-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM", + "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", + "MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / tma_info_slots", + "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents rate of split store accesses", - "MetricExpr": "2 * MEM_UOPS_RETIRED.SPLIT_STORES / CORE_CLKS", - "MetricGroup": "TopdownL4;tma_store_bound_group", - "MetricName": "tma_split_stores", - "PublicDescription": "This metric represents rate of split store accesses. Consider aligning your data to the 64-byte cache line granularity. Sample with: MEM_UOPS_RETIRED.SPLIT_STORES_PS", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences", + "MetricExpr": "tma_microcode_sequencer", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses", - "MetricExpr": "(8 * DTLB_STORE_MISSES.STLB_HIT + cpu@DTLB_STORE_MISSES.WALK_DURATION\\,cmask\\=1@ + 7 * DTLB_STORE_MISSES.WALK_COMPLETED) / CLKS", - "MetricGroup": "MemoryTLB;TopdownL4;tma_store_bound_group", - "MetricName": "tma_dtlb_store", - "PublicDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses. As with ordinary data caching; focus on improving data locality and reducing working-set size to reduce DTLB overhead. Additionally; consider using profile-guided optimization (PGO) to collocate frequently-used data on the same page. Try using larger page sizes for large amounts of frequently-used data. Sample with: MEM_UOPS_RETIRED.STLB_MISS_STORES_PS", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses.", + "MetricExpr": "ICACHE.IFDATA_STALL / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", - "MetricExpr": "tma_backend_bound - tma_memory_bound", - "MetricGroup": "Backend;Compute;TopdownL2;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_core_bound", - "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", - "ScaleUnit": "100%" + "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", + "MetricExpr": "tma_info_turbo_utilization * TSC / 1e9 / duration_time", + "MetricGroup": "Power;Summary", + "MetricName": "tma_info_average_frequency" }, { - "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", - "MetricExpr": "ARITH.FPU_DIV_ACTIVE / CORE_CLKS", - "MetricGroup": "TopdownL3;tma_core_bound_group", - "MetricName": "tma_divider", - "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication. Sample with: ARITH.DIVIDER_UOPS", - "ScaleUnit": "100%" + "BriefDescription": "Branch instructions per taken branch.", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_bptkbranch" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", - "MetricExpr": "((CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if IPC > 1.8 else (UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC - RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else RESOURCE_STALLS.SB)) - RESOURCE_STALLS.SB - CYCLE_ACTIVITY.STALLS_MEM_ANY) / CLKS", - "MetricGroup": "PortsUtil;TopdownL3;tma_core_bound_group", - "MetricName": "tma_ports_utilization", - "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", - "ScaleUnit": "100%" + "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)", + "MetricExpr": "(tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) * tma_info_slots / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BrMispredicts;tma_issueBM", + "MetricName": "tma_info_branch_misprediction_cost", + "PublicDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear). Related metrics: tma_branch_mispredicts, tma_mispredicts_resteers" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\,cmask\\=1@ / 2 if #SMT_on else (CYCLE_ACTIVITY.STALLS_TOTAL - RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) / CORE_CLKS)", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_0", - "PublicDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise). Long-latency instructions like divides may contribute to this metric.", - "ScaleUnit": "100%" + "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Pipeline", + "MetricName": "tma_info_clks" }, { - "BriefDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) / CORE_CLKS)", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_1", - "PublicDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). This can be due to heavy data-dependency among software instructions; or over oversubscribing a particular hardware resource. In some other cases with high 1_Port_Utilized and L1_Bound; this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful.", - "ScaleUnit": "100%" + "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_clks))", + "MetricGroup": "SMT", + "MetricName": "tma_info_core_clks" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / CORE_CLKS)", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_2", - "PublicDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization -most compilers feature auto-Vectorization options today- reduces pressure on the execution ports as multiple elements are calculated with same uop.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_core_clks", + "MetricGroup": "Ret;SMT;TmaL1;tma_L1_group", + "MetricName": "tma_info_coreipc" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise).", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / CORE_CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_3m", - "ScaleUnit": "100%" + "BriefDescription": "Cycles Per Instruction (per Logical Processor)", + "MetricExpr": "1 / tma_info_ipc", + "MetricGroup": "Mem;Pipeline", + "MetricName": "tma_info_cpi" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.", - "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT.PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / SLOTS", - "MetricGroup": "TopdownL5;tma_ports_utilized_3m_group", - "MetricName": "tma_alu_op_utilization", - "ScaleUnit": "100%" + "BriefDescription": "Average CPU Utilization", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": "HPC;Summary", + "MetricName": "tma_info_cpu_utilization" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch) Sample with: UOPS_DISPATCHED_PORT.PORT_0", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_0 / CORE_CLKS", - "MetricGroup": "Compute;TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_0", - "ScaleUnit": "100%" + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD / OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_data_l2_mlp" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU) Sample with: UOPS_DISPATCHED_PORT.PORT_1", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_1 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_1", - "ScaleUnit": "100%" + "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", + "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_REQUESTS.ALL) / 1e6 / duration_time / 1e3", + "MetricGroup": "HPC;Mem;MemoryBW;SoC;tma_issueBW", + "MetricName": "tma_info_dram_bw_use", + "PublicDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]. Related metrics: tma_fb_full, tma_mem_bandwidth, tma_sq_full" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU) Sample with: UOPS_DISPATCHED.PORT_5", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_5 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_5", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + LSD.UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_dsb_coverage", + "MetricThreshold": "tma_info_dsb_coverage < 0.7 & tma_info_ipc / 4 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_iptb, tma_lcp" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU) Sample with: UOPS_DISPATCHED_PORT.PORT_6", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_6 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_6", - "ScaleUnit": "100%" + "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-thread", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,cmask\\=1@", + "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", + "MetricName": "tma_info_execute" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations Sample with: UOPS_DISPATCHED.PORT_2_3", - "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_2 + UOPS_DISPATCHED_PORT.PORT_3 + UOPS_DISPATCHED_PORT.PORT_7 - UOPS_DISPATCHED_PORT.PORT_4) / (2 * CORE_CLKS)", - "MetricGroup": "TopdownL5;tma_ports_utilized_3m_group", - "MetricName": "tma_load_op_utilization", - "ScaleUnit": "100%" + "BriefDescription": "The ratio of Executed- by Issued-Uops", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", + "MetricGroup": "Cor;Pipeline", + "MetricName": "tma_info_execute_per_issue", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate of \"execute\" at rename stage." }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 2 ([SNB+]Loads and Store-address; [ICL+] Loads) Sample with: UOPS_DISPATCHED_PORT.PORT_2", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_2 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_load_op_utilization_group", - "MetricName": "tma_port_2", - "ScaleUnit": "100%" + "BriefDescription": "Floating Point Operations Per Cycle", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / tma_info_core_clks", + "MetricGroup": "Flops;Ret", + "MetricName": "tma_info_flopc" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 3 ([SNB+]Loads and Store-address; [ICL+] Loads) Sample with: UOPS_DISPATCHED_PORT.PORT_3", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_3 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_load_op_utilization_group", - "MetricName": "tma_port_3", - "ScaleUnit": "100%" + "BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width)", + "MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0x3c@) / (2 * tma_info_core_clks)", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_fp_arith_utilization", + "PublicDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width). Values > 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less common)." }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations", - "MetricExpr": "tma_port_4", - "MetricGroup": "TopdownL5;tma_ports_utilized_3m_group", - "MetricName": "tma_store_op_utilization", - "ScaleUnit": "100%" + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / duration_time", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine." }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 4 (Store-data) Sample with: UOPS_DISPATCHED_PORT.PORT_4", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_4 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_store_op_utilization_group", - "MetricName": "tma_port_4", - "ScaleUnit": "100%" + "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", + "MetricExpr": "UOPS_EXECUTED.THREAD / (cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC)", + "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", + "MetricName": "tma_info_ilp" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 7 ([HSW+]simple Store-address) Sample with: UOPS_DISPATCHED_PORT.PORT_7", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_7 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_store_op_utilization_group", - "MetricName": "tma_port_7", - "ScaleUnit": "100%" + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Summary;TmaL1;tma_L1_group", + "MetricName": "tma_info_instructions", + "PublicDescription": "Total number of retired Instructions. Sample with: INST_RETIRED.PREC_DIST" }, { - "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group", - "MetricName": "tma_retiring", - "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.RETIRE_SLOTS", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0x3c@)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_iparith", + "MetricThreshold": "tma_info_iparith < 10", + "PublicDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate). May undercount due to FMA double counting. Approximated prior to BDW." }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", - "MetricExpr": "tma_retiring - tma_heavy_operations", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_retiring_group", - "MetricName": "tma_light_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx128", + "MetricThreshold": "tma_info_iparith_avx128 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)", - "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", - "MetricGroup": "HPC;TopdownL3;tma_light_operations_group", - "MetricName": "tma_fp_arith", - "PublicDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired). Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain and FMA double-counting.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx256", + "MetricThreshold": "tma_info_iparith_avx256 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "This metric serves as an approximation of legacy x87 usage", - "MetricExpr": "INST_RETIRED.X87 * UPI / UOPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Compute;TopdownL4;tma_fp_arith_group", - "MetricName": "tma_x87_use", - "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOUBLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_iparith_scalar_dp", + "MetricThreshold": "tma_info_iparith_scalar_dp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE) / UOPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Compute;Flops;TopdownL4;tma_fp_arith_group", - "MetricName": "tma_fp_scalar", - "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SINGLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_iparith_scalar_sp", + "MetricThreshold": "tma_info_iparith_scalar_sp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths", - "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Compute;Flops;TopdownL4;tma_fp_arith_group", - "MetricName": "tma_fp_vector", - "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Branches;Fed;InsType", + "MetricName": "tma_info_ipbranch", + "MetricThreshold": "tma_info_ipbranch < 8" }, { - "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors", - "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Compute;Flops;TopdownL5;tma_fp_vector_group", - "MetricName": "tma_fp_vector_128b", - "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors. May overcount due to FMA double counting.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions Per Cycle (per Logical Processor)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_clks", + "MetricGroup": "Ret;Summary", + "MetricName": "tma_info_ipc" }, { - "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors", - "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Compute;Flops;TopdownL5;tma_fp_vector_group", - "MetricName": "tma_fp_vector_256b", - "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors. May overcount due to FMA double counting.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per (near) call (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_ipcall", + "MetricThreshold": "tma_info_ipcall < 200" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences", - "MetricExpr": "tma_microcode_sequencer", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_retiring_group", - "MetricName": "tma_heavy_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences. This highly-correlates with the uop length of these instructions/sequences.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;OS", + "MetricName": "tma_info_ipfarbranch", + "MetricThreshold": "tma_info_ipfarbranch < 1e6" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.MS_UOPS / SLOTS", - "MetricGroup": "MicroSeq;TopdownL3;tma_heavy_operations_group", - "MetricName": "tma_microcode_sequencer", - "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided. Sample with: IDQ.MS_UOPS", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per Floating Point (FP) Operation (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_ipflop", + "MetricThreshold": "tma_info_ipflop < 10" }, { - "BriefDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists", - "MetricExpr": "100 * OTHER_ASSISTS.ANY_WB_ASSIST / SLOTS", - "MetricGroup": "TopdownL4;tma_microcode_sequencer_group", - "MetricName": "tma_assists", - "PublicDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists. Assists are long sequences of uops that are required in certain corner-cases for operations that cannot be handled natively by the execution pipeline. For example; when working with very small floating point values (so-called Denormals); the FP units are not set up to perform these operations natively. Instead; a sequence of instructions to perform the computation on the Denormals is injected into the pipeline. Since these microcode sequences might be dozens of uops long; Assists can be extremely deleterious to performance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.ANY", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per Load (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_LOADS", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipload", + "MetricThreshold": "tma_info_ipload < 3" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction", - "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", - "MetricGroup": "TopdownL4;tma_microcode_sequencer_group", - "MetricName": "tma_cisc", - "PublicDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction. A CISC instruction has multiple uops that are required to perform the instruction's functionality as in the case of read-modify-write as an example. Since these instructions require multiple uops they may or may not imply sub-optimal use of machine resources.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per retired mispredicts for indirect CALL or JMP branches (lower number means higher occurrence rate).", + "MetricExpr": "tma_info_instructions / (UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * cpu@BR_MISP_EXEC.ALL_BRANCHES\\,umask\\=0xE4@)", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_indirect", + "MetricThreshold": "tma_info_ipmisp_indirect < 1e3" }, { - "BriefDescription": "Instructions Per Cycle (per Logical Processor)", - "MetricExpr": "INST_RETIRED.ANY / CLKS", - "MetricGroup": "Ret;Summary", - "MetricName": "IPC" + "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts", + "MetricName": "tma_info_ipmispredict", + "MetricThreshold": "tma_info_ipmispredict < 200" }, { - "BriefDescription": "Uops Per Instruction", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / INST_RETIRED.ANY", - "MetricGroup": "Pipeline;Ret;Retire", - "MetricName": "UPI" + "BriefDescription": "Instructions per Store (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_STORES", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipstore", + "MetricThreshold": "tma_info_ipstore < 8" }, { "BriefDescription": "Instruction per taken branch", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW", - "MetricName": "UpTB" + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", + "MetricName": "tma_info_iptb", + "MetricThreshold": "tma_info_iptb < 9", + "PublicDescription": "Instruction per taken branch. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_lcp" }, { - "BriefDescription": "Cycles Per Instruction (per Logical Processor)", - "MetricExpr": "1 / IPC", - "MetricGroup": "Mem;Pipeline", - "MetricName": "CPI" + "BriefDescription": "Instructions per speculative Unknown Branch Misprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_instructions / BACLEARS.ANY", + "MetricGroup": "Fed", + "MetricName": "tma_info_ipunknown_branch" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "Pipeline", - "MetricName": "CLKS" + "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_cpi" }, { - "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "4 * CORE_CLKS", - "MetricGroup": "tma_L1_group", - "MetricName": "SLOTS" + "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_utilization", + "MetricThreshold": "tma_info_kernel_utilization > 0.05" }, { - "BriefDescription": "The ratio of Executed- by Issued-Uops", - "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", - "MetricGroup": "Cor;Pipeline", - "MetricName": "Execute_per_Issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate of \"execute\" at rename stage." + "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw" }, { - "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", - "MetricExpr": "INST_RETIRED.ANY / CORE_CLKS", - "MetricGroup": "Ret;SMT;tma_L1_group", - "MetricName": "CoreIPC" + "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]", + "MetricExpr": "tma_info_l1d_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw_1t" }, { - "BriefDescription": "Floating Point Operations Per Cycle", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / CORE_CLKS", - "MetricGroup": "Flops;Ret", - "MetricName": "FLOPc" + "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L1_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l1mpki" }, { - "BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width)", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)) / (2 * CORE_CLKS)", - "MetricGroup": "Cor;Flops;HPC", - "MetricName": "FP_Arith_Utilization", - "PublicDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width). Values > 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less common)." + "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC)", - "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", - "MetricName": "ILP" + "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]", + "MetricExpr": "tma_info_l2_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw_1t" }, { - "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else CLKS))", - "MetricGroup": "SMT", - "MetricName": "CORE_CLKS" + "BriefDescription": "L2 cache hits per kilo instruction for all request types (including speculative)", + "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2hpki_all" }, { - "BriefDescription": "Instructions per Load (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_LOADS", - "MetricGroup": "InsType", - "MetricName": "IpLoad" + "BriefDescription": "L2 cache hits per kilo instruction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2hpki_load" }, { - "BriefDescription": "Instructions per Store (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_STORES", - "MetricGroup": "InsType", - "MetricName": "IpStore" + "BriefDescription": "L2 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "Backend;CacheMisses;Mem", + "MetricName": "tma_info_l2mpki" }, { - "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Branches;Fed;InsType", - "MetricName": "IpBranch" + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all request types (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem;Offcore", + "MetricName": "tma_info_l2mpki_all" }, { - "BriefDescription": "Instructions per (near) call (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "IpCall" + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2mpki_load" }, { - "BriefDescription": "Instruction per taken branch", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO", - "MetricName": "IpTB" + "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "0", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_l3_cache_access_bw_1t" }, { - "BriefDescription": "Branch instructions per taken branch. ", - "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "BpTkBranch" + "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw" }, { - "BriefDescription": "Instructions per Floating Point (FP) Operation (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", - "MetricGroup": "Flops;InsType", - "MetricName": "IpFLOP" + "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "tma_info_l3_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw_1t" }, { - "BriefDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE))", - "MetricGroup": "Flops;InsType", - "MetricName": "IpArith", - "PublicDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate). May undercount due to FMA double counting. Approximated prior to BDW." + "BriefDescription": "L3 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L3_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l3mpki" }, { - "BriefDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SINGLE", - "MetricGroup": "Flops;FpScalar;InsType", - "MetricName": "IpArith_Scalar_SP", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "Average Latency for L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCORE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "Memory_Lat;Offcore", + "MetricName": "tma_info_load_l2_miss_latency" }, { - "BriefDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOUBLE", - "MetricGroup": "Flops;FpScalar;InsType", - "MetricName": "IpArith_Scalar_DP", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_load_l2_mlp" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX128", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_UOPS_RETIRED.L1_MISS + MEM_LOAD_UOPS_RETIRED.HIT_LFB)", + "MetricGroup": "Mem;MemoryBound;MemoryLat", + "MetricName": "tma_info_load_miss_real_latency" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX256", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "Average number of parallel requests to external memory", + "MetricExpr": "UNC_ARB_TRK_OCCUPANCY.ALL / UNC_ARB_TRK_OCCUPANCY.CYCLES_WITH_ANY_REQUEST", + "MetricGroup": "Mem;SoC", + "MetricName": "tma_info_mem_parallel_requests", + "PublicDescription": "Average number of parallel requests to external memory. Accounts for all requests" }, { - "BriefDescription": "Total number of retired Instructions Sample with: INST_RETIRED.PREC_DIST", - "MetricExpr": "INST_RETIRED.ANY", - "MetricGroup": "Summary;tma_L1_group", - "MetricName": "Instructions" + "BriefDescription": "Average latency of all requests to external memory (in Uncore cycles)", + "MetricExpr": "UNC_ARB_TRK_OCCUPANCY.ALL / UNC_ARB_TRK_REQUESTS.ALL", + "MetricGroup": "Mem;SoC", + "MetricName": "tma_info_mem_request_latency" + }, + { + "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLES", + "MetricGroup": "Mem;MemoryBW;MemoryBound", + "MetricName": "tma_info_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)" + }, + { + "BriefDescription": "Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses", + "MetricExpr": "(cpu@ITLB_MISSES.WALK_DURATION\\,cmask\\=1@ + cpu@DTLB_LOAD_MISSES.WALK_DURATION\\,cmask\\=1@ + cpu@DTLB_STORE_MISSES.WALK_DURATION\\,cmask\\=1@ + 7 * (DTLB_STORE_MISSES.WALK_COMPLETED + DTLB_LOAD_MISSES.WALK_COMPLETED + ITLB_MISSES.WALK_COMPLETED)) / tma_info_core_clks", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_page_walks_utilization", + "MetricThreshold": "tma_info_page_walks_utilization > 0.5" }, { "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.", "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE_SLOTS\\,cmask\\=1@", "MetricGroup": "Pipeline;Ret", - "MetricName": "Retire" + "MetricName": "tma_info_retire" }, { - "BriefDescription": "", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,cmask\\=1@", - "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", - "MetricName": "Execute" + "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "4 * tma_info_core_clks", + "MetricGroup": "TmaL1;tma_L1_group", + "MetricName": "tma_info_slots" }, { - "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", - "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + LSD.UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", - "MetricGroup": "DSB;Fed;FetchBW", - "MetricName": "DSB_Coverage" + "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", + "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", + "MetricGroup": "SMT", + "MetricName": "tma_info_smt_2t_utilization" }, { - "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;BadSpec;BrMispredicts", - "MetricName": "IpMispredict" + "BriefDescription": "Socket actual clocks when any core is active on that socket", + "MetricExpr": "UNC_CLOCK.SOCKET", + "MetricGroup": "SoC", + "MetricName": "tma_info_socket_clks" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)", - "MetricExpr": "(tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) * SLOTS / BR_MISP_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;BrMispredicts", - "MetricName": "Branch_Misprediction_Cost" + "BriefDescription": "Average Frequency Utilization relative nominal frequency", + "MetricExpr": "tma_info_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_turbo_utilization" }, { - "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)", - "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_UOPS_RETIRED.L1_MISS + MEM_LOAD_UOPS_RETIRED.HIT_LFB)", - "MetricGroup": "Mem;MemoryBound;MemoryLat", - "MetricName": "Load_Miss_Real_Latency" + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / INST_RETIRED.ANY", + "MetricGroup": "Pipeline;Ret;Retire", + "MetricName": "tma_info_uoppi", + "MetricThreshold": "tma_info_uoppi > 1.05" }, { - "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)", - "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLES", - "MetricGroup": "Mem;MemoryBW;MemoryBound", - "MetricName": "MLP" + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW", + "MetricName": "tma_info_uptb", + "MetricThreshold": "tma_info_uptb < 6" }, { - "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L1_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L1MPKI" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + cpu@ITLB_MISSES.WALK_DURATION\\,cmask\\=1@ + 7 * ITLB_MISSES.WALK_COMPLETED) / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: ITLB_MISSES.WALK_COMPLETED", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L2_MISS / INST_RETIRED.ANY", - "MetricGroup": "Backend;CacheMisses;Mem", - "MetricName": "L2MPKI" + "BriefDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache", + "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY.STALLS_L1D_MISS) / tma_info_clks, 0)", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache. The L1 data cache typically has the shortest latency. However; in certain cases like loads blocked on older stores; a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls; while some non-completed demand load lives in the machine without having that demand load missing the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT_PS;MEM_LOAD_UOPS_RETIRED.HIT_LFB_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all request types (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem;Offcore", - "MetricName": "L2MPKI_All" + "BriefDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads", + "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_info_clks", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 misses/L2 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L2_HIT_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2MPKI_Load" + "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core", + "MetricConstraint": "NO_GROUP_EVENTS_SMT", + "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS) * CYCLE_ACTIVITY.STALLS_L2_MISS / tma_info_clks", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache hits per kilo instruction for all request types (including speculative)", - "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2HPKI_All" + "BriefDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "29 * (MEM_LOAD_UOPS_RETIRED.L3_HIT * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.L3_MISS))) / tma_info_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_issueLat;tma_l3_bound_group", + "MetricName": "tma_l3_hit_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve the latency; reduce contention with sibling physical cores and increase performance. Note the value of this node may overlap with its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS. Related metrics: tma_mem_latency", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache hits per kilo instruction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2HPKI_Load" + "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "ILD_STALL.LCP / tma_info_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_iptb", + "ScaleUnit": "100%" }, { - "BriefDescription": "L3 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L3_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L3MPKI" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", + "MetricExpr": "tma_retiring - tma_heavy_operations", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST", + "ScaleUnit": "100%" }, { - "BriefDescription": "Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses", - "MetricConstraint": "NO_NMI_WATCHDOG", - "MetricExpr": "(cpu@ITLB_MISSES.WALK_DURATION\\,cmask\\=1@ + cpu@DTLB_LOAD_MISSES.WALK_DURATION\\,cmask\\=1@ + cpu@DTLB_STORE_MISSES.WALK_DURATION\\,cmask\\=1@ + 7 * (DTLB_STORE_MISSES.WALK_COMPLETED + DTLB_LOAD_MISSES.WALK_COMPLETED + ITLB_MISSES.WALK_COMPLETED)) / CORE_CLKS", - "MetricGroup": "Mem;MemoryTLB", - "MetricName": "Page_Walks_Utilization" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_2 + UOPS_DISPATCHED_PORT.PORT_3 + UOPS_DISPATCHED_PORT.PORT_7 - UOPS_DISPATCHED_PORT.PORT_4) / (2 * tma_info_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations. Sample with: UOPS_DISPATCHED.PORT_2_3", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW" + "BriefDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO) / tma_info_clks", + "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1_bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations. Due to the microarchitecture handling of locks; they are classified as L1_Bound regardless of what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOADS_PS. Related metrics: tma_store_latency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW" + "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_bad_speculation - tma_branch_mispredicts", + "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW" + "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=4@) / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that). Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_sq_full", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]", - "MetricExpr": "L1D_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW_1T" + "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / tma_info_clks - tma_mem_bandwidth", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that). Related metrics: tma_l3_hit_latency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]", - "MetricExpr": "L2_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW_1T" + "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(CYCLE_ACTIVITY.STALLS_MEM_ANY + RESOURCE_STALLS.SB) / (CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - (UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if tma_info_ipc > 1.8 else UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) * tma_backend_bound", + "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "L3_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW_1T" + "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.MS_UOPS / tma_info_slots", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operations_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_ms_switches", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "0", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "L3_Cache_Access_BW_1T" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage", + "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES * tma_branch_resteers / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", + "MetricGroup": "BadSpec;BrMispredicts;TopdownL4;tma_L4_group;tma_branch_resteers_group;tma_issueBM", + "MetricName": "tma_mispredicts_resteers", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage. Related metrics: tma_branch_mispredicts, tma_info_branch_misprediction_cost", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average CPU Utilization", - "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", - "MetricGroup": "HPC;Summary", - "MetricName": "CPU_Utilization" + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline)", + "MetricExpr": "(IDQ.ALL_MITE_CYCLES_ANY_UOPS - IDQ.ALL_MITE_CYCLES_4_UOPS) / tma_info_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or LSD. For example; inefficiencies due to asymmetric decoders; use of long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", - "MetricExpr": "Turbo_Utilization * TSC / 1e9 / duration_time", - "MetricGroup": "Power;Summary", - "MetricName": "Average_Frequency" + "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", + "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_clks", + "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", + "ScaleUnit": "100%" }, { - "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / duration_time", - "MetricGroup": "Cor;Flops;HPC", - "MetricName": "GFLOPs", - "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine." + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_0 / tma_info_core_clks", + "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_0", + "MetricThreshold": "tma_port_0 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average Frequency Utilization relative nominal frequency", - "MetricExpr": "CLKS / CPU_CLK_UNHALTED.REF_TSC", - "MetricGroup": "Power", - "MetricName": "Turbo_Utilization" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_1 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_1", + "MetricThreshold": "tma_port_1 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", - "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", - "MetricGroup": "SMT", - "MetricName": "SMT_2T_Utilization" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 2 ([SNB+]Loads and Store-address; [ICL+] Loads)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_2 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_load_op_utilization_group", + "MetricName": "tma_port_2", + "MetricThreshold": "tma_port_2 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 2 ([SNB+]Loads and Store-address; [ICL+] Loads). Sample with: UOPS_DISPATCHED_PORT.PORT_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "OS", - "MetricName": "Kernel_Utilization" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 3 ([SNB+]Loads and Store-address; [ICL+] Loads)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_3 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_load_op_utilization_group", + "MetricName": "tma_port_3", + "MetricThreshold": "tma_port_3 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 3 ([SNB+]Loads and Store-address; [ICL+] Loads). Sample with: UOPS_DISPATCHED_PORT.PORT_3", + "ScaleUnit": "100%" }, { - "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", - "MetricGroup": "OS", - "MetricName": "Kernel_CPI" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 4 (Store-data)", + "MetricExpr": "tma_store_op_utilization", + "MetricGroup": "TopdownL6;tma_L6_group;tma_issueSpSt;tma_store_op_utilization_group", + "MetricName": "tma_port_4", + "MetricThreshold": "tma_port_4 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 4 (Store-data). Sample with: UOPS_DISPATCHED_PORT.PORT_4. Related metrics: tma_split_stores", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", - "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_REQUESTS.ALL) / 1e6 / duration_time / 1e3", - "MetricGroup": "HPC;Mem;MemoryBW;SoC", - "MetricName": "DRAM_BW_Use" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_5 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_5", + "MetricThreshold": "tma_port_5 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average latency of all requests to external memory (in Uncore cycles)", - "MetricExpr": "MEM_Parallel_Requests", - "MetricGroup": "Mem;SoC", - "MetricName": "MEM_Request_Latency" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_6 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_6", + "MetricThreshold": "tma_port_6 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_6. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average number of parallel requests to external memory. Accounts for all requests", - "MetricExpr": "UNC_ARB_TRK_OCCUPANCY.ALL / UNC_ARB_TRK_REQUESTS.ALL", - "MetricGroup": "Mem;SoC", - "MetricName": "MEM_Parallel_Requests" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 7 ([HSW+]simple Store-address)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_7 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_store_op_utilization_group", + "MetricName": "tma_port_7", + "MetricThreshold": "tma_port_7 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 7 ([HSW+]simple Store-address). Sample with: UOPS_DISPATCHED_PORT.PORT_7", + "ScaleUnit": "100%" }, { - "BriefDescription": "Socket actual clocks when any core is active on that socket", - "MetricExpr": "UNC_CLOCK.SOCKET", - "MetricGroup": "SoC", - "MetricName": "Socket_CLKS" + "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - (UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if tma_info_ipc > 1.8 else UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB - RESOURCE_STALLS.SB - CYCLE_ACTIVITY.STALLS_MEM_ANY) / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", - "MetricGroup": "Branches;OS", - "MetricName": "IpFarBranch" + "BriefDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\,cmask\\=1@ / 2 if #SMT_on else (CYCLE_ACTIVITY.STALLS_TOTAL - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0)) / tma_info_core_clks)", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise). Long-latency instructions like divides may contribute to this metric.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Uncore frequency per die [GHZ]", - "MetricExpr": "Socket_CLKS / #num_dies / duration_time / 1e9", - "MetricGroup": "SoC", - "MetricName": "UNCORE_FREQ" + "BriefDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) / tma_info_core_clks)", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_1", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). This can be due to heavy data-dependency among software instructions; or over oversubscribing a particular hardware resource. In some other cases with high 1_Port_Utilized and L1_Bound; this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful. Related metrics: tma_l1_bound", + "ScaleUnit": "100%" }, { - "BriefDescription": "C3 residency percent per core", - "MetricExpr": "cstate_core@c3\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C3_Core_Residency", + "BriefDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_clks)", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization -most compilers feature auto-Vectorization options today- reduces pressure on the execution ports as multiple elements are calculated with same uop. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per core", - "MetricExpr": "cstate_core@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Core_Residency", + "BriefDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise).", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.7 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { - "BriefDescription": "C7 residency percent per core", - "MetricExpr": "cstate_core@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Core_Residency", + "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1", + "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.RETIRE_SLOTS", "ScaleUnit": "100%" }, { - "BriefDescription": "C2 residency percent per package", - "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C2_Pkg_Residency", + "BriefDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_info_load_miss_real_latency * LD_BLOCKS.NO_SR / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "C3 residency percent per package", - "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C3_Pkg_Residency", + "BriefDescription": "This metric represents rate of split store accesses", + "MetricExpr": "2 * MEM_UOPS_RETIRED.SPLIT_STORES / tma_info_core_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bound_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store accesses. Consider aligning your data to the 64-byte cache line granularity. Sample with: MEM_UOPS_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_4", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per package", - "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Pkg_Residency", + "BriefDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors)", + "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on else OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueBW;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors). Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_mem_bandwidth", "ScaleUnit": "100%" }, { - "BriefDescription": "C7 residency percent per package", - "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Pkg_Residency", + "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", + "MetricExpr": "RESOURCE_STALLS.SB / tma_info_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck. Sample with: MEM_UOPS_RETIRED.ALL_STORES_PS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores. To streamline memory operations in the pipeline; a load can avoid waiting for memory if a prior in-flight store is writing the data that the load wants to read (store forwarding process). However; in some cases the load may be blocked for a significant time pending the store forward. For example; when the prior store is writing a smaller region than the load is reading.", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_clks", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_issueSL;tma_store_bound_group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses. Store accesses usually less impact out-of-order core performance; however; holding resources for longer time can lead into undesired implications (e.g. contention on L1D fill-buffer entries - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_4 / tma_info_core_clks", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears", + "MetricExpr": "tma_branch_resteers - tma_mispredicts_resteers - tma_clears_resteers", + "MetricGroup": "BigFoot;FetchLat;TopdownL4;tma_L4_group;tma_branch_resteers_group", + "MetricName": "tma_unknown_branches", + "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears. These are fetched branches the Branch Prediction Unit was unable to recognize (e.g. first time the branch is fetched or hitting BTB capacity limit). Sample with: BACLEARS.ANY", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric serves as an approximation of legacy x87 usage", + "MetricExpr": "INST_RETIRED.X87 * tma_info_uoppi / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%" } ] From patchwork Sun Feb 19 09:28:10 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59103 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp772408wrn; Sun, 19 Feb 2023 01:32:56 -0800 (PST) X-Google-Smtp-Source: AK7set/iB5RasSgi+GngaAjQShqdwGOzSZKee1RUj0MsTRhXKhMcCIye00F9gfBQXKWwYbdpZ6fm X-Received: by 2002:a05:6402:b04:b0:4ad:7abe:f41a with SMTP id bm4-20020a0564020b0400b004ad7abef41amr2484140edb.25.1676799176200; Sun, 19 Feb 2023 01:32:56 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799176; cv=none; d=google.com; s=arc-20160816; b=KajZIgxQTTiQK8MzuKfHO81X+LhYRgSxRtATfHHFnRpfuWk28uUzK4eJ1xG8FQjMIx TygWah635rHIQ3k6urFRKP+0QJe8fSkkJkXzrksQ0lgcNHA8Rs2QuqK1kxZ/ATMDG5Un GLgF8cWuo9JjylRxdzZ9q9rPeykqjKItfipz2G3vfKw+f4ZD2EOwuR0f0eqm+vbfGV3W s/WXoq8MFIsuUgrVqKu5Yr4G6x9tS3IoV7/QRreSPgfqaqrLX869cuJ9zD1L/3WZ8Rmd e5qx6JMYVgiNTw4DURJ5M3kIy6XjIsoe8lgwhWNAWm7FkhKuirPKxGuoNdPv43/AQjLu qGGg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :references:mime-version:message-id:in-reply-to:date:dkim-signature; bh=2W5bF1+eYfCMToRpqKWibN9HPaokpnt6JrTQGYqwxL0=; b=fU+FVvpqDsPn720mD15WcQd4LStP6357fvy448qrzEXUYwzwFjmdzNH9bT454PuXiC ATx9ZaatQ3yiuSSWwx2IWquKV4mIqB2VGXr9OZyzUOMrpUeiA2OeMfiIlp831Xvuw5Hs LhQwNmCYTWzpN1fnlfIcsG2jgv16KZ0l3ygy67ipfb2sy7yDy0Z31lL69RSYBf03h0WW cC5FtR1MEiNkviZfTbyS+pe5Gbs9otKtlUJqMr6MNuP9cvcDuUEYA43PMWg6Qyxms2VX FeYHyrw2TVLn2hc9yd98TuyirfGdlWIdWs3EF0nO2bqSgFbT388qy5h/RoR3Tk/z41TD UKYw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=JpfDCIA4; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id z25-20020aa7c659000000b004ace655ef51si11474950edr.192.2023.02.19.01.32.32; Sun, 19 Feb 2023 01:32:56 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=JpfDCIA4; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229892AbjBSJbk (ORCPT + 99 others); Sun, 19 Feb 2023 04:31:40 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53746 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229917AbjBSJb2 (ORCPT ); Sun, 19 Feb 2023 04:31:28 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 04B75125A7 for ; Sun, 19 Feb 2023 01:30:54 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id m13-20020a25260d000000b0087f69905709so348565ybm.10 for ; Sun, 19 Feb 2023 01:30:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:from:to:cc:subject:date :message-id:reply-to; bh=2W5bF1+eYfCMToRpqKWibN9HPaokpnt6JrTQGYqwxL0=; b=JpfDCIA4ppKmESuQFzB6W/7DruT1DHMOVh/VWjtI7KRZRQoLnR3/sZ1Bohsk/eLYEA U+7YLD9yVs2xukqi/gPf38x8plfb8cTGsBNXk9LZt5iUJuhxbym6qfK+Pl63CTMGx3eF nZGlnyTSv4egaClLA6bd3B83xyf/BcBfMvOYi70RDcqdUh5I6hFE0DvpVpAw5lNggp5q 4vxkS0J/haba0eEx57Pga6Zvx6M0IxBnVNN5ezr2aQsgjTyxqKepri2EijnusFItiBbM /Rku0HnNF5CNwSPBIe72D0kAKoYVsLF5cDjvznGJkrpdNBsCsryuRKK6oKdB7MvSwlcp kUDA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=2W5bF1+eYfCMToRpqKWibN9HPaokpnt6JrTQGYqwxL0=; b=WCMIfKHvbjj48yWSxwG5ocTZDCRSnMGXgudV30BM9WEvDHkg9coUCnhW+7SVasBI3t rIY1BHo4t6R+vcLe2DkeadwFPFrBE+bOSu0o4zH9r/zo63qbWAZvtlOj3z2AyExjDGhP H30eG0VBFqPSXKZHM/dOed8N/lonvKwPE7voGNVlJ3Vwbm5LFNYldyoC9IxMMrFoDxKE YC8yGYpQVlloPbXx/Il7s2KhWqwCmrPf6Xi9sxB4lFA+dTNYJ7aLLLy8hl760oBXY7ke G11ZfS6p+GWjeFtUcIoI6Jf2QIy2vHaM0mD3NUtsJi84swPNNsc6oog99F7ahdYsdh1o OuDg== X-Gm-Message-State: AO0yUKW8yl6OSWiWOlGro/xLnyLxvQKlFegH7eLCIqWOa4WLjscMLBHd 4XmoCgk1Cv7J9YBqw6sxA5cRfqsjLVgT X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a05:6902:10c6:b0:97a:ebd:a594 with SMTP id w6-20020a05690210c600b0097a0ebda594mr389997ybu.3.1676799051317; Sun, 19 Feb 2023 01:30:51 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:10 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-14-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 13/51] perf vendor events intel: Refresh broadwellde metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758251372564848621?= X-GMAIL-MSGID: =?utf-8?q?1758251372564848621?= Update the broadwellde metrics to TMA version 4.5. Generation was done using https://github.com/intel/perfmon. Notable changes are TMA info metrics are renamed from their node name to be lower case and prefixed by tma_info_, MetricThreshold expressions are added, and the smi_cost metric group is added replicating existing hard coded metrics in stat-shadow. Signed-off-by: Ian Rogers --- .../arch/x86/broadwellde/bdwde-metrics.json | 1405 +++++++++-------- 1 file changed, 785 insertions(+), 620 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/broadwellde/bdwde-metrics.json b/tools/perf/pmu-events/arch/x86/broadwellde/bdwde-metrics.json index d35d30932b68..fb57c7382408 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellde/bdwde-metrics.json +++ b/tools/perf/pmu-events/arch/x86/broadwellde/bdwde-metrics.json @@ -1,937 +1,1102 @@ [ { - "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", - "MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / SLOTS", - "MetricGroup": "PGO;TopdownL1;tma_L1_group", - "MetricName": "tma_frontend_bound", - "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Machine_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", - "MetricExpr": "4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / SLOTS", - "MetricGroup": "Frontend;TopdownL2;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_latency", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "BriefDescription": "C2 residency percent per package", + "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C2_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses", - "MetricExpr": "ICACHE.IFDATA_STALL / CLKS", - "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_icache_misses", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RETIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "BriefDescription": "C3 residency percent per core", + "MetricExpr": "cstate_core@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Core_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses", - "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + cpu@ITLB_MISSES.WALK_DURATION\\,cmask\\=1@ + 7 * ITLB_MISSES.WALK_COMPLETED) / CLKS", - "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_itlb_misses", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTEND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "BriefDescription": "C3 residency percent per package", + "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", - "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY) / CLKS", - "MetricGroup": "FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_branch_resteers", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", + "BriefDescription": "C6 residency percent per core", + "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Core_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage", - "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES * tma_branch_resteers / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL4;tma_branch_resteers_group", - "MetricName": "tma_mispredicts_resteers", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES", + "BriefDescription": "C6 residency percent per package", + "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears", - "MetricExpr": "MACHINE_CLEARS.COUNT * tma_branch_resteers / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", - "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_branch_resteers_group", - "MetricName": "tma_clears_resteers", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES", + "BriefDescription": "C7 residency percent per core", + "MetricExpr": "cstate_core@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Core_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears", - "MetricExpr": "tma_branch_resteers - tma_mispredicts_resteers - tma_clears_resteers", - "MetricGroup": "BigFoot;FetchLat;TopdownL4;tma_branch_resteers_group", - "MetricName": "tma_unknown_branches", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears. These are fetched branches the Branch Prediction Unit was unable to recognize (First fetch or hitting BPU capacity limit). Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", + "BriefDescription": "C7 residency percent per package", + "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / CLKS", - "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_dsb_switches", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DSB_MISS_PS", + "BriefDescription": "Percentage of cycles spent in System Management Interrupts.", + "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0 else 0)", + "MetricGroup": "smi", + "MetricName": "smi_cycles", + "MetricThreshold": "smi_cycles > 0.1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", - "MetricExpr": "ILD_STALL.LCP / CLKS", - "MetricGroup": "FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_lcp", - "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs.", - "ScaleUnit": "100%" + "BriefDescription": "Number of SMI interrupts.", + "MetricExpr": "msr@smi@", + "MetricGroup": "smi", + "MetricName": "smi_num", + "ScaleUnit": "1SMI#" }, { - "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", - "MetricExpr": "2 * IDQ.MS_SWITCHES / CLKS", - "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_ms_switches", - "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals. Sample with: IDQ.MS_SWITCHES", + "BriefDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset", + "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_4k_aliasing", + "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset. False match is possible; which incur a few cycles load re-issue. However; the short re-issue duration is often hidden by the out-of-order core and HW optimizations; hence a user may safely ignore a high value of this metric unless it manages to propagate up into parent nodes of the hierarchy (e.g. to L1_Bound).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", - "MetricExpr": "tma_frontend_bound - tma_fetch_latency", - "MetricGroup": "FetchBW;Frontend;TopdownL2;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_bandwidth", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Sample with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT.PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_info_slots", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline)", - "MetricExpr": "(IDQ.ALL_MITE_CYCLES_ANY_UOPS - IDQ.ALL_MITE_CYCLES_4_UOPS) / CORE_CLKS / 2", - "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_fetch_bandwidth_group", - "MetricName": "tma_mite", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or LSD. For example; inefficiencies due to asymmetric decoders; use of long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sample with: FRONTEND_RETIRED.ANY_DSB_MISS", + "BriefDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists", + "MetricExpr": "100 * OTHER_ASSISTS.ANY_WB_ASSIST / tma_info_slots", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists. Assists are long sequences of uops that are required in certain corner-cases for operations that cannot be handled natively by the execution pipeline. For example; when working with very small floating point values (so-called Denormals); the FP units are not set up to perform these operations natively. Instead; a sequence of instructions to perform the computation on the Denormals is injected into the pipeline. Since these microcode sequences might be dozens of uops long; Assists can be extremely deleterious to performance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline", - "MetricExpr": "(IDQ.ALL_DSB_CYCLES_ANY_UOPS - IDQ.ALL_DSB_CYCLES_4_UOPS) / CORE_CLKS / 2", - "MetricGroup": "DSB;FetchBW;TopdownL3;tma_fetch_bandwidth_group", - "MetricName": "tma_dsb", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline. For example; inefficient utilization of the DSB cache structure or bank conflict when reading from it; are categorized here.", + "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "1 - (tma_frontend_bound + tma_bad_speculation + tma_retiring)", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.2", + "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations", - "MetricExpr": "(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) / SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group", + "MetricExpr": "(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction", + "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * tma_bad_speculation", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL2;tma_L2_group;tma_bad_speculation_group", + "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM", "MetricName": "tma_branch_mispredicts", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: tma_info_branch_misprediction_cost, tma_mispredicts_resteers", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", - "MetricExpr": "tma_bad_speculation - tma_branch_mispredicts", - "MetricGroup": "BadSpec;MachineClears;TopdownL2;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_machine_clears", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", + "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY) / tma_info_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", "ScaleUnit": "100%" }, { - "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend", - "MetricExpr": "1 - (tma_frontend_bound + tma_bad_speculation + tma_retiring)", - "MetricGroup": "TopdownL1;tma_L1_group", - "MetricName": "tma_backend_bound", - "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", + "BriefDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction. A CISC instruction has multiple uops that are required to perform the instruction's functionality as in the case of read-modify-write as an example. Since these instructions require multiple uops they may or may not imply sub-optimal use of machine resources.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", - "MetricExpr": "(CYCLE_ACTIVITY.STALLS_MEM_ANY + RESOURCE_STALLS.SB) / (CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if IPC > 1.8 else (UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC - RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else RESOURCE_STALLS.SB)) * tma_backend_bound", - "MetricGroup": "Backend;TopdownL2;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_memory_bound", - "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears", + "MetricExpr": "MACHINE_CLEARS.COUNT * tma_branch_resteers / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", + "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_branch_resteers_group;tma_issueMC", + "MetricName": "tma_clears_resteers", + "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache", - "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY.STALLS_L1D_MISS) / CLKS, 0)", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_l1_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache. The L1 data cache typically has the shortest latency. However; in certain cases like loads blocked on older stores; a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls; while some non-completed demand load lives in the machine without having that demand load missing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB_HIT_PS", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(60 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.L3_MISS))) + 43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.L3_MISS)))) / tma_info_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_contested_accesses", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses. Contested accesses occur when data written by one Logical Processor are read by another Logical Processor on a different Physical Core. Examples of contested accesses include synchronizations such as locks; true data sharing such as modified locked variables; and false sharing. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + cpu@DTLB_LOAD_MISSES.WALK_DURATION\\,cmask\\=1@ + 7 * DTLB_LOAD_MISSES.WALK_COMPLETED) / CLKS", - "MetricGroup": "MemoryTLB;TopdownL4;tma_l1_bound_group", - "MetricName": "tma_dtlb_load", - "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS", + "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_backend_bound - tma_memory_bound", + "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores", - "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_store_fwd_blk", - "PublicDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores. To streamline memory operations in the pipeline; a load can avoid waiting for memory if a prior in-flight store is writing the data that the load wants to read (store forwarding process). However; in some cases the load may be blocked for a significant time pending the store forward. For example; when the prior store is writing a smaller region than the load is reading.", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.L3_MISS))) / tma_info_clks", + "MetricGroup": "Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses. Data shared by multiple Logical Processors (even just read shared) may cause increased access latency due to cache coherency. Excessive data sharing can drastically harm multithreaded performance. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_contested_accesses, tma_false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations", - "MetricExpr": "MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO) / CLKS", - "MetricGroup": "Offcore;TopdownL4;tma_l1_bound_group", - "MetricName": "tma_lock_latency", - "PublicDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations. Due to the microarchitecture handling of locks; they are classified as L1_Bound regardless of what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOADS_PS", + "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", + "MetricExpr": "ARITH.FPU_DIV_ACTIVE / tma_info_core_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication. Sample with: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary", - "MetricExpr": "Load_Miss_Real_Latency * LD_BLOCKS.NO_SR / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_split_loads", - "PublicDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", + "MetricConstraint": "NO_GROUP_EVENTS_SMT", + "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS)) * CYCLE_ACTIVITY.STALLS_L2_MISS / tma_info_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L3_MISS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset", - "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_4k_aliasing", - "PublicDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset. False match is possible; which incur a few cycles load re-issue. However; the short re-issue duration is often hidden by the out-of-order core and HW optimizations; hence a user may safely ignore a high value of this metric unless it manages to propagate up into parent nodes of the hierarchy (e.g. to L1_Bound).", + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline", + "MetricExpr": "(IDQ.ALL_DSB_CYCLES_ANY_UOPS - IDQ.ALL_DSB_CYCLES_4_UOPS) / tma_info_core_clks / 2", + "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline. For example; inefficient utilization of the DSB cache structure or bank conflict when reading from it; are categorized here.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed", - "MetricExpr": "Load_Miss_Real_Latency * cpu@L1D_PEND_MISS.FB_FULL\\,cmask\\=1@ / CLKS", - "MetricGroup": "MemoryBW;TopdownL4;tma_l1_bound_group", - "MetricName": "tma_fb_full", - "PublicDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed. The higher the metric value; the deeper the memory hierarchy level the misses are satisfied from (metric values >1 are valid). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or external memory).", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_clks", + "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DSB_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_iptb, tma_lcp", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads", - "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / CLKS", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_l2_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 misses/L2 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L2_HIT_PS", + "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + cpu@DTLB_LOAD_MISSES.WALK_DURATION\\,cmask\\=1@ + 7 * DTLB_LOAD_MISSES.WALK_COMPLETED) / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: tma_dtlb_store", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core", - "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS) * CYCLE_ACTIVITY.STALLS_L2_MISS / CLKS", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_l3_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "BriefDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses", + "MetricExpr": "(8 * DTLB_STORE_MISSES.STLB_HIT + cpu@DTLB_STORE_MISSES.WALK_DURATION\\,cmask\\=1@ + 7 * DTLB_STORE_MISSES.WALK_COMPLETED) / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses. As with ordinary data caching; focus on improving data locality and reducing working-set size to reduce DTLB overhead. Additionally; consider using profile-guided optimization (PGO) to collocate frequently-used data on the same page. Try using larger page sizes for large amounts of frequently-used data. Sample with: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses", - "MetricExpr": "(60 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.L3_MISS))) + 43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.L3_MISS)))) / CLKS", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_contested_accesses", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses. Contested accesses occur when data written by one Logical Processor are read by another Logical Processor on a different Physical Core. Examples of contested accesses include synchronizations such as locks; true data sharing such as modified locked variables; and false sharing. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS", + "BriefDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_info_load_miss_real_latency * cpu@L1D_PEND_MISS.FB_FULL\\,cmask\\=1@ / tma_info_clks", + "MetricGroup": "MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;tma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed. The higher the metric value; the deeper the memory hierarchy level the misses are satisfied from (metric values >1 are valid). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or external memory). Related metrics: tma_info_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses", - "MetricExpr": "43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.L3_MISS))) / CLKS", - "MetricGroup": "Offcore;Snoop;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_data_sharing", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses. Data shared by multiple Logical Processors (even just read shared) may cause increased access latency due to cache coherency. Excessive data sharing can drastically harm multithreaded performance. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD", + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", + "MetricExpr": "tma_frontend_bound - tma_fetch_latency", + "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Sample with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_switches, tma_info_dsb_coverage, tma_info_iptb, tma_lcp", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited)", - "MetricExpr": "29 * (MEM_LOAD_UOPS_RETIRED.L3_HIT * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.L3_MISS))) / CLKS", - "MetricGroup": "MemoryLat;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_l3_hit_latency", - "PublicDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve the latency; reduce contention with sibling physical cores and increase performance. Note the value of this node may overlap with its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", + "MetricExpr": "4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / tma_info_slots", + "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors)", - "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on else OFFCORE_REQUESTS_BUFFER.SQ_FULL) / CORE_CLKS", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_sq_full", - "PublicDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors). The Super Queue is used for requests to access the L2 cache or to go out to the Uncore.", + "BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", + "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired). Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain and FMA double-counting.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", - "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS)) * CYCLE_ACTIVITY.STALLS_L2_MISS / CLKS", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_dram_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L3_MISS_PS", + "BriefDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P", + "MetricName": "tma_fp_scalar", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=4@) / CLKS", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_dram_bound_group", - "MetricName": "tma_mem_bandwidth", - "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that).", + "BriefDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0x3c@ / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P", + "MetricName": "tma_fp_vector", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / CLKS - tma_mem_bandwidth", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_dram_bound_group", - "MetricName": "tma_mem_latency", - "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that).", + "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P", + "MetricName": "tma_fp_vector_128b", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", - "MetricExpr": "RESOURCE_STALLS.SB / CLKS", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_store_bound", - "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck. Sample with: MEM_INST_RETIRED.ALL_STORES_PS", + "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P", + "MetricName": "tma_fp_vector_256b", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses", - "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / CLKS", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_store_bound_group", - "MetricName": "tma_store_latency", - "PublicDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses. Store accesses usually less impact out-of-order core performance; however; holding resources for longer time can lead into undesired implications (e.g. contention on L1D fill-buffer entries - see FB_Full)", + "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", + "MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / tma_info_slots", + "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents rate of split store accesses", - "MetricExpr": "2 * MEM_UOPS_RETIRED.SPLIT_STORES / CORE_CLKS", - "MetricGroup": "TopdownL4;tma_store_bound_group", - "MetricName": "tma_split_stores", - "PublicDescription": "This metric represents rate of split store accesses. Consider aligning your data to the 64-byte cache line granularity. Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences", + "MetricExpr": "tma_microcode_sequencer", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses", - "MetricExpr": "(8 * DTLB_STORE_MISSES.STLB_HIT + cpu@DTLB_STORE_MISSES.WALK_DURATION\\,cmask\\=1@ + 7 * DTLB_STORE_MISSES.WALK_COMPLETED) / CLKS", - "MetricGroup": "MemoryTLB;TopdownL4;tma_store_bound_group", - "MetricName": "tma_dtlb_store", - "PublicDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses. As with ordinary data caching; focus on improving data locality and reducing working-set size to reduce DTLB overhead. Additionally; consider using profile-guided optimization (PGO) to collocate frequently-used data on the same page. Try using larger page sizes for large amounts of frequently-used data. Sample with: MEM_INST_RETIRED.STLB_MISS_STORES_PS", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses", + "MetricExpr": "ICACHE.IFDATA_STALL / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RETIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", - "MetricExpr": "tma_backend_bound - tma_memory_bound", - "MetricGroup": "Backend;Compute;TopdownL2;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_core_bound", - "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", - "ScaleUnit": "100%" + "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", + "MetricExpr": "tma_info_turbo_utilization * TSC / 1e9 / duration_time", + "MetricGroup": "Power;Summary", + "MetricName": "tma_info_average_frequency" }, { - "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", - "MetricExpr": "ARITH.FPU_DIV_ACTIVE / CORE_CLKS", - "MetricGroup": "TopdownL3;tma_core_bound_group", - "MetricName": "tma_divider", - "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication. Sample with: ARITH.DIVIDER_ACTIVE", - "ScaleUnit": "100%" + "BriefDescription": "Branch instructions per taken branch.", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_bptkbranch" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", - "MetricExpr": "((CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if IPC > 1.8 else (UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC - RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else RESOURCE_STALLS.SB)) - RESOURCE_STALLS.SB - CYCLE_ACTIVITY.STALLS_MEM_ANY) / CLKS", - "MetricGroup": "PortsUtil;TopdownL3;tma_core_bound_group", - "MetricName": "tma_ports_utilization", - "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", - "ScaleUnit": "100%" + "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)", + "MetricExpr": "(tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) * tma_info_slots / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BrMispredicts;tma_issueBM", + "MetricName": "tma_info_branch_misprediction_cost", + "PublicDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear). Related metrics: tma_branch_mispredicts, tma_mispredicts_resteers" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\,cmask\\=1@ / 2 if #SMT_on else (CYCLE_ACTIVITY.STALLS_TOTAL - RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) / CORE_CLKS)", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_0", - "PublicDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise). Long-latency instructions like divides may contribute to this metric.", - "ScaleUnit": "100%" + "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Pipeline", + "MetricName": "tma_info_clks" }, { - "BriefDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) / CORE_CLKS)", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_1", - "PublicDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). This can be due to heavy data-dependency among software instructions; or over oversubscribing a particular hardware resource. In some other cases with high 1_Port_Utilized and L1_Bound; this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL", - "ScaleUnit": "100%" + "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_clks))", + "MetricGroup": "SMT", + "MetricName": "tma_info_core_clks" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / CORE_CLKS)", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_2", - "PublicDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization -most compilers feature auto-Vectorization options today- reduces pressure on the execution ports as multiple elements are calculated with same uop. Sample with: EXE_ACTIVITY.2_PORTS_UTIL", - "ScaleUnit": "100%" + "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_core_clks", + "MetricGroup": "Ret;SMT;TmaL1;tma_L1_group", + "MetricName": "tma_info_coreipc" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / CORE_CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_3m", - "PublicDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Sample with: UOPS_EXECUTED.CYCLES_GE_3", - "ScaleUnit": "100%" + "BriefDescription": "Cycles Per Instruction (per Logical Processor)", + "MetricExpr": "1 / tma_info_ipc", + "MetricGroup": "Mem;Pipeline", + "MetricName": "tma_info_cpi" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.", - "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT.PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / SLOTS", - "MetricGroup": "TopdownL5;tma_ports_utilized_3m_group", - "MetricName": "tma_alu_op_utilization", - "ScaleUnit": "100%" + "BriefDescription": "Average CPU Utilization", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": "HPC;Summary", + "MetricName": "tma_info_cpu_utilization" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch) Sample with: UOPS_DISPATCHED.PORT_0", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_0 / CORE_CLKS", - "MetricGroup": "Compute;TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_0", - "ScaleUnit": "100%" + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD / OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_data_l2_mlp" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU) Sample with: UOPS_DISPATCHED.PORT_1", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_1 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_1", - "ScaleUnit": "100%" + "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", + "MetricExpr": "64 * (arb@event\\=0x81\\,umask\\=0x1@ + arb@event\\=0x84\\,umask\\=0x1@) / 1e6 / duration_time / 1e3", + "MetricGroup": "HPC;Mem;MemoryBW;SoC;tma_issueBW", + "MetricName": "tma_info_dram_bw_use", + "PublicDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]. Related metrics: tma_fb_full, tma_mem_bandwidth, tma_sq_full" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_5 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_5", - "ScaleUnit": "100%" - }, + "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + LSD.UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_dsb_coverage", + "MetricThreshold": "tma_info_dsb_coverage < 0.7 & tma_info_ipc / 4 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_iptb, tma_lcp" + }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU) Sample with: UOPS_DISPATCHED.PORT_6", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_6 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_6", - "ScaleUnit": "100%" + "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-thread", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,cmask\\=1@", + "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", + "MetricName": "tma_info_execute" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations Sample with: UOPS_DISPATCHED.PORT_2_3_10", - "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_2 + UOPS_DISPATCHED_PORT.PORT_3 + UOPS_DISPATCHED_PORT.PORT_7 - UOPS_DISPATCHED_PORT.PORT_4) / (2 * CORE_CLKS)", - "MetricGroup": "TopdownL5;tma_ports_utilized_3m_group", - "MetricName": "tma_load_op_utilization", - "ScaleUnit": "100%" + "BriefDescription": "The ratio of Executed- by Issued-Uops", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", + "MetricGroup": "Cor;Pipeline", + "MetricName": "tma_info_execute_per_issue", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate of \"execute\" at rename stage." }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 2 ([SNB+]Loads and Store-address; [ICL+] Loads)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_2 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_load_op_utilization_group", - "MetricName": "tma_port_2", - "ScaleUnit": "100%" + "BriefDescription": "Floating Point Operations Per Cycle", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / tma_info_core_clks", + "MetricGroup": "Flops;Ret", + "MetricName": "tma_info_flopc" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 3 ([SNB+]Loads and Store-address; [ICL+] Loads)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_3 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_load_op_utilization_group", - "MetricName": "tma_port_3", - "ScaleUnit": "100%" + "BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width)", + "MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0x3c@) / (2 * tma_info_core_clks)", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_fp_arith_utilization", + "PublicDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width). Values > 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less common)." }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations Sample with: UOPS_DISPATCHED.PORT_7_8", - "MetricExpr": "tma_port_4", - "MetricGroup": "TopdownL5;tma_ports_utilized_3m_group", - "MetricName": "tma_store_op_utilization", - "ScaleUnit": "100%" + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / duration_time", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine." }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 4 (Store-data)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_4 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_store_op_utilization_group", - "MetricName": "tma_port_4", - "ScaleUnit": "100%" + "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", + "MetricExpr": "UOPS_EXECUTED.THREAD / (cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC)", + "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", + "MetricName": "tma_info_ilp" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 7 ([HSW+]simple Store-address)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_7 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_store_op_utilization_group", - "MetricName": "tma_port_7", - "ScaleUnit": "100%" + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Summary;TmaL1;tma_L1_group", + "MetricName": "tma_info_instructions", + "PublicDescription": "Total number of retired Instructions. Sample with: INST_RETIRED.PREC_DIST" }, { - "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group", - "MetricName": "tma_retiring", - "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0x3c@)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_iparith", + "MetricThreshold": "tma_info_iparith < 10", + "PublicDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate). May undercount due to FMA double counting. Approximated prior to BDW." }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", - "MetricExpr": "tma_retiring - tma_heavy_operations", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_retiring_group", - "MetricName": "tma_light_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx128", + "MetricThreshold": "tma_info_iparith_avx128 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)", - "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", - "MetricGroup": "HPC;TopdownL3;tma_light_operations_group", - "MetricName": "tma_fp_arith", - "PublicDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired). Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain and FMA double-counting.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx256", + "MetricThreshold": "tma_info_iparith_avx256 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "This metric serves as an approximation of legacy x87 usage", - "MetricExpr": "INST_RETIRED.X87 * UPI / UOPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Compute;TopdownL4;tma_fp_arith_group", - "MetricName": "tma_x87_use", - "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOUBLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_iparith_scalar_dp", + "MetricThreshold": "tma_info_iparith_scalar_dp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE) / UOPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Compute;Flops;TopdownL4;tma_fp_arith_group", - "MetricName": "tma_fp_scalar", - "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SINGLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_iparith_scalar_sp", + "MetricThreshold": "tma_info_iparith_scalar_sp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths", - "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Compute;Flops;TopdownL4;tma_fp_arith_group", - "MetricName": "tma_fp_vector", - "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Branches;Fed;InsType", + "MetricName": "tma_info_ipbranch", + "MetricThreshold": "tma_info_ipbranch < 8" }, { - "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors", - "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Compute;Flops;TopdownL5;tma_fp_vector_group", - "MetricName": "tma_fp_vector_128b", - "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors. May overcount due to FMA double counting.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions Per Cycle (per Logical Processor)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_clks", + "MetricGroup": "Ret;Summary", + "MetricName": "tma_info_ipc" }, { - "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors", - "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Compute;Flops;TopdownL5;tma_fp_vector_group", - "MetricName": "tma_fp_vector_256b", - "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors. May overcount due to FMA double counting.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per (near) call (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_ipcall", + "MetricThreshold": "tma_info_ipcall < 200" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences", - "MetricExpr": "tma_microcode_sequencer", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_retiring_group", - "MetricName": "tma_heavy_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences. This highly-correlates with the uop length of these instructions/sequences.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;OS", + "MetricName": "tma_info_ipfarbranch", + "MetricThreshold": "tma_info_ipfarbranch < 1e6" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.MS_UOPS / SLOTS", - "MetricGroup": "MicroSeq;TopdownL3;tma_heavy_operations_group", - "MetricName": "tma_microcode_sequencer", - "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided. Sample with: UOPS_RETIRED.MS", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per Floating Point (FP) Operation (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_ipflop", + "MetricThreshold": "tma_info_ipflop < 10" }, { - "BriefDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists", - "MetricExpr": "100 * OTHER_ASSISTS.ANY_WB_ASSIST / SLOTS", - "MetricGroup": "TopdownL4;tma_microcode_sequencer_group", - "MetricName": "tma_assists", - "PublicDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists. Assists are long sequences of uops that are required in certain corner-cases for operations that cannot be handled natively by the execution pipeline. For example; when working with very small floating point values (so-called Denormals); the FP units are not set up to perform these operations natively. Instead; a sequence of instructions to perform the computation on the Denormals is injected into the pipeline. Since these microcode sequences might be dozens of uops long; Assists can be extremely deleterious to performance and they can be avoided in many cases. Sample with: ASSISTS.ANY", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per Load (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_LOADS", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipload", + "MetricThreshold": "tma_info_ipload < 3" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction", - "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", - "MetricGroup": "TopdownL4;tma_microcode_sequencer_group", - "MetricName": "tma_cisc", - "PublicDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction. A CISC instruction has multiple uops that are required to perform the instruction's functionality as in the case of read-modify-write as an example. Since these instructions require multiple uops they may or may not imply sub-optimal use of machine resources.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per retired mispredicts for indirect CALL or JMP branches (lower number means higher occurrence rate).", + "MetricExpr": "tma_info_instructions / (UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * cpu@BR_MISP_EXEC.ALL_BRANCHES\\,umask\\=0xE4@)", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_indirect", + "MetricThreshold": "tma_info_ipmisp_indirect < 1e3" }, { - "BriefDescription": "Instructions Per Cycle (per Logical Processor)", - "MetricExpr": "INST_RETIRED.ANY / CLKS", - "MetricGroup": "Ret;Summary", - "MetricName": "IPC" + "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts", + "MetricName": "tma_info_ipmispredict", + "MetricThreshold": "tma_info_ipmispredict < 200" }, { - "BriefDescription": "Uops Per Instruction", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / INST_RETIRED.ANY", - "MetricGroup": "Pipeline;Ret;Retire", - "MetricName": "UPI" + "BriefDescription": "Instructions per Store (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_STORES", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipstore", + "MetricThreshold": "tma_info_ipstore < 8" }, { "BriefDescription": "Instruction per taken branch", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW", - "MetricName": "UpTB" + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", + "MetricName": "tma_info_iptb", + "MetricThreshold": "tma_info_iptb < 9", + "PublicDescription": "Instruction per taken branch. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_lcp" }, { - "BriefDescription": "Cycles Per Instruction (per Logical Processor)", - "MetricExpr": "1 / IPC", - "MetricGroup": "Mem;Pipeline", - "MetricName": "CPI" + "BriefDescription": "Instructions per speculative Unknown Branch Misprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_instructions / BACLEARS.ANY", + "MetricGroup": "Fed", + "MetricName": "tma_info_ipunknown_branch" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "Pipeline", - "MetricName": "CLKS" + "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_cpi" }, { - "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "4 * CORE_CLKS", - "MetricGroup": "tma_L1_group", - "MetricName": "SLOTS" + "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_utilization", + "MetricThreshold": "tma_info_kernel_utilization > 0.05" }, { - "BriefDescription": "The ratio of Executed- by Issued-Uops", - "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", - "MetricGroup": "Cor;Pipeline", - "MetricName": "Execute_per_Issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate of \"execute\" at rename stage." + "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw" }, { - "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", - "MetricExpr": "INST_RETIRED.ANY / CORE_CLKS", - "MetricGroup": "Ret;SMT;tma_L1_group", - "MetricName": "CoreIPC" + "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]", + "MetricExpr": "tma_info_l1d_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw_1t" }, { - "BriefDescription": "Floating Point Operations Per Cycle", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / CORE_CLKS", - "MetricGroup": "Flops;Ret", - "MetricName": "FLOPc" + "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L1_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l1mpki" }, { - "BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width)", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)) / (2 * CORE_CLKS)", - "MetricGroup": "Cor;Flops;HPC", - "MetricName": "FP_Arith_Utilization", - "PublicDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width). Values > 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less common)." + "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC)", - "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", - "MetricName": "ILP" + "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]", + "MetricExpr": "tma_info_l2_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw_1t" }, { - "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else CLKS))", - "MetricGroup": "SMT", - "MetricName": "CORE_CLKS" + "BriefDescription": "L2 cache hits per kilo instruction for all request types (including speculative)", + "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2hpki_all" }, { - "BriefDescription": "Instructions per Load (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_LOADS", - "MetricGroup": "InsType", - "MetricName": "IpLoad" + "BriefDescription": "L2 cache hits per kilo instruction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2hpki_load" }, { - "BriefDescription": "Instructions per Store (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_STORES", - "MetricGroup": "InsType", - "MetricName": "IpStore" + "BriefDescription": "L2 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "Backend;CacheMisses;Mem", + "MetricName": "tma_info_l2mpki" }, { - "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Branches;Fed;InsType", - "MetricName": "IpBranch" + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all request types (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem;Offcore", + "MetricName": "tma_info_l2mpki_all" }, { - "BriefDescription": "Instructions per (near) call (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "IpCall" + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2mpki_load" }, { - "BriefDescription": "Instruction per taken branch", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO", - "MetricName": "IpTB" + "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "0", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_l3_cache_access_bw_1t" }, { - "BriefDescription": "Branch instructions per taken branch. ", - "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "BpTkBranch" + "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw" }, { - "BriefDescription": "Instructions per Floating Point (FP) Operation (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", - "MetricGroup": "Flops;InsType", - "MetricName": "IpFLOP" + "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "tma_info_l3_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw_1t" }, { - "BriefDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE))", - "MetricGroup": "Flops;InsType", - "MetricName": "IpArith", - "PublicDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate). May undercount due to FMA double counting. Approximated prior to BDW." + "BriefDescription": "L3 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L3_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l3mpki" }, { - "BriefDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SINGLE", - "MetricGroup": "Flops;FpScalar;InsType", - "MetricName": "IpArith_Scalar_SP", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "Average Latency for L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCORE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "Memory_Lat;Offcore", + "MetricName": "tma_info_load_l2_miss_latency" }, { - "BriefDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOUBLE", - "MetricGroup": "Flops;FpScalar;InsType", - "MetricName": "IpArith_Scalar_DP", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_load_l2_mlp" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX128", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_UOPS_RETIRED.L1_MISS + MEM_LOAD_UOPS_RETIRED.HIT_LFB)", + "MetricGroup": "Mem;MemoryBound;MemoryLat", + "MetricName": "tma_info_load_miss_real_latency" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX256", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLES", + "MetricGroup": "Mem;MemoryBW;MemoryBound", + "MetricName": "tma_info_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)" }, { - "BriefDescription": "Total number of retired Instructions Sample with: INST_RETIRED.PREC_DIST", - "MetricExpr": "INST_RETIRED.ANY", - "MetricGroup": "Summary;tma_L1_group", - "MetricName": "Instructions" + "BriefDescription": "Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses", + "MetricExpr": "(cpu@ITLB_MISSES.WALK_DURATION\\,cmask\\=1@ + cpu@DTLB_LOAD_MISSES.WALK_DURATION\\,cmask\\=1@ + cpu@DTLB_STORE_MISSES.WALK_DURATION\\,cmask\\=1@ + 7 * (DTLB_STORE_MISSES.WALK_COMPLETED + DTLB_LOAD_MISSES.WALK_COMPLETED + ITLB_MISSES.WALK_COMPLETED)) / tma_info_core_clks", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_page_walks_utilization", + "MetricThreshold": "tma_info_page_walks_utilization > 0.5" }, { "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.", "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE_SLOTS\\,cmask\\=1@", "MetricGroup": "Pipeline;Ret", - "MetricName": "Retire" + "MetricName": "tma_info_retire" }, { - "BriefDescription": "", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,cmask\\=1@", - "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", - "MetricName": "Execute" + "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "4 * tma_info_core_clks", + "MetricGroup": "TmaL1;tma_L1_group", + "MetricName": "tma_info_slots" }, { - "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", - "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + LSD.UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", - "MetricGroup": "DSB;Fed;FetchBW", - "MetricName": "DSB_Coverage" + "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", + "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", + "MetricGroup": "SMT", + "MetricName": "tma_info_smt_2t_utilization" }, { - "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;BadSpec;BrMispredicts", - "MetricName": "IpMispredict" + "BriefDescription": "Average Frequency Utilization relative nominal frequency", + "MetricExpr": "tma_info_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_turbo_utilization" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)", - "MetricExpr": "(tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) * SLOTS / BR_MISP_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;BrMispredicts", - "MetricName": "Branch_Misprediction_Cost" + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / INST_RETIRED.ANY", + "MetricGroup": "Pipeline;Ret;Retire", + "MetricName": "tma_info_uoppi", + "MetricThreshold": "tma_info_uoppi > 1.05" }, { - "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)", - "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_UOPS_RETIRED.L1_MISS + MEM_LOAD_UOPS_RETIRED.HIT_LFB)", - "MetricGroup": "Mem;MemoryBound;MemoryLat", - "MetricName": "Load_Miss_Real_Latency" + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW", + "MetricName": "tma_info_uptb", + "MetricThreshold": "tma_info_uptb < 6" }, { - "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)", - "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLES", - "MetricGroup": "Mem;MemoryBW;MemoryBound", - "MetricName": "MLP" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + cpu@ITLB_MISSES.WALK_DURATION\\,cmask\\=1@ + 7 * ITLB_MISSES.WALK_COMPLETED) / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTEND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L1_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L1MPKI" + "BriefDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache", + "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY.STALLS_L1D_MISS) / tma_info_clks, 0)", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache. The L1 data cache typically has the shortest latency. However; in certain cases like loads blocked on older stores; a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls; while some non-completed demand load lives in the machine without having that demand load missing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB_HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L2_MISS / INST_RETIRED.ANY", - "MetricGroup": "Backend;CacheMisses;Mem", - "MetricName": "L2MPKI" + "BriefDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads", + "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_info_clks", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 misses/L2 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L2_HIT_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all request types (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem;Offcore", - "MetricName": "L2MPKI_All" + "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core", + "MetricConstraint": "NO_GROUP_EVENTS_SMT", + "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS) * CYCLE_ACTIVITY.STALLS_L2_MISS / tma_info_clks", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2MPKI_Load" + "BriefDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "29 * (MEM_LOAD_UOPS_RETIRED.L3_HIT * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.L3_MISS))) / tma_info_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_issueLat;tma_l3_bound_group", + "MetricName": "tma_l3_hit_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve the latency; reduce contention with sibling physical cores and increase performance. Note the value of this node may overlap with its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tma_mem_latency", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache hits per kilo instruction for all request types (including speculative)", - "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2HPKI_All" + "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "ILD_STALL.LCP / tma_info_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_iptb", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache hits per kilo instruction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2HPKI_Load" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", + "MetricExpr": "tma_retiring - tma_heavy_operations", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST", + "ScaleUnit": "100%" }, { - "BriefDescription": "L3 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L3_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L3MPKI" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_2 + UOPS_DISPATCHED_PORT.PORT_3 + UOPS_DISPATCHED_PORT.PORT_7 - UOPS_DISPATCHED_PORT.PORT_4) / (2 * tma_info_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations. Sample with: UOPS_DISPATCHED.PORT_2_3_10", + "ScaleUnit": "100%" }, { - "BriefDescription": "Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses", - "MetricConstraint": "NO_NMI_WATCHDOG", - "MetricExpr": "(ITLB_MISSES.WALK_DURATION + DTLB_LOAD_MISSES.WALK_DURATION + DTLB_STORE_MISSES.WALK_DURATION + 7 * (DTLB_STORE_MISSES.WALK_COMPLETED + DTLB_LOAD_MISSES.WALK_COMPLETED + ITLB_MISSES.WALK_COMPLETED)) / (2 * CORE_CLKS)", - "MetricGroup": "Mem;MemoryTLB", - "MetricName": "Page_Walks_Utilization" + "BriefDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO) / tma_info_clks", + "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1_bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations. Due to the microarchitecture handling of locks; they are classified as L1_Bound regardless of what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOADS_PS. Related metrics: tma_store_latency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW" + "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_bad_speculation - tma_branch_mispredicts", + "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW" + "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=4@) / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that). Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_sq_full", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW" + "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / tma_info_clks - tma_mem_bandwidth", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that). Related metrics: tma_l3_hit_latency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]", - "MetricExpr": "L1D_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW_1T" + "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(CYCLE_ACTIVITY.STALLS_MEM_ANY + RESOURCE_STALLS.SB) / (CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - (UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if tma_info_ipc > 1.8 else UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) * tma_backend_bound", + "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]", - "MetricExpr": "L2_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW_1T" + "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.MS_UOPS / tma_info_slots", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operations_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_ms_switches", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "L3_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW_1T" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage", + "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES * tma_branch_resteers / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", + "MetricGroup": "BadSpec;BrMispredicts;TopdownL4;tma_L4_group;tma_branch_resteers_group;tma_issueBM", + "MetricName": "tma_mispredicts_resteers", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_branch_mispredicts, tma_info_branch_misprediction_cost", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "0", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "L3_Cache_Access_BW_1T" + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline)", + "MetricExpr": "(IDQ.ALL_MITE_CYCLES_ANY_UOPS - IDQ.ALL_MITE_CYCLES_4_UOPS) / tma_info_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or LSD. For example; inefficiencies due to asymmetric decoders; use of long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sample with: FRONTEND_RETIRED.ANY_DSB_MISS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average CPU Utilization", - "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", - "MetricGroup": "HPC;Summary", - "MetricName": "CPU_Utilization" + "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", + "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_clks", + "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", + "ScaleUnit": "100%" }, { - "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", - "MetricExpr": "Turbo_Utilization * TSC / 1e9 / duration_time", - "MetricGroup": "Power;Summary", - "MetricName": "Average_Frequency" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_0 / tma_info_core_clks", + "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_0", + "MetricThreshold": "tma_port_0 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / duration_time", - "MetricGroup": "Cor;Flops;HPC", - "MetricName": "GFLOPs", - "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine." + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_1 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_1", + "MetricThreshold": "tma_port_1 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average Frequency Utilization relative nominal frequency", - "MetricExpr": "CLKS / CPU_CLK_UNHALTED.REF_TSC", - "MetricGroup": "Power", - "MetricName": "Turbo_Utilization" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 2 ([SNB+]Loads and Store-address; [ICL+] Loads)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_2 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_load_op_utilization_group", + "MetricName": "tma_port_2", + "MetricThreshold": "tma_port_2 > 0.6", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", - "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", - "MetricGroup": "SMT", - "MetricName": "SMT_2T_Utilization" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 3 ([SNB+]Loads and Store-address; [ICL+] Loads)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_3 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_load_op_utilization_group", + "MetricName": "tma_port_3", + "MetricThreshold": "tma_port_3 > 0.6", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "OS", - "MetricName": "Kernel_Utilization" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 4 (Store-data)", + "MetricExpr": "tma_store_op_utilization", + "MetricGroup": "TopdownL6;tma_L6_group;tma_issueSpSt;tma_store_op_utilization_group", + "MetricName": "tma_port_4", + "MetricThreshold": "tma_port_4 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 4 (Store-data). Related metrics: tma_split_stores", + "ScaleUnit": "100%" }, { - "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", - "MetricGroup": "OS", - "MetricName": "Kernel_CPI" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_5 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_5", + "MetricThreshold": "tma_port_5 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU). Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", - "MetricExpr": "64 * (arb@event\\=0x81\\,umask\\=0x1@ + arb@event\\=0x84\\,umask\\=0x1@) / 1e6 / duration_time / 1e3", - "MetricGroup": "HPC;Mem;MemoryBW;SoC", - "MetricName": "DRAM_BW_Use" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_6 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_6", + "MetricThreshold": "tma_port_6 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU). Sample with: UOPS_DISPATCHED.PORT_6. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", - "MetricGroup": "Branches;OS", - "MetricName": "IpFarBranch" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 7 ([HSW+]simple Store-address)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_7 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_store_op_utilization_group", + "MetricName": "tma_port_7", + "MetricThreshold": "tma_port_7 > 0.6", + "ScaleUnit": "100%" }, { - "BriefDescription": "C3 residency percent per core", - "MetricExpr": "cstate_core@c3\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C3_Core_Residency", + "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - (UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if tma_info_ipc > 1.8 else UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB - RESOURCE_STALLS.SB - CYCLE_ACTIVITY.STALLS_MEM_ANY) / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per core", - "MetricExpr": "cstate_core@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Core_Residency", + "BriefDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\,cmask\\=1@ / 2 if #SMT_on else (CYCLE_ACTIVITY.STALLS_TOTAL - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0)) / tma_info_core_clks)", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise). Long-latency instructions like divides may contribute to this metric.", "ScaleUnit": "100%" }, { - "BriefDescription": "C7 residency percent per core", - "MetricExpr": "cstate_core@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Core_Residency", + "BriefDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) / tma_info_core_clks)", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_1", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). This can be due to heavy data-dependency among software instructions; or over oversubscribing a particular hardware resource. In some other cases with high 1_Port_Utilized and L1_Bound; this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, { - "BriefDescription": "C2 residency percent per package", - "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C2_Pkg_Residency", + "BriefDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_clks)", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization -most compilers feature auto-Vectorization options today- reduces pressure on the execution ports as multiple elements are calculated with same uop. Sample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, { - "BriefDescription": "C3 residency percent per package", - "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C3_Pkg_Residency", + "BriefDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.7 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Sample with: UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per package", - "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Pkg_Residency", + "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1", + "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS", "ScaleUnit": "100%" }, { - "BriefDescription": "C7 residency percent per package", - "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Pkg_Residency", + "BriefDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_info_load_miss_real_latency * LD_BLOCKS.NO_SR / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents rate of split store accesses", + "MetricExpr": "2 * MEM_UOPS_RETIRED.SPLIT_STORES / tma_info_core_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bound_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store accesses. Consider aligning your data to the 64-byte cache line granularity. Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_4", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors)", + "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on else OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueBW;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors). Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_mem_bandwidth", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", + "MetricExpr": "RESOURCE_STALLS.SB / tma_info_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck. Sample with: MEM_INST_RETIRED.ALL_STORES_PS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores. To streamline memory operations in the pipeline; a load can avoid waiting for memory if a prior in-flight store is writing the data that the load wants to read (store forwarding process). However; in some cases the load may be blocked for a significant time pending the store forward. For example; when the prior store is writing a smaller region than the load is reading.", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_clks", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_issueSL;tma_store_bound_group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses. Store accesses usually less impact out-of-order core performance; however; holding resources for longer time can lead into undesired implications (e.g. contention on L1D fill-buffer entries - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_4 / tma_info_core_clks", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations. Sample with: UOPS_DISPATCHED.PORT_7_8", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears", + "MetricExpr": "tma_branch_resteers - tma_mispredicts_resteers - tma_clears_resteers", + "MetricGroup": "BigFoot;FetchLat;TopdownL4;tma_L4_group;tma_branch_resteers_group", + "MetricName": "tma_unknown_branches", + "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears. These are fetched branches the Branch Prediction Unit was unable to recognize (e.g. first time the branch is fetched or hitting BTB capacity limit). Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric serves as an approximation of legacy x87 usage", + "MetricExpr": "INST_RETIRED.X87 * tma_info_uoppi / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%" } ] From patchwork Sun Feb 19 09:28:11 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59104 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp772417wrn; Sun, 19 Feb 2023 01:32:57 -0800 (PST) X-Google-Smtp-Source: AK7set8RVdZLqz2N2uvisDz6bLoQvpnbr1tNDtL7KM26Vb15V0xGxgUwoJdfZGBASSjsMvcaya61 X-Received: by 2002:a17:902:c402:b0:19a:a5c4:afef with SMTP id k2-20020a170902c40200b0019aa5c4afefmr2543713plk.64.1676799177300; Sun, 19 Feb 2023 01:32:57 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799177; cv=none; d=google.com; s=arc-20160816; b=rlaZVOx5OgmmkTAEjdlOUlTPTbSCX7gah54+OX1At5MivkLscsYyiY/Kg+sUr9hF7e H7lNOXdW71X/aFAVqyrELVSGu35P2Z2ZkN0hkd4RJuGSw1RbDDopdvy4MqmXh+7+CcLn dG9cjUdQ+LW1HKfNR8ZE0qA9N6woFfWWFQdI/BM0Hnf/R6xPOBnNqk8jXlXbWEXnZVyb +rnqlPGFI4H6c4jYxX4pMtAl3+VwmhJgW9ZUwoObgQkW5KX7xDavBy0tXaDwN84dWN0x vUZDdRIqF+VKUL8cGqvO/+6ajxARnrIAXy/83+1gnMxeT6wAUINs6MWIPNVD5UkdL82N pIwQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :references:mime-version:message-id:in-reply-to:date:dkim-signature; bh=2wvJNBFtzajYi6R5dh+10ijP9wjGKOWLD2Nx4Wagg34=; b=wIVWHd+tMbqjq4tuBYuftS/aKGXUhX/NvVkRo+N6iLyZ2a4/eNk/auLBAaNnK8lK/k 8aZvPfSzEqb5fKpvNSLzF0Z8sJVy13FP53VCazfIAYKkK0Cn0OPKnD8oPxNWtH0ZOSFE zhk2J+n44FJd8V8aV/LmoFLBFMGlVR3ThUddpmjgzvmt057W8v/hMEDWVS3xLlIlFblw m/I/U2RA9LGlnMQDS3Pd7q3eMhIzu89wp0viAQpXXCPoWaN0PJcH+NDfwtTSFZAFP/lS qUNgV3/HWNGm061dFfB2jfTrjClVM6C3BNDLDHi64nhkniBa0WYVdA4pOY+mEy+/2N7J fOOg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=MNL4A804; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id km3-20020a17090327c300b0019868a3d86asi5074202plb.38.2023.02.19.01.32.44; Sun, 19 Feb 2023 01:32:57 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=MNL4A804; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229903AbjBSJbx (ORCPT + 99 others); Sun, 19 Feb 2023 04:31:53 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53506 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229838AbjBSJbh (ORCPT ); Sun, 19 Feb 2023 04:31:37 -0500 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 07A58E045 for ; Sun, 19 Feb 2023 01:31:01 -0800 (PST) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-536885323c1so17956327b3.15 for ; Sun, 19 Feb 2023 01:31:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:from:to:cc:subject:date :message-id:reply-to; bh=2wvJNBFtzajYi6R5dh+10ijP9wjGKOWLD2Nx4Wagg34=; b=MNL4A804Y0heXhzUZa1nxN9jJb6sRSQYDdChK5TO5YcqocaZ/yqDRWcFwHakubBHpk EJXZocwa+6YL7sHN+770EZcqWqK8WMtRGtRIad7FkZAVC4cCv+jUyLVKOVX6epfRAsGw iN3mkmaNKhzbKWowpag06diTUdqz4KU6KSG5t/zzqqlLja4dHHfKd5E8d0f08FWUNffL WxJt5miz+eK0GYkKiZYmsGlsrzW1iQx7AA0VJaJ6fYe9BuWmPTmN2Dkv9kSXhxovV7jT 88jYo9vocbRsH3D51oI06ILSi0v423B2/wZOzGR5CxnLHkxziAQSTa5Ilh3+KdcOM3xj doCw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=2wvJNBFtzajYi6R5dh+10ijP9wjGKOWLD2Nx4Wagg34=; b=lQIfaLM2afajTYGTa+TgSFSHd4Q0cbRNihEIv9H2LMF5IqUawjOLsEtwnAXBDBG4iW dIj2B4L8HhOkEmZK4B+QXV2pNZEwhfsNwWa7t2hbKh1iT0CDi+HHZdlwrwsD1kh8oCBl iEb2L7+goyPW8t0PpGkH0loRw/+BC1auOtty1neNVpHuLK4720KNMFP/MNucvEj3DBGu yuDM6jDd70SpanUhwoRu8RADd66fHZ71nWvCGBv3hCty6KzwXTHyONwpDSsnklpDNoEL vrZj+zxarsQ4+d7SWsJNhmy8kc865ZlqBjx0Eafgv13CVg7DgFj6k1dYhv8Frf2b7I/0 nJZg== X-Gm-Message-State: AO0yUKVBuUr4QgBuI62yhvNnKqXKojSXUjHoJ5kEEMPeWbbZrqkgqBwR hHkNF07M6wKD6rglCwHVrgfaFxHCAyaB X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a05:6902:154b:b0:8e3:6aea:973 with SMTP id r11-20020a056902154b00b008e36aea0973mr399044ybu.4.1676799059917; Sun, 19 Feb 2023 01:30:59 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:11 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-15-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 14/51] perf vendor events intel: Refresh broadwellx metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758251374358753116?= X-GMAIL-MSGID: =?utf-8?q?1758251374358753116?= Update the broadwellx metrics to TMA version 4.5. Generation was done using https://github.com/intel/perfmon. Notable changes are TMA info metrics are renamed from their node name to be lower case and prefixed by tma_info_, MetricThreshold expressions are added, "Sample with" documentation is added to many TMA metrics, and the smi_cost metric group is added replicating existing hard coded metrics in stat-shadow. Signed-off-by: Ian Rogers --- .../arch/x86/broadwellx/bdx-metrics.json | 1626 ++++++++--------- .../arch/x86/broadwellx/uncore-cache.json | 74 +- .../x86/broadwellx/uncore-interconnect.json | 64 +- .../arch/x86/broadwellx/uncore-other.json | 4 +- 4 files changed, 873 insertions(+), 895 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/bdx-metrics.json b/tools/perf/pmu-events/arch/x86/broadwellx/bdx-metrics.json index f5c8f707c692..65ec0c9e55d1 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/bdx-metrics.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/bdx-metrics.json @@ -1,1189 +1,1167 @@ [ { - "BriefDescription": "Instructions Per Cycle (per Logical Processor)", - "MetricExpr": "INST_RETIRED.ANY / CLKS", - "MetricGroup": "Ret;Summary", - "MetricName": "IPC" - }, - { - "BriefDescription": "Uops Per Instruction", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / INST_RETIRED.ANY", - "MetricGroup": "Pipeline;Ret;Retire", - "MetricName": "UPI" - }, - { - "BriefDescription": "Instruction per taken branch", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW", - "MetricName": "UpTB" - }, - { - "BriefDescription": "Cycles Per Instruction (per Logical Processor)", - "MetricExpr": "1 / IPC", - "MetricGroup": "Mem;Pipeline", - "MetricName": "CPI" - }, - { - "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "Pipeline", - "MetricName": "CLKS" - }, - { - "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "4 * CORE_CLKS", - "MetricGroup": "tma_L1_group", - "MetricName": "SLOTS" - }, - { - "BriefDescription": "The ratio of Executed- by Issued-Uops", - "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", - "MetricGroup": "Cor;Pipeline", - "MetricName": "Execute_per_Issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate of \"execute\" at rename stage." - }, - { - "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", - "MetricExpr": "INST_RETIRED.ANY / CORE_CLKS", - "MetricGroup": "Ret;SMT;tma_L1_group", - "MetricName": "CoreIPC" - }, - { - "BriefDescription": "Floating Point Operations Per Cycle", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / CORE_CLKS", - "MetricGroup": "Flops;Ret", - "MetricName": "FLOPc" + "BriefDescription": "C2 residency percent per package", + "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C2_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width)", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)) / (2 * CORE_CLKS)", - "MetricGroup": "Cor;Flops;HPC", - "MetricName": "FP_Arith_Utilization", - "PublicDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width). Values > 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less common)." + "BriefDescription": "C3 residency percent per core", + "MetricExpr": "cstate_core@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Core_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC)", - "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", - "MetricName": "ILP" + "BriefDescription": "C3 residency percent per package", + "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else CLKS))", - "MetricGroup": "SMT", - "MetricName": "CORE_CLKS" + "BriefDescription": "C6 residency percent per core", + "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Core_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Load (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_LOADS", - "MetricGroup": "InsType", - "MetricName": "IpLoad" + "BriefDescription": "C6 residency percent per package", + "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Store (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_STORES", - "MetricGroup": "InsType", - "MetricName": "IpStore" + "BriefDescription": "C7 residency percent per core", + "MetricExpr": "cstate_core@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Core_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Branches;Fed;InsType", - "MetricName": "IpBranch" + "BriefDescription": "C7 residency percent per package", + "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per (near) call (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "IpCall" + "BriefDescription": "Uncore frequency per die [GHZ]", + "MetricExpr": "tma_info_socket_clks / #num_dies / duration_time / 1e9", + "MetricGroup": "SoC", + "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "Instruction per taken branch", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO", - "MetricName": "IpTB" + "BriefDescription": "Percentage of cycles spent in System Management Interrupts.", + "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0 else 0)", + "MetricGroup": "smi", + "MetricName": "smi_cycles", + "MetricThreshold": "smi_cycles > 0.1", + "ScaleUnit": "100%" }, { - "BriefDescription": "Branch instructions per taken branch. ", - "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "BpTkBranch" + "BriefDescription": "Number of SMI interrupts.", + "MetricExpr": "msr@smi@", + "MetricGroup": "smi", + "MetricName": "smi_num", + "ScaleUnit": "1SMI#" }, { - "BriefDescription": "Instructions per Floating Point (FP) Operation (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", - "MetricGroup": "Flops;InsType", - "MetricName": "IpFLOP" + "BriefDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset", + "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_4k_aliasing", + "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset. False match is possible; which incur a few cycles load re-issue. However; the short re-issue duration is often hidden by the out-of-order core and HW optimizations; hence a user may safely ignore a high value of this metric unless it manages to propagate up into parent nodes of the hierarchy (e.g. to L1_Bound).", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE))", - "MetricGroup": "Flops;InsType", - "MetricName": "IpArith", - "PublicDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate). May undercount due to FMA double counting. Approximated prior to BDW." + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT.PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_info_slots", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.6", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SINGLE", - "MetricGroup": "Flops;FpScalar;InsType", - "MetricName": "IpArith_Scalar_SP", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists", + "MetricExpr": "100 * OTHER_ASSISTS.ANY_WB_ASSIST / tma_info_slots", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists. Assists are long sequences of uops that are required in certain corner-cases for operations that cannot be handled natively by the execution pipeline. For example; when working with very small floating point values (so-called Denormals); the FP units are not set up to perform these operations natively. Instead; a sequence of instructions to perform the computation on the Denormals is injected into the pipeline. Since these microcode sequences might be dozens of uops long; Assists can be extremely deleterious to performance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.ANY", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOUBLE", - "MetricGroup": "Flops;FpScalar;InsType", - "MetricName": "IpArith_Scalar_DP", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "1 - (tma_frontend_bound + tma_bad_speculation + tma_retiring)", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.2", + "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX128", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations", + "MetricExpr": "(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", + "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX256", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * tma_bad_speculation", + "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM", + "MetricName": "tma_branch_mispredicts", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics: tma_info_branch_misprediction_cost, tma_mispredicts_resteers", + "ScaleUnit": "100%" }, { - "BriefDescription": "Total number of retired Instructions Sample with: INST_RETIRED.PREC_DIST", - "MetricExpr": "INST_RETIRED.ANY", - "MetricGroup": "Summary;tma_L1_group", - "MetricName": "Instructions" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", + "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY) / tma_info_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE_SLOTS\\,cmask\\=1@", - "MetricGroup": "Pipeline;Ret", - "MetricName": "Retire" + "BriefDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction. A CISC instruction has multiple uops that are required to perform the instruction's functionality as in the case of read-modify-write as an example. Since these instructions require multiple uops they may or may not imply sub-optimal use of machine resources.", + "ScaleUnit": "100%" }, { - "BriefDescription": "", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,cmask\\=1@", - "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", - "MetricName": "Execute" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears", + "MetricExpr": "MACHINE_CLEARS.COUNT * tma_branch_resteers / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", + "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_branch_resteers_group;tma_issueMC", + "MetricName": "tma_clears_resteers", + "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears. Related metrics: tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_ms_switches", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", - "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + LSD.UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", - "MetricGroup": "DSB;Fed;FetchBW", - "MetricName": "DSB_Coverage" + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(60 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD))) + 43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD)))) / tma_info_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_contested_accesses", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses. Contested accesses occur when data written by one Logical Processor are read by another Logical Processor on a different Physical Core. Examples of contested accesses include synchronizations such as locks; true data sharing such as modified locked variables; and false sharing. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Related metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma_remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;BadSpec;BrMispredicts", - "MetricName": "IpMispredict" + "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_backend_bound - tma_memory_bound", + "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", + "ScaleUnit": "100%" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)", - "MetricExpr": "(tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) * SLOTS / BR_MISP_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;BrMispredicts", - "MetricName": "Branch_Misprediction_Cost" + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD))) / tma_info_clks", + "MetricGroup": "Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses. Data shared by multiple Logical Processors (even just read shared) may cause increased access latency due to cache coherency. Excessive data sharing can drastically harm multithreaded performance. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma_false_sharing, tma_machine_clears, tma_remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)", - "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_UOPS_RETIRED.L1_MISS + MEM_LOAD_UOPS_RETIRED.HIT_LFB)", - "MetricGroup": "Mem;MemoryBound;MemoryLat", - "MetricName": "Load_Miss_Real_Latency" + "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", + "MetricExpr": "ARITH.FPU_DIV_ACTIVE / tma_info_core_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication. Sample with: ARITH.DIVIDER_UOPS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)", - "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLES", - "MetricGroup": "Mem;MemoryBW;MemoryBound", - "MetricName": "MLP" + "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", + "MetricConstraint": "NO_GROUP_EVENTS_SMT", + "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS)) * CYCLE_ACTIVITY.STALLS_L2_MISS / tma_info_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_MISS_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L1_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L1MPKI" + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline", + "MetricExpr": "(IDQ.ALL_DSB_CYCLES_ANY_UOPS - IDQ.ALL_DSB_CYCLES_4_UOPS) / tma_info_core_clks / 2", + "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline. For example; inefficient utilization of the DSB cache structure or bank conflict when reading from it; are categorized here.", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L2_MISS / INST_RETIRED.ANY", - "MetricGroup": "Backend;CacheMisses;Mem", - "MetricName": "L2MPKI" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_clks", + "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty. Related metrics: tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_iptb, tma_lcp", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all request types (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem;Offcore", - "MetricName": "L2MPKI_All" + "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + cpu@DTLB_LOAD_MISSES.WALK_DURATION\\,cmask\\=1@ + 7 * DTLB_LOAD_MISSES.WALK_COMPLETED) / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS. Related metrics: tma_dtlb_store", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2MPKI_Load" + "BriefDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses", + "MetricExpr": "(8 * DTLB_STORE_MISSES.STLB_HIT + cpu@DTLB_STORE_MISSES.WALK_DURATION\\,cmask\\=1@ + 7 * DTLB_STORE_MISSES.WALK_COMPLETED) / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses. As with ordinary data caching; focus on improving data locality and reducing working-set size to reduce DTLB overhead. Additionally; consider using profile-guided optimization (PGO) to collocate frequently-used data on the same page. Try using larger page sizes for large amounts of frequently-used data. Sample with: MEM_UOPS_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache hits per kilo instruction for all request types (including speculative)", - "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2HPKI_All" + "BriefDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing", + "MetricExpr": "(200 * OFFCORE_RESPONSE.DEMAND_RFO.LLC_MISS.REMOTE_HITM + 60 * OFFCORE_RESPONSE.DEMAND_RFO.LLC_HIT.HITM_OTHER_CORE) / tma_info_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_store_bound_group", + "MetricName": "tma_false_sharing", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing. False Sharing is a multithreading hiccup; where multiple Logical Processors contend on different data-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache hits per kilo instruction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2HPKI_Load" + "BriefDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_info_load_miss_real_latency * cpu@L1D_PEND_MISS.FB_FULL\\,cmask\\=1@ / tma_info_clks", + "MetricGroup": "MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;tma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed. The higher the metric value; the deeper the memory hierarchy level the misses are satisfied from (metric values >1 are valid). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or external memory). Related metrics: tma_info_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", + "ScaleUnit": "100%" }, { - "BriefDescription": "L3 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L3_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L3MPKI" + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", + "MetricExpr": "tma_frontend_bound - tma_fetch_latency", + "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Related metrics: tma_dsb_switches, tma_info_dsb_coverage, tma_info_iptb, tma_lcp", + "ScaleUnit": "100%" }, { - "BriefDescription": "Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses", - "MetricConstraint": "NO_NMI_WATCHDOG", - "MetricExpr": "(ITLB_MISSES.WALK_DURATION + DTLB_LOAD_MISSES.WALK_DURATION + DTLB_STORE_MISSES.WALK_DURATION + 7 * (DTLB_STORE_MISSES.WALK_COMPLETED + DTLB_LOAD_MISSES.WALK_COMPLETED + ITLB_MISSES.WALK_COMPLETED)) / (2 * CORE_CLKS)", - "MetricGroup": "Mem;MemoryTLB", - "MetricName": "Page_Walks_Utilization" + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", + "MetricExpr": "4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / tma_info_slots", + "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: RS_EVENTS.EMPTY_END", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW" + "BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", + "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired). Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain and FMA double-counting.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW" + "BriefDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P", + "MetricName": "tma_fp_scalar", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW" + "BriefDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0x3c@ / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P", + "MetricName": "tma_fp_vector", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]", - "MetricExpr": "L1D_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW_1T" + "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P", + "MetricName": "tma_fp_vector_128b", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]", - "MetricExpr": "L2_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW_1T" + "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P", + "MetricName": "tma_fp_vector_256b", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "L3_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW_1T" + "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", + "MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / tma_info_slots", + "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "0", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "L3_Cache_Access_BW_1T" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences", + "MetricExpr": "tma_microcode_sequencer", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average CPU Utilization", - "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", - "MetricGroup": "HPC;Summary", - "MetricName": "CPU_Utilization" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses.", + "MetricExpr": "ICACHE.IFDATA_STALL / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "ScaleUnit": "100%" }, { "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", - "MetricExpr": "Turbo_Utilization * TSC / 1e9 / duration_time", + "MetricExpr": "tma_info_turbo_utilization * TSC / 1e9 / duration_time", "MetricGroup": "Power;Summary", - "MetricName": "Average_Frequency" + "MetricName": "tma_info_average_frequency" }, { - "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / duration_time", - "MetricGroup": "Cor;Flops;HPC", - "MetricName": "GFLOPs", - "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine." - }, - { - "BriefDescription": "Average Frequency Utilization relative nominal frequency", - "MetricExpr": "CLKS / CPU_CLK_UNHALTED.REF_TSC", - "MetricGroup": "Power", - "MetricName": "Turbo_Utilization" - }, - { - "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", - "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", - "MetricGroup": "SMT", - "MetricName": "SMT_2T_Utilization" - }, - { - "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "OS", - "MetricName": "Kernel_Utilization" - }, - { - "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", - "MetricGroup": "OS", - "MetricName": "Kernel_CPI" - }, - { - "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", - "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e9 / duration_time", - "MetricGroup": "HPC;Mem;MemoryBW;SoC", - "MetricName": "DRAM_BW_Use" + "BriefDescription": "Branch instructions per taken branch.", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_bptkbranch" }, { - "BriefDescription": "Average latency of data read request to external memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetches", - "MetricExpr": "1e9 * (UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=0x182@ / UNC_C_TOR_INSERTS.MISS_OPCODE@filter_opc\\=0x182@) / (Socket_CLKS / duration_time)", - "MetricGroup": "Mem;MemoryLat;SoC", - "MetricName": "MEM_Read_Latency" + "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)", + "MetricExpr": "(tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) * tma_info_slots / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BrMispredicts;tma_issueBM", + "MetricName": "tma_info_branch_misprediction_cost", + "PublicDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear). Related metrics: tma_branch_mispredicts, tma_mispredicts_resteers" }, { - "BriefDescription": "Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches", - "MetricExpr": "UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=0x182@ / UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=0x182\\,thresh\\=1@", - "MetricGroup": "Mem;MemoryBW;SoC", - "MetricName": "MEM_Parallel_Reads" + "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Pipeline", + "MetricName": "tma_info_clks" }, { - "BriefDescription": "Socket actual clocks when any core is active on that socket", - "MetricExpr": "cbox_0@event\\=0x0@", - "MetricGroup": "SoC", - "MetricName": "Socket_CLKS" + "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_clks))", + "MetricGroup": "SMT", + "MetricName": "tma_info_core_clks" }, { - "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", - "MetricGroup": "Branches;OS", - "MetricName": "IpFarBranch" + "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_core_clks", + "MetricGroup": "Ret;SMT;TmaL1;tma_L1_group", + "MetricName": "tma_info_coreipc" }, { - "BriefDescription": "Uncore frequency per die [GHZ]", - "MetricExpr": "Socket_CLKS / #num_dies / duration_time / 1e9", - "MetricGroup": "SoC", - "MetricName": "UNCORE_FREQ" + "BriefDescription": "Cycles Per Instruction (per Logical Processor)", + "MetricExpr": "1 / tma_info_ipc", + "MetricGroup": "Mem;Pipeline", + "MetricName": "tma_info_cpi" }, { - "BriefDescription": "CPU operating frequency (in GHz)", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / duration_time", - "MetricName": "cpu_operating_frequency", - "ScaleUnit": "1GHz" + "BriefDescription": "Average CPU Utilization", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": "HPC;Summary", + "MetricName": "tma_info_cpu_utilization" }, { - "BriefDescription": "Cycles per instruction retired; indicating how much time each executed instruction took; in units of cycles.", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY", - "MetricName": "cpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD / OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_data_l2_mlp" }, { - "BriefDescription": "The ratio of number of completed memory load instructions to the total number completed instructions", - "MetricExpr": "MEM_UOPS_RETIRED.ALL_LOADS / INST_RETIRED.ANY", - "MetricName": "loads_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", + "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e9 / duration_time", + "MetricGroup": "HPC;Mem;MemoryBW;SoC;tma_issueBW", + "MetricName": "tma_info_dram_bw_use", + "PublicDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]. Related metrics: tma_fb_full, tma_mem_bandwidth, tma_sq_full" }, { - "BriefDescription": "The ratio of number of completed memory store instructions to the total number completed instructions", - "MetricExpr": "MEM_UOPS_RETIRED.ALL_STORES / INST_RETIRED.ANY", - "MetricName": "stores_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + LSD.UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_dsb_coverage", + "MetricThreshold": "tma_info_dsb_coverage < 0.7 & tma_info_ipc / 4 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_iptb, tma_lcp" }, { - "BriefDescription": "Ratio of number of requests missing L1 data cache (includes data+rfo w/ prefetches) to the total number of completed instructions", - "MetricExpr": "L1D.REPLACEMENT / INST_RETIRED.ANY", - "MetricName": "l1d_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-thread", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,cmask\\=1@", + "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", + "MetricName": "tma_info_execute" }, { - "BriefDescription": "Ratio of number of demand load requests hitting in L1 data cache to the total number of completed instructions", - "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L1_HIT / INST_RETIRED.ANY", - "MetricName": "l1d_demand_data_read_hits_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "The ratio of Executed- by Issued-Uops", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", + "MetricGroup": "Cor;Pipeline", + "MetricName": "tma_info_execute_per_issue", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate of \"execute\" at rename stage." }, { - "BriefDescription": "Ratio of number of code read requests missing in L1 instruction cache (includes prefetches) to the total number of completed instructions", - "MetricExpr": "L2_RQSTS.ALL_CODE_RD / INST_RETIRED.ANY", - "MetricName": "l1_i_code_read_misses_with_prefetches_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Floating Point Operations Per Cycle", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / tma_info_core_clks", + "MetricGroup": "Flops;Ret", + "MetricName": "tma_info_flopc" }, { - "BriefDescription": "Ratio of number of completed demand load requests hitting in L2 cache to the total number of completed instructions", - "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L2_HIT / INST_RETIRED.ANY", - "MetricName": "l2_demand_data_read_hits_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width)", + "MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0x3c@) / (2 * tma_info_core_clks)", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_fp_arith_utilization", + "PublicDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width). Values > 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less common)." }, { - "BriefDescription": "Ratio of number of requests missing L2 cache (includes code+data+rfo w/ prefetches) to the total number of completed instructions", - "MetricExpr": "L2_LINES_IN.ALL / INST_RETIRED.ANY", - "MetricName": "l2_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / duration_time", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine." }, { - "BriefDescription": "Ratio of number of completed data read request missing L2 cache to the total number of completed instructions", - "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L2_MISS / INST_RETIRED.ANY", - "MetricName": "l2_demand_data_read_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", + "MetricExpr": "UOPS_EXECUTED.THREAD / (cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC)", + "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", + "MetricName": "tma_info_ilp" }, { - "BriefDescription": "Ratio of number of code read request missing L2 cache to the total number of completed instructions", - "MetricExpr": "L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", - "MetricName": "l2_demand_code_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Summary;TmaL1;tma_L1_group", + "MetricName": "tma_info_instructions", + "PublicDescription": "Total number of retired Instructions. Sample with: INST_RETIRED.PREC_DIST" }, { - "BriefDescription": "Ratio of number of data read requests missing last level core cache (includes demand w/ prefetches) to the total number of completed instructions", - "MetricExpr": "(cbox@UNC_C_TOR_INSERTS.MISS_OPCODE\\,filter_opc\\=0x182@ + cbox@UNC_C_TOR_INSERTS.MISS_OPCODE\\,filter_opc\\=0x192@) / INST_RETIRED.ANY", - "MetricName": "llc_data_read_mpi_demand_plus_prefetch", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0x3c@)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_iparith", + "MetricThreshold": "tma_info_iparith < 10", + "PublicDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate). May undercount due to FMA double counting. Approximated prior to BDW." }, { - "BriefDescription": "Ratio of number of code read requests missing last level core cache (includes demand w/ prefetches) to the total number of completed instructions", - "MetricExpr": "(cbox@UNC_C_TOR_INSERTS.MISS_OPCODE\\,filter_opc\\=0x181@ + cbox@UNC_C_TOR_INSERTS.MISS_OPCODE\\,filter_opc\\=0x191@) / INST_RETIRED.ANY", - "MetricName": "llc_code_read_mpi_demand_plus_prefetch", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx128", + "MetricThreshold": "tma_info_iparith_avx128 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "Average latency of a last level cache (LLC) demand and prefetch data read miss (read memory access) in nano seconds", - "MetricExpr": "1e9 * (cbox@UNC_C_TOR_OCCUPANCY.MISS_OPCODE\\,filter_opc\\=0x182@ / cbox@UNC_C_TOR_INSERTS.MISS_OPCODE\\,filter_opc\\=0x182@) / (UNC_C_CLOCKTICKS / (#num_cores / #num_packages * #num_packages)) * duration_time", - "MetricName": "llc_data_read_demand_plus_prefetch_miss_latency", - "ScaleUnit": "1ns" + "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx256", + "MetricThreshold": "tma_info_iparith_avx256 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "Average latency of a last level cache (LLC) demand and prefetch data read miss (read memory access) addressed to local memory in nano seconds", - "MetricExpr": "1e9 * (cbox@UNC_C_TOR_OCCUPANCY.MISS_LOCAL_OPCODE\\,filter_opc\\=0x182@ / cbox@UNC_C_TOR_INSERTS.MISS_LOCAL_OPCODE\\,filter_opc\\=0x182@) / (UNC_C_CLOCKTICKS / (#num_cores / #num_packages * #num_packages)) * duration_time", - "MetricName": "llc_data_read_demand_plus_prefetch_miss_latency_for_local_requests", - "ScaleUnit": "1ns" + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOUBLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_iparith_scalar_dp", + "MetricThreshold": "tma_info_iparith_scalar_dp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "Average latency of a last level cache (LLC) demand and prefetch data read miss (read memory access) addressed to remote memory in nano seconds", - "MetricExpr": "1e9 * (cbox@UNC_C_TOR_OCCUPANCY.MISS_REMOTE_OPCODE\\,filter_opc\\=0x182@ / cbox@UNC_C_TOR_INSERTS.MISS_REMOTE_OPCODE\\,filter_opc\\=0x182@) / (UNC_C_CLOCKTICKS / (#num_cores / #num_packages * #num_packages)) * duration_time", - "MetricName": "llc_data_read_demand_plus_prefetch_miss_latency_for_remote_requests", - "ScaleUnit": "1ns" + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SINGLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_iparith_scalar_sp", + "MetricThreshold": "tma_info_iparith_scalar_sp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "Ratio of number of completed page walks (for all page sizes) caused by a code fetch to the total number of completed instructions", - "MetricExpr": "ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", - "MetricName": "itlb_mpi", - "PublicDescription": "Ratio of number of completed page walks (for all page sizes) caused by a code fetch to the total number of completed instructions. This implies it missed in the ITLB (Instruction TLB) and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Branches;Fed;InsType", + "MetricName": "tma_info_ipbranch", + "MetricThreshold": "tma_info_ipbranch < 8" }, { - "BriefDescription": "Ratio of number of completed page walks (for 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total number of completed instructions", - "MetricExpr": "ITLB_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY", - "MetricName": "itlb_large_page_mpi", - "PublicDescription": "Ratio of number of completed page walks (for 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total number of completed instructions. This implies it missed in the Instruction Translation Lookaside Buffer (ITLB) and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions Per Cycle (per Logical Processor)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_clks", + "MetricGroup": "Ret;Summary", + "MetricName": "tma_info_ipc" }, { - "BriefDescription": "Ratio of number of completed page walks (for all page sizes) caused by demand data loads to the total number of completed instructions", - "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", - "MetricName": "dtlb_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for all page sizes) caused by demand data loads to the total number of completed instructions. This implies it missed in the DTLB and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per (near) call (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_ipcall", + "MetricThreshold": "tma_info_ipcall < 200" }, { - "BriefDescription": "Ratio of number of completed page walks (for all page sizes) caused by demand data stores to the total number of completed instructions", - "MetricExpr": "DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", - "MetricName": "dtlb_store_mpi", - "PublicDescription": "Ratio of number of completed page walks (for all page sizes) caused by demand data stores to the total number of completed instructions. This implies it missed in the DTLB and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;OS", + "MetricName": "tma_info_ipfarbranch", + "MetricThreshold": "tma_info_ipfarbranch < 1e6" }, { - "BriefDescription": "Memory read that miss the last level cache (LLC) addressed to local DRAM as a percentage of total memory read accesses, does not include LLC prefetches.", - "MetricExpr": "cbox@UNC_C_TOR_INSERTS.MISS_LOCAL_OPCODE\\,filter_opc\\=0x182@ / (cbox@UNC_C_TOR_INSERTS.MISS_LOCAL_OPCODE\\,filter_opc\\=0x182@ + cbox@UNC_C_TOR_INSERTS.MISS_REMOTE_OPCODE\\,filter_opc\\=0x182@)", - "MetricName": "numa_reads_addressed_to_local_dram", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per Floating Point (FP) Operation (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_ipflop", + "MetricThreshold": "tma_info_ipflop < 10" }, { - "BriefDescription": "Memory reads that miss the last level cache (LLC) addressed to remote DRAM as a percentage of total memory read accesses, does not include LLC prefetches.", - "MetricExpr": "cbox@UNC_C_TOR_INSERTS.MISS_REMOTE_OPCODE\\,filter_opc\\=0x182@ / (cbox@UNC_C_TOR_INSERTS.MISS_LOCAL_OPCODE\\,filter_opc\\=0x182@ + cbox@UNC_C_TOR_INSERTS.MISS_REMOTE_OPCODE\\,filter_opc\\=0x182@)", - "MetricName": "numa_reads_addressed_to_remote_dram", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per Load (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_LOADS", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipload", + "MetricThreshold": "tma_info_ipload < 3" }, { - "BriefDescription": "Uncore operating frequency in GHz", - "MetricExpr": "UNC_C_CLOCKTICKS / (#num_cores / #num_packages * #num_packages) / 1e9 / duration_time", - "MetricName": "uncore_frequency", - "ScaleUnit": "1GHz" + "BriefDescription": "Instructions per retired mispredicts for indirect CALL or JMP branches (lower number means higher occurrence rate).", + "MetricExpr": "tma_info_instructions / (UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * cpu@BR_MISP_EXEC.ALL_BRANCHES\\,umask\\=0xE4@)", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_indirect", + "MetricThreshold": "tma_info_ipmisp_indirect < 1e3" }, { - "BriefDescription": "Intel(R) Quick Path Interconnect (QPI) data transmit bandwidth (MB/sec)", - "MetricExpr": "UNC_Q_TxL_FLITS_G0.DATA * 8 / 1e6 / duration_time", - "MetricName": "qpi_data_transmit_bw", - "ScaleUnit": "1MB/s" + "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts", + "MetricName": "tma_info_ipmispredict", + "MetricThreshold": "tma_info_ipmispredict < 200" }, { - "BriefDescription": "DDR memory read bandwidth (MB/sec)", - "MetricExpr": "UNC_M_CAS_COUNT.RD * 64 / 1e6 / duration_time", - "MetricName": "memory_bandwidth_read", - "ScaleUnit": "1MB/s" + "BriefDescription": "Instructions per Store (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_STORES", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipstore", + "MetricThreshold": "tma_info_ipstore < 8" }, { - "BriefDescription": "DDR memory write bandwidth (MB/sec)", - "MetricExpr": "UNC_M_CAS_COUNT.WR * 64 / 1e6 / duration_time", - "MetricName": "memory_bandwidth_write", - "ScaleUnit": "1MB/s" + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", + "MetricName": "tma_info_iptb", + "MetricThreshold": "tma_info_iptb < 9", + "PublicDescription": "Instruction per taken branch. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_lcp" }, { - "BriefDescription": "DDR memory bandwidth (MB/sec)", - "MetricExpr": "(UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) * 64 / 1e6 / duration_time", - "MetricName": "memory_bandwidth_total", - "ScaleUnit": "1MB/s" + "BriefDescription": "Instructions per speculative Unknown Branch Misprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_instructions / BACLEARS.ANY", + "MetricGroup": "Fed", + "MetricName": "tma_info_ipunknown_branch" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by end device controllers that are requesting memory from the CPU.", - "MetricExpr": "cbox@UNC_C_TOR_INSERTS.OPCODE\\,filter_opc\\=0x19e@ * 64 / 1e6 / duration_time", - "MetricName": "io_bandwidth_disk_or_network_writes", - "ScaleUnit": "1MB/s" + "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_cpi" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by end device controllers that are writing memory to the CPU.", - "MetricExpr": "(cbox@UNC_C_TOR_INSERTS.OPCODE\\,filter_opc\\=0x1c8\\,filter_tid\\=0x3e@ + cbox@UNC_C_TOR_INSERTS.OPCODE\\,filter_opc\\=0x180\\,filter_tid\\=0x3e@) * 64 / 1e6 / duration_time", - "MetricName": "io_bandwidth_disk_or_network_reads", - "ScaleUnit": "1MB/s" + "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_utilization", + "MetricThreshold": "tma_info_kernel_utilization > 0.05" }, { - "BriefDescription": "Uops delivered from decoded instruction cache (decoded stream buffer or DSB) as a percent of total uops delivered to Instruction Decode Queue", - "MetricExpr": "IDQ.DSB_UOPS / UOPS_ISSUED.ANY", - "MetricName": "percent_uops_delivered_from_decoded_icache", - "ScaleUnit": "100%" + "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw" }, { - "BriefDescription": "Uops delivered from legacy decode pipeline (Micro-instruction Translation Engine or MITE) as a percent of total uops delivered to Instruction Decode Queue", - "MetricExpr": "IDQ.MITE_UOPS / UOPS_ISSUED.ANY", - "MetricName": "percent_uops_delivered_from_legacy_decode_pipeline", - "ScaleUnit": "100%" + "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]", + "MetricExpr": "tma_info_l1d_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw_1t" }, { - "BriefDescription": "Uops delivered from microcode sequencer (MS) as a percent of total uops delivered to Instruction Decode Queue", - "MetricExpr": "IDQ.MS_UOPS / UOPS_ISSUED.ANY", - "MetricName": "percent_uops_delivered_from_microcode_sequencer", - "ScaleUnit": "100%" + "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L1_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l1mpki" }, { - "BriefDescription": "Uops delivered from loop stream detector(LSD) as a percent of total uops delivered to Instruction Decode Queue", - "MetricExpr": "LSD.UOPS / UOPS_ISSUED.ANY", - "MetricName": "percent_uops_delivered_from_loop_stream_detector", - "ScaleUnit": "100%" + "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw" }, { - "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", - "MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / SLOTS", - "MetricGroup": "PGO;TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_frontend_bound", - "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Machine_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound.", - "ScaleUnit": "100%" + "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]", + "MetricExpr": "tma_info_l2_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw_1t" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", - "MetricExpr": "4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / SLOTS", - "MetricGroup": "Frontend;TopdownL2;tma_L2_group;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_latency", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period.", - "ScaleUnit": "100%" + "BriefDescription": "L2 cache hits per kilo instruction for all request types (including speculative)", + "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2hpki_all" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses.", - "MetricExpr": "ICACHE.IFDATA_STALL / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_icache_misses", - "ScaleUnit": "100%" + "BriefDescription": "L2 cache hits per kilo instruction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2hpki_load" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses.", - "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + cpu@ITLB_MISSES.WALK_DURATION\\,cmask\\=0x1@ + 7 * ITLB_MISSES.WALK_COMPLETED) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_itlb_misses", - "ScaleUnit": "100%" + "BriefDescription": "L2 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "Backend;CacheMisses;Mem", + "MetricName": "tma_info_l2mpki" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", - "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_branch_resteers", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings.", - "ScaleUnit": "100%" + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all request types (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem;Offcore", + "MetricName": "tma_info_l2mpki_all" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage. ", - "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES * tma_branch_resteers / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL4;tma_L4_group;tma_branch_resteers_group", - "MetricName": "tma_mispredicts_resteers", - "ScaleUnit": "100%" + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2mpki_load" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears. ", - "MetricExpr": "MACHINE_CLEARS.COUNT * tma_branch_resteers / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", - "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_branch_resteers_group", - "MetricName": "tma_clears_resteers", - "ScaleUnit": "100%" + "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "0", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_l3_cache_access_bw_1t" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears", - "MetricExpr": "tma_branch_resteers - tma_mispredicts_resteers - tma_clears_resteers", - "MetricGroup": "BigFoot;FetchLat;TopdownL4;tma_L4_group;tma_branch_resteers_group", - "MetricName": "tma_unknown_branches", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears. These are fetched branches the Branch Prediction Unit was unable to recognize (First fetch or hitting BPU capacity limit).", - "ScaleUnit": "100%" + "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_dsb_switches", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty.", - "ScaleUnit": "100%" + "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "tma_info_l3_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw_1t" }, { - "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", - "MetricExpr": "ILD_STALL.LCP / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_lcp", - "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs.", - "ScaleUnit": "100%" + "BriefDescription": "L3 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L3_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l3mpki" }, { - "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", - "MetricExpr": "2 * IDQ.MS_SWITCHES / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_ms_switches", - "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals.", - "ScaleUnit": "100%" + "BriefDescription": "Average Latency for L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCORE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "Memory_Lat;Offcore", + "MetricName": "tma_info_load_l2_miss_latency" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", - "MetricExpr": "tma_frontend_bound - tma_fetch_latency", - "MetricGroup": "FetchBW;Frontend;TopdownL2;tma_L2_group;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_bandwidth", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend.", - "ScaleUnit": "100%" + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_load_l2_mlp" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline)", - "MetricExpr": "(IDQ.ALL_MITE_CYCLES_ANY_UOPS - IDQ.ALL_MITE_CYCLES_4_UOPS) / CORE_CLKS / 2", - "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", - "MetricName": "tma_mite", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or LSD. For example; inefficiencies due to asymmetric decoders; use of long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.", - "ScaleUnit": "100%" + "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_UOPS_RETIRED.L1_MISS + MEM_LOAD_UOPS_RETIRED.HIT_LFB)", + "MetricGroup": "Mem;MemoryBound;MemoryLat", + "MetricName": "tma_info_load_miss_real_latency" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline", - "MetricExpr": "(IDQ.ALL_DSB_CYCLES_ANY_UOPS - IDQ.ALL_DSB_CYCLES_4_UOPS) / CORE_CLKS / 2", - "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", - "MetricName": "tma_dsb", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline. For example; inefficient utilization of the DSB cache structure or bank conflict when reading from it; are categorized here.", - "ScaleUnit": "100%" + "BriefDescription": "Average number of parallel data read requests to external memory", + "MetricExpr": "UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=0x182@ / UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=0x182\\,thresh\\=1@", + "MetricGroup": "Mem;MemoryBW;SoC", + "MetricName": "tma_info_mem_parallel_reads", + "PublicDescription": "Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches" }, { - "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations", - "MetricExpr": "(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) / SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_bad_speculation", - "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.", - "ScaleUnit": "100%" + "BriefDescription": "Average latency of data read request to external memory (in nanoseconds)", + "MetricExpr": "1e9 * (UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=0x182@ / UNC_C_TOR_INSERTS.MISS_OPCODE@filter_opc\\=0x182@) / (tma_info_socket_clks / duration_time)", + "MetricGroup": "Mem;MemoryLat;SoC", + "MetricName": "tma_info_mem_read_latency", + "PublicDescription": "Average latency of data read request to external memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetches. ([RKL+]memory-controller only)" }, { - "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction", - "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * tma_bad_speculation", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL2;tma_L2_group;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_branch_mispredicts", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path.", - "ScaleUnit": "100%" + "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLES", + "MetricGroup": "Mem;MemoryBW;MemoryBound", + "MetricName": "tma_info_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)" }, { - "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", - "MetricExpr": "tma_bad_speculation - tma_branch_mispredicts", - "MetricGroup": "BadSpec;MachineClears;TopdownL2;tma_L2_group;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_machine_clears", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes.", - "ScaleUnit": "100%" + "BriefDescription": "Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses", + "MetricExpr": "(ITLB_MISSES.WALK_DURATION + DTLB_LOAD_MISSES.WALK_DURATION + DTLB_STORE_MISSES.WALK_DURATION + 7 * (DTLB_STORE_MISSES.WALK_COMPLETED + DTLB_LOAD_MISSES.WALK_COMPLETED + ITLB_MISSES.WALK_COMPLETED)) / (2 * tma_info_core_clks)", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_page_walks_utilization", + "MetricThreshold": "tma_info_page_walks_utilization > 0.5" }, { - "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend", - "MetricExpr": "1 - (tma_frontend_bound + tma_bad_speculation + UOPS_RETIRED.RETIRE_SLOTS / SLOTS)", - "MetricGroup": "TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_backend_bound", - "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound.", - "ScaleUnit": "100%" + "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE_SLOTS\\,cmask\\=1@", + "MetricGroup": "Pipeline;Ret", + "MetricName": "tma_info_retire" }, { - "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", - "MetricExpr": "(CYCLE_ACTIVITY.STALLS_MEM_ANY + RESOURCE_STALLS.SB) / (CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - (UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if INST_RETIRED.ANY / CPU_CLK_UNHALTED.THREAD > 1.8 else UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) * tma_backend_bound", - "MetricGroup": "Backend;TopdownL2;tma_L2_group;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_memory_bound", - "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", - "ScaleUnit": "100%" + "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "4 * tma_info_core_clks", + "MetricGroup": "TmaL1;tma_L1_group", + "MetricName": "tma_info_slots" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache", - "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY.STALLS_L1D_MISS) / CPU_CLK_UNHALTED.THREAD, 0)", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", - "MetricName": "tma_l1_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache. The L1 data cache typically has the shortest latency. However; in certain cases like loads blocked on older stores; a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls; while some non-completed demand load lives in the machine without having that demand load missing the L1 cache.", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", + "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", + "MetricGroup": "SMT", + "MetricName": "tma_info_smt_2t_utilization" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + cpu@DTLB_LOAD_MISSES.WALK_DURATION\\,cmask\\=0x1@ + 7 * DTLB_LOAD_MISSES.WALK_COMPLETED) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_dtlb_load", - "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss.", - "ScaleUnit": "100%" + "BriefDescription": "Socket actual clocks when any core is active on that socket", + "MetricExpr": "cbox_0@event\\=0x0@", + "MetricGroup": "SoC", + "MetricName": "tma_info_socket_clks" }, { - "BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores", - "MetricExpr": "min(13 * LD_BLOCKS.STORE_FORWARD / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_store_fwd_blk", - "PublicDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores. To streamline memory operations in the pipeline; a load can avoid waiting for memory if a prior in-flight store is writing the data that the load wants to read (store forwarding process). However; in some cases the load may be blocked for a significant time pending the store forward. For example; when the prior store is writing a smaller region than the load is reading.", - "ScaleUnit": "100%" + "BriefDescription": "Average Frequency Utilization relative nominal frequency", + "MetricExpr": "tma_info_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_turbo_utilization" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations", - "MetricExpr": "min(MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_lock_latency", - "PublicDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations. Due to the microarchitecture handling of locks; they are classified as L1_Bound regardless of what memory source satisfied them.", - "ScaleUnit": "100%" + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / INST_RETIRED.ANY", + "MetricGroup": "Pipeline;Ret;Retire", + "MetricName": "tma_info_uoppi", + "MetricThreshold": "tma_info_uoppi > 1.05" }, { - "BriefDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary. ", - "MetricExpr": "min(Load_Miss_Real_Latency * LD_BLOCKS.NO_SR / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_split_loads", - "ScaleUnit": "100%" + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW", + "MetricName": "tma_info_uptb", + "MetricThreshold": "tma_info_uptb < 6" }, { - "BriefDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset", - "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_4k_aliasing", - "PublicDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset. False match is possible; which incur a few cycles load re-issue. However; the short re-issue duration is often hidden by the out-of-order core and HW optimizations; hence a user may safely ignore a high value of this metric unless it manages to propagate up into parent nodes of the hierarchy (e.g. to L1_Bound).", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + cpu@ITLB_MISSES.WALK_DURATION\\,cmask\\=1@ + 7 * ITLB_MISSES.WALK_COMPLETED) / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: ITLB_MISSES.WALK_COMPLETED", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed", - "MetricExpr": "Load_Miss_Real_Latency * cpu@L1D_PEND_MISS.FB_FULL\\,cmask\\=0x1@ / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryBW;TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_fb_full", - "PublicDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed. The higher the metric value; the deeper the memory hierarchy level the misses are satisfied from (metric values >1 are valid). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or external memory).", + "BriefDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache", + "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY.STALLS_L1D_MISS) / tma_info_clks, 0)", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache. The L1 data cache typically has the shortest latency. However; in certain cases like loads blocked on older stores; a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls; while some non-completed demand load lives in the machine without having that demand load missing the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT_PS;MEM_LOAD_UOPS_RETIRED.HIT_LFB_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_switches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads", - "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / CPU_CLK_UNHALTED.THREAD", + "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_info_clks", "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 misses/L2 hits) can improve the latency and increase performance.", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 misses/L2 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L2_HIT_PS", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core", - "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS) * CYCLE_ACTIVITY.STALLS_L2_MISS / CPU_CLK_UNHALTED.THREAD", + "MetricConstraint": "NO_GROUP_EVENTS_SMT", + "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS) * CYCLE_ACTIVITY.STALLS_L2_MISS / tma_info_clks", "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance.", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses", - "MetricExpr": "min((60 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD))) + 43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD)))) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_l3_bound_group", - "MetricName": "tma_contested_accesses", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses. Contested accesses occur when data written by one Logical Processor are read by another Logical Processor on a different Physical Core. Examples of contested accesses include synchronizations such as locks; true data sharing such as modified locked variables; and false sharing.", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses", - "MetricExpr": "min(43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD))) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Offcore;Snoop;TopdownL4;tma_L4_group;tma_l3_bound_group", - "MetricName": "tma_data_sharing", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses. Data shared by multiple Logical Processors (even just read shared) may cause increased access latency due to cache coherency. Excessive data sharing can drastically harm multithreaded performance.", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited)", - "MetricExpr": "min(41 * (MEM_LOAD_UOPS_RETIRED.L3_HIT * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD))) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l3_bound_group", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "41 * (MEM_LOAD_UOPS_RETIRED.L3_HIT * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD))) / tma_info_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_issueLat;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "PublicDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve the latency; reduce contention with sibling physical cores and increase performance. Note the value of this node may overlap with its siblings.", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors)", - "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on else OFFCORE_REQUESTS_BUFFER.SQ_FULL) / CORE_CLKS", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_l3_bound_group", - "MetricName": "tma_sq_full", - "PublicDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors). The Super Queue is used for requests to access the L2 cache or to go out to the Uncore.", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve the latency; reduce contention with sibling physical cores and increase performance. Note the value of this node may overlap with its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS. Related metrics: tma_mem_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", - "MetricExpr": "min((1 - MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS)) * CYCLE_ACTIVITY.STALLS_L2_MISS / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", - "MetricName": "tma_dram_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance.", + "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "ILD_STALL.LCP / tma_info_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=0x4@) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group", - "MetricName": "tma_mem_bandwidth", - "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that).", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", + "MetricExpr": "tma_retiring - tma_heavy_operations", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / CPU_CLK_UNHALTED.THREAD - tma_mem_bandwidth", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group", - "MetricName": "tma_mem_latency", - "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that).", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_2 + UOPS_DISPATCHED_PORT.PORT_3 + UOPS_DISPATCHED_PORT.PORT_7 - UOPS_DISPATCHED_PORT.PORT_4) / (2 * tma_info_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations. Sample with: UOPS_DISPATCHED.PORT_2_3", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from local memory", - "MetricExpr": "min(200 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD))) / CPU_CLK_UNHALTED.THREAD, 1)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "200 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD))) / tma_info_clks", "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_group", "MetricName": "tma_local_dram", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from local memory. Caching will improve the latency and increase performance.", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote memory", - "MetricExpr": "min(310 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD))) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latency_group", - "MetricName": "tma_remote_dram", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote memory. This is caused often due to non-optimal NUMA allocations. #link to NUMA article", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote cache in other sockets including synchronizations issues", - "MetricExpr": "min((200 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD))) + 180 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD)))) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latency_group", - "MetricName": "tma_remote_cache", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote cache in other sockets including synchronizations issues. This is caused often due to non-optimal NUMA allocations. #link to NUMA article", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", - "MetricExpr": "RESOURCE_STALLS.SB / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", - "MetricName": "tma_store_bound", - "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck.", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses", - "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_store_bound_group", - "MetricName": "tma_store_latency", - "PublicDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses. Store accesses usually less impact out-of-order core performance; however; holding resources for longer time can lead into undesired implications (e.g. contention on L1D fill-buffer entries - see FB_Full)", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing", - "MetricExpr": "min((200 * OFFCORE_RESPONSE.DEMAND_RFO.LLC_MISS.REMOTE_HITM + 60 * OFFCORE_RESPONSE.DEMAND_RFO.LLC_HIT.HITM_OTHER_CORE) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_store_bound_group", - "MetricName": "tma_false_sharing", - "PublicDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing. False Sharing is a multithreading hiccup; where multiple Logical Processors contend on different data-elements mapped into the same cache line. ", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents rate of split store accesses", - "MetricExpr": "2 * MEM_UOPS_RETIRED.SPLIT_STORES / CORE_CLKS", - "MetricGroup": "TopdownL4;tma_L4_group;tma_store_bound_group", - "MetricName": "tma_split_stores", - "PublicDescription": "This metric represents rate of split store accesses. Consider aligning your data to the 64-byte cache line granularity.", + "MetricThreshold": "tma_local_dram > 0.1 & (tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from local memory. Caching will improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses", - "MetricExpr": "min((8 * DTLB_STORE_MISSES.STLB_HIT + cpu@DTLB_STORE_MISSES.WALK_DURATION\\,cmask\\=0x1@ + 7 * DTLB_STORE_MISSES.WALK_COMPLETED) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_store_bound_group", - "MetricName": "tma_dtlb_store", - "PublicDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses. As with ordinary data caching; focus on improving data locality and reducing working-set size to reduce DTLB overhead. Additionally; consider using profile-guided optimization (PGO) to collocate frequently-used data on the same page. Try using larger page sizes for large amounts of frequently-used data.", + "BriefDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO) / tma_info_clks", + "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1_bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations. Due to the microarchitecture handling of locks; they are classified as L1_Bound regardless of what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOADS_PS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", - "MetricExpr": "tma_backend_bound - tma_memory_bound", - "MetricGroup": "Backend;Compute;TopdownL2;tma_L2_group;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_core_bound", - "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", + "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_bad_speculation - tma_branch_mispredicts", + "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", - "MetricExpr": "ARITH.FPU_DIV_ACTIVE / CORE_CLKS", - "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", - "MetricName": "tma_divider", - "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication.", + "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=4@) / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that). Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_sq_full", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", - "MetricExpr": "(CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - (UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if INST_RETIRED.ANY / CPU_CLK_UNHALTED.THREAD > 1.8 else UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB - RESOURCE_STALLS.SB - CYCLE_ACTIVITY.STALLS_MEM_ANY) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_group", - "MetricName": "tma_ports_utilization", - "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", + "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / tma_info_clks - tma_mem_bandwidth", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that). Related metrics: tma_l3_hit_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\=0x1\\,cmask\\=0x1@ / 2 if #SMT_on else CYCLE_ACTIVITY.STALLS_TOTAL - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0)) / CORE_CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_0", - "PublicDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise). Long-latency instructions like divides may contribute to this metric.", + "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(CYCLE_ACTIVITY.STALLS_MEM_ANY + RESOURCE_STALLS.SB) / (CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - (UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if tma_info_ipc > 1.8 else UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) * tma_backend_bound", + "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=0x1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=0x2@) / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) / CORE_CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_1", - "PublicDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). This can be due to heavy data-dependency among software instructions; or over oversubscribing a particular hardware resource. In some other cases with high 1_Port_Utilized and L1_Bound; this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful.", + "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.MS_UOPS / tma_info_slots", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operations_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=0x2@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=0x3@) / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / CORE_CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_2", - "PublicDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization -most compilers feature auto-Vectorization options today- reduces pressure on the execution ports as multiple elements are calculated with same uop.", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage", + "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES * tma_branch_resteers / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", + "MetricGroup": "BadSpec;BrMispredicts;TopdownL4;tma_L4_group;tma_branch_resteers_group;tma_issueBM", + "MetricName": "tma_mispredicts_resteers", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage. Related metrics: tma_branch_mispredicts, tma_info_branch_misprediction_cost", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise).", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=0x3@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / CORE_CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_3m", + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline)", + "MetricExpr": "(IDQ.ALL_MITE_CYCLES_ANY_UOPS - IDQ.ALL_MITE_CYCLES_4_UOPS) / tma_info_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or LSD. For example; inefficiencies due to asymmetric decoders; use of long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.", - "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT.PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / SLOTS", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", - "MetricName": "tma_alu_op_utilization", + "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", + "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_clks", + "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_0 / CORE_CLKS", - "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilization_group", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_0 / tma_info_core_clks", + "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", "MetricName": "tma_port_0", + "MetricThreshold": "tma_port_0 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_1 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_1 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", "MetricName": "tma_port_1", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_5 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group", - "MetricName": "tma_port_5", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_6 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group", - "MetricName": "tma_port_6", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations", - "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_2 + UOPS_DISPATCHED_PORT.PORT_3 + UOPS_DISPATCHED_PORT.PORT_7 - UOPS_DISPATCHED_PORT.PORT_4) / (2 * CORE_CLKS)", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", - "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_port_1 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 2 ([SNB+]Loads and Store-address; [ICL+] Loads)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_2 / CORE_CLKS", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_2 / tma_info_core_clks", "MetricGroup": "TopdownL6;tma_L6_group;tma_load_op_utilization_group", "MetricName": "tma_port_2", + "MetricThreshold": "tma_port_2 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 2 ([SNB+]Loads and Store-address; [ICL+] Loads). Sample with: UOPS_DISPATCHED_PORT.PORT_2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 3 ([SNB+]Loads and Store-address; [ICL+] Loads)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_3 / CORE_CLKS", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_3 / tma_info_core_clks", "MetricGroup": "TopdownL6;tma_L6_group;tma_load_op_utilization_group", "MetricName": "tma_port_3", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_4 / CORE_CLKS", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", - "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_port_3 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 3 ([SNB+]Loads and Store-address; [ICL+] Loads). Sample with: UOPS_DISPATCHED_PORT.PORT_3", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 4 (Store-data)", "MetricExpr": "tma_store_op_utilization", - "MetricGroup": "TopdownL6;tma_L6_group;tma_store_op_utilization_group", + "MetricGroup": "TopdownL6;tma_L6_group;tma_issueSpSt;tma_store_op_utilization_group", "MetricName": "tma_port_4", + "MetricThreshold": "tma_port_4 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 4 (Store-data). Sample with: UOPS_DISPATCHED_PORT.PORT_4. Related metrics: tma_split_stores", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 7 ([HSW+]simple Store-address)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_7 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_L6_group;tma_store_op_utilization_group", - "MetricName": "tma_port_7", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_5 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_5", + "MetricThreshold": "tma_port_5 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_retiring", - "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. ", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_6 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_6", + "MetricThreshold": "tma_port_6 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_6. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", - "MetricExpr": "tma_retiring - UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.MS_UOPS / SLOTS", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_L2_group;tma_retiring_group", - "MetricName": "tma_light_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved.", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 7 ([HSW+]simple Store-address)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_7 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_store_op_utilization_group", + "MetricName": "tma_port_7", + "MetricThreshold": "tma_port_7 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 7 ([HSW+]simple Store-address). Sample with: UOPS_DISPATCHED_PORT.PORT_7", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)", - "MetricExpr": "INST_RETIRED.X87 * UPI / UOPS_RETIRED.RETIRE_SLOTS + tma_fp_scalar + tma_fp_vector", - "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_group", - "MetricName": "tma_fp_arith", - "PublicDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired). Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain and FMA double-counting.", + "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - (UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if tma_info_ipc > 1.8 else UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB - RESOURCE_STALLS.SB - CYCLE_ACTIVITY.STALLS_MEM_ANY) / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric serves as an approximation of legacy x87 usage", - "MetricExpr": "INST_RETIRED.X87 * UPI / UOPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", - "MetricName": "tma_x87_use", - "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", + "BriefDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\,cmask\\=1@ / 2 if #SMT_on else (CYCLE_ACTIVITY.STALLS_TOTAL - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0)) / tma_info_core_clks)", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise). Long-latency instructions like divides may contribute to this metric.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE) / UOPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group", - "MetricName": "tma_fp_scalar", - "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting.", + "BriefDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) / tma_info_core_clks)", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_1", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). This can be due to heavy data-dependency among software instructions; or over oversubscribing a particular hardware resource. In some other cases with high 1_Port_Utilized and L1_Bound; this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths", - "MetricExpr": "min((FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS, 1)", - "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group", - "MetricName": "tma_fp_vector", - "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting.", + "BriefDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_clks)", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization -most compilers feature auto-Vectorization options today- reduces pressure on the execution ports as multiple elements are calculated with same uop. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors", - "MetricExpr": "min((FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS, 1)", - "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group", - "MetricName": "tma_fp_vector_128b", - "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors. May overcount due to FMA double counting.", + "BriefDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise).", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.7 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors", - "MetricExpr": "min((FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS, 1)", - "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group", - "MetricName": "tma_fp_vector_256b", - "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors. May overcount due to FMA double counting.", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote cache in other sockets including synchronizations issues", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(200 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD))) + 180 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD)))) / tma_info_clks", + "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_issueSyncxn;tma_mem_latency_group", + "MetricName": "tma_remote_cache", + "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote cache in other sockets including synchronizations issues. This is caused often due to non-optimal NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM_PS;MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD_PS. Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.MS_UOPS / SLOTS", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_L2_group;tma_retiring_group", - "MetricName": "tma_heavy_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences. This highly-correlates with the uop length of these instructions/sequences.", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote memory", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "310 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD))) / tma_info_clks", + "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latency_group", + "MetricName": "tma_remote_dram", + "MetricThreshold": "tma_remote_dram > 0.1 & (tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote memory. This is caused often due to non-optimal NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", - "MetricExpr": "tma_heavy_operations", - "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operations_group", - "MetricName": "tma_microcode_sequencer", - "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided.", + "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1", + "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.RETIRE_SLOTS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists", - "MetricExpr": "min(100 * OTHER_ASSISTS.ANY_WB_ASSIST / SLOTS, 1)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", - "MetricName": "tma_assists", - "PublicDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists. Assists are long sequences of uops that are required in certain corner-cases for operations that cannot be handled natively by the execution pipeline. For example; when working with very small floating point values (so-called Denormals); the FP units are not set up to perform these operations natively. Instead; a sequence of instructions to perform the computation on the Denormals is injected into the pipeline. Since these microcode sequences might be dozens of uops long; Assists can be extremely deleterious to performance and they can be avoided in many cases.", + "BriefDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_info_load_miss_real_latency * LD_BLOCKS.NO_SR / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction", - "MetricExpr": "max(0, tma_heavy_operations - tma_assists)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", - "MetricName": "tma_cisc", - "PublicDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction. A CISC instruction has multiple uops that are required to perform the instruction's functionality as in the case of read-modify-write as an example. Since these instructions require multiple uops they may or may not imply sub-optimal use of machine resources.", + "BriefDescription": "This metric represents rate of split store accesses", + "MetricExpr": "2 * MEM_UOPS_RETIRED.SPLIT_STORES / tma_info_core_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bound_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store accesses. Consider aligning your data to the 64-byte cache line granularity. Sample with: MEM_UOPS_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_4", "ScaleUnit": "100%" }, { - "BriefDescription": "C3 residency percent per core", - "MetricExpr": "cstate_core@c3\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C3_Core_Residency", + "BriefDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors)", + "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on else OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueBW;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors). Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_mem_bandwidth", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per core", - "MetricExpr": "cstate_core@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Core_Residency", + "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", + "MetricExpr": "RESOURCE_STALLS.SB / tma_info_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck. Sample with: MEM_UOPS_RETIRED.ALL_STORES_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "C7 residency percent per core", - "MetricExpr": "cstate_core@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Core_Residency", + "BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores. To streamline memory operations in the pipeline; a load can avoid waiting for memory if a prior in-flight store is writing the data that the load wants to read (store forwarding process). However; in some cases the load may be blocked for a significant time pending the store forward. For example; when the prior store is writing a smaller region than the load is reading.", "ScaleUnit": "100%" }, { - "BriefDescription": "C2 residency percent per package", - "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C2_Pkg_Residency", + "BriefDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_clks", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_issueSL;tma_store_bound_group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses. Store accesses usually less impact out-of-order core performance; however; holding resources for longer time can lead into undesired implications (e.g. contention on L1D fill-buffer entries - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "C3 residency percent per package", - "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C3_Pkg_Residency", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_4 / tma_info_core_clks", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per package", - "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Pkg_Residency", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears", + "MetricExpr": "tma_branch_resteers - tma_mispredicts_resteers - tma_clears_resteers", + "MetricGroup": "BigFoot;FetchLat;TopdownL4;tma_L4_group;tma_branch_resteers_group", + "MetricName": "tma_unknown_branches", + "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears. These are fetched branches the Branch Prediction Unit was unable to recognize (e.g. first time the branch is fetched or hitting BTB capacity limit). Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, { - "BriefDescription": "C7 residency percent per package", - "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Pkg_Residency", + "BriefDescription": "This metric serves as an approximation of legacy x87 usage", + "MetricExpr": "INST_RETIRED.X87 * tma_info_uoppi / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%" } ] diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/uncore-cache.json b/tools/perf/pmu-events/arch/x86/broadwellx/uncore-cache.json index 38eaac5afd4b..746954775437 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/uncore-cache.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/uncore-cache.json @@ -5,7 +5,7 @@ "EventName": "LLC_MISSES.CODE_LLC_PREFETCH", "Filter": "filter_opc=0x191", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions inserted into the TOR that match an opcode.", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions inserted into the TOR that match an opcode.", "ScaleUnit": "64Bytes", "UMask": "0x3", "Unit": "CBO" @@ -16,7 +16,7 @@ "EventName": "LLC_MISSES.DATA_LLC_PREFETCH", "Filter": "filter_opc=0x192", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions inserted into the TOR that match an opcode.", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions inserted into the TOR that match an opcode.", "ScaleUnit": "64Bytes", "UMask": "0x3", "Unit": "CBO" @@ -27,7 +27,7 @@ "EventName": "LLC_MISSES.DATA_READ", "Filter": "filter_opc=0x182", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions inserted into the TOR that match an opcode.", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions inserted into the TOR that match an opcode.", "ScaleUnit": "64Bytes", "UMask": "0x3", "Unit": "CBO" @@ -38,7 +38,7 @@ "EventName": "LLC_MISSES.MMIO_READ", "Filter": "filter_opc=0x187,filter_nc=1", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions inserted into the TOR that match an opcode.", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions inserted into the TOR that match an opcode.", "ScaleUnit": "64Bytes", "UMask": "0x3", "Unit": "CBO" @@ -49,7 +49,7 @@ "EventName": "LLC_MISSES.MMIO_WRITE", "Filter": "filter_opc=0x18f,filter_nc=1", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions inserted into the TOR that match an opcode.", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions inserted into the TOR that match an opcode.", "ScaleUnit": "64Bytes", "UMask": "0x3", "Unit": "CBO" @@ -60,7 +60,7 @@ "EventName": "LLC_MISSES.PCIE_NON_SNOOP_WRITE", "Filter": "filter_opc=0x1c8,filter_tid=0x3e", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions inserted into the TOR that match an opcode.", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions inserted into the TOR that match an opcode.", "ScaleUnit": "64Bytes", "UMask": "0x3", "Unit": "CBO" @@ -71,7 +71,7 @@ "EventName": "LLC_MISSES.PCIE_READ", "Filter": "filter_opc=0x19e", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions inserted into the TOR that match an opcode.", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions inserted into the TOR that match an opcode.", "ScaleUnit": "64Bytes", "UMask": "0x3", "Unit": "CBO" @@ -82,7 +82,7 @@ "EventName": "LLC_MISSES.PCIE_WRITE", "Filter": "filter_opc=0x1c8", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions inserted into the TOR that match an opcode.", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions inserted into the TOR that match an opcode.", "ScaleUnit": "64Bytes", "UMask": "0x3", "Unit": "CBO" @@ -93,7 +93,7 @@ "EventName": "LLC_MISSES.RFO_LLC_PREFETCH", "Filter": "filter_opc=0x190", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions inserted into the TOR that match an opcode.", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions inserted into the TOR that match an opcode.", "ScaleUnit": "64Bytes", "UMask": "0x3", "Unit": "CBO" @@ -104,7 +104,7 @@ "EventName": "LLC_MISSES.UNCACHEABLE", "Filter": "filter_opc=0x187", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions inserted into the TOR that match an opcode.", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions inserted into the TOR that match an opcode.", "ScaleUnit": "64Bytes", "UMask": "0x3", "Unit": "CBO" @@ -115,7 +115,7 @@ "EventName": "LLC_REFERENCES.CODE_LLC_PREFETCH", "Filter": "filter_opc=0x181", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Transactions inserted into the TOR that match an opcode (matched by Cn_MSR_PMON_BOX_FILTER.opc)", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Transactions inserted into the TOR that match an opcode (matched by Cn_MSR_PMON_BOX_FILTER.opc)", "ScaleUnit": "64Bytes", "UMask": "0x1", "Unit": "CBO" @@ -126,7 +126,7 @@ "EventName": "LLC_REFERENCES.PCIE_NS_PARTIAL_WRITE", "Filter": "filter_opc=0x180,filter_tid=0x3e", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Transactions inserted into the TOR that match an opcode (matched by Cn_MSR_PMON_BOX_FILTER.opc)", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Transactions inserted into the TOR that match an opcode (matched by Cn_MSR_PMON_BOX_FILTER.opc)", "UMask": "0x1", "Unit": "CBO" }, @@ -136,7 +136,7 @@ "EventName": "LLC_REFERENCES.PCIE_READ", "Filter": "filter_opc=0x19e", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Transactions inserted into the TOR that match an opcode (matched by Cn_MSR_PMON_BOX_FILTER.opc)", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Transactions inserted into the TOR that match an opcode (matched by Cn_MSR_PMON_BOX_FILTER.opc)", "ScaleUnit": "64Bytes", "UMask": "0x1", "Unit": "CBO" @@ -147,7 +147,7 @@ "EventName": "LLC_REFERENCES.PCIE_WRITE", "Filter": "filter_opc=0x1c8,filter_tid=0x3e", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Transactions inserted into the TOR that match an opcode (matched by Cn_MSR_PMON_BOX_FILTER.opc)", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Transactions inserted into the TOR that match an opcode (matched by Cn_MSR_PMON_BOX_FILTER.opc)", "ScaleUnit": "64Bytes", "UMask": "0x1", "Unit": "CBO" @@ -158,7 +158,7 @@ "EventName": "LLC_REFERENCES.STREAMING_FULL", "Filter": "filter_opc=0x18c", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Transactions inserted into the TOR that match an opcode (matched by Cn_MSR_PMON_BOX_FILTER.opc)", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Transactions inserted into the TOR that match an opcode (matched by Cn_MSR_PMON_BOX_FILTER.opc)", "ScaleUnit": "64Bytes", "UMask": "0x1", "Unit": "CBO" @@ -169,7 +169,7 @@ "EventName": "LLC_REFERENCES.STREAMING_PARTIAL", "Filter": "filter_opc=0x18d", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Transactions inserted into the TOR that match an opcode (matched by Cn_MSR_PMON_BOX_FILTER.opc)", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Transactions inserted into the TOR that match an opcode (matched by Cn_MSR_PMON_BOX_FILTER.opc)", "ScaleUnit": "64Bytes", "UMask": "0x1", "Unit": "CBO" @@ -1157,7 +1157,7 @@ "EventCode": "0x35", "EventName": "UNC_C_TOR_INSERTS.ALL", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; All transactions inserted into the TOR. This includes requests that reside in the TOR for a short time, such as LLC Hits that do not need to snoop cores or requests that get rejected and have to be retried through one of the ingress queues. The TOR is more commonly a bottleneck in skews with smaller core counts, where the ratio of RTIDs to TOR entries is larger. Note that there are reserved TOR entries for various request types, so it is possible that a given request type be blocked with an occupancy that is less than 20. Also note that generally requests will not be able to arbitrate into the TOR pipeline if there are no available TOR slots.", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; All transactions inserted into the TOR. This includes requests that reside in the TOR for a short time, such as LLC Hits that do not need to snoop cores or requests that get rejected and have to be retried through one of the ingress queues. The TOR is more commonly a bottleneck in skews with smaller core counts, where the ratio of RTIDs to TOR entries is larger. Note that there are reserved TOR entries for various request types, so it is possible that a given request type be blocked with an occupancy that is less than 20. Also note that generally requests will not be able to arbitrate into the TOR pipeline if there are no available TOR slots.", "UMask": "0x8", "Unit": "CBO" }, @@ -1166,7 +1166,7 @@ "EventCode": "0x35", "EventName": "UNC_C_TOR_INSERTS.EVICTION", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Eviction transactions inserted into the TOR. Evictions can be quick, such as when the line is in the F, S, or E states and no core valid bits are set. They can also be longer if either CV bits are set (so the cores need to be snooped) and/or if there is a HitM (in which case it is necessary to write the request out to memory).", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Eviction transactions inserted into the TOR. Evictions can be quick, such as when the line is in the F, S, or E states and no core valid bits are set. They can also be longer if either CV bits are set (so the cores need to be snooped) and/or if there is a HitM (in which case it is necessary to write the request out to memory).", "UMask": "0x4", "Unit": "CBO" }, @@ -1175,7 +1175,7 @@ "EventCode": "0x35", "EventName": "UNC_C_TOR_INSERTS.LOCAL", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; All transactions inserted into the TOR that are satisifed by locally HOMed memory.", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; All transactions inserted into the TOR that are satisifed by locally HOMed memory.", "UMask": "0x28", "Unit": "CBO" }, @@ -1184,7 +1184,7 @@ "EventCode": "0x35", "EventName": "UNC_C_TOR_INSERTS.LOCAL_OPCODE", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; All transactions, satisifed by an opcode, inserted into the TOR that are satisifed by locally HOMed memory.", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; All transactions, satisifed by an opcode, inserted into the TOR that are satisifed by locally HOMed memory.", "UMask": "0x21", "Unit": "CBO" }, @@ -1193,7 +1193,7 @@ "EventCode": "0x35", "EventName": "UNC_C_TOR_INSERTS.MISS_LOCAL", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions inserted into the TOR that are satisifed by locally HOMed memory.", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions inserted into the TOR that are satisifed by locally HOMed memory.", "UMask": "0x2a", "Unit": "CBO" }, @@ -1202,7 +1202,7 @@ "EventCode": "0x35", "EventName": "UNC_C_TOR_INSERTS.MISS_LOCAL_OPCODE", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions, satisifed by an opcode, inserted into the TOR that are satisifed by locally HOMed memory.", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions, satisifed by an opcode, inserted into the TOR that are satisifed by locally HOMed memory.", "UMask": "0x23", "Unit": "CBO" }, @@ -1211,7 +1211,7 @@ "EventCode": "0x35", "EventName": "UNC_C_TOR_INSERTS.MISS_OPCODE", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions inserted into the TOR that match an opcode.", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions inserted into the TOR that match an opcode.", "UMask": "0x3", "Unit": "CBO" }, @@ -1220,7 +1220,7 @@ "EventCode": "0x35", "EventName": "UNC_C_TOR_INSERTS.MISS_REMOTE", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions inserted into the TOR that are satisifed by remote caches or remote memory.", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions inserted into the TOR that are satisifed by remote caches or remote memory.", "UMask": "0x8a", "Unit": "CBO" }, @@ -1229,7 +1229,7 @@ "EventCode": "0x35", "EventName": "UNC_C_TOR_INSERTS.MISS_REMOTE_OPCODE", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions, satisifed by an opcode, inserted into the TOR that are satisifed by remote caches or remote memory.", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions, satisifed by an opcode, inserted into the TOR that are satisifed by remote caches or remote memory.", "UMask": "0x83", "Unit": "CBO" }, @@ -1238,7 +1238,7 @@ "EventCode": "0x35", "EventName": "UNC_C_TOR_INSERTS.NID_ALL", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; All NID matched (matches an RTID destination) transactions inserted into the TOR. The NID is programmed in Cn_MSR_PMON_BOX_FILTER.nid. In conjunction with STATE = I, it is possible to monitor misses to specific NIDs in the system.", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; All NID matched (matches an RTID destination) transactions inserted into the TOR. The NID is programmed in Cn_MSR_PMON_BOX_FILTER.nid. In conjunction with STATE = I, it is possible to monitor misses to specific NIDs in the system.", "UMask": "0x48", "Unit": "CBO" }, @@ -1247,7 +1247,7 @@ "EventCode": "0x35", "EventName": "UNC_C_TOR_INSERTS.NID_EVICTION", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; NID matched eviction transactions inserted into the TOR.", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; NID matched eviction transactions inserted into the TOR.", "UMask": "0x44", "Unit": "CBO" }, @@ -1256,7 +1256,7 @@ "EventCode": "0x35", "EventName": "UNC_C_TOR_INSERTS.NID_MISS_ALL", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; All NID matched miss requests that were inserted into the TOR.", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; All NID matched miss requests that were inserted into the TOR.", "UMask": "0x4a", "Unit": "CBO" }, @@ -1265,7 +1265,7 @@ "EventCode": "0x35", "EventName": "UNC_C_TOR_INSERTS.NID_MISS_OPCODE", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions inserted into the TOR that match a NID and an opcode.", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Miss transactions inserted into the TOR that match a NID and an opcode.", "UMask": "0x43", "Unit": "CBO" }, @@ -1274,7 +1274,7 @@ "EventCode": "0x35", "EventName": "UNC_C_TOR_INSERTS.NID_OPCODE", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Transactions inserted into the TOR that match a NID and an opcode.", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Transactions inserted into the TOR that match a NID and an opcode.", "UMask": "0x41", "Unit": "CBO" }, @@ -1283,7 +1283,7 @@ "EventCode": "0x35", "EventName": "UNC_C_TOR_INSERTS.NID_WB", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; NID matched write transactions inserted into the TOR.", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; NID matched write transactions inserted into the TOR.", "UMask": "0x50", "Unit": "CBO" }, @@ -1292,7 +1292,7 @@ "EventCode": "0x35", "EventName": "UNC_C_TOR_INSERTS.OPCODE", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Transactions inserted into the TOR that match an opcode (matched by Cn_MSR_PMON_BOX_FILTER.opc)", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Transactions inserted into the TOR that match an opcode (matched by Cn_MSR_PMON_BOX_FILTER.opc)", "UMask": "0x1", "Unit": "CBO" }, @@ -1301,7 +1301,7 @@ "EventCode": "0x35", "EventName": "UNC_C_TOR_INSERTS.REMOTE", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; All transactions inserted into the TOR that are satisifed by remote caches or remote memory.", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; All transactions inserted into the TOR that are satisifed by remote caches or remote memory.", "UMask": "0x88", "Unit": "CBO" }, @@ -1310,7 +1310,7 @@ "EventCode": "0x35", "EventName": "UNC_C_TOR_INSERTS.REMOTE_OPCODE", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; All transactions, satisifed by an opcode, inserted into the TOR that are satisifed by remote caches or remote memory.", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; All transactions, satisifed by an opcode, inserted into the TOR that are satisifed by remote caches or remote memory.", "UMask": "0x81", "Unit": "CBO" }, @@ -1319,7 +1319,7 @@ "EventCode": "0x35", "EventName": "UNC_C_TOR_INSERTS.WB", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Write transactions inserted into the TOR. This does not include RFO, but actual operations that contain data being sent from the core.", + "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select MISS_OPC_MATCH and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).; Write transactions inserted into the TOR. This does not include RFO, but actual operations that contain data being sent from the core.", "UMask": "0x10", "Unit": "CBO" }, @@ -1590,7 +1590,7 @@ "EventCode": "0x2", "EventName": "UNC_C_TxR_INSERTS.BL_CORE", "PerPkg": "1", - "PublicDescription": "Number of allocations into the Cbo Egress. The Egress is used to queue up requests destined for the ring.; Ring transactions from the Corebo destined for the BL ring. This is commonly used for transferring writeback data to the cache.", + "PublicDescription": "Number of allocations into the Cbo Egress. The Egress is used to queue up requests destined for the ring.; Ring transactions from the Corebo destined for the BL ring. This is commonly used for transfering writeback data to the cache.", "UMask": "0x40", "Unit": "CBO" }, @@ -1737,7 +1737,7 @@ "EventCode": "0x41", "EventName": "UNC_H_DIRECTORY_LAT_OPT", "PerPkg": "1", - "PublicDescription": "Directory Latency Optimization Data Return Path Taken. When directory mode is enabled and the directory returned for a read is Dir=I, then data can be returned using a faster path if certain conditions are met (credits, free pipeline, etc).", + "PublicDescription": "Directory Latency Optimization Data Return Path Taken. When directory mode is enabled and the directory retuned for a read is Dir=I, then data can be returned using a faster path if certain conditions are met (credits, free pipeline, etc).", "Unit": "HA" }, { diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/uncore-interconnect.json b/tools/perf/pmu-events/arch/x86/broadwellx/uncore-interconnect.json index a5457c7ba58b..489a3673323d 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/uncore-interconnect.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/uncore-interconnect.json @@ -3,7 +3,7 @@ "BriefDescription": "Number of non data (control) flits transmitted . Derived from unc_q_txl_flits_g0.non_data", "EventName": "QPI_CTL_BANDWIDTH_TX", "PerPkg": "1", - "PublicDescription": "Counts the number of flits transmitted across the QPI Link. It includes filters for Idle, protocol, and Data Flits. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time (for L0) or 4B instead of 8B for L0p.; Number of non-NULL non-data flits transmitted across QPI. This basically tracks the protocol overhead on the QPI link. One can get a good picture of the QPI-link characteristics by evaluating the protocol flits, data flits, and idle/null flits. This includes the header flits for data packets.", + "PublicDescription": "Counts the number of flits transmitted across the QPI Link. It includes filters for Idle, protocol, and Data Flits. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time (for L0) or 4B instead of 8B for L0p.; Number of non-NULL non-data flits transmitted across QPI. This basically tracks the protocol overhead on the QPI link. One can get a good picture of the QPI-link characteristics by evaluating the protocol flits, data flits, and idle/null flits. This includes the header flits for data packets.", "ScaleUnit": "8Bytes", "UMask": "0x4", "Unit": "QPI LL" @@ -12,7 +12,7 @@ "BriefDescription": "Number of data flits transmitted . Derived from unc_q_txl_flits_g0.data", "EventName": "QPI_DATA_BANDWIDTH_TX", "PerPkg": "1", - "PublicDescription": "Counts the number of flits transmitted across the QPI Link. It includes filters for Idle, protocol, and Data Flits. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time (for L0) or 4B instead of 8B for L0p.; Number of data flits transmitted over QPI. Each flit contains 64b of data. This includes both DRS and NCB data flits (coherent and non-coherent). This can be used to calculate the data bandwidth of the QPI link. One can get a good picture of the QPI-link characteristics by evaluating the protocol flits, data flits, and idle/null flits. This does not include the header flits that go in data packets.", + "PublicDescription": "Counts the number of flits transmitted across the QPI Link. It includes filters for Idle, protocol, and Data Flits. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time (for L0) or 4B instead of 8B for L0p.; Number of data flits transmitted over QPI. Each flit contains 64b of data. This includes both DRS and NCB data flits (coherent and non-coherent). This can be used to calculate the data bandwidth of the QPI link. One can get a good picture of the QPI-link characteristics by evaluating the protocol flits, data flits, and idle/null flits. This does not include the header flits that go in data packets.", "ScaleUnit": "8Bytes", "UMask": "0x2", "Unit": "QPI LL" @@ -134,7 +134,7 @@ "EventCode": "0x9", "EventName": "UNC_Q_RxL_BYPASSED", "PerPkg": "1", - "PublicDescription": "Counts the number of times that an incoming flit was able to bypass the flit buffer and pass directly across the BGF and into the Egress. This is a latency optimization, and should generally be the common case. If this value is less than the number of flits transferred, it implies that there was queueing getting onto the ring, and thus the transactions saw higher latency.", + "PublicDescription": "Counts the number of times that an incoming flit was able to bypass the flit buffer and pass directly across the BGF and into the Egress. This is a latency optimization, and should generally be the common case. If this value is less than the number of flits transfered, it implies that there was queueing getting onto the ring, and thus the transactions saw higher latency.", "Unit": "QPI LL" }, { @@ -391,7 +391,7 @@ "EventCode": "0x1", "EventName": "UNC_Q_RxL_FLITS_G0.IDLE", "PerPkg": "1", - "PublicDescription": "Counts the number of flits received from the QPI Link. It includes filters for Idle, protocol, and Data Flits. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time (for L0) or 4B instead of 8B for L0p.; Number of flits received over QPI that do not hold protocol payload. When QPI is not in a power saving state, it continuously transmits flits across the link. When there are no protocol flits to send, it will send IDLE and NULL flits across. These flits sometimes do carry a payload, such as credit returns, but are generall not considered part of the QPI bandwidth.", + "PublicDescription": "Counts the number of flits received from the QPI Link. It includes filters for Idle, protocol, and Data Flits. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time (for L0) or 4B instead of 8B for L0p.; Number of flits received over QPI that do not hold protocol payload. When QPI is not in a power saving state, it continuously transmits flits across the link. When there are no protocol flits to send, it will send IDLE and NULL flits across. These flits sometimes do carry a payload, such as credit returns, but are generall not considered part of the QPI bandwidth.", "UMask": "0x1", "Unit": "QPI LL" }, @@ -400,7 +400,7 @@ "EventCode": "0x2", "EventName": "UNC_Q_RxL_FLITS_G1.DRS", "PerPkg": "1", - "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three groups that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the total number of flits received over QPI on the DRS (Data Response) channel. DRS flits are used to transmit data with coherency. This does not count data flits received over the NCB channel which transmits non-coherent data.", + "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three groups that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the total number of flits received over QPI on the DRS (Data Response) channel. DRS flits are used to transmit data with coherency. This does not count data flits received over the NCB channel which transmits non-coherent data.", "UMask": "0x18", "Unit": "QPI LL" }, @@ -409,7 +409,7 @@ "EventCode": "0x2", "EventName": "UNC_Q_RxL_FLITS_G1.DRS_DATA", "PerPkg": "1", - "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three groups that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the total number of data flits received over QPI on the DRS (Data Response) channel. DRS flits are used to transmit data with coherency. This does not count data flits received over the NCB channel which transmits non-coherent data. This includes only the data flits (not the header).", + "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three groups that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the total number of data flits received over QPI on the DRS (Data Response) channel. DRS flits are used to transmit data with coherency. This does not count data flits received over the NCB channel which transmits non-coherent data. This includes only the data flits (not the header).", "UMask": "0x8", "Unit": "QPI LL" }, @@ -418,7 +418,7 @@ "EventCode": "0x2", "EventName": "UNC_Q_RxL_FLITS_G1.DRS_NONDATA", "PerPkg": "1", - "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three groups that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the total number of protocol flits received over QPI on the DRS (Data Response) channel. DRS flits are used to transmit data with coherency. This does not count data flits received over the NCB channel which transmits non-coherent data. This includes only the header flits (not the data). This includes extended headers.", + "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three groups that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the total number of protocol flits received over QPI on the DRS (Data Response) channel. DRS flits are used to transmit data with coherency. This does not count data flits received over the NCB channel which transmits non-coherent data. This includes only the header flits (not the data). This includes extended headers.", "UMask": "0x10", "Unit": "QPI LL" }, @@ -427,7 +427,7 @@ "EventCode": "0x2", "EventName": "UNC_Q_RxL_FLITS_G1.HOM", "PerPkg": "1", - "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three groups that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the number of flits received over QPI on the home channel.", + "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three groups that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the number of flits received over QPI on the home channel.", "UMask": "0x6", "Unit": "QPI LL" }, @@ -436,7 +436,7 @@ "EventCode": "0x2", "EventName": "UNC_Q_RxL_FLITS_G1.HOM_NONREQ", "PerPkg": "1", - "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three groups that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the number of non-request flits received over QPI on the home channel. These are most commonly snoop responses, and this event can be used as a proxy for that.", + "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three groups that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the number of non-request flits received over QPI on the home channel. These are most commonly snoop responses, and this event can be used as a proxy for that.", "UMask": "0x4", "Unit": "QPI LL" }, @@ -445,7 +445,7 @@ "EventCode": "0x2", "EventName": "UNC_Q_RxL_FLITS_G1.HOM_REQ", "PerPkg": "1", - "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three groups that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the number of data request received over QPI on the home channel. This basically counts the number of remote memory requests received over QPI. In conjunction with the local read count in the Home Agent, one can calculate the number of LLC Misses.", + "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three groups that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the number of data request received over QPI on the home channel. This basically counts the number of remote memory requests received over QPI. In conjunction with the local read count in the Home Agent, one can calculate the number of LLC Misses.", "UMask": "0x2", "Unit": "QPI LL" }, @@ -454,7 +454,7 @@ "EventCode": "0x2", "EventName": "UNC_Q_RxL_FLITS_G1.SNP", "PerPkg": "1", - "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three groups that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the number of snoop request flits received over QPI. These requests are contained in the snoop channel. This does not include snoop responses, which are received on the home channel.", + "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three groups that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the number of snoop request flits received over QPI. These requests are contained in the snoop channel. This does not include snoop responses, which are received on the home channel.", "UMask": "0x1", "Unit": "QPI LL" }, @@ -463,7 +463,7 @@ "EventCode": "0x3", "EventName": "UNC_Q_RxL_FLITS_G2.NCB", "PerPkg": "1", - "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three groups that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Number of Non-Coherent Bypass flits. These packets are generally used to transmit non-coherent data across QPI.", + "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three groups that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Number of Non-Coherent Bypass flits. These packets are generally used to transmit non-coherent data across QPI.", "UMask": "0xc", "Unit": "QPI LL" }, @@ -472,7 +472,7 @@ "EventCode": "0x3", "EventName": "UNC_Q_RxL_FLITS_G2.NCB_DATA", "PerPkg": "1", - "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three groups that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Number of Non-Coherent Bypass data flits. These flits are generally used to transmit non-coherent data across QPI. This does not include a count of the DRS (coherent) data flits. This only counts the data flits, not the NCB headers.", + "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three groups that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Number of Non-Coherent Bypass data flits. These flits are generally used to transmit non-coherent data across QPI. This does not include a count of the DRS (coherent) data flits. This only counts the data flits, not the NCB headers.", "UMask": "0x4", "Unit": "QPI LL" }, @@ -481,7 +481,7 @@ "EventCode": "0x3", "EventName": "UNC_Q_RxL_FLITS_G2.NCB_NONDATA", "PerPkg": "1", - "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three groups that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Number of Non-Coherent Bypass non-data flits. These packets are generally used to transmit non-coherent data across QPI, and the flits counted here are for headers and other non-data flits. This includes extended headers.", + "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three groups that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Number of Non-Coherent Bypass non-data flits. These packets are generally used to transmit non-coherent data across QPI, and the flits counted here are for headers and other non-data flits. This includes extended headers.", "UMask": "0x8", "Unit": "QPI LL" }, @@ -490,7 +490,7 @@ "EventCode": "0x3", "EventName": "UNC_Q_RxL_FLITS_G2.NCS", "PerPkg": "1", - "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three groups that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Number of NCS (non-coherent standard) flits received over QPI. This includes extended headers.", + "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three groups that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Number of NCS (non-coherent standard) flits received over QPI. This includes extended headers.", "UMask": "0x10", "Unit": "QPI LL" }, @@ -499,7 +499,7 @@ "EventCode": "0x3", "EventName": "UNC_Q_RxL_FLITS_G2.NDR_AD", "PerPkg": "1", - "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three groups that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the total number of flits received over the NDR (Non-Data Response) channel. This channel is used to send a variety of protocol flits including grants and completions. This is only for NDR packets to the local socket which use the AK ring.", + "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three groups that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the total number of flits received over the NDR (Non-Data Response) channel. This channel is used to send a variety of protocol flits including grants and completions. This is only for NDR packets to the local socket which use the AK ring.", "UMask": "0x1", "Unit": "QPI LL" }, @@ -508,7 +508,7 @@ "EventCode": "0x3", "EventName": "UNC_Q_RxL_FLITS_G2.NDR_AK", "PerPkg": "1", - "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three groups that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the total number of flits received over the NDR (Non-Data Response) channel. This channel is used to send a variety of protocol flits including grants and completions. This is only for NDR packets destined for Route-thru to a remote socket.", + "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three groups that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the total number of flits received over the NDR (Non-Data Response) channel. This channel is used to send a variety of protocol flits including grants and completions. This is only for NDR packets destined for Route-thru to a remote socket.", "UMask": "0x2", "Unit": "QPI LL" }, @@ -924,7 +924,7 @@ "BriefDescription": "Flits Transferred - Group 0; Data Tx Flits", "EventName": "UNC_Q_TxL_FLITS_G0.DATA", "PerPkg": "1", - "PublicDescription": "Counts the number of flits transmitted across the QPI Link. It includes filters for Idle, protocol, and Data Flits. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time (for L0) or 4B instead of 8B for L0p.; Number of data flits transmitted over QPI. Each flit contains 64b of data. This includes both DRS and NCB data flits (coherent and non-coherent). This can be used to calculate the data bandwidth of the QPI link. One can get a good picture of the QPI-link characteristics by evaluating the protocol flits, data flits, and idle/null flits. This does not include the header flits that go in data packets.", + "PublicDescription": "Counts the number of flits transmitted across the QPI Link. It includes filters for Idle, protocol, and Data Flits. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time (for L0) or 4B instead of 8B for L0p.; Number of data flits transmitted over QPI. Each flit contains 64b of data. This includes both DRS and NCB data flits (coherent and non-coherent). This can be used to calculate the data bandwidth of the QPI link. One can get a good picture of the QPI-link characteristics by evaluating the protocol flits, data flits, and idle/null flits. This does not include the header flits that go in data packets.", "UMask": "0x2", "Unit": "QPI LL" }, @@ -932,7 +932,7 @@ "BriefDescription": "Flits Transferred - Group 0; Non-Data protocol Tx Flits", "EventName": "UNC_Q_TxL_FLITS_G0.NON_DATA", "PerPkg": "1", - "PublicDescription": "Counts the number of flits transmitted across the QPI Link. It includes filters for Idle, protocol, and Data Flits. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time (for L0) or 4B instead of 8B for L0p.; Number of non-NULL non-data flits transmitted across QPI. This basically tracks the protocol overhead on the QPI link. One can get a good picture of the QPI-link characteristics by evaluating the protocol flits, data flits, and idle/null flits. This includes the header flits for data packets.", + "PublicDescription": "Counts the number of flits transmitted across the QPI Link. It includes filters for Idle, protocol, and Data Flits. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time (for L0) or 4B instead of 8B for L0p.; Number of non-NULL non-data flits transmitted across QPI. This basically tracks the protocol overhead on the QPI link. One can get a good picture of the QPI-link characteristics by evaluating the protocol flits, data flits, and idle/null flits. This includes the header flits for data packets.", "UMask": "0x4", "Unit": "QPI LL" }, @@ -940,7 +940,7 @@ "BriefDescription": "Flits Transferred - Group 1; DRS Flits (both Header and Data)", "EventName": "UNC_Q_TxL_FLITS_G1.DRS", "PerPkg": "1", - "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three groups that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the total number of flits transmitted over QPI on the DRS (Data Response) channel. DRS flits are used to transmit data with coherency.", + "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three groups that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the total number of flits transmitted over QPI on the DRS (Data Response) channel. DRS flits are used to transmit data with coherency.", "UMask": "0x18", "Unit": "QPI LL" }, @@ -948,7 +948,7 @@ "BriefDescription": "Flits Transferred - Group 1; DRS Data Flits", "EventName": "UNC_Q_TxL_FLITS_G1.DRS_DATA", "PerPkg": "1", - "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three groups that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the total number of data flits transmitted over QPI on the DRS (Data Response) channel. DRS flits are used to transmit data with coherency. This does not count data flits transmitted over the NCB channel which transmits non-coherent data. This includes only the data flits (not the header).", + "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three groups that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the total number of data flits transmitted over QPI on the DRS (Data Response) channel. DRS flits are used to transmit data with coherency. This does not count data flits transmitted over the NCB channel which transmits non-coherent data. This includes only the data flits (not the header).", "UMask": "0x8", "Unit": "QPI LL" }, @@ -956,7 +956,7 @@ "BriefDescription": "Flits Transferred - Group 1; DRS Header Flits", "EventName": "UNC_Q_TxL_FLITS_G1.DRS_NONDATA", "PerPkg": "1", - "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three groups that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the total number of protocol flits transmitted over QPI on the DRS (Data Response) channel. DRS flits are used to transmit data with coherency. This does not count data flits transmitted over the NCB channel which transmits non-coherent data. This includes only the header flits (not the data). This includes extended headers.", + "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three groups that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the total number of protocol flits transmitted over QPI on the DRS (Data Response) channel. DRS flits are used to transmit data with coherency. This does not count data flits transmitted over the NCB channel which transmits non-coherent data. This includes only the header flits (not the data). This includes extended headers.", "UMask": "0x10", "Unit": "QPI LL" }, @@ -964,7 +964,7 @@ "BriefDescription": "Flits Transferred - Group 1; HOM Flits", "EventName": "UNC_Q_TxL_FLITS_G1.HOM", "PerPkg": "1", - "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three groups that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the number of flits transmitted over QPI on the home channel.", + "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three groups that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the number of flits transmitted over QPI on the home channel.", "UMask": "0x6", "Unit": "QPI LL" }, @@ -972,7 +972,7 @@ "BriefDescription": "Flits Transferred - Group 1; HOM Non-Request Flits", "EventName": "UNC_Q_TxL_FLITS_G1.HOM_NONREQ", "PerPkg": "1", - "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three groups that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the number of non-request flits transmitted over QPI on the home channel. These are most commonly snoop responses, and this event can be used as a proxy for that.", + "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three groups that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the number of non-request flits transmitted over QPI on the home channel. These are most commonly snoop responses, and this event can be used as a proxy for that.", "UMask": "0x4", "Unit": "QPI LL" }, @@ -980,7 +980,7 @@ "BriefDescription": "Flits Transferred - Group 1; HOM Request Flits", "EventName": "UNC_Q_TxL_FLITS_G1.HOM_REQ", "PerPkg": "1", - "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three groups that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the number of data request transmitted over QPI on the home channel. This basically counts the number of remote memory requests transmitted over QPI. In conjunction with the local read count in the Home Agent, one can calculate the number of LLC Misses.", + "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three groups that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the number of data request transmitted over QPI on the home channel. This basically counts the number of remote memory requests transmitted over QPI. In conjunction with the local read count in the Home Agent, one can calculate the number of LLC Misses.", "UMask": "0x2", "Unit": "QPI LL" }, @@ -988,7 +988,7 @@ "BriefDescription": "Flits Transferred - Group 1; SNP Flits", "EventName": "UNC_Q_TxL_FLITS_G1.SNP", "PerPkg": "1", - "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three groups that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the number of snoop request flits transmitted over QPI. These requests are contained in the snoop channel. This does not include snoop responses, which are transmitted on the home channel.", + "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three groups that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the number of snoop request flits transmitted over QPI. These requests are contained in the snoop channel. This does not include snoop responses, which are transmitted on the home channel.", "UMask": "0x1", "Unit": "QPI LL" }, @@ -997,7 +997,7 @@ "EventCode": "0x1", "EventName": "UNC_Q_TxL_FLITS_G2.NCB", "PerPkg": "1", - "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three groups that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Number of Non-Coherent Bypass flits. These packets are generally used to transmit non-coherent data across QPI.", + "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three groups that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Number of Non-Coherent Bypass flits. These packets are generally used to transmit non-coherent data across QPI.", "UMask": "0xc", "Unit": "QPI LL" }, @@ -1006,7 +1006,7 @@ "EventCode": "0x1", "EventName": "UNC_Q_TxL_FLITS_G2.NCB_DATA", "PerPkg": "1", - "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three groups that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Number of Non-Coherent Bypass data flits. These flits are generally used to transmit non-coherent data across QPI. This does not include a count of the DRS (coherent) data flits. This only counts the data flits, not te NCB headers.", + "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three groups that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Number of Non-Coherent Bypass data flits. These flits are generally used to transmit non-coherent data across QPI. This does not include a count of the DRS (coherent) data flits. This only counts the data flits, not te NCB headers.", "UMask": "0x4", "Unit": "QPI LL" }, @@ -1015,7 +1015,7 @@ "EventCode": "0x1", "EventName": "UNC_Q_TxL_FLITS_G2.NCB_NONDATA", "PerPkg": "1", - "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three groups that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Number of Non-Coherent Bypass non-data flits. These packets are generally used to transmit non-coherent data across QPI, and the flits counted here are for headers and other non-data flits. This includes extended headers.", + "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three groups that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Number of Non-Coherent Bypass non-data flits. These packets are generally used to transmit non-coherent data across QPI, and the flits counted here are for headers and other non-data flits. This includes extended headers.", "UMask": "0x8", "Unit": "QPI LL" }, @@ -1024,7 +1024,7 @@ "EventCode": "0x1", "EventName": "UNC_Q_TxL_FLITS_G2.NCS", "PerPkg": "1", - "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three groups that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Number of NCS (non-coherent standard) flits transmitted over QPI. This includes extended headers.", + "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three groups that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Number of NCS (non-coherent standard) flits transmitted over QPI. This includes extended headers.", "UMask": "0x10", "Unit": "QPI LL" }, @@ -1033,7 +1033,7 @@ "EventCode": "0x1", "EventName": "UNC_Q_TxL_FLITS_G2.NDR_AD", "PerPkg": "1", - "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three groups that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the total number of flits transmitted over the NDR (Non-Data Response) channel. This channel is used to send a variety of protocol flits including grants and completions. This is only for NDR packets to the local socket which use the AK ring.", + "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three groups that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the total number of flits transmitted over the NDR (Non-Data Response) channel. This channel is used to send a variety of protocol flits including grants and completions. This is only for NDR packets to the local socket which use the AK ring.", "UMask": "0x1", "Unit": "QPI LL" }, @@ -1042,7 +1042,7 @@ "EventCode": "0x1", "EventName": "UNC_Q_TxL_FLITS_G2.NDR_AK", "PerPkg": "1", - "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three groups that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the total number of flits transmitted over the NDR (Non-Data Response) channel. This channel is used to send a variety of protocol flits including grants and completions. This is only for NDR packets destined for Route-thru to a remote socket.", + "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three groups that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each flit is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four fits, each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI speed (for example, 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the system will transfer 1 flit at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as data bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual data and an additional 16 bits of other information. To calculate data bandwidth, one should therefore do: data flits * 8B / time.; Counts the total number of flits transmitted over the NDR (Non-Data Response) channel. This channel is used to send a variety of protocol flits including grants and completions. This is only for NDR packets destined for Route-thru to a remote socket.", "UMask": "0x2", "Unit": "QPI LL" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/uncore-other.json b/tools/perf/pmu-events/arch/x86/broadwellx/uncore-other.json index 495e34ee5bfb..a80d931dc3d5 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/uncore-other.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/uncore-other.json @@ -2312,7 +2312,7 @@ "EventCode": "0x33", "EventName": "UNC_R3_VNA_CREDITS_ACQUIRED.AD", "PerPkg": "1", - "PublicDescription": "Number of QPI VNA Credit acquisitions. This event can be used in conjunction with the VNA In-Use Accumulator to calculate the average lifetime of a credit holder. VNA credits are used by all message classes in order to communicate across QPI. If a packet is unable to acquire credits, it will then attempt to use credts from the VN0 pool. Note that a single packet may require multiple flit buffers (i.e. when data is being transferred). Therefore, this event will increment by the number of credits acquired in each cycle. Filtering based on message class is not provided. One can count the number of packets transferred in a given message class using an qfclk event.; Filter for the Home (HOM) message class. HOM is generally used to send requests, request responses, and snoop responses.", + "PublicDescription": "Number of QPI VNA Credit acquisitions. This event can be used in conjunction with the VNA In-Use Accumulator to calculate the average lifetime of a credit holder. VNA credits are used by all message classes in order to communicate across QPI. If a packet is unable to acquire credits, it will then attempt to use credts from the VN0 pool. Note that a single packet may require multiple flit buffers (i.e. when data is being transfered). Therefore, this event will increment by the number of credits acquired in each cycle. Filtering based on message class is not provided. One can count the number of packets transfered in a given message class using an qfclk event.; Filter for the Home (HOM) message class. HOM is generally used to send requests, request responses, and snoop responses.", "UMask": "0x1", "Unit": "R3QPI" }, @@ -2321,7 +2321,7 @@ "EventCode": "0x33", "EventName": "UNC_R3_VNA_CREDITS_ACQUIRED.BL", "PerPkg": "1", - "PublicDescription": "Number of QPI VNA Credit acquisitions. This event can be used in conjunction with the VNA In-Use Accumulator to calculate the average lifetime of a credit holder. VNA credits are used by all message classes in order to communicate across QPI. If a packet is unable to acquire credits, it will then attempt to use credts from the VN0 pool. Note that a single packet may require multiple flit buffers (i.e. when data is being transferred). Therefore, this event will increment by the number of credits acquired in each cycle. Filtering based on message class is not provided. One can count the number of packets transferred in a given message class using an qfclk event.; Filter for the Home (HOM) message class. HOM is generally used to send requests, request responses, and snoop responses.", + "PublicDescription": "Number of QPI VNA Credit acquisitions. This event can be used in conjunction with the VNA In-Use Accumulator to calculate the average lifetime of a credit holder. VNA credits are used by all message classes in order to communicate across QPI. If a packet is unable to acquire credits, it will then attempt to use credts from the VN0 pool. Note that a single packet may require multiple flit buffers (i.e. when data is being transfered). Therefore, this event will increment by the number of credits acquired in each cycle. Filtering based on message class is not provided. One can count the number of packets transfered in a given message class using an qfclk event.; Filter for the Home (HOM) message class. HOM is generally used to send requests, request responses, and snoop responses.", "UMask": "0x4", "Unit": "R3QPI" }, From patchwork Sun Feb 19 09:28:12 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59105 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp772470wrn; Sun, 19 Feb 2023 01:33:12 -0800 (PST) X-Google-Smtp-Source: AK7set8j5qVLrdhDBR/0IlPSOHnybIJB6YwjrOeT+y3TYfudMvCNux35w3b+vPsQ6M7fMZNygNTf X-Received: by 2002:a05:6a20:1585:b0:be:a604:c683 with SMTP id h5-20020a056a20158500b000bea604c683mr7622266pzj.45.1676799192441; Sun, 19 Feb 2023 01:33:12 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799192; cv=none; d=google.com; s=arc-20160816; b=iVz+l86kcv0y9ruD/lAG+o/4BAn2e/xBbPBQEaj5BAkeCh8MnlMbReiUjHdrRqpIA6 FLtBmzX/GitJbTdHoh3/TDpxwVLO/wcxCujg7hoPIQ3yqDI/I3/pC1ovYISb4ndaYwoG 9xvbgvjX3pGHL3eMpPnrShgtctOo+Zf7Aivwa9aQne4wh1XGSOsx2PE9zRllVmh7ldl0 oYh4zLLOyvdVjXHBx/T2Uryhx/PvB/kN+W7D0QVlhYM15zMbTgbUR1+Y3kskEMnYcBMF A1K8fnK8pDbf1GBUhLulP/igEG8yspFsLTZ8OKl152ZBbOKpnLYtaJgVHiXjYli9GiyA Lxnw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :references:mime-version:message-id:in-reply-to:date:dkim-signature; bh=GwlLCukzT6GbHSyONT3+T7aL0ArA6deTd4KeYIE1IWM=; b=uRHuJD/zNpyi5CE3xoLBxCKuSvLKpFelFqWzaLwji2yQTbgDA8rWILfUXgovW2/jp9 UzpZtXmxrm5cxl1mOUGdUj129eZcxQnFIzNe7buTgp2B0kN5l9iRFq28AFkoICleNJKu aIjwV96FQqEzQCl3DC/aIo0GqZVF028feQ8JaWYRL6XJMVnFhqPM5FqK6NX/swxZqrhd fJie8CQAt1NyyAYqHbaJN0nyUc2SzwZieKDvvaS3zqE9Py+j5fWbiFx1jLBzLc/Rg5Qh rT+CrGxTVFAqyBrL6uAd5W6ubHCk0/oLaaz6NX174t5YLDJpxr6fPqV3A1JenV55tIF/ Q30w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=QCefFBie; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id i3-20020a636d03000000b004fb8d2b25d2si10894740pgc.827.2023.02.19.01.32.59; Sun, 19 Feb 2023 01:33:12 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=QCefFBie; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229917AbjBSJb4 (ORCPT + 99 others); Sun, 19 Feb 2023 04:31:56 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53214 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229907AbjBSJbn (ORCPT ); Sun, 19 Feb 2023 04:31:43 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B63BC126E3 for ; Sun, 19 Feb 2023 01:31:07 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id a34-20020a25a1a5000000b0092aabd4fa90so2168436ybi.18 for ; Sun, 19 Feb 2023 01:31:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:from:to:cc:subject:date :message-id:reply-to; bh=GwlLCukzT6GbHSyONT3+T7aL0ArA6deTd4KeYIE1IWM=; b=QCefFBiefJ8mMMIoBJW5NW2xO3ly5g+vB3A/Czk1u8PYC5a6dmPEP9IFC/fKPfEWBp VzUjQGu7upG3cyhsQNgm9JwVx4D2Mzu/LWTFfKleRABhs5cZwBNoE4xDTl5MJ/BQ9TcJ NJEpXE09uBFTbQdNly/C3nou3HH4UE2/09Nb5Xvkw/np08fA8h4NCmMlWEPeYLG3DFVB qHNsszNwU3AiMBkQsXTnp0MovjqvuvcWLZkUy/cd0wttBdEzbCkclBFTMOcdAkoUNbmU +4fKQN1O3Uy+dSLlfE/6v6ONu2uKbaU++lt9Uux9/u/TZ887npe3KiZLNpTnV/kQl4CP BuRg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=GwlLCukzT6GbHSyONT3+T7aL0ArA6deTd4KeYIE1IWM=; b=gFeRWJgN/KT1pPEegIEIIfd5z7LaE4yrvn6PTGmKMg9FOpQks8Y9b0WWKIn78HpYJ+ qiK9xEyfZrTnC1WtlgEuDmrc22esN8VGCbJT4qi0UrOvHH/baxfQl0gxfgEGnfm+79EQ y/m4Xw34HUKuRQKBChNl16Pb/RzO/TG1FrhWSdig7/rHgIkJlHwvnrEOlaXW8gEBrGvG EUzlInKnnVIMI+Brbc36/za4FyjxEU7NmrKUz3XMJGBG95XCNfFSghaCNMkHOk0AuuDf Ql5yPDSQC3sWx7aYFvrHofnnM0fKz0QnoKZhrcwlpy9fu+cn2ynAhlippZKqKVsWaFhH 8B/w== X-Gm-Message-State: AO0yUKWHw8HYNebv8sNOYLQ1aqRyDXavKWVu+HpZwrLtAGW1WRZSzlCp QFTCNnstQR6sDJRwPIdSJXKdho7Ehvs9 X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a05:6902:46:b0:8da:d0ab:258a with SMTP id m6-20020a056902004600b008dad0ab258amr21007ybh.5.1676799067361; Sun, 19 Feb 2023 01:31:07 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:12 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-16-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 15/51] perf vendor events intel: Refresh cascadelakex events From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758251389769426141?= X-GMAIL-MSGID: =?utf-8?q?1758251389769426141?= Update the cascadelakex events from 1.16 to 1.17. Generation was done using https://github.com/intel/perfmon. Notable changes are new events and event descriptions, TMA metrics are updated to version 4.5, TMA info metrics are renamed from their node name to be lower case and prefixed by tma_info_, MetricThreshold expressions are added, "Sample with" documentation is added to many TMA metrics, smi_cost and transaction metric groups are added replicating existing hard coded metrics in stat-shadow. Signed-off-by: Ian Rogers --- .../arch/x86/cascadelakex/cache.json | 24 +- .../arch/x86/cascadelakex/clx-metrics.json | 2198 +++++++++-------- .../arch/x86/cascadelakex/frontend.json | 8 +- .../arch/x86/cascadelakex/pipeline.json | 16 + .../arch/x86/cascadelakex/uncore-memory.json | 18 +- .../arch/x86/cascadelakex/uncore-other.json | 120 +- .../arch/x86/cascadelakex/uncore-power.json | 8 +- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 8 files changed, 1236 insertions(+), 1158 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/cascadelakex/cache.json b/tools/perf/pmu-events/arch/x86/cascadelakex/cache.json index 1070ad317ec9..a842f05cb60d 100644 --- a/tools/perf/pmu-events/arch/x86/cascadelakex/cache.json +++ b/tools/perf/pmu-events/arch/x86/cascadelakex/cache.json @@ -234,20 +234,22 @@ "UMask": "0x4f" }, { - "BriefDescription": "All retired load instructions.", + "BriefDescription": "Retired load instructions.", "Data_LA": "1", "EventCode": "0xD0", "EventName": "MEM_INST_RETIRED.ALL_LOADS", "PEBS": "1", + "PublicDescription": "Counts all retired load instructions. This event accounts for SW prefetch instructions of PREFETCHNTA or PREFETCHT0/1/2 or PREFETCHW.", "SampleAfterValue": "2000003", "UMask": "0x81" }, { - "BriefDescription": "All retired store instructions.", + "BriefDescription": "Retired store instructions.", "Data_LA": "1", "EventCode": "0xD0", "EventName": "MEM_INST_RETIRED.ALL_STORES", "PEBS": "1", + "PublicDescription": "Counts all retired store instructions.", "SampleAfterValue": "2000003", "UMask": "0x82" }, @@ -388,12 +390,12 @@ "UMask": "0x4" }, { - "BriefDescription": "Retired load instructions with remote Intel(R) Optane(TM) DC persistent memory as the data source where the data request missed all caches. Precise event.", + "BriefDescription": "Retired load instructions with remote Intel(R) Optane(TM) DC persistent memory as the data source where the data request missed all caches.", "Data_LA": "1", "EventCode": "0xD3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_PMM", "PEBS": "1", - "PublicDescription": "Counts retired load instructions with remote Intel(R) Optane(TM) DC persistent memory as the data source and the data request missed L3 (AppDirect or Memory Mode) and DRAM cache(Memory Mode). Precise event", + "PublicDescription": "Counts retired load instructions with remote Intel(R) Optane(TM) DC persistent memory as the data source and the data request missed L3 (AppDirect or Memory Mode) and DRAM cache(Memory Mode).", "SampleAfterValue": "100007", "UMask": "0x10" }, @@ -477,12 +479,12 @@ "UMask": "0x20" }, { - "BriefDescription": "Retired load instructions with local Intel(R) Optane(TM) DC persistent memory as the data source where the data request missed all caches. Precise event.", + "BriefDescription": "Retired load instructions with local Intel(R) Optane(TM) DC persistent memory as the data source where the data request missed all caches.", "Data_LA": "1", "EventCode": "0xD1", "EventName": "MEM_LOAD_RETIRED.LOCAL_PMM", "PEBS": "1", - "PublicDescription": "Counts retired load instructions with local Intel(R) Optane(TM) DC persistent memory as the data source and the data request missed L3 (AppDirect or Memory Mode) and DRAM cache(Memory Mode). Precise event", + "PublicDescription": "Counts retired load instructions with local Intel(R) Optane(TM) DC persistent memory as the data source and the data request missed L3 (AppDirect or Memory Mode) and DRAM cache(Memory Mode).", "SampleAfterValue": "100003", "UMask": "0x80" }, @@ -5039,7 +5041,7 @@ "UMask": "0x80" }, { - "BriefDescription": "Cacheable and noncachaeble code read requests", + "BriefDescription": "Cacheable and non-cacheable code read requests", "EventCode": "0xB0", "EventName": "OFFCORE_REQUESTS.DEMAND_CODE_RD", "PublicDescription": "Counts both cacheable and non-cacheable code read requests.", @@ -5146,14 +5148,6 @@ "SampleAfterValue": "2000003", "UMask": "0x4" }, - { - "BriefDescription": "Offcore response can be programmed only with a specific pair of event select and counter MSR, and with specific event codes and predefine mask bit value in a dedicated MSR to specify attributes of the offcore transaction", - "EventCode": "0xB7, 0xBB", - "EventName": "OFFCORE_RESPONSE", - "PublicDescription": "Offcore response can be programmed only with a specific pair of event select and counter MSR, and with specific event codes and predefine mask bit value in a dedicated MSR to specify attributes of the offcore transaction.", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, { "BriefDescription": "This event is deprecated. Refer to new event OCR.ALL_DATA_RD.ANY_RESPONSE", "Deprecated": "1", diff --git a/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json b/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json index 356cf6603b69..4e993a3220e3 100644 --- a/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json +++ b/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json @@ -1,1548 +1,1608 @@ [ { - "BriefDescription": "Total pipeline cost of Branch Misprediction related bottlenecks", - "MetricExpr": "100 * (tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts", - "MetricName": "Mispredictions" - }, - { - "BriefDescription": "Total pipeline cost of (external) Memory Bandwidth related bottlenecks", - "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_sq_full / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full))) + tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk))", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "Memory_Bandwidth" - }, - { - "BriefDescription": "Total pipeline cost of Memory Latency related bottlenecks (external memory and off-core caches)", - "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_l3_hit_latency / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound))", - "MetricGroup": "Mem;MemoryLat;Offcore", - "MetricName": "Memory_Latency" - }, - { - "BriefDescription": "Total pipeline cost of Memory Address Translation related bottlenecks (data-side TLBs)", - "MetricExpr": "100 * tma_memory_bound * (tma_l1_bound / max(tma_memory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_dtlb_load / max(tma_l1_bound, tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_latency)))", - "MetricGroup": "Mem;MemoryTLB;Offcore", - "MetricName": "Memory_Data_TLBs" - }, - { - "BriefDescription": "Total pipeline cost of branch related instructions (used for program control-flow including function calls)", - "MetricExpr": "100 * ((BR_INST_RETIRED.CONDITIONAL + 3 * BR_INST_RETIRED.NEAR_CALL + (BR_INST_RETIRED.NEAR_TAKEN - (BR_INST_RETIRED.CONDITIONAL - BR_INST_RETIRED.NOT_TAKEN) - 2 * BR_INST_RETIRED.NEAR_CALL)) / SLOTS)", - "MetricGroup": "Ret", - "MetricName": "Branching_Overhead" - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch related bottlenecks by large code footprint programs (i-side cache; TLB and BTB misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_icache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFoot;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "Big_Code" - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_frontend_bound - tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) - Big_Code", - "MetricGroup": "Fed;FetchBW;Frontend", - "MetricName": "Instruction_Fetch_BW" - }, - { - "BriefDescription": "Instructions Per Cycle (per Logical Processor)", - "MetricExpr": "INST_RETIRED.ANY / CLKS", - "MetricGroup": "Ret;Summary", - "MetricName": "IPC" - }, - { - "BriefDescription": "Uops Per Instruction", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / INST_RETIRED.ANY", - "MetricGroup": "Pipeline;Ret;Retire", - "MetricName": "UPI" - }, - { - "BriefDescription": "Instruction per taken branch", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW", - "MetricName": "UpTB" - }, - { - "BriefDescription": "Cycles Per Instruction (per Logical Processor)", - "MetricExpr": "1 / IPC", - "MetricGroup": "Mem;Pipeline", - "MetricName": "CPI" - }, - { - "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "Pipeline", - "MetricName": "CLKS" - }, - { - "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "4 * CORE_CLKS", - "MetricGroup": "tma_L1_group", - "MetricName": "SLOTS" - }, - { - "BriefDescription": "The ratio of Executed- by Issued-Uops", - "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", - "MetricGroup": "Cor;Pipeline", - "MetricName": "Execute_per_Issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate of \"execute\" at rename stage." - }, - { - "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", - "MetricExpr": "INST_RETIRED.ANY / CORE_CLKS", - "MetricGroup": "Ret;SMT;tma_L1_group", - "MetricName": "CoreIPC" - }, - { - "BriefDescription": "Floating Point Operations Per Cycle", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE) + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / CORE_CLKS", - "MetricGroup": "Flops;Ret", - "MetricName": "FLOPc" - }, - { - "BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width)", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE)) / (2 * CORE_CLKS)", - "MetricGroup": "Cor;Flops;HPC", - "MetricName": "FP_Arith_Utilization", - "PublicDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width). Values > 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less common)." - }, - { - "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_GE_1 / 2 if #SMT_on else UOPS_EXECUTED.CORE_CYCLES_GE_1)", - "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", - "MetricName": "ILP" - }, - { - "BriefDescription": "Probability of Core Bound bottleneck hidden by SMT-profiling artifacts", - "MetricExpr": "((1 - tma_core_bound / tma_ports_utilization if tma_core_bound < tma_ports_utilization else 1) if SMT_2T_Utilization > 0.5 else 0)", - "MetricGroup": "Cor;SMT", - "MetricName": "Core_Bound_Likely" - }, - { - "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else CLKS))", - "MetricGroup": "SMT", - "MetricName": "CORE_CLKS" - }, - { - "BriefDescription": "Instructions per Load (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", - "MetricGroup": "InsType", - "MetricName": "IpLoad" - }, - { - "BriefDescription": "Instructions per Store (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", - "MetricGroup": "InsType", - "MetricName": "IpStore" - }, - { - "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Branches;Fed;InsType", - "MetricName": "IpBranch" + "BriefDescription": "C2 residency percent per package", + "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C2_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per (near) call (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "IpCall" + "BriefDescription": "C3 residency percent per core", + "MetricExpr": "cstate_core@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Core_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instruction per taken branch", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO", - "MetricName": "IpTB" + "BriefDescription": "C3 residency percent per package", + "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Branch instructions per taken branch. ", - "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "BpTkBranch" + "BriefDescription": "C6 residency percent per core", + "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Core_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Floating Point (FP) Operation (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE) + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE)", - "MetricGroup": "Flops;InsType", - "MetricName": "IpFLOP" + "BriefDescription": "C6 residency percent per package", + "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE))", - "MetricGroup": "Flops;InsType", - "MetricName": "IpArith", - "PublicDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate). May undercount due to FMA double counting. Approximated prior to BDW." + "BriefDescription": "C7 residency percent per core", + "MetricExpr": "cstate_core@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Core_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SINGLE", - "MetricGroup": "Flops;FpScalar;InsType", - "MetricName": "IpArith_Scalar_SP", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "C7 residency percent per package", + "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOUBLE", - "MetricGroup": "Flops;FpScalar;InsType", - "MetricName": "IpArith_Scalar_DP", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "Uncore frequency per die [GHZ]", + "MetricExpr": "tma_info_socket_clks / #num_dies / duration_time / 1e9", + "MetricGroup": "SoC", + "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX128", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "Percentage of cycles spent in System Management Interrupts.", + "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0 else 0)", + "MetricGroup": "smi", + "MetricName": "smi_cycles", + "MetricThreshold": "smi_cycles > 0.1", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX256", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "Number of SMI interrupts.", + "MetricExpr": "msr@smi@", + "MetricGroup": "smi", + "MetricName": "smi_num", + "ScaleUnit": "1SMI#" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX512", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset", + "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_4k_aliasing", + "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset. False match is possible; which incur a few cycles load re-issue. However; the short re-issue duration is often hidden by the out-of-order core and HW optimizations; hence a user may safely ignore a high value of this metric unless it manages to propagate up into parent nodes of the hierarchy (e.g. to L1_Bound).", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Software prefetch instruction (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umask\\=0xF@", - "MetricGroup": "Prefetches", - "MetricName": "IpSWPF" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.", + "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT.PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_info_slots", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.6", + "ScaleUnit": "100%" }, { - "BriefDescription": "Total number of retired Instructions Sample with: INST_RETIRED.PREC_DIST", - "MetricExpr": "INST_RETIRED.ANY", - "MetricGroup": "Summary;tma_L1_group", - "MetricName": "Instructions" + "BriefDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists", + "MetricExpr": "100 * (FP_ASSIST.ANY + OTHER_ASSISTS.ANY) / tma_info_slots", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists. Assists are long sequences of uops that are required in certain corner-cases for operations that cannot be handled natively by the execution pipeline. For example; when working with very small floating point values (so-called Denormals); the FP units are not set up to perform these operations natively. Instead; a sequence of instructions to perform the computation on the Denormals is injected into the pipeline. Since these microcode sequences might be dozens of uops long; Assists can be extremely deleterious to performance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.ANY", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE_SLOTS\\,cmask\\=1@", - "MetricGroup": "Pipeline;Ret", - "MetricName": "Retire" + "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend", + "MetricExpr": "1 - tma_frontend_bound - (UOPS_ISSUED.ANY + 4 * (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.2", + "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound.", + "ScaleUnit": "100%" }, { - "BriefDescription": "", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,cmask\\=1@", - "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", - "MetricName": "Execute" + "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations", + "MetricExpr": "(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", + "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average number of Uops issued by front-end when it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=1@", - "MetricGroup": "Fed;FetchBW", - "MetricName": "Fetch_UpC" + "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * tma_bad_speculation", + "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM", + "MetricName": "tma_branch_mispredicts", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics: tma_info_branch_misprediction_cost, tma_info_mispredictions, tma_mispredicts_resteers", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", - "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", - "MetricGroup": "DSB;Fed;FetchBW", - "MetricName": "DSB_Coverage" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", + "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_clks + tma_unknown_branches", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average number of cycles of a switch from the DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details.", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / DSB2MITE_SWITCHES.COUNT", - "MetricGroup": "DSBmiss", - "MetricName": "DSB_Switch_Cost" + "BriefDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction. A CISC instruction has multiple uops that are required to perform the instruction's functionality as in the case of read-modify-write as an example. Since these instructions require multiple uops they may or may not imply sub-optimal use of machine resources.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Total penalty related to DSB (uop cache) misses - subset of the Instruction_Fetch_BW Bottleneck.", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tma_mite))", - "MetricGroup": "DSBmiss;Fed", - "MetricName": "DSB_Misses" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears", + "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_clks", + "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_branch_resteers_group;tma_issueMC", + "MetricName": "tma_clears_resteers", + "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_ms_switches", + "ScaleUnit": "100%" }, { - "BriefDescription": "Number of Instructions per non-speculative DSB miss (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", - "MetricGroup": "DSBmiss;Fed", - "MetricName": "IpDSB_Miss_Ret" + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(44 * tma_info_average_frequency * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (OCR.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE / (OCR.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE + OCR.DEMAND_DATA_RD.L3_HIT.HIT_OTHER_CORE_FWD))) + 44 * tma_info_average_frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_contested_accesses", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses. Contested accesses occur when data written by one Logical Processor are read by another Logical Processor on a different Physical Core. Examples of contested accesses include synchronizations such as locks; true data sharing such as modified locked variables; and false sharing. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Related metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma_remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;BadSpec;BrMispredicts", - "MetricName": "IpMispredict" + "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_backend_bound - tma_memory_bound", + "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", + "ScaleUnit": "100%" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)", - "MetricExpr": "(tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) * SLOTS / BR_MISP_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;BrMispredicts", - "MetricName": "Branch_Misprediction_Cost" + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "44 * tma_info_average_frequency * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (1 - OCR.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE / (OCR.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE + OCR.DEMAND_DATA_RD.L3_HIT.HIT_OTHER_CORE_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses. Data shared by multiple Logical Processors (even just read shared) may cause increased access latency due to cache coherency. Excessive data sharing can drastically harm multithreaded performance. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma_false_sharing, tma_machine_clears, tma_remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of branches that are non-taken conditionals", - "MetricExpr": "BR_INST_RETIRED.NOT_TAKEN / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches;CodeGen;PGO", - "MetricName": "Cond_NT" + "BriefDescription": "This metric represents fraction of cycles where decoder-0 was the only active decoder", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=1@ - cpu@INST_DECODED.DECODERS\\,cmask\\=2@) / tma_info_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0;tma_mite_group", + "MetricName": "tma_decoder0_alone", + "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35))", + "PublicDescription": "This metric represents fraction of cycles where decoder-0 was the only active decoder. Related metrics: tma_few_uops_instructions", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of branches that are taken conditionals", - "MetricExpr": "(BR_INST_RETIRED.CONDITIONAL - BR_INST_RETIRED.NOT_TAKEN) / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches;CodeGen;PGO", - "MetricName": "Cond_TK" + "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", + "MetricExpr": "ARITH.DIVIDER_ACTIVE / tma_info_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication. Sample with: ARITH.DIVIDER_ACTIVE", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of branches that are CALL or RET", - "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_RETURN) / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches", - "MetricName": "CallRet" + "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_clks + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_info_clks - tma_l2_bound - tma_pmm_bound", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L3_MISS_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of branches that are unconditional (direct or indirect) jumps", - "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - (BR_INST_RETIRED.CONDITIONAL - BR_INST_RETIRED.NOT_TAKEN) - 2 * BR_INST_RETIRED.NEAR_CALL) / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches", - "MetricName": "Jump" + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline", + "MetricExpr": "(IDQ.ALL_DSB_CYCLES_ANY_UOPS - IDQ.ALL_DSB_CYCLES_4_UOPS) / tma_info_core_clks / 2", + "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline. For example; inefficient utilization of the DSB cache structure or bank conflict when reading from it; are categorized here.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)", - "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_RETIRED.L1_MISS + MEM_LOAD_RETIRED.FB_HIT)", - "MetricGroup": "Mem;MemoryBound;MemoryLat", - "MetricName": "Load_Miss_Real_Latency" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_clks", + "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DSB_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_dsb_misses, tma_info_iptb, tma_lcp", + "ScaleUnit": "100%" }, { - "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)", - "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLES", - "MetricGroup": "Mem;MemoryBW;MemoryBound", - "MetricName": "MLP" + "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "min(9 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: tma_dtlb_store, tma_info_memory_data_tlbs", + "ScaleUnit": "100%" }, { - "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L1MPKI" + "BriefDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses", + "MetricExpr": "(9 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=1@ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses. As with ordinary data caching; focus on improving data locality and reducing working-set size to reduce DTLB overhead. Additionally; consider using profile-guided optimization (PGO) to collocate frequently-used data on the same page. Try using larger page sizes for large amounts of frequently-used data. Sample with: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, tma_info_memory_data_tlbs", + "ScaleUnit": "100%" }, { - "BriefDescription": "L1 cache true misses per kilo instruction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L1MPKI_Load" + "BriefDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(110 * tma_info_average_frequency * (OCR.DEMAND_RFO.L3_MISS.REMOTE_HITM + OCR.PF_L2_RFO.L3_MISS.REMOTE_HITM) + 47.5 * tma_info_average_frequency * (OCR.DEMAND_RFO.L3_HIT.HITM_OTHER_CORE + OCR.PF_L2_RFO.L3_HIT.HITM_OTHER_CORE)) / tma_info_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_store_bound_group", + "MetricName": "tma_false_sharing", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing. False Sharing is a multithreading hiccup; where multiple Logical Processors contend on different data-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", - "MetricGroup": "Backend;CacheMisses;Mem", - "MetricName": "L2MPKI" + "BriefDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "tma_info_load_miss_real_latency * cpu@L1D_PEND_MISS.FB_FULL\\,cmask\\=1@ / tma_info_clks", + "MetricGroup": "MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;tma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed. The higher the metric value; the deeper the memory hierarchy level the misses are satisfied from (metric values >1 are valid). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or external memory). Related metrics: tma_info_dram_bw_use, tma_info_memory_bandwidth, tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all request types (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem;Offcore", - "MetricName": "L2MPKI_All" + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", + "MetricExpr": "tma_frontend_bound - tma_fetch_latency", + "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Sample with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_switches, tma_info_dsb_coverage, tma_info_dsb_misses, tma_info_iptb, tma_lcp", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2MPKI_Load" + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", + "MetricExpr": "4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / tma_info_slots", + "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache hits per kilo instruction for all request types (including speculative)", - "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2HPKI_All" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops", + "MetricExpr": "tma_heavy_operations - tma_microcode_sequencer", + "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;tma_issueD0", + "MetricName": "tma_few_uops_instructions", + "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number of uops in such instructions. Related metrics: tma_decoder0_alone", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache hits per kilo instruction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2HPKI_Load" + "BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", + "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired). Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain and FMA double-counting.", + "ScaleUnit": "100%" }, { - "BriefDescription": "L3 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L3MPKI" + "BriefDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P", + "MetricName": "tma_fp_scalar", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fill Buffer (FB) hits per kilo instructions for retired demand loads (L1D misses that merge into ongoing miss-handling entries)", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "FB_HPKI" + "BriefDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0xfc@ / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P", + "MetricName": "tma_fp_vector", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses", - "MetricConstraint": "NO_NMI_WATCHDOG", - "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_PENDING + DTLB_STORE_MISSES.WALK_PENDING + EPT.WALK_PENDING) / (2 * CORE_CLKS)", - "MetricGroup": "Mem;MemoryTLB", - "MetricName": "Page_Walks_Utilization" + "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P", + "MetricName": "tma_fp_vector_128b", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW" + "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P", + "MetricName": "tma_fp_vector_256b", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW" + "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 512-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P", + "MetricName": "tma_fp_vector_512b", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 512-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW" + "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", + "MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / tma_info_slots", + "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data access bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "L3_Cache_Access_BW" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring fused instructions -- where one uop can represent multiple contiguous instructions", + "MetricExpr": "tma_light_operations * UOPS_RETIRED.MACRO_FUSED / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_fused_instructions", + "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring fused instructions -- where one uop can represent multiple contiguous instructions. The instruction pairs of CMP+JCC or DEC+JCC are commonly used examples.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Rate of silent evictions from the L2 cache per Kilo instruction where the evicted lines are dropped (no writeback to L3 or memory)", - "MetricExpr": "1e3 * L2_LINES_OUT.SILENT / Instructions", - "MetricGroup": "L2Evicts;Mem;Server", - "MetricName": "L2_Evictions_Silent_PKI" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences", + "MetricExpr": "(UOPS_RETIRED.RETIRE_SLOTS + UOPS_RETIRED.MACRO_FUSED - INST_RETIRED.ANY) / tma_info_slots", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Rate of non silent evictions from the L2 cache per Kilo instruction", - "MetricExpr": "1e3 * L2_LINES_OUT.NON_SILENT / Instructions", - "MetricGroup": "L2Evicts;Mem;Server", - "MetricName": "L2_Evictions_NonSilent_PKI" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses", + "MetricExpr": "(ICACHE_16B.IFDATA_STALL + 2 * cpu@ICACHE_16B.IFDATA_STALL\\,cmask\\=1\\,edge@) / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RETIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]", - "MetricExpr": "L1D_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW_1T" + "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", + "MetricExpr": "tma_info_turbo_utilization * TSC / 1e9 / duration_time", + "MetricGroup": "Power;Summary", + "MetricName": "tma_info_average_frequency" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]", - "MetricExpr": "L2_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW_1T" + "BriefDescription": "Total pipeline cost of instruction fetch related bottlenecks by large code footprint programs (i-side cache; TLB and BTB misses)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_icache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", + "MetricGroup": "BigFoot;Fed;Frontend;IcMiss;MemoryTLB;tma_issueBC", + "MetricName": "tma_info_big_code", + "MetricThreshold": "tma_info_big_code > 20", + "PublicDescription": "Total pipeline cost of instruction fetch related bottlenecks by large code footprint programs (i-side cache; TLB and BTB misses). Related metrics: tma_info_branching_overhead" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "L3_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW_1T" + "BriefDescription": "Branch instructions per taken branch.", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_bptkbranch" }, { - "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "L3_Cache_Access_BW", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "L3_Cache_Access_BW_1T" + "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)", + "MetricExpr": "(tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) * tma_info_slots / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BrMispredicts;tma_issueBM", + "MetricName": "tma_info_branch_misprediction_cost", + "PublicDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear). Related metrics: tma_branch_mispredicts, tma_info_mispredictions, tma_mispredicts_resteers" }, { - "BriefDescription": "Average CPU Utilization", - "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", - "MetricGroup": "HPC;Summary", - "MetricName": "CPU_Utilization" + "BriefDescription": "Total pipeline cost of branch related instructions (used for program control-flow including function calls)", + "MetricExpr": "100 * ((BR_INST_RETIRED.CONDITIONAL + 3 * BR_INST_RETIRED.NEAR_CALL + (BR_INST_RETIRED.NEAR_TAKEN - (BR_INST_RETIRED.CONDITIONAL - BR_INST_RETIRED.NOT_TAKEN) - 2 * BR_INST_RETIRED.NEAR_CALL)) / tma_info_slots)", + "MetricGroup": "Ret;tma_issueBC", + "MetricName": "tma_info_branching_overhead", + "MetricThreshold": "tma_info_branching_overhead > 10", + "PublicDescription": "Total pipeline cost of branch related instructions (used for program control-flow including function calls). Related metrics: tma_info_big_code" }, { - "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", - "MetricExpr": "Turbo_Utilization * TSC / 1e9 / duration_time", - "MetricGroup": "Power;Summary", - "MetricName": "Average_Frequency" + "BriefDescription": "Fraction of branches that are CALL or RET", + "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_RETURN) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_callret" }, { - "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE) + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1e9 / duration_time", - "MetricGroup": "Cor;Flops;HPC", - "MetricName": "GFLOPs", - "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine." + "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Pipeline", + "MetricName": "tma_info_clks" }, { - "BriefDescription": "Average Frequency Utilization relative nominal frequency", - "MetricExpr": "CLKS / CPU_CLK_UNHALTED.REF_TSC", - "MetricGroup": "Power", - "MetricName": "Turbo_Utilization" + "BriefDescription": "STLB (2nd level TLB) code speculative misses per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", + "MetricGroup": "Fed;MemoryTLB", + "MetricName": "tma_info_code_stlb_mpki" }, { - "BriefDescription": "Fraction of Core cycles where the core was running with power-delivery for baseline license level 0", - "MetricExpr": "(CORE_POWER.LVL0_TURBO_LICENSE / 2 / CORE_CLKS if #SMT_on else CORE_POWER.LVL0_TURBO_LICENSE / CORE_CLKS)", - "MetricGroup": "Power", - "MetricName": "Power_License0_Utilization", - "PublicDescription": "Fraction of Core cycles where the core was running with power-delivery for baseline license level 0. This includes non-AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes." + "BriefDescription": "Fraction of branches that are non-taken conditionals", + "MetricExpr": "BR_INST_RETIRED.NOT_TAKEN / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_cond_nt" }, { - "BriefDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 1", - "MetricExpr": "(CORE_POWER.LVL1_TURBO_LICENSE / 2 / CORE_CLKS if #SMT_on else CORE_POWER.LVL1_TURBO_LICENSE / CORE_CLKS)", - "MetricGroup": "Power", - "MetricName": "Power_License1_Utilization", - "PublicDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 1. This includes high current AVX 256-bit instructions as well as low current AVX 512-bit instructions." + "BriefDescription": "Fraction of branches that are taken conditionals", + "MetricExpr": "(BR_INST_RETIRED.CONDITIONAL - BR_INST_RETIRED.NOT_TAKEN) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_cond_tk" }, { - "BriefDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 2 (introduced in SKX)", - "MetricExpr": "(CORE_POWER.LVL2_TURBO_LICENSE / 2 / CORE_CLKS if #SMT_on else CORE_POWER.LVL2_TURBO_LICENSE / CORE_CLKS)", - "MetricGroup": "Power", - "MetricName": "Power_License2_Utilization", - "PublicDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 2 (introduced in SKX). This includes high current AVX 512-bit instructions." + "BriefDescription": "Probability of Core Bound bottleneck hidden by SMT-profiling artifacts", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(100 * (1 - tma_core_bound / tma_ports_utilization if tma_core_bound < tma_ports_utilization else 1) if tma_info_smt_2t_utilization > 0.5 else 0)", + "MetricGroup": "Cor;SMT", + "MetricName": "tma_info_core_bound_likely", + "MetricThreshold": "tma_info_core_bound_likely > 0.5" }, { - "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", - "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", + "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_clks))", "MetricGroup": "SMT", - "MetricName": "SMT_2T_Utilization" + "MetricName": "tma_info_core_clks" }, { - "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "OS", - "MetricName": "Kernel_Utilization" + "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_core_clks", + "MetricGroup": "Ret;SMT;TmaL1;tma_L1_group", + "MetricName": "tma_info_coreipc" }, { - "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", - "MetricGroup": "OS", - "MetricName": "Kernel_CPI" + "BriefDescription": "Cycles Per Instruction (per Logical Processor)", + "MetricExpr": "1 / tma_info_ipc", + "MetricGroup": "Mem;Pipeline", + "MetricName": "tma_info_cpi" }, { - "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", - "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e9 / duration_time", - "MetricGroup": "HPC;Mem;MemoryBW;SoC", - "MetricName": "DRAM_BW_Use" + "BriefDescription": "Average CPU Utilization", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": "HPC;Summary", + "MetricName": "tma_info_cpu_utilization" }, { - "BriefDescription": "Average latency of data read request to external memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetches", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_INSERTS.IA_MISS_DRD) / (Socket_CLKS / duration_time)", - "MetricGroup": "Mem;MemoryLat;SoC", - "MetricName": "MEM_Read_Latency" + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD / OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_data_l2_mlp" }, { - "BriefDescription": "Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches", - "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD@thresh\\=1@", - "MetricGroup": "Mem;MemoryBW;SoC", - "MetricName": "MEM_Parallel_Reads" + "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", + "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e9 / duration_time", + "MetricGroup": "HPC;Mem;MemoryBW;SoC;tma_issueBW", + "MetricName": "tma_info_dram_bw_use", + "PublicDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_memory_bandwidth, tma_mem_bandwidth, tma_sq_full" }, { - "BriefDescription": "Average latency of data read request to external 3D X-Point memory [in nanoseconds]. Accounts for demand loads and L1/L2 data-read prefetches", - "MetricExpr": "1e9 * (UNC_M_PMM_RPQ_OCCUPANCY.ALL / UNC_M_PMM_RPQ_INSERTS) / imc_0@event\\=0x0@", - "MetricGroup": "Mem;MemoryLat;Server;SoC", - "MetricName": "MEM_PMM_Read_Latency" + "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_dsb_coverage", + "MetricThreshold": "tma_info_dsb_coverage < 0.7 & tma_info_ipc / 4 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_misses, tma_info_iptb, tma_lcp" }, { - "BriefDescription": "Average latency of data read request to external DRAM memory [in nanoseconds]. Accounts for demand loads and L1/L2 data-read prefetches", - "MetricExpr": "1e9 * (UNC_M_RPQ_OCCUPANCY / UNC_M_RPQ_INSERTS) / imc_0@event\\=0x0@", - "MetricGroup": "Mem;MemoryLat;Server;SoC", - "MetricName": "MEM_DRAM_Read_Latency" + "BriefDescription": "Total pipeline cost of DSB (uop cache) misses - subset of the Instruction_Fetch_BW Bottleneck", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tma_mite))", + "MetricGroup": "DSBmiss;Fed;tma_issueFB", + "MetricName": "tma_info_dsb_misses", + "MetricThreshold": "tma_info_dsb_misses > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) misses - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_iptb, tma_lcp" }, { - "BriefDescription": "Average 3DXP Memory Bandwidth Use for reads [GB / sec]", - "MetricExpr": "64 * UNC_M_PMM_RPQ_INSERTS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW;Server;SoC", - "MetricName": "PMM_Read_BW" + "BriefDescription": "Average number of cycles of a switch from the DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details.", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / DSB2MITE_SWITCHES.COUNT", + "MetricGroup": "DSBmiss", + "MetricName": "tma_info_dsb_switch_cost" }, { - "BriefDescription": "Average 3DXP Memory Bandwidth Use for Writes [GB / sec]", - "MetricExpr": "64 * UNC_M_PMM_WPQ_INSERTS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW;Server;SoC", - "MetricName": "PMM_Write_BW" + "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-thread", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,cmask\\=1@", + "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", + "MetricName": "tma_info_execute" }, { - "BriefDescription": "Average IO (network or disk) Bandwidth Use for Writes [GB / sec]", - "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART0 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART2 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART3) * 4 / 1e9 / duration_time", - "MetricGroup": "IoBW;Mem;Server;SoC", - "MetricName": "IO_Write_BW" + "BriefDescription": "The ratio of Executed- by Issued-Uops", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", + "MetricGroup": "Cor;Pipeline", + "MetricName": "tma_info_execute_per_issue", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate of \"execute\" at rename stage." }, { - "BriefDescription": "Average IO (network or disk) Bandwidth Use for Reads [GB / sec]", - "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART0 + UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART2 + UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART3) * 4 / 1e9 / duration_time", - "MetricGroup": "IoBW;Mem;Server;SoC", - "MetricName": "IO_Read_BW" + "BriefDescription": "Fill Buffer (FB) hits per kilo instructions for retired demand loads (L1D misses that merge into ongoing miss-handling entries)", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_fb_hpki" }, { - "BriefDescription": "Socket actual clocks when any core is active on that socket", - "MetricExpr": "cha_0@event\\=0x0@", - "MetricGroup": "SoC", - "MetricName": "Socket_CLKS" + "BriefDescription": "Average number of Uops issued by front-end when it issued something", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=1@", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_fetch_upc" }, { - "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", - "MetricGroup": "Branches;OS", - "MetricName": "IpFarBranch" + "BriefDescription": "Floating Point Operations Per Cycle", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE) + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / tma_info_core_clks", + "MetricGroup": "Flops;Ret", + "MetricName": "tma_info_flopc" }, { - "BriefDescription": "Uncore frequency per die [GHZ]", - "MetricExpr": "Socket_CLKS / #num_dies / duration_time / 1e9", - "MetricGroup": "SoC", - "MetricName": "UNCORE_FREQ" + "BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0xfc@) / (2 * tma_info_core_clks)", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_fp_arith_utilization", + "PublicDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width). Values > 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less common)." }, { - "BriefDescription": "Percentage of time spent in the active CPU power state C0", - "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", - "MetricName": "cpu_utilization", - "ScaleUnit": "100%" + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE) + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1e9 / duration_time", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine." }, { - "BriefDescription": "CPU operating frequency (in GHz)", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / duration_time", - "MetricName": "cpu_operating_frequency", - "ScaleUnit": "1GHz" + "BriefDescription": "Total pipeline cost of Instruction Cache misses - subset of the Big_Code Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", + "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", + "MetricName": "tma_info_ic_misses", + "MetricThreshold": "tma_info_ic_misses > 5", + "PublicDescription": "Total pipeline cost of Instruction Cache misses - subset of the Big_Code Bottleneck. Related metrics: " }, { - "BriefDescription": "Cycles per instruction retired; indicating how much time each executed instruction took; in units of cycles.", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY", - "MetricName": "cpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Average Latency for L1 instruction cache misses", + "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STALL\\,cmask\\=1\\,edge@ + 2", + "MetricGroup": "Fed;FetchLat;IcMiss", + "MetricName": "tma_info_icache_miss_latency" }, { - "BriefDescription": "The ratio of number of completed memory load instructions to the total number completed instructions", - "MetricExpr": "MEM_INST_RETIRED.ALL_LOADS / INST_RETIRED.ANY", - "MetricName": "loads_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_GE_1 / 2 if #SMT_on else UOPS_EXECUTED.CORE_CYCLES_GE_1)", + "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", + "MetricName": "tma_info_ilp" }, { - "BriefDescription": "The ratio of number of completed memory store instructions to the total number completed instructions", - "MetricExpr": "MEM_INST_RETIRED.ALL_STORES / INST_RETIRED.ANY", - "MetricName": "stores_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Total pipeline cost of instruction fetch bandwidth related bottlenecks", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * (tma_frontend_bound - tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) - tma_info_big_code", + "MetricGroup": "Fed;FetchBW;Frontend", + "MetricName": "tma_info_instruction_fetch_bw", + "MetricThreshold": "tma_info_instruction_fetch_bw > 20" }, { - "BriefDescription": "Ratio of number of requests missing L1 data cache (includes data+rfo w/ prefetches) to the total number of completed instructions", - "MetricExpr": "L1D.REPLACEMENT / INST_RETIRED.ANY", - "MetricName": "l1d_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Summary;TmaL1;tma_L1_group", + "MetricName": "tma_info_instructions", + "PublicDescription": "Total number of retired Instructions. Sample with: INST_RETIRED.PREC_DIST" }, { - "BriefDescription": "Ratio of number of demand load requests hitting in L1 data cache to the total number of completed instructions ", - "MetricExpr": "MEM_LOAD_RETIRED.L1_HIT / INST_RETIRED.ANY", - "MetricName": "l1d_demand_data_read_hits_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Average IO (network or disk) Bandwidth Use for Reads [GB / sec]", + "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART0 + UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART2 + UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART3) * 4 / 1e9 / duration_time", + "MetricGroup": "IoBW;Mem;Server;SoC", + "MetricName": "tma_info_io_read_bw" }, { - "BriefDescription": "Ratio of number of code read requests missing in L1 instruction cache (includes prefetches) to the total number of completed instructions", - "MetricExpr": "L2_RQSTS.ALL_CODE_RD / INST_RETIRED.ANY", - "MetricName": "l1_i_code_read_misses_with_prefetches_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Average IO (network or disk) Bandwidth Use for Writes [GB / sec]", + "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART0 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART2 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART3) * 4 / 1e9 / duration_time", + "MetricGroup": "IoBW;Mem;Server;SoC", + "MetricName": "tma_info_io_write_bw" }, { - "BriefDescription": "Ratio of number of completed demand load requests hitting in L2 cache to the total number of completed instructions ", - "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT / INST_RETIRED.ANY", - "MetricName": "l2_demand_data_read_hits_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "INST_RETIRED.ANY / (cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0xfc@)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_iparith", + "MetricThreshold": "tma_info_iparith < 10", + "PublicDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate). May undercount due to FMA double counting. Approximated prior to BDW." }, { - "BriefDescription": "Ratio of number of requests missing L2 cache (includes code+data+rfo w/ prefetches) to the total number of completed instructions", - "MetricExpr": "L2_LINES_IN.ALL / INST_RETIRED.ANY", - "MetricName": "l2_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx128", + "MetricThreshold": "tma_info_iparith_avx128 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "Ratio of number of completed data read request missing L2 cache to the total number of completed instructions", - "MetricExpr": "MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", - "MetricName": "l2_demand_data_read_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx256", + "MetricThreshold": "tma_info_iparith_avx256 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "Ratio of number of code read request missing L2 cache to the total number of completed instructions", - "MetricExpr": "L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", - "MetricName": "l2_demand_code_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx512", + "MetricThreshold": "tma_info_iparith_avx512 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "Ratio of number of data read requests missing last level core cache (includes demand w/ prefetches) to the total number of completed instructions", - "MetricExpr": "cha@unc_cha_tor_inserts.ia_miss\\,config1\\=0x12D4043300000000@ / INST_RETIRED.ANY", - "MetricName": "llc_data_read_mpi_demand_plus_prefetch", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOUBLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_iparith_scalar_dp", + "MetricThreshold": "tma_info_iparith_scalar_dp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "Ratio of number of code read requests missing last level core cache (includes demand w/ prefetches) to the total number of completed instructions", - "MetricExpr": "cha@unc_cha_tor_inserts.ia_miss\\,config1\\=0x12CC023300000000@ / INST_RETIRED.ANY", - "MetricName": "llc_code_read_mpi_demand_plus_prefetch", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SINGLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_iparith_scalar_sp", + "MetricThreshold": "tma_info_iparith_scalar_sp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "Average latency of a last level cache (LLC) demand and prefetch data read miss (read memory access) in nano seconds", - "MetricExpr": "1e9 * (cha@unc_cha_tor_occupancy.ia_miss\\,config1\\=0x4043300000000@ / cha@unc_cha_tor_inserts.ia_miss\\,config1\\=0x4043300000000@) / (UNC_CHA_CLOCKTICKS / (source_count(UNC_CHA_CLOCKTICKS) * #num_packages)) * duration_time", - "MetricName": "llc_data_read_demand_plus_prefetch_miss_latency", - "ScaleUnit": "1ns" + "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Branches;Fed;InsType", + "MetricName": "tma_info_ipbranch", + "MetricThreshold": "tma_info_ipbranch < 8" }, { - "BriefDescription": "Average latency of a last level cache (LLC) demand and prefetch data read miss (read memory access) addressed to local memory in nano seconds", - "MetricExpr": "1e9 * (cha@unc_cha_tor_occupancy.ia_miss\\,config1\\=0x4043200000000@ / cha@unc_cha_tor_inserts.ia_miss\\,config1\\=0x4043200000000@) / (UNC_CHA_CLOCKTICKS / (source_count(UNC_CHA_CLOCKTICKS) * #num_packages)) * duration_time", - "MetricName": "llc_data_read_demand_plus_prefetch_miss_latency_for_local_requests", - "ScaleUnit": "1ns" + "BriefDescription": "Instructions Per Cycle (per Logical Processor)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_clks", + "MetricGroup": "Ret;Summary", + "MetricName": "tma_info_ipc" }, { - "BriefDescription": "Average latency of a last level cache (LLC) demand and prefetch data read miss (read memory access) addressed to remote memory in nano seconds", - "MetricExpr": "1e9 * (cha@unc_cha_tor_occupancy.ia_miss\\,config1\\=0x4043100000000@ / cha@unc_cha_tor_inserts.ia_miss\\,config1\\=0x4043100000000@) / (UNC_CHA_CLOCKTICKS / (source_count(UNC_CHA_CLOCKTICKS) * #num_packages)) * duration_time", - "MetricName": "llc_data_read_demand_plus_prefetch_miss_latency_for_remote_requests", - "ScaleUnit": "1ns" + "BriefDescription": "Instructions per (near) call (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_ipcall", + "MetricThreshold": "tma_info_ipcall < 200" }, { - "BriefDescription": "Ratio of number of completed page walks (for all page sizes) caused by a code fetch to the total number of completed instructions", - "MetricExpr": "ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", - "MetricName": "itlb_mpi", - "PublicDescription": "Ratio of number of completed page walks (for all page sizes) caused by a code fetch to the total number of completed instructions. This implies it missed in the ITLB (Instruction TLB) and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per non-speculative DSB miss (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", + "MetricGroup": "DSBmiss;Fed", + "MetricName": "tma_info_ipdsb_miss_ret", + "MetricThreshold": "tma_info_ipdsb_miss_ret < 50" }, { - "BriefDescription": "Ratio of number of completed page walks (for 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total number of completed instructions", - "MetricExpr": "ITLB_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY", - "MetricName": "itlb_large_page_mpi", - "PublicDescription": "Ratio of number of completed page walks (for 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total number of completed instructions. This implies it missed in the Instruction Translation Lookaside Buffer (ITLB) and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;OS", + "MetricName": "tma_info_ipfarbranch", + "MetricThreshold": "tma_info_ipfarbranch < 1e6" }, { - "BriefDescription": "Ratio of number of completed page walks (for all page sizes) caused by demand data loads to the total number of completed instructions", - "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", - "MetricName": "dtlb_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for all page sizes) caused by demand data loads to the total number of completed instructions. This implies it missed in the DTLB and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per Floating Point (FP) Operation (lower number means higher occurrence rate)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE) + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_ipflop", + "MetricThreshold": "tma_info_ipflop < 10" }, { - "BriefDescription": "Ratio of number of completed page walks (for 2 megabyte page sizes) caused by demand data loads to the total number of completed instructions", - "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY", - "MetricName": "dtlb_2mb_large_page_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for 2 megabyte page sizes) caused by demand data loads to the total number of completed instructions. This implies it missed in the Data Translation Lookaside Buffer (DTLB) and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per Load (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipload", + "MetricThreshold": "tma_info_ipload < 3" }, { - "BriefDescription": "Ratio of number of completed page walks (for all page sizes) caused by demand data stores to the total number of completed instructions", - "MetricExpr": "DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", - "MetricName": "dtlb_store_mpi", - "PublicDescription": "Ratio of number of completed page walks (for all page sizes) caused by demand data stores to the total number of completed instructions. This implies it missed in the DTLB and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per retired mispredicts for indirect CALL or JMP branches (lower number means higher occurrence rate).", + "MetricExpr": "tma_info_instructions / (UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * cpu@BR_MISP_EXEC.ALL_BRANCHES\\,umask\\=0xE4@)", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_indirect", + "MetricThreshold": "tma_info_ipmisp_indirect < 1e3" }, { - "BriefDescription": "Memory read that miss the last level cache (LLC) addressed to local DRAM as a percentage of total memory read accesses, does not include LLC prefetches.", - "MetricExpr": "cha@unc_cha_tor_inserts.ia_miss\\,config1\\=0x4043200000000@ / (cha@unc_cha_tor_inserts.ia_miss\\,config1\\=0x4043200000000@ + cha@unc_cha_tor_inserts.ia_miss\\,config1\\=0x4043100000000@)", - "MetricName": "numa_reads_addressed_to_local_dram", - "ScaleUnit": "100%" + "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts", + "MetricName": "tma_info_ipmispredict", + "MetricThreshold": "tma_info_ipmispredict < 200" }, { - "BriefDescription": "Memory reads that miss the last level cache (LLC) addressed to remote DRAM as a percentage of total memory read accesses, does not include LLC prefetches.", - "MetricExpr": "cha@unc_cha_tor_inserts.ia_miss\\,config1\\=0x4043100000000@ / (cha@unc_cha_tor_inserts.ia_miss\\,config1\\=0x4043200000000@ + cha@unc_cha_tor_inserts.ia_miss\\,config1\\=0x4043100000000@)", - "MetricName": "numa_reads_addressed_to_remote_dram", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per Store (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipstore", + "MetricThreshold": "tma_info_ipstore < 8" }, { - "BriefDescription": "Uncore operating frequency in GHz", - "MetricExpr": "UNC_CHA_CLOCKTICKS / (source_count(UNC_CHA_CLOCKTICKS) * #num_packages) / 1e9 / duration_time", - "MetricName": "uncore_frequency", - "ScaleUnit": "1GHz" + "BriefDescription": "Instructions per Software prefetch instruction (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umask\\=0xF@", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_ipswpf", + "MetricThreshold": "tma_info_ipswpf < 100" }, { - "BriefDescription": "Intel(R) Ultra Path Interconnect (UPI) data transmit bandwidth (MB/sec)", - "MetricExpr": "UNC_UPI_TxL_FLITS.ALL_DATA * 7.111111111111111 / 1e6 / duration_time", - "MetricName": "upi_data_transmit_bw", - "ScaleUnit": "1MB/s" + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", + "MetricName": "tma_info_iptb", + "MetricThreshold": "tma_info_iptb < 9", + "PublicDescription": "Instruction per taken branch. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_dsb_misses, tma_lcp" }, { - "BriefDescription": "DDR memory read bandwidth (MB/sec)", - "MetricExpr": "UNC_M_CAS_COUNT.RD * 64 / 1e6 / duration_time", - "MetricName": "memory_bandwidth_read", - "ScaleUnit": "1MB/s" + "BriefDescription": "Instructions per speculative Unknown Branch Misprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_instructions / BACLEARS.ANY", + "MetricGroup": "Fed", + "MetricName": "tma_info_ipunknown_branch" }, { - "BriefDescription": "DDR memory write bandwidth (MB/sec)", - "MetricExpr": "UNC_M_CAS_COUNT.WR * 64 / 1e6 / duration_time", - "MetricName": "memory_bandwidth_write", - "ScaleUnit": "1MB/s" + "BriefDescription": "Fraction of branches that are unconditional (direct or indirect) jumps", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - (BR_INST_RETIRED.CONDITIONAL - BR_INST_RETIRED.NOT_TAKEN) - 2 * BR_INST_RETIRED.NEAR_CALL) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_jump" }, { - "BriefDescription": "DDR memory bandwidth (MB/sec)", - "MetricExpr": "(UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) * 64 / 1e6 / duration_time", - "MetricName": "memory_bandwidth_total", - "ScaleUnit": "1MB/s" + "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_cpi" }, { - "BriefDescription": "Intel(R) Optane(TM) Persistent Memory(PMEM) memory read bandwidth (MB/sec)", - "MetricExpr": "UNC_M_PMM_RPQ_INSERTS * 64 / 1e6 / duration_time", - "MetricName": "pmem_memory_bandwidth_read", - "ScaleUnit": "1MB/s" + "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_utilization", + "MetricThreshold": "tma_info_kernel_utilization > 0.05" }, { - "BriefDescription": "Intel(R) Optane(TM) Persistent Memory(PMEM) memory write bandwidth (MB/sec)", - "MetricExpr": "UNC_M_PMM_WPQ_INSERTS * 64 / 1e6 / duration_time", - "MetricName": "pmem_memory_bandwidth_write", - "ScaleUnit": "1MB/s" + "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw" }, { - "BriefDescription": "Intel(R) Optane(TM) Persistent Memory(PMEM) memory bandwidth (MB/sec)", - "MetricExpr": "(UNC_M_PMM_RPQ_INSERTS + UNC_M_PMM_WPQ_INSERTS) * 64 / 1e6 / duration_time", - "MetricName": "pmem_memory_bandwidth_total", - "ScaleUnit": "1MB/s" + "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]", + "MetricExpr": "tma_info_l1d_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw_1t" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by end device controllers that are requesting memory from the CPU.", - "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART0 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART2 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART3) * 4 / 1e6 / duration_time", - "MetricName": "io_bandwidth_disk_or_network_writes", - "ScaleUnit": "1MB/s" + "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l1mpki" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by end device controllers that are writing memory to the CPU.", - "MetricExpr": "(UNC_IIO_PAYLOAD_BYTES_IN.MEM_WRITE.PART0 + UNC_IIO_PAYLOAD_BYTES_IN.MEM_WRITE.PART1 + UNC_IIO_PAYLOAD_BYTES_IN.MEM_WRITE.PART2 + UNC_IIO_PAYLOAD_BYTES_IN.MEM_WRITE.PART3) * 4 / 1e6 / duration_time", - "MetricName": "io_bandwidth_disk_or_network_reads", - "ScaleUnit": "1MB/s" + "BriefDescription": "L1 cache true misses per kilo instruction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l1mpki_load" }, { - "BriefDescription": "Uops delivered from decoded instruction cache (decoded stream buffer or DSB) as a percent of total uops delivered to Instruction Decode Queue", - "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS + LSD.UOPS)", - "MetricName": "percent_uops_delivered_from_decoded_icache", - "ScaleUnit": "100%" + "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw" }, { - "BriefDescription": "Uops delivered from legacy decode pipeline (Micro-instruction Translation Engine or MITE) as a percent of total uops delivered to Instruction Decode Queue", - "MetricExpr": "IDQ.MITE_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS + LSD.UOPS)", - "MetricName": "percent_uops_delivered_from_legacy_decode_pipeline", - "ScaleUnit": "100%" + "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]", + "MetricExpr": "tma_info_l2_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw_1t" }, { - "BriefDescription": "Uops delivered from microcode sequencer (MS) as a percent of total uops delivered to Instruction Decode Queue", - "MetricExpr": "IDQ.MS_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS + LSD.UOPS)", - "MetricName": "percent_uops_delivered_from_microcode_sequencer", - "ScaleUnit": "100%" + "BriefDescription": "Rate of non silent evictions from the L2 cache per Kilo instruction", + "MetricExpr": "1e3 * L2_LINES_OUT.NON_SILENT / tma_info_instructions", + "MetricGroup": "L2Evicts;Mem;Server", + "MetricName": "tma_info_l2_evictions_nonsilent_pki" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss the last level cache (LLC) and go to local memory.", - "MetricExpr": "UNC_CHA_REQUESTS.READS_LOCAL * 64 / 1e6 / duration_time", - "MetricName": "llc_miss_local_memory_bandwidth_read", - "ScaleUnit": "1MB/s" + "BriefDescription": "Rate of silent evictions from the L2 cache per Kilo instruction where the evicted lines are dropped (no writeback to L3 or memory)", + "MetricExpr": "1e3 * L2_LINES_OUT.SILENT / tma_info_instructions", + "MetricGroup": "L2Evicts;Mem;Server", + "MetricName": "tma_info_l2_evictions_silent_pki" }, { - "BriefDescription": "Bandwidth (MB/sec) of write requests that miss the last level cache (LLC) and go to local memory.", - "MetricExpr": "UNC_CHA_REQUESTS.WRITES_LOCAL * 64 / 1e6 / duration_time", - "MetricName": "llc_miss_local_memory_bandwidth_write", - "ScaleUnit": "1MB/s" + "BriefDescription": "L2 cache hits per kilo instruction for all request types (including speculative)", + "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2hpki_all" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss the last level cache (LLC) and go to remote memory.", - "MetricExpr": "UNC_CHA_REQUESTS.READS_REMOTE * 64 / 1e6 / duration_time", - "MetricName": "llc_miss_remote_memory_bandwidth_read", - "ScaleUnit": "1MB/s" + "BriefDescription": "L2 cache hits per kilo instruction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2hpki_load" }, { - "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", - "MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / SLOTS", - "MetricGroup": "PGO;TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_frontend_bound", - "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Machine_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound.", - "ScaleUnit": "100%" + "BriefDescription": "L2 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "Backend;CacheMisses;Mem", + "MetricName": "tma_info_l2mpki" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", - "MetricExpr": "4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / SLOTS", - "MetricGroup": "Frontend;TopdownL2;tma_L2_group;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_latency", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period.", - "ScaleUnit": "100%" + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all request types (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem;Offcore", + "MetricName": "tma_info_l2mpki_all" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses.", - "MetricExpr": "(ICACHE_16B.IFDATA_STALL + 2 * cpu@ICACHE_16B.IFDATA_STALL\\,cmask\\=0x1\\,edge\\=0x1@) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_icache_misses", - "ScaleUnit": "100%" + "BriefDescription": "L2 cache true code cacheline misses per kilo instruction", + "MetricExpr": "1e3 * FRONTEND_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_l2mpki_code" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses.", - "MetricExpr": "ICACHE_64B.IFTAG_STALL / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_itlb_misses", - "ScaleUnit": "100%" + "BriefDescription": "L2 cache speculative code cacheline misses per kilo instruction", + "MetricExpr": "1e3 * L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_l2mpki_code_all" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", - "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / CPU_CLK_UNHALTED.THREAD + tma_unknown_branches", - "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_branch_resteers", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings.", - "ScaleUnit": "100%" + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2mpki_load" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage. ", - "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL4;tma_L4_group;tma_branch_resteers_group", - "MetricName": "tma_mispredicts_resteers", - "ScaleUnit": "100%" + "BriefDescription": "Average per-core data access bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_l3_cache_access_bw" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears. ", - "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_branch_resteers_group", - "MetricName": "tma_clears_resteers", - "ScaleUnit": "100%" + "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "tma_info_l3_cache_access_bw", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_l3_cache_access_bw_1t" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears", - "MetricExpr": "9 * BACLEARS.ANY / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "BigFoot;FetchLat;TopdownL4;tma_L4_group;tma_branch_resteers_group", - "MetricName": "tma_unknown_branches", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears. These are fetched branches the Branch Prediction Unit was unable to recognize (First fetch or hitting BPU capacity limit).", - "ScaleUnit": "100%" + "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_dsb_switches", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty.", - "ScaleUnit": "100%" + "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "tma_info_l3_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw_1t" }, { - "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", - "MetricExpr": "ILD_STALL.LCP / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_lcp", - "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs.", - "ScaleUnit": "100%" + "BriefDescription": "L3 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l3mpki" }, { - "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", - "MetricExpr": "2 * IDQ.MS_SWITCHES / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_ms_switches", - "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals.", - "ScaleUnit": "100%" + "BriefDescription": "Average Latency for L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCORE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "Memory_Lat;Offcore", + "MetricName": "tma_info_load_l2_miss_latency" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", - "MetricExpr": "tma_frontend_bound - tma_fetch_latency", - "MetricGroup": "FetchBW;Frontend;TopdownL2;tma_L2_group;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_bandwidth", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend.", - "ScaleUnit": "100%" + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_load_l2_mlp" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline)", - "MetricExpr": "(IDQ.ALL_MITE_CYCLES_ANY_UOPS - IDQ.ALL_MITE_CYCLES_4_UOPS) / CORE_CLKS / 2", - "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", - "MetricName": "tma_mite", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or LSD. For example; inefficiencies due to asymmetric decoders; use of long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.", - "ScaleUnit": "100%" + "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)", + "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_RETIRED.L1_MISS + MEM_LOAD_RETIRED.FB_HIT)", + "MetricGroup": "Mem;MemoryBound;MemoryLat", + "MetricName": "tma_info_load_miss_real_latency" }, { - "BriefDescription": "This metric represents fraction of cycles where decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=0x1@ - cpu@INST_DECODED.DECODERS\\,cmask\\=0x2@) / CORE_CLKS", - "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_mite_group", - "MetricName": "tma_decoder0_alone", - "ScaleUnit": "100%" + "BriefDescription": "STLB (2nd level TLB) data load speculative misses per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_load_stlb_mpki" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline", - "MetricExpr": "(IDQ.ALL_DSB_CYCLES_ANY_UOPS - IDQ.ALL_DSB_CYCLES_4_UOPS) / CORE_CLKS / 2", - "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", - "MetricName": "tma_dsb", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline. For example; inefficient utilization of the DSB cache structure or bank conflict when reading from it; are categorized here.", - "ScaleUnit": "100%" + "BriefDescription": "Average latency of data read request to external DRAM memory [in nanoseconds]", + "MetricExpr": "1e9 * (UNC_M_RPQ_OCCUPANCY / UNC_M_RPQ_INSERTS) / imc_0@event\\=0x0@", + "MetricGroup": "Mem;MemoryLat;Server;SoC", + "MetricName": "tma_info_mem_dram_read_latency", + "PublicDescription": "Average latency of data read request to external DRAM memory [in nanoseconds]. Accounts for demand loads and L1/L2 data-read prefetches" }, { - "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations", - "MetricExpr": "(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) / SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_bad_speculation", - "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.", - "ScaleUnit": "100%" + "BriefDescription": "Average number of parallel data read requests to external memory", + "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD@thresh\\=1@", + "MetricGroup": "Mem;MemoryBW;SoC", + "MetricName": "tma_info_mem_parallel_reads", + "PublicDescription": "Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches" }, { - "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction", - "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * tma_bad_speculation", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL2;tma_L2_group;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_branch_mispredicts", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path.", - "ScaleUnit": "100%" + "BriefDescription": "Average latency of data read request to external 3D X-Point memory [in nanoseconds]", + "MetricExpr": "1e9 * (UNC_M_PMM_RPQ_OCCUPANCY.ALL / UNC_M_PMM_RPQ_INSERTS) / imc_0@event\\=0x0@", + "MetricGroup": "Mem;MemoryLat;Server;SoC", + "MetricName": "tma_info_mem_pmm_read_latency", + "PublicDescription": "Average latency of data read request to external 3D X-Point memory [in nanoseconds]. Accounts for demand loads and L1/L2 data-read prefetches" }, { - "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", - "MetricExpr": "tma_bad_speculation - tma_branch_mispredicts", - "MetricGroup": "BadSpec;MachineClears;TopdownL2;tma_L2_group;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_machine_clears", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes.", - "ScaleUnit": "100%" + "BriefDescription": "Average latency of data read request to external memory (in nanoseconds)", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_INSERTS.IA_MISS_DRD) / (tma_info_socket_clks / duration_time)", + "MetricGroup": "Mem;MemoryLat;SoC", + "MetricName": "tma_info_mem_read_latency", + "PublicDescription": "Average latency of data read request to external memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetches. ([RKL+]memory-controller only)" }, { - "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend", - "MetricExpr": "1 - tma_frontend_bound - (UOPS_ISSUED.ANY + 4 * (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) / SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_backend_bound", - "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound.", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of (external) Memory Bandwidth related bottlenecks", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_sq_full / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full))) + tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk))", + "MetricGroup": "Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_info_memory_bandwidth", + "MetricThreshold": "tma_info_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of (external) Memory Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_mem_bandwidth, tma_sq_full" }, { - "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", - "MetricExpr": "(CYCLE_ACTIVITY.STALLS_MEM_ANY + EXE_ACTIVITY.BOUND_ON_STORES) / (CYCLE_ACTIVITY.STALLS_TOTAL + (EXE_ACTIVITY.1_PORTS_UTIL + UOPS_RETIRED.RETIRE_SLOTS / SLOTS * EXE_ACTIVITY.2_PORTS_UTIL) + EXE_ACTIVITY.BOUND_ON_STORES) * tma_backend_bound", - "MetricGroup": "Backend;TopdownL2;tma_L2_group;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_memory_bound", - "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of Memory Address Translation related bottlenecks (data-side TLBs)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * tma_memory_bound * (tma_l1_bound / max(tma_memory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_dtlb_load / max(tma_l1_bound, tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_latency)))", + "MetricGroup": "Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_info_memory_data_tlbs", + "MetricThreshold": "tma_info_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Translation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load, tma_dtlb_store" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache", - "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY.STALLS_L1D_MISS) / CPU_CLK_UNHALTED.THREAD, 0)", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", - "MetricName": "tma_l1_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache. The L1 data cache typically has the shortest latency. However; in certain cases like loads blocked on older stores; a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls; while some non-completed demand load lives in the machine without having that demand load missing the L1 cache.", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of Memory Latency related bottlenecks (external memory and off-core caches)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_l3_hit_latency / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound))", + "MetricGroup": "Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_info_memory_latency", + "MetricThreshold": "tma_info_memory_latency > 20", + "PublicDescription": "Total pipeline cost of Memory Latency related bottlenecks (external memory and off-core caches). Related metrics: tma_l3_hit_latency, tma_mem_latency" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "min(9 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=0x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_dtlb_load", - "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss.", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of Branch Misprediction related bottlenecks", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * (tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;tma_issueBM", + "MetricName": "tma_info_mispredictions", + "MetricThreshold": "tma_info_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_branch_misprediction_cost, tma_mispredicts_resteers" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the (first level) DTLB was missed by load accesses, that later on hit in second-level TLB (STLB)", - "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", - "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_group", - "MetricName": "tma_load_stlb_hit", - "ScaleUnit": "100%" + "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss", + "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLES", + "MetricGroup": "Mem;MemoryBW;MemoryBound", + "MetricName": "tma_info_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)" }, { - "BriefDescription": "This metric estimates the fraction of cycles where the Second-level TLB (STLB) was missed by load accesses, performing a hardware page walk", - "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_group", - "MetricName": "tma_load_stlb_miss", - "ScaleUnit": "100%" + "BriefDescription": "Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_PENDING + DTLB_STORE_MISSES.WALK_PENDING + EPT.WALK_PENDING) / (2 * tma_info_core_clks)", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_page_walks_utilization", + "MetricThreshold": "tma_info_page_walks_utilization > 0.5" }, { - "BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores", - "MetricExpr": "min(13 * LD_BLOCKS.STORE_FORWARD / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_store_fwd_blk", - "PublicDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores. To streamline memory operations in the pipeline; a load can avoid waiting for memory if a prior in-flight store is writing the data that the load wants to read (store forwarding process). However; in some cases the load may be blocked for a significant time pending the store forward. For example; when the prior store is writing a smaller region than the load is reading.", - "ScaleUnit": "100%" + "BriefDescription": "Average 3DXP Memory Bandwidth Use for reads [GB / sec]", + "MetricExpr": "64 * UNC_M_PMM_RPQ_INSERTS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW;Server;SoC", + "MetricName": "tma_info_pmm_read_bw" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations", - "MetricExpr": "min((12 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS.ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (11 * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO))) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_lock_latency", - "PublicDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations. Due to the microarchitecture handling of locks; they are classified as L1_Bound regardless of what memory source satisfied them.", - "ScaleUnit": "100%" + "BriefDescription": "Average 3DXP Memory Bandwidth Use for Writes [GB / sec]", + "MetricExpr": "64 * UNC_M_PMM_WPQ_INSERTS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW;Server;SoC", + "MetricName": "tma_info_pmm_write_bw" }, { - "BriefDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary. ", - "MetricExpr": "min(Load_Miss_Real_Latency * LD_BLOCKS.NO_SR / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_split_loads", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of Core cycles where the core was running with power-delivery for baseline license level 0", + "MetricExpr": "(CORE_POWER.LVL0_TURBO_LICENSE / 2 / tma_info_core_clks if #SMT_on else CORE_POWER.LVL0_TURBO_LICENSE / tma_info_core_clks)", + "MetricGroup": "Power", + "MetricName": "tma_info_power_license0_utilization", + "PublicDescription": "Fraction of Core cycles where the core was running with power-delivery for baseline license level 0. This includes non-AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes." }, { - "BriefDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset", - "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_4k_aliasing", - "PublicDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset. False match is possible; which incur a few cycles load re-issue. However; the short re-issue duration is often hidden by the out-of-order core and HW optimizations; hence a user may safely ignore a high value of this metric unless it manages to propagate up into parent nodes of the hierarchy (e.g. to L1_Bound).", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 1", + "MetricExpr": "(CORE_POWER.LVL1_TURBO_LICENSE / 2 / tma_info_core_clks if #SMT_on else CORE_POWER.LVL1_TURBO_LICENSE / tma_info_core_clks)", + "MetricGroup": "Power", + "MetricName": "tma_info_power_license1_utilization", + "MetricThreshold": "tma_info_power_license1_utilization > 0.5", + "PublicDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 1. This includes high current AVX 256-bit instructions as well as low current AVX 512-bit instructions." }, { - "BriefDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed", - "MetricExpr": "Load_Miss_Real_Latency * cpu@L1D_PEND_MISS.FB_FULL\\,cmask\\=0x1@ / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryBW;TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_fb_full", - "PublicDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed. The higher the metric value; the deeper the memory hierarchy level the misses are satisfied from (metric values >1 are valid). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or external memory).", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 2 (introduced in SKX)", + "MetricExpr": "(CORE_POWER.LVL2_TURBO_LICENSE / 2 / tma_info_core_clks if #SMT_on else CORE_POWER.LVL2_TURBO_LICENSE / tma_info_core_clks)", + "MetricGroup": "Power", + "MetricName": "tma_info_power_license2_utilization", + "MetricThreshold": "tma_info_power_license2_utilization > 0.5", + "PublicDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 2 (introduced in SKX). This includes high current AVX 512-bit instructions." }, { - "BriefDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads", - "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + cpu@L1D_PEND_MISS.FB_FULL\\,cmask\\=0x1@) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / CPU_CLK_UNHALTED.THREAD)", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", - "MetricName": "tma_l2_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 misses/L2 hits) can improve the latency and increase performance.", - "ScaleUnit": "100%" + "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE_SLOTS\\,cmask\\=1@", + "MetricGroup": "Pipeline;Ret", + "MetricName": "tma_info_retire" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core", - "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STALLS_L3_MISS) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", - "MetricName": "tma_l3_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance.", - "ScaleUnit": "100%" + "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "4 * tma_info_core_clks", + "MetricGroup": "TmaL1;tma_L1_group", + "MetricName": "tma_info_slots" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses", - "MetricExpr": "min(((47.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 3.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (OCR.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE / (OCR.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE + OCR.DEMAND_DATA_RD.L3_HIT.HIT_OTHER_CORE_FWD))) + (47.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 3.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_l3_bound_group", - "MetricName": "tma_contested_accesses", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses. Contested accesses occur when data written by one Logical Processor are read by another Logical Processor on a different Physical Core. Examples of contested accesses include synchronizations such as locks; true data sharing such as modified locked variables; and false sharing.", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", + "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", + "MetricGroup": "SMT", + "MetricName": "tma_info_smt_2t_utilization" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses", - "MetricExpr": "min((47.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 3.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (1 - OCR.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE / (OCR.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE + OCR.DEMAND_DATA_RD.L3_HIT.HIT_OTHER_CORE_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Offcore;Snoop;TopdownL4;tma_L4_group;tma_l3_bound_group", - "MetricName": "tma_data_sharing", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses. Data shared by multiple Logical Processors (even just read shared) may cause increased access latency due to cache coherency. Excessive data sharing can drastically harm multithreaded performance.", - "ScaleUnit": "100%" + "BriefDescription": "Socket actual clocks when any core is active on that socket", + "MetricExpr": "cha_0@event\\=0x0@", + "MetricGroup": "SoC", + "MetricName": "tma_info_socket_clks" }, { - "BriefDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited)", - "MetricExpr": "min((20.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 3.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l3_bound_group", - "MetricName": "tma_l3_hit_latency", - "PublicDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve the latency; reduce contention with sibling physical cores and increase performance. Note the value of this node may overlap with its siblings.", - "ScaleUnit": "100%" + "BriefDescription": "STLB (2nd level TLB) data store speculative misses per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_store_stlb_mpki" }, { - "BriefDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors)", - "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on else OFFCORE_REQUESTS_BUFFER.SQ_FULL) / CORE_CLKS", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_l3_bound_group", - "MetricName": "tma_sq_full", - "PublicDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors). The Super Queue is used for requests to access the L2 cache or to go out to the Uncore.", - "ScaleUnit": "100%" + "BriefDescription": "Average Frequency Utilization relative nominal frequency", + "MetricExpr": "tma_info_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_turbo_utilization" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", - "MetricExpr": "min(CYCLE_ACTIVITY.STALLS_L3_MISS / CPU_CLK_UNHALTED.THREAD + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / CPU_CLK_UNHALTED.THREAD - tma_l2_bound - min(((1 - (19 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 10 * (MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))) / (19 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 10 * (MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + (25 * (MEM_LOAD_RETIRED.LOCAL_PMM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 33 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_PMM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))))) * (CYCLE_ACTIVITY.STALLS_L3_MISS / CPU_CLK_UNHALTED.THREAD + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / CPU_CLK_UNHALTED.THREAD - tma_l2_bound) if 1e6 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_PMM + MEM_LOAD_RETIRED.LOCAL_PMM) > MEM_LOAD_RETIRED.L1_MISS else 0), 1), 1)", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", - "MetricName": "tma_dram_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance.", - "ScaleUnit": "100%" + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / INST_RETIRED.ANY", + "MetricGroup": "Pipeline;Ret;Retire", + "MetricName": "tma_info_uoppi", + "MetricThreshold": "tma_info_uoppi > 1.05" }, { - "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=0x4@) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group", - "MetricName": "tma_mem_bandwidth", - "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that).", - "ScaleUnit": "100%" + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW", + "MetricName": "tma_info_uptb", + "MetricThreshold": "tma_info_uptb < 6" }, { - "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / CPU_CLK_UNHALTED.THREAD - tma_mem_bandwidth", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group", - "MetricName": "tma_mem_latency", - "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that).", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "ICACHE_64B.IFTAG_STALL / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTEND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from local memory", - "MetricExpr": "min((80 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 20.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_group", - "MetricName": "tma_local_dram", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from local memory. Caching will improve the latency and increase performance.", + "BriefDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache", + "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY.STALLS_L1D_MISS) / tma_info_clks, 0)", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache. The L1 data cache typically has the shortest latency. However; in certain cases like loads blocked on older stores; a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls; while some non-completed demand load lives in the machine without having that demand load missing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB_HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_switches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote memory", - "MetricExpr": "min((147.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 20.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latency_group", - "MetricName": "tma_remote_dram", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote memory. This is caused often due to non-optimal NUMA allocations. #link to NUMA article", + "BriefDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + cpu@L1D_PEND_MISS.FB_FULL\\,cmask\\=1@) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_info_clks)", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 misses/L2 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L2_HIT_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote cache in other sockets including synchronizations issues", - "MetricExpr": "min(((110 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 20.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM + (110 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 20.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latency_group", - "MetricName": "tma_remote_cache", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote cache in other sockets including synchronizations issues. This is caused often due to non-optimal NUMA allocations. #link to NUMA article", + "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core", + "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STALLS_L3_MISS) / tma_info_clks", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates (based on idle latencies) how often the CPU was stalled on accesses to external 3D-Xpoint (Crystal Ridge, a.k.a", - "MetricExpr": "min(((1 - (19 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 10 * (MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))) / (19 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 10 * (MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + (25 * (MEM_LOAD_RETIRED.LOCAL_PMM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 33 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_PMM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))))) * (CYCLE_ACTIVITY.STALLS_L3_MISS / CPU_CLK_UNHALTED.THREAD + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / CPU_CLK_UNHALTED.THREAD - tma_l2_bound) if 1e6 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_PMM + MEM_LOAD_RETIRED.LOCAL_PMM) > MEM_LOAD_RETIRED.L1_MISS else 0), 1)", - "MetricGroup": "MemoryBound;Server;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", - "MetricName": "tma_pmm_bound", - "PublicDescription": "This metric roughly estimates (based on idle latencies) how often the CPU was stalled on accesses to external 3D-Xpoint (Crystal Ridge, a.k.a. IXP) memory by loads, PMM stands for Persistent Memory Module. ", + "BriefDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited)", + "MetricExpr": "17 * tma_info_average_frequency * MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_issueLat;tma_l3_bound_group", + "MetricName": "tma_l3_hit_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve the latency; reduce contention with sibling physical cores and increase performance. Note the value of this node may overlap with its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tma_info_memory_latency, tma_mem_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", - "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", - "MetricName": "tma_store_bound", - "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck.", + "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "ILD_STALL.LCP / tma_info_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_dsb_misses, tma_info_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses", - "MetricExpr": "(L2_RQSTS.RFO_HIT * 11 * (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_store_bound_group", - "MetricName": "tma_store_latency", - "PublicDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses. Store accesses usually less impact out-of-order core performance; however; holding resources for longer time can lead into undesired implications (e.g. contention on L1D fill-buffer entries - see FB_Full)", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", + "MetricExpr": "tma_retiring - tma_heavy_operations", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing", - "MetricExpr": "min((110 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) * (OCR.DEMAND_RFO.L3_MISS.REMOTE_HITM + OCR.PF_L2_RFO.L3_MISS.REMOTE_HITM) + 47.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) * (OCR.DEMAND_RFO.L3_HIT.HITM_OTHER_CORE + OCR.PF_L2_RFO.L3_HIT.HITM_OTHER_CORE)) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_store_bound_group", - "MetricName": "tma_false_sharing", - "PublicDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing. False Sharing is a multithreading hiccup; where multiple Logical Processors contend on different data-elements mapped into the same cache line. ", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations", + "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_2 + UOPS_DISPATCHED_PORT.PORT_3 + UOPS_DISPATCHED_PORT.PORT_7 - UOPS_DISPATCHED_PORT.PORT_4) / (2 * tma_info_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations. Sample with: UOPS_DISPATCHED.PORT_2_3", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents rate of split store accesses", - "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / CORE_CLKS", - "MetricGroup": "TopdownL4;tma_L4_group;tma_store_bound_group", - "MetricName": "tma_split_stores", - "PublicDescription": "This metric represents rate of split store accesses. Consider aligning your data to the 64-byte cache line granularity.", + "BriefDescription": "This metric roughly estimates the fraction of cycles where the (first level) DTLB was missed by load accesses, that later on hit in second-level TLB (STLB)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_group", + "MetricName": "tma_load_stlb_hit", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses", - "MetricExpr": "min((9 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=0x1@ + DTLB_STORE_MISSES.WALK_ACTIVE) / CORE_CLKS, 1)", - "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_store_bound_group", - "MetricName": "tma_dtlb_store", - "PublicDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses. As with ordinary data caching; focus on improving data locality and reducing working-set size to reduce DTLB overhead. Additionally; consider using profile-guided optimization (PGO) to collocate frequently-used data on the same page. Try using larger page sizes for large amounts of frequently-used data.", + "BriefDescription": "This metric estimates the fraction of cycles where the Second-level TLB (STLB) was missed by load accesses, performing a hardware page walk", + "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_group", + "MetricName": "tma_load_stlb_miss", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the TLB was missed by store accesses, hitting in the second-level TLB (STLB)", - "MetricExpr": "tma_dtlb_store - DTLB_STORE_MISSES.WALK_ACTIVE / CORE_CLKS", - "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_group", - "MetricName": "tma_store_stlb_hit", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from local memory", + "MetricExpr": "59.5 * tma_info_average_frequency * MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_group", + "MetricName": "tma_local_dram", + "MetricThreshold": "tma_local_dram > 0.1 & (tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from local memory. Caching will improve the latency and increase performance. Sample with: MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles where the STLB was missed by store accesses, performing a hardware page walk", - "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / CORE_CLKS", - "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_group", - "MetricName": "tma_store_stlb_miss", + "BriefDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations", + "MetricExpr": "(12 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS.ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (11 * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO))) / tma_info_clks", + "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1_bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations. Due to the microarchitecture handling of locks; they are classified as L1_Bound regardless of what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOADS_PS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", - "MetricExpr": "tma_backend_bound - tma_memory_bound", - "MetricGroup": "Backend;Compute;TopdownL2;tma_L2_group;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_core_bound", - "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", + "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_bad_speculation - tma_branch_mispredicts", + "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", - "MetricExpr": "ARITH.DIVIDER_ACTIVE / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", - "MetricName": "tma_divider", - "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication.", + "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=4@) / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that). Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_info_memory_bandwidth, tma_sq_full", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", - "MetricExpr": "((EXE_ACTIVITY.EXE_BOUND_0_PORTS + (EXE_ACTIVITY.1_PORTS_UTIL + UOPS_RETIRED.RETIRE_SLOTS / SLOTS * EXE_ACTIVITY.2_PORTS_UTIL)) / CPU_CLK_UNHALTED.THREAD if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CYCLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + UOPS_RETIRED.RETIRE_SLOTS / SLOTS * EXE_ACTIVITY.2_PORTS_UTIL) / CPU_CLK_UNHALTED.THREAD)", - "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_group", - "MetricName": "tma_ports_utilization", - "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", + "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / tma_info_clks - tma_mem_bandwidth", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that). Related metrics: tma_info_memory_latency, tma_l3_hit_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "(UOPS_EXECUTED.CORE_CYCLES_NONE / 2 if #SMT_on else CYCLE_ACTIVITY.STALLS_TOTAL - CYCLE_ACTIVITY.STALLS_MEM_ANY) / CORE_CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_0", - "PublicDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise). Long-latency instructions like divides may contribute to this metric.", + "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(CYCLE_ACTIVITY.STALLS_MEM_ANY + EXE_ACTIVITY.BOUND_ON_STORES) / (CYCLE_ACTIVITY.STALLS_TOTAL + (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL) + EXE_ACTIVITY.BOUND_ON_STORES) * tma_backend_bound", + "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations", - "MetricExpr": "PARTIAL_RAT_STALLS.SCOREBOARD / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_0_group", - "MetricName": "tma_serializing_operation", - "PublicDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations. Instructions like CPUID; WRMSR or LFENCE serialize the out-of-order execution which may limit performance.", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring memory operations -- uops for memory load or store accesses.", + "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_RETIRED.ANY", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_memory_operations", + "MetricThreshold": "tma_memory_operations > 0.1 & tma_light_operations > 0.6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to PAUSE Instructions.", - "MetricExpr": "40 * ROB_MISC_EVENTS.PAUSE_INST / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "TopdownL6;tma_L6_group;tma_serializing_operation_group", - "MetricName": "tma_slow_pause", + "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.MS_UOPS / tma_info_slots", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operations_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "The Mixing_Vectors metric gives the percentage of injected blend uops out of all uops issued", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD * UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY, 1)", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_0_group", - "MetricName": "tma_mixing_vectors", - "PublicDescription": "The Mixing_Vectors metric gives the percentage of injected blend uops out of all uops issued. Usually a Mixing_Vectors over 5% is worth investigating. Read more in Appendix B1 of the Optimizations Guide for this topic.", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage", + "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_clks", + "MetricGroup": "BadSpec;BrMispredicts;TopdownL4;tma_L4_group;tma_branch_resteers_group;tma_issueBM", + "MetricName": "tma_mispredicts_resteers", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_branch_mispredicts, tma_info_branch_misprediction_cost, tma_info_mispredictions", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_1 - UOPS_EXECUTED.CORE_CYCLES_GE_2) / 2 if #SMT_on else EXE_ACTIVITY.1_PORTS_UTIL) / CORE_CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_1", - "PublicDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). This can be due to heavy data-dependency among software instructions; or over oversubscribing a particular hardware resource. In some other cases with high 1_Port_Utilized and L1_Bound; this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful.", + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline)", + "MetricExpr": "(IDQ.ALL_MITE_CYCLES_ANY_UOPS - IDQ.ALL_MITE_CYCLES_4_UOPS) / tma_info_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or LSD. For example; inefficiencies due to asymmetric decoders; use of long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sample with: FRONTEND_RETIRED.ANY_DSB_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_2 - UOPS_EXECUTED.CORE_CYCLES_GE_3) / 2 if #SMT_on else EXE_ACTIVITY.2_PORTS_UTIL) / CORE_CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_2", - "PublicDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization -most compilers feature auto-Vectorization options today- reduces pressure on the execution ports as multiple elements are calculated with same uop.", + "BriefDescription": "The Mixing_Vectors metric gives the percentage of injected blend uops out of all uops issued", + "MetricExpr": "UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY", + "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utilized_0_group", + "MetricName": "tma_mixing_vectors", + "MetricThreshold": "tma_mixing_vectors > 0.05", + "PublicDescription": "The Mixing_Vectors metric gives the percentage of injected blend uops out of all uops issued. Usually a Mixing_Vectors over 5% is worth investigating. Read more in Appendix B1 of the Optimizations Guide for this topic. Related metrics: tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise).", - "MetricExpr": "(UOPS_EXECUTED.CORE_CYCLES_GE_3 / 2 if #SMT_on else UOPS_EXECUTED.CORE_CYCLES_GE_3) / CORE_CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_3m", + "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", + "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_clks", + "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.", - "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT.PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / SLOTS", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", - "MetricName": "tma_alu_op_utilization", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring branch instructions that were not fused", + "MetricExpr": "tma_light_operations * (BR_INST_RETIRED.ALL_BRANCHES - UOPS_RETIRED.MACRO_FUSED) / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_non_fused_branches", + "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring branch instructions that were not fused. Non-conditional branches like direct JMP or CALL would count here. Can be used to examine fusible conditional jumps that were not fused.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_0 / CORE_CLKS", - "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilization_group", - "MetricName": "tma_port_0", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_nop_instructions", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions. Compilers often use NOPs for certain address alignments - e.g. start address of a function or loop body. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_1 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group", - "MetricName": "tma_port_1", + "BriefDescription": "This metric represents the remaining light uops fraction the CPU has executed - remaining means not covered by other sibling nodes", + "MetricExpr": "max(0, tma_light_operations - (tma_fp_arith + tma_memory_operations + tma_fused_instructions + tma_non_fused_branches + tma_nop_instructions))", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_other_light_ops", + "MetricThreshold": "tma_other_light_ops > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents the remaining light uops fraction the CPU has executed - remaining means not covered by other sibling nodes. May undercount due to FMA double counting", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_5 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group", - "MetricName": "tma_port_5", + "BriefDescription": "This metric roughly estimates (based on idle latencies) how often the CPU was stalled on accesses to external 3D-Xpoint (Crystal Ridge, a.k.a", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "((1 - (19 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 10 * (MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))) / (19 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 10 * (MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + (25 * (MEM_LOAD_RETIRED.LOCAL_PMM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 33 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_PMM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))))) * (CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_clks + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_info_clks - tma_l2_bound) if 1e6 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_PMM + MEM_LOAD_RETIRED.LOCAL_PMM) > MEM_LOAD_RETIRED.L1_MISS else 0)", + "MetricGroup": "MemoryBound;Server;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_pmm_bound", + "MetricThreshold": "tma_pmm_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric roughly estimates (based on idle latencies) how often the CPU was stalled on accesses to external 3D-Xpoint (Crystal Ridge, a.k.a. IXP) memory by loads, PMM stands for Persistent Memory Module.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_6 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group", - "MetricName": "tma_port_6", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_0 / tma_info_core_clks", + "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_0", + "MetricThreshold": "tma_port_0 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations", - "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_2 + UOPS_DISPATCHED_PORT.PORT_3 + UOPS_DISPATCHED_PORT.PORT_7 - UOPS_DISPATCHED_PORT.PORT_4) / (2 * CORE_CLKS)", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", - "MetricName": "tma_load_op_utilization", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_1 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_1", + "MetricThreshold": "tma_port_1 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 2 ([SNB+]Loads and Store-address; [ICL+] Loads)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_2 / CORE_CLKS", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_2 / tma_info_core_clks", "MetricGroup": "TopdownL6;tma_L6_group;tma_load_op_utilization_group", "MetricName": "tma_port_2", + "MetricThreshold": "tma_port_2 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 2 ([SNB+]Loads and Store-address; [ICL+] Loads). Sample with: UOPS_DISPATCHED_PORT.PORT_2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 3 ([SNB+]Loads and Store-address; [ICL+] Loads)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_3 / CORE_CLKS", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_3 / tma_info_core_clks", "MetricGroup": "TopdownL6;tma_L6_group;tma_load_op_utilization_group", "MetricName": "tma_port_3", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_4 / CORE_CLKS", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", - "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_port_3 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 3 ([SNB+]Loads and Store-address; [ICL+] Loads). Sample with: UOPS_DISPATCHED_PORT.PORT_3", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 4 (Store-data)", "MetricExpr": "tma_store_op_utilization", - "MetricGroup": "TopdownL6;tma_L6_group;tma_store_op_utilization_group", + "MetricGroup": "TopdownL6;tma_L6_group;tma_issueSpSt;tma_store_op_utilization_group", "MetricName": "tma_port_4", + "MetricThreshold": "tma_port_4 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 4 (Store-data). Sample with: UOPS_DISPATCHED_PORT.PORT_4. Related metrics: tma_split_stores", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 7 ([HSW+]simple Store-address)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_7 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_L6_group;tma_store_op_utilization_group", - "MetricName": "tma_port_7", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_5 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_5", + "MetricThreshold": "tma_port_5 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_retiring", - "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. ", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_6 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_6", + "MetricThreshold": "tma_port_6 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_6. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", - "MetricExpr": "tma_retiring - (UOPS_RETIRED.RETIRE_SLOTS + UOPS_RETIRED.MACRO_FUSED - INST_RETIRED.ANY) / SLOTS", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_L2_group;tma_retiring_group", - "MetricName": "tma_light_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved.", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 7 ([HSW+]simple Store-address)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_7 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_store_op_utilization_group", + "MetricName": "tma_port_7", + "MetricThreshold": "tma_port_7 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 7 ([HSW+]simple Store-address). Sample with: UOPS_DISPATCHED_PORT.PORT_7", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)", - "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.THREAD + tma_fp_scalar + tma_fp_vector", - "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_group", - "MetricName": "tma_fp_arith", - "PublicDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired). Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain and FMA double-counting.", + "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", + "MetricExpr": "((EXE_ACTIVITY.EXE_BOUND_0_PORTS + (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL)) / tma_info_clks if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CYCLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_clks)", + "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric serves as an approximation of legacy x87 usage", - "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.THREAD", - "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", - "MetricName": "tma_x87_use", - "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", + "BriefDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "(UOPS_EXECUTED.CORE_CYCLES_NONE / 2 if #SMT_on else CYCLE_ACTIVITY.STALLS_TOTAL - CYCLE_ACTIVITY.STALLS_MEM_ANY) / tma_info_core_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise). Long-latency instructions like divides may contribute to this metric.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE) / UOPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group", - "MetricName": "tma_fp_scalar", - "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting.", + "BriefDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_1 - UOPS_EXECUTED.CORE_CYCLES_GE_2) / 2 if #SMT_on else EXE_ACTIVITY.1_PORTS_UTIL) / tma_info_core_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_1", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). This can be due to heavy data-dependency among software instructions; or over oversubscribing a particular hardware resource. In some other cases with high 1_Port_Utilized and L1_Bound; this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths", - "MetricExpr": "min((FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS, 1)", - "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group", - "MetricName": "tma_fp_vector", - "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting.", + "BriefDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_2 - UOPS_EXECUTED.CORE_CYCLES_GE_3) / 2 if #SMT_on else EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_core_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization -most compilers feature auto-Vectorization options today- reduces pressure on the execution ports as multiple elements are calculated with same uop. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors", - "MetricExpr": "min((FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS, 1)", - "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group", - "MetricName": "tma_fp_vector_128b", - "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors. May overcount due to FMA double counting.", + "BriefDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise).", + "MetricExpr": "(UOPS_EXECUTED.CORE_CYCLES_GE_3 / 2 if #SMT_on else UOPS_EXECUTED.CORE_CYCLES_GE_3) / tma_info_core_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.7 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors", - "MetricExpr": "min((FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS, 1)", - "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group", - "MetricName": "tma_fp_vector_256b", - "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors. May overcount due to FMA double counting.", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote cache in other sockets including synchronizations issues", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "(89.5 * tma_info_average_frequency * MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM + 89.5 * tma_info_average_frequency * MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_issueSyncxn;tma_mem_latency_group", + "MetricName": "tma_remote_cache", + "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote cache in other sockets including synchronizations issues. This is caused often due to non-optimal NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM_PS;MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD_PS. Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 512-bit wide vectors", - "MetricExpr": "min((FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS, 1)", - "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group", - "MetricName": "tma_fp_vector_512b", - "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 512-bit wide vectors. May overcount due to FMA double counting.", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote memory", + "MetricExpr": "127 * tma_info_average_frequency * MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latency_group", + "MetricName": "tma_remote_dram", + "MetricThreshold": "tma_remote_dram > 0.1 & (tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote memory. This is caused often due to non-optimal NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring memory operations -- uops for memory load or store accesses.", - "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_RETIRED.ANY", - "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", - "MetricName": "tma_memory_operations", + "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1", + "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.RETIRE_SLOTS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring fused instructions -- where one uop can represent multiple contiguous instructions", - "MetricExpr": "tma_light_operations * UOPS_RETIRED.MACRO_FUSED / UOPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", - "MetricName": "tma_fused_instructions", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring fused instructions -- where one uop can represent multiple contiguous instructions. The instruction pairs of CMP+JCC or DEC+JCC are commonly used examples.", + "BriefDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations", + "MetricExpr": "PARTIAL_RAT_STALLS.SCOREBOARD / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL5;tma_L5_group;tma_issueSO;tma_ports_utilized_0_group", + "MetricName": "tma_serializing_operation", + "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)))", + "PublicDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations. Instructions like CPUID; WRMSR or LFENCE serialize the out-of-order execution which may limit performance. Sample with: PARTIAL_RAT_STALLS.SCOREBOARD. Related metrics: tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring branch instructions that were not fused", - "MetricExpr": "tma_light_operations * (BR_INST_RETIRED.ALL_BRANCHES - UOPS_RETIRED.MACRO_FUSED) / UOPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", - "MetricName": "tma_non_fused_branches", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring branch instructions that were not fused. Non-conditional branches like direct JMP or CALL would count here. Can be used to examine fusible conditional jumps that were not fused.", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to PAUSE Instructions", + "MetricExpr": "40 * ROB_MISC_EVENTS.PAUSE_INST / tma_info_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_serializing_operation_group", + "MetricName": "tma_slow_pause", + "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_operation > 0.1 & (tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to PAUSE Instructions. Sample with: MISC_RETIRED.PAUSE_INST", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions", - "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / UOPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", - "MetricName": "tma_nop_instructions", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions. Compilers often use NOPs for certain address alignments - e.g. start address of a function or loop body.", + "BriefDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "tma_info_load_miss_real_latency * LD_BLOCKS.NO_SR / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents the remaining light uops fraction the CPU has executed - remaining means not covered by other sibling nodes. May undercount due to FMA double counting", - "MetricExpr": "max(0, tma_light_operations - (tma_fp_arith + tma_memory_operations + tma_fused_instructions + tma_non_fused_branches + tma_nop_instructions))", - "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", - "MetricName": "tma_other_light_ops", + "BriefDescription": "This metric represents rate of split store accesses", + "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bound_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store accesses. Consider aligning your data to the 64-byte cache line granularity. Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_4", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences", - "MetricExpr": "(UOPS_RETIRED.RETIRE_SLOTS + UOPS_RETIRED.MACRO_FUSED - INST_RETIRED.ANY) / SLOTS", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_L2_group;tma_retiring_group", - "MetricName": "tma_heavy_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences. This highly-correlates with the uop length of these instructions/sequences.", + "BriefDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors)", + "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on else OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueBW;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors). Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_info_memory_bandwidth, tma_mem_bandwidth", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops", - "MetricExpr": "tma_heavy_operations - UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.MS_UOPS / SLOTS", - "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group", - "MetricName": "tma_few_uops_instructions", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number of uops in such instructions.", + "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", + "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck. Sample with: MEM_INST_RETIRED.ALL_STORES_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.MS_UOPS / SLOTS", - "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operations_group", - "MetricName": "tma_microcode_sequencer", - "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided.", + "BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores. To streamline memory operations in the pipeline; a load can avoid waiting for memory if a prior in-flight store is writing the data that the load wants to read (store forwarding process). However; in some cases the load may be blocked for a significant time pending the store forward. For example; when the prior store is writing a smaller region than the load is reading.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists", - "MetricExpr": "min(100 * (FP_ASSIST.ANY + OTHER_ASSISTS.ANY) / SLOTS, 1)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", - "MetricName": "tma_assists", - "PublicDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists. Assists are long sequences of uops that are required in certain corner-cases for operations that cannot be handled natively by the execution pipeline. For example; when working with very small floating point values (so-called Denormals); the FP units are not set up to perform these operations natively. Instead; a sequence of instructions to perform the computation on the Denormals is injected into the pipeline. Since these microcode sequences might be dozens of uops long; Assists can be extremely deleterious to performance and they can be avoided in many cases.", + "BriefDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "(L2_RQSTS.RFO_HIT * 11 * (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_clks", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_issueSL;tma_store_bound_group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses. Store accesses usually less impact out-of-order core performance; however; holding resources for longer time can lead into undesired implications (e.g. contention on L1D fill-buffer entries - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction", - "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", - "MetricName": "tma_cisc", - "PublicDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction. A CISC instruction has multiple uops that are required to perform the instruction's functionality as in the case of read-modify-write as an example. Since these instructions require multiple uops they may or may not imply sub-optimal use of machine resources.", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_4 / tma_info_core_clks", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", "ScaleUnit": "100%" }, { - "BriefDescription": "C3 residency percent per core", - "MetricExpr": "cstate_core@c3\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C3_Core_Residency", + "BriefDescription": "This metric roughly estimates the fraction of cycles where the TLB was missed by store accesses, hitting in the second-level TLB (STLB)", + "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_group", + "MetricName": "tma_store_stlb_hit", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per core", - "MetricExpr": "cstate_core@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Core_Residency", + "BriefDescription": "This metric estimates the fraction of cycles where the STLB was missed by store accesses, performing a hardware page walk", + "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_clks", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_group", + "MetricName": "tma_store_stlb_miss", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "C7 residency percent per core", - "MetricExpr": "cstate_core@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Core_Residency", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears", + "MetricExpr": "9 * BACLEARS.ANY / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;TopdownL4;tma_L4_group;tma_branch_resteers_group", + "MetricName": "tma_unknown_branches", + "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears. These are fetched branches the Branch Prediction Unit was unable to recognize (e.g. first time the branch is fetched or hitting BTB capacity limit). Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, { - "BriefDescription": "C2 residency percent per package", - "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C2_Pkg_Residency", + "BriefDescription": "This metric serves as an approximation of legacy x87 usage", + "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.THREAD", + "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%" }, { - "BriefDescription": "C3 residency percent per package", - "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C3_Pkg_Residency", + "BriefDescription": "Percentage of cycles in aborted transactions.", + "MetricExpr": "max(cpu@cycles\\-t@ - cpu@cycles\\-ct@, 0) / cycles", + "MetricGroup": "transaction", + "MetricName": "tsx_aborted_cycles", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per package", - "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Pkg_Residency", - "ScaleUnit": "100%" + "BriefDescription": "Number of cycles within a transaction divided by the number of elisions.", + "MetricExpr": "cpu@cycles\\-t@ / cpu@el\\-start@", + "MetricGroup": "transaction", + "MetricName": "tsx_cycles_per_elision", + "ScaleUnit": "1cycles / elision" }, { - "BriefDescription": "C7 residency percent per package", - "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Pkg_Residency", + "BriefDescription": "Number of cycles within a transaction divided by the number of transactions.", + "MetricExpr": "cpu@cycles\\-t@ / cpu@tx\\-start@", + "MetricGroup": "transaction", + "MetricName": "tsx_cycles_per_transaction", + "ScaleUnit": "1cycles / transaction" + }, + { + "BriefDescription": "Percentage of cycles within a transaction region.", + "MetricExpr": "cpu@cycles\\-t@ / cycles", + "MetricGroup": "transaction", + "MetricName": "tsx_transactional_cycles", "ScaleUnit": "100%" } ] diff --git a/tools/perf/pmu-events/arch/x86/cascadelakex/frontend.json b/tools/perf/pmu-events/arch/x86/cascadelakex/frontend.json index 13ccf50db43d..04f08e4d2402 100644 --- a/tools/perf/pmu-events/arch/x86/cascadelakex/frontend.json +++ b/tools/perf/pmu-events/arch/x86/cascadelakex/frontend.json @@ -322,7 +322,7 @@ "UMask": "0x4" }, { - "BriefDescription": "Cycles when uops are being delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy", + "BriefDescription": "Cycles when uops are being delivered to Instruction Decode Queue (IDQ) while Microcode Sequencer (MS) is busy", "CounterMask": "1", "EventCode": "0x79", "EventName": "IDQ.MS_CYCLES", @@ -331,7 +331,7 @@ "UMask": "0x30" }, { - "BriefDescription": "Cycles when uops initiated by Decode Stream Buffer (DSB) are being delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy", + "BriefDescription": "Cycles when uops initiated by Decode Stream Buffer (DSB) are being delivered to Instruction Decode Queue (IDQ) while Microcode Sequencer (MS) is busy", "CounterMask": "1", "EventCode": "0x79", "EventName": "IDQ.MS_DSB_CYCLES", @@ -340,7 +340,7 @@ "UMask": "0x10" }, { - "BriefDescription": "Uops initiated by MITE and delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy", + "BriefDescription": "Uops initiated by MITE and delivered to Instruction Decode Queue (IDQ) while Microcode Sequencer (MS) is busy", "EventCode": "0x79", "EventName": "IDQ.MS_MITE_UOPS", "PublicDescription": "Counts the number of uops initiated by MITE and delivered to Instruction Decode Queue (IDQ) while the Microcode Sequencer (MS) is busy. Counting includes uops that may 'bypass' the IDQ.", @@ -358,7 +358,7 @@ "UMask": "0x30" }, { - "BriefDescription": "Uops delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy", + "BriefDescription": "Uops delivered to Instruction Decode Queue (IDQ) while Microcode Sequencer (MS) is busy", "EventCode": "0x79", "EventName": "IDQ.MS_UOPS", "PublicDescription": "Counts the total number of uops delivered by the Microcode Sequencer (MS). Any instruction over 4 uops will be delivered by the MS. Some instructions such as transcendentals may additionally generate uops from the MS.", diff --git a/tools/perf/pmu-events/arch/x86/cascadelakex/pipeline.json b/tools/perf/pmu-events/arch/x86/cascadelakex/pipeline.json index 64e1fe351333..0f06e314fe36 100644 --- a/tools/perf/pmu-events/arch/x86/cascadelakex/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/cascadelakex/pipeline.json @@ -93,6 +93,22 @@ "SampleAfterValue": "400009", "UMask": "0x10" }, + { + "BriefDescription": "Speculative and retired mispredicted macro conditional branches", + "EventCode": "0x89", + "EventName": "BR_MISP_EXEC.ALL_BRANCHES", + "PublicDescription": "This event counts both taken and not taken speculative and retired mispredicted branch instructions.", + "SampleAfterValue": "200003", + "UMask": "0xff" + }, + { + "BriefDescription": "Speculative mispredicted indirect branches", + "EventCode": "0x89", + "EventName": "BR_MISP_EXEC.INDIRECT", + "PublicDescription": "Counts speculatively miss-predicted indirect branches at execution time. Counts for indirect near CALL or JMP instructions (RET excluded).", + "SampleAfterValue": "200003", + "UMask": "0xe4" + }, { "BriefDescription": "All mispredicted macro branch instructions retired.", "EventCode": "0xC5", diff --git a/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-memory.json b/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-memory.json index 70a2c0ff8dfd..aafd2c9b813b 100644 --- a/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-memory.json +++ b/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-memory.json @@ -192,7 +192,7 @@ "EventCode": "0x9", "EventName": "UNC_M_ECC_CORRECTABLE_ERRORS", "PerPkg": "1", - "PublicDescription": "Counts the number of ECC errors detected and corrected by the iMC on this channel. This counter is only useful with ECC DRAM devices. This count will increment one time for each correction regardless of the number of bits corrected. The iMC can correct up to 4 bit errors in independent channel mode and 8 bit erros in lockstep mode.", + "PublicDescription": "Counts the number of ECC errors detected and corrected by the iMC on this channel. This counter is only useful with ECC DRAM devices. This count will increment one time for each correction regardless of the number of bits corrected. The iMC can correct up to 4 bit errors in independent channel mode and 8 bit errors in lockstep mode.", "Unit": "iMC" }, { @@ -212,7 +212,7 @@ "Unit": "iMC" }, { - "BriefDescription": "UNC_M_MAJMODE2.PMM_CYC", + "BriefDescription": "Major Mode 2 : Cycles in PMM major mode", "EventCode": "0xED", "EventName": "UNC_M_MAJMODE2.PMM_CYC", "PerPkg": "1", @@ -220,7 +220,7 @@ "Unit": "iMC" }, { - "BriefDescription": "UNC_M_MAJMODE2.PMM_ENTER", + "BriefDescription": "Major Mode 2 : Entered PMM major mode", "EventCode": "0xED", "EventName": "UNC_M_MAJMODE2.PMM_ENTER", "PerPkg": "1", @@ -290,7 +290,7 @@ "Unit": "iMC" }, { - "BriefDescription": "All commands for Intel Optane DC persistent memory", + "BriefDescription": "All commands for Intel(R) Optane(TM) DC persistent memory", "EventCode": "0xEA", "EventName": "UNC_M_PMM_CMD1.ALL", "PerPkg": "1", @@ -314,7 +314,7 @@ "Unit": "iMC" }, { - "BriefDescription": "Regular reads(RPQ) commands for Intel Optane DC persistent memory", + "BriefDescription": "Regular reads(RPQ) commands for Intel(R) Optane(TM) DC persistent memory", "EventCode": "0xEA", "EventName": "UNC_M_PMM_CMD1.RD", "PerPkg": "1", @@ -331,7 +331,7 @@ "Unit": "iMC" }, { - "BriefDescription": "Underfill read commands for Intel Optane DC persistent memory", + "BriefDescription": "Underfill read commands for Intel(R) Optane(TM) DC persistent memory", "EventCode": "0xEA", "EventName": "UNC_M_PMM_CMD1.UFILL_RD", "PerPkg": "1", @@ -348,7 +348,7 @@ "Unit": "iMC" }, { - "BriefDescription": "Write commands for Intel Optane DC persistent memory", + "BriefDescription": "Write commands for Intel(R) Optane(TM) DC persistent memory", "EventCode": "0xEA", "EventName": "UNC_M_PMM_CMD1.WR", "PerPkg": "1", @@ -522,7 +522,7 @@ "Unit": "iMC" }, { - "BriefDescription": "Write Pending Queue Occupancy of all write requests for Intel Optane DC persistent memory", + "BriefDescription": "Write Pending Queue Occupancy of all write requests for Intel(R) Optane(TM) DC persistent memory", "EventCode": "0xE4", "EventName": "UNC_M_PMM_WPQ_OCCUPANCY.ALL", "PerPkg": "1", @@ -2735,7 +2735,7 @@ "EventCode": "0x81", "EventName": "UNC_M_WPQ_OCCUPANCY", "PerPkg": "1", - "PublicDescription": "Counts the number of entries in the Write Pending Queue (WPQ) at each cycle. This can then be used to calculate both the average queue occupancy (in conjunction with the number of cycles not empty) and the average latency (in conjunction with the number of allocations). The WPQ is used to schedule writes out to the memory controller and to track the requests. Requests allocate into the WPQ soon after they enter the memory controller, and need credits for an entry in this buffer before being sent from the CHA to the iMC (memory controller). They deallocate after being issued to DRAM. Write requests themselves are able to complete (from the perspective of the rest of the system) as soon they have 'posted' to the iMC. This is not to be confused with actually performing the write to DRAM. Therefore, the average latency for this queue is actually not useful for deconstruction intermediate write latencies. So, we provide filtering based on if the request has posted or not. By using the 'not posted' filter, we can track how long writes spent in the iMC before completions were sent to the HA. The 'posted' filter, on the other hand, provides information about how much queueing is actually happenning in the iMC for writes before they are actually issued to memory. High average occupancies will generally coincide with high write major mode counts. Is there a filter of sorts???", + "PublicDescription": "Counts the number of entries in the Write Pending Queue (WPQ) at each cycle. This can then be used to calculate both the average queue occupancy (in conjunction with the number of cycles not empty) and the average latency (in conjunction with the number of allocations). The WPQ is used to schedule writes out to the memory controller and to track the requests. Requests allocate into the WPQ soon after they enter the memory controller, and need credits for an entry in this buffer before being sent from the CHA to the iMC (memory controller). They deallocate after being issued to DRAM. Write requests themselves are able to complete (from the perspective of the rest of the system) as soon they have 'posted' to the iMC. This is not to be confused with actually performing the write to DRAM. Therefore, the average latency for this queue is actually not useful for deconstruction intermediate write latencies. So, we provide filtering based on if the request has posted or not. By using the 'not posted' filter, we can track how long writes spent in the iMC before completions were sent to the HA. The 'posted' filter, on the other hand, provides information about how much queueing is actually happening in the iMC for writes before they are actually issued to memory. High average occupancies will generally coincide with high write major mode counts. Is there a filter of sorts???", "Unit": "iMC" }, { diff --git a/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-other.json b/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-other.json index ef4767feb4e2..5f3ed5e843b9 100644 --- a/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-other.json +++ b/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-other.json @@ -44,7 +44,7 @@ "MetricName": "LLC_MISSES.PCIE_WRITE", "PerPkg": "1", "PortMask": "0x01", - "PublicDescription": "Counts every write request of 4 bytes of data made by IIO Part0 to a unit onthe main die (generally memory). In the general case, Part0 refers to a standard PCIe card of any size (x16,x8,x4) that is plugged directly into one of the PCIe slots. Part0 could also refer to any device plugged into the first slot of a PCIe riser card or to a device attached to the IIO unit which starts its use of the bus using lane 0 of the 16 lanes supported by the bus.", + "PublicDescription": "Counts every write request of 4 bytes of data made by IIO Part0 to a unit on the main die (generally memory). In the general case, Part0 refers to a standard PCIe card of any size (x16,x8,x4) that is plugged directly into one of the PCIe slots. Part0 could also refer to any device plugged into the first slot of a PCIe riser card or to a device attached to the IIO unit which starts its use of the bus using lane 0 of the 16 lanes supported by the bus.", "ScaleUnit": "4Bytes", "UMask": "0x1", "Unit": "IIO" @@ -856,7 +856,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Counts the number of Allocate/Update to HitMe Cache; Deallocate HtiME$ on Reads without RspFwdI*", + "BriefDescription": "Counts the number of Allocate/Update to HitMe Cache; Deallocate HitME$ on Reads without RspFwdI*", "EventCode": "0x61", "EventName": "UNC_CHA_HITME_UPDATE.DEALLOCATE", "PerPkg": "1", @@ -1210,7 +1210,7 @@ "EventCode": "0x34", "EventName": "UNC_CHA_LLC_LOOKUP.WRITE", "PerPkg": "1", - "PublicDescription": "Counts the number of times the LLC was accessed - this includes code, data, prefetches and hints coming from L2. This has numerous filters available. Note the non-standard filtering equation. This event will count requests that lookup the cache multiple times with multiple increments. One must ALWAYS set umask bit 0 and select a state or states to match. Otherwise, the event will count nothing. CHAFilter0[24:21,17] bits correspond to [FMESI] state.; Writeback transactions from L2 to the LLC This includes all write transactions -- both Cachable and UC.", + "PublicDescription": "Counts the number of times the LLC was accessed - this includes code, data, prefetches and hints coming from L2. This has numerous filters available. Note the non-standard filtering equation. This event will count requests that lookup the cache multiple times with multiple increments. One must ALWAYS set umask bit 0 and select a state or states to match. Otherwise, the event will count nothing. CHAFilter0[24:21,17] bits correspond to [FMESI] state.; Writeback transactions from L2 to the LLC This includes all write transactions -- both Cacheable and UC.", "UMask": "0x5", "Unit": "CHA" }, @@ -3481,7 +3481,7 @@ "EventCode": "0x5D", "EventName": "UNC_CHA_SNOOP_RESP_LOCAL.RSPSFWD", "PerPkg": "1", - "PublicDescription": "Number of snoop responses received for a Local request; Filters for a snoop response of RspSFwd to local CA requests. This is returned when a remote caching agent forwards data but holds on to its currentl copy. This is common for data and code reads that hit in a remote socket in E or F state.", + "PublicDescription": "Number of snoop responses received for a Local request; Filters for a snoop response of RspSFwd to local CA requests. This is returned when a remote caching agent forwards data but holds on to its current copy. This is common for data and code reads that hit in a remote socket in E or F state.", "UMask": "0x8", "Unit": "CHA" }, @@ -4082,10 +4082,11 @@ "Unit": "CHA" }, { - "BriefDescription": "UNC_CHA_TOR_OCCUPANCY.ALL", + "BriefDescription": "TOR Occupancy : All", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.ALL", "PerPkg": "1", + "PublicDescription": "TOR Occupancy : All : For each cycle, this event accumulates the number of valid entries in the TOR that match qualifications specified by the subevent. Does not include addressless requests such as locks and interrupts.", "UMask": "0xff", "Unit": "CHA" }, @@ -4153,20 +4154,22 @@ "Unit": "CHA" }, { - "BriefDescription": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_CRD", + "BriefDescription": "TOR Occupancy : CRds issued by iA Cores that Hit the LLC", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_CRD", "Filter": "config1=0x40233", "PerPkg": "1", + "PublicDescription": "TOR Occupancy : CRds issued by iA Cores that Hit the LLC : For each cycle, this event accumulates the number of valid entries in the TOR that match qualifications specified by the subevent. Does not include addressless requests such as locks and interrupts.", "UMask": "0x11", "Unit": "CHA" }, { - "BriefDescription": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_DRD", + "BriefDescription": "TOR Occupancy : DRds issued by iA Cores that Hit the LLC", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_DRD", "Filter": "config1=0x40433", "PerPkg": "1", + "PublicDescription": "TOR Occupancy : DRds issued by iA Cores that Hit the LLC : For each cycle, this event accumulates the number of valid entries in the TOR that match qualifications specified by the subevent. Does not include addressless requests such as locks and interrupts.", "UMask": "0x11", "Unit": "CHA" }, @@ -4189,20 +4192,22 @@ "Unit": "CHA" }, { - "BriefDescription": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_LlcPrefRFO", + "BriefDescription": "TOR Occupancy : LLCPrefRFO issued by iA Cores that hit the LLC", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_LlcPrefRFO", "Filter": "config1=0x4b033", "PerPkg": "1", + "PublicDescription": "TOR Occupancy : LLCPrefRFO issued by iA Cores that hit the LLC : For each cycle, this event accumulates the number of valid entries in the TOR that match qualifications specified by the subevent. Does not include addressless requests such as locks and interrupts.", "UMask": "0x11", "Unit": "CHA" }, { - "BriefDescription": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_RFO", + "BriefDescription": "TOR Occupancy : RFOs issued by iA Cores that Hit the LLC", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_RFO", "Filter": "config1=0x40033", "PerPkg": "1", + "PublicDescription": "TOR Occupancy : RFOs issued by iA Cores that Hit the LLC : For each cycle, this event accumulates the number of valid entries in the TOR that match qualifications specified by the subevent. Does not include addressless requests such as locks and interrupts.", "UMask": "0x11", "Unit": "CHA" }, @@ -4216,11 +4221,12 @@ "Unit": "CHA" }, { - "BriefDescription": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_CRD", + "BriefDescription": "TOR Occupancy : CRds issued by iA Cores that Missed the LLC", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_CRD", "Filter": "config1=0x40233", "PerPkg": "1", + "PublicDescription": "TOR Occupancy : CRds issued by iA Cores that Missed the LLC : For each cycle, this event accumulates the number of valid entries in the TOR that match qualifications specified by the subevent. Does not include addressless requests such as locks and interrupts.", "UMask": "0x21", "Unit": "CHA" }, @@ -4253,20 +4259,22 @@ "Unit": "CHA" }, { - "BriefDescription": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_LlcPrefRFO", + "BriefDescription": "TOR Occupancy : LLCPrefRFO issued by iA Cores that missed the LLC", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_LlcPrefRFO", "Filter": "config1=0x4b033", "PerPkg": "1", + "PublicDescription": "TOR Occupancy : LLCPrefRFO issued by iA Cores that missed the LLC : For each cycle, this event accumulates the number of valid entries in the TOR that match qualifications specified by the subevent. Does not include addressless requests such as locks and interrupts.", "UMask": "0x21", "Unit": "CHA" }, { - "BriefDescription": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_RFO", + "BriefDescription": "TOR Occupancy : RFOs issued by iA Cores that Missed the LLC", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_RFO", "Filter": "config1=0x40033", "PerPkg": "1", + "PublicDescription": "TOR Occupancy : RFOs issued by iA Cores that Missed the LLC : For each cycle, this event accumulates the number of valid entries in the TOR that match qualifications specified by the subevent. Does not include addressless requests such as locks and interrupts.", "UMask": "0x21", "Unit": "CHA" }, @@ -4308,7 +4316,7 @@ "Unit": "CHA" }, { - "BriefDescription": "TOR Occupancy; RDCUR isses from Local IO", + "BriefDescription": "TOR Occupancy; RDCUR misses from Local IO", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IO_MISS_RDCUR", "Filter": "config1=0x43C33", @@ -11637,7 +11645,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x01", - "PublicDescription": "Counts every write request of 4 bytes of data made by IIO Part0 to a unit onthe main die (generally memory). In the general case, Part0 refers to a standard PCIe card of any size (x16,x8,x4) that is plugged directly into one of the PCIe slots. Part0 could also refer to any device plugged into the first slot of a PCIe riser card or to a device attached to the IIO unit which starts its use of the bus using lane 0 of the 16 lanes supported by the bus.", + "PublicDescription": "Counts every write request of 4 bytes of data made by IIO Part0 to a unit on the main die (generally memory). In the general case, Part0 refers to a standard PCIe card of any size (x16,x8,x4) that is plugged directly into one of the PCIe slots. Part0 could also refer to any device plugged into the first slot of a PCIe riser card or to a device attached to the IIO unit which starts its use of the bus using lane 0 of the 16 lanes supported by the bus.", "UMask": "0x1", "Unit": "IIO" }, @@ -12024,7 +12032,7 @@ "Unit": "IIO" }, { - "BriefDescription": "UNC_IIO_NOTHING", + "BriefDescription": "Counting disabled", "EventName": "UNC_IIO_NOTHING", "PerPkg": "1", "Unit": "IIO" @@ -15622,7 +15630,7 @@ "EventCode": "0xC", "EventName": "UNC_I_TxS_REQUEST_OCCUPANCY", "PerPkg": "1", - "PublicDescription": "Accumultes the number of outstanding outbound requests from the IRP to the switch (towards the devices). This can be used in conjuection with the allocations event in order to calculate average latency of outbound requests.", + "PublicDescription": "Accumulates the number of outstanding outbound requests from the IRP to the switch (towards the devices). This can be used in conjunction with the allocations event in order to calculate average latency of outbound requests.", "Unit": "IRP" }, { @@ -16128,35 +16136,35 @@ "Unit": "M2M" }, { - "BriefDescription": "Number of reads in which direct to Intel UPI transactions were overridden", + "BriefDescription": "Number of reads in which direct to Intel(R) UPI transactions were overridden", "EventCode": "0x28", "EventName": "UNC_M2M_DIRECT2UPI_NOT_TAKEN_CREDITS", "PerPkg": "1", - "PublicDescription": "Counts reads in which direct to Intel Ultra Path Interconnect (UPI) transactions (which would have bypassed the CHA) were overridden", + "PublicDescription": "Counts reads in which direct to Intel(R) Ultra Path Interconnect (UPI) transactions (which would have bypassed the CHA) were overridden", "Unit": "M2M" }, { - "BriefDescription": "Cycles when direct to Intel UPI was disabled", + "BriefDescription": "Cycles when direct to Intel(R) UPI was disabled", "EventCode": "0x27", "EventName": "UNC_M2M_DIRECT2UPI_NOT_TAKEN_DIRSTATE", "PerPkg": "1", - "PublicDescription": "Counts cycles when the ability to send messages direct to the Intel Ultra Path Interconnect (bypassing the CHA) was disabled", + "PublicDescription": "Counts cycles when the ability to send messages direct to the Intel(R) Ultra Path Interconnect (bypassing the CHA) was disabled", "Unit": "M2M" }, { - "BriefDescription": "Messages sent direct to the Intel UPI", + "BriefDescription": "Messages sent direct to the Intel(R) UPI", "EventCode": "0x26", "EventName": "UNC_M2M_DIRECT2UPI_TAKEN", "PerPkg": "1", - "PublicDescription": "Counts when messages were sent direct to the Intel Ultra Path Interconnect (bypassing the CHA)", + "PublicDescription": "Counts when messages were sent direct to the Intel(R) Ultra Path Interconnect (bypassing the CHA)", "Unit": "M2M" }, { - "BriefDescription": "Number of reads that a message sent direct2 Intel UPI was overridden", + "BriefDescription": "Number of reads that a message sent direct2 Intel(R) UPI was overridden", "EventCode": "0x29", "EventName": "UNC_M2M_DIRECT2UPI_TXN_OVERRIDE", "PerPkg": "1", - "PublicDescription": "Counts when a read message that was sent direct to the Intel Ultra Path Interconnect (bypassing the CHA) was overridden", + "PublicDescription": "Counts when a read message that was sent direct to the Intel(R) Ultra Path Interconnect (bypassing the CHA) was overridden", "Unit": "M2M" }, { @@ -16583,7 +16591,7 @@ "Unit": "M2M" }, { - "BriefDescription": "Read requests to Intel Optane DC persistent memory issued to the iMC from M2M", + "BriefDescription": "Read requests to Intel(R) Optane(TM) DC persistent memory issued to the iMC from M2M", "EventCode": "0x37", "EventName": "UNC_M2M_IMC_READS.TO_PMM", "PerPkg": "1", @@ -16650,7 +16658,7 @@ "Unit": "M2M" }, { - "BriefDescription": "Write requests to Intel Optane DC persistent memory issued to the iMC from M2M", + "BriefDescription": "Write requests to Intel(R) Optane(TM) DC persistent memory issued to the iMC from M2M", "EventCode": "0x38", "EventName": "UNC_M2M_IMC_WRITES.TO_PMM", "PerPkg": "1", @@ -16675,7 +16683,7 @@ "Unit": "M2M" }, { - "BriefDescription": "M2M->iMC RPQ Cycles w/Credits - Regular; Channel 0", + "BriefDescription": "M2M->iMC RPQ Cycles w/Credits - Regular; Channel 0", "EventCode": "0x4F", "EventName": "UNC_M2M_PMM_RPQ_CYCLES_REG_CREDITS.CHN0", "PerPkg": "1", @@ -16683,7 +16691,7 @@ "Unit": "M2M" }, { - "BriefDescription": "M2M->iMC RPQ Cycles w/Credits - Regular; Channel 1", + "BriefDescription": "M2M->iMC RPQ Cycles w/Credits - Regular; Channel 1", "EventCode": "0x4F", "EventName": "UNC_M2M_PMM_RPQ_CYCLES_REG_CREDITS.CHN1", "PerPkg": "1", @@ -16691,7 +16699,7 @@ "Unit": "M2M" }, { - "BriefDescription": "M2M->iMC RPQ Cycles w/Credits - Regular; Channel 2", + "BriefDescription": "M2M->iMC RPQ Cycles w/Credits - Regular; Channel 2", "EventCode": "0x4F", "EventName": "UNC_M2M_PMM_RPQ_CYCLES_REG_CREDITS.CHN2", "PerPkg": "1", @@ -16699,7 +16707,7 @@ "Unit": "M2M" }, { - "BriefDescription": "M2M->iMC WPQ Cycles w/Credits - Regular; Channel 0", + "BriefDescription": "M2M->iMC WPQ Cycles w/Credits - Regular; Channel 0", "EventCode": "0x51", "EventName": "UNC_M2M_PMM_WPQ_CYCLES_REG_CREDITS.CHN0", "PerPkg": "1", @@ -16707,7 +16715,7 @@ "Unit": "M2M" }, { - "BriefDescription": "M2M->iMC WPQ Cycles w/Credits - Regular; Channel 1", + "BriefDescription": "M2M->iMC WPQ Cycles w/Credits - Regular; Channel 1", "EventCode": "0x51", "EventName": "UNC_M2M_PMM_WPQ_CYCLES_REG_CREDITS.CHN1", "PerPkg": "1", @@ -16715,7 +16723,7 @@ "Unit": "M2M" }, { - "BriefDescription": "M2M->iMC WPQ Cycles w/Credits - Regular; Channel 2", + "BriefDescription": "M2M->iMC WPQ Cycles w/Credits - Regular; Channel 2", "EventCode": "0x51", "EventName": "UNC_M2M_PMM_WPQ_CYCLES_REG_CREDITS.CHN2", "PerPkg": "1", @@ -16737,11 +16745,11 @@ "Unit": "M2M" }, { - "BriefDescription": "Prefecth requests that got turn into a demand request", + "BriefDescription": "Prefetch requests that got turn into a demand request", "EventCode": "0x56", "EventName": "UNC_M2M_PREFCAM_DEMAND_PROMOTIONS", "PerPkg": "1", - "PublicDescription": "Counts when the M2M (Mesh to Memory) promotes a outstanding request in the prefetch queue due to a subsequent demand read request that entered the M2M with the same address. Explanatory Side Note: The Prefecth queue is made of CAM (Content Addressable Memory)", + "PublicDescription": "Counts when the M2M (Mesh to Memory) promotes a outstanding request in the prefetch queue due to a subsequent demand read request that entered the M2M with the same address. Explanatory Side Note: The Prefetch queue is made of CAM (Content Addressable Memory)", "Unit": "M2M" }, { @@ -20804,7 +20812,7 @@ "EventCode": "0x40", "EventName": "UNC_M3UPI_RxC_BYPASSED.AD_S0_BL_ARB", "PerPkg": "1", - "PublicDescription": "Number ot times message is bypassed around the Ingress Queue; AD is taking bypass to slot 0 of independent flit while bl message is in arbitration", + "PublicDescription": "Number of times message is bypassed around the Ingress Queue; AD is taking bypass to slot 0 of independent flit while bl message is in arbitration", "UMask": "0x2", "Unit": "M3UPI" }, @@ -20813,7 +20821,7 @@ "EventCode": "0x40", "EventName": "UNC_M3UPI_RxC_BYPASSED.AD_S0_IDLE", "PerPkg": "1", - "PublicDescription": "Number ot times message is bypassed around the Ingress Queue; AD is taking bypass to slot 0 of independent flit while pipeline is idle", + "PublicDescription": "Number of times message is bypassed around the Ingress Queue; AD is taking bypass to slot 0 of independent flit while pipeline is idle", "UMask": "0x1", "Unit": "M3UPI" }, @@ -20822,7 +20830,7 @@ "EventCode": "0x40", "EventName": "UNC_M3UPI_RxC_BYPASSED.AD_S1_BL_SLOT", "PerPkg": "1", - "PublicDescription": "Number ot times message is bypassed around the Ingress Queue; AD is taking bypass to flit slot 1 while merging with bl message in same flit", + "PublicDescription": "Number of times message is bypassed around the Ingress Queue; AD is taking bypass to flit slot 1 while merging with bl message in same flit", "UMask": "0x4", "Unit": "M3UPI" }, @@ -20831,7 +20839,7 @@ "EventCode": "0x40", "EventName": "UNC_M3UPI_RxC_BYPASSED.AD_S2_BL_SLOT", "PerPkg": "1", - "PublicDescription": "Number ot times message is bypassed around the Ingress Queue; AD is taking bypass to flit slot 2 while merging with bl message in same flit", + "PublicDescription": "Number of times message is bypassed around the Ingress Queue; AD is taking bypass to flit slot 2 while merging with bl message in same flit", "UMask": "0x8", "Unit": "M3UPI" }, @@ -21397,7 +21405,7 @@ "Unit": "M3UPI" }, { - "BriefDescription": "Flit Gen - Header 1; Acumullate", + "BriefDescription": "Flit Gen - Header 1; Accumulate", "EventCode": "0x53", "EventName": "UNC_M3UPI_RxC_FLIT_GEN_HDR1.ACCUM", "PerPkg": "1", @@ -24618,7 +24626,7 @@ "EventCode": "0x29", "EventName": "UNC_M3UPI_UPI_PREFETCH_SPAWN", "PerPkg": "1", - "PublicDescription": "Count cases where flow control queue that sits between the Intel Ultra Path Interconnect (UPI) and the mesh spawns a prefetch to the iMC (Memory Controller)", + "PublicDescription": "Count cases where flow control queue that sits between the Intel(R) Ultra Path Interconnect (UPI) and the mesh spawns a prefetch to the iMC (Memory Controller)", "Unit": "M3UPI" }, { @@ -24973,11 +24981,11 @@ "Unit": "M2M" }, { - "BriefDescription": "Clocks of the Intel Ultra Path Interconnect (UPI)", + "BriefDescription": "Clocks of the Intel(R) Ultra Path Interconnect (UPI)", "EventCode": "0x1", "EventName": "UNC_UPI_CLOCKTICKS", "PerPkg": "1", - "PublicDescription": "Counts clockticks of the fixed frequency clock controlling the Intel Ultra Path Interconnect (UPI). This clock runs at1/8th the 'GT/s' speed of the UPI link. For example, a 9.6GT/s link will have a fixed Frequency of 1.2 Ghz.", + "PublicDescription": "Counts clockticks of the fixed frequency clock controlling the Intel(R) Ultra Path Interconnect (UPI). This clock runs at1/8th the 'GT/s' speed of the UPI link. For example, a 9.6GT/s link will have a fixed Frequency of 1.2 Ghz.", "Unit": "UPI LL" }, { @@ -24999,11 +25007,11 @@ "Unit": "UPI LL" }, { - "BriefDescription": "Data Response packets that go direct to Intel UPI", + "BriefDescription": "Data Response packets that go direct to Intel(R) UPI", "EventCode": "0x12", "EventName": "UNC_UPI_DIRECT_ATTEMPTS.D2U", "PerPkg": "1", - "PublicDescription": "Counts Data Response (DRS) packets that attempted to go direct to Intel Ultra Path Interconnect (UPI) bypassing the CHA .", + "PublicDescription": "Counts Data Response (DRS) packets that attempted to go direct to Intel(R) Ultra Path Interconnect (UPI) bypassing the CHA .", "UMask": "0x2", "Unit": "UPI LL" }, @@ -25072,11 +25080,11 @@ "Unit": "UPI LL" }, { - "BriefDescription": "Cycles Intel UPI is in L1 power mode (shutdown)", + "BriefDescription": "Cycles Intel(R) UPI is in L1 power mode (shutdown)", "EventCode": "0x21", "EventName": "UNC_UPI_L1_POWER_CYCLES", "PerPkg": "1", - "PublicDescription": "Counts cycles when the Intel Ultra Path Interconnect (UPI) is in L1 power mode. L1 is a mode that totally shuts down the UPI link. Link power states are per link and per direction, so for example the Tx direction could be in one state while Rx was in another, this event only coutns when both links are shutdown.", + "PublicDescription": "Counts cycles when the Intel(R) Ultra Path Interconnect (UPI) is in L1 power mode. L1 is a mode that totally shuts down the UPI link. Link power states are per link and per direction, so for example the Tx direction could be in one state while Rx was in another, this event only coutns when both links are shutdown.", "Unit": "UPI LL" }, { @@ -25238,11 +25246,11 @@ "Unit": "UPI LL" }, { - "BriefDescription": "Cycles the Rx of the Intel UPI is in L0p power mode", + "BriefDescription": "Cycles the Rx of the Intel(R) UPI is in L0p power mode", "EventCode": "0x25", "EventName": "UNC_UPI_RxL0P_POWER_CYCLES", "PerPkg": "1", - "PublicDescription": "Counts cycles when the receive side (Rx) of the Intel Ultra Path Interconnect(UPI) is in L0p power mode. L0p is a mode where we disable 60% of the UPI lanes, decreasing our bandwidth in order to save power.", + "PublicDescription": "Counts cycles when the receive side (Rx) of the Intel(R) Ultra Path Interconnect(UPI) is in L0p power mode. L0p is a mode where we disable 60% of the UPI lanes, decreasing our bandwidth in order to save power.", "Unit": "UPI LL" }, { @@ -25451,7 +25459,7 @@ "EventCode": "0x3", "EventName": "UNC_UPI_RxL_FLITS.ALL_DATA", "PerPkg": "1", - "PublicDescription": "Counts valid data FLITs (80 bit FLow control unITs: 64bits of data) received from any of the 3 Intel Ultra Path Interconnect (UPI) Receive Queue slots on this UPI unit.", + "PublicDescription": "Counts valid data FLITs (80 bit FLow control unITs: 64bits of data) received from any of the 3 Intel(R) Ultra Path Interconnect (UPI) Receive Queue slots on this UPI unit.", "UMask": "0xf", "Unit": "UPI LL" }, @@ -25460,7 +25468,7 @@ "EventCode": "0x3", "EventName": "UNC_UPI_RxL_FLITS.ALL_NULL", "PerPkg": "1", - "PublicDescription": "Counts null FLITs (80 bit FLow control unITs) received from any of the 3 Intel Ultra Path Interconnect (UPI) Receive Queue slots on this UPI unit.", + "PublicDescription": "Counts null FLITs (80 bit FLow control unITs) received from any of the 3 Intel(R) Ultra Path Interconnect (UPI) Receive Queue slots on this UPI unit.", "UMask": "0x27", "Unit": "UPI LL" }, @@ -25784,11 +25792,11 @@ "Unit": "UPI LL" }, { - "BriefDescription": "Cycles in which the Tx of the Intel Ultra Path Interconnect (UPI) is in L0p power mode", + "BriefDescription": "Cycles in which the Tx of the Intel(R) Ultra Path Interconnect (UPI) is in L0p power mode", "EventCode": "0x27", "EventName": "UNC_UPI_TxL0P_POWER_CYCLES", "PerPkg": "1", - "PublicDescription": "Counts cycles when the transmit side (Tx) of the Intel Ultra Path Interconnect(UPI) is in L0p power mode. L0p is a mode where we disable 60% of the UPI lanes, decreasing our bandwidth in order to save power.", + "PublicDescription": "Counts cycles when the transmit side (Tx) of the Intel(R) Ultra Path Interconnect(UPI) is in L0p power mode. L0p is a mode where we disable 60% of the UPI lanes, decreasing our bandwidth in order to save power.", "Unit": "UPI LL" }, { @@ -25960,7 +25968,7 @@ "EventCode": "0x41", "EventName": "UNC_UPI_TxL_BYPASSED", "PerPkg": "1", - "PublicDescription": "Counts incoming FLITs (FLow control unITs) which bypassed the TxL(transmit) FLIT buffer and pass directly out the UPI Link. Generally, when data is transmitted across the Intel Ultra Path Interconnect (UPI), it will bypass the TxQ and pass directly to the link. However, the TxQ will be used in L0p (Low Power) mode and (Link Layer Retry) LLR mode, increasing latency to transfer out to the link.", + "PublicDescription": "Counts incoming FLITs (FLow control unITs) which bypassed the TxL(transmit) FLIT buffer and pass directly out the UPI Link. Generally, when data is transmitted across the Intel(R) Ultra Path Interconnect (UPI), it will bypass the TxQ and pass directly to the link. However, the TxQ will be used in L0p (Low Power) mode and (Link Layer Retry) LLR mode, increasing latency to transfer out to the link.", "Unit": "UPI LL" }, { @@ -25968,7 +25976,7 @@ "EventCode": "0x2", "EventName": "UNC_UPI_TxL_FLITS.ALL_DATA", "PerPkg": "1", - "PublicDescription": "Counts valid data FLITs (80 bit FLow control unITs: 64bits of data) transmitted (TxL) via any of the 3 Intel Ultra Path Interconnect (UPI) slots on this UPI unit.", + "PublicDescription": "Counts valid data FLITs (80 bit FLow control unITs: 64bits of data) transmitted (TxL) via any of the 3 Intel(R) Ultra Path Interconnect (UPI) slots on this UPI unit.", "UMask": "0xf", "Unit": "UPI LL" }, @@ -25977,7 +25985,7 @@ "EventCode": "0x2", "EventName": "UNC_UPI_TxL_FLITS.ALL_NULL", "PerPkg": "1", - "PublicDescription": "Counts null FLITs (80 bit FLow control unITs) transmitted via any of the 3 Intel Ulra Path Interconnect (UPI) slots on this UPI unit.", + "PublicDescription": "Counts null FLITs (80 bit FLow control unITs) transmitted via any of the 3 Intel(R) Ulra Path Interconnect (UPI) slots on this UPI unit.", "UMask": "0x27", "Unit": "UPI LL" }, @@ -26328,7 +26336,7 @@ "EventCode": "0x2", "EventName": "UPI_DATA_BANDWIDTH_TX", "PerPkg": "1", - "PublicDescription": "Counts valid data FLITs (80 bit FLow control unITs: 64bits of data) transmitted (TxL) via any of the 3 Intel Ultra Path Interconnect (UPI) slots on this UPI unit.", + "PublicDescription": "Counts valid data FLITs (80 bit FLow control unITs: 64bits of data) transmitted (TxL) via any of the 3 Intel(R) Ultra Path Interconnect (UPI) slots on this UPI unit.", "ScaleUnit": "7.11E-06Bytes", "UMask": "0xf", "Unit": "UPI LL" diff --git a/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-power.json b/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-power.json index 6835e14cd42c..c6254af7a468 100644 --- a/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-power.json +++ b/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-power.json @@ -143,7 +143,7 @@ "EventCode": "0x80", "EventName": "UNC_P_POWER_STATE_OCCUPANCY.CORES_C0", "PerPkg": "1", - "PublicDescription": "This is an occupancy event that tracks the number of cores that are in the chosen C-State. It can be used by itself to get the average number of cores in that C-state with threshholding to generate histograms, or with other PCU events and occupancy triggering to capture other details.", + "PublicDescription": "This is an occupancy event that tracks the number of cores that are in the chosen C-State. It can be used by itself to get the average number of cores in that C-state with thresholding to generate histograms, or with other PCU events and occupancy triggering to capture other details.", "Unit": "PCU" }, { @@ -151,7 +151,7 @@ "EventCode": "0x80", "EventName": "UNC_P_POWER_STATE_OCCUPANCY.CORES_C3", "PerPkg": "1", - "PublicDescription": "This is an occupancy event that tracks the number of cores that are in the chosen C-State. It can be used by itself to get the average number of cores in that C-state with threshholding to generate histograms, or with other PCU events and occupancy triggering to capture other details.", + "PublicDescription": "This is an occupancy event that tracks the number of cores that are in the chosen C-State. It can be used by itself to get the average number of cores in that C-state with thresholding to generate histograms, or with other PCU events and occupancy triggering to capture other details.", "Unit": "PCU" }, { @@ -159,7 +159,7 @@ "EventCode": "0x80", "EventName": "UNC_P_POWER_STATE_OCCUPANCY.CORES_C6", "PerPkg": "1", - "PublicDescription": "This is an occupancy event that tracks the number of cores that are in the chosen C-State. It can be used by itself to get the average number of cores in that C-state with threshholding to generate histograms, or with other PCU events and occupancy triggering to capture other details.", + "PublicDescription": "This is an occupancy event that tracks the number of cores that are in the chosen C-State. It can be used by itself to get the average number of cores in that C-state with thresholding to generate histograms, or with other PCU events and occupancy triggering to capture other details.", "Unit": "PCU" }, { @@ -175,7 +175,7 @@ "EventCode": "0x9", "EventName": "UNC_P_PROCHOT_INTERNAL_CYCLES", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles that we are in Interal PROCHOT mode. This mode is triggered when a sensor on the die determines that we are too hot and must throttle to avoid damaging the chip.", + "PublicDescription": "Counts the number of cycles that we are in Internal PROCHOT mode. This mode is triggered when a sensor on the die determines that we are too hot and must throttle to avoid damaging the chip.", "Unit": "PCU" }, { diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-events/arch/x86/mapfile.csv index cad51223d0ea..793076e00188 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -5,7 +5,7 @@ GenuineIntel-6-(1C|26|27|35|36),v4,bonnell,core GenuineIntel-6-(3D|47),v26,broadwell,core GenuineIntel-6-56,v7,broadwellde,core GenuineIntel-6-4F,v19,broadwellx,core -GenuineIntel-6-55-[56789ABCDEF],v1.16,cascadelakex,core +GenuineIntel-6-55-[56789ABCDEF],v1.17,cascadelakex,core GenuineIntel-6-9[6C],v1.03,elkhartlake,core GenuineIntel-6-5[CF],v13,goldmont,core GenuineIntel-6-7A,v1.01,goldmontplus,core From patchwork Sun Feb 19 09:28:13 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59106 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp772481wrn; Sun, 19 Feb 2023 01:33:16 -0800 (PST) X-Google-Smtp-Source: AK7set/oWeYYuL41Hp9vim4YhrbAIp1ng1z9ZMJD7ddGMIoLpC2kNmttQMvF9fw3x6WiFMTJBEeh X-Received: by 2002:a17:90b:17d0:b0:233:f393:f6cd with SMTP id me16-20020a17090b17d000b00233f393f6cdmr2314807pjb.5.1676799195734; Sun, 19 Feb 2023 01:33:15 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799195; cv=none; d=google.com; s=arc-20160816; b=RLL+5tcgoZjuE8RDQtVSFKpbt/F8cThz72FhJcaWKeIhu58RNWYfcMT90auu6LCFhd 8qGAszHPxilkX6oluiHU8uVCJFCMxJeM+d8EqYbQA+dD2uLq3fcC1yJpEs6Vo3hL9CZL cIHFnPkWs8YFGW608mdnpotnwBTuiHDQf4brWmjpIGb63e5vIAems/C5WSoME7tOx3lH rQl70rNAN8SRvdTBdceuId4hrezeDo9WhUigA1T1R5hA2EeZvS9dEJE0vvoejYGtKbiH 2Eir7iQmo245ia7LW3kCY12a0LjcON/g+TwzeD+6LFUdEHnl6of6sriS/dFa0fxhtT45 S9Hg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :references:mime-version:message-id:in-reply-to:date:dkim-signature; bh=AiAycDrybveoMuJFEB+r5dLTNOyKRMMZ7bT1bkVL/x4=; b=gZUtDZp+pzeLlRh89+gMXpv0IY1nRmZm98HKymquR8D8phRurFzJmT+XacP9LRbANc oBROjzYzxs5jTKhFow+ypJVM7D1HSPIB5J+cadvX8ez2355BwNGlaoDuteXil9fTZZxh UAbECa4iJwROOHTJ7gxWWtpf4Mxk6264FdGW9TS/dUWUHtC06FMCeo/bd6XI/RA14r3A 2zTv6J9x3ZvbXtZ5ZjP5zIA89+IsFK+uWApxVHEz75SOTO3EjdlUXtY2o7dS0rQc27/q d9Kz85Z2UcNR9hkE5oOMGJdpnSyk6z4nIFt1DmuU7bENN8bSqQPGF61KqcwU2hlXCdWz MOaw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=mAGNMChI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id r5-20020a632b05000000b004fbc2116e12si10777249pgr.587.2023.02.19.01.33.03; Sun, 19 Feb 2023 01:33:15 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=mAGNMChI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229914AbjBSJcS (ORCPT + 99 others); Sun, 19 Feb 2023 04:32:18 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53772 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229761AbjBSJby (ORCPT ); Sun, 19 Feb 2023 04:31:54 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 20B72CDF0 for ; Sun, 19 Feb 2023 01:31:16 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id k2-20020a056902070200b007eba3f8e3baso279243ybt.4 for ; Sun, 19 Feb 2023 01:31:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:from:to:cc:subject:date :message-id:reply-to; bh=AiAycDrybveoMuJFEB+r5dLTNOyKRMMZ7bT1bkVL/x4=; b=mAGNMChIxmCLNL67y+hstMcY9oXuPTwb2KsmXQc1lkpPbxb+zHBfH+Visl9kjNSuSh wX1Z7ZX062igns/0RVUrhLbGGAog8e6kKBKbUTxV/xW3AfTK0T93XWMO/RfNP6DqI2oe vyrPpPeGUg5U3OkeT/X5v4XKwr08PlzWm6SUW/2xKwkgFutbxQ/hpiKmh5PUfMniw6yC L2pItgOROrWWhX3BVpGABcNH6gBD6qpnu19VQGXlhj4V1/O6TFK5/bco3NXfaIAO87q2 V7LOi/WUATuoO9aAwd0Q0QdbEpOmnQYO9Vt31PcprgN2Z2X5Spx9WVMCgXzZ3O+dZpYO BSug== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=AiAycDrybveoMuJFEB+r5dLTNOyKRMMZ7bT1bkVL/x4=; b=Bbl3pdgiBym+p6V9ZsGd04TgSStGlUEHDCGaS8mieyC0ShkVk4Zc8+vj8eQAyM5Na5 q1RcQsJwuOVF7C5KKsHuZiE0ko3w33071dTomY+5oB/xTIH4E/6Rym6n3QXsurAlR+6r FBgiqx7o7Y2cjXmMM+HylfHt35FfU6MimZBmevqbvCu7EiD2ereCj2EGfQKxKFb1jdsP fSyh6S7x5tTBEoG0VPzXm5vx1eEUgJff5q/TU1oEOUXK94F2Qn9bsRYuDaNCGG02I8Fd xx6GhGGTmXxvezMFx9ohrdCkAMiKEJKIJmZAGiDfYucxE+6BJicgvJuaHAdtiQWcq9As LqHg== X-Gm-Message-State: AO0yUKXwwfA/lGpUIMFbh9mnHamQJb3Bb5jcaB4TvUm4Xah3jmE6ww+F /yT+bgDxiIth1TVMDM4Fa+dM4VUcWWfH X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a05:6902:351:b0:8ac:72e3:c743 with SMTP id e17-20020a056902035100b008ac72e3c743mr14973ybs.9.1676799074539; Sun, 19 Feb 2023 01:31:14 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:13 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-17-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 16/51] perf vendor events intel: Add graniterapids events From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758251393797229665?= X-GMAIL-MSGID: =?utf-8?q?1758251393797229665?= Add version 1.00 of the graniterapids events from https://github.com/intel/perfmon. Signed-off-by: Ian Rogers --- .../arch/x86/graniterapids/cache.json | 54 ++++++ .../arch/x86/graniterapids/frontend.json | 10 + .../arch/x86/graniterapids/memory.json | 174 ++++++++++++++++++ .../arch/x86/graniterapids/other.json | 29 +++ .../arch/x86/graniterapids/pipeline.json | 102 ++++++++++ .../x86/graniterapids/virtual-memory.json | 26 +++ tools/perf/pmu-events/arch/x86/mapfile.csv | 1 + 7 files changed, 396 insertions(+) create mode 100644 tools/perf/pmu-events/arch/x86/graniterapids/cache.json create mode 100644 tools/perf/pmu-events/arch/x86/graniterapids/frontend.json create mode 100644 tools/perf/pmu-events/arch/x86/graniterapids/memory.json create mode 100644 tools/perf/pmu-events/arch/x86/graniterapids/other.json create mode 100644 tools/perf/pmu-events/arch/x86/graniterapids/pipeline.json create mode 100644 tools/perf/pmu-events/arch/x86/graniterapids/virtual-memory.json diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/cache.json b/tools/perf/pmu-events/arch/x86/graniterapids/cache.json new file mode 100644 index 000000000000..56212827870c --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/graniterapids/cache.json @@ -0,0 +1,54 @@ +[ + { + "BriefDescription": "L2 code requests", + "EventCode": "0x24", + "EventName": "L2_RQSTS.ALL_CODE_RD", + "PublicDescription": "Counts the total number of L2 code requests.", + "SampleAfterValue": "200003", + "UMask": "0xe4" + }, + { + "BriefDescription": "Demand Data Read access L2 cache", + "EventCode": "0x24", + "EventName": "L2_RQSTS.ALL_DEMAND_DATA_RD", + "PublicDescription": "Counts Demand Data Read requests accessing the L2 cache. These requests may hit or miss L2 cache. True-miss exclude misses that were merged with ongoing L2 misses. An access is counted once.", + "SampleAfterValue": "200003", + "UMask": "0xe1" + }, + { + "BriefDescription": "Core-originated cacheable requests that missed L3 (Except hardware prefetches to the L3)", + "EventCode": "0x2e", + "EventName": "LONGEST_LAT_CACHE.MISS", + "PublicDescription": "Counts core-originated cacheable requests that miss the L3 cache (Longest Latency cache). Requests include data and code reads, Reads-for-Ownership (RFOs), speculative accesses and hardware prefetches to the L1 and L2. It does not include hardware prefetches to the L3, and may not count other types of requests to the L3.", + "SampleAfterValue": "100003", + "UMask": "0x41" + }, + { + "BriefDescription": "Core-originated cacheable requests that refer to L3 (Except hardware prefetches to the L3)", + "EventCode": "0x2e", + "EventName": "LONGEST_LAT_CACHE.REFERENCE", + "PublicDescription": "Counts core-originated cacheable requests to the L3 cache (Longest Latency cache). Requests include data and code reads, Reads-for-Ownership (RFOs), speculative accesses and hardware prefetches to the L1 and L2. It does not include hardware prefetches to the L3, and may not count other types of requests to the L3.", + "SampleAfterValue": "100003", + "UMask": "0x4f" + }, + { + "BriefDescription": "Retired load instructions.", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.ALL_LOADS", + "PEBS": "1", + "PublicDescription": "Counts all retired load instructions. This event accounts for SW prefetch instructions of PREFETCHNTA or PREFETCHT0/1/2 or PREFETCHW.", + "SampleAfterValue": "1000003", + "UMask": "0x81" + }, + { + "BriefDescription": "Retired store instructions.", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.ALL_STORES", + "PEBS": "1", + "PublicDescription": "Counts all retired store instructions.", + "SampleAfterValue": "1000003", + "UMask": "0x82" + } +] diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/frontend.json b/tools/perf/pmu-events/arch/x86/graniterapids/frontend.json new file mode 100644 index 000000000000..dfd9c5ea1584 --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/graniterapids/frontend.json @@ -0,0 +1,10 @@ +[ + { + "BriefDescription": "This event counts a subset of the Topdown Slots event that were no operation was delivered to the back-end pipeline due to instruction fetch limitations when the back-end could have accepted more operations. Common examples include instruction cache misses or x86 instruction decode limitations.", + "EventCode": "0x9c", + "EventName": "IDQ_BUBBLES.CORE", + "PublicDescription": "This event counts a subset of the Topdown Slots event that were no operation was delivered to the back-end pipeline due to instruction fetch limitations when the back-end could have accepted more operations. Common examples include instruction cache misses or x86 instruction decode limitations.\nThe count may be distributed among unhalted logical processors (hyper-threads) who share the same physical core, in processors that support Intel Hyper-Threading Technology. Software can use this event as the nominator for the Frontend Bound metric (or top-level category) of the Top-down Microarchitecture Analysis method.", + "SampleAfterValue": "1000003", + "UMask": "0x1" + } +] diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/memory.json b/tools/perf/pmu-events/arch/x86/graniterapids/memory.json new file mode 100644 index 000000000000..1c0e0e86e58e --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/graniterapids/memory.json @@ -0,0 +1,174 @@ +[ + { + "BriefDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 128 cycles.", + "Data_LA": "1", + "EventCode": "0xcd", + "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_128", + "MSRIndex": "0x3F6", + "MSRValue": "0x80", + "PEBS": "2", + "PublicDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 128 cycles. Reported latency may be longer than just the memory latency.", + "SampleAfterValue": "1009", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 16 cycles.", + "Data_LA": "1", + "EventCode": "0xcd", + "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_16", + "MSRIndex": "0x3F6", + "MSRValue": "0x10", + "PEBS": "2", + "PublicDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 16 cycles. Reported latency may be longer than just the memory latency.", + "SampleAfterValue": "20011", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 256 cycles.", + "Data_LA": "1", + "EventCode": "0xcd", + "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_256", + "MSRIndex": "0x3F6", + "MSRValue": "0x100", + "PEBS": "2", + "PublicDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 256 cycles. Reported latency may be longer than just the memory latency.", + "SampleAfterValue": "503", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 32 cycles.", + "Data_LA": "1", + "EventCode": "0xcd", + "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32", + "MSRIndex": "0x3F6", + "MSRValue": "0x20", + "PEBS": "2", + "PublicDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 32 cycles. Reported latency may be longer than just the memory latency.", + "SampleAfterValue": "100007", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 4 cycles.", + "Data_LA": "1", + "EventCode": "0xcd", + "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4", + "MSRIndex": "0x3F6", + "MSRValue": "0x4", + "PEBS": "2", + "PublicDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 4 cycles. Reported latency may be longer than just the memory latency.", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 512 cycles.", + "Data_LA": "1", + "EventCode": "0xcd", + "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_512", + "MSRIndex": "0x3F6", + "MSRValue": "0x200", + "PEBS": "2", + "PublicDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 512 cycles. Reported latency may be longer than just the memory latency.", + "SampleAfterValue": "101", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 64 cycles.", + "Data_LA": "1", + "EventCode": "0xcd", + "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64", + "MSRIndex": "0x3F6", + "MSRValue": "0x40", + "PEBS": "2", + "PublicDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 64 cycles. Reported latency may be longer than just the memory latency.", + "SampleAfterValue": "2003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 8 cycles.", + "Data_LA": "1", + "EventCode": "0xcd", + "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_8", + "MSRIndex": "0x3F6", + "MSRValue": "0x8", + "PEBS": "2", + "PublicDescription": "Counts randomly selected loads when the latency from first dispatch to completion is greater than 8 cycles. Reported latency may be longer than just the memory latency.", + "SampleAfterValue": "50021", + "UMask": "0x1" + }, + { + "BriefDescription": "Retired memory store access operations. A PDist event for PEBS Store Latency Facility.", + "Data_LA": "1", + "EventCode": "0xcd", + "EventName": "MEM_TRANS_RETIRED.STORE_SAMPLE", + "PEBS": "2", + "PublicDescription": "Counts Retired memory accesses with at least 1 store operation. This PEBS event is the precisely-distributed (PDist) trigger covering all stores uops for sampling by the PEBS Store Latency Facility. The facility is described in Intel SDM Volume 3 section 19.9.8", + "SampleAfterValue": "1000003", + "UMask": "0x2" + }, + { + "BriefDescription": "Counts demand data reads that were not supplied by the local socket's L1, L2, or L3 caches.", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.L3_MISS", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3FBFC00001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) requests and software prefetches for exclusive ownership (PREFETCHW) that were not supplied by the local socket's L1, L2, or L3 caches.", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_RFO.L3_MISS", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F3FC00002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Number of times an RTM execution aborted.", + "EventCode": "0xc9", + "EventName": "RTM_RETIRED.ABORTED", + "PublicDescription": "Counts the number of times RTM abort was triggered.", + "SampleAfterValue": "100003", + "UMask": "0x4" + }, + { + "BriefDescription": "Number of times an RTM execution successfully committed", + "EventCode": "0xc9", + "EventName": "RTM_RETIRED.COMMIT", + "PublicDescription": "Counts the number of times RTM commit succeeded.", + "SampleAfterValue": "100003", + "UMask": "0x2" + }, + { + "BriefDescription": "Number of times an RTM execution started.", + "EventCode": "0xc9", + "EventName": "RTM_RETIRED.START", + "PublicDescription": "Counts the number of times we entered an RTM region. Does not count nested transactions.", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Speculatively counts the number of TSX aborts due to a data capacity limitation for transactional reads", + "EventCode": "0x54", + "EventName": "TX_MEM.ABORT_CAPACITY_READ", + "PublicDescription": "Speculatively counts the number of Transactional Synchronization Extensions (TSX) aborts due to a data capacity limitation for transactional reads", + "SampleAfterValue": "100003", + "UMask": "0x80" + }, + { + "BriefDescription": "Speculatively counts the number of TSX aborts due to a data capacity limitation for transactional writes.", + "EventCode": "0x54", + "EventName": "TX_MEM.ABORT_CAPACITY_WRITE", + "PublicDescription": "Speculatively counts the number of Transactional Synchronization Extensions (TSX) aborts due to a data capacity limitation for transactional writes.", + "SampleAfterValue": "100003", + "UMask": "0x2" + }, + { + "BriefDescription": "Number of times a transactional abort was signaled due to a data conflict on a transactionally accessed address", + "EventCode": "0x54", + "EventName": "TX_MEM.ABORT_CONFLICT", + "PublicDescription": "Counts the number of times a TSX line had a cache conflict.", + "SampleAfterValue": "100003", + "UMask": "0x1" + } +] diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/other.json b/tools/perf/pmu-events/arch/x86/graniterapids/other.json new file mode 100644 index 000000000000..5e799bae03ea --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/graniterapids/other.json @@ -0,0 +1,29 @@ +[ + { + "BriefDescription": "Counts demand data reads that have any type of response.", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads that were supplied by DRAM attached to this socket, unless in Sub NUMA Cluster(SNC) Mode. In SNC Mode counts only those DRAM accesses that are controlled by the close SNC Cluster.", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x104000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) requests and software prefetches for exclusive ownership (PREFETCHW) that have any type of response.", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F3FFC0002", + "SampleAfterValue": "100003", + "UMask": "0x1" + } +] diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/pipeline.json b/tools/perf/pmu-events/arch/x86/graniterapids/pipeline.json new file mode 100644 index 000000000000..d6aafb258708 --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/graniterapids/pipeline.json @@ -0,0 +1,102 @@ +[ + { + "BriefDescription": "All branch instructions retired.", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.ALL_BRANCHES", + "PEBS": "1", + "PublicDescription": "Counts all branch instructions retired.", + "SampleAfterValue": "400009" + }, + { + "BriefDescription": "All mispredicted branch instructions retired.", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", + "PEBS": "1", + "PublicDescription": "Counts all the retired branch instructions that were mispredicted by the processor. A branch misprediction occurs when the processor incorrectly predicts the destination of the branch. When the misprediction is discovered at execution, all the instructions executed in the wrong (speculative) path must be discarded, and the processor must start fetching from the correct path.", + "SampleAfterValue": "400009" + }, + { + "BriefDescription": "Reference cycles when the core is not in halt state.", + "EventName": "CPU_CLK_UNHALTED.REF_TSC", + "PublicDescription": "Counts the number of reference cycles when the core is not in a halt state. The core enters the halt state when it is running the HLT instruction or the MWAIT instruction. This event is not affected by core frequency changes (for example, P states, TM2 transitions) but has the same incrementing frequency as the time stamp counter. This event can approximate elapsed time while the core was not in a halt state. It is counted on a dedicated fixed counter, leaving the eight programmable counters available for other events. Note: On all current platforms this event stops counting during 'throttling (TM)' states duty off periods the processor is 'halted'. The counter update is done at a lower clock rate then the core clock the overflow status bit for this counter may appear 'sticky'. After the counter has overflowed and software clears the overflow status bit and resets the counter to less than MAX. The reset value to the counter is not clocked immediately so the overflow status bit will flip 'high (1)' and generate another PMI (if enabled) after which the reset value gets clocked into the counter. Therefore, software will get the interrupt, read the overflow status bit '1 for bit 34 while the counter value is less than MAX. Software should ignore this case.", + "SampleAfterValue": "2000003", + "UMask": "0x3" + }, + { + "BriefDescription": "Reference cycles when the core is not in halt state.", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.REF_TSC_P", + "PublicDescription": "Counts the number of reference cycles when the core is not in a halt state. The core enters the halt state when it is running the HLT instruction or the MWAIT instruction. This event is not affected by core frequency changes (for example, P states, TM2 transitions) but has the same incrementing frequency as the time stamp counter. This event can approximate elapsed time while the core was not in a halt state. It is counted on a dedicated fixed counter, leaving the four (eight when Hyperthreading is disabled) programmable counters available for other events. Note: On all current platforms this event stops counting during 'throttling (TM)' states duty off periods the processor is 'halted'. The counter update is done at a lower clock rate then the core clock the overflow status bit for this counter may appear 'sticky'. After the counter has overflowed and software clears the overflow status bit and resets the counter to less than MAX. The reset value to the counter is not clocked immediately so the overflow status bit will flip 'high (1)' and generate another PMI (if enabled) after which the reset value gets clocked into the counter. Therefore, software will get the interrupt, read the overflow status bit '1 for bit 34 while the counter value is less than MAX. Software should ignore this case.", + "SampleAfterValue": "2000003", + "UMask": "0x1" + }, + { + "BriefDescription": "Core cycles when the thread is not in halt state", + "EventName": "CPU_CLK_UNHALTED.THREAD", + "PublicDescription": "Counts the number of core cycles while the thread is not in a halt state. The thread enters the halt state when it is running the HLT instruction. This event is a component in many key event ratios. The core frequency may change from time to time due to transitions associated with Enhanced Intel SpeedStep Technology or TM2. For this reason this event may have a changing ratio with regards to time. When the core frequency is constant, this event can approximate elapsed time while the core was not in the halt state. It is counted on a dedicated fixed counter, leaving the eight programmable counters available for other events.", + "SampleAfterValue": "2000003", + "UMask": "0x2" + }, + { + "BriefDescription": "Thread cycles when thread is not in halt state", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.THREAD_P", + "PublicDescription": "This is an architectural event that counts the number of thread cycles while the thread is not in a halt state. The thread enters the halt state when it is running the HLT instruction. The core frequency may change from time to time due to power or thermal throttling. For this reason, this event may have a changing ratio with regards to wall clock time.", + "SampleAfterValue": "2000003" + }, + { + "BriefDescription": "Number of instructions retired. Fixed Counter - architectural event", + "EventName": "INST_RETIRED.ANY", + "PEBS": "1", + "PublicDescription": "Counts the number of X86 instructions retired - an Architectural PerfMon event. Counting continues during hardware interrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is counted by a designated fixed counter freeing up programmable counters to count other events. INST_RETIRED.ANY_P is counted by a programmable counter.", + "SampleAfterValue": "2000003", + "UMask": "0x1" + }, + { + "BriefDescription": "Number of instructions retired. General Counter - architectural event", + "EventCode": "0xc0", + "EventName": "INST_RETIRED.ANY_P", + "PEBS": "1", + "PublicDescription": "Counts the number of X86 instructions retired - an Architectural PerfMon event. Counting continues during hardware interrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is counted by a designated fixed counter freeing up programmable counters to count other events. INST_RETIRED.ANY_P is counted by a programmable counter.", + "SampleAfterValue": "2000003" + }, + { + "BriefDescription": "Loads blocked due to overlapping with a preceding store that cannot be forwarded.", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.STORE_FORWARD", + "PublicDescription": "Counts the number of times where store forwarding was prevented for a load operation. The most common case is a load blocked due to the address of memory access (partially) overlapping with a preceding uncompleted store. Note: See the table of not supported store forwards in the Optimization Guide.", + "SampleAfterValue": "100003", + "UMask": "0x82" + }, + { + "BriefDescription": "This event counts a subset of the Topdown Slots event that were not consumed by the back-end pipeline due to lack of back-end resources, as a result of memory subsystem delays, execution units limitations, or other conditions.", + "EventCode": "0xa4", + "EventName": "TOPDOWN.BACKEND_BOUND_SLOTS", + "PublicDescription": "This event counts a subset of the Topdown Slots event that were not consumed by the back-end pipeline due to lack of back-end resources, as a result of memory subsystem delays, execution units limitations, or other conditions.\nThe count is distributed among unhalted logical processors (hyper-threads) who share the same physical core, in processors that support Intel Hyper-Threading Technology. Software can use this event as the nominator for the Backend Bound metric (or top-level category) of the Top-down Microarchitecture Analysis method.", + "SampleAfterValue": "10000003", + "UMask": "0x2" + }, + { + "BriefDescription": "TMA slots available for an unhalted logical processor. Fixed counter - architectural event", + "EventName": "TOPDOWN.SLOTS", + "PublicDescription": "Number of available slots for an unhalted logical processor. The event increments by machine-width of the narrowest pipeline as employed by the Top-down Microarchitecture Analysis method (TMA). The count is distributed among unhalted logical processors (hyper-threads) who share the same physical core. Software can use this event as the denominator for the top-level metrics of the TMA method. This architectural event is counted on a designated fixed counter (Fixed Counter 3).", + "SampleAfterValue": "10000003", + "UMask": "0x4" + }, + { + "BriefDescription": "TMA slots available for an unhalted logical processor. General counter - architectural event", + "EventCode": "0xa4", + "EventName": "TOPDOWN.SLOTS_P", + "PublicDescription": "Counts the number of available slots for an unhalted logical processor. The event increments by machine-width of the narrowest pipeline as employed by the Top-down Microarchitecture Analysis method. The count is distributed among unhalted logical processors (hyper-threads) who share the same physical core.", + "SampleAfterValue": "10000003", + "UMask": "0x1" + }, + { + "BriefDescription": "This event counts a subset of the Topdown Slots event that are utilized by operations that eventually get retired (committed) by the processor pipeline. Usually, this event positively correlates with higher performance for example, as measured by the instructions-per-cycle metric.", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.SLOTS", + "PublicDescription": "This event counts a subset of the Topdown Slots event that are utilized by operations that eventually get retired (committed) by the processor pipeline. Usually, this event positively correlates with higher performance for example, as measured by the instructions-per-cycle metric.\nSoftware can use this event as the nominator for the Retiring metric (or top-level category) of the Top-down Microarchitecture Analysis method.", + "SampleAfterValue": "2000003", + "UMask": "0x2" + } +] diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/virtual-memory.json b/tools/perf/pmu-events/arch/x86/graniterapids/virtual-memory.json new file mode 100644 index 000000000000..8784c97b7534 --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/graniterapids/virtual-memory.json @@ -0,0 +1,26 @@ +[ + { + "BriefDescription": "Load miss in all TLB levels causes a page walk that completes. (All page sizes)", + "EventCode": "0x12", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED", + "PublicDescription": "Counts completed page walks (all page sizes) caused by demand data loads. This implies it missed in the DTLB and further levels of TLB. The page walk can end with or without a fault.", + "SampleAfterValue": "100003", + "UMask": "0xe" + }, + { + "BriefDescription": "Store misses in all TLB levels causes a page walk that completes. (All page sizes)", + "EventCode": "0x13", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED", + "PublicDescription": "Counts completed page walks (all page sizes) caused by demand data stores. This implies it missed in the DTLB and further levels of TLB. The page walk can end with or without a fault.", + "SampleAfterValue": "100003", + "UMask": "0xe" + }, + { + "BriefDescription": "Code miss in all TLB levels causes a page walk that completes. (All page sizes)", + "EventCode": "0x11", + "EventName": "ITLB_MISSES.WALK_COMPLETED", + "PublicDescription": "Counts completed page walks (all page sizes) caused by a code fetch. This implies it missed in the ITLB (Instruction TLB) and further levels of TLB. The page walk can end with or without a fault.", + "SampleAfterValue": "100003", + "UMask": "0xe" + } +] diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-events/arch/x86/mapfile.csv index 793076e00188..1677ec22e2e3 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -9,6 +9,7 @@ GenuineIntel-6-55-[56789ABCDEF],v1.17,cascadelakex,core GenuineIntel-6-9[6C],v1.03,elkhartlake,core GenuineIntel-6-5[CF],v13,goldmont,core GenuineIntel-6-7A,v1.01,goldmontplus,core +GenuineIntel-6-A[DE],v1.00,graniterapids,core GenuineIntel-6-(3C|45|46),v32,haswell,core GenuineIntel-6-3F,v26,haswellx,core GenuineIntel-6-(7D|7E|A7),v1.15,icelake,core From patchwork Sun Feb 19 09:28:14 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59107 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp772614wrn; Sun, 19 Feb 2023 01:33:44 -0800 (PST) X-Google-Smtp-Source: AK7set+vqKblAhLY5/5lbnBZHksrPRjzTolZVKXNnJcqV2axi+jhUhjydy1tbDgpsqEc93O0Ae7y X-Received: by 2002:a05:6a20:430e:b0:bf:d67e:5517 with SMTP id h14-20020a056a20430e00b000bfd67e5517mr8006648pzk.42.1676799224404; Sun, 19 Feb 2023 01:33:44 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799224; cv=none; d=google.com; s=arc-20160816; b=mbjJv2uPvhP5Gh3WgRFBetkSDMhkebyU1UC/03xk2ccGG9/q+i/cl0H2v17IGXEsM1 NRGZwUfnXDi58kgFGyJZWsGzMnqJ2yuzX6TYSG7F7UHOWpDlqe4lMm9KjR4je4aAiI42 ygRLOdyirGM1olfo7pnU+AkhGhZtJQfCnD4svVG2Xxuzo16RKyGFFcqB8BCxSuTUartb VLvt6Fk1WX9KjOaKE440FcHA54czp0oVfArscli3g1yeDtUXY4QA3dG6pFXz2OTIXh0y zQy4zqEc/Ktb7hbIEbbm9vcmC8r9LV2ox1fg9T5sd8MZStQoHy4B5M81OH9Sqy+gXPeq 6bJQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :references:mime-version:message-id:in-reply-to:date:dkim-signature; bh=12j/l8hIL4EqqBqbMskbfCCagV0mCtiQ6sELKNyzC6k=; b=ABswsa9JvmLFyFuJgF2KEAbVgMATnWf2EeGYeHfBMrWyhb2s4cfYcQx3KqPn9c9VeL po1GntRNbux2b/ffqX/h65axOyEuEpQ3ifh/yl45Q+iiFbY7KMQcVZ3t0pnSPAwOhdPu lnLzVKkuEeX9lufECLUBvSpCVxZifDTECfJdWH52wjLIhqzNfeY5OonEqWOcymxeMs5p CKbfZ+PTuFxDrISHI5QjFLIBjzVI4l7Ahe/Klvxacy0hRf+DTzGW3wTZRbSBxq6H7FNw hanWmm60yzvUlsxoSx2gzoXJFcgBScoML66w9CpMyUel2ktDhCXe6y5oyLqSJge1JZUS VP1A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=tHgg68q4; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id u134-20020a63798c000000b004fb95f728a4si10393880pgc.872.2023.02.19.01.33.32; Sun, 19 Feb 2023 01:33:44 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=tHgg68q4; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229882AbjBSJcb (ORCPT + 99 others); Sun, 19 Feb 2023 04:32:31 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53222 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229967AbjBSJcQ (ORCPT ); Sun, 19 Feb 2023 04:32:16 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 15440126F3 for ; Sun, 19 Feb 2023 01:31:34 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id 5-20020a250105000000b009802c10698cso865279ybb.22 for ; Sun, 19 Feb 2023 01:31:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:from:to:cc:subject:date :message-id:reply-to; bh=12j/l8hIL4EqqBqbMskbfCCagV0mCtiQ6sELKNyzC6k=; b=tHgg68q4PFukFvRMOJrzBpew87QAmd0Eda9PNGBpg+FesTkt6527LVyhmfWZJdnfWK fs/PHs/NfmafjBvmXYqmZAqRg5sNHt1ucUiGm4v0ok+QIWfRDbdHGOqoBhoC12uyCyIf i/OpifNLB8+1Zw4hgTHyDIcmF7GqPtK9mHRiYAcfoeYpkSV68NDszPXt6Tyomm1IQj43 mi8pDkUYpMY/fzWXKoWRN4WFUqUIR1Eg6qWQwevYistLw2jMfMzyAxhH3pLDZ6ON5Qq7 LGnE9/TfU5tw1IlwVS7o5u3i0FJzLidKrze2Jh8pa7/tQPkJKvlxCrldcANE63s2l+B6 wlXg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=12j/l8hIL4EqqBqbMskbfCCagV0mCtiQ6sELKNyzC6k=; b=D4UFfI3NIkxnZ3V5lCyFVwyvQJXqcqwO3ofFX/9Q8icmkFoLfK4ovUQIfkYKzBpuR/ oC61ZLgvgxFmvZOBRE+NPmQtMmRv6hThCkiUeSHbllJjaFhW5s1Zhbh5jXg0r2340qGn yKT01zAeT9NcmjkHs9cQZYfNqASHNjabOIQgdcrjiSuWaNAYyihwdSGXMpNub85Th07y YPB4QvmCNKIQiFUpgfn4RLudDB9fdWKnSMpgPDBjMcuoUYmdhz8PrwIMl6KM8dU5ZWY0 MzzZYdihS7I2AhJjLGXUXPPMhv2KOigkFgYjZ+5QfVc98GGlTgZ0BcTl09ZnIW2PhTjq ikcA== X-Gm-Message-State: AO0yUKWz5IPM9MQvNBqHFLlXp3oWir8sOxZT0hbpt5Wz2XMfyCUudW3d K2fYKWa40QO98N+9jssXW0dKiBI6TVmJ X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a05:6902:2:b0:915:6891:3e15 with SMTP id l2-20020a056902000200b0091568913e15mr34614ybh.12.1676799082985; Sun, 19 Feb 2023 01:31:22 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:14 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-18-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 17/51] perf vendor events intel: Refresh haswell metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758251423508682918?= X-GMAIL-MSGID: =?utf-8?q?1758251423508682918?= Update the haswell metrics to TMA version 4.5. Generation was done using https://github.com/intel/perfmon. Notable changes are TMA info metrics are renamed from their node name to be lower case and prefixed by tma_info_, MetricThreshold expressions are added, and the smi_cost metric group is added replicating existing hard coded metrics in stat-shadow. Signed-off-by: Ian Rogers --- .../arch/x86/haswell/hsw-metrics.json | 1220 ++++++++++------- 1 file changed, 687 insertions(+), 533 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/haswell/hsw-metrics.json b/tools/perf/pmu-events/arch/x86/haswell/hsw-metrics.json index 2e032beee542..2528418200bb 100644 --- a/tools/perf/pmu-events/arch/x86/haswell/hsw-metrics.json +++ b/tools/perf/pmu-events/arch/x86/haswell/hsw-metrics.json @@ -1,799 +1,953 @@ [ { - "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", - "MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / SLOTS", - "MetricGroup": "PGO;TopdownL1;tma_L1_group", - "MetricName": "tma_frontend_bound", - "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Machine_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound.", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", - "MetricExpr": "4 * min(CPU_CLK_UNHALTED.THREAD, IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE) / SLOTS", - "MetricGroup": "Frontend;TopdownL2;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_latency", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: RS_EVENTS.EMPTY_END", + "BriefDescription": "C2 residency percent per package", + "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C2_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses.", - "MetricExpr": "ICACHE.IFDATA_STALL / CLKS", - "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_icache_misses", + "BriefDescription": "C3 residency percent per core", + "MetricExpr": "cstate_core@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Core_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses", - "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + ITLB_MISSES.WALK_DURATION) / CLKS", - "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_itlb_misses", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: ITLB_MISSES.WALK_COMPLETED", + "BriefDescription": "C3 residency percent per package", + "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", - "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY) / CLKS", - "MetricGroup": "FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_branch_resteers", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", + "BriefDescription": "C6 residency percent per core", + "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Core_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / CLKS", - "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_dsb_switches", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty.", + "BriefDescription": "C6 residency percent per package", + "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", - "MetricExpr": "ILD_STALL.LCP / CLKS", - "MetricGroup": "FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_lcp", - "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs.", + "BriefDescription": "C7 residency percent per core", + "MetricExpr": "cstate_core@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Core_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", - "MetricExpr": "2 * IDQ.MS_SWITCHES / CLKS", - "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_ms_switches", - "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals. Sample with: IDQ.MS_SWITCHES", + "BriefDescription": "C7 residency percent per package", + "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", - "MetricExpr": "tma_frontend_bound - tma_fetch_latency", - "MetricGroup": "FetchBW;Frontend;TopdownL2;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_bandwidth", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend.", - "ScaleUnit": "100%" + "BriefDescription": "Uncore frequency per die [GHZ]", + "MetricExpr": "tma_info_socket_clks / #num_dies / duration_time / 1e9", + "MetricGroup": "SoC", + "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline)", - "MetricExpr": "(IDQ.ALL_MITE_CYCLES_ANY_UOPS - IDQ.ALL_MITE_CYCLES_4_UOPS) / CORE_CLKS / 2", - "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_fetch_bandwidth_group", - "MetricName": "tma_mite", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or LSD. For example; inefficiencies due to asymmetric decoders; use of long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.", + "BriefDescription": "Percentage of cycles spent in System Management Interrupts.", + "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0 else 0)", + "MetricGroup": "smi", + "MetricName": "smi_cycles", + "MetricThreshold": "smi_cycles > 0.1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline", - "MetricExpr": "(IDQ.ALL_DSB_CYCLES_ANY_UOPS - IDQ.ALL_DSB_CYCLES_4_UOPS) / CORE_CLKS / 2", - "MetricGroup": "DSB;FetchBW;TopdownL3;tma_fetch_bandwidth_group", - "MetricName": "tma_dsb", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline. For example; inefficient utilization of the DSB cache structure or bank conflict when reading from it; are categorized here.", - "ScaleUnit": "100%" + "BriefDescription": "Number of SMI interrupts.", + "MetricExpr": "msr@smi@", + "MetricGroup": "smi", + "MetricName": "smi_num", + "ScaleUnit": "1SMI#" }, { - "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations", - "MetricExpr": "(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) / SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group", - "MetricName": "tma_bad_speculation", - "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.", + "BriefDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset", + "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_4k_aliasing", + "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset. False match is possible; which incur a few cycles load re-issue. However; the short re-issue duration is often hidden by the out-of-order core and HW optimizations; hence a user may safely ignore a high value of this metric unless it manages to propagate up into parent nodes of the hierarchy (e.g. to L1_Bound).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction", - "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * tma_bad_speculation", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL2;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_branch_mispredicts", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT.PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_info_slots", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", - "MetricExpr": "tma_bad_speculation - tma_branch_mispredicts", - "MetricGroup": "BadSpec;MachineClears;TopdownL2;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_machine_clears", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT", + "BriefDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists", + "MetricExpr": "100 * OTHER_ASSISTS.ANY_WB_ASSIST / tma_info_slots", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists. Assists are long sequences of uops that are required in certain corner-cases for operations that cannot be handled natively by the execution pipeline. For example; when working with very small floating point values (so-called Denormals); the FP units are not set up to perform these operations natively. Instead; a sequence of instructions to perform the computation on the Denormals is injected into the pipeline. Since these microcode sequences might be dozens of uops long; Assists can be extremely deleterious to performance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.ANY", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "1 - (tma_frontend_bound + tma_bad_speculation + tma_retiring)", - "MetricGroup": "TopdownL1;tma_L1_group", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.2", "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", - "MetricExpr": "((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING) + RESOURCE_STALLS.SB) / (min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@ if IPC > 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@) / 2 - RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else RESOURCE_STALLS.SB) if #SMT_on else (min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@ if IPC > 1.8 else (cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@ - RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else RESOURCE_STALLS.SB))) * tma_backend_bound", - "MetricGroup": "Backend;TopdownL2;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_memory_bound", - "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache", - "MetricExpr": "max((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING) - CYCLE_ACTIVITY.STALLS_L1D_PENDING) / CLKS, 0)", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_l1_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache. The L1 data cache typically has the shortest latency. However; in certain cases like loads blocked on older stores; a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls; while some non-completed demand load lives in the machine without having that demand load missing the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT_PS;MEM_LOAD_UOPS_RETIRED.HIT_LFB_PS", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + DTLB_LOAD_MISSES.WALK_DURATION) / CLKS", - "MetricGroup": "MemoryTLB;TopdownL4;tma_l1_bound_group", - "MetricName": "tma_dtlb_load", - "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores", - "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_store_fwd_blk", - "PublicDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores. To streamline memory operations in the pipeline; a load can avoid waiting for memory if a prior in-flight store is writing the data that the load wants to read (store forwarding process). However; in some cases the load may be blocked for a significant time pending the store forward. For example; when the prior store is writing a smaller region than the load is reading.", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations", - "MetricExpr": "MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO) / CLKS", - "MetricGroup": "Offcore;TopdownL4;tma_l1_bound_group", - "MetricName": "tma_lock_latency", - "PublicDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations. Due to the microarchitecture handling of locks; they are classified as L1_Bound regardless of what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOADS_PS", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary", - "MetricExpr": "Load_Miss_Real_Latency * LD_BLOCKS.NO_SR / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_split_loads", - "PublicDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS_PS", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset", - "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_4k_aliasing", - "PublicDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset. False match is possible; which incur a few cycles load re-issue. However; the short re-issue duration is often hidden by the out-of-order core and HW optimizations; hence a user may safely ignore a high value of this metric unless it manages to propagate up into parent nodes of the hierarchy (e.g. to L1_Bound).", + "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations", + "MetricExpr": "(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", + "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed", - "MetricExpr": "Load_Miss_Real_Latency * cpu@L1D_PEND_MISS.REQUEST_FB_FULL\\,cmask\\=1@ / CLKS", - "MetricGroup": "MemoryBW;TopdownL4;tma_l1_bound_group", - "MetricName": "tma_fb_full", - "PublicDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed. The higher the metric value; the deeper the memory hierarchy level the misses are satisfied from (metric values >1 are valid). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or external memory).", + "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * tma_bad_speculation", + "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM", + "MetricName": "tma_branch_mispredicts", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics: tma_info_branch_misprediction_cost, tma_mispredicts_resteers", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads", - "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L1D_PENDING - CYCLE_ACTIVITY.STALLS_L2_PENDING) / CLKS", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_l2_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 misses/L2 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L2_HIT_PS", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", + "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY) / tma_info_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core", - "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS) * CYCLE_ACTIVITY.STALLS_L2_PENDING / CLKS", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_l3_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS", + "BriefDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction. A CISC instruction has multiple uops that are required to perform the instruction's functionality as in the case of read-modify-write as an example. Since these instructions require multiple uops they may or may not imply sub-optimal use of machine resources.", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses", - "MetricExpr": "(60 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.L3_MISS))) + 43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.L3_MISS)))) / CLKS", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_l3_bound_group", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(60 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.L3_MISS))) + 43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.L3_MISS)))) / tma_info_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses. Contested accesses occur when data written by one Logical Processor are read by another Logical Processor on a different Physical Core. Examples of contested accesses include synchronizations such as locks; true data sharing such as modified locked variables; and false sharing. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses. Contested accesses occur when data written by one Logical Processor are read by another Logical Processor on a different Physical Core. Examples of contested accesses include synchronizations such as locks; true data sharing such as modified locked variables; and false sharing. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Related metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses", - "MetricExpr": "43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.L3_MISS))) / CLKS", - "MetricGroup": "Offcore;Snoop;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_data_sharing", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses. Data shared by multiple Logical Processors (even just read shared) may cause increased access latency due to cache coherency. Excessive data sharing can drastically harm multithreaded performance. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT_PS", + "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_backend_bound - tma_memory_bound", + "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited)", - "MetricExpr": "29 * (MEM_LOAD_UOPS_RETIRED.L3_HIT * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.L3_MISS))) / CLKS", - "MetricGroup": "MemoryLat;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_l3_hit_latency", - "PublicDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve the latency; reduce contention with sibling physical cores and increase performance. Note the value of this node may overlap with its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.L3_MISS))) / tma_info_clks", + "MetricGroup": "Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses. Data shared by multiple Logical Processors (even just read shared) may cause increased access latency due to cache coherency. Excessive data sharing can drastically harm multithreaded performance. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma_false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors)", - "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on else OFFCORE_REQUESTS_BUFFER.SQ_FULL) / CORE_CLKS", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_sq_full", - "PublicDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors). The Super Queue is used for requests to access the L2 cache or to go out to the Uncore.", + "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", + "MetricExpr": "10 * ARITH.DIVIDER_UOPS / tma_info_core_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication. Sample with: ARITH.DIVIDER_UOPS", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", - "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS)) * CYCLE_ACTIVITY.STALLS_L2_PENDING / CLKS", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", + "MetricConstraint": "NO_GROUP_EVENTS_SMT", + "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS)) * CYCLE_ACTIVITY.STALLS_L2_PENDING / tma_info_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_MISS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=6@) / CLKS", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_dram_bound_group", - "MetricName": "tma_mem_bandwidth", - "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that).", + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline", + "MetricExpr": "(IDQ.ALL_DSB_CYCLES_ANY_UOPS - IDQ.ALL_DSB_CYCLES_4_UOPS) / tma_info_core_clks / 2", + "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline. For example; inefficient utilization of the DSB cache structure or bank conflict when reading from it; are categorized here.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / CLKS - tma_mem_bandwidth", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_dram_bound_group", - "MetricName": "tma_mem_latency", - "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that).", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_clks", + "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty. Related metrics: tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_iptb, tma_lcp", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", - "MetricExpr": "RESOURCE_STALLS.SB / CLKS", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_store_bound", - "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck. Sample with: MEM_UOPS_RETIRED.ALL_STORES_PS", + "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + DTLB_LOAD_MISSES.WALK_DURATION) / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS. Related metrics: tma_dtlb_store", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses", - "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / CLKS", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_store_bound_group", - "MetricName": "tma_store_latency", - "PublicDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses. Store accesses usually less impact out-of-order core performance; however; holding resources for longer time can lead into undesired implications (e.g. contention on L1D fill-buffer entries - see FB_Full)", + "BriefDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses", + "MetricExpr": "(8 * DTLB_STORE_MISSES.STLB_HIT + DTLB_STORE_MISSES.WALK_DURATION) / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses. As with ordinary data caching; focus on improving data locality and reducing working-set size to reduce DTLB overhead. Additionally; consider using profile-guided optimization (PGO) to collocate frequently-used data on the same page. Try using larger page sizes for large amounts of frequently-used data. Sample with: MEM_UOPS_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing", - "MetricExpr": "60 * OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.HITM_OTHER_CORE / CLKS", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_store_bound_group", + "MetricExpr": "60 * OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.HITM_OTHER_CORE / tma_info_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "PublicDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing. False Sharing is a multithreading hiccup; where multiple Logical Processors contend on different data-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing. False Sharing is a multithreading hiccup; where multiple Logical Processors contend on different data-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents rate of split store accesses", - "MetricExpr": "2 * MEM_UOPS_RETIRED.SPLIT_STORES / CORE_CLKS", - "MetricGroup": "TopdownL4;tma_store_bound_group", - "MetricName": "tma_split_stores", - "PublicDescription": "This metric represents rate of split store accesses. Consider aligning your data to the 64-byte cache line granularity. Sample with: MEM_UOPS_RETIRED.SPLIT_STORES_PS", + "BriefDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_info_load_miss_real_latency * cpu@L1D_PEND_MISS.REQUEST_FB_FULL\\,cmask\\=1@ / tma_info_clks", + "MetricGroup": "MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;tma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed. The higher the metric value; the deeper the memory hierarchy level the misses are satisfied from (metric values >1 are valid). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or external memory). Related metrics: tma_info_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses", - "MetricExpr": "(8 * DTLB_STORE_MISSES.STLB_HIT + DTLB_STORE_MISSES.WALK_DURATION) / CLKS", - "MetricGroup": "MemoryTLB;TopdownL4;tma_store_bound_group", - "MetricName": "tma_dtlb_store", - "PublicDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses. As with ordinary data caching; focus on improving data locality and reducing working-set size to reduce DTLB overhead. Additionally; consider using profile-guided optimization (PGO) to collocate frequently-used data on the same page. Try using larger page sizes for large amounts of frequently-used data. Sample with: MEM_UOPS_RETIRED.STLB_MISS_STORES_PS", + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", + "MetricExpr": "tma_frontend_bound - tma_fetch_latency", + "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Related metrics: tma_dsb_switches, tma_info_dsb_coverage, tma_info_iptb, tma_lcp", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", - "MetricExpr": "tma_backend_bound - tma_memory_bound", - "MetricGroup": "Backend;Compute;TopdownL2;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_core_bound", - "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", + "MetricExpr": "4 * min(CPU_CLK_UNHALTED.THREAD, IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE) / tma_info_slots", + "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: RS_EVENTS.EMPTY_END", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", - "MetricExpr": "10 * ARITH.DIVIDER_UOPS / CORE_CLKS", - "MetricGroup": "TopdownL3;tma_core_bound_group", - "MetricName": "tma_divider", - "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication. Sample with: ARITH.DIVIDER_UOPS", + "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", + "MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / tma_info_slots", + "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", - "MetricExpr": "((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@ if IPC > 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@) / 2 - RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else RESOURCE_STALLS.SB) if #SMT_on else (min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@ if IPC > 1.8 else (cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@ - RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else RESOURCE_STALLS.SB)) - RESOURCE_STALLS.SB - min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING)) / CLKS", - "MetricGroup": "PortsUtil;TopdownL3;tma_core_bound_group", - "MetricName": "tma_ports_utilization", - "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences", + "MetricExpr": "tma_microcode_sequencer", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\,cmask\\=1@ / 2 if #SMT_on else (min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) - RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) / CORE_CLKS)", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_0", - "PublicDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise). Long-latency instructions like divides may contribute to this metric.", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses.", + "MetricExpr": "ICACHE.IFDATA_STALL / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@) / 2 if #SMT_on else (cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@) / CORE_CLKS)", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_1", - "PublicDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). This can be due to heavy data-dependency among software instructions; or over oversubscribing a particular hardware resource. In some other cases with high 1_Port_Utilized and L1_Bound; this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful.", - "ScaleUnit": "100%" + "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", + "MetricExpr": "tma_info_turbo_utilization * TSC / 1e9 / duration_time", + "MetricGroup": "Power;Summary", + "MetricName": "tma_info_average_frequency" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@) / 2 if #SMT_on else (cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@) / CORE_CLKS)", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_2", - "PublicDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization -most compilers feature auto-Vectorization options today- reduces pressure on the execution ports as multiple elements are calculated with same uop.", - "ScaleUnit": "100%" + "BriefDescription": "Branch instructions per taken branch.", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_bptkbranch" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise).", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@ / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@) / CORE_CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_3m", - "ScaleUnit": "100%" + "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Pipeline", + "MetricName": "tma_info_clks" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.", - "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT.PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / SLOTS", - "MetricGroup": "TopdownL5;tma_ports_utilized_3m_group", - "MetricName": "tma_alu_op_utilization", - "ScaleUnit": "100%" + "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_clks))", + "MetricGroup": "SMT", + "MetricName": "tma_info_core_clks" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch) Sample with: UOPS_DISPATCHED_PORT.PORT_0", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_0 / CORE_CLKS", - "MetricGroup": "Compute;TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_0", - "ScaleUnit": "100%" + "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_core_clks", + "MetricGroup": "Ret;SMT;TmaL1;tma_L1_group", + "MetricName": "tma_info_coreipc" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU) Sample with: UOPS_DISPATCHED_PORT.PORT_1", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_1 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_1", - "ScaleUnit": "100%" + "BriefDescription": "Cycles Per Instruction (per Logical Processor)", + "MetricExpr": "1 / tma_info_ipc", + "MetricGroup": "Mem;Pipeline", + "MetricName": "tma_info_cpi" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU) Sample with: UOPS_DISPATCHED.PORT_5", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_5 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_5", - "ScaleUnit": "100%" + "BriefDescription": "Average CPU Utilization", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": "HPC;Summary", + "MetricName": "tma_info_cpu_utilization" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU) Sample with: UOPS_DISPATCHED_PORT.PORT_6", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_6 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_6", - "ScaleUnit": "100%" + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD / OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_data_l2_mlp" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations Sample with: UOPS_DISPATCHED.PORT_2_3", - "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_2 + UOPS_DISPATCHED_PORT.PORT_3 + UOPS_DISPATCHED_PORT.PORT_7 - UOPS_DISPATCHED_PORT.PORT_4) / (2 * CORE_CLKS)", - "MetricGroup": "TopdownL5;tma_ports_utilized_3m_group", - "MetricName": "tma_load_op_utilization", - "ScaleUnit": "100%" + "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", + "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_REQUESTS.ALL) / 1e6 / duration_time / 1e3", + "MetricGroup": "HPC;Mem;MemoryBW;SoC;tma_issueBW", + "MetricName": "tma_info_dram_bw_use", + "PublicDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]. Related metrics: tma_fb_full, tma_mem_bandwidth, tma_sq_full" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 2 ([SNB+]Loads and Store-address; [ICL+] Loads) Sample with: UOPS_DISPATCHED_PORT.PORT_2", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_2 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_load_op_utilization_group", - "MetricName": "tma_port_2", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + LSD.UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_dsb_coverage", + "MetricThreshold": "tma_info_dsb_coverage < 0.7 & tma_info_ipc / 4 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_iptb, tma_lcp" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 3 ([SNB+]Loads and Store-address; [ICL+] Loads) Sample with: UOPS_DISPATCHED_PORT.PORT_3", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_3 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_load_op_utilization_group", - "MetricName": "tma_port_3", - "ScaleUnit": "100%" + "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", + "MetricExpr": "(UOPS_EXECUTED.CORE / 2 / (cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@) if #SMT_on else UOPS_EXECUTED.CORE / (cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@))", + "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", + "MetricName": "tma_info_ilp" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations", - "MetricExpr": "tma_port_4", - "MetricGroup": "TopdownL5;tma_ports_utilized_3m_group", - "MetricName": "tma_store_op_utilization", - "ScaleUnit": "100%" + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Summary;TmaL1;tma_L1_group", + "MetricName": "tma_info_instructions", + "PublicDescription": "Total number of retired Instructions. Sample with: INST_RETIRED.PREC_DIST" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 4 (Store-data) Sample with: UOPS_DISPATCHED_PORT.PORT_4", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_4 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_store_op_utilization_group", - "MetricName": "tma_port_4", - "ScaleUnit": "100%" - }, + "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Branches;Fed;InsType", + "MetricName": "tma_info_ipbranch", + "MetricThreshold": "tma_info_ipbranch < 8" + }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 7 ([HSW+]simple Store-address) Sample with: UOPS_DISPATCHED_PORT.PORT_7", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_7 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_store_op_utilization_group", - "MetricName": "tma_port_7", - "ScaleUnit": "100%" + "BriefDescription": "Instructions Per Cycle (per Logical Processor)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_clks", + "MetricGroup": "Ret;Summary", + "MetricName": "tma_info_ipc" }, { - "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group", - "MetricName": "tma_retiring", - "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.RETIRE_SLOTS", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per (near) call (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_ipcall", + "MetricThreshold": "tma_info_ipcall < 200" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", - "MetricExpr": "tma_retiring - tma_heavy_operations", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_retiring_group", - "MetricName": "tma_light_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;OS", + "MetricName": "tma_info_ipfarbranch", + "MetricThreshold": "tma_info_ipfarbranch < 1e6" }, { - "BriefDescription": "This metric serves as an approximation of legacy x87 usage", - "MetricExpr": "INST_RETIRED.X87 * UPI / UOPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Compute;TopdownL4;tma_fp_arith_group", - "MetricName": "tma_x87_use", - "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per Load (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_LOADS", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipload", + "MetricThreshold": "tma_info_ipload < 3" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences", - "MetricExpr": "tma_microcode_sequencer", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_retiring_group", - "MetricName": "tma_heavy_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences. This highly-correlates with the uop length of these instructions/sequences.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per retired mispredicts for indirect CALL or JMP branches (lower number means higher occurrence rate).", + "MetricExpr": "tma_info_instructions / (UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * cpu@BR_MISP_EXEC.ALL_BRANCHES\\,umask\\=0xE4@)", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_indirect", + "MetricThreshold": "tma_info_ipmisp_indirect < 1e3" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.MS_UOPS / SLOTS", - "MetricGroup": "MicroSeq;TopdownL3;tma_heavy_operations_group", - "MetricName": "tma_microcode_sequencer", - "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided. Sample with: IDQ.MS_UOPS", - "ScaleUnit": "100%" + "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts", + "MetricName": "tma_info_ipmispredict", + "MetricThreshold": "tma_info_ipmispredict < 200" }, { - "BriefDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists", - "MetricExpr": "100 * OTHER_ASSISTS.ANY_WB_ASSIST / SLOTS", - "MetricGroup": "TopdownL4;tma_microcode_sequencer_group", - "MetricName": "tma_assists", - "PublicDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists. Assists are long sequences of uops that are required in certain corner-cases for operations that cannot be handled natively by the execution pipeline. For example; when working with very small floating point values (so-called Denormals); the FP units are not set up to perform these operations natively. Instead; a sequence of instructions to perform the computation on the Denormals is injected into the pipeline. Since these microcode sequences might be dozens of uops long; Assists can be extremely deleterious to performance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.ANY", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per Store (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_STORES", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipstore", + "MetricThreshold": "tma_info_ipstore < 8" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction", - "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", - "MetricGroup": "TopdownL4;tma_microcode_sequencer_group", - "MetricName": "tma_cisc", - "PublicDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction. A CISC instruction has multiple uops that are required to perform the instruction's functionality as in the case of read-modify-write as an example. Since these instructions require multiple uops they may or may not imply sub-optimal use of machine resources.", - "ScaleUnit": "100%" + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", + "MetricName": "tma_info_iptb", + "MetricThreshold": "tma_info_iptb < 9", + "PublicDescription": "Instruction per taken branch. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_lcp" }, { - "BriefDescription": "Instructions Per Cycle (per Logical Processor)", - "MetricExpr": "INST_RETIRED.ANY / CLKS", - "MetricGroup": "Ret;Summary", - "MetricName": "IPC" + "BriefDescription": "Instructions per speculative Unknown Branch Misprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_instructions / BACLEARS.ANY", + "MetricGroup": "Fed", + "MetricName": "tma_info_ipunknown_branch" }, { - "BriefDescription": "Uops Per Instruction", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / INST_RETIRED.ANY", - "MetricGroup": "Pipeline;Ret;Retire", - "MetricName": "UPI" + "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_cpi" }, { - "BriefDescription": "Instruction per taken branch", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW", - "MetricName": "UpTB" + "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_utilization", + "MetricThreshold": "tma_info_kernel_utilization > 0.05" }, { - "BriefDescription": "Cycles Per Instruction (per Logical Processor)", - "MetricExpr": "1 / IPC", - "MetricGroup": "Mem;Pipeline", - "MetricName": "CPI" + "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "Pipeline", - "MetricName": "CLKS" + "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]", + "MetricExpr": "tma_info_l1d_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw_1t" }, { - "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "4 * CORE_CLKS", - "MetricGroup": "tma_L1_group", - "MetricName": "SLOTS" + "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L1_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l1mpki" }, { - "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", - "MetricExpr": "INST_RETIRED.ANY / CORE_CLKS", - "MetricGroup": "Ret;SMT;tma_L1_group", - "MetricName": "CoreIPC" + "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", - "MetricExpr": "(UOPS_EXECUTED.CORE / 2 / (cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@) if #SMT_on else UOPS_EXECUTED.CORE / (cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@))", - "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", - "MetricName": "ILP" + "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]", + "MetricExpr": "tma_info_l2_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw_1t" }, { - "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else CLKS))", - "MetricGroup": "SMT", - "MetricName": "CORE_CLKS" + "BriefDescription": "L2 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "Backend;CacheMisses;Mem", + "MetricName": "tma_info_l2mpki" }, { - "BriefDescription": "Instructions per Load (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_LOADS", - "MetricGroup": "InsType", - "MetricName": "IpLoad" + "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "0", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_l3_cache_access_bw_1t" }, { - "BriefDescription": "Instructions per Store (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_STORES", - "MetricGroup": "InsType", - "MetricName": "IpStore" + "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw" }, { - "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Branches;Fed;InsType", - "MetricName": "IpBranch" + "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "tma_info_l3_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw_1t" }, { - "BriefDescription": "Instructions per (near) call (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "IpCall" + "BriefDescription": "L3 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L3_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l3mpki" }, { - "BriefDescription": "Instruction per taken branch", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO", - "MetricName": "IpTB" + "BriefDescription": "Average Latency for L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCORE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "Memory_Lat;Offcore", + "MetricName": "tma_info_load_l2_miss_latency" }, { - "BriefDescription": "Branch instructions per taken branch. ", - "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "BpTkBranch" + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_load_l2_mlp" }, { - "BriefDescription": "Total number of retired Instructions Sample with: INST_RETIRED.PREC_DIST", - "MetricExpr": "INST_RETIRED.ANY", - "MetricGroup": "Summary;tma_L1_group", - "MetricName": "Instructions" + "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_UOPS_RETIRED.L1_MISS + MEM_LOAD_UOPS_RETIRED.HIT_LFB)", + "MetricGroup": "Mem;MemoryBound;MemoryLat", + "MetricName": "tma_info_load_miss_real_latency" + }, + { + "BriefDescription": "Average number of parallel requests to external memory", + "MetricExpr": "UNC_ARB_TRK_OCCUPANCY.ALL / UNC_ARB_TRK_OCCUPANCY.CYCLES_WITH_ANY_REQUEST", + "MetricGroup": "Mem;SoC", + "MetricName": "tma_info_mem_parallel_requests", + "PublicDescription": "Average number of parallel requests to external memory. Accounts for all requests" + }, + { + "BriefDescription": "Average latency of all requests to external memory (in Uncore cycles)", + "MetricExpr": "UNC_ARB_TRK_OCCUPANCY.ALL / UNC_ARB_TRK_REQUESTS.ALL", + "MetricGroup": "Mem;SoC", + "MetricName": "tma_info_mem_request_latency" + }, + { + "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLES", + "MetricGroup": "Mem;MemoryBW;MemoryBound", + "MetricName": "tma_info_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)" + }, + { + "BriefDescription": "Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses", + "MetricExpr": "(ITLB_MISSES.WALK_DURATION + DTLB_LOAD_MISSES.WALK_DURATION + DTLB_STORE_MISSES.WALK_DURATION) / tma_info_core_clks", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_page_walks_utilization", + "MetricThreshold": "tma_info_page_walks_utilization > 0.5" }, { "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.", "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE_SLOTS\\,cmask\\=1@", "MetricGroup": "Pipeline;Ret", - "MetricName": "Retire" + "MetricName": "tma_info_retire" }, { - "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", - "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + LSD.UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", - "MetricGroup": "DSB;Fed;FetchBW", - "MetricName": "DSB_Coverage" + "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "4 * tma_info_core_clks", + "MetricGroup": "TmaL1;tma_L1_group", + "MetricName": "tma_info_slots" }, { - "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;BadSpec;BrMispredicts", - "MetricName": "IpMispredict" + "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", + "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", + "MetricGroup": "SMT", + "MetricName": "tma_info_smt_2t_utilization" }, { - "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)", - "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_UOPS_RETIRED.L1_MISS + MEM_LOAD_UOPS_RETIRED.HIT_LFB)", - "MetricGroup": "Mem;MemoryBound;MemoryLat", - "MetricName": "Load_Miss_Real_Latency" + "BriefDescription": "Socket actual clocks when any core is active on that socket", + "MetricExpr": "UNC_CLOCK.SOCKET", + "MetricGroup": "SoC", + "MetricName": "tma_info_socket_clks" }, { - "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)", - "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLES", - "MetricGroup": "Mem;MemoryBW;MemoryBound", - "MetricName": "MLP" + "BriefDescription": "Average Frequency Utilization relative nominal frequency", + "MetricExpr": "tma_info_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_turbo_utilization" }, { - "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L1_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L1MPKI" + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / INST_RETIRED.ANY", + "MetricGroup": "Pipeline;Ret;Retire", + "MetricName": "tma_info_uoppi", + "MetricThreshold": "tma_info_uoppi > 1.05" }, { - "BriefDescription": "L2 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L2_MISS / INST_RETIRED.ANY", - "MetricGroup": "Backend;CacheMisses;Mem", - "MetricName": "L2MPKI" + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW", + "MetricName": "tma_info_uptb", + "MetricThreshold": "tma_info_uptb < 6" }, { - "BriefDescription": "L3 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L3_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L3MPKI" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + ITLB_MISSES.WALK_DURATION) / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: ITLB_MISSES.WALK_COMPLETED", + "ScaleUnit": "100%" }, { - "BriefDescription": "Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses", - "MetricConstraint": "NO_NMI_WATCHDOG", - "MetricExpr": "(ITLB_MISSES.WALK_DURATION + DTLB_LOAD_MISSES.WALK_DURATION + DTLB_STORE_MISSES.WALK_DURATION) / CORE_CLKS", - "MetricGroup": "Mem;MemoryTLB", - "MetricName": "Page_Walks_Utilization" + "BriefDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache", + "MetricExpr": "max((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING) - CYCLE_ACTIVITY.STALLS_L1D_PENDING) / tma_info_clks, 0)", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache. The L1 data cache typically has the shortest latency. However; in certain cases like loads blocked on older stores; a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls; while some non-completed demand load lives in the machine without having that demand load missing the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT_PS;MEM_LOAD_UOPS_RETIRED.HIT_LFB_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW" + "BriefDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads", + "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L1D_PENDING - CYCLE_ACTIVITY.STALLS_L2_PENDING) / tma_info_clks", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 misses/L2 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L2_HIT_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW" + "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core", + "MetricConstraint": "NO_GROUP_EVENTS_SMT", + "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS) * CYCLE_ACTIVITY.STALLS_L2_PENDING / tma_info_clks", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW" + "BriefDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "29 * (MEM_LOAD_UOPS_RETIRED.L3_HIT * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.L3_MISS))) / tma_info_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_issueLat;tma_l3_bound_group", + "MetricName": "tma_l3_hit_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve the latency; reduce contention with sibling physical cores and increase performance. Note the value of this node may overlap with its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS. Related metrics: tma_mem_latency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]", - "MetricExpr": "L1D_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW_1T" + "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "ILD_STALL.LCP / tma_info_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_iptb", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]", - "MetricExpr": "L2_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW_1T" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", + "MetricExpr": "tma_retiring - tma_heavy_operations", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "L3_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW_1T" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_2 + UOPS_DISPATCHED_PORT.PORT_3 + UOPS_DISPATCHED_PORT.PORT_7 - UOPS_DISPATCHED_PORT.PORT_4) / (2 * tma_info_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations. Sample with: UOPS_DISPATCHED.PORT_2_3", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "0", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "L3_Cache_Access_BW_1T" + "BriefDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO) / tma_info_clks", + "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1_bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations. Due to the microarchitecture handling of locks; they are classified as L1_Bound regardless of what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOADS_PS. Related metrics: tma_store_latency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average CPU Utilization", - "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", - "MetricGroup": "HPC;Summary", - "MetricName": "CPU_Utilization" + "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_bad_speculation - tma_branch_mispredicts", + "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", - "MetricExpr": "Turbo_Utilization * TSC / 1e9 / duration_time", - "MetricGroup": "Power;Summary", - "MetricName": "Average_Frequency" + "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=6@) / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that). Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_sq_full", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average Frequency Utilization relative nominal frequency", - "MetricExpr": "CLKS / CPU_CLK_UNHALTED.REF_TSC", - "MetricGroup": "Power", - "MetricName": "Turbo_Utilization" + "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / tma_info_clks - tma_mem_bandwidth", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that). Related metrics: tma_l3_hit_latency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", - "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", - "MetricGroup": "SMT", - "MetricName": "SMT_2T_Utilization" + "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING) + RESOURCE_STALLS.SB) / (min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ - (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@ if tma_info_ipc > 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@)) / 2 - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) if #SMT_on else min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ - (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@ if tma_info_ipc > 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) * tma_backend_bound", + "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "OS", - "MetricName": "Kernel_Utilization" + "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.MS_UOPS / tma_info_slots", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operations_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_ms_switches", + "ScaleUnit": "100%" }, { - "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", - "MetricGroup": "OS", - "MetricName": "Kernel_CPI" + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline)", + "MetricExpr": "(IDQ.ALL_MITE_CYCLES_ANY_UOPS - IDQ.ALL_MITE_CYCLES_4_UOPS) / tma_info_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or LSD. For example; inefficiencies due to asymmetric decoders; use of long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", - "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_REQUESTS.ALL) / 1e6 / duration_time / 1e3", - "MetricGroup": "HPC;Mem;MemoryBW;SoC", - "MetricName": "DRAM_BW_Use" + "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", + "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_clks", + "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average latency of all requests to external memory (in Uncore cycles)", - "MetricExpr": "MEM_Parallel_Requests", - "MetricGroup": "Mem;SoC", - "MetricName": "MEM_Request_Latency" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_0 / tma_info_core_clks", + "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_0", + "MetricThreshold": "tma_port_0 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_512b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average number of parallel requests to external memory. Accounts for all requests", - "MetricExpr": "UNC_ARB_TRK_OCCUPANCY.ALL / UNC_ARB_TRK_REQUESTS.ALL", - "MetricGroup": "Mem;SoC", - "MetricName": "MEM_Parallel_Requests" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_1 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_1", + "MetricThreshold": "tma_port_1 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_512b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Socket actual clocks when any core is active on that socket", - "MetricExpr": "UNC_CLOCK.SOCKET", - "MetricGroup": "SoC", - "MetricName": "Socket_CLKS" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 2 ([SNB+]Loads and Store-address; [ICL+] Loads)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_2 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_load_op_utilization_group", + "MetricName": "tma_port_2", + "MetricThreshold": "tma_port_2 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 2 ([SNB+]Loads and Store-address; [ICL+] Loads). Sample with: UOPS_DISPATCHED_PORT.PORT_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", - "MetricGroup": "Branches;OS", - "MetricName": "IpFarBranch" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 3 ([SNB+]Loads and Store-address; [ICL+] Loads)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_3 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_load_op_utilization_group", + "MetricName": "tma_port_3", + "MetricThreshold": "tma_port_3 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 3 ([SNB+]Loads and Store-address; [ICL+] Loads). Sample with: UOPS_DISPATCHED_PORT.PORT_3", + "ScaleUnit": "100%" }, { - "BriefDescription": "Uncore frequency per die [GHZ]", - "MetricExpr": "Socket_CLKS / #num_dies / duration_time / 1e9", - "MetricGroup": "SoC", - "MetricName": "UNCORE_FREQ" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 4 (Store-data)", + "MetricExpr": "tma_store_op_utilization", + "MetricGroup": "TopdownL6;tma_L6_group;tma_issueSpSt;tma_store_op_utilization_group", + "MetricName": "tma_port_4", + "MetricThreshold": "tma_port_4 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 4 (Store-data). Sample with: UOPS_DISPATCHED_PORT.PORT_4. Related metrics: tma_split_stores", + "ScaleUnit": "100%" }, { - "BriefDescription": "C3 residency percent per core", - "MetricExpr": "cstate_core@c3\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C3_Core_Residency", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_5 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_5", + "MetricThreshold": "tma_port_5 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per core", - "MetricExpr": "cstate_core@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Core_Residency", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_6 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_6", + "MetricThreshold": "tma_port_6 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_6. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "C7 residency percent per core", - "MetricExpr": "cstate_core@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Core_Residency", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 7 ([HSW+]simple Store-address)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_7 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_store_op_utilization_group", + "MetricName": "tma_port_7", + "MetricThreshold": "tma_port_7 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 7 ([HSW+]simple Store-address). Sample with: UOPS_DISPATCHED_PORT.PORT_7", "ScaleUnit": "100%" }, { - "BriefDescription": "C2 residency percent per package", - "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C2_Pkg_Residency", + "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ - (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@ if tma_info_ipc > 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@)) / 2 - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB if #SMT_on else min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ - (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@ if tma_info_ipc > 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB - RESOURCE_STALLS.SB - min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING)) / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%" }, { - "BriefDescription": "C3 residency percent per package", - "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C3_Pkg_Residency", + "BriefDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\,cmask\\=1@ / 2 if #SMT_on else (min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0)) / tma_info_core_clks)", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise). Long-latency instructions like divides may contribute to this metric.", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per package", - "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Pkg_Residency", + "BriefDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@) / 2 if #SMT_on else (cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@) / tma_info_core_clks)", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_1", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). This can be due to heavy data-dependency among software instructions; or over oversubscribing a particular hardware resource. In some other cases with high 1_Port_Utilized and L1_Bound; this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, { - "BriefDescription": "C7 residency percent per package", - "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Pkg_Residency", + "BriefDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@) / 2 if #SMT_on else (cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@) / tma_info_core_clks)", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization -most compilers feature auto-Vectorization options today- reduces pressure on the execution ports as multiple elements are calculated with same uop. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise).", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@ / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@) / tma_info_core_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.7 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1", + "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.RETIRE_SLOTS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_info_load_miss_real_latency * LD_BLOCKS.NO_SR / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS_PS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents rate of split store accesses", + "MetricExpr": "2 * MEM_UOPS_RETIRED.SPLIT_STORES / tma_info_core_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bound_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store accesses. Consider aligning your data to the 64-byte cache line granularity. Sample with: MEM_UOPS_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_4", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors)", + "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on else OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueBW;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors). Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_mem_bandwidth", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", + "MetricExpr": "RESOURCE_STALLS.SB / tma_info_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck. Sample with: MEM_UOPS_RETIRED.ALL_STORES_PS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores. To streamline memory operations in the pipeline; a load can avoid waiting for memory if a prior in-flight store is writing the data that the load wants to read (store forwarding process). However; in some cases the load may be blocked for a significant time pending the store forward. For example; when the prior store is writing a smaller region than the load is reading.", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_clks", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_issueSL;tma_store_bound_group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses. Store accesses usually less impact out-of-order core performance; however; holding resources for longer time can lead into undesired implications (e.g. contention on L1D fill-buffer entries - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_4 / tma_info_core_clks", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric serves as an approximation of legacy x87 usage", + "MetricExpr": "INST_RETIRED.X87 * tma_info_uoppi / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1", + "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%" } ] From patchwork Sun Feb 19 09:28:15 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59108 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp772644wrn; Sun, 19 Feb 2023 01:33:50 -0800 (PST) X-Google-Smtp-Source: AK7set9jS5aLxWbCG6ndzePSfn1Si14ZYerQdf7HF2d5Gi5W488V4VKa53vFhTHuc+D1266RTFhO X-Received: by 2002:a62:7b07:0:b0:5a8:2008:d1eb with SMTP id w7-20020a627b07000000b005a82008d1ebmr208187pfc.17.1676799229641; Sun, 19 Feb 2023 01:33:49 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799229; cv=none; d=google.com; s=arc-20160816; b=x0cCkkMdxFPflNkA7jhrSZorCSEBiq4wmDu+LbCz2cSN3s1zBcHFLnt6bZEA11qWaS OMyjWwtyPgWlYxLlS/cAxZm79yYFIto0bCcN9Ym3cSoIkkngisn/BPeIuZqJS8g2pg9i BoAktoPfApDohnFOdsMmot+PVgioUGA31LD48YuqLLjNqqVB3IlYSj4HeZeyxHhJYENs N4JDkIYh5i/N2UhtztzrEbEL8SSbn1x6bfln/89/JIszhs2coZ0yo5GoflfEKPE0DMlA oQrsmLGcPGIaA/juSS1k5DvlEM55VmNqunskehlr3jysg1fHsQj0VgRF+kYDuh8REK1S oMFA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :references:mime-version:message-id:in-reply-to:date:dkim-signature; bh=HEvZvF4Oc3vexy4XIoQ+XAYsAN/Zh5qFP9D1bfkhcFg=; b=QwdHbUIl/D5aCcts/qPkGh1Yobjj/LabGYS+gyl9qNjUpxe7p0Xg0/6LgDGQJyo/D/ UImCTusY4HWJmhKQJpKWnTDsuAVCBN5DoGgaALBDRp1CW9tUWxoblaa+xZ153TPwtCG+ Vh9J3fqAblVTJRHY16eUjezC6Eif78fqcK1nJQpE8N/+6HEC1lJOQywuCj0WFJPF3Vpn 2DG29NX/lgHdnD8NKbF/dzjxZtPunfCqoL0xWpgpvt+G+JI8URGg5QG1ZVc21045uags 7kFEZ5ieOsHiaq6WrkuMOo4GvhH4qQvzv3uBDAU+pQKs1ixqL0rHzD+o3I5AEqyJbAqV ZHVw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=r8XLJOdD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id g23-20020a056a00079700b005a9caf70f2dsi7874611pfu.267.2023.02.19.01.33.37; Sun, 19 Feb 2023 01:33:49 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=r8XLJOdD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229939AbjBSJcn (ORCPT + 99 others); Sun, 19 Feb 2023 04:32:43 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54334 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229937AbjBSJcZ (ORCPT ); Sun, 19 Feb 2023 04:32:25 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0E4EA12585 for ; Sun, 19 Feb 2023 01:31:42 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id k18-20020a25b292000000b0091231592671so17481ybj.1 for ; Sun, 19 Feb 2023 01:31:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:from:to:cc:subject:date :message-id:reply-to; bh=HEvZvF4Oc3vexy4XIoQ+XAYsAN/Zh5qFP9D1bfkhcFg=; b=r8XLJOdDvuuB2bUXNDwDRvoZ7FMRiOQKeA6SoXunYiJ0eszntMdgWluAfrIdSjAAPr 7l0dxRPNBy0vDuXB5TgKaEGEUBg2nadUL2+dFK9B7MiuoRTDIaer/VtDdOB35jZNtMGZ /oWyPXsWzxJEBehWf02NDBWkkylNks43hZ/Kkhz7K5QAnF/wIdScrY1Nmbx5NZ6xu78I QKwnQnDBSUduoUscQYfIz5R0rxICeKf1kmfEkFhRU60cZOaUO1IQ880wGHqFRi3bcdh2 vLtrBHyTSUnVA0RuOmhn6r8SO7atXDNC0mFzHDh6r5sR9yxdFmVtctw4oO9bgYEc43BF r+Sg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=HEvZvF4Oc3vexy4XIoQ+XAYsAN/Zh5qFP9D1bfkhcFg=; b=OpZKg++qxkrlwui11Vkz9gbcTcJBu4cZs//Q+QvYOzHDvefV2eGayf+bb9xPfthoMa +wTEGeTQXT6vGovkQLk7rORz392hD85mEjzFaQy87RjjeiJKusZ6lXHvSXxfcxoMq6lb 7Vt4VrLGmPRvGSxQSB8rWP29Kb1Db+I1oOrxnOEok0mVwWy3jqEjBd299sQkAkZKb5Nk 5+w+12M4uc4/9GDErCwQspnG0nRPOslVqYv7rNwiKe/jhXNoG34LatN4sF6lWk5Kahug GxohZpPYEpZirJinCm8gW1D1+mqN32aYi/g3AvLGDDbigjYOYbbyq/LXWlffROUT2mQh y2Eg== X-Gm-Message-State: AO0yUKXqcyn1vaK3umBxocR1SnIkxOlFR17pNvuAHyfb+YP6XHr1Qk1N qKOg3MJyNrBnyAmUsS+OCZxvxseDa6xz X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a25:6408:0:b0:911:cc15:db74 with SMTP id y8-20020a256408000000b00911cc15db74mr28176ybb.13.1676799091379; Sun, 19 Feb 2023 01:31:31 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:15 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-19-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 18/51] perf vendor events intel: Refresh haswellx metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758251429230862086?= X-GMAIL-MSGID: =?utf-8?q?1758251429230862086?= Update the haswellx metrics to TMA version 4.5. Generation was done using https://github.com/intel/perfmon. Notable changes are TMA info metrics are renamed from their node name to be lower case and prefixed by tma_info_, MetricThreshold expressions are added and "Sample with" documentation is added to many TMA metrics, and the smi_cost metric group is added replicating existing hard coded metrics in stat-shadow. Signed-off-by: Ian Rogers --- .../arch/x86/haswellx/hsx-metrics.json | 1397 ++++++++--------- 1 file changed, 679 insertions(+), 718 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/haswellx/hsx-metrics.json b/tools/perf/pmu-events/arch/x86/haswellx/hsx-metrics.json index 2e1fbc936d25..11f152c346eb 100644 --- a/tools/perf/pmu-events/arch/x86/haswellx/hsx-metrics.json +++ b/tools/perf/pmu-events/arch/x86/haswellx/hsx-metrics.json @@ -1,1023 +1,984 @@ [ { - "BriefDescription": "Instructions Per Cycle (per Logical Processor)", - "MetricExpr": "INST_RETIRED.ANY / CLKS", - "MetricGroup": "Ret;Summary", - "MetricName": "IPC" - }, - { - "BriefDescription": "Uops Per Instruction", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / INST_RETIRED.ANY", - "MetricGroup": "Pipeline;Ret;Retire", - "MetricName": "UPI" - }, - { - "BriefDescription": "Instruction per taken branch", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW", - "MetricName": "UpTB" - }, - { - "BriefDescription": "Cycles Per Instruction (per Logical Processor)", - "MetricExpr": "1 / IPC", - "MetricGroup": "Mem;Pipeline", - "MetricName": "CPI" - }, - { - "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "Pipeline", - "MetricName": "CLKS" - }, - { - "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "4 * CORE_CLKS", - "MetricGroup": "tma_L1_group", - "MetricName": "SLOTS" - }, - { - "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", - "MetricExpr": "INST_RETIRED.ANY / CORE_CLKS", - "MetricGroup": "Ret;SMT;tma_L1_group", - "MetricName": "CoreIPC" - }, - { - "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", - "MetricExpr": "(UOPS_EXECUTED.CORE / 2 / (cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@) if #SMT_on else UOPS_EXECUTED.CORE / (cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@))", - "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", - "MetricName": "ILP" - }, - { - "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else CLKS))", - "MetricGroup": "SMT", - "MetricName": "CORE_CLKS" - }, - { - "BriefDescription": "Instructions per Load (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_LOADS", - "MetricGroup": "InsType", - "MetricName": "IpLoad" - }, - { - "BriefDescription": "Instructions per Store (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_STORES", - "MetricGroup": "InsType", - "MetricName": "IpStore" - }, - { - "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Branches;Fed;InsType", - "MetricName": "IpBranch" - }, - { - "BriefDescription": "Instructions per (near) call (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "IpCall" - }, - { - "BriefDescription": "Instruction per taken branch", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO", - "MetricName": "IpTB" - }, - { - "BriefDescription": "Branch instructions per taken branch. ", - "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "BpTkBranch" - }, - { - "BriefDescription": "Total number of retired Instructions Sample with: INST_RETIRED.PREC_DIST", - "MetricExpr": "INST_RETIRED.ANY", - "MetricGroup": "Summary;tma_L1_group", - "MetricName": "Instructions" - }, - { - "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE_SLOTS\\,cmask\\=1@", - "MetricGroup": "Pipeline;Ret", - "MetricName": "Retire" + "BriefDescription": "C2 residency percent per package", + "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C2_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", - "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + LSD.UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", - "MetricGroup": "DSB;Fed;FetchBW", - "MetricName": "DSB_Coverage" + "BriefDescription": "C3 residency percent per core", + "MetricExpr": "cstate_core@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Core_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;BadSpec;BrMispredicts", - "MetricName": "IpMispredict" + "BriefDescription": "C3 residency percent per package", + "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)", - "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_UOPS_RETIRED.L1_MISS + MEM_LOAD_UOPS_RETIRED.HIT_LFB)", - "MetricGroup": "Mem;MemoryBound;MemoryLat", - "MetricName": "Load_Miss_Real_Latency" + "BriefDescription": "C6 residency percent per core", + "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Core_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)", - "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLES", - "MetricGroup": "Mem;MemoryBW;MemoryBound", - "MetricName": "MLP" + "BriefDescription": "C6 residency percent per package", + "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L1_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L1MPKI" + "BriefDescription": "C7 residency percent per core", + "MetricExpr": "cstate_core@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Core_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L2_MISS / INST_RETIRED.ANY", - "MetricGroup": "Backend;CacheMisses;Mem", - "MetricName": "L2MPKI" + "BriefDescription": "C7 residency percent per package", + "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "L3 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L3_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L3MPKI" + "BriefDescription": "Uncore frequency per die [GHZ]", + "MetricExpr": "tma_info_socket_clks / #num_dies / duration_time / 1e9", + "MetricGroup": "SoC", + "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses", - "MetricConstraint": "NO_NMI_WATCHDOG", - "MetricExpr": "(ITLB_MISSES.WALK_DURATION + DTLB_LOAD_MISSES.WALK_DURATION + DTLB_STORE_MISSES.WALK_DURATION) / CORE_CLKS", - "MetricGroup": "Mem;MemoryTLB", - "MetricName": "Page_Walks_Utilization" + "BriefDescription": "Percentage of cycles spent in System Management Interrupts.", + "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0 else 0)", + "MetricGroup": "smi", + "MetricName": "smi_cycles", + "MetricThreshold": "smi_cycles > 0.1", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW" + "BriefDescription": "Number of SMI interrupts.", + "MetricExpr": "msr@smi@", + "MetricGroup": "smi", + "MetricName": "smi_num", + "ScaleUnit": "1SMI#" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW" + "BriefDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset", + "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_4k_aliasing", + "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset. False match is possible; which incur a few cycles load re-issue. However; the short re-issue duration is often hidden by the out-of-order core and HW optimizations; hence a user may safely ignore a high value of this metric unless it manages to propagate up into parent nodes of the hierarchy (e.g. to L1_Bound).", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT.PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_info_slots", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.6", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]", - "MetricExpr": "L1D_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW_1T" + "BriefDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists", + "MetricExpr": "100 * OTHER_ASSISTS.ANY_WB_ASSIST / tma_info_slots", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists. Assists are long sequences of uops that are required in certain corner-cases for operations that cannot be handled natively by the execution pipeline. For example; when working with very small floating point values (so-called Denormals); the FP units are not set up to perform these operations natively. Instead; a sequence of instructions to perform the computation on the Denormals is injected into the pipeline. Since these microcode sequences might be dozens of uops long; Assists can be extremely deleterious to performance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.ANY", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]", - "MetricExpr": "L2_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW_1T" + "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "1 - (tma_frontend_bound + tma_bad_speculation + tma_retiring)", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.2", + "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "L3_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW_1T" + "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations", + "MetricExpr": "(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", + "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "0", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "L3_Cache_Access_BW_1T" + "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * tma_bad_speculation", + "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM", + "MetricName": "tma_branch_mispredicts", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics: tma_info_branch_misprediction_cost, tma_mispredicts_resteers", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average CPU Utilization", - "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", - "MetricGroup": "HPC;Summary", - "MetricName": "CPU_Utilization" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", + "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY) / tma_info_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", + "ScaleUnit": "100%" }, { - "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", - "MetricExpr": "Turbo_Utilization * TSC / 1e9 / duration_time", - "MetricGroup": "Power;Summary", - "MetricName": "Average_Frequency" + "BriefDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction. A CISC instruction has multiple uops that are required to perform the instruction's functionality as in the case of read-modify-write as an example. Since these instructions require multiple uops they may or may not imply sub-optimal use of machine resources.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average Frequency Utilization relative nominal frequency", - "MetricExpr": "CLKS / CPU_CLK_UNHALTED.REF_TSC", - "MetricGroup": "Power", - "MetricName": "Turbo_Utilization" + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(60 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD))) + 43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD)))) / tma_info_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_contested_accesses", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses. Contested accesses occur when data written by one Logical Processor are read by another Logical Processor on a different Physical Core. Examples of contested accesses include synchronizations such as locks; true data sharing such as modified locked variables; and false sharing. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Related metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma_remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", - "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", - "MetricGroup": "SMT", - "MetricName": "SMT_2T_Utilization" + "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_backend_bound - tma_memory_bound", + "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "OS", - "MetricName": "Kernel_Utilization" + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD))) / tma_info_clks", + "MetricGroup": "Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses. Data shared by multiple Logical Processors (even just read shared) may cause increased access latency due to cache coherency. Excessive data sharing can drastically harm multithreaded performance. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma_false_sharing, tma_machine_clears, tma_remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", - "MetricGroup": "OS", - "MetricName": "Kernel_CPI" + "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", + "MetricExpr": "10 * ARITH.DIVIDER_UOPS / tma_info_core_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication. Sample with: ARITH.DIVIDER_UOPS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", - "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e9 / duration_time", - "MetricGroup": "HPC;Mem;MemoryBW;SoC", - "MetricName": "DRAM_BW_Use" + "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", + "MetricConstraint": "NO_GROUP_EVENTS_SMT", + "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS)) * CYCLE_ACTIVITY.STALLS_L2_PENDING / tma_info_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_MISS_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average latency of data read request to external memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetches", - "MetricExpr": "1e9 * (UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=0x182@ / UNC_C_TOR_INSERTS.MISS_OPCODE@filter_opc\\=0x182@) / (Socket_CLKS / duration_time)", - "MetricGroup": "Mem;MemoryLat;SoC", - "MetricName": "MEM_Read_Latency" + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline", + "MetricExpr": "(IDQ.ALL_DSB_CYCLES_ANY_UOPS - IDQ.ALL_DSB_CYCLES_4_UOPS) / tma_info_core_clks / 2", + "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline. For example; inefficient utilization of the DSB cache structure or bank conflict when reading from it; are categorized here.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches", - "MetricExpr": "UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=0x182@ / UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=0x182\\,thresh\\=1@", - "MetricGroup": "Mem;MemoryBW;SoC", - "MetricName": "MEM_Parallel_Reads" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_clks", + "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty. Related metrics: tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_iptb, tma_lcp", + "ScaleUnit": "100%" }, { - "BriefDescription": "Socket actual clocks when any core is active on that socket", - "MetricExpr": "cbox_0@event\\=0x0@", - "MetricGroup": "SoC", - "MetricName": "Socket_CLKS" + "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + DTLB_LOAD_MISSES.WALK_DURATION) / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS. Related metrics: tma_dtlb_store", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", - "MetricGroup": "Branches;OS", - "MetricName": "IpFarBranch" + "BriefDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses", + "MetricExpr": "(8 * DTLB_STORE_MISSES.STLB_HIT + DTLB_STORE_MISSES.WALK_DURATION) / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses. As with ordinary data caching; focus on improving data locality and reducing working-set size to reduce DTLB overhead. Additionally; consider using profile-guided optimization (PGO) to collocate frequently-used data on the same page. Try using larger page sizes for large amounts of frequently-used data. Sample with: MEM_UOPS_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load", + "ScaleUnit": "100%" }, { - "BriefDescription": "Uncore frequency per die [GHZ]", - "MetricExpr": "Socket_CLKS / #num_dies / duration_time / 1e9", - "MetricGroup": "SoC", - "MetricName": "UNCORE_FREQ" + "BriefDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing", + "MetricExpr": "(200 * OFFCORE_RESPONSE.DEMAND_RFO.LLC_MISS.REMOTE_HITM + 60 * OFFCORE_RESPONSE.DEMAND_RFO.LLC_HIT.HITM_OTHER_CORE) / tma_info_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_store_bound_group", + "MetricName": "tma_false_sharing", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing. False Sharing is a multithreading hiccup; where multiple Logical Processors contend on different data-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "CPU operating frequency (in GHz)", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / duration_time", - "MetricName": "cpu_operating_frequency", - "ScaleUnit": "1GHz" + "BriefDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_info_load_miss_real_latency * cpu@L1D_PEND_MISS.REQUEST_FB_FULL\\,cmask\\=1@ / tma_info_clks", + "MetricGroup": "MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;tma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed. The higher the metric value; the deeper the memory hierarchy level the misses are satisfied from (metric values >1 are valid). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or external memory). Related metrics: tma_info_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", + "ScaleUnit": "100%" }, { - "BriefDescription": "Cycles per instruction retired; indicating how much time each executed instruction took; in units of cycles.", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY", - "MetricName": "cpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", + "MetricExpr": "tma_frontend_bound - tma_fetch_latency", + "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Related metrics: tma_dsb_switches, tma_info_dsb_coverage, tma_info_iptb, tma_lcp", + "ScaleUnit": "100%" }, { - "BriefDescription": "The ratio of number of completed memory load instructions to the total number completed instructions", - "MetricExpr": "MEM_UOPS_RETIRED.ALL_LOADS / INST_RETIRED.ANY", - "MetricName": "loads_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", + "MetricExpr": "4 * min(CPU_CLK_UNHALTED.THREAD, IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE) / tma_info_slots", + "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: RS_EVENTS.EMPTY_END", + "ScaleUnit": "100%" }, { - "BriefDescription": "The ratio of number of completed memory store instructions to the total number completed instructions", - "MetricExpr": "MEM_UOPS_RETIRED.ALL_STORES / INST_RETIRED.ANY", - "MetricName": "stores_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", + "MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / tma_info_slots", + "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Ratio of number of requests missing L1 data cache (includes data+rfo w/ prefetches) to the total number of completed instructions", - "MetricExpr": "L1D.REPLACEMENT / INST_RETIRED.ANY", - "MetricName": "l1d_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences", + "MetricExpr": "tma_microcode_sequencer", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Ratio of number of demand load requests hitting in L1 data cache to the total number of completed instructions", - "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L1_HIT / INST_RETIRED.ANY", - "MetricName": "l1d_demand_data_read_hits_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses.", + "MetricExpr": "ICACHE.IFDATA_STALL / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "ScaleUnit": "100%" }, { - "BriefDescription": "Ratio of number of code read requests missing in L1 instruction cache (includes prefetches) to the total number of completed instructions", - "MetricExpr": "L2_RQSTS.ALL_CODE_RD / INST_RETIRED.ANY", - "MetricName": "l1_i_code_read_misses_with_prefetches_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", + "MetricExpr": "tma_info_turbo_utilization * TSC / 1e9 / duration_time", + "MetricGroup": "Power;Summary", + "MetricName": "tma_info_average_frequency" }, { - "BriefDescription": "Ratio of number of completed demand load requests hitting in L2 cache to the total number of completed instructions", - "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L2_HIT / INST_RETIRED.ANY", - "MetricName": "l2_demand_data_read_hits_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Branch instructions per taken branch.", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_bptkbranch" }, { - "BriefDescription": "Ratio of number of requests missing L2 cache (includes code+data+rfo w/ prefetches) to the total number of completed instructions", - "MetricExpr": "L2_LINES_IN.ALL / INST_RETIRED.ANY", - "MetricName": "l2_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Pipeline", + "MetricName": "tma_info_clks" }, { - "BriefDescription": "Ratio of number of completed data read request missing L2 cache to the total number of completed instructions", - "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L2_MISS / INST_RETIRED.ANY", - "MetricName": "l2_demand_data_read_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_clks))", + "MetricGroup": "SMT", + "MetricName": "tma_info_core_clks" }, { - "BriefDescription": "Ratio of number of code read request missing L2 cache to the total number of completed instructions", - "MetricExpr": "L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", - "MetricName": "l2_demand_code_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_core_clks", + "MetricGroup": "Ret;SMT;TmaL1;tma_L1_group", + "MetricName": "tma_info_coreipc" }, { - "BriefDescription": "Average latency of a last level cache (LLC) demand and prefetch data read miss (read memory access) in nano seconds", - "MetricExpr": "1e9 * (cbox@UNC_C_TOR_OCCUPANCY.MISS_OPCODE\\,filter_opc\\=0x182@ / cbox@UNC_C_TOR_INSERTS.MISS_OPCODE\\,filter_opc\\=0x182@) / (UNC_C_CLOCKTICKS / (#num_cores / #num_packages * #num_packages)) * duration_time", - "MetricName": "llc_data_read_demand_plus_prefetch_miss_latency", - "ScaleUnit": "1ns" + "BriefDescription": "Cycles Per Instruction (per Logical Processor)", + "MetricExpr": "1 / tma_info_ipc", + "MetricGroup": "Mem;Pipeline", + "MetricName": "tma_info_cpi" }, { - "BriefDescription": "Average latency of a last level cache (LLC) demand and prefetch data read miss (read memory access) addressed to local memory in nano seconds", - "MetricExpr": "1e9 * (cbox@UNC_C_TOR_OCCUPANCY.MISS_LOCAL_OPCODE\\,filter_opc\\=0x182@ / cbox@UNC_C_TOR_INSERTS.MISS_LOCAL_OPCODE\\,filter_opc\\=0x182@) / (UNC_C_CLOCKTICKS / (#num_cores / #num_packages * #num_packages)) * duration_time", - "MetricName": "llc_data_read_demand_plus_prefetch_miss_latency_for_local_requests", - "ScaleUnit": "1ns" + "BriefDescription": "Average CPU Utilization", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": "HPC;Summary", + "MetricName": "tma_info_cpu_utilization" }, { - "BriefDescription": "Average latency of a last level cache (LLC) demand and prefetch data read miss (read memory access) addressed to remote memory in nano seconds", - "MetricExpr": "1e9 * (cbox@UNC_C_TOR_OCCUPANCY.MISS_REMOTE_OPCODE\\,filter_opc\\=0x182@ / cbox@UNC_C_TOR_INSERTS.MISS_REMOTE_OPCODE\\,filter_opc\\=0x182@) / (UNC_C_CLOCKTICKS / (#num_cores / #num_packages * #num_packages)) * duration_time", - "MetricName": "llc_data_read_demand_plus_prefetch_miss_latency_for_remote_requests", - "ScaleUnit": "1ns" + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD / OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_data_l2_mlp" }, { - "BriefDescription": "Ratio of number of completed page walks (for all page sizes) caused by a code fetch to the total number of completed instructions", - "MetricExpr": "ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", - "MetricName": "itlb_mpi", - "PublicDescription": "Ratio of number of completed page walks (for all page sizes) caused by a code fetch to the total number of completed instructions. This implies it missed in the ITLB (Instruction TLB) and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", + "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e9 / duration_time", + "MetricGroup": "HPC;Mem;MemoryBW;SoC;tma_issueBW", + "MetricName": "tma_info_dram_bw_use", + "PublicDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]. Related metrics: tma_fb_full, tma_mem_bandwidth, tma_sq_full" }, { - "BriefDescription": "Ratio of number of completed page walks (for 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total number of completed instructions", - "MetricExpr": "ITLB_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY", - "MetricName": "itlb_large_page_mpi", - "PublicDescription": "Ratio of number of completed page walks (for 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total number of completed instructions. This implies it missed in the Instruction Translation Lookaside Buffer (ITLB) and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + LSD.UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_dsb_coverage", + "MetricThreshold": "tma_info_dsb_coverage < 0.7 & tma_info_ipc / 4 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_iptb, tma_lcp" }, { - "BriefDescription": "Ratio of number of completed page walks (for all page sizes) caused by demand data loads to the total number of completed instructions", - "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", - "MetricName": "dtlb_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for all page sizes) caused by demand data loads to the total number of completed instructions. This implies it missed in the DTLB and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", + "MetricExpr": "(UOPS_EXECUTED.CORE / 2 / (cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@) if #SMT_on else UOPS_EXECUTED.CORE / (cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@))", + "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", + "MetricName": "tma_info_ilp" }, { - "BriefDescription": "Ratio of number of completed page walks (for all page sizes) caused by demand data stores to the total number of completed instructions", - "MetricExpr": "DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", - "MetricName": "dtlb_store_mpi", - "PublicDescription": "Ratio of number of completed page walks (for all page sizes) caused by demand data stores to the total number of completed instructions. This implies it missed in the DTLB and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Summary;TmaL1;tma_L1_group", + "MetricName": "tma_info_instructions", + "PublicDescription": "Total number of retired Instructions. Sample with: INST_RETIRED.PREC_DIST" }, { - "BriefDescription": "Uncore operating frequency in GHz", - "MetricExpr": "UNC_C_CLOCKTICKS / (#num_cores / #num_packages * #num_packages) / 1e9 / duration_time", - "MetricName": "uncore_frequency", - "ScaleUnit": "1GHz" + "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Branches;Fed;InsType", + "MetricName": "tma_info_ipbranch", + "MetricThreshold": "tma_info_ipbranch < 8" }, { - "BriefDescription": "Intel(R) Quick Path Interconnect (QPI) data transmit bandwidth (MB/sec)", - "MetricExpr": "UNC_Q_TxL_FLITS_G0.DATA * 8 / 1e6 / duration_time", - "MetricName": "qpi_data_transmit_bw", - "ScaleUnit": "1MB/s" + "BriefDescription": "Instructions Per Cycle (per Logical Processor)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_clks", + "MetricGroup": "Ret;Summary", + "MetricName": "tma_info_ipc" }, { - "BriefDescription": "DDR memory read bandwidth (MB/sec)", - "MetricExpr": "UNC_M_CAS_COUNT.RD * 64 / 1e6 / duration_time", - "MetricName": "memory_bandwidth_read", - "ScaleUnit": "1MB/s" + "BriefDescription": "Instructions per (near) call (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_ipcall", + "MetricThreshold": "tma_info_ipcall < 200" }, { - "BriefDescription": "DDR memory write bandwidth (MB/sec)", - "MetricExpr": "UNC_M_CAS_COUNT.WR * 64 / 1e6 / duration_time", - "MetricName": "memory_bandwidth_write", - "ScaleUnit": "1MB/s" + "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;OS", + "MetricName": "tma_info_ipfarbranch", + "MetricThreshold": "tma_info_ipfarbranch < 1e6" }, { - "BriefDescription": "DDR memory bandwidth (MB/sec)", - "MetricExpr": "(UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) * 64 / 1e6 / duration_time", - "MetricName": "memory_bandwidth_total", - "ScaleUnit": "1MB/s" + "BriefDescription": "Instructions per Load (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_LOADS", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipload", + "MetricThreshold": "tma_info_ipload < 3" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by end device controllers that are requesting memory from the CPU.", - "MetricExpr": "cbox@UNC_C_TOR_INSERTS.OPCODE\\,filter_opc\\=0x19e@ * 64 / 1e6 / duration_time", - "MetricName": "io_bandwidth_disk_or_network_writes", - "ScaleUnit": "1MB/s" + "BriefDescription": "Instructions per retired mispredicts for indirect CALL or JMP branches (lower number means higher occurrence rate).", + "MetricExpr": "tma_info_instructions / (UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * cpu@BR_MISP_EXEC.ALL_BRANCHES\\,umask\\=0xE4@)", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_indirect", + "MetricThreshold": "tma_info_ipmisp_indirect < 1e3" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by end device controllers that are writing memory to the CPU.", - "MetricExpr": "cbox@UNC_C_TOR_INSERTS.OPCODE\\,filter_opc\\=0x1c8\\,filter_tid\\=0x3e@ * 64 / 1e6 / duration_time", - "MetricName": "io_bandwidth_disk_or_network_reads", - "ScaleUnit": "1MB/s" + "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts", + "MetricName": "tma_info_ipmispredict", + "MetricThreshold": "tma_info_ipmispredict < 200" }, { - "BriefDescription": "Uops delivered from decoded instruction cache (decoded stream buffer or DSB) as a percent of total uops delivered to Instruction Decode Queue", - "MetricExpr": "IDQ.DSB_UOPS / UOPS_ISSUED.ANY", - "MetricName": "percent_uops_delivered_from_decoded_icache", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per Store (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_STORES", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipstore", + "MetricThreshold": "tma_info_ipstore < 8" }, { - "BriefDescription": "Uops delivered from legacy decode pipeline (Micro-instruction Translation Engine or MITE) as a percent of total uops delivered to Instruction Decode Queue", - "MetricExpr": "IDQ.MITE_UOPS / UOPS_ISSUED.ANY", - "MetricName": "percent_uops_delivered_from_legacy_decode_pipeline", - "ScaleUnit": "100%" + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", + "MetricName": "tma_info_iptb", + "MetricThreshold": "tma_info_iptb < 9", + "PublicDescription": "Instruction per taken branch. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_lcp" }, { - "BriefDescription": "Uops delivered from microcode sequencer (MS) as a percent of total uops delivered to Instruction Decode Queue", - "MetricExpr": "IDQ.MS_UOPS / UOPS_ISSUED.ANY", - "MetricName": "percent_uops_delivered_from_microcode_sequencer", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per speculative Unknown Branch Misprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_instructions / BACLEARS.ANY", + "MetricGroup": "Fed", + "MetricName": "tma_info_ipunknown_branch" }, { - "BriefDescription": "Uops delivered from loop stream detector(LSD) as a percent of total uops delivered to Instruction Decode Queue", - "MetricExpr": "(UOPS_ISSUED.ANY - IDQ.MITE_UOPS - IDQ.MS_UOPS - IDQ.DSB_UOPS) / UOPS_ISSUED.ANY", - "MetricName": "percent_uops_delivered_from_loop_stream_detector", - "ScaleUnit": "100%" + "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_cpi" }, { - "BriefDescription": "Ratio of number of data read requests missing last level core cache (includes demand w/ prefetches) to the total number of completed instructions", - "MetricExpr": "(cbox@UNC_C_TOR_INSERTS.MISS_OPCODE\\,filter_opc\\=0x182@ + cbox@UNC_C_TOR_INSERTS.MISS_OPCODE\\,filter_opc\\=0x192@) / INST_RETIRED.ANY", - "MetricName": "llc_data_read_mpi_demand_plus_prefetch", - "ScaleUnit": "1per_instr" + "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_utilization", + "MetricThreshold": "tma_info_kernel_utilization > 0.05" }, { - "BriefDescription": "Ratio of number of code read requests missing last level core cache (includes demand w/ prefetches) to the total number of completed instructions", - "MetricExpr": "(cbox@UNC_C_TOR_INSERTS.MISS_OPCODE\\,filter_opc\\=0x181@ + cbox@UNC_C_TOR_INSERTS.MISS_OPCODE\\,filter_opc\\=0x191@) / INST_RETIRED.ANY", - "MetricName": "llc_code_read_mpi_demand_plus_prefetch", - "ScaleUnit": "1per_instr" + "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw" }, { - "BriefDescription": "Memory read that miss the last level cache (LLC) addressed to local DRAM as a percentage of total memory read accesses, does not include LLC prefetches.", - "MetricExpr": "cbox@UNC_C_TOR_INSERTS.MISS_LOCAL_OPCODE\\,filter_opc\\=0x182@ / (cbox@UNC_C_TOR_INSERTS.MISS_LOCAL_OPCODE\\,filter_opc\\=0x182@ + cbox@UNC_C_TOR_INSERTS.MISS_REMOTE_OPCODE\\,filter_opc\\=0x182@)", - "MetricName": "numa_reads_addressed_to_local_dram", - "ScaleUnit": "100%" + "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]", + "MetricExpr": "tma_info_l1d_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw_1t" }, { - "BriefDescription": "Memory reads that miss the last level cache (LLC) addressed to remote DRAM as a percentage of total memory read accesses, does not include LLC prefetches.", - "MetricExpr": "cbox@UNC_C_TOR_INSERTS.MISS_REMOTE_OPCODE\\,filter_opc\\=0x182@ / (cbox@UNC_C_TOR_INSERTS.MISS_LOCAL_OPCODE\\,filter_opc\\=0x182@ + cbox@UNC_C_TOR_INSERTS.MISS_REMOTE_OPCODE\\,filter_opc\\=0x182@)", - "MetricName": "numa_reads_addressed_to_remote_dram", - "ScaleUnit": "100%" + "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L1_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l1mpki" }, { - "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", - "MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / SLOTS", - "MetricGroup": "PGO;TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_frontend_bound", - "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Machine_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound.", - "ScaleUnit": "100%" + "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", - "MetricExpr": "4 * min(CPU_CLK_UNHALTED.THREAD, IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE) / SLOTS", - "MetricGroup": "Frontend;TopdownL2;tma_L2_group;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_latency", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period.", - "ScaleUnit": "100%" + "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]", + "MetricExpr": "tma_info_l2_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw_1t" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses.", - "MetricExpr": "ICACHE.IFDATA_STALL / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_icache_misses", - "ScaleUnit": "100%" + "BriefDescription": "L2 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "Backend;CacheMisses;Mem", + "MetricName": "tma_info_l2mpki" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses.", - "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + ITLB_MISSES.WALK_DURATION) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_itlb_misses", - "ScaleUnit": "100%" + "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "0", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_l3_cache_access_bw_1t" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", - "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_branch_resteers", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings.", - "ScaleUnit": "100%" + "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_dsb_switches", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty.", - "ScaleUnit": "100%" + "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "tma_info_l3_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw_1t" }, { - "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", - "MetricExpr": "ILD_STALL.LCP / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_lcp", - "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs.", - "ScaleUnit": "100%" + "BriefDescription": "L3 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L3_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l3mpki" }, { - "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", - "MetricExpr": "2 * IDQ.MS_SWITCHES / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_ms_switches", - "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals.", - "ScaleUnit": "100%" + "BriefDescription": "Average Latency for L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCORE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "Memory_Lat;Offcore", + "MetricName": "tma_info_load_l2_miss_latency" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", - "MetricExpr": "tma_frontend_bound - tma_fetch_latency", - "MetricGroup": "FetchBW;Frontend;TopdownL2;tma_L2_group;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_bandwidth", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend.", - "ScaleUnit": "100%" + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_load_l2_mlp" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline)", - "MetricExpr": "(IDQ.ALL_MITE_CYCLES_ANY_UOPS - IDQ.ALL_MITE_CYCLES_4_UOPS) / CORE_CLKS / 2", - "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", - "MetricName": "tma_mite", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or LSD. For example; inefficiencies due to asymmetric decoders; use of long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.", - "ScaleUnit": "100%" + "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_UOPS_RETIRED.L1_MISS + MEM_LOAD_UOPS_RETIRED.HIT_LFB)", + "MetricGroup": "Mem;MemoryBound;MemoryLat", + "MetricName": "tma_info_load_miss_real_latency" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline", - "MetricExpr": "(IDQ.ALL_DSB_CYCLES_ANY_UOPS - IDQ.ALL_DSB_CYCLES_4_UOPS) / CORE_CLKS / 2", - "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", - "MetricName": "tma_dsb", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline. For example; inefficient utilization of the DSB cache structure or bank conflict when reading from it; are categorized here.", - "ScaleUnit": "100%" + "BriefDescription": "Average number of parallel data read requests to external memory", + "MetricExpr": "UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=0x182@ / UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=0x182\\,thresh\\=1@", + "MetricGroup": "Mem;MemoryBW;SoC", + "MetricName": "tma_info_mem_parallel_reads", + "PublicDescription": "Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches" }, { - "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations", - "MetricExpr": "(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) / SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_bad_speculation", - "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.", - "ScaleUnit": "100%" + "BriefDescription": "Average latency of data read request to external memory (in nanoseconds)", + "MetricExpr": "1e9 * (UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=0x182@ / UNC_C_TOR_INSERTS.MISS_OPCODE@filter_opc\\=0x182@) / (tma_info_socket_clks / duration_time)", + "MetricGroup": "Mem;MemoryLat;SoC", + "MetricName": "tma_info_mem_read_latency", + "PublicDescription": "Average latency of data read request to external memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetches. ([RKL+]memory-controller only)" }, { - "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction", - "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * tma_bad_speculation", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL2;tma_L2_group;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_branch_mispredicts", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path.", - "ScaleUnit": "100%" + "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLES", + "MetricGroup": "Mem;MemoryBW;MemoryBound", + "MetricName": "tma_info_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)" }, { - "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", - "MetricExpr": "tma_bad_speculation - tma_branch_mispredicts", - "MetricGroup": "BadSpec;MachineClears;TopdownL2;tma_L2_group;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_machine_clears", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes.", - "ScaleUnit": "100%" + "BriefDescription": "Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses", + "MetricExpr": "(ITLB_MISSES.WALK_DURATION + DTLB_LOAD_MISSES.WALK_DURATION + DTLB_STORE_MISSES.WALK_DURATION) / tma_info_core_clks", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_page_walks_utilization", + "MetricThreshold": "tma_info_page_walks_utilization > 0.5" }, { - "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend", - "MetricExpr": "1 - (tma_frontend_bound + tma_bad_speculation + UOPS_RETIRED.RETIRE_SLOTS / SLOTS)", - "MetricGroup": "TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_backend_bound", - "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound.", - "ScaleUnit": "100%" + "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE_SLOTS\\,cmask\\=1@", + "MetricGroup": "Pipeline;Ret", + "MetricName": "tma_info_retire" }, { - "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", - "MetricExpr": "(min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING) + RESOURCE_STALLS.SB) / (min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=0x1@ - (cpu@UOPS_EXECUTED.CORE\\,cmask\\=0x3@ if INST_RETIRED.ANY / CPU_CLK_UNHALTED.THREAD > 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=0x2@)) / 2 - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB if #SMT_on else min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.CORE\\,cmask\\=0x1@ - (cpu@UOPS_EXECUTED.CORE\\,cmask\\=0x3@ if INST_RETIRED.ANY / CPU_CLK_UNHALTED.THREAD > 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=0x2@) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) * tma_backend_bound", - "MetricGroup": "Backend;TopdownL2;tma_L2_group;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_memory_bound", - "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", - "ScaleUnit": "100%" + "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "4 * tma_info_core_clks", + "MetricGroup": "TmaL1;tma_L1_group", + "MetricName": "tma_info_slots" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache", - "MetricExpr": "max((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING) - CYCLE_ACTIVITY.STALLS_L1D_PENDING) / CPU_CLK_UNHALTED.THREAD, 0)", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", - "MetricName": "tma_l1_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache. The L1 data cache typically has the shortest latency. However; in certain cases like loads blocked on older stores; a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls; while some non-completed demand load lives in the machine without having that demand load missing the L1 cache.", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", + "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", + "MetricGroup": "SMT", + "MetricName": "tma_info_smt_2t_utilization" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + DTLB_LOAD_MISSES.WALK_DURATION) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_dtlb_load", - "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss.", - "ScaleUnit": "100%" + "BriefDescription": "Socket actual clocks when any core is active on that socket", + "MetricExpr": "cbox_0@event\\=0x0@", + "MetricGroup": "SoC", + "MetricName": "tma_info_socket_clks" }, { - "BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores", - "MetricExpr": "min(13 * LD_BLOCKS.STORE_FORWARD / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_store_fwd_blk", - "PublicDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores. To streamline memory operations in the pipeline; a load can avoid waiting for memory if a prior in-flight store is writing the data that the load wants to read (store forwarding process). However; in some cases the load may be blocked for a significant time pending the store forward. For example; when the prior store is writing a smaller region than the load is reading.", - "ScaleUnit": "100%" + "BriefDescription": "Average Frequency Utilization relative nominal frequency", + "MetricExpr": "tma_info_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_turbo_utilization" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations", - "MetricExpr": "min(MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_lock_latency", - "PublicDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations. Due to the microarchitecture handling of locks; they are classified as L1_Bound regardless of what memory source satisfied them.", - "ScaleUnit": "100%" + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / INST_RETIRED.ANY", + "MetricGroup": "Pipeline;Ret;Retire", + "MetricName": "tma_info_uoppi", + "MetricThreshold": "tma_info_uoppi > 1.05" }, { - "BriefDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary. ", - "MetricExpr": "min(Load_Miss_Real_Latency * LD_BLOCKS.NO_SR / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_split_loads", - "ScaleUnit": "100%" + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW", + "MetricName": "tma_info_uptb", + "MetricThreshold": "tma_info_uptb < 6" }, { - "BriefDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset", - "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_4k_aliasing", - "PublicDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset. False match is possible; which incur a few cycles load re-issue. However; the short re-issue duration is often hidden by the out-of-order core and HW optimizations; hence a user may safely ignore a high value of this metric unless it manages to propagate up into parent nodes of the hierarchy (e.g. to L1_Bound).", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + ITLB_MISSES.WALK_DURATION) / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: ITLB_MISSES.WALK_COMPLETED", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed", - "MetricExpr": "Load_Miss_Real_Latency * cpu@L1D_PEND_MISS.REQUEST_FB_FULL\\,cmask\\=0x1@ / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryBW;TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_fb_full", - "PublicDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed. The higher the metric value; the deeper the memory hierarchy level the misses are satisfied from (metric values >1 are valid). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or external memory).", + "BriefDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache", + "MetricExpr": "max((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING) - CYCLE_ACTIVITY.STALLS_L1D_PENDING) / tma_info_clks, 0)", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache. The L1 data cache typically has the shortest latency. However; in certain cases like loads blocked on older stores; a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls; while some non-completed demand load lives in the machine without having that demand load missing the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT_PS;MEM_LOAD_UOPS_RETIRED.HIT_LFB_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_switches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads", - "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L1D_PENDING - CYCLE_ACTIVITY.STALLS_L2_PENDING) / CPU_CLK_UNHALTED.THREAD", + "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L1D_PENDING - CYCLE_ACTIVITY.STALLS_L2_PENDING) / tma_info_clks", "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 misses/L2 hits) can improve the latency and increase performance.", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 misses/L2 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L2_HIT_PS", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core", - "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS) * CYCLE_ACTIVITY.STALLS_L2_PENDING / CPU_CLK_UNHALTED.THREAD", + "MetricConstraint": "NO_GROUP_EVENTS_SMT", + "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS) * CYCLE_ACTIVITY.STALLS_L2_PENDING / tma_info_clks", "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance.", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses", - "MetricExpr": "min((60 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD))) + 43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD)))) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_l3_bound_group", - "MetricName": "tma_contested_accesses", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses. Contested accesses occur when data written by one Logical Processor are read by another Logical Processor on a different Physical Core. Examples of contested accesses include synchronizations such as locks; true data sharing such as modified locked variables; and false sharing.", + "BriefDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "41 * (MEM_LOAD_UOPS_RETIRED.L3_HIT * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD))) / tma_info_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_issueLat;tma_l3_bound_group", + "MetricName": "tma_l3_hit_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve the latency; reduce contention with sibling physical cores and increase performance. Note the value of this node may overlap with its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS. Related metrics: tma_mem_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses", - "MetricExpr": "min(43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD))) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Offcore;Snoop;TopdownL4;tma_L4_group;tma_l3_bound_group", - "MetricName": "tma_data_sharing", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses. Data shared by multiple Logical Processors (even just read shared) may cause increased access latency due to cache coherency. Excessive data sharing can drastically harm multithreaded performance.", + "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "ILD_STALL.LCP / tma_info_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited)", - "MetricExpr": "min(41 * (MEM_LOAD_UOPS_RETIRED.L3_HIT * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD))) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l3_bound_group", - "MetricName": "tma_l3_hit_latency", - "PublicDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve the latency; reduce contention with sibling physical cores and increase performance. Note the value of this node may overlap with its siblings.", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", + "MetricExpr": "tma_retiring - tma_heavy_operations", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors)", - "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on else OFFCORE_REQUESTS_BUFFER.SQ_FULL) / CORE_CLKS", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_l3_bound_group", - "MetricName": "tma_sq_full", - "PublicDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors). The Super Queue is used for requests to access the L2 cache or to go out to the Uncore.", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_2 + UOPS_DISPATCHED_PORT.PORT_3 + UOPS_DISPATCHED_PORT.PORT_7 - UOPS_DISPATCHED_PORT.PORT_4) / (2 * tma_info_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations. Sample with: UOPS_DISPATCHED.PORT_2_3", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", - "MetricExpr": "min((1 - MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS)) * CYCLE_ACTIVITY.STALLS_L2_PENDING / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", - "MetricName": "tma_dram_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance.", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from local memory", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "200 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD))) / tma_info_clks", + "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_group", + "MetricName": "tma_local_dram", + "MetricThreshold": "tma_local_dram > 0.1 & (tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from local memory. Caching will improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM_PS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO) / tma_info_clks", + "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1_bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations. Due to the microarchitecture handling of locks; they are classified as L1_Bound regardless of what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOADS_PS. Related metrics: tma_store_latency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_bad_speculation - tma_branch_mispredicts", + "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=0x6@) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=6@) / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that).", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that). Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_sq_full", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / CPU_CLK_UNHALTED.THREAD - tma_mem_bandwidth", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / tma_info_clks - tma_mem_bandwidth", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that).", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that). Related metrics: tma_l3_hit_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from local memory", - "MetricExpr": "min(200 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD))) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_group", - "MetricName": "tma_local_dram", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from local memory. Caching will improve the latency and increase performance.", + "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING) + RESOURCE_STALLS.SB) / (min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ - (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@ if tma_info_ipc > 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@)) / 2 - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) if #SMT_on else min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ - (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@ if tma_info_ipc > 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) * tma_backend_bound", + "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote memory", - "MetricExpr": "min(310 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD))) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latency_group", - "MetricName": "tma_remote_dram", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote memory. This is caused often due to non-optimal NUMA allocations. #link to NUMA article", + "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.MS_UOPS / tma_info_slots", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operations_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote cache in other sockets including synchronizations issues", - "MetricExpr": "min((200 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD))) + 180 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD)))) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latency_group", - "MetricName": "tma_remote_cache", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote cache in other sockets including synchronizations issues. This is caused often due to non-optimal NUMA allocations. #link to NUMA article", + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline)", + "MetricExpr": "(IDQ.ALL_MITE_CYCLES_ANY_UOPS - IDQ.ALL_MITE_CYCLES_4_UOPS) / tma_info_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or LSD. For example; inefficiencies due to asymmetric decoders; use of long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", - "MetricExpr": "RESOURCE_STALLS.SB / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", - "MetricName": "tma_store_bound", - "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck.", + "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", + "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_clks", + "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses", - "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_store_bound_group", - "MetricName": "tma_store_latency", - "PublicDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses. Store accesses usually less impact out-of-order core performance; however; holding resources for longer time can lead into undesired implications (e.g. contention on L1D fill-buffer entries - see FB_Full)", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_0 / tma_info_core_clks", + "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_0", + "MetricThreshold": "tma_port_0 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_512b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing", - "MetricExpr": "min((200 * OFFCORE_RESPONSE.DEMAND_RFO.LLC_MISS.REMOTE_HITM + 60 * OFFCORE_RESPONSE.DEMAND_RFO.LLC_HIT.HITM_OTHER_CORE) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_store_bound_group", - "MetricName": "tma_false_sharing", - "PublicDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing. False Sharing is a multithreading hiccup; where multiple Logical Processors contend on different data-elements mapped into the same cache line. ", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_1 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_1", + "MetricThreshold": "tma_port_1 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_512b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents rate of split store accesses", - "MetricExpr": "2 * MEM_UOPS_RETIRED.SPLIT_STORES / CORE_CLKS", - "MetricGroup": "TopdownL4;tma_L4_group;tma_store_bound_group", - "MetricName": "tma_split_stores", - "PublicDescription": "This metric represents rate of split store accesses. Consider aligning your data to the 64-byte cache line granularity.", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 2 ([SNB+]Loads and Store-address; [ICL+] Loads)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_2 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_load_op_utilization_group", + "MetricName": "tma_port_2", + "MetricThreshold": "tma_port_2 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 2 ([SNB+]Loads and Store-address; [ICL+] Loads). Sample with: UOPS_DISPATCHED_PORT.PORT_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses", - "MetricExpr": "min((8 * DTLB_STORE_MISSES.STLB_HIT + DTLB_STORE_MISSES.WALK_DURATION) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_store_bound_group", - "MetricName": "tma_dtlb_store", - "PublicDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses. As with ordinary data caching; focus on improving data locality and reducing working-set size to reduce DTLB overhead. Additionally; consider using profile-guided optimization (PGO) to collocate frequently-used data on the same page. Try using larger page sizes for large amounts of frequently-used data.", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 3 ([SNB+]Loads and Store-address; [ICL+] Loads)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_3 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_load_op_utilization_group", + "MetricName": "tma_port_3", + "MetricThreshold": "tma_port_3 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 3 ([SNB+]Loads and Store-address; [ICL+] Loads). Sample with: UOPS_DISPATCHED_PORT.PORT_3", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", - "MetricExpr": "tma_backend_bound - tma_memory_bound", - "MetricGroup": "Backend;Compute;TopdownL2;tma_L2_group;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_core_bound", - "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 4 (Store-data)", + "MetricExpr": "tma_store_op_utilization", + "MetricGroup": "TopdownL6;tma_L6_group;tma_issueSpSt;tma_store_op_utilization_group", + "MetricName": "tma_port_4", + "MetricThreshold": "tma_port_4 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 4 (Store-data). Sample with: UOPS_DISPATCHED_PORT.PORT_4. Related metrics: tma_split_stores", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", - "MetricExpr": "10 * ARITH.DIVIDER_UOPS / CORE_CLKS", - "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", - "MetricName": "tma_divider", - "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication.", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_5 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_5", + "MetricThreshold": "tma_port_5 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_6 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_6", + "MetricThreshold": "tma_port_6 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_6. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 7 ([HSW+]simple Store-address)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_7 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_store_op_utilization_group", + "MetricName": "tma_port_7", + "MetricThreshold": "tma_port_7 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 7 ([HSW+]simple Store-address). Sample with: UOPS_DISPATCHED_PORT.PORT_7", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", - "MetricExpr": "((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=0x1@ - (cpu@UOPS_EXECUTED.CORE\\,cmask\\=0x3@ if INST_RETIRED.ANY / CPU_CLK_UNHALTED.THREAD > 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=0x2@)) / 2 - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB if #SMT_on else min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.CORE\\,cmask\\=0x1@ - (cpu@UOPS_EXECUTED.CORE\\,cmask\\=0x3@ if INST_RETIRED.ANY / CPU_CLK_UNHALTED.THREAD > 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=0x2@) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) - RESOURCE_STALLS.SB - min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING)) / CPU_CLK_UNHALTED.THREAD", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ - (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@ if tma_info_ipc > 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@)) / 2 - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB if #SMT_on else min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ - (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@ if tma_info_ipc > 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB - RESOURCE_STALLS.SB - min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING)) / tma_info_clks", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\=0x1\\,cmask\\=0x1@ / 2 if #SMT_on else min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0)) / CORE_CLKS", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\,cmask\\=1@ / 2 if #SMT_on else (min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0)) / tma_info_core_clks)", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise). Long-latency instructions like divides may contribute to this metric.", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=0x1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=0x2@) / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=0x1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=0x2@) / CORE_CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@) / 2 if #SMT_on else (cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@) / tma_info_core_clks)", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_ports_utilization_group", "MetricName": "tma_ports_utilized_1", - "PublicDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). This can be due to heavy data-dependency among software instructions; or over oversubscribing a particular hardware resource. In some other cases with high 1_Port_Utilized and L1_Bound; this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful.", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). This can be due to heavy data-dependency among software instructions; or over oversubscribing a particular hardware resource. In some other cases with high 1_Port_Utilized and L1_Bound; this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=0x2@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=0x3@) / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=0x2@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=0x3@) / CORE_CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@) / 2 if #SMT_on else (cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@) / tma_info_core_clks)", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_ports_utilization_group", "MetricName": "tma_ports_utilized_2", - "PublicDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization -most compilers feature auto-Vectorization options today- reduces pressure on the execution ports as multiple elements are calculated with same uop.", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization -most compilers feature auto-Vectorization options today- reduces pressure on the execution ports as multiple elements are calculated with same uop. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise).", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=0x3@ / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=0x3@) / CORE_CLKS", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@ / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@) / tma_info_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.7 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.", - "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT.PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / SLOTS", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", - "MetricName": "tma_alu_op_utilization", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_0 / CORE_CLKS", - "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilization_group", - "MetricName": "tma_port_0", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_1 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group", - "MetricName": "tma_port_1", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_5 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group", - "MetricName": "tma_port_5", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote cache in other sockets including synchronizations issues", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(200 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD))) + 180 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD)))) / tma_info_clks", + "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_issueSyncxn;tma_mem_latency_group", + "MetricName": "tma_remote_cache", + "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote cache in other sockets including synchronizations issues. This is caused often due to non-optimal NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM_PS;MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD_PS. Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_6 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group", - "MetricName": "tma_port_6", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote memory", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "310 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD))) / tma_info_clks", + "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latency_group", + "MetricName": "tma_remote_dram", + "MetricThreshold": "tma_remote_dram > 0.1 & (tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote memory. This is caused often due to non-optimal NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations", - "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_2 + UOPS_DISPATCHED_PORT.PORT_3 + UOPS_DISPATCHED_PORT.PORT_7 - UOPS_DISPATCHED_PORT.PORT_4) / (2 * CORE_CLKS)", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", - "MetricName": "tma_load_op_utilization", + "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1", + "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.RETIRE_SLOTS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 2 ([SNB+]Loads and Store-address; [ICL+] Loads)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_2 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_L6_group;tma_load_op_utilization_group", - "MetricName": "tma_port_2", + "BriefDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_info_load_miss_real_latency * LD_BLOCKS.NO_SR / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 3 ([SNB+]Loads and Store-address; [ICL+] Loads)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_3 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_L6_group;tma_load_op_utilization_group", - "MetricName": "tma_port_3", + "BriefDescription": "This metric represents rate of split store accesses", + "MetricExpr": "2 * MEM_UOPS_RETIRED.SPLIT_STORES / tma_info_core_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bound_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store accesses. Consider aligning your data to the 64-byte cache line granularity. Sample with: MEM_UOPS_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_4", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_4 / CORE_CLKS", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", - "MetricName": "tma_store_op_utilization", + "BriefDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors)", + "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on else OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueBW;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors). Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_mem_bandwidth", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 4 (Store-data)", - "MetricExpr": "tma_store_op_utilization", - "MetricGroup": "TopdownL6;tma_L6_group;tma_store_op_utilization_group", - "MetricName": "tma_port_4", + "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", + "MetricExpr": "RESOURCE_STALLS.SB / tma_info_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck. Sample with: MEM_UOPS_RETIRED.ALL_STORES_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 7 ([HSW+]simple Store-address)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_7 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_L6_group;tma_store_op_utilization_group", - "MetricName": "tma_port_7", + "BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores. To streamline memory operations in the pipeline; a load can avoid waiting for memory if a prior in-flight store is writing the data that the load wants to read (store forwarding process). However; in some cases the load may be blocked for a significant time pending the store forward. For example; when the prior store is writing a smaller region than the load is reading.", "ScaleUnit": "100%" }, { - "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_retiring", - "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. ", + "BriefDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_clks", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_issueSL;tma_store_bound_group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses. Store accesses usually less impact out-of-order core performance; however; holding resources for longer time can lead into undesired implications (e.g. contention on L1D fill-buffer entries - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", - "MetricExpr": "tma_retiring - UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.MS_UOPS / SLOTS", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_L2_group;tma_retiring_group", - "MetricName": "tma_light_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved.", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_4 / tma_info_core_clks", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", "ScaleUnit": "100%" }, { "BriefDescription": "This metric serves as an approximation of legacy x87 usage", - "MetricExpr": "INST_RETIRED.X87 * UPI / UOPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_ports_utilization_group", + "MetricExpr": "INST_RETIRED.X87 * tma_info_uoppi / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1", "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.MS_UOPS / SLOTS", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_L2_group;tma_retiring_group", - "MetricName": "tma_heavy_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences. This highly-correlates with the uop length of these instructions/sequences.", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", - "MetricExpr": "tma_heavy_operations", - "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operations_group", - "MetricName": "tma_microcode_sequencer", - "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided.", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists", - "MetricExpr": "min(100 * OTHER_ASSISTS.ANY_WB_ASSIST / SLOTS, 1)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", - "MetricName": "tma_assists", - "PublicDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists. Assists are long sequences of uops that are required in certain corner-cases for operations that cannot be handled natively by the execution pipeline. For example; when working with very small floating point values (so-called Denormals); the FP units are not set up to perform these operations natively. Instead; a sequence of instructions to perform the computation on the Denormals is injected into the pipeline. Since these microcode sequences might be dozens of uops long; Assists can be extremely deleterious to performance and they can be avoided in many cases.", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction", - "MetricExpr": "max(0, tma_heavy_operations - tma_assists)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", - "MetricName": "tma_cisc", - "PublicDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction. A CISC instruction has multiple uops that are required to perform the instruction's functionality as in the case of read-modify-write as an example. Since these instructions require multiple uops they may or may not imply sub-optimal use of machine resources.", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "C3 residency percent per core", - "MetricExpr": "cstate_core@c3\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C3_Core_Residency", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "C6 residency percent per core", - "MetricExpr": "cstate_core@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Core_Residency", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "C7 residency percent per core", - "MetricExpr": "cstate_core@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Core_Residency", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "C2 residency percent per package", - "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C2_Pkg_Residency", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "C3 residency percent per package", - "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C3_Pkg_Residency", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "C6 residency percent per package", - "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Pkg_Residency", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "C7 residency percent per package", - "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Pkg_Residency", - "ScaleUnit": "100%" } ] From patchwork Sun Feb 19 09:28:16 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59109 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp772734wrn; Sun, 19 Feb 2023 01:34:08 -0800 (PST) X-Google-Smtp-Source: AK7set//ZuPK9SasmHezRm0p14oZZ76OltFlprYIvRfYgzmAg/YuLyHtU5X/elDcbSIMut0rPlh/ X-Received: by 2002:a62:31c7:0:b0:5a9:d734:7d2 with SMTP id x190-20020a6231c7000000b005a9d73407d2mr6812120pfx.32.1676799248080; Sun, 19 Feb 2023 01:34:08 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799248; cv=none; d=google.com; s=arc-20160816; b=A4h4/5iMlrH2WdzeDEGXPEGjX1EnGk4+5U5dl024dS0DwHZGDpRTZZ3fSGMJ9um+ua AI+Z3+p+aTJLH+3tw8wF5/xx4t1md/ovGZS0DRLQvhNHoOJgR7PohNg2eLCIvN10Rz2O UMIN9HNdQPzoo107CUtlPlX7IcqwUKmYhIF9935JIqaHeNTJEwRNMe1wOrKa5BFIRBfv 79mAX9yddy8vWQfWu69s2QL0dfeIKVrFRWMc1aF2hL2+Z6Tf1vAA9ozAzgveVJDO/a8a 2M3h8gVhahxNUmdVZ23WnuIrc5DPUkeeTl4R2pQ2rj2lZkIKI5FUNfmtyCXKFBx1ml09 RLLA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :references:mime-version:message-id:in-reply-to:date:dkim-signature; bh=KQfysMeKlgE+jIUjOmzXBqFSCTlH0ae4syUK6W7l0PA=; b=ICO9pebJ47WTiWA//BIVRShmipsv0izAL4POsjvfRx++cxltHwxegZ0vplQVOuIQNa 2z71PXMdY/FYP0Cq/JuikUFzUMtb8wAUL41XBaqJ3Yuf2vbdXCgaxRtDuPKDMWgSdUCK 7di6CO/XMMSOpZ29z4QPeXOkhnaxxblZ0gPE8NCgSXCj7pJYn9j93Vq+IjmiZB7p63+C TjDHBPAUwX+p7aJxC36as1bQYzmY6GX7XzO+/seuHsReMyAxDjkwDXAopUscVHMQgaLj 4Q6dw14vf8TBQ4/BwYO/CpM9tYZDjhbhHIKBDGqSoKT2SlZM403rjsmhs+jusQ1TXX6Y ltlQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=NcbhWt38; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id f13-20020a056a001acd00b0056cb4662b9csi11608016pfv.16.2023.02.19.01.33.55; Sun, 19 Feb 2023 01:34:08 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=NcbhWt38; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229988AbjBSJc7 (ORCPT + 99 others); Sun, 19 Feb 2023 04:32:59 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54842 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229957AbjBSJcj (ORCPT ); Sun, 19 Feb 2023 04:32:39 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C9EADE3AB for ; Sun, 19 Feb 2023 01:31:53 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id f33-20020a25b0a1000000b009433a21be0dso2105466ybj.19 for ; Sun, 19 Feb 2023 01:31:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:from:to:cc:subject:date :message-id:reply-to; bh=KQfysMeKlgE+jIUjOmzXBqFSCTlH0ae4syUK6W7l0PA=; b=NcbhWt38wnTvWPHF6Hxj7T1OVnt724H9kOafS5s205R0YWzKdm4NnWhe3P3Ey/IPuG iuHgcCGLEdD4pm88yr40lhL2K2xMef++OsRl7iXiYaRQcG5ZSZRYFc/qVnhbovv3p3S8 gEyi3GcEIkuS0MsgI0Eb7UzCtaMEuAbUCMqyASSfqnFkv1OEXAoNeji43Hpjmppt5r8k /JVKN51dhjKF2tt1DBrFpJCTKVfhXFOggbzHkNYR0WfTN5kdTAb6Vz5W4+CwtlXF1TW6 TXSH3kisc7BNpGrqntQQ4eHCNHTxIKRzoGfWIy+6BweAaH9TZbOZJet+hUG8DZSUvcTJ rX/Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=KQfysMeKlgE+jIUjOmzXBqFSCTlH0ae4syUK6W7l0PA=; b=qdFMtSqBQAtYlqQzeWSmx6WaNjB9l7l9SNC/TYuE95hy+XAH5JAYBLQjmUiPOhybYZ 7If5IcF/LA6/7y+Ppl42Afui2MzOCq3zEqPyAleg9bN7PnLs39rx/mAGZc1AAtIvJJlp FJx6+F4XAbtwF/65N2SFFjTeNTFUn5ZPTih5ka3trlgNC6ZIG5lmnVmCuCYWmy57FdKZ /vx160fVdZffjr1rFlay2g+YEvVtqx3GBnY7/jJVlyzIhRGJoq6CcLSfvykVHHTawUcA Kl9BijLg2JPyaTe0xhr2WBrunTR71LTCqx4BHM2bQffnV37VRoxJFt7T0WpHsff+94iF Q2Ww== X-Gm-Message-State: AO0yUKXSISjLEtz9GYxZD3gISFBPgj0Zo8bRDaAVj2O3s4wjakDeTOTx foXF3/Cvf/0Svxpz4nuy2FBf0K3GGneB X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a05:6902:11cd:b0:8a3:d147:280b with SMTP id n13-20020a05690211cd00b008a3d147280bmr180885ybu.3.1676799099000; Sun, 19 Feb 2023 01:31:39 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:16 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-20-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 19/51] perf vendor events intel: Refresh icelake events From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758251448551016038?= X-GMAIL-MSGID: =?utf-8?q?1758251448551016038?= Update the icelake events from 1.15 to 1.17. Generation was done using https://github.com/intel/perfmon. Notable changes are new events and event descriptions, TMA metrics are updated to version 4.5, TMA info metrics are renamed from their node name to be lower case and prefixed by tma_info_, MetricThreshold expressions are added, smi_cost and transaction metric groups are added replicating existing hard coded metrics in stat-shadow. Signed-off-by: Ian Rogers --- .../pmu-events/arch/x86/icelake/cache.json | 16 + .../arch/x86/icelake/floating-point.json | 31 + .../arch/x86/icelake/icl-metrics.json | 1932 ++++++++++------- .../pmu-events/arch/x86/icelake/pipeline.json | 23 +- .../arch/x86/icelake/uncore-other.json | 56 + tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 6 files changed, 1235 insertions(+), 825 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/icelake/cache.json b/tools/perf/pmu-events/arch/x86/icelake/cache.json index bc6587391760..a9174a0837f0 100644 --- a/tools/perf/pmu-events/arch/x86/icelake/cache.json +++ b/tools/perf/pmu-events/arch/x86/icelake/cache.json @@ -154,6 +154,22 @@ "SampleAfterValue": "200003", "UMask": "0x21" }, + { + "BriefDescription": "All requests that miss L2 cache. This event is not supported on ICL and ICX products, only supported on RKL products.", + "EventCode": "0x24", + "EventName": "L2_RQSTS.MISS", + "PublicDescription": "Counts all requests that miss L2 cache. This event is not supported on ICL and ICX products, only supported on RKL products.", + "SampleAfterValue": "200003", + "UMask": "0x3f" + }, + { + "BriefDescription": "All L2 requests. This event is not supported on ICL and ICX products, only supported on RKL products.", + "EventCode": "0x24", + "EventName": "L2_RQSTS.REFERENCES", + "PublicDescription": "Counts all L2 requests. This event is not supported on ICL and ICX products, only supported on RKL products.", + "SampleAfterValue": "200003", + "UMask": "0xff" + }, { "BriefDescription": "RFO requests that hit L2 cache", "EventCode": "0x24", diff --git a/tools/perf/pmu-events/arch/x86/icelake/floating-point.json b/tools/perf/pmu-events/arch/x86/icelake/floating-point.json index 655342dadac6..85c26c889088 100644 --- a/tools/perf/pmu-events/arch/x86/icelake/floating-point.json +++ b/tools/perf/pmu-events/arch/x86/icelake/floating-point.json @@ -39,6 +39,14 @@ "SampleAfterValue": "100003", "UMask": "0x20" }, + { + "BriefDescription": "Number of SSE/AVX computational 128-bit packed single and 256-bit packed double precision FP instructions retired; some instructions will count twice as noted below. Each count represents 2 or/and 4 computation operations, 1 for each element. Applies to SSE* and AVX* packed single precision and packed double precision FP instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX RCP14 RSQRT14 SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB count twice as they perform 2 calculations per element.", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.4_FLOPS", + "PublicDescription": "Number of SSE/AVX computational 128-bit packed single precision and 256-bit packed double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 2 or/and 4 computation operations, one for each element. Applies to SSE* and AVX* packed single precision floating-point and packed double precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX RCP14 RSQRT14 SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element. The DAZ and FTZ flags in the MXCSR register need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x18" + }, { "BriefDescription": "Counts number of SSE/AVX computational 512-bit packed double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 8 computation operations, one for each element. Applies to SSE* and AVX* packed double precision floating-point instructions: ADD SUB MUL DIV MIN MAX SQRT RSQRT14 RCP14 FM(N)ADD/SUB. FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element.", "EventCode": "0xc7", @@ -55,6 +63,22 @@ "SampleAfterValue": "100003", "UMask": "0x80" }, + { + "BriefDescription": "Number of SSE/AVX computational 256-bit packed single precision and 512-bit packed double precision FP instructions retired; some instructions will count twice as noted below. Each count represents 8 computation operations, 1 for each element. Applies to SSE* and AVX* packed single precision and double precision FP instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX SQRT RSQRT RSQRT14 RCP RCP14 DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB count twice as they perform 2 calculations per element.", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.8_FLOPS", + "PublicDescription": "Number of SSE/AVX computational 256-bit packed single precision and 512-bit packed double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 8 computation operations, one for each element. Applies to SSE* and AVX* packed single precision and double precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX SQRT RSQRT RSQRT14 RCP RCP14 DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element. The DAZ and FTZ flags in the MXCSR register need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x60" + }, + { + "BriefDescription": "Number of SSE/AVX computational scalar floating-point instructions retired; some instructions will count twice as noted below. Applies to SSE* and AVX* scalar, double and single precision floating-point: ADD SUB MUL DIV MIN MAX RCP14 RSQRT14 SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they perform multiple calculations per element.", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.SCALAR", + "PublicDescription": "Number of SSE/AVX computational scalar single precision and double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 1 computational operation. Applies to SSE* and AVX* scalar single precision floating-point instructions: ADD SUB MUL DIV MIN MAX SQRT RSQRT RCP FM(N)ADD/SUB. FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element. The DAZ and FTZ flags in the MXCSR register need to be set when using these events.", + "SampleAfterValue": "1000003", + "UMask": "0x3" + }, { "BriefDescription": "Counts number of SSE/AVX computational scalar double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 1 computational operation. Applies to SSE* and AVX* scalar double precision floating-point instructions: ADD SUB MUL DIV MIN MAX SQRT FM(N)ADD/SUB. FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element.", "EventCode": "0xc7", @@ -70,5 +94,12 @@ "PublicDescription": "Number of SSE/AVX computational scalar single precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 1 computational operation. Applies to SSE* and AVX* scalar single precision floating-point instructions: ADD SUB MUL DIV MIN MAX SQRT RSQRT RCP FM(N)ADD/SUB. FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element. The DAZ and FTZ flags in the MXCSR register need to be set when using these events.", "SampleAfterValue": "100003", "UMask": "0x2" + }, + { + "BriefDescription": "Number of any Vector retired FP arithmetic instructions", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.VECTOR", + "SampleAfterValue": "1000003", + "UMask": "0xfc" } ] diff --git a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json index 2ad36e00d289..f45ae3483df4 100644 --- a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json +++ b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json @@ -1,1230 +1,1518 @@ [ { - "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", - "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / SLOTS", - "MetricGroup": "PGO;TopdownL1;tma_L1_group", - "MetricName": "tma_frontend_bound", - "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Machine_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", - "MetricExpr": "(5 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE - INT_MISC.UOP_DROPPING) / SLOTS", - "MetricGroup": "Frontend;TopdownL2;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_latency", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "BriefDescription": "C10 residency percent per package", + "MetricExpr": "cstate_pkg@c10\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C10_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses", - "MetricExpr": "ICACHE_16B.IFDATA_STALL / CLKS", - "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_icache_misses", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RETIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "BriefDescription": "C2 residency percent per package", + "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C2_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses", - "MetricExpr": "ICACHE_64B.IFTAG_STALL / CLKS", - "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_itlb_misses", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTEND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "BriefDescription": "C3 residency percent per package", + "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", - "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / CLKS + tma_unknown_branches", - "MetricGroup": "FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_branch_resteers", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", + "BriefDescription": "C6 residency percent per core", + "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Core_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage", - "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / CLKS", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL4;tma_branch_resteers_group", - "MetricName": "tma_mispredicts_resteers", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES", + "BriefDescription": "C6 residency percent per package", + "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears", - "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / CLKS", - "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_branch_resteers_group", - "MetricName": "tma_clears_resteers", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES", + "BriefDescription": "C7 residency percent per core", + "MetricExpr": "cstate_core@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Core_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears", - "MetricExpr": "10 * BACLEARS.ANY / CLKS", - "MetricGroup": "BigFoot;FetchLat;TopdownL4;tma_branch_resteers_group", - "MetricName": "tma_unknown_branches", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears. These are fetched branches the Branch Prediction Unit was unable to recognize (First fetch or hitting BPU capacity limit). Sample with: BACLEARS.ANY", + "BriefDescription": "C7 residency percent per package", + "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / CLKS", - "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_dsb_switches", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DSB_MISS_PS", + "BriefDescription": "C8 residency percent per package", + "MetricExpr": "cstate_pkg@c8\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C8_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", - "MetricExpr": "ILD_STALL.LCP / CLKS", - "MetricGroup": "FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_lcp", - "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs.", + "BriefDescription": "C9 residency percent per package", + "MetricExpr": "cstate_pkg@c9\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C9_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", - "MetricExpr": "3 * IDQ.MS_SWITCHES / CLKS", - "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_ms_switches", - "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals. Sample with: IDQ.MS_SWITCHES", - "ScaleUnit": "100%" + "BriefDescription": "Uncore frequency per die [GHZ]", + "MetricExpr": "tma_info_socket_clks / #num_dies / duration_time / 1e9", + "MetricGroup": "SoC", + "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", - "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)", - "MetricGroup": "FetchBW;Frontend;TopdownL2;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_bandwidth", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Sample with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS", + "BriefDescription": "Percentage of cycles spent in System Management Interrupts.", + "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0 else 0)", + "MetricGroup": "smi", + "MetricName": "smi_cycles", + "MetricThreshold": "smi_cycles > 0.1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline)", - "MetricExpr": "(IDQ.MITE_CYCLES_ANY - IDQ.MITE_CYCLES_OK) / CORE_CLKS / 2", - "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_fetch_bandwidth_group", - "MetricName": "tma_mite", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or LSD. For example; inefficiencies due to asymmetric decoders; use of long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sample with: FRONTEND_RETIRED.ANY_DSB_MISS", - "ScaleUnit": "100%" + "BriefDescription": "Number of SMI interrupts.", + "MetricExpr": "msr@smi@", + "MetricGroup": "smi", + "MetricName": "smi_num", + "ScaleUnit": "1SMI#" }, { - "BriefDescription": "This metric represents fraction of cycles where decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=1@ - cpu@INST_DECODED.DECODERS\\,cmask\\=2@) / CORE_CLKS", - "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_mite_group", - "MetricName": "tma_decoder0_alone", + "BriefDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset", + "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_4k_aliasing", + "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset. False match is possible; which incur a few cycles load re-issue. However; the short re-issue duration is often hidden by the out-of-order core and HW optimizations; hence a user may safely ignore a high value of this metric unless it manages to propagate up into parent nodes of the hierarchy (e.g. to L1_Bound).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles where (only) 4 uops were delivered by the MITE pipeline", - "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=4@ - cpu@IDQ.MITE_UOPS\\,cmask\\=5@) / CLKS", - "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_mite_group", - "MetricName": "tma_mite_4wide", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.", + "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + UOPS_DISPATCHED.PORT_5 + UOPS_DISPATCHED.PORT_6) / (4 * tma_info_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline", - "MetricExpr": "(IDQ.DSB_CYCLES_ANY - IDQ.DSB_CYCLES_OK) / CORE_CLKS / 2", - "MetricGroup": "DSB;FetchBW;TopdownL3;tma_fetch_bandwidth_group", - "MetricName": "tma_dsb", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline. For example; inefficient utilization of the DSB cache structure or bank conflict when reading from it; are categorized here.", + "BriefDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists", + "MetricExpr": "100 * ASSISTS.ANY / tma_info_slots", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists. Assists are long sequences of uops that are required in certain corner-cases for operations that cannot be handled natively by the execution pipeline. For example; when working with very small floating point values (so-called Denormals); the FP units are not set up to perform these operations natively. Instead; a sequence of instructions to perform the computation on the Denormals is injected into the pipeline. Since these microcode sequences might be dozens of uops long; Assists can be extremely deleterious to performance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to LSD (Loop Stream Detector) unit", - "MetricExpr": "(LSD.CYCLES_ACTIVE - LSD.CYCLES_OK) / CORE_CLKS / 2", - "MetricGroup": "FetchBW;LSD;TopdownL3;tma_fetch_bandwidth_group", - "MetricName": "tma_lsd", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to LSD (Loop Stream Detector) unit. LSD typically does well sustaining Uop supply. However; in some rare cases; optimal uop-delivery could not be reached for small loops whose size (in terms of number of uops) does not suit well the LSD structure.", + "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend", + "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.2", + "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations", "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)", - "MetricGroup": "TopdownL1;tma_L1_group", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.", "ScaleUnit": "100%" }, + { + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring branch instructions.", + "MetricExpr": "tma_light_operations * BR_INST_RETIRED.ALL_BRANCHES / (tma_retiring * tma_info_slots)", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_branch_instructions", + "MetricThreshold": "tma_branch_instructions > 0.1 & tma_light_operations > 0.6", + "ScaleUnit": "100%" + }, { "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction", "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * tma_bad_speculation", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL2;tma_L2_group;tma_bad_speculation_group", + "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM", "MetricName": "tma_branch_mispredicts", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics: tma_info_branch_misprediction_cost, tma_info_mispredictions, tma_mispredicts_resteers", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", - "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts)", - "MetricGroup": "BadSpec;MachineClears;TopdownL2;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_machine_clears", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", + "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_clks + tma_unknown_branches", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", "ScaleUnit": "100%" }, { - "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend", - "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group", - "MetricName": "tma_backend_bound", - "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", + "BriefDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction. A CISC instruction has multiple uops that are required to perform the instruction's functionality as in the case of read-modify-write as an example. Since these instructions require multiple uops they may or may not imply sub-optimal use of machine resources.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", - "MetricExpr": "(CYCLE_ACTIVITY.STALLS_MEM_ANY + EXE_ACTIVITY.BOUND_ON_STORES) / (CYCLE_ACTIVITY.STALLS_TOTAL + (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL) + EXE_ACTIVITY.BOUND_ON_STORES) * tma_backend_bound", - "MetricGroup": "Backend;TopdownL2;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_memory_bound", - "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears", + "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_clks", + "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_branch_resteers_group;tma_issueMC", + "MetricName": "tma_clears_resteers", + "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache", - "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY.STALLS_L1D_MISS) / CLKS, 0)", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_l1_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache. The L1 data cache typically has the shortest latency. However; in certain cases like loads blocked on older stores; a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls; while some non-completed demand load lives in the machine without having that demand load missing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB_HIT_PS", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(29 * tma_info_average_frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM + 23.5 * tma_info_average_frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_contested_accesses", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses. Contested accesses occur when data written by one Logical Processor are read by another Logical Processor on a different Physical Core. Examples of contested accesses include synchronizations such as locks; true data sharing such as modified locked variables; and false sharing. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Related metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / CLKS", - "MetricGroup": "MemoryTLB;TopdownL4;tma_l1_bound_group", - "MetricName": "tma_dtlb_load", - "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS", + "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", + "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)", + "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the (first level) DTLB was missed by load accesses, that later on hit in second-level TLB (STLB)", - "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", - "MetricGroup": "MemoryTLB;TopdownL5;tma_dtlb_load_group", - "MetricName": "tma_load_stlb_hit", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "23.5 * tma_info_average_frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses. Data shared by multiple Logical Processors (even just read shared) may cause increased access latency due to cache coherency. Excessive data sharing can drastically harm multithreaded performance. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma_false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles where the Second-level TLB (STLB) was missed by load accesses, performing a hardware page walk", - "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / CLKS", - "MetricGroup": "MemoryTLB;TopdownL5;tma_dtlb_load_group", - "MetricName": "tma_load_stlb_miss", + "BriefDescription": "This metric represents fraction of cycles where decoder-0 was the only active decoder", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=1@ - cpu@INST_DECODED.DECODERS\\,cmask\\=2@) / tma_info_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0;tma_mite_group", + "MetricName": "tma_decoder0_alone", + "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 5 > 0.35))", + "PublicDescription": "This metric represents fraction of cycles where decoder-0 was the only active decoder. Related metrics: tma_few_uops_instructions", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores", - "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_store_fwd_blk", - "PublicDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores. To streamline memory operations in the pipeline; a load can avoid waiting for memory if a prior in-flight store is writing the data that the load wants to read (store forwarding process). However; in some cases the load may be blocked for a significant time pending the store forward. For example; when the prior store is writing a smaller region than the load is reading.", + "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", + "MetricExpr": "ARITH.DIVIDER_ACTIVE / tma_info_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication. Sample with: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations", - "MetricExpr": "(16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS.ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10 * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO))) / CLKS", - "MetricGroup": "Offcore;TopdownL4;tma_l1_bound_group", - "MetricName": "tma_lock_latency", - "PublicDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations. Due to the microarchitecture handling of locks; they are classified as L1_Bound regardless of what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOADS_PS", + "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_clks + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_info_clks - tma_l2_bound", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L3_MISS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary", - "MetricExpr": "Load_Miss_Real_Latency * LD_BLOCKS.NO_SR / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_split_loads", - "PublicDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline", + "MetricExpr": "(IDQ.DSB_CYCLES_ANY - IDQ.DSB_CYCLES_OK) / tma_info_core_clks / 2", + "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 5 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline. For example; inefficient utilization of the DSB cache structure or bank conflict when reading from it; are categorized here.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset", - "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_4k_aliasing", - "PublicDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset. False match is possible; which incur a few cycles load re-issue. However; the short re-issue duration is often hidden by the out-of-order core and HW optimizations; hence a user may safely ignore a high value of this metric unless it manages to propagate up into parent nodes of the hierarchy (e.g. to L1_Bound).", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_clks", + "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DSB_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_dsb_misses, tma_info_iptb, tma_lcp", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed", - "MetricExpr": "L1D_PEND_MISS.FB_FULL / CLKS", - "MetricGroup": "MemoryBW;TopdownL4;tma_l1_bound_group", - "MetricName": "tma_fb_full", - "PublicDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed. The higher the metric value; the deeper the memory hierarchy level the misses are satisfied from (metric values >1 are valid). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or external memory).", + "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: tma_dtlb_store, tma_info_memory_data_tlbs", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads", - "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + L1D_PEND_MISS.FB_FULL_PERIODS) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / CLKS)", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_l2_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 misses/L2 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L2_HIT_PS", + "BriefDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses", + "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=1@ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses. As with ordinary data caching; focus on improving data locality and reducing working-set size to reduce DTLB overhead. Additionally; consider using profile-guided optimization (PGO) to collocate frequently-used data on the same page. Try using larger page sizes for large amounts of frequently-used data. Sample with: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, tma_info_memory_data_tlbs", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core", - "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STALLS_L3_MISS) / CLKS", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_l3_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "BriefDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing", + "MetricExpr": "32.5 * tma_info_average_frequency * OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM / tma_info_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_store_bound_group", + "MetricName": "tma_false_sharing", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing. False Sharing is a multithreading hiccup; where multiple Logical Processors contend on different data-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses", - "MetricExpr": "(29 * Average_Frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM + 23.5 * Average_Frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CLKS", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_contested_accesses", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses. Contested accesses occur when data written by one Logical Processor are read by another Logical Processor on a different Physical Core. Examples of contested accesses include synchronizations such as locks; true data sharing such as modified locked variables; and false sharing. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS", + "BriefDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed", + "MetricExpr": "L1D_PEND_MISS.FB_FULL / tma_info_clks", + "MetricGroup": "MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;tma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed. The higher the metric value; the deeper the memory hierarchy level the misses are satisfied from (metric values >1 are valid). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or external memory). Related metrics: tma_info_dram_bw_use, tma_info_memory_bandwidth, tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses", - "MetricExpr": "23.5 * Average_Frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CLKS", - "MetricGroup": "Offcore;Snoop;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_data_sharing", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses. Data shared by multiple Logical Processors (even just read shared) may cause increased access latency due to cache coherency. Excessive data sharing can drastically harm multithreaded performance. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT_PS", + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", + "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)", + "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 5 > 0.35", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Sample with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_switches, tma_info_dsb_coverage, tma_info_dsb_misses, tma_info_iptb, tma_lcp", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited)", - "MetricExpr": "9 * Average_Frequency * MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CLKS", - "MetricGroup": "MemoryLat;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_l3_hit_latency", - "PublicDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve the latency; reduce contention with sibling physical cores and increase performance. Note the value of this node may overlap with its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", + "MetricExpr": "(5 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE - INT_MISC.UOP_DROPPING) / tma_info_slots", + "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors)", - "MetricExpr": "L1D_PEND_MISS.L2_STALL / CLKS", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_sq_full", - "PublicDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors). The Super Queue is used for requests to access the L2 cache or to go out to the Uncore.", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops", + "MetricExpr": "tma_heavy_operations - tma_microcode_sequencer", + "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;tma_issueD0", + "MetricName": "tma_few_uops_instructions", + "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number of uops in such instructions. Related metrics: tma_decoder0_alone", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", - "MetricExpr": "CYCLE_ACTIVITY.STALLS_L3_MISS / CLKS + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / CLKS - tma_l2_bound", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_dram_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L3_MISS_PS", + "BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", + "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired). Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain and FMA double-counting.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=4@) / CLKS", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_dram_bound_group", - "MetricName": "tma_mem_bandwidth", - "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that).", + "BriefDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P", + "MetricName": "tma_fp_scalar", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / CLKS - tma_mem_bandwidth", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_dram_bound_group", - "MetricName": "tma_mem_latency", - "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that).", + "BriefDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0xfc@ / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P", + "MetricName": "tma_fp_vector", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", - "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / CLKS", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_store_bound", - "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck. Sample with: MEM_INST_RETIRED.ALL_STORES_PS", + "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P", + "MetricName": "tma_fp_vector_128b", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses", - "MetricExpr": "(L2_RQSTS.RFO_HIT * 10 * (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / CLKS", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_store_bound_group", - "MetricName": "tma_store_latency", - "PublicDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses. Store accesses usually less impact out-of-order core performance; however; holding resources for longer time can lead into undesired implications (e.g. contention on L1D fill-buffer entries - see FB_Full)", + "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P", + "MetricName": "tma_fp_vector_256b", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing", - "MetricExpr": "32.5 * Average_Frequency * OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM / CLKS", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_store_bound_group", - "MetricName": "tma_false_sharing", - "PublicDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing. False Sharing is a multithreading hiccup; where multiple Logical Processors contend on different data-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM", + "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 512-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P", + "MetricName": "tma_fp_vector_512b", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 512-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents rate of split store accesses", - "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / CORE_CLKS", - "MetricGroup": "TopdownL4;tma_store_bound_group", - "MetricName": "tma_split_stores", - "PublicDescription": "This metric represents rate of split store accesses. Consider aligning your data to the 64-byte cache line granularity. Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS", + "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", + "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_slots", + "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often CPU was stalled due to Streaming store memory accesses; Streaming store optimize out a read request required by RFO stores", - "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / CLKS", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_store_bound_group", - "MetricName": "tma_streaming_stores", - "PublicDescription": "This metric estimates how often CPU was stalled due to Streaming store memory accesses; Streaming store optimize out a read request required by RFO stores. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should Streaming stores be a bottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences", + "MetricExpr": "tma_microcode_sequencer + tma_retiring * (UOPS_DECODED.DEC0 - cpu@UOPS_DECODED.DEC0\\,cmask\\=1@) / IDQ.MITE_UOPS", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses", - "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=1@ + DTLB_STORE_MISSES.WALK_ACTIVE) / CORE_CLKS", - "MetricGroup": "MemoryTLB;TopdownL4;tma_store_bound_group", - "MetricName": "tma_dtlb_store", - "PublicDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses. As with ordinary data caching; focus on improving data locality and reducing working-set size to reduce DTLB overhead. Additionally; consider using profile-guided optimization (PGO) to collocate frequently-used data on the same page. Try using larger page sizes for large amounts of frequently-used data. Sample with: MEM_INST_RETIRED.STLB_MISS_STORES_PS", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses", + "MetricExpr": "ICACHE_16B.IFDATA_STALL / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RETIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the TLB was missed by store accesses, hitting in the second-level TLB (STLB)", - "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", - "MetricGroup": "MemoryTLB;TopdownL5;tma_dtlb_store_group", - "MetricName": "tma_store_stlb_hit", - "ScaleUnit": "100%" + "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", + "MetricExpr": "tma_info_turbo_utilization * TSC / 1e9 / duration_time", + "MetricGroup": "Power;Summary", + "MetricName": "tma_info_average_frequency" }, { - "BriefDescription": "This metric estimates the fraction of cycles where the STLB was missed by store accesses, performing a hardware page walk", - "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / CORE_CLKS", - "MetricGroup": "MemoryTLB;TopdownL5;tma_dtlb_store_group", - "MetricName": "tma_store_stlb_miss", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of instruction fetch related bottlenecks by large code footprint programs (i-side cache; TLB and BTB misses)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_icache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", + "MetricGroup": "BigFoot;Fed;Frontend;IcMiss;MemoryTLB;tma_issueBC", + "MetricName": "tma_info_big_code", + "MetricThreshold": "tma_info_big_code > 20", + "PublicDescription": "Total pipeline cost of instruction fetch related bottlenecks by large code footprint programs (i-side cache; TLB and BTB misses). Related metrics: tma_info_branching_overhead" }, { - "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", - "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)", - "MetricGroup": "Backend;Compute;TopdownL2;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_core_bound", - "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", - "ScaleUnit": "100%" + "BriefDescription": "Branch instructions per taken branch.", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_bptkbranch" }, { - "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", - "MetricExpr": "ARITH.DIVIDER_ACTIVE / CLKS", - "MetricGroup": "TopdownL3;tma_core_bound_group", - "MetricName": "tma_divider", - "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication. Sample with: ARITH.DIVIDER_ACTIVE", - "ScaleUnit": "100%" + "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)", + "MetricExpr": "(tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) * tma_info_slots / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BrMispredicts;tma_issueBM", + "MetricName": "tma_info_branch_misprediction_cost", + "PublicDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear). Related metrics: tma_branch_mispredicts, tma_info_mispredictions, tma_mispredicts_resteers" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", - "MetricExpr": "((cpu@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=0x80@ + (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL)) / CLKS if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CYCLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL) / CLKS)", - "MetricGroup": "PortsUtil;TopdownL3;tma_core_bound_group", - "MetricName": "tma_ports_utilization", - "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of branch related instructions (used for program control-flow including function calls)", + "MetricExpr": "100 * ((BR_INST_RETIRED.COND + 3 * BR_INST_RETIRED.NEAR_CALL + (BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL)) / tma_info_slots)", + "MetricGroup": "Ret;tma_issueBC", + "MetricName": "tma_info_branching_overhead", + "MetricThreshold": "tma_info_branching_overhead > 10", + "PublicDescription": "Total pipeline cost of branch related instructions (used for program control-flow including function calls). Related metrics: tma_info_big_code" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "cpu@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=0x80@ / CLKS + tma_serializing_operation * (CYCLE_ACTIVITY.STALLS_TOTAL - CYCLE_ACTIVITY.STALLS_MEM_ANY) / CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_0", - "PublicDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise). Long-latency instructions like divides may contribute to this metric.", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of branches that are CALL or RET", + "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_RETURN) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_callret" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations", - "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / CLKS", - "MetricGroup": "TopdownL5;tma_ports_utilized_0_group", - "MetricName": "tma_serializing_operation", - "PublicDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations. Instructions like CPUID; WRMSR or LFENCE serialize the out-of-order execution which may limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD", - "ScaleUnit": "100%" + "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Pipeline", + "MetricName": "tma_info_clks" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to PAUSE Instructions", - "MetricExpr": "140 * MISC_RETIRED.PAUSE_INST / CLKS", - "MetricGroup": "TopdownL6;tma_serializing_operation_group", - "MetricName": "tma_slow_pause", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to PAUSE Instructions. Sample with: MISC_RETIRED.PAUSE_INST", - "ScaleUnit": "100%" + "BriefDescription": "STLB (2nd level TLB) code speculative misses per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", + "MetricGroup": "Fed;MemoryTLB", + "MetricName": "tma_info_code_stlb_mpki" }, { - "BriefDescription": "The Mixing_Vectors metric gives the percentage of injected blend uops out of all uops issued", - "MetricExpr": "CLKS * UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY", - "MetricGroup": "TopdownL5;tma_ports_utilized_0_group", - "MetricName": "tma_mixing_vectors", - "PublicDescription": "The Mixing_Vectors metric gives the percentage of injected blend uops out of all uops issued. Usually a Mixing_Vectors over 5% is worth investigating. Read more in Appendix B1 of the Optimizations Guide for this topic.", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of branches that are non-taken conditionals", + "MetricExpr": "BR_INST_RETIRED.COND_NTAKEN / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_cond_nt" }, { - "BriefDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_1", - "PublicDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). This can be due to heavy data-dependency among software instructions; or over oversubscribing a particular hardware resource. In some other cases with high 1_Port_Utilized and L1_Bound; this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of branches that are taken conditionals", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_cond_tk" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_2", - "PublicDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization -most compilers feature auto-Vectorization options today- reduces pressure on the execution ports as multiple elements are calculated with same uop. Sample with: EXE_ACTIVITY.2_PORTS_UTIL", - "ScaleUnit": "100%" + "BriefDescription": "Probability of Core Bound bottleneck hidden by SMT-profiling artifacts", + "MetricExpr": "(100 * (1 - tma_core_bound / tma_ports_utilization if tma_core_bound < tma_ports_utilization else 1) if tma_info_smt_2t_utilization > 0.5 else 0)", + "MetricGroup": "Cor;SMT", + "MetricName": "tma_info_core_bound_likely", + "MetricThreshold": "tma_info_core_bound_likely > 0.5" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_3m", - "PublicDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Sample with: UOPS_EXECUTED.CYCLES_GE_3", - "ScaleUnit": "100%" + "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", + "MetricExpr": "CPU_CLK_UNHALTED.DISTRIBUTED", + "MetricGroup": "SMT", + "MetricName": "tma_info_core_clks" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.", - "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + UOPS_DISPATCHED.PORT_5 + UOPS_DISPATCHED.PORT_6) / (4 * CORE_CLKS)", - "MetricGroup": "TopdownL5;tma_ports_utilized_3m_group", - "MetricName": "tma_alu_op_utilization", - "ScaleUnit": "100%" + "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_core_clks", + "MetricGroup": "Ret;SMT;TmaL1;tma_L1_group", + "MetricName": "tma_info_coreipc" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch) Sample with: UOPS_DISPATCHED.PORT_0", - "MetricExpr": "UOPS_DISPATCHED.PORT_0 / CORE_CLKS", - "MetricGroup": "Compute;TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_0", - "ScaleUnit": "100%" + "BriefDescription": "Cycles Per Instruction (per Logical Processor)", + "MetricExpr": "1 / tma_info_ipc", + "MetricGroup": "Mem;Pipeline", + "MetricName": "tma_info_cpi" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU) Sample with: UOPS_DISPATCHED.PORT_1", - "MetricExpr": "UOPS_DISPATCHED.PORT_1 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_1", - "ScaleUnit": "100%" + "BriefDescription": "Average CPU Utilization", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": "HPC;Summary", + "MetricName": "tma_info_cpu_utilization" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU) Sample with: UOPS_DISPATCHED.PORT_5", - "MetricExpr": "UOPS_DISPATCHED.PORT_5 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_5", - "ScaleUnit": "100%" + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD / OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_data_l2_mlp" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU) Sample with: UOPS_DISPATCHED.PORT_6", - "MetricExpr": "UOPS_DISPATCHED.PORT_6 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_6", - "ScaleUnit": "100%" + "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", + "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_REQUESTS.ALL) / 1e6 / duration_time / 1e3", + "MetricGroup": "HPC;Mem;MemoryBW;SoC;tma_issueBW", + "MetricName": "tma_info_dram_bw_use", + "PublicDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_memory_bandwidth, tma_mem_bandwidth, tma_sq_full" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations Sample with: UOPS_DISPATCHED.PORT_2_3", - "MetricExpr": "UOPS_DISPATCHED.PORT_2_3 / (2 * CORE_CLKS)", - "MetricGroup": "TopdownL5;tma_ports_utilized_3m_group", - "MetricName": "tma_load_op_utilization", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_dsb_coverage", + "MetricThreshold": "tma_info_dsb_coverage < 0.7 & tma_info_ipc / 5 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_misses, tma_info_iptb, tma_lcp" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations Sample with: UOPS_DISPATCHED.PORT_7_8", - "MetricExpr": "(UOPS_DISPATCHED.PORT_4_9 + UOPS_DISPATCHED.PORT_7_8) / (4 * CORE_CLKS)", - "MetricGroup": "TopdownL5;tma_ports_utilized_3m_group", - "MetricName": "tma_store_op_utilization", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of DSB (uop cache) misses - subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tma_lsd + tma_mite))", + "MetricGroup": "DSBmiss;Fed;tma_issueFB", + "MetricName": "tma_info_dsb_misses", + "MetricThreshold": "tma_info_dsb_misses > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) misses - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_iptb, tma_lcp" }, { - "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group", - "MetricName": "tma_retiring", - "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS", - "ScaleUnit": "100%" + "BriefDescription": "Average number of cycles of a switch from the DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details.", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWITCHES.PENALTY_CYCLES\\,cmask\\=1\\,edge@", + "MetricGroup": "DSBmiss", + "MetricName": "tma_info_dsb_switch_cost" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", - "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_retiring_group", - "MetricName": "tma_light_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST", - "ScaleUnit": "100%" + "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-thread", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,cmask\\=1@", + "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", + "MetricName": "tma_info_execute" }, { - "BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)", - "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", - "MetricGroup": "HPC;TopdownL3;tma_light_operations_group", - "MetricName": "tma_fp_arith", - "PublicDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired). Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain and FMA double-counting.", - "ScaleUnit": "100%" + "BriefDescription": "The ratio of Executed- by Issued-Uops", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", + "MetricGroup": "Cor;Pipeline", + "MetricName": "tma_info_execute_per_issue", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate of \"execute\" at rename stage." }, { - "BriefDescription": "This metric serves as an approximation of legacy x87 usage", - "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.THREAD", - "MetricGroup": "Compute;TopdownL4;tma_fp_arith_group", - "MetricName": "tma_x87_use", - "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", - "ScaleUnit": "100%" + "BriefDescription": "Fill Buffer (FB) hits per kilo instructions for retired demand loads (L1D misses that merge into ongoing miss-handling entries)", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_fb_hpki" }, { - "BriefDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE) / (tma_retiring * SLOTS)", - "MetricGroup": "Compute;Flops;TopdownL4;tma_fp_arith_group", - "MetricName": "tma_fp_scalar", - "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting.", - "ScaleUnit": "100%" + "BriefDescription": "Average number of Uops issued by front-end when it issued something", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=1@", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_fetch_upc" }, { - "BriefDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths", - "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / (tma_retiring * SLOTS)", - "MetricGroup": "Compute;Flops;TopdownL4;tma_fp_arith_group", - "MetricName": "tma_fp_vector", - "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting.", - "ScaleUnit": "100%" + "BriefDescription": "Floating Point Operations Per Cycle", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * cpu@FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE\\,umask\\=0x18@ + 8 * cpu@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE\\,umask\\=0x60@ + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / tma_info_core_clks", + "MetricGroup": "Flops;Ret", + "MetricName": "tma_info_flopc" }, { - "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors", - "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * SLOTS)", - "MetricGroup": "Compute;Flops;TopdownL5;tma_fp_vector_group", - "MetricName": "tma_fp_vector_128b", - "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors. May overcount due to FMA double counting.", - "ScaleUnit": "100%" + "BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0xfc@) / (2 * tma_info_core_clks)", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_fp_arith_utilization", + "PublicDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width). Values > 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less common)." }, { - "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors", - "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / (tma_retiring * SLOTS)", - "MetricGroup": "Compute;Flops;TopdownL5;tma_fp_vector_group", - "MetricName": "tma_fp_vector_256b", - "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors. May overcount due to FMA double counting.", - "ScaleUnit": "100%" + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * cpu@FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE\\,umask\\=0x18@ + 8 * cpu@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE\\,umask\\=0x60@ + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1e9 / duration_time", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine." }, { - "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 512-bit wide vectors", - "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / (tma_retiring * SLOTS)", - "MetricGroup": "Compute;Flops;TopdownL5;tma_fp_vector_group", - "MetricName": "tma_fp_vector_512b", - "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 512-bit wide vectors. May overcount due to FMA double counting.", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of Instruction Cache misses - subset of the Big_Code Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", + "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", + "MetricName": "tma_info_ic_misses", + "MetricThreshold": "tma_info_ic_misses > 5", + "PublicDescription": "Total pipeline cost of Instruction Cache misses - subset of the Big_Code Bottleneck. Related metrics: " }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring memory operations -- uops for memory load or store accesses.", - "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_RETIRED.ANY", - "MetricGroup": "Pipeline;TopdownL3;tma_light_operations_group", - "MetricName": "tma_memory_operations", - "ScaleUnit": "100%" + "BriefDescription": "Average Latency for L1 instruction cache misses", + "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STALL\\,cmask\\=1\\,edge@", + "MetricGroup": "Fed;FetchLat;IcMiss", + "MetricName": "tma_info_icache_miss_latency" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring branch instructions.", - "MetricExpr": "tma_light_operations * BR_INST_RETIRED.ALL_BRANCHES / (tma_retiring * SLOTS)", - "MetricGroup": "Pipeline;TopdownL3;tma_light_operations_group", - "MetricName": "tma_branch_instructions", - "ScaleUnit": "100%" + "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_GE_1 / 2 if #SMT_on else UOPS_EXECUTED.CORE_CYCLES_GE_1)", + "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", + "MetricName": "tma_info_ilp" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions", - "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_retiring * SLOTS)", - "MetricGroup": "Pipeline;TopdownL3;tma_light_operations_group", - "MetricName": "tma_nop_instructions", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions. Compilers often use NOPs for certain address alignments - e.g. start address of a function or loop body. Sample with: INST_RETIRED.NOP", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of instruction fetch bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_frontend_bound - tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) - tma_info_big_code", + "MetricGroup": "Fed;FetchBW;Frontend", + "MetricName": "tma_info_instruction_fetch_bw", + "MetricThreshold": "tma_info_instruction_fetch_bw > 20" }, { - "BriefDescription": "This metric represents the remaining light uops fraction the CPU has executed - remaining means not covered by other sibling nodes. May undercount due to FMA double counting", - "MetricExpr": "max(0, tma_light_operations - (tma_fp_arith + tma_memory_operations + tma_branch_instructions + tma_nop_instructions))", - "MetricGroup": "Pipeline;TopdownL3;tma_light_operations_group", - "MetricName": "tma_other_light_ops", - "ScaleUnit": "100%" + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Summary;TmaL1;tma_L1_group", + "MetricName": "tma_info_instructions", + "PublicDescription": "Total number of retired Instructions. Sample with: INST_RETIRED.PREC_DIST" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences", - "MetricExpr": "tma_microcode_sequencer + tma_retiring * (UOPS_DECODED.DEC0 - cpu@UOPS_DECODED.DEC0\\,cmask\\=1@) / IDQ.MITE_UOPS", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_retiring_group", - "MetricName": "tma_heavy_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences. This highly-correlates with the uop length of these instructions/sequences.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0xfc@)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_iparith", + "MetricThreshold": "tma_info_iparith < 10", + "PublicDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate). May undercount due to FMA double counting. Approximated prior to BDW." }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops", - "MetricExpr": "tma_heavy_operations - tma_microcode_sequencer", - "MetricGroup": "TopdownL3;tma_heavy_operations_group", - "MetricName": "tma_few_uops_instructions", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number of uops in such instructions.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx128", + "MetricThreshold": "tma_info_iparith_avx128 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", - "MetricExpr": "tma_retiring * SLOTS / UOPS_ISSUED.ANY * IDQ.MS_UOPS / SLOTS", - "MetricGroup": "MicroSeq;TopdownL3;tma_heavy_operations_group", - "MetricName": "tma_microcode_sequencer", - "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided. Sample with: IDQ.MS_UOPS", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx256", + "MetricThreshold": "tma_info_iparith_avx256 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists", - "MetricExpr": "100 * ASSISTS.ANY / SLOTS", - "MetricGroup": "TopdownL4;tma_microcode_sequencer_group", - "MetricName": "tma_assists", - "PublicDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists. Assists are long sequences of uops that are required in certain corner-cases for operations that cannot be handled natively by the execution pipeline. For example; when working with very small floating point values (so-called Denormals); the FP units are not set up to perform these operations natively. Instead; a sequence of instructions to perform the computation on the Denormals is injected into the pipeline. Since these microcode sequences might be dozens of uops long; Assists can be extremely deleterious to performance and they can be avoided in many cases. Sample with: ASSISTS.ANY", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx512", + "MetricThreshold": "tma_info_iparith_avx512 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction", - "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", - "MetricGroup": "TopdownL4;tma_microcode_sequencer_group", - "MetricName": "tma_cisc", - "PublicDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction. A CISC instruction has multiple uops that are required to perform the instruction's functionality as in the case of read-modify-write as an example. Since these instructions require multiple uops they may or may not imply sub-optimal use of machine resources.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOUBLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_iparith_scalar_dp", + "MetricThreshold": "tma_info_iparith_scalar_dp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "Total pipeline cost of Branch Misprediction related bottlenecks", - "MetricExpr": "100 * (tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts", - "MetricName": "Mispredictions" + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SINGLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_iparith_scalar_sp", + "MetricThreshold": "tma_info_iparith_scalar_sp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "Total pipeline cost of (external) Memory Bandwidth related bottlenecks", - "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_sq_full / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full))) + tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk))", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "Memory_Bandwidth" + "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Branches;Fed;InsType", + "MetricName": "tma_info_ipbranch", + "MetricThreshold": "tma_info_ipbranch < 8" }, { - "BriefDescription": "Total pipeline cost of Memory Latency related bottlenecks (external memory and off-core caches)", - "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_l3_hit_latency / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound))", - "MetricGroup": "Mem;MemoryLat;Offcore", - "MetricName": "Memory_Latency" + "BriefDescription": "Instructions Per Cycle (per Logical Processor)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_clks", + "MetricGroup": "Ret;Summary", + "MetricName": "tma_info_ipc" }, { - "BriefDescription": "Total pipeline cost of Memory Address Translation related bottlenecks (data-side TLBs)", - "MetricExpr": "100 * tma_memory_bound * (tma_l1_bound / max(tma_memory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_dtlb_load / max(tma_l1_bound, tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_latency + tma_streaming_stores)))", - "MetricGroup": "Mem;MemoryTLB;Offcore", - "MetricName": "Memory_Data_TLBs" + "BriefDescription": "Instructions per (near) call (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_ipcall", + "MetricThreshold": "tma_info_ipcall < 200" }, { - "BriefDescription": "Total pipeline cost of branch related instructions (used for program control-flow including function calls)", - "MetricExpr": "100 * ((BR_INST_RETIRED.COND + 3 * BR_INST_RETIRED.NEAR_CALL + (BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL)) / SLOTS)", - "MetricGroup": "Ret", - "MetricName": "Branching_Overhead" + "BriefDescription": "Instructions per non-speculative DSB miss (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", + "MetricGroup": "DSBmiss;Fed", + "MetricName": "tma_info_ipdsb_miss_ret", + "MetricThreshold": "tma_info_ipdsb_miss_ret < 50" }, { - "BriefDescription": "Total pipeline cost of instruction fetch related bottlenecks by large code footprint programs (i-side cache; TLB and BTB misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_icache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFoot;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "Big_Code" + "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;OS", + "MetricName": "tma_info_ipfarbranch", + "MetricThreshold": "tma_info_ipfarbranch < 1e6" }, { - "BriefDescription": "Total pipeline cost of instruction fetch bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_frontend_bound - tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) - Big_Code", - "MetricGroup": "Fed;FetchBW;Frontend", - "MetricName": "Instruction_Fetch_BW" + "BriefDescription": "Instructions per Floating Point (FP) Operation (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * cpu@FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE\\,umask\\=0x18@ + 8 * cpu@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE\\,umask\\=0x60@ + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_ipflop", + "MetricThreshold": "tma_info_ipflop < 10" }, { - "BriefDescription": "Instructions Per Cycle (per Logical Processor)", - "MetricExpr": "INST_RETIRED.ANY / CLKS", - "MetricGroup": "Ret;Summary", - "MetricName": "IPC" + "BriefDescription": "Instructions per Load (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipload", + "MetricThreshold": "tma_info_ipload < 3" }, { - "BriefDescription": "Uops Per Instruction", - "MetricExpr": "tma_retiring * SLOTS / INST_RETIRED.ANY", - "MetricGroup": "Pipeline;Ret;Retire", - "MetricName": "UPI" + "BriefDescription": "Instructions per retired mispredicts for conditional non-taken branches (lower number means higher occurrence rate).", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_cond_ntaken", + "MetricThreshold": "tma_info_ipmisp_cond_ntaken < 200" }, { - "BriefDescription": "Instruction per taken branch", - "MetricExpr": "tma_retiring * SLOTS / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW", - "MetricName": "UpTB" + "BriefDescription": "Instructions per retired mispredicts for conditional taken branches (lower number means higher occurrence rate).", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_cond_taken", + "MetricThreshold": "tma_info_ipmisp_cond_taken < 200" }, { - "BriefDescription": "Cycles Per Instruction (per Logical Processor)", - "MetricExpr": "1 / IPC", - "MetricGroup": "Mem;Pipeline", - "MetricName": "CPI" + "BriefDescription": "Instructions per retired mispredicts for indirect CALL or JMP branches (lower number means higher occurrence rate).", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_indirect", + "MetricThreshold": "tma_info_ipmisp_indirect < 1e3" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "Pipeline", - "MetricName": "CLKS" + "BriefDescription": "Instructions per retired mispredicts for return branches (lower number means higher occurrence rate).", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_ret", + "MetricThreshold": "tma_info_ipmisp_ret < 500" }, { - "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "TOPDOWN.SLOTS", - "MetricGroup": "tma_L1_group", - "MetricName": "SLOTS" + "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts", + "MetricName": "tma_info_ipmispredict", + "MetricThreshold": "tma_info_ipmispredict < 200" }, { - "BriefDescription": "Fraction of Physical Core issue-slots utilized by this Logical Processor", - "MetricExpr": "(SLOTS / (TOPDOWN.SLOTS / 2) if #SMT_on else 1)", - "MetricGroup": "SMT;tma_L1_group", - "MetricName": "Slots_Utilization" + "BriefDescription": "Instructions per Store (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipstore", + "MetricThreshold": "tma_info_ipstore < 8" }, { - "BriefDescription": "The ratio of Executed- by Issued-Uops", - "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", - "MetricGroup": "Cor;Pipeline", - "MetricName": "Execute_per_Issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate of \"execute\" at rename stage." + "BriefDescription": "Instructions per Software prefetch instruction (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umask\\=0xF@", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_ipswpf", + "MetricThreshold": "tma_info_ipswpf < 100" }, { - "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", - "MetricExpr": "INST_RETIRED.ANY / CORE_CLKS", - "MetricGroup": "Ret;SMT;tma_L1_group", - "MetricName": "CoreIPC" - }, - { - "BriefDescription": "Floating Point Operations Per Cycle", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE) + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / CORE_CLKS", - "MetricGroup": "Flops;Ret", - "MetricName": "FLOPc" - }, - { - "BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width)", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE)) / (2 * CORE_CLKS)", - "MetricGroup": "Cor;Flops;HPC", - "MetricName": "FP_Arith_Utilization", - "PublicDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width). Values > 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less common)." + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", + "MetricName": "tma_info_iptb", + "MetricThreshold": "tma_info_iptb < 11", + "PublicDescription": "Instruction per taken branch. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_dsb_misses, tma_lcp" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_GE_1 / 2 if #SMT_on else UOPS_EXECUTED.CORE_CYCLES_GE_1)", - "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", - "MetricName": "ILP" + "BriefDescription": "Instructions per speculative Unknown Branch Misprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_instructions / BACLEARS.ANY", + "MetricGroup": "Fed", + "MetricName": "tma_info_ipunknown_branch" }, { - "BriefDescription": "Probability of Core Bound bottleneck hidden by SMT-profiling artifacts", - "MetricExpr": "((1 - tma_core_bound / tma_ports_utilization if tma_core_bound < tma_ports_utilization else 1) if SMT_2T_Utilization > 0.5 else 0)", - "MetricGroup": "Cor;SMT", - "MetricName": "Core_Bound_Likely" + "BriefDescription": "Fraction of branches that are unconditional (direct or indirect) jumps", + "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_jump" }, { - "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", - "MetricExpr": "CPU_CLK_UNHALTED.DISTRIBUTED", - "MetricGroup": "SMT", - "MetricName": "CORE_CLKS" + "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_cpi" }, { - "BriefDescription": "Instructions per Load (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", - "MetricGroup": "InsType", - "MetricName": "IpLoad" + "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_utilization", + "MetricThreshold": "tma_info_kernel_utilization > 0.05" }, { - "BriefDescription": "Instructions per Store (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", - "MetricGroup": "InsType", - "MetricName": "IpStore" + "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw" }, { - "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Branches;Fed;InsType", - "MetricName": "IpBranch" + "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]", + "MetricExpr": "tma_info_l1d_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw_1t" }, { - "BriefDescription": "Instructions per (near) call (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "IpCall" + "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l1mpki" }, { - "BriefDescription": "Instruction per taken branch", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO", - "MetricName": "IpTB" + "BriefDescription": "L1 cache true misses per kilo instruction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l1mpki_load" }, { - "BriefDescription": "Branch instructions per taken branch. ", - "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "BpTkBranch" + "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw" }, { - "BriefDescription": "Instructions per Floating Point (FP) Operation (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE) + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE)", - "MetricGroup": "Flops;InsType", - "MetricName": "IpFLOP" + "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]", + "MetricExpr": "tma_info_l2_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw_1t" }, { - "BriefDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE))", - "MetricGroup": "Flops;InsType", - "MetricName": "IpArith", - "PublicDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate). May undercount due to FMA double counting. Approximated prior to BDW." + "BriefDescription": "L2 cache hits per kilo instruction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2hpki_load" }, { - "BriefDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SINGLE", - "MetricGroup": "Flops;FpScalar;InsType", - "MetricName": "IpArith_Scalar_SP", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "L2 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "Backend;CacheMisses;Mem", + "MetricName": "tma_info_l2mpki" }, { - "BriefDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOUBLE", - "MetricGroup": "Flops;FpScalar;InsType", - "MetricName": "IpArith_Scalar_DP", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all request types (including speculative)", + "MetricExpr": "1e3 * (OFFCORE_REQUESTS.ALL_DATA_RD - OFFCORE_REQUESTS.DEMAND_DATA_RD + L2_RQSTS.ALL_DEMAND_MISS + L2_RQSTS.SWPF_MISS) / tma_info_instructions", + "MetricGroup": "CacheMisses;Mem;Offcore", + "MetricName": "tma_info_l2mpki_all" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX128", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "L2 cache true code cacheline misses per kilo instruction", + "MetricExpr": "1e3 * FRONTEND_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_l2mpki_code" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX256", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "L2 cache speculative code cacheline misses per kilo instruction", + "MetricExpr": "1e3 * L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_l2mpki_code_all" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX512", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2mpki_load" }, { - "BriefDescription": "Instructions per Software prefetch instruction (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umask\\=0xF@", - "MetricGroup": "Prefetches", - "MetricName": "IpSWPF" + "BriefDescription": "Average per-core data access bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_l3_cache_access_bw" }, { - "BriefDescription": "Total number of retired Instructions Sample with: INST_RETIRED.PREC_DIST", - "MetricExpr": "INST_RETIRED.ANY", - "MetricGroup": "Summary;tma_L1_group", - "MetricName": "Instructions" + "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "tma_info_l3_cache_access_bw", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_l3_cache_access_bw_1t" }, { - "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.", - "MetricExpr": "tma_retiring * SLOTS / cpu@UOPS_RETIRED.SLOTS\\,cmask\\=1@", - "MetricGroup": "Pipeline;Ret", - "MetricName": "Retire" + "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw" }, { - "BriefDescription": "", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,cmask\\=1@", - "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", - "MetricName": "Execute" + "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "tma_info_l3_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw_1t" }, { - "BriefDescription": "Average number of Uops issued by front-end when it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=1@", - "MetricGroup": "Fed;FetchBW", - "MetricName": "Fetch_UpC" + "BriefDescription": "L3 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l3mpki" }, { - "BriefDescription": "Fraction of Uops delivered by the LSD (Loop Stream Detector; aka Loop Cache)", - "MetricExpr": "LSD.UOPS / (IDQ.DSB_UOPS + LSD.UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", - "MetricGroup": "Fed;LSD", - "MetricName": "LSD_Coverage" + "BriefDescription": "Average Latency for L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCORE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "Memory_Lat;Offcore", + "MetricName": "tma_info_load_l2_miss_latency" }, { - "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", - "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + LSD.UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", - "MetricGroup": "DSB;Fed;FetchBW", - "MetricName": "DSB_Coverage" + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=1@", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_load_l2_mlp" }, { - "BriefDescription": "Average number of cycles of a switch from the DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details.", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWITCHES.PENALTY_CYCLES\\,cmask\\=1\\,edge@", - "MetricGroup": "DSBmiss", - "MetricName": "DSB_Switch_Cost" + "BriefDescription": "Average Latency for L3 cache miss demand Loads", + "MetricExpr": "cpu@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,umask\\=0x10@ / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", + "MetricGroup": "Memory_Lat;Offcore", + "MetricName": "tma_info_load_l3_miss_latency" }, { - "BriefDescription": "Total penalty related to DSB (uop cache) misses - subset of the Instruction_Fetch_BW Bottleneck.", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tma_lsd + tma_mite))", - "MetricGroup": "DSBmiss;Fed", - "MetricName": "DSB_Misses" + "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)", + "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_RETIRED.L1_MISS + MEM_LOAD_RETIRED.FB_HIT)", + "MetricGroup": "Mem;MemoryBound;MemoryLat", + "MetricName": "tma_info_load_miss_real_latency" }, { - "BriefDescription": "Number of Instructions per non-speculative DSB miss (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", - "MetricGroup": "DSBmiss;Fed", - "MetricName": "IpDSB_Miss_Ret" + "BriefDescription": "STLB (2nd level TLB) data load speculative misses per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_load_stlb_mpki" }, { - "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;BadSpec;BrMispredicts", - "MetricName": "IpMispredict" + "BriefDescription": "Fraction of Uops delivered by the LSD (Loop Stream Detector; aka Loop Cache)", + "MetricExpr": "LSD.UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "Fed;LSD", + "MetricName": "tma_info_lsd_coverage" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)", - "MetricExpr": "(tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) * SLOTS / BR_MISP_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;BrMispredicts", - "MetricName": "Branch_Misprediction_Cost" + "BriefDescription": "Total pipeline cost of (external) Memory Bandwidth related bottlenecks", + "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_sq_full / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full))) + tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk))", + "MetricGroup": "Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_info_memory_bandwidth", + "MetricThreshold": "tma_info_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of (external) Memory Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_mem_bandwidth, tma_sq_full" }, { - "BriefDescription": "Fraction of branches that are non-taken conditionals", - "MetricExpr": "BR_INST_RETIRED.COND_NTAKEN / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches;CodeGen;PGO", - "MetricName": "Cond_NT" + "BriefDescription": "Total pipeline cost of Memory Address Translation related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * tma_memory_bound * (tma_l1_bound / max(tma_memory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_dtlb_load / max(tma_l1_bound, tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_latency + tma_streaming_stores)))", + "MetricGroup": "Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_info_memory_data_tlbs", + "MetricThreshold": "tma_info_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Translation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load, tma_dtlb_store" }, { - "BriefDescription": "Fraction of branches that are taken conditionals", - "MetricExpr": "BR_INST_RETIRED.COND_TAKEN / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches;CodeGen;PGO", - "MetricName": "Cond_TK" + "BriefDescription": "Total pipeline cost of Memory Latency related bottlenecks (external memory and off-core caches)", + "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_l3_hit_latency / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound))", + "MetricGroup": "Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_info_memory_latency", + "MetricThreshold": "tma_info_memory_latency > 20", + "PublicDescription": "Total pipeline cost of Memory Latency related bottlenecks (external memory and off-core caches). Related metrics: tma_l3_hit_latency, tma_mem_latency" }, { - "BriefDescription": "Fraction of branches that are CALL or RET", - "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_RETURN) / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches", - "MetricName": "CallRet" + "BriefDescription": "Total pipeline cost of Branch Misprediction related bottlenecks", + "MetricExpr": "100 * (tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;tma_issueBM", + "MetricName": "tma_info_mispredictions", + "MetricThreshold": "tma_info_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_branch_misprediction_cost, tma_mispredicts_resteers" }, { - "BriefDescription": "Fraction of branches that are unconditional (direct or indirect) jumps", - "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL) / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches", - "MetricName": "Jump" + "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss", + "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLES", + "MetricGroup": "Mem;MemoryBW;MemoryBound", + "MetricName": "tma_info_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)" }, { "BriefDescription": "Fraction of branches of other types (not individually covered by other metrics in Info.Branches group)", - "MetricExpr": "1 - (Cond_NT + Cond_TK + CallRet + Jump)", + "MetricExpr": "1 - (tma_info_cond_nt + tma_info_cond_tk + tma_info_callret + tma_info_jump)", "MetricGroup": "Bad;Branches", - "MetricName": "Other_Branches" - }, - { - "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)", - "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_RETIRED.L1_MISS + MEM_LOAD_RETIRED.FB_HIT)", - "MetricGroup": "Mem;MemoryBound;MemoryLat", - "MetricName": "Load_Miss_Real_Latency" + "MetricName": "tma_info_other_branches" }, { - "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)", - "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLES", - "MetricGroup": "Mem;MemoryBW;MemoryBound", - "MetricName": "MLP" + "BriefDescription": "Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses", + "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_PENDING + DTLB_STORE_MISSES.WALK_PENDING) / (2 * tma_info_core_clks)", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_page_walks_utilization", + "MetricThreshold": "tma_info_page_walks_utilization > 0.5" }, { - "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L1MPKI" + "BriefDescription": "Fraction of Core cycles where the core was running with power-delivery for baseline license level 0", + "MetricExpr": "CORE_POWER.LVL0_TURBO_LICENSE / tma_info_core_clks", + "MetricGroup": "Power", + "MetricName": "tma_info_power_license0_utilization", + "PublicDescription": "Fraction of Core cycles where the core was running with power-delivery for baseline license level 0. This includes non-AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes." }, { - "BriefDescription": "L1 cache true misses per kilo instruction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L1MPKI_Load" + "BriefDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 1", + "MetricExpr": "CORE_POWER.LVL1_TURBO_LICENSE / tma_info_core_clks", + "MetricGroup": "Power", + "MetricName": "tma_info_power_license1_utilization", + "MetricThreshold": "tma_info_power_license1_utilization > 0.5", + "PublicDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 1. This includes high current AVX 256-bit instructions as well as low current AVX 512-bit instructions." }, { - "BriefDescription": "L2 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", - "MetricGroup": "Backend;CacheMisses;Mem", - "MetricName": "L2MPKI" + "BriefDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 2 (introduced in SKX)", + "MetricExpr": "CORE_POWER.LVL2_TURBO_LICENSE / tma_info_core_clks", + "MetricGroup": "Power", + "MetricName": "tma_info_power_license2_utilization", + "MetricThreshold": "tma_info_power_license2_utilization > 0.5", + "PublicDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 2 (introduced in SKX). This includes high current AVX 512-bit instructions." }, { - "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all request types (including speculative)", - "MetricExpr": "1e3 * (OFFCORE_REQUESTS.ALL_DATA_RD - OFFCORE_REQUESTS.DEMAND_DATA_RD + L2_RQSTS.ALL_DEMAND_MISS + L2_RQSTS.SWPF_MISS) / Instructions", - "MetricGroup": "CacheMisses;Mem;Offcore", - "MetricName": "L2MPKI_All" + "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.", + "MetricExpr": "tma_retiring * tma_info_slots / cpu@UOPS_RETIRED.SLOTS\\,cmask\\=1@", + "MetricGroup": "Pipeline;Ret", + "MetricName": "tma_info_retire" }, { - "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2MPKI_Load" + "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "TOPDOWN.SLOTS", + "MetricGroup": "TmaL1;tma_L1_group", + "MetricName": "tma_info_slots" }, { - "BriefDescription": "L2 cache hits per kilo instruction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2HPKI_Load" + "BriefDescription": "Fraction of Physical Core issue-slots utilized by this Logical Processor", + "MetricExpr": "(tma_info_slots / (TOPDOWN.SLOTS / 2) if #SMT_on else 1)", + "MetricGroup": "SMT;TmaL1;tma_L1_group", + "MetricName": "tma_info_slots_utilization" }, { - "BriefDescription": "L3 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L3MPKI" + "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", + "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_DISTRIBUTED if #SMT_on else 0)", + "MetricGroup": "SMT", + "MetricName": "tma_info_smt_2t_utilization" }, { - "BriefDescription": "Fill Buffer (FB) hits per kilo instructions for retired demand loads (L1D misses that merge into ongoing miss-handling entries)", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "FB_HPKI" + "BriefDescription": "Socket actual clocks when any core is active on that socket", + "MetricExpr": "UNC_CLOCK.SOCKET", + "MetricGroup": "SoC", + "MetricName": "tma_info_socket_clks" }, { - "BriefDescription": "Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses", - "MetricConstraint": "NO_NMI_WATCHDOG", - "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_PENDING + DTLB_STORE_MISSES.WALK_PENDING) / (2 * CORE_CLKS)", + "BriefDescription": "STLB (2nd level TLB) data store speculative misses per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricGroup": "Mem;MemoryTLB", - "MetricName": "Page_Walks_Utilization" + "MetricName": "tma_info_store_stlb_mpki" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW" + "BriefDescription": "Average Frequency Utilization relative nominal frequency", + "MetricExpr": "tma_info_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_turbo_utilization" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW" + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "tma_retiring * tma_info_slots / INST_RETIRED.ANY", + "MetricGroup": "Pipeline;Ret;Retire", + "MetricName": "tma_info_uoppi", + "MetricThreshold": "tma_info_uoppi > 1.05" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW" + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "tma_retiring * tma_info_slots / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW", + "MetricName": "tma_info_uptb", + "MetricThreshold": "tma_info_uptb < 7.5" }, { - "BriefDescription": "Average per-core data access bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "L3_Cache_Access_BW" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "ICACHE_64B.IFTAG_STALL / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTEND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]", - "MetricExpr": "L1D_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW_1T" + "BriefDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache", + "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY.STALLS_L1D_MISS) / tma_info_clks, 0)", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache. The L1 data cache typically has the shortest latency. However; in certain cases like loads blocked on older stores; a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls; while some non-completed demand load lives in the machine without having that demand load missing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB_HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]", - "MetricExpr": "L2_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW_1T" + "BriefDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + L1D_PEND_MISS.FB_FULL_PERIODS) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_info_clks)", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 misses/L2 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L2_HIT_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "L3_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW_1T" + "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core", + "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STALLS_L3_MISS) / tma_info_clks", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "L3_Cache_Access_BW", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "L3_Cache_Access_BW_1T" + "BriefDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited)", + "MetricExpr": "9 * tma_info_average_frequency * MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_issueLat;tma_l3_bound_group", + "MetricName": "tma_l3_hit_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve the latency; reduce contention with sibling physical cores and increase performance. Note the value of this node may overlap with its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tma_info_memory_latency, tma_mem_latency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average CPU Utilization", - "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", - "MetricGroup": "HPC;Summary", - "MetricName": "CPU_Utilization" + "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "ILD_STALL.LCP / tma_info_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_dsb_misses, tma_info_iptb", + "ScaleUnit": "100%" }, { - "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", - "MetricExpr": "Turbo_Utilization * TSC / 1e9 / duration_time", - "MetricGroup": "Power;Summary", - "MetricName": "Average_Frequency" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", + "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST", + "ScaleUnit": "100%" }, { - "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE) + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1e9 / duration_time", - "MetricGroup": "Cor;Flops;HPC", - "MetricName": "GFLOPs", - "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine." + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations", + "MetricExpr": "UOPS_DISPATCHED.PORT_2_3 / (2 * tma_info_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations. Sample with: UOPS_DISPATCHED.PORT_2_3", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average Frequency Utilization relative nominal frequency", - "MetricExpr": "CLKS / CPU_CLK_UNHALTED.REF_TSC", - "MetricGroup": "Power", - "MetricName": "Turbo_Utilization" + "BriefDescription": "This metric roughly estimates the fraction of cycles where the (first level) DTLB was missed by load accesses, that later on hit in second-level TLB (STLB)", + "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_group", + "MetricName": "tma_load_stlb_hit", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of Core cycles where the core was running with power-delivery for baseline license level 0", - "MetricExpr": "CORE_POWER.LVL0_TURBO_LICENSE / CORE_CLKS", - "MetricGroup": "Power", - "MetricName": "Power_License0_Utilization", - "PublicDescription": "Fraction of Core cycles where the core was running with power-delivery for baseline license level 0. This includes non-AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes." + "BriefDescription": "This metric estimates the fraction of cycles where the Second-level TLB (STLB) was missed by load accesses, performing a hardware page walk", + "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_group", + "MetricName": "tma_load_stlb_miss", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 1", - "MetricExpr": "CORE_POWER.LVL1_TURBO_LICENSE / CORE_CLKS", - "MetricGroup": "Power", - "MetricName": "Power_License1_Utilization", - "PublicDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 1. This includes high current AVX 256-bit instructions as well as low current AVX 512-bit instructions." + "BriefDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS.ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10 * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO))) / tma_info_clks", + "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1_bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations. Due to the microarchitecture handling of locks; they are classified as L1_Bound regardless of what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOADS_PS. Related metrics: tma_store_latency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 2 (introduced in SKX)", - "MetricExpr": "CORE_POWER.LVL2_TURBO_LICENSE / CORE_CLKS", - "MetricGroup": "Power", - "MetricName": "Power_License2_Utilization", - "PublicDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 2 (introduced in SKX). This includes high current AVX 512-bit instructions." + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to LSD (Loop Stream Detector) unit", + "MetricExpr": "(LSD.CYCLES_ACTIVE - LSD.CYCLES_OK) / tma_info_core_clks / 2", + "MetricGroup": "FetchBW;LSD;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_lsd", + "MetricThreshold": "tma_lsd > 0.15 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 5 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to LSD (Loop Stream Detector) unit. LSD typically does well sustaining Uop supply. However; in some rare cases; optimal uop-delivery could not be reached for small loops whose size (in terms of number of uops) does not suit well the LSD structure.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", - "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_DISTRIBUTED if #SMT_on else 0)", - "MetricGroup": "SMT", - "MetricName": "SMT_2T_Utilization" + "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", + "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts)", + "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "OS", - "MetricName": "Kernel_Utilization" + "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=4@) / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that). Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_info_memory_bandwidth, tma_sq_full", + "ScaleUnit": "100%" }, { - "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", - "MetricGroup": "OS", - "MetricName": "Kernel_CPI" + "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / tma_info_clks - tma_mem_bandwidth", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that). Related metrics: tma_info_memory_latency, tma_l3_hit_latency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", - "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_REQUESTS.ALL) / 1e6 / duration_time / 1e3", - "MetricGroup": "HPC;Mem;MemoryBW;SoC", - "MetricName": "DRAM_BW_Use" + "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", + "MetricExpr": "(CYCLE_ACTIVITY.STALLS_MEM_ANY + EXE_ACTIVITY.BOUND_ON_STORES) / (CYCLE_ACTIVITY.STALLS_TOTAL + (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL) + EXE_ACTIVITY.BOUND_ON_STORES) * tma_backend_bound", + "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", + "ScaleUnit": "100%" }, { - "BriefDescription": "Socket actual clocks when any core is active on that socket", - "MetricExpr": "UNC_CLOCK.SOCKET", - "MetricGroup": "SoC", - "MetricName": "Socket_CLKS" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring memory operations -- uops for memory load or store accesses.", + "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_RETIRED.ANY", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_memory_operations", + "MetricThreshold": "tma_memory_operations > 0.1 & tma_light_operations > 0.6", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", - "MetricGroup": "Branches;OS", - "MetricName": "IpFarBranch" + "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "tma_retiring * tma_info_slots / UOPS_ISSUED.ANY * IDQ.MS_UOPS / tma_info_slots", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operations_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_ms_switches", + "ScaleUnit": "100%" }, { - "BriefDescription": "Uncore frequency per die [GHZ]", - "MetricExpr": "Socket_CLKS / #num_dies / duration_time / 1e9", - "MetricGroup": "SoC", - "MetricName": "UNCORE_FREQ" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage", + "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_clks", + "MetricGroup": "BadSpec;BrMispredicts;TopdownL4;tma_L4_group;tma_branch_resteers_group;tma_issueBM", + "MetricName": "tma_mispredicts_resteers", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_branch_mispredicts, tma_info_branch_misprediction_cost, tma_info_mispredictions", + "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per core", - "MetricExpr": "cstate_core@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Core_Residency", + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline)", + "MetricExpr": "(IDQ.MITE_CYCLES_ANY - IDQ.MITE_CYCLES_OK) / tma_info_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 5 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or LSD. For example; inefficiencies due to asymmetric decoders; use of long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sample with: FRONTEND_RETIRED.ANY_DSB_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "C7 residency percent per core", - "MetricExpr": "cstate_core@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Core_Residency", + "BriefDescription": "This metric represents fraction of cycles where (only) 4 uops were delivered by the MITE pipeline", + "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=4@ - cpu@IDQ.MITE_UOPS\\,cmask\\=5@) / tma_info_clks", + "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_mite_group", + "MetricName": "tma_mite_4wide", + "MetricThreshold": "tma_mite_4wide > 0.05 & (tma_mite > 0.1 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 5 > 0.35))", "ScaleUnit": "100%" }, { - "BriefDescription": "C2 residency percent per package", - "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C2_Pkg_Residency", + "BriefDescription": "The Mixing_Vectors metric gives the percentage of injected blend uops out of all uops issued", + "MetricExpr": "UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY", + "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utilized_0_group", + "MetricName": "tma_mixing_vectors", + "MetricThreshold": "tma_mixing_vectors > 0.05", + "PublicDescription": "The Mixing_Vectors metric gives the percentage of injected blend uops out of all uops issued. Usually a Mixing_Vectors over 5% is worth investigating. Read more in Appendix B1 of the Optimizations Guide for this topic. Related metrics: tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "C3 residency percent per package", - "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C3_Pkg_Residency", + "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", + "MetricExpr": "3 * IDQ.MS_SWITCHES / tma_info_clks", + "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per package", - "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Pkg_Residency", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_retiring * tma_info_slots)", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_nop_instructions", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions. Compilers often use NOPs for certain address alignments - e.g. start address of a function or loop body. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, { - "BriefDescription": "C7 residency percent per package", - "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Pkg_Residency", + "BriefDescription": "This metric represents the remaining light uops fraction the CPU has executed - remaining means not covered by other sibling nodes", + "MetricExpr": "max(0, tma_light_operations - (tma_fp_arith + tma_memory_operations + tma_branch_instructions + tma_nop_instructions))", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_other_light_ops", + "MetricThreshold": "tma_other_light_ops > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents the remaining light uops fraction the CPU has executed - remaining means not covered by other sibling nodes. May undercount due to FMA double counting", "ScaleUnit": "100%" }, { - "BriefDescription": "C8 residency percent per package", - "MetricExpr": "cstate_pkg@c8\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C8_Pkg_Residency", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch)", + "MetricExpr": "UOPS_DISPATCHED.PORT_0 / tma_info_core_clks", + "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_0", + "MetricThreshold": "tma_port_0 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "C9 residency percent per package", - "MetricExpr": "cstate_pkg@c9\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C9_Pkg_Residency", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU)", + "MetricExpr": "UOPS_DISPATCHED.PORT_1 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_1", + "MetricThreshold": "tma_port_1 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "C10 residency percent per package", - "MetricExpr": "cstate_pkg@c10\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C10_Pkg_Residency", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU)", + "MetricExpr": "UOPS_DISPATCHED.PORT_5 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_5", + "MetricThreshold": "tma_port_5 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU)", + "MetricExpr": "UOPS_DISPATCHED.PORT_6 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_6", + "MetricThreshold": "tma_port_6 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU). Sample with: UOPS_DISPATCHED.PORT_6. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", + "MetricExpr": "((cpu@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=0x80@ + tma_serializing_operation * (CYCLE_ACTIVITY.STALLS_TOTAL - CYCLE_ACTIVITY.STALLS_MEM_ANY) + (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL)) / tma_info_clks if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CYCLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_clks)", + "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "cpu@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=0x80@ / tma_info_clks + tma_serializing_operation * (CYCLE_ACTIVITY.STALLS_TOTAL - CYCLE_ACTIVITY.STALLS_MEM_ANY) / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise). Long-latency instructions like divides may contribute to this metric.", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_1", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). This can be due to heavy data-dependency among software instructions; or over oversubscribing a particular hardware resource. In some other cases with high 1_Port_Utilized and L1_Bound; this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related metrics: tma_l1_bound", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization -most compilers feature auto-Vectorization options today- reduces pressure on the execution ports as multiple elements are calculated with same uop. Sample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.7 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Sample with: UOPS_EXECUTED.CYCLES_GE_3", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1", + "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations", + "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL5;tma_L5_group;tma_issueSO;tma_ports_utilized_0_group", + "MetricName": "tma_serializing_operation", + "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)))", + "PublicDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations. Instructions like CPUID; WRMSR or LFENCE serialize the out-of-order execution which may limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metrics: tma_ms_switches", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to PAUSE Instructions", + "MetricExpr": "140 * MISC_RETIRED.PAUSE_INST / tma_info_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_serializing_operation_group", + "MetricName": "tma_slow_pause", + "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_operation > 0.1 & (tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to PAUSE Instructions. Sample with: MISC_RETIRED.PAUSE_INST", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary", + "MetricExpr": "tma_info_load_miss_real_latency * LD_BLOCKS.NO_SR / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents rate of split store accesses", + "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bound_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store accesses. Consider aligning your data to the 64-byte cache line granularity. Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_4", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors)", + "MetricExpr": "L1D_PEND_MISS.L2_STALL / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueBW;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors). Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_info_memory_bandwidth, tma_mem_bandwidth", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", + "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck. Sample with: MEM_INST_RETIRED.ALL_STORES_PS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores. To streamline memory operations in the pipeline; a load can avoid waiting for memory if a prior in-flight store is writing the data that the load wants to read (store forwarding process). However; in some cases the load may be blocked for a significant time pending the store forward. For example; when the prior store is writing a smaller region than the load is reading.", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses", + "MetricExpr": "(L2_RQSTS.RFO_HIT * 10 * (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_clks", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_issueSL;tma_store_bound_group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses. Store accesses usually less impact out-of-order core performance; however; holding resources for longer time can lead into undesired implications (e.g. contention on L1D fill-buffer entries - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations", + "MetricExpr": "(UOPS_DISPATCHED.PORT_4_9 + UOPS_DISPATCHED.PORT_7_8) / (4 * tma_info_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations. Sample with: UOPS_DISPATCHED.PORT_7_8", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of cycles where the TLB was missed by store accesses, hitting in the second-level TLB (STLB)", + "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_group", + "MetricName": "tma_store_stlb_hit", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles where the STLB was missed by store accesses, performing a hardware page walk", + "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_clks", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_group", + "MetricName": "tma_store_stlb_miss", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates how often CPU was stalled due to Streaming store memory accesses; Streaming store optimize out a read request required by RFO stores", + "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueSmSt;tma_store_bound_group", + "MetricName": "tma_streaming_stores", + "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often CPU was stalled due to Streaming store memory accesses; Streaming store optimize out a read request required by RFO stores. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should Streaming stores be a bottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma_fb_full", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears", + "MetricExpr": "10 * BACLEARS.ANY / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;TopdownL4;tma_L4_group;tma_branch_resteers_group", + "MetricName": "tma_unknown_branches", + "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears. These are fetched branches the Branch Prediction Unit was unable to recognize (e.g. first time the branch is fetched or hitting BTB capacity limit). Sample with: BACLEARS.ANY", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric serves as an approximation of legacy x87 usage", + "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.THREAD", + "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Percentage of cycles in aborted transactions.", + "MetricExpr": "max(cpu@cycles\\-t@ - cpu@cycles\\-ct@, 0) / cycles", + "MetricGroup": "transaction", + "MetricName": "tsx_aborted_cycles", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Number of cycles within a transaction divided by the number of elisions.", + "MetricExpr": "cpu@cycles\\-t@ / cpu@el\\-start@", + "MetricGroup": "transaction", + "MetricName": "tsx_cycles_per_elision", + "ScaleUnit": "1cycles / elision" + }, + { + "BriefDescription": "Number of cycles within a transaction divided by the number of transactions.", + "MetricExpr": "cpu@cycles\\-t@ / cpu@tx\\-start@", + "MetricGroup": "transaction", + "MetricName": "tsx_cycles_per_transaction", + "ScaleUnit": "1cycles / transaction" + }, + { + "BriefDescription": "Percentage of cycles within a transaction region.", + "MetricExpr": "cpu@cycles\\-t@ / cycles", + "MetricGroup": "transaction", + "MetricName": "tsx_transactional_cycles", "ScaleUnit": "100%" } ] diff --git a/tools/perf/pmu-events/arch/x86/icelake/pipeline.json b/tools/perf/pmu-events/arch/x86/icelake/pipeline.json index 3b31a842a0b1..154fee4b60fb 100644 --- a/tools/perf/pmu-events/arch/x86/icelake/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/icelake/pipeline.json @@ -158,6 +158,15 @@ "SampleAfterValue": "50021", "UMask": "0x20" }, + { + "BriefDescription": "This event counts the number of mispredicted ret instructions retired. Non PEBS", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.RET", + "PEBS": "1", + "PublicDescription": "This is a non-precise version (that is, does not use PEBS) of the event that counts mispredicted return instructions retired.", + "SampleAfterValue": "50021", + "UMask": "0x8" + }, { "BriefDescription": "Cycle counts are evenly distributed between active threads in the Core.", "EventCode": "0xec", @@ -375,6 +384,16 @@ "SampleAfterValue": "2000003", "UMask": "0x3" }, + { + "BriefDescription": "Clears speculative count", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0x0D", + "EventName": "INT_MISC.CLEARS_COUNT", + "PublicDescription": "Counts the number of speculative clears due to any type of branch misprediction or machine clears", + "SampleAfterValue": "500009", + "UMask": "0x1" + }, { "BriefDescription": "Counts cycles after recovery from a branch misprediction or machine clear till the first uop is issued from the resteered path.", "EventCode": "0x0d", @@ -747,7 +766,7 @@ "BriefDescription": "Uops inserted at issue-stage in order to preserve upper bits of vector registers.", "EventCode": "0x0e", "EventName": "UOPS_ISSUED.VECTOR_WIDTH_MISMATCH", - "PublicDescription": "Counts the number of Blend Uops issued by the Resource Allocation Table (RAT) to the reservation station (RS) in order to preserve upper bits of vector registers. Starting with the Skylake microarchitecture, these Blend uops are needed since every Intel SSE instruction executed in Dirty Upper State needs to preserve bits 128-255 of the destination register. For more information, refer to Mixing Intel AVX and Intel SSE Code section of the Optimization Guide.", + "PublicDescription": "Counts the number of Blend Uops issued by the Resource Allocation Table (RAT) to the reservation station (RS) in order to preserve upper bits of vector registers. Starting with the Skylake microarchitecture, these Blend uops are needed since every Intel SSE instruction executed in Dirty Upper State needs to preserve bits 128-255 of the destination register. For more information, refer to 'Mixing Intel AVX and Intel SSE Code' section of the Optimization Guide.", "SampleAfterValue": "100003", "UMask": "0x2" }, @@ -775,7 +794,7 @@ "EventCode": "0xc2", "EventName": "UOPS_RETIRED.TOTAL_CYCLES", "Invert": "1", - "PublicDescription": "Counts the number of cycles using always true condition (uops_ret &lt; 16) applied to non PEBS uops retired event.", + "PublicDescription": "Counts the number of cycles using always true condition (uops_ret < 16) applied to non PEBS uops retired event.", "SampleAfterValue": "1000003", "UMask": "0x2" } diff --git a/tools/perf/pmu-events/arch/x86/icelake/uncore-other.json b/tools/perf/pmu-events/arch/x86/icelake/uncore-other.json index f7aff8818f46..b27d95b2c857 100644 --- a/tools/perf/pmu-events/arch/x86/icelake/uncore-other.json +++ b/tools/perf/pmu-events/arch/x86/icelake/uncore-other.json @@ -7,6 +7,54 @@ "UMask": "0x1", "Unit": "ARB" }, + { + "BriefDescription": "Each cycle counts number of any coherent request at memory controller that were issued by any core. This event is not supported on ICL products but is supported on RKL products.", + "EventCode": "0x85", + "EventName": "UNC_ARB_DAT_OCCUPANCY.ALL", + "PerPkg": "1", + "UMask": "0x1", + "Unit": "ARB" + }, + { + "BriefDescription": "Each cycle counts number of coherent reads pending on data return from memory controller that were issued by any core. This event is not supported on ICL products but is supported on RKL products.", + "EventCode": "0x85", + "EventName": "UNC_ARB_DAT_OCCUPANCY.RD", + "PerPkg": "1", + "UMask": "0x2", + "Unit": "ARB" + }, + { + "BriefDescription": "Each cycle count number of 'valid' coherent Data Read entries . Such entry is defined as valid when it is allocated till deallocation. Doesn't include prefetches. This event is not supported on ICL products but is supported on RKL products.", + "EventCode": "0x80", + "EventName": "UNC_ARB_REQ_TRK_OCCUPANCY.DRD", + "PerPkg": "1", + "UMask": "0x2", + "Unit": "ARB" + }, + { + "BriefDescription": "Number of all coherent Data Read entries. Doesn't include prefetches", + "EventCode": "0x81", + "EventName": "UNC_ARB_REQ_TRK_REQUEST.DRD", + "PerPkg": "1", + "UMask": "0x2", + "Unit": "ARB" + }, + { + "BriefDescription": "Each cycle counts number of all outgoing valid entries in ReqTrk. Such entry is defined as valid from its allocation in ReqTrk till deallocation. Accounts for Coherent and non-coherent traffic. This event is not supported on ICL products but is supported on RKL products.", + "EventCode": "0x80", + "EventName": "UNC_ARB_TRK_OCCUPANCY.ALL", + "PerPkg": "1", + "UMask": "0x1", + "Unit": "ARB" + }, + { + "BriefDescription": "Each cycle count number of 'valid' coherent Data Read entries . Such entry is defined as valid when it is allocated till deallocation. Doesn't include prefetches. This event is not supported on ICL products but is supported on RKL products.", + "EventCode": "0x80", + "EventName": "UNC_ARB_TRK_OCCUPANCY.RD", + "PerPkg": "1", + "UMask": "0x2", + "Unit": "ARB" + }, { "BriefDescription": "Total number of all outgoing entries allocated. Accounts for Coherent and non-coherent traffic.", "EventCode": "0x81", @@ -15,6 +63,14 @@ "UMask": "0x1", "Unit": "ARB" }, + { + "BriefDescription": "Number of all coherent Data Read entries. Doesn't include prefetches. This event is not supported on ICL products but is supported on RKL products.", + "EventCode": "0x81", + "EventName": "UNC_ARB_TRK_REQUESTS.RD", + "PerPkg": "1", + "UMask": "0x2", + "Unit": "ARB" + }, { "BriefDescription": "UNC_CLOCK.SOCKET", "EventCode": "0xff", diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-events/arch/x86/mapfile.csv index 1677ec22e2e3..b258702e0666 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -12,7 +12,7 @@ GenuineIntel-6-7A,v1.01,goldmontplus,core GenuineIntel-6-A[DE],v1.00,graniterapids,core GenuineIntel-6-(3C|45|46),v32,haswell,core GenuineIntel-6-3F,v26,haswellx,core -GenuineIntel-6-(7D|7E|A7),v1.15,icelake,core +GenuineIntel-6-(7D|7E|A7),v1.17,icelake,core GenuineIntel-6-6[AC],v1.17,icelakex,core GenuineIntel-6-3A,v23,ivybridge,core GenuineIntel-6-3E,v22,ivytown,core From patchwork Sun Feb 19 09:28:17 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59110 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp772744wrn; Sun, 19 Feb 2023 01:34:10 -0800 (PST) X-Google-Smtp-Source: AK7set8B/KnW2/jyHgTUtBM8sJUc5rp+YjzeDiJhZ0zOCnCSxcZink4MPz9v5GnXMx8r7jN4gDkx X-Received: by 2002:a05:6a21:7896:b0:bf:58d1:ce9a with SMTP id bf22-20020a056a21789600b000bf58d1ce9amr2980130pzc.25.1676799249838; Sun, 19 Feb 2023 01:34:09 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799249; cv=none; d=google.com; s=arc-20160816; b=zgLS94aNtDsC6U7tvHyDdPX26PWZSYAAedmS9MEJPtf17RcsB1FolLKVuYN3WWRF3a K6IC5P7BMPQEG1reMN3JsRIScpJq4RVejUwrTYK7STCxEGNxH9mQFPdEnuaUM7T48XlG P1YU2fBuGGmJVA0/pXxDoE+mAYYok7oNOETVQ6hLZVVE9GM2WINZGht4BchzApIEkUjE KgLNQjFU1jcBxk5QHlRJwnMG8j5owWiepYNIU0dNIKA+5p3zMSrM2YNQw1c0DC+DoD5z d8XprN8QEJ3szyBdSqNphR26rU+iVJ4G/6ZBnaRlK+NOQ4T7Pn9ZK8yvNyxaivc7Q04u /9sg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :references:mime-version:message-id:in-reply-to:date:dkim-signature; bh=ZU8AvWHavtdwGreU7MuZtH0D9Juq/l3Nf0liB6jRXGM=; b=mFgFMxO2jRGmG9ZLileEpGvroE1LelsOSP7vc6PejZOkMcQkVagM458xzzc/6J1ycs uRluGhGQFrcVRgnmsTJ28hgP+MaQkEPEuauH127nDO9/FzjggVOlSAqW7IOSCH05ejG6 NWOkHpLRu+fwTR4GEYWEHpiuYDGjeyb2BP8uR6fG6iTpwbxA0NbKJDdatV6w8Htpbjw7 uVnkBas2m4J4h9tBrLHkeroqyboUv/cH0yykWw7YezkQcCxFkapgk11uD7jAxWnl2vte gSE4K9K5DFRSymkaxXB+cfHl0NcPJVXz1AgUBl612GatGVyHn3X7rfj3RSke6xUDuI0C uC1w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=dzoLCtaM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id t19-20020a63f353000000b004fba07a634dsi10746622pgj.66.2023.02.19.01.33.57; Sun, 19 Feb 2023 01:34:09 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=dzoLCtaM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229946AbjBSJdC (ORCPT + 99 others); Sun, 19 Feb 2023 04:33:02 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55120 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229942AbjBSJcu (ORCPT ); Sun, 19 Feb 2023 04:32:50 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6C286126DE for ; Sun, 19 Feb 2023 01:32:02 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id 124-20020a250482000000b0090f2c84a6a4so2102467ybe.13 for ; Sun, 19 Feb 2023 01:32:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:from:to:cc:subject:date :message-id:reply-to; bh=ZU8AvWHavtdwGreU7MuZtH0D9Juq/l3Nf0liB6jRXGM=; b=dzoLCtaMP4WQl6//af8RTLdgDO2fnUVCNgB4cFfXsDCkaptPRsq4wDaY6Lkad+jnfJ hghqUg56dTjY2d2AOm2g4lgfwaIdSVAM0yb2jQp93ouGsXvdxXsXJOwQEdzCxfK8a/gG 5U0mcWOWuXsQM7NpjQ+VwERt5NYwdUhw9gULI4v7Qj+BnsEhsjS3Z0b6oXpAwaOdUp+Z K+cs9r9gXnkHPSyruE8vRWRFgGDIjQYybIKB0lKm4U9hGks5seOUwyR9riBAicUPApNM OAp0Ne0nWV6Bd3M/0+iJNMXgHypbW/kLZ4viJ23duDDimI3YcHzzA8IRliMxe7VVD73T ZqfQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=ZU8AvWHavtdwGreU7MuZtH0D9Juq/l3Nf0liB6jRXGM=; b=4ZO7CgOtMVRdc7Haf+loyZmhglyWGPUXTaecSzeNGlLUGWQtv4vqmiyflBl8cfivnN p0NrldicbJh1og/64DDmUv8cIHeLM42BdPQb+NpN4lYDIkGo3groPmhb/gB1jmcLIRQe 5rGdu2Nf/6bOVigJCKsDkhIKEAqQgWkPxzwh03wYgSNWqFWZ2/J+xh15z3ECT/zuIZ4p ZDzsFxAkGcqJaJZRWdDZAMXRGsm4quousDvTuHmy+ZUMyr2M6MHf46Mo09b5qXF7lFh7 5IpwCcBD5pqSXRnAvxHbAJyME0K0pxLzT35rs48dzdTexyrHttRpkzMF8UHouLTpQkfG cORA== X-Gm-Message-State: AO0yUKVPd9g9KG0Sj8nwj4fSs/rsH6SZ7/Wi5Aa9D1qOpo34/8jRCXEJ ZkRXqNc3JgUPJkoV16CeRiKVfk/CFGYx X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a05:6902:13c6:b0:855:fdcb:446d with SMTP id y6-20020a05690213c600b00855fdcb446dmr187894ybu.6.1676799106181; Sun, 19 Feb 2023 01:31:46 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:17 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-21-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 20/51] perf vendor events intel: Refresh icelakex metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758251450271817620?= X-GMAIL-MSGID: =?utf-8?q?1758251450271817620?= Update the icelakex metrics to TMA version 4.5. Generation was done using https://github.com/intel/perfmon. Notable changes are TMA info metrics are renamed from their node name to be lower case and prefixed by tma_info_, MetricThreshold expressions are added, "Sample with" documentation is added to many TMA metrics, smi_cost and transaction metric groups are added replicating existing hard coded metrics in stat-shadow. Signed-off-by: Ian Rogers --- .../arch/x86/icelakex/icx-metrics.json | 2153 +++++++++-------- .../arch/x86/icelakex/uncore-memory.json | 2 +- .../arch/x86/icelakex/uncore-other.json | 4 +- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 4 files changed, 1100 insertions(+), 1061 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json index 22b2a97d0ff8..8109088a4df7 100644 --- a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json +++ b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json @@ -1,1529 +1,1568 @@ [ { - "BriefDescription": "Total pipeline cost of Branch Misprediction related bottlenecks", - "MetricExpr": "100 * (tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts", - "MetricName": "Mispredictions" - }, - { - "BriefDescription": "Total pipeline cost of (external) Memory Bandwidth related bottlenecks", - "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_sq_full / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full))) + tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk))", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "Memory_Bandwidth" - }, - { - "BriefDescription": "Total pipeline cost of Memory Latency related bottlenecks (external memory and off-core caches)", - "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_l3_hit_latency / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound))", - "MetricGroup": "Mem;MemoryLat;Offcore", - "MetricName": "Memory_Latency" - }, - { - "BriefDescription": "Total pipeline cost of Memory Address Translation related bottlenecks (data-side TLBs)", - "MetricExpr": "100 * tma_memory_bound * (tma_l1_bound / max(tma_memory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_dtlb_load / max(tma_l1_bound, tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_latency + tma_streaming_stores)))", - "MetricGroup": "Mem;MemoryTLB;Offcore", - "MetricName": "Memory_Data_TLBs" - }, - { - "BriefDescription": "Total pipeline cost of branch related instructions (used for program control-flow including function calls)", - "MetricExpr": "100 * ((BR_INST_RETIRED.COND + 3 * BR_INST_RETIRED.NEAR_CALL + (BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL)) / SLOTS)", - "MetricGroup": "Ret", - "MetricName": "Branching_Overhead" - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch related bottlenecks by large code footprint programs (i-side cache; TLB and BTB misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_icache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFoot;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "Big_Code" - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_frontend_bound - tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) - Big_Code", - "MetricGroup": "Fed;FetchBW;Frontend", - "MetricName": "Instruction_Fetch_BW" - }, - { - "BriefDescription": "Instructions Per Cycle (per Logical Processor)", - "MetricExpr": "INST_RETIRED.ANY / CLKS", - "MetricGroup": "Ret;Summary", - "MetricName": "IPC" - }, - { - "BriefDescription": "Uops Per Instruction", - "MetricExpr": "tma_retiring * SLOTS / INST_RETIRED.ANY", - "MetricGroup": "Pipeline;Ret;Retire", - "MetricName": "UPI" - }, - { - "BriefDescription": "Instruction per taken branch", - "MetricExpr": "tma_retiring * SLOTS / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW", - "MetricName": "UpTB" - }, - { - "BriefDescription": "Cycles Per Instruction (per Logical Processor)", - "MetricExpr": "1 / IPC", - "MetricGroup": "Mem;Pipeline", - "MetricName": "CPI" - }, - { - "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "Pipeline", - "MetricName": "CLKS" - }, - { - "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "TOPDOWN.SLOTS", - "MetricGroup": "tma_L1_group", - "MetricName": "SLOTS" - }, - { - "BriefDescription": "Fraction of Physical Core issue-slots utilized by this Logical Processor", - "MetricExpr": "(SLOTS / (TOPDOWN.SLOTS / 2) if #SMT_on else 1)", - "MetricGroup": "SMT;tma_L1_group", - "MetricName": "Slots_Utilization" - }, - { - "BriefDescription": "The ratio of Executed- by Issued-Uops", - "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", - "MetricGroup": "Cor;Pipeline", - "MetricName": "Execute_per_Issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate of \"execute\" at rename stage." - }, - { - "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", - "MetricExpr": "INST_RETIRED.ANY / CORE_CLKS", - "MetricGroup": "Ret;SMT;tma_L1_group", - "MetricName": "CoreIPC" - }, - { - "BriefDescription": "Floating Point Operations Per Cycle", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE) + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / CORE_CLKS", - "MetricGroup": "Flops;Ret", - "MetricName": "FLOPc" - }, - { - "BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width)", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE)) / (2 * CORE_CLKS)", - "MetricGroup": "Cor;Flops;HPC", - "MetricName": "FP_Arith_Utilization", - "PublicDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width). Values > 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less common)." - }, - { - "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_GE_1 / 2 if #SMT_on else UOPS_EXECUTED.CORE_CYCLES_GE_1)", - "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", - "MetricName": "ILP" - }, - { - "BriefDescription": "Probability of Core Bound bottleneck hidden by SMT-profiling artifacts", - "MetricExpr": "((1 - tma_core_bound / tma_ports_utilization if tma_core_bound < tma_ports_utilization else 1) if SMT_2T_Utilization > 0.5 else 0)", - "MetricGroup": "Cor;SMT", - "MetricName": "Core_Bound_Likely" - }, - { - "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", - "MetricExpr": "CPU_CLK_UNHALTED.DISTRIBUTED", - "MetricGroup": "SMT", - "MetricName": "CORE_CLKS" - }, - { - "BriefDescription": "Instructions per Load (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", - "MetricGroup": "InsType", - "MetricName": "IpLoad" - }, - { - "BriefDescription": "Instructions per Store (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", - "MetricGroup": "InsType", - "MetricName": "IpStore" - }, - { - "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Branches;Fed;InsType", - "MetricName": "IpBranch" - }, - { - "BriefDescription": "Instructions per (near) call (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "IpCall" + "BriefDescription": "C1 residency percent per core", + "MetricExpr": "cstate_core@c1\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C1_Core_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instruction per taken branch", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO", - "MetricName": "IpTB" + "BriefDescription": "C2 residency percent per package", + "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C2_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Branch instructions per taken branch. ", - "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "BpTkBranch" + "BriefDescription": "C6 residency percent per core", + "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Core_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Floating Point (FP) Operation (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE) + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE)", - "MetricGroup": "Flops;InsType", - "MetricName": "IpFLOP" + "BriefDescription": "C6 residency percent per package", + "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE))", - "MetricGroup": "Flops;InsType", - "MetricName": "IpArith", - "PublicDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate). May undercount due to FMA double counting. Approximated prior to BDW." + "BriefDescription": "Uncore frequency per die [GHZ]", + "MetricExpr": "tma_info_socket_clks / #num_dies / duration_time / 1e9", + "MetricGroup": "SoC", + "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SINGLE", - "MetricGroup": "Flops;FpScalar;InsType", - "MetricName": "IpArith_Scalar_SP", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "Percentage of cycles spent in System Management Interrupts.", + "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0 else 0)", + "MetricGroup": "smi", + "MetricName": "smi_cycles", + "MetricThreshold": "smi_cycles > 0.1", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOUBLE", - "MetricGroup": "Flops;FpScalar;InsType", - "MetricName": "IpArith_Scalar_DP", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "Number of SMI interrupts.", + "MetricExpr": "msr@smi@", + "MetricGroup": "smi", + "MetricName": "smi_num", + "ScaleUnit": "1SMI#" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX128", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset", + "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_4k_aliasing", + "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset. False match is possible; which incur a few cycles load re-issue. However; the short re-issue duration is often hidden by the out-of-order core and HW optimizations; hence a user may safely ignore a high value of this metric unless it manages to propagate up into parent nodes of the hierarchy (e.g. to L1_Bound).", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX256", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.", + "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + UOPS_DISPATCHED.PORT_5 + UOPS_DISPATCHED.PORT_6) / (4 * tma_info_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.6", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX512", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists", + "MetricExpr": "100 * ASSISTS.ANY / tma_info_slots", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists. Assists are long sequences of uops that are required in certain corner-cases for operations that cannot be handled natively by the execution pipeline. For example; when working with very small floating point values (so-called Denormals); the FP units are not set up to perform these operations natively. Instead; a sequence of instructions to perform the computation on the Denormals is injected into the pipeline. Since these microcode sequences might be dozens of uops long; Assists can be extremely deleterious to performance and they can be avoided in many cases. Sample with: ASSISTS.ANY", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Software prefetch instruction (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umask\\=0xF@", - "MetricGroup": "Prefetches", - "MetricName": "IpSWPF" + "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend", + "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.2", + "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Total number of retired Instructions Sample with: INST_RETIRED.PREC_DIST", - "MetricExpr": "INST_RETIRED.ANY", - "MetricGroup": "Summary;tma_L1_group", - "MetricName": "Instructions" + "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations", + "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", + "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.", - "MetricExpr": "tma_retiring * SLOTS / cpu@UOPS_RETIRED.SLOTS\\,cmask\\=1@", - "MetricGroup": "Pipeline;Ret", - "MetricName": "Retire" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring branch instructions.", + "MetricExpr": "tma_light_operations * BR_INST_RETIRED.ALL_BRANCHES / (tma_retiring * tma_info_slots)", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_branch_instructions", + "MetricThreshold": "tma_branch_instructions > 0.1 & tma_light_operations > 0.6", + "ScaleUnit": "100%" }, { - "BriefDescription": "", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,cmask\\=1@", - "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", - "MetricName": "Execute" + "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction", + "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * tma_bad_speculation", + "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM", + "MetricName": "tma_branch_mispredicts", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics: tma_info_branch_misprediction_cost, tma_info_mispredictions, tma_mispredicts_resteers", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average number of Uops issued by front-end when it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=1@", - "MetricGroup": "Fed;FetchBW", - "MetricName": "Fetch_UpC" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", + "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_clks + tma_unknown_branches", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", - "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", - "MetricGroup": "DSB;Fed;FetchBW", - "MetricName": "DSB_Coverage" + "BriefDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction. A CISC instruction has multiple uops that are required to perform the instruction's functionality as in the case of read-modify-write as an example. Since these instructions require multiple uops they may or may not imply sub-optimal use of machine resources.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average number of cycles of a switch from the DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details.", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWITCHES.PENALTY_CYCLES\\,cmask\\=1\\,edge@", - "MetricGroup": "DSBmiss", - "MetricName": "DSB_Switch_Cost" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears", + "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_clks", + "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_branch_resteers_group;tma_issueMC", + "MetricName": "tma_clears_resteers", + "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_ms_switches", + "ScaleUnit": "100%" }, { - "BriefDescription": "Total penalty related to DSB (uop cache) misses - subset of the Instruction_Fetch_BW Bottleneck.", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tma_mite))", - "MetricGroup": "DSBmiss;Fed", - "MetricName": "DSB_Misses" + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(44 * tma_info_average_frequency * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) + 43.5 * tma_info_average_frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_contested_accesses", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses. Contested accesses occur when data written by one Logical Processor are read by another Logical Processor on a different Physical Core. Examples of contested accesses include synchronizations such as locks; true data sharing such as modified locked variables; and false sharing. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Related metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma_remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "Number of Instructions per non-speculative DSB miss (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", - "MetricGroup": "DSBmiss;Fed", - "MetricName": "IpDSB_Miss_Ret" + "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", + "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)", + "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", + "ScaleUnit": "100%" }, { - "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;BadSpec;BrMispredicts", - "MetricName": "IpMispredict" + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "43.5 * tma_info_average_frequency * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (1 - OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses. Data shared by multiple Logical Processors (even just read shared) may cause increased access latency due to cache coherency. Excessive data sharing can drastically harm multithreaded performance. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma_false_sharing, tma_machine_clears, tma_remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)", - "MetricExpr": "(tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) * SLOTS / BR_MISP_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;BrMispredicts", - "MetricName": "Branch_Misprediction_Cost" + "BriefDescription": "This metric represents fraction of cycles where decoder-0 was the only active decoder", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=1@ - cpu@INST_DECODED.DECODERS\\,cmask\\=2@) / tma_info_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0;tma_mite_group", + "MetricName": "tma_decoder0_alone", + "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 5 > 0.35))", + "PublicDescription": "This metric represents fraction of cycles where decoder-0 was the only active decoder. Related metrics: tma_few_uops_instructions", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of branches that are non-taken conditionals", - "MetricExpr": "BR_INST_RETIRED.COND_NTAKEN / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches;CodeGen;PGO", - "MetricName": "Cond_NT" + "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", + "MetricExpr": "ARITH.DIVIDER_ACTIVE / tma_info_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication. Sample with: ARITH.DIVIDER_ACTIVE", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of branches that are taken conditionals", - "MetricExpr": "BR_INST_RETIRED.COND_TAKEN / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches;CodeGen;PGO", - "MetricName": "Cond_TK" + "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_clks + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_info_clks - tma_l2_bound - tma_pmm_bound", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L3_MISS_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of branches that are CALL or RET", - "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_RETURN) / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches", - "MetricName": "CallRet" + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline", + "MetricExpr": "(IDQ.DSB_CYCLES_ANY - IDQ.DSB_CYCLES_OK) / tma_info_core_clks / 2", + "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 5 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline. For example; inefficient utilization of the DSB cache structure or bank conflict when reading from it; are categorized here.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of branches that are unconditional (direct or indirect) jumps", - "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL) / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches", - "MetricName": "Jump" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_clks", + "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DSB_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_dsb_misses, tma_info_iptb, tma_lcp", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of branches of other types (not individually covered by other metrics in Info.Branches group)", - "MetricExpr": "1 - (Cond_NT + Cond_TK + CallRet + Jump)", - "MetricGroup": "Bad;Branches", - "MetricName": "Other_Branches" + "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: tma_dtlb_store, tma_info_memory_data_tlbs", + "ScaleUnit": "100%" }, { - "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)", - "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_RETIRED.L1_MISS + MEM_LOAD_RETIRED.FB_HIT)", - "MetricGroup": "Mem;MemoryBound;MemoryLat", - "MetricName": "Load_Miss_Real_Latency" + "BriefDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses", + "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=1@ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses. As with ordinary data caching; focus on improving data locality and reducing working-set size to reduce DTLB overhead. Additionally; consider using profile-guided optimization (PGO) to collocate frequently-used data on the same page. Try using larger page sizes for large amounts of frequently-used data. Sample with: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, tma_info_memory_data_tlbs", + "ScaleUnit": "100%" }, { - "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)", - "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLES", - "MetricGroup": "Mem;MemoryBW;MemoryBound", - "MetricName": "MLP" + "BriefDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing", + "MetricExpr": "48 * tma_info_average_frequency * OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM / tma_info_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_store_bound_group", + "MetricName": "tma_false_sharing", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing. False Sharing is a multithreading hiccup; where multiple Logical Processors contend on different data-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L1MPKI" + "BriefDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed", + "MetricExpr": "L1D_PEND_MISS.FB_FULL / tma_info_clks", + "MetricGroup": "MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;tma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed. The higher the metric value; the deeper the memory hierarchy level the misses are satisfied from (metric values >1 are valid). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or external memory). Related metrics: tma_info_dram_bw_use, tma_info_memory_bandwidth, tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", + "ScaleUnit": "100%" }, { - "BriefDescription": "L1 cache true misses per kilo instruction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L1MPKI_Load" + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", + "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)", + "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 5 > 0.35", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Sample with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_switches, tma_info_dsb_coverage, tma_info_dsb_misses, tma_info_iptb, tma_lcp", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", - "MetricGroup": "Backend;CacheMisses;Mem", - "MetricName": "L2MPKI" + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", + "MetricExpr": "(5 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE - INT_MISC.UOP_DROPPING) / tma_info_slots", + "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all request types (including speculative)", - "MetricExpr": "1e3 * (OFFCORE_REQUESTS.ALL_DATA_RD - OFFCORE_REQUESTS.DEMAND_DATA_RD + L2_RQSTS.ALL_DEMAND_MISS + L2_RQSTS.SWPF_MISS) / Instructions", - "MetricGroup": "CacheMisses;Mem;Offcore", - "MetricName": "L2MPKI_All" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops", + "MetricExpr": "tma_heavy_operations - tma_microcode_sequencer", + "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;tma_issueD0", + "MetricName": "tma_few_uops_instructions", + "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number of uops in such instructions. Related metrics: tma_decoder0_alone", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2MPKI_Load" + "BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", + "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired). Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain and FMA double-counting.", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache hits per kilo instruction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2HPKI_Load" + "BriefDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P", + "MetricName": "tma_fp_scalar", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "L3 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L3MPKI" + "BriefDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0xfc@ / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P", + "MetricName": "tma_fp_vector", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fill Buffer (FB) hits per kilo instructions for retired demand loads (L1D misses that merge into ongoing miss-handling entries)", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "FB_HPKI" + "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P", + "MetricName": "tma_fp_vector_128b", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses", - "MetricConstraint": "NO_NMI_WATCHDOG", - "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_PENDING + DTLB_STORE_MISSES.WALK_PENDING) / (2 * CORE_CLKS)", - "MetricGroup": "Mem;MemoryTLB", - "MetricName": "Page_Walks_Utilization" + "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P", + "MetricName": "tma_fp_vector_256b", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW" + "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 512-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P", + "MetricName": "tma_fp_vector_512b", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 512-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW" + "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", + "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_slots", + "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences", + "MetricExpr": "tma_microcode_sequencer + tma_retiring * (UOPS_DECODED.DEC0 - cpu@UOPS_DECODED.DEC0\\,cmask\\=1@) / IDQ.MITE_UOPS", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data access bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "L3_Cache_Access_BW" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses", + "MetricExpr": "ICACHE_16B.IFDATA_STALL / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RETIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Rate of silent evictions from the L2 cache per Kilo instruction where the evicted lines are dropped (no writeback to L3 or memory)", - "MetricExpr": "1e3 * L2_LINES_OUT.SILENT / Instructions", - "MetricGroup": "L2Evicts;Mem;Server", - "MetricName": "L2_Evictions_Silent_PKI" + "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", + "MetricExpr": "tma_info_turbo_utilization * TSC / 1e9 / duration_time", + "MetricGroup": "Power;Summary", + "MetricName": "tma_info_average_frequency" }, { - "BriefDescription": "Rate of non silent evictions from the L2 cache per Kilo instruction", - "MetricExpr": "1e3 * L2_LINES_OUT.NON_SILENT / Instructions", - "MetricGroup": "L2Evicts;Mem;Server", - "MetricName": "L2_Evictions_NonSilent_PKI" + "BriefDescription": "Total pipeline cost of instruction fetch related bottlenecks by large code footprint programs (i-side cache; TLB and BTB misses)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_icache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", + "MetricGroup": "BigFoot;Fed;Frontend;IcMiss;MemoryTLB;tma_issueBC", + "MetricName": "tma_info_big_code", + "MetricThreshold": "tma_info_big_code > 20", + "PublicDescription": "Total pipeline cost of instruction fetch related bottlenecks by large code footprint programs (i-side cache; TLB and BTB misses). Related metrics: tma_info_branching_overhead" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]", - "MetricExpr": "L1D_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW_1T" + "BriefDescription": "Branch instructions per taken branch.", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_bptkbranch" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]", - "MetricExpr": "L2_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW_1T" + "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)", + "MetricExpr": "(tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) * tma_info_slots / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BrMispredicts;tma_issueBM", + "MetricName": "tma_info_branch_misprediction_cost", + "PublicDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear). Related metrics: tma_branch_mispredicts, tma_info_mispredictions, tma_mispredicts_resteers" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "L3_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW_1T" + "BriefDescription": "Total pipeline cost of branch related instructions (used for program control-flow including function calls)", + "MetricExpr": "100 * ((BR_INST_RETIRED.COND + 3 * BR_INST_RETIRED.NEAR_CALL + (BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL)) / tma_info_slots)", + "MetricGroup": "Ret;tma_issueBC", + "MetricName": "tma_info_branching_overhead", + "MetricThreshold": "tma_info_branching_overhead > 10", + "PublicDescription": "Total pipeline cost of branch related instructions (used for program control-flow including function calls). Related metrics: tma_info_big_code" }, { - "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "L3_Cache_Access_BW", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "L3_Cache_Access_BW_1T" + "BriefDescription": "Fraction of branches that are CALL or RET", + "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_RETURN) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_callret" }, { - "BriefDescription": "Average CPU Utilization", - "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", - "MetricGroup": "HPC;Summary", - "MetricName": "CPU_Utilization" + "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Pipeline", + "MetricName": "tma_info_clks" }, { - "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", - "MetricExpr": "Turbo_Utilization * TSC / 1e9 / duration_time", - "MetricGroup": "Power;Summary", - "MetricName": "Average_Frequency" + "BriefDescription": "STLB (2nd level TLB) code speculative misses per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", + "MetricGroup": "Fed;MemoryTLB", + "MetricName": "tma_info_code_stlb_mpki" }, { - "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE) + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1e9 / duration_time", - "MetricGroup": "Cor;Flops;HPC", - "MetricName": "GFLOPs", - "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine." + "BriefDescription": "Fraction of branches that are non-taken conditionals", + "MetricExpr": "BR_INST_RETIRED.COND_NTAKEN / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_cond_nt" }, { - "BriefDescription": "Average Frequency Utilization relative nominal frequency", - "MetricExpr": "CLKS / CPU_CLK_UNHALTED.REF_TSC", - "MetricGroup": "Power", - "MetricName": "Turbo_Utilization" + "BriefDescription": "Fraction of branches that are taken conditionals", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_cond_tk" }, { - "BriefDescription": "Fraction of Core cycles where the core was running with power-delivery for baseline license level 0", - "MetricExpr": "CORE_POWER.LVL0_TURBO_LICENSE / CORE_CLKS", - "MetricGroup": "Power", - "MetricName": "Power_License0_Utilization", - "PublicDescription": "Fraction of Core cycles where the core was running with power-delivery for baseline license level 0. This includes non-AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes." + "BriefDescription": "Probability of Core Bound bottleneck hidden by SMT-profiling artifacts", + "MetricExpr": "(100 * (1 - tma_core_bound / tma_ports_utilization if tma_core_bound < tma_ports_utilization else 1) if tma_info_smt_2t_utilization > 0.5 else 0)", + "MetricGroup": "Cor;SMT", + "MetricName": "tma_info_core_bound_likely", + "MetricThreshold": "tma_info_core_bound_likely > 0.5" }, { - "BriefDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 1", - "MetricExpr": "CORE_POWER.LVL1_TURBO_LICENSE / CORE_CLKS", - "MetricGroup": "Power", - "MetricName": "Power_License1_Utilization", - "PublicDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 1. This includes high current AVX 256-bit instructions as well as low current AVX 512-bit instructions." + "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", + "MetricExpr": "CPU_CLK_UNHALTED.DISTRIBUTED", + "MetricGroup": "SMT", + "MetricName": "tma_info_core_clks" }, { - "BriefDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 2 (introduced in SKX)", - "MetricExpr": "CORE_POWER.LVL2_TURBO_LICENSE / CORE_CLKS", - "MetricGroup": "Power", - "MetricName": "Power_License2_Utilization", - "PublicDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 2 (introduced in SKX). This includes high current AVX 512-bit instructions." + "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_core_clks", + "MetricGroup": "Ret;SMT;TmaL1;tma_L1_group", + "MetricName": "tma_info_coreipc" }, { - "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", - "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_DISTRIBUTED if #SMT_on else 0)", - "MetricGroup": "SMT", - "MetricName": "SMT_2T_Utilization" + "BriefDescription": "Cycles Per Instruction (per Logical Processor)", + "MetricExpr": "1 / tma_info_ipc", + "MetricGroup": "Mem;Pipeline", + "MetricName": "tma_info_cpi" }, { - "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "OS", - "MetricName": "Kernel_Utilization" + "BriefDescription": "Average CPU Utilization", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": "HPC;Summary", + "MetricName": "tma_info_cpu_utilization" }, { - "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", - "MetricGroup": "OS", - "MetricName": "Kernel_CPI" + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD / OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_data_l2_mlp" }, { "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e9 / duration_time", - "MetricGroup": "HPC;Mem;MemoryBW;SoC", - "MetricName": "DRAM_BW_Use" + "MetricGroup": "HPC;Mem;MemoryBW;SoC;tma_issueBW", + "MetricName": "tma_info_dram_bw_use", + "PublicDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_memory_bandwidth, tma_mem_bandwidth, tma_sq_full" }, { - "BriefDescription": "Average latency of data read request to external memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetches", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_INSERTS.IA_MISS_DRD) / (Socket_CLKS / duration_time)", - "MetricGroup": "Mem;MemoryLat;SoC", - "MetricName": "MEM_Read_Latency" + "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_dsb_coverage", + "MetricThreshold": "tma_info_dsb_coverage < 0.7 & tma_info_ipc / 5 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_misses, tma_info_iptb, tma_lcp" }, { - "BriefDescription": "Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches", - "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD@thresh\\=1@", - "MetricGroup": "Mem;MemoryBW;SoC", - "MetricName": "MEM_Parallel_Reads" + "BriefDescription": "Total pipeline cost of DSB (uop cache) misses - subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tma_mite))", + "MetricGroup": "DSBmiss;Fed;tma_issueFB", + "MetricName": "tma_info_dsb_misses", + "MetricThreshold": "tma_info_dsb_misses > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) misses - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_iptb, tma_lcp" }, { - "BriefDescription": "Average latency of data read request to external 3D X-Point memory [in nanoseconds]. Accounts for demand loads and L1/L2 data-read prefetches", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_PMM / UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PMM) / cha_0@event\\=0x0@", - "MetricGroup": "Mem;MemoryLat;Server;SoC", - "MetricName": "MEM_PMM_Read_Latency" + "BriefDescription": "Average number of cycles of a switch from the DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details.", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWITCHES.PENALTY_CYCLES\\,cmask\\=1\\,edge@", + "MetricGroup": "DSBmiss", + "MetricName": "tma_info_dsb_switch_cost" }, { - "BriefDescription": "Average latency of data read request to external DRAM memory [in nanoseconds]. Accounts for demand loads and L1/L2 data-read prefetches", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR / UNC_CHA_TOR_INSERTS.IA_MISS_DRD_DDR) / cha_0@event\\=0x0@", - "MetricGroup": "Mem;MemoryLat;Server;SoC", - "MetricName": "MEM_DRAM_Read_Latency" + "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-thread", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,cmask\\=1@", + "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", + "MetricName": "tma_info_execute" }, { - "BriefDescription": "Average 3DXP Memory Bandwidth Use for reads [GB / sec]", - "MetricExpr": "64 * UNC_M_PMM_RPQ_INSERTS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW;Server;SoC", - "MetricName": "PMM_Read_BW" + "BriefDescription": "The ratio of Executed- by Issued-Uops", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", + "MetricGroup": "Cor;Pipeline", + "MetricName": "tma_info_execute_per_issue", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate of \"execute\" at rename stage." }, { - "BriefDescription": "Average 3DXP Memory Bandwidth Use for Writes [GB / sec]", - "MetricExpr": "64 * UNC_M_PMM_WPQ_INSERTS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW;Server;SoC", - "MetricName": "PMM_Write_BW" + "BriefDescription": "Fill Buffer (FB) hits per kilo instructions for retired demand loads (L1D misses that merge into ongoing miss-handling entries)", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_fb_hpki" }, { - "BriefDescription": "Average IO (network or disk) Bandwidth Use for Writes [GB / sec]", - "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e9 / duration_time", - "MetricGroup": "IoBW;Mem;Server;SoC", - "MetricName": "IO_Write_BW" + "BriefDescription": "Average number of Uops issued by front-end when it issued something", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=1@", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_fetch_upc" }, { - "BriefDescription": "Average IO (network or disk) Bandwidth Use for Reads [GB / sec]", - "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_HIT_ITOM + UNC_CHA_TOR_INSERTS.IO_MISS_ITOM + UNC_CHA_TOR_INSERTS.IO_HIT_ITOMCACHENEAR + UNC_CHA_TOR_INSERTS.IO_MISS_ITOMCACHENEAR) * 64 / 1e9 / duration_time", - "MetricGroup": "IoBW;Mem;Server;SoC", - "MetricName": "IO_Read_BW" + "BriefDescription": "Floating Point Operations Per Cycle", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * cpu@FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE\\,umask\\=0x18@ + 8 * cpu@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE\\,umask\\=0x60@ + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / tma_info_core_clks", + "MetricGroup": "Flops;Ret", + "MetricName": "tma_info_flopc" }, { - "BriefDescription": "Socket actual clocks when any core is active on that socket", - "MetricExpr": "cha_0@event\\=0x0@", - "MetricGroup": "SoC", - "MetricName": "Socket_CLKS" + "BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0xfc@) / (2 * tma_info_core_clks)", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_fp_arith_utilization", + "PublicDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width). Values > 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less common)." }, { - "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", - "MetricGroup": "Branches;OS", - "MetricName": "IpFarBranch" + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * cpu@FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE\\,umask\\=0x18@ + 8 * cpu@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE\\,umask\\=0x60@ + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1e9 / duration_time", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine." }, { - "BriefDescription": "Uncore frequency per die [GHZ]", - "MetricExpr": "Socket_CLKS / #num_dies / duration_time / 1e9", - "MetricGroup": "SoC", - "MetricName": "UNCORE_FREQ" + "BriefDescription": "Total pipeline cost of Instruction Cache misses - subset of the Big_Code Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", + "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", + "MetricName": "tma_info_ic_misses", + "MetricThreshold": "tma_info_ic_misses > 5", + "PublicDescription": "Total pipeline cost of Instruction Cache misses - subset of the Big_Code Bottleneck. Related metrics: " }, { - "BriefDescription": "Percentage of time spent in the active CPU power state C0", - "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", - "MetricName": "cpu_utilization", - "ScaleUnit": "100%" + "BriefDescription": "Average Latency for L1 instruction cache misses", + "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STALL\\,cmask\\=1\\,edge@", + "MetricGroup": "Fed;FetchLat;IcMiss", + "MetricName": "tma_info_icache_miss_latency" + }, + { + "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_GE_1 / 2 if #SMT_on else UOPS_EXECUTED.CORE_CYCLES_GE_1)", + "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", + "MetricName": "tma_info_ilp" }, { - "BriefDescription": "CPU operating frequency (in GHz)", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / duration_time", - "MetricName": "cpu_operating_frequency", - "ScaleUnit": "1GHz" + "BriefDescription": "Total pipeline cost of instruction fetch bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_frontend_bound - tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) - tma_info_big_code", + "MetricGroup": "Fed;FetchBW;Frontend", + "MetricName": "tma_info_instruction_fetch_bw", + "MetricThreshold": "tma_info_instruction_fetch_bw > 20" }, { - "BriefDescription": "Cycles per instruction retired; indicating how much time each executed instruction took; in units of cycles.", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY", - "MetricName": "cpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Summary;TmaL1;tma_L1_group", + "MetricName": "tma_info_instructions", + "PublicDescription": "Total number of retired Instructions. Sample with: INST_RETIRED.PREC_DIST" }, { - "BriefDescription": "The ratio of number of completed memory load instructions to the total number completed instructions", - "MetricExpr": "MEM_INST_RETIRED.ALL_LOADS / INST_RETIRED.ANY", - "MetricName": "loads_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Average IO (network or disk) Bandwidth Use for Reads [GB / sec]", + "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_HIT_ITOM + UNC_CHA_TOR_INSERTS.IO_MISS_ITOM + UNC_CHA_TOR_INSERTS.IO_HIT_ITOMCACHENEAR + UNC_CHA_TOR_INSERTS.IO_MISS_ITOMCACHENEAR) * 64 / 1e9 / duration_time", + "MetricGroup": "IoBW;Mem;Server;SoC", + "MetricName": "tma_info_io_read_bw" }, { - "BriefDescription": "The ratio of number of completed memory store instructions to the total number completed instructions", - "MetricExpr": "MEM_INST_RETIRED.ALL_STORES / INST_RETIRED.ANY", - "MetricName": "stores_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Average IO (network or disk) Bandwidth Use for Writes [GB / sec]", + "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e9 / duration_time", + "MetricGroup": "IoBW;Mem;Server;SoC", + "MetricName": "tma_info_io_write_bw" }, { - "BriefDescription": "Ratio of number of requests missing L1 data cache (includes data+rfo w/ prefetches) to the total number of completed instructions", - "MetricExpr": "L1D.REPLACEMENT / INST_RETIRED.ANY", - "MetricName": "l1d_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0xfc@)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_iparith", + "MetricThreshold": "tma_info_iparith < 10", + "PublicDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate). May undercount due to FMA double counting. Approximated prior to BDW." }, { - "BriefDescription": "Ratio of number of demand load requests hitting in L1 data cache to the total number of completed instructions ", - "MetricExpr": "MEM_LOAD_RETIRED.L1_HIT / INST_RETIRED.ANY", - "MetricName": "l1d_demand_data_read_hits_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx128", + "MetricThreshold": "tma_info_iparith_avx128 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "Ratio of number of code read requests missing in L1 instruction cache (includes prefetches) to the total number of completed instructions", - "MetricExpr": "L2_RQSTS.ALL_CODE_RD / INST_RETIRED.ANY", - "MetricName": "l1_i_code_read_misses_with_prefetches_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx256", + "MetricThreshold": "tma_info_iparith_avx256 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "Ratio of number of completed demand load requests hitting in L2 cache to the total number of completed instructions ", - "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT / INST_RETIRED.ANY", - "MetricName": "l2_demand_data_read_hits_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx512", + "MetricThreshold": "tma_info_iparith_avx512 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "Ratio of number of requests missing L2 cache (includes code+data+rfo w/ prefetches) to the total number of completed instructions", - "MetricExpr": "L2_LINES_IN.ALL / INST_RETIRED.ANY", - "MetricName": "l2_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOUBLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_iparith_scalar_dp", + "MetricThreshold": "tma_info_iparith_scalar_dp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "Ratio of number of completed data read request missing L2 cache to the total number of completed instructions", - "MetricExpr": "MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", - "MetricName": "l2_demand_data_read_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SINGLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_iparith_scalar_sp", + "MetricThreshold": "tma_info_iparith_scalar_sp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "Ratio of number of code read request missing L2 cache to the total number of completed instructions", - "MetricExpr": "L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", - "MetricName": "l2_demand_code_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Branches;Fed;InsType", + "MetricName": "tma_info_ipbranch", + "MetricThreshold": "tma_info_ipbranch < 8" }, { - "BriefDescription": "Ratio of number of data read requests missing last level core cache (includes demand w/ prefetches) to the total number of completed instructions", - "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_LLCPREFDATA + UNC_CHA_TOR_INSERTS.IA_MISS_DRD + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF) / INST_RETIRED.ANY", - "MetricName": "llc_data_read_mpi_demand_plus_prefetch", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions Per Cycle (per Logical Processor)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_clks", + "MetricGroup": "Ret;Summary", + "MetricName": "tma_info_ipc" }, { - "BriefDescription": "Ratio of number of code read requests missing last level core cache (includes demand w/ prefetches) to the total number of completed instructions", - "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_CRD + UNC_CHA_TOR_INSERTS.IA_MISS_CRD_PREF) / INST_RETIRED.ANY", - "MetricName": "llc_code_read_mpi_demand_plus_prefetch", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per (near) call (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_ipcall", + "MetricThreshold": "tma_info_ipcall < 200" }, { - "BriefDescription": "Average latency of a last level cache (LLC) demand data read miss (read memory access) in nano seconds", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_INSERTS.IA_MISS_DRD) / (UNC_CHA_CLOCKTICKS / (source_count(UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD) * #num_packages)) * duration_time", - "MetricName": "llc_demand_data_read_miss_latency", - "ScaleUnit": "1ns" + "BriefDescription": "Instructions per non-speculative DSB miss (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", + "MetricGroup": "DSBmiss;Fed", + "MetricName": "tma_info_ipdsb_miss_ret", + "MetricThreshold": "tma_info_ipdsb_miss_ret < 50" }, { - "BriefDescription": "Average latency of a last level cache (LLC) demand data read miss (read memory access) addressed to local memory in nano seconds", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_LOCAL / UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCAL) / (UNC_CHA_CLOCKTICKS / (source_count(UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_LOCAL) * #num_packages)) * duration_time", - "MetricName": "llc_demand_data_read_miss_latency_for_local_requests", - "ScaleUnit": "1ns" + "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;OS", + "MetricName": "tma_info_ipfarbranch", + "MetricThreshold": "tma_info_ipfarbranch < 1e6" }, { - "BriefDescription": "Average latency of a last level cache (LLC) demand data read miss (read memory access) addressed to remote memory in nano seconds", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_REMOTE / UNC_CHA_TOR_INSERTS.IA_MISS_DRD_REMOTE) / (UNC_CHA_CLOCKTICKS / (source_count(UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_REMOTE) * #num_packages)) * duration_time", - "MetricName": "llc_demand_data_read_miss_latency_for_remote_requests", - "ScaleUnit": "1ns" + "BriefDescription": "Instructions per Floating Point (FP) Operation (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * cpu@FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE\\,umask\\=0x18@ + 8 * cpu@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE\\,umask\\=0x60@ + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_ipflop", + "MetricThreshold": "tma_info_ipflop < 10" }, { - "BriefDescription": "Average latency of a last level cache (LLC) demand data read miss (read memory access) addressed to Intel(R) Optane(TM) Persistent Memory(PMEM) in nano seconds", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_PMM / UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PMM) / (UNC_CHA_CLOCKTICKS / (source_count(UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_PMM) * #num_packages)) * duration_time", - "MetricName": "llc_demand_data_read_miss_to_pmem_latency", - "ScaleUnit": "1ns" + "BriefDescription": "Instructions per Load (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipload", + "MetricThreshold": "tma_info_ipload < 3" }, { - "BriefDescription": "Average latency of a last level cache (LLC) demand data read miss (read memory access) addressed to DRAM in nano seconds", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR / UNC_CHA_TOR_INSERTS.IA_MISS_DRD_DDR) / (UNC_CHA_CLOCKTICKS / (source_count(UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR) * #num_packages)) * duration_time", - "MetricName": "llc_demand_data_read_miss_to_dram_latency", - "ScaleUnit": "1ns" + "BriefDescription": "Instructions per retired mispredicts for conditional non-taken branches (lower number means higher occurrence rate).", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_cond_ntaken", + "MetricThreshold": "tma_info_ipmisp_cond_ntaken < 200" }, { - "BriefDescription": "Ratio of number of completed page walks (for all page sizes) caused by a code fetch to the total number of completed instructions", - "MetricExpr": "ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", - "MetricName": "itlb_2nd_level_mpi", - "PublicDescription": "Ratio of number of completed page walks (for all page sizes) caused by a code fetch to the total number of completed instructions. This implies it missed in the ITLB (Instruction TLB) and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per retired mispredicts for conditional taken branches (lower number means higher occurrence rate).", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_cond_taken", + "MetricThreshold": "tma_info_ipmisp_cond_taken < 200" }, { - "BriefDescription": "Ratio of number of completed page walks (for 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total number of completed instructions", - "MetricExpr": "ITLB_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY", - "MetricName": "itlb_2nd_level_large_page_mpi", - "PublicDescription": "Ratio of number of completed page walks (for 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total number of completed instructions. This implies it missed in the Instruction Translation Lookaside Buffer (ITLB) and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per retired mispredicts for indirect CALL or JMP branches (lower number means higher occurrence rate).", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_indirect", + "MetricThreshold": "tma_info_ipmisp_indirect < 1e3" }, { - "BriefDescription": "Ratio of number of completed page walks (for all page sizes) caused by demand data loads to the total number of completed instructions", - "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", - "MetricName": "dtlb_2nd_level_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for all page sizes) caused by demand data loads to the total number of completed instructions. This implies it missed in the DTLB and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per retired mispredicts for return branches (lower number means higher occurrence rate).", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_ret", + "MetricThreshold": "tma_info_ipmisp_ret < 500" }, { - "BriefDescription": "Ratio of number of completed page walks (for 2 megabyte page sizes) caused by demand data loads to the total number of completed instructions", - "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY", - "MetricName": "dtlb_2nd_level_2mb_large_page_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for 2 megabyte page sizes) caused by demand data loads to the total number of completed instructions. This implies it missed in the Data Translation Lookaside Buffer (DTLB) and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts", + "MetricName": "tma_info_ipmispredict", + "MetricThreshold": "tma_info_ipmispredict < 200" }, { - "BriefDescription": "Ratio of number of completed page walks (for all page sizes) caused by demand data stores to the total number of completed instructions", - "MetricExpr": "DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", - "MetricName": "dtlb_2nd_level_store_mpi", - "PublicDescription": "Ratio of number of completed page walks (for all page sizes) caused by demand data stores to the total number of completed instructions. This implies it missed in the DTLB and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per Store (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipstore", + "MetricThreshold": "tma_info_ipstore < 8" }, { - "BriefDescription": "Memory read that miss the last level cache (LLC) addressed to local DRAM as a percentage of total memory read accesses, does not include LLC prefetches.", - "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCAL + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_LOCAL) / (UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCAL + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_LOCAL + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_REMOTE + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_REMOTE)", - "MetricName": "numa_reads_addressed_to_local_dram", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per Software prefetch instruction (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umask\\=0xF@", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_ipswpf", + "MetricThreshold": "tma_info_ipswpf < 100" }, { - "BriefDescription": "Memory reads that miss the last level cache (LLC) addressed to remote DRAM as a percentage of total memory read accesses, does not include LLC prefetches.", - "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_DRD_REMOTE + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_REMOTE) / (UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCAL + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_LOCAL + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_REMOTE + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_REMOTE)", - "MetricName": "numa_reads_addressed_to_remote_dram", - "ScaleUnit": "100%" + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", + "MetricName": "tma_info_iptb", + "MetricThreshold": "tma_info_iptb < 11", + "PublicDescription": "Instruction per taken branch. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_dsb_misses, tma_lcp" }, { - "BriefDescription": "Uncore operating frequency in GHz", - "MetricExpr": "UNC_CHA_CLOCKTICKS / (source_count(UNC_CHA_CLOCKTICKS) * #num_packages) / 1e9 / duration_time", - "MetricName": "uncore_frequency", - "ScaleUnit": "1GHz" + "BriefDescription": "Instructions per speculative Unknown Branch Misprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_instructions / BACLEARS.ANY", + "MetricGroup": "Fed", + "MetricName": "tma_info_ipunknown_branch" }, { - "BriefDescription": "Intel(R) Ultra Path Interconnect (UPI) data transmit bandwidth (MB/sec)", - "MetricExpr": "UNC_UPI_TxL_FLITS.ALL_DATA * 7.111111111111111 / 1e6 / duration_time", - "MetricName": "upi_data_transmit_bw", - "ScaleUnit": "1MB/s" + "BriefDescription": "Fraction of branches that are unconditional (direct or indirect) jumps", + "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_jump" }, { - "BriefDescription": "DDR memory read bandwidth (MB/sec)", - "MetricExpr": "UNC_M_CAS_COUNT.RD * 64 / 1e6 / duration_time", - "MetricName": "memory_bandwidth_read", - "ScaleUnit": "1MB/s" + "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_cpi" }, { - "BriefDescription": "DDR memory write bandwidth (MB/sec)", - "MetricExpr": "UNC_M_CAS_COUNT.WR * 64 / 1e6 / duration_time", - "MetricName": "memory_bandwidth_write", - "ScaleUnit": "1MB/s" + "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_utilization", + "MetricThreshold": "tma_info_kernel_utilization > 0.05" }, { - "BriefDescription": "DDR memory bandwidth (MB/sec)", - "MetricExpr": "(UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) * 64 / 1e6 / duration_time", - "MetricName": "memory_bandwidth_total", - "ScaleUnit": "1MB/s" + "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw" }, { - "BriefDescription": "Intel(R) Optane(TM) Persistent Memory(PMEM) memory read bandwidth (MB/sec)", - "MetricExpr": "UNC_M_PMM_RPQ_INSERTS * 64 / 1e6 / duration_time", - "MetricName": "pmem_memory_bandwidth_read", - "ScaleUnit": "1MB/s" + "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]", + "MetricExpr": "tma_info_l1d_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw_1t" }, { - "BriefDescription": "Intel(R) Optane(TM) Persistent Memory(PMEM) memory write bandwidth (MB/sec)", - "MetricExpr": "UNC_M_PMM_WPQ_INSERTS * 64 / 1e6 / duration_time", - "MetricName": "pmem_memory_bandwidth_write", - "ScaleUnit": "1MB/s" + "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l1mpki" }, { - "BriefDescription": "Intel(R) Optane(TM) Persistent Memory(PMEM) memory bandwidth (MB/sec)", - "MetricExpr": "(UNC_M_PMM_RPQ_INSERTS + UNC_M_PMM_WPQ_INSERTS) * 64 / 1e6 / duration_time", - "MetricName": "pmem_memory_bandwidth_total", - "ScaleUnit": "1MB/s" + "BriefDescription": "L1 cache true misses per kilo instruction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l1mpki_load" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by end device controllers that are requesting memory from the CPU.", - "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_HIT_PCIRDCUR + UNC_CHA_TOR_INSERTS.IO_MISS_PCIRDCUR) * 64 / 1e6 / duration_time", - "MetricName": "io_bandwidth_disk_or_network_writes", - "ScaleUnit": "1MB/s" + "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by end device controllers that are writing memory to the CPU.", - "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_HIT_ITOM + UNC_CHA_TOR_INSERTS.IO_MISS_ITOM + UNC_CHA_TOR_INSERTS.IO_HIT_ITOMCACHENEAR + UNC_CHA_TOR_INSERTS.IO_MISS_ITOMCACHENEAR) * 64 / 1e6 / duration_time", - "MetricName": "io_bandwidth_disk_or_network_reads", - "ScaleUnit": "1MB/s" + "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]", + "MetricExpr": "tma_info_l2_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw_1t" }, { - "BriefDescription": "Uops delivered from decoded instruction cache (decoded stream buffer or DSB) as a percent of total uops delivered to Instruction Decode Queue", - "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS + LSD.UOPS)", - "MetricName": "percent_uops_delivered_from_decoded_icache", - "ScaleUnit": "100%" + "BriefDescription": "Rate of non silent evictions from the L2 cache per Kilo instruction", + "MetricExpr": "1e3 * L2_LINES_OUT.NON_SILENT / tma_info_instructions", + "MetricGroup": "L2Evicts;Mem;Server", + "MetricName": "tma_info_l2_evictions_nonsilent_pki" }, { - "BriefDescription": "Uops delivered from legacy decode pipeline (Micro-instruction Translation Engine or MITE) as a percent of total uops delivered to Instruction Decode Queue", - "MetricExpr": "IDQ.MITE_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS + LSD.UOPS)", - "MetricName": "percent_uops_delivered_from_legacy_decode_pipeline", - "ScaleUnit": "100%" + "BriefDescription": "Rate of silent evictions from the L2 cache per Kilo instruction where the evicted lines are dropped (no writeback to L3 or memory)", + "MetricExpr": "1e3 * L2_LINES_OUT.SILENT / tma_info_instructions", + "MetricGroup": "L2Evicts;Mem;Server", + "MetricName": "tma_info_l2_evictions_silent_pki" }, { - "BriefDescription": "Uops delivered from microcode sequencer (MS) as a percent of total uops delivered to Instruction Decode Queue", - "MetricExpr": "IDQ.MS_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS + LSD.UOPS)", - "MetricName": "percent_uops_delivered_from_microcode_sequencer", - "ScaleUnit": "100%" + "BriefDescription": "L2 cache hits per kilo instruction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2hpki_load" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss the last level cache (LLC) and go to local memory.", - "MetricExpr": "UNC_CHA_REQUESTS.READS_LOCAL * 64 / 1e6 / duration_time", - "MetricName": "llc_miss_local_memory_bandwidth_read", - "ScaleUnit": "1MB/s" + "BriefDescription": "L2 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "Backend;CacheMisses;Mem", + "MetricName": "tma_info_l2mpki" }, { - "BriefDescription": "Bandwidth (MB/sec) of write requests that miss the last level cache (LLC) and go to local memory.", - "MetricExpr": "UNC_CHA_REQUESTS.WRITES_LOCAL * 64 / 1e6 / duration_time", - "MetricName": "llc_miss_local_memory_bandwidth_write", - "ScaleUnit": "1MB/s" + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all request types (including speculative)", + "MetricExpr": "1e3 * (OFFCORE_REQUESTS.ALL_DATA_RD - OFFCORE_REQUESTS.DEMAND_DATA_RD + L2_RQSTS.ALL_DEMAND_MISS + L2_RQSTS.SWPF_MISS) / tma_info_instructions", + "MetricGroup": "CacheMisses;Mem;Offcore", + "MetricName": "tma_info_l2mpki_all" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss the last level cache (LLC) and go to remote memory.", - "MetricExpr": "UNC_CHA_REQUESTS.READS_REMOTE * 64 / 1e6 / duration_time", - "MetricName": "llc_miss_remote_memory_bandwidth_read", - "ScaleUnit": "1MB/s" + "BriefDescription": "L2 cache true code cacheline misses per kilo instruction", + "MetricExpr": "1e3 * FRONTEND_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_l2mpki_code" }, { - "BriefDescription": "Bandwidth (MB/sec) of write requests that miss the last level cache (LLC) and go to remote memory.", - "MetricExpr": "UNC_CHA_REQUESTS.WRITES_REMOTE * 64 / 1e6 / duration_time", - "MetricName": "llc_miss_remote_memory_bandwidth_write", - "ScaleUnit": "1MB/s" + "BriefDescription": "L2 cache speculative code cacheline misses per kilo instruction", + "MetricExpr": "1e3 * L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_l2mpki_code_all" }, { - "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", - "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / slots", - "MetricGroup": "PGO;TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_frontend_bound", - "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Machine_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound.", - "ScaleUnit": "100%" + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2mpki_load" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", - "MetricExpr": "(5 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE - INT_MISC.UOP_DROPPING) / slots", - "MetricGroup": "Frontend;TopdownL2;tma_L2_group;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_latency", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period.", - "ScaleUnit": "100%" + "BriefDescription": "Average per-core data access bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_l3_cache_access_bw" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses.", - "MetricExpr": "ICACHE_16B.IFDATA_STALL / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_icache_misses", - "ScaleUnit": "100%" + "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "tma_info_l3_cache_access_bw", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_l3_cache_access_bw_1t" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses.", - "MetricExpr": "ICACHE_64B.IFTAG_STALL / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_itlb_misses", - "ScaleUnit": "100%" + "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", - "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / CPU_CLK_UNHALTED.THREAD + tma_unknown_branches", - "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_branch_resteers", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings.", - "ScaleUnit": "100%" + "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "tma_info_l3_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw_1t" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage. ", - "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL4;tma_L4_group;tma_branch_resteers_group", - "MetricName": "tma_mispredicts_resteers", - "ScaleUnit": "100%" + "BriefDescription": "L3 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l3mpki" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears. ", - "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_branch_resteers_group", - "MetricName": "tma_clears_resteers", - "ScaleUnit": "100%" + "BriefDescription": "Average Latency for L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCORE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "Memory_Lat;Offcore", + "MetricName": "tma_info_load_l2_miss_latency" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears", - "MetricExpr": "10 * BACLEARS.ANY / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "BigFoot;FetchLat;TopdownL4;tma_L4_group;tma_branch_resteers_group", - "MetricName": "tma_unknown_branches", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears. These are fetched branches the Branch Prediction Unit was unable to recognize (First fetch or hitting BPU capacity limit).", - "ScaleUnit": "100%" + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=1@", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_load_l2_mlp" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_dsb_switches", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty.", - "ScaleUnit": "100%" + "BriefDescription": "Average Latency for L3 cache miss demand Loads", + "MetricExpr": "cpu@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,umask\\=0x10@ / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", + "MetricGroup": "Memory_Lat;Offcore", + "MetricName": "tma_info_load_l3_miss_latency" }, { - "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", - "MetricExpr": "ILD_STALL.LCP / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_lcp", - "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs.", - "ScaleUnit": "100%" + "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)", + "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_RETIRED.L1_MISS + MEM_LOAD_RETIRED.FB_HIT)", + "MetricGroup": "Mem;MemoryBound;MemoryLat", + "MetricName": "tma_info_load_miss_real_latency" }, { - "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", - "MetricExpr": "3 * IDQ.MS_SWITCHES / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_ms_switches", - "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals.", - "ScaleUnit": "100%" + "BriefDescription": "STLB (2nd level TLB) data load speculative misses per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_load_stlb_mpki" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", - "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)", - "MetricGroup": "FetchBW;Frontend;TopdownL2;tma_L2_group;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_bandwidth", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend.", - "ScaleUnit": "100%" + "BriefDescription": "Average latency of data read request to external DRAM memory [in nanoseconds]", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR / UNC_CHA_TOR_INSERTS.IA_MISS_DRD_DDR) / cha_0@event\\=0x0@", + "MetricGroup": "Mem;MemoryLat;Server;SoC", + "MetricName": "tma_info_mem_dram_read_latency", + "PublicDescription": "Average latency of data read request to external DRAM memory [in nanoseconds]. Accounts for demand loads and L1/L2 data-read prefetches" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline)", - "MetricExpr": "(IDQ.MITE_CYCLES_ANY - IDQ.MITE_CYCLES_OK) / CPU_CLK_UNHALTED.DISTRIBUTED / 2", - "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", - "MetricName": "tma_mite", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or LSD. For example; inefficiencies due to asymmetric decoders; use of long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.", - "ScaleUnit": "100%" + "BriefDescription": "Average number of parallel data read requests to external memory", + "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD@thresh\\=1@", + "MetricGroup": "Mem;MemoryBW;SoC", + "MetricName": "tma_info_mem_parallel_reads", + "PublicDescription": "Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches" }, { - "BriefDescription": "This metric represents fraction of cycles where decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=0x1@ - cpu@INST_DECODED.DECODERS\\,cmask\\=0x2@) / CPU_CLK_UNHALTED.DISTRIBUTED", - "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_mite_group", - "MetricName": "tma_decoder0_alone", - "ScaleUnit": "100%" + "BriefDescription": "Average latency of data read request to external 3D X-Point memory [in nanoseconds]", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_PMM / UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PMM) / cha_0@event\\=0x0@", + "MetricGroup": "Mem;MemoryLat;Server;SoC", + "MetricName": "tma_info_mem_pmm_read_latency", + "PublicDescription": "Average latency of data read request to external 3D X-Point memory [in nanoseconds]. Accounts for demand loads and L1/L2 data-read prefetches" }, { - "BriefDescription": "This metric represents fraction of cycles where (only) 4 uops were delivered by the MITE pipeline", - "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=0x4@ - cpu@IDQ.MITE_UOPS\\,cmask\\=0x5@) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_mite_group", - "MetricName": "tma_mite_4wide", - "ScaleUnit": "100%" + "BriefDescription": "Average latency of data read request to external memory (in nanoseconds)", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_INSERTS.IA_MISS_DRD) / (tma_info_socket_clks / duration_time)", + "MetricGroup": "Mem;MemoryLat;SoC", + "MetricName": "tma_info_mem_read_latency", + "PublicDescription": "Average latency of data read request to external memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetches. ([RKL+]memory-controller only)" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline", - "MetricExpr": "(IDQ.DSB_CYCLES_ANY - IDQ.DSB_CYCLES_OK) / CPU_CLK_UNHALTED.DISTRIBUTED / 2", - "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", - "MetricName": "tma_dsb", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline. For example; inefficient utilization of the DSB cache structure or bank conflict when reading from it; are categorized here.", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of (external) Memory Bandwidth related bottlenecks", + "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_sq_full / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full))) + tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk))", + "MetricGroup": "Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_info_memory_bandwidth", + "MetricThreshold": "tma_info_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of (external) Memory Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_mem_bandwidth, tma_sq_full" }, { - "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations", - "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound)), 0)", - "MetricGroup": "TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_bad_speculation", - "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of Memory Address Translation related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * tma_memory_bound * (tma_l1_bound / max(tma_memory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_dtlb_load / max(tma_l1_bound, tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_latency + tma_streaming_stores)))", + "MetricGroup": "Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_info_memory_data_tlbs", + "MetricThreshold": "tma_info_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Translation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load, tma_dtlb_store" }, { - "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction", - "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * tma_bad_speculation", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL2;tma_L2_group;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_branch_mispredicts", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path.", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of Memory Latency related bottlenecks (external memory and off-core caches)", + "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_l3_hit_latency / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound))", + "MetricGroup": "Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_info_memory_latency", + "MetricThreshold": "tma_info_memory_latency > 20", + "PublicDescription": "Total pipeline cost of Memory Latency related bottlenecks (external memory and off-core caches). Related metrics: tma_l3_hit_latency, tma_mem_latency" }, { - "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", - "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts)", - "MetricGroup": "BadSpec;MachineClears;TopdownL2;tma_L2_group;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_machine_clears", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes.", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of Branch Misprediction related bottlenecks", + "MetricExpr": "100 * (tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;tma_issueBM", + "MetricName": "tma_info_mispredictions", + "MetricThreshold": "tma_info_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_branch_misprediction_cost, tma_mispredicts_resteers" }, { - "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend", - "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=0x1\\,edge\\=0x1@ / slots", - "MetricGroup": "TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_backend_bound", - "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound.", - "ScaleUnit": "100%" + "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss", + "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLES", + "MetricGroup": "Mem;MemoryBW;MemoryBound", + "MetricName": "tma_info_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)" }, { - "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", - "MetricExpr": "(CYCLE_ACTIVITY.STALLS_MEM_ANY + EXE_ACTIVITY.BOUND_ON_STORES) / (CYCLE_ACTIVITY.STALLS_TOTAL + (EXE_ACTIVITY.1_PORTS_UTIL + topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * EXE_ACTIVITY.2_PORTS_UTIL) + EXE_ACTIVITY.BOUND_ON_STORES) * tma_backend_bound", - "MetricGroup": "Backend;TopdownL2;tma_L2_group;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_memory_bound", - "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of branches of other types (not individually covered by other metrics in Info.Branches group)", + "MetricExpr": "1 - (tma_info_cond_nt + tma_info_cond_tk + tma_info_callret + tma_info_jump)", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_other_branches" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache", - "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY.STALLS_L1D_MISS) / CPU_CLK_UNHALTED.THREAD, 0)", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", - "MetricName": "tma_l1_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache. The L1 data cache typically has the shortest latency. However; in certain cases like loads blocked on older stores; a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls; while some non-completed demand load lives in the machine without having that demand load missing the L1 cache.", - "ScaleUnit": "100%" + "BriefDescription": "Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses", + "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_PENDING + DTLB_STORE_MISSES.WALK_PENDING) / (2 * tma_info_core_clks)", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_page_walks_utilization", + "MetricThreshold": "tma_info_page_walks_utilization > 0.5" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=0x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_dtlb_load", - "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss.", - "ScaleUnit": "100%" + "BriefDescription": "Average 3DXP Memory Bandwidth Use for reads [GB / sec]", + "MetricExpr": "64 * UNC_M_PMM_RPQ_INSERTS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW;Server;SoC", + "MetricName": "tma_info_pmm_read_bw" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the (first level) DTLB was missed by load accesses, that later on hit in second-level TLB (STLB)", - "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", - "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_group", - "MetricName": "tma_load_stlb_hit", - "ScaleUnit": "100%" + "BriefDescription": "Average 3DXP Memory Bandwidth Use for Writes [GB / sec]", + "MetricExpr": "64 * UNC_M_PMM_WPQ_INSERTS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW;Server;SoC", + "MetricName": "tma_info_pmm_write_bw" }, { - "BriefDescription": "This metric estimates the fraction of cycles where the Second-level TLB (STLB) was missed by load accesses, performing a hardware page walk", - "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_group", - "MetricName": "tma_load_stlb_miss", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of Core cycles where the core was running with power-delivery for baseline license level 0", + "MetricExpr": "CORE_POWER.LVL0_TURBO_LICENSE / tma_info_core_clks", + "MetricGroup": "Power", + "MetricName": "tma_info_power_license0_utilization", + "PublicDescription": "Fraction of Core cycles where the core was running with power-delivery for baseline license level 0. This includes non-AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes." }, { - "BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores", - "MetricExpr": "min(13 * LD_BLOCKS.STORE_FORWARD / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_store_fwd_blk", - "PublicDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores. To streamline memory operations in the pipeline; a load can avoid waiting for memory if a prior in-flight store is writing the data that the load wants to read (store forwarding process). However; in some cases the load may be blocked for a significant time pending the store forward. For example; when the prior store is writing a smaller region than the load is reading.", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 1", + "MetricExpr": "CORE_POWER.LVL1_TURBO_LICENSE / tma_info_core_clks", + "MetricGroup": "Power", + "MetricName": "tma_info_power_license1_utilization", + "MetricThreshold": "tma_info_power_license1_utilization > 0.5", + "PublicDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 1. This includes high current AVX 256-bit instructions as well as low current AVX 512-bit instructions." }, { - "BriefDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations", - "MetricExpr": "min((16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS.ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10 * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO))) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_lock_latency", - "PublicDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations. Due to the microarchitecture handling of locks; they are classified as L1_Bound regardless of what memory source satisfied them.", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 2 (introduced in SKX)", + "MetricExpr": "CORE_POWER.LVL2_TURBO_LICENSE / tma_info_core_clks", + "MetricGroup": "Power", + "MetricName": "tma_info_power_license2_utilization", + "MetricThreshold": "tma_info_power_license2_utilization > 0.5", + "PublicDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 2 (introduced in SKX). This includes high current AVX 512-bit instructions." }, { - "BriefDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary. ", - "MetricExpr": "min(Load_Miss_Real_Latency * LD_BLOCKS.NO_SR / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_split_loads", - "ScaleUnit": "100%" + "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.", + "MetricExpr": "tma_retiring * tma_info_slots / cpu@UOPS_RETIRED.SLOTS\\,cmask\\=1@", + "MetricGroup": "Pipeline;Ret", + "MetricName": "tma_info_retire" }, { - "BriefDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset", - "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_4k_aliasing", - "PublicDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset. False match is possible; which incur a few cycles load re-issue. However; the short re-issue duration is often hidden by the out-of-order core and HW optimizations; hence a user may safely ignore a high value of this metric unless it manages to propagate up into parent nodes of the hierarchy (e.g. to L1_Bound).", - "ScaleUnit": "100%" + "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "TOPDOWN.SLOTS", + "MetricGroup": "TmaL1;tma_L1_group", + "MetricName": "tma_info_slots" }, { - "BriefDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed", - "MetricExpr": "L1D_PEND_MISS.FB_FULL / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryBW;TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_fb_full", - "PublicDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed. The higher the metric value; the deeper the memory hierarchy level the misses are satisfied from (metric values >1 are valid). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or external memory).", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of Physical Core issue-slots utilized by this Logical Processor", + "MetricExpr": "(tma_info_slots / (TOPDOWN.SLOTS / 2) if #SMT_on else 1)", + "MetricGroup": "SMT;TmaL1;tma_L1_group", + "MetricName": "tma_info_slots_utilization" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads", - "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + L1D_PEND_MISS.FB_FULL_PERIODS) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / CPU_CLK_UNHALTED.THREAD)", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", - "MetricName": "tma_l2_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 misses/L2 hits) can improve the latency and increase performance.", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", + "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_DISTRIBUTED if #SMT_on else 0)", + "MetricGroup": "SMT", + "MetricName": "tma_info_smt_2t_utilization" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core", - "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STALLS_L3_MISS) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", - "MetricName": "tma_l3_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance.", - "ScaleUnit": "100%" + "BriefDescription": "Socket actual clocks when any core is active on that socket", + "MetricExpr": "cha_0@event\\=0x0@", + "MetricGroup": "SoC", + "MetricName": "tma_info_socket_clks" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses", - "MetricExpr": "min(((48 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 4 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) + (47.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 4 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_l3_bound_group", - "MetricName": "tma_contested_accesses", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses. Contested accesses occur when data written by one Logical Processor are read by another Logical Processor on a different Physical Core. Examples of contested accesses include synchronizations such as locks; true data sharing such as modified locked variables; and false sharing.", - "ScaleUnit": "100%" + "BriefDescription": "STLB (2nd level TLB) data store speculative misses per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_store_stlb_mpki" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses", - "MetricExpr": "min((47.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 4 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (1 - OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Offcore;Snoop;TopdownL4;tma_L4_group;tma_l3_bound_group", - "MetricName": "tma_data_sharing", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses. Data shared by multiple Logical Processors (even just read shared) may cause increased access latency due to cache coherency. Excessive data sharing can drastically harm multithreaded performance.", - "ScaleUnit": "100%" + "BriefDescription": "Average Frequency Utilization relative nominal frequency", + "MetricExpr": "tma_info_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_turbo_utilization" }, { - "BriefDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited)", - "MetricExpr": "min((23 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 4 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l3_bound_group", - "MetricName": "tma_l3_hit_latency", - "PublicDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve the latency; reduce contention with sibling physical cores and increase performance. Note the value of this node may overlap with its siblings.", - "ScaleUnit": "100%" + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "tma_retiring * tma_info_slots / INST_RETIRED.ANY", + "MetricGroup": "Pipeline;Ret;Retire", + "MetricName": "tma_info_uoppi", + "MetricThreshold": "tma_info_uoppi > 1.05" }, { - "BriefDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors)", - "MetricExpr": "L1D_PEND_MISS.L2_STALL / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_l3_bound_group", - "MetricName": "tma_sq_full", - "PublicDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors). The Super Queue is used for requests to access the L2 cache or to go out to the Uncore.", - "ScaleUnit": "100%" + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "tma_retiring * tma_info_slots / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW", + "MetricName": "tma_info_uptb", + "MetricThreshold": "tma_info_uptb < 7.5" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", - "MetricExpr": "min(CYCLE_ACTIVITY.STALLS_L3_MISS / CPU_CLK_UNHALTED.THREAD + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / CPU_CLK_UNHALTED.THREAD - tma_l2_bound - min(((1 - (19 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 10 * (MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))) / (19 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 10 * (MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + (25 * (MEM_LOAD_RETIRED.LOCAL_PMM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 33 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_PMM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))))) * (CYCLE_ACTIVITY.STALLS_L3_MISS / CPU_CLK_UNHALTED.THREAD + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / CPU_CLK_UNHALTED.THREAD - tma_l2_bound) if 1e6 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_PMM + MEM_LOAD_RETIRED.LOCAL_PMM) > MEM_LOAD_RETIRED.L1_MISS else 0), 1), 1)", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", - "MetricName": "tma_dram_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance.", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "ICACHE_64B.IFTAG_STALL / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTEND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=0x4@) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group", - "MetricName": "tma_mem_bandwidth", - "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that).", + "BriefDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache", + "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY.STALLS_L1D_MISS) / tma_info_clks, 0)", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache. The L1 data cache typically has the shortest latency. However; in certain cases like loads blocked on older stores; a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls; while some non-completed demand load lives in the machine without having that demand load missing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB_HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_switches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / CPU_CLK_UNHALTED.THREAD - tma_mem_bandwidth", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group", - "MetricName": "tma_mem_latency", - "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that).", + "BriefDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + L1D_PEND_MISS.FB_FULL_PERIODS) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_info_clks)", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 misses/L2 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L2_HIT_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from local memory", - "MetricExpr": "min((66.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 23 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_group", - "MetricName": "tma_local_dram", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from local memory. Caching will improve the latency and increase performance.", + "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core", + "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STALLS_L3_MISS) / tma_info_clks", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote memory", - "MetricExpr": "min((131 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 23 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latency_group", - "MetricName": "tma_remote_dram", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote memory. This is caused often due to non-optimal NUMA allocations. #link to NUMA article", + "BriefDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited)", + "MetricExpr": "19 * tma_info_average_frequency * MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_issueLat;tma_l3_bound_group", + "MetricName": "tma_l3_hit_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve the latency; reduce contention with sibling physical cores and increase performance. Note the value of this node may overlap with its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tma_info_memory_latency, tma_mem_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote cache in other sockets including synchronizations issues", - "MetricExpr": "min(((120 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 23 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM + (120 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 23 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latency_group", - "MetricName": "tma_remote_cache", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote cache in other sockets including synchronizations issues. This is caused often due to non-optimal NUMA allocations. #link to NUMA article", + "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "ILD_STALL.LCP / tma_info_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_dsb_misses, tma_info_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates (based on idle latencies) how often the CPU was stalled on accesses to external 3D-Xpoint (Crystal Ridge, a.k.a", - "MetricExpr": "min(((1 - (19 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 10 * (MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))) / (19 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 10 * (MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + (25 * (MEM_LOAD_RETIRED.LOCAL_PMM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 33 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_PMM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))))) * (CYCLE_ACTIVITY.STALLS_L3_MISS / CPU_CLK_UNHALTED.THREAD + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / CPU_CLK_UNHALTED.THREAD - tma_l2_bound) if 1e6 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_PMM + MEM_LOAD_RETIRED.LOCAL_PMM) > MEM_LOAD_RETIRED.L1_MISS else 0), 1)", - "MetricGroup": "MemoryBound;Server;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", - "MetricName": "tma_pmm_bound", - "PublicDescription": "This metric roughly estimates (based on idle latencies) how often the CPU was stalled on accesses to external 3D-Xpoint (Crystal Ridge, a.k.a. IXP) memory by loads, PMM stands for Persistent Memory Module. ", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", + "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", - "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", - "MetricName": "tma_store_bound", - "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck.", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations", + "MetricExpr": "UOPS_DISPATCHED.PORT_2_3 / (2 * tma_info_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations. Sample with: UOPS_DISPATCHED.PORT_2_3", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses", - "MetricExpr": "(L2_RQSTS.RFO_HIT * 10 * (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_store_bound_group", - "MetricName": "tma_store_latency", - "PublicDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses. Store accesses usually less impact out-of-order core performance; however; holding resources for longer time can lead into undesired implications (e.g. contention on L1D fill-buffer entries - see FB_Full)", + "BriefDescription": "This metric roughly estimates the fraction of cycles where the (first level) DTLB was missed by load accesses, that later on hit in second-level TLB (STLB)", + "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_group", + "MetricName": "tma_load_stlb_hit", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing", - "MetricExpr": "min(48 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) * OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_store_bound_group", - "MetricName": "tma_false_sharing", - "PublicDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing. False Sharing is a multithreading hiccup; where multiple Logical Processors contend on different data-elements mapped into the same cache line. ", + "BriefDescription": "This metric estimates the fraction of cycles where the Second-level TLB (STLB) was missed by load accesses, performing a hardware page walk", + "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_group", + "MetricName": "tma_load_stlb_miss", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents rate of split store accesses", - "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / CPU_CLK_UNHALTED.DISTRIBUTED", - "MetricGroup": "TopdownL4;tma_L4_group;tma_store_bound_group", - "MetricName": "tma_split_stores", - "PublicDescription": "This metric represents rate of split store accesses. Consider aligning your data to the 64-byte cache line granularity.", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from local memory", + "MetricExpr": "43.5 * tma_info_average_frequency * MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_group", + "MetricName": "tma_local_dram", + "MetricThreshold": "tma_local_dram > 0.1 & (tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from local memory. Caching will improve the latency and increase performance. Sample with: MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often CPU was stalled due to Streaming store memory accesses; Streaming store optimize out a read request required by RFO stores", - "MetricExpr": "min(9 * OCR.STREAMING_WR.ANY_RESPONSE / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_store_bound_group", - "MetricName": "tma_streaming_stores", - "PublicDescription": "This metric estimates how often CPU was stalled due to Streaming store memory accesses; Streaming store optimize out a read request required by RFO stores. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should Streaming stores be a bottleneck.", + "BriefDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS.ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10 * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO))) / tma_info_clks", + "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1_bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations. Due to the microarchitecture handling of locks; they are classified as L1_Bound regardless of what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOADS_PS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses", - "MetricExpr": "min((7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=0x1@ + DTLB_STORE_MISSES.WALK_ACTIVE) / CPU_CLK_UNHALTED.DISTRIBUTED, 1)", - "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_store_bound_group", - "MetricName": "tma_dtlb_store", - "PublicDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses. As with ordinary data caching; focus on improving data locality and reducing working-set size to reduce DTLB overhead. Additionally; consider using profile-guided optimization (PGO) to collocate frequently-used data on the same page. Try using larger page sizes for large amounts of frequently-used data.", + "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", + "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts)", + "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the TLB was missed by store accesses, hitting in the second-level TLB (STLB)", - "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", - "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_group", - "MetricName": "tma_store_stlb_hit", + "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=4@) / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that). Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_info_memory_bandwidth, tma_sq_full", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles where the STLB was missed by store accesses, performing a hardware page walk", - "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / CPU_CLK_UNHALTED.DISTRIBUTED", - "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_group", - "MetricName": "tma_store_stlb_miss", + "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / tma_info_clks - tma_mem_bandwidth", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that). Related metrics: tma_info_memory_latency, tma_l3_hit_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", - "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)", - "MetricGroup": "Backend;Compute;TopdownL2;tma_L2_group;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_core_bound", - "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", + "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", + "MetricExpr": "(CYCLE_ACTIVITY.STALLS_MEM_ANY + EXE_ACTIVITY.BOUND_ON_STORES) / (CYCLE_ACTIVITY.STALLS_TOTAL + (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL) + EXE_ACTIVITY.BOUND_ON_STORES) * tma_backend_bound", + "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", - "MetricExpr": "ARITH.DIVIDER_ACTIVE / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", - "MetricName": "tma_divider", - "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication.", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring memory operations -- uops for memory load or store accesses.", + "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_RETIRED.ANY", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_memory_operations", + "MetricThreshold": "tma_memory_operations > 0.1 & tma_light_operations > 0.6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", - "MetricExpr": "((cpu@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=0x80@ + (EXE_ACTIVITY.1_PORTS_UTIL + topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * EXE_ACTIVITY.2_PORTS_UTIL)) / CPU_CLK_UNHALTED.THREAD if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CYCLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * EXE_ACTIVITY.2_PORTS_UTIL) / CPU_CLK_UNHALTED.THREAD) + 0 * slots", - "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_group", - "MetricName": "tma_ports_utilization", - "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", + "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "tma_retiring * tma_info_slots / UOPS_ISSUED.ANY * IDQ.MS_UOPS / tma_info_slots", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operations_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "cpu@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=0x80@ / CPU_CLK_UNHALTED.THREAD + tma_serializing_operation * (CYCLE_ACTIVITY.STALLS_TOTAL - CYCLE_ACTIVITY.STALLS_MEM_ANY) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_0", - "PublicDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise). Long-latency instructions like divides may contribute to this metric.", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage", + "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_clks", + "MetricGroup": "BadSpec;BrMispredicts;TopdownL4;tma_L4_group;tma_branch_resteers_group;tma_issueBM", + "MetricName": "tma_mispredicts_resteers", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_branch_mispredicts, tma_info_branch_misprediction_cost, tma_info_mispredictions", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations", - "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_0_group", - "MetricName": "tma_serializing_operation", - "PublicDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations. Instructions like CPUID; WRMSR or LFENCE serialize the out-of-order execution which may limit performance.", + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline)", + "MetricExpr": "(IDQ.MITE_CYCLES_ANY - IDQ.MITE_CYCLES_OK) / tma_info_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 5 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or LSD. For example; inefficiencies due to asymmetric decoders; use of long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sample with: FRONTEND_RETIRED.ANY_DSB_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to PAUSE Instructions.", - "MetricExpr": "37 * MISC_RETIRED.PAUSE_INST / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "TopdownL6;tma_L6_group;tma_serializing_operation_group", - "MetricName": "tma_slow_pause", + "BriefDescription": "This metric represents fraction of cycles where (only) 4 uops were delivered by the MITE pipeline", + "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=4@ - cpu@IDQ.MITE_UOPS\\,cmask\\=5@) / tma_info_clks", + "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_mite_group", + "MetricName": "tma_mite_4wide", + "MetricThreshold": "tma_mite_4wide > 0.05 & (tma_mite > 0.1 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 5 > 0.35))", "ScaleUnit": "100%" }, { "BriefDescription": "The Mixing_Vectors metric gives the percentage of injected blend uops out of all uops issued", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD * UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY, 1)", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_0_group", + "MetricExpr": "UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY", + "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utilized_0_group", "MetricName": "tma_mixing_vectors", - "PublicDescription": "The Mixing_Vectors metric gives the percentage of injected blend uops out of all uops issued. Usually a Mixing_Vectors over 5% is worth investigating. Read more in Appendix B1 of the Optimizations Guide for this topic.", + "MetricThreshold": "tma_mixing_vectors > 0.05", + "PublicDescription": "The Mixing_Vectors metric gives the percentage of injected blend uops out of all uops issued. Usually a Mixing_Vectors over 5% is worth investigating. Read more in Appendix B1 of the Optimizations Guide for this topic. Related metrics: tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_1", - "PublicDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). This can be due to heavy data-dependency among software instructions; or over oversubscribing a particular hardware resource. In some other cases with high 1_Port_Utilized and L1_Bound; this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful.", + "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", + "MetricExpr": "3 * IDQ.MS_SWITCHES / tma_info_clks", + "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_2", - "PublicDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization -most compilers feature auto-Vectorization options today- reduces pressure on the execution ports as multiple elements are calculated with same uop.", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_retiring * tma_info_slots)", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_nop_instructions", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions. Compilers often use NOPs for certain address alignments - e.g. start address of a function or loop body. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise).", - "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_3m", + "BriefDescription": "This metric represents the remaining light uops fraction the CPU has executed - remaining means not covered by other sibling nodes", + "MetricExpr": "max(0, tma_light_operations - (tma_fp_arith + tma_memory_operations + tma_branch_instructions + tma_nop_instructions))", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_other_light_ops", + "MetricThreshold": "tma_other_light_ops > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents the remaining light uops fraction the CPU has executed - remaining means not covered by other sibling nodes. May undercount due to FMA double counting", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.", - "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + UOPS_DISPATCHED.PORT_5 + UOPS_DISPATCHED.PORT_6) / (4 * CPU_CLK_UNHALTED.DISTRIBUTED)", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", - "MetricName": "tma_alu_op_utilization", + "BriefDescription": "This metric roughly estimates (based on idle latencies) how often the CPU was stalled on accesses to external 3D-Xpoint (Crystal Ridge, a.k.a", + "MetricExpr": "((1 - (19 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 10 * (MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))) / (19 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 10 * (MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + (25 * (MEM_LOAD_RETIRED.LOCAL_PMM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 33 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_PMM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))))) * (CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_clks + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_info_clks - tma_l2_bound) if 1e6 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_PMM + MEM_LOAD_RETIRED.LOCAL_PMM) > MEM_LOAD_RETIRED.L1_MISS else 0)", + "MetricGroup": "MemoryBound;Server;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_pmm_bound", + "MetricThreshold": "tma_pmm_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric roughly estimates (based on idle latencies) how often the CPU was stalled on accesses to external 3D-Xpoint (Crystal Ridge, a.k.a. IXP) memory by loads, PMM stands for Persistent Memory Module.", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch)", - "MetricExpr": "UOPS_DISPATCHED.PORT_0 / CPU_CLK_UNHALTED.DISTRIBUTED", - "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilization_group", + "MetricExpr": "UOPS_DISPATCHED.PORT_0 / tma_info_core_clks", + "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", "MetricName": "tma_port_0", + "MetricThreshold": "tma_port_0 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU)", - "MetricExpr": "UOPS_DISPATCHED.PORT_1 / CPU_CLK_UNHALTED.DISTRIBUTED", - "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group", + "MetricExpr": "UOPS_DISPATCHED.PORT_1 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", "MetricName": "tma_port_1", + "MetricThreshold": "tma_port_1 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU)", - "MetricExpr": "UOPS_DISPATCHED.PORT_5 / CPU_CLK_UNHALTED.DISTRIBUTED", - "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group", + "MetricExpr": "UOPS_DISPATCHED.PORT_5 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", "MetricName": "tma_port_5", + "MetricThreshold": "tma_port_5 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU)", - "MetricExpr": "UOPS_DISPATCHED.PORT_6 / CPU_CLK_UNHALTED.DISTRIBUTED", - "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group", + "MetricExpr": "UOPS_DISPATCHED.PORT_6 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", "MetricName": "tma_port_6", + "MetricThreshold": "tma_port_6 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU). Sample with: UOPS_DISPATCHED.PORT_6. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations", - "MetricExpr": "UOPS_DISPATCHED.PORT_2_3 / (2 * CPU_CLK_UNHALTED.DISTRIBUTED)", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", - "MetricName": "tma_load_op_utilization", + "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", + "MetricExpr": "((cpu@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=0x80@ + tma_serializing_operation * (CYCLE_ACTIVITY.STALLS_TOTAL - CYCLE_ACTIVITY.STALLS_MEM_ANY) + (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL)) / tma_info_clks if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CYCLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_clks)", + "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations", - "MetricExpr": "(UOPS_DISPATCHED.PORT_4_9 + UOPS_DISPATCHED.PORT_7_8) / (4 * CPU_CLK_UNHALTED.DISTRIBUTED)", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", - "MetricName": "tma_store_op_utilization", + "BriefDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "cpu@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=0x80@ / tma_info_clks + tma_serializing_operation * (CYCLE_ACTIVITY.STALLS_TOTAL - CYCLE_ACTIVITY.STALLS_MEM_ANY) / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise). Long-latency instructions like divides may contribute to this metric.", "ScaleUnit": "100%" }, { - "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", - "MetricGroup": "TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_retiring", - "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. ", + "BriefDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_1", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). This can be due to heavy data-dependency among software instructions; or over oversubscribing a particular hardware resource. In some other cases with high 1_Port_Utilized and L1_Bound; this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", - "MetricExpr": "max(0, topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - tma_heavy_operations)", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_L2_group;tma_retiring_group", - "MetricName": "tma_light_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved.", + "BriefDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization -most compilers feature auto-Vectorization options today- reduces pressure on the execution ports as multiple elements are calculated with same uop. Sample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * UOPS_EXECUTED.X87 / UOPS_EXECUTED.THREAD + tma_fp_scalar + tma_fp_vector", - "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_group", - "MetricName": "tma_fp_arith", - "PublicDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired). Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain and FMA double-counting.", + "BriefDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.7 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Sample with: UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric serves as an approximation of legacy x87 usage", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * UOPS_EXECUTED.X87 / UOPS_EXECUTED.THREAD + 0 * slots", - "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", - "MetricName": "tma_x87_use", - "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote cache in other sockets including synchronizations issues", + "MetricExpr": "(97 * tma_info_average_frequency * MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM + 97 * tma_info_average_frequency * MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_issueSyncxn;tma_mem_latency_group", + "MetricName": "tma_remote_cache", + "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote cache in other sockets including synchronizations issues. This is caused often due to non-optimal NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM_PS;MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD_PS. Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE) / (topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * slots)", - "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group", - "MetricName": "tma_fp_scalar", - "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting.", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote memory", + "MetricExpr": "108 * tma_info_average_frequency * MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latency_group", + "MetricName": "tma_remote_dram", + "MetricThreshold": "tma_remote_dram > 0.1 & (tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote memory. This is caused often due to non-optimal NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths", - "MetricExpr": "min((FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / (topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * slots), 1)", - "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group", - "MetricName": "tma_fp_vector", - "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting.", + "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1", + "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors", - "MetricExpr": "min((FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE) / (topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * slots), 1)", - "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group", - "MetricName": "tma_fp_vector_128b", - "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors. May overcount due to FMA double counting.", + "BriefDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations", + "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL5;tma_L5_group;tma_issueSO;tma_ports_utilized_0_group", + "MetricName": "tma_serializing_operation", + "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)))", + "PublicDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations. Instructions like CPUID; WRMSR or LFENCE serialize the out-of-order execution which may limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metrics: tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors", - "MetricExpr": "min((FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / (topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * slots), 1)", - "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group", - "MetricName": "tma_fp_vector_256b", - "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors. May overcount due to FMA double counting.", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to PAUSE Instructions", + "MetricExpr": "37 * MISC_RETIRED.PAUSE_INST / tma_info_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_serializing_operation_group", + "MetricName": "tma_slow_pause", + "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_operation > 0.1 & (tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to PAUSE Instructions. Sample with: MISC_RETIRED.PAUSE_INST", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 512-bit wide vectors", - "MetricExpr": "min((FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / (topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * slots), 1)", - "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group", - "MetricName": "tma_fp_vector_512b", - "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 512-bit wide vectors. May overcount due to FMA double counting.", + "BriefDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary", + "MetricExpr": "tma_info_load_miss_real_latency * LD_BLOCKS.NO_SR / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring memory operations -- uops for memory load or store accesses.", - "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_RETIRED.ANY", - "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", - "MetricName": "tma_memory_operations", + "BriefDescription": "This metric represents rate of split store accesses", + "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bound_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store accesses. Consider aligning your data to the 64-byte cache line granularity. Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_4", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring branch instructions.", - "MetricExpr": "tma_light_operations * BR_INST_RETIRED.ALL_BRANCHES / (topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * slots)", - "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", - "MetricName": "tma_branch_instructions", + "BriefDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors)", + "MetricExpr": "L1D_PEND_MISS.L2_STALL / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueBW;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors). Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_info_memory_bandwidth, tma_mem_bandwidth", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions", - "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * slots)", - "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", - "MetricName": "tma_nop_instructions", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions. Compilers often use NOPs for certain address alignments - e.g. start address of a function or loop body.", + "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", + "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck. Sample with: MEM_INST_RETIRED.ALL_STORES_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents the remaining light uops fraction the CPU has executed - remaining means not covered by other sibling nodes. May undercount due to FMA double counting", - "MetricExpr": "max(0, tma_light_operations - (tma_fp_arith + tma_memory_operations + tma_branch_instructions + tma_nop_instructions))", - "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", - "MetricName": "tma_other_light_ops", + "BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores. To streamline memory operations in the pipeline; a load can avoid waiting for memory if a prior in-flight store is writing the data that the load wants to read (store forwarding process). However; in some cases the load may be blocked for a significant time pending the store forward. For example; when the prior store is writing a smaller region than the load is reading.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences", - "MetricExpr": "tma_microcode_sequencer + topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * (UOPS_DECODED.DEC0 - cpu@UOPS_DECODED.DEC0\\,cmask\\=0x1@) / IDQ.MITE_UOPS", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_L2_group;tma_retiring_group", - "MetricName": "tma_heavy_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences. This highly-correlates with the uop length of these instructions/sequences.", + "BriefDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses", + "MetricExpr": "(L2_RQSTS.RFO_HIT * 10 * (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_clks", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_issueSL;tma_store_bound_group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses. Store accesses usually less impact out-of-order core performance; however; holding resources for longer time can lead into undesired implications (e.g. contention on L1D fill-buffer entries - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops", - "MetricExpr": "tma_heavy_operations - tma_microcode_sequencer", - "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group", - "MetricName": "tma_few_uops_instructions", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number of uops in such instructions.", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations", + "MetricExpr": "(UOPS_DISPATCHED.PORT_4_9 + UOPS_DISPATCHED.PORT_7_8) / (4 * tma_info_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations. Sample with: UOPS_DISPATCHED.PORT_7_8", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * slots / UOPS_ISSUED.ANY * IDQ.MS_UOPS / slots", - "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operations_group", - "MetricName": "tma_microcode_sequencer", - "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided.", + "BriefDescription": "This metric roughly estimates the fraction of cycles where the TLB was missed by store accesses, hitting in the second-level TLB (STLB)", + "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_group", + "MetricName": "tma_store_stlb_hit", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists", - "MetricExpr": "min(100 * ASSISTS.ANY / slots, 1)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", - "MetricName": "tma_assists", - "PublicDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists. Assists are long sequences of uops that are required in certain corner-cases for operations that cannot be handled natively by the execution pipeline. For example; when working with very small floating point values (so-called Denormals); the FP units are not set up to perform these operations natively. Instead; a sequence of instructions to perform the computation on the Denormals is injected into the pipeline. Since these microcode sequences might be dozens of uops long; Assists can be extremely deleterious to performance and they can be avoided in many cases.", + "BriefDescription": "This metric estimates the fraction of cycles where the STLB was missed by store accesses, performing a hardware page walk", + "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_clks", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_group", + "MetricName": "tma_store_stlb_miss", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction", - "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", - "MetricName": "tma_cisc", - "PublicDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction. A CISC instruction has multiple uops that are required to perform the instruction's functionality as in the case of read-modify-write as an example. Since these instructions require multiple uops they may or may not imply sub-optimal use of machine resources.", + "BriefDescription": "This metric estimates how often CPU was stalled due to Streaming store memory accesses; Streaming store optimize out a read request required by RFO stores", + "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueSmSt;tma_store_bound_group", + "MetricName": "tma_streaming_stores", + "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often CPU was stalled due to Streaming store memory accesses; Streaming store optimize out a read request required by RFO stores. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should Streaming stores be a bottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma_fb_full", "ScaleUnit": "100%" }, { - "BriefDescription": "C1 residency percent per core", - "MetricExpr": "cstate_core@c1\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C1_Core_Residency", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears", + "MetricExpr": "10 * BACLEARS.ANY / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;TopdownL4;tma_L4_group;tma_branch_resteers_group", + "MetricName": "tma_unknown_branches", + "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears. These are fetched branches the Branch Prediction Unit was unable to recognize (e.g. first time the branch is fetched or hitting BTB capacity limit). Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per core", - "MetricExpr": "cstate_core@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Core_Residency", + "BriefDescription": "This metric serves as an approximation of legacy x87 usage", + "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.THREAD", + "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%" }, { - "BriefDescription": "C2 residency percent per package", - "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C2_Pkg_Residency", + "BriefDescription": "Percentage of cycles in aborted transactions.", + "MetricExpr": "max(cpu@cycles\\-t@ - cpu@cycles\\-ct@, 0) / cycles", + "MetricGroup": "transaction", + "MetricName": "tsx_aborted_cycles", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per package", - "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Pkg_Residency", + "BriefDescription": "Number of cycles within a transaction divided by the number of elisions.", + "MetricExpr": "cpu@cycles\\-t@ / cpu@el\\-start@", + "MetricGroup": "transaction", + "MetricName": "tsx_cycles_per_elision", + "ScaleUnit": "1cycles / elision" + }, + { + "BriefDescription": "Number of cycles within a transaction divided by the number of transactions.", + "MetricExpr": "cpu@cycles\\-t@ / cpu@tx\\-start@", + "MetricGroup": "transaction", + "MetricName": "tsx_cycles_per_transaction", + "ScaleUnit": "1cycles / transaction" + }, + { + "BriefDescription": "Percentage of cycles within a transaction region.", + "MetricExpr": "cpu@cycles\\-t@ / cycles", + "MetricGroup": "transaction", + "MetricName": "tsx_transactional_cycles", "ScaleUnit": "100%" } ] diff --git a/tools/perf/pmu-events/arch/x86/icelakex/uncore-memory.json b/tools/perf/pmu-events/arch/x86/icelakex/uncore-memory.json index 0d495ae53f3d..66bb4538c6f2 100644 --- a/tools/perf/pmu-events/arch/x86/icelakex/uncore-memory.json +++ b/tools/perf/pmu-events/arch/x86/icelakex/uncore-memory.json @@ -538,7 +538,7 @@ "EventCode": "0x02", "EventName": "UNC_M_PRE_COUNT.PGT", "PerPkg": "1", - "PublicDescription": "DRAM Precharge commands. : Precharge due to page table : Counts the number of DRAM Precharge commands sent on this channel. : Prechages from Page Table", + "PublicDescription": "DRAM Precharge commands. : Precharge due to page table : Counts the number of DRAM Precharge commands sent on this channel. : Precharges from Page Table", "UMask": "0x10", "Unit": "iMC" }, diff --git a/tools/perf/pmu-events/arch/x86/icelakex/uncore-other.json b/tools/perf/pmu-events/arch/x86/icelakex/uncore-other.json index 8c09d1358849..b1d29877c141 100644 --- a/tools/perf/pmu-events/arch/x86/icelakex/uncore-other.json +++ b/tools/perf/pmu-events/arch/x86/icelakex/uncore-other.json @@ -1296,7 +1296,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Counts the number of Allocate/Update to HitMe Cache : Deallocate HtiME$ on Reads without RspFwdI*", + "BriefDescription": "Counts the number of Allocate/Update to HitMe Cache : Deallocate HitME$ on Reads without RspFwdI*", "EventCode": "0x61", "EventName": "UNC_CHA_HITME_UPDATE.DEALLOCATE", "PerPkg": "1", @@ -14927,7 +14927,7 @@ "EventCode": "0x0C", "EventName": "UNC_I_TxS_REQUEST_OCCUPANCY", "PerPkg": "1", - "PublicDescription": "Outbound Request Queue Occupancy : Accumultes the number of outstanding outbound requests from the IRP to the switch (towards the devices). This can be used in conjuection with the allocations event in order to calculate average latency of outbound requests.", + "PublicDescription": "Outbound Request Queue Occupancy : Accumulates the number of outstanding outbound requests from the IRP to the switch (towards the devices). This can be used in conjunction with the allocations event in order to calculate average latency of outbound requests.", "Unit": "IRP" }, { diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-events/arch/x86/mapfile.csv index b258702e0666..77af9d0bf6d4 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -13,7 +13,7 @@ GenuineIntel-6-A[DE],v1.00,graniterapids,core GenuineIntel-6-(3C|45|46),v32,haswell,core GenuineIntel-6-3F,v26,haswellx,core GenuineIntel-6-(7D|7E|A7),v1.17,icelake,core -GenuineIntel-6-6[AC],v1.17,icelakex,core +GenuineIntel-6-6[AC],v1.18,icelakex,core GenuineIntel-6-3A,v23,ivybridge,core GenuineIntel-6-3E,v22,ivytown,core GenuineIntel-6-2D,v21,jaketown,core From patchwork Sun Feb 19 09:28:18 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59111 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp772813wrn; Sun, 19 Feb 2023 01:34:24 -0800 (PST) X-Google-Smtp-Source: AK7set90NqTvuWwOFtHc1s1NSkLcZSvJt5fMOC9VuHEYH6FF7o6A2X1nfYygKxiHUlWam4iolURT X-Received: by 2002:a17:902:ecd0:b0:19b:2332:18cb with SMTP id a16-20020a170902ecd000b0019b233218cbmr797197plh.1.1676799263981; Sun, 19 Feb 2023 01:34:23 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799263; cv=none; d=google.com; s=arc-20160816; b=KJAWM9G83HjWrv5cFyqNDg6EMkF3ubTZeoxEBn94Hpy4kNan2DOsrbrByxKPOrUsa+ uPSPkIoR8eb6P2kRFpc0Za/x385cuEWUe2qJUS+keMLb/YR19vl9bdEQRoTgz9kmKx9W YTIoVO4cLE0gs994QM9cddYa5QdQBROtNzCXGvpS3ALycf1F/C+I0GWEyg+Va4Qs76K1 bhRxUSjJJyz75vfKQCuMaKHg6EGz2gSpJuyKVKvJz2TTHj7WAra9/902mvOJ9Q1Me4zH u3HQAtfrSsk+8MCeFoG1VM6bq8JoFqspWov9dseuJEbUAQ9PoFPxizntVJI6gbCEm6nn LA/A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :references:mime-version:message-id:in-reply-to:date:dkim-signature; bh=stRh8ApqMxBPWgS9RfTDCEhdMBqe8tV+YtqG9YKcSyU=; b=B2iTbpolGb9JeZ7/h7UobHhVo0pMQmfXhHCbwqmFANdaF+eC+Ov/Nn8ovv+D2zMz9q kW31WUmTihDZEP9+D+2tBDEef5sA542KntWLpGXOD2vioKX/IssiCR5RbXLjGedzBZFh qRuo9L2TpLDoj9GCal0YnRuEhGkBv78d29qD4xliph3jq5GZurYAQ282lhjG6AP+PZTf UZ6/LwvnmwKxmgXDTmKBZ2PIPjyGNFtmuo5YFZxlDqE748s5qaIJ84fHa7vpheKK+WCv B8Npl3BJ/w4szF3LT4TvnEsI8DMqUx6C8I5gBjjZn+OuLwwerommk+VYKrAjuXp6xDlQ G6wA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=GlIBOCme; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id ix14-20020a170902f80e00b00194819b340dsi5341704plb.351.2023.02.19.01.34.11; Sun, 19 Feb 2023 01:34:23 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=GlIBOCme; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229954AbjBSJdb (ORCPT + 99 others); Sun, 19 Feb 2023 04:33:31 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55590 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229994AbjBSJdL (ORCPT ); Sun, 19 Feb 2023 04:33:11 -0500 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3FE28EC46 for ; Sun, 19 Feb 2023 01:32:16 -0800 (PST) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-53686d8ce27so20544657b3.5 for ; Sun, 19 Feb 2023 01:32:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:from:to:cc:subject:date :message-id:reply-to; bh=stRh8ApqMxBPWgS9RfTDCEhdMBqe8tV+YtqG9YKcSyU=; b=GlIBOCmeCUkTpgViwUrh1O5mbQZWUVAJZbknA/T2w/0w0AI0iMilDoIPxmK3xTFFDb 91EjMa3FHy8d98EpUEUlR5bAdBN1Q5PqHisBMdvAXOspLIXQLNl3Uv6vQ7dtooMWXq8v VL7a1/W4jiEFiMpQub8afEpW8BFZ2DOMR2GBnC8vuTJxHXt1U8SWT8Ohuh9Cv6KDKOd+ mHiUggr5Ea5WJpl43gVAO/yTI5HyoVTUXrXO+FxOZcQA6GU3ARCNgWUC3JMeVvHMySq3 per07APuaIkAppF/FkIM7pCGVYnw84e3UTv9YqNN/hnalbaFkxgnNemEDcZr68F6M/FI oRSA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=stRh8ApqMxBPWgS9RfTDCEhdMBqe8tV+YtqG9YKcSyU=; b=2y7UGumJIOcKAuMGan/+YY+x4BjkHlNpAw21Az3pe/SuCA2ZvGRI6D06Ynw6uR8GL0 9kUuHBL0XMk1gMrRLRyQZcW1yThcgMZcqThSpZPotpUIIA5MoTHvlhMmjrOdusMoQ5PI kDreRzPV29qZ7r0Dzypmyo0HYLKtQNeSmgBFdHwJBFXqtu1NGDgNtIphY8iw5zplNbB3 msBfJ7u2KyOANhgEVsz7mUlDR1w2GoXY+M39QdHmseFOxGO+50+COnqR9RIocotLa6+p Y8FB5L90RXLgAuP1OAY/T5lMlmNX+n67KR8cl1DF98zp9c1M4DAYuGlGCpIwkQnXLx5+ LvfA== X-Gm-Message-State: AO0yUKXPgwvztthneFoUL9NujiORag2RktFoJy467IclUAJEEALle+ci ksXvAczJmvReqIt96pij1FO5jlgtGitL X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a5b:904:0:b0:90b:5f44:f310 with SMTP id a4-20020a5b0904000000b0090b5f44f310mr2061892ybq.134.1676799113718; Sun, 19 Feb 2023 01:31:53 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:18 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-22-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 21/51] perf vendor events intel: Refresh ivybridge metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758251464969312178?= X-GMAIL-MSGID: =?utf-8?q?1758251464969312178?= Update the ivybridge metrics to TMA version 4.5. Generation was done using https://github.com/intel/perfmon. Notable changes are TMA info metrics are renamed from their node name to be lower case and prefixed by tma_info_, MetricThreshold expressions are added, and the smi_cost metric group is added replicating existing hard coded metrics in stat-shadow. Signed-off-by: Ian Rogers --- .../arch/x86/ivybridge/ivb-metrics.json | 1270 +++++++++-------- 1 file changed, 712 insertions(+), 558 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/ivybridge/ivb-metrics.json b/tools/perf/pmu-events/arch/x86/ivybridge/ivb-metrics.json index 88980c1a3a64..5247f69c13b6 100644 --- a/tools/perf/pmu-events/arch/x86/ivybridge/ivb-metrics.json +++ b/tools/perf/pmu-events/arch/x86/ivybridge/ivb-metrics.json @@ -1,842 +1,996 @@ [ { - "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", - "MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / SLOTS", - "MetricGroup": "PGO;TopdownL1;tma_L1_group", - "MetricName": "tma_frontend_bound", - "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Machine_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound.", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", - "MetricExpr": "4 * min(CPU_CLK_UNHALTED.THREAD, IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE) / SLOTS", - "MetricGroup": "Frontend;TopdownL2;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_latency", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: RS_EVENTS.EMPTY_END", + "BriefDescription": "C2 residency percent per package", + "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C2_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses.", - "MetricExpr": "ICACHE.IFETCH_STALL / CLKS - tma_itlb_misses", - "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_icache_misses", + "BriefDescription": "C3 residency percent per core", + "MetricExpr": "cstate_core@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Core_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses", - "MetricExpr": "(12 * ITLB_MISSES.STLB_HIT + ITLB_MISSES.WALK_DURATION) / CLKS", - "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_itlb_misses", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: ITLB_MISSES.WALK_COMPLETED", + "BriefDescription": "C3 residency percent per package", + "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", - "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY) / CLKS", - "MetricGroup": "FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_branch_resteers", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", + "BriefDescription": "C6 residency percent per core", + "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Core_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / CLKS", - "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_dsb_switches", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty.", + "BriefDescription": "C6 residency percent per package", + "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", - "MetricExpr": "ILD_STALL.LCP / CLKS", - "MetricGroup": "FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_lcp", - "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs.", + "BriefDescription": "C7 residency percent per core", + "MetricExpr": "cstate_core@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Core_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", - "MetricExpr": "3 * IDQ.MS_SWITCHES / CLKS", - "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_ms_switches", - "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals. Sample with: IDQ.MS_SWITCHES", + "BriefDescription": "C7 residency percent per package", + "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", - "MetricExpr": "tma_frontend_bound - tma_fetch_latency", - "MetricGroup": "FetchBW;Frontend;TopdownL2;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_bandwidth", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend.", - "ScaleUnit": "100%" + "BriefDescription": "Uncore frequency per die [GHZ]", + "MetricExpr": "tma_info_socket_clks / #num_dies / duration_time / 1e9", + "MetricGroup": "SoC", + "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline)", - "MetricExpr": "(IDQ.ALL_MITE_CYCLES_ANY_UOPS - IDQ.ALL_MITE_CYCLES_4_UOPS) / CORE_CLKS / 2", - "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_fetch_bandwidth_group", - "MetricName": "tma_mite", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or LSD. For example; inefficiencies due to asymmetric decoders; use of long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.", + "BriefDescription": "Percentage of cycles spent in System Management Interrupts.", + "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0 else 0)", + "MetricGroup": "smi", + "MetricName": "smi_cycles", + "MetricThreshold": "smi_cycles > 0.1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline", - "MetricExpr": "(IDQ.ALL_DSB_CYCLES_ANY_UOPS - IDQ.ALL_DSB_CYCLES_4_UOPS) / CORE_CLKS / 2", - "MetricGroup": "DSB;FetchBW;TopdownL3;tma_fetch_bandwidth_group", - "MetricName": "tma_dsb", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline. For example; inefficient utilization of the DSB cache structure or bank conflict when reading from it; are categorized here.", - "ScaleUnit": "100%" + "BriefDescription": "Number of SMI interrupts.", + "MetricExpr": "msr@smi@", + "MetricGroup": "smi", + "MetricName": "smi_num", + "ScaleUnit": "1SMI#" }, { - "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations", - "MetricExpr": "(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) / SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group", - "MetricName": "tma_bad_speculation", - "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.", + "BriefDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset", + "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_4k_aliasing", + "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset. False match is possible; which incur a few cycles load re-issue. However; the short re-issue duration is often hidden by the out-of-order core and HW optimizations; hence a user may safely ignore a high value of this metric unless it manages to propagate up into parent nodes of the hierarchy (e.g. to L1_Bound).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction", - "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * tma_bad_speculation", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL2;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_branch_mispredicts", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT.PORT_1 + UOPS_DISPATCHED_PORT.PORT_5) / (3 * tma_info_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", - "MetricExpr": "tma_bad_speculation - tma_branch_mispredicts", - "MetricGroup": "BadSpec;MachineClears;TopdownL2;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_machine_clears", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT", + "BriefDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists", + "MetricExpr": "100 * OTHER_ASSISTS.ANY_WB_ASSIST / tma_info_slots", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists. Assists are long sequences of uops that are required in certain corner-cases for operations that cannot be handled natively by the execution pipeline. For example; when working with very small floating point values (so-called Denormals); the FP units are not set up to perform these operations natively. Instead; a sequence of instructions to perform the computation on the Denormals is injected into the pipeline. Since these microcode sequences might be dozens of uops long; Assists can be extremely deleterious to performance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.ANY", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "1 - (tma_frontend_bound + tma_bad_speculation + tma_retiring)", - "MetricGroup": "TopdownL1;tma_L1_group", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.2", "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", - "MetricExpr": "(min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING) + RESOURCE_STALLS.SB) / (min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if IPC > 1.8 else (UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC - RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else RESOURCE_STALLS.SB)) * tma_backend_bound", - "MetricGroup": "Backend;TopdownL2;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_memory_bound", - "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache", - "MetricExpr": "max((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING) - CYCLE_ACTIVITY.STALLS_L1D_PENDING) / CLKS, 0)", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_l1_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache. The L1 data cache typically has the shortest latency. However; in certain cases like loads blocked on older stores; a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls; while some non-completed demand load lives in the machine without having that demand load missing the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT_PS;MEM_LOAD_UOPS_RETIRED.HIT_LFB_PS", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "(7 * DTLB_LOAD_MISSES.STLB_HIT + DTLB_LOAD_MISSES.WALK_DURATION) / CLKS", - "MetricGroup": "MemoryTLB;TopdownL4;tma_l1_bound_group", - "MetricName": "tma_dtlb_load", - "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores", - "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_store_fwd_blk", - "PublicDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores. To streamline memory operations in the pipeline; a load can avoid waiting for memory if a prior in-flight store is writing the data that the load wants to read (store forwarding process). However; in some cases the load may be blocked for a significant time pending the store forward. For example; when the prior store is writing a smaller region than the load is reading.", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations", - "MetricExpr": "MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO) / CLKS", - "MetricGroup": "Offcore;TopdownL4;tma_l1_bound_group", - "MetricName": "tma_lock_latency", - "PublicDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations. Due to the microarchitecture handling of locks; they are classified as L1_Bound regardless of what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOADS_PS", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary", - "MetricExpr": "13 * LD_BLOCKS.NO_SR / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_split_loads", - "PublicDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS_PS", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset", - "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_4k_aliasing", - "PublicDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset. False match is possible; which incur a few cycles load re-issue. However; the short re-issue duration is often hidden by the out-of-order core and HW optimizations; hence a user may safely ignore a high value of this metric unless it manages to propagate up into parent nodes of the hierarchy (e.g. to L1_Bound).", + "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations", + "MetricExpr": "(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", + "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed", - "MetricExpr": "Load_Miss_Real_Latency * cpu@L1D_PEND_MISS.FB_FULL\\,cmask\\=1@ / CLKS", - "MetricGroup": "MemoryBW;TopdownL4;tma_l1_bound_group", - "MetricName": "tma_fb_full", - "PublicDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed. The higher the metric value; the deeper the memory hierarchy level the misses are satisfied from (metric values >1 are valid). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or external memory).", + "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * tma_bad_speculation", + "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM", + "MetricName": "tma_branch_mispredicts", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics: tma_info_branch_misprediction_cost, tma_mispredicts_resteers", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads", - "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L1D_PENDING - CYCLE_ACTIVITY.STALLS_L2_PENDING) / CLKS", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_l2_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 misses/L2 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L2_HIT_PS", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", + "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY) / tma_info_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core", - "MetricExpr": "MEM_LOAD_UOPS_RETIRED.LLC_HIT / (MEM_LOAD_UOPS_RETIRED.LLC_HIT + 7 * MEM_LOAD_UOPS_RETIRED.LLC_MISS) * CYCLE_ACTIVITY.STALLS_L2_PENDING / CLKS", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_l3_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS", + "BriefDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction. A CISC instruction has multiple uops that are required to perform the instruction's functionality as in the case of read-modify-write as an example. Since these instructions require multiple uops they may or may not imply sub-optimal use of machine resources.", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses", - "MetricExpr": "(60 * (MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.LLC_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.LLC_MISS))) + 43 * (MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.LLC_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.LLC_MISS)))) / CLKS", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_l3_bound_group", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(60 * (MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.LLC_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.LLC_MISS))) + 43 * (MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.LLC_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.LLC_MISS)))) / tma_info_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses. Contested accesses occur when data written by one Logical Processor are read by another Logical Processor on a different Physical Core. Examples of contested accesses include synchronizations such as locks; true data sharing such as modified locked variables; and false sharing. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses. Contested accesses occur when data written by one Logical Processor are read by another Logical Processor on a different Physical Core. Examples of contested accesses include synchronizations such as locks; true data sharing such as modified locked variables; and false sharing. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Related metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses", - "MetricExpr": "43 * (MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.LLC_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.LLC_MISS))) / CLKS", - "MetricGroup": "Offcore;Snoop;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_data_sharing", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses. Data shared by multiple Logical Processors (even just read shared) may cause increased access latency due to cache coherency. Excessive data sharing can drastically harm multithreaded performance. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT_PS", + "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_backend_bound - tma_memory_bound", + "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited)", - "MetricExpr": "29 * (MEM_LOAD_UOPS_RETIRED.LLC_HIT * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.LLC_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.LLC_MISS))) / CLKS", - "MetricGroup": "MemoryLat;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_l3_hit_latency", - "PublicDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve the latency; reduce contention with sibling physical cores and increase performance. Note the value of this node may overlap with its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "43 * (MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.LLC_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.LLC_MISS))) / tma_info_clks", + "MetricGroup": "Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses. Data shared by multiple Logical Processors (even just read shared) may cause increased access latency due to cache coherency. Excessive data sharing can drastically harm multithreaded performance. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma_false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors)", - "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on else OFFCORE_REQUESTS_BUFFER.SQ_FULL) / CORE_CLKS", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_sq_full", - "PublicDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors). The Super Queue is used for requests to access the L2 cache or to go out to the Uncore.", + "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", + "MetricExpr": "ARITH.FPU_DIV_ACTIVE / tma_info_core_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication. Sample with: ARITH.DIVIDER_UOPS", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", - "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.LLC_HIT / (MEM_LOAD_UOPS_RETIRED.LLC_HIT + 7 * MEM_LOAD_UOPS_RETIRED.LLC_MISS)) * CYCLE_ACTIVITY.STALLS_L2_PENDING / CLKS", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", + "MetricConstraint": "NO_GROUP_EVENTS_SMT", + "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.LLC_HIT / (MEM_LOAD_UOPS_RETIRED.LLC_HIT + 7 * MEM_LOAD_UOPS_RETIRED.LLC_MISS)) * CYCLE_ACTIVITY.STALLS_L2_PENDING / tma_info_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_MISS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=6@) / CLKS", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_dram_bound_group", - "MetricName": "tma_mem_bandwidth", - "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that).", + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline", + "MetricExpr": "(IDQ.ALL_DSB_CYCLES_ANY_UOPS - IDQ.ALL_DSB_CYCLES_4_UOPS) / tma_info_core_clks / 2", + "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline. For example; inefficient utilization of the DSB cache structure or bank conflict when reading from it; are categorized here.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / CLKS - tma_mem_bandwidth", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_dram_bound_group", - "MetricName": "tma_mem_latency", - "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that).", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_clks", + "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty. Related metrics: tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_iptb, tma_lcp", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", - "MetricExpr": "RESOURCE_STALLS.SB / CLKS", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_store_bound", - "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck. Sample with: MEM_UOPS_RETIRED.ALL_STORES_PS", + "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricExpr": "(7 * DTLB_LOAD_MISSES.STLB_HIT + DTLB_LOAD_MISSES.WALK_DURATION) / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS. Related metrics: tma_dtlb_store", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses", - "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / CLKS", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_store_bound_group", - "MetricName": "tma_store_latency", - "PublicDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses. Store accesses usually less impact out-of-order core performance; however; holding resources for longer time can lead into undesired implications (e.g. contention on L1D fill-buffer entries - see FB_Full)", + "BriefDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses", + "MetricExpr": "(7 * DTLB_STORE_MISSES.STLB_HIT + DTLB_STORE_MISSES.WALK_DURATION) / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses. As with ordinary data caching; focus on improving data locality and reducing working-set size to reduce DTLB overhead. Additionally; consider using profile-guided optimization (PGO) to collocate frequently-used data on the same page. Try using larger page sizes for large amounts of frequently-used data. Sample with: MEM_UOPS_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing", - "MetricExpr": "60 * OFFCORE_RESPONSE.DEMAND_RFO.LLC_HIT.HITM_OTHER_CORE / CLKS", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_store_bound_group", + "MetricExpr": "60 * OFFCORE_RESPONSE.DEMAND_RFO.LLC_HIT.HITM_OTHER_CORE / tma_info_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "PublicDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing. False Sharing is a multithreading hiccup; where multiple Logical Processors contend on different data-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing. False Sharing is a multithreading hiccup; where multiple Logical Processors contend on different data-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents rate of split store accesses", - "MetricExpr": "2 * MEM_UOPS_RETIRED.SPLIT_STORES / CORE_CLKS", - "MetricGroup": "TopdownL4;tma_store_bound_group", - "MetricName": "tma_split_stores", - "PublicDescription": "This metric represents rate of split store accesses. Consider aligning your data to the 64-byte cache line granularity. Sample with: MEM_UOPS_RETIRED.SPLIT_STORES_PS", + "BriefDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_info_load_miss_real_latency * cpu@L1D_PEND_MISS.FB_FULL\\,cmask\\=1@ / tma_info_clks", + "MetricGroup": "MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;tma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed. The higher the metric value; the deeper the memory hierarchy level the misses are satisfied from (metric values >1 are valid). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or external memory). Related metrics: tma_info_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses", - "MetricExpr": "(7 * DTLB_STORE_MISSES.STLB_HIT + DTLB_STORE_MISSES.WALK_DURATION) / CLKS", - "MetricGroup": "MemoryTLB;TopdownL4;tma_store_bound_group", - "MetricName": "tma_dtlb_store", - "PublicDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses. As with ordinary data caching; focus on improving data locality and reducing working-set size to reduce DTLB overhead. Additionally; consider using profile-guided optimization (PGO) to collocate frequently-used data on the same page. Try using larger page sizes for large amounts of frequently-used data. Sample with: MEM_UOPS_RETIRED.STLB_MISS_STORES_PS", + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", + "MetricExpr": "tma_frontend_bound - tma_fetch_latency", + "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Related metrics: tma_dsb_switches, tma_info_dsb_coverage, tma_info_iptb, tma_lcp", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", - "MetricExpr": "tma_backend_bound - tma_memory_bound", - "MetricGroup": "Backend;Compute;TopdownL2;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_core_bound", - "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", + "MetricExpr": "4 * min(CPU_CLK_UNHALTED.THREAD, IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE) / tma_info_slots", + "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: RS_EVENTS.EMPTY_END", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", - "MetricExpr": "ARITH.FPU_DIV_ACTIVE / CORE_CLKS", - "MetricGroup": "TopdownL3;tma_core_bound_group", - "MetricName": "tma_divider", - "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication. Sample with: ARITH.DIVIDER_UOPS", + "BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", + "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired). Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain and FMA double-counting.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", - "MetricExpr": "((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if IPC > 1.8 else (UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC - RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else RESOURCE_STALLS.SB)) - RESOURCE_STALLS.SB - min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING)) / CLKS", - "MetricGroup": "PortsUtil;TopdownL3;tma_core_bound_group", - "MetricName": "tma_ports_utilization", - "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", + "BriefDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired", + "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE) / UOPS_EXECUTED.THREAD", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P", + "MetricName": "tma_fp_scalar", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\,cmask\\=1@ / 2 if #SMT_on else (min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) - RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) / CORE_CLKS)", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_0", - "PublicDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise). Long-latency instructions like divides may contribute to this metric.", + "BriefDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths", + "MetricExpr": "(FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + FP_COMP_OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) / UOPS_EXECUTED.THREAD", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P", + "MetricName": "tma_fp_vector", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) / CORE_CLKS)", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_1", - "PublicDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). This can be due to heavy data-dependency among software instructions; or over oversubscribing a particular hardware resource. In some other cases with high 1_Port_Utilized and L1_Bound; this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful.", + "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", + "MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / tma_info_slots", + "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / CORE_CLKS)", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_2", - "PublicDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization -most compilers feature auto-Vectorization options today- reduces pressure on the execution ports as multiple elements are calculated with same uop.", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences", + "MetricExpr": "tma_microcode_sequencer", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise).", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / CORE_CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_3m", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses.", + "MetricExpr": "ICACHE.IFETCH_STALL / tma_info_clks - tma_itlb_misses", + "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.", - "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT.PORT_1 + UOPS_DISPATCHED_PORT.PORT_5) / (3 * CORE_CLKS)", - "MetricGroup": "TopdownL5;tma_ports_utilized_3m_group", - "MetricName": "tma_alu_op_utilization", - "ScaleUnit": "100%" + "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", + "MetricExpr": "tma_info_turbo_utilization * TSC / 1e9 / duration_time", + "MetricGroup": "Power;Summary", + "MetricName": "tma_info_average_frequency" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch) Sample with: UOPS_DISPATCHED_PORT.PORT_0", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_0 / CORE_CLKS", - "MetricGroup": "Compute;TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_0", - "ScaleUnit": "100%" + "BriefDescription": "Branch instructions per taken branch.", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_bptkbranch" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU) Sample with: UOPS_DISPATCHED_PORT.PORT_1", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_1 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_1", - "ScaleUnit": "100%" + "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Pipeline", + "MetricName": "tma_info_clks" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU) Sample with: UOPS_DISPATCHED.PORT_5", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_5 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_5", - "ScaleUnit": "100%" + "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_clks))", + "MetricGroup": "SMT", + "MetricName": "tma_info_core_clks" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations Sample with: UOPS_DISPATCHED.PORT_2_3", - "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_2 + UOPS_DISPATCHED_PORT.PORT_3 - UOPS_DISPATCHED_PORT.PORT_4) / (2 * CORE_CLKS)", - "MetricGroup": "TopdownL5;tma_ports_utilized_3m_group", - "MetricName": "tma_load_op_utilization", - "ScaleUnit": "100%" + "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_core_clks", + "MetricGroup": "Ret;SMT;TmaL1;tma_L1_group", + "MetricName": "tma_info_coreipc" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 2 ([SNB+]Loads and Store-address; [ICL+] Loads) Sample with: UOPS_DISPATCHED_PORT.PORT_2", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_2 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_load_op_utilization_group", - "MetricName": "tma_port_2", - "ScaleUnit": "100%" + "BriefDescription": "Cycles Per Instruction (per Logical Processor)", + "MetricExpr": "1 / tma_info_ipc", + "MetricGroup": "Mem;Pipeline", + "MetricName": "tma_info_cpi" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 3 ([SNB+]Loads and Store-address; [ICL+] Loads) Sample with: UOPS_DISPATCHED_PORT.PORT_3", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_3 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_load_op_utilization_group", - "MetricName": "tma_port_3", - "ScaleUnit": "100%" + "BriefDescription": "Average CPU Utilization", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": "HPC;Summary", + "MetricName": "tma_info_cpu_utilization" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations", - "MetricExpr": "tma_port_4", - "MetricGroup": "TopdownL5;tma_ports_utilized_3m_group", - "MetricName": "tma_store_op_utilization", - "ScaleUnit": "100%" + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD / OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_data_l2_mlp" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 4 (Store-data) Sample with: UOPS_DISPATCHED_PORT.PORT_4", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_4 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_store_op_utilization_group", - "MetricName": "tma_port_4", - "ScaleUnit": "100%" + "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", + "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_REQUESTS.ALL) / 1e6 / duration_time / 1e3", + "MetricGroup": "HPC;Mem;MemoryBW;SoC;tma_issueBW", + "MetricName": "tma_info_dram_bw_use", + "PublicDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]. Related metrics: tma_fb_full, tma_mem_bandwidth, tma_sq_full" }, { - "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group", - "MetricName": "tma_retiring", - "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.RETIRE_SLOTS", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", - "MetricExpr": "tma_retiring - tma_heavy_operations", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_retiring_group", - "MetricName": "tma_light_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + LSD.UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_dsb_coverage", + "MetricThreshold": "tma_info_dsb_coverage < 0.7 & tma_info_ipc / 4 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_iptb, tma_lcp" }, { - "BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)", - "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", - "MetricGroup": "HPC;TopdownL3;tma_light_operations_group", - "MetricName": "tma_fp_arith", - "PublicDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired). Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain and FMA double-counting.", - "ScaleUnit": "100%" + "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-thread", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,cmask\\=1@", + "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", + "MetricName": "tma_info_execute" }, { - "BriefDescription": "This metric serves as an approximation of legacy x87 usage", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS * FP_COMP_OPS_EXE.X87 / UOPS_EXECUTED.THREAD", - "MetricGroup": "Compute;TopdownL4;tma_fp_arith_group", - "MetricName": "tma_x87_use", - "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", - "ScaleUnit": "100%" + "BriefDescription": "The ratio of Executed- by Issued-Uops", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", + "MetricGroup": "Cor;Pipeline", + "MetricName": "tma_info_execute_per_issue", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate of \"execute\" at rename stage." }, { - "BriefDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired", - "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE) / UOPS_EXECUTED.THREAD", - "MetricGroup": "Compute;Flops;TopdownL4;tma_fp_arith_group", - "MetricName": "tma_fp_scalar", - "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting.", - "ScaleUnit": "100%" + "BriefDescription": "Floating Point Operations Per Cycle", + "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE + 2 * FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + 4 * (FP_COMP_OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) + 8 * SIMD_FP_256.PACKED_SINGLE) / tma_info_core_clks", + "MetricGroup": "Flops;Ret", + "MetricName": "tma_info_flopc" }, { - "BriefDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths", - "MetricExpr": "(FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + FP_COMP_OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) / UOPS_EXECUTED.THREAD", - "MetricGroup": "Compute;Flops;TopdownL4;tma_fp_arith_group", - "MetricName": "tma_fp_vector", - "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting.", - "ScaleUnit": "100%" + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE + 2 * FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + 4 * (FP_COMP_OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) + 8 * SIMD_FP_256.PACKED_SINGLE) / 1e9 / duration_time", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine." }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences", - "MetricExpr": "tma_microcode_sequencer", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_retiring_group", - "MetricName": "tma_heavy_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences. This highly-correlates with the uop length of these instructions/sequences.", - "ScaleUnit": "100%" + "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", + "MetricExpr": "UOPS_EXECUTED.THREAD / (cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC)", + "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", + "MetricName": "tma_info_ilp" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.MS_UOPS / SLOTS", - "MetricGroup": "MicroSeq;TopdownL3;tma_heavy_operations_group", - "MetricName": "tma_microcode_sequencer", - "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided. Sample with: IDQ.MS_UOPS", - "ScaleUnit": "100%" + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Summary;TmaL1;tma_L1_group", + "MetricName": "tma_info_instructions", + "PublicDescription": "Total number of retired Instructions. Sample with: INST_RETIRED.PREC_DIST" }, { - "BriefDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists", - "MetricExpr": "100 * OTHER_ASSISTS.ANY_WB_ASSIST / SLOTS", - "MetricGroup": "TopdownL4;tma_microcode_sequencer_group", - "MetricName": "tma_assists", - "PublicDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists. Assists are long sequences of uops that are required in certain corner-cases for operations that cannot be handled natively by the execution pipeline. For example; when working with very small floating point values (so-called Denormals); the FP units are not set up to perform these operations natively. Instead; a sequence of instructions to perform the computation on the Denormals is injected into the pipeline. Since these microcode sequences might be dozens of uops long; Assists can be extremely deleterious to performance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.ANY", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate)", + "MetricExpr": "1 / (tma_fp_scalar + tma_fp_vector)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_iparith", + "MetricThreshold": "tma_info_iparith < 10", + "PublicDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate). May undercount due to FMA double counting. Approximated prior to BDW." }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction", - "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", - "MetricGroup": "TopdownL4;tma_microcode_sequencer_group", - "MetricName": "tma_cisc", - "PublicDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction. A CISC instruction has multiple uops that are required to perform the instruction's functionality as in the case of read-modify-write as an example. Since these instructions require multiple uops they may or may not imply sub-optimal use of machine resources.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Branches;Fed;InsType", + "MetricName": "tma_info_ipbranch", + "MetricThreshold": "tma_info_ipbranch < 8" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor)", - "MetricExpr": "INST_RETIRED.ANY / CLKS", + "MetricExpr": "INST_RETIRED.ANY / tma_info_clks", "MetricGroup": "Ret;Summary", - "MetricName": "IPC" + "MetricName": "tma_info_ipc" }, { - "BriefDescription": "Uops Per Instruction", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / INST_RETIRED.ANY", - "MetricGroup": "Pipeline;Ret;Retire", - "MetricName": "UPI" + "BriefDescription": "Instructions per (near) call (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_ipcall", + "MetricThreshold": "tma_info_ipcall < 200" }, { - "BriefDescription": "Instruction per taken branch", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW", - "MetricName": "UpTB" + "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;OS", + "MetricName": "tma_info_ipfarbranch", + "MetricThreshold": "tma_info_ipfarbranch < 1e6" }, { - "BriefDescription": "Cycles Per Instruction (per Logical Processor)", - "MetricExpr": "1 / IPC", - "MetricGroup": "Mem;Pipeline", - "MetricName": "CPI" + "BriefDescription": "Instructions per Load (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_LOADS", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipload", + "MetricThreshold": "tma_info_ipload < 3" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "Pipeline", - "MetricName": "CLKS" + "BriefDescription": "Instructions per retired mispredicts for indirect CALL or JMP branches (lower number means higher occurrence rate).", + "MetricExpr": "tma_info_instructions / (UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * cpu@BR_MISP_EXEC.ALL_BRANCHES\\,umask\\=0xE4@)", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_indirect", + "MetricThreshold": "tma_info_ipmisp_indirect < 1e3" }, { - "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "4 * CORE_CLKS", - "MetricGroup": "tma_L1_group", - "MetricName": "SLOTS" + "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts", + "MetricName": "tma_info_ipmispredict", + "MetricThreshold": "tma_info_ipmispredict < 200" }, { - "BriefDescription": "The ratio of Executed- by Issued-Uops", - "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", - "MetricGroup": "Cor;Pipeline", - "MetricName": "Execute_per_Issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate of \"execute\" at rename stage." + "BriefDescription": "Instructions per Store (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_STORES", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipstore", + "MetricThreshold": "tma_info_ipstore < 8" }, { - "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", - "MetricExpr": "INST_RETIRED.ANY / CORE_CLKS", - "MetricGroup": "Ret;SMT;tma_L1_group", - "MetricName": "CoreIPC" + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", + "MetricName": "tma_info_iptb", + "MetricThreshold": "tma_info_iptb < 9", + "PublicDescription": "Instruction per taken branch. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_lcp" }, { - "BriefDescription": "Floating Point Operations Per Cycle", - "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE + 2 * FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + 4 * (FP_COMP_OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) + 8 * SIMD_FP_256.PACKED_SINGLE) / CORE_CLKS", - "MetricGroup": "Flops;Ret", - "MetricName": "FLOPc" + "BriefDescription": "Instructions per speculative Unknown Branch Misprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_instructions / BACLEARS.ANY", + "MetricGroup": "Fed", + "MetricName": "tma_info_ipunknown_branch" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC)", - "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", - "MetricName": "ILP" + "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_cpi" }, { - "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else CLKS))", - "MetricGroup": "SMT", - "MetricName": "CORE_CLKS" + "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_utilization", + "MetricThreshold": "tma_info_kernel_utilization > 0.05" }, { - "BriefDescription": "Instructions per Load (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_LOADS", - "MetricGroup": "InsType", - "MetricName": "IpLoad" + "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw" }, { - "BriefDescription": "Instructions per Store (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_STORES", - "MetricGroup": "InsType", - "MetricName": "IpStore" + "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]", + "MetricExpr": "tma_info_l1d_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw_1t" }, { - "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Branches;Fed;InsType", - "MetricName": "IpBranch" + "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L1_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l1mpki" }, { - "BriefDescription": "Instructions per (near) call (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "IpCall" + "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw" }, { - "BriefDescription": "Instruction per taken branch", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO", - "MetricName": "IpTB" + "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]", + "MetricExpr": "tma_info_l2_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw_1t" }, { - "BriefDescription": "Branch instructions per taken branch. ", - "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "BpTkBranch" + "BriefDescription": "L2 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "Backend;CacheMisses;Mem", + "MetricName": "tma_info_l2mpki" }, { - "BriefDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate)", - "MetricExpr": "1 / (tma_fp_scalar + tma_fp_vector)", - "MetricGroup": "Flops;InsType", - "MetricName": "IpArith", - "PublicDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate). May undercount due to FMA double counting. Approximated prior to BDW." + "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "0", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_l3_cache_access_bw_1t" }, { - "BriefDescription": "Total number of retired Instructions Sample with: INST_RETIRED.PREC_DIST", - "MetricExpr": "INST_RETIRED.ANY", - "MetricGroup": "Summary;tma_L1_group", - "MetricName": "Instructions" + "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw" }, { - "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE_SLOTS\\,cmask\\=1@", - "MetricGroup": "Pipeline;Ret", - "MetricName": "Retire" + "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "tma_info_l3_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw_1t" }, { - "BriefDescription": "", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,cmask\\=1@", - "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", - "MetricName": "Execute" + "BriefDescription": "L3 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.LLC_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l3mpki" }, { - "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", - "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + LSD.UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", - "MetricGroup": "DSB;Fed;FetchBW", - "MetricName": "DSB_Coverage" + "BriefDescription": "Average Latency for L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCORE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "Memory_Lat;Offcore", + "MetricName": "tma_info_load_l2_miss_latency" }, { - "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;BadSpec;BrMispredicts", - "MetricName": "IpMispredict" + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_load_l2_mlp" }, { "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)", + "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_UOPS_RETIRED.L1_MISS + MEM_LOAD_UOPS_RETIRED.HIT_LFB)", "MetricGroup": "Mem;MemoryBound;MemoryLat", - "MetricName": "Load_Miss_Real_Latency" + "MetricName": "tma_info_load_miss_real_latency" + }, + { + "BriefDescription": "Average number of parallel requests to external memory", + "MetricExpr": "UNC_ARB_TRK_OCCUPANCY.ALL / UNC_ARB_TRK_OCCUPANCY.CYCLES_WITH_ANY_REQUEST", + "MetricGroup": "Mem;SoC", + "MetricName": "tma_info_mem_parallel_requests", + "PublicDescription": "Average number of parallel requests to external memory. Accounts for all requests" }, { - "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)", + "BriefDescription": "Average latency of all requests to external memory (in Uncore cycles)", + "MetricExpr": "UNC_ARB_TRK_OCCUPANCY.ALL / UNC_ARB_TRK_REQUESTS.ALL", + "MetricGroup": "Mem;SoC", + "MetricName": "tma_info_mem_request_latency" + }, + { + "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss", + "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLES", "MetricGroup": "Mem;MemoryBW;MemoryBound", - "MetricName": "MLP" + "MetricName": "tma_info_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)" }, { - "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L1_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L1MPKI" + "BriefDescription": "Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses", + "MetricExpr": "(ITLB_MISSES.WALK_DURATION + DTLB_LOAD_MISSES.WALK_DURATION + DTLB_STORE_MISSES.WALK_DURATION) / tma_info_core_clks", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_page_walks_utilization", + "MetricThreshold": "tma_info_page_walks_utilization > 0.5" }, { - "BriefDescription": "L2 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L2_MISS / INST_RETIRED.ANY", - "MetricGroup": "Backend;CacheMisses;Mem", - "MetricName": "L2MPKI" + "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE_SLOTS\\,cmask\\=1@", + "MetricGroup": "Pipeline;Ret", + "MetricName": "tma_info_retire" }, { - "BriefDescription": "L3 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.LLC_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L3MPKI" + "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "4 * tma_info_core_clks", + "MetricGroup": "TmaL1;tma_L1_group", + "MetricName": "tma_info_slots" }, { - "BriefDescription": "Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses", - "MetricConstraint": "NO_NMI_WATCHDOG", - "MetricExpr": "(ITLB_MISSES.WALK_DURATION + DTLB_LOAD_MISSES.WALK_DURATION + DTLB_STORE_MISSES.WALK_DURATION) / CORE_CLKS", - "MetricGroup": "Mem;MemoryTLB", - "MetricName": "Page_Walks_Utilization" + "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", + "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", + "MetricGroup": "SMT", + "MetricName": "tma_info_smt_2t_utilization" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW" + "BriefDescription": "Socket actual clocks when any core is active on that socket", + "MetricExpr": "UNC_CLOCK.SOCKET", + "MetricGroup": "SoC", + "MetricName": "tma_info_socket_clks" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW" + "BriefDescription": "Average Frequency Utilization relative nominal frequency", + "MetricExpr": "tma_info_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_turbo_utilization" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW" + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / INST_RETIRED.ANY", + "MetricGroup": "Pipeline;Ret;Retire", + "MetricName": "tma_info_uoppi", + "MetricThreshold": "tma_info_uoppi > 1.05" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]", - "MetricExpr": "L1D_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW_1T" + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW", + "MetricName": "tma_info_uptb", + "MetricThreshold": "tma_info_uptb < 6" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]", - "MetricExpr": "L2_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW_1T" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "(12 * ITLB_MISSES.STLB_HIT + ITLB_MISSES.WALK_DURATION) / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: ITLB_MISSES.WALK_COMPLETED", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "L3_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW_1T" + "BriefDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache", + "MetricExpr": "max((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING) - CYCLE_ACTIVITY.STALLS_L1D_PENDING) / tma_info_clks, 0)", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache. The L1 data cache typically has the shortest latency. However; in certain cases like loads blocked on older stores; a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls; while some non-completed demand load lives in the machine without having that demand load missing the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT_PS;MEM_LOAD_UOPS_RETIRED.HIT_LFB_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "0", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "L3_Cache_Access_BW_1T" + "BriefDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads", + "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L1D_PENDING - CYCLE_ACTIVITY.STALLS_L2_PENDING) / tma_info_clks", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 misses/L2 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L2_HIT_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average CPU Utilization", - "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", - "MetricGroup": "HPC;Summary", - "MetricName": "CPU_Utilization" + "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core", + "MetricConstraint": "NO_GROUP_EVENTS_SMT", + "MetricExpr": "MEM_LOAD_UOPS_RETIRED.LLC_HIT / (MEM_LOAD_UOPS_RETIRED.LLC_HIT + 7 * MEM_LOAD_UOPS_RETIRED.LLC_MISS) * CYCLE_ACTIVITY.STALLS_L2_PENDING / tma_info_clks", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", - "MetricExpr": "Turbo_Utilization * TSC / 1e9 / duration_time", - "MetricGroup": "Power;Summary", - "MetricName": "Average_Frequency" + "BriefDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "29 * (MEM_LOAD_UOPS_RETIRED.LLC_HIT * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.LLC_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RETIRED.LLC_MISS))) / tma_info_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_issueLat;tma_l3_bound_group", + "MetricName": "tma_l3_hit_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve the latency; reduce contention with sibling physical cores and increase performance. Note the value of this node may overlap with its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS. Related metrics: tma_mem_latency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE + 2 * FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + 4 * (FP_COMP_OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) + 8 * SIMD_FP_256.PACKED_SINGLE) / 1e9 / duration_time", - "MetricGroup": "Cor;Flops;HPC", - "MetricName": "GFLOPs", - "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine." + "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "ILD_STALL.LCP / tma_info_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_iptb", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average Frequency Utilization relative nominal frequency", - "MetricExpr": "CLKS / CPU_CLK_UNHALTED.REF_TSC", - "MetricGroup": "Power", - "MetricName": "Turbo_Utilization" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", + "MetricExpr": "tma_retiring - tma_heavy_operations", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", - "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", - "MetricGroup": "SMT", - "MetricName": "SMT_2T_Utilization" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_2 + UOPS_DISPATCHED_PORT.PORT_3 - UOPS_DISPATCHED_PORT.PORT_4) / (2 * tma_info_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations. Sample with: UOPS_DISPATCHED.PORT_2_3", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "OS", - "MetricName": "Kernel_Utilization" + "BriefDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO) / tma_info_clks", + "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1_bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations. Due to the microarchitecture handling of locks; they are classified as L1_Bound regardless of what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOADS_PS. Related metrics: tma_store_latency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", - "MetricGroup": "OS", - "MetricName": "Kernel_CPI" + "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_bad_speculation - tma_branch_mispredicts", + "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", - "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_REQUESTS.ALL) / 1e6 / duration_time / 1e3", - "MetricGroup": "HPC;Mem;MemoryBW;SoC", - "MetricName": "DRAM_BW_Use" + "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=6@) / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that). Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_sq_full", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average latency of all requests to external memory (in Uncore cycles)", - "MetricExpr": "MEM_Parallel_Requests", - "MetricGroup": "Mem;SoC", - "MetricName": "MEM_Request_Latency" + "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / tma_info_clks - tma_mem_bandwidth", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that). Related metrics: tma_l3_hit_latency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average number of parallel requests to external memory. Accounts for all requests", - "MetricExpr": "UNC_ARB_TRK_OCCUPANCY.ALL / UNC_ARB_TRK_REQUESTS.ALL", - "MetricGroup": "Mem;SoC", - "MetricName": "MEM_Parallel_Requests" + "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING) + RESOURCE_STALLS.SB) / (min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - (UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if tma_info_ipc > 1.8 else UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) * tma_backend_bound", + "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", + "ScaleUnit": "100%" }, { - "BriefDescription": "Socket actual clocks when any core is active on that socket", - "MetricExpr": "UNC_CLOCK.SOCKET", - "MetricGroup": "SoC", - "MetricName": "Socket_CLKS" + "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.MS_UOPS / tma_info_slots", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operations_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_ms_switches", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", - "MetricGroup": "Branches;OS", - "MetricName": "IpFarBranch" + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline)", + "MetricExpr": "(IDQ.ALL_MITE_CYCLES_ANY_UOPS - IDQ.ALL_MITE_CYCLES_4_UOPS) / tma_info_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or LSD. For example; inefficiencies due to asymmetric decoders; use of long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Uncore frequency per die [GHZ]", - "MetricExpr": "Socket_CLKS / #num_dies / duration_time / 1e9", - "MetricGroup": "SoC", - "MetricName": "UNCORE_FREQ" + "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", + "MetricExpr": "3 * IDQ.MS_SWITCHES / tma_info_clks", + "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", + "ScaleUnit": "100%" }, { - "BriefDescription": "C3 residency percent per core", - "MetricExpr": "cstate_core@c3\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C3_Core_Residency", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_0 / tma_info_core_clks", + "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_0", + "MetricThreshold": "tma_port_0 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_512b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per core", - "MetricExpr": "cstate_core@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Core_Residency", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_1 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_1", + "MetricThreshold": "tma_port_1 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_512b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "C7 residency percent per core", - "MetricExpr": "cstate_core@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Core_Residency", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 2 ([SNB+]Loads and Store-address; [ICL+] Loads)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_2 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_load_op_utilization_group", + "MetricName": "tma_port_2", + "MetricThreshold": "tma_port_2 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 2 ([SNB+]Loads and Store-address; [ICL+] Loads). Sample with: UOPS_DISPATCHED_PORT.PORT_2", "ScaleUnit": "100%" }, { - "BriefDescription": "C2 residency percent per package", - "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C2_Pkg_Residency", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 3 ([SNB+]Loads and Store-address; [ICL+] Loads)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_3 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_load_op_utilization_group", + "MetricName": "tma_port_3", + "MetricThreshold": "tma_port_3 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 3 ([SNB+]Loads and Store-address; [ICL+] Loads). Sample with: UOPS_DISPATCHED_PORT.PORT_3", "ScaleUnit": "100%" }, { - "BriefDescription": "C3 residency percent per package", - "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C3_Pkg_Residency", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 4 (Store-data)", + "MetricExpr": "tma_store_op_utilization", + "MetricGroup": "TopdownL6;tma_L6_group;tma_issueSpSt;tma_store_op_utilization_group", + "MetricName": "tma_port_4", + "MetricThreshold": "tma_port_4 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 4 (Store-data). Sample with: UOPS_DISPATCHED_PORT.PORT_4. Related metrics: tma_split_stores", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per package", - "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Pkg_Residency", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_5 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_5", + "MetricThreshold": "tma_port_5 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "C7 residency percent per package", - "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Pkg_Residency", + "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - (UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if tma_info_ipc > 1.8 else UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB - RESOURCE_STALLS.SB - min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING)) / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\,cmask\\=1@ / 2 if #SMT_on else (min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0)) / tma_info_core_clks)", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise). Long-latency instructions like divides may contribute to this metric.", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) / tma_info_core_clks)", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_1", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). This can be due to heavy data-dependency among software instructions; or over oversubscribing a particular hardware resource. In some other cases with high 1_Port_Utilized and L1_Bound; this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful. Related metrics: tma_l1_bound", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_clks)", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization -most compilers feature auto-Vectorization options today- reduces pressure on the execution ports as multiple elements are calculated with same uop. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise).", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.7 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1", + "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.RETIRE_SLOTS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "13 * LD_BLOCKS.NO_SR / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS_PS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents rate of split store accesses", + "MetricExpr": "2 * MEM_UOPS_RETIRED.SPLIT_STORES / tma_info_core_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bound_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store accesses. Consider aligning your data to the 64-byte cache line granularity. Sample with: MEM_UOPS_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_4", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors)", + "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on else OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueBW;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors). Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_mem_bandwidth", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", + "MetricExpr": "RESOURCE_STALLS.SB / tma_info_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck. Sample with: MEM_UOPS_RETIRED.ALL_STORES_PS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores. To streamline memory operations in the pipeline; a load can avoid waiting for memory if a prior in-flight store is writing the data that the load wants to read (store forwarding process). However; in some cases the load may be blocked for a significant time pending the store forward. For example; when the prior store is writing a smaller region than the load is reading.", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_clks", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_issueSL;tma_store_bound_group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses. Store accesses usually less impact out-of-order core performance; however; holding resources for longer time can lead into undesired implications (e.g. contention on L1D fill-buffer entries - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_4 / tma_info_core_clks", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric serves as an approximation of legacy x87 usage", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS * FP_COMP_OPS_EXE.X87 / UOPS_EXECUTED.THREAD", + "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%" } ] From patchwork Sun Feb 19 09:28:19 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59115 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp773649wrn; Sun, 19 Feb 2023 01:37:52 -0800 (PST) X-Google-Smtp-Source: AK7set+FK3S1mnCxwSpGYrK+GxXXG0Zdwa21n7IxLEcaZY565kl7iJo2sWj6cYgeQtXt9Xd8tgD5 X-Received: by 2002:a17:90b:3b4a:b0:233:d5a8:16 with SMTP id ot10-20020a17090b3b4a00b00233d5a80016mr2198780pjb.0.1676799472439; Sun, 19 Feb 2023 01:37:52 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799472; cv=none; d=google.com; s=arc-20160816; b=D79WSQWCN5HoYolLTN+o9VFOZOfJSUZFbgFKQMHt9hLMPVTij1xoX6fEzsW66NLgSI u36xyf68V+tiE38kE4RSTjEnoWoWcfl8W81QIsub+ol4/+kGuBBjzkR1CvIq6mtlguQY S3BMupeovk4Adc+5yOmAeIBlt3dMR2KahW/g5kqoS4wqx4H26+N/BUMdAODtE7xQj2i0 Qoy0zXMmHMPLnKFEiWD9CQYAH4v4gOlQjk2tLSLYMHzkgcy3APpkW/JaA2mDMk0q+hO0 GZ5LrWsE9qG8yp2DklvTqMa7LjJrKv9BNT2QE0LTQXV/YN7rl0ruvyAVv6p4Ai2n9N5K Zovw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :references:mime-version:message-id:in-reply-to:date:dkim-signature; bh=8LHsGfnmt6qYrack3gCp6lFG25TULdtBc4gz63x2LdI=; b=j/Nto3Pg1jgkOMFO/FTdzv/PNkszj3pbE75X7/4pUtQZtnAcRlZg64OTRxqebcqwER 3L6qFEV0LK0FEdNo2ZvmZTzyuRZehx3QyoCamJqZ295cf1u/tlATregx/xWBeTAg/Lue 7sKxHbEVbEkuu6NMfPhLfsWFHBs1MSB5Jb2EfQoLq5mtqfXABR3ijUfi4UrfLGjHeDQA TBKubCIcagHPe+xHSQUFM2K1WJL+uyn+5wPUaiOGXRiS0VyZJsYUzfv9SAxM4/mTQPcH NLI0P96ediktNBhyayYM4R/X6FjV79j2H27bzQBaQs3AsO0Zo8tgneA2Jkcu3Jiq00B5 AVZw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=EJvBW4p+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id a3-20020a17090aa50300b00233b32fdf1csi13865775pjq.77.2023.02.19.01.37.40; Sun, 19 Feb 2023 01:37:52 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=EJvBW4p+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229937AbjBSJdg (ORCPT + 99 others); Sun, 19 Feb 2023 04:33:36 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54948 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229884AbjBSJdd (ORCPT ); Sun, 19 Feb 2023 04:33:33 -0500 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BB67E12071 for ; Sun, 19 Feb 2023 01:32:34 -0800 (PST) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-5368988d828so16004577b3.19 for ; Sun, 19 Feb 2023 01:32:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:from:to:cc:subject:date :message-id:reply-to; bh=8LHsGfnmt6qYrack3gCp6lFG25TULdtBc4gz63x2LdI=; b=EJvBW4p+SFjnBzgue4/x6znEJDTvb83/64IopRB7FinD79V+Fb4cOHIjvjOT/5Qm6w WhfoxFK6YqpE74IuhwH334eo4/GdyfEdR8H6QRE+t6woO0+vejNSKWdNXRe/zbGD3aUN o5Nm4Ls2qUsw3U/MfnJ++e/l01cbXWvS54dcGdIWRvsrGUwdcN4r12GLhkiZ1djAQJkc PGkxww4ph/TCHOSei6uHBCTlgX26IY4qGoO3ARP442x6TLmPmXSdflJ/hwojUf0ZY0Zx jGOiIGes+vtOM2tNIc6HZvdlyV0o7ptHIddV51Go3XVq/FjiQiZZF2ogaBLmCR7M4N2+ IVfA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=8LHsGfnmt6qYrack3gCp6lFG25TULdtBc4gz63x2LdI=; b=5Vq0KAG3Vk2O+/1MAAOGiyVkhaT6802XMJJZtPWfo5aDXjkPQ/Axno5S/Zw6Mxx8BG JXoIle3rvCo1+58Ntil0llNEjiooH17z375YKxxnTI2G+437BlrSyCWGPKIQDMHsK2iE /5qkzGMHVF+mCplBSmCbt+kBwISxcQ+ZHiC218IjYOww7qsjpvBVEylarQ56pIIGnaYO 4Ys2P4Odx6wW3ZV4+FWYjEnxgtt0iRhbeo0798++SYpRd48wmVFxQAm2D82K+Zv9b4IB a74EwklAU8vMyNFS7kaiKRmn35NfY6SMc2OOYxJwWi2lZM5Pc3Vy3nvtEOTgjZUDVpeG f1pA== X-Gm-Message-State: AO0yUKU+WJKbHbJQbNwWONBz1OfMJL4zol0KDQ2B/CIft7HAqncBxcJM eF6G0BW7ENbvhWi0kODS5OpSHV71TXY8 X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a25:e909:0:b0:803:19fa:2c20 with SMTP id n9-20020a25e909000000b0080319fa2c20mr120391ybd.207.1676799122365; Sun, 19 Feb 2023 01:32:02 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:19 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-23-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 22/51] perf vendor events intel: Refresh ivytown metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758251683470359082?= X-GMAIL-MSGID: =?utf-8?q?1758251683470359082?= Update the ivytown metrics to TMA version 4.5. Generation was done using https://github.com/intel/perfmon. Notable changes are TMA info metrics are renamed from their node name to be lower case and prefixed by tma_info_, MetricThreshold expressions are added, and the smi_cost metric group is added replicating existing hard coded metrics in stat-shadow. Signed-off-by: Ian Rogers --- .../arch/x86/ivytown/ivt-metrics.json | 1311 +++++++++-------- 1 file changed, 736 insertions(+), 575 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/ivytown/ivt-metrics.json b/tools/perf/pmu-events/arch/x86/ivytown/ivt-metrics.json index 80444bc4e66e..89469b10fa30 100644 --- a/tools/perf/pmu-events/arch/x86/ivytown/ivt-metrics.json +++ b/tools/perf/pmu-events/arch/x86/ivytown/ivt-metrics.json @@ -1,866 +1,1027 @@ [ { - "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", - "MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / SLOTS", - "MetricGroup": "PGO;TopdownL1;tma_L1_group", - "MetricName": "tma_frontend_bound", - "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Machine_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound.", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", - "MetricExpr": "4 * min(CPU_CLK_UNHALTED.THREAD, IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE) / SLOTS", - "MetricGroup": "Frontend;TopdownL2;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_latency", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: RS_EVENTS.EMPTY_END", + "BriefDescription": "C2 residency percent per package", + "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C2_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses.", - "MetricExpr": "ICACHE.IFETCH_STALL / CLKS - tma_itlb_misses", - "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_icache_misses", + "BriefDescription": "C3 residency percent per core", + "MetricExpr": "cstate_core@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Core_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses", - "MetricExpr": "(12 * ITLB_MISSES.STLB_HIT + ITLB_MISSES.WALK_DURATION) / CLKS", - "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_itlb_misses", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: ITLB_MISSES.WALK_COMPLETED", + "BriefDescription": "C3 residency percent per package", + "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", - "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY) / CLKS", - "MetricGroup": "FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_branch_resteers", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", + "BriefDescription": "C6 residency percent per core", + "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Core_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / CLKS", - "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_dsb_switches", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty.", + "BriefDescription": "C6 residency percent per package", + "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", - "MetricExpr": "ILD_STALL.LCP / CLKS", - "MetricGroup": "FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_lcp", - "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs.", + "BriefDescription": "C7 residency percent per core", + "MetricExpr": "cstate_core@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Core_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", - "MetricExpr": "3 * IDQ.MS_SWITCHES / CLKS", - "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_ms_switches", - "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals. Sample with: IDQ.MS_SWITCHES", + "BriefDescription": "C7 residency percent per package", + "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", - "MetricExpr": "tma_frontend_bound - tma_fetch_latency", - "MetricGroup": "FetchBW;Frontend;TopdownL2;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_bandwidth", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend.", - "ScaleUnit": "100%" + "BriefDescription": "Uncore frequency per die [GHZ]", + "MetricExpr": "tma_info_socket_clks / #num_dies / duration_time / 1e9", + "MetricGroup": "SoC", + "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline)", - "MetricExpr": "(IDQ.ALL_MITE_CYCLES_ANY_UOPS - IDQ.ALL_MITE_CYCLES_4_UOPS) / CORE_CLKS / 2", - "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_fetch_bandwidth_group", - "MetricName": "tma_mite", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or LSD. For example; inefficiencies due to asymmetric decoders; use of long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.", + "BriefDescription": "Percentage of cycles spent in System Management Interrupts.", + "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0 else 0)", + "MetricGroup": "smi", + "MetricName": "smi_cycles", + "MetricThreshold": "smi_cycles > 0.1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline", - "MetricExpr": "(IDQ.ALL_DSB_CYCLES_ANY_UOPS - IDQ.ALL_DSB_CYCLES_4_UOPS) / CORE_CLKS / 2", - "MetricGroup": "DSB;FetchBW;TopdownL3;tma_fetch_bandwidth_group", - "MetricName": "tma_dsb", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline. For example; inefficient utilization of the DSB cache structure or bank conflict when reading from it; are categorized here.", - "ScaleUnit": "100%" + "BriefDescription": "Number of SMI interrupts.", + "MetricExpr": "msr@smi@", + "MetricGroup": "smi", + "MetricName": "smi_num", + "ScaleUnit": "1SMI#" }, { - "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations", - "MetricExpr": "(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) / SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group", - "MetricName": "tma_bad_speculation", - "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.", + "BriefDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset", + "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_4k_aliasing", + "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset. False match is possible; which incur a few cycles load re-issue. However; the short re-issue duration is often hidden by the out-of-order core and HW optimizations; hence a user may safely ignore a high value of this metric unless it manages to propagate up into parent nodes of the hierarchy (e.g. to L1_Bound).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction", - "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * tma_bad_speculation", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL2;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_branch_mispredicts", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT.PORT_1 + UOPS_DISPATCHED_PORT.PORT_5) / (3 * tma_info_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", - "MetricExpr": "tma_bad_speculation - tma_branch_mispredicts", - "MetricGroup": "BadSpec;MachineClears;TopdownL2;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_machine_clears", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT", + "BriefDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists", + "MetricExpr": "100 * OTHER_ASSISTS.ANY_WB_ASSIST / tma_info_slots", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists. Assists are long sequences of uops that are required in certain corner-cases for operations that cannot be handled natively by the execution pipeline. For example; when working with very small floating point values (so-called Denormals); the FP units are not set up to perform these operations natively. Instead; a sequence of instructions to perform the computation on the Denormals is injected into the pipeline. Since these microcode sequences might be dozens of uops long; Assists can be extremely deleterious to performance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.ANY", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "1 - (tma_frontend_bound + tma_bad_speculation + tma_retiring)", - "MetricGroup": "TopdownL1;tma_L1_group", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.2", "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", - "MetricExpr": "(min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING) + RESOURCE_STALLS.SB) / (min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if IPC > 1.8 else (UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC - RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else RESOURCE_STALLS.SB)) * tma_backend_bound", - "MetricGroup": "Backend;TopdownL2;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_memory_bound", - "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", + "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations", + "MetricExpr": "(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", + "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache", - "MetricExpr": "max((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING) - CYCLE_ACTIVITY.STALLS_L1D_PENDING) / CLKS, 0)", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_l1_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache. The L1 data cache typically has the shortest latency. However; in certain cases like loads blocked on older stores; a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls; while some non-completed demand load lives in the machine without having that demand load missing the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT_PS;MEM_LOAD_UOPS_RETIRED.HIT_LFB_PS", + "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * tma_bad_speculation", + "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM", + "MetricName": "tma_branch_mispredicts", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics: tma_info_branch_misprediction_cost, tma_mispredicts_resteers", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "(7 * DTLB_LOAD_MISSES.STLB_HIT + DTLB_LOAD_MISSES.WALK_DURATION) / CLKS", - "MetricGroup": "MemoryTLB;TopdownL4;tma_l1_bound_group", - "MetricName": "tma_dtlb_load", - "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", + "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY) / tma_info_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores", - "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_store_fwd_blk", - "PublicDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores. To streamline memory operations in the pipeline; a load can avoid waiting for memory if a prior in-flight store is writing the data that the load wants to read (store forwarding process). However; in some cases the load may be blocked for a significant time pending the store forward. For example; when the prior store is writing a smaller region than the load is reading.", + "BriefDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction. A CISC instruction has multiple uops that are required to perform the instruction's functionality as in the case of read-modify-write as an example. Since these instructions require multiple uops they may or may not imply sub-optimal use of machine resources.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations", - "MetricExpr": "MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO) / CLKS", - "MetricGroup": "Offcore;TopdownL4;tma_l1_bound_group", - "MetricName": "tma_lock_latency", - "PublicDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations. Due to the microarchitecture handling of locks; they are classified as L1_Bound regardless of what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOADS_PS", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(60 * (MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.LLC_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_FWD))) + 43 * (MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.LLC_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_FWD)))) / tma_info_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_contested_accesses", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses. Contested accesses occur when data written by one Logical Processor are read by another Logical Processor on a different Physical Core. Examples of contested accesses include synchronizations such as locks; true data sharing such as modified locked variables; and false sharing. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Related metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary", - "MetricExpr": "13 * LD_BLOCKS.NO_SR / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_split_loads", - "PublicDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS_PS", + "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_backend_bound - tma_memory_bound", + "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset", - "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_4k_aliasing", - "PublicDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset. False match is possible; which incur a few cycles load re-issue. However; the short re-issue duration is often hidden by the out-of-order core and HW optimizations; hence a user may safely ignore a high value of this metric unless it manages to propagate up into parent nodes of the hierarchy (e.g. to L1_Bound).", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "43 * (MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.LLC_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_FWD))) / tma_info_clks", + "MetricGroup": "Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses. Data shared by multiple Logical Processors (even just read shared) may cause increased access latency due to cache coherency. Excessive data sharing can drastically harm multithreaded performance. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma_false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed", - "MetricExpr": "Load_Miss_Real_Latency * cpu@L1D_PEND_MISS.FB_FULL\\,cmask\\=1@ / CLKS", - "MetricGroup": "MemoryBW;TopdownL4;tma_l1_bound_group", - "MetricName": "tma_fb_full", - "PublicDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed. The higher the metric value; the deeper the memory hierarchy level the misses are satisfied from (metric values >1 are valid). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or external memory).", + "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", + "MetricExpr": "ARITH.FPU_DIV_ACTIVE / tma_info_core_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication. Sample with: ARITH.DIVIDER_UOPS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads", - "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L1D_PENDING - CYCLE_ACTIVITY.STALLS_L2_PENDING) / CLKS", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_l2_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 misses/L2 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L2_HIT_PS", + "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", + "MetricConstraint": "NO_GROUP_EVENTS_SMT", + "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.LLC_HIT / (MEM_LOAD_UOPS_RETIRED.LLC_HIT + 7 * MEM_LOAD_UOPS_RETIRED.LLC_MISS)) * CYCLE_ACTIVITY.STALLS_L2_PENDING / tma_info_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_MISS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core", - "MetricExpr": "MEM_LOAD_UOPS_RETIRED.LLC_HIT / (MEM_LOAD_UOPS_RETIRED.LLC_HIT + 7 * MEM_LOAD_UOPS_RETIRED.LLC_MISS) * CYCLE_ACTIVITY.STALLS_L2_PENDING / CLKS", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_l3_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS", + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline", + "MetricExpr": "(IDQ.ALL_DSB_CYCLES_ANY_UOPS - IDQ.ALL_DSB_CYCLES_4_UOPS) / tma_info_core_clks / 2", + "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline. For example; inefficient utilization of the DSB cache structure or bank conflict when reading from it; are categorized here.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses", - "MetricExpr": "(60 * (MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.LLC_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_FWD))) + 43 * (MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.LLC_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_FWD)))) / CLKS", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_contested_accesses", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses. Contested accesses occur when data written by one Logical Processor are read by another Logical Processor on a different Physical Core. Examples of contested accesses include synchronizations such as locks; true data sharing such as modified locked variables; and false sharing. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_clks", + "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty. Related metrics: tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_iptb, tma_lcp", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses", - "MetricExpr": "43 * (MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.LLC_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_FWD))) / CLKS", - "MetricGroup": "Offcore;Snoop;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_data_sharing", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses. Data shared by multiple Logical Processors (even just read shared) may cause increased access latency due to cache coherency. Excessive data sharing can drastically harm multithreaded performance. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT_PS", + "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricExpr": "(7 * DTLB_LOAD_MISSES.STLB_HIT + DTLB_LOAD_MISSES.WALK_DURATION) / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS. Related metrics: tma_dtlb_store", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited)", - "MetricExpr": "41 * (MEM_LOAD_UOPS_RETIRED.LLC_HIT * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.LLC_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_FWD))) / CLKS", - "MetricGroup": "MemoryLat;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_l3_hit_latency", - "PublicDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve the latency; reduce contention with sibling physical cores and increase performance. Note the value of this node may overlap with its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS", + "BriefDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses", + "MetricExpr": "(7 * DTLB_STORE_MISSES.STLB_HIT + DTLB_STORE_MISSES.WALK_DURATION) / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses. As with ordinary data caching; focus on improving data locality and reducing working-set size to reduce DTLB overhead. Additionally; consider using profile-guided optimization (PGO) to collocate frequently-used data on the same page. Try using larger page sizes for large amounts of frequently-used data. Sample with: MEM_UOPS_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors)", - "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on else OFFCORE_REQUESTS_BUFFER.SQ_FULL) / CORE_CLKS", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_sq_full", - "PublicDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors). The Super Queue is used for requests to access the L2 cache or to go out to the Uncore.", + "BriefDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing", + "MetricExpr": "(200 * OFFCORE_RESPONSE.DEMAND_RFO.LLC_MISS.REMOTE_HITM + 60 * OFFCORE_RESPONSE.DEMAND_RFO.LLC_HIT.HITM_OTHER_CORE) / tma_info_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_store_bound_group", + "MetricName": "tma_false_sharing", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing. False Sharing is a multithreading hiccup; where multiple Logical Processors contend on different data-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", - "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.LLC_HIT / (MEM_LOAD_UOPS_RETIRED.LLC_HIT + 7 * MEM_LOAD_UOPS_RETIRED.LLC_MISS)) * CYCLE_ACTIVITY.STALLS_L2_PENDING / CLKS", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_dram_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_MISS_PS", + "BriefDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_info_load_miss_real_latency * cpu@L1D_PEND_MISS.FB_FULL\\,cmask\\=1@ / tma_info_clks", + "MetricGroup": "MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;tma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed. The higher the metric value; the deeper the memory hierarchy level the misses are satisfied from (metric values >1 are valid). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or external memory). Related metrics: tma_info_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=6@) / CLKS", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_dram_bound_group", - "MetricName": "tma_mem_bandwidth", - "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that).", + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", + "MetricExpr": "tma_frontend_bound - tma_fetch_latency", + "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Related metrics: tma_dsb_switches, tma_info_dsb_coverage, tma_info_iptb, tma_lcp", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / CLKS - tma_mem_bandwidth", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_dram_bound_group", - "MetricName": "tma_mem_latency", - "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that).", + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", + "MetricExpr": "4 * min(CPU_CLK_UNHALTED.THREAD, IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE) / tma_info_slots", + "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: RS_EVENTS.EMPTY_END", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from local memory", - "MetricExpr": "200 * (MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.LLC_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_FWD))) / CLKS", - "MetricGroup": "Server;TopdownL5;tma_mem_latency_group", - "MetricName": "tma_local_dram", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from local memory. Caching will improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM_PS", + "BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", + "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired). Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain and FMA double-counting.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote memory", - "MetricExpr": "310 * (MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.LLC_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_FWD))) / CLKS", - "MetricGroup": "Server;Snoop;TopdownL5;tma_mem_latency_group", - "MetricName": "tma_remote_dram", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote memory. This is caused often due to non-optimal NUMA allocations. #link to NUMA article Sample with: MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM_PS", + "BriefDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired", + "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE) / UOPS_EXECUTED.THREAD", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P", + "MetricName": "tma_fp_scalar", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote cache in other sockets including synchronizations issues", - "MetricExpr": "(200 * (MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.LLC_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_FWD))) + 180 * (MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.LLC_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_FWD)))) / CLKS", - "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_mem_latency_group", - "MetricName": "tma_remote_cache", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote cache in other sockets including synchronizations issues. This is caused often due to non-optimal NUMA allocations. #link to NUMA article Sample with: MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM_PS;MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD_PS", + "BriefDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths", + "MetricExpr": "(FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + FP_COMP_OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) / UOPS_EXECUTED.THREAD", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P", + "MetricName": "tma_fp_vector", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", - "MetricExpr": "RESOURCE_STALLS.SB / CLKS", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_store_bound", - "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck. Sample with: MEM_UOPS_RETIRED.ALL_STORES_PS", + "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", + "MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / tma_info_slots", + "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses", - "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / CLKS", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_store_bound_group", - "MetricName": "tma_store_latency", - "PublicDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses. Store accesses usually less impact out-of-order core performance; however; holding resources for longer time can lead into undesired implications (e.g. contention on L1D fill-buffer entries - see FB_Full)", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences", + "MetricExpr": "tma_microcode_sequencer", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing", - "MetricExpr": "(200 * OFFCORE_RESPONSE.DEMAND_RFO.LLC_MISS.REMOTE_HITM + 60 * OFFCORE_RESPONSE.DEMAND_RFO.LLC_HIT.HITM_OTHER_CORE) / CLKS", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_store_bound_group", - "MetricName": "tma_false_sharing", - "PublicDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing. False Sharing is a multithreading hiccup; where multiple Logical Processors contend on different data-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses.", + "MetricExpr": "ICACHE.IFETCH_STALL / tma_info_clks - tma_itlb_misses", + "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents rate of split store accesses", - "MetricExpr": "2 * MEM_UOPS_RETIRED.SPLIT_STORES / CORE_CLKS", - "MetricGroup": "TopdownL4;tma_store_bound_group", - "MetricName": "tma_split_stores", - "PublicDescription": "This metric represents rate of split store accesses. Consider aligning your data to the 64-byte cache line granularity. Sample with: MEM_UOPS_RETIRED.SPLIT_STORES_PS", - "ScaleUnit": "100%" + "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", + "MetricExpr": "tma_info_turbo_utilization * TSC / 1e9 / duration_time", + "MetricGroup": "Power;Summary", + "MetricName": "tma_info_average_frequency" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses", - "MetricExpr": "(7 * DTLB_STORE_MISSES.STLB_HIT + DTLB_STORE_MISSES.WALK_DURATION) / CLKS", - "MetricGroup": "MemoryTLB;TopdownL4;tma_store_bound_group", - "MetricName": "tma_dtlb_store", - "PublicDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses. As with ordinary data caching; focus on improving data locality and reducing working-set size to reduce DTLB overhead. Additionally; consider using profile-guided optimization (PGO) to collocate frequently-used data on the same page. Try using larger page sizes for large amounts of frequently-used data. Sample with: MEM_UOPS_RETIRED.STLB_MISS_STORES_PS", - "ScaleUnit": "100%" + "BriefDescription": "Branch instructions per taken branch.", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_bptkbranch" }, { - "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", - "MetricExpr": "tma_backend_bound - tma_memory_bound", - "MetricGroup": "Backend;Compute;TopdownL2;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_core_bound", - "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", - "ScaleUnit": "100%" + "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Pipeline", + "MetricName": "tma_info_clks" }, { - "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", - "MetricExpr": "ARITH.FPU_DIV_ACTIVE / CORE_CLKS", - "MetricGroup": "TopdownL3;tma_core_bound_group", - "MetricName": "tma_divider", - "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication. Sample with: ARITH.DIVIDER_UOPS", - "ScaleUnit": "100%" + "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_clks))", + "MetricGroup": "SMT", + "MetricName": "tma_info_core_clks" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", - "MetricExpr": "((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if IPC > 1.8 else (UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC - RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else RESOURCE_STALLS.SB)) - RESOURCE_STALLS.SB - min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING)) / CLKS", - "MetricGroup": "PortsUtil;TopdownL3;tma_core_bound_group", - "MetricName": "tma_ports_utilization", - "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_core_clks", + "MetricGroup": "Ret;SMT;TmaL1;tma_L1_group", + "MetricName": "tma_info_coreipc" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\,cmask\\=1@ / 2 if #SMT_on else (min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) - RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) / CORE_CLKS)", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_0", - "PublicDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise). Long-latency instructions like divides may contribute to this metric.", - "ScaleUnit": "100%" + "BriefDescription": "Cycles Per Instruction (per Logical Processor)", + "MetricExpr": "1 / tma_info_ipc", + "MetricGroup": "Mem;Pipeline", + "MetricName": "tma_info_cpi" }, { - "BriefDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) / CORE_CLKS)", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_1", - "PublicDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). This can be due to heavy data-dependency among software instructions; or over oversubscribing a particular hardware resource. In some other cases with high 1_Port_Utilized and L1_Bound; this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful.", - "ScaleUnit": "100%" + "BriefDescription": "Average CPU Utilization", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": "HPC;Summary", + "MetricName": "tma_info_cpu_utilization" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / CORE_CLKS)", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_2", - "PublicDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization -most compilers feature auto-Vectorization options today- reduces pressure on the execution ports as multiple elements are calculated with same uop.", - "ScaleUnit": "100%" + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD / OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_data_l2_mlp" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise).", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / CORE_CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_3m", - "ScaleUnit": "100%" + "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", + "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e9 / duration_time", + "MetricGroup": "HPC;Mem;MemoryBW;SoC;tma_issueBW", + "MetricName": "tma_info_dram_bw_use", + "PublicDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]. Related metrics: tma_fb_full, tma_mem_bandwidth, tma_sq_full" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.", - "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT.PORT_1 + UOPS_DISPATCHED_PORT.PORT_5) / (3 * CORE_CLKS)", - "MetricGroup": "TopdownL5;tma_ports_utilized_3m_group", - "MetricName": "tma_alu_op_utilization", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + LSD.UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_dsb_coverage", + "MetricThreshold": "tma_info_dsb_coverage < 0.7 & tma_info_ipc / 4 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_iptb, tma_lcp" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch) Sample with: UOPS_DISPATCHED_PORT.PORT_0", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_0 / CORE_CLKS", - "MetricGroup": "Compute;TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_0", - "ScaleUnit": "100%" + "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-thread", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,cmask\\=1@", + "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", + "MetricName": "tma_info_execute" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU) Sample with: UOPS_DISPATCHED_PORT.PORT_1", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_1 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_1", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU) Sample with: UOPS_DISPATCHED.PORT_5", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_5 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_5", - "ScaleUnit": "100%" + "BriefDescription": "The ratio of Executed- by Issued-Uops", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", + "MetricGroup": "Cor;Pipeline", + "MetricName": "tma_info_execute_per_issue", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate of \"execute\" at rename stage." }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations Sample with: UOPS_DISPATCHED.PORT_2_3", - "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_2 + UOPS_DISPATCHED_PORT.PORT_3 - UOPS_DISPATCHED_PORT.PORT_4) / (2 * CORE_CLKS)", - "MetricGroup": "TopdownL5;tma_ports_utilized_3m_group", - "MetricName": "tma_load_op_utilization", - "ScaleUnit": "100%" + "BriefDescription": "Floating Point Operations Per Cycle", + "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE + 2 * FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + 4 * (FP_COMP_OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) + 8 * SIMD_FP_256.PACKED_SINGLE) / tma_info_core_clks", + "MetricGroup": "Flops;Ret", + "MetricName": "tma_info_flopc" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 2 ([SNB+]Loads and Store-address; [ICL+] Loads) Sample with: UOPS_DISPATCHED_PORT.PORT_2", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_2 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_load_op_utilization_group", - "MetricName": "tma_port_2", - "ScaleUnit": "100%" + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE + 2 * FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + 4 * (FP_COMP_OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) + 8 * SIMD_FP_256.PACKED_SINGLE) / 1e9 / duration_time", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine." }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 3 ([SNB+]Loads and Store-address; [ICL+] Loads) Sample with: UOPS_DISPATCHED_PORT.PORT_3", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_3 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_load_op_utilization_group", - "MetricName": "tma_port_3", - "ScaleUnit": "100%" + "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", + "MetricExpr": "UOPS_EXECUTED.THREAD / (cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC)", + "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", + "MetricName": "tma_info_ilp" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations", - "MetricExpr": "tma_port_4", - "MetricGroup": "TopdownL5;tma_ports_utilized_3m_group", - "MetricName": "tma_store_op_utilization", - "ScaleUnit": "100%" + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Summary;TmaL1;tma_L1_group", + "MetricName": "tma_info_instructions", + "PublicDescription": "Total number of retired Instructions. Sample with: INST_RETIRED.PREC_DIST" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 4 (Store-data) Sample with: UOPS_DISPATCHED_PORT.PORT_4", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_4 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_store_op_utilization_group", - "MetricName": "tma_port_4", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate)", + "MetricExpr": "1 / (tma_fp_scalar + tma_fp_vector)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_iparith", + "MetricThreshold": "tma_info_iparith < 10", + "PublicDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate). May undercount due to FMA double counting. Approximated prior to BDW." }, { - "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group", - "MetricName": "tma_retiring", - "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.RETIRE_SLOTS", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Branches;Fed;InsType", + "MetricName": "tma_info_ipbranch", + "MetricThreshold": "tma_info_ipbranch < 8" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", - "MetricExpr": "tma_retiring - tma_heavy_operations", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_retiring_group", - "MetricName": "tma_light_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST", - "ScaleUnit": "100%" + "BriefDescription": "Instructions Per Cycle (per Logical Processor)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_clks", + "MetricGroup": "Ret;Summary", + "MetricName": "tma_info_ipc" }, { - "BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)", - "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", - "MetricGroup": "HPC;TopdownL3;tma_light_operations_group", - "MetricName": "tma_fp_arith", - "PublicDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired). Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain and FMA double-counting.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per (near) call (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_ipcall", + "MetricThreshold": "tma_info_ipcall < 200" }, { - "BriefDescription": "This metric serves as an approximation of legacy x87 usage", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS * FP_COMP_OPS_EXE.X87 / UOPS_EXECUTED.THREAD", - "MetricGroup": "Compute;TopdownL4;tma_fp_arith_group", - "MetricName": "tma_x87_use", - "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;OS", + "MetricName": "tma_info_ipfarbranch", + "MetricThreshold": "tma_info_ipfarbranch < 1e6" }, { - "BriefDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired", - "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE) / UOPS_EXECUTED.THREAD", - "MetricGroup": "Compute;Flops;TopdownL4;tma_fp_arith_group", - "MetricName": "tma_fp_scalar", - "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per Load (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_LOADS", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipload", + "MetricThreshold": "tma_info_ipload < 3" }, { - "BriefDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths", - "MetricExpr": "(FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + FP_COMP_OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) / UOPS_EXECUTED.THREAD", - "MetricGroup": "Compute;Flops;TopdownL4;tma_fp_arith_group", - "MetricName": "tma_fp_vector", - "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per retired mispredicts for indirect CALL or JMP branches (lower number means higher occurrence rate).", + "MetricExpr": "tma_info_instructions / (UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * cpu@BR_MISP_EXEC.ALL_BRANCHES\\,umask\\=0xE4@)", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_indirect", + "MetricThreshold": "tma_info_ipmisp_indirect < 1e3" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences", - "MetricExpr": "tma_microcode_sequencer", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_retiring_group", - "MetricName": "tma_heavy_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences. This highly-correlates with the uop length of these instructions/sequences.", - "ScaleUnit": "100%" + "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts", + "MetricName": "tma_info_ipmispredict", + "MetricThreshold": "tma_info_ipmispredict < 200" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.MS_UOPS / SLOTS", - "MetricGroup": "MicroSeq;TopdownL3;tma_heavy_operations_group", - "MetricName": "tma_microcode_sequencer", - "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided. Sample with: IDQ.MS_UOPS", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per Store (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_STORES", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipstore", + "MetricThreshold": "tma_info_ipstore < 8" }, { - "BriefDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists", - "MetricExpr": "100 * OTHER_ASSISTS.ANY_WB_ASSIST / SLOTS", - "MetricGroup": "TopdownL4;tma_microcode_sequencer_group", - "MetricName": "tma_assists", - "PublicDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists. Assists are long sequences of uops that are required in certain corner-cases for operations that cannot be handled natively by the execution pipeline. For example; when working with very small floating point values (so-called Denormals); the FP units are not set up to perform these operations natively. Instead; a sequence of instructions to perform the computation on the Denormals is injected into the pipeline. Since these microcode sequences might be dozens of uops long; Assists can be extremely deleterious to performance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.ANY", - "ScaleUnit": "100%" + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", + "MetricName": "tma_info_iptb", + "MetricThreshold": "tma_info_iptb < 9", + "PublicDescription": "Instruction per taken branch. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_lcp" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction", - "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", - "MetricGroup": "TopdownL4;tma_microcode_sequencer_group", - "MetricName": "tma_cisc", - "PublicDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction. A CISC instruction has multiple uops that are required to perform the instruction's functionality as in the case of read-modify-write as an example. Since these instructions require multiple uops they may or may not imply sub-optimal use of machine resources.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per speculative Unknown Branch Misprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_instructions / BACLEARS.ANY", + "MetricGroup": "Fed", + "MetricName": "tma_info_ipunknown_branch" }, { - "BriefDescription": "Instructions Per Cycle (per Logical Processor)", - "MetricExpr": "INST_RETIRED.ANY / CLKS", - "MetricGroup": "Ret;Summary", - "MetricName": "IPC" + "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_cpi" }, { - "BriefDescription": "Uops Per Instruction", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / INST_RETIRED.ANY", - "MetricGroup": "Pipeline;Ret;Retire", - "MetricName": "UPI" + "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_utilization", + "MetricThreshold": "tma_info_kernel_utilization > 0.05" }, { - "BriefDescription": "Instruction per taken branch", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW", - "MetricName": "UpTB" + "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw" }, { - "BriefDescription": "Cycles Per Instruction (per Logical Processor)", - "MetricExpr": "1 / IPC", - "MetricGroup": "Mem;Pipeline", - "MetricName": "CPI" + "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]", + "MetricExpr": "tma_info_l1d_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw_1t" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "Pipeline", - "MetricName": "CLKS" + "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L1_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l1mpki" }, { - "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "4 * CORE_CLKS", - "MetricGroup": "tma_L1_group", - "MetricName": "SLOTS" + "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw" }, { - "BriefDescription": "The ratio of Executed- by Issued-Uops", - "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", - "MetricGroup": "Cor;Pipeline", - "MetricName": "Execute_per_Issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate of \"execute\" at rename stage." + "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]", + "MetricExpr": "tma_info_l2_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw_1t" }, { - "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", - "MetricExpr": "INST_RETIRED.ANY / CORE_CLKS", - "MetricGroup": "Ret;SMT;tma_L1_group", - "MetricName": "CoreIPC" + "BriefDescription": "L2 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "Backend;CacheMisses;Mem", + "MetricName": "tma_info_l2mpki" }, { - "BriefDescription": "Floating Point Operations Per Cycle", - "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE + 2 * FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + 4 * (FP_COMP_OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) + 8 * SIMD_FP_256.PACKED_SINGLE) / CORE_CLKS", - "MetricGroup": "Flops;Ret", - "MetricName": "FLOPc" + "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "0", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_l3_cache_access_bw_1t" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC)", - "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", - "MetricName": "ILP" + "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw" }, { - "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else CLKS))", - "MetricGroup": "SMT", - "MetricName": "CORE_CLKS" + "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "tma_info_l3_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw_1t" }, { - "BriefDescription": "Instructions per Load (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_LOADS", - "MetricGroup": "InsType", - "MetricName": "IpLoad" + "BriefDescription": "L3 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.LLC_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l3mpki" }, { - "BriefDescription": "Instructions per Store (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_STORES", - "MetricGroup": "InsType", - "MetricName": "IpStore" + "BriefDescription": "Average Latency for L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCORE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "Memory_Lat;Offcore", + "MetricName": "tma_info_load_l2_miss_latency" }, { - "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Branches;Fed;InsType", - "MetricName": "IpBranch" + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_load_l2_mlp" }, { - "BriefDescription": "Instructions per (near) call (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "IpCall" + "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_UOPS_RETIRED.L1_MISS + MEM_LOAD_UOPS_RETIRED.HIT_LFB)", + "MetricGroup": "Mem;MemoryBound;MemoryLat", + "MetricName": "tma_info_load_miss_real_latency" }, { - "BriefDescription": "Instruction per taken branch", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO", - "MetricName": "IpTB" + "BriefDescription": "Average number of parallel data read requests to external memory", + "MetricExpr": "UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=0x182@ / UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=0x182\\,thresh\\=1@", + "MetricGroup": "Mem;MemoryBW;SoC", + "MetricName": "tma_info_mem_parallel_reads", + "PublicDescription": "Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches" }, { - "BriefDescription": "Branch instructions per taken branch. ", - "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "BpTkBranch" + "BriefDescription": "Average latency of data read request to external memory (in nanoseconds)", + "MetricExpr": "1e9 * (UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=0x182@ / UNC_C_TOR_INSERTS.MISS_OPCODE@filter_opc\\=0x182@) / (tma_info_socket_clks / duration_time)", + "MetricGroup": "Mem;MemoryLat;SoC", + "MetricName": "tma_info_mem_read_latency", + "PublicDescription": "Average latency of data read request to external memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetches. ([RKL+]memory-controller only)" }, { - "BriefDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate)", - "MetricExpr": "1 / (tma_fp_scalar + tma_fp_vector)", - "MetricGroup": "Flops;InsType", - "MetricName": "IpArith", - "PublicDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate). May undercount due to FMA double counting. Approximated prior to BDW." + "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLES", + "MetricGroup": "Mem;MemoryBW;MemoryBound", + "MetricName": "tma_info_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)" }, { - "BriefDescription": "Total number of retired Instructions Sample with: INST_RETIRED.PREC_DIST", - "MetricExpr": "INST_RETIRED.ANY", - "MetricGroup": "Summary;tma_L1_group", - "MetricName": "Instructions" + "BriefDescription": "Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses", + "MetricExpr": "(ITLB_MISSES.WALK_DURATION + DTLB_LOAD_MISSES.WALK_DURATION + DTLB_STORE_MISSES.WALK_DURATION) / tma_info_core_clks", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_page_walks_utilization", + "MetricThreshold": "tma_info_page_walks_utilization > 0.5" }, { "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.", "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE_SLOTS\\,cmask\\=1@", "MetricGroup": "Pipeline;Ret", - "MetricName": "Retire" + "MetricName": "tma_info_retire" }, { - "BriefDescription": "", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,cmask\\=1@", - "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", - "MetricName": "Execute" + "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "4 * tma_info_core_clks", + "MetricGroup": "TmaL1;tma_L1_group", + "MetricName": "tma_info_slots" }, { - "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", - "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + LSD.UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", - "MetricGroup": "DSB;Fed;FetchBW", - "MetricName": "DSB_Coverage" + "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", + "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", + "MetricGroup": "SMT", + "MetricName": "tma_info_smt_2t_utilization" }, { - "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;BadSpec;BrMispredicts", - "MetricName": "IpMispredict" + "BriefDescription": "Socket actual clocks when any core is active on that socket", + "MetricExpr": "cbox_0@event\\=0x0@", + "MetricGroup": "SoC", + "MetricName": "tma_info_socket_clks" }, { - "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)", - "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_UOPS_RETIRED.L1_MISS + MEM_LOAD_UOPS_RETIRED.HIT_LFB)", - "MetricGroup": "Mem;MemoryBound;MemoryLat", - "MetricName": "Load_Miss_Real_Latency" + "BriefDescription": "Average Frequency Utilization relative nominal frequency", + "MetricExpr": "tma_info_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_turbo_utilization" }, { - "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)", - "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLES", - "MetricGroup": "Mem;MemoryBW;MemoryBound", - "MetricName": "MLP" + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / INST_RETIRED.ANY", + "MetricGroup": "Pipeline;Ret;Retire", + "MetricName": "tma_info_uoppi", + "MetricThreshold": "tma_info_uoppi > 1.05" }, { - "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L1_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L1MPKI" + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW", + "MetricName": "tma_info_uptb", + "MetricThreshold": "tma_info_uptb < 6" }, { - "BriefDescription": "L2 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.L2_MISS / INST_RETIRED.ANY", - "MetricGroup": "Backend;CacheMisses;Mem", - "MetricName": "L2MPKI" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "(12 * ITLB_MISSES.STLB_HIT + ITLB_MISSES.WALK_DURATION) / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: ITLB_MISSES.WALK_COMPLETED", + "ScaleUnit": "100%" }, { - "BriefDescription": "L3 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_UOPS_RETIRED.LLC_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L3MPKI" + "BriefDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache", + "MetricExpr": "max((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING) - CYCLE_ACTIVITY.STALLS_L1D_PENDING) / tma_info_clks, 0)", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache. The L1 data cache typically has the shortest latency. However; in certain cases like loads blocked on older stores; a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls; while some non-completed demand load lives in the machine without having that demand load missing the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT_PS;MEM_LOAD_UOPS_RETIRED.HIT_LFB_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "ScaleUnit": "100%" }, { - "BriefDescription": "Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses", - "MetricConstraint": "NO_NMI_WATCHDOG", - "MetricExpr": "(ITLB_MISSES.WALK_DURATION + DTLB_LOAD_MISSES.WALK_DURATION + DTLB_STORE_MISSES.WALK_DURATION) / CORE_CLKS", - "MetricGroup": "Mem;MemoryTLB", - "MetricName": "Page_Walks_Utilization" + "BriefDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads", + "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L1D_PENDING - CYCLE_ACTIVITY.STALLS_L2_PENDING) / tma_info_clks", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 misses/L2 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L2_HIT_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW" + "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core", + "MetricConstraint": "NO_GROUP_EVENTS_SMT", + "MetricExpr": "MEM_LOAD_UOPS_RETIRED.LLC_HIT / (MEM_LOAD_UOPS_RETIRED.LLC_HIT + 7 * MEM_LOAD_UOPS_RETIRED.LLC_MISS) * CYCLE_ACTIVITY.STALLS_L2_PENDING / tma_info_clks", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW" + "BriefDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "41 * (MEM_LOAD_UOPS_RETIRED.LLC_HIT * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.LLC_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_FWD))) / tma_info_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_issueLat;tma_l3_bound_group", + "MetricName": "tma_l3_hit_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve the latency; reduce contention with sibling physical cores and increase performance. Note the value of this node may overlap with its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS. Related metrics: tma_mem_latency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW" + "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "ILD_STALL.LCP / tma_info_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_iptb", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]", - "MetricExpr": "L1D_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW_1T" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", + "MetricExpr": "tma_retiring - tma_heavy_operations", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]", - "MetricExpr": "L2_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW_1T" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_2 + UOPS_DISPATCHED_PORT.PORT_3 - UOPS_DISPATCHED_PORT.PORT_4) / (2 * tma_info_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations. Sample with: UOPS_DISPATCHED.PORT_2_3", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "L3_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW_1T" + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from local memory", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "200 * (MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.LLC_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_FWD))) / tma_info_clks", + "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_group", + "MetricName": "tma_local_dram", + "MetricThreshold": "tma_local_dram > 0.1 & (tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from local memory. Caching will improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "0", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "L3_Cache_Access_BW_1T" + "BriefDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO) / tma_info_clks", + "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1_bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations. Due to the microarchitecture handling of locks; they are classified as L1_Bound regardless of what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOADS_PS. Related metrics: tma_store_latency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average CPU Utilization", - "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", - "MetricGroup": "HPC;Summary", - "MetricName": "CPU_Utilization" + "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_bad_speculation - tma_branch_mispredicts", + "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", - "MetricExpr": "Turbo_Utilization * TSC / 1e9 / duration_time", - "MetricGroup": "Power;Summary", - "MetricName": "Average_Frequency" + "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=6@) / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that). Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_sq_full", + "ScaleUnit": "100%" }, { - "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE + 2 * FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + 4 * (FP_COMP_OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) + 8 * SIMD_FP_256.PACKED_SINGLE) / 1e9 / duration_time", - "MetricGroup": "Cor;Flops;HPC", - "MetricName": "GFLOPs", - "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine." + "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / tma_info_clks - tma_mem_bandwidth", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that). Related metrics: tma_l3_hit_latency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average Frequency Utilization relative nominal frequency", - "MetricExpr": "CLKS / CPU_CLK_UNHALTED.REF_TSC", - "MetricGroup": "Power", - "MetricName": "Turbo_Utilization" + "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING) + RESOURCE_STALLS.SB) / (min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - (UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if tma_info_ipc > 1.8 else UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) * tma_backend_bound", + "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", - "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", - "MetricGroup": "SMT", - "MetricName": "SMT_2T_Utilization" + "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.MS_UOPS / tma_info_slots", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operations_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_ms_switches", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "OS", - "MetricName": "Kernel_Utilization" + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline)", + "MetricExpr": "(IDQ.ALL_MITE_CYCLES_ANY_UOPS - IDQ.ALL_MITE_CYCLES_4_UOPS) / tma_info_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or LSD. For example; inefficiencies due to asymmetric decoders; use of long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", - "MetricGroup": "OS", - "MetricName": "Kernel_CPI" + "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", + "MetricExpr": "3 * IDQ.MS_SWITCHES / tma_info_clks", + "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", - "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e9 / duration_time", - "MetricGroup": "HPC;Mem;MemoryBW;SoC", - "MetricName": "DRAM_BW_Use" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_0 / tma_info_core_clks", + "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_0", + "MetricThreshold": "tma_port_0 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_512b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average latency of data read request to external memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetches", - "MetricExpr": "1e9 * (UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=0x182@ / UNC_C_TOR_INSERTS.MISS_OPCODE@filter_opc\\=0x182@) / (Socket_CLKS / duration_time)", - "MetricGroup": "Mem;MemoryLat;SoC", - "MetricName": "MEM_Read_Latency" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_1 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_1", + "MetricThreshold": "tma_port_1 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_512b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches", - "MetricExpr": "UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=0x182@ / UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=0x182\\,thresh\\=1@", - "MetricGroup": "Mem;MemoryBW;SoC", - "MetricName": "MEM_Parallel_Reads" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 2 ([SNB+]Loads and Store-address; [ICL+] Loads)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_2 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_load_op_utilization_group", + "MetricName": "tma_port_2", + "MetricThreshold": "tma_port_2 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 2 ([SNB+]Loads and Store-address; [ICL+] Loads). Sample with: UOPS_DISPATCHED_PORT.PORT_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Socket actual clocks when any core is active on that socket", - "MetricExpr": "cbox_0@event\\=0x0@", - "MetricGroup": "SoC", - "MetricName": "Socket_CLKS" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 3 ([SNB+]Loads and Store-address; [ICL+] Loads)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_3 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_load_op_utilization_group", + "MetricName": "tma_port_3", + "MetricThreshold": "tma_port_3 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 3 ([SNB+]Loads and Store-address; [ICL+] Loads). Sample with: UOPS_DISPATCHED_PORT.PORT_3", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", - "MetricGroup": "Branches;OS", - "MetricName": "IpFarBranch" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 4 (Store-data)", + "MetricExpr": "tma_store_op_utilization", + "MetricGroup": "TopdownL6;tma_L6_group;tma_issueSpSt;tma_store_op_utilization_group", + "MetricName": "tma_port_4", + "MetricThreshold": "tma_port_4 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 4 (Store-data). Sample with: UOPS_DISPATCHED_PORT.PORT_4. Related metrics: tma_split_stores", + "ScaleUnit": "100%" }, { - "BriefDescription": "Uncore frequency per die [GHZ]", - "MetricExpr": "Socket_CLKS / #num_dies / duration_time / 1e9", - "MetricGroup": "SoC", - "MetricName": "UNCORE_FREQ" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_5 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_5", + "MetricThreshold": "tma_port_5 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "C3 residency percent per core", - "MetricExpr": "cstate_core@c3\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C3_Core_Residency", + "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - (UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if tma_info_ipc > 1.8 else UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB - RESOURCE_STALLS.SB - min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING)) / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per core", - "MetricExpr": "cstate_core@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Core_Residency", + "BriefDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\,cmask\\=1@ / 2 if #SMT_on else (min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0)) / tma_info_core_clks)", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise). Long-latency instructions like divides may contribute to this metric.", "ScaleUnit": "100%" }, { - "BriefDescription": "C7 residency percent per core", - "MetricExpr": "cstate_core@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Core_Residency", + "BriefDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) / tma_info_core_clks)", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_1", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). This can be due to heavy data-dependency among software instructions; or over oversubscribing a particular hardware resource. In some other cases with high 1_Port_Utilized and L1_Bound; this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, { - "BriefDescription": "C2 residency percent per package", - "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C2_Pkg_Residency", + "BriefDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=2@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_clks)", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization -most compilers feature auto-Vectorization options today- reduces pressure on the execution ports as multiple elements are calculated with same uop. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, { - "BriefDescription": "C3 residency percent per package", - "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C3_Pkg_Residency", + "BriefDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise).", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.7 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per package", - "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Pkg_Residency", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote cache in other sockets including synchronizations issues", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(200 * (MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.LLC_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_FWD))) + 180 * (MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.LLC_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_FWD)))) / tma_info_clks", + "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_issueSyncxn;tma_mem_latency_group", + "MetricName": "tma_remote_cache", + "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote cache in other sockets including synchronizations issues. This is caused often due to non-optimal NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM_PS;MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD_PS. Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { - "BriefDescription": "C7 residency percent per package", - "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Pkg_Residency", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote memory", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "310 * (MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.LLC_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_FWD))) / tma_info_clks", + "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latency_group", + "MetricName": "tma_remote_dram", + "MetricThreshold": "tma_remote_dram > 0.1 & (tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote memory. This is caused often due to non-optimal NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM_PS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1", + "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.RETIRE_SLOTS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "13 * LD_BLOCKS.NO_SR / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS_PS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents rate of split store accesses", + "MetricExpr": "2 * MEM_UOPS_RETIRED.SPLIT_STORES / tma_info_core_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bound_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store accesses. Consider aligning your data to the 64-byte cache line granularity. Sample with: MEM_UOPS_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_4", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors)", + "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on else OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueBW;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors). Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_mem_bandwidth", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", + "MetricExpr": "RESOURCE_STALLS.SB / tma_info_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck. Sample with: MEM_UOPS_RETIRED.ALL_STORES_PS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores. To streamline memory operations in the pipeline; a load can avoid waiting for memory if a prior in-flight store is writing the data that the load wants to read (store forwarding process). However; in some cases the load may be blocked for a significant time pending the store forward. For example; when the prior store is writing a smaller region than the load is reading.", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_clks", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_issueSL;tma_store_bound_group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses. Store accesses usually less impact out-of-order core performance; however; holding resources for longer time can lead into undesired implications (e.g. contention on L1D fill-buffer entries - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_4 / tma_info_core_clks", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric serves as an approximation of legacy x87 usage", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS * FP_COMP_OPS_EXE.X87 / UOPS_EXECUTED.THREAD", + "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%" } ] From patchwork Sun Feb 19 09:28:20 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59114 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp773594wrn; Sun, 19 Feb 2023 01:37:43 -0800 (PST) X-Google-Smtp-Source: AK7set8EPDObpvMLsOXNLbPtXMso/N1Eqo8J1Ukm4SCfv9GJMapMbfKx4nLA1zlxJKbgx2SwlpAB X-Received: by 2002:a62:15c3:0:b0:5a9:1616:b607 with SMTP id 186-20020a6215c3000000b005a91616b607mr1043775pfv.22.1676799463213; Sun, 19 Feb 2023 01:37:43 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799463; cv=none; d=google.com; s=arc-20160816; b=q3JepCG1+b4+hQo6S4qi/Vu88diCuLcMSfJh6Xi3Q9LNmhfiY4Uvod3yxCf3eQSOy/ Zwl/4BknYP3L3wY9yn6nxUWifXjGkCJNr7sknWZAR99dQkLc1WkQuPsCkyKjsX21+kxc Wr60fVQRHcU21MKqTFhZbBRg/tAJV5UcYENabdHh2MBoqWXBJNAvldJzhMxdGowaVhjc INMSoJb0TzoWKnw2pVxR1xqEZGe++ouybXz5j21TS084mdO+wsFfiI9FIyDu7lVIj0FT bwZKhXwzzNY3Ebx+EvA6cotOx7slaEUhr+ifUm3k3/SXJMs0MM0E2AEPnuP1YiksP+H1 a3AA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :references:mime-version:message-id:in-reply-to:date:dkim-signature; bh=lZa444TGelSxfyFGIASgMZVRbr1mUgy5fMfkXk2Hphs=; b=B9cvp9tUS8Htd/59itilEioPx1zDK2T6+tSwd7ESXZ0wa40StPGMN5eloezV8xY5UD AxuWqwgSp0QdQ9da9RY5J0pbTOV7V2I8tMFsnSqbWDpl3SMBX2LEe9J3Ln+K25OnUeAF zSyDP7KIsq43EilSt5nFRSIeOiuNW0gExvAWdw391oxahzhFVdu9qoAAfzL7AALzdGIw OaTDBza/zbZ1ISG7br0YL++2RJpWPLWEH8F6WP12gDWULz5IMzazwlomVWyuFjXMXvPD 7wJVhMLamaty0WD9Zm+d261gjikH+hrdHlADs9vQs1xKztpCPTlZsyBvqMvN3I6ZChm9 TnqQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=s8MwD3Xc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id y129-20020a62ce87000000b005a8f259220asi10333373pfg.66.2023.02.19.01.37.30; Sun, 19 Feb 2023 01:37:43 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=s8MwD3Xc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229907AbjBSJdy (ORCPT + 99 others); Sun, 19 Feb 2023 04:33:54 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56288 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229967AbjBSJdu (ORCPT ); Sun, 19 Feb 2023 04:33:50 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 05BD21206E for ; Sun, 19 Feb 2023 01:32:40 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id 144-20020a250596000000b008fa1d22bd55so1988436ybf.21 for ; Sun, 19 Feb 2023 01:32:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:from:to:cc:subject:date :message-id:reply-to; bh=lZa444TGelSxfyFGIASgMZVRbr1mUgy5fMfkXk2Hphs=; b=s8MwD3XcpksdpAQ44IhNk599hvZCE39mG8ERX3KznRk7WbK33EPKtTHl604SaZu22f W8udp7rA36r7w8E0tgNmbNccHgZTRlSSS+ibAhjSQDm2kR0xfrJTCUllsJ5FEoSbew8r 9V/QOqjRnmU5oExwB1uVzUYmEpLoDTPEWHbpNWSMTlZ2njVfC/nC/iTfMw86FYFBBbsA fECKHQYBu+7Bg9fWIuwrxYPgcC/Zy1oh6sofUQQcbNz48pd1v/VYyYc0ygU6eZt/PVIR HB/fT0nE8qNWp9G4+mdhKprjItsI7wAVcMIM2OdJfg42ox/dekM3ajN+0++EeFXX7G9d coAA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=lZa444TGelSxfyFGIASgMZVRbr1mUgy5fMfkXk2Hphs=; b=m6c9n5DSsSr1qhyJk2CcTn6L3M5NfBnJBQqaGba+LhpQ/2P9AZB6c+/jJC6E0PxI6I 510MMMslSiRYkGW+9JkU69PteR7le6tllPRxE4gl2vFwKOGjICsOKN/uG4FGEVg+IO/U EWRgNWz6g6SanQT06k6wTOMvQqLBk/dsqqL/OldN8sAZkxC18j+sfAQiYUtmCHAkQRsC w28w4CVJu10xmHwGaPrG7XLn3ZetORx2o74qh5gV6yGn3KeF5gCT7a9f4RQuns+litq7 ZGRWRN//W22SpjcerCum7MsuQMZRQ5Jwx3GnP5bAuEHg+TMeBktiJA49Fykto6toDOHF KL5A== X-Gm-Message-State: AO0yUKUYTC2F636PS8TEKtMqB9YMZDP/qw3Pnskzuc2WGEe4K3a6RxW7 aXimRosnLA1E0o816TTSDlM11YK7dz2s X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a05:6902:11cd:b0:90d:a6c0:6870 with SMTP id n13-20020a05690211cd00b0090da6c06870mr390938ybu.2.1676799129953; Sun, 19 Feb 2023 01:32:09 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:20 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-24-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 23/51] perf vendor events intel: Refresh jaketown events From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758251673842745356?= X-GMAIL-MSGID: =?utf-8?q?1758251673842745356?= Update the jaketown events from 21 to 22. Generation was done using https://github.com/intel/perfmon. Notable changes are improved event descriptions, TMA metrics are updated to version 4.5, TMA info metrics are renamed from their node name to be lower case and prefixed by tma_info_, MetricThreshold expressions are added, and the smi_cost metric group is added replicating existing hard coded metrics in stat-shadow. Signed-off-by: Ian Rogers --- .../pmu-events/arch/x86/jaketown/cache.json | 6 +- .../arch/x86/jaketown/floating-point.json | 2 +- .../arch/x86/jaketown/frontend.json | 12 +- .../arch/x86/jaketown/jkt-metrics.json | 602 ++++++++++-------- .../arch/x86/jaketown/pipeline.json | 2 +- .../arch/x86/jaketown/uncore-cache.json | 22 +- .../x86/jaketown/uncore-interconnect.json | 74 +-- .../arch/x86/jaketown/uncore-memory.json | 4 +- .../arch/x86/jaketown/uncore-other.json | 22 +- .../arch/x86/jaketown/uncore-power.json | 8 +- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 11 files changed, 409 insertions(+), 347 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/jaketown/cache.json b/tools/perf/pmu-events/arch/x86/jaketown/cache.json index f1271039b6b2..b9769d39940c 100644 --- a/tools/perf/pmu-events/arch/x86/jaketown/cache.json +++ b/tools/perf/pmu-events/arch/x86/jaketown/cache.json @@ -37,7 +37,7 @@ "UMask": "0x5" }, { - "BriefDescription": "Cycles a demand request was blocked due to Fill Buffers inavailability.", + "BriefDescription": "Cycles a demand request was blocked due to Fill Buffers unavailability.", "CounterMask": "1", "EventCode": "0x48", "EventName": "L1D_PEND_MISS.FB_FULL", @@ -45,7 +45,7 @@ "UMask": "0x2" }, { - "BriefDescription": "L1D miss oustandings duration in cycles.", + "BriefDescription": "L1D miss outstanding duration in cycles.", "EventCode": "0x48", "EventName": "L1D_PEND_MISS.PENDING", "SampleAfterValue": "2000003", @@ -500,7 +500,7 @@ "UMask": "0x8" }, { - "BriefDescription": "Cacheable and noncachaeble code read requests.", + "BriefDescription": "Cacheable and non-cacheable code read requests.", "EventCode": "0xB0", "EventName": "OFFCORE_REQUESTS.DEMAND_CODE_RD", "SampleAfterValue": "100003", diff --git a/tools/perf/pmu-events/arch/x86/jaketown/floating-point.json b/tools/perf/pmu-events/arch/x86/jaketown/floating-point.json index 8c2a246adef9..79e8f403c426 100644 --- a/tools/perf/pmu-events/arch/x86/jaketown/floating-point.json +++ b/tools/perf/pmu-events/arch/x86/jaketown/floating-point.json @@ -64,7 +64,7 @@ "UMask": "0x20" }, { - "BriefDescription": "Number of FP Computational Uops Executed this cycle. The number of FADD, FSUB, FCOM, FMULs, integer MULsand IMULs, FDIVs, FPREMs, FSQRTS, integer DIVs, and IDIVs. This event does not distinguish an FADD used in the middle of a transcendental flow from a s.", + "BriefDescription": "Number of FP Computational Uops Executed this cycle. The number of FADD, FSUB, FCOM, FMULs, integer MULs and IMULs, FDIVs, FPREMs, FSQRTS, integer DIVs, and IDIVs. This event does not distinguish an FADD used in the middle of a transcendental flow from a s.", "EventCode": "0x10", "EventName": "FP_COMP_OPS_EXE.X87", "SampleAfterValue": "2000003", diff --git a/tools/perf/pmu-events/arch/x86/jaketown/frontend.json b/tools/perf/pmu-events/arch/x86/jaketown/frontend.json index 3f4fc3481112..754ee2749485 100644 --- a/tools/perf/pmu-events/arch/x86/jaketown/frontend.json +++ b/tools/perf/pmu-events/arch/x86/jaketown/frontend.json @@ -134,7 +134,7 @@ "UMask": "0x4" }, { - "BriefDescription": "Cycles when uops are being delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy.", + "BriefDescription": "Cycles when uops are being delivered to Instruction Decode Queue (IDQ) while Microcode Sequencer (MS) is busy.", "CounterMask": "1", "EventCode": "0x79", "EventName": "IDQ.MS_CYCLES", @@ -143,7 +143,7 @@ "UMask": "0x30" }, { - "BriefDescription": "Cycles when uops initiated by Decode Stream Buffer (DSB) are being delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy.", + "BriefDescription": "Cycles when uops initiated by Decode Stream Buffer (DSB) are being delivered to Instruction Decode Queue (IDQ) while Microcode Sequencer (MS) is busy.", "CounterMask": "1", "EventCode": "0x79", "EventName": "IDQ.MS_DSB_CYCLES", @@ -151,7 +151,7 @@ "UMask": "0x10" }, { - "BriefDescription": "Deliveries to Instruction Decode Queue (IDQ) initiated by Decode Stream Buffer (DSB) while Microcode Sequenser (MS) is busy.", + "BriefDescription": "Deliveries to Instruction Decode Queue (IDQ) initiated by Decode Stream Buffer (DSB) while Microcode Sequencer (MS) is busy.", "CounterMask": "1", "EdgeDetect": "1", "EventCode": "0x79", @@ -160,14 +160,14 @@ "UMask": "0x10" }, { - "BriefDescription": "Uops initiated by Decode Stream Buffer (DSB) that are being delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy.", + "BriefDescription": "Uops initiated by Decode Stream Buffer (DSB) that are being delivered to Instruction Decode Queue (IDQ) while Microcode Sequencer (MS) is busy.", "EventCode": "0x79", "EventName": "IDQ.MS_DSB_UOPS", "SampleAfterValue": "2000003", "UMask": "0x10" }, { - "BriefDescription": "Uops initiated by MITE and delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy.", + "BriefDescription": "Uops initiated by MITE and delivered to Instruction Decode Queue (IDQ) while Microcode Sequencer (MS) is busy.", "EventCode": "0x79", "EventName": "IDQ.MS_MITE_UOPS", "SampleAfterValue": "2000003", @@ -183,7 +183,7 @@ "UMask": "0x30" }, { - "BriefDescription": "Uops delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy.", + "BriefDescription": "Uops delivered to Instruction Decode Queue (IDQ) while Microcode Sequencer (MS) is busy.", "EventCode": "0x79", "EventName": "IDQ.MS_UOPS", "SampleAfterValue": "2000003", diff --git a/tools/perf/pmu-events/arch/x86/jaketown/jkt-metrics.json b/tools/perf/pmu-events/arch/x86/jaketown/jkt-metrics.json index cb1420df3768..e8f4e5c01c9f 100644 --- a/tools/perf/pmu-events/arch/x86/jaketown/jkt-metrics.json +++ b/tools/perf/pmu-events/arch/x86/jaketown/jkt-metrics.json @@ -1,449 +1,511 @@ [ { - "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", - "MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / SLOTS", - "MetricGroup": "PGO;TopdownL1;tma_L1_group", - "MetricName": "tma_frontend_bound", - "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Machine_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound.", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", - "MetricExpr": "4 * min(CPU_CLK_UNHALTED.THREAD, IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE) / SLOTS", - "MetricGroup": "Frontend;TopdownL2;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_latency", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: RS_EVENTS.EMPTY_END", + "BriefDescription": "C2 residency percent per package", + "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C2_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses", - "MetricExpr": "(12 * ITLB_MISSES.STLB_HIT + ITLB_MISSES.WALK_DURATION) / CLKS", - "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_itlb_misses", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: ITLB_MISSES.WALK_COMPLETED", + "BriefDescription": "C3 residency percent per core", + "MetricExpr": "cstate_core@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Core_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", - "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY) / CLKS", - "MetricGroup": "FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_branch_resteers", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", + "BriefDescription": "C3 residency percent per package", + "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / CLKS", - "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_dsb_switches", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty.", + "BriefDescription": "C6 residency percent per core", + "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Core_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", - "MetricExpr": "ILD_STALL.LCP / CLKS", - "MetricGroup": "FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_lcp", - "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs.", + "BriefDescription": "C6 residency percent per package", + "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", - "MetricExpr": "3 * IDQ.MS_SWITCHES / CLKS", - "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_ms_switches", - "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals. Sample with: IDQ.MS_SWITCHES", + "BriefDescription": "C7 residency percent per core", + "MetricExpr": "cstate_core@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Core_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", - "MetricExpr": "tma_frontend_bound - tma_fetch_latency", - "MetricGroup": "FetchBW;Frontend;TopdownL2;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_bandwidth", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend.", + "BriefDescription": "C7 residency percent per package", + "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations", - "MetricExpr": "(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) / SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group", - "MetricName": "tma_bad_speculation", - "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.", - "ScaleUnit": "100%" + "BriefDescription": "Uncore frequency per die [GHZ]", + "MetricExpr": "tma_info_socket_clks / #num_dies / duration_time / 1e9", + "MetricGroup": "SoC", + "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction", - "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * tma_bad_speculation", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL2;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_branch_mispredicts", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", + "BriefDescription": "Percentage of cycles spent in System Management Interrupts.", + "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0 else 0)", + "MetricGroup": "smi", + "MetricName": "smi_cycles", + "MetricThreshold": "smi_cycles > 0.1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", - "MetricExpr": "tma_bad_speculation - tma_branch_mispredicts", - "MetricGroup": "BadSpec;MachineClears;TopdownL2;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_machine_clears", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT", - "ScaleUnit": "100%" + "BriefDescription": "Number of SMI interrupts.", + "MetricExpr": "msr@smi@", + "MetricGroup": "smi", + "MetricName": "smi_num", + "ScaleUnit": "1SMI#" }, { "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "1 - (tma_frontend_bound + tma_bad_speculation + tma_retiring)", - "MetricGroup": "TopdownL1;tma_L1_group", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.2", "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", - "MetricExpr": "(min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_L1D_PENDING) + RESOURCE_STALLS.SB) / (min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_DISPATCH) + cpu@UOPS_DISPATCHED.THREAD\\,cmask\\=1@ - cpu@UOPS_DISPATCHED.THREAD\\,cmask\\=3@ if IPC > 1.8 else (cpu@UOPS_DISPATCHED.THREAD\\,cmask\\=2@ - RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else RESOURCE_STALLS.SB)) * tma_backend_bound", - "MetricGroup": "Backend;TopdownL2;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_memory_bound", - "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "(7 * DTLB_LOAD_MISSES.STLB_HIT + DTLB_LOAD_MISSES.WALK_DURATION) / CLKS", - "MetricGroup": "MemoryTLB;TopdownL4;tma_l1_bound_group", - "MetricName": "tma_dtlb_load", - "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core", - "MetricExpr": "MEM_LOAD_UOPS_RETIRED.LLC_HIT / (MEM_LOAD_UOPS_RETIRED.LLC_HIT + 7 * MEM_LOAD_UOPS_RETIRED.LLC_MISS) * CYCLE_ACTIVITY.STALLS_L2_PENDING / CLKS", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_l3_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", - "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.LLC_HIT / (MEM_LOAD_UOPS_RETIRED.LLC_HIT + 7 * MEM_LOAD_UOPS_RETIRED.LLC_MISS)) * CYCLE_ACTIVITY.STALLS_L2_PENDING / CLKS", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_dram_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_MISS_PS", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=6@) / CLKS", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_dram_bound_group", - "MetricName": "tma_mem_bandwidth", - "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that).", + "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations", + "MetricExpr": "(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", + "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / CLKS - tma_mem_bandwidth", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_dram_bound_group", - "MetricName": "tma_mem_latency", - "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that).", + "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * tma_bad_speculation", + "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM", + "MetricName": "tma_branch_mispredicts", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics: tma_info_branch_misprediction_cost, tma_mispredicts_resteers", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", - "MetricExpr": "RESOURCE_STALLS.SB / CLKS", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_store_bound", - "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck. Sample with: MEM_UOPS_RETIRED.ALL_STORES_PS", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", + "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY) / tma_info_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", + "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "tma_backend_bound - tma_memory_bound", - "MetricGroup": "Backend;Compute;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", - "MetricExpr": "ARITH.FPU_DIV_ACTIVE / CORE_CLKS", - "MetricGroup": "TopdownL3;tma_core_bound_group", + "MetricExpr": "ARITH.FPU_DIV_ACTIVE / tma_info_core_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication. Sample with: ARITH.DIVIDER_UOPS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", - "MetricExpr": "((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_DISPATCH) + cpu@UOPS_DISPATCHED.THREAD\\,cmask\\=1@ - cpu@UOPS_DISPATCHED.THREAD\\,cmask\\=3@ if IPC > 1.8 else (cpu@UOPS_DISPATCHED.THREAD\\,cmask\\=2@ - RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else RESOURCE_STALLS.SB)) - RESOURCE_STALLS.SB - min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_L1D_PENDING)) / CLKS", - "MetricGroup": "PortsUtil;TopdownL3;tma_core_bound_group", - "MetricName": "tma_ports_utilization", - "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", + "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", + "MetricConstraint": "NO_GROUP_EVENTS_SMT", + "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.LLC_HIT / (MEM_LOAD_UOPS_RETIRED.LLC_HIT + 7 * MEM_LOAD_UOPS_RETIRED.LLC_MISS)) * CYCLE_ACTIVITY.STALLS_L2_PENDING / tma_info_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_MISS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group", - "MetricName": "tma_retiring", - "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.RETIRE_SLOTS", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_clks", + "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty. Related metrics: tma_fetch_bandwidth, tma_info_dsb_coverage, tma_lcp", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", - "MetricExpr": "tma_retiring - tma_heavy_operations", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_retiring_group", - "MetricName": "tma_light_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST", + "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricExpr": "(7 * DTLB_LOAD_MISSES.STLB_HIT + DTLB_LOAD_MISSES.WALK_DURATION) / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1", + "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS. Related metrics: tma_dtlb_store", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", + "MetricExpr": "tma_frontend_bound - tma_fetch_latency", + "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Related metrics: tma_dsb_switches, tma_info_dsb_coverage, tma_lcp", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", + "MetricExpr": "4 * min(CPU_CLK_UNHALTED.THREAD, IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE) / tma_info_slots", + "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: RS_EVENTS.EMPTY_END", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)", "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", - "MetricGroup": "HPC;TopdownL3;tma_light_operations_group", + "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_group", "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.6", "PublicDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired). Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain and FMA double-counting.", "ScaleUnit": "100%" }, - { - "BriefDescription": "This metric serves as an approximation of legacy x87 usage", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS * FP_COMP_OPS_EXE.X87 / UOPS_DISPATCHED.THREAD", - "MetricGroup": "Compute;TopdownL4;tma_fp_arith_group", - "MetricName": "tma_x87_use", - "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", - "ScaleUnit": "100%" - }, { "BriefDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired", "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE) / UOPS_DISPATCHED.THREAD", - "MetricGroup": "Compute;Flops;TopdownL4;tma_fp_arith_group", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P", "MetricName": "tma_fp_scalar", - "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting.", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_512b, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths", "MetricExpr": "(FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + FP_COMP_OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) / UOPS_DISPATCHED.THREAD", - "MetricGroup": "Compute;Flops;TopdownL4;tma_fp_arith_group", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P", "MetricName": "tma_fp_vector", - "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting.", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector_512b, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", + "MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / tma_info_slots", + "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences", "MetricExpr": "tma_microcode_sequencer", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", "MetricName": "tma_heavy_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences. This highly-correlates with the uop length of these instructions/sequences.", + "MetricThreshold": "tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.MS_UOPS / SLOTS", - "MetricGroup": "MicroSeq;TopdownL3;tma_heavy_operations_group", - "MetricName": "tma_microcode_sequencer", - "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided. Sample with: IDQ.MS_UOPS", - "ScaleUnit": "100%" + "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", + "MetricExpr": "tma_info_turbo_utilization * TSC / 1e9 / duration_time", + "MetricGroup": "Power;Summary", + "MetricName": "tma_info_average_frequency" }, { - "BriefDescription": "Instructions Per Cycle (per Logical Processor)", - "MetricExpr": "INST_RETIRED.ANY / CLKS", - "MetricGroup": "Ret;Summary", - "MetricName": "IPC" + "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Pipeline", + "MetricName": "tma_info_clks" }, { - "BriefDescription": "Uops Per Instruction", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / INST_RETIRED.ANY", - "MetricGroup": "Pipeline;Ret;Retire", - "MetricName": "UPI" + "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_clks))", + "MetricGroup": "SMT", + "MetricName": "tma_info_core_clks" + }, + { + "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_core_clks", + "MetricGroup": "Ret;SMT;TmaL1;tma_L1_group", + "MetricName": "tma_info_coreipc" }, { "BriefDescription": "Cycles Per Instruction (per Logical Processor)", - "MetricExpr": "1 / IPC", + "MetricExpr": "1 / tma_info_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "CPI" + "MetricName": "tma_info_cpi" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "Pipeline", - "MetricName": "CLKS" + "BriefDescription": "Average CPU Utilization", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": "HPC;Summary", + "MetricName": "tma_info_cpu_utilization" }, { - "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "4 * CORE_CLKS", - "MetricGroup": "tma_L1_group", - "MetricName": "SLOTS" + "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", + "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e9 / duration_time", + "MetricGroup": "HPC;Mem;MemoryBW;SoC;tma_issueBW", + "MetricName": "tma_info_dram_bw_use", + "PublicDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]. Related metrics: tma_mem_bandwidth" + }, + { + "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + LSD.UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_dsb_coverage", + "MetricThreshold": "tma_info_dsb_coverage < 0.7 & tma_info_ipc / 4 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_lcp" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_DISPATCHED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", - "MetricName": "Execute_per_Issue", + "MetricName": "tma_info_execute_per_issue", "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate of \"execute\" at rename stage." }, - { - "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", - "MetricExpr": "INST_RETIRED.ANY / CORE_CLKS", - "MetricGroup": "Ret;SMT;tma_L1_group", - "MetricName": "CoreIPC" - }, { "BriefDescription": "Floating Point Operations Per Cycle", - "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE + 2 * FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + 4 * (FP_COMP_OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) + 8 * SIMD_FP_256.PACKED_SINGLE) / CORE_CLKS", + "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE + 2 * FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + 4 * (FP_COMP_OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) + 8 * SIMD_FP_256.PACKED_SINGLE) / tma_info_core_clks", "MetricGroup": "Flops;Ret", - "MetricName": "FLOPc" + "MetricName": "tma_info_flopc" + }, + { + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE + 2 * FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + 4 * (FP_COMP_OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) + 8 * SIMD_FP_256.PACKED_SINGLE) / 1e9 / duration_time", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine." }, { "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", "MetricExpr": "UOPS_DISPATCHED.THREAD / (cpu@UOPS_DISPATCHED.CORE\\,cmask\\=1@ / 2 if #SMT_on else cpu@UOPS_DISPATCHED.CORE\\,cmask\\=1@)", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", - "MetricName": "ILP" + "MetricName": "tma_info_ilp" }, { - "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else CLKS))", - "MetricGroup": "SMT", - "MetricName": "CORE_CLKS" + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Summary;TmaL1;tma_L1_group", + "MetricName": "tma_info_instructions", + "PublicDescription": "Total number of retired Instructions. Sample with: INST_RETIRED.PREC_DIST" }, { - "BriefDescription": "Total number of retired Instructions Sample with: INST_RETIRED.PREC_DIST", - "MetricExpr": "INST_RETIRED.ANY", - "MetricGroup": "Summary;tma_L1_group", - "MetricName": "Instructions" + "BriefDescription": "Instructions Per Cycle (per Logical Processor)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_clks", + "MetricGroup": "Ret;Summary", + "MetricName": "tma_info_ipc" }, { - "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE_SLOTS\\,cmask\\=1@", - "MetricGroup": "Pipeline;Ret", - "MetricName": "Retire" + "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;OS", + "MetricName": "tma_info_ipfarbranch", + "MetricThreshold": "tma_info_ipfarbranch < 1e6" }, { - "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", - "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + LSD.UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", - "MetricGroup": "DSB;Fed;FetchBW", - "MetricName": "DSB_Coverage" + "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_cpi" }, { - "BriefDescription": "Average CPU Utilization", - "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", - "MetricGroup": "HPC;Summary", - "MetricName": "CPU_Utilization" + "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_utilization", + "MetricThreshold": "tma_info_kernel_utilization > 0.05" }, { - "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", - "MetricExpr": "Turbo_Utilization * TSC / 1e9 / duration_time", - "MetricGroup": "Power;Summary", - "MetricName": "Average_Frequency" + "BriefDescription": "Average number of parallel data read requests to external memory", + "MetricExpr": "UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=0x182@ / UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=0x182\\,thresh\\=1@", + "MetricGroup": "Mem;MemoryBW;SoC", + "MetricName": "tma_info_mem_parallel_reads", + "PublicDescription": "Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches" }, { - "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE + 2 * FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + 4 * (FP_COMP_OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) + 8 * SIMD_FP_256.PACKED_SINGLE) / 1e9 / duration_time", - "MetricGroup": "Cor;Flops;HPC", - "MetricName": "GFLOPs", - "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine." + "BriefDescription": "Average latency of data read request to external memory (in nanoseconds)", + "MetricExpr": "1e9 * (UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=0x182@ / UNC_C_TOR_INSERTS.MISS_OPCODE@filter_opc\\=0x182@) / (tma_info_socket_clks / duration_time)", + "MetricGroup": "Mem;MemoryLat;SoC", + "MetricName": "tma_info_mem_read_latency", + "PublicDescription": "Average latency of data read request to external memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetches. ([RKL+]memory-controller only)" }, { - "BriefDescription": "Average Frequency Utilization relative nominal frequency", - "MetricExpr": "CLKS / CPU_CLK_UNHALTED.REF_TSC", - "MetricGroup": "Power", - "MetricName": "Turbo_Utilization" + "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE_SLOTS\\,cmask\\=1@", + "MetricGroup": "Pipeline;Ret", + "MetricName": "tma_info_retire" + }, + { + "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "4 * tma_info_core_clks", + "MetricGroup": "TmaL1;tma_L1_group", + "MetricName": "tma_info_slots" }, { "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", "MetricGroup": "SMT", - "MetricName": "SMT_2T_Utilization" + "MetricName": "tma_info_smt_2t_utilization" }, { - "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "OS", - "MetricName": "Kernel_Utilization" + "BriefDescription": "Socket actual clocks when any core is active on that socket", + "MetricExpr": "cbox_0@event\\=0x0@", + "MetricGroup": "SoC", + "MetricName": "tma_info_socket_clks" }, { - "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", - "MetricGroup": "OS", - "MetricName": "Kernel_CPI" + "BriefDescription": "Average Frequency Utilization relative nominal frequency", + "MetricExpr": "tma_info_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_turbo_utilization" }, { - "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", - "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e9 / duration_time", - "MetricGroup": "HPC;Mem;MemoryBW;SoC", - "MetricName": "DRAM_BW_Use" + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / INST_RETIRED.ANY", + "MetricGroup": "Pipeline;Ret;Retire", + "MetricName": "tma_info_uoppi", + "MetricThreshold": "tma_info_uoppi > 1.05" }, { - "BriefDescription": "Average latency of data read request to external memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetches", - "MetricExpr": "1e9 * (UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=0x182@ / UNC_C_TOR_INSERTS.MISS_OPCODE@filter_opc\\=0x182@) / (Socket_CLKS / duration_time)", - "MetricGroup": "Mem;MemoryLat;SoC", - "MetricName": "MEM_Read_Latency" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "(12 * ITLB_MISSES.STLB_HIT + ITLB_MISSES.WALK_DURATION) / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: ITLB_MISSES.WALK_COMPLETED", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches", - "MetricExpr": "UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=0x182@ / UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=0x182\\,thresh\\=1@", - "MetricGroup": "Mem;MemoryBW;SoC", - "MetricName": "MEM_Parallel_Reads" + "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core", + "MetricConstraint": "NO_GROUP_EVENTS_SMT", + "MetricExpr": "MEM_LOAD_UOPS_RETIRED.LLC_HIT / (MEM_LOAD_UOPS_RETIRED.LLC_HIT + 7 * MEM_LOAD_UOPS_RETIRED.LLC_MISS) * CYCLE_ACTIVITY.STALLS_L2_PENDING / tma_info_clks", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Socket actual clocks when any core is active on that socket", - "MetricExpr": "cbox_0@event\\=0x0@", - "MetricGroup": "SoC", - "MetricName": "Socket_CLKS" + "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "ILD_STALL.LCP / tma_info_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", - "MetricGroup": "Branches;OS", - "MetricName": "IpFarBranch" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", + "MetricExpr": "tma_retiring - tma_heavy_operations", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST", + "ScaleUnit": "100%" }, { - "BriefDescription": "Uncore frequency per die [GHZ]", - "MetricExpr": "Socket_CLKS / #num_dies / duration_time / 1e9", - "MetricGroup": "SoC", - "MetricName": "UNCORE_FREQ" + "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_bad_speculation - tma_branch_mispredicts", + "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: tma_clears_resteers, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "C3 residency percent per core", - "MetricExpr": "cstate_core@c3\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C3_Core_Residency", + "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=6@) / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that). Related metrics: tma_info_dram_bw_use", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per core", - "MetricExpr": "cstate_core@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Core_Residency", + "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / tma_info_clks - tma_mem_bandwidth", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that). Related metrics: ", "ScaleUnit": "100%" }, { - "BriefDescription": "C7 residency percent per core", - "MetricExpr": "cstate_core@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Core_Residency", + "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_L1D_PENDING) + RESOURCE_STALLS.SB) / (min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_DISPATCH) + cpu@UOPS_DISPATCHED.THREAD\\,cmask\\=1@ - (cpu@UOPS_DISPATCHED.THREAD\\,cmask\\=3@ if tma_info_ipc > 1.8 else cpu@UOPS_DISPATCHED.THREAD\\,cmask\\=2@) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) * tma_backend_bound", + "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", "ScaleUnit": "100%" }, { - "BriefDescription": "C2 residency percent per package", - "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C2_Pkg_Residency", + "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.MS_UOPS / tma_info_slots", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operations_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "C3 residency percent per package", - "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C3_Pkg_Residency", + "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", + "MetricExpr": "3 * IDQ.MS_SWITCHES / tma_info_clks", + "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per package", - "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Pkg_Residency", + "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_DISPATCH) + cpu@UOPS_DISPATCHED.THREAD\\,cmask\\=1@ - (cpu@UOPS_DISPATCHED.THREAD\\,cmask\\=3@ if tma_info_ipc > 1.8 else cpu@UOPS_DISPATCHED.THREAD\\,cmask\\=2@) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB - RESOURCE_STALLS.SB - min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_L1D_PENDING)) / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%" }, { - "BriefDescription": "C7 residency percent per package", - "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Pkg_Residency", + "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1", + "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.RETIRE_SLOTS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", + "MetricExpr": "RESOURCE_STALLS.SB / tma_info_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck. Sample with: MEM_UOPS_RETIRED.ALL_STORES_PS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric serves as an approximation of legacy x87 usage", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS * FP_COMP_OPS_EXE.X87 / UOPS_DISPATCHED.THREAD", + "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%" } ] diff --git a/tools/perf/pmu-events/arch/x86/jaketown/pipeline.json b/tools/perf/pmu-events/arch/x86/jaketown/pipeline.json index 11d41ce8c922..85c04fe7632a 100644 --- a/tools/perf/pmu-events/arch/x86/jaketown/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/jaketown/pipeline.json @@ -501,7 +501,7 @@ "BriefDescription": "Cases when loads get true Block-on-Store blocking code preventing store forwarding.", "EventCode": "0x03", "EventName": "LD_BLOCKS.STORE_FORWARD", - "PublicDescription": "This event counts loads that followed a store to the same address, where the data could not be forwarded inside the pipeline from the store to the load. The most common reason why store forwarding would be blocked is when a load's address range overlaps with a preceeding smaller uncompleted store. See the table of not supported store forwards in the Intel? 64 and IA-32 Architectures Optimization Reference Manual. The penalty for blocked store forwarding is that the load must wait for the store to complete before it can be issued.", + "PublicDescription": "This event counts loads that followed a store to the same address, where the data could not be forwarded inside the pipeline from the store to the load. The most common reason why store forwarding would be blocked is when a load's address range overlaps with a preceding smaller uncompleted store. See the table of not supported store forwards in the Intel? 64 and IA-32 Architectures Optimization Reference Manual. The penalty for blocked store forwarding is that the load must wait for the store to complete before it can be issued.", "SampleAfterValue": "100003", "UMask": "0x2" }, diff --git a/tools/perf/pmu-events/arch/x86/jaketown/uncore-cache.json b/tools/perf/pmu-events/arch/x86/jaketown/uncore-cache.json index b9e68f9f33ea..47830ca5c682 100644 --- a/tools/perf/pmu-events/arch/x86/jaketown/uncore-cache.json +++ b/tools/perf/pmu-events/arch/x86/jaketown/uncore-cache.json @@ -572,7 +572,7 @@ "EventCode": "0x35", "EventName": "UNC_C_TOR_INSERTS.EVICTION", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select 'MISS_OPC_MATCH' and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).", + "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select 'MISS_OPC_MATCH' and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).", "UMask": "0x4", "Unit": "CBO" }, @@ -581,7 +581,7 @@ "EventCode": "0x35", "EventName": "UNC_C_TOR_INSERTS.MISS_ALL", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select 'MISS_OPC_MATCH' and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).", + "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select 'MISS_OPC_MATCH' and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).", "UMask": "0xa", "Unit": "CBO" }, @@ -590,7 +590,7 @@ "EventCode": "0x35", "EventName": "UNC_C_TOR_INSERTS.MISS_OPCODE", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select 'MISS_OPC_MATCH' and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).", + "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select 'MISS_OPC_MATCH' and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).", "UMask": "0x3", "Unit": "CBO" }, @@ -599,7 +599,7 @@ "EventCode": "0x35", "EventName": "UNC_C_TOR_INSERTS.NID_ALL", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select 'MISS_OPC_MATCH' and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).", + "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select 'MISS_OPC_MATCH' and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).", "UMask": "0x48", "Unit": "CBO" }, @@ -608,7 +608,7 @@ "EventCode": "0x35", "EventName": "UNC_C_TOR_INSERTS.NID_EVICTION", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select 'MISS_OPC_MATCH' and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).", + "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select 'MISS_OPC_MATCH' and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).", "UMask": "0x44", "Unit": "CBO" }, @@ -617,7 +617,7 @@ "EventCode": "0x35", "EventName": "UNC_C_TOR_INSERTS.NID_MISS_ALL", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select 'MISS_OPC_MATCH' and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).", + "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select 'MISS_OPC_MATCH' and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).", "UMask": "0x4a", "Unit": "CBO" }, @@ -626,7 +626,7 @@ "EventCode": "0x35", "EventName": "UNC_C_TOR_INSERTS.NID_MISS_OPCODE", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select 'MISS_OPC_MATCH' and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).", + "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select 'MISS_OPC_MATCH' and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).", "UMask": "0x43", "Unit": "CBO" }, @@ -635,7 +635,7 @@ "EventCode": "0x35", "EventName": "UNC_C_TOR_INSERTS.NID_OPCODE", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select 'MISS_OPC_MATCH' and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).", + "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select 'MISS_OPC_MATCH' and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).", "UMask": "0x41", "Unit": "CBO" }, @@ -644,7 +644,7 @@ "EventCode": "0x35", "EventName": "UNC_C_TOR_INSERTS.NID_WB", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select 'MISS_OPC_MATCH' and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).", + "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select 'MISS_OPC_MATCH' and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).", "UMask": "0x50", "Unit": "CBO" }, @@ -653,7 +653,7 @@ "EventCode": "0x35", "EventName": "UNC_C_TOR_INSERTS.OPCODE", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select 'MISS_OPC_MATCH' and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).", + "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select 'MISS_OPC_MATCH' and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).", "UMask": "0x1", "Unit": "CBO" }, @@ -662,7 +662,7 @@ "EventCode": "0x35", "EventName": "UNC_C_TOR_INSERTS.WB", "PerPkg": "1", - "PublicDescription": "Counts the number of entries successfuly inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select 'MISS_OPC_MATCH' and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).", + "PublicDescription": "Counts the number of entries successfully inserted into the TOR that match qualifications specified by the subevent. There are a number of subevent 'filters' but only a subset of the subevent combinations are valid. Subevents that require an opcode or NID match require the Cn_MSR_PMON_BOX_FILTER.{opc, nid} field to be set. If, for example, one wanted to count DRD Local Misses, one should select 'MISS_OPC_MATCH' and set Cn_MSR_PMON_BOX_FILTER.opc to DRD (0x182).", "UMask": "0x10", "Unit": "CBO" }, diff --git a/tools/perf/pmu-events/arch/x86/jaketown/uncore-interconnect.json b/tools/perf/pmu-events/arch/x86/jaketown/uncore-interconnect.json index 1c2cf94889a1..b16bb649225d 100644 --- a/tools/perf/pmu-events/arch/x86/jaketown/uncore-interconnect.json +++ b/tools/perf/pmu-events/arch/x86/jaketown/uncore-interconnect.json @@ -20,7 +20,7 @@ "EventCode": "0x13", "EventName": "UNC_Q_DIRECT2CORE.FAILURE_CREDITS", "PerPkg": "1", - "PublicDescription": "Counts the number of DRS packets that we attempted to do direct2core on. There are 4 mutually exlusive filters. Filter [0] can be used to get successful spawns, while [1:3] provide the different failure cases. Note that this does not count packets that are not candidates for Direct2Core. The only candidates for Direct2Core are DRS packets destined for Cbos.", + "PublicDescription": "Counts the number of DRS packets that we attempted to do direct2core on. There are 4 mutually exclusive filters. Filter [0] can be used to get successful spawns, while [1:3] provide the different failure cases. Note that this does not count packets that are not candidates for Direct2Core. The only candidates for Direct2Core are DRS packets destined for Cbos.", "UMask": "0x2", "Unit": "QPI LL" }, @@ -29,7 +29,7 @@ "EventCode": "0x13", "EventName": "UNC_Q_DIRECT2CORE.FAILURE_CREDITS_RBT", "PerPkg": "1", - "PublicDescription": "Counts the number of DRS packets that we attempted to do direct2core on. There are 4 mutually exlusive filters. Filter [0] can be used to get successful spawns, while [1:3] provide the different failure cases. Note that this does not count packets that are not candidates for Direct2Core. The only candidates for Direct2Core are DRS packets destined for Cbos.", + "PublicDescription": "Counts the number of DRS packets that we attempted to do direct2core on. There are 4 mutually exclusive filters. Filter [0] can be used to get successful spawns, while [1:3] provide the different failure cases. Note that this does not count packets that are not candidates for Direct2Core. The only candidates for Direct2Core are DRS packets destined for Cbos.", "UMask": "0x8", "Unit": "QPI LL" }, @@ -38,7 +38,7 @@ "EventCode": "0x13", "EventName": "UNC_Q_DIRECT2CORE.FAILURE_RBT", "PerPkg": "1", - "PublicDescription": "Counts the number of DRS packets that we attempted to do direct2core on. There are 4 mutually exlusive filters. Filter [0] can be used to get successful spawns, while [1:3] provide the different failure cases. Note that this does not count packets that are not candidates for Direct2Core. The only candidates for Direct2Core are DRS packets destined for Cbos.", + "PublicDescription": "Counts the number of DRS packets that we attempted to do direct2core on. There are 4 mutually exclusive filters. Filter [0] can be used to get successful spawns, while [1:3] provide the different failure cases. Note that this does not count packets that are not candidates for Direct2Core. The only candidates for Direct2Core are DRS packets destined for Cbos.", "UMask": "0x4", "Unit": "QPI LL" }, @@ -47,7 +47,7 @@ "EventCode": "0x13", "EventName": "UNC_Q_DIRECT2CORE.SUCCESS", "PerPkg": "1", - "PublicDescription": "Counts the number of DRS packets that we attempted to do direct2core on. There are 4 mutually exlusive filters. Filter [0] can be used to get successful spawns, while [1:3] provide the different failure cases. Note that this does not count packets that are not candidates for Direct2Core. The only candidates for Direct2Core are DRS packets destined for Cbos.", + "PublicDescription": "Counts the number of DRS packets that we attempted to do direct2core on. There are 4 mutually exclusive filters. Filter [0] can be used to get successful spawns, while [1:3] provide the different failure cases. Note that this does not count packets that are not candidates for Direct2Core. The only candidates for Direct2Core are DRS packets destined for Cbos.", "UMask": "0x1", "Unit": "QPI LL" }, @@ -80,7 +80,7 @@ "EventCode": "0x9", "EventName": "UNC_Q_RxL_BYPASSED", "PerPkg": "1", - "PublicDescription": "Counts the number of times that an incoming flit was able to bypass the flit buffer and pass directly across the BGF and into the Egress. This is a latency optimization, and should generally be the common case. If this value is less than the number of flits transfered, it implies that there was queueing getting onto the ring, and thus the transactions saw higher latency.", + "PublicDescription": "Counts the number of times that an incoming flit was able to bypass the flit buffer and pass directly across the BGF and into the Egress. This is a latency optimization, and should generally be the common case. If this value is less than the number of flits transferred, it implies that there was queueing getting onto the ring, and thus the transactions saw higher latency.", "Unit": "QPI LL" }, { @@ -176,7 +176,7 @@ "EventCode": "0x1", "EventName": "UNC_Q_RxL_FLITS_G0.DATA", "PerPkg": "1", - "PublicDescription": "Counts the number of flits received from the QPI Link. It includes filters for Idle, protocol, and Data Flits. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time (for L0) or 4B instead of 8B for L0p.", + "PublicDescription": "Counts the number of flits received from the QPI Link. It includes filters for Idle, protocol, and Data Flits. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time (for L0) or 4B instead of 8B for L0p.", "UMask": "0x2", "Unit": "QPI LL" }, @@ -185,7 +185,7 @@ "EventCode": "0x1", "EventName": "UNC_Q_RxL_FLITS_G0.IDLE", "PerPkg": "1", - "PublicDescription": "Counts the number of flits received from the QPI Link. It includes filters for Idle, protocol, and Data Flits. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time (for L0) or 4B instead of 8B for L0p.", + "PublicDescription": "Counts the number of flits received from the QPI Link. It includes filters for Idle, protocol, and Data Flits. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time (for L0) or 4B instead of 8B for L0p.", "UMask": "0x1", "Unit": "QPI LL" }, @@ -194,7 +194,7 @@ "EventCode": "0x1", "EventName": "UNC_Q_RxL_FLITS_G0.NON_DATA", "PerPkg": "1", - "PublicDescription": "Counts the number of flits received from the QPI Link. It includes filters for Idle, protocol, and Data Flits. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time (for L0) or 4B instead of 8B for L0p.", + "PublicDescription": "Counts the number of flits received from the QPI Link. It includes filters for Idle, protocol, and Data Flits. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time (for L0) or 4B instead of 8B for L0p.", "UMask": "0x4", "Unit": "QPI LL" }, @@ -203,7 +203,7 @@ "EventCode": "0x2", "EventName": "UNC_Q_RxL_FLITS_G1.DRS", "PerPkg": "1", - "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", + "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", "UMask": "0x18", "Unit": "QPI LL" }, @@ -212,7 +212,7 @@ "EventCode": "0x2", "EventName": "UNC_Q_RxL_FLITS_G1.DRS_DATA", "PerPkg": "1", - "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", + "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", "UMask": "0x8", "Unit": "QPI LL" }, @@ -221,7 +221,7 @@ "EventCode": "0x2", "EventName": "UNC_Q_RxL_FLITS_G1.DRS_NONDATA", "PerPkg": "1", - "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", + "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", "UMask": "0x10", "Unit": "QPI LL" }, @@ -230,7 +230,7 @@ "EventCode": "0x2", "EventName": "UNC_Q_RxL_FLITS_G1.HOM", "PerPkg": "1", - "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", + "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", "UMask": "0x6", "Unit": "QPI LL" }, @@ -239,7 +239,7 @@ "EventCode": "0x2", "EventName": "UNC_Q_RxL_FLITS_G1.HOM_NONREQ", "PerPkg": "1", - "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", + "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", "UMask": "0x4", "Unit": "QPI LL" }, @@ -248,7 +248,7 @@ "EventCode": "0x2", "EventName": "UNC_Q_RxL_FLITS_G1.HOM_REQ", "PerPkg": "1", - "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", + "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", "UMask": "0x2", "Unit": "QPI LL" }, @@ -257,7 +257,7 @@ "EventCode": "0x2", "EventName": "UNC_Q_RxL_FLITS_G1.SNP", "PerPkg": "1", - "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", + "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", "UMask": "0x1", "Unit": "QPI LL" }, @@ -266,7 +266,7 @@ "EventCode": "0x3", "EventName": "UNC_Q_RxL_FLITS_G2.NCB", "PerPkg": "1", - "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", + "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", "UMask": "0xc", "Unit": "QPI LL" }, @@ -275,7 +275,7 @@ "EventCode": "0x3", "EventName": "UNC_Q_RxL_FLITS_G2.NCB_DATA", "PerPkg": "1", - "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", + "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", "UMask": "0x4", "Unit": "QPI LL" }, @@ -284,7 +284,7 @@ "EventCode": "0x3", "EventName": "UNC_Q_RxL_FLITS_G2.NCB_NONDATA", "PerPkg": "1", - "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", + "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", "UMask": "0x8", "Unit": "QPI LL" }, @@ -293,7 +293,7 @@ "EventCode": "0x3", "EventName": "UNC_Q_RxL_FLITS_G2.NCS", "PerPkg": "1", - "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", + "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", "UMask": "0x10", "Unit": "QPI LL" }, @@ -302,7 +302,7 @@ "EventCode": "0x3", "EventName": "UNC_Q_RxL_FLITS_G2.NDR_AD", "PerPkg": "1", - "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", + "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", "UMask": "0x1", "Unit": "QPI LL" }, @@ -311,7 +311,7 @@ "EventCode": "0x3", "EventName": "UNC_Q_RxL_FLITS_G2.NDR_AK", "PerPkg": "1", - "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", + "PublicDescription": "Counts the number of flits received from the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", "UMask": "0x2", "Unit": "QPI LL" }, @@ -553,7 +553,7 @@ "BriefDescription": "Flits Transferred - Group 0; Data Tx Flits", "EventName": "UNC_Q_TxL_FLITS_G0.DATA", "PerPkg": "1", - "PublicDescription": "Counts the number of flits transmitted across the QPI Link. It includes filters for Idle, protocol, and Data Flits. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time (for L0) or 4B instead of 8B for L0p.", + "PublicDescription": "Counts the number of flits transmitted across the QPI Link. It includes filters for Idle, protocol, and Data Flits. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time (for L0) or 4B instead of 8B for L0p.", "UMask": "0x2", "Unit": "QPI LL" }, @@ -561,7 +561,7 @@ "BriefDescription": "Flits Transferred - Group 0; Idle and Null Flits", "EventName": "UNC_Q_TxL_FLITS_G0.IDLE", "PerPkg": "1", - "PublicDescription": "Counts the number of flits transmitted across the QPI Link. It includes filters for Idle, protocol, and Data Flits. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time (for L0) or 4B instead of 8B for L0p.", + "PublicDescription": "Counts the number of flits transmitted across the QPI Link. It includes filters for Idle, protocol, and Data Flits. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time (for L0) or 4B instead of 8B for L0p.", "UMask": "0x1", "Unit": "QPI LL" }, @@ -569,7 +569,7 @@ "BriefDescription": "Flits Transferred - Group 0; Non-Data protocol Tx Flits", "EventName": "UNC_Q_TxL_FLITS_G0.NON_DATA", "PerPkg": "1", - "PublicDescription": "Counts the number of flits transmitted across the QPI Link. It includes filters for Idle, protocol, and Data Flits. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time (for L0) or 4B instead of 8B for L0p.", + "PublicDescription": "Counts the number of flits transmitted across the QPI Link. It includes filters for Idle, protocol, and Data Flits. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time (for L0) or 4B instead of 8B for L0p.", "UMask": "0x4", "Unit": "QPI LL" }, @@ -577,7 +577,7 @@ "BriefDescription": "Flits Transferred - Group 1; DRS Flits (both Header and Data)", "EventName": "UNC_Q_TxL_FLITS_G1.DRS", "PerPkg": "1", - "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", + "PublicDescription": "Counts the number of flits transmitted across the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", "UMask": "0x18", "Unit": "QPI LL" }, @@ -585,7 +585,7 @@ "BriefDescription": "Flits Transferred - Group 1; DRS Data Flits", "EventName": "UNC_Q_TxL_FLITS_G1.DRS_DATA", "PerPkg": "1", - "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", + "PublicDescription": "Counts the number of flits transmitted across the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", "UMask": "0x8", "Unit": "QPI LL" }, @@ -593,7 +593,7 @@ "BriefDescription": "Flits Transferred - Group 1; DRS Header Flits", "EventName": "UNC_Q_TxL_FLITS_G1.DRS_NONDATA", "PerPkg": "1", - "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", + "PublicDescription": "Counts the number of flits transmitted across the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", "UMask": "0x10", "Unit": "QPI LL" }, @@ -601,7 +601,7 @@ "BriefDescription": "Flits Transferred - Group 1; HOM Flits", "EventName": "UNC_Q_TxL_FLITS_G1.HOM", "PerPkg": "1", - "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", + "PublicDescription": "Counts the number of flits transmitted across the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", "UMask": "0x6", "Unit": "QPI LL" }, @@ -609,7 +609,7 @@ "BriefDescription": "Flits Transferred - Group 1; HOM Non-Request Flits", "EventName": "UNC_Q_TxL_FLITS_G1.HOM_NONREQ", "PerPkg": "1", - "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", + "PublicDescription": "Counts the number of flits transmitted across the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", "UMask": "0x4", "Unit": "QPI LL" }, @@ -617,7 +617,7 @@ "BriefDescription": "Flits Transferred - Group 1; HOM Request Flits", "EventName": "UNC_Q_TxL_FLITS_G1.HOM_REQ", "PerPkg": "1", - "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", + "PublicDescription": "Counts the number of flits transmitted across the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", "UMask": "0x2", "Unit": "QPI LL" }, @@ -625,7 +625,7 @@ "BriefDescription": "Flits Transferred - Group 1; SNP Flits", "EventName": "UNC_Q_TxL_FLITS_G1.SNP", "PerPkg": "1", - "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", + "PublicDescription": "Counts the number of flits transmitted across the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for SNP, HOM, and DRS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", "UMask": "0x1", "Unit": "QPI LL" }, @@ -634,7 +634,7 @@ "EventCode": "0x1", "EventName": "UNC_Q_TxL_FLITS_G2.NCB", "PerPkg": "1", - "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", + "PublicDescription": "Counts the number of flits transmitted across the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", "UMask": "0xc", "Unit": "QPI LL" }, @@ -643,7 +643,7 @@ "EventCode": "0x1", "EventName": "UNC_Q_TxL_FLITS_G2.NCB_DATA", "PerPkg": "1", - "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", + "PublicDescription": "Counts the number of flits transmitted across the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", "UMask": "0x4", "Unit": "QPI LL" }, @@ -652,7 +652,7 @@ "EventCode": "0x1", "EventName": "UNC_Q_TxL_FLITS_G2.NCB_NONDATA", "PerPkg": "1", - "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", + "PublicDescription": "Counts the number of flits transmitted across the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", "UMask": "0x8", "Unit": "QPI LL" }, @@ -661,7 +661,7 @@ "EventCode": "0x1", "EventName": "UNC_Q_TxL_FLITS_G2.NCS", "PerPkg": "1", - "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", + "PublicDescription": "Counts the number of flits transmitted across the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", "UMask": "0x10", "Unit": "QPI LL" }, @@ -670,7 +670,7 @@ "EventCode": "0x1", "EventName": "UNC_Q_TxL_FLITS_G2.NDR_AD", "PerPkg": "1", - "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", + "PublicDescription": "Counts the number of flits transmitted across the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", "UMask": "0x1", "Unit": "QPI LL" }, @@ -679,7 +679,7 @@ "EventCode": "0x1", "EventName": "UNC_Q_TxL_FLITS_G2.NDR_AK", "PerPkg": "1", - "PublicDescription": "Counts the number of flits trasmitted across the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transfering a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", + "PublicDescription": "Counts the number of flits transmitted across the QPI Link. This is one of three 'groups' that allow us to track flits. It includes filters for NDR, NCB, and NCS message classes. Each 'flit' is made up of 80 bits of information (in addition to some ECC data). In full-width (L0) mode, flits are made up of four 'fits', each of which contains 20 bits of data (along with some additional ECC data). In half-width (L0p) mode, the fits are only 10 bits, and therefore it takes twice as many fits to transmit a flit. When one talks about QPI 'speed' (for example, 8.0 GT/s), the 'transfers' here refer to 'fits'. Therefore, in L0, the system will transfer 1 'flit' at the rate of 1/4th the QPI speed. One can calculate the bandwidth of the link by taking: flits*80b/time. Note that this is not the same as 'data' bandwidth. For example, when we are transferring a 64B cacheline across QPI, we will break it into 9 flits -- 1 with header information and 8 with 64 bits of actual 'data' and an additional 16 bits of other information. To calculate 'data' bandwidth, one should therefore do: data flits * 8B / time.", "UMask": "0x2", "Unit": "QPI LL" }, diff --git a/tools/perf/pmu-events/arch/x86/jaketown/uncore-memory.json b/tools/perf/pmu-events/arch/x86/jaketown/uncore-memory.json index 2faf0dc6675d..6dcc9415a462 100644 --- a/tools/perf/pmu-events/arch/x86/jaketown/uncore-memory.json +++ b/tools/perf/pmu-events/arch/x86/jaketown/uncore-memory.json @@ -101,7 +101,7 @@ "EventCode": "0x9", "EventName": "UNC_M_ECC_CORRECTABLE_ERRORS", "PerPkg": "1", - "PublicDescription": "Counts the number of ECC errors detected and corrected by the iMC on this channel. This counter is only useful with ECC DRAM devices. This count will increment one time for each correction regardless of the number of bits corrected. The iMC can correct up to 4 bit errors in independent channel mode and 8 bit erros in lockstep mode.", + "PublicDescription": "Counts the number of ECC errors detected and corrected by the iMC on this channel. This counter is only useful with ECC DRAM devices. This count will increment one time for each correction regardless of the number of bits corrected. The iMC can correct up to 4 bit errors in independent channel mode and 8 bit errors in lockstep mode.", "Unit": "iMC" }, { @@ -413,7 +413,7 @@ "EventCode": "0x81", "EventName": "UNC_M_WPQ_OCCUPANCY", "PerPkg": "1", - "PublicDescription": "Accumulates the occupancies of the Write Pending Queue each cycle. This can then be used to calculate both the average queue occupancy (in conjunction with the number of cycles not empty) and the average latency (in conjunction with the number of allocations). The WPQ is used to schedule write out to the memory controller and to track the writes. Requests allocate into the WPQ soon after they enter the memory controller, and need credits for an entry in this buffer before being sent from the HA to the iMC. They deallocate after being issued to DRAM. Write requests themselves are able to complete (from the perspective of the rest of the system) as soon they have 'posted' to the iMC. This is not to be confused with actually performing the write to DRAM. Therefore, the average latency for this queue is actually not useful for deconstruction intermediate write latencies. So, we provide filtering based on if the request has posted or not. By using the 'not posted' filter, we can track how long writes spent in the iMC before completions were sent to the HA. The 'posted' filter, on the other hand, provides information about how much queueing is actually happenning in the iMC for writes before they are actually issued to memory. High average occupancies will generally coincide with high write major mode counts.", + "PublicDescription": "Accumulates the occupancies of the Write Pending Queue each cycle. This can then be used to calculate both the average queue occupancy (in conjunction with the number of cycles not empty) and the average latency (in conjunction with the number of allocations). The WPQ is used to schedule write out to the memory controller and to track the writes. Requests allocate into the WPQ soon after they enter the memory controller, and need credits for an entry in this buffer before being sent from the HA to the iMC. They deallocate after being issued to DRAM. Write requests themselves are able to complete (from the perspective of the rest of the system) as soon they have 'posted' to the iMC. This is not to be confused with actually performing the write to DRAM. Therefore, the average latency for this queue is actually not useful for deconstruction intermediate write latencies. So, we provide filtering based on if the request has posted or not. By using the 'not posted' filter, we can track how long writes spent in the iMC before completions were sent to the HA. The 'posted' filter, on the other hand, provides information about how much queueing is actually happening in the iMC for writes before they are actually issued to memory. High average occupancies will generally coincide with high write major mode counts.", "Unit": "iMC" }, { diff --git a/tools/perf/pmu-events/arch/x86/jaketown/uncore-other.json b/tools/perf/pmu-events/arch/x86/jaketown/uncore-other.json index 51a9a4e81046..ca727c0e1865 100644 --- a/tools/perf/pmu-events/arch/x86/jaketown/uncore-other.json +++ b/tools/perf/pmu-events/arch/x86/jaketown/uncore-other.json @@ -284,7 +284,7 @@ "EventCode": "0xD", "EventName": "UNC_I_TxR_REQUEST_OCCUPANCY", "PerPkg": "1", - "PublicDescription": "Accumultes the number of outstanding outbound requests from the IRP to the switch (towards the devices). This can be used in conjuection with the allocations event in order to calculate average latency of outbound requests.", + "PublicDescription": "Accumulates the number of outstanding outbound requests from the IRP to the switch (towards the devices). This can be used in conjunction with the allocations event in order to calculate average latency of outbound requests.", "Unit": "IRP" }, { @@ -630,7 +630,7 @@ "EventCode": "0x20", "EventName": "UNC_R3_IIO_CREDITS_ACQUIRED.DRS", "PerPkg": "1", - "PublicDescription": "Counts the number of times the NCS/NCB/DRS credit is acquried in the QPI for sending messages on BL to the IIO. There is one credit for each of these three message classes (three credits total). NCS is used for reads to PCIe space, NCB is used for transfering data without coherency, and DRS is used for transfering data with coherency (cachable PCI transactions). This event can only track one message class at a time.", + "PublicDescription": "Counts the number of times the NCS/NCB/DRS credit is acquired in the QPI for sending messages on BL to the IIO. There is one credit for each of these three message classes (three credits total). NCS is used for reads to PCIe space, NCB is used for transferring data without coherency, and DRS is used for transferring data with coherency (cacheable PCI transactions). This event can only track one message class at a time.", "UMask": "0x8", "Unit": "R3QPI" }, @@ -639,7 +639,7 @@ "EventCode": "0x20", "EventName": "UNC_R3_IIO_CREDITS_ACQUIRED.NCB", "PerPkg": "1", - "PublicDescription": "Counts the number of times the NCS/NCB/DRS credit is acquried in the QPI for sending messages on BL to the IIO. There is one credit for each of these three message classes (three credits total). NCS is used for reads to PCIe space, NCB is used for transfering data without coherency, and DRS is used for transfering data with coherency (cachable PCI transactions). This event can only track one message class at a time.", + "PublicDescription": "Counts the number of times the NCS/NCB/DRS credit is acquired in the QPI for sending messages on BL to the IIO. There is one credit for each of these three message classes (three credits total). NCS is used for reads to PCIe space, NCB is used for transferring data without coherency, and DRS is used for transferring data with coherency (cacheable PCI transactions). This event can only track one message class at a time.", "UMask": "0x10", "Unit": "R3QPI" }, @@ -648,7 +648,7 @@ "EventCode": "0x20", "EventName": "UNC_R3_IIO_CREDITS_ACQUIRED.NCS", "PerPkg": "1", - "PublicDescription": "Counts the number of times the NCS/NCB/DRS credit is acquried in the QPI for sending messages on BL to the IIO. There is one credit for each of these three message classes (three credits total). NCS is used for reads to PCIe space, NCB is used for transfering data without coherency, and DRS is used for transfering data with coherency (cachable PCI transactions). This event can only track one message class at a time.", + "PublicDescription": "Counts the number of times the NCS/NCB/DRS credit is acquired in the QPI for sending messages on BL to the IIO. There is one credit for each of these three message classes (three credits total). NCS is used for reads to PCIe space, NCB is used for transferring data without coherency, and DRS is used for transferring data with coherency (cacheable PCI transactions). This event can only track one message class at a time.", "UMask": "0x20", "Unit": "R3QPI" }, @@ -657,7 +657,7 @@ "EventCode": "0x21", "EventName": "UNC_R3_IIO_CREDITS_REJECT.DRS", "PerPkg": "1", - "PublicDescription": "Counts the number of times that a request attempted to acquire an NCS/NCB/DRS credit in the QPI for sending messages on BL to the IIO but was rejected because no credit was available. There is one credit for each of these three message classes (three credits total). NCS is used for reads to PCIe space, NCB is used for transfering data without coherency, and DRS is used for transfering data with coherency (cachable PCI transactions). This event can only track one message class at a time.", + "PublicDescription": "Counts the number of times that a request attempted to acquire an NCS/NCB/DRS credit in the QPI for sending messages on BL to the IIO but was rejected because no credit was available. There is one credit for each of these three message classes (three credits total). NCS is used for reads to PCIe space, NCB is used for transferring data without coherency, and DRS is used for transferring data with coherency (cacheable PCI transactions). This event can only track one message class at a time.", "UMask": "0x8", "Unit": "R3QPI" }, @@ -666,7 +666,7 @@ "EventCode": "0x21", "EventName": "UNC_R3_IIO_CREDITS_REJECT.NCB", "PerPkg": "1", - "PublicDescription": "Counts the number of times that a request attempted to acquire an NCS/NCB/DRS credit in the QPI for sending messages on BL to the IIO but was rejected because no credit was available. There is one credit for each of these three message classes (three credits total). NCS is used for reads to PCIe space, NCB is used for transfering data without coherency, and DRS is used for transfering data with coherency (cachable PCI transactions). This event can only track one message class at a time.", + "PublicDescription": "Counts the number of times that a request attempted to acquire an NCS/NCB/DRS credit in the QPI for sending messages on BL to the IIO but was rejected because no credit was available. There is one credit for each of these three message classes (three credits total). NCS is used for reads to PCIe space, NCB is used for transferring data without coherency, and DRS is used for transferring data with coherency (cacheable PCI transactions). This event can only track one message class at a time.", "UMask": "0x10", "Unit": "R3QPI" }, @@ -675,7 +675,7 @@ "EventCode": "0x21", "EventName": "UNC_R3_IIO_CREDITS_REJECT.NCS", "PerPkg": "1", - "PublicDescription": "Counts the number of times that a request attempted to acquire an NCS/NCB/DRS credit in the QPI for sending messages on BL to the IIO but was rejected because no credit was available. There is one credit for each of these three message classes (three credits total). NCS is used for reads to PCIe space, NCB is used for transfering data without coherency, and DRS is used for transfering data with coherency (cachable PCI transactions). This event can only track one message class at a time.", + "PublicDescription": "Counts the number of times that a request attempted to acquire an NCS/NCB/DRS credit in the QPI for sending messages on BL to the IIO but was rejected because no credit was available. There is one credit for each of these three message classes (three credits total). NCS is used for reads to PCIe space, NCB is used for transferring data without coherency, and DRS is used for transferring data with coherency (cacheable PCI transactions). This event can only track one message class at a time.", "UMask": "0x20", "Unit": "R3QPI" }, @@ -684,7 +684,7 @@ "EventCode": "0x22", "EventName": "UNC_R3_IIO_CREDITS_USED.DRS", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when the NCS/NCB/DRS credit is in use in the QPI for sending messages on BL to the IIO. There is one credit for each of these three message classes (three credits total). NCS is used for reads to PCIe space, NCB is used for transfering data without coherency, and DRS is used for transfering data with coherency (cachable PCI transactions). This event can only track one message class at a time.", + "PublicDescription": "Counts the number of cycles when the NCS/NCB/DRS credit is in use in the QPI for sending messages on BL to the IIO. There is one credit for each of these three message classes (three credits total). NCS is used for reads to PCIe space, NCB is used for transferring data without coherency, and DRS is used for transferring data with coherency (cacheable PCI transactions). This event can only track one message class at a time.", "UMask": "0x8", "Unit": "R3QPI" }, @@ -693,7 +693,7 @@ "EventCode": "0x22", "EventName": "UNC_R3_IIO_CREDITS_USED.NCB", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when the NCS/NCB/DRS credit is in use in the QPI for sending messages on BL to the IIO. There is one credit for each of these three message classes (three credits total). NCS is used for reads to PCIe space, NCB is used for transfering data without coherency, and DRS is used for transfering data with coherency (cachable PCI transactions). This event can only track one message class at a time.", + "PublicDescription": "Counts the number of cycles when the NCS/NCB/DRS credit is in use in the QPI for sending messages on BL to the IIO. There is one credit for each of these three message classes (three credits total). NCS is used for reads to PCIe space, NCB is used for transferring data without coherency, and DRS is used for transferring data with coherency (cacheable PCI transactions). This event can only track one message class at a time.", "UMask": "0x10", "Unit": "R3QPI" }, @@ -702,7 +702,7 @@ "EventCode": "0x22", "EventName": "UNC_R3_IIO_CREDITS_USED.NCS", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when the NCS/NCB/DRS credit is in use in the QPI for sending messages on BL to the IIO. There is one credit for each of these three message classes (three credits total). NCS is used for reads to PCIe space, NCB is used for transfering data without coherency, and DRS is used for transfering data with coherency (cachable PCI transactions). This event can only track one message class at a time.", + "PublicDescription": "Counts the number of cycles when the NCS/NCB/DRS credit is in use in the QPI for sending messages on BL to the IIO. There is one credit for each of these three message classes (three credits total). NCS is used for reads to PCIe space, NCB is used for transferring data without coherency, and DRS is used for transferring data with coherency (cacheable PCI transactions). This event can only track one message class at a time.", "UMask": "0x20", "Unit": "R3QPI" }, @@ -1107,7 +1107,7 @@ "EventCode": "0x33", "EventName": "UNC_R3_VNA_CREDITS_ACQUIRED", "PerPkg": "1", - "PublicDescription": "Number of QPI VNA Credit acquisitions. This event can be used in conjunction with the VNA In-Use Accumulator to calculate the average lifetime of a credit holder. VNA credits are used by all message classes in order to communicate across QPI. If a packet is unable to acquire credits, it will then attempt to use credts from the VN0 pool. Note that a single packet may require multiple flit buffers (i.e. when data is being transfered). Therefore, this event will increment by the number of credits acquired in each cycle. Filtering based on message class is not provided. One can count the number of packets transfered in a given message class using an qfclk event.", + "PublicDescription": "Number of QPI VNA Credit acquisitions. This event can be used in conjunction with the VNA In-Use Accumulator to calculate the average lifetime of a credit holder. VNA credits are used by all message classes in order to communicate across QPI. If a packet is unable to acquire credits, it will then attempt to use credits from the VN0 pool. Note that a single packet may require multiple flit buffers (i.e. when data is being transferred). Therefore, this event will increment by the number of credits acquired in each cycle. Filtering based on message class is not provided. One can count the number of packets transferred in a given message class using an qfclk event.", "Unit": "R3QPI" }, { diff --git a/tools/perf/pmu-events/arch/x86/jaketown/uncore-power.json b/tools/perf/pmu-events/arch/x86/jaketown/uncore-power.json index 638aa8a35cdb..b3ee5d741015 100644 --- a/tools/perf/pmu-events/arch/x86/jaketown/uncore-power.json +++ b/tools/perf/pmu-events/arch/x86/jaketown/uncore-power.json @@ -234,7 +234,7 @@ "EventCode": "0x80", "EventName": "UNC_P_POWER_STATE_OCCUPANCY.CORES_C0", "PerPkg": "1", - "PublicDescription": "This is an occupancy event that tracks the number of cores that are in C0. It can be used by itself to get the average number of cores in C0, with threshholding to generate histograms, or with other PCU events and occupancy triggering to capture other details.", + "PublicDescription": "This is an occupancy event that tracks the number of cores that are in C0. It can be used by itself to get the average number of cores in C0, with thresholding to generate histograms, or with other PCU events and occupancy triggering to capture other details.", "Unit": "PCU" }, { @@ -242,7 +242,7 @@ "EventCode": "0x80", "EventName": "UNC_P_POWER_STATE_OCCUPANCY.CORES_C3", "PerPkg": "1", - "PublicDescription": "This is an occupancy event that tracks the number of cores that are in C0. It can be used by itself to get the average number of cores in C0, with threshholding to generate histograms, or with other PCU events and occupancy triggering to capture other details.", + "PublicDescription": "This is an occupancy event that tracks the number of cores that are in C0. It can be used by itself to get the average number of cores in C0, with thresholding to generate histograms, or with other PCU events and occupancy triggering to capture other details.", "Unit": "PCU" }, { @@ -250,7 +250,7 @@ "EventCode": "0x80", "EventName": "UNC_P_POWER_STATE_OCCUPANCY.CORES_C6", "PerPkg": "1", - "PublicDescription": "This is an occupancy event that tracks the number of cores that are in C0. It can be used by itself to get the average number of cores in C0, with threshholding to generate histograms, or with other PCU events and occupancy triggering to capture other details.", + "PublicDescription": "This is an occupancy event that tracks the number of cores that are in C0. It can be used by itself to get the average number of cores in C0, with thresholding to generate histograms, or with other PCU events and occupancy triggering to capture other details.", "Unit": "PCU" }, { @@ -266,7 +266,7 @@ "EventCode": "0x9", "EventName": "UNC_P_PROCHOT_INTERNAL_CYCLES", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles that we are in Interal PROCHOT mode. This mode is triggered when a sensor on the die determines that we are too hot and must throttle to avoid damaging the chip.", + "PublicDescription": "Counts the number of cycles that we are in Internal PROCHOT mode. This mode is triggered when a sensor on the die determines that we are too hot and must throttle to avoid damaging the chip.", "Unit": "PCU" }, { diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-events/arch/x86/mapfile.csv index 77af9d0bf6d4..afe811f154d7 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -16,7 +16,7 @@ GenuineIntel-6-(7D|7E|A7),v1.17,icelake,core GenuineIntel-6-6[AC],v1.18,icelakex,core GenuineIntel-6-3A,v23,ivybridge,core GenuineIntel-6-3E,v22,ivytown,core -GenuineIntel-6-2D,v21,jaketown,core +GenuineIntel-6-2D,v22,jaketown,core GenuineIntel-6-(57|85),v9,knightslanding,core GenuineIntel-6-A[AC],v1.00,meteorlake,core GenuineIntel-6-1[AEF],v3,nehalemep,core From patchwork Sun Feb 19 09:28:21 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59116 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp774792wrn; Sun, 19 Feb 2023 01:42:13 -0800 (PST) X-Google-Smtp-Source: AK7set9tVMqoTt733LH0gje6Nfx+SCfUqsVzCT8BgQyKyNN7qEmzgNVdbmfld6aYUpQLQOjs2lh8 X-Received: by 2002:a17:907:6d03:b0:8b3:6da3:c788 with SMTP id sa3-20020a1709076d0300b008b36da3c788mr11466922ejc.51.1676799733630; Sun, 19 Feb 2023 01:42:13 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799733; cv=none; d=google.com; s=arc-20160816; b=PNghXlXBbgRsgIksztqYniumsR1bbfYeDH77Dy0nTPL5tI68OyQz9eFgKqNslKnq1Q Vs6TpmKCZrWfrx8Z7HHILueyjNU4cii66QCsCXVLdrNmLTam/c0AT5/JarMGpiyUCDQ0 u/LlhxvNnxCs9cVc48v4354O0qwnC1T5Z4Ru1rSAd9ilqJ+CQ9+MLYXYt10O+53l0ZTE mDrTaW2fMuT1v9QhGJfEjRl5BdjiBNbgwPQMNp/29Z5xgDi6PvSpf6F403J6IUW3UHx0 uyLf1+tvwuwJqSsPhFJPFBvXES+3bjCaFRZ9nxUymuhHE2TjxxNQwb7q4ESJ+/xcjuvN X0Ag== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :references:mime-version:message-id:in-reply-to:date:dkim-signature; bh=RhC8RlQQDEUumThBVP3BFgOcI9ow4i6TnnBtpNZj6g8=; b=HvCFmWO7i5ZTJeuby63d1H/YONaIPNzlNxYXpL7rqLwrfIaQWM/fb7y3K4udGk/p0W qYaz8RZcx7bdXYewN+w8WXysFgiV4O6FGRMeBeD2JCYsW28AmdrN4WAPix9obQzAI1Pw +riCeltPrScNTKuhHbPFloU/X7z1h57JN7WxKLTkADESVF345VfWDgnd3Kl5kZ8FYzRA RzuWgzaBWf4kkGSLOI9u5KD8q+Qe0VTXbkWbMuwdSgZm5tYAJK4T7Lxc/KruRaMWaun2 OO78eBC3Nnq/v4UURhNyUnF/RbzrMD/bSh90MhmcRJdoMo3c5ru0z4ABi0fJVMF62Mx0 sezg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=NLiqNqqN; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id j16-20020a170906535000b008b17dfa77fdsi10598489ejo.127.2023.02.19.01.41.48; Sun, 19 Feb 2023 01:42:13 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=NLiqNqqN; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229973AbjBSJeL (ORCPT + 99 others); Sun, 19 Feb 2023 04:34:11 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56558 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229963AbjBSJeH (ORCPT ); Sun, 19 Feb 2023 04:34:07 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CC1AE4C08 for ; Sun, 19 Feb 2023 01:32:53 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id h204-20020a256cd5000000b00953ffdfbe1aso2263058ybc.23 for ; Sun, 19 Feb 2023 01:32:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:from:to:cc:subject:date :message-id:reply-to; bh=RhC8RlQQDEUumThBVP3BFgOcI9ow4i6TnnBtpNZj6g8=; b=NLiqNqqN2UXiBVa/kIcJY7ovAkmY5RZlMuWs3LqQooTQ7jmKpeovntgrAcJjCTFJQF 8ZxOPeAEWluzKYrDb8Lmmn1QQQFGp4RYZibqZ+XdgcNY0Gulj9egKCnUDB+FDBCU+/wz IfFux3pX1h7GnQDz0Fo2rtQF2oZZZP8uHh5lBaCMiDjZC6mNUHh65NbvekbRAUq0cqfY qcEN8ZtWjf03eBBDuJ+kAv/mw1xZ3+ayXqspuvzFRRTGhyDeiQl2vPediInZj2DuK6qc P0O7xGQZjAtedCZCSp7qfhJmF0TwRjs/EkdhrrQbDwOekwD/j8XAOujIiT4dybqYIaFX GuUA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=RhC8RlQQDEUumThBVP3BFgOcI9ow4i6TnnBtpNZj6g8=; b=HGnGo1DjXE1UK+jiqBn09duiFXE1f+XO5yPQI7OtCFrwScACBakMieVi2s7+HvrPM5 ayASAfANQrX4mILtj76+8zN78z3brycwEFKu1hNGpJ7UIIQN0U6mXJVxpqpNRHP2IRnE SqTp0mcHdfqdgjFAWHvxM44zPxQ+VrU4e1GU46grC1NHL+1O9o8s+aA4SZl7TI+/G0ES Mdo6BlhYf9sdALzFiooL9KwLYCthfnHXm82YgCMHPfzAoHhzO3fZ2XEt7XD2b7M8U2gw CHpOwmnJGiBAH39dq4xUEAWQTo7J2+eM7LQQS19L1iPep6ulId8GGRuyHvzWoJBj0YXp oDrg== X-Gm-Message-State: AO0yUKXDMmiaEM/FEoVn3IHZ7MpzsigFBi1y58/iazBN28bgcuuXuboA Mf1u/M/b1+sWENTy2dplBY5BCqEjLrge X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a05:6902:291:b0:9a0:1d7b:707b with SMTP id v17-20020a056902029100b009a01d7b707bmr49063ybh.4.1676799137260; Sun, 19 Feb 2023 01:32:17 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:21 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-25-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 24/51] perf vendor events intel: Refresh knightslanding events From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758251957164275786?= X-GMAIL-MSGID: =?utf-8?q?1758251957164275786?= Update the knightslanding events from 9 to 10. Generation was done using https://github.com/intel/perfmon. The most notable change is in corrections to event descriptions. Signed-off-by: Ian Rogers --- .../arch/x86/knightslanding/cache.json | 94 +++++++++---------- .../arch/x86/knightslanding/pipeline.json | 8 +- .../arch/x86/knightslanding/uncore-other.json | 8 +- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 4 files changed, 56 insertions(+), 56 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/knightslanding/cache.json b/tools/perf/pmu-events/arch/x86/knightslanding/cache.json index 01aea3d2832e..d9876cb06b08 100644 --- a/tools/perf/pmu-events/arch/x86/knightslanding/cache.json +++ b/tools/perf/pmu-events/arch/x86/knightslanding/cache.json @@ -6,7 +6,7 @@ "SampleAfterValue": "200003" }, { - "BriefDescription": "Counts the number of core cycles the fetch stalls because of an icache miss. This is a cummulative count of core cycles the fetch stalled for all icache misses.", + "BriefDescription": "Counts the number of core cycles the fetch stalls because of an icache miss. This is a cumulative count of core cycles the fetch stalled for all icache misses.", "EventCode": "0x86", "EventName": "FETCH_STALL.ICACHE_FILL_PENDING_CYCLES", "PublicDescription": "This event counts the number of core cycles the fetch stalls because of an icache miss. This is a cumulative count of cycles the NIP stalled for all icache misses.", @@ -28,7 +28,7 @@ "UMask": "0x4f" }, { - "BriefDescription": "Counts the number of MEC requests from the L2Q that reference a cache line (cacheable requests) exlcuding SW prefetches filling only to L2 cache and L1 evictions (automatically exlcudes L2HWP, UC, WC) that were rejected - Multiple repeated rejects should be counted multiple times", + "BriefDescription": "Counts the number of MEC requests from the L2Q that reference a cache line (cacheable requests) excluding SW prefetches filling only to L2 cache and L1 evictions (automatically exlcudes L2HWP, UC, WC) that were rejected - Multiple repeated rejects should be counted multiple times", "EventCode": "0x30", "EventName": "L2_REQUESTS_REJECT.ALL", "SampleAfterValue": "200003" @@ -108,7 +108,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts Demand code reads and prefetch code read requests that accounts for reponses from snoop request hit with data forwarded from it Far(not in the same quadrant as the request)-other tile L2 in E/F/M state. Valid only in SNC4 Cluster mode.", + "BriefDescription": "Counts Demand code reads and prefetch code read requests that accounts for responses from snoop request hit with data forwarded from it Far(not in the same quadrant as the request)-other tile L2 in E/F/M state. Valid only in SNC4 Cluster mode.", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.ANY_CODE_RD.L2_HIT_FAR_TILE", "MSRIndex": "0x1a6,0x1a7", @@ -135,7 +135,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts Demand code reads and prefetch code read requests that accounts for reponses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", + "BriefDescription": "Counts Demand code reads and prefetch code read requests that accounts for responses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.ANY_CODE_RD.L2_HIT_NEAR_TILE", "MSRIndex": "0x1a6,0x1a7", @@ -198,7 +198,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts Demand code reads and prefetch code read requests that are outstanding, per weighted cycle, from the time of the request to when any response is received. The oustanding response should be programmed only on PMC0.", + "BriefDescription": "Counts Demand code reads and prefetch code read requests that are outstanding, per weighted cycle, from the time of the request to when any response is received. The outstanding response should be programmed only on PMC0.", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.ANY_CODE_RD.OUTSTANDING", "MSRIndex": "0x1a6", @@ -216,7 +216,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts Demand cacheable data and L1 prefetch data read requests that accounts for reponses from snoop request hit with data forwarded from it Far(not in the same quadrant as the request)-other tile L2 in E/F/M state. Valid only in SNC4 Cluster mode.", + "BriefDescription": "Counts Demand cacheable data and L1 prefetch data read requests that accounts for responses from snoop request hit with data forwarded from it Far(not in the same quadrant as the request)-other tile L2 in E/F/M state. Valid only in SNC4 Cluster mode.", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.ANY_DATA_RD.L2_HIT_FAR_TILE", "MSRIndex": "0x1a6,0x1a7", @@ -243,7 +243,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts Demand cacheable data and L1 prefetch data read requests that accounts for reponses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", + "BriefDescription": "Counts Demand cacheable data and L1 prefetch data read requests that accounts for responses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.ANY_DATA_RD.L2_HIT_NEAR_TILE", "MSRIndex": "0x1a6,0x1a7", @@ -306,7 +306,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts Demand cacheable data and L1 prefetch data read requests that are outstanding, per weighted cycle, from the time of the request to when any response is received. The oustanding response should be programmed only on PMC0.", + "BriefDescription": "Counts Demand cacheable data and L1 prefetch data read requests that are outstanding, per weighted cycle, from the time of the request to when any response is received. The outstanding response should be programmed only on PMC0.", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.ANY_DATA_RD.OUTSTANDING", "MSRIndex": "0x1a6", @@ -324,7 +324,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts any Prefetch requests that accounts for reponses from snoop request hit with data forwarded from it Far(not in the same quadrant as the request)-other tile L2 in E/F/M state. Valid only in SNC4 Cluster mode.", + "BriefDescription": "Counts any Prefetch requests that accounts for responses from snoop request hit with data forwarded from it Far(not in the same quadrant as the request)-other tile L2 in E/F/M state. Valid only in SNC4 Cluster mode.", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.ANY_PF_L2.L2_HIT_FAR_TILE", "MSRIndex": "0x1a6,0x1a7", @@ -351,7 +351,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts any Prefetch requests that accounts for reponses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", + "BriefDescription": "Counts any Prefetch requests that accounts for responses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.ANY_PF_L2.L2_HIT_NEAR_TILE", "MSRIndex": "0x1a6,0x1a7", @@ -405,7 +405,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts any Prefetch requests that are outstanding, per weighted cycle, from the time of the request to when any response is received. The oustanding response should be programmed only on PMC0.", + "BriefDescription": "Counts any Prefetch requests that are outstanding, per weighted cycle, from the time of the request to when any response is received. The outstanding response should be programmed only on PMC0.", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.ANY_PF_L2.OUTSTANDING", "MSRIndex": "0x1a6", @@ -423,7 +423,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts any Read request that accounts for reponses from snoop request hit with data forwarded from it Far(not in the same quadrant as the request)-other tile L2 in E/F/M state. Valid only in SNC4 Cluster mode.", + "BriefDescription": "Counts any Read request that accounts for responses from snoop request hit with data forwarded from it Far(not in the same quadrant as the request)-other tile L2 in E/F/M state. Valid only in SNC4 Cluster mode.", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.ANY_READ.L2_HIT_FAR_TILE", "MSRIndex": "0x1a6,0x1a7", @@ -450,7 +450,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts any Read request that accounts for reponses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", + "BriefDescription": "Counts any Read request that accounts for responses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.ANY_READ.L2_HIT_NEAR_TILE", "MSRIndex": "0x1a6,0x1a7", @@ -513,7 +513,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts any Read request that are outstanding, per weighted cycle, from the time of the request to when any response is received. The oustanding response should be programmed only on PMC0.", + "BriefDescription": "Counts any Read request that are outstanding, per weighted cycle, from the time of the request to when any response is received. The outstanding response should be programmed only on PMC0.", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.ANY_READ.OUTSTANDING", "MSRIndex": "0x1a6", @@ -531,7 +531,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts any request that accounts for reponses from snoop request hit with data forwarded from it Far(not in the same quadrant as the request)-other tile L2 in E/F/M state. Valid only in SNC4 Cluster mode.", + "BriefDescription": "Counts any request that accounts for responses from snoop request hit with data forwarded from it Far(not in the same quadrant as the request)-other tile L2 in E/F/M state. Valid only in SNC4 Cluster mode.", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.ANY_REQUEST.L2_HIT_FAR_TILE", "MSRIndex": "0x1a6,0x1a7", @@ -558,7 +558,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts any request that accounts for reponses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", + "BriefDescription": "Counts any request that accounts for responses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.ANY_REQUEST.L2_HIT_NEAR_TILE", "MSRIndex": "0x1a6,0x1a7", @@ -621,7 +621,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts any request that are outstanding, per weighted cycle, from the time of the request to when any response is received. The oustanding response should be programmed only on PMC0.", + "BriefDescription": "Counts any request that are outstanding, per weighted cycle, from the time of the request to when any response is received. The outstanding response should be programmed only on PMC0.", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.ANY_REQUEST.OUTSTANDING", "MSRIndex": "0x1a6", @@ -639,7 +639,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts Demand cacheable data write requests that accounts for reponses from snoop request hit with data forwarded from it Far(not in the same quadrant as the request)-other tile L2 in E/F/M state. Valid only in SNC4 Cluster mode.", + "BriefDescription": "Counts Demand cacheable data write requests that accounts for responses from snoop request hit with data forwarded from it Far(not in the same quadrant as the request)-other tile L2 in E/F/M state. Valid only in SNC4 Cluster mode.", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.ANY_RFO.L2_HIT_FAR_TILE", "MSRIndex": "0x1a6,0x1a7", @@ -666,7 +666,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts Demand cacheable data write requests that accounts for reponses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", + "BriefDescription": "Counts Demand cacheable data write requests that accounts for responses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.ANY_RFO.L2_HIT_NEAR_TILE", "MSRIndex": "0x1a6,0x1a7", @@ -729,7 +729,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts Demand cacheable data write requests that are outstanding, per weighted cycle, from the time of the request to when any response is received. The oustanding response should be programmed only on PMC0.", + "BriefDescription": "Counts Demand cacheable data write requests that are outstanding, per weighted cycle, from the time of the request to when any response is received. The outstanding response should be programmed only on PMC0.", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.ANY_RFO.OUTSTANDING", "MSRIndex": "0x1a6", @@ -747,7 +747,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts Bus locks and split lock requests that accounts for reponses from snoop request hit with data forwarded from it Far(not in the same quadrant as the request)-other tile L2 in E/F/M state. Valid only in SNC4 Cluster mode.", + "BriefDescription": "Counts Bus locks and split lock requests that accounts for responses from snoop request hit with data forwarded from it Far(not in the same quadrant as the request)-other tile L2 in E/F/M state. Valid only in SNC4 Cluster mode.", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.BUS_LOCKS.L2_HIT_FAR_TILE", "MSRIndex": "0x1a6,0x1a7", @@ -774,7 +774,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts Bus locks and split lock requests that accounts for reponses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", + "BriefDescription": "Counts Bus locks and split lock requests that accounts for responses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.BUS_LOCKS.L2_HIT_NEAR_TILE", "MSRIndex": "0x1a6,0x1a7", @@ -837,7 +837,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts Bus locks and split lock requests that are outstanding, per weighted cycle, from the time of the request to when any response is received. The oustanding response should be programmed only on PMC0.", + "BriefDescription": "Counts Bus locks and split lock requests that are outstanding, per weighted cycle, from the time of the request to when any response is received. The outstanding response should be programmed only on PMC0.", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.BUS_LOCKS.OUTSTANDING", "MSRIndex": "0x1a6", @@ -855,7 +855,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts demand code reads and prefetch code reads that accounts for reponses from snoop request hit with data forwarded from it Far(not in the same quadrant as the request)-other tile L2 in E/F/M state. Valid only in SNC4 Cluster mode.", + "BriefDescription": "Counts demand code reads and prefetch code reads that accounts for responses from snoop request hit with data forwarded from it Far(not in the same quadrant as the request)-other tile L2 in E/F/M state. Valid only in SNC4 Cluster mode.", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.DEMAND_CODE_RD.L2_HIT_FAR_TILE", "MSRIndex": "0x1a6,0x1a7", @@ -882,7 +882,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts demand code reads and prefetch code reads that accounts for reponses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", + "BriefDescription": "Counts demand code reads and prefetch code reads that accounts for responses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.DEMAND_CODE_RD.L2_HIT_NEAR_TILE", "MSRIndex": "0x1a6,0x1a7", @@ -945,7 +945,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts demand code reads and prefetch code reads that are outstanding, per weighted cycle, from the time of the request to when any response is received. The oustanding response should be programmed only on PMC0.", + "BriefDescription": "Counts demand code reads and prefetch code reads that are outstanding, per weighted cycle, from the time of the request to when any response is received. The outstanding response should be programmed only on PMC0.", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.DEMAND_CODE_RD.OUTSTANDING", "MSRIndex": "0x1a6", @@ -1035,7 +1035,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts demand cacheable data and L1 prefetch data reads that are outstanding, per weighted cycle, from the time of the request to when any response is received. The oustanding response should be programmed only on PMC0.", + "BriefDescription": "Counts demand cacheable data and L1 prefetch data reads that are outstanding, per weighted cycle, from the time of the request to when any response is received. The outstanding response should be programmed only on PMC0.", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.DEMAND_DATA_RD.OUTSTANDING", "MSRIndex": "0x1a6", @@ -1053,7 +1053,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts Demand cacheable data writes that accounts for reponses from snoop request hit with data forwarded from it Far(not in the same quadrant as the request)-other tile L2 in E/F/M state. Valid only in SNC4 Cluster mode.", + "BriefDescription": "Counts Demand cacheable data writes that accounts for responses from snoop request hit with data forwarded from it Far(not in the same quadrant as the request)-other tile L2 in E/F/M state. Valid only in SNC4 Cluster mode.", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.DEMAND_RFO.L2_HIT_FAR_TILE", "MSRIndex": "0x1a6,0x1a7", @@ -1080,7 +1080,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts Demand cacheable data writes that accounts for reponses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", + "BriefDescription": "Counts Demand cacheable data writes that accounts for responses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.DEMAND_RFO.L2_HIT_NEAR_TILE", "MSRIndex": "0x1a6,0x1a7", @@ -1143,7 +1143,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts Demand cacheable data writes that are outstanding, per weighted cycle, from the time of the request to when any response is received. The oustanding response should be programmed only on PMC0.", + "BriefDescription": "Counts Demand cacheable data writes that are outstanding, per weighted cycle, from the time of the request to when any response is received. The outstanding response should be programmed only on PMC0.", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.DEMAND_RFO.OUTSTANDING", "MSRIndex": "0x1a6", @@ -1170,7 +1170,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts Partial reads (UC or WC and is valid only for Outstanding response type). that accounts for reponses from snoop request hit with data forwarded from it Far(not in the same quadrant as the request)-other tile L2 in E/F/M state. Valid only in SNC4 Cluster mode.", + "BriefDescription": "Counts Partial reads (UC or WC and is valid only for Outstanding response type). that accounts for responses from snoop request hit with data forwarded from it Far(not in the same quadrant as the request)-other tile L2 in E/F/M state. Valid only in SNC4 Cluster mode.", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.PARTIAL_READS.L2_HIT_FAR_TILE", "MSRIndex": "0x1a6,0x1a7", @@ -1197,7 +1197,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts Partial reads (UC or WC and is valid only for Outstanding response type). that accounts for reponses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", + "BriefDescription": "Counts Partial reads (UC or WC and is valid only for Outstanding response type). that accounts for responses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.PARTIAL_READS.L2_HIT_NEAR_TILE", "MSRIndex": "0x1a6,0x1a7", @@ -1260,7 +1260,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts Partial reads (UC or WC and is valid only for Outstanding response type). that are outstanding, per weighted cycle, from the time of the request to when any response is received. The oustanding response should be programmed only on PMC0.", + "BriefDescription": "Counts Partial reads (UC or WC and is valid only for Outstanding response type). that are outstanding, per weighted cycle, from the time of the request to when any response is received. The outstanding response should be programmed only on PMC0.", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.PARTIAL_READS.OUTSTANDING", "MSRIndex": "0x1a6", @@ -1287,7 +1287,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts Partial writes (UC or WT or WP and should be programmed on PMC1) that accounts for reponses from snoop request hit with data forwarded from it Far(not in the same quadrant as the request)-other tile L2 in E/F/M state. Valid only in SNC4 Cluster mode.", + "BriefDescription": "Counts Partial writes (UC or WT or WP and should be programmed on PMC1) that accounts for responses from snoop request hit with data forwarded from it Far(not in the same quadrant as the request)-other tile L2 in E/F/M state. Valid only in SNC4 Cluster mode.", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.PARTIAL_WRITES.L2_HIT_FAR_TILE", "MSRIndex": "0x1a7", @@ -1314,7 +1314,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts Partial writes (UC or WT or WP and should be programmed on PMC1) that accounts for reponses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", + "BriefDescription": "Counts Partial writes (UC or WT or WP and should be programmed on PMC1) that accounts for responses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.PARTIAL_WRITES.L2_HIT_NEAR_TILE", "MSRIndex": "0x1a7", @@ -1386,7 +1386,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts L1 data HW prefetches that accounts for reponses from snoop request hit with data forwarded from it Far(not in the same quadrant as the request)-other tile L2 in E/F/M state. Valid only in SNC4 Cluster mode.", + "BriefDescription": "Counts L1 data HW prefetches that accounts for responses from snoop request hit with data forwarded from it Far(not in the same quadrant as the request)-other tile L2 in E/F/M state. Valid only in SNC4 Cluster mode.", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.PF_L1_DATA_RD.L2_HIT_FAR_TILE", "MSRIndex": "0x1a6,0x1a7", @@ -1413,7 +1413,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts L1 data HW prefetches that accounts for reponses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", + "BriefDescription": "Counts L1 data HW prefetches that accounts for responses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.PF_L1_DATA_RD.L2_HIT_NEAR_TILE", "MSRIndex": "0x1a6,0x1a7", @@ -1476,7 +1476,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts L1 data HW prefetches that are outstanding, per weighted cycle, from the time of the request to when any response is received. The oustanding response should be programmed only on PMC0.", + "BriefDescription": "Counts L1 data HW prefetches that are outstanding, per weighted cycle, from the time of the request to when any response is received. The outstanding response should be programmed only on PMC0.", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.PF_L1_DATA_RD.OUTSTANDING", "MSRIndex": "0x1a6", @@ -1494,7 +1494,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts L2 code HW prefetches that accounts for reponses from snoop request hit with data forwarded from it Far(not in the same quadrant as the request)-other tile L2 in E/F/M state. Valid only in SNC4 Cluster mode.", + "BriefDescription": "Counts L2 code HW prefetches that accounts for responses from snoop request hit with data forwarded from it Far(not in the same quadrant as the request)-other tile L2 in E/F/M state. Valid only in SNC4 Cluster mode.", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.PF_L2_CODE_RD.L2_HIT_FAR_TILE", "MSRIndex": "0x1a6,0x1a7", @@ -1521,7 +1521,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts L2 code HW prefetches that accounts for reponses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", + "BriefDescription": "Counts L2 code HW prefetches that accounts for responses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.PF_L2_CODE_RD.L2_HIT_NEAR_TILE", "MSRIndex": "0x1a6,0x1a7", @@ -1566,7 +1566,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts L2 code HW prefetches that are outstanding, per weighted cycle, from the time of the request to when any response is received. The oustanding response should be programmed only on PMC0.", + "BriefDescription": "Counts L2 code HW prefetches that are outstanding, per weighted cycle, from the time of the request to when any response is received. The outstanding response should be programmed only on PMC0.", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.PF_L2_CODE_RD.OUTSTANDING", "MSRIndex": "0x1a6", @@ -1602,7 +1602,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts L2 data RFO prefetches (includes PREFETCHW instruction) that accounts for reponses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", + "BriefDescription": "Counts L2 data RFO prefetches (includes PREFETCHW instruction) that accounts for responses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.PF_L2_RFO.L2_HIT_NEAR_TILE", "MSRIndex": "0x1a6,0x1a7", @@ -1683,7 +1683,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts Software Prefetches that accounts for reponses from snoop request hit with data forwarded from it Far(not in the same quadrant as the request)-other tile L2 in E/F/M state. Valid only in SNC4 Cluster mode.", + "BriefDescription": "Counts Software Prefetches that accounts for responses from snoop request hit with data forwarded from it Far(not in the same quadrant as the request)-other tile L2 in E/F/M state. Valid only in SNC4 Cluster mode.", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.PF_SOFTWARE.L2_HIT_FAR_TILE", "MSRIndex": "0x1a6,0x1a7", @@ -1710,7 +1710,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts Software Prefetches that accounts for reponses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", + "BriefDescription": "Counts Software Prefetches that accounts for responses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.PF_SOFTWARE.L2_HIT_NEAR_TILE", "MSRIndex": "0x1a6,0x1a7", @@ -1773,7 +1773,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts Software Prefetches that are outstanding, per weighted cycle, from the time of the request to when any response is received. The oustanding response should be programmed only on PMC0.", + "BriefDescription": "Counts Software Prefetches that are outstanding, per weighted cycle, from the time of the request to when any response is received. The outstanding response should be programmed only on PMC0.", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.PF_SOFTWARE.OUTSTANDING", "MSRIndex": "0x1a6", @@ -1818,7 +1818,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts UC code reads (valid only for Outstanding response type) that accounts for reponses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", + "BriefDescription": "Counts UC code reads (valid only for Outstanding response type) that accounts for responses from snoop request hit with data forwarded from its Near-other tile L2 in E/F/M state", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.UC_CODE_READS.L2_HIT_NEAR_TILE", "MSRIndex": "0x1a6,0x1a7", @@ -1881,7 +1881,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts UC code reads (valid only for Outstanding response type) that are outstanding, per weighted cycle, from the time of the request to when any response is received. The oustanding response should be programmed only on PMC0.", + "BriefDescription": "Counts UC code reads (valid only for Outstanding response type) that are outstanding, per weighted cycle, from the time of the request to when any response is received. The outstanding response should be programmed only on PMC0.", "EventCode": "0xB7", "EventName": "OFFCORE_RESPONSE.UC_CODE_READS.OUTSTANDING", "MSRIndex": "0x1a6", diff --git a/tools/perf/pmu-events/arch/x86/knightslanding/pipeline.json b/tools/perf/pmu-events/arch/x86/knightslanding/pipeline.json index 1b803fa38641..3dc532107ead 100644 --- a/tools/perf/pmu-events/arch/x86/knightslanding/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/knightslanding/pipeline.json @@ -254,14 +254,14 @@ "UMask": "0x80" }, { - "BriefDescription": "Counts the number of occurences a retired load gets blocked because its address overlaps with a store whose data is not ready", + "BriefDescription": "Counts the number of occurrences a retired load gets blocked because its address overlaps with a store whose data is not ready", "EventCode": "0x03", "EventName": "RECYCLEQ.LD_BLOCK_STD_NOTREADY", "SampleAfterValue": "200003", "UMask": "0x2" }, { - "BriefDescription": "Counts the number of occurences a retired load gets blocked because its address partially overlaps with a store", + "BriefDescription": "Counts the number of occurrences a retired load gets blocked because its address partially overlaps with a store", "Data_LA": "1", "EventCode": "0x03", "EventName": "RECYCLEQ.LD_BLOCK_ST_FORWARD", @@ -270,7 +270,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts the number of occurences a retired load that is a cache line split. Each split should be counted only once.", + "BriefDescription": "Counts the number of occurrences a retired load that is a cache line split. Each split should be counted only once.", "Data_LA": "1", "EventCode": "0x03", "EventName": "RECYCLEQ.LD_SPLITS", @@ -293,7 +293,7 @@ "UMask": "0x20" }, { - "BriefDescription": "Counts the number of occurences a retired store that is a cache line split. Each split should be counted only once.", + "BriefDescription": "Counts the number of occurrences a retired store that is a cache line split. Each split should be counted only once.", "EventCode": "0x03", "EventName": "RECYCLEQ.ST_SPLITS", "PublicDescription": "This event counts the number of retired store that experienced a cache line boundary split(Precise Event). Note that each spilt should be counted only once.", diff --git a/tools/perf/pmu-events/arch/x86/knightslanding/uncore-other.json b/tools/perf/pmu-events/arch/x86/knightslanding/uncore-other.json index 3abd9c3fdc48..491cb37ddab0 100644 --- a/tools/perf/pmu-events/arch/x86/knightslanding/uncore-other.json +++ b/tools/perf/pmu-events/arch/x86/knightslanding/uncore-other.json @@ -1084,7 +1084,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Cache Lookups. Counts the number of times the LLC was accessed. Writeback transactions from L2 to the LLC This includes all write transactions -- both Cachable and UC.", + "BriefDescription": "Cache Lookups. Counts the number of times the LLC was accessed. Writeback transactions from L2 to the LLC This includes all write transactions -- both Cacheable and UC.", "EventCode": "0x37", "EventName": "UNC_H_CACHE_LINES_VICTIMIZED.E_STATE", "PerPkg": "1", @@ -1843,7 +1843,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Counts cycles source throttling is adderted - horizontal", + "BriefDescription": "Counts cycles source throttling is asserted - horizontal", "EventCode": "0xA5", "EventName": "UNC_H_FAST_ASSERTED.HORZ", "PerPkg": "1", @@ -1851,7 +1851,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Counts cycles source throttling is adderted - vertical", + "BriefDescription": "Counts cycles source throttling is asserted - vertical", "EventCode": "0xA5", "EventName": "UNC_H_FAST_ASSERTED.VERT", "PerPkg": "1", @@ -2929,7 +2929,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Cache Lookups. Counts the number of times the LLC was accessed. Writeback transactions from L2 to the LLC This includes all write transactions -- both Cachable and UC.", + "BriefDescription": "Cache Lookups. Counts the number of times the LLC was accessed. Writeback transactions from L2 to the LLC This includes all write transactions -- both Cacheable and UC.", "EventCode": "0x34", "EventName": "UNC_H_SF_LOOKUP.WRITE", "PerPkg": "1", diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-events/arch/x86/mapfile.csv index afe811f154d7..41bd13baa265 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -17,7 +17,7 @@ GenuineIntel-6-6[AC],v1.18,icelakex,core GenuineIntel-6-3A,v23,ivybridge,core GenuineIntel-6-3E,v22,ivytown,core GenuineIntel-6-2D,v22,jaketown,core -GenuineIntel-6-(57|85),v9,knightslanding,core +GenuineIntel-6-(57|85),v10,knightslanding,core GenuineIntel-6-A[AC],v1.00,meteorlake,core GenuineIntel-6-1[AEF],v3,nehalemep,core GenuineIntel-6-2E,v3,nehalemex,core From patchwork Sun Feb 19 09:28:22 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59117 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp774986wrn; Sun, 19 Feb 2023 01:42:58 -0800 (PST) X-Google-Smtp-Source: AK7set8ghysHxHLN5wC2VbpwiZ2hVN/d03iRZf54b44iQxqkAz2vu3ldJBHMCxsfewTEEjj16mTk X-Received: by 2002:a17:906:718f:b0:8ae:e2d0:e998 with SMTP id h15-20020a170906718f00b008aee2d0e998mr7520231ejk.31.1676799778401; Sun, 19 Feb 2023 01:42:58 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799778; cv=none; d=google.com; s=arc-20160816; b=zmSlTM/zyp4Ti9uJ0CvTV1ElKJh5JXbouLk9jtCUpmnHB1v2Eud5twz/rIa72pk+yQ mD8uBZtCeYVSH0DZxz2eN0oBMYVUCzhJRRMY09MVDN9mXc7kqWeRpmYZXvww05Qh47ZF VFO2wxnN+GByhqPeE3gwudgltKR46yOOO+eB4TgtaQL4wVuw1M0uwJIXxL1JpAIeoCo3 ErruH7mzUqKGFDKzVkWnaZdTABNxl/b+cVjdYWmLiIa4T7Angguj3ZKJzaWXdCtJyumA EdiQzHT+EkjncNgUK7ObG0h9KSQ+t44aW6m/adGExvP1mpdjpSmDA3hO4X+3fMpiefGE w+gg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :references:mime-version:message-id:in-reply-to:date:dkim-signature; bh=+hc9b9b/BO/VXC2/NW+XDZ+ErI1SMduZPX3EY93EMfk=; b=T7GXG+fP3o/+/vJfVytguyQYM2HupTprouSijYbCA2WUwCB/DMhQ0B5M3I6w94+Be5 Tc/HnN6kRG+wjqOtThsteeBZlFh4A//HicMmBTxvXmZYt8BeD7GXiezhHCsjTacKQvCK bkctLg+Ok3f51/7x31jOYloAPN4NUHsPAnqsnp+flYlpaX125Z574XCzet/gjA0rVB1f JH3OwtuD2JUbe/ylxA6i9fvI7RUqlJUmUOWbJ+E4j6uApGOdhCnFTyhrllJGFug/TTwn MQHXCRiLQKT4JXwFWLLFos93Yte//fF4I2V8NBkssUCCdpiEdhLUMw58m0UweloVGNgF XnrA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=kC1sVcdX; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id v24-20020a17090610d800b008ba6b182877si6838916ejv.698.2023.02.19.01.42.35; Sun, 19 Feb 2023 01:42:58 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=kC1sVcdX; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229993AbjBSJe1 (ORCPT + 99 others); Sun, 19 Feb 2023 04:34:27 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56800 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229963AbjBSJeX (ORCPT ); Sun, 19 Feb 2023 04:34:23 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5227012599 for ; Sun, 19 Feb 2023 01:33:10 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id 5-20020a250105000000b009802c10698cso866900ybb.22 for ; Sun, 19 Feb 2023 01:33:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:from:to:cc:subject:date :message-id:reply-to; bh=+hc9b9b/BO/VXC2/NW+XDZ+ErI1SMduZPX3EY93EMfk=; b=kC1sVcdXsB0uUQ1w+SkmrhM8B96rLLbj15fdHOc3sch9t/qU63JwZZ0xgD+MyUjX6g BHbuLumrVLMgR6q2CrqYl/GnJ+9PwkOfkhHz5aKbsx2UZmwo/QlxBGTmP6TzcmMeAxXW MCR0JOjbsu5D3j3qdtPmzW0UKf/ha+WqUihiNA019ww5tnxT/yBm6wlrQFsfZ1kBu0rb w3d/iJ6+x6kB/ATkJle3cMaRU2/ibHAGo/vlwjIzsNSACYRTXu5w55jU/m7D9MOVi5Iv YOOuzdPcEc4PD2bjlYYLXLKVNFehxBd2oOjZMaT2D0EHvN1IegU/faMPLAERWJSuWIAC UwvA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=+hc9b9b/BO/VXC2/NW+XDZ+ErI1SMduZPX3EY93EMfk=; b=k849Hs2AD9J+qBPGVtQJs4klsaz1QmsaAZWqxvafgrAqcvH4OMDQGg0Ah7XWKp+tnd V4Y38+7EMrhxu2VUmvSuiTETv05X6fQ6DhN1PA4UPTuBnsQ+CSk18pla+4/UfOkHxnWe LOADfTlO/hRerOufQQ2NbNtNN2a7dtUA4AMaWLSemny0mG5x+zdyQLxN7FlJsIgSwyZN 7QeLqv1aUd0TT0WW/AYJs0/NDumH3kW9JPKfdA8F934HwsiigULa/OfZF+s39hbhA3h/ 4/hdLZvgjy95Be8XAPaLDTtKD7Ts2QGZXYTJZF5PzyxHMEecdElmT8/O6evv09/a556j J4NQ== X-Gm-Message-State: AO0yUKWe4MVnRIss8ld+K7J7gFRozAYyutZeoN8XvpImwF1liHygPL0q 4BVIo8OZuWkwJtB7TZLhT7OtGmaHjSD1 X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a05:6902:4e1:b0:855:fd7e:b84e with SMTP id w1-20020a05690204e100b00855fd7eb84emr3360ybs.10.1676799147190; Sun, 19 Feb 2023 01:32:27 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:22 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-26-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 25/51] perf vendor events intel: Refresh sandybridge events From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758252004802840699?= X-GMAIL-MSGID: =?utf-8?q?1758252004802840699?= Update the sandybridge events from 17 to 18. Generation was done using https://github.com/intel/perfmon. Notable changes are improved event descriptions, TMA metrics are updated to version 4.5, TMA info metrics are renamed from their node name to be lower case and prefixed by tma_info_, MetricThreshold expressions are added, and the smi_cost metric group is added replicating existing hard coded metrics in stat-shadow. Signed-off-by: Ian Rogers --- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- .../arch/x86/sandybridge/cache.json | 8 +- .../arch/x86/sandybridge/floating-point.json | 2 +- .../arch/x86/sandybridge/frontend.json | 12 +- .../arch/x86/sandybridge/pipeline.json | 2 +- .../arch/x86/sandybridge/snb-metrics.json | 601 ++++++++++-------- 6 files changed, 344 insertions(+), 283 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-events/arch/x86/mapfile.csv index 41bd13baa265..02715cbe4fd8 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -21,7 +21,7 @@ GenuineIntel-6-(57|85),v10,knightslanding,core GenuineIntel-6-A[AC],v1.00,meteorlake,core GenuineIntel-6-1[AEF],v3,nehalemep,core GenuineIntel-6-2E,v3,nehalemex,core -GenuineIntel-6-2A,v17,sandybridge,core +GenuineIntel-6-2A,v18,sandybridge,core GenuineIntel-6-(8F|CF),v1.09,sapphirerapids,core GenuineIntel-6-(37|4A|4C|4D|5A),v14,silvermont,core GenuineIntel-6-(4E|5E|8E|9E|A5|A6),v53,skylake,core diff --git a/tools/perf/pmu-events/arch/x86/sandybridge/cache.json b/tools/perf/pmu-events/arch/x86/sandybridge/cache.json index 65696ea2a581..4e5572ee7dfe 100644 --- a/tools/perf/pmu-events/arch/x86/sandybridge/cache.json +++ b/tools/perf/pmu-events/arch/x86/sandybridge/cache.json @@ -37,7 +37,7 @@ "UMask": "0x5" }, { - "BriefDescription": "Cycles a demand request was blocked due to Fill Buffers inavailability.", + "BriefDescription": "Cycles a demand request was blocked due to Fill Buffers unavailability.", "CounterMask": "1", "EventCode": "0x48", "EventName": "L1D_PEND_MISS.FB_FULL", @@ -45,7 +45,7 @@ "UMask": "0x2" }, { - "BriefDescription": "L1D miss oustandings duration in cycles.", + "BriefDescription": "L1D miss outstanding duration in cycles.", "EventCode": "0x48", "EventName": "L1D_PEND_MISS.PENDING", "SampleAfterValue": "2000003", @@ -493,7 +493,7 @@ "UMask": "0x8" }, { - "BriefDescription": "Cacheable and noncachaeble code read requests.", + "BriefDescription": "Cacheable and noncacheable code read requests.", "EventCode": "0xB0", "EventName": "OFFCORE_REQUESTS.DEMAND_CODE_RD", "SampleAfterValue": "100003", @@ -898,7 +898,7 @@ "UMask": "0x1" }, { - "BriefDescription": "COREWB & ANY_RESPONSE", + "BriefDescription": "OFFCORE_RESPONSE.COREWB.ANY_RESPONSE", "EventCode": "0xB7, 0xBB", "EventName": "OFFCORE_RESPONSE.COREWB.ANY_RESPONSE", "MSRIndex": "0x1a6,0x1a7", diff --git a/tools/perf/pmu-events/arch/x86/sandybridge/floating-point.json b/tools/perf/pmu-events/arch/x86/sandybridge/floating-point.json index 8c2a246adef9..79e8f403c426 100644 --- a/tools/perf/pmu-events/arch/x86/sandybridge/floating-point.json +++ b/tools/perf/pmu-events/arch/x86/sandybridge/floating-point.json @@ -64,7 +64,7 @@ "UMask": "0x20" }, { - "BriefDescription": "Number of FP Computational Uops Executed this cycle. The number of FADD, FSUB, FCOM, FMULs, integer MULsand IMULs, FDIVs, FPREMs, FSQRTS, integer DIVs, and IDIVs. This event does not distinguish an FADD used in the middle of a transcendental flow from a s.", + "BriefDescription": "Number of FP Computational Uops Executed this cycle. The number of FADD, FSUB, FCOM, FMULs, integer MULs and IMULs, FDIVs, FPREMs, FSQRTS, integer DIVs, and IDIVs. This event does not distinguish an FADD used in the middle of a transcendental flow from a s.", "EventCode": "0x10", "EventName": "FP_COMP_OPS_EXE.X87", "SampleAfterValue": "2000003", diff --git a/tools/perf/pmu-events/arch/x86/sandybridge/frontend.json b/tools/perf/pmu-events/arch/x86/sandybridge/frontend.json index 69ab8d215f84..700716b42f1a 100644 --- a/tools/perf/pmu-events/arch/x86/sandybridge/frontend.json +++ b/tools/perf/pmu-events/arch/x86/sandybridge/frontend.json @@ -134,7 +134,7 @@ "UMask": "0x4" }, { - "BriefDescription": "Cycles when uops are being delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy.", + "BriefDescription": "Cycles when uops are being delivered to Instruction Decode Queue (IDQ) while Microcode Sequencer (MS) is busy.", "CounterMask": "1", "EventCode": "0x79", "EventName": "IDQ.MS_CYCLES", @@ -143,7 +143,7 @@ "UMask": "0x30" }, { - "BriefDescription": "Cycles when uops initiated by Decode Stream Buffer (DSB) are being delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy.", + "BriefDescription": "Cycles when uops initiated by Decode Stream Buffer (DSB) are being delivered to Instruction Decode Queue (IDQ) while Microcode Sequencer (MS) is busy.", "CounterMask": "1", "EventCode": "0x79", "EventName": "IDQ.MS_DSB_CYCLES", @@ -151,7 +151,7 @@ "UMask": "0x10" }, { - "BriefDescription": "Deliveries to Instruction Decode Queue (IDQ) initiated by Decode Stream Buffer (DSB) while Microcode Sequenser (MS) is busy.", + "BriefDescription": "Deliveries to Instruction Decode Queue (IDQ) initiated by Decode Stream Buffer (DSB) while Microcode Sequencer (MS) is busy.", "CounterMask": "1", "EdgeDetect": "1", "EventCode": "0x79", @@ -160,14 +160,14 @@ "UMask": "0x10" }, { - "BriefDescription": "Uops initiated by Decode Stream Buffer (DSB) that are being delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy.", + "BriefDescription": "Uops initiated by Decode Stream Buffer (DSB) that are being delivered to Instruction Decode Queue (IDQ) while Microcode Sequencer (MS) is busy.", "EventCode": "0x79", "EventName": "IDQ.MS_DSB_UOPS", "SampleAfterValue": "2000003", "UMask": "0x10" }, { - "BriefDescription": "Uops initiated by MITE and delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy.", + "BriefDescription": "Uops initiated by MITE and delivered to Instruction Decode Queue (IDQ) while Microcode Sequencer (MS) is busy.", "EventCode": "0x79", "EventName": "IDQ.MS_MITE_UOPS", "SampleAfterValue": "2000003", @@ -183,7 +183,7 @@ "UMask": "0x30" }, { - "BriefDescription": "Uops delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy.", + "BriefDescription": "Uops delivered to Instruction Decode Queue (IDQ) while Microcode Sequencer (MS) is busy.", "EventCode": "0x79", "EventName": "IDQ.MS_UOPS", "SampleAfterValue": "2000003", diff --git a/tools/perf/pmu-events/arch/x86/sandybridge/pipeline.json b/tools/perf/pmu-events/arch/x86/sandybridge/pipeline.json index 53ab5993e8b0..54454e5e262c 100644 --- a/tools/perf/pmu-events/arch/x86/sandybridge/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/sandybridge/pipeline.json @@ -509,7 +509,7 @@ "BriefDescription": "Cases when loads get true Block-on-Store blocking code preventing store forwarding.", "EventCode": "0x03", "EventName": "LD_BLOCKS.STORE_FORWARD", - "PublicDescription": "This event counts loads that followed a store to the same address, where the data could not be forwarded inside the pipeline from the store to the load. The most common reason why store forwarding would be blocked is when a load's address range overlaps with a preceeding smaller uncompleted store. See the table of not supported store forwards in the Intel(R) 64 and IA-32 Architectures Optimization Reference Manual. The penalty for blocked store forwarding is that the load must wait for the store to complete before it can be issued.", + "PublicDescription": "This event counts loads that followed a store to the same address, where the data could not be forwarded inside the pipeline from the store to the load. The most common reason why store forwarding would be blocked is when a load's address range overlaps with a preceding smaller uncompleted store. See the table of not supported store forwards in the Intel(R) 64 and IA-32 Architectures Optimization Reference Manual. The penalty for blocked store forwarding is that the load must wait for the store to complete before it can be issued.", "SampleAfterValue": "100003", "UMask": "0x2" }, diff --git a/tools/perf/pmu-events/arch/x86/sandybridge/snb-metrics.json b/tools/perf/pmu-events/arch/x86/sandybridge/snb-metrics.json index a7b3c835b03d..4a99fe515f4b 100644 --- a/tools/perf/pmu-events/arch/x86/sandybridge/snb-metrics.json +++ b/tools/perf/pmu-events/arch/x86/sandybridge/snb-metrics.json @@ -1,449 +1,510 @@ [ { - "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", - "MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / SLOTS", - "MetricGroup": "PGO;TopdownL1;tma_L1_group", - "MetricName": "tma_frontend_bound", - "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Machine_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound.", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", - "MetricExpr": "4 * min(CPU_CLK_UNHALTED.THREAD, IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE) / SLOTS", - "MetricGroup": "Frontend;TopdownL2;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_latency", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: RS_EVENTS.EMPTY_END", + "BriefDescription": "C2 residency percent per package", + "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C2_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses", - "MetricExpr": "(12 * ITLB_MISSES.STLB_HIT + ITLB_MISSES.WALK_DURATION) / CLKS", - "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_itlb_misses", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: ITLB_MISSES.WALK_COMPLETED", + "BriefDescription": "C3 residency percent per core", + "MetricExpr": "cstate_core@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Core_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", - "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY) / CLKS", - "MetricGroup": "FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_branch_resteers", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", + "BriefDescription": "C3 residency percent per package", + "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / CLKS", - "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_dsb_switches", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty.", + "BriefDescription": "C6 residency percent per core", + "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Core_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", - "MetricExpr": "ILD_STALL.LCP / CLKS", - "MetricGroup": "FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_lcp", - "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs.", + "BriefDescription": "C6 residency percent per package", + "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", - "MetricExpr": "3 * IDQ.MS_SWITCHES / CLKS", - "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_ms_switches", - "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals. Sample with: IDQ.MS_SWITCHES", + "BriefDescription": "C7 residency percent per core", + "MetricExpr": "cstate_core@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Core_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", - "MetricExpr": "tma_frontend_bound - tma_fetch_latency", - "MetricGroup": "FetchBW;Frontend;TopdownL2;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_bandwidth", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend.", + "BriefDescription": "C7 residency percent per package", + "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations", - "MetricExpr": "(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) / SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group", - "MetricName": "tma_bad_speculation", - "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.", - "ScaleUnit": "100%" + "BriefDescription": "Uncore frequency per die [GHZ]", + "MetricExpr": "tma_info_socket_clks / #num_dies / duration_time / 1e9", + "MetricGroup": "SoC", + "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction", - "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * tma_bad_speculation", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL2;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_branch_mispredicts", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", + "BriefDescription": "Percentage of cycles spent in System Management Interrupts.", + "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0 else 0)", + "MetricGroup": "smi", + "MetricName": "smi_cycles", + "MetricThreshold": "smi_cycles > 0.1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", - "MetricExpr": "tma_bad_speculation - tma_branch_mispredicts", - "MetricGroup": "BadSpec;MachineClears;TopdownL2;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_machine_clears", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT", - "ScaleUnit": "100%" + "BriefDescription": "Number of SMI interrupts.", + "MetricExpr": "msr@smi@", + "MetricGroup": "smi", + "MetricName": "smi_num", + "ScaleUnit": "1SMI#" }, { "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "1 - (tma_frontend_bound + tma_bad_speculation + tma_retiring)", - "MetricGroup": "TopdownL1;tma_L1_group", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.2", "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", - "MetricExpr": "(min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_L1D_PENDING) + RESOURCE_STALLS.SB) / (min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_DISPATCH) + cpu@UOPS_DISPATCHED.THREAD\\,cmask\\=1@ - cpu@UOPS_DISPATCHED.THREAD\\,cmask\\=3@ if IPC > 1.8 else (cpu@UOPS_DISPATCHED.THREAD\\,cmask\\=2@ - RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else RESOURCE_STALLS.SB)) * tma_backend_bound", - "MetricGroup": "Backend;TopdownL2;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_memory_bound", - "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "(7 * DTLB_LOAD_MISSES.STLB_HIT + DTLB_LOAD_MISSES.WALK_DURATION) / CLKS", - "MetricGroup": "MemoryTLB;TopdownL4;tma_l1_bound_group", - "MetricName": "tma_dtlb_load", - "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core", - "MetricExpr": "MEM_LOAD_UOPS_RETIRED.LLC_HIT / (MEM_LOAD_UOPS_RETIRED.LLC_HIT + 7 * MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS) * CYCLE_ACTIVITY.STALLS_L2_PENDING / CLKS", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_l3_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", - "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.LLC_HIT / (MEM_LOAD_UOPS_RETIRED.LLC_HIT + 7 * MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS)) * CYCLE_ACTIVITY.STALLS_L2_PENDING / CLKS", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_dram_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_MISS_PS", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=6@) / CLKS", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_dram_bound_group", - "MetricName": "tma_mem_bandwidth", - "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that).", + "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations", + "MetricExpr": "(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", + "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / CLKS - tma_mem_bandwidth", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_dram_bound_group", - "MetricName": "tma_mem_latency", - "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that).", + "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * tma_bad_speculation", + "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM", + "MetricName": "tma_branch_mispredicts", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics: tma_info_branch_misprediction_cost, tma_mispredicts_resteers", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", - "MetricExpr": "RESOURCE_STALLS.SB / CLKS", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_store_bound", - "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck. Sample with: MEM_UOPS_RETIRED.ALL_STORES_PS", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", + "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY) / tma_info_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", + "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "tma_backend_bound - tma_memory_bound", - "MetricGroup": "Backend;Compute;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", - "MetricExpr": "ARITH.FPU_DIV_ACTIVE / CORE_CLKS", - "MetricGroup": "TopdownL3;tma_core_bound_group", + "MetricExpr": "ARITH.FPU_DIV_ACTIVE / tma_info_core_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication. Sample with: ARITH.DIVIDER_UOPS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", - "MetricExpr": "((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_DISPATCH) + cpu@UOPS_DISPATCHED.THREAD\\,cmask\\=1@ - cpu@UOPS_DISPATCHED.THREAD\\,cmask\\=3@ if IPC > 1.8 else (cpu@UOPS_DISPATCHED.THREAD\\,cmask\\=2@ - RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else RESOURCE_STALLS.SB)) - RESOURCE_STALLS.SB - min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_L1D_PENDING)) / CLKS", - "MetricGroup": "PortsUtil;TopdownL3;tma_core_bound_group", - "MetricName": "tma_ports_utilization", - "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", + "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", + "MetricConstraint": "NO_GROUP_EVENTS_SMT", + "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.LLC_HIT / (MEM_LOAD_UOPS_RETIRED.LLC_HIT + 7 * MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS)) * CYCLE_ACTIVITY.STALLS_L2_PENDING / tma_info_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_MISS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group", - "MetricName": "tma_retiring", - "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.RETIRE_SLOTS", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_clks", + "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty. Related metrics: tma_fetch_bandwidth, tma_info_dsb_coverage, tma_lcp", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", - "MetricExpr": "tma_retiring - tma_heavy_operations", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_retiring_group", - "MetricName": "tma_light_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST", + "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricExpr": "(7 * DTLB_LOAD_MISSES.STLB_HIT + DTLB_LOAD_MISSES.WALK_DURATION) / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1", + "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS. Related metrics: tma_dtlb_store", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", + "MetricExpr": "tma_frontend_bound - tma_fetch_latency", + "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Related metrics: tma_dsb_switches, tma_info_dsb_coverage, tma_lcp", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", + "MetricExpr": "4 * min(CPU_CLK_UNHALTED.THREAD, IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE) / tma_info_slots", + "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: RS_EVENTS.EMPTY_END", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)", "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", - "MetricGroup": "HPC;TopdownL3;tma_light_operations_group", + "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_group", "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.6", "PublicDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired). Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain and FMA double-counting.", "ScaleUnit": "100%" }, - { - "BriefDescription": "This metric serves as an approximation of legacy x87 usage", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS * FP_COMP_OPS_EXE.X87 / UOPS_DISPATCHED.THREAD", - "MetricGroup": "Compute;TopdownL4;tma_fp_arith_group", - "MetricName": "tma_x87_use", - "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", - "ScaleUnit": "100%" - }, { "BriefDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired", "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE) / UOPS_DISPATCHED.THREAD", - "MetricGroup": "Compute;Flops;TopdownL4;tma_fp_arith_group", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P", "MetricName": "tma_fp_scalar", - "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting.", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_512b, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths", "MetricExpr": "(FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + FP_COMP_OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) / UOPS_DISPATCHED.THREAD", - "MetricGroup": "Compute;Flops;TopdownL4;tma_fp_arith_group", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P", "MetricName": "tma_fp_vector", - "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting.", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector_512b, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", + "MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / tma_info_slots", + "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences", "MetricExpr": "tma_microcode_sequencer", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", "MetricName": "tma_heavy_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences. This highly-correlates with the uop length of these instructions/sequences.", + "MetricThreshold": "tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.MS_UOPS / SLOTS", - "MetricGroup": "MicroSeq;TopdownL3;tma_heavy_operations_group", - "MetricName": "tma_microcode_sequencer", - "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided. Sample with: IDQ.MS_UOPS", - "ScaleUnit": "100%" + "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", + "MetricExpr": "tma_info_turbo_utilization * TSC / 1e9 / duration_time", + "MetricGroup": "Power;Summary", + "MetricName": "tma_info_average_frequency" }, { - "BriefDescription": "Instructions Per Cycle (per Logical Processor)", - "MetricExpr": "INST_RETIRED.ANY / CLKS", - "MetricGroup": "Ret;Summary", - "MetricName": "IPC" + "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Pipeline", + "MetricName": "tma_info_clks" }, { - "BriefDescription": "Uops Per Instruction", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / INST_RETIRED.ANY", - "MetricGroup": "Pipeline;Ret;Retire", - "MetricName": "UPI" + "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_clks))", + "MetricGroup": "SMT", + "MetricName": "tma_info_core_clks" + }, + { + "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_core_clks", + "MetricGroup": "Ret;SMT;TmaL1;tma_L1_group", + "MetricName": "tma_info_coreipc" }, { "BriefDescription": "Cycles Per Instruction (per Logical Processor)", - "MetricExpr": "1 / IPC", + "MetricExpr": "1 / tma_info_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "CPI" + "MetricName": "tma_info_cpi" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "Pipeline", - "MetricName": "CLKS" + "BriefDescription": "Average CPU Utilization", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": "HPC;Summary", + "MetricName": "tma_info_cpu_utilization" }, { - "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "4 * CORE_CLKS", - "MetricGroup": "tma_L1_group", - "MetricName": "SLOTS" + "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", + "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_REQUESTS.ALL) / 1e6 / duration_time / 1e3", + "MetricGroup": "HPC;Mem;MemoryBW;SoC;tma_issueBW", + "MetricName": "tma_info_dram_bw_use", + "PublicDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]. Related metrics: tma_mem_bandwidth" + }, + { + "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + LSD.UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_dsb_coverage", + "MetricThreshold": "tma_info_dsb_coverage < 0.7 & tma_info_ipc / 4 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_lcp" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_DISPATCHED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", - "MetricName": "Execute_per_Issue", + "MetricName": "tma_info_execute_per_issue", "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate of \"execute\" at rename stage." }, - { - "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", - "MetricExpr": "INST_RETIRED.ANY / CORE_CLKS", - "MetricGroup": "Ret;SMT;tma_L1_group", - "MetricName": "CoreIPC" - }, { "BriefDescription": "Floating Point Operations Per Cycle", - "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE + 2 * FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + 4 * (FP_COMP_OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) + 8 * SIMD_FP_256.PACKED_SINGLE) / CORE_CLKS", + "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE + 2 * FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + 4 * (FP_COMP_OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) + 8 * SIMD_FP_256.PACKED_SINGLE) / tma_info_core_clks", "MetricGroup": "Flops;Ret", - "MetricName": "FLOPc" + "MetricName": "tma_info_flopc" + }, + { + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE + 2 * FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + 4 * (FP_COMP_OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) + 8 * SIMD_FP_256.PACKED_SINGLE) / 1e9 / duration_time", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine." }, { "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", "MetricExpr": "UOPS_DISPATCHED.THREAD / (cpu@UOPS_DISPATCHED.CORE\\,cmask\\=1@ / 2 if #SMT_on else cpu@UOPS_DISPATCHED.CORE\\,cmask\\=1@)", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", - "MetricName": "ILP" + "MetricName": "tma_info_ilp" }, { - "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else CLKS))", - "MetricGroup": "SMT", - "MetricName": "CORE_CLKS" + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Summary;TmaL1;tma_L1_group", + "MetricName": "tma_info_instructions", + "PublicDescription": "Total number of retired Instructions. Sample with: INST_RETIRED.PREC_DIST" }, { - "BriefDescription": "Total number of retired Instructions Sample with: INST_RETIRED.PREC_DIST", - "MetricExpr": "INST_RETIRED.ANY", - "MetricGroup": "Summary;tma_L1_group", - "MetricName": "Instructions" + "BriefDescription": "Instructions Per Cycle (per Logical Processor)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_clks", + "MetricGroup": "Ret;Summary", + "MetricName": "tma_info_ipc" }, { - "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE_SLOTS\\,cmask\\=1@", - "MetricGroup": "Pipeline;Ret", - "MetricName": "Retire" + "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;OS", + "MetricName": "tma_info_ipfarbranch", + "MetricThreshold": "tma_info_ipfarbranch < 1e6" }, { - "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", - "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + LSD.UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", - "MetricGroup": "DSB;Fed;FetchBW", - "MetricName": "DSB_Coverage" + "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_cpi" }, { - "BriefDescription": "Average CPU Utilization", - "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", - "MetricGroup": "HPC;Summary", - "MetricName": "CPU_Utilization" + "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_utilization", + "MetricThreshold": "tma_info_kernel_utilization > 0.05" }, { - "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", - "MetricExpr": "Turbo_Utilization * TSC / 1e9 / duration_time", - "MetricGroup": "Power;Summary", - "MetricName": "Average_Frequency" + "BriefDescription": "Average number of parallel requests to external memory", + "MetricExpr": "UNC_ARB_TRK_OCCUPANCY.ALL / UNC_ARB_TRK_OCCUPANCY.CYCLES_WITH_ANY_REQUEST", + "MetricGroup": "Mem;SoC", + "MetricName": "tma_info_mem_parallel_requests", + "PublicDescription": "Average number of parallel requests to external memory. Accounts for all requests" }, { - "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE + 2 * FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + 4 * (FP_COMP_OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) + 8 * SIMD_FP_256.PACKED_SINGLE) / 1e9 / duration_time", - "MetricGroup": "Cor;Flops;HPC", - "MetricName": "GFLOPs", - "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine." + "BriefDescription": "Average latency of all requests to external memory (in Uncore cycles)", + "MetricExpr": "UNC_ARB_TRK_OCCUPANCY.ALL / UNC_ARB_TRK_REQUESTS.ALL", + "MetricGroup": "Mem;SoC", + "MetricName": "tma_info_mem_request_latency" }, { - "BriefDescription": "Average Frequency Utilization relative nominal frequency", - "MetricExpr": "CLKS / CPU_CLK_UNHALTED.REF_TSC", - "MetricGroup": "Power", - "MetricName": "Turbo_Utilization" + "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE_SLOTS\\,cmask\\=1@", + "MetricGroup": "Pipeline;Ret", + "MetricName": "tma_info_retire" + }, + { + "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "4 * tma_info_core_clks", + "MetricGroup": "TmaL1;tma_L1_group", + "MetricName": "tma_info_slots" }, { "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", "MetricGroup": "SMT", - "MetricName": "SMT_2T_Utilization" + "MetricName": "tma_info_smt_2t_utilization" }, { - "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "OS", - "MetricName": "Kernel_Utilization" + "BriefDescription": "Socket actual clocks when any core is active on that socket", + "MetricExpr": "UNC_CLOCK.SOCKET", + "MetricGroup": "SoC", + "MetricName": "tma_info_socket_clks" }, { - "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", - "MetricGroup": "OS", - "MetricName": "Kernel_CPI" + "BriefDescription": "Average Frequency Utilization relative nominal frequency", + "MetricExpr": "tma_info_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_turbo_utilization" }, { - "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", - "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_REQUESTS.ALL) / 1e6 / duration_time / 1e3", - "MetricGroup": "HPC;Mem;MemoryBW;SoC", - "MetricName": "DRAM_BW_Use" + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / INST_RETIRED.ANY", + "MetricGroup": "Pipeline;Ret;Retire", + "MetricName": "tma_info_uoppi", + "MetricThreshold": "tma_info_uoppi > 1.05" }, { - "BriefDescription": "Average latency of all requests to external memory (in Uncore cycles)", - "MetricExpr": "MEM_Parallel_Requests", - "MetricGroup": "Mem;SoC", - "MetricName": "MEM_Request_Latency" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "(12 * ITLB_MISSES.STLB_HIT + ITLB_MISSES.WALK_DURATION) / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: ITLB_MISSES.WALK_COMPLETED", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average number of parallel requests to external memory. Accounts for all requests", - "MetricExpr": "UNC_ARB_TRK_OCCUPANCY.ALL / UNC_ARB_TRK_REQUESTS.ALL", - "MetricGroup": "Mem;SoC", - "MetricName": "MEM_Parallel_Requests" + "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core", + "MetricConstraint": "NO_GROUP_EVENTS_SMT", + "MetricExpr": "MEM_LOAD_UOPS_RETIRED.LLC_HIT / (MEM_LOAD_UOPS_RETIRED.LLC_HIT + 7 * MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS) * CYCLE_ACTIVITY.STALLS_L2_PENDING / tma_info_clks", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Socket actual clocks when any core is active on that socket", - "MetricExpr": "UNC_CLOCK.SOCKET", - "MetricGroup": "SoC", - "MetricName": "Socket_CLKS" + "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "ILD_STALL.LCP / tma_info_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", - "MetricGroup": "Branches;OS", - "MetricName": "IpFarBranch" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", + "MetricExpr": "tma_retiring - tma_heavy_operations", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST", + "ScaleUnit": "100%" }, { - "BriefDescription": "Uncore frequency per die [GHZ]", - "MetricExpr": "Socket_CLKS / #num_dies / duration_time / 1e9", - "MetricGroup": "SoC", - "MetricName": "UNCORE_FREQ" + "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_bad_speculation - tma_branch_mispredicts", + "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: tma_clears_resteers, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "C3 residency percent per core", - "MetricExpr": "cstate_core@c3\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C3_Core_Residency", + "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=6@) / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that). Related metrics: tma_info_dram_bw_use", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per core", - "MetricExpr": "cstate_core@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Core_Residency", + "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / tma_info_clks - tma_mem_bandwidth", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that). Related metrics: ", "ScaleUnit": "100%" }, { - "BriefDescription": "C7 residency percent per core", - "MetricExpr": "cstate_core@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Core_Residency", + "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_L1D_PENDING) + RESOURCE_STALLS.SB) / (min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_DISPATCH) + cpu@UOPS_DISPATCHED.THREAD\\,cmask\\=1@ - (cpu@UOPS_DISPATCHED.THREAD\\,cmask\\=3@ if tma_info_ipc > 1.8 else cpu@UOPS_DISPATCHED.THREAD\\,cmask\\=2@) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) * tma_backend_bound", + "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", "ScaleUnit": "100%" }, { - "BriefDescription": "C2 residency percent per package", - "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C2_Pkg_Residency", + "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.MS_UOPS / tma_info_slots", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operations_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "C3 residency percent per package", - "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C3_Pkg_Residency", + "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", + "MetricExpr": "3 * IDQ.MS_SWITCHES / tma_info_clks", + "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per package", - "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Pkg_Residency", + "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_DISPATCH) + cpu@UOPS_DISPATCHED.THREAD\\,cmask\\=1@ - (cpu@UOPS_DISPATCHED.THREAD\\,cmask\\=3@ if tma_info_ipc > 1.8 else cpu@UOPS_DISPATCHED.THREAD\\,cmask\\=2@) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB - RESOURCE_STALLS.SB - min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_L1D_PENDING)) / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%" }, { - "BriefDescription": "C7 residency percent per package", - "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Pkg_Residency", + "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1", + "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.RETIRE_SLOTS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", + "MetricExpr": "RESOURCE_STALLS.SB / tma_info_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck. Sample with: MEM_UOPS_RETIRED.ALL_STORES_PS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric serves as an approximation of legacy x87 usage", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS * FP_COMP_OPS_EXE.X87 / UOPS_DISPATCHED.THREAD", + "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%" } ] From patchwork Sun Feb 19 09:28:23 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59113 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp773400wrn; Sun, 19 Feb 2023 01:36:56 -0800 (PST) X-Google-Smtp-Source: AK7set97KX6yA4c7VuqnWYjrv090YvMY0HZ5vBSjDt3nC2Y/O/sT6cwhR7RsJ/fzSQsSKwGPGEA0 X-Received: by 2002:a17:902:c408:b0:19a:996c:5c2e with SMTP id k8-20020a170902c40800b0019a996c5c2emr2368673plk.49.1676799415617; Sun, 19 Feb 2023 01:36:55 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799415; cv=none; d=google.com; s=arc-20160816; b=fPlt9MbZTQoBppYfrFnNWLt8LyFOUZIa2MZjwZ4PyBqjsXnhKzzgGfySsXauweG3Rb IF3i45PBMgBXmZCwbVYiWJItSPuPs/4q6KUOugIet++Y8y+yJ7b7fo97Mdkxkp2t4YOz r+g4gq6rFjFLCbwxO5bJ7R//mJd729TWbIjvq7EUqW3QFzwrn8GWlz3HUwTEgbCWzHmP 88hiy/OBWfBSEXW8Mvzy4FxZ4GJ5nmPxhWOnjX1ipEpnCZSTSEqN7fU8xK5dDObC3g+n HWuV6gp1/uwcuLu/CQT2lVBpygeo6zxMWluJbMKBmuQouyep/Q2uNdeA0rI1gGMsGulZ AaCQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :references:mime-version:message-id:in-reply-to:date:dkim-signature; bh=0WNtF2PwUjqgHiS8vMpkTdXaQVv2moJc+3rqhuSsfuE=; b=Eg1eMc1NlScVD2l91jlcYT2kk+qtr0z5HRbGFRB6RSIa46Fuv4CnBCi3tUUVVov3Ja yhJa4QpD941K9GmyaDdtvgKXLMYtzHtOgr8Niq9Sow3HyjIM/TnNfWliKOg+y2biZgyf EKxOxoc2ASLsFNiY6BZK0kXG5o3Qj8CJfN4mtUukz9MezJ31COkba0FYU99fxHN7wvg1 024P+WRmLDA8DprxqXIeIozZwvqIPj17TpvGX9PuZKTh78RljcJfLB4R/vMvGWRWluju SlHbnsyIJkwV6CaMd9CD16UttRjoT4fAGgcPG76JcRP7fbgJtiNXWicoyI9Hwr07zwPx vaoQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=dLmrnaUG; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id b5-20020a170902e94500b00188fce355c9si5635603pll.42.2023.02.19.01.36.43; Sun, 19 Feb 2023 01:36:55 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=dLmrnaUG; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229523AbjBSJeu (ORCPT + 99 others); Sun, 19 Feb 2023 04:34:50 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57140 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230004AbjBSJem (ORCPT ); Sun, 19 Feb 2023 04:34:42 -0500 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D3EE593FB for ; Sun, 19 Feb 2023 01:33:26 -0800 (PST) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-5368974a1d7so16316217b3.20 for ; Sun, 19 Feb 2023 01:33:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:from:to:cc:subject:date :message-id:reply-to; bh=0WNtF2PwUjqgHiS8vMpkTdXaQVv2moJc+3rqhuSsfuE=; b=dLmrnaUGG3X1ERysDp0IRq/7zE8C6S2VBu+Y8Ifmmpg2eJgYsyXl7bisBwPs1DiTGl nMkXCAYn45Q51IOgYicn5/fRQnhwZ2AgrKX+gWHEDx/d9/PBofNNUcsNFmRKfZpJHljn fuXw2+ebxY1lKf1nliswtJRcwcEECKRiIhJT7Z4fap/sKtQKTuR+8tUN9NYyOH3/tzlf Vqr55zqW6Icy9VUi2nJ/WJT9wMuVBXSuanWyzxOzS5cpQJXuYVSGFZ7vSeG6uNArjFPu 2FhgxiTS4WX9/iZ8CTax+HQ7kb2837/YMdGze5ZXRJmdU8A7C7jaIkRn06uOxk2YHkSq U+Dg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=0WNtF2PwUjqgHiS8vMpkTdXaQVv2moJc+3rqhuSsfuE=; b=UX4VDXv3f/M6MQvyzBuwUuYQ/wyFvH21nOs6WqFPaPJSPCjk2fuoKeeLEhInM/xSsm nH9FxbmowTVIh3uBTMalhTpkJcGm2PvX09eXUcngQrbvZzB6Cayyt6ZCwUs90PwUU9v/ 7kOwE/Agv7jQu0ZLXSqco//ByT602KE70LAdy1cBCV6PR2nU+ed5ApE7HPMRqDuqNKno fWSjRHXb3wscM0dnfFvAZia823d45aDIs1FbWUJgRqX1fUPz85peSAlJyh2fCJPnLH2x cY7aNrHwyBENx8WCmQ468Ih4SG5Xn3h1PbruDGQ8/Wyi7amjlWeoa0KIzg3CanQJR9X9 RqQg== X-Gm-Message-State: AO0yUKWtpowKPjrPe7XTB+CdSlmaQign2U3mYAu1ETvGDrvVZLnbORKh 3EOTExJv9D8ZWCc1sAhgspKQlY2KRyz6 X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a25:9b85:0:b0:901:73ee:7044 with SMTP id v5-20020a259b85000000b0090173ee7044mr1642835ybo.458.1676799156002; Sun, 19 Feb 2023 01:32:36 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:23 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-27-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 26/51] perf vendor events intel: Refresh sapphirerapids events From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758251623894724885?= X-GMAIL-MSGID: =?utf-8?q?1758251623894724885?= Update the sapphirerapids events from 1.09 to 1.11. Generation was done using https://github.com/intel/perfmon. Notable changes are new events and event descriptions, TMA metrics are updated to version 4.5, TMA info metrics are renamed from their node name to be lower case and prefixed by tma_info_, MetricThreshold expressions are added, smi_cost and transaction metric groups are added replicating existing hard coded metrics in stat-shadow. Signed-off-by: Ian Rogers --- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- .../arch/x86/sapphirerapids/cache.json | 24 +- .../x86/sapphirerapids/floating-point.json | 32 + .../arch/x86/sapphirerapids/frontend.json | 8 + .../arch/x86/sapphirerapids/pipeline.json | 19 +- .../arch/x86/sapphirerapids/spr-metrics.json | 2283 +++++++++-------- .../arch/x86/sapphirerapids/uncore-other.json | 60 + 7 files changed, 1304 insertions(+), 1124 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-events/arch/x86/mapfile.csv index 02715cbe4fd8..1f611a7dbdda 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -22,7 +22,7 @@ GenuineIntel-6-A[AC],v1.00,meteorlake,core GenuineIntel-6-1[AEF],v3,nehalemep,core GenuineIntel-6-2E,v3,nehalemex,core GenuineIntel-6-2A,v18,sandybridge,core -GenuineIntel-6-(8F|CF),v1.09,sapphirerapids,core +GenuineIntel-6-(8F|CF),v1.11,sapphirerapids,core GenuineIntel-6-(37|4A|4C|4D|5A),v14,silvermont,core GenuineIntel-6-(4E|5E|8E|9E|A5|A6),v53,skylake,core GenuineIntel-6-55-[01234],v1.28,skylakex,core diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/cache.json b/tools/perf/pmu-events/arch/x86/sapphirerapids/cache.json index 92a605ecac6e..9606e76b98d6 100644 --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/cache.json +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/cache.json @@ -97,18 +97,18 @@ "UMask": "0x4" }, { - "BriefDescription": "All accesses to L2 cache[This event is alias to L2_RQSTS.REFERENCES]", + "BriefDescription": "All accesses to L2 cache [This event is alias to L2_RQSTS.REFERENCES]", "EventCode": "0x24", "EventName": "L2_REQUEST.ALL", - "PublicDescription": "Counts all requests that were hit or true misses in L2 cache. True-miss excludes misses that were merged with ongoing L2 misses.[This event is alias to L2_RQSTS.REFERENCES]", + "PublicDescription": "Counts all requests that were hit or true misses in L2 cache. True-miss excludes misses that were merged with ongoing L2 misses. [This event is alias to L2_RQSTS.REFERENCES]", "SampleAfterValue": "200003", "UMask": "0xff" }, { - "BriefDescription": "Read requests with true-miss in L2 cache.[This event is alias to L2_RQSTS.MISS]", + "BriefDescription": "Read requests with true-miss in L2 cache. [This event is alias to L2_RQSTS.MISS]", "EventCode": "0x24", "EventName": "L2_REQUEST.MISS", - "PublicDescription": "Counts read requests of any type with true-miss in the L2 cache. True-miss excludes L2 misses that were merged with ongoing L2 misses.[This event is alias to L2_RQSTS.MISS]", + "PublicDescription": "Counts read requests of any type with true-miss in the L2 cache. True-miss excludes L2 misses that were merged with ongoing L2 misses. [This event is alias to L2_RQSTS.MISS]", "SampleAfterValue": "200003", "UMask": "0x3f" }, @@ -199,18 +199,18 @@ "UMask": "0x30" }, { - "BriefDescription": "Read requests with true-miss in L2 cache.[This event is alias to L2_REQUEST.MISS]", + "BriefDescription": "Read requests with true-miss in L2 cache. [This event is alias to L2_REQUEST.MISS]", "EventCode": "0x24", "EventName": "L2_RQSTS.MISS", - "PublicDescription": "Counts read requests of any type with true-miss in the L2 cache. True-miss excludes L2 misses that were merged with ongoing L2 misses.[This event is alias to L2_REQUEST.MISS]", + "PublicDescription": "Counts read requests of any type with true-miss in the L2 cache. True-miss excludes L2 misses that were merged with ongoing L2 misses. [This event is alias to L2_REQUEST.MISS]", "SampleAfterValue": "200003", "UMask": "0x3f" }, { - "BriefDescription": "All accesses to L2 cache[This event is alias to L2_REQUEST.ALL]", + "BriefDescription": "All accesses to L2 cache [This event is alias to L2_REQUEST.ALL]", "EventCode": "0x24", "EventName": "L2_RQSTS.REFERENCES", - "PublicDescription": "Counts all requests that were hit or true misses in L2 cache. True-miss excludes misses that were merged with ongoing L2 misses.[This event is alias to L2_REQUEST.ALL]", + "PublicDescription": "Counts all requests that were hit or true misses in L2 cache. True-miss excludes misses that were merged with ongoing L2 misses. [This event is alias to L2_REQUEST.ALL]", "SampleAfterValue": "200003", "UMask": "0xff" }, @@ -863,6 +863,14 @@ "SampleAfterValue": "1000003", "UMask": "0x1" }, + { + "BriefDescription": "Counts bus locks, accounts for cache line split locks and UC locks.", + "EventCode": "0x2c", + "EventName": "SQ_MISC.BUS_LOCK", + "PublicDescription": "Counts the more expensive bus lock needed to enforce cache coherency for certain memory accesses that need to be done atomically. Can be created by issuing an atomic instruction (via the LOCK prefix) which causes a cache line split or accesses uncacheable memory.", + "SampleAfterValue": "100003", + "UMask": "0x10" + }, { "BriefDescription": "Number of PREFETCHNTA instructions executed.", "EventCode": "0x40", diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/floating-point.json b/tools/perf/pmu-events/arch/x86/sapphirerapids/floating-point.json index 01baea3df562..4a9d211e9d4f 100644 --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/floating-point.json +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/floating-point.json @@ -75,6 +75,14 @@ "SampleAfterValue": "100003", "UMask": "0x20" }, + { + "BriefDescription": "Number of SSE/AVX computational 128-bit packed single and 256-bit packed double precision FP instructions retired; some instructions will count twice as noted below. Each count represents 2 or/and 4 computation operations, 1 for each element. Applies to SSE* and AVX* packed single precision and packed double precision FP instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX RCP14 RSQRT14 SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB count twice as they perform 2 calculations per element.", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.4_FLOPS", + "PublicDescription": "Number of SSE/AVX computational 128-bit packed single precision and 256-bit packed double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 2 or/and 4 computation operations, one for each element. Applies to SSE* and AVX* packed single precision floating-point and packed double precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX RCP14 RSQRT14 SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element. The DAZ and FTZ flags in the MXCSR register need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x18" + }, { "BriefDescription": "Counts number of SSE/AVX computational 512-bit packed double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 8 computation operations, one for each element. Applies to SSE* and AVX* packed double precision floating-point instructions: ADD SUB MUL DIV MIN MAX SQRT RSQRT14 RCP14 FM(N)ADD/SUB. FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element.", "EventCode": "0xc7", @@ -91,6 +99,22 @@ "SampleAfterValue": "100003", "UMask": "0x80" }, + { + "BriefDescription": "Number of SSE/AVX computational 256-bit packed single precision and 512-bit packed double precision FP instructions retired; some instructions will count twice as noted below. Each count represents 8 computation operations, 1 for each element. Applies to SSE* and AVX* packed single precision and double precision FP instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX SQRT RSQRT RSQRT14 RCP RCP14 DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB count twice as they perform 2 calculations per element.", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.8_FLOPS", + "PublicDescription": "Number of SSE/AVX computational 256-bit packed single precision and 512-bit packed double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 8 computation operations, one for each element. Applies to SSE* and AVX* packed single precision and double precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX SQRT RSQRT RSQRT14 RCP RCP14 DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element. The DAZ and FTZ flags in the MXCSR register need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x60" + }, + { + "BriefDescription": "Number of SSE/AVX computational scalar floating-point instructions retired; some instructions will count twice as noted below. Applies to SSE* and AVX* scalar, double and single precision floating-point: ADD SUB MUL DIV MIN MAX RCP14 RSQRT14 RANGE SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they perform multiple calculations per element.", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.SCALAR", + "PublicDescription": "Number of SSE/AVX computational scalar single precision and double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 1 computational operation. Applies to SSE* and AVX* scalar single precision floating-point instructions: ADD SUB MUL DIV MIN MAX SQRT RSQRT RCP FM(N)ADD/SUB. FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element. The DAZ and FTZ flags in the MXCSR register need to be set when using these events.", + "SampleAfterValue": "1000003", + "UMask": "0x3" + }, { "BriefDescription": "Counts number of SSE/AVX computational scalar double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 1 computational operation. Applies to SSE* and AVX* scalar double precision floating-point instructions: ADD SUB MUL DIV MIN MAX SQRT FM(N)ADD/SUB. FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element.", "EventCode": "0xc7", @@ -107,6 +131,14 @@ "SampleAfterValue": "100003", "UMask": "0x2" }, + { + "BriefDescription": "Number of any Vector retired FP arithmetic instructions", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.VECTOR", + "PublicDescription": "Number of any Vector retired FP arithmetic instructions. The DAZ and FTZ flags in the MXCSR register need to be set when using these events.", + "SampleAfterValue": "1000003", + "UMask": "0xfc" + }, { "BriefDescription": "FP_ARITH_INST_RETIRED2.128B_PACKED_HALF", "EventCode": "0xcf", diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/frontend.json b/tools/perf/pmu-events/arch/x86/sapphirerapids/frontend.json index 2c7c617f27ed..860a415e5e79 100644 --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/frontend.json +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/frontend.json @@ -1,4 +1,12 @@ [ + { + "BriefDescription": "Clears due to Unknown Branches.", + "EventCode": "0x60", + "EventName": "BACLEARS.ANY", + "PublicDescription": "Number of times the front-end is resteered when it finds a branch instruction in a fetch line. This is called Unknown Branch which occurs for the first time a branch instruction is fetched or when the branch is not tracked by the BPU (Branch Prediction Unit) anymore.", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Stalls caused by changing prefix length of the instruction.", "EventCode": "0x87", diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/pipeline.json b/tools/perf/pmu-events/arch/x86/sapphirerapids/pipeline.json index ceb14181ebc8..40e52357ade1 100644 --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/pipeline.json @@ -1,15 +1,17 @@ [ { - "BriefDescription": "AMX_OPS_RETIRED.BF16", + "BriefDescription": "AMX retired arithmetic BF16 operations.", "EventCode": "0xce", "EventName": "AMX_OPS_RETIRED.BF16", + "PublicDescription": "Number of AMX-based retired arithmetic bfloat16 (BF16) floating-point operations. Counts TDPBF16PS FP instructions. SW to use operation multiplier of 4", "SampleAfterValue": "1000003", "UMask": "0x2" }, { - "BriefDescription": "AMX_OPS_RETIRED.INT8", + "BriefDescription": "AMX retired arithmetic integer 8-bit operations.", "EventCode": "0xce", "EventName": "AMX_OPS_RETIRED.INT8", + "PublicDescription": "Number of AMX-based retired arithmetic integer operations of 8-bit width source operands. Counts TDPB[SS,UU,US,SU]D instructions. SW should use operation multiplier of 8.", "SampleAfterValue": "1000003", "UMask": "0x1" }, @@ -461,12 +463,23 @@ "UMask": "0x1" }, { - "BriefDescription": "INST_RETIRED.REP_ITERATION", + "BriefDescription": "Iterations of Repeat string retired instructions.", "EventCode": "0xc0", "EventName": "INST_RETIRED.REP_ITERATION", + "PublicDescription": "Number of iterations of Repeat (REP) string retired instructions such as MOVS, CMPS, and SCAS. Each has a byte, word, and doubleword version and string instructions can be repeated using a repetition prefix, REP, that allows their architectural execution to be repeated a number of times as specified by the RCX register. Note the number of iterations is implementation-dependent.", "SampleAfterValue": "2000003", "UMask": "0x8" }, + { + "BriefDescription": "Clears speculative count", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0xad", + "EventName": "INT_MISC.CLEARS_COUNT", + "PublicDescription": "Counts the number of speculative clears due to any type of branch misprediction or machine clears", + "SampleAfterValue": "500009", + "UMask": "0x1" + }, { "BriefDescription": "Counts cycles after recovery from a branch misprediction or machine clear till the first uop is issued from the resteered path.", "EventCode": "0xad", diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json index ce18fc458e37..149cc4c07fb5 100644 --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json @@ -1,1616 +1,1675 @@ [ { - "BriefDescription": "Total pipeline cost of Branch Misprediction related bottlenecks", - "MetricExpr": "100 * (tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts", - "MetricName": "Mispredictions" - }, - { - "BriefDescription": "Total pipeline cost of (external) Memory Bandwidth related bottlenecks", - "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_sq_full / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full))) + tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_fb_full / (tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk))", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "Memory_Bandwidth" + "BriefDescription": "C1 residency percent per core", + "MetricExpr": "cstate_core@c1\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C1_Core_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Total pipeline cost of Memory Latency related bottlenecks (external memory and off-core caches)", - "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_l3_hit_latency / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound))", - "MetricGroup": "Mem;MemoryLat;Offcore", - "MetricName": "Memory_Latency" + "BriefDescription": "C2 residency percent per package", + "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C2_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Total pipeline cost of Memory Address Translation related bottlenecks (data-side TLBs)", - "MetricExpr": "100 * tma_memory_bound * (tma_l1_bound / max(tma_memory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_latency + tma_streaming_stores)))", - "MetricGroup": "Mem;MemoryTLB;Offcore", - "MetricName": "Memory_Data_TLBs" + "BriefDescription": "C6 residency percent per core", + "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Core_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Total pipeline cost of branch related instructions (used for program control-flow including function calls)", - "MetricExpr": "100 * ((BR_INST_RETIRED.COND + 3 * BR_INST_RETIRED.NEAR_CALL + (BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL)) / SLOTS)", - "MetricGroup": "Ret", - "MetricName": "Branching_Overhead" + "BriefDescription": "C6 residency percent per package", + "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Total pipeline cost of instruction fetch related bottlenecks by large code footprint programs (i-side cache; TLB and BTB misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_icache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFoot;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "Big_Code" + "BriefDescription": "Uncore frequency per die [GHZ]", + "MetricExpr": "tma_info_socket_clks / #num_dies / duration_time / 1e9", + "MetricGroup": "SoC", + "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "Total pipeline cost of instruction fetch bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_frontend_bound - tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) - Big_Code", - "MetricGroup": "Fed;FetchBW;Frontend", - "MetricName": "Instruction_Fetch_BW" + "BriefDescription": "Percentage of cycles spent in System Management Interrupts.", + "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0 else 0)", + "MetricGroup": "smi", + "MetricName": "smi_cycles", + "MetricThreshold": "smi_cycles > 0.1", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions Per Cycle (per Logical Processor)", - "MetricExpr": "INST_RETIRED.ANY / CLKS", - "MetricGroup": "Ret;Summary", - "MetricName": "IPC" + "BriefDescription": "Number of SMI interrupts.", + "MetricExpr": "msr@smi@", + "MetricGroup": "smi", + "MetricName": "smi_num", + "ScaleUnit": "1SMI#" }, { - "BriefDescription": "Uops Per Instruction", - "MetricExpr": "tma_retiring * SLOTS / INST_RETIRED.ANY", - "MetricGroup": "Pipeline;Ret;Retire", - "MetricName": "UPI" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.", + "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + UOPS_DISPATCHED.PORT_5_11 + UOPS_DISPATCHED.PORT_6) / (5 * tma_info_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.6", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instruction per taken branch", - "MetricExpr": "tma_retiring * SLOTS / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW", - "MetricName": "UpTB" + "BriefDescription": "This metric estimates fraction of cycles where the Advanced Matrix Extensions (AMX) execution engine was busy with tile (arithmetic) operations", + "MetricExpr": "EXE.AMX_BUSY / tma_info_core_clks", + "MetricGroup": "Compute;HPC;Server;TopdownL5;tma_L5_group;tma_ports_utilized_0_group", + "MetricName": "tma_amx_busy", + "MetricThreshold": "tma_amx_busy > 0.5 & (tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)))", + "ScaleUnit": "100%" }, { - "BriefDescription": "Cycles Per Instruction (per Logical Processor)", - "MetricExpr": "1 / IPC", - "MetricGroup": "Mem;Pipeline", - "MetricName": "CPI" + "BriefDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists", + "MetricExpr": "100 * cpu@ASSISTS.ANY\\,umask\\=0x1B@ / tma_info_slots", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists. Assists are long sequences of uops that are required in certain corner-cases for operations that cannot be handled natively by the execution pipeline. For example; when working with very small floating point values (so-called Denormals); the FP units are not set up to perform these operations natively. Instead; a sequence of instructions to perform the computation on the Denormals is injected into the pipeline. Since these microcode sequences might be dozens of uops long; Assists can be extremely deleterious to performance and they can be avoided in many cases. Sample with: ASSISTS.ANY", + "ScaleUnit": "100%" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "Pipeline", - "MetricName": "CLKS" + "BriefDescription": "This metric estimates fraction of slots the CPU retired uops as a result of handing SSE to AVX* or AVX* to SSE transition Assists.", + "MetricExpr": "63 * ASSISTS.SSE_AVX_MIX / tma_info_slots", + "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", + "MetricName": "tma_avx_assists", + "MetricThreshold": "tma_avx_assists > 0.1", + "ScaleUnit": "100%" }, { - "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "TOPDOWN.SLOTS", - "MetricGroup": "tma_L1_group", - "MetricName": "SLOTS" + "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend", + "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.2", + "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of Physical Core issue-slots utilized by this Logical Processor", - "MetricExpr": "(SLOTS / (TOPDOWN.SLOTS / 2) if #SMT_on else 1)", - "MetricGroup": "SMT;tma_L1_group", - "MetricName": "Slots_Utilization" + "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations", + "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", + "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.", + "ScaleUnit": "100%" }, { - "BriefDescription": "The ratio of Executed- by Issued-Uops", - "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", - "MetricGroup": "Cor;Pipeline", - "MetricName": "Execute_per_Issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate of \"execute\" at rename stage." + "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction", + "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_slots", + "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM", + "MetricName": "tma_branch_mispredicts", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: tma_info_branch_misprediction_cost, tma_info_mispredictions, tma_mispredicts_resteers", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", - "MetricExpr": "INST_RETIRED.ANY / CORE_CLKS", - "MetricGroup": "Ret;SMT;tma_L1_group", - "MetricName": "CoreIPC" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", + "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_clks + tma_unknown_branches", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", + "ScaleUnit": "100%" }, { - "BriefDescription": "Floating Point Operations Per Cycle", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + FP_ARITH_INST_RETIRED2.SCALAR_HALF + 2 * (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED2.COMPLEX_SCALAR_HALF) + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED2.128B_PACKED_HALF + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE) + 16 * (FP_ARITH_INST_RETIRED2.256B_PACKED_HALF + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) + 32 * FP_ARITH_INST_RETIRED2.512B_PACKED_HALF + 4 * AMX_OPS_RETIRED.BF16) / CORE_CLKS", - "MetricGroup": "Flops;Ret", - "MetricName": "FLOPc" + "BriefDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction. A CISC instruction has multiple uops that are required to perform the instruction's functionality as in the case of read-modify-write as an example. Since these instructions require multiple uops they may or may not imply sub-optimal use of machine resources. Sample with: FRONTEND_RETIRED.MS_FLOWS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width)", - "MetricExpr": "(FP_ARITH_DISPATCHED.PORT_0 + FP_ARITH_DISPATCHED.PORT_1 + FP_ARITH_DISPATCHED.PORT_5) / (2 * CORE_CLKS)", - "MetricGroup": "Cor;Flops;HPC", - "MetricName": "FP_Arith_Utilization", - "PublicDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width). Values > 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less common)." + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears", + "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_clks", + "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_branch_resteers_group;tma_issueMC", + "MetricName": "tma_clears_resteers", + "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_ms_switches", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_GE_1 / 2 if #SMT_on else UOPS_EXECUTED.CORE_CYCLES_GE_1)", - "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", - "MetricName": "ILP" + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(76 * tma_info_average_frequency * (MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) + 75.5 * tma_info_average_frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_contested_accesses", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses. Contested accesses occur when data written by one Logical Processor are read by another Logical Processor on a different Physical Core. Examples of contested accesses include synchronizations such as locks; true data sharing such as modified locked variables; and false sharing. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma_remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "Probability of Core Bound bottleneck hidden by SMT-profiling artifacts", - "MetricExpr": "((1 - tma_core_bound / tma_ports_utilization if tma_core_bound < tma_ports_utilization else 1) if SMT_2T_Utilization > 0.5 else 0)", - "MetricGroup": "Cor;SMT", - "MetricName": "Core_Bound_Likely" + "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", + "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)", + "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", + "ScaleUnit": "100%" }, { - "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", - "MetricExpr": "CPU_CLK_UNHALTED.DISTRIBUTED", - "MetricGroup": "SMT", - "MetricName": "CORE_CLKS" + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "75.5 * tma_info_average_frequency * (MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD + MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (1 - OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses. Data shared by multiple Logical Processors (even just read shared) may cause increased access latency due to cache coherency. Excessive data sharing can drastically harm multithreaded performance. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_contested_accesses, tma_false_sharing, tma_machine_clears, tma_remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Load (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", - "MetricGroup": "InsType", - "MetricName": "IpLoad" + "BriefDescription": "This metric represents fraction of cycles where decoder-0 was the only active decoder", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=1@ - cpu@INST_DECODED.DECODERS\\,cmask\\=2@) / tma_info_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0;tma_mite_group", + "MetricName": "tma_decoder0_alone", + "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 6 > 0.35))", + "PublicDescription": "This metric represents fraction of cycles where decoder-0 was the only active decoder. Related metrics: tma_few_uops_instructions", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Store (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", - "MetricGroup": "InsType", - "MetricName": "IpStore" + "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", + "MetricExpr": "ARITH.DIV_ACTIVE / tma_info_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication. Sample with: ARITH.DIVIDER_ACTIVE", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Branches;Fed;InsType", - "MetricName": "IpBranch" + "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "MEMORY_ACTIVITY.STALLS_L3_MISS / tma_info_clks - tma_pmm_bound", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L3_MISS_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per (near) call (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "IpCall" + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline", + "MetricExpr": "(IDQ.DSB_CYCLES_ANY - IDQ.DSB_CYCLES_OK) / tma_info_core_clks / 2", + "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 6 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline. For example; inefficient utilization of the DSB cache structure or bank conflict when reading from it; are categorized here.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instruction per taken branch", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO", - "MetricName": "IpTB" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_clks", + "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DSB_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_dsb_misses, tma_info_iptb, tma_lcp", + "ScaleUnit": "100%" }, { - "BriefDescription": "Branch instructions per taken branch. ", - "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "BpTkBranch" + "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - MEMORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: tma_dtlb_store, tma_info_memory_data_tlbs", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Floating Point (FP) Operation (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + FP_ARITH_INST_RETIRED2.SCALAR_HALF + 2 * (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED2.COMPLEX_SCALAR_HALF) + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED2.128B_PACKED_HALF + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE) + 16 * (FP_ARITH_INST_RETIRED2.256B_PACKED_HALF + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) + 32 * FP_ARITH_INST_RETIRED2.512B_PACKED_HALF + 4 * AMX_OPS_RETIRED.BF16)", - "MetricGroup": "Flops;InsType", - "MetricName": "IpFLOP" + "BriefDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses", + "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=1@ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses. As with ordinary data caching; focus on improving data locality and reducing working-set size to reduce DTLB overhead. Additionally; consider using profile-guided optimization (PGO) to collocate frequently-used data on the same page. Try using larger page sizes for large amounts of frequently-used data. Sample with: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, tma_info_memory_data_tlbs", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + FP_ARITH_INST_RETIRED2.SCALAR + (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.VECTOR))", - "MetricGroup": "Flops;InsType", - "MetricName": "IpArith", - "PublicDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate). May undercount due to FMA double counting. Approximated prior to BDW." + "BriefDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing", + "MetricExpr": "80 * tma_info_average_frequency * OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM / tma_info_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_store_bound_group", + "MetricName": "tma_false_sharing", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing. False Sharing is a multithreading hiccup; where multiple Logical Processors contend on different data-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SINGLE", - "MetricGroup": "Flops;FpScalar;InsType", - "MetricName": "IpArith_Scalar_SP", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed", + "MetricExpr": "L1D_PEND_MISS.FB_FULL / tma_info_clks", + "MetricGroup": "MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;tma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed. The higher the metric value; the deeper the memory hierarchy level the misses are satisfied from (metric values >1 are valid). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or external memory). Related metrics: tma_info_dram_bw_use, tma_info_memory_bandwidth, tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOUBLE", - "MetricGroup": "Flops;FpScalar;InsType", - "MetricName": "IpArith_Scalar_DP", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", + "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)", + "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 6 > 0.35", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Sample with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_switches, tma_info_dsb_coverage, tma_info_dsb_misses, tma_info_iptb, tma_lcp", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.128B_PACKED_HALF)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX128", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", + "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_slots", + "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.256B_PACKED_HALF)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX256", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops", + "MetricExpr": "max(0, tma_heavy_operations - tma_microcode_sequencer)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;tma_issueD0", + "MetricName": "tma_few_uops_instructions", + "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number of uops in such instructions. Related metrics: tma_decoder0_alone", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.512B_PACKED_HALF)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX512", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "This metric approximates arithmetic floating-point (FP) matrix uops fraction the CPU has retired (aggregated across all supported FP datatypes in AMX engine)", + "MetricExpr": "cpu@AMX_OPS_RETIRED.BF16\\,cmask\\=1@ / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;Flops;HPC;Pipeline;Server;TopdownL4;tma_L4_group;tma_fp_arith_group", + "MetricName": "tma_fp_amx", + "MetricThreshold": "tma_fp_amx > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating-point (FP) matrix uops fraction the CPU has retired (aggregated across all supported FP datatypes in AMX engine). Refer to AMX_Busy and GFLOPs metrics for actual AMX utilization and FP performance, resp.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic AMX operation (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / AMX_OPS_RETIRED.BF16", - "MetricGroup": "Flops;FpVector;InsType;Server", - "MetricName": "IpArith_AMX_F16", - "PublicDescription": "Instructions per FP Arithmetic AMX operation (lower number means higher occurrence rate). Operations factored per matrices' sizes of the AMX instructions." + "BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector + tma_fp_amx", + "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired). Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain and FMA double-counting.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Integer Arithmetic AMX operation (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / AMX_OPS_RETIRED.INT8", - "MetricGroup": "InsType;IntVector;Server", - "MetricName": "IpArith_AMX_Int8", - "PublicDescription": "Instructions per Integer Arithmetic AMX operation (lower number means higher occurrence rate). Operations factored per matrices' sizes of the AMX instructions." + "BriefDescription": "This metric roughly estimates fraction of slots the CPU retired uops as a result of handing Floating Point (FP) Assists", + "MetricExpr": "30 * ASSISTS.FP / tma_info_slots", + "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", + "MetricName": "tma_fp_assists", + "MetricThreshold": "tma_fp_assists > 0.1", + "PublicDescription": "This metric roughly estimates fraction of slots the CPU retired uops as a result of handing Floating Point (FP) Assists. FP Assist may apply when working with very small floating point values (so-called Denormals).", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Software prefetch instruction (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umask\\=0xF@", - "MetricGroup": "Prefetches", - "MetricName": "IpSWPF" + "BriefDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired", + "MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + FP_ARITH_INST_RETIRED2.SCALAR) / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P", + "MetricName": "tma_fp_scalar", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Total number of retired Instructions Sample with: INST_RETIRED.PREC_DIST", - "MetricExpr": "INST_RETIRED.ANY", - "MetricGroup": "Summary;tma_L1_group", - "MetricName": "Instructions" + "BriefDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths", + "MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0x3c@ + FP_ARITH_INST_RETIRED2.VECTOR) / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P", + "MetricName": "tma_fp_vector", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.", - "MetricExpr": "tma_retiring * SLOTS / cpu@UOPS_RETIRED.SLOTS\\,cmask\\=1@", - "MetricGroup": "Pipeline;Ret", - "MetricName": "Retire" + "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.128B_PACKED_HALF) / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P", + "MetricName": "tma_fp_vector_128b", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Estimated fraction of retirement-cycles dealing with repeat instructions", - "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.SLOTS\\,cmask\\=1@", - "MetricGroup": "Pipeline;Ret", - "MetricName": "Strings_Cycles" + "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.256B_PACKED_HALF) / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P", + "MetricName": "tma_fp_vector_256b", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per a microcode Assist invocation. See Assists tree node for details (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / cpu@ASSISTS.ANY\\,umask\\=0x1B@", - "MetricGroup": "Pipeline;Ret;Retire", - "MetricName": "IpAssist" + "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 512-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.512B_PACKED_HALF) / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P", + "MetricName": "tma_fp_vector_512b", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 512-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,cmask\\=1@", - "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", - "MetricName": "Execute" + "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", + "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_slots", + "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average number of Uops issued by front-end when it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=1@", - "MetricGroup": "Fed;FetchBW", - "MetricName": "Fetch_UpC" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring fused instructions -- where one uop can represent multiple contiguous instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.MACRO_FUSED / (tma_retiring * tma_info_slots)", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_fused_instructions", + "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring fused instructions -- where one uop can represent multiple contiguous instructions. The instruction pairs of CMP+JCC or DEC+JCC are commonly used examples.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", - "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", - "MetricGroup": "DSB;Fed;FetchBW", - "MetricName": "DSB_Coverage" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences", + "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_slots", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences. Sample with: UOPS_RETIRED.HEAVY", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average number of cycles of a switch from the DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details.", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWITCHES.PENALTY_CYCLES\\,cmask\\=1\\,edge@", - "MetricGroup": "DSBmiss", - "MetricName": "DSB_Switch_Cost" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses", + "MetricExpr": "ICACHE_DATA.STALLS / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RETIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Total penalty related to DSB (uop cache) misses - subset of the Instruction_Fetch_BW Bottleneck.", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tma_mite))", - "MetricGroup": "DSBmiss;Fed", - "MetricName": "DSB_Misses" + "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", + "MetricExpr": "tma_info_turbo_utilization * TSC / 1e9 / duration_time", + "MetricGroup": "Power;Summary", + "MetricName": "tma_info_average_frequency" }, { - "BriefDescription": "Number of Instructions per non-speculative DSB miss (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", - "MetricGroup": "DSBmiss;Fed", - "MetricName": "IpDSB_Miss_Ret" + "BriefDescription": "Total pipeline cost of instruction fetch related bottlenecks by large code footprint programs (i-side cache; TLB and BTB misses)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_icache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", + "MetricGroup": "BigFoot;Fed;Frontend;IcMiss;MemoryTLB;tma_issueBC", + "MetricName": "tma_info_big_code", + "MetricThreshold": "tma_info_big_code > 20", + "PublicDescription": "Total pipeline cost of instruction fetch related bottlenecks by large code footprint programs (i-side cache; TLB and BTB misses). Related metrics: tma_info_branching_overhead" }, { - "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;BadSpec;BrMispredicts", - "MetricName": "IpMispredict" + "BriefDescription": "Branch instructions per taken branch.", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_bptkbranch" }, { "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)", - "MetricExpr": "(tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) * SLOTS / BR_MISP_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;BrMispredicts", - "MetricName": "Branch_Misprediction_Cost" - }, - { - "BriefDescription": "Fraction of branches that are non-taken conditionals", - "MetricExpr": "BR_INST_RETIRED.COND_NTAKEN / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches;CodeGen;PGO", - "MetricName": "Cond_NT" - }, - { - "BriefDescription": "Fraction of branches that are taken conditionals", - "MetricExpr": "BR_INST_RETIRED.COND_TAKEN / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches;CodeGen;PGO", - "MetricName": "Cond_TK" - }, - { - "BriefDescription": "Fraction of branches that are CALL or RET", - "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_RETURN) / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches", - "MetricName": "CallRet" - }, - { - "BriefDescription": "Fraction of branches that are unconditional (direct or indirect) jumps", - "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL) / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches", - "MetricName": "Jump" - }, - { - "BriefDescription": "Fraction of branches of other types (not individually covered by other metrics in Info.Branches group)", - "MetricExpr": "1 - (Cond_NT + Cond_TK + CallRet + Jump)", - "MetricGroup": "Bad;Branches", - "MetricName": "Other_Branches" - }, - { - "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)", - "MetricExpr": "L1D_PEND_MISS.PENDING / MEM_LOAD_COMPLETED.L1_MISS_ANY", - "MetricGroup": "Mem;MemoryBound;MemoryLat", - "MetricName": "Load_Miss_Real_Latency" - }, - { - "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)", - "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLES", - "MetricGroup": "Mem;MemoryBW;MemoryBound", - "MetricName": "MLP" - }, - { - "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L1MPKI" - }, - { - "BriefDescription": "L1 cache true misses per kilo instruction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L1MPKI_Load" - }, - { - "BriefDescription": "L2 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", - "MetricGroup": "Backend;CacheMisses;Mem", - "MetricName": "L2MPKI" - }, - { - "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all request types (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem;Offcore", - "MetricName": "L2MPKI_All" - }, - { - "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2MPKI_Load" - }, - { - "BriefDescription": "L2 cache hits per kilo instruction for all request types (including speculative)", - "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2HPKI_All" - }, - { - "BriefDescription": "L2 cache hits per kilo instruction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2HPKI_Load" + "MetricExpr": "(tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) * tma_info_slots / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BrMispredicts;tma_issueBM", + "MetricName": "tma_info_branch_misprediction_cost", + "PublicDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear). Related metrics: tma_branch_mispredicts, tma_info_mispredictions, tma_mispredicts_resteers" }, { - "BriefDescription": "L3 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L3MPKI" - }, - { - "BriefDescription": "Fill Buffer (FB) hits per kilo instructions for retired demand loads (L1D misses that merge into ongoing miss-handling entries)", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "FB_HPKI" - }, - { - "BriefDescription": "Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses", - "MetricConstraint": "NO_NMI_WATCHDOG", - "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_PENDING + DTLB_STORE_MISSES.WALK_PENDING) / (4 * CORE_CLKS)", - "MetricGroup": "Mem;MemoryTLB", - "MetricName": "Page_Walks_Utilization" - }, - { - "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW" + "BriefDescription": "Total pipeline cost of branch related instructions (used for program control-flow including function calls)", + "MetricExpr": "100 * ((BR_INST_RETIRED.COND + 3 * BR_INST_RETIRED.NEAR_CALL + (BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL)) / tma_info_slots)", + "MetricGroup": "Ret;tma_issueBC", + "MetricName": "tma_info_branching_overhead", + "MetricThreshold": "tma_info_branching_overhead > 10", + "PublicDescription": "Total pipeline cost of branch related instructions (used for program control-flow including function calls). Related metrics: tma_info_big_code" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW" + "BriefDescription": "Fraction of branches that are CALL or RET", + "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_RETURN) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_callret" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW" + "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Pipeline", + "MetricName": "tma_info_clks" }, { - "BriefDescription": "Average per-core data access bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "L3_Cache_Access_BW" + "BriefDescription": "STLB (2nd level TLB) code speculative misses per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", + "MetricGroup": "Fed;MemoryTLB", + "MetricName": "tma_info_code_stlb_mpki" }, { - "BriefDescription": "Rate of silent evictions from the L2 cache per Kilo instruction where the evicted lines are dropped (no writeback to L3 or memory)", - "MetricExpr": "1e3 * L2_LINES_OUT.SILENT / Instructions", - "MetricGroup": "L2Evicts;Mem;Server", - "MetricName": "L2_Evictions_Silent_PKI" + "BriefDescription": "Fraction of branches that are non-taken conditionals", + "MetricExpr": "BR_INST_RETIRED.COND_NTAKEN / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_cond_nt" }, { - "BriefDescription": "Rate of non silent evictions from the L2 cache per Kilo instruction", - "MetricExpr": "1e3 * L2_LINES_OUT.NON_SILENT / Instructions", - "MetricGroup": "L2Evicts;Mem;Server", - "MetricName": "L2_Evictions_NonSilent_PKI" + "BriefDescription": "Fraction of branches that are taken conditionals", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_cond_tk" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]", - "MetricExpr": "L1D_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW_1T" + "BriefDescription": "Probability of Core Bound bottleneck hidden by SMT-profiling artifacts", + "MetricExpr": "(100 * (1 - tma_core_bound / tma_ports_utilization if tma_core_bound < tma_ports_utilization else 1) if tma_info_smt_2t_utilization > 0.5 else 0)", + "MetricGroup": "Cor;SMT", + "MetricName": "tma_info_core_bound_likely", + "MetricThreshold": "tma_info_core_bound_likely > 0.5" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]", - "MetricExpr": "L2_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW_1T" + "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", + "MetricExpr": "CPU_CLK_UNHALTED.DISTRIBUTED", + "MetricGroup": "SMT", + "MetricName": "tma_info_core_clks" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "L3_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW_1T" + "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_core_clks", + "MetricGroup": "Ret;SMT;TmaL1;tma_L1_group", + "MetricName": "tma_info_coreipc" }, { - "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "L3_Cache_Access_BW", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "L3_Cache_Access_BW_1T" + "BriefDescription": "Cycles Per Instruction (per Logical Processor)", + "MetricExpr": "1 / tma_info_ipc", + "MetricGroup": "Mem;Pipeline", + "MetricName": "tma_info_cpi" }, { "BriefDescription": "Average CPU Utilization", "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", "MetricGroup": "HPC;Summary", - "MetricName": "CPU_Utilization" - }, - { - "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", - "MetricExpr": "Turbo_Utilization * TSC / 1e9 / duration_time", - "MetricGroup": "Power;Summary", - "MetricName": "Average_Frequency" - }, - { - "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + FP_ARITH_INST_RETIRED2.SCALAR_HALF + 2 * (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED2.COMPLEX_SCALAR_HALF) + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED2.128B_PACKED_HALF + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE) + 16 * (FP_ARITH_INST_RETIRED2.256B_PACKED_HALF + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) + 32 * FP_ARITH_INST_RETIRED2.512B_PACKED_HALF + 4 * AMX_OPS_RETIRED.BF16) / 1e9 / duration_time", - "MetricGroup": "Cor;Flops;HPC", - "MetricName": "GFLOPs", - "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine." + "MetricName": "tma_info_cpu_utilization" }, { - "BriefDescription": "Tera Integer (matrix) Operations Per Second", - "MetricExpr": "8 * AMX_OPS_RETIRED.INT8 / 1e12 / duration_time", - "MetricGroup": "Cor;HPC;IntVector;Server", - "MetricName": "TIOPS" + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD / OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_data_l2_mlp" }, { - "BriefDescription": "Average Frequency Utilization relative nominal frequency", - "MetricExpr": "CLKS / CPU_CLK_UNHALTED.REF_TSC", - "MetricGroup": "Power", - "MetricName": "Turbo_Utilization" + "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", + "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e9 / duration_time", + "MetricGroup": "HPC;Mem;MemoryBW;SoC;tma_issueBW", + "MetricName": "tma_info_dram_bw_use", + "PublicDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_memory_bandwidth, tma_mem_bandwidth, tma_sq_full" }, { - "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", - "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_DISTRIBUTED if #SMT_on else 0)", - "MetricGroup": "SMT", - "MetricName": "SMT_2T_Utilization" + "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_dsb_coverage", + "MetricThreshold": "tma_info_dsb_coverage < 0.7 & tma_info_ipc / 6 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_misses, tma_info_iptb, tma_lcp" }, { - "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "OS", - "MetricName": "Kernel_Utilization" + "BriefDescription": "Total pipeline cost of DSB (uop cache) misses - subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tma_mite))", + "MetricGroup": "DSBmiss;Fed;tma_issueFB", + "MetricName": "tma_info_dsb_misses", + "MetricThreshold": "tma_info_dsb_misses > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) misses - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_iptb, tma_lcp" }, { - "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", - "MetricGroup": "OS", - "MetricName": "Kernel_CPI" + "BriefDescription": "Average number of cycles of a switch from the DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details.", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWITCHES.PENALTY_CYCLES\\,cmask\\=1\\,edge@", + "MetricGroup": "DSBmiss", + "MetricName": "tma_info_dsb_switch_cost" }, { - "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", - "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e9 / duration_time", - "MetricGroup": "HPC;Mem;MemoryBW;SoC", - "MetricName": "DRAM_BW_Use" + "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-thread", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,cmask\\=1@", + "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", + "MetricName": "tma_info_execute" }, { - "BriefDescription": "Average latency of data read request to external memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetches", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_INSERTS.IA_MISS_DRD) / (Socket_CLKS / duration_time)", - "MetricGroup": "Mem;MemoryLat;SoC", - "MetricName": "MEM_Read_Latency" + "BriefDescription": "The ratio of Executed- by Issued-Uops", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", + "MetricGroup": "Cor;Pipeline", + "MetricName": "tma_info_execute_per_issue", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate of \"execute\" at rename stage." }, { - "BriefDescription": "Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches", - "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD@thresh\\=1@", - "MetricGroup": "Mem;MemoryBW;SoC", - "MetricName": "MEM_Parallel_Reads" + "BriefDescription": "Fill Buffer (FB) hits per kilo instructions for retired demand loads (L1D misses that merge into ongoing miss-handling entries)", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_fb_hpki" }, { - "BriefDescription": "Average latency of data read request to external 3D X-Point memory [in nanoseconds]. Accounts for demand loads and L1/L2 data-read prefetches", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_PMM / UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PMM) / uncore_cha_0@event\\=0x1@", - "MetricGroup": "Mem;MemoryLat;Server;SoC", - "MetricName": "MEM_PMM_Read_Latency" + "BriefDescription": "Average number of Uops issued by front-end when it issued something", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=1@", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_fetch_upc" }, { - "BriefDescription": "Average latency of data read request to external DRAM memory [in nanoseconds]. Accounts for demand loads and L1/L2 data-read prefetches", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR / UNC_CHA_TOR_INSERTS.IA_MISS_DRD_DDR) / uncore_cha_0@event\\=0x1@", - "MetricGroup": "Mem;MemoryLat;Server;SoC", - "MetricName": "MEM_DRAM_Read_Latency" + "BriefDescription": "Floating Point Operations Per Cycle", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + FP_ARITH_INST_RETIRED2.SCALAR_HALF + 2 * (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED2.COMPLEX_SCALAR_HALF) + 4 * cpu@FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE\\,umask\\=0x18@ + 8 * (FP_ARITH_INST_RETIRED2.128B_PACKED_HALF + cpu@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE\\,umask\\=0x60@) + 16 * (FP_ARITH_INST_RETIRED2.256B_PACKED_HALF + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) + 32 * FP_ARITH_INST_RETIRED2.512B_PACKED_HALF + 4 * AMX_OPS_RETIRED.BF16", + "MetricGroup": "Flops;Ret", + "MetricName": "tma_info_flopc" }, { - "BriefDescription": "Average 3DXP Memory Bandwidth Use for reads [GB / sec]", - "MetricExpr": "64 * UNC_M_PMM_RPQ_INSERTS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW;Server;SoC", - "MetricName": "PMM_Read_BW" + "BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(FP_ARITH_DISPATCHED.PORT_0 + FP_ARITH_DISPATCHED.PORT_1 + FP_ARITH_DISPATCHED.PORT_5) / (2 * tma_info_core_clks)", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_fp_arith_utilization", + "PublicDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width). Values > 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less common)." }, { - "BriefDescription": "Average 3DXP Memory Bandwidth Use for Writes [GB / sec]", - "MetricExpr": "64 * UNC_M_PMM_WPQ_INSERTS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW;Server;SoC", - "MetricName": "PMM_Write_BW" + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricExpr": "tma_info_flopc / duration_time", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine." }, { - "BriefDescription": "Average IO (network or disk) Bandwidth Use for Writes [GB / sec]", - "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e9 / duration_time", - "MetricGroup": "IoBW;Mem;Server;SoC", - "MetricName": "IO_Write_BW" + "BriefDescription": "Total pipeline cost of Instruction Cache misses - subset of the Big_Code Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", + "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", + "MetricName": "tma_info_ic_misses", + "MetricThreshold": "tma_info_ic_misses > 5", + "PublicDescription": "Total pipeline cost of Instruction Cache misses - subset of the Big_Code Bottleneck. Related metrics: " }, { - "BriefDescription": "Socket actual clocks when any core is active on that socket", - "MetricExpr": "uncore_cha_0@event\\=0x1@", - "MetricGroup": "SoC", - "MetricName": "Socket_CLKS" + "BriefDescription": "Average Latency for L1 instruction cache misses", + "MetricExpr": "ICACHE_DATA.STALLS / cpu@ICACHE_DATA.STALLS\\,cmask\\=1\\,edge@", + "MetricGroup": "Fed;FetchLat;IcMiss", + "MetricName": "tma_info_icache_miss_latency" }, { - "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", - "MetricGroup": "Branches;OS", - "MetricName": "IpFarBranch" + "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_GE_1 / 2 if #SMT_on else UOPS_EXECUTED.CORE_CYCLES_GE_1)", + "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", + "MetricName": "tma_info_ilp" }, { - "BriefDescription": "Uncore frequency per die [GHZ]", - "MetricExpr": "Socket_CLKS / #num_dies / duration_time / 1e9", - "MetricGroup": "SoC", - "MetricName": "UNCORE_FREQ" + "BriefDescription": "Total pipeline cost of instruction fetch bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_frontend_bound - tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) - tma_info_big_code", + "MetricGroup": "Fed;FetchBW;Frontend", + "MetricName": "tma_info_instruction_fetch_bw", + "MetricThreshold": "tma_info_instruction_fetch_bw > 20" }, { - "BriefDescription": "Percentage of time spent in the active CPU power state C0", - "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", - "MetricName": "cpu_utilization", - "ScaleUnit": "100%" + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Summary;TmaL1;tma_L1_group", + "MetricName": "tma_info_instructions", + "PublicDescription": "Total number of retired Instructions. Sample with: INST_RETIRED.PREC_DIST" }, { - "BriefDescription": "CPU operating frequency (in GHz)", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / duration_time", - "MetricName": "cpu_operating_frequency", - "ScaleUnit": "1GHz" + "BriefDescription": "Average IO (network or disk) Bandwidth Use for Writes [GB / sec]", + "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e9 / duration_time", + "MetricGroup": "IoBW;Mem;Server;SoC", + "MetricName": "tma_info_io_write_bw" }, { - "BriefDescription": "Cycles per instruction retired; indicating how much time each executed instruction took; in units of cycles.", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY", - "MetricName": "cpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + FP_ARITH_INST_RETIRED2.SCALAR + (cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0x3c@ + FP_ARITH_INST_RETIRED2.VECTOR))", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_iparith", + "MetricThreshold": "tma_info_iparith < 10", + "PublicDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate). May undercount due to FMA double counting. Approximated prior to BDW." }, { - "BriefDescription": "The ratio of number of completed memory load instructions to the total number completed instructions", - "MetricExpr": "MEM_INST_RETIRED.ALL_LOADS / INST_RETIRED.ANY", - "MetricName": "loads_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic AMX operation (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / AMX_OPS_RETIRED.BF16", + "MetricGroup": "Flops;FpVector;InsType;Server", + "MetricName": "tma_info_iparith_amx_f16", + "MetricThreshold": "tma_info_iparith_amx_f16 < 10", + "PublicDescription": "Instructions per FP Arithmetic AMX operation (lower number means higher occurrence rate). Operations factored per matrices' sizes of the AMX instructions." }, { - "BriefDescription": "The ratio of number of completed memory store instructions to the total number completed instructions", - "MetricExpr": "MEM_INST_RETIRED.ALL_STORES / INST_RETIRED.ANY", - "MetricName": "stores_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per Integer Arithmetic AMX operation (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / AMX_OPS_RETIRED.INT8", + "MetricGroup": "InsType;IntVector;Server", + "MetricName": "tma_info_iparith_amx_int8", + "MetricThreshold": "tma_info_iparith_amx_int8 < 10", + "PublicDescription": "Instructions per Integer Arithmetic AMX operation (lower number means higher occurrence rate). Operations factored per matrices' sizes of the AMX instructions." }, { - "BriefDescription": "Ratio of number of requests missing L1 data cache (includes data+rfo w/ prefetches) to the total number of completed instructions", - "MetricExpr": "L1D.REPLACEMENT / INST_RETIRED.ANY", - "MetricName": "l1d_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.128B_PACKED_HALF)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx128", + "MetricThreshold": "tma_info_iparith_avx128 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "Ratio of number of demand load requests hitting in L1 data cache to the total number of completed instructions ", - "MetricExpr": "MEM_LOAD_RETIRED.L1_HIT / INST_RETIRED.ANY", - "MetricName": "l1d_demand_data_read_hits_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.256B_PACKED_HALF)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx256", + "MetricThreshold": "tma_info_iparith_avx256 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "Ratio of number of code read requests missing in L1 instruction cache (includes prefetches) to the total number of completed instructions", - "MetricExpr": "L2_RQSTS.ALL_CODE_RD / INST_RETIRED.ANY", - "MetricName": "l1_i_code_read_misses_with_prefetches_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.512B_PACKED_HALF)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx512", + "MetricThreshold": "tma_info_iparith_avx512 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "Ratio of number of completed demand load requests hitting in L2 cache to the total number of completed instructions ", - "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT / INST_RETIRED.ANY", - "MetricName": "l2_demand_data_read_hits_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOUBLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_iparith_scalar_dp", + "MetricThreshold": "tma_info_iparith_scalar_dp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "Ratio of number of requests missing L2 cache (includes code+data+rfo w/ prefetches) to the total number of completed instructions", - "MetricExpr": "L2_LINES_IN.ALL / INST_RETIRED.ANY", - "MetricName": "l2_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SINGLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_iparith_scalar_sp", + "MetricThreshold": "tma_info_iparith_scalar_sp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "Ratio of number of completed data read request missing L2 cache to the total number of completed instructions", - "MetricExpr": "MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", - "MetricName": "l2_demand_data_read_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per a microcode Assist invocation", + "MetricExpr": "INST_RETIRED.ANY / cpu@ASSISTS.ANY\\,umask\\=0x1B@", + "MetricGroup": "Pipeline;Ret;Retire", + "MetricName": "tma_info_ipassist", + "MetricThreshold": "tma_info_ipassist < 100e3", + "PublicDescription": "Instructions per a microcode Assist invocation. See Assists tree node for details (lower number means higher occurrence rate)" }, { - "BriefDescription": "Ratio of number of code read request missing L2 cache to the total number of completed instructions", - "MetricExpr": "L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", - "MetricName": "l2_demand_code_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Branches;Fed;InsType", + "MetricName": "tma_info_ipbranch", + "MetricThreshold": "tma_info_ipbranch < 8" }, { - "BriefDescription": "Ratio of number of data read requests missing last level core cache (includes demand w/ prefetches) to the total number of completed instructions", - "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_LLCPREFDATA + UNC_CHA_TOR_INSERTS.IA_MISS_DRD + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF) / INST_RETIRED.ANY", - "MetricName": "llc_data_read_mpi_demand_plus_prefetch", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions Per Cycle (per Logical Processor)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_clks", + "MetricGroup": "Ret;Summary", + "MetricName": "tma_info_ipc" }, { - "BriefDescription": "Ratio of number of code read requests missing last level core cache (includes demand w/ prefetches) to the total number of completed instructions", - "MetricExpr": "UNC_CHA_TOR_INSERTS.IA_MISS_CRD / INST_RETIRED.ANY", - "MetricName": "llc_code_read_mpi_demand_plus_prefetch", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per (near) call (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_ipcall", + "MetricThreshold": "tma_info_ipcall < 200" }, { - "BriefDescription": "Average latency of a last level cache (LLC) demand data read miss (read memory access) in nano seconds", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_INSERTS.IA_MISS_DRD) / (UNC_CHA_CLOCKTICKS / (source_count(UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD) * #num_packages)) * duration_time", - "MetricName": "llc_demand_data_read_miss_latency", - "ScaleUnit": "1ns" + "BriefDescription": "Instructions per non-speculative DSB miss (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", + "MetricGroup": "DSBmiss;Fed", + "MetricName": "tma_info_ipdsb_miss_ret", + "MetricThreshold": "tma_info_ipdsb_miss_ret < 50" }, { - "BriefDescription": "Average latency of a last level cache (LLC) demand data read miss (read memory access) addressed to local memory in nano seconds", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_LOCAL / UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCAL) / (UNC_CHA_CLOCKTICKS / (source_count(UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_LOCAL) * #num_packages)) * duration_time", - "MetricName": "llc_demand_data_read_miss_latency_for_local_requests", - "ScaleUnit": "1ns" + "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;OS", + "MetricName": "tma_info_ipfarbranch", + "MetricThreshold": "tma_info_ipfarbranch < 1e6" }, { - "BriefDescription": "Average latency of a last level cache (LLC) demand data read miss (read memory access) addressed to remote memory in nano seconds", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_REMOTE / UNC_CHA_TOR_INSERTS.IA_MISS_DRD_REMOTE) / (UNC_CHA_CLOCKTICKS / (source_count(UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_REMOTE) * #num_packages)) * duration_time", - "MetricName": "llc_demand_data_read_miss_latency_for_remote_requests", - "ScaleUnit": "1ns" + "BriefDescription": "Instructions per Floating Point (FP) Operation (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_flopc", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_ipflop", + "MetricThreshold": "tma_info_ipflop < 10" }, { - "BriefDescription": "Average latency of a last level cache (LLC) demand data read miss (read memory access) addressed to Intel(R) Optane(TM) Persistent Memory(PMEM) in nano seconds", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_PMM / UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PMM) / (UNC_CHA_CLOCKTICKS / (source_count(UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_PMM) * #num_packages)) * duration_time", - "MetricName": "llc_demand_data_read_miss_to_pmem_latency", - "ScaleUnit": "1ns" + "BriefDescription": "Instructions per Load (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipload", + "MetricThreshold": "tma_info_ipload < 3" }, { - "BriefDescription": "Average latency of a last level cache (LLC) demand data read miss (read memory access) addressed to DRAM in nano seconds", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR / UNC_CHA_TOR_INSERTS.IA_MISS_DRD_DDR) / (UNC_CHA_CLOCKTICKS / (source_count(UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR) * #num_packages)) * duration_time", - "MetricName": "llc_demand_data_read_miss_to_dram_latency", - "ScaleUnit": "1ns" + "BriefDescription": "Instructions per retired mispredicts for conditional non-taken branches (lower number means higher occurrence rate).", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_cond_ntaken", + "MetricThreshold": "tma_info_ipmisp_cond_ntaken < 200" }, { - "BriefDescription": "Ratio of number of completed page walks (for all page sizes) caused by a code fetch to the total number of completed instructions", - "MetricExpr": "ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", - "MetricName": "itlb_2nd_level_mpi", - "PublicDescription": "Ratio of number of completed page walks (for all page sizes) caused by a code fetch to the total number of completed instructions. This implies it missed in the ITLB (Instruction TLB) and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per retired mispredicts for conditional taken branches (lower number means higher occurrence rate).", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_cond_taken", + "MetricThreshold": "tma_info_ipmisp_cond_taken < 200" }, { - "BriefDescription": "Ratio of number of completed page walks (for 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total number of completed instructions", - "MetricExpr": "ITLB_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY", - "MetricName": "itlb_2nd_level_large_page_mpi", - "PublicDescription": "Ratio of number of completed page walks (for 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total number of completed instructions. This implies it missed in the Instruction Translation Lookaside Buffer (ITLB) and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per retired mispredicts for indirect CALL or JMP branches (lower number means higher occurrence rate).", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_indirect", + "MetricThreshold": "tma_info_ipmisp_indirect < 1e3" }, { - "BriefDescription": "Ratio of number of completed page walks (for all page sizes) caused by demand data loads to the total number of completed instructions", - "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", - "MetricName": "dtlb_2nd_level_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for all page sizes) caused by demand data loads to the total number of completed instructions. This implies it missed in the DTLB and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per retired mispredicts for return branches (lower number means higher occurrence rate).", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_ret", + "MetricThreshold": "tma_info_ipmisp_ret < 500" }, { - "BriefDescription": "Ratio of number of completed page walks (for 2 megabyte page sizes) caused by demand data loads to the total number of completed instructions", - "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY", - "MetricName": "dtlb_2nd_level_2mb_large_page_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for 2 megabyte page sizes) caused by demand data loads to the total number of completed instructions. This implies it missed in the Data Translation Lookaside Buffer (DTLB) and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts", + "MetricName": "tma_info_ipmispredict", + "MetricThreshold": "tma_info_ipmispredict < 200" }, { - "BriefDescription": "Ratio of number of completed page walks (for all page sizes) caused by demand data stores to the total number of completed instructions", - "MetricExpr": "DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", - "MetricName": "dtlb_2nd_level_store_mpi", - "PublicDescription": "Ratio of number of completed page walks (for all page sizes) caused by demand data stores to the total number of completed instructions. This implies it missed in the DTLB and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per Store (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipstore", + "MetricThreshold": "tma_info_ipstore < 8" }, { - "BriefDescription": "Memory read that miss the last level cache (LLC) addressed to local DRAM as a percentage of total memory read accesses, does not include LLC prefetches.", - "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCAL + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_LOCAL) / (UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCAL + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_LOCAL + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_REMOTE + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_REMOTE)", - "MetricName": "numa_reads_addressed_to_local_dram", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per Software prefetch instruction (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umask\\=0xF@", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_ipswpf", + "MetricThreshold": "tma_info_ipswpf < 100" }, { - "BriefDescription": "Memory reads that miss the last level cache (LLC) addressed to remote DRAM as a percentage of total memory read accesses, does not include LLC prefetches.", - "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_DRD_REMOTE + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_REMOTE) / (UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCAL + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_LOCAL + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_REMOTE + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_REMOTE)", - "MetricName": "numa_reads_addressed_to_remote_dram", - "ScaleUnit": "100%" + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", + "MetricName": "tma_info_iptb", + "MetricThreshold": "tma_info_iptb < 13", + "PublicDescription": "Instruction per taken branch. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_dsb_misses, tma_lcp" }, { - "BriefDescription": "Uncore operating frequency in GHz", - "MetricExpr": "UNC_CHA_CLOCKTICKS / (source_count(UNC_CHA_CLOCKTICKS) * #num_packages) / 1e9 / duration_time", - "MetricName": "uncore_frequency", - "ScaleUnit": "1GHz" + "BriefDescription": "Instructions per speculative Unknown Branch Misprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_instructions / BACLEARS.ANY", + "MetricGroup": "Fed", + "MetricName": "tma_info_ipunknown_branch" }, { - "BriefDescription": "Intel(R) Ultra Path Interconnect (UPI) data transmit bandwidth (MB/sec)", - "MetricExpr": "UNC_UPI_TxL_FLITS.ALL_DATA * 7.111111111111111 / 1e6 / duration_time", - "MetricName": "upi_data_transmit_bw", - "ScaleUnit": "1MB/s" + "BriefDescription": "Fraction of branches that are unconditional (direct or indirect) jumps", + "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_jump" }, { - "BriefDescription": "DDR memory read bandwidth (MB/sec)", - "MetricExpr": "UNC_M_CAS_COUNT.RD * 64 / 1e6 / duration_time", - "MetricName": "memory_bandwidth_read", - "ScaleUnit": "1MB/s" + "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_cpi" }, { - "BriefDescription": "DDR memory write bandwidth (MB/sec)", - "MetricExpr": "UNC_M_CAS_COUNT.WR * 64 / 1e6 / duration_time", - "MetricName": "memory_bandwidth_write", - "ScaleUnit": "1MB/s" + "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_utilization", + "MetricThreshold": "tma_info_kernel_utilization > 0.05" }, { - "BriefDescription": "DDR memory bandwidth (MB/sec)", - "MetricExpr": "(UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) * 64 / 1e6 / duration_time", - "MetricName": "memory_bandwidth_total", - "ScaleUnit": "1MB/s" + "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw" }, { - "BriefDescription": "Intel(R) Optane(TM) Persistent Memory(PMEM) memory read bandwidth (MB/sec)", - "MetricExpr": "UNC_M_PMM_RPQ_INSERTS * 64 / 1e6 / duration_time", - "MetricName": "pmem_memory_bandwidth_read", - "ScaleUnit": "1MB/s" + "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]", + "MetricExpr": "tma_info_l1d_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw_1t" }, { - "BriefDescription": "Intel(R) Optane(TM) Persistent Memory(PMEM) memory write bandwidth (MB/sec)", - "MetricExpr": "UNC_M_PMM_WPQ_INSERTS * 64 / 1e6 / duration_time", - "MetricName": "pmem_memory_bandwidth_write", - "ScaleUnit": "1MB/s" + "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l1mpki" }, { - "BriefDescription": "Intel(R) Optane(TM) Persistent Memory(PMEM) memory bandwidth (MB/sec)", - "MetricExpr": "(UNC_M_PMM_RPQ_INSERTS + UNC_M_PMM_WPQ_INSERTS) * 64 / 1e6 / duration_time", - "MetricName": "pmem_memory_bandwidth_total", - "ScaleUnit": "1MB/s" + "BriefDescription": "L1 cache true misses per kilo instruction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l1mpki_load" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by end device controllers that are requesting memory from the CPU.", - "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e6 / duration_time", - "MetricName": "io_bandwidth_disk_or_network_writes", - "ScaleUnit": "1MB/s" + "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by end device controllers that are writing memory to the CPU.", - "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM + UNC_CHA_TOR_INSERTS.IO_ITOMCACHENEAR) * 64 / 1e6 / duration_time", - "MetricName": "io_bandwidth_disk_or_network_reads", - "ScaleUnit": "1MB/s" + "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]", + "MetricExpr": "tma_info_l2_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw_1t" }, { - "BriefDescription": "Uops delivered from decoded instruction cache (decoded stream buffer or DSB) as a percent of total uops delivered to Instruction Decode Queue", - "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS + LSD.UOPS)", - "MetricName": "percent_uops_delivered_from_decoded_icache", - "ScaleUnit": "100%" + "BriefDescription": "Rate of non silent evictions from the L2 cache per Kilo instruction", + "MetricExpr": "1e3 * L2_LINES_OUT.NON_SILENT / tma_info_instructions", + "MetricGroup": "L2Evicts;Mem;Server", + "MetricName": "tma_info_l2_evictions_nonsilent_pki" }, { - "BriefDescription": "Uops delivered from legacy decode pipeline (Micro-instruction Translation Engine or MITE) as a percent of total uops delivered to Instruction Decode Queue", - "MetricExpr": "IDQ.MITE_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS + LSD.UOPS)", - "MetricName": "percent_uops_delivered_from_legacy_decode_pipeline", - "ScaleUnit": "100%" + "BriefDescription": "Rate of silent evictions from the L2 cache per Kilo instruction where the evicted lines are dropped (no writeback to L3 or memory)", + "MetricExpr": "1e3 * L2_LINES_OUT.SILENT / tma_info_instructions", + "MetricGroup": "L2Evicts;Mem;Server", + "MetricName": "tma_info_l2_evictions_silent_pki" }, { - "BriefDescription": "Uops delivered from microcode sequencer (MS) as a percent of total uops delivered to Instruction Decode Queue", - "MetricExpr": "IDQ.MS_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS + LSD.UOPS)", - "MetricName": "percent_uops_delivered_from_microcode_sequencer", - "ScaleUnit": "100%" + "BriefDescription": "L2 cache hits per kilo instruction for all request types (including speculative)", + "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2hpki_all" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss the last level cache (LLC) and go to local memory.", - "MetricExpr": "UNC_CHA_REQUESTS.READS_LOCAL * 64 / 1e6 / duration_time", - "MetricName": "llc_miss_local_memory_bandwidth_read", - "ScaleUnit": "1MB/s" + "BriefDescription": "L2 cache hits per kilo instruction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2hpki_load" }, { - "BriefDescription": "Bandwidth (MB/sec) of write requests that miss the last level cache (LLC) and go to local memory.", - "MetricExpr": "UNC_CHA_REQUESTS.WRITES_LOCAL * 64 / 1e6 / duration_time", - "MetricName": "llc_miss_local_memory_bandwidth_write", - "ScaleUnit": "1MB/s" + "BriefDescription": "L2 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "Backend;CacheMisses;Mem", + "MetricName": "tma_info_l2mpki" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss the last level cache (LLC) and go to remote memory.", - "MetricExpr": "UNC_CHA_REQUESTS.READS_REMOTE * 64 / 1e6 / duration_time", - "MetricName": "llc_miss_remote_memory_bandwidth_read", - "ScaleUnit": "1MB/s" + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all request types (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem;Offcore", + "MetricName": "tma_info_l2mpki_all" }, { - "BriefDescription": "Bandwidth (MB/sec) of write requests that miss the last level cache (LLC) and go to remote memory.", - "MetricExpr": "UNC_CHA_REQUESTS.WRITES_REMOTE * 64 / 1e6 / duration_time", - "MetricName": "llc_miss_remote_memory_bandwidth_write", - "ScaleUnit": "1MB/s" + "BriefDescription": "L2 cache true code cacheline misses per kilo instruction", + "MetricExpr": "1e3 * FRONTEND_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_l2mpki_code" }, { - "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", - "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / slots", - "MetricGroup": "PGO;TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_frontend_bound", - "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Machine_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound.", - "ScaleUnit": "100%" + "BriefDescription": "L2 cache speculative code cacheline misses per kilo instruction", + "MetricExpr": "1e3 * L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_l2mpki_code_all" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", - "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / slots", - "MetricGroup": "Frontend;TopdownL2;tma_L2_group;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_latency", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period.", - "ScaleUnit": "100%" + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2mpki_load" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses.", - "MetricExpr": "ICACHE_DATA.STALLS / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_icache_misses", - "ScaleUnit": "100%" + "BriefDescription": "Average per-core data access bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_l3_cache_access_bw" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses.", - "MetricExpr": "ICACHE_TAG.STALLS / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_itlb_misses", - "ScaleUnit": "100%" + "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "tma_info_l3_cache_access_bw", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_l3_cache_access_bw_1t" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", - "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / CPU_CLK_UNHALTED.THREAD + tma_unknown_branches", - "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_branch_resteers", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings.", - "ScaleUnit": "100%" + "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage. ", - "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) / max(1 - (tma_frontend_bound + topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound)), 0) * INT_MISC.CLEAR_RESTEER_CYCLES / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL4;tma_L4_group;tma_branch_resteers_group", - "MetricName": "tma_mispredicts_resteers", - "ScaleUnit": "100%" + "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "tma_info_l3_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw_1t" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears. ", - "MetricExpr": "(1 - topdown\\-br\\-mispredict / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) / max(1 - (tma_frontend_bound + topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound)), 0)) * INT_MISC.CLEAR_RESTEER_CYCLES / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_branch_resteers_group", - "MetricName": "tma_clears_resteers", - "ScaleUnit": "100%" + "BriefDescription": "L3 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l3mpki" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears", - "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "BigFoot;FetchLat;TopdownL4;tma_L4_group;tma_branch_resteers_group", - "MetricName": "tma_unknown_branches", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears. These are fetched branches the Branch Prediction Unit was unable to recognize (First fetch or hitting BPU capacity limit).", - "ScaleUnit": "100%" + "BriefDescription": "Average Latency for L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCORE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "Memory_Lat;Offcore", + "MetricName": "tma_info_load_l2_miss_latency" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_dsb_switches", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty.", - "ScaleUnit": "100%" + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=1@", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_load_l2_mlp" }, { - "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", - "MetricExpr": "DECODE.LCP / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_lcp", - "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs.", - "ScaleUnit": "100%" + "BriefDescription": "Average Latency for L3 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAND_DATA_RD / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", + "MetricGroup": "Memory_Lat;Offcore", + "MetricName": "tma_info_load_l3_miss_latency" }, { - "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", - "MetricExpr": "3 * IDQ.MS_SWITCHES / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_ms_switches", - "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals.", - "ScaleUnit": "100%" + "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)", + "MetricExpr": "L1D_PEND_MISS.PENDING / MEM_LOAD_COMPLETED.L1_MISS_ANY", + "MetricGroup": "Mem;MemoryBound;MemoryLat", + "MetricName": "tma_info_load_miss_real_latency" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", - "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)", - "MetricGroup": "FetchBW;Frontend;TopdownL2;tma_L2_group;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_bandwidth", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend.", - "ScaleUnit": "100%" + "BriefDescription": "STLB (2nd level TLB) data load speculative misses per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_load_stlb_mpki" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline)", - "MetricExpr": "(IDQ.MITE_CYCLES_ANY - IDQ.MITE_CYCLES_OK) / CPU_CLK_UNHALTED.DISTRIBUTED / 2", - "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", - "MetricName": "tma_mite", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or LSD. For example; inefficiencies due to asymmetric decoders; use of long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.", - "ScaleUnit": "100%" + "BriefDescription": "Average latency of data read request to external DRAM memory [in nanoseconds]", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR / UNC_CHA_TOR_INSERTS.IA_MISS_DRD_DDR) / uncore_cha_0@event\\=0x1@", + "MetricGroup": "Mem;MemoryLat;Server;SoC", + "MetricName": "tma_info_mem_dram_read_latency", + "PublicDescription": "Average latency of data read request to external DRAM memory [in nanoseconds]. Accounts for demand loads and L1/L2 data-read prefetches" }, { - "BriefDescription": "This metric represents fraction of cycles where decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=0x1@ - cpu@INST_DECODED.DECODERS\\,cmask\\=0x2@) / CPU_CLK_UNHALTED.DISTRIBUTED", - "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_mite_group", - "MetricName": "tma_decoder0_alone", - "ScaleUnit": "100%" + "BriefDescription": "Average number of parallel data read requests to external memory", + "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD@thresh\\=1@", + "MetricGroup": "Mem;MemoryBW;SoC", + "MetricName": "tma_info_mem_parallel_reads", + "PublicDescription": "Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline", - "MetricExpr": "(IDQ.DSB_CYCLES_ANY - IDQ.DSB_CYCLES_OK) / CPU_CLK_UNHALTED.DISTRIBUTED / 2", - "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", - "MetricName": "tma_dsb", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline. For example; inefficient utilization of the DSB cache structure or bank conflict when reading from it; are categorized here.", - "ScaleUnit": "100%" + "BriefDescription": "Average latency of data read request to external 3D X-Point memory [in nanoseconds]", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_PMM / UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PMM) / uncore_cha_0@event\\=0x1@", + "MetricGroup": "Mem;MemoryLat;Server;SoC", + "MetricName": "tma_info_mem_pmm_read_latency", + "PublicDescription": "Average latency of data read request to external 3D X-Point memory [in nanoseconds]. Accounts for demand loads and L1/L2 data-read prefetches" }, { - "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations", - "MetricExpr": "max(1 - (tma_frontend_bound + topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound)), 0)", - "MetricGroup": "TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_bad_speculation", - "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.", - "ScaleUnit": "100%" + "BriefDescription": "Average latency of data read request to external memory (in nanoseconds)", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_INSERTS.IA_MISS_DRD) / (tma_info_socket_clks / duration_time)", + "MetricGroup": "Mem;MemoryLat;SoC", + "MetricName": "tma_info_mem_read_latency", + "PublicDescription": "Average latency of data read request to external memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetches. ([RKL+]memory-controller only)" }, { - "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction", - "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL2;tma_L2_group;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_branch_mispredicts", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path.", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of (external) Memory Bandwidth related bottlenecks", + "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_sq_full / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full))) + tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_fb_full / (tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk))", + "MetricGroup": "Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_info_memory_bandwidth", + "MetricThreshold": "tma_info_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of (external) Memory Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_mem_bandwidth, tma_sq_full" }, { - "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", - "MetricExpr": "max(0, tma_bad_speculation - topdown\\-br\\-mispredict / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound))", - "MetricGroup": "BadSpec;MachineClears;TopdownL2;tma_L2_group;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_machine_clears", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes.", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of Memory Address Translation related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * tma_memory_bound * (tma_l1_bound / max(tma_memory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_latency + tma_streaming_stores)))", + "MetricGroup": "Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_info_memory_data_tlbs", + "MetricThreshold": "tma_info_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Translation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load, tma_dtlb_store" }, { - "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend", - "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", - "MetricGroup": "TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_backend_bound", - "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound.", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of Memory Latency related bottlenecks (external memory and off-core caches)", + "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_l3_hit_latency / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound))", + "MetricGroup": "Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_info_memory_latency", + "MetricThreshold": "tma_info_memory_latency > 20", + "PublicDescription": "Total pipeline cost of Memory Latency related bottlenecks (external memory and off-core caches). Related metrics: tma_l3_hit_latency, tma_mem_latency" }, { - "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", - "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", - "MetricGroup": "Backend;TopdownL2;tma_L2_group;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_memory_bound", - "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of Branch Misprediction related bottlenecks", + "MetricExpr": "100 * (tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;tma_issueBM", + "MetricName": "tma_info_mispredictions", + "MetricThreshold": "tma_info_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_branch_misprediction_cost, tma_mispredicts_resteers" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache", - "MetricExpr": "max((EXE_ACTIVITY.BOUND_ON_LOADS - MEMORY_ACTIVITY.STALLS_L1D_MISS) / CPU_CLK_UNHALTED.THREAD, 0)", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", - "MetricName": "tma_l1_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache. The L1 data cache typically has the shortest latency. However; in certain cases like loads blocked on older stores; a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls; while some non-completed demand load lives in the machine without having that demand load missing the L1 cache.", - "ScaleUnit": "100%" + "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss", + "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLES", + "MetricGroup": "Mem;MemoryBW;MemoryBound", + "MetricName": "tma_info_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=0x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - MEMORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_dtlb_load", - "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss.", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of branches of other types (not individually covered by other metrics in Info.Branches group)", + "MetricExpr": "1 - (tma_info_cond_nt + tma_info_cond_tk + tma_info_callret + tma_info_jump)", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_other_branches" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the (first level) DTLB was missed by load accesses, that later on hit in second-level TLB (STLB)", - "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", - "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_group", - "MetricName": "tma_load_stlb_hit", - "ScaleUnit": "100%" + "BriefDescription": "Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses", + "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_PENDING + DTLB_STORE_MISSES.WALK_PENDING) / (4 * tma_info_core_clks)", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_page_walks_utilization", + "MetricThreshold": "tma_info_page_walks_utilization > 0.5" }, { - "BriefDescription": "This metric estimates the fraction of cycles where the Second-level TLB (STLB) was missed by load accesses, performing a hardware page walk", - "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_group", - "MetricName": "tma_load_stlb_miss", - "ScaleUnit": "100%" + "BriefDescription": "Average 3DXP Memory Bandwidth Use for reads [GB / sec]", + "MetricExpr": "64 * UNC_M_PMM_RPQ_INSERTS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW;Server;SoC", + "MetricName": "tma_info_pmm_read_bw" }, { - "BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores", - "MetricExpr": "min(13 * LD_BLOCKS.STORE_FORWARD / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_store_fwd_blk", - "PublicDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores. To streamline memory operations in the pipeline; a load can avoid waiting for memory if a prior in-flight store is writing the data that the load wants to read (store forwarding process). However; in some cases the load may be blocked for a significant time pending the store forward. For example; when the prior store is writing a smaller region than the load is reading.", - "ScaleUnit": "100%" + "BriefDescription": "Average 3DXP Memory Bandwidth Use for Writes [GB / sec]", + "MetricExpr": "64 * UNC_M_PMM_WPQ_INSERTS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW;Server;SoC", + "MetricName": "tma_info_pmm_write_bw" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations", - "MetricExpr": "min((16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS.ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10 * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO))) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_lock_latency", - "PublicDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations. Due to the microarchitecture handling of locks; they are classified as L1_Bound regardless of what memory source satisfied them.", - "ScaleUnit": "100%" + "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.", + "MetricExpr": "tma_retiring * tma_info_slots / cpu@UOPS_RETIRED.SLOTS\\,cmask\\=1@", + "MetricGroup": "Pipeline;Ret", + "MetricName": "tma_info_retire" }, { - "BriefDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary. ", - "MetricExpr": "min(Load_Miss_Real_Latency * LD_BLOCKS.NO_SR / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_split_loads", - "ScaleUnit": "100%" + "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "TOPDOWN.SLOTS", + "MetricGroup": "TmaL1;tma_L1_group", + "MetricName": "tma_info_slots" }, { - "BriefDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed", - "MetricExpr": "L1D_PEND_MISS.FB_FULL / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryBW;TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_fb_full", - "PublicDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed. The higher the metric value; the deeper the memory hierarchy level the misses are satisfied from (metric values >1 are valid). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or external memory).", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of Physical Core issue-slots utilized by this Logical Processor", + "MetricExpr": "(tma_info_slots / (TOPDOWN.SLOTS / 2) if #SMT_on else 1)", + "MetricGroup": "SMT;TmaL1;tma_L1_group", + "MetricName": "tma_info_slots_utilization" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads", - "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L1D_MISS - MEMORY_ACTIVITY.STALLS_L2_MISS) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", - "MetricName": "tma_l2_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 misses/L2 hits) can improve the latency and increase performance.", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", + "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_DISTRIBUTED if #SMT_on else 0)", + "MetricGroup": "SMT", + "MetricName": "tma_info_smt_2t_utilization" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core", - "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L2_MISS - MEMORY_ACTIVITY.STALLS_L3_MISS) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", - "MetricName": "tma_l3_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance.", - "ScaleUnit": "100%" + "BriefDescription": "Socket actual clocks when any core is active on that socket", + "MetricExpr": "uncore_cha_0@event\\=0x1@", + "MetricGroup": "SoC", + "MetricName": "tma_info_socket_clks" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses", - "MetricExpr": "min(((28 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 3 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) + (27 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 3 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_l3_bound_group", - "MetricName": "tma_contested_accesses", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses. Contested accesses occur when data written by one Logical Processor are read by another Logical Processor on a different Physical Core. Examples of contested accesses include synchronizations such as locks; true data sharing such as modified locked variables; and false sharing.", - "ScaleUnit": "100%" + "BriefDescription": "STLB (2nd level TLB) data store speculative misses per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_store_stlb_mpki" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses", - "MetricExpr": "min((27 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 3 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD + MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (1 - OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Offcore;Snoop;TopdownL4;tma_L4_group;tma_l3_bound_group", - "MetricName": "tma_data_sharing", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses. Data shared by multiple Logical Processors (even just read shared) may cause increased access latency due to cache coherency. Excessive data sharing can drastically harm multithreaded performance.", - "ScaleUnit": "100%" + "BriefDescription": "Estimated fraction of retirement-cycles dealing with repeat instructions", + "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.SLOTS\\,cmask\\=1@", + "MetricGroup": "Pipeline;Ret", + "MetricName": "tma_info_strings_cycles", + "MetricThreshold": "tma_info_strings_cycles > 0.1" }, { - "BriefDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited)", - "MetricExpr": "min((12 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 3 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l3_bound_group", - "MetricName": "tma_l3_hit_latency", - "PublicDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve the latency; reduce contention with sibling physical cores and increase performance. Note the value of this node may overlap with its siblings.", - "ScaleUnit": "100%" + "BriefDescription": "Tera Integer (matrix) Operations Per Second", + "MetricExpr": "8 * AMX_OPS_RETIRED.INT8 / 1e12 / duration_time", + "MetricGroup": "Cor;HPC;IntVector;Server", + "MetricName": "tma_info_tiops" }, { - "BriefDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors)", - "MetricExpr": "(XQ.FULL_CYCLES + L1D_PEND_MISS.L2_STALLS) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_l3_bound_group", - "MetricName": "tma_sq_full", - "PublicDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors). The Super Queue is used for requests to access the L2 cache or to go out to the Uncore.", - "ScaleUnit": "100%" + "BriefDescription": "Average Frequency Utilization relative nominal frequency", + "MetricExpr": "tma_info_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_turbo_utilization" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", - "MetricExpr": "min(MEMORY_ACTIVITY.STALLS_L3_MISS / CPU_CLK_UNHALTED.THREAD - tma_pmm_bound, 1)", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", - "MetricName": "tma_dram_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance.", - "ScaleUnit": "100%" + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "tma_retiring * tma_info_slots / INST_RETIRED.ANY", + "MetricGroup": "Pipeline;Ret;Retire", + "MetricName": "tma_info_uoppi", + "MetricThreshold": "tma_info_uoppi > 1.05" }, { - "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=0x4@) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group", - "MetricName": "tma_mem_bandwidth", - "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that).", - "ScaleUnit": "100%" + "BriefDescription": "Cross-socket Ultra Path Interconnect (UPI) data transmit bandwidth for data only [MB / sec]", + "MetricExpr": "UNC_UPI_TxL_FLITS.ALL_DATA * 64 / 9 / 1e6", + "MetricGroup": "Server;SoC", + "MetricName": "tma_info_upi_data_transmit_bw" }, { - "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to memory bandwidth Allocation feature (RDT's memory bandwidth throttling).", - "MetricExpr": "INT_MISC.MBA_STALLS / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryBW;Offcore;Server;TopdownL5;tma_L5_group;tma_mem_bandwidth_group", - "MetricName": "tma_mba_stalls", - "ScaleUnit": "100%" + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "tma_retiring * tma_info_slots / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW", + "MetricName": "tma_info_uptb", + "MetricThreshold": "tma_info_uptb < 9" }, { - "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / CPU_CLK_UNHALTED.THREAD - tma_mem_bandwidth", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group", - "MetricName": "tma_mem_latency", - "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that).", + "BriefDescription": "This metric approximates arithmetic Integer (Int) matrix uops fraction the CPU has retired (aggregated across all supported Int datatypes in AMX engine)", + "MetricExpr": "cpu@AMX_OPS_RETIRED.INT8\\,cmask\\=1@ / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;HPC;IntVector;Pipeline;Server;TopdownL4;tma_L4_group;tma_int_operations_group", + "MetricName": "tma_int_amx", + "MetricThreshold": "tma_int_amx > 0.1 & (tma_int_operations > 0.1 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic Integer (Int) matrix uops fraction the CPU has retired (aggregated across all supported Int datatypes in AMX engine). Refer to AMX_Busy and TIOPs metrics for actual AMX utilization and Int performance, resp.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from local memory", - "MetricExpr": "min((66.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 12 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_group", - "MetricName": "tma_local_dram", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from local memory. Caching will improve the latency and increase performance.", + "BriefDescription": "This metric represents overall Integer (Int) select operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_int_vector_128b + tma_int_vector_256b + tma_shuffles + tma_int_amx", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_int_operations", + "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents overall Integer (Int) select operations fraction the CPU has executed (retired). Vector/Matrix Int operations and shuffles are counted. Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote memory", - "MetricExpr": "min((131 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 12 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latency_group", - "MetricName": "tma_remote_dram", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote memory. This is caused often due to non-optimal NUMA allocations. #link to NUMA article", + "BriefDescription": "This metric represents 128-bit vector Integer ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the CPU has retired", + "MetricExpr": "(INT_VEC_RETIRED.ADD_128 + INT_VEC_RETIRED.VNNI_128) / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_128b", + "MetricThreshold": "tma_int_vector_128b > 0.1 & (tma_int_operations > 0.1 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents 128-bit vector Integer ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote cache in other sockets including synchronizations issues", - "MetricExpr": "min(((120 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 12 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM + (120 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 12 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latency_group", - "MetricName": "tma_remote_cache", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote cache in other sockets including synchronizations issues. This is caused often due to non-optimal NUMA allocations. #link to NUMA article", + "BriefDescription": "This metric represents 256-bit vector Integer ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the CPU has retired", + "MetricExpr": "(INT_VEC_RETIRED.ADD_256 + INT_VEC_RETIRED.MUL_256 + INT_VEC_RETIRED.VNNI_256) / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_256b", + "MetricThreshold": "tma_int_vector_256b > 0.1 & (tma_int_operations > 0.1 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents 256-bit vector Integer ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates (based on idle latencies) how often the CPU was stalled on accesses to external 3D-Xpoint (Crystal Ridge, a.k.a", - "MetricExpr": "min(((1 - (19 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 10 * (MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))) / (19 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 10 * (MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + (25 * (MEM_LOAD_RETIRED.LOCAL_PMM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 33 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_PMM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))))) * (MEMORY_ACTIVITY.STALLS_L3_MISS / CPU_CLK_UNHALTED.THREAD) if 1e6 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_PMM + MEM_LOAD_RETIRED.LOCAL_PMM) > MEM_LOAD_RETIRED.L1_MISS else 0), 1)", - "MetricGroup": "MemoryBound;Server;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", - "MetricName": "tma_pmm_bound", - "PublicDescription": "This metric roughly estimates (based on idle latencies) how often the CPU was stalled on accesses to external 3D-Xpoint (Crystal Ridge, a.k.a. IXP) memory by loads, PMM stands for Persistent Memory Module. ", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "ICACHE_TAG.STALLS / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTEND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", - "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", - "MetricName": "tma_store_bound", - "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck.", + "BriefDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache", + "MetricExpr": "max((EXE_ACTIVITY.BOUND_ON_LOADS - MEMORY_ACTIVITY.STALLS_L1D_MISS) / tma_info_clks, 0)", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache. The L1 data cache typically has the shortest latency. However; in certain cases like loads blocked on older stores; a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls; while some non-completed demand load lives in the machine without having that demand load missing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB_HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_switches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses", - "MetricExpr": "(MEM_STORE_RETIRED.L2_HIT * 10 * (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_store_bound_group", - "MetricName": "tma_store_latency", - "PublicDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses. Store accesses usually less impact out-of-order core performance; however; holding resources for longer time can lead into undesired implications (e.g. contention on L1D fill-buffer entries - see FB_Full)", + "BriefDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L1D_MISS - MEMORY_ACTIVITY.STALLS_L2_MISS) / tma_info_clks", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 misses/L2 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L2_HIT_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing", - "MetricExpr": "min(28 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) * OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_store_bound_group", - "MetricName": "tma_false_sharing", - "PublicDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing. False Sharing is a multithreading hiccup; where multiple Logical Processors contend on different data-elements mapped into the same cache line. ", + "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core", + "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L2_MISS - MEMORY_ACTIVITY.STALLS_L3_MISS) / tma_info_clks", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents rate of split store accesses", - "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / CPU_CLK_UNHALTED.DISTRIBUTED", - "MetricGroup": "TopdownL4;tma_L4_group;tma_store_bound_group", - "MetricName": "tma_split_stores", - "PublicDescription": "This metric represents rate of split store accesses. Consider aligning your data to the 64-byte cache line granularity.", + "BriefDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited)", + "MetricExpr": "33 * tma_info_average_frequency * MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_issueLat;tma_l3_bound_group", + "MetricName": "tma_l3_hit_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve the latency; reduce contention with sibling physical cores and increase performance. Note the value of this node may overlap with its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tma_info_memory_latency, tma_mem_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often CPU was stalled due to Streaming store memory accesses; Streaming store optimize out a read request required by RFO stores", - "MetricExpr": "min(9 * OCR.STREAMING_WR.ANY_RESPONSE / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_store_bound_group", - "MetricName": "tma_streaming_stores", - "PublicDescription": "This metric estimates how often CPU was stalled due to Streaming store memory accesses; Streaming store optimize out a read request required by RFO stores. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should Streaming stores be a bottleneck.", + "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "DECODE.LCP / tma_info_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_dsb_misses, tma_info_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses", - "MetricExpr": "min((7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=0x1@ + DTLB_STORE_MISSES.WALK_ACTIVE) / CPU_CLK_UNHALTED.DISTRIBUTED, 1)", - "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_store_bound_group", - "MetricName": "tma_dtlb_store", - "PublicDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses. As with ordinary data caching; focus on improving data locality and reducing working-set size to reduce DTLB overhead. Additionally; consider using profile-guided optimization (PGO) to collocate frequently-used data on the same page. Try using larger page sizes for large amounts of frequently-used data.", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", + "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the TLB was missed by store accesses, hitting in the second-level TLB (STLB)", - "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", - "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_group", - "MetricName": "tma_store_stlb_hit", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations", + "MetricExpr": "UOPS_DISPATCHED.PORT_2_3_10 / (3 * tma_info_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations. Sample with: UOPS_DISPATCHED.PORT_2_3_10", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles where the STLB was missed by store accesses, performing a hardware page walk", - "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / CPU_CLK_UNHALTED.DISTRIBUTED", - "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_group", - "MetricName": "tma_store_stlb_miss", + "BriefDescription": "This metric roughly estimates the fraction of cycles where the (first level) DTLB was missed by load accesses, that later on hit in second-level TLB (STLB)", + "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_group", + "MetricName": "tma_load_stlb_hit", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", - "MetricExpr": "max(0, topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - topdown\\-mem\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound)) + 0 * slots", - "MetricGroup": "Backend;Compute;TopdownL2;tma_L2_group;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_core_bound", - "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", + }, + { + "BriefDescription": "This metric estimates the fraction of cycles where the Second-level TLB (STLB) was missed by load accesses, performing a hardware page walk", + "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_group", + "MetricName": "tma_load_stlb_miss", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", - "MetricExpr": "ARITH.DIVIDER_ACTIVE / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", - "MetricName": "tma_divider", - "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication.", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from local memory", + "MetricExpr": "71 * tma_info_average_frequency * MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_group", + "MetricName": "tma_local_dram", + "MetricThreshold": "tma_local_dram > 0.1 & (tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from local memory. Caching will improve the latency and increase performance. Sample with: MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", - "MetricExpr": "((cpu@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=0x80@ + (EXE_ACTIVITY.1_PORTS_UTIL + topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * cpu@EXE_ACTIVITY.2_PORTS_UTIL\\,umask\\=0xc@)) / CPU_CLK_UNHALTED.THREAD if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - EXE_ACTIVITY.BOUND_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL + topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * cpu@EXE_ACTIVITY.2_PORTS_UTIL\\,umask\\=0xc@) / CPU_CLK_UNHALTED.THREAD) + 0 * slots", - "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_group", - "MetricName": "tma_ports_utilization", - "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", + "BriefDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS.ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10 * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO))) / tma_info_clks", + "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1_bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations. Due to the microarchitecture handling of locks; they are classified as L1_Bound regardless of what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOADS_PS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "cpu@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=0x80@ / CPU_CLK_UNHALTED.THREAD + tma_serializing_operation * (CYCLE_ACTIVITY.STALLS_TOTAL - EXE_ACTIVITY.BOUND_ON_LOADS) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_0", - "PublicDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise). Long-latency instructions like divides may contribute to this metric.", + "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", + "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts)", + "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations", - "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_0_group", - "MetricName": "tma_serializing_operation", - "PublicDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations. Instructions like CPUID; WRMSR or LFENCE serialize the out-of-order execution which may limit performance.", + "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to memory bandwidth Allocation feature (RDT's memory bandwidth throttling).", + "MetricExpr": "INT_MISC.MBA_STALLS / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;Server;TopdownL5;tma_L5_group;tma_mem_bandwidth_group", + "MetricName": "tma_mba_stalls", + "MetricThreshold": "tma_mba_stalls > 0.1 & (tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to PAUSE Instructions.", - "MetricExpr": "CPU_CLK_UNHALTED.PAUSE / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "TopdownL6;tma_L6_group;tma_serializing_operation_group", - "MetricName": "tma_slow_pause", + "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=4@) / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that). Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_info_memory_bandwidth, tma_sq_full", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to LFENCE Instructions.", - "MetricExpr": "min(13 * MISC2_RETIRED.LFENCE / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "TopdownL6;tma_L6_group;tma_serializing_operation_group", - "MetricName": "tma_memory_fence", + "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / tma_info_clks - tma_mem_bandwidth", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that). Related metrics: tma_info_memory_latency, tma_l3_hit_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "The Mixing_Vectors metric gives the percentage of injected blend uops out of all uops issued", - "MetricExpr": "min(160 * ASSISTS.SSE_AVX_MIX / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_0_group", - "MetricName": "tma_mixing_vectors", - "PublicDescription": "The Mixing_Vectors metric gives the percentage of injected blend uops out of all uops issued. Usually a Mixing_Vectors over 5% is worth investigating. Read more in Appendix B1 of the Optimizations Guide for this topic.", + "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", + "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_slots", + "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles where the Advanced Matrix Extensions (AMX) execution engine was busy with tile (arithmetic) operations", - "MetricExpr": "EXE.AMX_BUSY / CPU_CLK_UNHALTED.DISTRIBUTED", - "MetricGroup": "Compute;HPC;Server;TopdownL5;tma_L5_group;tma_ports_utilized_0_group", - "MetricName": "tma_amx_busy", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to LFENCE Instructions.", + "MetricExpr": "13 * MISC2_RETIRED.LFENCE / tma_info_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_serializing_operation_group", + "MetricName": "tma_memory_fence", + "MetricThreshold": "tma_memory_fence > 0.05 & (tma_serializing_operation > 0.1 & (tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_1", - "PublicDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). This can be due to heavy data-dependency among software instructions; or over oversubscribing a particular hardware resource. In some other cases with high 1_Port_Utilized and L1_Bound; this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful.", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring memory operations -- uops for memory load or store accesses.", + "MetricExpr": "tma_light_operations * MEM_UOP_RETIRED.ANY / (tma_retiring * tma_info_slots)", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_memory_operations", + "MetricThreshold": "tma_memory_operations > 0.1 & tma_light_operations > 0.6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_2", - "PublicDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization -most compilers feature auto-Vectorization options today- reduces pressure on the execution ports as multiple elements are calculated with same uop.", + "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "UOPS_RETIRED.MS / tma_info_slots", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operations_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise).", - "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_3m", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage", + "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_clks", + "MetricGroup": "BadSpec;BrMispredicts;TopdownL4;tma_L4_group;tma_branch_resteers_group;tma_issueBM", + "MetricName": "tma_mispredicts_resteers", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_branch_mispredicts, tma_info_branch_misprediction_cost, tma_info_mispredictions", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.", - "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + UOPS_DISPATCHED.PORT_5_11 + UOPS_DISPATCHED.PORT_6) / (5 * CPU_CLK_UNHALTED.DISTRIBUTED)", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", - "MetricName": "tma_alu_op_utilization", + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline)", + "MetricExpr": "(IDQ.MITE_CYCLES_ANY - IDQ.MITE_CYCLES_OK) / tma_info_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 6 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or LSD. For example; inefficiencies due to asymmetric decoders; use of long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sample with: FRONTEND_RETIRED.ANY_DSB_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch)", - "MetricExpr": "UOPS_DISPATCHED.PORT_0 / CPU_CLK_UNHALTED.DISTRIBUTED", - "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilization_group", - "MetricName": "tma_port_0", + "BriefDescription": "The Mixing_Vectors metric gives the percentage of injected blend uops out of all uops issued", + "MetricExpr": "160 * ASSISTS.SSE_AVX_MIX / tma_info_clks", + "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utilized_0_group", + "MetricName": "tma_mixing_vectors", + "MetricThreshold": "tma_mixing_vectors > 0.05", + "PublicDescription": "The Mixing_Vectors metric gives the percentage of injected blend uops out of all uops issued. Usually a Mixing_Vectors over 5% is worth investigating. Read more in Appendix B1 of the Optimizations Guide for this topic. Related metrics: tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU)", - "MetricExpr": "UOPS_DISPATCHED.PORT_1 / CPU_CLK_UNHALTED.DISTRIBUTED", - "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group", - "MetricName": "tma_port_1", + "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", + "MetricExpr": "3 * cpu@UOPS_RETIRED.MS\\,cmask\\=1\\,edge@ / (tma_retiring * tma_info_slots / UOPS_ISSUED.ANY) / tma_info_clks", + "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals. Sample with: FRONTEND_RETIRED.MS_FLOWS. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU)", - "MetricExpr": "UOPS_DISPATCHED.PORT_6 / CPU_CLK_UNHALTED.DISTRIBUTED", - "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group", - "MetricName": "tma_port_6", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring branch instructions that were not fused", + "MetricExpr": "tma_light_operations * (BR_INST_RETIRED.ALL_BRANCHES - INST_RETIRED.MACRO_FUSED) / (tma_retiring * tma_info_slots)", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_non_fused_branches", + "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring branch instructions that were not fused. Non-conditional branches like direct JMP or CALL would count here. Can be used to examine fusible conditional jumps that were not fused.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations", - "MetricExpr": "UOPS_DISPATCHED.PORT_2_3_10 / (3 * CPU_CLK_UNHALTED.DISTRIBUTED)", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", - "MetricName": "tma_load_op_utilization", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_retiring * tma_info_slots)", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_nop_instructions", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions. Compilers often use NOPs for certain address alignments - e.g. start address of a function or loop body. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations", - "MetricExpr": "(UOPS_DISPATCHED.PORT_4_9 + UOPS_DISPATCHED.PORT_7_8) / (4 * CPU_CLK_UNHALTED.DISTRIBUTED)", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", - "MetricName": "tma_store_op_utilization", + "BriefDescription": "This metric represents the remaining light uops fraction the CPU has executed - remaining means not covered by other sibling nodes", + "MetricExpr": "max(0, tma_light_operations - (tma_fp_arith + tma_int_operations + tma_memory_operations + tma_fused_instructions + tma_non_fused_branches + tma_nop_instructions))", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_other_light_ops", + "MetricThreshold": "tma_other_light_ops > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents the remaining light uops fraction the CPU has executed - remaining means not covered by other sibling nodes. May undercount due to FMA double counting", "ScaleUnit": "100%" }, { - "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", - "MetricGroup": "TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_retiring", - "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. ", + "BriefDescription": "This metric roughly estimates fraction of slots the CPU retired uops as a result of handing Page Faults", + "MetricExpr": "99 * ASSISTS.PAGE_FAULT / tma_info_slots", + "MetricGroup": "TopdownL5;tma_L5_group;tma_assists_group", + "MetricName": "tma_page_faults", + "MetricThreshold": "tma_page_faults > 0.05", + "PublicDescription": "This metric roughly estimates fraction of slots the CPU retired uops as a result of handing Page Faults. A Page Fault may apply on first application access to a memory page. Note operating system handling of page faults accounts for the majority of its cost.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", - "MetricExpr": "max(0, topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound)) + 0 * slots", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_L2_group;tma_retiring_group", - "MetricName": "tma_light_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved.", + "BriefDescription": "This metric roughly estimates (based on idle latencies) how often the CPU was stalled on accesses to external 3D-Xpoint (Crystal Ridge, a.k.a", + "MetricExpr": "((1 - (19 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 10 * (MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))) / (19 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 10 * (MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + (25 * (MEM_LOAD_RETIRED.LOCAL_PMM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 33 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_PMM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))))) * (MEMORY_ACTIVITY.STALLS_L3_MISS / tma_info_clks) if 1e6 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_PMM + MEM_LOAD_RETIRED.LOCAL_PMM) > MEM_LOAD_RETIRED.L1_MISS else 0)", + "MetricGroup": "MemoryBound;Server;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_pmm_bound", + "MetricThreshold": "tma_pmm_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric roughly estimates (based on idle latencies) how often the CPU was stalled on accesses to external 3D-Xpoint (Crystal Ridge, a.k.a. IXP) memory by loads, PMM stands for Persistent Memory Module.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * UOPS_EXECUTED.X87 / UOPS_EXECUTED.THREAD + tma_fp_scalar + tma_fp_vector + tma_fp_amx", - "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_group", - "MetricName": "tma_fp_arith", - "PublicDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired). Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain and FMA double-counting.", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch)", + "MetricExpr": "UOPS_DISPATCHED.PORT_0 / tma_info_core_clks", + "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_0", + "MetricThreshold": "tma_port_0 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_256b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric serves as an approximation of legacy x87 usage", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * UOPS_EXECUTED.X87 / UOPS_EXECUTED.THREAD + 0 * slots", - "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", - "MetricName": "tma_x87_use", - "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU)", + "MetricExpr": "UOPS_DISPATCHED.PORT_1 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_1", + "MetricThreshold": "tma_port_1 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + FP_ARITH_INST_RETIRED2.SCALAR) / (topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * slots)", - "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group", - "MetricName": "tma_fp_scalar", - "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting.", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU)", + "MetricExpr": "UOPS_DISPATCHED.PORT_6 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_6", + "MetricThreshold": "tma_port_6 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU). Sample with: UOPS_DISPATCHED.PORT_6. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths", - "MetricExpr": "min((FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.VECTOR) / (topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * slots), 1)", - "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group", - "MetricName": "tma_fp_vector", - "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting.", + "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", + "MetricExpr": "((cpu@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=0x80@ + tma_serializing_operation * (CYCLE_ACTIVITY.STALLS_TOTAL - EXE_ACTIVITY.BOUND_ON_LOADS) + (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * cpu@EXE_ACTIVITY.2_PORTS_UTIL\\,umask\\=0xc@)) / tma_info_clks if ARITH.DIV_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - EXE_ACTIVITY.BOUND_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * cpu@EXE_ACTIVITY.2_PORTS_UTIL\\,umask\\=0xc@) / tma_info_clks)", + "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors", - "MetricExpr": "min((FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.128B_PACKED_HALF) / (topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * slots), 1)", - "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group", - "MetricName": "tma_fp_vector_128b", - "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors. May overcount due to FMA double counting.", + "BriefDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "cpu@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=0x80@ / tma_info_clks + tma_serializing_operation * (CYCLE_ACTIVITY.STALLS_TOTAL - EXE_ACTIVITY.BOUND_ON_LOADS) / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise). Long-latency instructions like divides may contribute to this metric.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors", - "MetricExpr": "min((FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.256B_PACKED_HALF) / (topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * slots), 1)", - "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group", - "MetricName": "tma_fp_vector_256b", - "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors. May overcount due to FMA double counting.", + "BriefDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_1", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). This can be due to heavy data-dependency among software instructions; or over oversubscribing a particular hardware resource. In some other cases with high 1_Port_Utilized and L1_Bound; this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 512-bit wide vectors", - "MetricExpr": "min((FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.512B_PACKED_HALF) / (topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * slots), 1)", - "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group", - "MetricName": "tma_fp_vector_512b", - "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 512-bit wide vectors. May overcount due to FMA double counting.", + "BriefDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization -most compilers feature auto-Vectorization options today- reduces pressure on the execution ports as multiple elements are calculated with same uop. Sample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic floating-point (FP) matrix uops fraction the CPU has retired (aggregated across all supported FP datatypes in AMX engine)", - "MetricExpr": "cpu@AMX_OPS_RETIRED.BF16\\,cmask\\=0x1@ / (topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * slots)", - "MetricGroup": "Compute;Flops;HPC;Pipeline;Server;TopdownL4;tma_L4_group;tma_fp_arith_group", - "MetricName": "tma_fp_amx", - "PublicDescription": "This metric approximates arithmetic floating-point (FP) matrix uops fraction the CPU has retired (aggregated across all supported FP datatypes in AMX engine). Refer to AMX_Busy and GFLOPs metrics for actual AMX utilization and FP performance, resp.", + "BriefDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.7 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Sample with: UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents overall Integer (Int) select operations fraction the CPU has executed (retired)", - "MetricExpr": "tma_int_vector_128b + tma_int_vector_256b + tma_shuffles", - "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", - "MetricName": "tma_int_operations", - "PublicDescription": "This metric represents overall Integer (Int) select operations fraction the CPU has executed (retired). Vector/Matrix Int operations and shuffles are counted. Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain.", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote cache in other sockets including synchronizations issues", + "MetricExpr": "(135.5 * tma_info_average_frequency * MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM + 135.5 * tma_info_average_frequency * MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_issueSyncxn;tma_mem_latency_group", + "MetricName": "tma_remote_cache", + "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote cache in other sockets including synchronizations issues. This is caused often due to non-optimal NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM_PS;MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD_PS. Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents 128-bit vector Integer ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the CPU has retired.", - "MetricExpr": "(INT_VEC_RETIRED.ADD_128 + INT_VEC_RETIRED.VNNI_128) / (topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * slots)", - "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;tma_int_operations_group", - "MetricName": "tma_int_vector_128b", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote memory", + "MetricExpr": "149 * tma_info_average_frequency * MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latency_group", + "MetricName": "tma_remote_dram", + "MetricThreshold": "tma_remote_dram > 0.1 & (tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote memory. This is caused often due to non-optimal NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents 256-bit vector Integer ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the CPU has retired.", - "MetricExpr": "(INT_VEC_RETIRED.ADD_256 + INT_VEC_RETIRED.MUL_256 + INT_VEC_RETIRED.VNNI_256) / (topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * slots)", - "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;tma_int_operations_group", - "MetricName": "tma_int_vector_256b", + "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1", + "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic Integer (Int) matrix uops fraction the CPU has retired (aggregated across all supported Int datatypes in AMX engine)", - "MetricExpr": "cpu@AMX_OPS_RETIRED.INT8\\,cmask\\=0x1@ / (topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * slots)", - "MetricGroup": "Compute;HPC;IntVector;Pipeline;Server;TopdownL4;tma_L4_group;tma_int_operations_group", - "MetricName": "tma_int_amx", - "PublicDescription": "This metric approximates arithmetic Integer (Int) matrix uops fraction the CPU has retired (aggregated across all supported Int datatypes in AMX engine). Refer to AMX_Busy and TIOPs metrics for actual AMX utilization and Int performance, resp.", + "BriefDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations", + "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL5;tma_L5_group;tma_issueSO;tma_ports_utilized_0_group", + "MetricName": "tma_serializing_operation", + "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)))", + "PublicDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations. Instructions like CPUID; WRMSR or LFENCE serialize the out-of-order execution which may limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metrics: tma_ms_switches", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents Shuffle (cross \"vector lane\" data transfers) uops fraction the CPU has retired.", - "MetricExpr": "INT_VEC_RETIRED.SHUFFLES / (topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * slots)", + "MetricExpr": "INT_VEC_RETIRED.SHUFFLES / (tma_retiring * tma_info_slots)", "MetricGroup": "HPC;Pipeline;TopdownL4;tma_L4_group;tma_int_operations_group", "MetricName": "tma_shuffles", + "MetricThreshold": "tma_shuffles > 0.1 & (tma_int_operations > 0.1 & tma_light_operations > 0.6)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring memory operations -- uops for memory load or store accesses.", - "MetricExpr": "max(0, topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound)) * MEM_UOP_RETIRED.ANY / (topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * slots)", - "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", - "MetricName": "tma_memory_operations", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to PAUSE Instructions", + "MetricExpr": "CPU_CLK_UNHALTED.PAUSE / tma_info_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_serializing_operation_group", + "MetricName": "tma_slow_pause", + "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_operation > 0.1 & (tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.PAUSE_INST", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring fused instructions -- where one uop can represent multiple contiguous instructions", - "MetricExpr": "max(0, topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound)) * INST_RETIRED.MACRO_FUSED / (topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * slots)", - "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", - "MetricName": "tma_fused_instructions", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring fused instructions -- where one uop can represent multiple contiguous instructions. The instruction pairs of CMP+JCC or DEC+JCC are commonly used examples.", + "BriefDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary", + "MetricExpr": "tma_info_load_miss_real_latency * LD_BLOCKS.NO_SR / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring branch instructions that were not fused", - "MetricExpr": "max(0, topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound)) * (BR_INST_RETIRED.ALL_BRANCHES - INST_RETIRED.MACRO_FUSED) / (topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * slots)", - "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", - "MetricName": "tma_non_fused_branches", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring branch instructions that were not fused. Non-conditional branches like direct JMP or CALL would count here. Can be used to examine fusible conditional jumps that were not fused.", + "BriefDescription": "This metric represents rate of split store accesses", + "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bound_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store accesses. Consider aligning your data to the 64-byte cache line granularity. Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_4", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions", - "MetricExpr": "max(0, topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound)) * INST_RETIRED.NOP / (topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * slots)", - "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", - "MetricName": "tma_nop_instructions", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions. Compilers often use NOPs for certain address alignments - e.g. start address of a function or loop body.", + "BriefDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors)", + "MetricExpr": "(XQ.FULL_CYCLES + L1D_PEND_MISS.L2_STALLS) / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueBW;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors). Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_info_memory_bandwidth, tma_mem_bandwidth", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents the remaining light uops fraction the CPU has executed - remaining means not covered by other sibling nodes. May undercount due to FMA double counting", - "MetricExpr": "max(0, max(0, topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound)) - (tma_fp_arith + tma_int_operations + tma_memory_operations + tma_fused_instructions + tma_non_fused_branches + tma_nop_instructions))", - "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", - "MetricName": "tma_other_light_ops", + "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", + "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck. Sample with: MEM_INST_RETIRED.ALL_STORES_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences", - "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_L2_group;tma_retiring_group", - "MetricName": "tma_heavy_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences. This highly-correlates with the uop length of these instructions/sequences.", + "BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores. To streamline memory operations in the pipeline; a load can avoid waiting for memory if a prior in-flight store is writing the data that the load wants to read (store forwarding process). However; in some cases the load may be blocked for a significant time pending the store forward. For example; when the prior store is writing a smaller region than the load is reading.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops", - "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - tma_microcode_sequencer", - "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group", - "MetricName": "tma_few_uops_instructions", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number of uops in such instructions.", + "BriefDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses", + "MetricExpr": "(MEM_STORE_RETIRED.L2_HIT * 10 * (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_clks", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_issueSL;tma_store_bound_group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses. Store accesses usually less impact out-of-order core performance; however; holding resources for longer time can lead into undesired implications (e.g. contention on L1D fill-buffer entries - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", - "MetricExpr": "UOPS_RETIRED.MS / slots", - "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operations_group", - "MetricName": "tma_microcode_sequencer", - "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided.", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations", + "MetricExpr": "(UOPS_DISPATCHED.PORT_4_9 + UOPS_DISPATCHED.PORT_7_8) / (4 * tma_info_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations. Sample with: UOPS_DISPATCHED.PORT_7_8", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists", - "MetricExpr": "min(100 * cpu@ASSISTS.ANY\\,umask\\=0x1B@ / slots, 1)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", - "MetricName": "tma_assists", - "PublicDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists. Assists are long sequences of uops that are required in certain corner-cases for operations that cannot be handled natively by the execution pipeline. For example; when working with very small floating point values (so-called Denormals); the FP units are not set up to perform these operations natively. Instead; a sequence of instructions to perform the computation on the Denormals is injected into the pipeline. Since these microcode sequences might be dozens of uops long; Assists can be extremely deleterious to performance and they can be avoided in many cases.", + "BriefDescription": "This metric roughly estimates the fraction of cycles where the TLB was missed by store accesses, hitting in the second-level TLB (STLB)", + "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_group", + "MetricName": "tma_store_stlb_hit", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates fraction of slots the CPU retired uops as a result of handing Page Faults", - "MetricExpr": "99 * ASSISTS.PAGE_FAULT / slots", - "MetricGroup": "TopdownL5;tma_L5_group;tma_assists_group", - "MetricName": "tma_page_faults", - "PublicDescription": "This metric roughly estimates fraction of slots the CPU retired uops as a result of handing Page Faults. A Page Fault may apply on first application access to a memory page. Note operating system handling of page faults accounts for the majority of its cost.", + "BriefDescription": "This metric estimates the fraction of cycles where the STLB was missed by store accesses, performing a hardware page walk", + "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_clks", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_group", + "MetricName": "tma_store_stlb_miss", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates fraction of slots the CPU retired uops as a result of handing Floating Point (FP) Assists", - "MetricExpr": "30 * ASSISTS.FP / slots", - "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", - "MetricName": "tma_fp_assists", - "PublicDescription": "This metric roughly estimates fraction of slots the CPU retired uops as a result of handing Floating Point (FP) Assists. FP Assist may apply when working with very small floating point values (so-called denormals).", + "BriefDescription": "This metric estimates how often CPU was stalled due to Streaming store memory accesses; Streaming store optimize out a read request required by RFO stores", + "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueSmSt;tma_store_bound_group", + "MetricName": "tma_streaming_stores", + "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often CPU was stalled due to Streaming store memory accesses; Streaming store optimize out a read request required by RFO stores. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should Streaming stores be a bottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma_fb_full", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the CPU retired uops as a result of handing SSE to AVX* or AVX* to SSE transition Assists. ", - "MetricExpr": "63 * ASSISTS.SSE_AVX_MIX / slots", - "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", - "MetricName": "tma_avx_assists", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;TopdownL4;tma_L4_group;tma_branch_resteers_group", + "MetricName": "tma_unknown_branches", + "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears. These are fetched branches the Branch Prediction Unit was unable to recognize (e.g. first time the branch is fetched or hitting BTB capacity limit). Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction", - "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", - "MetricName": "tma_cisc", - "PublicDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction. A CISC instruction has multiple uops that are required to perform the instruction's functionality as in the case of read-modify-write as an example. Since these instructions require multiple uops they may or may not imply sub-optimal use of machine resources.", + "BriefDescription": "This metric serves as an approximation of legacy x87 usage", + "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.THREAD", + "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%" }, { - "BriefDescription": "C1 residency percent per core", - "MetricExpr": "cstate_core@c1\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C1_Core_Residency", + "BriefDescription": "Percentage of cycles in aborted transactions.", + "MetricExpr": "max(cpu@cycles\\-t@ - cpu@cycles\\-ct@, 0) / cycles", + "MetricGroup": "transaction", + "MetricName": "tsx_aborted_cycles", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per core", - "MetricExpr": "cstate_core@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Core_Residency", - "ScaleUnit": "100%" + "BriefDescription": "Number of cycles within a transaction divided by the number of elisions.", + "MetricExpr": "cpu@cycles\\-t@ / cpu@el\\-start@", + "MetricGroup": "transaction", + "MetricName": "tsx_cycles_per_elision", + "ScaleUnit": "1cycles / elision" }, { - "BriefDescription": "C2 residency percent per package", - "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C2_Pkg_Residency", - "ScaleUnit": "100%" + "BriefDescription": "Number of cycles within a transaction divided by the number of transactions.", + "MetricExpr": "cpu@cycles\\-t@ / cpu@tx\\-start@", + "MetricGroup": "transaction", + "MetricName": "tsx_cycles_per_transaction", + "ScaleUnit": "1cycles / transaction" }, { - "BriefDescription": "C6 residency percent per package", - "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Pkg_Residency", + "BriefDescription": "Percentage of cycles within a transaction region.", + "MetricExpr": "cpu@cycles\\-t@ / cycles", + "MetricGroup": "transaction", + "MetricName": "tsx_transactional_cycles", "ScaleUnit": "100%" } ] diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/uncore-other.json b/tools/perf/pmu-events/arch/x86/sapphirerapids/uncore-other.json index fd253e3276df..11c037e8291d 100644 --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/uncore-other.json +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/uncore-other.json @@ -2832,6 +2832,16 @@ "UMask": "0x80", "Unit": "IIO" }, + { + "BriefDescription": "Read request for 4 bytes made by IIO Part0-7 to Memory", + "EventCode": "0x83", + "EventName": "UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.ALL_PARTS", + "FCMask": "0x07", + "PerPkg": "1", + "PortMask": "0x00ff", + "UMask": "0x4", + "Unit": "IIO" + }, { "BriefDescription": "Read request for 4 bytes made by IIO Part0 to Memory", "EventCode": "0x83", @@ -2920,6 +2930,16 @@ "UMask": "0x4", "Unit": "IIO" }, + { + "BriefDescription": "Write request of 4 bytes made by IIO Part0-7 to Memory", + "EventCode": "0x83", + "EventName": "UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.ALL_PARTS", + "FCMask": "0x07", + "PerPkg": "1", + "PortMask": "0x00ff", + "UMask": "0x1", + "Unit": "IIO" + }, { "BriefDescription": "Write request of 4 bytes made by IIO Part0 to Memory", "EventCode": "0x83", @@ -4038,6 +4058,46 @@ "PublicDescription": "FlowQ Generated Prefetch : Count cases where FlowQ causes spawn of Prefetch to iMC/SMI3 target", "Unit": "M3UPI" }, + { + "BriefDescription": "All CAS commands issued", + "EventCode": "0x05", + "EventName": "UNC_MCHBM_CAS_COUNT.ALL", + "PerPkg": "1", + "UMask": "0xff", + "Unit": "MCHBM" + }, + { + "BriefDescription": "Read CAS commands issued (regular and underfill)", + "EventCode": "0x05", + "EventName": "UNC_MCHBM_CAS_COUNT.RD", + "PerPkg": "1", + "UMask": "0xcf", + "Unit": "MCHBM" + }, + { + "BriefDescription": "Regular read CAS commands issued (does not include underfills)", + "EventCode": "0x05", + "EventName": "UNC_MCHBM_CAS_COUNT.RD_REG", + "PerPkg": "1", + "UMask": "0xc1", + "Unit": "MCHBM" + }, + { + "BriefDescription": "Underfill read CAS commands issued", + "EventCode": "0x05", + "EventName": "UNC_MCHBM_CAS_COUNT.RD_UNDERFILL", + "PerPkg": "1", + "UMask": "0xc4", + "Unit": "MCHBM" + }, + { + "BriefDescription": "Write CAS commands issued", + "EventCode": "0x05", + "EventName": "UNC_MCHBM_CAS_COUNT.WR", + "PerPkg": "1", + "UMask": "0xf0", + "Unit": "MCHBM" + }, { "BriefDescription": "UPI Clockticks", "EventCode": "0x01", From patchwork Sun Feb 19 09:28:24 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59112 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp773291wrn; Sun, 19 Feb 2023 01:36:26 -0800 (PST) X-Google-Smtp-Source: AK7set9IIAa09WxYzUa41OruecyrxLN3q3Et5TMXCy05JFWCsNwJLZkyNSw/6cv420Zirmc8HIcK X-Received: by 2002:a17:90b:4c4c:b0:236:8e07:4c6d with SMTP id np12-20020a17090b4c4c00b002368e074c6dmr2407198pjb.7.1676799386242; Sun, 19 Feb 2023 01:36:26 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799386; cv=none; d=google.com; s=arc-20160816; b=vDgFSgTWS+hsqonPztlGQxc/Pk0CPUDF6qcAgRBnDa+eTDBQ8wWKSBrIJjWmnZEBKg wf5WbHsQMh9APpCZJSADYM/RA7t28KwuKnqiXG26obJgSvyThWX1syp136yUd+Ckc3U0 SIROtgietnlfeV4LBZymUx6qCEoR8HmNOFTfsWTeoNLXVPXkxy6LWXY5CWT+X1CMYPGd +YchZGuVN8iXO7aj4BLmAi8ZJXKBAr/mIlqgOjIS7jjEqqbXXTTPN/f8GkuRkjYOfklr lhYqGfJR4lCD0XWkgDY4xIhi8yGxLrumt482pb6vI1c2HHD5BYzVOvF1Q7LOkQsWo4yx KEuA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :references:mime-version:message-id:in-reply-to:date:dkim-signature; bh=HHfJUUC7WFoksqzMwO/wB71MjmrmmBDk4U3wZWtbf08=; b=qTgt+sWEch/rSQ3WJBVpFmn32ZSzoYTDCic3rOTxRzOUf0t5+HKHTalTT6xRC7Ou71 bRnw7SYrriJRnoMtIGHJBeS3B7oE+TfJu/ulCECPeRxmuulcjcILLJCo3vxnj4JEPTVi BeooSdDF7YsPkhl/oiFHRbhugjFyS1YoLWNyJCmMu4AUOd9vX1iubgyDL1olOhP5Elkd WUosUL84hC9Rla/m8G/xbtjkZqaSND1SOjqUH+L1A9RYiJTNbvNQ6XvxA4iJHOlhexxw d8ZohpApHeu4NntG/fGMBMyKAarbBIOz9fK2SWw/mQ3lTROq9EtsommSmL3kVEShXAuE oJJg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=fX59j52x; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id c8-20020a17090abf0800b00234a8e1f2efsi10009042pjs.182.2023.02.19.01.36.12; Sun, 19 Feb 2023 01:36:26 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=fX59j52x; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229960AbjBSJd5 (ORCPT + 99 others); Sun, 19 Feb 2023 04:33:57 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56298 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229969AbjBSJdu (ORCPT ); Sun, 19 Feb 2023 04:33:50 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 521C21259E for ; Sun, 19 Feb 2023 01:32:45 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id f33-20020a25b0a1000000b009433a21be0dso2107037ybj.19 for ; Sun, 19 Feb 2023 01:32:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:from:to:cc:subject:date :message-id:reply-to; bh=HHfJUUC7WFoksqzMwO/wB71MjmrmmBDk4U3wZWtbf08=; b=fX59j52xFTzLYLu6bfL/D7xhUQY699O8G2CBhOY2b+WQeh/CNRkmVbk6izOLD09kRS ElhU4Qo79C2zelyYISHlIQ06bx7oywbUO5YVV+td+ogHhr5ySpZA39YLt/jD3Rgl4Hru NSl1oSHcdzoPDzg7iF8Lms4PQ0/pfTnWuTfK8wSI6NYne98qpyTFNmdRCtV6neg55tJh Ro7kutyBaTvkQVQM9r2EZf+9Ichm6O4JTDhrgZK0xRKCqHyknhKGMchi+O2i5EdVLhjY xtmOJEGgVHLUBujQkjjQT/JAK8oWzaiy/iTO9dpZZUZDbqFaUd9bzxnQJrq9uVzKKKb3 CXiQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=HHfJUUC7WFoksqzMwO/wB71MjmrmmBDk4U3wZWtbf08=; b=ab5hjDgrxUSS2DL/EsGHsrIK3qFVlNyw/gB47LyHu5pePIAYSTLr2VI2oHlwvACc8U 3HABmbyoLcWcp2QS9eZ0+tePlLr+9CO4+Im2rVYRgmfKQ10UcMp8aNCvmEolFxoy2NNV 9ci/xCrCLdO6dvtx0t2KUqZLefUvFZzQhYTuNTKgIFt9+wMKhhz19yuWHuXDY44NqepM Vvlf2SsaLLOTwWtvpSzBzMh5i2+vt0YV4XjrOGv83SIl8Dw8fzIwlB5EjjYCHRCUPyqn TTTYS092RC61S7jYrLCk6KIEJsQNovSEn4U0oRzb2Hu2G9m5waZobruiNov9pnzxWzHO bvqw== X-Gm-Message-State: AO0yUKUskbPSG1MHQSb2mSBlJfmeiw78ewl/AbzFzPdEqhMcQdJS5Aem Mu7l97fYD3Bq+HUJjOaK7qzx+YVGiJi/ X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a25:8312:0:b0:889:8b19:faec with SMTP id s18-20020a258312000000b008898b19faecmr515910ybk.196.1676799165017; Sun, 19 Feb 2023 01:32:45 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:24 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-28-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 27/51] perf vendor events intel: Refresh silvermont events From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758251593489404164?= X-GMAIL-MSGID: =?utf-8?q?1758251593489404164?= Update the silvermont events from 14 to 15. Generation was done using https://github.com/intel/perfmon. The most notable change is in corrections to event descriptions. Signed-off-by: Ian Rogers --- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- tools/perf/pmu-events/arch/x86/silvermont/frontend.json | 2 +- tools/perf/pmu-events/arch/x86/silvermont/pipeline.json | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-events/arch/x86/mapfile.csv index 1f611a7dbdda..d1d40d0f2b2c 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -23,7 +23,7 @@ GenuineIntel-6-1[AEF],v3,nehalemep,core GenuineIntel-6-2E,v3,nehalemex,core GenuineIntel-6-2A,v18,sandybridge,core GenuineIntel-6-(8F|CF),v1.11,sapphirerapids,core -GenuineIntel-6-(37|4A|4C|4D|5A),v14,silvermont,core +GenuineIntel-6-(37|4A|4C|4D|5A),v15,silvermont,core GenuineIntel-6-(4E|5E|8E|9E|A5|A6),v53,skylake,core GenuineIntel-6-55-[01234],v1.28,skylakex,core GenuineIntel-6-86,v1.20,snowridgex,core diff --git a/tools/perf/pmu-events/arch/x86/silvermont/frontend.json b/tools/perf/pmu-events/arch/x86/silvermont/frontend.json index c35da10f7133..cd6ed3f59e26 100644 --- a/tools/perf/pmu-events/arch/x86/silvermont/frontend.json +++ b/tools/perf/pmu-events/arch/x86/silvermont/frontend.json @@ -11,7 +11,7 @@ "BriefDescription": "Counts the number of JCC baclears", "EventCode": "0xE6", "EventName": "BACLEARS.COND", - "PublicDescription": "The BACLEARS event counts the number of times the front end is resteered, mainly when the Branch Prediction Unit cannot provide a correct prediction and this is corrected by the Branch Address Calculator at the front end. The BACLEARS.COND event counts the number of JCC (Jump on Condtional Code) baclears.", + "PublicDescription": "The BACLEARS event counts the number of times the front end is resteered, mainly when the Branch Prediction Unit cannot provide a correct prediction and this is corrected by the Branch Address Calculator at the front end. The BACLEARS.COND event counts the number of JCC (Jump on Conditional Code) baclears.", "SampleAfterValue": "200003", "UMask": "0x10" }, diff --git a/tools/perf/pmu-events/arch/x86/silvermont/pipeline.json b/tools/perf/pmu-events/arch/x86/silvermont/pipeline.json index 59f6116a7eae..2d4214bf9e39 100644 --- a/tools/perf/pmu-events/arch/x86/silvermont/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/silvermont/pipeline.json @@ -228,7 +228,7 @@ "BriefDescription": "Counts the number of cycles when no uops are allocated, the IQ is empty, and no other condition is blocking allocation.", "EventCode": "0xCA", "EventName": "NO_ALLOC_CYCLES.NOT_DELIVERED", - "PublicDescription": "The NO_ALLOC_CYCLES.NOT_DELIVERED event is used to measure front-end inefficiencies, i.e. when front-end of the machine is not delivering micro-ops to the back-end and the back-end is not stalled. This event can be used to identify if the machine is truly front-end bound. When this event occurs, it is an indication that the front-end of the machine is operating at less than its theoretical peak performance. Background: We can think of the processor pipeline as being divided into 2 broader parts: Front-end and Back-end. Front-end is responsible for fetching the instruction, decoding into micro-ops (uops) in machine understandable format and putting them into a micro-op queue to be consumed by back end. The back-end then takes these micro-ops, allocates the required resources. When all resources are ready, micro-ops are executed. If the back-end is not ready to accept micro-ops from the front-end, then we do not want to count these as front-end bottlenecks. However, whenever we have bottlenecks in the back-end, we will have allocation unit stalls and eventually forcing the front-end to wait until the back-end is ready to receive more UOPS. This event counts the cycles only when back-end is requesting more uops and front-end is not able to provide them. Some examples of conditions that cause front-end efficiencies are: Icache misses, ITLB misses, and decoder restrictions that limit the the front-end bandwidth.", + "PublicDescription": "The NO_ALLOC_CYCLES.NOT_DELIVERED event is used to measure front-end inefficiencies, i.e. when front-end of the machine is not delivering micro-ops to the back-end and the back-end is not stalled. This event can be used to identify if the machine is truly front-end bound. When this event occurs, it is an indication that the front-end of the machine is operating at less than its theoretical peak performance. Background: We can think of the processor pipeline as being divided into 2 broader parts: Front-end and Back-end. Front-end is responsible for fetching the instruction, decoding into micro-ops (uops) in machine understandable format and putting them into a micro-op queue to be consumed by back end. The back-end then takes these micro-ops, allocates the required resources. When all resources are ready, micro-ops are executed. If the back-end is not ready to accept micro-ops from the front-end, then we do not want to count these as front-end bottlenecks. However, whenever we have bottlenecks in the back-end, we will have allocation unit stalls and eventually forcing the front-end to wait until the back-end is ready to receive more UOPS. This event counts the cycles only when back-end is requesting more uops and front-end is not able to provide them. Some examples of conditions that cause front-end efficiencies are: Icache misses, ITLB misses, and decoder restrictions that limit the front-end bandwidth.", "SampleAfterValue": "200003", "UMask": "0x50" }, From patchwork Sun Feb 19 09:28:25 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59118 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp775421wrn; Sun, 19 Feb 2023 01:44:34 -0800 (PST) X-Google-Smtp-Source: AK7set+ioa7nGfLTU3YraAu7Y6BleNFW8fp0647i8o+3u1IOh3A5JRAMDckZmAfE5cC9J4YqHIWy X-Received: by 2002:a17:906:40cd:b0:87b:59d9:5a03 with SMTP id a13-20020a17090640cd00b0087b59d95a03mr5807572ejk.36.1676799874376; Sun, 19 Feb 2023 01:44:34 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799874; cv=none; d=google.com; s=arc-20160816; b=gG0bJrSps8K7+OI119b0TLGZ4ZEQm/2Wd4V5I1fod3vmkMkh71J8hYgTBZTLQgoOM0 mwP2rkWSH/BC0uzjjj7B/HV5TQ0uW/966CYPNEZAVYn9nRpLdUzJhFS/TiLn48bqltk6 zOKN0lxPPgYDGdVXqe6gseW/zgRBR1GsCO4ljqLfKmQeP7pGSuuOrE+KEi8c7Nbfvma2 Inm4POa68EmV/3EcakFUeViEJ9Eahq4hnLR8pzGJ3U7cy9ugXpj9nAHnFjcduk9AshYq ifnuTxMnrSEanko/H/b7Ztcd2f+8NDq/r8gm5TgbCf26g/8753TX75XlqX2yBabsqwFJ sc5w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :references:mime-version:message-id:in-reply-to:date:dkim-signature; bh=F/0qgvRor7Fd89QQg2ZUNUJmGuxhLlMY6OwK/gh//u8=; b=wzxo/T5Iq9v6XDH1QD5ELYBZWdLH2XujtmeCL6lPH7yCxe2g6XQZRrWnkUNkFhwU1T V5HAjlJWrxGWgawKSLyyia9xEu2oY5avno87BsjOcgb83h5vLqGkWqlhuunYyX5rK/py s5ARs3fS3ofC81N6zBVmlclSzye4dTE8LGmjH0SYF/PfOiN/IkcYGQDkSi5KEbt5wouK 1wxGFeZ4gtJ2VJWqgx0su0zvXb4xIIwQucjzlvnXm8xD5bNflPrkPIhdUiaRKluFKJe6 eClV7k3FEwYkRBkIoDIskAwDrmCFEjbOUH01aPZLE5n2nUVa0btOksaPTZZCztCcLEms Tw9w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=V13QA01x; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id up33-20020a170907cca100b008bd2a964b93si4939459ejc.300.2023.02.19.01.44.10; Sun, 19 Feb 2023 01:44:34 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=V13QA01x; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229992AbjBSJfm (ORCPT + 99 others); Sun, 19 Feb 2023 04:35:42 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58172 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230005AbjBSJfi (ORCPT ); Sun, 19 Feb 2023 04:35:38 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0297012876 for ; Sun, 19 Feb 2023 01:34:33 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id a9-20020a25bac9000000b00827819f87e5so78071ybk.0 for ; Sun, 19 Feb 2023 01:34:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:from:to:cc:subject:date :message-id:reply-to; bh=F/0qgvRor7Fd89QQg2ZUNUJmGuxhLlMY6OwK/gh//u8=; b=V13QA01xcklzXoWzygQQCE9aPro+/LGSOTTmV4N6bqPfqLCHGBjEANiVZYZqpo+Yuu 424yMpT0bRbEDloS5nGWAnDB83UGHL1BNe+P2FXxr1ZgtUurIIRjFPQMxKz0JCqSZa4I A9POY6qUBvl2aUg6EQU5x5Gg6jKOXtqmh1Bouiwt5V70g2sLY/R6Nyu4tkPax4TyZfVf mITECx6YkWFOFpo/2Jqx6DMpaeCICHpI65h0m258aoYykVYQyRbT4Shedv5B005tkb8w bWrL/cwWAZ5LKoUmvoNoYPF9BmTcu6rVuB+ZfLnadsU1Hk4yr8TwNsM5/5yjAPUzk5K1 XXrQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=F/0qgvRor7Fd89QQg2ZUNUJmGuxhLlMY6OwK/gh//u8=; b=jVIdE6FN5vL1Rq+D6B1AxrnpkVkGUJpVB3xU3+vz2YssLJAQMIql4YTazfkPc0eA0Z 9lwEA3yWUYux2W7hFyjryKOTFqePfcGFcp1XVjEjxwp9BbsZz0k4rk7PCGylAXA4BBfw NiqnJeRelNS88lRL8ftAbPrqillb5KlqJe3B66k3a0590m9lEuFmOC8jAkS8fJ6N313f x2XW5XIAKY0VUApcqsnjNcrAPt3h7r3T/vmb036mx2ulEpORnbdFluCAFkxubAVnFEBO ro2VYvCCx7Hf6QyBBc0KeqOE6gQtJXBC8JUr3CPPbgmgHgPwEeu5BWCzt6CPBmxLc/d5 sE3g== X-Gm-Message-State: AO0yUKV1ha6lesC+nCeBLyqoMREnJfhXTlZLXchU1hNl8o8+chy93hT9 XIqpmsm7xCXmnZ3Oz4dh6Ii3v3Hl9eTE X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a81:4994:0:b0:52e:c82f:b775 with SMTP id w142-20020a814994000000b0052ec82fb775mr23274ywa.426.1676799173766; Sun, 19 Feb 2023 01:32:53 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:25 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-29-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 28/51] perf vendor events intel: Refresh skylake events From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758252104888194486?= X-GMAIL-MSGID: =?utf-8?q?1758252104888194486?= Update the skylake events from 53 to 54. Generation was done using https://github.com/intel/perfmon. Notable changes are updated events and event descriptions, TMA metrics are updated to version 4.5, TMA info metrics are renamed from their node name to be lower case and prefixed by tma_info_ and MetricThreshold expressions are added, smi_cost and transaction metric groups are added replicating existing hard coded metrics in stat-shadow. Signed-off-by: Ian Rogers --- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- .../pmu-events/arch/x86/skylake/cache.json | 25 +- .../pmu-events/arch/x86/skylake/frontend.json | 8 +- .../pmu-events/arch/x86/skylake/other.json | 1 + .../pmu-events/arch/x86/skylake/pipeline.json | 16 + .../arch/x86/skylake/skl-metrics.json | 1877 ++++++++++------- .../arch/x86/skylake/uncore-other.json | 1 + 7 files changed, 1115 insertions(+), 815 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-events/arch/x86/mapfile.csv index d1d40d0f2b2c..22aa63f90f89 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -24,7 +24,7 @@ GenuineIntel-6-2E,v3,nehalemex,core GenuineIntel-6-2A,v18,sandybridge,core GenuineIntel-6-(8F|CF),v1.11,sapphirerapids,core GenuineIntel-6-(37|4A|4C|4D|5A),v15,silvermont,core -GenuineIntel-6-(4E|5E|8E|9E|A5|A6),v53,skylake,core +GenuineIntel-6-(4E|5E|8E|9E|A5|A6),v54,skylake,core GenuineIntel-6-55-[01234],v1.28,skylakex,core GenuineIntel-6-86,v1.20,snowridgex,core GenuineIntel-6-8[CD],v1.08,tigerlake,core diff --git a/tools/perf/pmu-events/arch/x86/skylake/cache.json b/tools/perf/pmu-events/arch/x86/skylake/cache.json index 1538ddb5752f..0080ac27b899 100644 --- a/tools/perf/pmu-events/arch/x86/skylake/cache.json +++ b/tools/perf/pmu-events/arch/x86/skylake/cache.json @@ -72,6 +72,7 @@ }, { "BriefDescription": "This event is deprecated. Refer to new event L2_LINES_OUT.USELESS_HWPF", + "Deprecated": "1", "EventCode": "0xF2", "EventName": "L2_LINES_OUT.USELESS_PREF", "SampleAfterValue": "200003", @@ -232,20 +233,22 @@ "UMask": "0x4f" }, { - "BriefDescription": "All retired load instructions.", + "BriefDescription": "Retired load instructions.", "Data_LA": "1", "EventCode": "0xD0", "EventName": "MEM_INST_RETIRED.ALL_LOADS", "PEBS": "1", + "PublicDescription": "Counts all retired load instructions. This event accounts for SW prefetch instructions of PREFETCHNTA or PREFETCHT0/1/2 or PREFETCHW.", "SampleAfterValue": "2000003", "UMask": "0x81" }, { - "BriefDescription": "All retired store instructions.", + "BriefDescription": "Retired store instructions.", "Data_LA": "1", "EventCode": "0xD0", "EventName": "MEM_INST_RETIRED.ALL_STORES", "PEBS": "1", + "PublicDescription": "Counts all retired store instructions.", "SampleAfterValue": "2000003", "UMask": "0x82" }, @@ -443,7 +446,7 @@ "UMask": "0x80" }, { - "BriefDescription": "Cacheable and noncachaeble code read requests", + "BriefDescription": "Cacheable and non-cacheable code read requests", "EventCode": "0xB0", "EventName": "OFFCORE_REQUESTS.DEMAND_CODE_RD", "PublicDescription": "Counts both cacheable and non-cacheable code read requests.", @@ -551,15 +554,7 @@ "UMask": "0x4" }, { - "BriefDescription": "Offcore response can be programmed only with a specific pair of event select and counter MSR, and with specific event codes and predefine mask bit value in a dedicated MSR to specify attributes of the offcore transaction", - "EventCode": "0xB7, 0xBB", - "EventName": "OFFCORE_RESPONSE", - "PublicDescription": "Offcore response can be programmed only with a specific pair of event select and counter MSR, and with specific event codes and predefine mask bit value in a dedicated MSR to specify attributes of the offcore transaction.", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all demand code readshave any response type.", + "BriefDescription": "Counts all demand code reads have any response type.", "EventCode": "0xB7, 0xBB", "EventName": "OFFCORE_RESPONSE.DEMAND_CODE_RD.ANY_RESPONSE", "MSRIndex": "0x1a6,0x1a7", @@ -946,7 +941,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts demand data readshave any response type.", + "BriefDescription": "Counts demand data reads have any response type.", "EventCode": "0xB7, 0xBB", "EventName": "OFFCORE_RESPONSE.DEMAND_DATA_RD.ANY_RESPONSE", "MSRIndex": "0x1a6,0x1a7", @@ -1333,7 +1328,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts all demand data writes (RFOs)have any response type.", + "BriefDescription": "Counts all demand data writes (RFOs) have any response type.", "EventCode": "0xB7, 0xBB", "EventName": "OFFCORE_RESPONSE.DEMAND_RFO.ANY_RESPONSE", "MSRIndex": "0x1a6,0x1a7", @@ -1720,7 +1715,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts any other requestshave any response type.", + "BriefDescription": "Counts any other requests have any response type.", "EventCode": "0xB7, 0xBB", "EventName": "OFFCORE_RESPONSE.OTHER.ANY_RESPONSE", "MSRIndex": "0x1a6,0x1a7", diff --git a/tools/perf/pmu-events/arch/x86/skylake/frontend.json b/tools/perf/pmu-events/arch/x86/skylake/frontend.json index 13ccf50db43d..04f08e4d2402 100644 --- a/tools/perf/pmu-events/arch/x86/skylake/frontend.json +++ b/tools/perf/pmu-events/arch/x86/skylake/frontend.json @@ -322,7 +322,7 @@ "UMask": "0x4" }, { - "BriefDescription": "Cycles when uops are being delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy", + "BriefDescription": "Cycles when uops are being delivered to Instruction Decode Queue (IDQ) while Microcode Sequencer (MS) is busy", "CounterMask": "1", "EventCode": "0x79", "EventName": "IDQ.MS_CYCLES", @@ -331,7 +331,7 @@ "UMask": "0x30" }, { - "BriefDescription": "Cycles when uops initiated by Decode Stream Buffer (DSB) are being delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy", + "BriefDescription": "Cycles when uops initiated by Decode Stream Buffer (DSB) are being delivered to Instruction Decode Queue (IDQ) while Microcode Sequencer (MS) is busy", "CounterMask": "1", "EventCode": "0x79", "EventName": "IDQ.MS_DSB_CYCLES", @@ -340,7 +340,7 @@ "UMask": "0x10" }, { - "BriefDescription": "Uops initiated by MITE and delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy", + "BriefDescription": "Uops initiated by MITE and delivered to Instruction Decode Queue (IDQ) while Microcode Sequencer (MS) is busy", "EventCode": "0x79", "EventName": "IDQ.MS_MITE_UOPS", "PublicDescription": "Counts the number of uops initiated by MITE and delivered to Instruction Decode Queue (IDQ) while the Microcode Sequencer (MS) is busy. Counting includes uops that may 'bypass' the IDQ.", @@ -358,7 +358,7 @@ "UMask": "0x30" }, { - "BriefDescription": "Uops delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy", + "BriefDescription": "Uops delivered to Instruction Decode Queue (IDQ) while Microcode Sequencer (MS) is busy", "EventCode": "0x79", "EventName": "IDQ.MS_UOPS", "PublicDescription": "Counts the total number of uops delivered by the Microcode Sequencer (MS). Any instruction over 4 uops will be delivered by the MS. Some instructions such as transcendentals may additionally generate uops from the MS.", diff --git a/tools/perf/pmu-events/arch/x86/skylake/other.json b/tools/perf/pmu-events/arch/x86/skylake/other.json index 9f3a9dffb807..d75d53279b4e 100644 --- a/tools/perf/pmu-events/arch/x86/skylake/other.json +++ b/tools/perf/pmu-events/arch/x86/skylake/other.json @@ -8,6 +8,7 @@ "UMask": "0x1" }, { + "BriefDescription": "MEMORY_DISAMBIGUATION.HISTORY_RESET", "EventCode": "0x09", "EventName": "MEMORY_DISAMBIGUATION.HISTORY_RESET", "SampleAfterValue": "2000003", diff --git a/tools/perf/pmu-events/arch/x86/skylake/pipeline.json b/tools/perf/pmu-events/arch/x86/skylake/pipeline.json index cf35a535c2f6..2c827d806554 100644 --- a/tools/perf/pmu-events/arch/x86/skylake/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/skylake/pipeline.json @@ -93,6 +93,22 @@ "SampleAfterValue": "400009", "UMask": "0x10" }, + { + "BriefDescription": "Speculative and retired mispredicted macro conditional branches", + "EventCode": "0x89", + "EventName": "BR_MISP_EXEC.ALL_BRANCHES", + "PublicDescription": "This event counts both taken and not taken speculative and retired mispredicted branch instructions.", + "SampleAfterValue": "200003", + "UMask": "0xff" + }, + { + "BriefDescription": "Speculative mispredicted indirect branches", + "EventCode": "0x89", + "EventName": "BR_MISP_EXEC.INDIRECT", + "PublicDescription": "Counts speculatively miss-predicted indirect branches at execution time. Counts for indirect near CALL or JMP instructions (RET excluded).", + "SampleAfterValue": "200003", + "UMask": "0xe4" + }, { "BriefDescription": "All mispredicted macro branch instructions retired.", "EventCode": "0xC5", diff --git a/tools/perf/pmu-events/arch/x86/skylake/skl-metrics.json b/tools/perf/pmu-events/arch/x86/skylake/skl-metrics.json index 972d3744c2c8..a6d212b349f5 100644 --- a/tools/perf/pmu-events/arch/x86/skylake/skl-metrics.json +++ b/tools/perf/pmu-events/arch/x86/skylake/skl-metrics.json @@ -1,1197 +1,1484 @@ [ { - "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", - "MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / SLOTS", - "MetricGroup": "PGO;TopdownL1;tma_L1_group", - "MetricName": "tma_frontend_bound", - "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Machine_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", - "MetricExpr": "4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / SLOTS", - "MetricGroup": "Frontend;TopdownL2;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_latency", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "BriefDescription": "C2 residency percent per package", + "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C2_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses", - "MetricExpr": "(ICACHE_16B.IFDATA_STALL + 2 * cpu@ICACHE_16B.IFDATA_STALL\\,cmask\\=1\\,edge@) / CLKS", - "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_icache_misses", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RETIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "BriefDescription": "C3 residency percent per core", + "MetricExpr": "cstate_core@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Core_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses", - "MetricExpr": "ICACHE_64B.IFTAG_STALL / CLKS", - "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_itlb_misses", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTEND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "BriefDescription": "C3 residency percent per package", + "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", - "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / CLKS + tma_unknown_branches", - "MetricGroup": "FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_branch_resteers", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", + "BriefDescription": "C6 residency percent per core", + "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Core_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage", - "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / CLKS", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL4;tma_branch_resteers_group", - "MetricName": "tma_mispredicts_resteers", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES", + "BriefDescription": "C6 residency percent per package", + "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears", - "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / CLKS", - "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_branch_resteers_group", - "MetricName": "tma_clears_resteers", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES", + "BriefDescription": "C7 residency percent per core", + "MetricExpr": "cstate_core@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Core_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears", - "MetricExpr": "9 * BACLEARS.ANY / CLKS", - "MetricGroup": "BigFoot;FetchLat;TopdownL4;tma_branch_resteers_group", - "MetricName": "tma_unknown_branches", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears. These are fetched branches the Branch Prediction Unit was unable to recognize (First fetch or hitting BPU capacity limit). Sample with: BACLEARS.ANY", + "BriefDescription": "C7 residency percent per package", + "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / CLKS", - "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_dsb_switches", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DSB_MISS_PS", - "ScaleUnit": "100%" + "BriefDescription": "Uncore frequency per die [GHZ]", + "MetricExpr": "tma_info_socket_clks / #num_dies / duration_time / 1e9", + "MetricGroup": "SoC", + "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", - "MetricExpr": "ILD_STALL.LCP / CLKS", - "MetricGroup": "FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_lcp", - "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs.", + "BriefDescription": "Percentage of cycles spent in System Management Interrupts.", + "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0 else 0)", + "MetricGroup": "smi", + "MetricName": "smi_cycles", + "MetricThreshold": "smi_cycles > 0.1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", - "MetricExpr": "2 * IDQ.MS_SWITCHES / CLKS", - "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_ms_switches", - "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals. Sample with: IDQ.MS_SWITCHES", - "ScaleUnit": "100%" + "BriefDescription": "Number of SMI interrupts.", + "MetricExpr": "msr@smi@", + "MetricGroup": "smi", + "MetricName": "smi_num", + "ScaleUnit": "1SMI#" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", - "MetricExpr": "tma_frontend_bound - tma_fetch_latency", - "MetricGroup": "FetchBW;Frontend;TopdownL2;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_bandwidth", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Sample with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS", + "BriefDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset", + "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_4k_aliasing", + "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset. False match is possible; which incur a few cycles load re-issue. However; the short re-issue duration is often hidden by the out-of-order core and HW optimizations; hence a user may safely ignore a high value of this metric unless it manages to propagate up into parent nodes of the hierarchy (e.g. to L1_Bound).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline)", - "MetricExpr": "(IDQ.ALL_MITE_CYCLES_ANY_UOPS - IDQ.ALL_MITE_CYCLES_4_UOPS) / CORE_CLKS / 2", - "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_fetch_bandwidth_group", - "MetricName": "tma_mite", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or LSD. For example; inefficiencies due to asymmetric decoders; use of long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sample with: FRONTEND_RETIRED.ANY_DSB_MISS", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.", + "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT.PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_info_slots", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles where decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=1@ - cpu@INST_DECODED.DECODERS\\,cmask\\=2@) / CORE_CLKS", - "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_mite_group", - "MetricName": "tma_decoder0_alone", + "BriefDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists", + "MetricExpr": "100 * (FP_ASSIST.ANY + OTHER_ASSISTS.ANY) / tma_info_slots", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists. Assists are long sequences of uops that are required in certain corner-cases for operations that cannot be handled natively by the execution pipeline. For example; when working with very small floating point values (so-called Denormals); the FP units are not set up to perform these operations natively. Instead; a sequence of instructions to perform the computation on the Denormals is injected into the pipeline. Since these microcode sequences might be dozens of uops long; Assists can be extremely deleterious to performance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.ANY", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline", - "MetricExpr": "(IDQ.ALL_DSB_CYCLES_ANY_UOPS - IDQ.ALL_DSB_CYCLES_4_UOPS) / CORE_CLKS / 2", - "MetricGroup": "DSB;FetchBW;TopdownL3;tma_fetch_bandwidth_group", - "MetricName": "tma_dsb", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline. For example; inefficient utilization of the DSB cache structure or bank conflict when reading from it; are categorized here.", + "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend", + "MetricExpr": "1 - tma_frontend_bound - (UOPS_ISSUED.ANY + 4 * (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.2", + "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound.", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations", - "MetricExpr": "(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) / SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group", + "MetricExpr": "(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction", + "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * tma_bad_speculation", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL2;tma_L2_group;tma_bad_speculation_group", + "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM", "MetricName": "tma_branch_mispredicts", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics: tma_info_branch_misprediction_cost, tma_info_mispredictions, tma_mispredicts_resteers", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", - "MetricExpr": "tma_bad_speculation - tma_branch_mispredicts", - "MetricGroup": "BadSpec;MachineClears;TopdownL2;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_machine_clears", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", + "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_clks + tma_unknown_branches", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", "ScaleUnit": "100%" }, { - "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend", - "MetricExpr": "1 - tma_frontend_bound - (UOPS_ISSUED.ANY + 4 * (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) / SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group", - "MetricName": "tma_backend_bound", - "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound.", + "BriefDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction. A CISC instruction has multiple uops that are required to perform the instruction's functionality as in the case of read-modify-write as an example. Since these instructions require multiple uops they may or may not imply sub-optimal use of machine resources.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", - "MetricExpr": "(CYCLE_ACTIVITY.STALLS_MEM_ANY + EXE_ACTIVITY.BOUND_ON_STORES) / (CYCLE_ACTIVITY.STALLS_TOTAL + (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL) + EXE_ACTIVITY.BOUND_ON_STORES) * tma_backend_bound", - "MetricGroup": "Backend;TopdownL2;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_memory_bound", - "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears", + "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_clks", + "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_branch_resteers_group;tma_issueMC", + "MetricName": "tma_clears_resteers", + "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache", - "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY.STALLS_L1D_MISS) / CLKS, 0)", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_l1_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache. The L1 data cache typically has the shortest latency. However; in certain cases like loads blocked on older stores; a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls; while some non-completed demand load lives in the machine without having that demand load missing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB_HIT_PS", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(18.5 * tma_info_average_frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM + 16.5 * tma_info_average_frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_contested_accesses", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses. Contested accesses occur when data written by one Logical Processor are read by another Logical Processor on a different Physical Core. Examples of contested accesses include synchronizations such as locks; true data sharing such as modified locked variables; and false sharing. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Related metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "min(9 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / CLKS", - "MetricGroup": "MemoryTLB;TopdownL4;tma_l1_bound_group", - "MetricName": "tma_dtlb_load", - "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS", + "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_backend_bound - tma_memory_bound", + "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the (first level) DTLB was missed by load accesses, that later on hit in second-level TLB (STLB)", - "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", - "MetricGroup": "MemoryTLB;TopdownL5;tma_dtlb_load_group", - "MetricName": "tma_load_stlb_hit", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "16.5 * tma_info_average_frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses. Data shared by multiple Logical Processors (even just read shared) may cause increased access latency due to cache coherency. Excessive data sharing can drastically harm multithreaded performance. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma_false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles where the Second-level TLB (STLB) was missed by load accesses, performing a hardware page walk", - "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / CLKS", - "MetricGroup": "MemoryTLB;TopdownL5;tma_dtlb_load_group", - "MetricName": "tma_load_stlb_miss", + "BriefDescription": "This metric represents fraction of cycles where decoder-0 was the only active decoder", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=1@ - cpu@INST_DECODED.DECODERS\\,cmask\\=2@) / tma_info_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0;tma_mite_group", + "MetricName": "tma_decoder0_alone", + "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35))", + "PublicDescription": "This metric represents fraction of cycles where decoder-0 was the only active decoder. Related metrics: tma_few_uops_instructions", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores", - "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_store_fwd_blk", - "PublicDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores. To streamline memory operations in the pipeline; a load can avoid waiting for memory if a prior in-flight store is writing the data that the load wants to read (store forwarding process). However; in some cases the load may be blocked for a significant time pending the store forward. For example; when the prior store is writing a smaller region than the load is reading.", + "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", + "MetricExpr": "ARITH.DIVIDER_ACTIVE / tma_info_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication. Sample with: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations", - "MetricExpr": "(12 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS.ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (9 * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO))) / CLKS", - "MetricGroup": "Offcore;TopdownL4;tma_l1_bound_group", - "MetricName": "tma_lock_latency", - "PublicDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations. Due to the microarchitecture handling of locks; they are classified as L1_Bound regardless of what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOADS_PS", + "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_clks + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_info_clks - tma_l2_bound", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L3_MISS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary", - "MetricExpr": "Load_Miss_Real_Latency * LD_BLOCKS.NO_SR / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_split_loads", - "PublicDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline", + "MetricExpr": "(IDQ.ALL_DSB_CYCLES_ANY_UOPS - IDQ.ALL_DSB_CYCLES_4_UOPS) / tma_info_core_clks / 2", + "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline. For example; inefficient utilization of the DSB cache structure or bank conflict when reading from it; are categorized here.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset", - "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_4k_aliasing", - "PublicDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset. False match is possible; which incur a few cycles load re-issue. However; the short re-issue duration is often hidden by the out-of-order core and HW optimizations; hence a user may safely ignore a high value of this metric unless it manages to propagate up into parent nodes of the hierarchy (e.g. to L1_Bound).", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_clks", + "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DSB_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_dsb_misses, tma_info_iptb, tma_lcp", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed", - "MetricExpr": "Load_Miss_Real_Latency * cpu@L1D_PEND_MISS.FB_FULL\\,cmask\\=1@ / CLKS", - "MetricGroup": "MemoryBW;TopdownL4;tma_l1_bound_group", - "MetricName": "tma_fb_full", - "PublicDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed. The higher the metric value; the deeper the memory hierarchy level the misses are satisfied from (metric values >1 are valid). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or external memory).", + "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "min(9 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: tma_dtlb_store, tma_info_memory_data_tlbs", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads", - "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + cpu@L1D_PEND_MISS.FB_FULL\\,cmask\\=1@) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / CLKS)", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_l2_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 misses/L2 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L2_HIT_PS", + "BriefDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses", + "MetricExpr": "(9 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=1@ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses. As with ordinary data caching; focus on improving data locality and reducing working-set size to reduce DTLB overhead. Additionally; consider using profile-guided optimization (PGO) to collocate frequently-used data on the same page. Try using larger page sizes for large amounts of frequently-used data. Sample with: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, tma_info_memory_data_tlbs", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core", - "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STALLS_L3_MISS) / CLKS", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_l3_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "BriefDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "22 * tma_info_average_frequency * OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM / tma_info_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_store_bound_group", + "MetricName": "tma_false_sharing", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing. False Sharing is a multithreading hiccup; where multiple Logical Processors contend on different data-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses", - "MetricExpr": "(18.5 * Average_Frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM + 16.5 * Average_Frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CLKS", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_contested_accesses", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses. Contested accesses occur when data written by one Logical Processor are read by another Logical Processor on a different Physical Core. Examples of contested accesses include synchronizations such as locks; true data sharing such as modified locked variables; and false sharing. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS", + "BriefDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "tma_info_load_miss_real_latency * cpu@L1D_PEND_MISS.FB_FULL\\,cmask\\=1@ / tma_info_clks", + "MetricGroup": "MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;tma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed. The higher the metric value; the deeper the memory hierarchy level the misses are satisfied from (metric values >1 are valid). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or external memory). Related metrics: tma_info_dram_bw_use, tma_info_memory_bandwidth, tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses", - "MetricExpr": "16.5 * Average_Frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CLKS", - "MetricGroup": "Offcore;Snoop;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_data_sharing", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses. Data shared by multiple Logical Processors (even just read shared) may cause increased access latency due to cache coherency. Excessive data sharing can drastically harm multithreaded performance. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT_PS", + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", + "MetricExpr": "tma_frontend_bound - tma_fetch_latency", + "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Sample with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_switches, tma_info_dsb_coverage, tma_info_dsb_misses, tma_info_iptb, tma_lcp", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited)", - "MetricExpr": "6.5 * Average_Frequency * MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CLKS", - "MetricGroup": "MemoryLat;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_l3_hit_latency", - "PublicDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve the latency; reduce contention with sibling physical cores and increase performance. Note the value of this node may overlap with its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", + "MetricExpr": "4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / tma_info_slots", + "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors)", - "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on else OFFCORE_REQUESTS_BUFFER.SQ_FULL) / CORE_CLKS", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_sq_full", - "PublicDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors). The Super Queue is used for requests to access the L2 cache or to go out to the Uncore.", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops", + "MetricExpr": "tma_heavy_operations - tma_microcode_sequencer", + "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;tma_issueD0", + "MetricName": "tma_few_uops_instructions", + "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number of uops in such instructions. Related metrics: tma_decoder0_alone", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", - "MetricExpr": "CYCLE_ACTIVITY.STALLS_L3_MISS / CLKS + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / CLKS - tma_l2_bound", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_dram_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L3_MISS_PS", + "BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", + "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired). Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain and FMA double-counting.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=4@) / CLKS", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_dram_bound_group", - "MetricName": "tma_mem_bandwidth", - "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that).", + "BriefDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P", + "MetricName": "tma_fp_scalar", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / CLKS - tma_mem_bandwidth", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_dram_bound_group", - "MetricName": "tma_mem_latency", - "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that).", + "BriefDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0x3c@ / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P", + "MetricName": "tma_fp_vector", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", - "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / CLKS", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_store_bound", - "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck. Sample with: MEM_INST_RETIRED.ALL_STORES_PS", + "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P", + "MetricName": "tma_fp_vector_128b", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses", - "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / CLKS", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_store_bound_group", - "MetricName": "tma_store_latency", - "PublicDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses. Store accesses usually less impact out-of-order core performance; however; holding resources for longer time can lead into undesired implications (e.g. contention on L1D fill-buffer entries - see FB_Full)", + "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P", + "MetricName": "tma_fp_vector_256b", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing", - "MetricExpr": "22 * Average_Frequency * OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM / CLKS", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_store_bound_group", - "MetricName": "tma_false_sharing", - "PublicDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing. False Sharing is a multithreading hiccup; where multiple Logical Processors contend on different data-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM", + "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", + "MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / tma_info_slots", + "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents rate of split store accesses", - "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / CORE_CLKS", - "MetricGroup": "TopdownL4;tma_store_bound_group", - "MetricName": "tma_split_stores", - "PublicDescription": "This metric represents rate of split store accesses. Consider aligning your data to the 64-byte cache line granularity. Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring fused instructions -- where one uop can represent multiple contiguous instructions", + "MetricExpr": "tma_light_operations * UOPS_RETIRED.MACRO_FUSED / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_fused_instructions", + "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring fused instructions -- where one uop can represent multiple contiguous instructions. The instruction pairs of CMP+JCC or DEC+JCC are commonly used examples.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses", - "MetricExpr": "(9 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=1@ + DTLB_STORE_MISSES.WALK_ACTIVE) / CORE_CLKS", - "MetricGroup": "MemoryTLB;TopdownL4;tma_store_bound_group", - "MetricName": "tma_dtlb_store", - "PublicDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses. As with ordinary data caching; focus on improving data locality and reducing working-set size to reduce DTLB overhead. Additionally; consider using profile-guided optimization (PGO) to collocate frequently-used data on the same page. Try using larger page sizes for large amounts of frequently-used data. Sample with: MEM_INST_RETIRED.STLB_MISS_STORES_PS", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences", + "MetricExpr": "(UOPS_RETIRED.RETIRE_SLOTS + UOPS_RETIRED.MACRO_FUSED - INST_RETIRED.ANY) / tma_info_slots", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the TLB was missed by store accesses, hitting in the second-level TLB (STLB)", - "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", - "MetricGroup": "MemoryTLB;TopdownL5;tma_dtlb_store_group", - "MetricName": "tma_store_stlb_hit", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses", + "MetricExpr": "(ICACHE_16B.IFDATA_STALL + 2 * cpu@ICACHE_16B.IFDATA_STALL\\,cmask\\=1\\,edge@) / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RETIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles where the STLB was missed by store accesses, performing a hardware page walk", - "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / CORE_CLKS", - "MetricGroup": "MemoryTLB;TopdownL5;tma_dtlb_store_group", - "MetricName": "tma_store_stlb_miss", - "ScaleUnit": "100%" + "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", + "MetricExpr": "tma_info_turbo_utilization * TSC / 1e9 / duration_time", + "MetricGroup": "Power;Summary", + "MetricName": "tma_info_average_frequency" }, { - "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", - "MetricExpr": "tma_backend_bound - tma_memory_bound", - "MetricGroup": "Backend;Compute;TopdownL2;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_core_bound", - "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of instruction fetch related bottlenecks by large code footprint programs (i-side cache; TLB and BTB misses)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_icache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", + "MetricGroup": "BigFoot;Fed;Frontend;IcMiss;MemoryTLB;tma_issueBC", + "MetricName": "tma_info_big_code", + "MetricThreshold": "tma_info_big_code > 20", + "PublicDescription": "Total pipeline cost of instruction fetch related bottlenecks by large code footprint programs (i-side cache; TLB and BTB misses). Related metrics: tma_info_branching_overhead" }, { - "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", - "MetricExpr": "ARITH.DIVIDER_ACTIVE / CLKS", - "MetricGroup": "TopdownL3;tma_core_bound_group", - "MetricName": "tma_divider", - "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication. Sample with: ARITH.DIVIDER_ACTIVE", - "ScaleUnit": "100%" + "BriefDescription": "Branch instructions per taken branch.", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_bptkbranch" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", - "MetricExpr": "((EXE_ACTIVITY.EXE_BOUND_0_PORTS + (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL)) / CLKS if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CYCLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL) / CLKS)", - "MetricGroup": "PortsUtil;TopdownL3;tma_core_bound_group", - "MetricName": "tma_ports_utilization", - "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", - "ScaleUnit": "100%" + "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)", + "MetricExpr": "(tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) * tma_info_slots / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BrMispredicts;tma_issueBM", + "MetricName": "tma_info_branch_misprediction_cost", + "PublicDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear). Related metrics: tma_branch_mispredicts, tma_info_mispredictions, tma_mispredicts_resteers" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "(UOPS_EXECUTED.CORE_CYCLES_NONE / 2 if #SMT_on else CYCLE_ACTIVITY.STALLS_TOTAL - CYCLE_ACTIVITY.STALLS_MEM_ANY) / CORE_CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_0", - "PublicDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise). Long-latency instructions like divides may contribute to this metric.", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of branch related instructions (used for program control-flow including function calls)", + "MetricExpr": "100 * ((BR_INST_RETIRED.CONDITIONAL + 3 * BR_INST_RETIRED.NEAR_CALL + (BR_INST_RETIRED.NEAR_TAKEN - (BR_INST_RETIRED.CONDITIONAL - BR_INST_RETIRED.NOT_TAKEN) - 2 * BR_INST_RETIRED.NEAR_CALL)) / tma_info_slots)", + "MetricGroup": "Ret;tma_issueBC", + "MetricName": "tma_info_branching_overhead", + "MetricThreshold": "tma_info_branching_overhead > 10", + "PublicDescription": "Total pipeline cost of branch related instructions (used for program control-flow including function calls). Related metrics: tma_info_big_code" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations", - "MetricExpr": "PARTIAL_RAT_STALLS.SCOREBOARD / CLKS", - "MetricGroup": "TopdownL5;tma_ports_utilized_0_group", - "MetricName": "tma_serializing_operation", - "PublicDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations. Instructions like CPUID; WRMSR or LFENCE serialize the out-of-order execution which may limit performance. Sample with: PARTIAL_RAT_STALLS.SCOREBOARD", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of branches that are CALL or RET", + "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_RETURN) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_callret" }, { - "BriefDescription": "The Mixing_Vectors metric gives the percentage of injected blend uops out of all uops issued", - "MetricExpr": "CLKS * UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY", - "MetricGroup": "TopdownL5;tma_ports_utilized_0_group", - "MetricName": "tma_mixing_vectors", - "PublicDescription": "The Mixing_Vectors metric gives the percentage of injected blend uops out of all uops issued. Usually a Mixing_Vectors over 5% is worth investigating. Read more in Appendix B1 of the Optimizations Guide for this topic.", - "ScaleUnit": "100%" + "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Pipeline", + "MetricName": "tma_info_clks" }, { - "BriefDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_1 - UOPS_EXECUTED.CORE_CYCLES_GE_2) / 2 if #SMT_on else EXE_ACTIVITY.1_PORTS_UTIL) / CORE_CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_1", - "PublicDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). This can be due to heavy data-dependency among software instructions; or over oversubscribing a particular hardware resource. In some other cases with high 1_Port_Utilized and L1_Bound; this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful.", - "ScaleUnit": "100%" + "BriefDescription": "STLB (2nd level TLB) code speculative misses per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", + "MetricGroup": "Fed;MemoryTLB", + "MetricName": "tma_info_code_stlb_mpki" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_2 - UOPS_EXECUTED.CORE_CYCLES_GE_3) / 2 if #SMT_on else EXE_ACTIVITY.2_PORTS_UTIL) / CORE_CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_2", - "PublicDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization -most compilers feature auto-Vectorization options today- reduces pressure on the execution ports as multiple elements are calculated with same uop.", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of branches that are non-taken conditionals", + "MetricExpr": "BR_INST_RETIRED.NOT_TAKEN / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_cond_nt" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise).", - "MetricExpr": "(UOPS_EXECUTED.CORE_CYCLES_GE_3 / 2 if #SMT_on else UOPS_EXECUTED.CORE_CYCLES_GE_3) / CORE_CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_3m", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of branches that are taken conditionals", + "MetricExpr": "(BR_INST_RETIRED.CONDITIONAL - BR_INST_RETIRED.NOT_TAKEN) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_cond_tk" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.", - "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT.PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / SLOTS", - "MetricGroup": "TopdownL5;tma_ports_utilized_3m_group", - "MetricName": "tma_alu_op_utilization", - "ScaleUnit": "100%" + "BriefDescription": "Probability of Core Bound bottleneck hidden by SMT-profiling artifacts", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(100 * (1 - tma_core_bound / tma_ports_utilization if tma_core_bound < tma_ports_utilization else 1) if tma_info_smt_2t_utilization > 0.5 else 0)", + "MetricGroup": "Cor;SMT", + "MetricName": "tma_info_core_bound_likely", + "MetricThreshold": "tma_info_core_bound_likely > 0.5" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch) Sample with: UOPS_DISPATCHED_PORT.PORT_0", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_0 / CORE_CLKS", - "MetricGroup": "Compute;TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_0", - "ScaleUnit": "100%" + "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_clks))", + "MetricGroup": "SMT", + "MetricName": "tma_info_core_clks" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU) Sample with: UOPS_DISPATCHED_PORT.PORT_1", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_1 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_1", - "ScaleUnit": "100%" + "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_core_clks", + "MetricGroup": "Ret;SMT;TmaL1;tma_L1_group", + "MetricName": "tma_info_coreipc" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU) Sample with: UOPS_DISPATCHED.PORT_5", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_5 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_5", - "ScaleUnit": "100%" + "BriefDescription": "Cycles Per Instruction (per Logical Processor)", + "MetricExpr": "1 / tma_info_ipc", + "MetricGroup": "Mem;Pipeline", + "MetricName": "tma_info_cpi" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU) Sample with: UOPS_DISPATCHED_PORT.PORT_6", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_6 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_6", - "ScaleUnit": "100%" + "BriefDescription": "Average CPU Utilization", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": "HPC;Summary", + "MetricName": "tma_info_cpu_utilization" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations Sample with: UOPS_DISPATCHED.PORT_2_3", - "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_2 + UOPS_DISPATCHED_PORT.PORT_3 + UOPS_DISPATCHED_PORT.PORT_7 - UOPS_DISPATCHED_PORT.PORT_4) / (2 * CORE_CLKS)", - "MetricGroup": "TopdownL5;tma_ports_utilized_3m_group", - "MetricName": "tma_load_op_utilization", - "ScaleUnit": "100%" + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD / OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_data_l2_mlp" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 2 ([SNB+]Loads and Store-address; [ICL+] Loads) Sample with: UOPS_DISPATCHED_PORT.PORT_2", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_2 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_load_op_utilization_group", - "MetricName": "tma_port_2", - "ScaleUnit": "100%" + "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", + "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_REQUESTS.ALL) / 1e6 / duration_time / 1e3", + "MetricGroup": "HPC;Mem;MemoryBW;SoC;tma_issueBW", + "MetricName": "tma_info_dram_bw_use", + "PublicDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_memory_bandwidth, tma_mem_bandwidth, tma_sq_full" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 3 ([SNB+]Loads and Store-address; [ICL+] Loads) Sample with: UOPS_DISPATCHED_PORT.PORT_3", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_3 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_load_op_utilization_group", - "MetricName": "tma_port_3", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_dsb_coverage", + "MetricThreshold": "tma_info_dsb_coverage < 0.7 & tma_info_ipc / 4 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_misses, tma_info_iptb, tma_lcp" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations", - "MetricExpr": "tma_port_4", - "MetricGroup": "TopdownL5;tma_ports_utilized_3m_group", - "MetricName": "tma_store_op_utilization", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of DSB (uop cache) misses - subset of the Instruction_Fetch_BW Bottleneck", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tma_mite))", + "MetricGroup": "DSBmiss;Fed;tma_issueFB", + "MetricName": "tma_info_dsb_misses", + "MetricThreshold": "tma_info_dsb_misses > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) misses - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_iptb, tma_lcp" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 4 (Store-data) Sample with: UOPS_DISPATCHED_PORT.PORT_4", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_4 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_store_op_utilization_group", - "MetricName": "tma_port_4", - "ScaleUnit": "100%" + "BriefDescription": "Average number of cycles of a switch from the DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details.", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / DSB2MITE_SWITCHES.COUNT", + "MetricGroup": "DSBmiss", + "MetricName": "tma_info_dsb_switch_cost" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 7 ([HSW+]simple Store-address) Sample with: UOPS_DISPATCHED_PORT.PORT_7", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_7 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_store_op_utilization_group", - "MetricName": "tma_port_7", - "ScaleUnit": "100%" + "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-thread", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,cmask\\=1@", + "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", + "MetricName": "tma_info_execute" }, { - "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group", - "MetricName": "tma_retiring", - "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.RETIRE_SLOTS", - "ScaleUnit": "100%" + "BriefDescription": "The ratio of Executed- by Issued-Uops", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", + "MetricGroup": "Cor;Pipeline", + "MetricName": "tma_info_execute_per_issue", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate of \"execute\" at rename stage." }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", - "MetricExpr": "tma_retiring - tma_heavy_operations", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_retiring_group", - "MetricName": "tma_light_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST", - "ScaleUnit": "100%" + "BriefDescription": "Fill Buffer (FB) hits per kilo instructions for retired demand loads (L1D misses that merge into ongoing miss-handling entries)", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_fb_hpki" }, { - "BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)", - "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", - "MetricGroup": "HPC;TopdownL3;tma_light_operations_group", - "MetricName": "tma_fp_arith", - "PublicDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired). Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain and FMA double-counting.", - "ScaleUnit": "100%" + "BriefDescription": "Average number of Uops issued by front-end when it issued something", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=1@", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_fetch_upc" }, { - "BriefDescription": "This metric serves as an approximation of legacy x87 usage", - "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.THREAD", - "MetricGroup": "Compute;TopdownL4;tma_fp_arith_group", - "MetricName": "tma_x87_use", - "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", - "ScaleUnit": "100%" + "BriefDescription": "Floating Point Operations Per Cycle", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / tma_info_core_clks", + "MetricGroup": "Flops;Ret", + "MetricName": "tma_info_flopc" }, { - "BriefDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE) / UOPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Compute;Flops;TopdownL4;tma_fp_arith_group", - "MetricName": "tma_fp_scalar", - "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting.", - "ScaleUnit": "100%" + "BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0x3c@) / (2 * tma_info_core_clks)", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_fp_arith_utilization", + "PublicDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width). Values > 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less common)." }, { - "BriefDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths", - "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Compute;Flops;TopdownL4;tma_fp_arith_group", - "MetricName": "tma_fp_vector", - "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting.", - "ScaleUnit": "100%" + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / duration_time", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine." }, { - "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors", - "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Compute;Flops;TopdownL5;tma_fp_vector_group", - "MetricName": "tma_fp_vector_128b", - "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors. May overcount due to FMA double counting.", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of Instruction Cache misses - subset of the Big_Code Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", + "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", + "MetricName": "tma_info_ic_misses", + "MetricThreshold": "tma_info_ic_misses > 5", + "PublicDescription": "Total pipeline cost of Instruction Cache misses - subset of the Big_Code Bottleneck. Related metrics: " }, { - "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors", - "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Compute;Flops;TopdownL5;tma_fp_vector_group", - "MetricName": "tma_fp_vector_256b", - "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors. May overcount due to FMA double counting.", - "ScaleUnit": "100%" + "BriefDescription": "Average Latency for L1 instruction cache misses", + "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STALL\\,cmask\\=1\\,edge@ + 2", + "MetricGroup": "Fed;FetchLat;IcMiss", + "MetricName": "tma_info_icache_miss_latency" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring memory operations -- uops for memory load or store accesses.", - "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_RETIRED.ANY", - "MetricGroup": "Pipeline;TopdownL3;tma_light_operations_group", - "MetricName": "tma_memory_operations", - "ScaleUnit": "100%" + "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_GE_1 / 2 if #SMT_on else UOPS_EXECUTED.CORE_CYCLES_GE_1)", + "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", + "MetricName": "tma_info_ilp" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring fused instructions -- where one uop can represent multiple contiguous instructions", - "MetricExpr": "tma_light_operations * UOPS_RETIRED.MACRO_FUSED / UOPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Pipeline;TopdownL3;tma_light_operations_group", - "MetricName": "tma_fused_instructions", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring fused instructions -- where one uop can represent multiple contiguous instructions. The instruction pairs of CMP+JCC or DEC+JCC are commonly used examples.", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of instruction fetch bandwidth related bottlenecks", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * (tma_frontend_bound - tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) - tma_info_big_code", + "MetricGroup": "Fed;FetchBW;Frontend", + "MetricName": "tma_info_instruction_fetch_bw", + "MetricThreshold": "tma_info_instruction_fetch_bw > 20" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring branch instructions that were not fused", - "MetricExpr": "tma_light_operations * (BR_INST_RETIRED.ALL_BRANCHES - UOPS_RETIRED.MACRO_FUSED) / UOPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Pipeline;TopdownL3;tma_light_operations_group", - "MetricName": "tma_non_fused_branches", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring branch instructions that were not fused. Non-conditional branches like direct JMP or CALL would count here. Can be used to examine fusible conditional jumps that were not fused.", - "ScaleUnit": "100%" + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Summary;TmaL1;tma_L1_group", + "MetricName": "tma_info_instructions", + "PublicDescription": "Total number of retired Instructions. Sample with: INST_RETIRED.PREC_DIST" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions", - "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / UOPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Pipeline;TopdownL3;tma_light_operations_group", - "MetricName": "tma_nop_instructions", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions. Compilers often use NOPs for certain address alignments - e.g. start address of a function or loop body. Sample with: INST_RETIRED.NOP", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "INST_RETIRED.ANY / (cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0x3c@)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_iparith", + "MetricThreshold": "tma_info_iparith < 10", + "PublicDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate). May undercount due to FMA double counting. Approximated prior to BDW." }, { - "BriefDescription": "This metric represents the remaining light uops fraction the CPU has executed - remaining means not covered by other sibling nodes. May undercount due to FMA double counting", - "MetricExpr": "max(0, tma_light_operations - (tma_fp_arith + tma_memory_operations + tma_fused_instructions + tma_non_fused_branches + tma_nop_instructions))", - "MetricGroup": "Pipeline;TopdownL3;tma_light_operations_group", - "MetricName": "tma_other_light_ops", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx128", + "MetricThreshold": "tma_info_iparith_avx128 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences", - "MetricExpr": "(UOPS_RETIRED.RETIRE_SLOTS + UOPS_RETIRED.MACRO_FUSED - INST_RETIRED.ANY) / SLOTS", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_retiring_group", - "MetricName": "tma_heavy_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences. This highly-correlates with the uop length of these instructions/sequences.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx256", + "MetricThreshold": "tma_info_iparith_avx256 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops", - "MetricExpr": "tma_heavy_operations - tma_microcode_sequencer", - "MetricGroup": "TopdownL3;tma_heavy_operations_group", - "MetricName": "tma_few_uops_instructions", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number of uops in such instructions.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOUBLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_iparith_scalar_dp", + "MetricThreshold": "tma_info_iparith_scalar_dp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.MS_UOPS / SLOTS", - "MetricGroup": "MicroSeq;TopdownL3;tma_heavy_operations_group", - "MetricName": "tma_microcode_sequencer", - "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided. Sample with: IDQ.MS_UOPS", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SINGLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_iparith_scalar_sp", + "MetricThreshold": "tma_info_iparith_scalar_sp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists", - "MetricExpr": "100 * (FP_ASSIST.ANY + OTHER_ASSISTS.ANY) / SLOTS", - "MetricGroup": "TopdownL4;tma_microcode_sequencer_group", - "MetricName": "tma_assists", - "PublicDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists. Assists are long sequences of uops that are required in certain corner-cases for operations that cannot be handled natively by the execution pipeline. For example; when working with very small floating point values (so-called Denormals); the FP units are not set up to perform these operations natively. Instead; a sequence of instructions to perform the computation on the Denormals is injected into the pipeline. Since these microcode sequences might be dozens of uops long; Assists can be extremely deleterious to performance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.ANY", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Branches;Fed;InsType", + "MetricName": "tma_info_ipbranch", + "MetricThreshold": "tma_info_ipbranch < 8" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction", - "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", - "MetricGroup": "TopdownL4;tma_microcode_sequencer_group", - "MetricName": "tma_cisc", - "PublicDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction. A CISC instruction has multiple uops that are required to perform the instruction's functionality as in the case of read-modify-write as an example. Since these instructions require multiple uops they may or may not imply sub-optimal use of machine resources.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions Per Cycle (per Logical Processor)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_clks", + "MetricGroup": "Ret;Summary", + "MetricName": "tma_info_ipc" }, { - "BriefDescription": "Total pipeline cost of Branch Misprediction related bottlenecks", - "MetricExpr": "100 * (tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts", - "MetricName": "Mispredictions" + "BriefDescription": "Instructions per (near) call (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_ipcall", + "MetricThreshold": "tma_info_ipcall < 200" }, { - "BriefDescription": "Total pipeline cost of (external) Memory Bandwidth related bottlenecks", - "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_sq_full / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full))) + tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk))", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "Memory_Bandwidth" + "BriefDescription": "Instructions per non-speculative DSB miss (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", + "MetricGroup": "DSBmiss;Fed", + "MetricName": "tma_info_ipdsb_miss_ret", + "MetricThreshold": "tma_info_ipdsb_miss_ret < 50" }, { - "BriefDescription": "Total pipeline cost of Memory Latency related bottlenecks (external memory and off-core caches)", - "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_l3_hit_latency / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound))", - "MetricGroup": "Mem;MemoryLat;Offcore", - "MetricName": "Memory_Latency" + "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;OS", + "MetricName": "tma_info_ipfarbranch", + "MetricThreshold": "tma_info_ipfarbranch < 1e6" }, { - "BriefDescription": "Total pipeline cost of Memory Address Translation related bottlenecks (data-side TLBs)", - "MetricExpr": "100 * tma_memory_bound * (tma_l1_bound / max(tma_memory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_dtlb_load / max(tma_l1_bound, tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_latency)))", - "MetricGroup": "Mem;MemoryTLB;Offcore", - "MetricName": "Memory_Data_TLBs" + "BriefDescription": "Instructions per Floating Point (FP) Operation (lower number means higher occurrence rate)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_ipflop", + "MetricThreshold": "tma_info_ipflop < 10" }, { - "BriefDescription": "Total pipeline cost of branch related instructions (used for program control-flow including function calls)", - "MetricExpr": "100 * ((BR_INST_RETIRED.CONDITIONAL + 3 * BR_INST_RETIRED.NEAR_CALL + (BR_INST_RETIRED.NEAR_TAKEN - (BR_INST_RETIRED.CONDITIONAL - BR_INST_RETIRED.NOT_TAKEN) - 2 * BR_INST_RETIRED.NEAR_CALL)) / SLOTS)", - "MetricGroup": "Ret", - "MetricName": "Branching_Overhead" + "BriefDescription": "Instructions per Load (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipload", + "MetricThreshold": "tma_info_ipload < 3" }, { - "BriefDescription": "Total pipeline cost of instruction fetch related bottlenecks by large code footprint programs (i-side cache; TLB and BTB misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_icache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFoot;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "Big_Code" + "BriefDescription": "Instructions per retired mispredicts for indirect CALL or JMP branches (lower number means higher occurrence rate).", + "MetricExpr": "tma_info_instructions / (UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * cpu@BR_MISP_EXEC.ALL_BRANCHES\\,umask\\=0xE4@)", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_indirect", + "MetricThreshold": "tma_info_ipmisp_indirect < 1e3" }, { - "BriefDescription": "Total pipeline cost of instruction fetch bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_frontend_bound - tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) - Big_Code", - "MetricGroup": "Fed;FetchBW;Frontend", - "MetricName": "Instruction_Fetch_BW" + "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts", + "MetricName": "tma_info_ipmispredict", + "MetricThreshold": "tma_info_ipmispredict < 200" }, { - "BriefDescription": "Instructions Per Cycle (per Logical Processor)", - "MetricExpr": "INST_RETIRED.ANY / CLKS", - "MetricGroup": "Ret;Summary", - "MetricName": "IPC" + "BriefDescription": "Instructions per Store (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipstore", + "MetricThreshold": "tma_info_ipstore < 8" }, { - "BriefDescription": "Uops Per Instruction", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / INST_RETIRED.ANY", - "MetricGroup": "Pipeline;Ret;Retire", - "MetricName": "UPI" + "BriefDescription": "Instructions per Software prefetch instruction (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umask\\=0xF@", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_ipswpf", + "MetricThreshold": "tma_info_ipswpf < 100" }, { "BriefDescription": "Instruction per taken branch", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW", - "MetricName": "UpTB" + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", + "MetricName": "tma_info_iptb", + "MetricThreshold": "tma_info_iptb < 9", + "PublicDescription": "Instruction per taken branch. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_dsb_misses, tma_lcp" }, { - "BriefDescription": "Cycles Per Instruction (per Logical Processor)", - "MetricExpr": "1 / IPC", - "MetricGroup": "Mem;Pipeline", - "MetricName": "CPI" + "BriefDescription": "Instructions per speculative Unknown Branch Misprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_instructions / BACLEARS.ANY", + "MetricGroup": "Fed", + "MetricName": "tma_info_ipunknown_branch" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "Pipeline", - "MetricName": "CLKS" + "BriefDescription": "Fraction of branches that are unconditional (direct or indirect) jumps", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - (BR_INST_RETIRED.CONDITIONAL - BR_INST_RETIRED.NOT_TAKEN) - 2 * BR_INST_RETIRED.NEAR_CALL) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_jump" }, { - "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "4 * CORE_CLKS", - "MetricGroup": "tma_L1_group", - "MetricName": "SLOTS" + "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_cpi" }, { - "BriefDescription": "The ratio of Executed- by Issued-Uops", - "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", - "MetricGroup": "Cor;Pipeline", - "MetricName": "Execute_per_Issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate of \"execute\" at rename stage." + "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_utilization", + "MetricThreshold": "tma_info_kernel_utilization > 0.05" }, { - "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", - "MetricExpr": "INST_RETIRED.ANY / CORE_CLKS", - "MetricGroup": "Ret;SMT;tma_L1_group", - "MetricName": "CoreIPC" + "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw" }, { - "BriefDescription": "Floating Point Operations Per Cycle", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / CORE_CLKS", - "MetricGroup": "Flops;Ret", - "MetricName": "FLOPc" + "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]", + "MetricExpr": "tma_info_l1d_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw_1t" }, { - "BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width)", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)) / (2 * CORE_CLKS)", - "MetricGroup": "Cor;Flops;HPC", - "MetricName": "FP_Arith_Utilization", - "PublicDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width). Values > 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less common)." + "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l1mpki" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_GE_1 / 2 if #SMT_on else UOPS_EXECUTED.CORE_CYCLES_GE_1)", - "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", - "MetricName": "ILP" + "BriefDescription": "L1 cache true misses per kilo instruction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l1mpki_load" }, { - "BriefDescription": "Probability of Core Bound bottleneck hidden by SMT-profiling artifacts", - "MetricExpr": "((1 - tma_core_bound / tma_ports_utilization if tma_core_bound < tma_ports_utilization else 1) if SMT_2T_Utilization > 0.5 else 0)", - "MetricGroup": "Cor;SMT", - "MetricName": "Core_Bound_Likely" + "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw" }, { - "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else CLKS))", - "MetricGroup": "SMT", - "MetricName": "CORE_CLKS" + "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]", + "MetricExpr": "tma_info_l2_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw_1t" }, { - "BriefDescription": "Instructions per Load (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", - "MetricGroup": "InsType", - "MetricName": "IpLoad" + "BriefDescription": "L2 cache hits per kilo instruction for all request types (including speculative)", + "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2hpki_all" }, { - "BriefDescription": "Instructions per Store (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", - "MetricGroup": "InsType", - "MetricName": "IpStore" + "BriefDescription": "L2 cache hits per kilo instruction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2hpki_load" }, { - "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Branches;Fed;InsType", - "MetricName": "IpBranch" + "BriefDescription": "L2 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "Backend;CacheMisses;Mem", + "MetricName": "tma_info_l2mpki" }, { - "BriefDescription": "Instructions per (near) call (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "IpCall" + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all request types (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem;Offcore", + "MetricName": "tma_info_l2mpki_all" }, { - "BriefDescription": "Instruction per taken branch", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO", - "MetricName": "IpTB" + "BriefDescription": "L2 cache true code cacheline misses per kilo instruction", + "MetricExpr": "1e3 * FRONTEND_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_l2mpki_code" }, { - "BriefDescription": "Branch instructions per taken branch. ", - "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "BpTkBranch" + "BriefDescription": "L2 cache speculative code cacheline misses per kilo instruction", + "MetricExpr": "1e3 * L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_l2mpki_code_all" }, { - "BriefDescription": "Instructions per Floating Point (FP) Operation (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", - "MetricGroup": "Flops;InsType", - "MetricName": "IpFLOP" + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2mpki_load" }, { - "BriefDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE))", - "MetricGroup": "Flops;InsType", - "MetricName": "IpArith", - "PublicDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate). May undercount due to FMA double counting. Approximated prior to BDW." + "BriefDescription": "Average per-core data access bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_l3_cache_access_bw" }, { - "BriefDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SINGLE", - "MetricGroup": "Flops;FpScalar;InsType", - "MetricName": "IpArith_Scalar_SP", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "tma_info_l3_cache_access_bw", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_l3_cache_access_bw_1t" }, { - "BriefDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOUBLE", - "MetricGroup": "Flops;FpScalar;InsType", - "MetricName": "IpArith_Scalar_DP", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX128", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "tma_info_l3_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw_1t" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX256", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "L3 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l3mpki" }, { - "BriefDescription": "Instructions per Software prefetch instruction (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umask\\=0xF@", - "MetricGroup": "Prefetches", - "MetricName": "IpSWPF" + "BriefDescription": "Average Latency for L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCORE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "Memory_Lat;Offcore", + "MetricName": "tma_info_load_l2_miss_latency" }, { - "BriefDescription": "Total number of retired Instructions Sample with: INST_RETIRED.PREC_DIST", - "MetricExpr": "INST_RETIRED.ANY", - "MetricGroup": "Summary;tma_L1_group", - "MetricName": "Instructions" + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_load_l2_mlp" + }, + { + "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)", + "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_RETIRED.L1_MISS + MEM_LOAD_RETIRED.FB_HIT)", + "MetricGroup": "Mem;MemoryBound;MemoryLat", + "MetricName": "tma_info_load_miss_real_latency" + }, + { + "BriefDescription": "STLB (2nd level TLB) data load speculative misses per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_load_stlb_mpki" + }, + { + "BriefDescription": "Average number of parallel data read requests to external memory", + "MetricExpr": "UNC_ARB_TRK_OCCUPANCY.DATA_READ / UNC_ARB_TRK_OCCUPANCY.DATA_READ@thresh\\=1@", + "MetricGroup": "Mem;MemoryBW;SoC", + "MetricName": "tma_info_mem_parallel_reads", + "PublicDescription": "Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches" + }, + { + "BriefDescription": "Average number of parallel requests to external memory", + "MetricExpr": "UNC_ARB_TRK_OCCUPANCY.ALL / UNC_ARB_TRK_OCCUPANCY.CYCLES_WITH_ANY_REQUEST", + "MetricGroup": "Mem;SoC", + "MetricName": "tma_info_mem_parallel_requests", + "PublicDescription": "Average number of parallel requests to external memory. Accounts for all requests" + }, + { + "BriefDescription": "Average latency of data read request to external memory (in nanoseconds)", + "MetricExpr": "1e9 * (UNC_ARB_TRK_OCCUPANCY.DATA_READ / UNC_ARB_TRK_REQUESTS.DATA_READ) / (tma_info_socket_clks / duration_time)", + "MetricGroup": "Mem;MemoryLat;SoC", + "MetricName": "tma_info_mem_read_latency", + "PublicDescription": "Average latency of data read request to external memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetches. ([RKL+]memory-controller only)" + }, + { + "BriefDescription": "Average latency of all requests to external memory (in Uncore cycles)", + "MetricExpr": "UNC_ARB_TRK_OCCUPANCY.ALL / UNC_ARB_TRK_REQUESTS.ALL", + "MetricGroup": "Mem;SoC", + "MetricName": "tma_info_mem_request_latency" + }, + { + "BriefDescription": "Total pipeline cost of (external) Memory Bandwidth related bottlenecks", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_sq_full / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full))) + tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk))", + "MetricGroup": "Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_info_memory_bandwidth", + "MetricThreshold": "tma_info_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of (external) Memory Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_mem_bandwidth, tma_sq_full" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Translation related bottlenecks (data-side TLBs)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * tma_memory_bound * (tma_l1_bound / max(tma_memory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_dtlb_load / max(tma_l1_bound, tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_latency)))", + "MetricGroup": "Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_info_memory_data_tlbs", + "MetricThreshold": "tma_info_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Translation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load, tma_dtlb_store" + }, + { + "BriefDescription": "Total pipeline cost of Memory Latency related bottlenecks (external memory and off-core caches)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_l3_hit_latency / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound))", + "MetricGroup": "Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_info_memory_latency", + "MetricThreshold": "tma_info_memory_latency > 20", + "PublicDescription": "Total pipeline cost of Memory Latency related bottlenecks (external memory and off-core caches). Related metrics: tma_l3_hit_latency, tma_mem_latency" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction related bottlenecks", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * (tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;tma_issueBM", + "MetricName": "tma_info_mispredictions", + "MetricThreshold": "tma_info_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_branch_misprediction_cost, tma_mispredicts_resteers" + }, + { + "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss", + "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLES", + "MetricGroup": "Mem;MemoryBW;MemoryBound", + "MetricName": "tma_info_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)" + }, + { + "BriefDescription": "Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_PENDING + DTLB_STORE_MISSES.WALK_PENDING + EPT.WALK_PENDING) / (2 * tma_info_core_clks)", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_page_walks_utilization", + "MetricThreshold": "tma_info_page_walks_utilization > 0.5" }, { "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.", "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE_SLOTS\\,cmask\\=1@", "MetricGroup": "Pipeline;Ret", - "MetricName": "Retire" + "MetricName": "tma_info_retire" }, { - "BriefDescription": "", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,cmask\\=1@", - "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", - "MetricName": "Execute" + "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "4 * tma_info_core_clks", + "MetricGroup": "TmaL1;tma_L1_group", + "MetricName": "tma_info_slots" }, { - "BriefDescription": "Average number of Uops issued by front-end when it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=1@", - "MetricGroup": "Fed;FetchBW", - "MetricName": "Fetch_UpC" + "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", + "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", + "MetricGroup": "SMT", + "MetricName": "tma_info_smt_2t_utilization" }, { - "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", - "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", - "MetricGroup": "DSB;Fed;FetchBW", - "MetricName": "DSB_Coverage" + "BriefDescription": "Socket actual clocks when any core is active on that socket", + "MetricExpr": "UNC_CLOCK.SOCKET", + "MetricGroup": "SoC", + "MetricName": "tma_info_socket_clks" }, { - "BriefDescription": "Average number of cycles of a switch from the DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details.", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / DSB2MITE_SWITCHES.COUNT", - "MetricGroup": "DSBmiss", - "MetricName": "DSB_Switch_Cost" + "BriefDescription": "STLB (2nd level TLB) data store speculative misses per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_store_stlb_mpki" }, { - "BriefDescription": "Total penalty related to DSB (uop cache) misses - subset of the Instruction_Fetch_BW Bottleneck.", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tma_mite))", - "MetricGroup": "DSBmiss;Fed", - "MetricName": "DSB_Misses" + "BriefDescription": "Average Frequency Utilization relative nominal frequency", + "MetricExpr": "tma_info_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_turbo_utilization" }, { - "BriefDescription": "Number of Instructions per non-speculative DSB miss (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", - "MetricGroup": "DSBmiss;Fed", - "MetricName": "IpDSB_Miss_Ret" + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / INST_RETIRED.ANY", + "MetricGroup": "Pipeline;Ret;Retire", + "MetricName": "tma_info_uoppi", + "MetricThreshold": "tma_info_uoppi > 1.05" }, { - "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;BadSpec;BrMispredicts", - "MetricName": "IpMispredict" + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW", + "MetricName": "tma_info_uptb", + "MetricThreshold": "tma_info_uptb < 6" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)", - "MetricExpr": "(tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) * SLOTS / BR_MISP_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;BrMispredicts", - "MetricName": "Branch_Misprediction_Cost" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "ICACHE_64B.IFTAG_STALL / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTEND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of branches that are non-taken conditionals", - "MetricExpr": "BR_INST_RETIRED.NOT_TAKEN / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches;CodeGen;PGO", - "MetricName": "Cond_NT" + "BriefDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache", + "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY.STALLS_L1D_MISS) / tma_info_clks, 0)", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache. The L1 data cache typically has the shortest latency. However; in certain cases like loads blocked on older stores; a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls; while some non-completed demand load lives in the machine without having that demand load missing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB_HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of branches that are taken conditionals", - "MetricExpr": "(BR_INST_RETIRED.CONDITIONAL - BR_INST_RETIRED.NOT_TAKEN) / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches;CodeGen;PGO", - "MetricName": "Cond_TK" + "BriefDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + cpu@L1D_PEND_MISS.FB_FULL\\,cmask\\=1@) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_info_clks)", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 misses/L2 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L2_HIT_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of branches that are CALL or RET", - "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_RETURN) / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches", - "MetricName": "CallRet" + "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core", + "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STALLS_L3_MISS) / tma_info_clks", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of branches that are unconditional (direct or indirect) jumps", - "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - (BR_INST_RETIRED.CONDITIONAL - BR_INST_RETIRED.NOT_TAKEN) - 2 * BR_INST_RETIRED.NEAR_CALL) / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches", - "MetricName": "Jump" + "BriefDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited)", + "MetricExpr": "6.5 * tma_info_average_frequency * MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_issueLat;tma_l3_bound_group", + "MetricName": "tma_l3_hit_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve the latency; reduce contention with sibling physical cores and increase performance. Note the value of this node may overlap with its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tma_info_memory_latency, tma_mem_latency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)", - "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_RETIRED.L1_MISS + MEM_LOAD_RETIRED.FB_HIT)", - "MetricGroup": "Mem;MemoryBound;MemoryLat", - "MetricName": "Load_Miss_Real_Latency" + "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "ILD_STALL.LCP / tma_info_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_dsb_misses, tma_info_iptb", + "ScaleUnit": "100%" }, { - "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)", - "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLES", - "MetricGroup": "Mem;MemoryBW;MemoryBound", - "MetricName": "MLP" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", + "MetricExpr": "tma_retiring - tma_heavy_operations", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST", + "ScaleUnit": "100%" }, { - "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L1MPKI" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations", + "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_2 + UOPS_DISPATCHED_PORT.PORT_3 + UOPS_DISPATCHED_PORT.PORT_7 - UOPS_DISPATCHED_PORT.PORT_4) / (2 * tma_info_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations. Sample with: UOPS_DISPATCHED.PORT_2_3", + "ScaleUnit": "100%" }, { - "BriefDescription": "L1 cache true misses per kilo instruction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L1MPKI_Load" + "BriefDescription": "This metric roughly estimates the fraction of cycles where the (first level) DTLB was missed by load accesses, that later on hit in second-level TLB (STLB)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_group", + "MetricName": "tma_load_stlb_hit", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", - "MetricGroup": "Backend;CacheMisses;Mem", - "MetricName": "L2MPKI" + "BriefDescription": "This metric estimates the fraction of cycles where the Second-level TLB (STLB) was missed by load accesses, performing a hardware page walk", + "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_group", + "MetricName": "tma_load_stlb_miss", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all request types (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem;Offcore", - "MetricName": "L2MPKI_All" + "BriefDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations", + "MetricExpr": "(12 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS.ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (9 * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO))) / tma_info_clks", + "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1_bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations. Due to the microarchitecture handling of locks; they are classified as L1_Bound regardless of what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOADS_PS. Related metrics: tma_store_latency", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2MPKI_Load" + "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_bad_speculation - tma_branch_mispredicts", + "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache hits per kilo instruction for all request types (including speculative)", - "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2HPKI_All" + "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=4@) / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that). Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_info_memory_bandwidth, tma_sq_full", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache hits per kilo instruction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2HPKI_Load" + "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / tma_info_clks - tma_mem_bandwidth", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that). Related metrics: tma_info_memory_latency, tma_l3_hit_latency", + "ScaleUnit": "100%" }, { - "BriefDescription": "L3 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L3MPKI" + "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(CYCLE_ACTIVITY.STALLS_MEM_ANY + EXE_ACTIVITY.BOUND_ON_STORES) / (CYCLE_ACTIVITY.STALLS_TOTAL + (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL) + EXE_ACTIVITY.BOUND_ON_STORES) * tma_backend_bound", + "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fill Buffer (FB) hits per kilo instructions for retired demand loads (L1D misses that merge into ongoing miss-handling entries)", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "FB_HPKI" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring memory operations -- uops for memory load or store accesses.", + "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_RETIRED.ANY", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_memory_operations", + "MetricThreshold": "tma_memory_operations > 0.1 & tma_light_operations > 0.6", + "ScaleUnit": "100%" }, { - "BriefDescription": "Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses", - "MetricConstraint": "NO_NMI_WATCHDOG", - "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_PENDING + DTLB_STORE_MISSES.WALK_PENDING + EPT.WALK_PENDING) / (2 * CORE_CLKS)", - "MetricGroup": "Mem;MemoryTLB", - "MetricName": "Page_Walks_Utilization" + "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.MS_UOPS / tma_info_slots", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operations_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_ms_switches", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage", + "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_clks", + "MetricGroup": "BadSpec;BrMispredicts;TopdownL4;tma_L4_group;tma_branch_resteers_group;tma_issueBM", + "MetricName": "tma_mispredicts_resteers", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_branch_mispredicts, tma_info_branch_misprediction_cost, tma_info_mispredictions", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW" + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline)", + "MetricExpr": "(IDQ.ALL_MITE_CYCLES_ANY_UOPS - IDQ.ALL_MITE_CYCLES_4_UOPS) / tma_info_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or LSD. For example; inefficiencies due to asymmetric decoders; use of long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sample with: FRONTEND_RETIRED.ANY_DSB_MISS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW" + "BriefDescription": "The Mixing_Vectors metric gives the percentage of injected blend uops out of all uops issued", + "MetricExpr": "UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY", + "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utilized_0_group", + "MetricName": "tma_mixing_vectors", + "MetricThreshold": "tma_mixing_vectors > 0.05", + "PublicDescription": "The Mixing_Vectors metric gives the percentage of injected blend uops out of all uops issued. Usually a Mixing_Vectors over 5% is worth investigating. Read more in Appendix B1 of the Optimizations Guide for this topic. Related metrics: tma_ms_switches", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data access bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "L3_Cache_Access_BW" + "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", + "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_clks", + "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]", - "MetricExpr": "L1D_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW_1T" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring branch instructions that were not fused", + "MetricExpr": "tma_light_operations * (BR_INST_RETIRED.ALL_BRANCHES - UOPS_RETIRED.MACRO_FUSED) / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_non_fused_branches", + "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring branch instructions that were not fused. Non-conditional branches like direct JMP or CALL would count here. Can be used to examine fusible conditional jumps that were not fused.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]", - "MetricExpr": "L2_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW_1T" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_nop_instructions", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions. Compilers often use NOPs for certain address alignments - e.g. start address of a function or loop body. Sample with: INST_RETIRED.NOP", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "L3_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW_1T" + "BriefDescription": "This metric represents the remaining light uops fraction the CPU has executed - remaining means not covered by other sibling nodes", + "MetricExpr": "max(0, tma_light_operations - (tma_fp_arith + tma_memory_operations + tma_fused_instructions + tma_non_fused_branches + tma_nop_instructions))", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_other_light_ops", + "MetricThreshold": "tma_other_light_ops > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents the remaining light uops fraction the CPU has executed - remaining means not covered by other sibling nodes. May undercount due to FMA double counting", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "L3_Cache_Access_BW", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "L3_Cache_Access_BW_1T" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_0 / tma_info_core_clks", + "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_0", + "MetricThreshold": "tma_port_0 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average CPU Utilization", - "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", - "MetricGroup": "HPC;Summary", - "MetricName": "CPU_Utilization" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_1 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_1", + "MetricThreshold": "tma_port_1 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", - "MetricExpr": "Turbo_Utilization * TSC / 1e9 / duration_time", - "MetricGroup": "Power;Summary", - "MetricName": "Average_Frequency" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 2 ([SNB+]Loads and Store-address; [ICL+] Loads)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_2 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_load_op_utilization_group", + "MetricName": "tma_port_2", + "MetricThreshold": "tma_port_2 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 2 ([SNB+]Loads and Store-address; [ICL+] Loads). Sample with: UOPS_DISPATCHED_PORT.PORT_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / duration_time", - "MetricGroup": "Cor;Flops;HPC", - "MetricName": "GFLOPs", - "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine." + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 3 ([SNB+]Loads and Store-address; [ICL+] Loads)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_3 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_load_op_utilization_group", + "MetricName": "tma_port_3", + "MetricThreshold": "tma_port_3 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 3 ([SNB+]Loads and Store-address; [ICL+] Loads). Sample with: UOPS_DISPATCHED_PORT.PORT_3", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average Frequency Utilization relative nominal frequency", - "MetricExpr": "CLKS / CPU_CLK_UNHALTED.REF_TSC", - "MetricGroup": "Power", - "MetricName": "Turbo_Utilization" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 4 (Store-data)", + "MetricExpr": "tma_store_op_utilization", + "MetricGroup": "TopdownL6;tma_L6_group;tma_issueSpSt;tma_store_op_utilization_group", + "MetricName": "tma_port_4", + "MetricThreshold": "tma_port_4 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 4 (Store-data). Sample with: UOPS_DISPATCHED_PORT.PORT_4. Related metrics: tma_split_stores", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", - "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", - "MetricGroup": "SMT", - "MetricName": "SMT_2T_Utilization" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_5 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_5", + "MetricThreshold": "tma_port_5 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "OS", - "MetricName": "Kernel_Utilization" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_6 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_6", + "MetricThreshold": "tma_port_6 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_6. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", - "MetricGroup": "OS", - "MetricName": "Kernel_CPI" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 7 ([HSW+]simple Store-address)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_7 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_store_op_utilization_group", + "MetricName": "tma_port_7", + "MetricThreshold": "tma_port_7 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 7 ([HSW+]simple Store-address). Sample with: UOPS_DISPATCHED_PORT.PORT_7", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", - "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_REQUESTS.ALL) / 1e6 / duration_time / 1e3", - "MetricGroup": "HPC;Mem;MemoryBW;SoC", - "MetricName": "DRAM_BW_Use" + "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", + "MetricExpr": "((EXE_ACTIVITY.EXE_BOUND_0_PORTS + (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL)) / tma_info_clks if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CYCLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_clks)", + "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average latency of all requests to external memory (in Uncore cycles)", - "MetricExpr": "MEM_Parallel_Requests", - "MetricGroup": "Mem;SoC", - "MetricName": "MEM_Request_Latency" + "BriefDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "(UOPS_EXECUTED.CORE_CYCLES_NONE / 2 if #SMT_on else CYCLE_ACTIVITY.STALLS_TOTAL - CYCLE_ACTIVITY.STALLS_MEM_ANY) / tma_info_core_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise). Long-latency instructions like divides may contribute to this metric.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average number of parallel requests to external memory. Accounts for all requests", - "MetricExpr": "UNC_ARB_TRK_OCCUPANCY.ALL / UNC_ARB_TRK_REQUESTS.ALL", - "MetricGroup": "Mem;SoC", - "MetricName": "MEM_Parallel_Requests" + "BriefDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_1 - UOPS_EXECUTED.CORE_CYCLES_GE_2) / 2 if #SMT_on else EXE_ACTIVITY.1_PORTS_UTIL) / tma_info_core_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_1", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). This can be due to heavy data-dependency among software instructions; or over oversubscribing a particular hardware resource. In some other cases with high 1_Port_Utilized and L1_Bound; this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful. Related metrics: tma_l1_bound", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average latency of data read request to external memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetches", - "MetricExpr": "1e9 * (UNC_ARB_TRK_OCCUPANCY.DATA_READ / UNC_ARB_TRK_REQUESTS.DATA_READ) / (Socket_CLKS / duration_time)", - "MetricGroup": "Mem;MemoryLat;SoC", - "MetricName": "MEM_Read_Latency" + "BriefDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_2 - UOPS_EXECUTED.CORE_CYCLES_GE_3) / 2 if #SMT_on else EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_core_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization -most compilers feature auto-Vectorization options today- reduces pressure on the execution ports as multiple elements are calculated with same uop. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches", - "MetricExpr": "UNC_ARB_TRK_OCCUPANCY.DATA_READ / UNC_ARB_TRK_OCCUPANCY.DATA_READ@thresh\\=1@", - "MetricGroup": "Mem;MemoryBW;SoC", - "MetricName": "MEM_Parallel_Reads" + "BriefDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise).", + "MetricExpr": "(UOPS_EXECUTED.CORE_CYCLES_GE_3 / 2 if #SMT_on else UOPS_EXECUTED.CORE_CYCLES_GE_3) / tma_info_core_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.7 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "ScaleUnit": "100%" }, { - "BriefDescription": "Socket actual clocks when any core is active on that socket", - "MetricExpr": "UNC_CLOCK.SOCKET", - "MetricGroup": "SoC", - "MetricName": "Socket_CLKS" + "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1", + "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.RETIRE_SLOTS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", - "MetricGroup": "Branches;OS", - "MetricName": "IpFarBranch" + "BriefDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations", + "MetricExpr": "PARTIAL_RAT_STALLS.SCOREBOARD / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL5;tma_L5_group;tma_issueSO;tma_ports_utilized_0_group", + "MetricName": "tma_serializing_operation", + "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)))", + "PublicDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations. Instructions like CPUID; WRMSR or LFENCE serialize the out-of-order execution which may limit performance. Sample with: PARTIAL_RAT_STALLS.SCOREBOARD. Related metrics: tma_ms_switches", + "ScaleUnit": "100%" }, { - "BriefDescription": "Uncore frequency per die [GHZ]", - "MetricExpr": "Socket_CLKS / #num_dies / duration_time / 1e9", - "MetricGroup": "SoC", - "MetricName": "UNCORE_FREQ" + "BriefDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "tma_info_load_miss_real_latency * LD_BLOCKS.NO_SR / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "C3 residency percent per core", - "MetricExpr": "cstate_core@c3\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C3_Core_Residency", + "BriefDescription": "This metric represents rate of split store accesses", + "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bound_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store accesses. Consider aligning your data to the 64-byte cache line granularity. Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_4", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per core", - "MetricExpr": "cstate_core@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Core_Residency", + "BriefDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors)", + "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on else OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueBW;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors). Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_info_memory_bandwidth, tma_mem_bandwidth", "ScaleUnit": "100%" }, { - "BriefDescription": "C7 residency percent per core", - "MetricExpr": "cstate_core@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Core_Residency", + "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", + "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck. Sample with: MEM_INST_RETIRED.ALL_STORES_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "C2 residency percent per package", - "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C2_Pkg_Residency", + "BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores. To streamline memory operations in the pipeline; a load can avoid waiting for memory if a prior in-flight store is writing the data that the load wants to read (store forwarding process). However; in some cases the load may be blocked for a significant time pending the store forward. For example; when the prior store is writing a smaller region than the load is reading.", "ScaleUnit": "100%" }, { - "BriefDescription": "C3 residency percent per package", - "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C3_Pkg_Residency", + "BriefDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_clks", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_issueSL;tma_store_bound_group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses. Store accesses usually less impact out-of-order core performance; however; holding resources for longer time can lead into undesired implications (e.g. contention on L1D fill-buffer entries - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per package", - "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Pkg_Residency", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_4 / tma_info_core_clks", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", "ScaleUnit": "100%" }, { - "BriefDescription": "C7 residency percent per package", - "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Pkg_Residency", + "BriefDescription": "This metric roughly estimates the fraction of cycles where the TLB was missed by store accesses, hitting in the second-level TLB (STLB)", + "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_group", + "MetricName": "tma_store_stlb_hit", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles where the STLB was missed by store accesses, performing a hardware page walk", + "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_clks", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_group", + "MetricName": "tma_store_stlb_miss", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears", + "MetricExpr": "9 * BACLEARS.ANY / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;TopdownL4;tma_L4_group;tma_branch_resteers_group", + "MetricName": "tma_unknown_branches", + "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears. These are fetched branches the Branch Prediction Unit was unable to recognize (e.g. first time the branch is fetched or hitting BTB capacity limit). Sample with: BACLEARS.ANY", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric serves as an approximation of legacy x87 usage", + "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.THREAD", + "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Percentage of cycles in aborted transactions.", + "MetricExpr": "max(cpu@cycles\\-t@ - cpu@cycles\\-ct@, 0) / cycles", + "MetricGroup": "transaction", + "MetricName": "tsx_aborted_cycles", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Number of cycles within a transaction divided by the number of elisions.", + "MetricExpr": "cpu@cycles\\-t@ / cpu@el\\-start@", + "MetricGroup": "transaction", + "MetricName": "tsx_cycles_per_elision", + "ScaleUnit": "1cycles / elision" + }, + { + "BriefDescription": "Number of cycles within a transaction divided by the number of transactions.", + "MetricExpr": "cpu@cycles\\-t@ / cpu@tx\\-start@", + "MetricGroup": "transaction", + "MetricName": "tsx_cycles_per_transaction", + "ScaleUnit": "1cycles / transaction" + }, + { + "BriefDescription": "Percentage of cycles within a transaction region.", + "MetricExpr": "cpu@cycles\\-t@ / cycles", + "MetricGroup": "transaction", + "MetricName": "tsx_transactional_cycles", "ScaleUnit": "100%" } ] diff --git a/tools/perf/pmu-events/arch/x86/skylake/uncore-other.json b/tools/perf/pmu-events/arch/x86/skylake/uncore-other.json index e6d4cd625597..ef804df3f41e 100644 --- a/tools/perf/pmu-events/arch/x86/skylake/uncore-other.json +++ b/tools/perf/pmu-events/arch/x86/skylake/uncore-other.json @@ -33,6 +33,7 @@ "Unit": "ARB" }, { + "BriefDescription": "UNC_ARB_TRK_REQUESTS.ALL", "EventCode": "0x81", "EventName": "UNC_ARB_TRK_REQUESTS.ALL", "PerPkg": "1", From patchwork Sun Feb 19 09:28:26 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59119 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp775616wrn; Sun, 19 Feb 2023 01:45:28 -0800 (PST) X-Google-Smtp-Source: AK7set8AOr0y8zXrf5DmAXbwmU14vHRmGkVHf+Ws18ydUTqMBUFZ+u7bv9SC9cqzJ1KbQV+0JwKw X-Received: by 2002:a17:90b:1c07:b0:234:656d:235a with SMTP id oc7-20020a17090b1c0700b00234656d235amr831152pjb.43.1676799928195; Sun, 19 Feb 2023 01:45:28 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799928; cv=none; d=google.com; s=arc-20160816; b=rjXjuRXEe4S7p9Bf/yY7HxkTOwdIZc4AfMR4m487yX0KOPRVmy/3XLuPV0gBKHkU4v cxe6ws0aD+osE9KohLNqCq5qdL3999SXDm1kR0SG8tmrYnlmiq232BzYeBGVQnODritr FumP+PkUhaSzXof1+H0MzCHhtsCPCbYZVmF0+WtLVy0pjsNpT/qSum1jnpxeErLPjE4H 58Iv0yfF4WXnlwdBhjFRis2c3MOASMQmrnQkYVn6A+CkXLrLOOyODVi/QW37jQVPfTib 9f/GotHXboqZafTynpN4SHLrwxe2a1OA/UsvbxscQzffy+gQUG1LGtOPS7GOEhdXimx4 4k9Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :references:mime-version:message-id:in-reply-to:date:dkim-signature; bh=QV7/hXxh57QEI+TG4VdKL7Xbqu6ReB4RGYaSf0jfwrI=; b=poT7Clx5mCqv4MuGaOklMBUZApTofQNL4em4WDkH1jMOPYuDJ3GHvHVyzIK9ob7ALc BNL1mVYZMDX86wDEtVGN/jH3DWXE/WwMcdTjs/Nq/g/YrDFmwfzV9WGCglFmPLPtkboZ XuCurP1M9o1uLSIklVS/yGcHn6IOj1TsF5nZM8G7NIh3JGNZf/tBRhgu4d8yXv3fHT4u b94wqqIa4/yrgL4dEfDJWobYM0U+4k7ahx/+Kgzx0uv15Ne47j7HOCq2IDiZBBGKUZ0E KzX65N6tTLf1BX0l1NArRc6gKEBph3xiUHMM6G8oP5nu1hctIACq4if83dsLCtG1D/wC /u1A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=GyQeWbmp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id a3-20020a17090acb8300b002299b06dca9si574072pju.83.2023.02.19.01.45.15; Sun, 19 Feb 2023 01:45:28 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=GyQeWbmp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229814AbjBSJfs (ORCPT + 99 others); Sun, 19 Feb 2023 04:35:48 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58224 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230009AbjBSJfo (ORCPT ); Sun, 19 Feb 2023 04:35:44 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A9C1C2117 for ; Sun, 19 Feb 2023 01:34:44 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id s78-20020a257751000000b008f257b16d71so18293ybc.15 for ; Sun, 19 Feb 2023 01:34:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:from:to:cc:subject:date :message-id:reply-to; bh=QV7/hXxh57QEI+TG4VdKL7Xbqu6ReB4RGYaSf0jfwrI=; b=GyQeWbmpCRRpYOvE5GRwkfqKNhRLK+5VTunsVkkiurNpBoFhGIp1BYS5ENUoVXUZ0P QBm0ldLBjVwk5DYM2Ny9UiUNaLSXJtBW4S4R9zFHaTxLbJiXHfIkkYNChDmKFvWkU65U nD1/eK5ik+lK2DNfH9gl0ujay6xkUoe5sJ0uWLzBvGY6T0aAJaHoYCgiuER8ICCVO7jd k+B//ccKWGNCLOcEdqQyl4i5rLEoRm+CNbp1VIOJxrL5LLeHzf3xkLOwh3LAmDJWtJa0 WrQ17Z+mQueFsofPcKeTDxBe/qYplGvvawbodFjlVV8YGWINAKFkBlCyFw/55xftM4hF E0YA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=QV7/hXxh57QEI+TG4VdKL7Xbqu6ReB4RGYaSf0jfwrI=; b=xTIpvcLY73VV42PPD8zhbWgUENJGJHeV/U6OEesXD8q1ukqY2XY1/RCvrhW4rVf6Ua WAbnBfI3QveOhxXZEOBi1CriVudbA8sbw7zZcG/1AvMO2IeosHL3uB1sKob04QS5k3Kl UkGuMjyrPPaqlUZ/wdI4ETIS2BOqJnIo0GJUKz0w4HX8xJmJQhI66ioSz4sfShxgDKTr gMn7i+/mJtec1K5bIet9xs2WiDSJKVCXjy5wx9i6IUkVas7fxNwSKcAFrXgxvnSuvuuu XMnwM6EbbNzWbEOrrZGZV5EuyaN7qezch1wl9BILNRoh5MoJEnRZcidqx5RWz4GybpE1 y3VA== X-Gm-Message-State: AO0yUKW8ao1o1DmgFSE9uFIYxCZcFPKG2bvvOz0EMCOqw6U5tCRZzDkg 5Hknmm4DBqJ+rbX3ORPBUAEKDZrzkZG7 X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a25:9e8c:0:b0:90c:e04d:deca with SMTP id p12-20020a259e8c000000b0090ce04ddecamr17677ybq.11.1676799182169; Sun, 19 Feb 2023 01:33:02 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:26 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-30-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 29/51] perf vendor events intel: Refresh skylakex metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758252161466542551?= X-GMAIL-MSGID: =?utf-8?q?1758252161466542551?= Update the skylakex events from 1.28 to 1.29 and metrics to TMA version 4.5. Generation was done using https://github.com/intel/perfmon. Notable changes are TMA info metrics are renamed from their node name to be lower case and prefixed by tma_info_, MetricThreshold expressions are added, "Sample with" documentation is added to many TMA metrics, smi_cost and transaction metric groups are added replicating existing hard coded metrics in stat-shadow. Signed-off-by: Ian Rogers --- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- .../pmu-events/arch/x86/skylakex/cache.json | 8 +- .../arch/x86/skylakex/frontend.json | 8 +- .../arch/x86/skylakex/pipeline.json | 16 + .../arch/x86/skylakex/skx-metrics.json | 2097 +++++++++-------- .../arch/x86/skylakex/uncore-memory.json | 2 +- .../arch/x86/skylakex/uncore-other.json | 96 +- .../arch/x86/skylakex/uncore-power.json | 6 +- 8 files changed, 1167 insertions(+), 1068 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-events/arch/x86/mapfile.csv index 22aa63f90f89..07f0b4758a83 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -25,7 +25,7 @@ GenuineIntel-6-2A,v18,sandybridge,core GenuineIntel-6-(8F|CF),v1.11,sapphirerapids,core GenuineIntel-6-(37|4A|4C|4D|5A),v15,silvermont,core GenuineIntel-6-(4E|5E|8E|9E|A5|A6),v54,skylake,core -GenuineIntel-6-55-[01234],v1.28,skylakex,core +GenuineIntel-6-55-[01234],v1.29,skylakex,core GenuineIntel-6-86,v1.20,snowridgex,core GenuineIntel-6-8[CD],v1.08,tigerlake,core GenuineIntel-6-2C,v3,westmereep-dp,core diff --git a/tools/perf/pmu-events/arch/x86/skylakex/cache.json b/tools/perf/pmu-events/arch/x86/skylakex/cache.json index 92da692795e7..d28d8822a51a 100644 --- a/tools/perf/pmu-events/arch/x86/skylakex/cache.json +++ b/tools/perf/pmu-events/arch/x86/skylakex/cache.json @@ -234,20 +234,22 @@ "UMask": "0x4f" }, { - "BriefDescription": "All retired load instructions.", + "BriefDescription": "Retired load instructions.", "Data_LA": "1", "EventCode": "0xD0", "EventName": "MEM_INST_RETIRED.ALL_LOADS", "PEBS": "1", + "PublicDescription": "Counts all retired load instructions. This event accounts for SW prefetch instructions of PREFETCHNTA or PREFETCHT0/1/2 or PREFETCHW.", "SampleAfterValue": "2000003", "UMask": "0x81" }, { - "BriefDescription": "All retired store instructions.", + "BriefDescription": "Retired store instructions.", "Data_LA": "1", "EventCode": "0xD0", "EventName": "MEM_INST_RETIRED.ALL_STORES", "PEBS": "1", + "PublicDescription": "Counts all retired store instructions.", "SampleAfterValue": "2000003", "UMask": "0x82" }, @@ -484,7 +486,7 @@ "UMask": "0x80" }, { - "BriefDescription": "Cacheable and noncachaeble code read requests", + "BriefDescription": "Cacheable and non-cacheable code read requests", "EventCode": "0xB0", "EventName": "OFFCORE_REQUESTS.DEMAND_CODE_RD", "PublicDescription": "Counts both cacheable and non-cacheable code read requests.", diff --git a/tools/perf/pmu-events/arch/x86/skylakex/frontend.json b/tools/perf/pmu-events/arch/x86/skylakex/frontend.json index 13ccf50db43d..04f08e4d2402 100644 --- a/tools/perf/pmu-events/arch/x86/skylakex/frontend.json +++ b/tools/perf/pmu-events/arch/x86/skylakex/frontend.json @@ -322,7 +322,7 @@ "UMask": "0x4" }, { - "BriefDescription": "Cycles when uops are being delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy", + "BriefDescription": "Cycles when uops are being delivered to Instruction Decode Queue (IDQ) while Microcode Sequencer (MS) is busy", "CounterMask": "1", "EventCode": "0x79", "EventName": "IDQ.MS_CYCLES", @@ -331,7 +331,7 @@ "UMask": "0x30" }, { - "BriefDescription": "Cycles when uops initiated by Decode Stream Buffer (DSB) are being delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy", + "BriefDescription": "Cycles when uops initiated by Decode Stream Buffer (DSB) are being delivered to Instruction Decode Queue (IDQ) while Microcode Sequencer (MS) is busy", "CounterMask": "1", "EventCode": "0x79", "EventName": "IDQ.MS_DSB_CYCLES", @@ -340,7 +340,7 @@ "UMask": "0x10" }, { - "BriefDescription": "Uops initiated by MITE and delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy", + "BriefDescription": "Uops initiated by MITE and delivered to Instruction Decode Queue (IDQ) while Microcode Sequencer (MS) is busy", "EventCode": "0x79", "EventName": "IDQ.MS_MITE_UOPS", "PublicDescription": "Counts the number of uops initiated by MITE and delivered to Instruction Decode Queue (IDQ) while the Microcode Sequencer (MS) is busy. Counting includes uops that may 'bypass' the IDQ.", @@ -358,7 +358,7 @@ "UMask": "0x30" }, { - "BriefDescription": "Uops delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy", + "BriefDescription": "Uops delivered to Instruction Decode Queue (IDQ) while Microcode Sequencer (MS) is busy", "EventCode": "0x79", "EventName": "IDQ.MS_UOPS", "PublicDescription": "Counts the total number of uops delivered by the Microcode Sequencer (MS). Any instruction over 4 uops will be delivered by the MS. Some instructions such as transcendentals may additionally generate uops from the MS.", diff --git a/tools/perf/pmu-events/arch/x86/skylakex/pipeline.json b/tools/perf/pmu-events/arch/x86/skylakex/pipeline.json index 64e1fe351333..0f06e314fe36 100644 --- a/tools/perf/pmu-events/arch/x86/skylakex/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/skylakex/pipeline.json @@ -93,6 +93,22 @@ "SampleAfterValue": "400009", "UMask": "0x10" }, + { + "BriefDescription": "Speculative and retired mispredicted macro conditional branches", + "EventCode": "0x89", + "EventName": "BR_MISP_EXEC.ALL_BRANCHES", + "PublicDescription": "This event counts both taken and not taken speculative and retired mispredicted branch instructions.", + "SampleAfterValue": "200003", + "UMask": "0xff" + }, + { + "BriefDescription": "Speculative mispredicted indirect branches", + "EventCode": "0x89", + "EventName": "BR_MISP_EXEC.INDIRECT", + "PublicDescription": "Counts speculatively miss-predicted indirect branches at execution time. Counts for indirect near CALL or JMP instructions (RET excluded).", + "SampleAfterValue": "200003", + "UMask": "0xe4" + }, { "BriefDescription": "All mispredicted macro branch instructions retired.", "EventCode": "0xC5", diff --git a/tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json b/tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json index 1f8d60cce3ce..fa2f7f126a30 100644 --- a/tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json +++ b/tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json @@ -1,1497 +1,1570 @@ [ { - "BriefDescription": "Total pipeline cost of Branch Misprediction related bottlenecks", - "MetricExpr": "100 * (tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts", - "MetricName": "Mispredictions" - }, - { - "BriefDescription": "Total pipeline cost of (external) Memory Bandwidth related bottlenecks", - "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_sq_full / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full))) + tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk))", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "Memory_Bandwidth" - }, - { - "BriefDescription": "Total pipeline cost of Memory Latency related bottlenecks (external memory and off-core caches)", - "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_l3_hit_latency / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound))", - "MetricGroup": "Mem;MemoryLat;Offcore", - "MetricName": "Memory_Latency" - }, - { - "BriefDescription": "Total pipeline cost of Memory Address Translation related bottlenecks (data-side TLBs)", - "MetricExpr": "100 * tma_memory_bound * (tma_l1_bound / max(tma_memory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_dtlb_load / max(tma_l1_bound, tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_latency)))", - "MetricGroup": "Mem;MemoryTLB;Offcore", - "MetricName": "Memory_Data_TLBs" - }, - { - "BriefDescription": "Total pipeline cost of branch related instructions (used for program control-flow including function calls)", - "MetricExpr": "100 * ((BR_INST_RETIRED.CONDITIONAL + 3 * BR_INST_RETIRED.NEAR_CALL + (BR_INST_RETIRED.NEAR_TAKEN - (BR_INST_RETIRED.CONDITIONAL - BR_INST_RETIRED.NOT_TAKEN) - 2 * BR_INST_RETIRED.NEAR_CALL)) / SLOTS)", - "MetricGroup": "Ret", - "MetricName": "Branching_Overhead" - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch related bottlenecks by large code footprint programs (i-side cache; TLB and BTB misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_icache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFoot;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "Big_Code" - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_frontend_bound - tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) - Big_Code", - "MetricGroup": "Fed;FetchBW;Frontend", - "MetricName": "Instruction_Fetch_BW" - }, - { - "BriefDescription": "Instructions Per Cycle (per Logical Processor)", - "MetricExpr": "INST_RETIRED.ANY / CLKS", - "MetricGroup": "Ret;Summary", - "MetricName": "IPC" - }, - { - "BriefDescription": "Uops Per Instruction", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / INST_RETIRED.ANY", - "MetricGroup": "Pipeline;Ret;Retire", - "MetricName": "UPI" - }, - { - "BriefDescription": "Instruction per taken branch", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW", - "MetricName": "UpTB" + "BriefDescription": "C2 residency percent per package", + "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C2_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Cycles Per Instruction (per Logical Processor)", - "MetricExpr": "1 / IPC", - "MetricGroup": "Mem;Pipeline", - "MetricName": "CPI" + "BriefDescription": "C3 residency percent per core", + "MetricExpr": "cstate_core@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Core_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "Pipeline", - "MetricName": "CLKS" + "BriefDescription": "C3 residency percent per package", + "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "4 * CORE_CLKS", - "MetricGroup": "tma_L1_group", - "MetricName": "SLOTS" + "BriefDescription": "C6 residency percent per core", + "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Core_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "The ratio of Executed- by Issued-Uops", - "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", - "MetricGroup": "Cor;Pipeline", - "MetricName": "Execute_per_Issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate of \"execute\" at rename stage." + "BriefDescription": "C6 residency percent per package", + "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", - "MetricExpr": "INST_RETIRED.ANY / CORE_CLKS", - "MetricGroup": "Ret;SMT;tma_L1_group", - "MetricName": "CoreIPC" + "BriefDescription": "C7 residency percent per core", + "MetricExpr": "cstate_core@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Core_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Floating Point Operations Per Cycle", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE) + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / CORE_CLKS", - "MetricGroup": "Flops;Ret", - "MetricName": "FLOPc" + "BriefDescription": "C7 residency percent per package", + "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width)", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE)) / (2 * CORE_CLKS)", - "MetricGroup": "Cor;Flops;HPC", - "MetricName": "FP_Arith_Utilization", - "PublicDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width). Values > 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less common)." + "BriefDescription": "Uncore frequency per die [GHZ]", + "MetricExpr": "tma_info_socket_clks / #num_dies / duration_time / 1e9", + "MetricGroup": "SoC", + "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_GE_1 / 2 if #SMT_on else UOPS_EXECUTED.CORE_CYCLES_GE_1)", - "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", - "MetricName": "ILP" + "BriefDescription": "Percentage of cycles spent in System Management Interrupts.", + "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0 else 0)", + "MetricGroup": "smi", + "MetricName": "smi_cycles", + "MetricThreshold": "smi_cycles > 0.1", + "ScaleUnit": "100%" }, { - "BriefDescription": "Probability of Core Bound bottleneck hidden by SMT-profiling artifacts", - "MetricExpr": "((1 - tma_core_bound / tma_ports_utilization if tma_core_bound < tma_ports_utilization else 1) if SMT_2T_Utilization > 0.5 else 0)", - "MetricGroup": "Cor;SMT", - "MetricName": "Core_Bound_Likely" + "BriefDescription": "Number of SMI interrupts.", + "MetricExpr": "msr@smi@", + "MetricGroup": "smi", + "MetricName": "smi_num", + "ScaleUnit": "1SMI#" }, { - "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else CLKS))", - "MetricGroup": "SMT", - "MetricName": "CORE_CLKS" + "BriefDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset", + "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_4k_aliasing", + "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset. False match is possible; which incur a few cycles load re-issue. However; the short re-issue duration is often hidden by the out-of-order core and HW optimizations; hence a user may safely ignore a high value of this metric unless it manages to propagate up into parent nodes of the hierarchy (e.g. to L1_Bound).", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Load (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", - "MetricGroup": "InsType", - "MetricName": "IpLoad" + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.", + "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT.PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_info_slots", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.6", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Store (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", - "MetricGroup": "InsType", - "MetricName": "IpStore" + "BriefDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists", + "MetricExpr": "100 * (FP_ASSIST.ANY + OTHER_ASSISTS.ANY) / tma_info_slots", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists. Assists are long sequences of uops that are required in certain corner-cases for operations that cannot be handled natively by the execution pipeline. For example; when working with very small floating point values (so-called Denormals); the FP units are not set up to perform these operations natively. Instead; a sequence of instructions to perform the computation on the Denormals is injected into the pipeline. Since these microcode sequences might be dozens of uops long; Assists can be extremely deleterious to performance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.ANY", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Branches;Fed;InsType", - "MetricName": "IpBranch" + "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend", + "MetricExpr": "1 - tma_frontend_bound - (UOPS_ISSUED.ANY + 4 * (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.2", + "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per (near) call (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "IpCall" + "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations", + "MetricExpr": "(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", + "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instruction per taken branch", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO", - "MetricName": "IpTB" + "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * tma_bad_speculation", + "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM", + "MetricName": "tma_branch_mispredicts", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics: tma_info_branch_misprediction_cost, tma_info_mispredictions, tma_mispredicts_resteers", + "ScaleUnit": "100%" }, { - "BriefDescription": "Branch instructions per taken branch. ", - "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "BpTkBranch" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", + "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_clks + tma_unknown_branches", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Floating Point (FP) Operation (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE) + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE)", - "MetricGroup": "Flops;InsType", - "MetricName": "IpFLOP" + "BriefDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction. A CISC instruction has multiple uops that are required to perform the instruction's functionality as in the case of read-modify-write as an example. Since these instructions require multiple uops they may or may not imply sub-optimal use of machine resources.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE))", - "MetricGroup": "Flops;InsType", - "MetricName": "IpArith", - "PublicDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate). May undercount due to FMA double counting. Approximated prior to BDW." + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears", + "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_clks", + "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_branch_resteers_group;tma_issueMC", + "MetricName": "tma_clears_resteers", + "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_ms_switches", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SINGLE", - "MetricGroup": "Flops;FpScalar;InsType", - "MetricName": "IpArith_Scalar_SP", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(44 * tma_info_average_frequency * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE / (OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE + OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) + 44 * tma_info_average_frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_contested_accesses", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses. Contested accesses occur when data written by one Logical Processor are read by another Logical Processor on a different Physical Core. Examples of contested accesses include synchronizations such as locks; true data sharing such as modified locked variables; and false sharing. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Related metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma_remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOUBLE", - "MetricGroup": "Flops;FpScalar;InsType", - "MetricName": "IpArith_Scalar_DP", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_backend_bound - tma_memory_bound", + "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX128", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "44 * tma_info_average_frequency * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (1 - OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE / (OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE + OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses. Data shared by multiple Logical Processors (even just read shared) may cause increased access latency due to cache coherency. Excessive data sharing can drastically harm multithreaded performance. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma_false_sharing, tma_machine_clears, tma_remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX256", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "This metric represents fraction of cycles where decoder-0 was the only active decoder", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=1@ - cpu@INST_DECODED.DECODERS\\,cmask\\=2@) / tma_info_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0;tma_mite_group", + "MetricName": "tma_decoder0_alone", + "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35))", + "PublicDescription": "This metric represents fraction of cycles where decoder-0 was the only active decoder. Related metrics: tma_few_uops_instructions", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX512", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", + "MetricExpr": "ARITH.DIVIDER_ACTIVE / tma_info_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication. Sample with: ARITH.DIVIDER_ACTIVE", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Software prefetch instruction (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umask\\=0xF@", - "MetricGroup": "Prefetches", - "MetricName": "IpSWPF" + "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_clks + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_info_clks - tma_l2_bound", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L3_MISS_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Total number of retired Instructions Sample with: INST_RETIRED.PREC_DIST", - "MetricExpr": "INST_RETIRED.ANY", - "MetricGroup": "Summary;tma_L1_group", - "MetricName": "Instructions" + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline", + "MetricExpr": "(IDQ.ALL_DSB_CYCLES_ANY_UOPS - IDQ.ALL_DSB_CYCLES_4_UOPS) / tma_info_core_clks / 2", + "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline. For example; inefficient utilization of the DSB cache structure or bank conflict when reading from it; are categorized here.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE_SLOTS\\,cmask\\=1@", - "MetricGroup": "Pipeline;Ret", - "MetricName": "Retire" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_clks", + "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DSB_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_dsb_misses, tma_info_iptb, tma_lcp", + "ScaleUnit": "100%" }, { - "BriefDescription": "", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,cmask\\=1@", - "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", - "MetricName": "Execute" + "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "min(9 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: tma_dtlb_store, tma_info_memory_data_tlbs", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average number of Uops issued by front-end when it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=1@", - "MetricGroup": "Fed;FetchBW", - "MetricName": "Fetch_UpC" + "BriefDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses", + "MetricExpr": "(9 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=1@ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses. As with ordinary data caching; focus on improving data locality and reducing working-set size to reduce DTLB overhead. Additionally; consider using profile-guided optimization (PGO) to collocate frequently-used data on the same page. Try using larger page sizes for large amounts of frequently-used data. Sample with: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, tma_info_memory_data_tlbs", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", - "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", - "MetricGroup": "DSB;Fed;FetchBW", - "MetricName": "DSB_Coverage" + "BriefDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(110 * tma_info_average_frequency * (OFFCORE_RESPONSE.DEMAND_RFO.L3_MISS.REMOTE_HITM + OFFCORE_RESPONSE.PF_L2_RFO.L3_MISS.REMOTE_HITM) + 47.5 * tma_info_average_frequency * (OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.HITM_OTHER_CORE + OFFCORE_RESPONSE.PF_L2_RFO.L3_HIT.HITM_OTHER_CORE)) / tma_info_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_store_bound_group", + "MetricName": "tma_false_sharing", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing. False Sharing is a multithreading hiccup; where multiple Logical Processors contend on different data-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average number of cycles of a switch from the DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details.", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / DSB2MITE_SWITCHES.COUNT", - "MetricGroup": "DSBmiss", - "MetricName": "DSB_Switch_Cost" + "BriefDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "tma_info_load_miss_real_latency * cpu@L1D_PEND_MISS.FB_FULL\\,cmask\\=1@ / tma_info_clks", + "MetricGroup": "MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;tma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed. The higher the metric value; the deeper the memory hierarchy level the misses are satisfied from (metric values >1 are valid). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or external memory). Related metrics: tma_info_dram_bw_use, tma_info_memory_bandwidth, tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", + "ScaleUnit": "100%" }, { - "BriefDescription": "Total penalty related to DSB (uop cache) misses - subset of the Instruction_Fetch_BW Bottleneck.", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tma_mite))", - "MetricGroup": "DSBmiss;Fed", - "MetricName": "DSB_Misses" + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", + "MetricExpr": "tma_frontend_bound - tma_fetch_latency", + "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Sample with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_switches, tma_info_dsb_coverage, tma_info_dsb_misses, tma_info_iptb, tma_lcp", + "ScaleUnit": "100%" }, { - "BriefDescription": "Number of Instructions per non-speculative DSB miss (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", - "MetricGroup": "DSBmiss;Fed", - "MetricName": "IpDSB_Miss_Ret" + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", + "MetricExpr": "4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / tma_info_slots", + "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;BadSpec;BrMispredicts", - "MetricName": "IpMispredict" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops", + "MetricExpr": "tma_heavy_operations - tma_microcode_sequencer", + "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;tma_issueD0", + "MetricName": "tma_few_uops_instructions", + "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number of uops in such instructions. Related metrics: tma_decoder0_alone", + "ScaleUnit": "100%" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)", - "MetricExpr": "(tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) * SLOTS / BR_MISP_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;BrMispredicts", - "MetricName": "Branch_Misprediction_Cost" + "BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", + "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired). Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain and FMA double-counting.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of branches that are non-taken conditionals", - "MetricExpr": "BR_INST_RETIRED.NOT_TAKEN / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches;CodeGen;PGO", - "MetricName": "Cond_NT" + "BriefDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P", + "MetricName": "tma_fp_scalar", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of branches that are taken conditionals", - "MetricExpr": "(BR_INST_RETIRED.CONDITIONAL - BR_INST_RETIRED.NOT_TAKEN) / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches;CodeGen;PGO", - "MetricName": "Cond_TK" + "BriefDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0xfc@ / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P", + "MetricName": "tma_fp_vector", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of branches that are CALL or RET", - "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_RETURN) / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches", - "MetricName": "CallRet" + "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P", + "MetricName": "tma_fp_vector_128b", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of branches that are unconditional (direct or indirect) jumps", - "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - (BR_INST_RETIRED.CONDITIONAL - BR_INST_RETIRED.NOT_TAKEN) - 2 * BR_INST_RETIRED.NEAR_CALL) / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches", - "MetricName": "Jump" + "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P", + "MetricName": "tma_fp_vector_256b", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)", - "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_RETIRED.L1_MISS + MEM_LOAD_RETIRED.FB_HIT)", - "MetricGroup": "Mem;MemoryBound;MemoryLat", - "MetricName": "Load_Miss_Real_Latency" + "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 512-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P", + "MetricName": "tma_fp_vector_512b", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 512-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)", - "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLES", - "MetricGroup": "Mem;MemoryBW;MemoryBound", - "MetricName": "MLP" + "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", + "MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / tma_info_slots", + "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L1MPKI" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring fused instructions -- where one uop can represent multiple contiguous instructions", + "MetricExpr": "tma_light_operations * UOPS_RETIRED.MACRO_FUSED / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_fused_instructions", + "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring fused instructions -- where one uop can represent multiple contiguous instructions. The instruction pairs of CMP+JCC or DEC+JCC are commonly used examples.", + "ScaleUnit": "100%" }, { - "BriefDescription": "L1 cache true misses per kilo instruction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L1MPKI_Load" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences", + "MetricExpr": "(UOPS_RETIRED.RETIRE_SLOTS + UOPS_RETIRED.MACRO_FUSED - INST_RETIRED.ANY) / tma_info_slots", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences.", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", - "MetricGroup": "Backend;CacheMisses;Mem", - "MetricName": "L2MPKI" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses", + "MetricExpr": "(ICACHE_16B.IFDATA_STALL + 2 * cpu@ICACHE_16B.IFDATA_STALL\\,cmask\\=1\\,edge@) / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RETIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all request types (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem;Offcore", - "MetricName": "L2MPKI_All" + "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", + "MetricExpr": "tma_info_turbo_utilization * TSC / 1e9 / duration_time", + "MetricGroup": "Power;Summary", + "MetricName": "tma_info_average_frequency" }, { - "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2MPKI_Load" + "BriefDescription": "Total pipeline cost of instruction fetch related bottlenecks by large code footprint programs (i-side cache; TLB and BTB misses)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_icache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", + "MetricGroup": "BigFoot;Fed;Frontend;IcMiss;MemoryTLB;tma_issueBC", + "MetricName": "tma_info_big_code", + "MetricThreshold": "tma_info_big_code > 20", + "PublicDescription": "Total pipeline cost of instruction fetch related bottlenecks by large code footprint programs (i-side cache; TLB and BTB misses). Related metrics: tma_info_branching_overhead" }, { - "BriefDescription": "L2 cache hits per kilo instruction for all request types (including speculative)", - "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2HPKI_All" + "BriefDescription": "Branch instructions per taken branch.", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_bptkbranch" }, { - "BriefDescription": "L2 cache hits per kilo instruction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2HPKI_Load" + "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)", + "MetricExpr": "(tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) * tma_info_slots / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BrMispredicts;tma_issueBM", + "MetricName": "tma_info_branch_misprediction_cost", + "PublicDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear). Related metrics: tma_branch_mispredicts, tma_info_mispredictions, tma_mispredicts_resteers" }, { - "BriefDescription": "L3 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L3MPKI" + "BriefDescription": "Total pipeline cost of branch related instructions (used for program control-flow including function calls)", + "MetricExpr": "100 * ((BR_INST_RETIRED.CONDITIONAL + 3 * BR_INST_RETIRED.NEAR_CALL + (BR_INST_RETIRED.NEAR_TAKEN - (BR_INST_RETIRED.CONDITIONAL - BR_INST_RETIRED.NOT_TAKEN) - 2 * BR_INST_RETIRED.NEAR_CALL)) / tma_info_slots)", + "MetricGroup": "Ret;tma_issueBC", + "MetricName": "tma_info_branching_overhead", + "MetricThreshold": "tma_info_branching_overhead > 10", + "PublicDescription": "Total pipeline cost of branch related instructions (used for program control-flow including function calls). Related metrics: tma_info_big_code" }, { - "BriefDescription": "Fill Buffer (FB) hits per kilo instructions for retired demand loads (L1D misses that merge into ongoing miss-handling entries)", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "FB_HPKI" + "BriefDescription": "Fraction of branches that are CALL or RET", + "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_RETURN) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_callret" }, { - "BriefDescription": "Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses", - "MetricConstraint": "NO_NMI_WATCHDOG", - "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_PENDING + DTLB_STORE_MISSES.WALK_PENDING + EPT.WALK_PENDING) / (2 * CORE_CLKS)", - "MetricGroup": "Mem;MemoryTLB", - "MetricName": "Page_Walks_Utilization" + "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Pipeline", + "MetricName": "tma_info_clks" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW" + "BriefDescription": "STLB (2nd level TLB) code speculative misses per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", + "MetricGroup": "Fed;MemoryTLB", + "MetricName": "tma_info_code_stlb_mpki" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW" + "BriefDescription": "Fraction of branches that are non-taken conditionals", + "MetricExpr": "BR_INST_RETIRED.NOT_TAKEN / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_cond_nt" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW" + "BriefDescription": "Fraction of branches that are taken conditionals", + "MetricExpr": "(BR_INST_RETIRED.CONDITIONAL - BR_INST_RETIRED.NOT_TAKEN) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_cond_tk" }, { - "BriefDescription": "Average per-core data access bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "L3_Cache_Access_BW" + "BriefDescription": "Probability of Core Bound bottleneck hidden by SMT-profiling artifacts", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(100 * (1 - tma_core_bound / tma_ports_utilization if tma_core_bound < tma_ports_utilization else 1) if tma_info_smt_2t_utilization > 0.5 else 0)", + "MetricGroup": "Cor;SMT", + "MetricName": "tma_info_core_bound_likely", + "MetricThreshold": "tma_info_core_bound_likely > 0.5" }, { - "BriefDescription": "Rate of silent evictions from the L2 cache per Kilo instruction where the evicted lines are dropped (no writeback to L3 or memory)", - "MetricExpr": "1e3 * L2_LINES_OUT.SILENT / Instructions", - "MetricGroup": "L2Evicts;Mem;Server", - "MetricName": "L2_Evictions_Silent_PKI" + "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_clks))", + "MetricGroup": "SMT", + "MetricName": "tma_info_core_clks" }, { - "BriefDescription": "Rate of non silent evictions from the L2 cache per Kilo instruction", - "MetricExpr": "1e3 * L2_LINES_OUT.NON_SILENT / Instructions", - "MetricGroup": "L2Evicts;Mem;Server", - "MetricName": "L2_Evictions_NonSilent_PKI" + "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_core_clks", + "MetricGroup": "Ret;SMT;TmaL1;tma_L1_group", + "MetricName": "tma_info_coreipc" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]", - "MetricExpr": "L1D_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW_1T" + "BriefDescription": "Cycles Per Instruction (per Logical Processor)", + "MetricExpr": "1 / tma_info_ipc", + "MetricGroup": "Mem;Pipeline", + "MetricName": "tma_info_cpi" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]", - "MetricExpr": "L2_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW_1T" + "BriefDescription": "Average CPU Utilization", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": "HPC;Summary", + "MetricName": "tma_info_cpu_utilization" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "L3_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW_1T" + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD / OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_data_l2_mlp" }, { - "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "L3_Cache_Access_BW", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "L3_Cache_Access_BW_1T" + "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", + "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e9 / duration_time", + "MetricGroup": "HPC;Mem;MemoryBW;SoC;tma_issueBW", + "MetricName": "tma_info_dram_bw_use", + "PublicDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_memory_bandwidth, tma_mem_bandwidth, tma_sq_full" }, { - "BriefDescription": "Average CPU Utilization", - "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", - "MetricGroup": "HPC;Summary", - "MetricName": "CPU_Utilization" + "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_dsb_coverage", + "MetricThreshold": "tma_info_dsb_coverage < 0.7 & tma_info_ipc / 4 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_misses, tma_info_iptb, tma_lcp" }, { - "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", - "MetricExpr": "Turbo_Utilization * TSC / 1e9 / duration_time", - "MetricGroup": "Power;Summary", - "MetricName": "Average_Frequency" + "BriefDescription": "Total pipeline cost of DSB (uop cache) misses - subset of the Instruction_Fetch_BW Bottleneck", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tma_mite))", + "MetricGroup": "DSBmiss;Fed;tma_issueFB", + "MetricName": "tma_info_dsb_misses", + "MetricThreshold": "tma_info_dsb_misses > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) misses - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_iptb, tma_lcp" }, { - "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE) + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1e9 / duration_time", - "MetricGroup": "Cor;Flops;HPC", - "MetricName": "GFLOPs", - "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine." + "BriefDescription": "Average number of cycles of a switch from the DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details.", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / DSB2MITE_SWITCHES.COUNT", + "MetricGroup": "DSBmiss", + "MetricName": "tma_info_dsb_switch_cost" }, { - "BriefDescription": "Average Frequency Utilization relative nominal frequency", - "MetricExpr": "CLKS / CPU_CLK_UNHALTED.REF_TSC", - "MetricGroup": "Power", - "MetricName": "Turbo_Utilization" + "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-thread", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,cmask\\=1@", + "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", + "MetricName": "tma_info_execute" }, { - "BriefDescription": "Fraction of Core cycles where the core was running with power-delivery for baseline license level 0", - "MetricExpr": "(CORE_POWER.LVL0_TURBO_LICENSE / 2 / CORE_CLKS if #SMT_on else CORE_POWER.LVL0_TURBO_LICENSE / CORE_CLKS)", - "MetricGroup": "Power", - "MetricName": "Power_License0_Utilization", - "PublicDescription": "Fraction of Core cycles where the core was running with power-delivery for baseline license level 0. This includes non-AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes." + "BriefDescription": "The ratio of Executed- by Issued-Uops", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", + "MetricGroup": "Cor;Pipeline", + "MetricName": "tma_info_execute_per_issue", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate of \"execute\" at rename stage." }, { - "BriefDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 1", - "MetricExpr": "(CORE_POWER.LVL1_TURBO_LICENSE / 2 / CORE_CLKS if #SMT_on else CORE_POWER.LVL1_TURBO_LICENSE / CORE_CLKS)", - "MetricGroup": "Power", - "MetricName": "Power_License1_Utilization", - "PublicDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 1. This includes high current AVX 256-bit instructions as well as low current AVX 512-bit instructions." + "BriefDescription": "Fill Buffer (FB) hits per kilo instructions for retired demand loads (L1D misses that merge into ongoing miss-handling entries)", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_fb_hpki" }, { - "BriefDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 2 (introduced in SKX)", - "MetricExpr": "(CORE_POWER.LVL2_TURBO_LICENSE / 2 / CORE_CLKS if #SMT_on else CORE_POWER.LVL2_TURBO_LICENSE / CORE_CLKS)", - "MetricGroup": "Power", - "MetricName": "Power_License2_Utilization", - "PublicDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 2 (introduced in SKX). This includes high current AVX 512-bit instructions." + "BriefDescription": "Average number of Uops issued by front-end when it issued something", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=1@", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_fetch_upc" }, { - "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", - "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", - "MetricGroup": "SMT", - "MetricName": "SMT_2T_Utilization" + "BriefDescription": "Floating Point Operations Per Cycle", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE) + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / tma_info_core_clks", + "MetricGroup": "Flops;Ret", + "MetricName": "tma_info_flopc" }, { - "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "OS", - "MetricName": "Kernel_Utilization" + "BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0xfc@) / (2 * tma_info_core_clks)", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_fp_arith_utilization", + "PublicDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width). Values > 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less common)." }, { - "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", - "MetricGroup": "OS", - "MetricName": "Kernel_CPI" + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE) + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1e9 / duration_time", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine." }, { - "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", - "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e9 / duration_time", - "MetricGroup": "HPC;Mem;MemoryBW;SoC", - "MetricName": "DRAM_BW_Use" + "BriefDescription": "Total pipeline cost of Instruction Cache misses - subset of the Big_Code Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", + "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", + "MetricName": "tma_info_ic_misses", + "MetricThreshold": "tma_info_ic_misses > 5", + "PublicDescription": "Total pipeline cost of Instruction Cache misses - subset of the Big_Code Bottleneck. Related metrics: " }, { - "BriefDescription": "Average latency of data read request to external memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetches", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_INSERTS.IA_MISS_DRD) / (Socket_CLKS / duration_time)", - "MetricGroup": "Mem;MemoryLat;SoC", - "MetricName": "MEM_Read_Latency" + "BriefDescription": "Average Latency for L1 instruction cache misses", + "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STALL\\,cmask\\=1\\,edge@ + 2", + "MetricGroup": "Fed;FetchLat;IcMiss", + "MetricName": "tma_info_icache_miss_latency" }, { - "BriefDescription": "Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches", - "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD@thresh\\=1@", - "MetricGroup": "Mem;MemoryBW;SoC", - "MetricName": "MEM_Parallel_Reads" + "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_GE_1 / 2 if #SMT_on else UOPS_EXECUTED.CORE_CYCLES_GE_1)", + "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", + "MetricName": "tma_info_ilp" }, { - "BriefDescription": "Average latency of data read request to external DRAM memory [in nanoseconds]. Accounts for demand loads and L1/L2 data-read prefetches", - "MetricExpr": "1e9 * (UNC_M_RPQ_OCCUPANCY / UNC_M_RPQ_INSERTS) / imc_0@event\\=0x0@", - "MetricGroup": "Mem;MemoryLat;Server;SoC", - "MetricName": "MEM_DRAM_Read_Latency" + "BriefDescription": "Total pipeline cost of instruction fetch bandwidth related bottlenecks", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * (tma_frontend_bound - tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) - tma_info_big_code", + "MetricGroup": "Fed;FetchBW;Frontend", + "MetricName": "tma_info_instruction_fetch_bw", + "MetricThreshold": "tma_info_instruction_fetch_bw > 20" }, { - "BriefDescription": "Average IO (network or disk) Bandwidth Use for Writes [GB / sec]", - "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART0 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART2 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART3) * 4 / 1e9 / duration_time", - "MetricGroup": "IoBW;Mem;Server;SoC", - "MetricName": "IO_Write_BW" + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Summary;TmaL1;tma_L1_group", + "MetricName": "tma_info_instructions", + "PublicDescription": "Total number of retired Instructions. Sample with: INST_RETIRED.PREC_DIST" }, { "BriefDescription": "Average IO (network or disk) Bandwidth Use for Reads [GB / sec]", "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART0 + UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART2 + UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART3) * 4 / 1e9 / duration_time", "MetricGroup": "IoBW;Mem;Server;SoC", - "MetricName": "IO_Read_BW" + "MetricName": "tma_info_io_read_bw" }, { - "BriefDescription": "Socket actual clocks when any core is active on that socket", - "MetricExpr": "cha_0@event\\=0x0@", - "MetricGroup": "SoC", - "MetricName": "Socket_CLKS" + "BriefDescription": "Average IO (network or disk) Bandwidth Use for Writes [GB / sec]", + "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART0 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART2 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART3) * 4 / 1e9 / duration_time", + "MetricGroup": "IoBW;Mem;Server;SoC", + "MetricName": "tma_info_io_write_bw" }, { - "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", - "MetricGroup": "Branches;OS", - "MetricName": "IpFarBranch" + "BriefDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "INST_RETIRED.ANY / (cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0xfc@)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_iparith", + "MetricThreshold": "tma_info_iparith < 10", + "PublicDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate). May undercount due to FMA double counting. Approximated prior to BDW." }, { - "BriefDescription": "Uncore frequency per die [GHZ]", - "MetricExpr": "Socket_CLKS / #num_dies / duration_time / 1e9", - "MetricGroup": "SoC", - "MetricName": "UNCORE_FREQ" + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx128", + "MetricThreshold": "tma_info_iparith_avx128 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "Percentage of time spent in the active CPU power state C0", - "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", - "MetricName": "cpu_utilization", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx256", + "MetricThreshold": "tma_info_iparith_avx256 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "CPU operating frequency (in GHz)", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / duration_time", - "MetricName": "cpu_operating_frequency", - "ScaleUnit": "1GHz" + "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx512", + "MetricThreshold": "tma_info_iparith_avx512 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "Cycles per instruction retired; indicating how much time each executed instruction took; in units of cycles.", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY", - "MetricName": "cpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOUBLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_iparith_scalar_dp", + "MetricThreshold": "tma_info_iparith_scalar_dp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "The ratio of number of completed memory load instructions to the total number completed instructions", - "MetricExpr": "MEM_INST_RETIRED.ALL_LOADS / INST_RETIRED.ANY", - "MetricName": "loads_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SINGLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_iparith_scalar_sp", + "MetricThreshold": "tma_info_iparith_scalar_sp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "The ratio of number of completed memory store instructions to the total number completed instructions", - "MetricExpr": "MEM_INST_RETIRED.ALL_STORES / INST_RETIRED.ANY", - "MetricName": "stores_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Branches;Fed;InsType", + "MetricName": "tma_info_ipbranch", + "MetricThreshold": "tma_info_ipbranch < 8" }, { - "BriefDescription": "Ratio of number of requests missing L1 data cache (includes data+rfo w/ prefetches) to the total number of completed instructions", - "MetricExpr": "L1D.REPLACEMENT / INST_RETIRED.ANY", - "MetricName": "l1d_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions Per Cycle (per Logical Processor)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_clks", + "MetricGroup": "Ret;Summary", + "MetricName": "tma_info_ipc" }, { - "BriefDescription": "Ratio of number of demand load requests hitting in L1 data cache to the total number of completed instructions ", - "MetricExpr": "MEM_LOAD_RETIRED.L1_HIT / INST_RETIRED.ANY", - "MetricName": "l1d_demand_data_read_hits_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per (near) call (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_ipcall", + "MetricThreshold": "tma_info_ipcall < 200" }, { - "BriefDescription": "Ratio of number of code read requests missing in L1 instruction cache (includes prefetches) to the total number of completed instructions", - "MetricExpr": "L2_RQSTS.ALL_CODE_RD / INST_RETIRED.ANY", - "MetricName": "l1_i_code_read_misses_with_prefetches_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per non-speculative DSB miss (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", + "MetricGroup": "DSBmiss;Fed", + "MetricName": "tma_info_ipdsb_miss_ret", + "MetricThreshold": "tma_info_ipdsb_miss_ret < 50" }, { - "BriefDescription": "Ratio of number of completed demand load requests hitting in L2 cache to the total number of completed instructions ", - "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT / INST_RETIRED.ANY", - "MetricName": "l2_demand_data_read_hits_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;OS", + "MetricName": "tma_info_ipfarbranch", + "MetricThreshold": "tma_info_ipfarbranch < 1e6" }, { - "BriefDescription": "Ratio of number of requests missing L2 cache (includes code+data+rfo w/ prefetches) to the total number of completed instructions", - "MetricExpr": "L2_LINES_IN.ALL / INST_RETIRED.ANY", - "MetricName": "l2_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per Floating Point (FP) Operation (lower number means higher occurrence rate)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE) + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_ipflop", + "MetricThreshold": "tma_info_ipflop < 10" }, { - "BriefDescription": "Ratio of number of completed data read request missing L2 cache to the total number of completed instructions", - "MetricExpr": "MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", - "MetricName": "l2_demand_data_read_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per Load (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipload", + "MetricThreshold": "tma_info_ipload < 3" }, { - "BriefDescription": "Ratio of number of code read request missing L2 cache to the total number of completed instructions", - "MetricExpr": "L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", - "MetricName": "l2_demand_code_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per retired mispredicts for indirect CALL or JMP branches (lower number means higher occurrence rate).", + "MetricExpr": "tma_info_instructions / (UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * cpu@BR_MISP_EXEC.ALL_BRANCHES\\,umask\\=0xE4@)", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_indirect", + "MetricThreshold": "tma_info_ipmisp_indirect < 1e3" }, { - "BriefDescription": "Ratio of number of data read requests missing last level core cache (includes demand w/ prefetches) to the total number of completed instructions", - "MetricExpr": "cha@unc_cha_tor_inserts.ia_miss\\,config1\\=0x12D4043300000000@ / INST_RETIRED.ANY", - "MetricName": "llc_data_read_mpi_demand_plus_prefetch", - "ScaleUnit": "1per_instr" + "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts", + "MetricName": "tma_info_ipmispredict", + "MetricThreshold": "tma_info_ipmispredict < 200" }, { - "BriefDescription": "Ratio of number of code read requests missing last level core cache (includes demand w/ prefetches) to the total number of completed instructions", - "MetricExpr": "cha@unc_cha_tor_inserts.ia_miss\\,config1\\=0x12CC023300000000@ / INST_RETIRED.ANY", - "MetricName": "llc_code_read_mpi_demand_plus_prefetch", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per Store (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipstore", + "MetricThreshold": "tma_info_ipstore < 8" }, { - "BriefDescription": "Average latency of a last level cache (LLC) demand and prefetch data read miss (read memory access) in nano seconds", - "MetricExpr": "1e9 * (cha@unc_cha_tor_occupancy.ia_miss\\,config1\\=0x4043300000000@ / cha@unc_cha_tor_inserts.ia_miss\\,config1\\=0x4043300000000@) / (UNC_CHA_CLOCKTICKS / (#num_cores / #num_packages * #num_packages)) * duration_time", - "MetricName": "llc_data_read_demand_plus_prefetch_miss_latency", - "ScaleUnit": "1ns" + "BriefDescription": "Instructions per Software prefetch instruction (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umask\\=0xF@", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_ipswpf", + "MetricThreshold": "tma_info_ipswpf < 100" }, { - "BriefDescription": "Average latency of a last level cache (LLC) demand and prefetch data read miss (read memory access) addressed to local memory in nano seconds", - "MetricExpr": "1e9 * (cha@unc_cha_tor_occupancy.ia_miss\\,config1\\=0x4043200000000@ / cha@unc_cha_tor_inserts.ia_miss\\,config1\\=0x4043200000000@) / (UNC_CHA_CLOCKTICKS / (#num_cores / #num_packages * #num_packages)) * duration_time", - "MetricName": "llc_data_read_demand_plus_prefetch_miss_latency_for_local_requests", - "ScaleUnit": "1ns" + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", + "MetricName": "tma_info_iptb", + "MetricThreshold": "tma_info_iptb < 9", + "PublicDescription": "Instruction per taken branch. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_dsb_misses, tma_lcp" }, { - "BriefDescription": "Average latency of a last level cache (LLC) demand and prefetch data read miss (read memory access) addressed to remote memory in nano seconds", - "MetricExpr": "1e9 * (cha@unc_cha_tor_occupancy.ia_miss\\,config1\\=0x4043100000000@ / cha@unc_cha_tor_inserts.ia_miss\\,config1\\=0x4043100000000@) / (UNC_CHA_CLOCKTICKS / (#num_cores / #num_packages * #num_packages)) * duration_time", - "MetricName": "llc_data_read_demand_plus_prefetch_miss_latency_for_remote_requests", - "ScaleUnit": "1ns" + "BriefDescription": "Instructions per speculative Unknown Branch Misprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_instructions / BACLEARS.ANY", + "MetricGroup": "Fed", + "MetricName": "tma_info_ipunknown_branch" }, { - "BriefDescription": "Ratio of number of completed page walks (for all page sizes) caused by a code fetch to the total number of completed instructions", - "MetricExpr": "ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", - "MetricName": "itlb_mpi", - "PublicDescription": "Ratio of number of completed page walks (for all page sizes) caused by a code fetch to the total number of completed instructions. This implies it missed in the ITLB (Instruction TLB) and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Fraction of branches that are unconditional (direct or indirect) jumps", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - (BR_INST_RETIRED.CONDITIONAL - BR_INST_RETIRED.NOT_TAKEN) - 2 * BR_INST_RETIRED.NEAR_CALL) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_jump" }, { - "BriefDescription": "Ratio of number of completed page walks (for 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total number of completed instructions", - "MetricExpr": "ITLB_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY", - "MetricName": "itlb_large_page_mpi", - "PublicDescription": "Ratio of number of completed page walks (for 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total number of completed instructions. This implies it missed in the Instruction Translation Lookaside Buffer (ITLB) and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_cpi" }, { - "BriefDescription": "Ratio of number of completed page walks (for all page sizes) caused by demand data loads to the total number of completed instructions", - "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", - "MetricName": "dtlb_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for all page sizes) caused by demand data loads to the total number of completed instructions. This implies it missed in the DTLB and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_utilization", + "MetricThreshold": "tma_info_kernel_utilization > 0.05" }, { - "BriefDescription": "Ratio of number of completed page walks (for 2 megabyte page sizes) caused by demand data loads to the total number of completed instructions", - "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY", - "MetricName": "dtlb_2mb_large_page_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for 2 megabyte page sizes) caused by demand data loads to the total number of completed instructions. This implies it missed in the Data Translation Lookaside Buffer (DTLB) and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw" }, { - "BriefDescription": "Ratio of number of completed page walks (for all page sizes) caused by demand data stores to the total number of completed instructions", - "MetricExpr": "DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", - "MetricName": "dtlb_store_mpi", - "PublicDescription": "Ratio of number of completed page walks (for all page sizes) caused by demand data stores to the total number of completed instructions. This implies it missed in the DTLB and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]", + "MetricExpr": "tma_info_l1d_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw_1t" }, { - "BriefDescription": "Memory read that miss the last level cache (LLC) addressed to local DRAM as a percentage of total memory read accesses, does not include LLC prefetches.", - "MetricExpr": "cha@unc_cha_tor_inserts.ia_miss\\,config1\\=0x4043200000000@ / (cha@unc_cha_tor_inserts.ia_miss\\,config1\\=0x4043200000000@ + cha@unc_cha_tor_inserts.ia_miss\\,config1\\=0x4043100000000@)", - "MetricName": "numa_reads_addressed_to_local_dram", - "ScaleUnit": "100%" + "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l1mpki" }, { - "BriefDescription": "Memory reads that miss the last level cache (LLC) addressed to remote DRAM as a percentage of total memory read accesses, does not include LLC prefetches.", - "MetricExpr": "cha@unc_cha_tor_inserts.ia_miss\\,config1\\=0x4043100000000@ / (cha@unc_cha_tor_inserts.ia_miss\\,config1\\=0x4043200000000@ + cha@unc_cha_tor_inserts.ia_miss\\,config1\\=0x4043100000000@)", - "MetricName": "numa_reads_addressed_to_remote_dram", - "ScaleUnit": "100%" + "BriefDescription": "L1 cache true misses per kilo instruction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l1mpki_load" }, { - "BriefDescription": "Uncore operating frequency in GHz", - "MetricExpr": "UNC_CHA_CLOCKTICKS / (#num_cores / #num_packages * #num_packages) / 1e9 / duration_time", - "MetricName": "uncore_frequency", - "ScaleUnit": "1GHz" + "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw" }, { - "BriefDescription": "Intel(R) Ultra Path Interconnect (UPI) data transmit bandwidth (MB/sec)", - "MetricExpr": "UNC_UPI_TxL_FLITS.ALL_DATA * 7.111111111111111 / 1e6 / duration_time", - "MetricName": "upi_data_transmit_bw", - "ScaleUnit": "1MB/s" + "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]", + "MetricExpr": "tma_info_l2_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw_1t" }, { - "BriefDescription": "DDR memory read bandwidth (MB/sec)", - "MetricExpr": "UNC_M_CAS_COUNT.RD * 64 / 1e6 / duration_time", - "MetricName": "memory_bandwidth_read", - "ScaleUnit": "1MB/s" + "BriefDescription": "Rate of non silent evictions from the L2 cache per Kilo instruction", + "MetricExpr": "1e3 * L2_LINES_OUT.NON_SILENT / tma_info_instructions", + "MetricGroup": "L2Evicts;Mem;Server", + "MetricName": "tma_info_l2_evictions_nonsilent_pki" }, { - "BriefDescription": "DDR memory write bandwidth (MB/sec)", - "MetricExpr": "UNC_M_CAS_COUNT.WR * 64 / 1e6 / duration_time", - "MetricName": "memory_bandwidth_write", - "ScaleUnit": "1MB/s" + "BriefDescription": "Rate of silent evictions from the L2 cache per Kilo instruction where the evicted lines are dropped (no writeback to L3 or memory)", + "MetricExpr": "1e3 * L2_LINES_OUT.SILENT / tma_info_instructions", + "MetricGroup": "L2Evicts;Mem;Server", + "MetricName": "tma_info_l2_evictions_silent_pki" }, { - "BriefDescription": "DDR memory bandwidth (MB/sec)", - "MetricExpr": "(UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) * 64 / 1e6 / duration_time", - "MetricName": "memory_bandwidth_total", - "ScaleUnit": "1MB/s" + "BriefDescription": "L2 cache hits per kilo instruction for all request types (including speculative)", + "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2hpki_all" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by end device controllers that are requesting memory from the CPU.", - "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART0 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART2 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART3) * 4 / 1e6 / duration_time", - "MetricName": "io_bandwidth_disk_or_network_writes", - "ScaleUnit": "1MB/s" + "BriefDescription": "L2 cache hits per kilo instruction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2hpki_load" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by end device controllers that are writing memory to the CPU.", - "MetricExpr": "(UNC_IIO_PAYLOAD_BYTES_IN.MEM_WRITE.PART0 + UNC_IIO_PAYLOAD_BYTES_IN.MEM_WRITE.PART1 + UNC_IIO_PAYLOAD_BYTES_IN.MEM_WRITE.PART2 + UNC_IIO_PAYLOAD_BYTES_IN.MEM_WRITE.PART3) * 4 / 1e6 / duration_time", - "MetricName": "io_bandwidth_disk_or_network_reads", - "ScaleUnit": "1MB/s" + "BriefDescription": "L2 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "Backend;CacheMisses;Mem", + "MetricName": "tma_info_l2mpki" }, { - "BriefDescription": "Uops delivered from decoded instruction cache (decoded stream buffer or DSB) as a percent of total uops delivered to Instruction Decode Queue", - "MetricExpr": "IDQ.DSB_UOPS / UOPS_ISSUED.ANY", - "MetricName": "percent_uops_delivered_from_decoded_icache", - "ScaleUnit": "100%" + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all request types (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem;Offcore", + "MetricName": "tma_info_l2mpki_all" }, { - "BriefDescription": "Uops delivered from legacy decode pipeline (Micro-instruction Translation Engine or MITE) as a percent of total uops delivered to Instruction Decode Queue", - "MetricExpr": "IDQ.MITE_UOPS / UOPS_ISSUED.ANY", - "MetricName": "percent_uops_delivered_from_legacy_decode_pipeline", - "ScaleUnit": "100%" + "BriefDescription": "L2 cache true code cacheline misses per kilo instruction", + "MetricExpr": "1e3 * FRONTEND_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_l2mpki_code" }, { - "BriefDescription": "Uops delivered from microcode sequencer (MS) as a percent of total uops delivered to Instruction Decode Queue", - "MetricExpr": "IDQ.MS_UOPS / UOPS_ISSUED.ANY", - "MetricName": "percent_uops_delivered_from_microcode_sequencer", - "ScaleUnit": "100%" + "BriefDescription": "L2 cache speculative code cacheline misses per kilo instruction", + "MetricExpr": "1e3 * L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_l2mpki_code_all" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss the last level cache (LLC) and go to local memory.", - "MetricExpr": "UNC_CHA_REQUESTS.READS_LOCAL * 64 / 1e6 / duration_time", - "MetricName": "llc_miss_local_memory_bandwidth_read", - "ScaleUnit": "1MB/s" + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2mpki_load" }, { - "BriefDescription": "Bandwidth (MB/sec) of write requests that miss the last level cache (LLC) and go to local memory.", - "MetricExpr": "UNC_CHA_REQUESTS.WRITES_LOCAL * 64 / 1e6 / duration_time", - "MetricName": "llc_miss_local_memory_bandwidth_write", - "ScaleUnit": "1MB/s" + "BriefDescription": "Average per-core data access bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_l3_cache_access_bw" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss the last level cache (LLC) and go to remote memory.", - "MetricExpr": "UNC_CHA_REQUESTS.READS_REMOTE * 64 / 1e6 / duration_time", - "MetricName": "llc_miss_remote_memory_bandwidth_read", - "ScaleUnit": "1MB/s" + "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "tma_info_l3_cache_access_bw", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_l3_cache_access_bw_1t" }, { - "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", - "MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / SLOTS", - "MetricGroup": "PGO;TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_frontend_bound", - "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Machine_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound.", - "ScaleUnit": "100%" + "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", - "MetricExpr": "4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / SLOTS", - "MetricGroup": "Frontend;TopdownL2;tma_L2_group;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_latency", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period.", - "ScaleUnit": "100%" + "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "tma_info_l3_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw_1t" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses.", - "MetricExpr": "(ICACHE_16B.IFDATA_STALL + 2 * cpu@ICACHE_16B.IFDATA_STALL\\,cmask\\=0x1\\,edge\\=0x1@) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_icache_misses", - "ScaleUnit": "100%" + "BriefDescription": "L3 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l3mpki" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses.", - "MetricExpr": "ICACHE_64B.IFTAG_STALL / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_itlb_misses", - "ScaleUnit": "100%" + "BriefDescription": "Average Latency for L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCORE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "Memory_Lat;Offcore", + "MetricName": "tma_info_load_l2_miss_latency" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", - "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / CPU_CLK_UNHALTED.THREAD + tma_unknown_branches", - "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_branch_resteers", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings.", - "ScaleUnit": "100%" + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_load_l2_mlp" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage. ", - "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL4;tma_L4_group;tma_branch_resteers_group", - "MetricName": "tma_mispredicts_resteers", - "ScaleUnit": "100%" + "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)", + "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_RETIRED.L1_MISS + MEM_LOAD_RETIRED.FB_HIT)", + "MetricGroup": "Mem;MemoryBound;MemoryLat", + "MetricName": "tma_info_load_miss_real_latency" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears. ", - "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_branch_resteers_group", - "MetricName": "tma_clears_resteers", - "ScaleUnit": "100%" + "BriefDescription": "STLB (2nd level TLB) data load speculative misses per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_load_stlb_mpki" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears", - "MetricExpr": "9 * BACLEARS.ANY / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "BigFoot;FetchLat;TopdownL4;tma_L4_group;tma_branch_resteers_group", - "MetricName": "tma_unknown_branches", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears. These are fetched branches the Branch Prediction Unit was unable to recognize (First fetch or hitting BPU capacity limit).", - "ScaleUnit": "100%" + "BriefDescription": "Average latency of data read request to external DRAM memory [in nanoseconds]", + "MetricExpr": "1e9 * (UNC_M_RPQ_OCCUPANCY / UNC_M_RPQ_INSERTS) / imc_0@event\\=0x0@", + "MetricGroup": "Mem;MemoryLat;Server;SoC", + "MetricName": "tma_info_mem_dram_read_latency", + "PublicDescription": "Average latency of data read request to external DRAM memory [in nanoseconds]. Accounts for demand loads and L1/L2 data-read prefetches" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_dsb_switches", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty.", - "ScaleUnit": "100%" + "BriefDescription": "Average number of parallel data read requests to external memory", + "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD@thresh\\=1@", + "MetricGroup": "Mem;MemoryBW;SoC", + "MetricName": "tma_info_mem_parallel_reads", + "PublicDescription": "Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches" }, { - "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", - "MetricExpr": "ILD_STALL.LCP / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_lcp", - "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs.", - "ScaleUnit": "100%" + "BriefDescription": "Average latency of data read request to external memory (in nanoseconds)", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_INSERTS.IA_MISS_DRD) / (tma_info_socket_clks / duration_time)", + "MetricGroup": "Mem;MemoryLat;SoC", + "MetricName": "tma_info_mem_read_latency", + "PublicDescription": "Average latency of data read request to external memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetches. ([RKL+]memory-controller only)" }, { - "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", - "MetricExpr": "2 * IDQ.MS_SWITCHES / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch_latency_group", - "MetricName": "tma_ms_switches", - "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals.", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of (external) Memory Bandwidth related bottlenecks", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_sq_full / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full))) + tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk))", + "MetricGroup": "Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_info_memory_bandwidth", + "MetricThreshold": "tma_info_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of (external) Memory Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_mem_bandwidth, tma_sq_full" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", - "MetricExpr": "tma_frontend_bound - tma_fetch_latency", - "MetricGroup": "FetchBW;Frontend;TopdownL2;tma_L2_group;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_bandwidth", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend.", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of Memory Address Translation related bottlenecks (data-side TLBs)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * tma_memory_bound * (tma_l1_bound / max(tma_memory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_dtlb_load / max(tma_l1_bound, tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_latency)))", + "MetricGroup": "Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_info_memory_data_tlbs", + "MetricThreshold": "tma_info_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Translation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load, tma_dtlb_store" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline)", - "MetricExpr": "(IDQ.ALL_MITE_CYCLES_ANY_UOPS - IDQ.ALL_MITE_CYCLES_4_UOPS) / CORE_CLKS / 2", - "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", - "MetricName": "tma_mite", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or LSD. For example; inefficiencies due to asymmetric decoders; use of long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of Memory Latency related bottlenecks (external memory and off-core caches)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_l3_hit_latency / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound))", + "MetricGroup": "Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_info_memory_latency", + "MetricThreshold": "tma_info_memory_latency > 20", + "PublicDescription": "Total pipeline cost of Memory Latency related bottlenecks (external memory and off-core caches). Related metrics: tma_l3_hit_latency, tma_mem_latency" }, { - "BriefDescription": "This metric represents fraction of cycles where decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=0x1@ - cpu@INST_DECODED.DECODERS\\,cmask\\=0x2@) / CORE_CLKS", - "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_mite_group", - "MetricName": "tma_decoder0_alone", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of Branch Misprediction related bottlenecks", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * (tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;tma_issueBM", + "MetricName": "tma_info_mispredictions", + "MetricThreshold": "tma_info_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_branch_misprediction_cost, tma_mispredicts_resteers" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline", - "MetricExpr": "(IDQ.ALL_DSB_CYCLES_ANY_UOPS - IDQ.ALL_DSB_CYCLES_4_UOPS) / CORE_CLKS / 2", - "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", - "MetricName": "tma_dsb", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline. For example; inefficient utilization of the DSB cache structure or bank conflict when reading from it; are categorized here.", - "ScaleUnit": "100%" + "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss", + "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLES", + "MetricGroup": "Mem;MemoryBW;MemoryBound", + "MetricName": "tma_info_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)" }, { - "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations", - "MetricExpr": "(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) / SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_bad_speculation", - "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.", - "ScaleUnit": "100%" + "BriefDescription": "Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_PENDING + DTLB_STORE_MISSES.WALK_PENDING + EPT.WALK_PENDING) / (2 * tma_info_core_clks)", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_page_walks_utilization", + "MetricThreshold": "tma_info_page_walks_utilization > 0.5" }, { - "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction", - "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * tma_bad_speculation", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL2;tma_L2_group;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_branch_mispredicts", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path.", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of Core cycles where the core was running with power-delivery for baseline license level 0", + "MetricExpr": "(CORE_POWER.LVL0_TURBO_LICENSE / 2 / tma_info_core_clks if #SMT_on else CORE_POWER.LVL0_TURBO_LICENSE / tma_info_core_clks)", + "MetricGroup": "Power", + "MetricName": "tma_info_power_license0_utilization", + "PublicDescription": "Fraction of Core cycles where the core was running with power-delivery for baseline license level 0. This includes non-AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes." }, { - "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", - "MetricExpr": "tma_bad_speculation - tma_branch_mispredicts", - "MetricGroup": "BadSpec;MachineClears;TopdownL2;tma_L2_group;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_machine_clears", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes.", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 1", + "MetricExpr": "(CORE_POWER.LVL1_TURBO_LICENSE / 2 / tma_info_core_clks if #SMT_on else CORE_POWER.LVL1_TURBO_LICENSE / tma_info_core_clks)", + "MetricGroup": "Power", + "MetricName": "tma_info_power_license1_utilization", + "MetricThreshold": "tma_info_power_license1_utilization > 0.5", + "PublicDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 1. This includes high current AVX 256-bit instructions as well as low current AVX 512-bit instructions." }, { - "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend", - "MetricExpr": "1 - tma_frontend_bound - (UOPS_ISSUED.ANY + 4 * (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) / SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_backend_bound", - "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound.", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 2 (introduced in SKX)", + "MetricExpr": "(CORE_POWER.LVL2_TURBO_LICENSE / 2 / tma_info_core_clks if #SMT_on else CORE_POWER.LVL2_TURBO_LICENSE / tma_info_core_clks)", + "MetricGroup": "Power", + "MetricName": "tma_info_power_license2_utilization", + "MetricThreshold": "tma_info_power_license2_utilization > 0.5", + "PublicDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 2 (introduced in SKX). This includes high current AVX 512-bit instructions." }, { - "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", - "MetricExpr": "(CYCLE_ACTIVITY.STALLS_MEM_ANY + EXE_ACTIVITY.BOUND_ON_STORES) / (CYCLE_ACTIVITY.STALLS_TOTAL + (EXE_ACTIVITY.1_PORTS_UTIL + UOPS_RETIRED.RETIRE_SLOTS / SLOTS * EXE_ACTIVITY.2_PORTS_UTIL) + EXE_ACTIVITY.BOUND_ON_STORES) * tma_backend_bound", - "MetricGroup": "Backend;TopdownL2;tma_L2_group;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_memory_bound", - "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", - "ScaleUnit": "100%" + "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE_SLOTS\\,cmask\\=1@", + "MetricGroup": "Pipeline;Ret", + "MetricName": "tma_info_retire" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache", - "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY.STALLS_L1D_MISS) / CPU_CLK_UNHALTED.THREAD, 0)", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", - "MetricName": "tma_l1_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache. The L1 data cache typically has the shortest latency. However; in certain cases like loads blocked on older stores; a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls; while some non-completed demand load lives in the machine without having that demand load missing the L1 cache.", - "ScaleUnit": "100%" + "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "4 * tma_info_core_clks", + "MetricGroup": "TmaL1;tma_L1_group", + "MetricName": "tma_info_slots" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "min(9 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=0x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_dtlb_load", - "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss.", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", + "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", + "MetricGroup": "SMT", + "MetricName": "tma_info_smt_2t_utilization" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the (first level) DTLB was missed by load accesses, that later on hit in second-level TLB (STLB)", - "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", - "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_group", - "MetricName": "tma_load_stlb_hit", - "ScaleUnit": "100%" + "BriefDescription": "Socket actual clocks when any core is active on that socket", + "MetricExpr": "cha_0@event\\=0x0@", + "MetricGroup": "SoC", + "MetricName": "tma_info_socket_clks" }, { - "BriefDescription": "This metric estimates the fraction of cycles where the Second-level TLB (STLB) was missed by load accesses, performing a hardware page walk", - "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_group", - "MetricName": "tma_load_stlb_miss", - "ScaleUnit": "100%" + "BriefDescription": "STLB (2nd level TLB) data store speculative misses per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_store_stlb_mpki" }, { - "BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores", - "MetricExpr": "min(13 * LD_BLOCKS.STORE_FORWARD / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_store_fwd_blk", - "PublicDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores. To streamline memory operations in the pipeline; a load can avoid waiting for memory if a prior in-flight store is writing the data that the load wants to read (store forwarding process). However; in some cases the load may be blocked for a significant time pending the store forward. For example; when the prior store is writing a smaller region than the load is reading.", - "ScaleUnit": "100%" + "BriefDescription": "Average Frequency Utilization relative nominal frequency", + "MetricExpr": "tma_info_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_turbo_utilization" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations", - "MetricExpr": "min((12 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS.ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (11 * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO))) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_lock_latency", - "PublicDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations. Due to the microarchitecture handling of locks; they are classified as L1_Bound regardless of what memory source satisfied them.", - "ScaleUnit": "100%" + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / INST_RETIRED.ANY", + "MetricGroup": "Pipeline;Ret;Retire", + "MetricName": "tma_info_uoppi", + "MetricThreshold": "tma_info_uoppi > 1.05" }, { - "BriefDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary. ", - "MetricExpr": "min(Load_Miss_Real_Latency * LD_BLOCKS.NO_SR / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_split_loads", - "ScaleUnit": "100%" + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW", + "MetricName": "tma_info_uptb", + "MetricThreshold": "tma_info_uptb < 6" }, { - "BriefDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset", - "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_4k_aliasing", - "PublicDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset. False match is possible; which incur a few cycles load re-issue. However; the short re-issue duration is often hidden by the out-of-order core and HW optimizations; hence a user may safely ignore a high value of this metric unless it manages to propagate up into parent nodes of the hierarchy (e.g. to L1_Bound).", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "ICACHE_64B.IFTAG_STALL / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTEND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed", - "MetricExpr": "Load_Miss_Real_Latency * cpu@L1D_PEND_MISS.FB_FULL\\,cmask\\=0x1@ / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryBW;TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_fb_full", - "PublicDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed. The higher the metric value; the deeper the memory hierarchy level the misses are satisfied from (metric values >1 are valid). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or external memory).", + "BriefDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache", + "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY.STALLS_L1D_MISS) / tma_info_clks, 0)", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache. The L1 data cache typically has the shortest latency. However; in certain cases like loads blocked on older stores; a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls; while some non-completed demand load lives in the machine without having that demand load missing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB_HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_switches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads", - "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + cpu@L1D_PEND_MISS.FB_FULL\\,cmask\\=0x1@) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / CPU_CLK_UNHALTED.THREAD)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + cpu@L1D_PEND_MISS.FB_FULL\\,cmask\\=1@) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_info_clks)", "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 misses/L2 hits) can improve the latency and increase performance.", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 misses/L2 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L2_HIT_PS", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core", - "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STALLS_L3_MISS) / CPU_CLK_UNHALTED.THREAD", + "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STALLS_L3_MISS) / tma_info_clks", "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance.", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses", - "MetricExpr": "min(((47.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 3.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE / (OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE + OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) + (47.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 3.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_l3_bound_group", - "MetricName": "tma_contested_accesses", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses. Contested accesses occur when data written by one Logical Processor are read by another Logical Processor on a different Physical Core. Examples of contested accesses include synchronizations such as locks; true data sharing such as modified locked variables; and false sharing.", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses", - "MetricExpr": "min((47.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 3.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (1 - OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE / (OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE + OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Offcore;Snoop;TopdownL4;tma_L4_group;tma_l3_bound_group", - "MetricName": "tma_data_sharing", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses. Data shared by multiple Logical Processors (even just read shared) may cause increased access latency due to cache coherency. Excessive data sharing can drastically harm multithreaded performance.", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited)", - "MetricExpr": "min((20.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 3.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l3_bound_group", + "MetricExpr": "17 * tma_info_average_frequency * MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_issueLat;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "PublicDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve the latency; reduce contention with sibling physical cores and increase performance. Note the value of this node may overlap with its siblings.", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors)", - "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on else OFFCORE_REQUESTS_BUFFER.SQ_FULL) / CORE_CLKS", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_l3_bound_group", - "MetricName": "tma_sq_full", - "PublicDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors). The Super Queue is used for requests to access the L2 cache or to go out to the Uncore.", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", - "MetricExpr": "min(CYCLE_ACTIVITY.STALLS_L3_MISS / CPU_CLK_UNHALTED.THREAD + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / CPU_CLK_UNHALTED.THREAD - tma_l2_bound, 1)", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", - "MetricName": "tma_dram_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance.", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=0x4@) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group", - "MetricName": "tma_mem_bandwidth", - "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that).", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve the latency; reduce contention with sibling physical cores and increase performance. Note the value of this node may overlap with its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tma_info_memory_latency, tma_mem_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / CPU_CLK_UNHALTED.THREAD - tma_mem_bandwidth", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group", - "MetricName": "tma_mem_latency", - "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that).", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from local memory", - "MetricExpr": "min((80 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 20.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_group", - "MetricName": "tma_local_dram", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from local memory. Caching will improve the latency and increase performance.", + "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "ILD_STALL.LCP / tma_info_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_dsb_misses, tma_info_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote memory", - "MetricExpr": "min((147.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 20.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latency_group", - "MetricName": "tma_remote_dram", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote memory. This is caused often due to non-optimal NUMA allocations. #link to NUMA article", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", + "MetricExpr": "tma_retiring - tma_heavy_operations", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote cache in other sockets including synchronizations issues", - "MetricExpr": "min(((110 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 20.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM + (110 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 20.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latency_group", - "MetricName": "tma_remote_cache", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote cache in other sockets including synchronizations issues. This is caused often due to non-optimal NUMA allocations. #link to NUMA article", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations", + "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_2 + UOPS_DISPATCHED_PORT.PORT_3 + UOPS_DISPATCHED_PORT.PORT_7 - UOPS_DISPATCHED_PORT.PORT_4) / (2 * tma_info_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations. Sample with: UOPS_DISPATCHED.PORT_2_3", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", - "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", - "MetricName": "tma_store_bound", - "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck.", + "BriefDescription": "This metric roughly estimates the fraction of cycles where the (first level) DTLB was missed by load accesses, that later on hit in second-level TLB (STLB)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_group", + "MetricName": "tma_load_stlb_hit", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses", - "MetricExpr": "(L2_RQSTS.RFO_HIT * 11 * (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_store_bound_group", - "MetricName": "tma_store_latency", - "PublicDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses. Store accesses usually less impact out-of-order core performance; however; holding resources for longer time can lead into undesired implications (e.g. contention on L1D fill-buffer entries - see FB_Full)", + "BriefDescription": "This metric estimates the fraction of cycles where the Second-level TLB (STLB) was missed by load accesses, performing a hardware page walk", + "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_group", + "MetricName": "tma_load_stlb_miss", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing", - "MetricExpr": "min((110 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) * (OFFCORE_RESPONSE.DEMAND_RFO.L3_MISS.REMOTE_HITM + OFFCORE_RESPONSE.PF_L2_RFO.L3_MISS.REMOTE_HITM) + 47.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) * (OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.HITM_OTHER_CORE + OFFCORE_RESPONSE.PF_L2_RFO.L3_HIT.HITM_OTHER_CORE)) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_store_bound_group", - "MetricName": "tma_false_sharing", - "PublicDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing. False Sharing is a multithreading hiccup; where multiple Logical Processors contend on different data-elements mapped into the same cache line. ", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from local memory", + "MetricExpr": "59.5 * tma_info_average_frequency * MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_group", + "MetricName": "tma_local_dram", + "MetricThreshold": "tma_local_dram > 0.1 & (tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from local memory. Caching will improve the latency and increase performance. Sample with: MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents rate of split store accesses", - "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / CORE_CLKS", - "MetricGroup": "TopdownL4;tma_L4_group;tma_store_bound_group", - "MetricName": "tma_split_stores", - "PublicDescription": "This metric represents rate of split store accesses. Consider aligning your data to the 64-byte cache line granularity.", + "BriefDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations", + "MetricExpr": "(12 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS.ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (11 * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO))) / tma_info_clks", + "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1_bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations. Due to the microarchitecture handling of locks; they are classified as L1_Bound regardless of what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOADS_PS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses", - "MetricExpr": "min((9 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=0x1@ + DTLB_STORE_MISSES.WALK_ACTIVE) / CORE_CLKS, 1)", - "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_store_bound_group", - "MetricName": "tma_dtlb_store", - "PublicDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses. As with ordinary data caching; focus on improving data locality and reducing working-set size to reduce DTLB overhead. Additionally; consider using profile-guided optimization (PGO) to collocate frequently-used data on the same page. Try using larger page sizes for large amounts of frequently-used data.", + "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_bad_speculation - tma_branch_mispredicts", + "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the TLB was missed by store accesses, hitting in the second-level TLB (STLB)", - "MetricExpr": "tma_dtlb_store - DTLB_STORE_MISSES.WALK_ACTIVE / CORE_CLKS", - "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_group", - "MetricName": "tma_store_stlb_hit", + "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=4@) / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that). Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_info_memory_bandwidth, tma_sq_full", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles where the STLB was missed by store accesses, performing a hardware page walk", - "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / CORE_CLKS", - "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_group", - "MetricName": "tma_store_stlb_miss", + "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / tma_info_clks - tma_mem_bandwidth", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that). Related metrics: tma_info_memory_latency, tma_l3_hit_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", - "MetricExpr": "tma_backend_bound - tma_memory_bound", - "MetricGroup": "Backend;Compute;TopdownL2;tma_L2_group;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_core_bound", - "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", + "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(CYCLE_ACTIVITY.STALLS_MEM_ANY + EXE_ACTIVITY.BOUND_ON_STORES) / (CYCLE_ACTIVITY.STALLS_TOTAL + (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL) + EXE_ACTIVITY.BOUND_ON_STORES) * tma_backend_bound", + "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", - "MetricExpr": "ARITH.DIVIDER_ACTIVE / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", - "MetricName": "tma_divider", - "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication.", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring memory operations -- uops for memory load or store accesses.", + "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_RETIRED.ANY", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_memory_operations", + "MetricThreshold": "tma_memory_operations > 0.1 & tma_light_operations > 0.6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", - "MetricExpr": "((EXE_ACTIVITY.EXE_BOUND_0_PORTS + (EXE_ACTIVITY.1_PORTS_UTIL + UOPS_RETIRED.RETIRE_SLOTS / SLOTS * EXE_ACTIVITY.2_PORTS_UTIL)) / CPU_CLK_UNHALTED.THREAD if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CYCLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + UOPS_RETIRED.RETIRE_SLOTS / SLOTS * EXE_ACTIVITY.2_PORTS_UTIL) / CPU_CLK_UNHALTED.THREAD)", - "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_group", - "MetricName": "tma_ports_utilization", - "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", + "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.MS_UOPS / tma_info_slots", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operations_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "(UOPS_EXECUTED.CORE_CYCLES_NONE / 2 if #SMT_on else CYCLE_ACTIVITY.STALLS_TOTAL - CYCLE_ACTIVITY.STALLS_MEM_ANY) / CORE_CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_0", - "PublicDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise). Long-latency instructions like divides may contribute to this metric.", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage", + "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_clks", + "MetricGroup": "BadSpec;BrMispredicts;TopdownL4;tma_L4_group;tma_branch_resteers_group;tma_issueBM", + "MetricName": "tma_mispredicts_resteers", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_branch_mispredicts, tma_info_branch_misprediction_cost, tma_info_mispredictions", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations", - "MetricExpr": "PARTIAL_RAT_STALLS.SCOREBOARD / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_0_group", - "MetricName": "tma_serializing_operation", - "PublicDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations. Instructions like CPUID; WRMSR or LFENCE serialize the out-of-order execution which may limit performance.", + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline)", + "MetricExpr": "(IDQ.ALL_MITE_CYCLES_ANY_UOPS - IDQ.ALL_MITE_CYCLES_4_UOPS) / tma_info_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or LSD. For example; inefficiencies due to asymmetric decoders; use of long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sample with: FRONTEND_RETIRED.ANY_DSB_MISS", "ScaleUnit": "100%" }, { "BriefDescription": "The Mixing_Vectors metric gives the percentage of injected blend uops out of all uops issued", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD * UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY, 1)", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_0_group", + "MetricExpr": "UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY", + "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utilized_0_group", "MetricName": "tma_mixing_vectors", - "PublicDescription": "The Mixing_Vectors metric gives the percentage of injected blend uops out of all uops issued. Usually a Mixing_Vectors over 5% is worth investigating. Read more in Appendix B1 of the Optimizations Guide for this topic.", + "MetricThreshold": "tma_mixing_vectors > 0.05", + "PublicDescription": "The Mixing_Vectors metric gives the percentage of injected blend uops out of all uops issued. Usually a Mixing_Vectors over 5% is worth investigating. Read more in Appendix B1 of the Optimizations Guide for this topic. Related metrics: tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_1 - UOPS_EXECUTED.CORE_CYCLES_GE_2) / 2 if #SMT_on else EXE_ACTIVITY.1_PORTS_UTIL) / CORE_CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_1", - "PublicDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). This can be due to heavy data-dependency among software instructions; or over oversubscribing a particular hardware resource. In some other cases with high 1_Port_Utilized and L1_Bound; this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful.", + "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", + "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_clks", + "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_2 - UOPS_EXECUTED.CORE_CYCLES_GE_3) / 2 if #SMT_on else EXE_ACTIVITY.2_PORTS_UTIL) / CORE_CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_2", - "PublicDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization -most compilers feature auto-Vectorization options today- reduces pressure on the execution ports as multiple elements are calculated with same uop.", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring branch instructions that were not fused", + "MetricExpr": "tma_light_operations * (BR_INST_RETIRED.ALL_BRANCHES - UOPS_RETIRED.MACRO_FUSED) / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_non_fused_branches", + "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring branch instructions that were not fused. Non-conditional branches like direct JMP or CALL would count here. Can be used to examine fusible conditional jumps that were not fused.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise).", - "MetricExpr": "(UOPS_EXECUTED.CORE_CYCLES_GE_3 / 2 if #SMT_on else UOPS_EXECUTED.CORE_CYCLES_GE_3) / CORE_CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_3m", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_nop_instructions", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions. Compilers often use NOPs for certain address alignments - e.g. start address of a function or loop body. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.", - "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT.PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / SLOTS", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", - "MetricName": "tma_alu_op_utilization", + "BriefDescription": "This metric represents the remaining light uops fraction the CPU has executed - remaining means not covered by other sibling nodes", + "MetricExpr": "max(0, tma_light_operations - (tma_fp_arith + tma_memory_operations + tma_fused_instructions + tma_non_fused_branches + tma_nop_instructions))", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_other_light_ops", + "MetricThreshold": "tma_other_light_ops > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents the remaining light uops fraction the CPU has executed - remaining means not covered by other sibling nodes. May undercount due to FMA double counting", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_0 / CORE_CLKS", - "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilization_group", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_0 / tma_info_core_clks", + "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", "MetricName": "tma_port_0", + "MetricThreshold": "tma_port_0 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_1 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_1 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", "MetricName": "tma_port_1", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_5 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group", - "MetricName": "tma_port_5", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_6 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group", - "MetricName": "tma_port_6", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations", - "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_2 + UOPS_DISPATCHED_PORT.PORT_3 + UOPS_DISPATCHED_PORT.PORT_7 - UOPS_DISPATCHED_PORT.PORT_4) / (2 * CORE_CLKS)", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", - "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_port_1 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 2 ([SNB+]Loads and Store-address; [ICL+] Loads)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_2 / CORE_CLKS", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_2 / tma_info_core_clks", "MetricGroup": "TopdownL6;tma_L6_group;tma_load_op_utilization_group", "MetricName": "tma_port_2", + "MetricThreshold": "tma_port_2 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 2 ([SNB+]Loads and Store-address; [ICL+] Loads). Sample with: UOPS_DISPATCHED_PORT.PORT_2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 3 ([SNB+]Loads and Store-address; [ICL+] Loads)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_3 / CORE_CLKS", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_3 / tma_info_core_clks", "MetricGroup": "TopdownL6;tma_L6_group;tma_load_op_utilization_group", "MetricName": "tma_port_3", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_4 / CORE_CLKS", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", - "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_port_3 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 3 ([SNB+]Loads and Store-address; [ICL+] Loads). Sample with: UOPS_DISPATCHED_PORT.PORT_3", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 4 (Store-data)", "MetricExpr": "tma_store_op_utilization", - "MetricGroup": "TopdownL6;tma_L6_group;tma_store_op_utilization_group", + "MetricGroup": "TopdownL6;tma_L6_group;tma_issueSpSt;tma_store_op_utilization_group", "MetricName": "tma_port_4", + "MetricThreshold": "tma_port_4 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 4 (Store-data). Sample with: UOPS_DISPATCHED_PORT.PORT_4. Related metrics: tma_split_stores", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 7 ([HSW+]simple Store-address)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_7 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_L6_group;tma_store_op_utilization_group", - "MetricName": "tma_port_7", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_5 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_5", + "MetricThreshold": "tma_port_5 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_retiring", - "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. ", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_6 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_6", + "MetricThreshold": "tma_port_6 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_6. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", - "MetricExpr": "tma_retiring - (UOPS_RETIRED.RETIRE_SLOTS + UOPS_RETIRED.MACRO_FUSED - INST_RETIRED.ANY) / SLOTS", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_L2_group;tma_retiring_group", - "MetricName": "tma_light_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved.", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 7 ([HSW+]simple Store-address)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_7 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_store_op_utilization_group", + "MetricName": "tma_port_7", + "MetricThreshold": "tma_port_7 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 7 ([HSW+]simple Store-address). Sample with: UOPS_DISPATCHED_PORT.PORT_7", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)", - "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.THREAD + tma_fp_scalar + tma_fp_vector", - "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_group", - "MetricName": "tma_fp_arith", - "PublicDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired). Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain and FMA double-counting.", + "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", + "MetricExpr": "((EXE_ACTIVITY.EXE_BOUND_0_PORTS + (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL)) / tma_info_clks if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CYCLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_clks)", + "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric serves as an approximation of legacy x87 usage", - "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.THREAD", - "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", - "MetricName": "tma_x87_use", - "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", + "BriefDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "(UOPS_EXECUTED.CORE_CYCLES_NONE / 2 if #SMT_on else CYCLE_ACTIVITY.STALLS_TOTAL - CYCLE_ACTIVITY.STALLS_MEM_ANY) / tma_info_core_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise). Long-latency instructions like divides may contribute to this metric.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE) / UOPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group", - "MetricName": "tma_fp_scalar", - "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting.", + "BriefDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_1 - UOPS_EXECUTED.CORE_CYCLES_GE_2) / 2 if #SMT_on else EXE_ACTIVITY.1_PORTS_UTIL) / tma_info_core_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_1", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). This can be due to heavy data-dependency among software instructions; or over oversubscribing a particular hardware resource. In some other cases with high 1_Port_Utilized and L1_Bound; this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths", - "MetricExpr": "min((FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS, 1)", - "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group", - "MetricName": "tma_fp_vector", - "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting.", + "BriefDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_2 - UOPS_EXECUTED.CORE_CYCLES_GE_3) / 2 if #SMT_on else EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_core_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization -most compilers feature auto-Vectorization options today- reduces pressure on the execution ports as multiple elements are calculated with same uop. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors", - "MetricExpr": "min((FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS, 1)", - "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group", - "MetricName": "tma_fp_vector_128b", - "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors. May overcount due to FMA double counting.", + "BriefDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise).", + "MetricExpr": "(UOPS_EXECUTED.CORE_CYCLES_GE_3 / 2 if #SMT_on else UOPS_EXECUTED.CORE_CYCLES_GE_3) / tma_info_core_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.7 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors", - "MetricExpr": "min((FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS, 1)", - "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group", - "MetricName": "tma_fp_vector_256b", - "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors. May overcount due to FMA double counting.", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote cache in other sockets including synchronizations issues", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "(89.5 * tma_info_average_frequency * MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM + 89.5 * tma_info_average_frequency * MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_issueSyncxn;tma_mem_latency_group", + "MetricName": "tma_remote_cache", + "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote cache in other sockets including synchronizations issues. This is caused often due to non-optimal NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM_PS;MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD_PS. Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 512-bit wide vectors", - "MetricExpr": "min((FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS, 1)", - "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group", - "MetricName": "tma_fp_vector_512b", - "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 512-bit wide vectors. May overcount due to FMA double counting.", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote memory", + "MetricExpr": "127 * tma_info_average_frequency * MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latency_group", + "MetricName": "tma_remote_dram", + "MetricThreshold": "tma_remote_dram > 0.1 & (tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling loads from remote memory. This is caused often due to non-optimal NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring memory operations -- uops for memory load or store accesses.", - "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_RETIRED.ANY", - "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", - "MetricName": "tma_memory_operations", + "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1", + "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.RETIRE_SLOTS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring fused instructions -- where one uop can represent multiple contiguous instructions", - "MetricExpr": "tma_light_operations * UOPS_RETIRED.MACRO_FUSED / UOPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", - "MetricName": "tma_fused_instructions", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring fused instructions -- where one uop can represent multiple contiguous instructions. The instruction pairs of CMP+JCC or DEC+JCC are commonly used examples.", + "BriefDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations", + "MetricExpr": "PARTIAL_RAT_STALLS.SCOREBOARD / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL5;tma_L5_group;tma_issueSO;tma_ports_utilized_0_group", + "MetricName": "tma_serializing_operation", + "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)))", + "PublicDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations. Instructions like CPUID; WRMSR or LFENCE serialize the out-of-order execution which may limit performance. Sample with: PARTIAL_RAT_STALLS.SCOREBOARD. Related metrics: tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring branch instructions that were not fused", - "MetricExpr": "tma_light_operations * (BR_INST_RETIRED.ALL_BRANCHES - UOPS_RETIRED.MACRO_FUSED) / UOPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", - "MetricName": "tma_non_fused_branches", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring branch instructions that were not fused. Non-conditional branches like direct JMP or CALL would count here. Can be used to examine fusible conditional jumps that were not fused.", + "BriefDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "tma_info_load_miss_real_latency * LD_BLOCKS.NO_SR / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions", - "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / UOPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", - "MetricName": "tma_nop_instructions", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions. Compilers often use NOPs for certain address alignments - e.g. start address of a function or loop body.", + "BriefDescription": "This metric represents rate of split store accesses", + "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bound_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store accesses. Consider aligning your data to the 64-byte cache line granularity. Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_4", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents the remaining light uops fraction the CPU has executed - remaining means not covered by other sibling nodes. May undercount due to FMA double counting", - "MetricExpr": "max(0, tma_light_operations - (tma_fp_arith + tma_memory_operations + tma_fused_instructions + tma_non_fused_branches + tma_nop_instructions))", - "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", - "MetricName": "tma_other_light_ops", + "BriefDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors)", + "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on else OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueBW;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors). Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_info_memory_bandwidth, tma_mem_bandwidth", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences", - "MetricExpr": "(UOPS_RETIRED.RETIRE_SLOTS + UOPS_RETIRED.MACRO_FUSED - INST_RETIRED.ANY) / SLOTS", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_L2_group;tma_retiring_group", - "MetricName": "tma_heavy_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences. This highly-correlates with the uop length of these instructions/sequences.", + "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", + "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck. Sample with: MEM_INST_RETIRED.ALL_STORES_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops", - "MetricExpr": "tma_heavy_operations - UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.MS_UOPS / SLOTS", - "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group", - "MetricName": "tma_few_uops_instructions", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number of uops in such instructions.", + "BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores. To streamline memory operations in the pipeline; a load can avoid waiting for memory if a prior in-flight store is writing the data that the load wants to read (store forwarding process). However; in some cases the load may be blocked for a significant time pending the store forward. For example; when the prior store is writing a smaller region than the load is reading.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.MS_UOPS / SLOTS", - "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operations_group", - "MetricName": "tma_microcode_sequencer", - "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided.", + "BriefDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "(L2_RQSTS.RFO_HIT * 11 * (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_clks", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_issueSL;tma_store_bound_group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses. Store accesses usually less impact out-of-order core performance; however; holding resources for longer time can lead into undesired implications (e.g. contention on L1D fill-buffer entries - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists", - "MetricExpr": "min(100 * (FP_ASSIST.ANY + OTHER_ASSISTS.ANY) / SLOTS, 1)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", - "MetricName": "tma_assists", - "PublicDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists. Assists are long sequences of uops that are required in certain corner-cases for operations that cannot be handled natively by the execution pipeline. For example; when working with very small floating point values (so-called Denormals); the FP units are not set up to perform these operations natively. Instead; a sequence of instructions to perform the computation on the Denormals is injected into the pipeline. Since these microcode sequences might be dozens of uops long; Assists can be extremely deleterious to performance and they can be avoided in many cases.", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_4 / tma_info_core_clks", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction", - "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", - "MetricName": "tma_cisc", - "PublicDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction. A CISC instruction has multiple uops that are required to perform the instruction's functionality as in the case of read-modify-write as an example. Since these instructions require multiple uops they may or may not imply sub-optimal use of machine resources.", + "BriefDescription": "This metric roughly estimates the fraction of cycles where the TLB was missed by store accesses, hitting in the second-level TLB (STLB)", + "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_group", + "MetricName": "tma_store_stlb_hit", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "C3 residency percent per core", - "MetricExpr": "cstate_core@c3\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C3_Core_Residency", + "BriefDescription": "This metric estimates the fraction of cycles where the STLB was missed by store accesses, performing a hardware page walk", + "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_clks", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_group", + "MetricName": "tma_store_stlb_miss", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per core", - "MetricExpr": "cstate_core@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Core_Residency", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears", + "MetricExpr": "9 * BACLEARS.ANY / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;TopdownL4;tma_L4_group;tma_branch_resteers_group", + "MetricName": "tma_unknown_branches", + "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears. These are fetched branches the Branch Prediction Unit was unable to recognize (e.g. first time the branch is fetched or hitting BTB capacity limit). Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, { - "BriefDescription": "C7 residency percent per core", - "MetricExpr": "cstate_core@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Core_Residency", + "BriefDescription": "This metric serves as an approximation of legacy x87 usage", + "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.THREAD", + "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%" }, { - "BriefDescription": "C2 residency percent per package", - "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C2_Pkg_Residency", + "BriefDescription": "Percentage of cycles in aborted transactions.", + "MetricExpr": "max(cpu@cycles\\-t@ - cpu@cycles\\-ct@, 0) / cycles", + "MetricGroup": "transaction", + "MetricName": "tsx_aborted_cycles", "ScaleUnit": "100%" }, { - "BriefDescription": "C3 residency percent per package", - "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C3_Pkg_Residency", - "ScaleUnit": "100%" + "BriefDescription": "Number of cycles within a transaction divided by the number of elisions.", + "MetricExpr": "cpu@cycles\\-t@ / cpu@el\\-start@", + "MetricGroup": "transaction", + "MetricName": "tsx_cycles_per_elision", + "ScaleUnit": "1cycles / elision" }, { - "BriefDescription": "C6 residency percent per package", - "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Pkg_Residency", - "ScaleUnit": "100%" + "BriefDescription": "Number of cycles within a transaction divided by the number of transactions.", + "MetricExpr": "cpu@cycles\\-t@ / cpu@tx\\-start@", + "MetricGroup": "transaction", + "MetricName": "tsx_cycles_per_transaction", + "ScaleUnit": "1cycles / transaction" }, { - "BriefDescription": "C7 residency percent per package", - "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Pkg_Residency", + "BriefDescription": "Percentage of cycles within a transaction region.", + "MetricExpr": "cpu@cycles\\-t@ / cycles", + "MetricGroup": "transaction", + "MetricName": "tsx_transactional_cycles", "ScaleUnit": "100%" } ] diff --git a/tools/perf/pmu-events/arch/x86/skylakex/uncore-memory.json b/tools/perf/pmu-events/arch/x86/skylakex/uncore-memory.json index 8784118123b4..e0840c24e7aa 100644 --- a/tools/perf/pmu-events/arch/x86/skylakex/uncore-memory.json +++ b/tools/perf/pmu-events/arch/x86/skylakex/uncore-memory.json @@ -1952,7 +1952,7 @@ "EventCode": "0x81", "EventName": "UNC_M_WPQ_OCCUPANCY", "PerPkg": "1", - "PublicDescription": "Counts the number of entries in the Write Pending Queue (WPQ) at each cycle. This can then be used to calculate both the average queue occupancy (in conjunction with the number of cycles not empty) and the average latency (in conjunction with the number of allocations). The WPQ is used to schedule writes out to the memory controller and to track the requests. Requests allocate into the WPQ soon after they enter the memory controller, and need credits for an entry in this buffer before being sent from the CHA to the iMC (memory controller). They deallocate after being issued to DRAM. Write requests themselves are able to complete (from the perspective of the rest of the system) as soon they have 'posted' to the iMC. This is not to be confused with actually performing the write to DRAM. Therefore, the average latency for this queue is actually not useful for deconstruction intermediate write latencies. So, we provide filtering based on if the request has posted or not. By using the 'not posted' filter, we can track how long writes spent in the iMC before completions were sent to the HA. The 'posted' filter, on the other hand, provides information about how much queueing is actually happenning in the iMC for writes before they are actually issued to memory. High average occupancies will generally coincide with high write major mode counts. Is there a filter of sorts???", + "PublicDescription": "Counts the number of entries in the Write Pending Queue (WPQ) at each cycle. This can then be used to calculate both the average queue occupancy (in conjunction with the number of cycles not empty) and the average latency (in conjunction with the number of allocations). The WPQ is used to schedule writes out to the memory controller and to track the requests. Requests allocate into the WPQ soon after they enter the memory controller, and need credits for an entry in this buffer before being sent from the CHA to the iMC (memory controller). They deallocate after being issued to DRAM. Write requests themselves are able to complete (from the perspective of the rest of the system) as soon they have 'posted' to the iMC. This is not to be confused with actually performing the write to DRAM. Therefore, the average latency for this queue is actually not useful for deconstruction intermediate write latencies. So, we provide filtering based on if the request has posted or not. By using the 'not posted' filter, we can track how long writes spent in the iMC before completions were sent to the HA. The 'posted' filter, on the other hand, provides information about how much queueing is actually happening in the iMC for writes before they are actually issued to memory. High average occupancies will generally coincide with high write major mode counts. Is there a filter of sorts???", "Unit": "iMC" }, { diff --git a/tools/perf/pmu-events/arch/x86/skylakex/uncore-other.json b/tools/perf/pmu-events/arch/x86/skylakex/uncore-other.json index 37003115c6c7..92a4bdcd4bd7 100644 --- a/tools/perf/pmu-events/arch/x86/skylakex/uncore-other.json +++ b/tools/perf/pmu-events/arch/x86/skylakex/uncore-other.json @@ -804,7 +804,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Counts the number of Allocate/Update to HitMe Cache; Deallocate HtiME$ on Reads without RspFwdI*", + "BriefDescription": "Counts the number of Allocate/Update to HitMe Cache; Deallocate HitME$ on Reads without RspFwdI*", "EventCode": "0x61", "EventName": "UNC_CHA_HITME_UPDATE.DEALLOCATE", "PerPkg": "1", @@ -3321,7 +3321,7 @@ "EventCode": "0x5C", "EventName": "UNC_CHA_SNOOP_RESP.RSP_FWD_WB", "PerPkg": "1", - "PublicDescription": "Counts when a transaction with the opcode type Rsp*Fwd*WB Snoop Response was received which indicates the data was written back to its home socket, and the cacheline was forwarded to the requestor socket. This snoop response is only used in >= 4 socket systems. It is used when a snoop HITM's in a remote caching agent and it directly forwards data to a requestor, and simultaneously returns data to its home socket to be written back to memory.", + "PublicDescription": "Counts when a transaction with the opcode type Rsp*Fwd*WB Snoop Response was received which indicates the data was written back to its home socket, and the cacheline was forwarded to the requestor socket. This snoop response is only used in >= 4 socket systems. It is used when a snoop HITM's in a remote caching agent and it directly forwards data to a requestor, and simultaneously returns data to its home socket to be written back to memory.", "UMask": "0x20", "Unit": "CHA" }, @@ -4039,20 +4039,22 @@ "Unit": "CHA" }, { - "BriefDescription": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_CRD", + "BriefDescription": "TOR Occupancy : CRds issued by iA Cores that Hit the LLC", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_CRD", "Filter": "config1=0x40233", "PerPkg": "1", + "PublicDescription": "TOR Occupancy : CRds issued by iA Cores that Hit the LLC : For each cycle, this event accumulates the number of valid entries in the TOR that match qualifications specified by the subevent. Does not include addressless requests such as locks and interrupts.", "UMask": "0x11", "Unit": "CHA" }, { - "BriefDescription": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_DRD", + "BriefDescription": "TOR Occupancy : DRds issued by iA Cores that Hit the LLC", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_DRD", "Filter": "config1=0x40433", "PerPkg": "1", + "PublicDescription": "TOR Occupancy : DRds issued by iA Cores that Hit the LLC : For each cycle, this event accumulates the number of valid entries in the TOR that match qualifications specified by the subevent. Does not include addressless requests such as locks and interrupts.", "UMask": "0x11", "Unit": "CHA" }, @@ -4075,20 +4077,22 @@ "Unit": "CHA" }, { - "BriefDescription": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_LlcPrefRFO", + "BriefDescription": "TOR Occupancy : LLCPrefRFO issued by iA Cores that hit the LLC", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_LlcPrefRFO", "Filter": "config1=0x4b033", "PerPkg": "1", + "PublicDescription": "TOR Occupancy : LLCPrefRFO issued by iA Cores that hit the LLC : For each cycle, this event accumulates the number of valid entries in the TOR that match qualifications specified by the subevent. Does not include addressless requests such as locks and interrupts.", "UMask": "0x11", "Unit": "CHA" }, { - "BriefDescription": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_RFO", + "BriefDescription": "TOR Occupancy : RFOs issued by iA Cores that Hit the LLC", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_RFO", "Filter": "config1=0x40033", "PerPkg": "1", + "PublicDescription": "TOR Occupancy : RFOs issued by iA Cores that Hit the LLC : For each cycle, this event accumulates the number of valid entries in the TOR that match qualifications specified by the subevent. Does not include addressless requests such as locks and interrupts.", "UMask": "0x11", "Unit": "CHA" }, @@ -4102,20 +4106,22 @@ "Unit": "CHA" }, { - "BriefDescription": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_CRD", + "BriefDescription": "TOR Occupancy : CRds issued by iA Cores that Missed the LLC", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_CRD", "Filter": "config1=0x40233", "PerPkg": "1", + "PublicDescription": "TOR Occupancy : CRds issued by iA Cores that Missed the LLC : For each cycle, this event accumulates the number of valid entries in the TOR that match qualifications specified by the subevent. Does not include addressless requests such as locks and interrupts.", "UMask": "0x21", "Unit": "CHA" }, { - "BriefDescription": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD", + "BriefDescription": "TOR Occupancy : DRds issued by iA Cores that Missed the LLC", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD", "Filter": "config1=0x40433", "PerPkg": "1", + "PublicDescription": "TOR Occupancy : DRds issued by iA Cores that Missed the LLC : For each cycle, this event accumulates the number of valid entries in the TOR that match qualifications specified by the subevent. Does not include addressless requests such as locks and interrupts.", "UMask": "0x21", "Unit": "CHA" }, @@ -4138,20 +4144,22 @@ "Unit": "CHA" }, { - "BriefDescription": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_LlcPrefRFO", + "BriefDescription": "TOR Occupancy : LLCPrefRFO issued by iA Cores that missed the LLC", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_LlcPrefRFO", "Filter": "config1=0x4b033", "PerPkg": "1", + "PublicDescription": "TOR Occupancy : LLCPrefRFO issued by iA Cores that missed the LLC : For each cycle, this event accumulates the number of valid entries in the TOR that match qualifications specified by the subevent. Does not include addressless requests such as locks and interrupts.", "UMask": "0x21", "Unit": "CHA" }, { - "BriefDescription": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_RFO", + "BriefDescription": "TOR Occupancy : RFOs issued by iA Cores that Missed the LLC", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_RFO", "Filter": "config1=0x40033", "PerPkg": "1", + "PublicDescription": "TOR Occupancy : RFOs issued by iA Cores that Missed the LLC : For each cycle, this event accumulates the number of valid entries in the TOR that match qualifications specified by the subevent. Does not include addressless requests such as locks and interrupts.", "UMask": "0x21", "Unit": "CHA" }, @@ -11909,7 +11917,7 @@ "Unit": "IIO" }, { - "BriefDescription": "UNC_IIO_NOTHING", + "BriefDescription": "Counting disabled", "EventName": "UNC_IIO_NOTHING", "PerPkg": "1", "Unit": "IIO" @@ -15507,7 +15515,7 @@ "EventCode": "0xC", "EventName": "UNC_I_TxS_REQUEST_OCCUPANCY", "PerPkg": "1", - "PublicDescription": "Accumultes the number of outstanding outbound requests from the IRP to the switch (towards the devices). This can be used in conjuection with the allocations event in order to calculate average latency of outbound requests.", + "PublicDescription": "Accumulates the number of outstanding outbound requests from the IRP to the switch (towards the devices). This can be used in conjunction with the allocations event in order to calculate average latency of outbound requests.", "Unit": "IRP" }, { @@ -16013,35 +16021,35 @@ "Unit": "M2M" }, { - "BriefDescription": "Number of reads in which direct to Intel UPI transactions were overridden", + "BriefDescription": "Number of reads in which direct to Intel(R) UPI transactions were overridden", "EventCode": "0x28", "EventName": "UNC_M2M_DIRECT2UPI_NOT_TAKEN_CREDITS", "PerPkg": "1", - "PublicDescription": "Counts reads in which direct to Intel Ultra Path Interconnect (UPI) transactions (which would have bypassed the CHA) were overridden", + "PublicDescription": "Counts reads in which direct to Intel(R) Ultra Path Interconnect (UPI) transactions (which would have bypassed the CHA) were overridden", "Unit": "M2M" }, { - "BriefDescription": "Cycles when direct to Intel UPI was disabled", + "BriefDescription": "Cycles when direct to Intel(R) UPI was disabled", "EventCode": "0x27", "EventName": "UNC_M2M_DIRECT2UPI_NOT_TAKEN_DIRSTATE", "PerPkg": "1", - "PublicDescription": "Counts cycles when the ability to send messages direct to the Intel Ultra Path Interconnect (bypassing the CHA) was disabled", + "PublicDescription": "Counts cycles when the ability to send messages direct to the Intel(R) Ultra Path Interconnect (bypassing the CHA) was disabled", "Unit": "M2M" }, { - "BriefDescription": "Messages sent direct to the Intel UPI", + "BriefDescription": "Messages sent direct to the Intel(R) UPI", "EventCode": "0x26", "EventName": "UNC_M2M_DIRECT2UPI_TAKEN", "PerPkg": "1", - "PublicDescription": "Counts when messages were sent direct to the Intel Ultra Path Interconnect (bypassing the CHA)", + "PublicDescription": "Counts when messages were sent direct to the Intel(R) Ultra Path Interconnect (bypassing the CHA)", "Unit": "M2M" }, { - "BriefDescription": "Number of reads that a message sent direct2 Intel UPI was overridden", + "BriefDescription": "Number of reads that a message sent direct2 Intel(R) UPI was overridden", "EventCode": "0x29", "EventName": "UNC_M2M_DIRECT2UPI_TXN_OVERRIDE", "PerPkg": "1", - "PublicDescription": "Counts when a read message that was sent direct to the Intel Ultra Path Interconnect (bypassing the CHA) was overridden", + "PublicDescription": "Counts when a read message that was sent direct to the Intel(R) Ultra Path Interconnect (bypassing the CHA) was overridden", "Unit": "M2M" }, { @@ -20785,7 +20793,7 @@ "EventCode": "0x61", "EventName": "UNC_M3UPI_RxC_CRD_OCC.FLITS_IN_FIFO", "PerPkg": "1", - "PublicDescription": "Occupancy of m3upi ingress -> upi link layer bgf; packets (flits) in fifo", + "PublicDescription": "Occupancy of m3upi ingress -> upi link layer bgf; packets (flits) in fifo", "UMask": "0x2", "Unit": "M3UPI" }, @@ -20794,7 +20802,7 @@ "EventCode": "0x61", "EventName": "UNC_M3UPI_RxC_CRD_OCC.FLITS_IN_PATH", "PerPkg": "1", - "PublicDescription": "Occupancy of m3upi ingress -> upi link layer bgf; packets (flits) in path (i.e. pipe to fifo or fifo)", + "PublicDescription": "Occupancy of m3upi ingress -> upi link layer bgf; packets (flits) in path (i.e. pipe to fifo or fifo)", "UMask": "0x4", "Unit": "M3UPI" }, @@ -21180,7 +21188,7 @@ "Unit": "M3UPI" }, { - "BriefDescription": "Flit Gen - Header 1; Acumullate", + "BriefDescription": "Flit Gen - Header 1; Accumulate", "EventCode": "0x53", "EventName": "UNC_M3UPI_RxC_FLIT_GEN_HDR1.ACCUM", "PerPkg": "1", @@ -21851,7 +21859,7 @@ "Unit": "M3UPI" }, { - "BriefDescription": "Remote VNA Credits; Level < 1", + "BriefDescription": "Remote VNA Credits; Level < 1", "EventCode": "0x5B", "EventName": "UNC_M3UPI_RxC_VNA_CRD.LT1", "PerPkg": "1", @@ -21860,7 +21868,7 @@ "Unit": "M3UPI" }, { - "BriefDescription": "Remote VNA Credits; Level < 4", + "BriefDescription": "Remote VNA Credits; Level < 4", "EventCode": "0x5B", "EventName": "UNC_M3UPI_RxC_VNA_CRD.LT4", "PerPkg": "1", @@ -21869,7 +21877,7 @@ "Unit": "M3UPI" }, { - "BriefDescription": "Remote VNA Credits; Level < 5", + "BriefDescription": "Remote VNA Credits; Level < 5", "EventCode": "0x5B", "EventName": "UNC_M3UPI_RxC_VNA_CRD.LT5", "PerPkg": "1", @@ -24401,7 +24409,7 @@ "EventCode": "0x29", "EventName": "UNC_M3UPI_UPI_PREFETCH_SPAWN", "PerPkg": "1", - "PublicDescription": "Count cases where flow control queue that sits between the Intel Ultra Path Interconnect (UPI) and the mesh spawns a prefetch to the iMC (Memory Controller)", + "PublicDescription": "Count cases where flow control queue that sits between the Intel(R) Ultra Path Interconnect (UPI) and the mesh spawns a prefetch to the iMC (Memory Controller)", "Unit": "M3UPI" }, { @@ -24756,11 +24764,11 @@ "Unit": "M2M" }, { - "BriefDescription": "Clocks of the Intel Ultra Path Interconnect (UPI)", + "BriefDescription": "Clocks of the Intel(R) Ultra Path Interconnect (UPI)", "EventCode": "0x1", "EventName": "UNC_UPI_CLOCKTICKS", "PerPkg": "1", - "PublicDescription": "Counts clockticks of the fixed frequency clock controlling the Intel Ultra Path Interconnect (UPI). This clock runs at1/8th the 'GT/s' speed of the UPI link. For example, a 9.6GT/s link will have a fixed Frequency of 1.2 Ghz.", + "PublicDescription": "Counts clockticks of the fixed frequency clock controlling the Intel(R) Ultra Path Interconnect (UPI). This clock runs at1/8th the 'GT/s' speed of the UPI link. For example, a 9.6GT/s link will have a fixed Frequency of 1.2 Ghz.", "Unit": "UPI LL" }, { @@ -24782,11 +24790,11 @@ "Unit": "UPI LL" }, { - "BriefDescription": "Data Response packets that go direct to Intel UPI", + "BriefDescription": "Data Response packets that go direct to Intel(R) UPI", "EventCode": "0x12", "EventName": "UNC_UPI_DIRECT_ATTEMPTS.D2U", "PerPkg": "1", - "PublicDescription": "Counts Data Response (DRS) packets that attempted to go direct to Intel Ultra Path Interconnect (UPI) bypassing the CHA .", + "PublicDescription": "Counts Data Response (DRS) packets that attempted to go direct to Intel(R) Ultra Path Interconnect (UPI) bypassing the CHA .", "UMask": "0x2", "Unit": "UPI LL" }, @@ -24855,11 +24863,11 @@ "Unit": "UPI LL" }, { - "BriefDescription": "Cycles Intel UPI is in L1 power mode (shutdown)", + "BriefDescription": "Cycles Intel(R) UPI is in L1 power mode (shutdown)", "EventCode": "0x21", "EventName": "UNC_UPI_L1_POWER_CYCLES", "PerPkg": "1", - "PublicDescription": "Counts cycles when the Intel Ultra Path Interconnect (UPI) is in L1 power mode. L1 is a mode that totally shuts down the UPI link. Link power states are per link and per direction, so for example the Tx direction could be in one state while Rx was in another, this event only coutns when both links are shutdown.", + "PublicDescription": "Counts cycles when the Intel(R) Ultra Path Interconnect (UPI) is in L1 power mode. L1 is a mode that totally shuts down the UPI link. Link power states are per link and per direction, so for example the Tx direction could be in one state while Rx was in another, this event only coutns when both links are shutdown.", "Unit": "UPI LL" }, { @@ -25021,11 +25029,11 @@ "Unit": "UPI LL" }, { - "BriefDescription": "Cycles the Rx of the Intel UPI is in L0p power mode", + "BriefDescription": "Cycles the Rx of the Intel(R) UPI is in L0p power mode", "EventCode": "0x25", "EventName": "UNC_UPI_RxL0P_POWER_CYCLES", "PerPkg": "1", - "PublicDescription": "Counts cycles when the receive side (Rx) of the Intel Ultra Path Interconnect(UPI) is in L0p power mode. L0p is a mode where we disable 60% of the UPI lanes, decreasing our bandwidth in order to save power.", + "PublicDescription": "Counts cycles when the receive side (Rx) of the Intel(R) Ultra Path Interconnect(UPI) is in L0p power mode. L0p is a mode where we disable 60% of the UPI lanes, decreasing our bandwidth in order to save power.", "Unit": "UPI LL" }, { @@ -25218,7 +25226,7 @@ "EventCode": "0x8", "EventName": "UNC_UPI_RxL_CRC_LLR_REQ_TRANSMIT", "PerPkg": "1", - "PublicDescription": "Number of LLR Requests were transmitted. This should generally be <= the number of CRC errors detected. If multiple errors are detected before the Rx side receives a LLC_REQ_ACK from the Tx side, there is no need to send more LLR_REQ_NACKs.", + "PublicDescription": "Number of LLR Requests were transmitted. This should generally be <= the number of CRC errors detected. If multiple errors are detected before the Rx side receives a LLC_REQ_ACK from the Tx side, there is no need to send more LLR_REQ_NACKs.", "Unit": "UPI LL" }, { @@ -25250,7 +25258,7 @@ "EventCode": "0x3", "EventName": "UNC_UPI_RxL_FLITS.ALL_DATA", "PerPkg": "1", - "PublicDescription": "Counts valid data FLITs (80 bit FLow control unITs: 64bits of data) received from any of the 3 Intel Ultra Path Interconnect (UPI) Receive Queue slots on this UPI unit.", + "PublicDescription": "Counts valid data FLITs (80 bit FLow control unITs: 64bits of data) received from any of the 3 Intel(R) Ultra Path Interconnect (UPI) Receive Queue slots on this UPI unit.", "UMask": "0xf", "Unit": "UPI LL" }, @@ -25259,7 +25267,7 @@ "EventCode": "0x3", "EventName": "UNC_UPI_RxL_FLITS.ALL_NULL", "PerPkg": "1", - "PublicDescription": "Counts null FLITs (80 bit FLow control unITs) received from any of the 3 Intel Ultra Path Interconnect (UPI) Receive Queue slots on this UPI unit.", + "PublicDescription": "Counts null FLITs (80 bit FLow control unITs) received from any of the 3 Intel(R) Ultra Path Interconnect (UPI) Receive Queue slots on this UPI unit.", "UMask": "0x27", "Unit": "UPI LL" }, @@ -25583,11 +25591,11 @@ "Unit": "UPI LL" }, { - "BriefDescription": "Cycles in which the Tx of the Intel Ultra Path Interconnect (UPI) is in L0p power mode", + "BriefDescription": "Cycles in which the Tx of the Intel(R) Ultra Path Interconnect (UPI) is in L0p power mode", "EventCode": "0x27", "EventName": "UNC_UPI_TxL0P_POWER_CYCLES", "PerPkg": "1", - "PublicDescription": "Counts cycles when the transmit side (Tx) of the Intel Ultra Path Interconnect(UPI) is in L0p power mode. L0p is a mode where we disable 60% of the UPI lanes, decreasing our bandwidth in order to save power.", + "PublicDescription": "Counts cycles when the transmit side (Tx) of the Intel(R) Ultra Path Interconnect(UPI) is in L0p power mode. L0p is a mode where we disable 60% of the UPI lanes, decreasing our bandwidth in order to save power.", "Unit": "UPI LL" }, { @@ -25759,7 +25767,7 @@ "EventCode": "0x41", "EventName": "UNC_UPI_TxL_BYPASSED", "PerPkg": "1", - "PublicDescription": "Counts incoming FLITs (FLow control unITs) which bypassed the TxL(transmit) FLIT buffer and pass directly out the UPI Link. Generally, when data is transmitted across the Intel Ultra Path Interconnect (UPI), it will bypass the TxQ and pass directly to the link. However, the TxQ will be used in L0p (Low Power) mode and (Link Layer Retry) LLR mode, increasing latency to transfer out to the link.", + "PublicDescription": "Counts incoming FLITs (FLow control unITs) which bypassed the TxL(transmit) FLIT buffer and pass directly out the UPI Link. Generally, when data is transmitted across the Intel(R) Ultra Path Interconnect (UPI), it will bypass the TxQ and pass directly to the link. However, the TxQ will be used in L0p (Low Power) mode and (Link Layer Retry) LLR mode, increasing latency to transfer out to the link.", "Unit": "UPI LL" }, { @@ -25767,7 +25775,7 @@ "EventCode": "0x2", "EventName": "UNC_UPI_TxL_FLITS.ALL_DATA", "PerPkg": "1", - "PublicDescription": "Counts valid data FLITs (80 bit FLow control unITs: 64bits of data) transmitted (TxL) via any of the 3 Intel Ultra Path Interconnect (UPI) slots on this UPI unit.", + "PublicDescription": "Counts valid data FLITs (80 bit FLow control unITs: 64bits of data) transmitted (TxL) via any of the 3 Intel(R) Ultra Path Interconnect (UPI) slots on this UPI unit.", "UMask": "0xf", "Unit": "UPI LL" }, @@ -25776,7 +25784,7 @@ "EventCode": "0x2", "EventName": "UNC_UPI_TxL_FLITS.ALL_NULL", "PerPkg": "1", - "PublicDescription": "Counts null FLITs (80 bit FLow control unITs) transmitted via any of the 3 Intel Ulra Path Interconnect (UPI) slots on this UPI unit.", + "PublicDescription": "Counts null FLITs (80 bit FLow control unITs) transmitted via any of the 3 Intel(R) Ulra Path Interconnect (UPI) slots on this UPI unit.", "UMask": "0x27", "Unit": "UPI LL" }, @@ -26127,7 +26135,7 @@ "EventCode": "0x2", "EventName": "UPI_DATA_BANDWIDTH_TX", "PerPkg": "1", - "PublicDescription": "Counts valid data FLITs (80 bit FLow control unITs: 64bits of data) transmitted (TxL) via any of the 3 Intel Ultra Path Interconnect (UPI) slots on this UPI unit.", + "PublicDescription": "Counts valid data FLITs (80 bit FLow control unITs: 64bits of data) transmitted (TxL) via any of the 3 Intel(R) Ultra Path Interconnect (UPI) slots on this UPI unit.", "ScaleUnit": "7.11E-06Bytes", "UMask": "0xf", "Unit": "UPI LL" diff --git a/tools/perf/pmu-events/arch/x86/skylakex/uncore-power.json b/tools/perf/pmu-events/arch/x86/skylakex/uncore-power.json index 8e21dc3eff16..c6254af7a468 100644 --- a/tools/perf/pmu-events/arch/x86/skylakex/uncore-power.json +++ b/tools/perf/pmu-events/arch/x86/skylakex/uncore-power.json @@ -143,7 +143,7 @@ "EventCode": "0x80", "EventName": "UNC_P_POWER_STATE_OCCUPANCY.CORES_C0", "PerPkg": "1", - "PublicDescription": "This is an occupancy event that tracks the number of cores that are in the chosen C-State. It can be used by itself to get the average number of cores in that C-state with threshholding to generate histograms, or with other PCU events and occupancy triggering to capture other details.", + "PublicDescription": "This is an occupancy event that tracks the number of cores that are in the chosen C-State. It can be used by itself to get the average number of cores in that C-state with thresholding to generate histograms, or with other PCU events and occupancy triggering to capture other details.", "Unit": "PCU" }, { @@ -151,7 +151,7 @@ "EventCode": "0x80", "EventName": "UNC_P_POWER_STATE_OCCUPANCY.CORES_C3", "PerPkg": "1", - "PublicDescription": "This is an occupancy event that tracks the number of cores that are in the chosen C-State. It can be used by itself to get the average number of cores in that C-state with threshholding to generate histograms, or with other PCU events and occupancy triggering to capture other details.", + "PublicDescription": "This is an occupancy event that tracks the number of cores that are in the chosen C-State. It can be used by itself to get the average number of cores in that C-state with thresholding to generate histograms, or with other PCU events and occupancy triggering to capture other details.", "Unit": "PCU" }, { @@ -159,7 +159,7 @@ "EventCode": "0x80", "EventName": "UNC_P_POWER_STATE_OCCUPANCY.CORES_C6", "PerPkg": "1", - "PublicDescription": "This is an occupancy event that tracks the number of cores that are in the chosen C-State. It can be used by itself to get the average number of cores in that C-state with threshholding to generate histograms, or with other PCU events and occupancy triggering to capture other details.", + "PublicDescription": "This is an occupancy event that tracks the number of cores that are in the chosen C-State. It can be used by itself to get the average number of cores in that C-state with thresholding to generate histograms, or with other PCU events and occupancy triggering to capture other details.", "Unit": "PCU" }, { From patchwork Sun Feb 19 09:28:27 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59121 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp775667wrn; Sun, 19 Feb 2023 01:45:38 -0800 (PST) X-Google-Smtp-Source: AK7set+yYwvV3i6Tx6y16SUbgcItZ424qgaAreVkdtgH4H43yDdYcdFR033DR2IjqSzyMnUfElOM X-Received: by 2002:aa7:946f:0:b0:5b4:2:b248 with SMTP id t15-20020aa7946f000000b005b40002b248mr3191978pfq.14.1676799938096; Sun, 19 Feb 2023 01:45:38 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799938; cv=none; d=google.com; s=arc-20160816; b=lIgdA7/8O7hAqa/LK0ZwvWSfu4EL0hVFbbIKDbNkebU+n23NVL35SlsVuVsXK+VGdl 3NDwvvh0cvj1sdSxODNVEAD9x/z7jvhugBtl3Pksrm1yVlSw0YQLFWR2rwsyLsJispVY X/plvaUwjhnEjxvHtpIlqARk+YTtf+EP9RYo90XCqJ3bF0h6AEKZWfshPgXrIhwq5cmK 0nm7+X4M9BUGLSvr2/5ioD+Dtwb6WrEe60ASCo2/+YiRJzuz+Ep1cNsRcBpUkA/zz4+J 0XQVNcKjX15QVMS717588gMYsouc32LVQvHtYvVRrJ/xJ4s0beux4NZPAAvRRMgLpJM6 /wAA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :references:mime-version:message-id:in-reply-to:date:dkim-signature; bh=H87tLbqSuA8OyX5pcXyXGCvfR4ey8ROoTkCeYGPz9hI=; b=Bc07zA9XdOy/LbMr0esu5AXkfdJXkQqzQLqnFjhuPt9+uesoUxclnZSlyhDcHnzCKv lfLgB4EP90xOpyls42QnUGaM6AAFJxNJZm4aCwc8N3Ypp3B60O5lP8fFzWjsr48TOAkQ 0rOipFqVnRr8LXybCBKjLxtwjAKPlqnAF9HoDh03dcXfrrYZP+Em0RqQ/f3UtDo2/FRL 066Aec+LLBVdyPVrYW4C/1iGCmIcGlsz2ZlHntdBtrUwt07D09+ktrfEsgFkS9VfmJ9r 53YeDBKJpNXPB8lvKATi72m/R3u1/UqEQqcN5Sd+bxyoPAvPflE7YGdORBEUm4JvY6y4 +68w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=bxQTXNHy; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id f188-20020a6238c5000000b0056d8f42a69csi10093372pfa.145.2023.02.19.01.45.25; Sun, 19 Feb 2023 01:45:38 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=bxQTXNHy; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230011AbjBSJgL (ORCPT + 99 others); Sun, 19 Feb 2023 04:36:11 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58440 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230010AbjBSJgG (ORCPT ); Sun, 19 Feb 2023 04:36:06 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C2780126EF for ; Sun, 19 Feb 2023 01:35:16 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id 144-20020a250596000000b008fa1d22bd55so1989925ybf.21 for ; Sun, 19 Feb 2023 01:35:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:from:to:cc:subject:date :message-id:reply-to; bh=H87tLbqSuA8OyX5pcXyXGCvfR4ey8ROoTkCeYGPz9hI=; b=bxQTXNHyF7U2ExEebKkhryH0DzLJqgQVdGrY5q/z+TtGMWgDqruOFasvJ9SS3QzekG ATOpdLAmOrSs80uAW0S/hp1NbNHXXxlvPXutMzS/C2wUtal2Pj/TxMlOsMATJoXSSnuE gi1HMij5pXc6CUz9B+5fu1dK0HtLkRJUkYlnXQ74RbEaHHRqrsnNI2Dl+b5cbhiERrIk WqauxwXL2Y4oizmkj5nWhyLhIlSejcw0oC0TCzQ1vMajLAFBSMpf1PfqPURKugku4piH wEy/++wUivZY5s50nQ2iiYtOIoyWvc8eZa3Cq2oD+cz677i26CrrBCgNI1WzNMh+mm9r 8hyw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=H87tLbqSuA8OyX5pcXyXGCvfR4ey8ROoTkCeYGPz9hI=; b=T/og9ytVWSCwKRplF6jGzVcnglr1HyxwCJsDXgYJR/PtvyFgivenSeqKsAsG3/UyA/ Us1t4uYVzm9n7ks0/LkZvim+9eGLOoJND/4Hr5mOgnKh7KHMSIxlokeE6XQYieotvHSM O4Q8yzihvRgYCp8Y0YKgqCCJ4gqTdj2NttcA1LpGWcn3M8DSMGOYfVO+788dW9jHkhPI GTomZHFMLchzkWzbIl2qT91iyTSosJpBf3+YrfTJCwMNcfE9AYIPCPoQaGf1Me5uej9D R3U6UrbbzUOpqFnjGG2xH8M6nB78/kPsg2rQMKwjv5OvqWMyFWkL8QJ/SRZsgNHNLP4C IG9w== X-Gm-Message-State: AO0yUKXDxuqNm7n/A2brxto5rl7+X0hip8+XE9fUuPYkwQiwwVejT4J5 kn1sX0MdfqPy2GCG0w7mxCO0rPGyXtN4 X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a05:6902:1281:b0:8bd:4ab5:18f4 with SMTP id i1-20020a056902128100b008bd4ab518f4mr390738ybu.6.1676799190736; Sun, 19 Feb 2023 01:33:10 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:27 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-31-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 30/51] perf vendor events intel: Refresh tigerlake events From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758252171980907717?= X-GMAIL-MSGID: =?utf-8?q?1758252171980907717?= Update the tigerlake events from 1.08 to 1.10. Generation was done using https://github.com/intel/perfmon. Notable changes are new events and event descriptions, TMA metrics are updated to version 4.5, TMA info metrics are renamed from their node name to be lower case and prefixed by tma_info_, MetricThreshold expressions are added, smi_cost and transaction metric groups are added replicating existing hard coded metrics in stat-shadow. Signed-off-by: Ian Rogers --- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- .../arch/x86/tigerlake/floating-point.json | 31 + .../arch/x86/tigerlake/pipeline.json | 18 + .../arch/x86/tigerlake/tgl-metrics.json | 1942 ++++++++++------- .../arch/x86/tigerlake/uncore-other.json | 28 +- 5 files changed, 1198 insertions(+), 823 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-events/arch/x86/mapfile.csv index 07f0b4758a83..bc2c4e756f44 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -27,7 +27,7 @@ GenuineIntel-6-(37|4A|4C|4D|5A),v15,silvermont,core GenuineIntel-6-(4E|5E|8E|9E|A5|A6),v54,skylake,core GenuineIntel-6-55-[01234],v1.29,skylakex,core GenuineIntel-6-86,v1.20,snowridgex,core -GenuineIntel-6-8[CD],v1.08,tigerlake,core +GenuineIntel-6-8[CD],v1.10,tigerlake,core GenuineIntel-6-2C,v3,westmereep-dp,core GenuineIntel-6-25,v3,westmereep-sp,core GenuineIntel-6-2F,v3,westmereex,core diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/floating-point.json b/tools/perf/pmu-events/arch/x86/tigerlake/floating-point.json index 655342dadac6..63b5b56d1ed0 100644 --- a/tools/perf/pmu-events/arch/x86/tigerlake/floating-point.json +++ b/tools/perf/pmu-events/arch/x86/tigerlake/floating-point.json @@ -39,6 +39,14 @@ "SampleAfterValue": "100003", "UMask": "0x20" }, + { + "BriefDescription": "Number of SSE/AVX computational 128-bit packed single and 256-bit packed double precision FP instructions retired; some instructions will count twice as noted below. Each count represents 2 or/and 4 computation operations, 1 for each element. Applies to SSE* and AVX* packed single precision and packed double precision FP instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX RCP14 RSQRT14 SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB count twice as they perform 2 calculations per element.", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.4_FLOPS", + "PublicDescription": "Number of SSE/AVX computational 128-bit packed single precision and 256-bit packed double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 2 or/and 4 computation operations, one for each element. Applies to SSE* and AVX* packed single precision floating-point and packed double precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX RCP14 RSQRT14 SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element. The DAZ and FTZ flags in the MXCSR register need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x18" + }, { "BriefDescription": "Counts number of SSE/AVX computational 512-bit packed double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 8 computation operations, one for each element. Applies to SSE* and AVX* packed double precision floating-point instructions: ADD SUB MUL DIV MIN MAX SQRT RSQRT14 RCP14 FM(N)ADD/SUB. FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element.", "EventCode": "0xc7", @@ -55,6 +63,22 @@ "SampleAfterValue": "100003", "UMask": "0x80" }, + { + "BriefDescription": "Number of SSE/AVX computational 256-bit packed single precision and 512-bit packed double precision FP instructions retired; some instructions will count twice as noted below. Each count represents 8 computation operations, 1 for each element. Applies to SSE* and AVX* packed single precision and double precision FP instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX SQRT RSQRT RSQRT14 RCP RCP14 DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB count twice as they perform 2 calculations per element.", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.8_FLOPS", + "PublicDescription": "Number of SSE/AVX computational 256-bit packed single precision and 512-bit packed double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 8 computation operations, one for each element. Applies to SSE* and AVX* packed single precision and double precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX SQRT RSQRT RSQRT14 RCP RCP14 DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element. The DAZ and FTZ flags in the MXCSR register need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x60" + }, + { + "BriefDescription": "Number of SSE/AVX computational scalar floating-point instructions retired; some instructions will count twice as noted below. Applies to SSE* and AVX* scalar, double and single precision floating-point: ADD SUB MUL DIV MIN MAX RCP14 RSQRT14 RANGE SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they perform multiple calculations per element.", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.SCALAR", + "PublicDescription": "Number of SSE/AVX computational scalar single precision and double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 1 computational operation. Applies to SSE* and AVX* scalar single precision floating-point instructions: ADD SUB MUL DIV MIN MAX SQRT RSQRT RCP FM(N)ADD/SUB. FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element. The DAZ and FTZ flags in the MXCSR register need to be set when using these events.", + "SampleAfterValue": "1000003", + "UMask": "0x3" + }, { "BriefDescription": "Counts number of SSE/AVX computational scalar double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 1 computational operation. Applies to SSE* and AVX* scalar double precision floating-point instructions: ADD SUB MUL DIV MIN MAX SQRT FM(N)ADD/SUB. FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element.", "EventCode": "0xc7", @@ -70,5 +94,12 @@ "PublicDescription": "Number of SSE/AVX computational scalar single precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 1 computational operation. Applies to SSE* and AVX* scalar single precision floating-point instructions: ADD SUB MUL DIV MIN MAX SQRT RSQRT RCP FM(N)ADD/SUB. FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element. The DAZ and FTZ flags in the MXCSR register need to be set when using these events.", "SampleAfterValue": "100003", "UMask": "0x2" + }, + { + "BriefDescription": "Number of any Vector retired FP arithmetic instructions", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.VECTOR", + "SampleAfterValue": "1000003", + "UMask": "0xfc" } ] diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/pipeline.json b/tools/perf/pmu-events/arch/x86/tigerlake/pipeline.json index 9d43decd75ec..a0aeeb801fd7 100644 --- a/tools/perf/pmu-events/arch/x86/tigerlake/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/tigerlake/pipeline.json @@ -158,6 +158,15 @@ "SampleAfterValue": "50021", "UMask": "0x20" }, + { + "BriefDescription": "This event counts the number of mispredicted ret instructions retired. Non PEBS", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.RET", + "PEBS": "1", + "PublicDescription": "This is a non-precise version (that is, does not use PEBS) of the event that counts mispredicted return instructions retired.", + "SampleAfterValue": "50021", + "UMask": "0x8" + }, { "BriefDescription": "Cycle counts are evenly distributed between active threads in the Core.", "EventCode": "0xec", @@ -383,6 +392,15 @@ "SampleAfterValue": "2000003", "UMask": "0x3" }, + { + "BriefDescription": "Clears speculative count", + "CounterMask": "1", + "EventCode": "0x0d", + "EventName": "INT_MISC.CLEARS_COUNT", + "PublicDescription": "Counts the number of speculative clears due to any type of branch misprediction or machine clears", + "SampleAfterValue": "500009", + "UMask": "0x1" + }, { "BriefDescription": "Counts cycles after recovery from a branch misprediction or machine clear till the first uop is issued from the resteered path.", "EventCode": "0x0d", diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json index 7e22a9127156..4c80d6be6cf1 100644 --- a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json +++ b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json @@ -1,1230 +1,1532 @@ [ { - "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", - "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / SLOTS", - "MetricGroup": "PGO;TopdownL1;tma_L1_group", - "MetricName": "tma_frontend_bound", - "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Machine_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", - "MetricExpr": "(5 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE - INT_MISC.UOP_DROPPING) / SLOTS", - "MetricGroup": "Frontend;TopdownL2;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_latency", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses", - "MetricExpr": "ICACHE_16B.IFDATA_STALL / CLKS", - "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_icache_misses", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RETIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "BriefDescription": "C10 residency percent per package", + "MetricExpr": "cstate_pkg@c10\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C10_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses", - "MetricExpr": "ICACHE_64B.IFTAG_STALL / CLKS", - "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_itlb_misses", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTEND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "BriefDescription": "C2 residency percent per package", + "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C2_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", - "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / CLKS + tma_unknown_branches", - "MetricGroup": "FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_branch_resteers", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", + "BriefDescription": "C3 residency percent per package", + "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage", - "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / CLKS", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL4;tma_branch_resteers_group", - "MetricName": "tma_mispredicts_resteers", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES", + "BriefDescription": "C6 residency percent per core", + "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Core_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears", - "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / CLKS", - "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_branch_resteers_group", - "MetricName": "tma_clears_resteers", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES", + "BriefDescription": "C6 residency percent per package", + "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears", - "MetricExpr": "10 * BACLEARS.ANY / CLKS", - "MetricGroup": "BigFoot;FetchLat;TopdownL4;tma_branch_resteers_group", - "MetricName": "tma_unknown_branches", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears. These are fetched branches the Branch Prediction Unit was unable to recognize (First fetch or hitting BPU capacity limit). Sample with: BACLEARS.ANY", + "BriefDescription": "C7 residency percent per core", + "MetricExpr": "cstate_core@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Core_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / CLKS", - "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_dsb_switches", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DSB_MISS_PS", + "BriefDescription": "C7 residency percent per package", + "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", - "MetricExpr": "ILD_STALL.LCP / CLKS", - "MetricGroup": "FetchLat;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_lcp", - "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs.", + "BriefDescription": "C8 residency percent per package", + "MetricExpr": "cstate_pkg@c8\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C8_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", - "MetricExpr": "3 * IDQ.MS_SWITCHES / CLKS", - "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_fetch_latency_group", - "MetricName": "tma_ms_switches", - "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals. Sample with: IDQ.MS_SWITCHES", + "BriefDescription": "C9 residency percent per package", + "MetricExpr": "cstate_pkg@c9\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C9_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", - "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)", - "MetricGroup": "FetchBW;Frontend;TopdownL2;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_fetch_bandwidth", - "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Sample with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS", + "BriefDescription": "Percentage of cycles spent in System Management Interrupts.", + "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0 else 0)", + "MetricGroup": "smi", + "MetricName": "smi_cycles", + "MetricThreshold": "smi_cycles > 0.1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline)", - "MetricExpr": "(IDQ.MITE_CYCLES_ANY - IDQ.MITE_CYCLES_OK) / CORE_CLKS / 2", - "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_fetch_bandwidth_group", - "MetricName": "tma_mite", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or LSD. For example; inefficiencies due to asymmetric decoders; use of long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sample with: FRONTEND_RETIRED.ANY_DSB_MISS", - "ScaleUnit": "100%" + "BriefDescription": "Number of SMI interrupts.", + "MetricExpr": "msr@smi@", + "MetricGroup": "smi", + "MetricName": "smi_num", + "ScaleUnit": "1SMI#" }, { - "BriefDescription": "This metric represents fraction of cycles where decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=1@ - cpu@INST_DECODED.DECODERS\\,cmask\\=2@) / CORE_CLKS", - "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_mite_group", - "MetricName": "tma_decoder0_alone", + "BriefDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset", + "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_4k_aliasing", + "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset. False match is possible; which incur a few cycles load re-issue. However; the short re-issue duration is often hidden by the out-of-order core and HW optimizations; hence a user may safely ignore a high value of this metric unless it manages to propagate up into parent nodes of the hierarchy (e.g. to L1_Bound).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles where (only) 4 uops were delivered by the MITE pipeline", - "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=4@ - cpu@IDQ.MITE_UOPS\\,cmask\\=5@) / CLKS", - "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_mite_group", - "MetricName": "tma_mite_4wide", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.", + "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + UOPS_DISPATCHED.PORT_5 + UOPS_DISPATCHED.PORT_6) / (4 * tma_info_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline", - "MetricExpr": "(IDQ.DSB_CYCLES_ANY - IDQ.DSB_CYCLES_OK) / CORE_CLKS / 2", - "MetricGroup": "DSB;FetchBW;TopdownL3;tma_fetch_bandwidth_group", - "MetricName": "tma_dsb", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline. For example; inefficient utilization of the DSB cache structure or bank conflict when reading from it; are categorized here.", + "BriefDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists", + "MetricExpr": "100 * ASSISTS.ANY / tma_info_slots", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists. Assists are long sequences of uops that are required in certain corner-cases for operations that cannot be handled natively by the execution pipeline. For example; when working with very small floating point values (so-called Denormals); the FP units are not set up to perform these operations natively. Instead; a sequence of instructions to perform the computation on the Denormals is injected into the pipeline. Since these microcode sequences might be dozens of uops long; Assists can be extremely deleterious to performance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to LSD (Loop Stream Detector) unit", - "MetricExpr": "(LSD.CYCLES_ACTIVE - LSD.CYCLES_OK) / CORE_CLKS / 2", - "MetricGroup": "FetchBW;LSD;TopdownL3;tma_fetch_bandwidth_group", - "MetricName": "tma_lsd", - "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to LSD (Loop Stream Detector) unit. LSD typically does well sustaining Uop supply. However; in some rare cases; optimal uop-delivery could not be reached for small loops whose size (in terms of number of uops) does not suit well the LSD structure.", + "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend", + "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.2", + "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations", "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)", - "MetricGroup": "TopdownL1;tma_L1_group", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.", "ScaleUnit": "100%" }, + { + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring branch instructions.", + "MetricExpr": "tma_light_operations * BR_INST_RETIRED.ALL_BRANCHES / (tma_retiring * tma_info_slots)", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_branch_instructions", + "MetricThreshold": "tma_branch_instructions > 0.1 & tma_light_operations > 0.6", + "ScaleUnit": "100%" + }, { "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction", "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * tma_bad_speculation", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL2;tma_L2_group;tma_bad_speculation_group", + "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM", "MetricName": "tma_branch_mispredicts", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics: tma_info_branch_misprediction_cost, tma_info_mispredictions, tma_mispredicts_resteers", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", - "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts)", - "MetricGroup": "BadSpec;MachineClears;TopdownL2;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_machine_clears", - "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers", + "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_clks + tma_unknown_branches", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", "ScaleUnit": "100%" }, { - "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend", - "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group", - "MetricName": "tma_backend_bound", - "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", + "BriefDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_group", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction. A CISC instruction has multiple uops that are required to perform the instruction's functionality as in the case of read-modify-write as an example. Since these instructions require multiple uops they may or may not imply sub-optimal use of machine resources.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", - "MetricExpr": "(CYCLE_ACTIVITY.STALLS_MEM_ANY + EXE_ACTIVITY.BOUND_ON_STORES) / (CYCLE_ACTIVITY.STALLS_TOTAL + (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL) + EXE_ACTIVITY.BOUND_ON_STORES) * tma_backend_bound", - "MetricGroup": "Backend;TopdownL2;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_memory_bound", - "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears", + "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_clks", + "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_branch_resteers_group;tma_issueMC", + "MetricName": "tma_clears_resteers", + "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine Clears. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache", - "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY.STALLS_L1D_MISS) / CLKS, 0)", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_l1_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache. The L1 data cache typically has the shortest latency. However; in certain cases like loads blocked on older stores; a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls; while some non-completed demand load lives in the machine without having that demand load missing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB_HIT_PS", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(49 * tma_info_average_frequency * (MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) + 48 * tma_info_average_frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_contested_accesses", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses. Contested accesses occur when data written by one Logical Processor are read by another Logical Processor on a different Physical Core. Examples of contested accesses include synchronizations such as locks; true data sharing such as modified locked variables; and false sharing. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / CLKS", - "MetricGroup": "MemoryTLB;TopdownL4;tma_l1_bound_group", - "MetricName": "tma_dtlb_load", - "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS", + "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", + "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)", + "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the (first level) DTLB was missed by load accesses, that later on hit in second-level TLB (STLB)", - "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", - "MetricGroup": "MemoryTLB;TopdownL5;tma_dtlb_load_group", - "MetricName": "tma_load_stlb_hit", + "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "48 * tma_info_average_frequency * (MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD + MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (1 - OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses. Data shared by multiple Logical Processors (even just read shared) may cause increased access latency due to cache coherency. Excessive data sharing can drastically harm multithreaded performance. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_contested_accesses, tma_false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles where the Second-level TLB (STLB) was missed by load accesses, performing a hardware page walk", - "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / CLKS", - "MetricGroup": "MemoryTLB;TopdownL5;tma_dtlb_load_group", - "MetricName": "tma_load_stlb_miss", + "BriefDescription": "This metric represents fraction of cycles where decoder-0 was the only active decoder", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=1@ - cpu@INST_DECODED.DECODERS\\,cmask\\=2@) / tma_info_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0;tma_mite_group", + "MetricName": "tma_decoder0_alone", + "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 5 > 0.35))", + "PublicDescription": "This metric represents fraction of cycles where decoder-0 was the only active decoder. Related metrics: tma_few_uops_instructions", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores", - "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_store_fwd_blk", - "PublicDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores. To streamline memory operations in the pipeline; a load can avoid waiting for memory if a prior in-flight store is writing the data that the load wants to read (store forwarding process). However; in some cases the load may be blocked for a significant time pending the store forward. For example; when the prior store is writing a smaller region than the load is reading.", + "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", + "MetricExpr": "ARITH.DIVIDER_ACTIVE / tma_info_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication. Sample with: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations", - "MetricExpr": "(16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS.ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10 * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO))) / CLKS", - "MetricGroup": "Offcore;TopdownL4;tma_l1_bound_group", - "MetricName": "tma_lock_latency", - "PublicDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations. Due to the microarchitecture handling of locks; they are classified as L1_Bound regardless of what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOADS_PS", + "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_clks + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_info_clks - tma_l2_bound", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L3_MISS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary", - "MetricExpr": "Load_Miss_Real_Latency * LD_BLOCKS.NO_SR / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_split_loads", - "PublicDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline", + "MetricExpr": "(IDQ.DSB_CYCLES_ANY - IDQ.DSB_CYCLES_OK) / tma_info_core_clks / 2", + "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 5 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline. For example; inefficient utilization of the DSB cache structure or bank conflict when reading from it; are categorized here.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset", - "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / CLKS", - "MetricGroup": "TopdownL4;tma_l1_bound_group", - "MetricName": "tma_4k_aliasing", - "PublicDescription": "This metric estimates how often memory load accesses were aliased by preceding stores (in program order) with a 4K address offset. False match is possible; which incur a few cycles load re-issue. However; the short re-issue duration is often hidden by the out-of-order core and HW optimizations; hence a user may safely ignore a high value of this metric unless it manages to propagate up into parent nodes of the hierarchy (e.g. to L1_Bound).", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_clks", + "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DSB_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_dsb_misses, tma_info_iptb, tma_lcp", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed", - "MetricExpr": "L1D_PEND_MISS.FB_FULL / CLKS", - "MetricGroup": "MemoryBW;TopdownL4;tma_l1_bound_group", - "MetricName": "tma_fb_full", - "PublicDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed. The higher the metric value; the deeper the memory hierarchy level the misses are satisfied from (metric values >1 are valid). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or external memory).", + "BriefDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction of cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Translation Look-aside Buffers) are processor caches for recently used entries out of the Page Tables that are used to map virtual- to physical-addresses by the operating system. This metric approximates the potential delay of demand loads missing the first-level data TLB (assuming worst case scenario with back to back misses to different pages). This includes hitting in the second-level TLB (STLB) as well as performing a hardware page walk on an STLB miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: tma_dtlb_store, tma_info_memory_data_tlbs", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads", - "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + L1D_PEND_MISS.FB_FULL_PERIODS) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / CLKS)", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_l2_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 misses/L2 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L2_HIT_PS", + "BriefDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses", + "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=1@ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses. As with ordinary data caching; focus on improving data locality and reducing working-set size to reduce DTLB overhead. Additionally; consider using profile-guided optimization (PGO) to collocate frequently-used data on the same page. Try using larger page sizes for large amounts of frequently-used data. Sample with: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, tma_info_memory_data_tlbs", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core", - "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STALLS_L3_MISS) / CLKS", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_l3_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "BriefDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing", + "MetricExpr": "54 * tma_info_average_frequency * OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM / tma_info_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_store_bound_group", + "MetricName": "tma_false_sharing", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing. False Sharing is a multithreading hiccup; where multiple Logical Processors contend on different data-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses", - "MetricExpr": "(49 * Average_Frequency * (MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) + 48 * Average_Frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CLKS", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_contested_accesses", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to contested accesses. Contested accesses occur when data written by one Logical Processor are read by another Logical Processor on a different Physical Core. Examples of contested accesses include synchronizations such as locks; true data sharing such as modified locked variables; and false sharing. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS", + "BriefDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed", + "MetricExpr": "L1D_PEND_MISS.FB_FULL / tma_info_clks", + "MetricGroup": "MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;tma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how often L1D Fill Buffer unavailability limited additional L1D miss memory access requests to proceed. The higher the metric value; the deeper the memory hierarchy level the misses are satisfied from (metric values >1 are valid). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or external memory). Related metrics: tma_info_dram_bw_use, tma_info_memory_bandwidth, tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses", - "MetricExpr": "48 * Average_Frequency * (MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD + MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (1 - OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CLKS", - "MetricGroup": "Offcore;Snoop;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_data_sharing", - "PublicDescription": "This metric estimates fraction of cycles while the memory subsystem was handling synchronizations due to data-sharing accesses. Data shared by multiple Logical Processors (even just read shared) may cause increased access latency due to cache coherency. Excessive data sharing can drastically harm multithreaded performance. Sample with: MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD", + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues", + "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)", + "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 5 > 0.35", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues. For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Sample with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_switches, tma_info_dsb_coverage, tma_info_dsb_misses, tma_info_iptb, tma_lcp", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited)", - "MetricExpr": "17.5 * Average_Frequency * MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CLKS", - "MetricGroup": "MemoryLat;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_l3_hit_latency", - "PublicDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve the latency; reduce contention with sibling physical cores and increase performance. Note the value of this node may overlap with its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues", + "MetricExpr": "(5 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE - INT_MISC.UOP_DROPPING) / tma_info_slots", + "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues. For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors)", - "MetricExpr": "L1D_PEND_MISS.L2_STALL / CLKS", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_l3_bound_group", - "MetricName": "tma_sq_full", - "PublicDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors). The Super Queue is used for requests to access the L2 cache or to go out to the Uncore.", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops", + "MetricExpr": "tma_heavy_operations - tma_microcode_sequencer", + "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;tma_issueD0", + "MetricName": "tma_few_uops_instructions", + "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number of uops in such instructions. Related metrics: tma_decoder0_alone", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads", - "MetricExpr": "CYCLE_ACTIVITY.STALLS_L3_MISS / CLKS + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / CLKS - tma_l2_bound", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_dram_bound", - "PublicDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L3_MISS_PS", + "BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", + "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired). Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain and FMA double-counting.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=4@) / CLKS", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_dram_bound_group", - "MetricName": "tma_mem_bandwidth", - "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that).", + "BriefDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P", + "MetricName": "tma_fp_scalar", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / CLKS - tma_mem_bandwidth", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_dram_bound_group", - "MetricName": "tma_mem_latency", - "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that).", + "BriefDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0xfc@ / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_group;tma_issue2P", + "MetricName": "tma_fp_vector", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", - "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / CLKS", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_memory_bound_group", - "MetricName": "tma_store_bound", - "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck. Sample with: MEM_INST_RETIRED.ALL_STORES_PS", + "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P", + "MetricName": "tma_fp_vector_128b", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses", - "MetricExpr": "(L2_RQSTS.RFO_HIT * 10 * (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / CLKS", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_store_bound_group", - "MetricName": "tma_store_latency", - "PublicDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses. Store accesses usually less impact out-of-order core performance; however; holding resources for longer time can lead into undesired implications (e.g. contention on L1D fill-buffer entries - see FB_Full)", + "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P", + "MetricName": "tma_fp_vector_256b", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing", - "MetricExpr": "54 * Average_Frequency * OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM / CLKS", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_store_bound_group", - "MetricName": "tma_false_sharing", - "PublicDescription": "This metric roughly estimates how often CPU was handling synchronizations due to False Sharing. False Sharing is a multithreading hiccup; where multiple Logical Processors contend on different data-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM", + "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 512-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector_group;tma_issue2P", + "MetricName": "tma_fp_vector_512b", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 512-bit wide vectors. May overcount due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents rate of split store accesses", - "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / CORE_CLKS", - "MetricGroup": "TopdownL4;tma_store_bound_group", - "MetricName": "tma_split_stores", - "PublicDescription": "This metric represents rate of split store accesses. Consider aligning your data to the 64-byte cache line granularity. Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS", + "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend", + "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_slots", + "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often CPU was stalled due to Streaming store memory accesses; Streaming store optimize out a read request required by RFO stores", - "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / CLKS", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_store_bound_group", - "MetricName": "tma_streaming_stores", - "PublicDescription": "This metric estimates how often CPU was stalled due to Streaming store memory accesses; Streaming store optimize out a read request required by RFO stores. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should Streaming stores be a bottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences", + "MetricExpr": "tma_microcode_sequencer + tma_retiring * (UOPS_DECODED.DEC0 - cpu@UOPS_DECODED.DEC0\\,cmask\\=1@) / IDQ.MITE_UOPS", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses", - "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=1@ + DTLB_STORE_MISSES.WALK_ACTIVE) / CORE_CLKS", - "MetricGroup": "MemoryTLB;TopdownL4;tma_store_bound_group", - "MetricName": "tma_dtlb_store", - "PublicDescription": "This metric roughly estimates the fraction of cycles spent handling first-level data TLB store misses. As with ordinary data caching; focus on improving data locality and reducing working-set size to reduce DTLB overhead. Additionally; consider using profile-guided optimization (PGO) to collocate frequently-used data on the same page. Try using larger page sizes for large amounts of frequently-used data. Sample with: MEM_INST_RETIRED.STLB_MISS_STORES_PS", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses", + "MetricExpr": "ICACHE_16B.IFDATA_STALL / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RETIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of cycles where the TLB was missed by store accesses, hitting in the second-level TLB (STLB)", - "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", - "MetricGroup": "MemoryTLB;TopdownL5;tma_dtlb_store_group", - "MetricName": "tma_store_stlb_hit", - "ScaleUnit": "100%" + "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", + "MetricExpr": "tma_info_turbo_utilization * TSC / 1e9 / duration_time", + "MetricGroup": "Power;Summary", + "MetricName": "tma_info_average_frequency" }, { - "BriefDescription": "This metric estimates the fraction of cycles where the STLB was missed by store accesses, performing a hardware page walk", - "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / CORE_CLKS", - "MetricGroup": "MemoryTLB;TopdownL5;tma_dtlb_store_group", - "MetricName": "tma_store_stlb_miss", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of instruction fetch related bottlenecks by large code footprint programs (i-side cache; TLB and BTB misses)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_icache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", + "MetricGroup": "BigFoot;Fed;Frontend;IcMiss;MemoryTLB;tma_issueBC", + "MetricName": "tma_info_big_code", + "MetricThreshold": "tma_info_big_code > 20", + "PublicDescription": "Total pipeline cost of instruction fetch related bottlenecks by large code footprint programs (i-side cache; TLB and BTB misses). Related metrics: tma_info_branching_overhead" }, { - "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck", - "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)", - "MetricGroup": "Backend;Compute;TopdownL2;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_core_bound", - "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck. Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).", - "ScaleUnit": "100%" + "BriefDescription": "Branch instructions per taken branch.", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_bptkbranch" }, { - "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active", - "MetricExpr": "ARITH.DIVIDER_ACTIVE / CLKS", - "MetricGroup": "TopdownL3;tma_core_bound_group", - "MetricName": "tma_divider", - "PublicDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication. Sample with: ARITH.DIVIDER_ACTIVE", - "ScaleUnit": "100%" + "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)", + "MetricExpr": "(tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) * tma_info_slots / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BrMispredicts;tma_issueBM", + "MetricName": "tma_info_branch_misprediction_cost", + "PublicDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear). Related metrics: tma_branch_mispredicts, tma_info_mispredictions, tma_mispredicts_resteers" }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", - "MetricExpr": "((cpu@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=0x80@ + (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL)) / CLKS if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CYCLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL) / CLKS)", - "MetricGroup": "PortsUtil;TopdownL3;tma_core_bound_group", - "MetricName": "tma_ports_utilization", - "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of branch related instructions (used for program control-flow including function calls)", + "MetricExpr": "100 * ((BR_INST_RETIRED.COND + 3 * BR_INST_RETIRED.NEAR_CALL + (BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL)) / tma_info_slots)", + "MetricGroup": "Ret;tma_issueBC", + "MetricName": "tma_info_branching_overhead", + "MetricThreshold": "tma_info_branching_overhead > 10", + "PublicDescription": "Total pipeline cost of branch related instructions (used for program control-flow including function calls). Related metrics: tma_info_big_code" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "cpu@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=0x80@ / CLKS + tma_serializing_operation * (CYCLE_ACTIVITY.STALLS_TOTAL - CYCLE_ACTIVITY.STALLS_MEM_ANY) / CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_0", - "PublicDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise). Long-latency instructions like divides may contribute to this metric.", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of branches that are CALL or RET", + "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_RETURN) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_callret" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations", - "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / CLKS", - "MetricGroup": "TopdownL5;tma_ports_utilized_0_group", - "MetricName": "tma_serializing_operation", - "PublicDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations. Instructions like CPUID; WRMSR or LFENCE serialize the out-of-order execution which may limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD", - "ScaleUnit": "100%" + "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Pipeline", + "MetricName": "tma_info_clks" }, { - "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to PAUSE Instructions", - "MetricExpr": "140 * MISC_RETIRED.PAUSE_INST / CLKS", - "MetricGroup": "TopdownL6;tma_serializing_operation_group", - "MetricName": "tma_slow_pause", - "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to PAUSE Instructions. Sample with: MISC_RETIRED.PAUSE_INST", - "ScaleUnit": "100%" + "BriefDescription": "STLB (2nd level TLB) code speculative misses per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", + "MetricGroup": "Fed;MemoryTLB", + "MetricName": "tma_info_code_stlb_mpki" }, { - "BriefDescription": "The Mixing_Vectors metric gives the percentage of injected blend uops out of all uops issued", - "MetricExpr": "CLKS * UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY", - "MetricGroup": "TopdownL5;tma_ports_utilized_0_group", - "MetricName": "tma_mixing_vectors", - "PublicDescription": "The Mixing_Vectors metric gives the percentage of injected blend uops out of all uops issued. Usually a Mixing_Vectors over 5% is worth investigating. Read more in Appendix B1 of the Optimizations Guide for this topic.", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of branches that are non-taken conditionals", + "MetricExpr": "BR_INST_RETIRED.COND_NTAKEN / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_cond_nt" }, { - "BriefDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_1", - "PublicDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). This can be due to heavy data-dependency among software instructions; or over oversubscribing a particular hardware resource. In some other cases with high 1_Port_Utilized and L1_Bound; this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of branches that are taken conditionals", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_cond_tk" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_2", - "PublicDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization -most compilers feature auto-Vectorization options today- reduces pressure on the execution ports as multiple elements are calculated with same uop. Sample with: EXE_ACTIVITY.2_PORTS_UTIL", - "ScaleUnit": "100%" + "BriefDescription": "Probability of Core Bound bottleneck hidden by SMT-profiling artifacts", + "MetricExpr": "(100 * (1 - tma_core_bound / tma_ports_utilization if tma_core_bound < tma_ports_utilization else 1) if tma_info_smt_2t_utilization > 0.5 else 0)", + "MetricGroup": "Cor;SMT", + "MetricName": "tma_info_core_bound_likely", + "MetricThreshold": "tma_info_core_bound_likely > 0.5" }, { - "BriefDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_ports_utilization_group", - "MetricName": "tma_ports_utilized_3m", - "PublicDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Sample with: UOPS_EXECUTED.CYCLES_GE_3", - "ScaleUnit": "100%" + "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", + "MetricExpr": "CPU_CLK_UNHALTED.DISTRIBUTED", + "MetricGroup": "SMT", + "MetricName": "tma_info_core_clks" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution ports for ALU operations.", - "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + UOPS_DISPATCHED.PORT_5 + UOPS_DISPATCHED.PORT_6) / (4 * CORE_CLKS)", - "MetricGroup": "TopdownL5;tma_ports_utilized_3m_group", - "MetricName": "tma_alu_op_utilization", - "ScaleUnit": "100%" + "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_core_clks", + "MetricGroup": "Ret;SMT;TmaL1;tma_L1_group", + "MetricName": "tma_info_coreipc" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch) Sample with: UOPS_DISPATCHED.PORT_0", - "MetricExpr": "UOPS_DISPATCHED.PORT_0 / CORE_CLKS", - "MetricGroup": "Compute;TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_0", - "ScaleUnit": "100%" + "BriefDescription": "Cycles Per Instruction (per Logical Processor)", + "MetricExpr": "1 / tma_info_ipc", + "MetricGroup": "Mem;Pipeline", + "MetricName": "tma_info_cpi" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU) Sample with: UOPS_DISPATCHED.PORT_1", - "MetricExpr": "UOPS_DISPATCHED.PORT_1 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_1", - "ScaleUnit": "100%" + "BriefDescription": "Average CPU Utilization", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": "HPC;Summary", + "MetricName": "tma_info_cpu_utilization" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU) Sample with: UOPS_DISPATCHED.PORT_5", - "MetricExpr": "UOPS_DISPATCHED.PORT_5 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_5", - "ScaleUnit": "100%" + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD / OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_data_l2_mlp" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU) Sample with: UOPS_DISPATCHED.PORT_6", - "MetricExpr": "UOPS_DISPATCHED.PORT_6 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_alu_op_utilization_group", - "MetricName": "tma_port_6", - "ScaleUnit": "100%" + "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", + "MetricExpr": "64 * (arb@event\\=0x81\\,umask\\=0x1@ + arb@event\\=0x84\\,umask\\=0x1@) / 1e6 / duration_time / 1e3", + "MetricGroup": "HPC;Mem;MemoryBW;SoC;tma_issueBW", + "MetricName": "tma_info_dram_bw_use", + "PublicDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_memory_bandwidth, tma_mem_bandwidth, tma_sq_full" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations Sample with: UOPS_DISPATCHED.PORT_2_3", - "MetricExpr": "UOPS_DISPATCHED.PORT_2_3 / (2 * CORE_CLKS)", - "MetricGroup": "TopdownL5;tma_ports_utilized_3m_group", - "MetricName": "tma_load_op_utilization", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_dsb_coverage", + "MetricThreshold": "tma_info_dsb_coverage < 0.7 & tma_info_ipc / 5 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_misses, tma_info_iptb, tma_lcp" }, { - "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations Sample with: UOPS_DISPATCHED.PORT_7_8", - "MetricExpr": "(UOPS_DISPATCHED.PORT_4_9 + UOPS_DISPATCHED.PORT_7_8) / (4 * CORE_CLKS)", - "MetricGroup": "TopdownL5;tma_ports_utilized_3m_group", - "MetricName": "tma_store_op_utilization", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of DSB (uop cache) misses - subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tma_lsd + tma_mite))", + "MetricGroup": "DSBmiss;Fed;tma_issueFB", + "MetricName": "tma_info_dsb_misses", + "MetricThreshold": "tma_info_dsb_misses > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) misses - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_iptb, tma_lcp" }, { - "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group", - "MetricName": "tma_retiring", - "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS", - "ScaleUnit": "100%" + "BriefDescription": "Average number of cycles of a switch from the DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details.", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWITCHES.PENALTY_CYCLES\\,cmask\\=1\\,edge@", + "MetricGroup": "DSBmiss", + "MetricName": "tma_info_dsb_switch_cost" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", - "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_retiring_group", - "MetricName": "tma_light_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST", - "ScaleUnit": "100%" + "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-thread", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,cmask\\=1@", + "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", + "MetricName": "tma_info_execute" }, { - "BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired)", - "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", - "MetricGroup": "HPC;TopdownL3;tma_light_operations_group", - "MetricName": "tma_fp_arith", - "PublicDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired). Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain and FMA double-counting.", - "ScaleUnit": "100%" + "BriefDescription": "The ratio of Executed- by Issued-Uops", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", + "MetricGroup": "Cor;Pipeline", + "MetricName": "tma_info_execute_per_issue", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate of \"execute\" at rename stage." }, { - "BriefDescription": "This metric serves as an approximation of legacy x87 usage", - "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.THREAD", - "MetricGroup": "Compute;TopdownL4;tma_fp_arith_group", - "MetricName": "tma_x87_use", - "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", - "ScaleUnit": "100%" + "BriefDescription": "Fill Buffer (FB) hits per kilo instructions for retired demand loads (L1D misses that merge into ongoing miss-handling entries)", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_fb_hpki" }, { - "BriefDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE) / (tma_retiring * SLOTS)", - "MetricGroup": "Compute;Flops;TopdownL4;tma_fp_arith_group", - "MetricName": "tma_fp_scalar", - "PublicDescription": "This metric approximates arithmetic floating-point (FP) scalar uops fraction the CPU has retired. May overcount due to FMA double counting.", - "ScaleUnit": "100%" + "BriefDescription": "Average number of Uops issued by front-end when it issued something", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=1@", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_fetch_upc" }, { - "BriefDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths", - "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / (tma_retiring * SLOTS)", - "MetricGroup": "Compute;Flops;TopdownL4;tma_fp_arith_group", - "MetricName": "tma_fp_vector", - "PublicDescription": "This metric approximates arithmetic floating-point (FP) vector uops fraction the CPU has retired aggregated across all vector widths. May overcount due to FMA double counting.", - "ScaleUnit": "100%" + "BriefDescription": "Floating Point Operations Per Cycle", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * cpu@FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE\\,umask\\=0x18@ + 8 * cpu@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE\\,umask\\=0x60@ + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / tma_info_core_clks", + "MetricGroup": "Flops;Ret", + "MetricName": "tma_info_flopc" }, { - "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors", - "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * SLOTS)", - "MetricGroup": "Compute;Flops;TopdownL5;tma_fp_vector_group", - "MetricName": "tma_fp_vector_128b", - "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 128-bit wide vectors. May overcount due to FMA double counting.", - "ScaleUnit": "100%" + "BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0xfc@) / (2 * tma_info_core_clks)", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_fp_arith_utilization", + "PublicDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width). Values > 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less common)." }, { - "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors", - "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) / (tma_retiring * SLOTS)", - "MetricGroup": "Compute;Flops;TopdownL5;tma_fp_vector_group", - "MetricName": "tma_fp_vector_256b", - "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 256-bit wide vectors. May overcount due to FMA double counting.", - "ScaleUnit": "100%" + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * cpu@FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE\\,umask\\=0x18@ + 8 * cpu@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE\\,umask\\=0x60@ + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1e9 / duration_time", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine." }, { - "BriefDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 512-bit wide vectors", - "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / (tma_retiring * SLOTS)", - "MetricGroup": "Compute;Flops;TopdownL5;tma_fp_vector_group", - "MetricName": "tma_fp_vector_512b", - "PublicDescription": "This metric approximates arithmetic FP vector uops fraction the CPU has retired for 512-bit wide vectors. May overcount due to FMA double counting.", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of Instruction Cache misses - subset of the Big_Code Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", + "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", + "MetricName": "tma_info_ic_misses", + "MetricThreshold": "tma_info_ic_misses > 5", + "PublicDescription": "Total pipeline cost of Instruction Cache misses - subset of the Big_Code Bottleneck. Related metrics: " }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring memory operations -- uops for memory load or store accesses.", - "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_RETIRED.ANY", - "MetricGroup": "Pipeline;TopdownL3;tma_light_operations_group", - "MetricName": "tma_memory_operations", - "ScaleUnit": "100%" + "BriefDescription": "Average Latency for L1 instruction cache misses", + "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STALL\\,cmask\\=1\\,edge@", + "MetricGroup": "Fed;FetchLat;IcMiss", + "MetricName": "tma_info_icache_miss_latency" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring branch instructions.", - "MetricExpr": "tma_light_operations * BR_INST_RETIRED.ALL_BRANCHES / (tma_retiring * SLOTS)", - "MetricGroup": "Pipeline;TopdownL3;tma_light_operations_group", - "MetricName": "tma_branch_instructions", - "ScaleUnit": "100%" + "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_GE_1 / 2 if #SMT_on else UOPS_EXECUTED.CORE_CYCLES_GE_1)", + "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", + "MetricName": "tma_info_ilp" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions", - "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_retiring * SLOTS)", - "MetricGroup": "Pipeline;TopdownL3;tma_light_operations_group", - "MetricName": "tma_nop_instructions", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions. Compilers often use NOPs for certain address alignments - e.g. start address of a function or loop body. Sample with: INST_RETIRED.NOP", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of instruction fetch bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_frontend_bound - tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) - tma_info_big_code", + "MetricGroup": "Fed;FetchBW;Frontend", + "MetricName": "tma_info_instruction_fetch_bw", + "MetricThreshold": "tma_info_instruction_fetch_bw > 20" }, { - "BriefDescription": "This metric represents the remaining light uops fraction the CPU has executed - remaining means not covered by other sibling nodes. May undercount due to FMA double counting", - "MetricExpr": "max(0, tma_light_operations - (tma_fp_arith + tma_memory_operations + tma_branch_instructions + tma_nop_instructions))", - "MetricGroup": "Pipeline;TopdownL3;tma_light_operations_group", - "MetricName": "tma_other_light_ops", - "ScaleUnit": "100%" + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Summary;TmaL1;tma_L1_group", + "MetricName": "tma_info_instructions", + "PublicDescription": "Total number of retired Instructions. Sample with: INST_RETIRED.PREC_DIST" }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences", - "MetricExpr": "tma_microcode_sequencer + tma_retiring * (UOPS_DECODED.DEC0 - cpu@UOPS_DECODED.DEC0\\,cmask\\=1@) / IDQ.MITE_UOPS", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_retiring_group", - "MetricName": "tma_heavy_operations", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences. This highly-correlates with the uop length of these instructions/sequences.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=0xfc@)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_iparith", + "MetricThreshold": "tma_info_iparith < 10", + "PublicDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate). May undercount due to FMA double counting. Approximated prior to BDW." }, { - "BriefDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops", - "MetricExpr": "tma_heavy_operations - tma_microcode_sequencer", - "MetricGroup": "TopdownL3;tma_heavy_operations_group", - "MetricName": "tma_few_uops_instructions", - "PublicDescription": "This metric represents fraction of slots where the CPU was retiring instructions that that are decoder into two or up to ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number of uops in such instructions.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx128", + "MetricThreshold": "tma_info_iparith_avx128 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", - "MetricExpr": "tma_retiring * SLOTS / UOPS_ISSUED.ANY * IDQ.MS_UOPS / SLOTS", - "MetricGroup": "MicroSeq;TopdownL3;tma_heavy_operations_group", - "MetricName": "tma_microcode_sequencer", - "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided. Sample with: IDQ.MS_UOPS", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx256", + "MetricThreshold": "tma_info_iparith_avx256 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists", - "MetricExpr": "100 * ASSISTS.ANY / SLOTS", - "MetricGroup": "TopdownL4;tma_microcode_sequencer_group", - "MetricName": "tma_assists", - "PublicDescription": "This metric estimates fraction of slots the CPU retired uops delivered by the Microcode_Sequencer as a result of Assists. Assists are long sequences of uops that are required in certain corner-cases for operations that cannot be handled natively by the execution pipeline. For example; when working with very small floating point values (so-called Denormals); the FP units are not set up to perform these operations natively. Instead; a sequence of instructions to perform the computation on the Denormals is injected into the pipeline. Since these microcode sequences might be dozens of uops long; Assists can be extremely deleterious to performance and they can be avoided in many cases. Sample with: ASSISTS.ANY", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx512", + "MetricThreshold": "tma_info_iparith_avx512 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction", - "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", - "MetricGroup": "TopdownL4;tma_microcode_sequencer_group", - "MetricName": "tma_cisc", - "PublicDescription": "This metric estimates fraction of cycles the CPU retired uops originated from CISC (complex instruction set computer) instruction. A CISC instruction has multiple uops that are required to perform the instruction's functionality as in the case of read-modify-write as an example. Since these instructions require multiple uops they may or may not imply sub-optimal use of machine resources.", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOUBLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_iparith_scalar_dp", + "MetricThreshold": "tma_info_iparith_scalar_dp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "Total pipeline cost of Branch Misprediction related bottlenecks", - "MetricExpr": "100 * (tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts", - "MetricName": "Mispredictions" + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SINGLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_iparith_scalar_sp", + "MetricThreshold": "tma_info_iparith_scalar_sp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." }, { - "BriefDescription": "Total pipeline cost of (external) Memory Bandwidth related bottlenecks", - "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_sq_full / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full))) + tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk))", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "Memory_Bandwidth" + "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Branches;Fed;InsType", + "MetricName": "tma_info_ipbranch", + "MetricThreshold": "tma_info_ipbranch < 8" }, { - "BriefDescription": "Total pipeline cost of Memory Latency related bottlenecks (external memory and off-core caches)", - "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_l3_hit_latency / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound))", - "MetricGroup": "Mem;MemoryLat;Offcore", - "MetricName": "Memory_Latency" + "BriefDescription": "Instructions Per Cycle (per Logical Processor)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_clks", + "MetricGroup": "Ret;Summary", + "MetricName": "tma_info_ipc" }, { - "BriefDescription": "Total pipeline cost of Memory Address Translation related bottlenecks (data-side TLBs)", - "MetricExpr": "100 * tma_memory_bound * (tma_l1_bound / max(tma_memory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_dtlb_load / max(tma_l1_bound, tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_latency + tma_streaming_stores)))", - "MetricGroup": "Mem;MemoryTLB;Offcore", - "MetricName": "Memory_Data_TLBs" + "BriefDescription": "Instructions per (near) call (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_ipcall", + "MetricThreshold": "tma_info_ipcall < 200" }, { - "BriefDescription": "Total pipeline cost of branch related instructions (used for program control-flow including function calls)", - "MetricExpr": "100 * ((BR_INST_RETIRED.COND + 3 * BR_INST_RETIRED.NEAR_CALL + (BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL)) / SLOTS)", - "MetricGroup": "Ret", - "MetricName": "Branching_Overhead" + "BriefDescription": "Instructions per non-speculative DSB miss (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", + "MetricGroup": "DSBmiss;Fed", + "MetricName": "tma_info_ipdsb_miss_ret", + "MetricThreshold": "tma_info_ipdsb_miss_ret < 50" }, { - "BriefDescription": "Total pipeline cost of instruction fetch related bottlenecks by large code footprint programs (i-side cache; TLB and BTB misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_icache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFoot;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "Big_Code" + "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;OS", + "MetricName": "tma_info_ipfarbranch", + "MetricThreshold": "tma_info_ipfarbranch < 1e6" }, { - "BriefDescription": "Total pipeline cost of instruction fetch bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_frontend_bound - tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) - Big_Code", - "MetricGroup": "Fed;FetchBW;Frontend", - "MetricName": "Instruction_Fetch_BW" + "BriefDescription": "Instructions per Floating Point (FP) Operation (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\=0x03@ + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * cpu@FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE\\,umask\\=0x18@ + 8 * cpu@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE\\,umask\\=0x60@ + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_ipflop", + "MetricThreshold": "tma_info_ipflop < 10" }, { - "BriefDescription": "Instructions Per Cycle (per Logical Processor)", - "MetricExpr": "INST_RETIRED.ANY / CLKS", - "MetricGroup": "Ret;Summary", - "MetricName": "IPC" + "BriefDescription": "Instructions per Load (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipload", + "MetricThreshold": "tma_info_ipload < 3" }, { - "BriefDescription": "Uops Per Instruction", - "MetricExpr": "tma_retiring * SLOTS / INST_RETIRED.ANY", - "MetricGroup": "Pipeline;Ret;Retire", - "MetricName": "UPI" + "BriefDescription": "Instructions per retired mispredicts for conditional non-taken branches (lower number means higher occurrence rate).", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_cond_ntaken", + "MetricThreshold": "tma_info_ipmisp_cond_ntaken < 200" }, { - "BriefDescription": "Instruction per taken branch", - "MetricExpr": "tma_retiring * SLOTS / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW", - "MetricName": "UpTB" + "BriefDescription": "Instructions per retired mispredicts for conditional taken branches (lower number means higher occurrence rate).", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_cond_taken", + "MetricThreshold": "tma_info_ipmisp_cond_taken < 200" }, { - "BriefDescription": "Cycles Per Instruction (per Logical Processor)", - "MetricExpr": "1 / IPC", - "MetricGroup": "Mem;Pipeline", - "MetricName": "CPI" + "BriefDescription": "Instructions per retired mispredicts for indirect CALL or JMP branches (lower number means higher occurrence rate).", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_indirect", + "MetricThreshold": "tma_info_ipmisp_indirect < 1e3" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "Pipeline", - "MetricName": "CLKS" + "BriefDescription": "Instructions per retired mispredicts for return branches (lower number means higher occurrence rate).", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_ret", + "MetricThreshold": "tma_info_ipmisp_ret < 500" }, { - "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "TOPDOWN.SLOTS", - "MetricGroup": "tma_L1_group", - "MetricName": "SLOTS" + "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts", + "MetricName": "tma_info_ipmispredict", + "MetricThreshold": "tma_info_ipmispredict < 200" }, { - "BriefDescription": "Fraction of Physical Core issue-slots utilized by this Logical Processor", - "MetricExpr": "(SLOTS / (TOPDOWN.SLOTS / 2) if #SMT_on else 1)", - "MetricGroup": "SMT;tma_L1_group", - "MetricName": "Slots_Utilization" + "BriefDescription": "Instructions per Store (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipstore", + "MetricThreshold": "tma_info_ipstore < 8" }, { - "BriefDescription": "The ratio of Executed- by Issued-Uops", - "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", - "MetricGroup": "Cor;Pipeline", - "MetricName": "Execute_per_Issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate of \"execute\" at rename stage." + "BriefDescription": "Instructions per Software prefetch instruction (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umask\\=0xF@", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_ipswpf", + "MetricThreshold": "tma_info_ipswpf < 100" }, { - "BriefDescription": "Instructions Per Cycle across hyper-threads (per physical core)", - "MetricExpr": "INST_RETIRED.ANY / CORE_CLKS", - "MetricGroup": "Ret;SMT;tma_L1_group", - "MetricName": "CoreIPC" + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", + "MetricName": "tma_info_iptb", + "MetricThreshold": "tma_info_iptb < 11", + "PublicDescription": "Instruction per taken branch. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_dsb_misses, tma_lcp" }, { - "BriefDescription": "Floating Point Operations Per Cycle", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE) + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / CORE_CLKS", - "MetricGroup": "Flops;Ret", - "MetricName": "FLOPc" - }, - { - "BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width)", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE)) / (2 * CORE_CLKS)", - "MetricGroup": "Cor;Flops;HPC", - "MetricName": "FP_Arith_Utilization", - "PublicDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width). Values > 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less common)." + "BriefDescription": "Instructions per speculative Unknown Branch Misprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_instructions / BACLEARS.ANY", + "MetricGroup": "Fed", + "MetricName": "tma_info_ipunknown_branch" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_GE_1 / 2 if #SMT_on else UOPS_EXECUTED.CORE_CYCLES_GE_1)", - "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", - "MetricName": "ILP" + "BriefDescription": "Fraction of branches that are unconditional (direct or indirect) jumps", + "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_jump" }, { - "BriefDescription": "Probability of Core Bound bottleneck hidden by SMT-profiling artifacts", - "MetricExpr": "((1 - tma_core_bound / tma_ports_utilization if tma_core_bound < tma_ports_utilization else 1) if SMT_2T_Utilization > 0.5 else 0)", - "MetricGroup": "Cor;SMT", - "MetricName": "Core_Bound_Likely" + "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_cpi" }, { - "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core", - "MetricExpr": "CPU_CLK_UNHALTED.DISTRIBUTED", - "MetricGroup": "SMT", - "MetricName": "CORE_CLKS" + "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_utilization", + "MetricThreshold": "tma_info_kernel_utilization > 0.05" }, { - "BriefDescription": "Instructions per Load (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", - "MetricGroup": "InsType", - "MetricName": "IpLoad" + "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw" }, { - "BriefDescription": "Instructions per Store (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", - "MetricGroup": "InsType", - "MetricName": "IpStore" + "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]", + "MetricExpr": "tma_info_l1d_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw_1t" }, { - "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Branches;Fed;InsType", - "MetricName": "IpBranch" + "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l1mpki" }, { - "BriefDescription": "Instructions per (near) call (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "IpCall" + "BriefDescription": "L1 cache true misses per kilo instruction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l1mpki_load" }, { - "BriefDescription": "Instruction per taken branch", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO", - "MetricName": "IpTB" + "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw" }, { - "BriefDescription": "Branch instructions per taken branch. ", - "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "BpTkBranch" + "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]", + "MetricExpr": "tma_info_l2_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw_1t" }, { - "BriefDescription": "Instructions per Floating Point (FP) Operation (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE) + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE)", - "MetricGroup": "Flops;InsType", - "MetricName": "IpFLOP" + "BriefDescription": "L2 cache hits per kilo instruction for all request types (including speculative)", + "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2hpki_all" }, { - "BriefDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE))", - "MetricGroup": "Flops;InsType", - "MetricName": "IpArith", - "PublicDescription": "Instructions per FP Arithmetic instruction (lower number means higher occurrence rate). May undercount due to FMA double counting. Approximated prior to BDW." + "BriefDescription": "L2 cache hits per kilo instruction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2hpki_load" }, { - "BriefDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SINGLE", - "MetricGroup": "Flops;FpScalar;InsType", - "MetricName": "IpArith_Scalar_SP", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "L2 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "Backend;CacheMisses;Mem", + "MetricName": "tma_info_l2mpki" }, { - "BriefDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOUBLE", - "MetricGroup": "Flops;FpScalar;InsType", - "MetricName": "IpArith_Scalar_DP", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double-Precision instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all request types (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem;Offcore", + "MetricName": "tma_info_l2mpki_all" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX128", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "L2 cache true code cacheline misses per kilo instruction", + "MetricExpr": "1e3 * FRONTEND_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_l2mpki_code" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX256", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "L2 cache speculative code cacheline misses per kilo instruction", + "MetricExpr": "1e3 * L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_l2mpki_code_all" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX512", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit instruction (lower number means higher occurrence rate). May undercount due to FMA double counting." + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2mpki_load" }, { - "BriefDescription": "Instructions per Software prefetch instruction (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umask\\=0xF@", - "MetricGroup": "Prefetches", - "MetricName": "IpSWPF" + "BriefDescription": "Average per-core data access bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_l3_cache_access_bw" }, { - "BriefDescription": "Total number of retired Instructions Sample with: INST_RETIRED.PREC_DIST", - "MetricExpr": "INST_RETIRED.ANY", - "MetricGroup": "Summary;tma_L1_group", - "MetricName": "Instructions" + "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "tma_info_l3_cache_access_bw", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_l3_cache_access_bw_1t" }, { - "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.", - "MetricExpr": "tma_retiring * SLOTS / cpu@UOPS_RETIRED.SLOTS\\,cmask\\=1@", - "MetricGroup": "Pipeline;Ret", - "MetricName": "Retire" + "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw" }, { - "BriefDescription": "", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,cmask\\=1@", - "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", - "MetricName": "Execute" + "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]", + "MetricExpr": "tma_info_l3_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw_1t" }, { - "BriefDescription": "Average number of Uops issued by front-end when it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=1@", - "MetricGroup": "Fed;FetchBW", - "MetricName": "Fetch_UpC" + "BriefDescription": "L3 cache true misses per kilo instruction for retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l3mpki" }, { - "BriefDescription": "Fraction of Uops delivered by the LSD (Loop Stream Detector; aka Loop Cache)", - "MetricExpr": "LSD.UOPS / (IDQ.DSB_UOPS + LSD.UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", - "MetricGroup": "Fed;LSD", - "MetricName": "LSD_Coverage" + "BriefDescription": "Average Latency for L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCORE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "Memory_Lat;Offcore", + "MetricName": "tma_info_load_l2_miss_latency" }, { - "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)", - "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + LSD.UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS)", - "MetricGroup": "DSB;Fed;FetchBW", - "MetricName": "DSB_Coverage" + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=1@", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_load_l2_mlp" }, { - "BriefDescription": "Average number of cycles of a switch from the DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details.", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWITCHES.PENALTY_CYCLES\\,cmask\\=1\\,edge@", - "MetricGroup": "DSBmiss", - "MetricName": "DSB_Switch_Cost" + "BriefDescription": "Average Latency for L3 cache miss demand Loads", + "MetricExpr": "cpu@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,umask\\=0x10@ / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", + "MetricGroup": "Memory_Lat;Offcore", + "MetricName": "tma_info_load_l3_miss_latency" }, { - "BriefDescription": "Total penalty related to DSB (uop cache) misses - subset of the Instruction_Fetch_BW Bottleneck.", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tma_lsd + tma_mite))", - "MetricGroup": "DSBmiss;Fed", - "MetricName": "DSB_Misses" + "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)", + "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_RETIRED.L1_MISS + MEM_LOAD_RETIRED.FB_HIT)", + "MetricGroup": "Mem;MemoryBound;MemoryLat", + "MetricName": "tma_info_load_miss_real_latency" }, { - "BriefDescription": "Number of Instructions per non-speculative DSB miss (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", - "MetricGroup": "DSBmiss;Fed", - "MetricName": "IpDSB_Miss_Ret" + "BriefDescription": "STLB (2nd level TLB) data load speculative misses per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_load_stlb_mpki" }, { - "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;BadSpec;BrMispredicts", - "MetricName": "IpMispredict" + "BriefDescription": "Fraction of Uops delivered by the LSD (Loop Stream Detector; aka Loop Cache)", + "MetricExpr": "LSD.UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "Fed;LSD", + "MetricName": "tma_info_lsd_coverage" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)", - "MetricExpr": "(tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) * SLOTS / BR_MISP_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;BrMispredicts", - "MetricName": "Branch_Misprediction_Cost" + "BriefDescription": "Average number of parallel data read requests to external memory", + "MetricExpr": "UNC_ARB_DAT_OCCUPANCY.RD / UNC_ARB_DAT_OCCUPANCY.RD@cmask\\=1@", + "MetricGroup": "Mem;MemoryBW;SoC", + "MetricName": "tma_info_mem_parallel_reads", + "PublicDescription": "Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches" }, { - "BriefDescription": "Fraction of branches that are non-taken conditionals", - "MetricExpr": "BR_INST_RETIRED.COND_NTAKEN / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches;CodeGen;PGO", - "MetricName": "Cond_NT" + "BriefDescription": "Average latency of data read request to external memory (in nanoseconds)", + "MetricExpr": "(UNC_ARB_TRK_OCCUPANCY.RD + UNC_ARB_DAT_OCCUPANCY.RD) / UNC_ARB_TRK_REQUESTS.RD", + "MetricGroup": "Mem;MemoryLat;SoC", + "MetricName": "tma_info_mem_read_latency", + "PublicDescription": "Average latency of data read request to external memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetches. ([RKL+]memory-controller only)" }, { - "BriefDescription": "Fraction of branches that are taken conditionals", - "MetricExpr": "BR_INST_RETIRED.COND_TAKEN / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches;CodeGen;PGO", - "MetricName": "Cond_TK" + "BriefDescription": "Average latency of all requests to external memory (in Uncore cycles)", + "MetricExpr": "(UNC_ARB_TRK_OCCUPANCY.ALL + UNC_ARB_DAT_OCCUPANCY.RD) / arb@event\\=0x81\\,umask\\=0x1@", + "MetricGroup": "Mem;SoC", + "MetricName": "tma_info_mem_request_latency" }, { - "BriefDescription": "Fraction of branches that are CALL or RET", - "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_RETURN) / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches", - "MetricName": "CallRet" + "BriefDescription": "Total pipeline cost of (external) Memory Bandwidth related bottlenecks", + "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_sq_full / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full))) + tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk))", + "MetricGroup": "Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_info_memory_bandwidth", + "MetricThreshold": "tma_info_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of (external) Memory Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_mem_bandwidth, tma_sq_full" }, { - "BriefDescription": "Fraction of branches that are unconditional (direct or indirect) jumps", - "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL) / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches", - "MetricName": "Jump" + "BriefDescription": "Total pipeline cost of Memory Address Translation related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * tma_memory_bound * (tma_l1_bound / max(tma_memory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_dtlb_load / max(tma_l1_bound, tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_latency + tma_streaming_stores)))", + "MetricGroup": "Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_info_memory_data_tlbs", + "MetricThreshold": "tma_info_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Translation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load, tma_dtlb_store" }, { - "BriefDescription": "Fraction of branches of other types (not individually covered by other metrics in Info.Branches group)", - "MetricExpr": "1 - (Cond_NT + Cond_TK + CallRet + Jump)", - "MetricGroup": "Bad;Branches", - "MetricName": "Other_Branches" + "BriefDescription": "Total pipeline cost of Memory Latency related bottlenecks (external memory and off-core caches)", + "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (tma_l3_hit_latency / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound))", + "MetricGroup": "Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_info_memory_latency", + "MetricThreshold": "tma_info_memory_latency > 20", + "PublicDescription": "Total pipeline cost of Memory Latency related bottlenecks (external memory and off-core caches). Related metrics: tma_l3_hit_latency, tma_mem_latency" }, { - "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)", - "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_RETIRED.L1_MISS + MEM_LOAD_RETIRED.FB_HIT)", - "MetricGroup": "Mem;MemoryBound;MemoryLat", - "MetricName": "Load_Miss_Real_Latency" + "BriefDescription": "Total pipeline cost of Branch Misprediction related bottlenecks", + "MetricExpr": "100 * (tma_branch_mispredicts + tma_fetch_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;tma_issueBM", + "MetricName": "tma_info_mispredictions", + "MetricThreshold": "tma_info_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_branch_misprediction_cost, tma_mispredicts_resteers" }, { - "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)", + "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss", "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLES", "MetricGroup": "Mem;MemoryBW;MemoryBound", - "MetricName": "MLP" + "MetricName": "tma_info_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)" }, { - "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L1MPKI" + "BriefDescription": "Fraction of branches of other types (not individually covered by other metrics in Info.Branches group)", + "MetricExpr": "1 - (tma_info_cond_nt + tma_info_cond_tk + tma_info_callret + tma_info_jump)", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_other_branches" }, { - "BriefDescription": "L1 cache true misses per kilo instruction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L1MPKI_Load" + "BriefDescription": "Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses", + "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_PENDING + DTLB_STORE_MISSES.WALK_PENDING) / (2 * tma_info_core_clks)", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_page_walks_utilization", + "MetricThreshold": "tma_info_page_walks_utilization > 0.5" }, { - "BriefDescription": "L2 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", - "MetricGroup": "Backend;CacheMisses;Mem", - "MetricName": "L2MPKI" + "BriefDescription": "Fraction of Core cycles where the core was running with power-delivery for baseline license level 0", + "MetricExpr": "CORE_POWER.LVL0_TURBO_LICENSE / tma_info_core_clks", + "MetricGroup": "Power", + "MetricName": "tma_info_power_license0_utilization", + "PublicDescription": "Fraction of Core cycles where the core was running with power-delivery for baseline license level 0. This includes non-AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes." }, { - "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all request types (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem;Offcore", - "MetricName": "L2MPKI_All" + "BriefDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 1", + "MetricExpr": "CORE_POWER.LVL1_TURBO_LICENSE / tma_info_core_clks", + "MetricGroup": "Power", + "MetricName": "tma_info_power_license1_utilization", + "MetricThreshold": "tma_info_power_license1_utilization > 0.5", + "PublicDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 1. This includes high current AVX 256-bit instructions as well as low current AVX 512-bit instructions." }, { - "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2MPKI_Load" + "BriefDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 2 (introduced in SKX)", + "MetricExpr": "CORE_POWER.LVL2_TURBO_LICENSE / tma_info_core_clks", + "MetricGroup": "Power", + "MetricName": "tma_info_power_license2_utilization", + "MetricThreshold": "tma_info_power_license2_utilization > 0.5", + "PublicDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 2 (introduced in SKX). This includes high current AVX 512-bit instructions." }, { - "BriefDescription": "L2 cache hits per kilo instruction for all request types (including speculative)", - "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2HPKI_All" + "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.", + "MetricExpr": "tma_retiring * tma_info_slots / cpu@UOPS_RETIRED.SLOTS\\,cmask\\=1@", + "MetricGroup": "Pipeline;Ret", + "MetricName": "tma_info_retire" }, { - "BriefDescription": "L2 cache hits per kilo instruction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2HPKI_Load" + "BriefDescription": "Total issue-pipeline slots (per-Physical Core till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "TOPDOWN.SLOTS", + "MetricGroup": "TmaL1;tma_L1_group", + "MetricName": "tma_info_slots" }, { - "BriefDescription": "L3 cache true misses per kilo instruction for retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L3MPKI" + "BriefDescription": "Fraction of Physical Core issue-slots utilized by this Logical Processor", + "MetricExpr": "(tma_info_slots / (TOPDOWN.SLOTS / 2) if #SMT_on else 1)", + "MetricGroup": "SMT;TmaL1;tma_L1_group", + "MetricName": "tma_info_slots_utilization" }, { - "BriefDescription": "Fill Buffer (FB) hits per kilo instructions for retired demand loads (L1D misses that merge into ongoing miss-handling entries)", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "FB_HPKI" + "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", + "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_DISTRIBUTED if #SMT_on else 0)", + "MetricGroup": "SMT", + "MetricName": "tma_info_smt_2t_utilization" }, { - "BriefDescription": "Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses", - "MetricConstraint": "NO_NMI_WATCHDOG", - "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_PENDING + DTLB_STORE_MISSES.WALK_PENDING) / (2 * CORE_CLKS)", + "BriefDescription": "STLB (2nd level TLB) data store speculative misses per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricGroup": "Mem;MemoryTLB", - "MetricName": "Page_Walks_Utilization" + "MetricName": "tma_info_store_stlb_mpki" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW" + "BriefDescription": "Average Frequency Utilization relative nominal frequency", + "MetricExpr": "tma_info_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_turbo_utilization" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW" + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "tma_retiring * tma_info_slots / INST_RETIRED.ANY", + "MetricGroup": "Pipeline;Ret;Retire", + "MetricName": "tma_info_uoppi", + "MetricThreshold": "tma_info_uoppi > 1.05" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW" + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "tma_retiring * tma_info_slots / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW", + "MetricName": "tma_info_uptb", + "MetricThreshold": "tma_info_uptb < 7.5" }, { - "BriefDescription": "Average per-core data access bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "L3_Cache_Access_BW" + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "ICACHE_64B.IFTAG_STALL / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTEND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]", - "MetricExpr": "L1D_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW_1T" + "BriefDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache", + "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY.STALLS_L1D_MISS) / tma_info_clks, 0)", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache. The L1 data cache typically has the shortest latency. However; in certain cases like loads blocked on older stores; a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls; while some non-completed demand load lives in the machine without having that demand load missing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB_HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]", - "MetricExpr": "L2_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW_1T" + "BriefDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + L1D_PEND_MISS.FB_FULL_PERIODS) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_info_clks)", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 misses/L2 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L2_HIT_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "L3_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW_1T" + "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core", + "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STALLS_L3_MISS) / tma_info_clks", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core. Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]", - "MetricExpr": "L3_Cache_Access_BW", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "L3_Cache_Access_BW_1T" + "BriefDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited)", + "MetricExpr": "17.5 * tma_info_average_frequency * MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_issueLat;tma_l3_bound_group", + "MetricName": "tma_l3_hit_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve the latency; reduce contention with sibling physical cores and increase performance. Note the value of this node may overlap with its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tma_info_memory_latency, tma_mem_latency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average CPU Utilization", - "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", - "MetricGroup": "HPC;Summary", - "MetricName": "CPU_Utilization" + "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "ILD_STALL.LCP / tma_info_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_dsb_misses, tma_info_iptb", + "ScaleUnit": "100%" }, { - "BriefDescription": "Measured Average Frequency for unhalted processors [GHz]", - "MetricExpr": "Turbo_Utilization * TSC / 1e9 / duration_time", - "MetricGroup": "Power;Summary", - "MetricName": "Average_Frequency" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)", + "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST", + "ScaleUnit": "100%" }, { - "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE) + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1e9 / duration_time", - "MetricGroup": "Cor;Flops;HPC", - "MetricName": "GFLOPs", - "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine." + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations", + "MetricExpr": "UOPS_DISPATCHED.PORT_2_3 / (2 * tma_info_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Load operations. Sample with: UOPS_DISPATCHED.PORT_2_3", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average Frequency Utilization relative nominal frequency", - "MetricExpr": "CLKS / CPU_CLK_UNHALTED.REF_TSC", - "MetricGroup": "Power", - "MetricName": "Turbo_Utilization" + "BriefDescription": "This metric roughly estimates the fraction of cycles where the (first level) DTLB was missed by load accesses, that later on hit in second-level TLB (STLB)", + "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_group", + "MetricName": "tma_load_stlb_hit", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of Core cycles where the core was running with power-delivery for baseline license level 0", - "MetricExpr": "CORE_POWER.LVL0_TURBO_LICENSE / CORE_CLKS", - "MetricGroup": "Power", - "MetricName": "Power_License0_Utilization", - "PublicDescription": "Fraction of Core cycles where the core was running with power-delivery for baseline license level 0. This includes non-AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes." + "BriefDescription": "This metric estimates the fraction of cycles where the Second-level TLB (STLB) was missed by load accesses, performing a hardware page walk", + "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_group", + "MetricName": "tma_load_stlb_miss", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 1", - "MetricExpr": "CORE_POWER.LVL1_TURBO_LICENSE / CORE_CLKS", - "MetricGroup": "Power", - "MetricName": "Power_License1_Utilization", - "PublicDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 1. This includes high current AVX 256-bit instructions as well as low current AVX 512-bit instructions." + "BriefDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS.ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10 * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO))) / tma_info_clks", + "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1_bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles the CPU spent handling cache misses due to lock operations. Due to the microarchitecture handling of locks; they are classified as L1_Bound regardless of what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOADS_PS. Related metrics: tma_store_latency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 2 (introduced in SKX)", - "MetricExpr": "CORE_POWER.LVL2_TURBO_LICENSE / CORE_CLKS", - "MetricGroup": "Power", - "MetricName": "Power_License2_Utilization", - "PublicDescription": "Fraction of Core cycles where the core was running with power-delivery for license level 2 (introduced in SKX). This includes high current AVX 512-bit instructions." + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to LSD (Loop Stream Detector) unit", + "MetricExpr": "(LSD.CYCLES_ACTIVE - LSD.CYCLES_OK) / tma_info_core_clks / 2", + "MetricGroup": "FetchBW;LSD;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_lsd", + "MetricThreshold": "tma_lsd > 0.15 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 5 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to LSD (Loop Stream Detector) unit. LSD typically does well sustaining Uop supply. However; in some rare cases; optimal uop-delivery could not be reached for small loops whose size (in terms of number of uops) does not suit well the LSD structure.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of cycles where both hardware Logical Processors were active", - "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_DISTRIBUTED if #SMT_on else 0)", - "MetricGroup": "SMT", - "MetricName": "SMT_2T_Utilization" + "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears", + "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts)", + "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation > 0.15", + "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears. These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of cycles spent in the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "OS", - "MetricName": "Kernel_Utilization" + "BriefDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=4@) / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the core's performance was likely hurt due to approaching bandwidth limits of external memory (DRAM). The underlying heuristic assumes that a similar off-core traffic is generated by all IA cores. This metric does not aggregate non-data-read requests by this logical processor; requests from other IA Logical Processors/Physical Cores/sockets; or other non-IA devices like GPU; hence the maximum external memory bandwidth limits may or may not be approached when this metric is flagged (see Uncore counters for that). Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_info_memory_bandwidth, tma_sq_full", + "ScaleUnit": "100%" }, { - "BriefDescription": "Cycles Per Instruction for the Operating System (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", - "MetricGroup": "OS", - "MetricName": "Kernel_CPI" + "BriefDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD) / tma_info_clks - tma_mem_bandwidth", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_dram_bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles where the performance was likely hurt due to latency from external memory (DRAM). This metric does not aggregate requests from other Logical Processors/Physical Cores/sockets (see Uncore counters for that). Related metrics: tma_info_memory_latency, tma_l3_hit_latency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average external Memory Bandwidth Use for reads and writes [GB / sec]", - "MetricExpr": "64 * (arb@event\\=0x81\\,umask\\=0x1@ + arb@event\\=0x84\\,umask\\=0x1@) / 1e6 / duration_time / 1e3", - "MetricGroup": "HPC;Mem;MemoryBW;SoC", - "MetricName": "DRAM_BW_Use" + "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck", + "MetricExpr": "(CYCLE_ACTIVITY.STALLS_MEM_ANY + EXE_ACTIVITY.BOUND_ON_STORES) / (CYCLE_ACTIVITY.STALLS_TOTAL + (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL) + EXE_ACTIVITY.BOUND_ON_STORES) * tma_backend_bound", + "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average number of parallel requests to external memory. Accounts for all requests", - "MetricExpr": "UNC_ARB_TRK_OCCUPANCY.ALL / arb@event\\=0x81\\,umask\\=0x1@", - "MetricGroup": "Mem;SoC", - "MetricName": "MEM_Parallel_Requests" + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring memory operations -- uops for memory load or store accesses.", + "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_RETIRED.ANY", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_memory_operations", + "MetricThreshold": "tma_memory_operations > 0.1 & tma_light_operations > 0.6", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", - "MetricGroup": "Branches;OS", - "MetricName": "IpFarBranch" + "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "tma_retiring * tma_info_slots / UOPS_ISSUED.ANY * IDQ.MS_UOPS / tma_info_slots", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operations_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_ms_switches", + "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per core", - "MetricExpr": "cstate_core@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Core_Residency", + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage", + "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_clks", + "MetricGroup": "BadSpec;BrMispredicts;TopdownL4;tma_L4_group;tma_branch_resteers_group;tma_issueBM", + "MetricName": "tma_mispredicts_resteers", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_branch_mispredicts, tma_info_branch_misprediction_cost, tma_info_mispredictions", "ScaleUnit": "100%" }, { - "BriefDescription": "C7 residency percent per core", - "MetricExpr": "cstate_core@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Core_Residency", + "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline)", + "MetricExpr": "(IDQ.MITE_CYCLES_ANY - IDQ.MITE_CYCLES_OK) / tma_info_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 5 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or LSD. For example; inefficiencies due to asymmetric decoders; use of long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sample with: FRONTEND_RETIRED.ANY_DSB_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "C2 residency percent per package", - "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C2_Pkg_Residency", + "BriefDescription": "This metric represents fraction of cycles where (only) 4 uops were delivered by the MITE pipeline", + "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=4@ - cpu@IDQ.MITE_UOPS\\,cmask\\=5@) / tma_info_clks", + "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_mite_group", + "MetricName": "tma_mite_4wide", + "MetricThreshold": "tma_mite_4wide > 0.05 & (tma_mite > 0.1 & (tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 5 > 0.35))", "ScaleUnit": "100%" }, { - "BriefDescription": "C3 residency percent per package", - "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C3_Pkg_Residency", + "BriefDescription": "The Mixing_Vectors metric gives the percentage of injected blend uops out of all uops issued", + "MetricExpr": "UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY", + "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utilized_0_group", + "MetricName": "tma_mixing_vectors", + "MetricThreshold": "tma_mixing_vectors > 0.05", + "PublicDescription": "The Mixing_Vectors metric gives the percentage of injected blend uops out of all uops issued. Usually a Mixing_Vectors over 5% is worth investigating. Read more in Appendix B1 of the Optimizations Guide for this topic. Related metrics: tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per package", - "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Pkg_Residency", + "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS)", + "MetricExpr": "3 * IDQ.MS_SWITCHES / tma_info_clks", + "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch_latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, { - "BriefDescription": "C7 residency percent per package", - "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Pkg_Residency", + "BriefDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_retiring * tma_info_slots)", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_nop_instructions", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots where the CPU was retiring NOP (no op) instructions. Compilers often use NOPs for certain address alignments - e.g. start address of a function or loop body. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, { - "BriefDescription": "C8 residency percent per package", - "MetricExpr": "cstate_pkg@c8\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C8_Pkg_Residency", + "BriefDescription": "This metric represents the remaining light uops fraction the CPU has executed - remaining means not covered by other sibling nodes", + "MetricExpr": "max(0, tma_light_operations - (tma_fp_arith + tma_memory_operations + tma_branch_instructions + tma_nop_instructions))", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operations_group", + "MetricName": "tma_other_light_ops", + "MetricThreshold": "tma_other_light_ops > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents the remaining light uops fraction the CPU has executed - remaining means not covered by other sibling nodes. May undercount due to FMA double counting", "ScaleUnit": "100%" }, { - "BriefDescription": "C9 residency percent per package", - "MetricExpr": "cstate_pkg@c9\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C9_Pkg_Residency", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch)", + "MetricExpr": "UOPS_DISPATCHED.PORT_0 / tma_info_core_clks", + "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_0", + "MetricThreshold": "tma_port_0 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "C10 residency percent per package", - "MetricExpr": "cstate_pkg@c10\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C10_Pkg_Residency", + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU)", + "MetricExpr": "UOPS_DISPATCHED.PORT_1 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_1", + "MetricThreshold": "tma_port_1 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU)", + "MetricExpr": "UOPS_DISPATCHED.PORT_5 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_5", + "MetricThreshold": "tma_port_5 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU)", + "MetricExpr": "UOPS_DISPATCHED.PORT_6 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_group;tma_issue2P", + "MetricName": "tma_port_6", + "MetricThreshold": "tma_port_6 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple ALU). Sample with: UOPS_DISPATCHED.PORT_6. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related)", + "MetricExpr": "((cpu@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=0x80@ + tma_serializing_operation * (CYCLE_ACTIVITY.STALLS_TOTAL - CYCLE_ACTIVITY.STALLS_MEM_ANY) + (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL)) / tma_info_clks if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CYCLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_clks)", + "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related). Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "cpu@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=0x80@ / tma_info_clks + tma_serializing_operation * (CYCLE_ACTIVITY.STALLS_TOTAL - CYCLE_ACTIVITY.STALLS_MEM_ANY) / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed no uops on any execution port (Logical Processor cycles since ICL, Physical Core cycles otherwise). Long-latency instructions like divides may contribute to this metric.", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_1", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles where the CPU executed total of 1 uop per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). This can be due to heavy data-dependency among software instructions; or over oversubscribing a particular hardware resource. In some other cases with high 1_Port_Utilized and L1_Bound; this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related metrics: tma_l1_bound", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed total of 2 uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization -most compilers feature auto-Vectorization options today- reduces pressure on the execution ports as multiple elements are calculated with same uop. Sample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port_6", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.7 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Sample with: UOPS_EXECUTED.CYCLES_GE_3", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1", + "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved. Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance. For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations", + "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL5;tma_L5_group;tma_issueSO;tma_ports_utilized_0_group", + "MetricName": "tma_serializing_operation", + "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2)))", + "PublicDescription": "This metric represents fraction of cycles the CPU issue-pipeline was stalled due to serializing operations. Instructions like CPUID; WRMSR or LFENCE serialize the out-of-order execution which may limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metrics: tma_ms_switches", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to PAUSE Instructions", + "MetricExpr": "140 * MISC_RETIRED.PAUSE_INST / tma_info_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_serializing_operation_group", + "MetricName": "tma_slow_pause", + "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_operation > 0.1 & (tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to PAUSE Instructions. Sample with: MISC_RETIRED.PAUSE_INST", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary", + "MetricExpr": "tma_info_load_miss_real_latency * LD_BLOCKS.NO_SR / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles handling memory load split accesses - load that cross 64-byte cache line boundary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents rate of split store accesses", + "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bound_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store accesses. Consider aligning your data to the 64-byte cache line granularity. Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_4", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors)", + "MetricExpr": "L1D_PEND_MISS.L2_STALL / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueBW;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-types and both hardware SMT threads (Logical Processors). Related metrics: tma_fb_full, tma_info_dram_bw_use, tma_info_memory_bandwidth, tma_mem_bandwidth", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write", + "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stalled due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck. Sample with: MEM_INST_RETIRED.ALL_STORES_PS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cycles when the memory subsystem had loads blocked since they could not forward data from earlier (in program order) overlapping stores. To streamline memory operations in the pipeline; a load can avoid waiting for memory if a prior in-flight store is writing the data that the load wants to read (store forwarding process). However; in some cases the load may be blocked for a significant time pending the store forward. For example; when the prior store is writing a smaller region than the load is reading.", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses", + "MetricExpr": "(L2_RQSTS.RFO_HIT * 10 * (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_clks", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_issueSL;tma_store_bound_group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the CPU spent handling L1D store misses. Store accesses usually less impact out-of-order core performance; however; holding resources for longer time can lead into undesired implications (e.g. contention on L1D fill-buffer entries - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations", + "MetricExpr": "(UOPS_DISPATCHED.PORT_4_9 + UOPS_DISPATCHED.PORT_7_8) / (4 * tma_info_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycles CPU dispatched uops on execution port for Store operations. Sample with: UOPS_DISPATCHED.PORT_7_8", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of cycles where the TLB was missed by store accesses, hitting in the second-level TLB (STLB)", + "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_group", + "MetricName": "tma_store_stlb_hit", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles where the STLB was missed by store accesses, performing a hardware page walk", + "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_clks", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_group", + "MetricName": "tma_store_stlb_miss", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)))", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates how often CPU was stalled due to Streaming store memory accesses; Streaming store optimize out a read request required by RFO stores", + "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueSmSt;tma_store_bound_group", + "MetricName": "tma_streaming_stores", + "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often CPU was stalled due to Streaming store memory accesses; Streaming store optimize out a read request required by RFO stores. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should Streaming stores be a bottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma_fb_full", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears", + "MetricExpr": "10 * BACLEARS.ANY / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;TopdownL4;tma_L4_group;tma_branch_resteers_group", + "MetricName": "tma_unknown_branches", + "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles the CPU was stalled due to new branch address clears. These are fetched branches the Branch Prediction Unit was unable to recognize (e.g. first time the branch is fetched or hitting BTB capacity limit). Sample with: BACLEARS.ANY", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric serves as an approximation of legacy x87 usage", + "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.THREAD", + "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of legacy x87 usage. It accounts for instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high usage and preferably upgrade to modern ISA. See Tip under Tuning Hint.", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Percentage of cycles in aborted transactions.", + "MetricExpr": "max(cpu@cycles\\-t@ - cpu@cycles\\-ct@, 0) / cycles", + "MetricGroup": "transaction", + "MetricName": "tsx_aborted_cycles", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Number of cycles within a transaction divided by the number of elisions.", + "MetricExpr": "cpu@cycles\\-t@ / cpu@el\\-start@", + "MetricGroup": "transaction", + "MetricName": "tsx_cycles_per_elision", + "ScaleUnit": "1cycles / elision" + }, + { + "BriefDescription": "Number of cycles within a transaction divided by the number of transactions.", + "MetricExpr": "cpu@cycles\\-t@ / cpu@tx\\-start@", + "MetricGroup": "transaction", + "MetricName": "tsx_cycles_per_transaction", + "ScaleUnit": "1cycles / transaction" + }, + { + "BriefDescription": "Percentage of cycles within a transaction region.", + "MetricExpr": "cpu@cycles\\-t@ / cycles", + "MetricGroup": "transaction", + "MetricName": "tsx_transactional_cycles", "ScaleUnit": "100%" } ] diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/uncore-other.json b/tools/perf/pmu-events/arch/x86/tigerlake/uncore-other.json index e2ea5ccfe3bc..a5a254327ae9 100644 --- a/tools/perf/pmu-events/arch/x86/tigerlake/uncore-other.json +++ b/tools/perf/pmu-events/arch/x86/tigerlake/uncore-other.json @@ -24,7 +24,7 @@ "Unit": "ARB" }, { - "BriefDescription": "Number of coherent read requests sent to memory controller that were issued by any core.", + "BriefDescription": "This event is deprecated. Refer to new event UNC_ARB_REQ_TRK_REQUEST.DRD", "EventCode": "0x81", "EventName": "UNC_ARB_DAT_REQUESTS.RD", "PerPkg": "1", @@ -40,7 +40,15 @@ "Unit": "ARB" }, { - "BriefDescription": "This event is deprecated. Refer to new event UNC_ARB_DAT_REQUESTS.RD", + "BriefDescription": "Each cycle count number of 'valid' coherent Data Read entries . Such entry is defined as valid when it is allocated till deallocation. Doesn't include prefetches [This event is alias to UNC_ARB_TRK_OCCUPANCY.RD]", + "EventCode": "0x80", + "EventName": "UNC_ARB_REQ_TRK_OCCUPANCY.DRD", + "PerPkg": "1", + "UMask": "0x2", + "Unit": "ARB" + }, + { + "BriefDescription": "Number of all coherent Data Read entries. Doesn't include prefetches [This event is alias to UNC_ARB_TRK_REQUESTS.RD]", "EventCode": "0x81", "EventName": "UNC_ARB_REQ_TRK_REQUEST.DRD", "PerPkg": "1", @@ -55,6 +63,14 @@ "UMask": "0x1", "Unit": "ARB" }, + { + "BriefDescription": "Each cycle count number of 'valid' coherent Data Read entries . Such entry is defined as valid when it is allocated till deallocation. Doesn't include prefetches [This event is alias to UNC_ARB_REQ_TRK_OCCUPANCY.DRD]", + "EventCode": "0x80", + "EventName": "UNC_ARB_TRK_OCCUPANCY.RD", + "PerPkg": "1", + "UMask": "0x2", + "Unit": "ARB" + }, { "BriefDescription": "UNC_ARB_TRK_REQUESTS.ALL", "EventCode": "0x81", @@ -63,6 +79,14 @@ "UMask": "0x1", "Unit": "ARB" }, + { + "BriefDescription": "Number of all coherent Data Read entries. Doesn't include prefetches [This event is alias to UNC_ARB_REQ_TRK_REQUEST.DRD]", + "EventCode": "0x81", + "EventName": "UNC_ARB_TRK_REQUESTS.RD", + "PerPkg": "1", + "UMask": "0x2", + "Unit": "ARB" + }, { "BriefDescription": "UNC_CLOCK.SOCKET", "EventCode": "0xff", From patchwork Sun Feb 19 09:28:28 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59123 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp775965wrn; Sun, 19 Feb 2023 01:46:48 -0800 (PST) X-Google-Smtp-Source: AK7set+WiR8SyhTXNfxkCf+zI32uNTe9tJR2hkcUP9dkJTgQWQcKgcUsebj0muuvJsCt84TL1/R7 X-Received: by 2002:a17:902:d591:b0:19c:190d:a55e with SMTP id k17-20020a170902d59100b0019c190da55emr4315716plh.65.1676800008043; Sun, 19 Feb 2023 01:46:48 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676800008; cv=none; d=google.com; s=arc-20160816; b=FXt2CxGcr7/vazfmWDRyPV36g84qwzQoSmcPb8S96sba49SeWufk8TQ2Rp7FWnuPG3 roiZqmwV4HScswxP99WHbi4YxDNIqXtlAJRlwwy4iV5jT79XSZhHIsooxSwJA1xDkg+s Y+zBOhyLRvqu45cYzlNF2/w99f44U9Ogh0JC8O08n2EqcMXYy8Zg24WJVoHRVb8/x+0M BPgry0zDifNjXFPYtMvWqHkktfn8ERQ32GshLnj9fTRhNZZlRL2HZVW4ldNTaj7E453i 7AAP/tLceK6OTMzoj7LJsC+/ufbWBEOyLeKPMtl4ttea56TvfjsAqePTcEB/PseeQqmd 3fBQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=IJq6auQeIidM24mfR/8eo3ksFWhAxhvnpDKY7OwBaGQ=; b=lZ21zQtD4QEA2CYuy9e3TwUpIfjL/Fb8wHl3BYfZV/HN3ceSl2ho/fBIuO2hFBQbYM gWlD+tZ1HRPm8wmMQ8EpCtIFGMAcAIY8RuySlIcY2eG8GeTIUe0oEq7VpKplqa35ziKe Ntyk35CPPZOhHK42JyyauFO23yq9SsxRwct2zM2GHhhWQKFqp8uTxSxDeOQaocb0hIvs jPms6QG9aBWtIL9dNcBOKjdaiqAJnQhTLKpm/R477w09AEF11ZpKk6Q07CUDtMFSJkQK cB8zbfIXo5lSxRuvTbyU9VY+jnGbbM2wl18KPtbH7pamRao9XNQliqU/80p8A9AsIFvl qOhg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=I7CnQOuW; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id bc1-20020a170902930100b0019a826d3050si1730213plb.636.2023.02.19.01.46.36; Sun, 19 Feb 2023 01:46:48 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=I7CnQOuW; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230038AbjBSJh0 (ORCPT + 99 others); Sun, 19 Feb 2023 04:37:26 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59694 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229700AbjBSJhY (ORCPT ); Sun, 19 Feb 2023 04:37:24 -0500 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5CEF412580 for ; Sun, 19 Feb 2023 01:36:33 -0800 (PST) Received: by mail-yb1-f202.google.com with SMTP id k131-20020a256f89000000b0074747131938so222653ybc.12 for ; Sun, 19 Feb 2023 01:36:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=IJq6auQeIidM24mfR/8eo3ksFWhAxhvnpDKY7OwBaGQ=; b=I7CnQOuWj57M+eY5KPLUsxlKb0JE3mtO9u5wKbbUpyvJmy+FSZG+GD25OOiBvunIGK Q2dHAsWmeS/qQ2obRvPvqxghBPvmo9xgjx5OjGAKmTc0mkIK9Y/AldxC7enwDlI4MWTt nR/l/y/IJF+yvccRElbcBsPK+/7hGdcJeqKbmIc9JCFFuv9oTrEoI0/ujyZ5AvH92wC/ EWoa3QlfPbYhWYZYHRA9iDFVl9xjc2KK7BL4RV4MYYfzreMAZCNLDg6ZS77C3hgZ8snp OZaDgveoVLXlEl2+8JzeckNGBVk4n0JGZzDd/DNEEguY7CzJddiyOUYqxZVM54HziTNG FPVw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=IJq6auQeIidM24mfR/8eo3ksFWhAxhvnpDKY7OwBaGQ=; b=BI6py0GpJnjVHsl2f/RyQetCwqaf4jNUhad65of9wfugO9C1ZgLTnAzh2A+lGAzx/5 GkJu3sHo+JMHU/NchorAkkOSO6uwD9RLKE79f8993s1LZrW5PaKgF7dhgaxK4EbqpHmo Ce6V+qyjg7J/VquwtxfwFkpHkWhDSgS+diZIeuDIVrMNHnGx6eQ88Z7GrWqcwqbf/vY/ GJKg9L11pyesFlW/MIjAsPS5Ay1Bpi0xkSTK1Lv1wsRTJYyu48NUSTnTAjIvFlPmi9AG gt+inEh8EX85SUzURX603XiZ7fvrQ+PzcsdMMRmU1F/vkgbDB3ClLDusjUy3ErWwcWSc bVew== X-Gm-Message-State: AO0yUKWwKZ2IBCRJo9JwX5qZDhw3F+rIrBuci33GeeLInk6FVh8G1aHH 5YjT4zVi8sFYfWz6O0CNj8kLpqT08rmN X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a81:af07:0:b0:534:2d49:790a with SMTP id n7-20020a81af07000000b005342d49790amr113313ywh.0.1676799199716; Sun, 19 Feb 2023 01:33:19 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:28 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-32-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 31/51] perf vendor events intel: Refresh westmereep-dp events From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758252245306888422?= X-GMAIL-MSGID: =?utf-8?q?1758252245306888422?= Update the westmereep-dp events from 3 to 4. Generation was done using https://github.com/intel/perfmon. The most notable change is in corrections to event descriptions. Signed-off-by: Ian Rogers --- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- tools/perf/pmu-events/arch/x86/westmereep-dp/cache.json | 2 +- .../perf/pmu-events/arch/x86/westmereep-dp/virtual-memory.json | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-events/arch/x86/mapfile.csv index bc2c4e756f44..1c6eef118e61 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -28,7 +28,7 @@ GenuineIntel-6-(4E|5E|8E|9E|A5|A6),v54,skylake,core GenuineIntel-6-55-[01234],v1.29,skylakex,core GenuineIntel-6-86,v1.20,snowridgex,core GenuineIntel-6-8[CD],v1.10,tigerlake,core -GenuineIntel-6-2C,v3,westmereep-dp,core +GenuineIntel-6-2C,v4,westmereep-dp,core GenuineIntel-6-25,v3,westmereep-sp,core GenuineIntel-6-2F,v3,westmereex,core AuthenticAMD-23-([12][0-9A-F]|[0-9A-F]),v2,amdzen1,core diff --git a/tools/perf/pmu-events/arch/x86/westmereep-dp/cache.json b/tools/perf/pmu-events/arch/x86/westmereep-dp/cache.json index 5c897da3cd6b..4dae735fb636 100644 --- a/tools/perf/pmu-events/arch/x86/westmereep-dp/cache.json +++ b/tools/perf/pmu-events/arch/x86/westmereep-dp/cache.json @@ -182,7 +182,7 @@ "UMask": "0x20" }, { - "BriefDescription": "L2 lines alloacated", + "BriefDescription": "L2 lines allocated", "EventCode": "0xF1", "EventName": "L2_LINES_IN.ANY", "SampleAfterValue": "100000", diff --git a/tools/perf/pmu-events/arch/x86/westmereep-dp/virtual-memory.json b/tools/perf/pmu-events/arch/x86/westmereep-dp/virtual-memory.json index ef635bff1522..f75084309041 100644 --- a/tools/perf/pmu-events/arch/x86/westmereep-dp/virtual-memory.json +++ b/tools/perf/pmu-events/arch/x86/westmereep-dp/virtual-memory.json @@ -56,7 +56,7 @@ "UMask": "0x80" }, { - "BriefDescription": "DTLB misses casued by low part of address", + "BriefDescription": "DTLB misses caused by low part of address", "EventCode": "0x49", "EventName": "DTLB_MISSES.PDE_MISS", "SampleAfterValue": "200000", From patchwork Sun Feb 19 09:28:29 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59134 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp779107wrn; Sun, 19 Feb 2023 02:00:04 -0800 (PST) X-Google-Smtp-Source: AK7set+/VggcW/ioIGPeavBCgf+kB5ILX+x8Lf5tNuabWRLPDbQWv51znxcFjBcZ+gAXKFzx3AFP X-Received: by 2002:a17:90b:1bca:b0:233:be7b:e71c with SMTP id oa10-20020a17090b1bca00b00233be7be71cmr2519438pjb.5.1676800804100; Sun, 19 Feb 2023 02:00:04 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676800804; cv=none; d=google.com; s=arc-20160816; b=YbB8rKRYmkfpIaNFOk5cylJQMyrYnMN9biWUj6pw5+qilwOfLr7YB/KbyOu1nPKUVt ebqd2ImGUnL7R2bsFHtZw0PtGe/zRseWAOKgnJP38kPoFJKQ5tdzj1OP9Zz1YbGJ2VZ1 E5OoSrUkhfMhnW0YLqm+zftXvhdV00RTxmwKfytgZ0GUEw8etsYQSNrQU3vIYNNkl9na Q0B/dEuuMQM1zdH20IwzxX/qQgXFouL5ur0eMI0Abgk1RFKLvfHlIX+NHddpVr+pQyYe plNeM+l8gLx3hk3Pc6PbvIPcxGuqoy9P07NaPOwVJlFBvNaFiFUXAjIz+/kab2FgiwEJ GSLg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=aDHxoNK+iGgerGSxgoF4xDhtMfF1M1KB0gmMhIH6g+o=; b=yo6b46xajGUTns/H+x7YgVLLabdlb2I39qSza7rR72SlkvxiXF1hWpf/9NjIlAU+2f bSNpxtSEwtVe18UlwShtRZ7TWqCZyQWGOyHzz20MPA3EDF/uqqwTFvx0cFwWRc4sldAj TNESEc4Uu8x2GhHMR5n+gOLFh/bYyEtGAnFxJabCIQJjgaKneethagN7csEYgwhHQ3F6 vjy1/K9qWMXqCAZ8//OoU8aqIcqoLanrHCEkqiGY8ln99zuM8WX2/ecVcz0Yiu/mqVro pjuter3qLaGkJFs01aAcjAoNnOpnSmIoQEX5ZK7mCDcTWlmpHkQpnG8O9ocEf2Bc/PDa 0nFw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=szVoKRw2; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id 6-20020a17090a174600b00232dd9ab146si10813482pjm.13.2023.02.19.01.59.51; Sun, 19 Feb 2023 02:00:04 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=szVoKRw2; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230173AbjBSJtL (ORCPT + 99 others); Sun, 19 Feb 2023 04:49:11 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44718 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230061AbjBSJtJ (ORCPT ); Sun, 19 Feb 2023 04:49:09 -0500 Received: from mail-oa1-x4a.google.com (mail-oa1-x4a.google.com [IPv6:2001:4860:4864:20::4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8F542D523 for ; Sun, 19 Feb 2023 01:48:18 -0800 (PST) Received: by mail-oa1-x4a.google.com with SMTP id 586e51a60fabf-171d5b265acso352174fac.2 for ; Sun, 19 Feb 2023 01:48:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=aDHxoNK+iGgerGSxgoF4xDhtMfF1M1KB0gmMhIH6g+o=; b=szVoKRw23j5Oh2tOkbAIG0IvtMpMsOrvW9A3e7kcO9it79BlwWVgNPV/uB9d0sO7u/ I5P/KixArxBv49X8CI3tQxLa1twzSrl7TRDwEq1x2y6kljyqBkBCYoIMwLoi+xqN41ew 1YfVyzeiNzV35kAtc0w1HbeMjjubJLR48JfNp8ElXZGSO/WXsj10RUfCtCXN1HnCxZyR j2zQAAH13HqlMas9+N26ILACb+ttaBbF6Jt9XRfpsIyNeIBd6pDcvMVo8sZ+Ar6TSVKI HOY1Q4URcucKu7ff34heQu958Cgzod/PuzJPQJrHwtSrEsUCVeat8sQUeniU8nlRMHHv 3Wkw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=aDHxoNK+iGgerGSxgoF4xDhtMfF1M1KB0gmMhIH6g+o=; b=f44Oqjc16ykTFwqcy7E8lxuhnq0iSP2V5p1s/eW5K9SAsvTOnKSdwNhN9vBT6R73SU ItI3NaIggeQj+nBcr1RPQDStwn4w9te4bKs4PVMZtpm2EZHR8gkvWPkJyd09cvTlSOMK 7bpXzT/mO9kYpbHS5fHPqBGwqFwgycT+aTkJW1LuC+rHcCTz0M5orRQnr8cRsS/mL3Vv 0xDVnOGGT4eXimLvE/ephd6peZQkA0G/MM7Y3stJE2BwdYkM8Y0v1NkPAl6ejRF4e/or IpvbH8XF/zfCWCsHxICL+VrrYHSubyszL09Ikyb9iwbbyCwxtCgpuo25CbdheQTXZZOs Firw== X-Gm-Message-State: AO0yUKW9sjnNTc1PBeLtdn0UtexW2C7vrUUyQsZp+jSivCpyYNvc3EHD j/BjiZrSV+etWGYKDscTvlaL/ZicWrXC X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a05:6902:287:b0:8da:cda:7a50 with SMTP id v7-20020a056902028700b008da0cda7a50mr9710ybh.11.1676799207878; Sun, 19 Feb 2023 01:33:27 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:29 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-33-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 32/51] perf jevents: Add rand support to metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758253079981589779?= X-GMAIL-MSGID: =?utf-8?q?1758253079981589779?= rand (reverse and) is useful in the parsing of metric thresholds. Update the documentation on operator precedence to clarify the simple expression parser and python differences wrt binary/logical operators. Signed-off-by: Ian Rogers --- tools/perf/pmu-events/metric.py | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/tools/perf/pmu-events/metric.py b/tools/perf/pmu-events/metric.py index 77ea6ff98538..8ec0ba884673 100644 --- a/tools/perf/pmu-events/metric.py +++ b/tools/perf/pmu-events/metric.py @@ -44,6 +44,9 @@ class Expression: def __and__(self, other: Union[int, float, 'Expression']) -> 'Operator': return Operator('&', self, other) + def __rand__(self, other: Union[int, float, 'Expression']) -> 'Operator': + return Operator('&', other, self) + def __lt__(self, other: Union[int, float, 'Expression']) -> 'Operator': return Operator('<', self, other) @@ -88,7 +91,10 @@ def _Constify(val: Union[bool, int, float, Expression]) -> Expression: # Simple lookup for operator precedence, used to avoid unnecessary -# brackets. Precedence matches that of python and the simple expression parser. +# brackets. Precedence matches that of the simple expression parser +# but differs from python where comparisons are lower precedence than +# the bitwise &, ^, | but not the logical versions that the expression +# parser doesn't have. _PRECEDENCE = { '|': 0, '^': 1, From patchwork Sun Feb 19 09:28:30 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59136 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp785633wrn; Sun, 19 Feb 2023 02:20:51 -0800 (PST) X-Google-Smtp-Source: AK7set/5l5z8sc9nTjHjHzid+fRRlhKlnA6tV63vo/FeDxJzb/wgTircGx/lWcRwiw/AxjPxz6LK X-Received: by 2002:a17:902:c401:b0:196:4814:2a33 with SMTP id k1-20020a170902c40100b0019648142a33mr3008902plk.42.1676802050787; Sun, 19 Feb 2023 02:20:50 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676802050; cv=none; d=google.com; s=arc-20160816; b=gcrdHzbwnXWY8jsPV+IirDFvsWhcdDKeG9wYl/Q93BmX2eXbKc1dVQdvaixPdqbfII pMJEkDX/+Rs09exqUGlVZkNmDtqzW+9stzHd5FS+sR9dcp92mJtkzXzo7Oc1jMAZgaCP rGyK196ur6deaSJD5zyr84U3E5+E20kYGjVzAfq8csFCmYBuP4fZZVeOnfHPFCrwBY2t crlv5g8mVJjkSv9RYUx57LitCvVtP6aydAdd5qQCh4929tuCeNhuzfYPl7DEPKv22N5L lOIMpdWCt8RyPWs0wWRGtSwTDW6DbmVHeMigm7zKTy7tIOGdQNOPyAI+mJpzp/OAcKmx ir7Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=8oK06fVQN6LXVRqRBpBf6uq7iBLthLemaenQUOot5L0=; b=lagFGo0dLfs2hxkVNfc7+87LiuHvP9XE2hDi2imsciEy+ku6Ypi22YX3JinudyWkwX SFVdQUeteyUihyj4JM2yKZk3QSKaYoz1N7mKllWDrVN7FnzsR0gR7d3uW+DPWeEGOvB+ qUf78y8ntCHYkSht2+JXLhNaY4FaFV+GljzVlRMxKGouOwQ4pDRBLSi//fBHBINSsNYO US1pDUtJOcTwMBTCICt3pSf2qVg7NIbpY06b/mnvdkQaE10Q6slxSPgGcsOhxh6nC3Lc WQ9HX/uZCw9oO0RqAPVrNQEw0PNOA4kw75F1/bI7W7MQWzfC0KUoyXdYNIGDSy6wnYx4 b1aQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b="c/IhxSXd"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id g12-20020a170902c38c00b00199563fff14si5608825plg.410.2023.02.19.02.20.36; Sun, 19 Feb 2023 02:20:50 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b="c/IhxSXd"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230149AbjBSJr1 (ORCPT + 99 others); Sun, 19 Feb 2023 04:47:27 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42316 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230145AbjBSJrY (ORCPT ); Sun, 19 Feb 2023 04:47:24 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5EC7112BC4 for ; Sun, 19 Feb 2023 01:46:52 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id f33-20020a25b0a1000000b009433a21be0dso2123448ybj.19 for ; Sun, 19 Feb 2023 01:46:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=8oK06fVQN6LXVRqRBpBf6uq7iBLthLemaenQUOot5L0=; b=c/IhxSXdMkvdWRTmNwDGCw1H+VDaY0b3orkL+EP+I6StfnTz8H8pSrT5peoX5iGhyL aZJVTGxQcotqY/9XbrXu5W8a7dwuTOKV2N61EKGNWX+G6n+kxuWgzbZDVX14qVNHNX+s fs5RmxJ4NrWIuhyym0Yg7NUOknjeXIWnd1ZkQeNcEc6sxObdx3t563ad/6YllU0XSRVH fEBlJh6xYkiCZVrIz2FJ28m1/wK+9QXfHWnY7uNt0V/Brg7V6nAD5888J6y3CRDbKjZN b08tJOGA3QuQgVGwsV5oJ+JxevsQJNE5g65XNpdn5E/qZEfMghvi8DBRe4DWYkzPbiRl Q1+w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=8oK06fVQN6LXVRqRBpBf6uq7iBLthLemaenQUOot5L0=; b=wdWNjv6xteuc8VFUp/oGNOEQxldgKDUnVW3PvOzsFOr7+NrhnBfvxiXAyCslrEvfEt TSS2zF0Ug3eJSZwdw66IN0VZAH8dsmNgbjZzF9fw18NEP2ee1kYhhC6/UAphCpuhIbTR 7ElVgUv1IU0v0w5pAdBpgiGtNI7GlMa48aJr84JGMY0YiJLL6qEJal3tNKLMnLHdGhjG 3rExR9DFXamJaqgx3aF7dTNXVIqQfFyecMk5A4TZs6J50djjhQ+nvXgG94vst99OYere CK9YzfyzzaMmOYPXlBOyYev+C5EEoDGN3G8dK64AXjFlFPhyQHmmuGObJr56sNbIX5fE /4ow== X-Gm-Message-State: AO0yUKX05eYmhopCtakjyEokzRaTIHsojDLxHN8ofMXh5ajnWbzncIRt +zA5zyC2Ghkl+aoVOEsFURSSoPpHijVq X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a25:9ac4:0:b0:967:e782:c03 with SMTP id t4-20020a259ac4000000b00967e7820c03mr884748ybo.430.1676799216733; Sun, 19 Feb 2023 01:33:36 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:30 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-34-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 33/51] perf jevent: Parse metric thresholds From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758254387265662046?= X-GMAIL-MSGID: =?utf-8?q?1758254387265662046?= Parse the metric threshold and add to the pmu-events.c file. The metric isn't parsed as the parser uses python's parser and will break the operator precedence. Signed-off-by: Ian Rogers --- tools/perf/pmu-events/jevents.py | 7 ++++++- tools/perf/pmu-events/pmu-events.h | 1 + 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/tools/perf/pmu-events/jevents.py b/tools/perf/pmu-events/jevents.py index e82dff3a1228..40b9e626fc15 100755 --- a/tools/perf/pmu-events/jevents.py +++ b/tools/perf/pmu-events/jevents.py @@ -51,7 +51,7 @@ _json_event_attributes = [ # Attributes that are in pmu_metric rather than pmu_event. _json_metric_attributes = [ - 'metric_name', 'metric_group', 'metric_expr', 'desc', + 'metric_name', 'metric_group', 'metric_expr', 'metric_threshold', 'desc', 'long_desc', 'unit', 'compat', 'aggr_mode', 'event_grouping' ] # Attributes that are bools or enum int values, encoded as '0', '1',... @@ -306,6 +306,9 @@ class JsonEvent: self.metric_expr = None if 'MetricExpr' in jd: self.metric_expr = metric.ParsePerfJson(jd['MetricExpr']).Simplify() + # Note, the metric formula for the threshold isn't parsed as the & + # and > have incorrect precedence. + self.metric_threshold = jd.get('MetricThreshold') arch_std = jd.get('ArchStdEvent') if precise and self.desc and '(Precise Event)' not in self.desc: @@ -362,6 +365,8 @@ class JsonEvent: # Convert parsed metric expressions into a string. Slashes # must be doubled in the file. x = x.ToPerfJson().replace('\\', '\\\\') + if metric and x and attr == 'metric_threshold': + x = x.replace('\\', '\\\\') if attr in _json_enum_attributes: s += x if x else '0' else: diff --git a/tools/perf/pmu-events/pmu-events.h b/tools/perf/pmu-events/pmu-events.h index 57a38e3e5c32..b7dff8f1021f 100644 --- a/tools/perf/pmu-events/pmu-events.h +++ b/tools/perf/pmu-events/pmu-events.h @@ -54,6 +54,7 @@ struct pmu_metric { const char *metric_name; const char *metric_group; const char *metric_expr; + const char *metric_threshold; const char *unit; const char *compat; const char *desc; From patchwork Sun Feb 19 09:28:31 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59120 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp775670wrn; Sun, 19 Feb 2023 01:45:39 -0800 (PST) X-Google-Smtp-Source: AK7set+TMz3AugUM6x30ImRBecPPc/TSyHOXdZOWYy7foTQXbhV78WfdIrjJBwfyI4UG3LkbZdqM X-Received: by 2002:a17:903:1246:b0:19a:ea48:7301 with SMTP id u6-20020a170903124600b0019aea487301mr340878plh.66.1676799938765; Sun, 19 Feb 2023 01:45:38 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799938; cv=none; d=google.com; s=arc-20160816; b=F2kEt87AXArN7dfCfQzYjoEth3xh2rELPTYHPjp8aYBUZebWXKyXd97foXiDm4jNKQ n2adGGha6y8EXIaCcn5guX6sAxGgxTajY6txiDRbGsU5v7wBQVMcJBB4PJa9KZOdg80u uv2IjoYsU0qJDu6nW2vZ/o8uVKwJItYI8J/VkhLbdOwGKK2monK0qO4+4oUVz5tqr94q 5a7tX2ITAtqL+zkcfdgYi7r5GEHPEp/102/ZMd8wBC5lwa9TS4iUH2YeN9gHah3x2ZAS 3IiZPX/+HvQyXAK/NBicCfHISY1sK0ZX2gKiYVAvxyYgfdHWVRDt7n8uZE1RDViXvMbN e/RQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=Q1iaZVlYNhiAOA5cs/+UwDX1JeW2DyTfNmupgRycGXY=; b=HNdyPAYmx8y2KemS7MWsfJeQPb/1wdJzgk7WGonjsjRnzIoznzAKBW1DsRpUrrIfFi rVvIyYkrYbqujYZHQo9LnT4F7g5Z43WyASsZRuLpdC9b363Z8pjqJ2iLmBfOKu14ZzPU /eTC61LyIGbki+hm62F0cHpogwNg51NAxz+ZJltlh1NRCfIJmYEhY7KRRs3k0T5QHLS7 P9IXNgKXlTEd1n9zQZnX+dwalGY894Sie2W7MRNzO4g9cUsXgm/rEuOdw9fEMwl9wMOn PwFRC2IF9T/LCUjCY3vw2RLhevG1sWvmkeXsotrvNjLoFm2x3Zkidb5aguhEwXyozmAb tKWQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=GXX1CCWT; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id j3-20020a170903028300b001926cc639ecsi7105171plr.487.2023.02.19.01.45.27; Sun, 19 Feb 2023 01:45:38 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=GXX1CCWT; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229651AbjBSJgl (ORCPT + 99 others); Sun, 19 Feb 2023 04:36:41 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58814 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230022AbjBSJgj (ORCPT ); Sun, 19 Feb 2023 04:36:39 -0500 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 95BC2126F1 for ; Sun, 19 Feb 2023 01:35:48 -0800 (PST) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-536885323c1so17996037b3.15 for ; Sun, 19 Feb 2023 01:35:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=Q1iaZVlYNhiAOA5cs/+UwDX1JeW2DyTfNmupgRycGXY=; b=GXX1CCWTT7DpChHpiAo8vxK/43ddF9gck9mQoSZ2hA08LIJ0U1Cu7/SFYa9EJvJFxe m8KaqG9d0IPv1H+awLg+uCW+JyHffDXi8RxqbbHwMymD1tyKhcsbVu23oOmaI7vI3Ggo A8m02HvwUYluQ/hwtU5heVipo9Wh/I0ByZJ263Ywidlya5sIiWqmTKfXXSjdJB+GukoV TUk+iK3GL6mVBoiR/DsqTD45H27yUEHyxZf4toav9JGdIbqj/1fHLuj6CnEuBf5HvRQU AhDI6PM1KMSvWtTcjBjzp7iRHFEfMo2PY+iby22NhQ0i2XGrdBrVZdFwwu7jGnEkvvA1 P5NQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Q1iaZVlYNhiAOA5cs/+UwDX1JeW2DyTfNmupgRycGXY=; b=S3yNglxsSfJcMJ1Gioyloib6GxEHGCu7e71UqEARRGGpldMm0niCK/ZXtozrKdKnPV eyypy9n7oprL+uW7beu/x74fVxlo8kY/ssobVAtKpAqTEPdPLVJVTgpsyES9Et3RVCdE 8SgZ3nCKvHvGeQi6R6ZCqmx9nLlLNBl6dNBZA9LO6s9jU6Kc0TaGI8SBOOYauekBnONJ m2J1Y9oSFjqHMRP1MlZ/7eYMLSe6LhuvzE5UN05dv7L2dc2fwzsRvcCS13qOxcZperc2 FH0cU0pz/dzRfo1IatiszHFHAXpu8zqWkND/s0au2NkwWgINv+ktX6idZfKdHATuQNOp rqMg== X-Gm-Message-State: AO0yUKUyv7YY8TEs+Hu8l8bZbNvrHRu2qbu3Fo6idEgl2RekiRJ4Wxkj lIpzCM58sY4PSweTCYrre7qKBkanb2iS X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a0d:ee44:0:b0:52e:e6ed:30a1 with SMTP id x65-20020a0dee44000000b0052ee6ed30a1mr1338109ywe.545.1676799225232; Sun, 19 Feb 2023 01:33:45 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:31 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-35-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 34/51] perf pmu-events: Test parsing metric thresholds with the fake PMU From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758252172545376661?= X-GMAIL-MSGID: =?utf-8?q?1758252172545376661?= Test the correctness of metric thresholds by testing them all with the fake PMU logic. Signed-off-by: Ian Rogers --- tools/perf/tests/pmu-events.c | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/tools/perf/tests/pmu-events.c b/tools/perf/tests/pmu-events.c index 521557c396bc..db2fed0c6993 100644 --- a/tools/perf/tests/pmu-events.c +++ b/tools/perf/tests/pmu-events.c @@ -1021,12 +1021,34 @@ static int test__parsing_fake(struct test_suite *test __maybe_unused, return pmu_for_each_sys_metric(test__parsing_fake_callback, NULL); } +static int test__parsing_threshold_callback(const struct pmu_metric *pm, + const struct pmu_metrics_table *table __maybe_unused, + void *data __maybe_unused) +{ + if (!pm->metric_threshold) + return 0; + return metric_parse_fake(pm->metric_name, pm->metric_threshold); +} + +static int test__parsing_threshold(struct test_suite *test __maybe_unused, + int subtest __maybe_unused) +{ + int err = 0; + + err = pmu_for_each_core_metric(test__parsing_threshold_callback, NULL); + if (err) + return err; + + return pmu_for_each_sys_metric(test__parsing_threshold_callback, NULL); +} + static struct test_case pmu_events_tests[] = { TEST_CASE("PMU event table sanity", pmu_event_table), TEST_CASE("PMU event map aliases", aliases), TEST_CASE_REASON("Parsing of PMU event table metrics", parsing, "some metrics failed"), TEST_CASE("Parsing of PMU event table metrics with fake PMUs", parsing_fake), + TEST_CASE("Parsing of metric thresholds with fake PMUs", parsing_threshold), { .name = NULL, } }; From patchwork Sun Feb 19 09:28:32 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59142 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp811578wrn; Sun, 19 Feb 2023 03:49:29 -0800 (PST) X-Google-Smtp-Source: AK7set+aO+0t2AXme29C1JpemGgrlknscm5rXEE8VnQpkbhULj4srTwBDdphNPL2aU0/L2py5Gc4 X-Received: by 2002:a17:90b:4b02:b0:234:13a3:6e65 with SMTP id lx2-20020a17090b4b0200b0023413a36e65mr3318619pjb.16.1676807369393; Sun, 19 Feb 2023 03:49:29 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676807369; cv=none; d=google.com; s=arc-20160816; b=G1urM7uLjFq8Up5pFIT7/MA5kNeQfeiH5PMuIft01HHMUrvGNnB1ZgliRehwVzNVJL pBhvJJA4YeEAQb0C5q183Kz1t5J3XCqcfS+gJmIQm7eQeIYbgKWm4EslXoEgi2b3a+eG 7k0Bth+Ujc3dkT/DH/F3rSo/8EWGBB8/wxr2mXHSLg+a+ycS9aEKx8lMS/k+ihvwInHY yDEreJhn8LGsawB8Q+CdbJJIf/vOCL7CgOzifzFK45YLN2CyAt2TWzZGLJr7x6f6Qa8t 0BF7w21NbhKexzadA+kRV1GkWIUIHnR/JoyVtk4iJBsHH3uiy7CirCFGPay5A44fbvFj 30Kw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=UBrZSTZ6qIfneIElsj3tLlORrGgvVMiFEvu6xYZfR6w=; b=gbdU9vtb0+32+yd+9h5gjtVRGKXV/erWB+IAE0uDereNagRZ+dLOXa3RK2bKnXw09c 08dLYj5GWuUeDrgTh1q2LZUbxrzEmNMJG81rR0jlcDYYDev2XFpgcVjo7pBs7iVnjFQl LAmhq6JO6lf3sJCjZRpAv+9vOxpeNZe7R5l/crkT0xjoiUyPTHFMOaGyAJHiqPNrVz+D 86ubXLtA8GlMbqiwi+PTCdAHkzB7BpgQd6I6EOd9k/7vn235M0hhmiyk6l/DekPQBOX+ Htz98xApLnWzbXfTqcCWSgMbu6Uo61vcone/ij+UsC9nL8NhlgPYas+vWSRr54ZNg50Z vzFA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=L9hQ5ciP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id qe10-20020a17090b4f8a00b0021918bc9a47si12229095pjb.174.2023.02.19.03.49.17; Sun, 19 Feb 2023 03:49:29 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=L9hQ5ciP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230109AbjBSJqE (ORCPT + 99 others); Sun, 19 Feb 2023 04:46:04 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40814 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230108AbjBSJqC (ORCPT ); Sun, 19 Feb 2023 04:46:02 -0500 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CFBC4EC57 for ; Sun, 19 Feb 2023 01:45:23 -0800 (PST) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-53688f5c1b1so16899917b3.2 for ; Sun, 19 Feb 2023 01:45:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=UBrZSTZ6qIfneIElsj3tLlORrGgvVMiFEvu6xYZfR6w=; b=L9hQ5ciPXy/yk7YbGTA3vKiBvQF5mmWbEaTwZbD1xCIQmw5pyL8Ot5PeRaEcmzFgyb R2RHR+3J8rqQ+Z4Nmsdh1lP+useHJkLKOQmnzRHA2LJHmXdmBjltVFEL6i5MLOLW+Yg9 qS0UsrjhG3rybTZddh87kURSAmDeFFI0w2fcKGSDjlDx3uWVgso/1COFLcjJvYbW6o+M +Diy/BkS+LGwu1Zt0MdjlHQT7bD1EMY0tbc38xpytZgpCaIzqt1Aip2yIf1mq7qu+kdh Co2aW9v1SuDUXVY4ThmRutlyPHc0KgLBet6MFTfDo6yeBAHvJcKeEbRUGL8BK38NcU/n u4Qg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=UBrZSTZ6qIfneIElsj3tLlORrGgvVMiFEvu6xYZfR6w=; b=ftOSmFnyelkclLSU3M9PgbDzmvU48ViqOknzTb424LV0nQGmRbmVmXQSQpr4Z6NNaT leSNUECkNTH32Kmd0YnlW5+wgoqskxzPlYwTUOMUkf7S/G/ihj8/JvuZGchR/woVVg0O cbR0Lut3zyEm4DMBhP2iPpumjLIxzO7nHrj8Yk5j9O+pzbieZYsYYvq9H1Q8XexTrj8w wO1YaF5QvSo6jbuc8tQsQi797JaWmjxGltAqz36HIX35pz/+xkMtG+GhL3Kxh/LIUzZ9 6YGCfxTANkk7hFLFuWHRTDXC6c+XqWa2i2Nf1WkKlnNszHTdq2V7bzsoFBHBRSWIHB0P wKYA== X-Gm-Message-State: AO0yUKUmPYcSrYgfatX8+780akiJvR2JguK24iHj/lbbOhf/GyDeJsRj 1gIZTRmUvPFRLHoY0iTyJG3XtslWL0xI X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a5b:bcd:0:b0:8ac:a1b7:6fa3 with SMTP id c13-20020a5b0bcd000000b008aca1b76fa3mr1567241ybr.278.1676799233664; Sun, 19 Feb 2023 01:33:53 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:32 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-36-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 35/51] perf list: Support for printing metric thresholds From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758259964058848115?= X-GMAIL-MSGID: =?utf-8?q?1758259964058848115?= Add to json output by default. For regular output, only enable with the --detail flag. Signed-off-by: Ian Rogers --- tools/perf/builtin-list.c | 13 ++++++++++++- tools/perf/util/metricgroup.c | 3 +++ tools/perf/util/print-events.h | 1 + 3 files changed, 16 insertions(+), 1 deletion(-) diff --git a/tools/perf/builtin-list.c b/tools/perf/builtin-list.c index 791f513ae5b4..76e1d31a68ee 100644 --- a/tools/perf/builtin-list.c +++ b/tools/perf/builtin-list.c @@ -168,6 +168,7 @@ static void default_print_metric(void *ps, const char *desc, const char *long_desc, const char *expr, + const char *threshold, const char *unit __maybe_unused) { struct print_state *print_state = ps; @@ -227,6 +228,11 @@ static void default_print_metric(void *ps, wordwrap(expr, 8, pager_get_columns(), 0); printf("]\n"); } + if (threshold && print_state->detailed) { + printf("%*s", 8, "["); + wordwrap(threshold, 8, pager_get_columns(), 0); + printf("]\n"); + } } struct json_print_state { @@ -367,7 +373,7 @@ static void json_print_event(void *ps, const char *pmu_name, const char *topic, static void json_print_metric(void *ps __maybe_unused, const char *group, const char *name, const char *desc, const char *long_desc, const char *expr, - const char *unit) + const char *threshold, const char *unit) { struct json_print_state *print_state = ps; bool need_sep = false; @@ -388,6 +394,11 @@ static void json_print_metric(void *ps __maybe_unused, const char *group, fix_escape_printf(&buf, "%s\t\"MetricExpr\": \"%S\"", need_sep ? ",\n" : "", expr); need_sep = true; } + if (threshold) { + fix_escape_printf(&buf, "%s\t\"MetricThreshold\": \"%S\"", need_sep ? ",\n" : "", + threshold); + need_sep = true; + } if (unit) { fix_escape_printf(&buf, "%s\t\"ScaleUnit\": \"%S\"", need_sep ? ",\n" : "", unit); need_sep = true; diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c index 868fc9c35606..b1d56a73223d 100644 --- a/tools/perf/util/metricgroup.c +++ b/tools/perf/util/metricgroup.c @@ -368,6 +368,7 @@ struct mep { const char *metric_desc; const char *metric_long_desc; const char *metric_expr; + const char *metric_threshold; const char *metric_unit; }; @@ -447,6 +448,7 @@ static int metricgroup__add_to_mep_groups(const struct pmu_metric *pm, me->metric_desc = pm->desc; me->metric_long_desc = pm->long_desc; me->metric_expr = pm->metric_expr; + me->metric_threshold = pm->metric_threshold; me->metric_unit = pm->unit; } } @@ -522,6 +524,7 @@ void metricgroup__print(const struct print_callbacks *print_cb, void *print_stat me->metric_desc, me->metric_long_desc, me->metric_expr, + me->metric_threshold, me->metric_unit); next = rb_next(node); rblist__remove_node(&groups, node); diff --git a/tools/perf/util/print-events.h b/tools/perf/util/print-events.h index 716dcf4b4859..e75a3d7e3fe3 100644 --- a/tools/perf/util/print-events.h +++ b/tools/perf/util/print-events.h @@ -23,6 +23,7 @@ struct print_callbacks { const char *desc, const char *long_desc, const char *expr, + const char *threshold, const char *unit); }; From patchwork Sun Feb 19 09:28:33 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59122 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp775848wrn; Sun, 19 Feb 2023 01:46:17 -0800 (PST) X-Google-Smtp-Source: AK7set8qNYwADSMAcXqWR/ouEPEjtEUf1YKkzxexvLmlRiFAYi5avSiYckKSlqpcL1pMfVPLC1OI X-Received: by 2002:a05:6a20:6a98:b0:b9:7a47:bca5 with SMTP id bi24-20020a056a206a9800b000b97a47bca5mr5818789pzb.43.1676799977501; Sun, 19 Feb 2023 01:46:17 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676799977; cv=none; d=google.com; s=arc-20160816; b=CjKicbgK+C7KiVYvZibkd1aUQysSjqiAkL8vcuiP3Nh4eNJAXWhWsE4LxwN2MgW3tC ugUvGu+wvEDBhbyaTv7+nwTpIiT4tQwWb6Qbj4HWYyDBFxVU6aSiHEJWfVR5ekvgsULe px9brM7cODMBVjs2wiaAJtNanj+QWyHw520OkGCZYGlWN0bsldmHDdizHmgLZ1Fx/sqs v2WQS8I+jSvUDz7YzKSGoCzZZpBMT3UtsMhceDrGgAgsO/uiqHaZMUSAIazumJMTYFrv z4n2Cgx5fBE+IrDiE1FtdVFosveeLS09JR8WoUvkXUqiNgk+ZyWE0lLnTfZMu/1pCt9u IEgw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=3kQG0KGeqgW06OPnyih1uXZZsO66wdHBjTbqSiX//SI=; b=LsdNTQUHI0KMvKsSEG5fQI9Wl//gIOAkEud5AsV5XnXZPVZGDf8C3COkwo5PwKIU+/ KeFqoSJ/Qx4h/pvDDOL1RMj4bOvUPPvoRFpNXUOkOOD+/JD6yvfM4QPAMlIRc5Q91KNW 8ahgB/kIjhs2jTrVh1CjK1F1mhVL1Ef2JtQwyW2AEYz9DuXerqnjGbhHLQfbvS1qmSZ0 b9x7DJaFFQ7FKdOsVSBUuLREPX4h+Djw+I4XVC/Gfrz6+LD3gYP8sTplrlGugpHz8tTk LR0HtwM0ap1fOwZKqN7LhleQfAWx/SI6oLTYkrol1Gt3iVdmhAM4Vwc2MncDLUFLMjpe aLDA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b="gb2QuI/Q"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id 9-20020a630109000000b004dacaa17ecbsi10985436pgb.559.2023.02.19.01.46.05; Sun, 19 Feb 2023 01:46:17 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b="gb2QuI/Q"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230026AbjBSJgz (ORCPT + 99 others); Sun, 19 Feb 2023 04:36:55 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59016 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230022AbjBSJgx (ORCPT ); Sun, 19 Feb 2023 04:36:53 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6AB9C12BC9 for ; Sun, 19 Feb 2023 01:36:00 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id 5-20020a250105000000b009802c10698cso869091ybb.22 for ; Sun, 19 Feb 2023 01:36:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=3kQG0KGeqgW06OPnyih1uXZZsO66wdHBjTbqSiX//SI=; b=gb2QuI/Q7UTLV03j+31SjEjAnQVplGs47YvBd6d72MRsfNX5u2GV1bClVFw/KPr/DJ j9VkCa6qiHYUKeSUlqAMjhUyepIsSaIdmc6/CDEnDH5Bg1MruzsNaUi1GrFe/RCHG4Ou 3/q2BxglwECD4NUkzFALnQWIey/UfoJLpkpUM2tCH8Ez4guaASLykOSuwFHIjrha8Gnb tc4FH3ZiS6wOvY2zokF6JUyq2nSs+yVikyByMBdQkxY+ijcxCjlGZ3MBC7jpOELRI0HS mHhwWScV1zH7rzz0rESS6LPTwCbYIbbRMDs+jv17YMGgoYXr5r8mpP8HO3oNUdzQHy/o ZZiw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=3kQG0KGeqgW06OPnyih1uXZZsO66wdHBjTbqSiX//SI=; b=jaurfch+sNWSghb6MLNX5iTBAgIj+hZOUvdDWoH1KofEKgwg3D9WjuORgvk8nnLmhO nM0Es6RqimjvtCBgrlOqHhWIZTMCFkSj0uHtMKJ+Q7KGaTjeXbm32wFk0OUAEKEtP2Dz yjqeOGoAS2aTFHIlajgzFNs+rDhCaLB86KuLBhg1yS3EX1fKeGr4N2i8KtWWKuABTEEN 1yPHhlMLNKv9QJhaZ7wmpHZITa5LDrFMxQKO5dixCksONqxXfrWm0V/G4bXwUDvy5GZP y9LOpsvKF/ZE/Pc8V79cQJqDkKPIBQbqsgb9eYbgUXArYb2JTrSOchA085BngPYzKbU/ QzMw== X-Gm-Message-State: AO0yUKVHuZ4/Z7o1oRSUo1bfk4d+7OzWOyTo6/n4Jkj0j6GYFSnKdGEc 3GGbKfNAJ/G+SlcNu0UTCcGoja17A6EE X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a25:6408:0:b0:911:cc15:db74 with SMTP id y8-20020a256408000000b00911cc15db74mr28488ybb.13.1676799242175; Sun, 19 Feb 2023 01:34:02 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:33 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-37-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 36/51] perf metric: Compute and print threshold values From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758252212665069711?= X-GMAIL-MSGID: =?utf-8?q?1758252212665069711?= Compute the threshold metric and use it to color the metric value as red or green. The threshold expression is used to generate the set of events as the threshold may require additional events. A later patch make this behavior optional with a --metric-no-threshold flag. Signed-off-by: Ian Rogers --- tools/perf/util/metricgroup.c | 24 +++++++++++++++++++++--- tools/perf/util/metricgroup.h | 1 + tools/perf/util/stat-shadow.c | 24 ++++++++++++++++-------- 3 files changed, 38 insertions(+), 11 deletions(-) diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c index b1d56a73223d..d83885697125 100644 --- a/tools/perf/util/metricgroup.c +++ b/tools/perf/util/metricgroup.c @@ -129,6 +129,8 @@ struct metric { const char *modifier; /** The expression to parse, for example, "instructions/cycles". */ const char *metric_expr; + /** Optional threshold expression where zero value is green, otherwise red. */ + const char *metric_threshold; /** * The "ScaleUnit" that scales and adds a unit to the metric during * output. @@ -222,6 +224,7 @@ static struct metric *metric__new(const struct pmu_metric *pm, goto out_err; } m->metric_expr = pm->metric_expr; + m->metric_threshold = pm->metric_threshold; m->metric_unit = pm->unit; m->pctx->sctx.user_requested_cpu_list = NULL; if (user_requested_cpu_list) { @@ -901,6 +904,7 @@ static int __add_metric(struct list_head *metric_list, const struct visited_metric *vm; int ret; bool is_root = !root_metric; + const char *expr; struct visited_metric visited_node = { .name = pm->metric_name, .parent = visited, @@ -963,16 +967,29 @@ static int __add_metric(struct list_head *metric_list, * For both the parent and referenced metrics, we parse * all the metric's IDs and add it to the root context. */ - if (expr__find_ids(pm->metric_expr, NULL, root_metric->pctx) < 0) { + ret = 0; + expr = pm->metric_expr; + if (is_root && pm->metric_threshold) { + /* + * Threshold expressions are built off the actual metric. Switch + * to use that in case of additional necessary events. Change + * the visited node name to avoid this being flagged as + * recursion. + */ + assert(strstr(pm->metric_threshold, pm->metric_name)); + expr = pm->metric_threshold; + visited_node.name = "__threshold__"; + } + if (expr__find_ids(expr, NULL, root_metric->pctx) < 0) { /* Broken metric. */ ret = -EINVAL; - } else { + } + if (!ret) { /* Resolve referenced metrics. */ ret = resolve_metric(metric_list, modifier, metric_no_group, user_requested_cpu_list, system_wide, root_metric, &visited_node, table); } - if (ret) { if (is_root) metric__free(root_metric); @@ -1554,6 +1571,7 @@ static int parse_groups(struct evlist *perf_evlist, const char *str, free(metric_events); goto out; } + expr->metric_threshold = m->metric_threshold; expr->metric_unit = m->metric_unit; expr->metric_events = metric_events; expr->runtime = m->pctx->sctx.runtime; diff --git a/tools/perf/util/metricgroup.h b/tools/perf/util/metricgroup.h index 84030321a057..32eb3a5381fb 100644 --- a/tools/perf/util/metricgroup.h +++ b/tools/perf/util/metricgroup.h @@ -47,6 +47,7 @@ struct metric_expr { const char *metric_expr; /** The name of the meric such as "IPC". */ const char *metric_name; + const char *metric_threshold; /** * The "ScaleUnit" that scales and adds a unit to the metric during * output. For example, "6.4e-05MiB" means to scale the resulting metric diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c index 806b32156459..a41f186c6ec8 100644 --- a/tools/perf/util/stat-shadow.c +++ b/tools/perf/util/stat-shadow.c @@ -777,6 +777,7 @@ static int prepare_metric(struct evsel **metric_events, static void generic_metric(struct perf_stat_config *config, const char *metric_expr, + const char *metric_threshold, struct evsel **metric_events, struct metric_ref *metric_refs, char *name, @@ -789,9 +790,10 @@ static void generic_metric(struct perf_stat_config *config, { print_metric_t print_metric = out->print_metric; struct expr_parse_ctx *pctx; - double ratio, scale; + double ratio, scale, threshold; int i; void *ctxp = out->ctx; + const char *color = NULL; pctx = expr__ctx_new(); if (!pctx) @@ -811,6 +813,12 @@ static void generic_metric(struct perf_stat_config *config, char *unit; char metric_bf[64]; + if (metric_threshold && + expr__parse(&threshold, pctx, metric_threshold) == 0) { + color = fpclassify(threshold) == FP_ZERO + ? PERF_COLOR_GREEN : PERF_COLOR_RED; + } + if (metric_unit && metric_name) { if (perf_pmu__convert_scale(metric_unit, &unit, &scale) >= 0) { @@ -823,22 +831,22 @@ static void generic_metric(struct perf_stat_config *config, scnprintf(metric_bf, sizeof(metric_bf), "%s %s", unit, metric_name); - print_metric(config, ctxp, NULL, "%8.1f", + print_metric(config, ctxp, color, "%8.1f", metric_bf, ratio); } else { - print_metric(config, ctxp, NULL, "%8.2f", + print_metric(config, ctxp, color, "%8.2f", metric_name ? metric_name : out->force_header ? name : "", ratio); } } else { - print_metric(config, ctxp, NULL, NULL, + print_metric(config, ctxp, color, /*unit=*/NULL, out->force_header ? (metric_name ? metric_name : name) : "", 0); } } else { - print_metric(config, ctxp, NULL, NULL, + print_metric(config, ctxp, color, /*unit=*/NULL, out->force_header ? (metric_name ? metric_name : name) : "", 0); } @@ -1214,9 +1222,9 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config, list_for_each_entry (mexp, &me->head, nd) { if (num++ > 0) out->new_line(config, ctxp); - generic_metric(config, mexp->metric_expr, mexp->metric_events, - mexp->metric_refs, evsel->name, mexp->metric_name, - mexp->metric_unit, mexp->runtime, + generic_metric(config, mexp->metric_expr, mexp->metric_threshold, + mexp->metric_events, mexp->metric_refs, evsel->name, + mexp->metric_name, mexp->metric_unit, mexp->runtime, map_idx, out, st); } } From patchwork Sun Feb 19 09:28:34 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59133 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp778001wrn; Sun, 19 Feb 2023 01:55:15 -0800 (PST) X-Google-Smtp-Source: AK7set9j+B85Xs8gFMnCAx6+oMhm8CjRfF8HmuwWFI9WkcpDL6ggaz33wsLzQICxP1FSdSx6YZqO X-Received: by 2002:a17:902:ebca:b0:19a:7d0e:ceea with SMTP id p10-20020a170902ebca00b0019a7d0eceeamr121890plg.25.1676800515254; Sun, 19 Feb 2023 01:55:15 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676800515; cv=none; d=google.com; s=arc-20160816; b=an5Af+UZvH3x/GeDVBljJVTf5YvVstWNhCi07rMq/5tq56Rcnq2GC2woGI+HXG5Hr9 prGg5o4ZXlhOj0A/qDku8ZOJdCFE8MK9s6PxS3SFO+pRV62iCMYDQff6J1d/EijCzNUb OifWn652wV5Fa98h4o1JmFxjONzMKr93NTtpHf5fTPQkx2YBe+xpuw9juyFqmHCHN0iM ZFq+f3dLk3cvrxHCr85q8j4oWqgIa2M7jNrUfw6+6QVEokg5B3k5AJLbZy4VK2schNm2 tWr/EkJ4a3isCw8gttiZsVTCSCdTxvgrZSSArVRaPW/9YZjFFEBjzmv/OV0gBOeyBoyA 4dvw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=SBnNbiH+hab12DiOKBd4pcQP1xg7aEKLvUH64+SdruU=; b=m9WoG5TWQNWXQS2P9a+AYm0Vh2+V00DssxGH53fJihUBWPi5S2m0bcpEu5vNQ1vPPA /LzqWGYoOxOKgb9wM1IC1Z36oQv5ftJsbD41HOq222jkhVRUE6Y36BVEKPsWUwTNQs2c y+DiTrRqixZjWTs5fOFh4F2Bxn+qpK8y5tmucoxtEzdu1ZUd1axhKvVVVbb8M+XKqJHQ 8YvTm9CkSB5umA2k/aoroPv5MrkSMgvckn528m9F1euMPChOWCBO059A+DhwCICK5l85 a6D41isnAmXtJTGS2+hArp4tY/QJpRKzrOqwiMdhp6V/yHlT6p1bhujQNfpP526JSJKF KcbA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b="lMQXrD7/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id kr12-20020a170903080c00b001993610cb43si5326895plb.343.2023.02.19.01.55.02; Sun, 19 Feb 2023 01:55:15 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b="lMQXrD7/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230197AbjBSJtj (ORCPT + 99 others); Sun, 19 Feb 2023 04:49:39 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45234 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230192AbjBSJti (ORCPT ); Sun, 19 Feb 2023 04:49:38 -0500 Received: from mail-vk1-xa4a.google.com (mail-vk1-xa4a.google.com [IPv6:2607:f8b0:4864:20::a4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2BA4C9010 for ; Sun, 19 Feb 2023 01:49:00 -0800 (PST) Received: by mail-vk1-xa4a.google.com with SMTP id j23-20020ac5ce17000000b003ea10940e34so124644vki.23 for ; Sun, 19 Feb 2023 01:49:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=SBnNbiH+hab12DiOKBd4pcQP1xg7aEKLvUH64+SdruU=; b=lMQXrD7/f5Md4Cv/Wu4Z3DRM30d0/Nfh2T/ynQgxAmUmcGDb50w7szQMN0mpyoaGO2 /KdOcvlaYuqHsrNTWXuo8TUwMlOEss4q8f9kTxFsrrK0yzTDOal0zeVsikFKmnYi1xy1 ISw/xdS0uAhog4e1fTOEADlDMT1LTOMPedDe904T1ZO1xypLhlaFsSaeUjrSXaUz6fFk msQHGNFimxoQ+TOS/hoLyrv1/fbNc/ptM/wx+RLmFSGLlTDitFY8whrioOlTCio/r8RJ kf7fNFbYz1FDSf1lcNFUE8BVZf9eCYbnaM+F9qAETfZ5JX+dy6Dc7IZB7eKVK/qH0GCF upeA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=SBnNbiH+hab12DiOKBd4pcQP1xg7aEKLvUH64+SdruU=; b=FPxZRrvochmZbnTpDdnm/Cmu+fSgNW8zFwrmyB70FVnT+tNEfrHyGqhwh8L4cFYcoD dfmSpJ2lMStttjmK0twijBt6ymI9vbGmNz6B5jU2FNm+YqDnPpgX9b55PYKGQ51iR/j5 n9QZhRT3I8J2Qf+Wdt76PwL9sRHBtYqS4UK69ZYko07+7VVkC6mtoAMYVkYIvswEd6CL rtKQlhGOn3jpEtAFLtpUWKjuL/Rr+uD3Q/oXgwNmSXdCr6WG2BigyX8m9CfX5St8P2lN cwmcGmf1VpkzVy3zg3a0NqBlkhhFWlburpNK19PLKm+BLy/Dhd1OKHkSRQqIpF0usFOU xB0g== X-Gm-Message-State: AO0yUKXKWgD5OGtwR4HH2+1uIHVQkMaqPfQqN0xpzn/FsqR+AyF1zaKl BT86ecxW/szVaY05SJDV5Ls9hBIi3EKQ X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a05:6902:10c6:b0:97a:ebd:a594 with SMTP id w6-20020a05690210c600b0097a0ebda594mr390732ybu.3.1676799250687; Sun, 19 Feb 2023 01:34:10 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:34 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-38-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 37/51] perf expr: More explicit NAN handling From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758252777336685087?= X-GMAIL-MSGID: =?utf-8?q?1758252777336685087?= Comparison and logical operations on NAN won't ensure the result is NAN. Ensure NANs are propogated so that threshold expressions like "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15" don't yield a number when the components are NAN. Signed-off-by: Ian Rogers --- tools/perf/util/expr.y | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/tools/perf/util/expr.y b/tools/perf/util/expr.y index 635e562350c5..250e444bf032 100644 --- a/tools/perf/util/expr.y +++ b/tools/perf/util/expr.y @@ -127,7 +127,11 @@ static struct ids handle_id(struct expr_parse_ctx *ctx, char *id, if (!compute_ids || (is_const(LHS.val) && is_const(RHS.val))) { \ assert(LHS.ids == NULL); \ assert(RHS.ids == NULL); \ - RESULT.val = (long)LHS.val OP (long)RHS.val; \ + if (isnan(LHS.val) || isnan(RHS.val)) { \ + RESULT.val = NAN; \ + } else { \ + RESULT.val = (long)LHS.val OP (long)RHS.val; \ + } \ RESULT.ids = NULL; \ } else { \ RESULT = union_expr(LHS, RHS); \ @@ -137,7 +141,11 @@ static struct ids handle_id(struct expr_parse_ctx *ctx, char *id, if (!compute_ids || (is_const(LHS.val) && is_const(RHS.val))) { \ assert(LHS.ids == NULL); \ assert(RHS.ids == NULL); \ - RESULT.val = LHS.val OP RHS.val; \ + if (isnan(LHS.val) || isnan(RHS.val)) { \ + RESULT.val = NAN; \ + } else { \ + RESULT.val = LHS.val OP RHS.val; \ + } \ RESULT.ids = NULL; \ } else { \ RESULT = union_expr(LHS, RHS); \ From patchwork Sun Feb 19 09:28:35 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59137 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp803502wrn; Sun, 19 Feb 2023 03:22:49 -0800 (PST) X-Google-Smtp-Source: AK7set9v2y3crevEuPqJ849ygpOD41VQQSzQmgZzTiF6wNHl893r0COD6Dm2cGfNlnT4FbC3q/Yy X-Received: by 2002:a05:6402:3705:b0:4a3:43c1:8431 with SMTP id ek5-20020a056402370500b004a343c18431mr3152902edb.5.1676805769285; Sun, 19 Feb 2023 03:22:49 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676805769; cv=none; d=google.com; s=arc-20160816; b=UKKH0kMusOqJWYAhhZ1LHalLkJO6aCyzublYJ86b7902LoVxD+i9YY8ReN5RtNdGxf gPlf/tnWsuleqTzSIXXw6Iy0pN2SNeEeF3liO/kIFIZ+0siKLHUst/aL7YwFMyR2JDJ+ JRhN7LGWtlzaw8YU5PIL7iEqFcEydoJ+yXxTC2wq270rSUW1HdP36R+4f6GuHAUYGPNs CELjUpjp0POa5LKeI6meBwXVAI2vg6+7dw7Bp7J2p5LuEBU6X+BrBR3/LYZ6JudGZsgP v+ap2jPYd3nVAWfJt5ZjmCynOAjPkxn9CMXMCBic+3kFN2zvqyo47L6W4xzOO6o9JK3N d/UQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=XBygbP2AnKfDbpz1krdMAZIvQLAdvtaCB6XLYUmfQds=; b=1ANmUiUtete6m7YkPN2ZCIlUE8A46et91Opjvc5z4jvrZAWckoJ+sVC+wZ9wMkBtmD YuTtFgKZHFdjfyT8vZ+P63Z5Jk3CIUCpj4k2F8+BCeTA1g0w3wVOewYt7NAzkzo/VPNw cMQTURhrN026LZBn9xvESAzl1/HLRgfdjnOSYNA0cDRVCaJKZVtgoBZLiDJMqgmGy/LY DTxQpUcC8OqVgbGsRBpmJZy413RarmLWd/XWcciwPBlnou6YhD4ArKnurr3NZ9ikc3Ef HzoZwtUuJ7YY12Fo+1FitrqYmrvWWmzT8/W0l4ns2hKLf2CAG5PNhSesEYyuQJ/FGikc ojvw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=PkbnNXwA; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id r18-20020aa7d592000000b004acbdaf456fsi11997619edq.291.2023.02.19.03.22.25; Sun, 19 Feb 2023 03:22:49 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=PkbnNXwA; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230151AbjBSJrl (ORCPT + 99 others); Sun, 19 Feb 2023 04:47:41 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42556 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230148AbjBSJrj (ORCPT ); Sun, 19 Feb 2023 04:47:39 -0500 Received: from mail-oo1-xc49.google.com (mail-oo1-xc49.google.com [IPv6:2607:f8b0:4864:20::c49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 39ABBFF3E for ; Sun, 19 Feb 2023 01:47:04 -0800 (PST) Received: by mail-oo1-xc49.google.com with SMTP id u24-20020a4ac998000000b0051f97e7b7f9so62155ooq.13 for ; Sun, 19 Feb 2023 01:47:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=XBygbP2AnKfDbpz1krdMAZIvQLAdvtaCB6XLYUmfQds=; b=PkbnNXwAZ7dASxCbIk3PSOXIwA56PiTDT2BrHJWvqt8nqcL3bSxaHP0HjREpFP4pBE zmCKIb+9FhlwQ8T9T3hD1DWUTG0iZf0+D8Fg5gEw/6kW81MXr3SCY3kj3ZCNZJM79rEm Z62LtGdZMOWbkEO42PWyNmuN/kx7l7fYGF+pskPCUg8W0RWDaNemZLiFYJpBXAf1moZy UpDKxP3KBGDK6CwystAwBk4I9aM2uCdMXGtMenYofayeUMOzV5RGtLImtFxLETRLxQfL FHBnkZN/j8gx2E/zhsgX1YogOTj0NgsRmfp+VgFEhl3UoSV3sPJ8dji7jQmoLUTBvZKF 78sQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=XBygbP2AnKfDbpz1krdMAZIvQLAdvtaCB6XLYUmfQds=; b=URAFFmiAplt5pC/2wBTS5FJkddjb5WExWrehsYI+cPt58k8MVkU98mN2sjCg51crI/ tGjqQlTkVcFIW3mNm9gG6OXoCXPKLVJzYNF/3nvN6MxtgclRfwLik/uQTj+3BcB136bl SYphoWdC3EW2AdEz7xNxf5zIeG8RwER+gCOFgoHlGnaQhgQEqbKMKTEpKs8Bgbn4RXal sMEaYt1fCrMGBHGvf0gclHJ28HQ8xVrrxU6ICxslE9PI7+dXPsMFZ4j40rW822hG7bT0 TNX48lqAxrWuNi/Yuu0yYZClOWNxRKbWYN1WveTrmf+m+CbGzKAhNdi1F5IS0gs4qMfL egRA== X-Gm-Message-State: AO0yUKVxTmugJFdO53v27yxUCv1Pfdf/7stkNPBwZsheAc3OjiIx7ZgJ nDbolQeqgzzdcOLX6kbgAm2SImyzY0SZ X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a05:6902:11cd:b0:90d:a6c0:6870 with SMTP id n13-20020a05690211cd00b0090da6c06870mr391398ybu.2.1676799259452; Sun, 19 Feb 2023 01:34:19 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:35 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-39-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 38/51] perf metric: Add --metric-no-threshold option From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758258286509039790?= X-GMAIL-MSGID: =?utf-8?q?1758258286509039790?= Thresholds may need additional events, this can impact things like sharing groups of events to avoid multiplexing. Add a flag to make the threshold calculations optional. The threshold will still be computed if no additional events are necessary. Signed-off-by: Ian Rogers --- tools/perf/builtin-stat.c | 4 +++ tools/perf/tests/expand-cgroup.c | 3 +- tools/perf/tests/parse-metric.c | 1 - tools/perf/tests/pmu-events.c | 4 +-- tools/perf/util/metricgroup.c | 62 ++++++++++++++++++++------------ tools/perf/util/metricgroup.h | 3 +- tools/perf/util/stat-shadow.c | 3 +- tools/perf/util/stat.h | 1 + 8 files changed, 49 insertions(+), 32 deletions(-) diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c index 5d18a5a6f662..5e13171a7bba 100644 --- a/tools/perf/builtin-stat.c +++ b/tools/perf/builtin-stat.c @@ -1256,6 +1256,8 @@ static struct option stat_options[] = { "don't group metric events, impacts multiplexing"), OPT_BOOLEAN(0, "metric-no-merge", &stat_config.metric_no_merge, "don't try to share events between metrics in a group"), + OPT_BOOLEAN(0, "metric-no-threshold", &stat_config.metric_no_threshold, + "don't try to share events between metrics in a group "), OPT_BOOLEAN(0, "topdown", &topdown_run, "measure top-down statistics"), OPT_UINTEGER(0, "td-level", &stat_config.topdown_level, @@ -1852,6 +1854,7 @@ static int add_default_attributes(void) return metricgroup__parse_groups(evsel_list, "transaction", stat_config.metric_no_group, stat_config.metric_no_merge, + stat_config.metric_no_threshold, stat_config.user_requested_cpu_list, stat_config.system_wide, &stat_config.metric_events); @@ -2519,6 +2522,7 @@ int cmd_stat(int argc, const char **argv) metricgroup__parse_groups(evsel_list, metrics, stat_config.metric_no_group, stat_config.metric_no_merge, + stat_config.metric_no_threshold, stat_config.user_requested_cpu_list, stat_config.system_wide, &stat_config.metric_events); diff --git a/tools/perf/tests/expand-cgroup.c b/tools/perf/tests/expand-cgroup.c index 672a27f37060..ec340880a848 100644 --- a/tools/perf/tests/expand-cgroup.c +++ b/tools/perf/tests/expand-cgroup.c @@ -187,8 +187,7 @@ static int expand_metric_events(void) rblist__init(&metric_events); pme_test = find_core_metrics_table("testarch", "testcpu"); - ret = metricgroup__parse_groups_test(evlist, pme_test, metric_str, - false, false, &metric_events); + ret = metricgroup__parse_groups_test(evlist, pme_test, metric_str, &metric_events); if (ret < 0) { pr_debug("failed to parse '%s' metric\n", metric_str); goto out; diff --git a/tools/perf/tests/parse-metric.c b/tools/perf/tests/parse-metric.c index 9fec6040950c..132c9b945a42 100644 --- a/tools/perf/tests/parse-metric.c +++ b/tools/perf/tests/parse-metric.c @@ -98,7 +98,6 @@ static int __compute_metric(const char *name, struct value *vals, /* Parse the metric into metric_events list. */ pme_test = find_core_metrics_table("testarch", "testcpu"); err = metricgroup__parse_groups_test(evlist, pme_test, name, - false, false, &metric_events); if (err) goto out; diff --git a/tools/perf/tests/pmu-events.c b/tools/perf/tests/pmu-events.c index db2fed0c6993..50b99a0f8f59 100644 --- a/tools/perf/tests/pmu-events.c +++ b/tools/perf/tests/pmu-events.c @@ -846,9 +846,7 @@ static int test__parsing_callback(const struct pmu_metric *pm, perf_evlist__set_maps(&evlist->core, cpus, NULL); runtime_stat__init(&st); - err = metricgroup__parse_groups_test(evlist, table, pm->metric_name, - false, false, - &metric_events); + err = metricgroup__parse_groups_test(evlist, table, pm->metric_name, &metric_events); if (err) { if (!strcmp(pm->metric_name, "M1") || !strcmp(pm->metric_name, "M2") || !strcmp(pm->metric_name, "M3")) { diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c index d83885697125..afb6f2fdc24e 100644 --- a/tools/perf/util/metricgroup.c +++ b/tools/perf/util/metricgroup.c @@ -771,6 +771,7 @@ struct metricgroup_add_iter_data { int *ret; bool *has_match; bool metric_no_group; + bool metric_no_threshold; const char *user_requested_cpu_list; bool system_wide; struct metric *root_metric; @@ -786,6 +787,7 @@ static int add_metric(struct list_head *metric_list, const struct pmu_metric *pm, const char *modifier, bool metric_no_group, + bool metric_no_threshold, const char *user_requested_cpu_list, bool system_wide, struct metric *root_metric, @@ -813,6 +815,7 @@ static int add_metric(struct list_head *metric_list, static int resolve_metric(struct list_head *metric_list, const char *modifier, bool metric_no_group, + bool metric_no_threshold, const char *user_requested_cpu_list, bool system_wide, struct metric *root_metric, @@ -861,8 +864,8 @@ static int resolve_metric(struct list_head *metric_list, */ for (i = 0; i < pending_cnt; i++) { ret = add_metric(metric_list, &pending[i].pm, modifier, metric_no_group, - user_requested_cpu_list, system_wide, root_metric, visited, - table); + metric_no_threshold, user_requested_cpu_list, system_wide, + root_metric, visited, table); if (ret) break; } @@ -879,6 +882,7 @@ static int resolve_metric(struct list_head *metric_list, * @metric_no_group: Should events written to events be grouped "{}" or * global. Grouping is the default but due to multiplexing the * user may override. + * @metric_no_threshold: Should threshold expressions be ignored? * @runtime: A special argument for the parser only known at runtime. * @user_requested_cpu_list: Command line specified CPUs to record on. * @system_wide: Are events for all processes recorded. @@ -894,6 +898,7 @@ static int __add_metric(struct list_head *metric_list, const struct pmu_metric *pm, const char *modifier, bool metric_no_group, + bool metric_no_threshold, int runtime, const char *user_requested_cpu_list, bool system_wide, @@ -974,10 +979,12 @@ static int __add_metric(struct list_head *metric_list, * Threshold expressions are built off the actual metric. Switch * to use that in case of additional necessary events. Change * the visited node name to avoid this being flagged as - * recursion. + * recursion. If the threshold events are disabled, just use the + * metric's name as a reference. This allows metric threshold + * computation if there are sufficient events. */ assert(strstr(pm->metric_threshold, pm->metric_name)); - expr = pm->metric_threshold; + expr = metric_no_threshold ? pm->metric_name : pm->metric_threshold; visited_node.name = "__threshold__"; } if (expr__find_ids(expr, NULL, root_metric->pctx) < 0) { @@ -987,8 +994,8 @@ static int __add_metric(struct list_head *metric_list, if (!ret) { /* Resolve referenced metrics. */ ret = resolve_metric(metric_list, modifier, metric_no_group, - user_requested_cpu_list, system_wide, - root_metric, &visited_node, table); + metric_no_threshold, user_requested_cpu_list, + system_wide, root_metric, &visited_node, table); } if (ret) { if (is_root) @@ -1035,6 +1042,7 @@ static int add_metric(struct list_head *metric_list, const struct pmu_metric *pm, const char *modifier, bool metric_no_group, + bool metric_no_threshold, const char *user_requested_cpu_list, bool system_wide, struct metric *root_metric, @@ -1046,9 +1054,9 @@ static int add_metric(struct list_head *metric_list, pr_debug("metric expr %s for %s\n", pm->metric_expr, pm->metric_name); if (!strstr(pm->metric_expr, "?")) { - ret = __add_metric(metric_list, pm, modifier, metric_no_group, 0, - user_requested_cpu_list, system_wide, root_metric, - visited, table); + ret = __add_metric(metric_list, pm, modifier, metric_no_group, + metric_no_threshold, 0, user_requested_cpu_list, + system_wide, root_metric, visited, table); } else { int j, count; @@ -1060,9 +1068,9 @@ static int add_metric(struct list_head *metric_list, */ for (j = 0; j < count && !ret; j++) - ret = __add_metric(metric_list, pm, modifier, metric_no_group, j, - user_requested_cpu_list, system_wide, - root_metric, visited, table); + ret = __add_metric(metric_list, pm, modifier, metric_no_group, + metric_no_threshold, j, user_requested_cpu_list, + system_wide, root_metric, visited, table); } return ret; @@ -1079,8 +1087,8 @@ static int metricgroup__add_metric_sys_event_iter(const struct pmu_metric *pm, return 0; ret = add_metric(d->metric_list, pm, d->modifier, d->metric_no_group, - d->user_requested_cpu_list, d->system_wide, - d->root_metric, d->visited, d->table); + d->metric_no_threshold, d->user_requested_cpu_list, + d->system_wide, d->root_metric, d->visited, d->table); if (ret) goto out; @@ -1124,6 +1132,7 @@ struct metricgroup__add_metric_data { const char *modifier; const char *user_requested_cpu_list; bool metric_no_group; + bool metric_no_threshold; bool system_wide; bool has_match; }; @@ -1141,8 +1150,9 @@ static int metricgroup__add_metric_callback(const struct pmu_metric *pm, data->has_match = true; ret = add_metric(data->list, pm, data->modifier, data->metric_no_group, - data->user_requested_cpu_list, data->system_wide, - /*root_metric=*/NULL, /*visited_metrics=*/NULL, table); + data->metric_no_threshold, data->user_requested_cpu_list, + data->system_wide, /*root_metric=*/NULL, + /*visited_metrics=*/NULL, table); } return ret; } @@ -1163,7 +1173,7 @@ static int metricgroup__add_metric_callback(const struct pmu_metric *pm, * architecture perf is running upon. */ static int metricgroup__add_metric(const char *metric_name, const char *modifier, - bool metric_no_group, + bool metric_no_group, bool metric_no_threshold, const char *user_requested_cpu_list, bool system_wide, struct list_head *metric_list, @@ -1179,6 +1189,7 @@ static int metricgroup__add_metric(const char *metric_name, const char *modifier .metric_name = metric_name, .modifier = modifier, .metric_no_group = metric_no_group, + .metric_no_threshold = metric_no_threshold, .user_requested_cpu_list = user_requested_cpu_list, .system_wide = system_wide, .has_match = false, @@ -1241,6 +1252,7 @@ static int metricgroup__add_metric(const char *metric_name, const char *modifier * architecture perf is running upon. */ static int metricgroup__add_metric_list(const char *list, bool metric_no_group, + bool metric_no_threshold, const char *user_requested_cpu_list, bool system_wide, struct list_head *metric_list, const struct pmu_metrics_table *table) @@ -1259,7 +1271,8 @@ static int metricgroup__add_metric_list(const char *list, bool metric_no_group, *modifier++ = '\0'; ret = metricgroup__add_metric(metric_name, modifier, - metric_no_group, user_requested_cpu_list, + metric_no_group, metric_no_threshold, + user_requested_cpu_list, system_wide, metric_list, table); if (ret == -EINVAL) pr_err("Cannot find metric or group `%s'\n", metric_name); @@ -1449,6 +1462,7 @@ static int parse_ids(bool metric_no_merge, struct perf_pmu *fake_pmu, static int parse_groups(struct evlist *perf_evlist, const char *str, bool metric_no_group, bool metric_no_merge, + bool metric_no_threshold, const char *user_requested_cpu_list, bool system_wide, struct perf_pmu *fake_pmu, @@ -1463,7 +1477,7 @@ static int parse_groups(struct evlist *perf_evlist, const char *str, if (metric_events_list->nr_entries == 0) metricgroup__rblist_init(metric_events_list); - ret = metricgroup__add_metric_list(str, metric_no_group, + ret = metricgroup__add_metric_list(str, metric_no_group, metric_no_threshold, user_requested_cpu_list, system_wide, &metric_list, table); if (ret) @@ -1598,6 +1612,7 @@ int metricgroup__parse_groups(struct evlist *perf_evlist, const char *str, bool metric_no_group, bool metric_no_merge, + bool metric_no_threshold, const char *user_requested_cpu_list, bool system_wide, struct rblist *metric_events) @@ -1608,18 +1623,19 @@ int metricgroup__parse_groups(struct evlist *perf_evlist, return -EINVAL; return parse_groups(perf_evlist, str, metric_no_group, metric_no_merge, - user_requested_cpu_list, system_wide, + metric_no_threshold, user_requested_cpu_list, system_wide, /*fake_pmu=*/NULL, metric_events, table); } int metricgroup__parse_groups_test(struct evlist *evlist, const struct pmu_metrics_table *table, const char *str, - bool metric_no_group, - bool metric_no_merge, struct rblist *metric_events) { - return parse_groups(evlist, str, metric_no_group, metric_no_merge, + return parse_groups(evlist, str, + /*metric_no_group=*/false, + /*metric_no_merge=*/false, + /*metric_no_threshold=*/false, /*user_requested_cpu_list=*/NULL, /*system_wide=*/false, &perf_pmu__fake, metric_events, table); diff --git a/tools/perf/util/metricgroup.h b/tools/perf/util/metricgroup.h index 32eb3a5381fb..8d50052c5b4c 100644 --- a/tools/perf/util/metricgroup.h +++ b/tools/perf/util/metricgroup.h @@ -70,14 +70,13 @@ int metricgroup__parse_groups(struct evlist *perf_evlist, const char *str, bool metric_no_group, bool metric_no_merge, + bool metric_no_threshold, const char *user_requested_cpu_list, bool system_wide, struct rblist *metric_events); int metricgroup__parse_groups_test(struct evlist *evlist, const struct pmu_metrics_table *table, const char *str, - bool metric_no_group, - bool metric_no_merge, struct rblist *metric_events); void metricgroup__print(const struct print_callbacks *print_cb, void *print_state); diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c index a41f186c6ec8..77483eeda0d8 100644 --- a/tools/perf/util/stat-shadow.c +++ b/tools/perf/util/stat-shadow.c @@ -814,7 +814,8 @@ static void generic_metric(struct perf_stat_config *config, char metric_bf[64]; if (metric_threshold && - expr__parse(&threshold, pctx, metric_threshold) == 0) { + expr__parse(&threshold, pctx, metric_threshold) == 0 && + !isnan(threshold)) { color = fpclassify(threshold) == FP_ZERO ? PERF_COLOR_GREEN : PERF_COLOR_RED; } diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h index b1c29156c560..cf2d8aa445f3 100644 --- a/tools/perf/util/stat.h +++ b/tools/perf/util/stat.h @@ -159,6 +159,7 @@ struct perf_stat_config { bool no_csv_summary; bool metric_no_group; bool metric_no_merge; + bool metric_no_threshold; bool stop_read_counter; bool iostat_run; char *user_requested_cpu_list; From patchwork Sun Feb 19 09:28:36 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59124 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp776021wrn; Sun, 19 Feb 2023 01:47:02 -0800 (PST) X-Google-Smtp-Source: AK7set/buXx9KD8UYuuNRa8N97zZbAeguEsbRYTVRhJ3xDn1B8u50Sc/dDf3d2UOyZB5b0DReuFf X-Received: by 2002:a17:906:f14a:b0:8b1:7e88:c20f with SMTP id gw10-20020a170906f14a00b008b17e88c20fmr8756976ejb.15.1676800022652; Sun, 19 Feb 2023 01:47:02 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676800022; cv=none; d=google.com; s=arc-20160816; b=F6chKzf3O/WuKIepMEl17SvAyKDZETrQSJP4B4QoNWUPyyMM9jSlreGlzM4+2kSVKk gPisBSZT9WvPKRQANJRYnViNPBCRFz3R9cxGrXPEqlDCAowmSta8jU1rWtz2U2liTrx9 x74zwKrY3jHawKmpmu1MYOvD/R26jw5cqnnBNSbLYz88LHqpcuyeV45hk/6Nir19Ljsa Kcvr8bAUkIM/HEwdh3gtljTC0SVo+z7ZcQ2aMQqP19B0cvhWW5epyqFeZbYWTIm47sZO DVmeoQCcR5ediDdisE0eQXVARZN7Gcsh9NZ5eMNH/vFOPc0qyTW+RetpwtE1kp2mpD87 TdeQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :references:mime-version:message-id:in-reply-to:date:dkim-signature; bh=4ivvDzCiOjMDLj/gcFt7rd0F2g23q1K7jFLZUP1Bknc=; b=0lDmDlHzd8R40mgMLyksSWRTbXl5srui3Ffux8/BMUVcLIViodyvTlnkb0c51CFNL3 L8KaghfPS+vVApXvXqz9fIwvFihX09IuA6su44BREB/+4BzzZwwJ2SGit5QYU3brexdV BRPhKYDy5OTJkZTzLcqU0E9L2MBa8VtyFu14luqJcIDrV/Oqlc7Gg/ScxsT7qhAs6nsV 1+RSOIp8nsZikdKh5IEJMh7cOReGcAa55Igm4TrfohJophrpC6Jw1UW0pzswy+cTiXHL vo3WH8ZSUwWbaJj1QmneoN+D0M7ulL63Sb8kKONx/HQhK9MKKj177fuaFzBgm1M/VDFl V8UA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=pYOV+yD5; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id up33-20020a170907cca100b008bd2a964b93si4939459ejc.300.2023.02.19.01.46.39; Sun, 19 Feb 2023 01:47:02 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=pYOV+yD5; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230037AbjBSJh2 (ORCPT + 99 others); Sun, 19 Feb 2023 04:37:28 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59706 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230013AbjBSJhZ (ORCPT ); Sun, 19 Feb 2023 04:37:25 -0500 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CC2C4E3BE for ; Sun, 19 Feb 2023 01:36:34 -0800 (PST) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-53687b09838so19197907b3.6 for ; Sun, 19 Feb 2023 01:36:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:from:to:cc:subject:date :message-id:reply-to; bh=4ivvDzCiOjMDLj/gcFt7rd0F2g23q1K7jFLZUP1Bknc=; b=pYOV+yD5JlC63LIjHOEgCbCoVbA5/K+27F7L88dBNxpq0XtFj0mZZlqIwBeS92aa/g 4kDUIW6Ukpzk0fFBdMUaZQ9gpbxcONzq+Wex72Os29VOJJeKmMByz2aAyDl1kZOJhSvz gMyXVIcQ66hX6vSSMEXC6o0ZfK1beKFKsed7fXcseehPo4dxwSPM0pj09N90AAhkjkgC RqmIIPmMTskb/OWIeVf6wOoMaMNj2sJEn+pKMsU2ZwLxQqfbvK/cKpI5GR39iETOkuxK apz9HUvn1dPJ6d+YgJ5aG1cft6QskyAlLCTBnyeqR6tQvJk1kYio9jElUCO1VUWcZ9I6 YNbQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=4ivvDzCiOjMDLj/gcFt7rd0F2g23q1K7jFLZUP1Bknc=; b=n2U5+OsFgVwTFNqn7byaXNR0d+BcIPstM1up1jSGuipmj7SJeyDZPr9/rVO3ya7LeQ Rd7Sxq38N0OPxpkSk5G4vMNEyq4eC9yU8/MrkdMuNEBQg+zfOV6MPV/9MLKnPMQDD0JB rxpvWC1jEgoY33TuSo0t4kBmO1GUAv2LkKc3kHeqgzU2fCVnWzrgpSWbeNukvpj22Mja /cH6W5FFLwML83m0iu9GOaotJ6eSO2nYInfuThTeAFz6lwKZeZwp3KhjogW3fO8Sa5+0 j4bFDBp+QGt9VNYRvNjGFTFqs11hrTiBLCK6fC9+dAlqg2UpXXgBTH5mtPeiKJIgwX0L 4eyw== X-Gm-Message-State: AO0yUKVfKLy7jtTPDWZmOThhYf7JgZ9klf6aoTMGeWAlz6ghsXW+4Z63 pfNb9mwfiBnRfsRvJoJHeFRFGLXu5SmL X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a05:6902:291:b0:9a0:1d7b:707b with SMTP id v17-20020a056902029100b009a01d7b707bmr49325ybh.4.1676799268239; Sun, 19 Feb 2023 01:34:28 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:36 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-40-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 39/51] perf stat: Add TopdownL1 metric as a default if present From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758252260727304403?= X-GMAIL-MSGID: =?utf-8?q?1758252260727304403?= When there are no events and on Intel, the topdown events will be added by default if present. To display the metrics associated with these request special handling in stat-shadow.c. To more easily update these metrics use the json metric version via the TopdownL1 group. This makes the handling less platform specific. Modify the metricgroup__has_metric code to also cover metric groups. Signed-off-by: Ian Rogers --- tools/perf/arch/x86/util/evlist.c | 6 +++--- tools/perf/arch/x86/util/topdown.c | 30 ------------------------------ tools/perf/arch/x86/util/topdown.h | 1 - tools/perf/builtin-stat.c | 14 ++++++++++++++ tools/perf/util/metricgroup.c | 6 ++---- 5 files changed, 19 insertions(+), 38 deletions(-) diff --git a/tools/perf/arch/x86/util/evlist.c b/tools/perf/arch/x86/util/evlist.c index cb59ce9b9638..8a7ae4162563 100644 --- a/tools/perf/arch/x86/util/evlist.c +++ b/tools/perf/arch/x86/util/evlist.c @@ -59,10 +59,10 @@ int arch_evlist__add_default_attrs(struct evlist *evlist, struct perf_event_attr *attrs, size_t nr_attrs) { - if (nr_attrs) - return ___evlist__add_default_attrs(evlist, attrs, nr_attrs); + if (!nr_attrs) + return 0; - return topdown_parse_events(evlist); + return ___evlist__add_default_attrs(evlist, attrs, nr_attrs); } struct evsel *arch_evlist__leader(struct list_head *list) diff --git a/tools/perf/arch/x86/util/topdown.c b/tools/perf/arch/x86/util/topdown.c index 54810f9acd6f..eb3a7d9652ab 100644 --- a/tools/perf/arch/x86/util/topdown.c +++ b/tools/perf/arch/x86/util/topdown.c @@ -9,11 +9,6 @@ #include "topdown.h" #include "evsel.h" -#define TOPDOWN_L1_EVENTS "{slots,topdown-retiring,topdown-bad-spec,topdown-fe-bound,topdown-be-bound}" -#define TOPDOWN_L1_EVENTS_CORE "{slots,cpu_core/topdown-retiring/,cpu_core/topdown-bad-spec/,cpu_core/topdown-fe-bound/,cpu_core/topdown-be-bound/}" -#define TOPDOWN_L2_EVENTS "{slots,topdown-retiring,topdown-bad-spec,topdown-fe-bound,topdown-be-bound,topdown-heavy-ops,topdown-br-mispredict,topdown-fetch-lat,topdown-mem-bound}" -#define TOPDOWN_L2_EVENTS_CORE "{slots,cpu_core/topdown-retiring/,cpu_core/topdown-bad-spec/,cpu_core/topdown-fe-bound/,cpu_core/topdown-be-bound/,cpu_core/topdown-heavy-ops/,cpu_core/topdown-br-mispredict/,cpu_core/topdown-fetch-lat/,cpu_core/topdown-mem-bound/}" - /* Check whether there is a PMU which supports the perf metrics. */ bool topdown_sys_has_perf_metrics(void) { @@ -99,28 +94,3 @@ const char *arch_get_topdown_pmu_name(struct evlist *evlist, bool warn) return pmu_name; } - -int topdown_parse_events(struct evlist *evlist) -{ - const char *topdown_events; - const char *pmu_name; - - if (!topdown_sys_has_perf_metrics()) - return 0; - - pmu_name = arch_get_topdown_pmu_name(evlist, false); - - if (pmu_have_event(pmu_name, "topdown-heavy-ops")) { - if (!strcmp(pmu_name, "cpu_core")) - topdown_events = TOPDOWN_L2_EVENTS_CORE; - else - topdown_events = TOPDOWN_L2_EVENTS; - } else { - if (!strcmp(pmu_name, "cpu_core")) - topdown_events = TOPDOWN_L1_EVENTS_CORE; - else - topdown_events = TOPDOWN_L1_EVENTS; - } - - return parse_event(evlist, topdown_events); -} diff --git a/tools/perf/arch/x86/util/topdown.h b/tools/perf/arch/x86/util/topdown.h index 7eb81f042838..46bf9273e572 100644 --- a/tools/perf/arch/x86/util/topdown.h +++ b/tools/perf/arch/x86/util/topdown.h @@ -3,6 +3,5 @@ #define _TOPDOWN_H 1 bool topdown_sys_has_perf_metrics(void); -int topdown_parse_events(struct evlist *evlist); #endif diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c index 5e13171a7bba..796e98e453f6 100644 --- a/tools/perf/builtin-stat.c +++ b/tools/perf/builtin-stat.c @@ -1996,6 +1996,7 @@ static int add_default_attributes(void) stat_config.topdown_level = TOPDOWN_MAX_LEVEL; if (!evsel_list->core.nr_entries) { + /* No events so add defaults. */ if (target__has_cpu(&target)) default_attrs0[0].config = PERF_COUNT_SW_CPU_CLOCK; @@ -2011,6 +2012,19 @@ static int add_default_attributes(void) } if (evlist__add_default_attrs(evsel_list, default_attrs1) < 0) return -1; + /* + * Add TopdownL1 metrics if they exist. To minimize + * multiplexing, don't request threshold computation. + */ + if (metricgroup__has_metric("TopdownL1") && + metricgroup__parse_groups(evsel_list, "TopdownL1", + /*metric_no_group=*/false, + /*metric_no_merge=*/false, + /*metric_no_threshold=*/true, + stat_config.user_requested_cpu_list, + stat_config.system_wide, + &stat_config.metric_events) < 0) + return -1; /* Platform specific attrs */ if (evlist__add_default_attrs(evsel_list, default_null_attrs) < 0) return -1; diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c index afb6f2fdc24e..64a35f2787dc 100644 --- a/tools/perf/util/metricgroup.c +++ b/tools/perf/util/metricgroup.c @@ -1647,10 +1647,8 @@ static int metricgroup__has_metric_callback(const struct pmu_metric *pm, { const char *metric = vdata; - if (!pm->metric_expr) - return 0; - - if (match_metric(pm->metric_name, metric)) + if (match_metric(pm->metric_name, metric) || + match_metric(pm->metric_group, metric)) return 1; return 0; From patchwork Sun Feb 19 09:28:37 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59128 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp776324wrn; Sun, 19 Feb 2023 01:48:10 -0800 (PST) X-Google-Smtp-Source: AK7set++7pjwcsq/4KwHJkKKQG9lVmUeqXF0YQv6+nEnEqksWaBblM0rYvLIK05gs1lShv19jj+s X-Received: by 2002:a17:903:2345:b0:196:3feb:1f1e with SMTP id c5-20020a170903234500b001963feb1f1emr3298007plh.47.1676800090551; Sun, 19 Feb 2023 01:48:10 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676800090; cv=none; d=google.com; s=arc-20160816; b=WmAESlbK5+m6bZaGdqjXByzWZA25EdzxLP3bUyaDFZ/+0VvnUqVGwSQZc3XKarInRb /52kVoe/gaDpp18oM+gg1UIi2UkFCANcGNyjw/qQK/G6AHybZTPbWQiG0Av+kTCvF7F3 kMIp+Y6zTX39Nx2dbN/MiMNSPWJx0vI6YVAzgDKxmXHoCRIgzaDXFPZfv6CxYbzt2rRX 66L+445IsGTT0KehIG8a80muDXEsHBBK66wWGeIfynd62/l1jg0tHqrsOFBxUxx8v570 OHj09Shaa/Sga26Ovrka3A7HNGWu3PoJgRelbk5/JuxTpNDSAZOZ7ZM8TnBvGboCcvXm lBpw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=fWY19TSQgERdPbwErlbK0jtYxZSrTa14TWowXdbFFqA=; b=MCgQ9bbLNWo+iDx4G+VstDBDXGz9FguHKTynZHMa74tW113RERDKUum00l2LBFuptB hIoycJksifMFT27zaBj5CMGmUM0dtWhK551t8/xAxh+o98VYOYcHuKHVFzzIlEHF3mIg t2SudQYdUvKcluvZ5F5WnD+HpLIeoxGbOv2Fn/0gCKLgr49ROZ/wDUuQQXmDTzBYmB5E W99vrgwBaY+WN/5YMOVj+Eec6LvBM/yZ1/oR2ZA/qqp7E2QpkivR5LG4dFTDHBWjC5Fo I7zqZ1Tszz4vHVrixmmencbCZQndBBe76WEdb90R370oOupNih+iKSw2KpQFHXYCV2WV M2qg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=BCodiPm0; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id d17-20020a170903231100b001890c6ff018si6610511plh.486.2023.02.19.01.47.58; Sun, 19 Feb 2023 01:48:10 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=BCodiPm0; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230070AbjBSJjC (ORCPT + 99 others); Sun, 19 Feb 2023 04:39:02 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33496 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230061AbjBSJio (ORCPT ); Sun, 19 Feb 2023 04:38:44 -0500 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 014E77ECD for ; Sun, 19 Feb 2023 01:37:47 -0800 (PST) Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-53688fe539aso16853857b3.16 for ; Sun, 19 Feb 2023 01:37:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=fWY19TSQgERdPbwErlbK0jtYxZSrTa14TWowXdbFFqA=; b=BCodiPm0UhrFbNnkmEiSM+MMtwfNo/m14XrQ3eMr4N7XoB1C0uGhfU8Ydpwlimt4q1 PuZZluxQHIbtmW6GfZhwXyfvEEXJT1NZIsqObL68eINAqv9QwPI8uriBzqRT6OA0pB/6 55DiiPo5KIVtWFlyjRYm4j9EQyviVMIfz03smYySOvvhY3imVAl+LL/03rNQZhxR4sC6 FGdD+dvjr2vBC+9GKA+jKvuky1Yjs7Skzt6JTDcow7vOCHE8Gj3+KCnz+O6YCfuk7vp+ L8FxI4VmvLraC2PA3Lz+qIyB6qVm8XjqNMNsd34o6RbgdCnZ8/aNtpS8kf8U7rOnHvV2 doSQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=fWY19TSQgERdPbwErlbK0jtYxZSrTa14TWowXdbFFqA=; b=IbubO3fDHMYXL1leAUahnwLrbqfQe2QxXVoy/jrcQWcNWzBG/m0kMs2WNsp/U/rdZR WBE0xxKosB+S9upb2ZtgcX6zM+8M9IBjYSevfDwPZf/9l5NFdQCrMxHfX97I5Q7EVcHo opXPFKLTLXTmMA5YkaFQDOF1/EaYcb9UcWFal0ITjdzzO5vZJ9Udh9w91PK65UxXnb0T MoAKj7V+taSxPaokKpAGOvlRrv+aOMQGNGQhsXHrUq4PWfG1QiICuR3v+ec3O59cAeIq g3wFY8hBgITXOJWHwlDnHwPyi5+0+XpbK8cZbVRUQlJUjMnK/wJIb06TQSPVyxew0Hwk AqgQ== X-Gm-Message-State: AO0yUKUe8xxNCqptNUPPBZdQWjkXptnLNAvFLtNG4+MHij5T6XpdJRuA Jw2OPaxA0ho4vCnF7KejFNMp29hUGPmm X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a0d:f681:0:b0:52f:bc0:2164 with SMTP id g123-20020a0df681000000b0052f0bc02164mr1795352ywf.472.1676799276576; Sun, 19 Feb 2023 01:34:36 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:37 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-41-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 40/51] perf stat: Implement --topdown using json metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758252331708565917?= X-GMAIL-MSGID: =?utf-8?q?1758252331708565917?= Request the topdown metric group of a level with the metrics in the group 'TopdownL' rather than through specific events. As more topdown levels are supported this way, such as 6 on Intel Ice Lake, default to just showing the level 1 metrics. This can be overridden using '--td-level'. Rather than determine the maximum topdown level from sysfs, use the metric group names. Remove some now unused topdown code. Signed-off-by: Ian Rogers --- tools/perf/arch/x86/util/topdown.c | 48 +----------- tools/perf/builtin-stat.c | 118 +++++------------------------ tools/perf/util/metricgroup.c | 31 ++++++++ tools/perf/util/metricgroup.h | 1 + tools/perf/util/topdown.c | 68 +---------------- tools/perf/util/topdown.h | 11 +-- 6 files changed, 58 insertions(+), 219 deletions(-) diff --git a/tools/perf/arch/x86/util/topdown.c b/tools/perf/arch/x86/util/topdown.c index eb3a7d9652ab..9ad5e5c7bd27 100644 --- a/tools/perf/arch/x86/util/topdown.c +++ b/tools/perf/arch/x86/util/topdown.c @@ -1,11 +1,8 @@ // SPDX-License-Identifier: GPL-2.0 -#include #include "api/fs/fs.h" +#include "util/evsel.h" #include "util/pmu.h" #include "util/topdown.h" -#include "util/evlist.h" -#include "util/debug.h" -#include "util/pmu-hybrid.h" #include "topdown.h" #include "evsel.h" @@ -33,30 +30,6 @@ bool topdown_sys_has_perf_metrics(void) return has_perf_metrics; } -/* - * Check whether we can use a group for top down. - * Without a group may get bad results due to multiplexing. - */ -bool arch_topdown_check_group(bool *warn) -{ - int n; - - if (sysctl__read_int("kernel/nmi_watchdog", &n) < 0) - return false; - if (n > 0) { - *warn = true; - return false; - } - return true; -} - -void arch_topdown_group_warn(void) -{ - fprintf(stderr, - "nmi_watchdog enabled with topdown. May give wrong results.\n" - "Disable with echo 0 > /proc/sys/kernel/nmi_watchdog\n"); -} - #define TOPDOWN_SLOTS 0x0400 /* @@ -65,7 +38,6 @@ void arch_topdown_group_warn(void) * Only Topdown metric supports sample-read. The slots * event must be the leader of the topdown group. */ - bool arch_topdown_sample_read(struct evsel *leader) { if (!evsel__sys_has_perf_metrics(leader)) @@ -76,21 +48,3 @@ bool arch_topdown_sample_read(struct evsel *leader) return false; } - -const char *arch_get_topdown_pmu_name(struct evlist *evlist, bool warn) -{ - const char *pmu_name; - - if (!perf_pmu__has_hybrid()) - return "cpu"; - - if (!evlist->hybrid_pmu_name) { - if (warn) - pr_warning("WARNING: default to use cpu_core topdown events\n"); - evlist->hybrid_pmu_name = perf_pmu__hybrid_type_to_pmu("core"); - } - - pmu_name = evlist->hybrid_pmu_name; - - return pmu_name; -} diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c index 796e98e453f6..bdb1ef4fc6ad 100644 --- a/tools/perf/builtin-stat.c +++ b/tools/perf/builtin-stat.c @@ -124,39 +124,6 @@ static const char * transaction_limited_attrs = { "}" }; -static const char * topdown_attrs[] = { - "topdown-total-slots", - "topdown-slots-retired", - "topdown-recovery-bubbles", - "topdown-fetch-bubbles", - "topdown-slots-issued", - NULL, -}; - -static const char *topdown_metric_attrs[] = { - "slots", - "topdown-retiring", - "topdown-bad-spec", - "topdown-fe-bound", - "topdown-be-bound", - NULL, -}; - -static const char *topdown_metric_L2_attrs[] = { - "slots", - "topdown-retiring", - "topdown-bad-spec", - "topdown-fe-bound", - "topdown-be-bound", - "topdown-heavy-ops", - "topdown-br-mispredict", - "topdown-fetch-lat", - "topdown-mem-bound", - NULL, -}; - -#define TOPDOWN_MAX_LEVEL 2 - static const char *smi_cost_attrs = { "{" "msr/aperf/," @@ -1914,86 +1881,41 @@ static int add_default_attributes(void) } if (topdown_run) { - const char **metric_attrs = topdown_metric_attrs; - unsigned int max_level = 1; - char *str = NULL; - bool warn = false; - const char *pmu_name = arch_get_topdown_pmu_name(evsel_list, true); + unsigned int max_level = metricgroups__topdown_max_level(); + char str[] = "TopdownL1"; if (!force_metric_only) stat_config.metric_only = true; - if (pmu_have_event(pmu_name, topdown_metric_L2_attrs[5])) { - metric_attrs = topdown_metric_L2_attrs; - max_level = 2; + if (!max_level) { + pr_err("Topdown requested but the topdown metric groups aren't present.\n" + "(See perf list the metric groups have names like TopdownL1)"); + return -1; } - if (stat_config.topdown_level > max_level) { pr_err("Invalid top-down metrics level. The max level is %u.\n", max_level); return -1; } else if (!stat_config.topdown_level) - stat_config.topdown_level = max_level; + stat_config.topdown_level = 1; - if (topdown_filter_events(metric_attrs, &str, 1, pmu_name) < 0) { - pr_err("Out of memory\n"); - return -1; - } - - if (metric_attrs[0] && str) { - if (!stat_config.interval && !stat_config.metric_only) { - fprintf(stat_config.output, - "Topdown accuracy may decrease when measuring long periods.\n" - "Please print the result regularly, e.g. -I1000\n"); - } - goto setup_metrics; - } - - zfree(&str); - - if (stat_config.aggr_mode != AGGR_GLOBAL && - stat_config.aggr_mode != AGGR_CORE) { - pr_err("top down event configuration requires --per-core mode\n"); - return -1; - } - stat_config.aggr_mode = AGGR_CORE; - if (nr_cgroups || !target__has_cpu(&target)) { - pr_err("top down event configuration requires system-wide mode (-a)\n"); - return -1; - } - - if (topdown_filter_events(topdown_attrs, &str, - arch_topdown_check_group(&warn), - pmu_name) < 0) { - pr_err("Out of memory\n"); - return -1; + if (!stat_config.interval && !stat_config.metric_only) { + fprintf(stat_config.output, + "Topdown accuracy may decrease when measuring long periods.\n" + "Please print the result regularly, e.g. -I1000\n"); } - - if (topdown_attrs[0] && str) { - struct parse_events_error errinfo; - if (warn) - arch_topdown_group_warn(); -setup_metrics: - parse_events_error__init(&errinfo); - err = parse_events(evsel_list, str, &errinfo); - if (err) { - fprintf(stderr, - "Cannot set up top down events %s: %d\n", - str, err); - parse_events_error__print(&errinfo, str); - parse_events_error__exit(&errinfo); - free(str); - return -1; - } - parse_events_error__exit(&errinfo); - } else { - fprintf(stderr, "System does not support topdown\n"); + str[8] = stat_config.topdown_level + '0'; + if (metricgroup__parse_groups(evsel_list, str, + /*metric_no_group=*/false, + /*metric_no_merge=*/false, + /*metric_no_threshold=*/true, + stat_config.user_requested_cpu_list, + stat_config.system_wide, + &stat_config.metric_events) < 0) return -1; - } - free(str); } if (!stat_config.topdown_level) - stat_config.topdown_level = TOPDOWN_MAX_LEVEL; + stat_config.topdown_level = 1; if (!evsel_list->core.nr_entries) { /* No events so add defaults. */ diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c index 64a35f2787dc..de6dd527a2ba 100644 --- a/tools/perf/util/metricgroup.c +++ b/tools/perf/util/metricgroup.c @@ -1665,6 +1665,37 @@ bool metricgroup__has_metric(const char *metric) (void *)metric) ? true : false; } +static int metricgroup__topdown_max_level_callback(const struct pmu_metric *pm, + const struct pmu_metrics_table *table __maybe_unused, + void *data) +{ + unsigned int *max_level = data; + unsigned int level; + const char *p = strstr(pm->metric_group, "TopdownL"); + + if (!p || p[8] == '\0') + return 0; + + level = p[8] - '0'; + if (level > *max_level) + *max_level = level; + + return 0; +} + +unsigned int metricgroups__topdown_max_level(void) +{ + unsigned int max_level = 0; + const struct pmu_metrics_table *table = pmu_metrics_table__find(); + + if (!table) + return false; + + pmu_metrics_table_for_each_metric(table, metricgroup__topdown_max_level_callback, + &max_level); + return max_level; +} + int metricgroup__copy_metric_events(struct evlist *evlist, struct cgroup *cgrp, struct rblist *new_metric_events, struct rblist *old_metric_events) diff --git a/tools/perf/util/metricgroup.h b/tools/perf/util/metricgroup.h index 8d50052c5b4c..77472e35705e 100644 --- a/tools/perf/util/metricgroup.h +++ b/tools/perf/util/metricgroup.h @@ -81,6 +81,7 @@ int metricgroup__parse_groups_test(struct evlist *evlist, void metricgroup__print(const struct print_callbacks *print_cb, void *print_state); bool metricgroup__has_metric(const char *metric); +unsigned int metricgroups__topdown_max_level(void); int arch_get_runtimeparam(const struct pmu_metric *pm); void metricgroup__rblist_exit(struct rblist *metric_events); diff --git a/tools/perf/util/topdown.c b/tools/perf/util/topdown.c index 1090841550f7..18fd5fed5d1a 100644 --- a/tools/perf/util/topdown.c +++ b/tools/perf/util/topdown.c @@ -1,74 +1,8 @@ // SPDX-License-Identifier: GPL-2.0 -#include -#include "pmu.h" -#include "pmu-hybrid.h" #include "topdown.h" - -int topdown_filter_events(const char **attr, char **str, bool use_group, - const char *pmu_name) -{ - int off = 0; - int i; - int len = 0; - char *s; - bool is_hybrid = perf_pmu__is_hybrid(pmu_name); - - for (i = 0; attr[i]; i++) { - if (pmu_have_event(pmu_name, attr[i])) { - if (is_hybrid) - len += strlen(attr[i]) + strlen(pmu_name) + 3; - else - len += strlen(attr[i]) + 1; - attr[i - off] = attr[i]; - } else - off++; - } - attr[i - off] = NULL; - - *str = malloc(len + 1 + 2); - if (!*str) - return -1; - s = *str; - if (i - off == 0) { - *s = 0; - return 0; - } - if (use_group) - *s++ = '{'; - for (i = 0; attr[i]; i++) { - if (!is_hybrid) - strcpy(s, attr[i]); - else - sprintf(s, "%s/%s/", pmu_name, attr[i]); - s += strlen(s); - *s++ = ','; - } - if (use_group) { - s[-1] = '}'; - *s = 0; - } else - s[-1] = 0; - return 0; -} - -__weak bool arch_topdown_check_group(bool *warn) -{ - *warn = false; - return false; -} - -__weak void arch_topdown_group_warn(void) -{ -} +#include __weak bool arch_topdown_sample_read(struct evsel *leader __maybe_unused) { return false; } - -__weak const char *arch_get_topdown_pmu_name(struct evlist *evlist - __maybe_unused, - bool warn __maybe_unused) -{ - return "cpu"; -} diff --git a/tools/perf/util/topdown.h b/tools/perf/util/topdown.h index f9531528c559..1996c5fedcd7 100644 --- a/tools/perf/util/topdown.h +++ b/tools/perf/util/topdown.h @@ -1,14 +1,11 @@ /* SPDX-License-Identifier: GPL-2.0 */ #ifndef TOPDOWN_H #define TOPDOWN_H 1 -#include "evsel.h" -#include "evlist.h" -bool arch_topdown_check_group(bool *warn); -void arch_topdown_group_warn(void); +#include + +struct evsel; + bool arch_topdown_sample_read(struct evsel *leader); -const char *arch_get_topdown_pmu_name(struct evlist *evlist, bool warn); -int topdown_filter_events(const char **attr, char **str, bool use_group, - const char *pmu_name); #endif From patchwork Sun Feb 19 09:28:38 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59132 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp776776wrn; Sun, 19 Feb 2023 01:49:53 -0800 (PST) X-Google-Smtp-Source: AK7set8Ec/u3BRKeyA4GR/MdrLdvmRxlAV5cTZl5EAZcmgfHjNzwOjEqYWodnXpdeIReKO9QxoHA X-Received: by 2002:a05:6a20:7fa0:b0:bf:5e5:9d85 with SMTP id d32-20020a056a207fa000b000bf05e59d85mr8312307pzj.40.1676800192799; Sun, 19 Feb 2023 01:49:52 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676800192; cv=none; d=google.com; s=arc-20160816; b=E7vsoWr7I1BlR8jdkXswN9TZ9U+9qv6KQRXGgEQGJAd3ugoPOwP5c/POf67IjYrEAk X4wvfu9rabQRB8lgU9e2xb1UYJvo87vLwpAsu/YyjnYQAZvgcg8wY9MU/wQ8brV6fjt9 i2dRd+1vW1dzCUMnMQAI/q4NCSaxRryu/5J4QnEFnmEw0rW6NKv6w3grh0b0DNa3iLpm E7LUikYxJ+JbBVsUywDEkdBEMMogMEKVEBhh1+usAC4o4bEMtP7gUrjquFxqESl5dEBo 8OFfG4rAmmNE4FUmZSCRXsA4PdZhfSKZrf09DVz3wROLFd9YNi4DNjrThaxg4XZHzsAG QxIQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=IJ9mpKEII8b9zP7ZWkBkiRr3d/wqZ7eKUUGAHl4KRMM=; b=LzH2Tzi0rj7P+Wcc6DdpCmZOTw9a2QUGAKOhMi+c75Dz1yRCPo1XPUdAFOsiUa48w8 QTG44JoQT7XE9zAiWat9TSY8Du9Im6hEuxUHH5nL4GT+TnGVqjnmDjmm+CbXMluPh0NG zb3z6sJGJy8e/sNeoqYR6bBhe6vsxJI3+ch1sL6KEsfEWh79p45IDJ+dnClUx/jKzZTA /axsxGfPb8NQwjxEzrPk+KD9HyNtIHrlHtE7VUpDyl7NL5r+QMUJWuBbPkJ6mBJG/wqc LMDXARsmxNzseT/rEx516KnqSKteqJxSQIhyoj2pjIZ4WmSSnpSnWJaQWqJX308WTUhc BdzQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=Wq7kd9Az; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id q3-20020a656843000000b004f0131105e9si10996610pgt.855.2023.02.19.01.49.40; Sun, 19 Feb 2023 01:49:52 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=Wq7kd9Az; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230098AbjBSJmc (ORCPT + 99 others); Sun, 19 Feb 2023 04:42:32 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36986 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230013AbjBSJmb (ORCPT ); Sun, 19 Feb 2023 04:42:31 -0500 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4D8CC11EB3 for ; Sun, 19 Feb 2023 01:41:41 -0800 (PST) Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-53688522d28so18077227b3.12 for ; Sun, 19 Feb 2023 01:41:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=IJ9mpKEII8b9zP7ZWkBkiRr3d/wqZ7eKUUGAHl4KRMM=; b=Wq7kd9AzZfERH/Wm96yVGtChfJIxX+Hq45TeTzrKqX6FBA90UqzrHKNaRhJjgtar5/ bhimsNdCsV5BYLK61cgfvHH1vhKfftVlyZ0hn+nA1cQcUXu1UKlJVhj9lMHk5M6FLuvM vNTfKd0izX9zt/h4iW3ALc3gSdLjLzCbUyNUnDg4Z3VCd1vDmdhXFdZgrkePkoQ/dEtS jK+sZatpxHnfOSjYx2XD6a/9cOoznt6WeG8PHDMQk90hC7bbXgYcy0GaACirsuMbCXoi hhOzFVvuyY2zoB6AUilSgOvhEvFH/siKENcL+09Vk0Xo2eRaQSbzj7mKSMywxjG4M9iv d8NA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=IJ9mpKEII8b9zP7ZWkBkiRr3d/wqZ7eKUUGAHl4KRMM=; b=LGPHG+A9YASGJx5jNL4wlXUutSBkfx2OpFkCwK7yeBqeCfG+jeN+8v8wiCkBeMh4fv 9EQyLhvWDybMAW49lSA6ky66vWXZAIxaaGv5Z/ZF5IecgRo5cm3bQFKUDuLRu2Z7pSSn Mknura5LyxtXnXQazypffLCsxBH+P7XgdgS5sJm2QVeDaQVikdesLTNFq1qrcZ5zUhPk qNOsgvSEmhZo7HQucY9HWftBv/oQ4/4fmhogeZ18VTjApJyPYsmZXO8o0y+68KsX5qdC OZ/WBt2kCnSW8K6E9iuLLwW+tyD0DUTmaNONrhvgun35CnDOrlwUDtaOgiBCzhERf5VX p4HA== X-Gm-Message-State: AO0yUKXBP9NFpOxCXOEQNl6tNMeid7IUfNr3fmGej008TbRjJjSe1RRj ZxW/ZjGr8DnlMU4Ecib4G8+/gDCpIqTE X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a81:ae0a:0:b0:52e:f66d:b70d with SMTP id m10-20020a81ae0a000000b0052ef66db70dmr22944ywh.0.1676799285601; Sun, 19 Feb 2023 01:34:45 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:38 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-42-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 41/51] perf stat: Remove topdown event special handling From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758252439098119526?= X-GMAIL-MSGID: =?utf-8?q?1758252439098119526?= Now the events are computed from json metrics, the hard coded logic can be removed. Signed-off-by: Ian Rogers --- tools/perf/util/stat-shadow.c | 346 ---------------------------------- tools/perf/util/stat.c | 13 -- tools/perf/util/stat.h | 26 --- 3 files changed, 385 deletions(-) diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c index 77483eeda0d8..5189756bf16d 100644 --- a/tools/perf/util/stat-shadow.c +++ b/tools/perf/util/stat-shadow.c @@ -241,45 +241,6 @@ void perf_stat__update_shadow_stats(struct evsel *counter, u64 count, update_runtime_stat(st, STAT_TRANSACTION, map_idx, count, &rsd); else if (perf_stat_evsel__is(counter, ELISION_START)) update_runtime_stat(st, STAT_ELISION, map_idx, count, &rsd); - else if (perf_stat_evsel__is(counter, TOPDOWN_TOTAL_SLOTS)) - update_runtime_stat(st, STAT_TOPDOWN_TOTAL_SLOTS, - map_idx, count, &rsd); - else if (perf_stat_evsel__is(counter, TOPDOWN_SLOTS_ISSUED)) - update_runtime_stat(st, STAT_TOPDOWN_SLOTS_ISSUED, - map_idx, count, &rsd); - else if (perf_stat_evsel__is(counter, TOPDOWN_SLOTS_RETIRED)) - update_runtime_stat(st, STAT_TOPDOWN_SLOTS_RETIRED, - map_idx, count, &rsd); - else if (perf_stat_evsel__is(counter, TOPDOWN_FETCH_BUBBLES)) - update_runtime_stat(st, STAT_TOPDOWN_FETCH_BUBBLES, - map_idx, count, &rsd); - else if (perf_stat_evsel__is(counter, TOPDOWN_RECOVERY_BUBBLES)) - update_runtime_stat(st, STAT_TOPDOWN_RECOVERY_BUBBLES, - map_idx, count, &rsd); - else if (perf_stat_evsel__is(counter, TOPDOWN_RETIRING)) - update_runtime_stat(st, STAT_TOPDOWN_RETIRING, - map_idx, count, &rsd); - else if (perf_stat_evsel__is(counter, TOPDOWN_BAD_SPEC)) - update_runtime_stat(st, STAT_TOPDOWN_BAD_SPEC, - map_idx, count, &rsd); - else if (perf_stat_evsel__is(counter, TOPDOWN_FE_BOUND)) - update_runtime_stat(st, STAT_TOPDOWN_FE_BOUND, - map_idx, count, &rsd); - else if (perf_stat_evsel__is(counter, TOPDOWN_BE_BOUND)) - update_runtime_stat(st, STAT_TOPDOWN_BE_BOUND, - map_idx, count, &rsd); - else if (perf_stat_evsel__is(counter, TOPDOWN_HEAVY_OPS)) - update_runtime_stat(st, STAT_TOPDOWN_HEAVY_OPS, - map_idx, count, &rsd); - else if (perf_stat_evsel__is(counter, TOPDOWN_BR_MISPREDICT)) - update_runtime_stat(st, STAT_TOPDOWN_BR_MISPREDICT, - map_idx, count, &rsd); - else if (perf_stat_evsel__is(counter, TOPDOWN_FETCH_LAT)) - update_runtime_stat(st, STAT_TOPDOWN_FETCH_LAT, - map_idx, count, &rsd); - else if (perf_stat_evsel__is(counter, TOPDOWN_MEM_BOUND)) - update_runtime_stat(st, STAT_TOPDOWN_MEM_BOUND, - map_idx, count, &rsd); else if (evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_FRONTEND)) update_runtime_stat(st, STAT_STALLED_CYCLES_FRONT, map_idx, count, &rsd); @@ -524,156 +485,6 @@ static void print_ll_cache_misses(struct perf_stat_config *config, out->print_metric(config, out->ctx, color, "%7.2f%%", "of all LL-cache accesses", ratio); } -/* - * High level "TopDown" CPU core pipe line bottleneck break down. - * - * Basic concept following - * Yasin, A Top Down Method for Performance analysis and Counter architecture - * ISPASS14 - * - * The CPU pipeline is divided into 4 areas that can be bottlenecks: - * - * Frontend -> Backend -> Retiring - * BadSpeculation in addition means out of order execution that is thrown away - * (for example branch mispredictions) - * Frontend is instruction decoding. - * Backend is execution, like computation and accessing data in memory - * Retiring is good execution that is not directly bottlenecked - * - * The formulas are computed in slots. - * A slot is an entry in the pipeline each for the pipeline width - * (for example a 4-wide pipeline has 4 slots for each cycle) - * - * Formulas: - * BadSpeculation = ((SlotsIssued - SlotsRetired) + RecoveryBubbles) / - * TotalSlots - * Retiring = SlotsRetired / TotalSlots - * FrontendBound = FetchBubbles / TotalSlots - * BackendBound = 1.0 - BadSpeculation - Retiring - FrontendBound - * - * The kernel provides the mapping to the low level CPU events and any scaling - * needed for the CPU pipeline width, for example: - * - * TotalSlots = Cycles * 4 - * - * The scaling factor is communicated in the sysfs unit. - * - * In some cases the CPU may not be able to measure all the formulas due to - * missing events. In this case multiple formulas are combined, as possible. - * - * Full TopDown supports more levels to sub-divide each area: for example - * BackendBound into computing bound and memory bound. For now we only - * support Level 1 TopDown. - */ - -static double sanitize_val(double x) -{ - if (x < 0 && x >= -0.02) - return 0.0; - return x; -} - -static double td_total_slots(int map_idx, struct runtime_stat *st, - struct runtime_stat_data *rsd) -{ - return runtime_stat_avg(st, STAT_TOPDOWN_TOTAL_SLOTS, map_idx, rsd); -} - -static double td_bad_spec(int map_idx, struct runtime_stat *st, - struct runtime_stat_data *rsd) -{ - double bad_spec = 0; - double total_slots; - double total; - - total = runtime_stat_avg(st, STAT_TOPDOWN_SLOTS_ISSUED, map_idx, rsd) - - runtime_stat_avg(st, STAT_TOPDOWN_SLOTS_RETIRED, map_idx, rsd) + - runtime_stat_avg(st, STAT_TOPDOWN_RECOVERY_BUBBLES, map_idx, rsd); - - total_slots = td_total_slots(map_idx, st, rsd); - if (total_slots) - bad_spec = total / total_slots; - return sanitize_val(bad_spec); -} - -static double td_retiring(int map_idx, struct runtime_stat *st, - struct runtime_stat_data *rsd) -{ - double retiring = 0; - double total_slots = td_total_slots(map_idx, st, rsd); - double ret_slots = runtime_stat_avg(st, STAT_TOPDOWN_SLOTS_RETIRED, - map_idx, rsd); - - if (total_slots) - retiring = ret_slots / total_slots; - return retiring; -} - -static double td_fe_bound(int map_idx, struct runtime_stat *st, - struct runtime_stat_data *rsd) -{ - double fe_bound = 0; - double total_slots = td_total_slots(map_idx, st, rsd); - double fetch_bub = runtime_stat_avg(st, STAT_TOPDOWN_FETCH_BUBBLES, - map_idx, rsd); - - if (total_slots) - fe_bound = fetch_bub / total_slots; - return fe_bound; -} - -static double td_be_bound(int map_idx, struct runtime_stat *st, - struct runtime_stat_data *rsd) -{ - double sum = (td_fe_bound(map_idx, st, rsd) + - td_bad_spec(map_idx, st, rsd) + - td_retiring(map_idx, st, rsd)); - if (sum == 0) - return 0; - return sanitize_val(1.0 - sum); -} - -/* - * Kernel reports metrics multiplied with slots. To get back - * the ratios we need to recreate the sum. - */ - -static double td_metric_ratio(int map_idx, enum stat_type type, - struct runtime_stat *stat, - struct runtime_stat_data *rsd) -{ - double sum = runtime_stat_avg(stat, STAT_TOPDOWN_RETIRING, map_idx, rsd) + - runtime_stat_avg(stat, STAT_TOPDOWN_FE_BOUND, map_idx, rsd) + - runtime_stat_avg(stat, STAT_TOPDOWN_BE_BOUND, map_idx, rsd) + - runtime_stat_avg(stat, STAT_TOPDOWN_BAD_SPEC, map_idx, rsd); - double d = runtime_stat_avg(stat, type, map_idx, rsd); - - if (sum) - return d / sum; - return 0; -} - -/* - * ... but only if most of the values are actually available. - * We allow two missing. - */ - -static bool full_td(int map_idx, struct runtime_stat *stat, - struct runtime_stat_data *rsd) -{ - int c = 0; - - if (runtime_stat_avg(stat, STAT_TOPDOWN_RETIRING, map_idx, rsd) > 0) - c++; - if (runtime_stat_avg(stat, STAT_TOPDOWN_BE_BOUND, map_idx, rsd) > 0) - c++; - if (runtime_stat_avg(stat, STAT_TOPDOWN_FE_BOUND, map_idx, rsd) > 0) - c++; - if (runtime_stat_avg(stat, STAT_TOPDOWN_BAD_SPEC, map_idx, rsd) > 0) - c++; - return c >= 2; -} - static void print_smi_cost(struct perf_stat_config *config, int map_idx, struct perf_stat_output_ctx *out, struct runtime_stat *st, @@ -885,7 +696,6 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config, void *ctxp = out->ctx; print_metric_t print_metric = out->print_metric; double total, ratio = 0.0, total2; - const char *color = NULL; struct runtime_stat_data rsd = { .ctx = evsel_context(evsel), .cgrp = evsel->cgrp, @@ -1044,162 +854,6 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config, avg / (ratio * evsel->scale)); else print_metric(config, ctxp, NULL, NULL, "CPUs utilized", 0); - } else if (perf_stat_evsel__is(evsel, TOPDOWN_FETCH_BUBBLES)) { - double fe_bound = td_fe_bound(map_idx, st, &rsd); - - if (fe_bound > 0.2) - color = PERF_COLOR_RED; - print_metric(config, ctxp, color, "%8.1f%%", "frontend bound", - fe_bound * 100.); - } else if (perf_stat_evsel__is(evsel, TOPDOWN_SLOTS_RETIRED)) { - double retiring = td_retiring(map_idx, st, &rsd); - - if (retiring > 0.7) - color = PERF_COLOR_GREEN; - print_metric(config, ctxp, color, "%8.1f%%", "retiring", - retiring * 100.); - } else if (perf_stat_evsel__is(evsel, TOPDOWN_RECOVERY_BUBBLES)) { - double bad_spec = td_bad_spec(map_idx, st, &rsd); - - if (bad_spec > 0.1) - color = PERF_COLOR_RED; - print_metric(config, ctxp, color, "%8.1f%%", "bad speculation", - bad_spec * 100.); - } else if (perf_stat_evsel__is(evsel, TOPDOWN_SLOTS_ISSUED)) { - double be_bound = td_be_bound(map_idx, st, &rsd); - const char *name = "backend bound"; - static int have_recovery_bubbles = -1; - - /* In case the CPU does not support topdown-recovery-bubbles */ - if (have_recovery_bubbles < 0) - have_recovery_bubbles = pmu_have_event("cpu", - "topdown-recovery-bubbles"); - if (!have_recovery_bubbles) - name = "backend bound/bad spec"; - - if (be_bound > 0.2) - color = PERF_COLOR_RED; - if (td_total_slots(map_idx, st, &rsd) > 0) - print_metric(config, ctxp, color, "%8.1f%%", name, - be_bound * 100.); - else - print_metric(config, ctxp, NULL, NULL, name, 0); - } else if (perf_stat_evsel__is(evsel, TOPDOWN_RETIRING) && - full_td(map_idx, st, &rsd)) { - double retiring = td_metric_ratio(map_idx, - STAT_TOPDOWN_RETIRING, st, - &rsd); - if (retiring > 0.7) - color = PERF_COLOR_GREEN; - print_metric(config, ctxp, color, "%8.1f%%", "Retiring", - retiring * 100.); - } else if (perf_stat_evsel__is(evsel, TOPDOWN_FE_BOUND) && - full_td(map_idx, st, &rsd)) { - double fe_bound = td_metric_ratio(map_idx, - STAT_TOPDOWN_FE_BOUND, st, - &rsd); - if (fe_bound > 0.2) - color = PERF_COLOR_RED; - print_metric(config, ctxp, color, "%8.1f%%", "Frontend Bound", - fe_bound * 100.); - } else if (perf_stat_evsel__is(evsel, TOPDOWN_BE_BOUND) && - full_td(map_idx, st, &rsd)) { - double be_bound = td_metric_ratio(map_idx, - STAT_TOPDOWN_BE_BOUND, st, - &rsd); - if (be_bound > 0.2) - color = PERF_COLOR_RED; - print_metric(config, ctxp, color, "%8.1f%%", "Backend Bound", - be_bound * 100.); - } else if (perf_stat_evsel__is(evsel, TOPDOWN_BAD_SPEC) && - full_td(map_idx, st, &rsd)) { - double bad_spec = td_metric_ratio(map_idx, - STAT_TOPDOWN_BAD_SPEC, st, - &rsd); - if (bad_spec > 0.1) - color = PERF_COLOR_RED; - print_metric(config, ctxp, color, "%8.1f%%", "Bad Speculation", - bad_spec * 100.); - } else if (perf_stat_evsel__is(evsel, TOPDOWN_HEAVY_OPS) && - full_td(map_idx, st, &rsd) && (config->topdown_level > 1)) { - double retiring = td_metric_ratio(map_idx, - STAT_TOPDOWN_RETIRING, st, - &rsd); - double heavy_ops = td_metric_ratio(map_idx, - STAT_TOPDOWN_HEAVY_OPS, st, - &rsd); - double light_ops = retiring - heavy_ops; - - if (retiring > 0.7 && heavy_ops > 0.1) - color = PERF_COLOR_GREEN; - print_metric(config, ctxp, color, "%8.1f%%", "Heavy Operations", - heavy_ops * 100.); - if (retiring > 0.7 && light_ops > 0.6) - color = PERF_COLOR_GREEN; - else - color = NULL; - print_metric(config, ctxp, color, "%8.1f%%", "Light Operations", - light_ops * 100.); - } else if (perf_stat_evsel__is(evsel, TOPDOWN_BR_MISPREDICT) && - full_td(map_idx, st, &rsd) && (config->topdown_level > 1)) { - double bad_spec = td_metric_ratio(map_idx, - STAT_TOPDOWN_BAD_SPEC, st, - &rsd); - double br_mis = td_metric_ratio(map_idx, - STAT_TOPDOWN_BR_MISPREDICT, st, - &rsd); - double m_clears = bad_spec - br_mis; - - if (bad_spec > 0.1 && br_mis > 0.05) - color = PERF_COLOR_RED; - print_metric(config, ctxp, color, "%8.1f%%", "Branch Mispredict", - br_mis * 100.); - if (bad_spec > 0.1 && m_clears > 0.05) - color = PERF_COLOR_RED; - else - color = NULL; - print_metric(config, ctxp, color, "%8.1f%%", "Machine Clears", - m_clears * 100.); - } else if (perf_stat_evsel__is(evsel, TOPDOWN_FETCH_LAT) && - full_td(map_idx, st, &rsd) && (config->topdown_level > 1)) { - double fe_bound = td_metric_ratio(map_idx, - STAT_TOPDOWN_FE_BOUND, st, - &rsd); - double fetch_lat = td_metric_ratio(map_idx, - STAT_TOPDOWN_FETCH_LAT, st, - &rsd); - double fetch_bw = fe_bound - fetch_lat; - - if (fe_bound > 0.2 && fetch_lat > 0.15) - color = PERF_COLOR_RED; - print_metric(config, ctxp, color, "%8.1f%%", "Fetch Latency", - fetch_lat * 100.); - if (fe_bound > 0.2 && fetch_bw > 0.1) - color = PERF_COLOR_RED; - else - color = NULL; - print_metric(config, ctxp, color, "%8.1f%%", "Fetch Bandwidth", - fetch_bw * 100.); - } else if (perf_stat_evsel__is(evsel, TOPDOWN_MEM_BOUND) && - full_td(map_idx, st, &rsd) && (config->topdown_level > 1)) { - double be_bound = td_metric_ratio(map_idx, - STAT_TOPDOWN_BE_BOUND, st, - &rsd); - double mem_bound = td_metric_ratio(map_idx, - STAT_TOPDOWN_MEM_BOUND, st, - &rsd); - double core_bound = be_bound - mem_bound; - - if (be_bound > 0.2 && mem_bound > 0.2) - color = PERF_COLOR_RED; - print_metric(config, ctxp, color, "%8.1f%%", "Memory Bound", - mem_bound * 100.); - if (be_bound > 0.2 && core_bound > 0.1) - color = PERF_COLOR_RED; - else - color = NULL; - print_metric(config, ctxp, color, "%8.1f%%", "Core Bound", - core_bound * 100.); } else if (runtime_stat_n(st, STAT_NSECS, map_idx, &rsd) != 0) { char unit = ' '; char unit_buf[10] = "/sec"; diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c index 534d36d26fc3..0b8c91ca13cd 100644 --- a/tools/perf/util/stat.c +++ b/tools/perf/util/stat.c @@ -91,19 +91,6 @@ static const char *id_str[PERF_STAT_EVSEL_ID__MAX] = { ID(TRANSACTION_START, cpu/tx-start/), ID(ELISION_START, cpu/el-start/), ID(CYCLES_IN_TX_CP, cpu/cycles-ct/), - ID(TOPDOWN_TOTAL_SLOTS, topdown-total-slots), - ID(TOPDOWN_SLOTS_ISSUED, topdown-slots-issued), - ID(TOPDOWN_SLOTS_RETIRED, topdown-slots-retired), - ID(TOPDOWN_FETCH_BUBBLES, topdown-fetch-bubbles), - ID(TOPDOWN_RECOVERY_BUBBLES, topdown-recovery-bubbles), - ID(TOPDOWN_RETIRING, topdown-retiring), - ID(TOPDOWN_BAD_SPEC, topdown-bad-spec), - ID(TOPDOWN_FE_BOUND, topdown-fe-bound), - ID(TOPDOWN_BE_BOUND, topdown-be-bound), - ID(TOPDOWN_HEAVY_OPS, topdown-heavy-ops), - ID(TOPDOWN_BR_MISPREDICT, topdown-br-mispredict), - ID(TOPDOWN_FETCH_LAT, topdown-fetch-lat), - ID(TOPDOWN_MEM_BOUND, topdown-mem-bound), ID(SMI_NUM, msr/smi/), ID(APERF, msr/aperf/), }; diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h index cf2d8aa445f3..42af350a96d9 100644 --- a/tools/perf/util/stat.h +++ b/tools/perf/util/stat.h @@ -25,19 +25,6 @@ enum perf_stat_evsel_id { PERF_STAT_EVSEL_ID__TRANSACTION_START, PERF_STAT_EVSEL_ID__ELISION_START, PERF_STAT_EVSEL_ID__CYCLES_IN_TX_CP, - PERF_STAT_EVSEL_ID__TOPDOWN_TOTAL_SLOTS, - PERF_STAT_EVSEL_ID__TOPDOWN_SLOTS_ISSUED, - PERF_STAT_EVSEL_ID__TOPDOWN_SLOTS_RETIRED, - PERF_STAT_EVSEL_ID__TOPDOWN_FETCH_BUBBLES, - PERF_STAT_EVSEL_ID__TOPDOWN_RECOVERY_BUBBLES, - PERF_STAT_EVSEL_ID__TOPDOWN_RETIRING, - PERF_STAT_EVSEL_ID__TOPDOWN_BAD_SPEC, - PERF_STAT_EVSEL_ID__TOPDOWN_FE_BOUND, - PERF_STAT_EVSEL_ID__TOPDOWN_BE_BOUND, - PERF_STAT_EVSEL_ID__TOPDOWN_HEAVY_OPS, - PERF_STAT_EVSEL_ID__TOPDOWN_BR_MISPREDICT, - PERF_STAT_EVSEL_ID__TOPDOWN_FETCH_LAT, - PERF_STAT_EVSEL_ID__TOPDOWN_MEM_BOUND, PERF_STAT_EVSEL_ID__SMI_NUM, PERF_STAT_EVSEL_ID__APERF, PERF_STAT_EVSEL_ID__MAX, @@ -108,19 +95,6 @@ enum stat_type { STAT_CYCLES_IN_TX, STAT_TRANSACTION, STAT_ELISION, - STAT_TOPDOWN_TOTAL_SLOTS, - STAT_TOPDOWN_SLOTS_ISSUED, - STAT_TOPDOWN_SLOTS_RETIRED, - STAT_TOPDOWN_FETCH_BUBBLES, - STAT_TOPDOWN_RECOVERY_BUBBLES, - STAT_TOPDOWN_RETIRING, - STAT_TOPDOWN_BAD_SPEC, - STAT_TOPDOWN_FE_BOUND, - STAT_TOPDOWN_BE_BOUND, - STAT_TOPDOWN_HEAVY_OPS, - STAT_TOPDOWN_BR_MISPREDICT, - STAT_TOPDOWN_FETCH_LAT, - STAT_TOPDOWN_MEM_BOUND, STAT_SMI_NUM, STAT_APERF, STAT_MAX From patchwork Sun Feb 19 09:28:39 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59138 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp806447wrn; Sun, 19 Feb 2023 03:32:35 -0800 (PST) X-Google-Smtp-Source: AK7set952/zel7/jcCqKZPHD+XybCHR9qZwpI7JNs29XMBkgkTAqOPVWyXotKsKOrPhbpBZevTTZ X-Received: by 2002:a17:906:c407:b0:8af:5154:ff8e with SMTP id u7-20020a170906c40700b008af5154ff8emr6158997ejz.15.1676806355554; Sun, 19 Feb 2023 03:32:35 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676806355; cv=none; d=google.com; s=arc-20160816; b=YmLLLs8wL+aHcC+nKG1xjMdztiI9cJts6Q9sleq1cqgcRK29FQdwfDb6/M7/QbfzuR wXv3j29IetEImQ+x0aZzbYTY9sbKOK8fU4HuDJud83MV6r6D55hoJdiRhT93r95T3ukE Kf7zXIq9HvBBgEqKey8CYIXZMMpU3+QZLR71+KxMTDg61BPbPpdDhFeqO1d4ly0moiiH mhlopecZDkMgKkBT/echk3yyQ1nEpsyp218yGvj+guNhcsXMnMGpZc17Q2Q+h0ygHc3K D9wv+wPJUQgiuzQTiaRDwRpqLBd2nEWrnjg/0E3DUj1sKeW3Xn4uX2x5eZwKq+lakRd1 o5LA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=QEW8cpP+9cEKKFEdVgrXtSkmKtdo53fKw92TT63XLiE=; b=tqibHMbEW+OYDYLUZxpzfp5kpJhUTwlTQ9PtiAxZuj0QQHefNVWTFP5igbDJLb9AwS O36oJQ3QUasz52UfotVKABWkwrfMycilOJwvLbxpAL14BlQuCw8WF+guh2abDWMSN/Kd brS0As9gQsh6bWKY+ezvX8K3tQlepcKqkaiBWsXG9BntSWBNKH0oGFDFjLqiwVLdBosM fV89Bw7aBSpNth1cd+4zx3dMPuYUsbXW54F7irsC99+i45lY6K8yIZ8YOrH7XBDtaayH x3V8NcBPVObRe2nLMx4tKtFMJ9DnP2vI7uIuuGAP00Xy+foecSKapwjAIx4yzJrq11Lr W91w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b="n4X3/MyC"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id r18-20020aa7d592000000b004acbdaf456fsi11997619edq.291.2023.02.19.03.32.13; Sun, 19 Feb 2023 03:32:35 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b="n4X3/MyC"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229940AbjBSJxj (ORCPT + 99 others); Sun, 19 Feb 2023 04:53:39 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50172 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230222AbjBSJxf (ORCPT ); Sun, 19 Feb 2023 04:53:35 -0500 Received: from mail-oo1-f73.google.com (mail-oo1-f73.google.com [209.85.161.73]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D0B022689 for ; Sun, 19 Feb 2023 01:53:30 -0800 (PST) Received: by mail-oo1-f73.google.com with SMTP id e22-20020a4a5516000000b0051faef7ba52so52529oob.11 for ; Sun, 19 Feb 2023 01:53:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=QEW8cpP+9cEKKFEdVgrXtSkmKtdo53fKw92TT63XLiE=; b=n4X3/MyCrrOq2mj67RyYv+wSgQgAcXUl5ZNGMxwZaz9J4PgnjVhdA1gdEU4c7rVlKF RD00qyZUg2Nu0n9I83joDDS11WMNhVImRYuGl74bUdlm/NTcCXRU+s9231VSqdViQ5y5 IOjhRW1LL0rd9D6VBrftyxEETIGZdePeZwPEUGiN5/i9rfa+KjDdpnTAf03uChXg7LNL HofGnGJrz33+hAH8qvx1150CJ4NN8RVbooFRVWHl3Pa2X5o7hZoSId7AYt1X7xwfP/qF /RVOFCjEigymUkIoiJhHOh3HzHiK5xYr4mOzg/3XkdKF3C1nQhiXmDqDiljssst6B9/K PH/A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=QEW8cpP+9cEKKFEdVgrXtSkmKtdo53fKw92TT63XLiE=; b=kgJJp+WcvgAl44DbCiNEl1dcgHkTGfw86fkMQ9ZSdvydvNixJvxAytpOo0F/zBNIgy D0uynyqWcIpcDhpzPuOCKnB74J+BozTZdDnnOspPcmoBmJOR25KVnkOnD7nko6TBb39J sUU0RzMBIdZbwsYtEdMyv324oGBYZ7B/3TCTe6hCjWrQv5Ro07ZTHIlFVXNF96my3MkP AJkM2nkkWEJ1XU8IMms2moBXnCnMfMXO/rrAaC69GRLGchueK/S8zIVwRg6SEsSp2gOm CtGtijeqyeucg0ePeOpQQYQoXCKxdx29/Ye14i/rSbCIe9G9ANwXE0TsPXqOQm74OFA3 7N1g== X-Gm-Message-State: AO0yUKUJoyVA7/XzTZE+steRHwHPSueR89738jfqBe+zGKwz94j/hfVU WMxjG2v5t7o4X5jhvaHrgzPeW1cN/NaX X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a05:6902:11cd:b0:8a3:d147:280b with SMTP id n13-20020a05690211cd00b008a3d147280bmr181444ybu.3.1676799294311; Sun, 19 Feb 2023 01:34:54 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:39 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-43-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 42/51] perf doc: Refresh topdown documentation From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758258900990940261?= X-GMAIL-MSGID: =?utf-8?q?1758258900990940261?= perf stat now supports --topdown for any platform with the TopdownL1 metric group including Intel before Icelake. Tweak the documentation to reflect this. Signed-off-by: Ian Rogers --- tools/perf/Documentation/perf-stat.txt | 27 +++++----- tools/perf/Documentation/topdown.txt | 70 +++++++++++--------------- 2 files changed, 44 insertions(+), 53 deletions(-) diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt index 18abdc1dce05..29bdcfa93f04 100644 --- a/tools/perf/Documentation/perf-stat.txt +++ b/tools/perf/Documentation/perf-stat.txt @@ -394,10 +394,10 @@ See perf list output for the possible metrics and metricgroups. Do not aggregate counts across all monitored CPUs. --topdown:: -Print complete top-down metrics supported by the CPU. This allows to -determine bottle necks in the CPU pipeline for CPU bound workloads, -by breaking the cycles consumed down into frontend bound, backend bound, -bad speculation and retiring. +Print top-down metrics supported by the CPU. This allows to determine +bottle necks in the CPU pipeline for CPU bound workloads, by breaking +the cycles consumed down into frontend bound, backend bound, bad +speculation and retiring. Frontend bound means that the CPU cannot fetch and decode instructions fast enough. Backend bound means that computation or memory access is the bottle @@ -430,15 +430,18 @@ CPUs the workload runs on. If needed the CPUs can be forced using taskset. --td-level:: -Print the top-down statistics that equal to or lower than the input level. -It allows users to print the interested top-down metrics level instead of -the complete top-down metrics. +Print the top-down statistics that equal the input level. It allows +users to print the interested top-down metrics level instead of the +level 1 top-down metrics. + +As the higher levels gather more metrics and use more counters they +will be less accurate. By convention a metric can be examined by +appending '_group' to it and this will increase accuracy compared to +gathering all metrics for a level. For example, level 1 analysis may +highlight 'tma_frontend_bound'. This metric may be drilled into with +'tma_frontend_bound_group' with +'perf stat -M tma_frontend_bound_group...'. -The availability of the top-down metrics level depends on the hardware. For -example, Ice Lake only supports L1 top-down metrics. The Sapphire Rapids -supports both L1 and L2 top-down metrics. - -Default: 0 means the max level that the current hardware support. Error out if the input is higher than the supported max level. --no-merge:: diff --git a/tools/perf/Documentation/topdown.txt b/tools/perf/Documentation/topdown.txt index a15b93fdcf50..ae0aee86844f 100644 --- a/tools/perf/Documentation/topdown.txt +++ b/tools/perf/Documentation/topdown.txt @@ -1,46 +1,35 @@ -Using TopDown metrics in user space ------------------------------------ +Using TopDown metrics +--------------------- -Intel CPUs (since Sandy Bridge and Silvermont) support a TopDown -methodology to break down CPU pipeline execution into 4 bottlenecks: -frontend bound, backend bound, bad speculation, retiring. +TopDown metrics break apart performance bottlenecks. Starting at level +1 it is typical to get metrics on retiring, bad speculation, frontend +bound, and backend bound. Higher levels provide more detail in to the +level 1 bottlenecks, such as at level 2: core bound, memory bound, +heavy operations, light operations, branch mispredicts, machine +clears, fetch latency and fetch bandwidth. For more details see [1][2][3]. -For more details on Topdown see [1][5] +perf stat --topdown implements this using available metrics that vary +per architecture. -Traditionally this was implemented by events in generic counters -and specific formulas to compute the bottlenecks. - -perf stat --topdown implements this. - -Full Top Down includes more levels that can break down the -bottlenecks further. This is not directly implemented in perf, -but available in other tools that can run on top of perf, -such as toplev[2] or vtune[3] +% perf stat -a --topdown -I1000 +# time % tma_retiring % tma_backend_bound % tma_frontend_bound % tma_bad_speculation + 1.001141351 11.5 34.9 46.9 6.7 + 2.006141972 13.4 28.1 50.4 8.1 + 3.010162040 12.9 28.1 51.1 8.0 + 4.014009311 12.5 28.6 51.8 7.2 + 5.017838554 11.8 33.0 48.0 7.2 + 5.704818971 14.0 27.5 51.3 7.3 +... -New Topdown features in Ice Lake -=============================== +New Topdown features in Intel Ice Lake +====================================== With Ice Lake CPUs the TopDown metrics are directly available as fixed counters and do not require generic counters. This allows to collect TopDown always in addition to other events. -% perf stat -a --topdown -I1000 -# time retiring bad speculation frontend bound backend bound - 1.001281330 23.0% 15.3% 29.6% 32.1% - 2.003009005 5.0% 6.8% 46.6% 41.6% - 3.004646182 6.7% 6.7% 46.0% 40.6% - 4.006326375 5.0% 6.4% 47.6% 41.0% - 5.007991804 5.1% 6.3% 46.3% 42.3% - 6.009626773 6.2% 7.1% 47.3% 39.3% - 7.011296356 4.7% 6.7% 46.2% 42.4% - 8.012951831 4.7% 6.7% 47.5% 41.1% -... - -This also enables measuring TopDown per thread/process instead -of only per core. - -Using TopDown through RDPMC in applications on Ice Lake -====================================================== +Using TopDown through RDPMC in applications on Intel Ice Lake +============================================================= For more fine grained measurements it can be useful to access the new directly from user space. This is more complicated, @@ -301,8 +290,8 @@ This "opens" a new measurement period. A program using RDPMC for TopDown should schedule such a reset regularly, as in every few seconds. -Limits on Ice Lake -================== +Limits on Intel Ice Lake +======================== Four pseudo TopDown metric events are exposed for the end-users, topdown-retiring, topdown-bad-spec, topdown-fe-bound and topdown-be-bound. @@ -318,8 +307,8 @@ a sampling read group. Since the SLOTS event must be the leader of a TopDown group, the second event of the group is the sampling event. For example, perf record -e '{slots, $sampling_event, topdown-retiring}:S' -Extension on Sapphire Rapids Server -=================================== +Extension on Intel Sapphire Rapids Server +========================================= The metrics counter is extended to support TMA method level 2 metrics. The lower half of the register is the TMA level 1 metrics (legacy). The upper half is also divided into four 8-bit fields for the new level 2 @@ -338,7 +327,6 @@ other four level 2 metrics by subtracting corresponding metrics as below. [1] https://software.intel.com/en-us/top-down-microarchitecture-analysis-method-win -[2] https://github.com/andikleen/pmu-tools/wiki/toplev-manual -[3] https://software.intel.com/en-us/intel-vtune-amplifier-xe +[2] https://sites.google.com/site/analysismethods/yasin-pubs +[3] https://perf.wiki.kernel.org/index.php/Top-Down_Analysis [4] https://github.com/andikleen/pmu-tools/tree/master/jevents -[5] https://sites.google.com/site/analysismethods/yasin-pubs From patchwork Sun Feb 19 09:28:40 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59135 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp783048wrn; Sun, 19 Feb 2023 02:11:55 -0800 (PST) X-Google-Smtp-Source: AK7set+11hhO0Yc0fw1Qgp9vbnNAIckFOtflyIbY8diwF+Q4nNWskSmjTFWLKhUn1JOw4v9PvzGh X-Received: by 2002:a17:90b:3ece:b0:233:ebf8:424f with SMTP id rm14-20020a17090b3ece00b00233ebf8424fmr2708660pjb.0.1676801514993; Sun, 19 Feb 2023 02:11:54 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676801514; cv=none; d=google.com; s=arc-20160816; b=qXlfDOiFWBLpsqs4ZPs0pSt7FXp36KGBNMd3BPDZusrIMks2mbf74mpHtq0LAmVxFN b5EoLClXyjKIT7K8/TRNT8Pg7tDrzfKa3wHbavqRxOQ4Cry0lrvF7JmalJGzTOdTO4R4 qpIxNUVvzcW06IVYApdb22qHsTcDk6Ah/iqJ4r8TKWHnQDSTR/YfCqeIzniDETWSpp3j O8dBhYGmpZZddcXhGyP2R2xFOm/OknJR9ej60U0UG14/A5SxOWIqj3AnlETtpcpC8AHc clODxVQm0UPmyuPygtb70HD+WpTnTGgqEQ60jmV/Vd1DqKub4FYLxDOIstuMkJnchVPE BTfQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=83ufvBpHVtY5n0dRrsyH94JHmCN6bDj+Te8NRRpOnv0=; b=MPQmATRsjp6QcVLAPaKZcrzJwGSV2JUau7dQbRwrA7r9nJIVLsJaVKC99FrBoYHvmU Av1wLWoQTIEf7HqrcolH4ggAGMj6taQc6Hh3IQ0Stuoh50tyAaRq7Xhdpt2vQZ4F+GRY BzPgdRIB3sECp73KD/OPbbcHAv78s3iZliFoUPZLDJuQDVZfpWK5xX0p+Fu6/UWHKrI0 irLGt+9rIoEm7NQt0COhH7KDRRvp8drFjEFq7mT2lxIP2oUquwt/usngjYW05Xzi828T jJOlgx0q7PXnJ7H8SgWPiilyvuL3WyJhICyJj/RRCZ8HgrSA1PlwwnUT7wCuK4kSkQwl yj9A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=lxWRJOz1; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id n2-20020a6546c2000000b00502b0909af3si2986324pgr.220.2023.02.19.02.11.39; Sun, 19 Feb 2023 02:11:54 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=lxWRJOz1; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229459AbjBSJpO (ORCPT + 99 others); Sun, 19 Feb 2023 04:45:14 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40136 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230013AbjBSJpN (ORCPT ); Sun, 19 Feb 2023 04:45:13 -0500 Received: from mail-vk1-xa4a.google.com (mail-vk1-xa4a.google.com [IPv6:2607:f8b0:4864:20::a4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6F943D529 for ; Sun, 19 Feb 2023 01:44:28 -0800 (PST) Received: by mail-vk1-xa4a.google.com with SMTP id x190-20020a1fe0c7000000b004015f66acfdso701940vkg.15 for ; Sun, 19 Feb 2023 01:44:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=83ufvBpHVtY5n0dRrsyH94JHmCN6bDj+Te8NRRpOnv0=; b=lxWRJOz1YmAnLVIpr+fpP4VDEQycUyWETLBz0W+CvZOIA6Kr/CoxJ2NsHEn41bEINl xdOo4dSeb6JldrmmHx8t98bCU9i7l7fSaOmRZyGpziCtv6bpIJCU9iN0JcYU+NjDa2O+ 9XTFi0YHaXrGx64qq4FOlIRjg9uoMS8Ya2jorXHuAwkk2xbnSK++ZlozuUPqdODuJFKs 6tKq8t5/yfeYhmVTyuQilkmzIWpSy5DP2e7xVmQeMyMJXwX8z5cUkxyl2CUbyclghtcG xaFP4bc1MAn1II2okkTepHrecIc0FvHgfZ29b/HmYpMob81nIptOTI3KNc6OhwmW2xc0 LwxA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=83ufvBpHVtY5n0dRrsyH94JHmCN6bDj+Te8NRRpOnv0=; b=p5LCHch300syRefupU6RWmSet3KLtIopzrZRCoKDSdjLrNmr/BotxR0Q1xd9HIdjop bCqRiE40fe+mYpp3Q3pVC2szAStq9cHVtHu1Yvf95+wZAx8fN4jCZ8k8KwhhpfYtOZOi 4nIGboX4mur8zX9/foswWpH8yIPlzZflOHws1PsBYaOOqwPPdbQV5CVLj4vxZjUcxpO3 I2MruZBUDwouM03DejZPPLeB4tDE80NVFJNeyV1bwwzX5tAgYEVU+QFWKztHfPH/lRFd qNwlGkHz2hDj1ckoWEi8PavvOTZGZJQqJ+SlhCzyfVqxcWsMDJy9dFxN3gBZG5m+iwL1 FD0w== X-Gm-Message-State: AO0yUKWRGDyBfQR6K46iP0l/erP2jHJJDvY8vdFLohxMyZ5sPqws3v/i 9c3oQ0AExUD2OdKLNKin4rZb7gb70RDk X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a05:6902:10c6:b0:855:fdcb:4467 with SMTP id w6-20020a05690210c600b00855fdcb4467mr151684ybu.0.1676799302521; Sun, 19 Feb 2023 01:35:02 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:40 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-44-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 43/51] perf stat: Remove hard coded transaction events From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758253824922520519?= X-GMAIL-MSGID: =?utf-8?q?1758253824922520519?= The metric group "transaction" is now present for Intel architectures so the legacy hard coded approach won't be used. Remove the associated logic. Signed-off-by: Ian Rogers --- tools/perf/builtin-stat.c | 59 ++++++----------------------------- tools/perf/util/stat-shadow.c | 48 +--------------------------- tools/perf/util/stat.c | 4 --- tools/perf/util/stat.h | 7 ----- 4 files changed, 11 insertions(+), 107 deletions(-) diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c index bdb1ef4fc6ad..e6b60b058257 100644 --- a/tools/perf/builtin-stat.c +++ b/tools/perf/builtin-stat.c @@ -100,30 +100,6 @@ static void print_counters(struct timespec *ts, int argc, const char **argv); -/* Default events used for perf stat -T */ -static const char *transaction_attrs = { - "task-clock," - "{" - "instructions," - "cycles," - "cpu/cycles-t/," - "cpu/tx-start/," - "cpu/el-start/," - "cpu/cycles-ct/" - "}" -}; - -/* More limited version when the CPU does not have all events. */ -static const char * transaction_limited_attrs = { - "task-clock," - "{" - "instructions," - "cycles," - "cpu/cycles-t/," - "cpu/tx-start/" - "}" -}; - static const char *smi_cost_attrs = { "{" "msr/aperf/," @@ -1811,37 +1787,22 @@ static int add_default_attributes(void) return 0; if (transaction_run) { - struct parse_events_error errinfo; /* Handle -T as -M transaction. Once platform specific metrics * support has been added to the json files, all architectures * will use this approach. To determine transaction support * on an architecture test for such a metric name. */ - if (metricgroup__has_metric("transaction")) { - return metricgroup__parse_groups(evsel_list, "transaction", - stat_config.metric_no_group, - stat_config.metric_no_merge, - stat_config.metric_no_threshold, - stat_config.user_requested_cpu_list, - stat_config.system_wide, - &stat_config.metric_events); - } - - parse_events_error__init(&errinfo); - if (pmu_have_event("cpu", "cycles-ct") && - pmu_have_event("cpu", "el-start")) - err = parse_events(evsel_list, transaction_attrs, - &errinfo); - else - err = parse_events(evsel_list, - transaction_limited_attrs, - &errinfo); - if (err) { - fprintf(stderr, "Cannot set up transaction events\n"); - parse_events_error__print(&errinfo, transaction_attrs); + if (!metricgroup__has_metric("transaction")) { + pr_err("Missing transaction metrics"); + return -1; } - parse_events_error__exit(&errinfo); - return err ? -1 : 0; + return metricgroup__parse_groups(evsel_list, "transaction", + stat_config.metric_no_group, + stat_config.metric_no_merge, + stat_config.metric_no_threshold, + stat_config.user_requested_cpu_list, + stat_config.system_wide, + &stat_config.metric_events); } if (smi_cost) { diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c index 5189756bf16d..3cfe4b4eb3de 100644 --- a/tools/perf/util/stat-shadow.c +++ b/tools/perf/util/stat-shadow.c @@ -235,12 +235,6 @@ void perf_stat__update_shadow_stats(struct evsel *counter, u64 count, update_runtime_stat(st, STAT_NSECS, map_idx, count_ns, &rsd); else if (evsel__match(counter, HARDWARE, HW_CPU_CYCLES)) update_runtime_stat(st, STAT_CYCLES, map_idx, count, &rsd); - else if (perf_stat_evsel__is(counter, CYCLES_IN_TX)) - update_runtime_stat(st, STAT_CYCLES_IN_TX, map_idx, count, &rsd); - else if (perf_stat_evsel__is(counter, TRANSACTION_START)) - update_runtime_stat(st, STAT_TRANSACTION, map_idx, count, &rsd); - else if (perf_stat_evsel__is(counter, ELISION_START)) - update_runtime_stat(st, STAT_ELISION, map_idx, count, &rsd); else if (evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_FRONTEND)) update_runtime_stat(st, STAT_STALLED_CYCLES_FRONT, map_idx, count, &rsd); @@ -695,7 +689,7 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config, { void *ctxp = out->ctx; print_metric_t print_metric = out->print_metric; - double total, ratio = 0.0, total2; + double total, ratio = 0.0; struct runtime_stat_data rsd = { .ctx = evsel_context(evsel), .cgrp = evsel->cgrp, @@ -808,46 +802,6 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config, } else { print_metric(config, ctxp, NULL, NULL, "Ghz", 0); } - } else if (perf_stat_evsel__is(evsel, CYCLES_IN_TX)) { - total = runtime_stat_avg(st, STAT_CYCLES, map_idx, &rsd); - - if (total) - print_metric(config, ctxp, NULL, - "%7.2f%%", "transactional cycles", - 100.0 * (avg / total)); - else - print_metric(config, ctxp, NULL, NULL, "transactional cycles", - 0); - } else if (perf_stat_evsel__is(evsel, CYCLES_IN_TX_CP)) { - total = runtime_stat_avg(st, STAT_CYCLES, map_idx, &rsd); - total2 = runtime_stat_avg(st, STAT_CYCLES_IN_TX, map_idx, &rsd); - - if (total2 < avg) - total2 = avg; - if (total) - print_metric(config, ctxp, NULL, "%7.2f%%", "aborted cycles", - 100.0 * ((total2-avg) / total)); - else - print_metric(config, ctxp, NULL, NULL, "aborted cycles", 0); - } else if (perf_stat_evsel__is(evsel, TRANSACTION_START)) { - total = runtime_stat_avg(st, STAT_CYCLES_IN_TX, map_idx, &rsd); - - if (avg) - ratio = total / avg; - - if (runtime_stat_n(st, STAT_CYCLES_IN_TX, map_idx, &rsd) != 0) - print_metric(config, ctxp, NULL, "%8.0f", - "cycles / transaction", ratio); - else - print_metric(config, ctxp, NULL, NULL, "cycles / transaction", - 0); - } else if (perf_stat_evsel__is(evsel, ELISION_START)) { - total = runtime_stat_avg(st, STAT_CYCLES_IN_TX, map_idx, &rsd); - - if (avg) - ratio = total / avg; - - print_metric(config, ctxp, NULL, "%8.0f", "cycles / elision", ratio); } else if (evsel__is_clock(evsel)) { if ((ratio = avg_stats(&walltime_nsecs_stats)) != 0) print_metric(config, ctxp, NULL, "%8.3f", "CPUs utilized", diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c index 0b8c91ca13cd..b5b18d457254 100644 --- a/tools/perf/util/stat.c +++ b/tools/perf/util/stat.c @@ -87,10 +87,6 @@ bool __perf_stat_evsel__is(struct evsel *evsel, enum perf_stat_evsel_id id) #define ID(id, name) [PERF_STAT_EVSEL_ID__##id] = #name static const char *id_str[PERF_STAT_EVSEL_ID__MAX] = { ID(NONE, x), - ID(CYCLES_IN_TX, cpu/cycles-t/), - ID(TRANSACTION_START, cpu/tx-start/), - ID(ELISION_START, cpu/el-start/), - ID(CYCLES_IN_TX_CP, cpu/cycles-ct/), ID(SMI_NUM, msr/smi/), ID(APERF, msr/aperf/), }; diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h index 42af350a96d9..c5fe847dd344 100644 --- a/tools/perf/util/stat.h +++ b/tools/perf/util/stat.h @@ -21,10 +21,6 @@ struct stats { enum perf_stat_evsel_id { PERF_STAT_EVSEL_ID__NONE = 0, - PERF_STAT_EVSEL_ID__CYCLES_IN_TX, - PERF_STAT_EVSEL_ID__TRANSACTION_START, - PERF_STAT_EVSEL_ID__ELISION_START, - PERF_STAT_EVSEL_ID__CYCLES_IN_TX_CP, PERF_STAT_EVSEL_ID__SMI_NUM, PERF_STAT_EVSEL_ID__APERF, PERF_STAT_EVSEL_ID__MAX, @@ -92,9 +88,6 @@ enum stat_type { STAT_LL_CACHE, STAT_ITLB_CACHE, STAT_DTLB_CACHE, - STAT_CYCLES_IN_TX, - STAT_TRANSACTION, - STAT_ELISION, STAT_SMI_NUM, STAT_APERF, STAT_MAX From patchwork Sun Feb 19 09:28:41 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59141 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp811378wrn; Sun, 19 Feb 2023 03:48:51 -0800 (PST) X-Google-Smtp-Source: AK7set/tzKBBm0bQFda3loDZU5SGK7yFvptTuLOdcMSAcwgpxuR/xFsCzph3HrA2VEJ2fpuRrwsg X-Received: by 2002:a17:90a:1951:b0:234:b6f5:7de0 with SMTP id 17-20020a17090a195100b00234b6f57de0mr1414239pjh.46.1676807331391; Sun, 19 Feb 2023 03:48:51 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676807331; cv=none; d=google.com; s=arc-20160816; b=L4/KQILSEKk5BB9AA0AEGCXBp0OorfV5vS4WRfIWc/O+NqdeCrmRzP3CeaDtcTzmaY RJuJS0v3X1kMwvZpAj4kQphEe1WF+g/tFe4rNdWUPGiLOL0ctVhcSlT0jAaHs0dxw2/j +kPKR+ccevCYNnUcdxQyc/O4lIP3IYiOg0yYscEccDX2HwJH69Y0tTFTZW7Y5LQ2XJ1Z pI75ZTfnd6ObE5SsR3dvF1XOq29MTuEN5rLkisSvVolSXuG0J7SupO/NRfjWFreEcat5 0lsahhc6RLR7hC1wpzDfGzuYtEb7nMbHcfKzyNVcWGmSwr1MWmDH64bOwnN7XpEJDjKp +lww== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=IR4cGxiP4ebyr+2RzCBvW6JFNP0fEMbIUpwuvhKV1BY=; b=gTeKoH+UMbxmT8X2LtbmwGpxCUmBNbLIW2p6/UOoXZ/uQPrOescdhZ7rEyae4Jlvl1 rgDpTtyRg8T8jwnU8pfZtfYlOgMiNxzWp0YfTCBLgS00lZ5MtFfqAHfAgK0SuOtl77Ad qR33+GHHW+dwaLZy8T2vFi+eXuSU9u5alqNhmmvt/NTYmWBW0UhhmAZqLyJfmdreFBth +TSNcqFUsxtzie3KUjV0TfBrL8mCcHe7Q8f/9pfWAAnfjIkKlHSP3f+Q/3lhOxk0q7X/ +OPe0vc9DL1zJ5gj3Y5rd2jv4zNaQh0Yp0fqW2lKjEZzRCAmhgAT8empBxEz2PGZwLuz qWsA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=NUcPy4ZH; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id qe10-20020a17090b4f8a00b0021918bc9a47si12229095pjb.174.2023.02.19.03.48.38; Sun, 19 Feb 2023 03:48:51 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=NUcPy4ZH; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229921AbjBSKC2 (ORCPT + 99 others); Sun, 19 Feb 2023 05:02:28 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55656 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229506AbjBSKCZ (ORCPT ); Sun, 19 Feb 2023 05:02:25 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 69B2DF742 for ; Sun, 19 Feb 2023 02:02:24 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id 5-20020a250105000000b009802c10698cso908929ybb.22 for ; Sun, 19 Feb 2023 02:02:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=IR4cGxiP4ebyr+2RzCBvW6JFNP0fEMbIUpwuvhKV1BY=; b=NUcPy4ZHCiZDAbpqYYOc5JdSudhcF9l9kvtbJC+/xwt6fI8QeNC4WkMTcvQQ6804AT YF/xZ2bSpjDIgQdlfWHY/eHe3WKnEpPAPIMT7q5iIrd9Sn14jdygINYb7VGjPcgh8nCF trMCEbNSzgK3iAHSpUdlMuCCMauBc/SxT+NSFCPELU6vvdlM8J62FYawyLz3LzLGC5KK oUyaye5uBSEFDs6oNM8SAhzfix1YWJzevbaptGaf5BqcNtP47rHHphmVSl3Fb5Vw1PHF +L9HwzS3oieL/2/SVOgkIe6RBEYtywm/BAF3tZcvH8plHZ82czV8C5Pc6C6EUryLjpM7 YatA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=IR4cGxiP4ebyr+2RzCBvW6JFNP0fEMbIUpwuvhKV1BY=; b=2GO/CdBNqpXjGmj7YRtuntxZACg99/qMR4mrFY+gxJyImQYB0PJvVAmI/F/5Qjg7fg FtipdiD03aJiHSzivjtX/5w1B5V4vWpgV35U0zzRQoCt1RGgdwp3O00tBLyVZ6lumuwn EtI3osOjQIXYdOuZIVLbGvlbEaUCCPvY6g6pnfl+XTQBf+xaUpqEaJTUIWtqLYKKU6oG f8qkTB5KDL8qu2CMYzWtDfSu2MxKu3uc9kdX9EVSFNhZbVx8QqT3ppkpo5RzG4r/56aJ XcKv6kptESkOu7fdXNh4gr14nPTRI1PGB3+CFFklMCv38V+waveoSBsN1H0Ou6yf84oh WVgA== X-Gm-Message-State: AO0yUKW92pBToX8Oy9+alsyDfZaoUAz6FQLGdyPeVswMIrGMlQh4mUgX eOBFilp6M7qeqwOtcrkSpb9TSagMyXY+ X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a5b:4c8:0:b0:8c2:3263:da34 with SMTP id u8-20020a5b04c8000000b008c23263da34mr1547079ybp.209.1676799311364; Sun, 19 Feb 2023 01:35:11 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:41 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-45-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 44/51] perf stat: Use metrics for --smi-cost From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758259924761661883?= X-GMAIL-MSGID: =?utf-8?q?1758259924761661883?= Rather than parsing events for --smi-cost, use the json metric group 'smi'. Signed-off-by: Ian Rogers --- tools/perf/builtin-stat.c | 34 +++++++++++----------------------- tools/perf/util/stat-shadow.c | 30 ------------------------------ tools/perf/util/stat.c | 2 -- tools/perf/util/stat.h | 4 ---- 4 files changed, 11 insertions(+), 59 deletions(-) diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c index e6b60b058257..9c1fbf154ee3 100644 --- a/tools/perf/builtin-stat.c +++ b/tools/perf/builtin-stat.c @@ -100,14 +100,6 @@ static void print_counters(struct timespec *ts, int argc, const char **argv); -static const char *smi_cost_attrs = { - "{" - "msr/aperf/," - "msr/smi/," - "cycles" - "}" -}; - static struct evlist *evsel_list; static bool all_counters_use_bpf = true; @@ -1666,7 +1658,6 @@ static int perf_stat_init_aggr_mode_file(struct perf_stat *st) */ static int add_default_attributes(void) { - int err; struct perf_event_attr default_attrs0[] = { { .type = PERF_TYPE_SOFTWARE, .config = PERF_COUNT_SW_TASK_CLOCK }, @@ -1806,11 +1797,10 @@ static int add_default_attributes(void) } if (smi_cost) { - struct parse_events_error errinfo; int smi; if (sysfs__read_int(FREEZE_ON_SMI_PATH, &smi) < 0) { - fprintf(stderr, "freeze_on_smi is not supported.\n"); + pr_err("freeze_on_smi is not supported."); return -1; } @@ -1822,23 +1812,21 @@ static int add_default_attributes(void) smi_reset = true; } - if (!pmu_have_event("msr", "aperf") || - !pmu_have_event("msr", "smi")) { - fprintf(stderr, "To measure SMI cost, it needs " - "msr/aperf/, msr/smi/ and cpu/cycles/ support\n"); + if (!metricgroup__has_metric("smi")) { + pr_err("Missing smi metrics"); return -1; } + if (!force_metric_only) stat_config.metric_only = true; - parse_events_error__init(&errinfo); - err = parse_events(evsel_list, smi_cost_attrs, &errinfo); - if (err) { - parse_events_error__print(&errinfo, smi_cost_attrs); - fprintf(stderr, "Cannot set up SMI cost events\n"); - } - parse_events_error__exit(&errinfo); - return err ? -1 : 0; + return metricgroup__parse_groups(evsel_list, "smi", + stat_config.metric_no_group, + stat_config.metric_no_merge, + stat_config.metric_no_threshold, + stat_config.user_requested_cpu_list, + stat_config.system_wide, + &stat_config.metric_events); } if (topdown_run) { diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c index 3cfe4b4eb3de..d14fa531ee27 100644 --- a/tools/perf/util/stat-shadow.c +++ b/tools/perf/util/stat-shadow.c @@ -255,10 +255,6 @@ void perf_stat__update_shadow_stats(struct evsel *counter, u64 count, update_runtime_stat(st, STAT_DTLB_CACHE, map_idx, count, &rsd); else if (evsel__match(counter, HW_CACHE, HW_CACHE_ITLB)) update_runtime_stat(st, STAT_ITLB_CACHE, map_idx, count, &rsd); - else if (perf_stat_evsel__is(counter, SMI_NUM)) - update_runtime_stat(st, STAT_SMI_NUM, map_idx, count, &rsd); - else if (perf_stat_evsel__is(counter, APERF)) - update_runtime_stat(st, STAT_APERF, map_idx, count, &rsd); if (counter->collect_stat) { v = saved_value_lookup(counter, map_idx, true, STAT_NONE, 0, st, @@ -479,30 +475,6 @@ static void print_ll_cache_misses(struct perf_stat_config *config, out->print_metric(config, out->ctx, color, "%7.2f%%", "of all LL-cache accesses", ratio); } -static void print_smi_cost(struct perf_stat_config *config, int map_idx, - struct perf_stat_output_ctx *out, - struct runtime_stat *st, - struct runtime_stat_data *rsd) -{ - double smi_num, aperf, cycles, cost = 0.0; - const char *color = NULL; - - smi_num = runtime_stat_avg(st, STAT_SMI_NUM, map_idx, rsd); - aperf = runtime_stat_avg(st, STAT_APERF, map_idx, rsd); - cycles = runtime_stat_avg(st, STAT_CYCLES, map_idx, rsd); - - if ((cycles == 0) || (aperf == 0)) - return; - - if (smi_num) - cost = (aperf - cycles) / aperf * 100.00; - - if (cost > 10) - color = PERF_COLOR_RED; - out->print_metric(config, out->ctx, color, "%8.1f%%", "SMI cycles%", cost); - out->print_metric(config, out->ctx, NULL, "%4.0f", "SMI#", smi_num); -} - static int prepare_metric(struct evsel **metric_events, struct metric_ref *metric_refs, struct expr_parse_ctx *pctx, @@ -819,8 +791,6 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config, if (unit != ' ') snprintf(unit_buf, sizeof(unit_buf), "%c/sec", unit); print_metric(config, ctxp, NULL, "%8.3f", unit_buf, ratio); - } else if (perf_stat_evsel__is(evsel, SMI_NUM)) { - print_smi_cost(config, map_idx, out, st, &rsd); } else { num = 0; } diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c index b5b18d457254..d51d7457f12d 100644 --- a/tools/perf/util/stat.c +++ b/tools/perf/util/stat.c @@ -87,8 +87,6 @@ bool __perf_stat_evsel__is(struct evsel *evsel, enum perf_stat_evsel_id id) #define ID(id, name) [PERF_STAT_EVSEL_ID__##id] = #name static const char *id_str[PERF_STAT_EVSEL_ID__MAX] = { ID(NONE, x), - ID(SMI_NUM, msr/smi/), - ID(APERF, msr/aperf/), }; #undef ID diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h index c5fe847dd344..9af4af3bc3f2 100644 --- a/tools/perf/util/stat.h +++ b/tools/perf/util/stat.h @@ -21,8 +21,6 @@ struct stats { enum perf_stat_evsel_id { PERF_STAT_EVSEL_ID__NONE = 0, - PERF_STAT_EVSEL_ID__SMI_NUM, - PERF_STAT_EVSEL_ID__APERF, PERF_STAT_EVSEL_ID__MAX, }; @@ -88,8 +86,6 @@ enum stat_type { STAT_LL_CACHE, STAT_ITLB_CACHE, STAT_DTLB_CACHE, - STAT_SMI_NUM, - STAT_APERF, STAT_MAX }; From patchwork Sun Feb 19 09:28:42 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59139 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp808211wrn; Sun, 19 Feb 2023 03:38:11 -0800 (PST) X-Google-Smtp-Source: AK7set/7DtvkfspKTKNrs+GOuaE1DHmztEpIysngrey+StN/rc6vnD/5YoDS9p/bY5ixWtp2B8P6 X-Received: by 2002:aa7:dbd3:0:b0:4ab:f6c:1a47 with SMTP id v19-20020aa7dbd3000000b004ab0f6c1a47mr2624626edt.31.1676806691071; Sun, 19 Feb 2023 03:38:11 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676806691; cv=none; d=google.com; s=arc-20160816; b=qZJwh4FNtqwLQpr+bL/hWV0U2z5tBQjBf7ojqBy42djPH0iCmO5hPG/t2xR1VOZH40 HZRllhHIyr6EASUqDjczLT873ZmSw3eIJ+I81LlAcmQjdeBOF/52nE+4Pn6C0wtiNRfL xpwVteusaxyfTV6OyYdV4nDbc5mNz5ylvhF3I3enZsAGqUvOWfsdw0otiqaK4craDtRG tTMWn3h9Gb2n7VIsQArZF1RvUDSUMwt7GCGggfmBTnUxSXlL+DIHQ8G+JQ7UgYMUrQ0O vqEe+K6FZ4j2IokRJ3YMYndhSyX8s4m15FIdOaZZGUt+Gs/nJmysjbCxq7F+XqBFpenx v7vg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=V1c75G2QRp2XfNcd0fyJt7z2QBiUcClmUEJnZutpl3U=; b=wbI7Q4IxTwwpillqFb4sA9205AQugvyoilS7YIJ6Emk2J5zvGxRM5WhR9fTcssSxv4 6pNKhWbeEs6mo7DJCJbuv7OSF+6ZSjIADd9ANGE/wLxjOgHXfiKNVpXzPfNgbMMe0bPp hMo7vmNOglGp+CVrTquh4FSNrFR3olgfk1P4fQTyD3O5v7FKXAR7+9vxGelZXdIWZHlN z50uqPiwaqZ9CZ1Tlmxkl62IW9pNKijiYlCLZNZJUH8sEcC6D3HsnD7xVeYljE6H8eC6 taTvXvJyKuwRSQxNYes4OdDaCIuJcN9g7IgbuWYg7eaWkbiuYbaBzgqL9ocEe3DdMrVe cbEg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=lQc7DLqj; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id j23-20020aa7c417000000b004ab4411217csi12079199edq.394.2023.02.19.03.37.47; Sun, 19 Feb 2023 03:38:11 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=lQc7DLqj; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230137AbjBSJqw (ORCPT + 99 others); Sun, 19 Feb 2023 04:46:52 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41710 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230125AbjBSJqu (ORCPT ); Sun, 19 Feb 2023 04:46:50 -0500 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8CFA02117 for ; Sun, 19 Feb 2023 01:46:10 -0800 (PST) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-5368a7c71e3so14518577b3.17 for ; Sun, 19 Feb 2023 01:46:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=V1c75G2QRp2XfNcd0fyJt7z2QBiUcClmUEJnZutpl3U=; b=lQc7DLqjY+G0h68iUyNnQt0s8JbFjdLvt6vUD0X3O7LQXsct1au6ugX7YKcPBS3UPJ AsAznm2proFWTuMGysZno/FG9nfkQj2SekacMWMc3JcdaBlyiBSXOGz0lhlEgLdgGAxp WDR9gSMRJuJm8sfYf4Js7MhM80qTW0YRs+7lkwr6P601fxAf4quXfIGXkK94Dp9yf7NJ VG/TQ0DMh2RX36SsfyliyUni4YrostDu/WgZ0dbrvUcDxYX103Xt5iUnMSJxA78X1q/J xUWr1XsXdm9DAOmntwU2JGUE0TJhIpozeTXHwgCbdlqWpR3m96Aghhr+kQKjI1/QS/RH /euQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=V1c75G2QRp2XfNcd0fyJt7z2QBiUcClmUEJnZutpl3U=; b=llgK7Nw4Pphw43ybly1oN1M1N4Cb55k/5qGhN4zZUchu+hwOLHYvWK58jAil6yKJVm DW5IqOIRn6FwRgnU28k8KEQ5xNQlHXHx8vHo3rvaDWYWrgz5WOEHcsW75wdyoH8xTKux u8tyme7y2TMMjZmok5fTlOxrKJlXdo3za7FfyrFRh4oqzc4EtIITxUJEInK103BHijSF f/CnsLy7ieAvOJE/T9yvB+th76RrDRJtyOUijZgsRgt14lqiFgP2myJAz9CBe9XNjnXh 1j7tZUEq95n3IprW+ut+IUkS0KcOXJZRyZmUHqhLSHdnssNo8mQQl6IaD9HE0QSWnszF IyfQ== X-Gm-Message-State: AO0yUKVeuL3hHObA3WAm0zg0UhE8Y3V/HIutTME3mrUk1ozs4LZlxWJI khpbRslGiDoa06I0PIUoOlrFo2FEkEPb X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a0d:d6c1:0:b0:514:dae0:21ef with SMTP id y184-20020a0dd6c1000000b00514dae021efmr2065215ywd.133.1676799319684; Sun, 19 Feb 2023 01:35:19 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:42 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-46-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 45/51] perf stat: Remove perf_stat_evsel_id From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758259253004055451?= X-GMAIL-MSGID: =?utf-8?q?1758259253004055451?= perf_stat_evsel_id was used for hard coded metrics. These have now migrated to json metrics and so the id values are no longer necessary. Signed-off-by: Ian Rogers --- tools/perf/util/stat.c | 31 ------------------------------- tools/perf/util/stat.h | 12 ------------ 2 files changed, 43 deletions(-) diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c index d51d7457f12d..8d83d2f4a082 100644 --- a/tools/perf/util/stat.c +++ b/tools/perf/util/stat.c @@ -77,36 +77,6 @@ double rel_stddev_stats(double stddev, double avg) return pct; } -bool __perf_stat_evsel__is(struct evsel *evsel, enum perf_stat_evsel_id id) -{ - struct perf_stat_evsel *ps = evsel->stats; - - return ps->id == id; -} - -#define ID(id, name) [PERF_STAT_EVSEL_ID__##id] = #name -static const char *id_str[PERF_STAT_EVSEL_ID__MAX] = { - ID(NONE, x), -}; -#undef ID - -static void perf_stat_evsel_id_init(struct evsel *evsel) -{ - struct perf_stat_evsel *ps = evsel->stats; - int i; - - /* ps->id is 0 hence PERF_STAT_EVSEL_ID__NONE by default */ - - for (i = 0; i < PERF_STAT_EVSEL_ID__MAX; i++) { - if (!strcmp(evsel__name(evsel), id_str[i]) || - (strstr(evsel__name(evsel), id_str[i]) && evsel->pmu_name - && strstr(evsel__name(evsel), evsel->pmu_name))) { - ps->id = i; - break; - } - } -} - static void evsel__reset_aggr_stats(struct evsel *evsel) { struct perf_stat_evsel *ps = evsel->stats; @@ -166,7 +136,6 @@ static int evsel__alloc_stat_priv(struct evsel *evsel, int nr_aggr) return -ENOMEM; } - perf_stat_evsel_id_init(evsel); evsel__reset_stat_priv(evsel); return 0; } diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h index 9af4af3bc3f2..df6068a3f7bb 100644 --- a/tools/perf/util/stat.h +++ b/tools/perf/util/stat.h @@ -19,11 +19,6 @@ struct stats { u64 max, min; }; -enum perf_stat_evsel_id { - PERF_STAT_EVSEL_ID__NONE = 0, - PERF_STAT_EVSEL_ID__MAX, -}; - /* hold aggregated event info */ struct perf_stat_aggr { /* aggregated values */ @@ -40,8 +35,6 @@ struct perf_stat_aggr { struct perf_stat_evsel { /* used for repeated runs */ struct stats res_stats; - /* evsel id for quick check */ - enum perf_stat_evsel_id id; /* number of allocated 'aggr' */ int nr_aggr; /* aggregated event values */ @@ -187,11 +180,6 @@ static inline void update_rusage_stats(struct rusage_stats *ru_stats, struct rus struct evsel; struct evlist; -bool __perf_stat_evsel__is(struct evsel *evsel, enum perf_stat_evsel_id id); - -#define perf_stat_evsel__is(evsel, id) \ - __perf_stat_evsel__is(evsel, PERF_STAT_EVSEL_ID__ ## id) - extern struct runtime_stat rt_stat; extern struct stats walltime_nsecs_stats; extern struct rusage_stats ru_stats; From patchwork Sun Feb 19 09:28:43 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59126 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp776177wrn; Sun, 19 Feb 2023 01:47:38 -0800 (PST) X-Google-Smtp-Source: AK7set+NeTcSs8oi2rnje9Hnw3CvuOJO7LsT++JVz25aMtdVB8MiZHES5YYzmo02RVtwN1/Ioj2R X-Received: by 2002:a17:902:e745:b0:199:1d6f:3cab with SMTP id p5-20020a170902e74500b001991d6f3cabmr3212587plf.21.1676800058290; Sun, 19 Feb 2023 01:47:38 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676800058; cv=none; d=google.com; s=arc-20160816; b=kIMSTvSieA0lCmUU1FJY6ONr6H2zRxBSgthT8IcFQ1W79WKHpmnul3356qD8kl/LXJ 0XrbNm+rReFmO7JaV67QQMrokujH3GJLAmCRJZGy+4u2yqz53o8iY2YHRAcujyvFcBpv pggm3Arwu+7cdA90W1gBNqhvdvlgtyDAL6eD9tL5o0UZfLfFmEUXbAyq+/C1lZzoAraS FH0RJRaB2nvSdTvY7peNsn2qxWHPGXveQhXrEQnbtbYxTTqjOWYVhWd1ZQI3b2QZpyQE iPDMSZQa6jyks3ufZr7ZEXq371M7+zNQq7EBt/JUCQmL63rbYV6IoKbJ5uslMriqg6+k RCsA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=6ApboYwC5HDpqa4KgO9vE7lItPOtxh0INRVt9hDTELA=; b=QHkSSnwRulMlqXYHGhKmGWzZmWZ9XO/EcUE9GiTNlybCG4m69jptR8fVYgdgcSSrmP 9SM5m4cLvw+nfx+WWi3XRmazD7Le5vJRCzZNprDSV4wqacLlbUS21Vi/X5B1CCO9kBf1 xdsvkQ1XYboRH53yCnefjgLOsHjo++0MRQ1xEow+suvF544wd4ISNa0pMUAQ568Q21+j ixwq1DKCwq4wKFyuN8MtqYUT3lvFY3i4uY+2EHPFmxWgItY9ILua5eJ07IA3kmig2lyV ZGFh1+gGHJrXLI5FLh6CYBb6vQYja3uGyxb1SsdeOt5tMBHzfbitpUdDqD2n0axb8qNM 7+ZA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=eKrS9YbE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id l10-20020a170902f68a00b0019a826d304esi6645112plg.634.2023.02.19.01.47.25; Sun, 19 Feb 2023 01:47:38 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=eKrS9YbE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230055AbjBSJi3 (ORCPT + 99 others); Sun, 19 Feb 2023 04:38:29 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33126 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230051AbjBSJi1 (ORCPT ); Sun, 19 Feb 2023 04:38:27 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8969740D6 for ; Sun, 19 Feb 2023 01:37:30 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id h204-20020a256cd5000000b00953ffdfbe1aso2267589ybc.23 for ; Sun, 19 Feb 2023 01:37:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=6ApboYwC5HDpqa4KgO9vE7lItPOtxh0INRVt9hDTELA=; b=eKrS9YbE5MvFZdViv4py+wujaC6AoA78JFYCdWX32lF9PoiXUgE3idq0OBKkFhvJ8L 8iy6lGZX2pkmRRG5L9VoyF4JMYhESKqhxg1bQRw3cbXu9cp0CTVAsnfEx7iX0m8WcHmJ 0khwL4pKRPFny0ZCfaK1RnTWEWhb3jRklhCcxexJ/lPXaF39kJkIOOWW9PvDQfhH3qw8 i6SXakOmGvzYKxm+6U79/2sZgCBcFrdbqgmoGs3Zp43La0bVDCky0ubQmF3HRG8l/moZ Or4Qbi5jk1XP9gTTmxi3oHJe1J1/5t4dUvhGzSs/SXr7RmRzmHYiWa0KAaYnoxYzSygO fbNw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=6ApboYwC5HDpqa4KgO9vE7lItPOtxh0INRVt9hDTELA=; b=BNixwsjfXHEcPz1iFVz3cp/TAnJUNz1d+zWvKUnI6oHklHpTWfaFA9OOYbWfH5Rjmw VR0AZtoeu0bka7P2ML8Af6IhMWCtp9EXaxkPzwxTJIOM98NHrsVZHAeE2jE//gAHNoPF yPHPXqgpST8mqguYQcTneS8vpcq74Oox2DaMkZe942jlBDPu6HDIQ+fAWAcm5y6mtSGH K1roptYPssbUCUYAAL+mCm30z6f1+VzXSQjOfeVffVmbvPwf6nMg8ZaApOZxsW1nxd7T Z4sa26aUae4N9RRx7MnExm76mMvkae1SOyArXWfIIBU3mstq1AXE5W70XXLnLsvKCOKL Z6Sg== X-Gm-Message-State: AO0yUKVMNWPce3WfhRy1hBY0ScrRrlakv/QggtAv2z/eVxZNGFWq0CYA qo01AvL84nX/CFjfIrKDO54PTLJxvRJ8 X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a81:e94c:0:b0:507:68eb:1b20 with SMTP id e12-20020a81e94c000000b0050768eb1b20mr311210ywm.236.1676799327857; Sun, 19 Feb 2023 01:35:27 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:43 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-47-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 46/51] perf stat: Move enums from header From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758252297689417789?= X-GMAIL-MSGID: =?utf-8?q?1758252297689417789?= The enums are only used in stat-shadow.c, so narrow their scope by moving to the C file. Signed-off-by: Ian Rogers --- tools/perf/util/stat-shadow.c | 25 +++++++++++++++++++++++++ tools/perf/util/stat.h | 27 --------------------------- 2 files changed, 25 insertions(+), 27 deletions(-) diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c index d14fa531ee27..fc948a7e83b7 100644 --- a/tools/perf/util/stat-shadow.c +++ b/tools/perf/util/stat-shadow.c @@ -29,6 +29,31 @@ struct runtime_stat rt_stat; struct stats walltime_nsecs_stats; struct rusage_stats ru_stats; +enum { + CTX_BIT_USER = 1 << 0, + CTX_BIT_KERNEL = 1 << 1, + CTX_BIT_HV = 1 << 2, + CTX_BIT_HOST = 1 << 3, + CTX_BIT_IDLE = 1 << 4, + CTX_BIT_MAX = 1 << 5, +}; + +enum stat_type { + STAT_NONE = 0, + STAT_NSECS, + STAT_CYCLES, + STAT_STALLED_CYCLES_FRONT, + STAT_STALLED_CYCLES_BACK, + STAT_BRANCHES, + STAT_CACHEREFS, + STAT_L1_DCACHE, + STAT_L1_ICACHE, + STAT_LL_CACHE, + STAT_ITLB_CACHE, + STAT_DTLB_CACHE, + STAT_MAX +}; + struct saved_value { struct rb_node rb_node; struct evsel *evsel; diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h index df6068a3f7bb..215c0f5c4db7 100644 --- a/tools/perf/util/stat.h +++ b/tools/perf/util/stat.h @@ -55,33 +55,6 @@ enum aggr_mode { AGGR_MAX }; -enum { - CTX_BIT_USER = 1 << 0, - CTX_BIT_KERNEL = 1 << 1, - CTX_BIT_HV = 1 << 2, - CTX_BIT_HOST = 1 << 3, - CTX_BIT_IDLE = 1 << 4, - CTX_BIT_MAX = 1 << 5, -}; - -#define NUM_CTX CTX_BIT_MAX - -enum stat_type { - STAT_NONE = 0, - STAT_NSECS, - STAT_CYCLES, - STAT_STALLED_CYCLES_FRONT, - STAT_STALLED_CYCLES_BACK, - STAT_BRANCHES, - STAT_CACHEREFS, - STAT_L1_DCACHE, - STAT_L1_ICACHE, - STAT_LL_CACHE, - STAT_ITLB_CACHE, - STAT_DTLB_CACHE, - STAT_MAX -}; - struct runtime_stat { struct rblist value_list; }; From patchwork Sun Feb 19 09:28:44 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59125 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp776038wrn; Sun, 19 Feb 2023 01:47:06 -0800 (PST) X-Google-Smtp-Source: AK7set+SkNb0qKImEUIcZttnYKnulFAtbjphS5ny4bPNOtEAMvT7z4qHq3cqHPY0ypOILlEMFLPA X-Received: by 2002:a17:902:f551:b0:19a:8304:21eb with SMTP id h17-20020a170902f55100b0019a830421ebmr3517243plf.6.1676800026570; Sun, 19 Feb 2023 01:47:06 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676800026; cv=none; d=google.com; s=arc-20160816; b=hiXedJ2ypAhnIcZJKbZjR15+j2jrzh22vEViSybJG3z7IGgHW8A5c00/IgaYPzZOan QEFzdleMU36qYnQBlOs6gVtzvPhSslBYG5ZFb7ogp5bJy4nhB8CMqRNUdA15oGxalpp+ ryrfoqU4j3Lo0KdEiRlwSqh2zmVnFXZ40Ubf8XINfMhJbLc9veAYfRtaDeJPGPv4uKaV 4OqHWO9SUaIsnj4tcgsybZzx/8881+yVnfGG40vyJ7fcFSeGk/2BRUsAlPhPZGh5i1Uk 25NUpwZXSto2PoyYQP0cVkUSwTTcww1YDCPMzzr8mDuztvyXY6VC0OcngqFsKoS9m/l2 9o9A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=W35JnUyB9F6OBcA0PUDeFeV51mTDn0RBQpIJeigYK+U=; b=ZECtcnBh4bW23PyNkOVpGPi/MdtN/LIIIkd8IiUxlpYoqYHFFKmsAUx0jc6e0f5RnB kBte3VBkn3EWntyOvfZgvxUVnMOKjhDD9T9sN8JuNq/5vEinclaSdTzaDrhXkENG+Kxv ftcLh35GORg18LoDzWY/IEqixO/CU/zzZ2zKO/hUqN9BG9DbU7Tl7yZ79Hi3PP7gITiN 4RVzHYY+FYtYrGmjc3wjbkscpb1g0v4oGNLkfDFLScUsHmV6cpywKAd7wH4q1K2kmOs7 WAI3XqRsdszNk0SbCXEx4x2KYjfyI/iVtQGY9P1J49ROmMylIYY4pxWetWWjsRtm9kv4 QgwA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=CmrQbZno; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id q16-20020a170902dad000b0019107d41d49si6796945plx.36.2023.02.19.01.46.53; Sun, 19 Feb 2023 01:47:06 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=CmrQbZno; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229947AbjBSJiP (ORCPT + 99 others); Sun, 19 Feb 2023 04:38:15 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:32812 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230013AbjBSJiL (ORCPT ); Sun, 19 Feb 2023 04:38:11 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 00D9012BE1 for ; Sun, 19 Feb 2023 01:37:15 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id o14-20020a25730e000000b009419f64f6afso2226981ybc.2 for ; Sun, 19 Feb 2023 01:37:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=W35JnUyB9F6OBcA0PUDeFeV51mTDn0RBQpIJeigYK+U=; b=CmrQbZnohetBBAm9AiARnIjs80kPdkPczowAQqWob0gZ6l+Q70jedRKRcgTRMb6G8g qqMY0EQyJw3Wdgz1zX44m5fF3RbqjAwLQ8Mo0f9BU2Y5tH3Gls7Zi6Q2VpaFmzbrmXvF rc2hfAh8O6ceIE+qMSb0k0OQgqBDHHxKjKvpWipx9nIy65oL1Innd1aqUQRbD4FZyiYj WjYBvrf5VpkaiQmHJXnnYZt8HcDNBshmDfHqUcQILbHwDomIGn7WkF2vAkx4Zzm1hfFp eL6kz2f6MqFwiqIpuWGBPltCloj1hmjmiMxkXwiVnXR0xbusmve7xv4RhJaxWkN+cwbv wyuQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=W35JnUyB9F6OBcA0PUDeFeV51mTDn0RBQpIJeigYK+U=; b=MUlX4g+8Y9ubpoxJMwxa1Fb7HQY+84M6GJT/GSErpr+j00VxyXSaKcAVp41cxQP5ze uec2sbhpAYoLwCpKe8Q6ebYVnApWYg/8Q5iW/HHa4x07wwL3YgDVVeXR+C8WClemQvuO ojDBmHqmFVeDMezKJa6kc9/QrvD1XwAFmjSYVA2f6r4m/CIQ0ONlMKDi5grgBOma6pRl iWHTMLJN2z7Dw8M1uU2tdzIHF6u7sIsYoxFanFlDJ2WVgL4Dikq3QcbuTM8J3QgXPtLT ryCBrvsvTsw7gNy8HT6fSgboC9FFCgpN2oM9X606gtTgUjs90G1IOO8k6s6uEYDC7pWT BzhA== X-Gm-Message-State: AO0yUKW/XsjZ8rOG9ybRhz6armyiHPsrMyRx0QJQ1aOpisCTsywmxql3 GwM6whfGCMnbo4nQ1PDmub4NOusnABi4 X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a81:6c97:0:b0:52e:df8d:45f4 with SMTP id h145-20020a816c97000000b0052edf8d45f4mr2294792ywc.218.1676799336588; Sun, 19 Feb 2023 01:35:36 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:44 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-48-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 47/51] perf stat: Hide runtime_stat From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758252264343236389?= X-GMAIL-MSGID: =?utf-8?q?1758252264343236389?= runtime_stat is only shared for the sake of tests that don't care about its value. Move the definition into stat-shadow.c and have the tests also use the global version. Signed-off-by: Ian Rogers --- tools/perf/builtin-script.c | 6 +- tools/perf/builtin-stat.c | 4 +- tools/perf/tests/parse-metric.c | 19 ++-- tools/perf/tests/pmu-events.c | 8 +- tools/perf/util/stat-display.c | 5 +- tools/perf/util/stat-shadow.c | 165 +++++++++++++------------------- tools/perf/util/stat.c | 2 +- tools/perf/util/stat.h | 17 +--- 8 files changed, 90 insertions(+), 136 deletions(-) diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c index a792214d1af8..e9b5387161df 100644 --- a/tools/perf/builtin-script.c +++ b/tools/perf/builtin-script.c @@ -2074,8 +2074,7 @@ static void perf_sample__fprint_metric(struct perf_script *script, val = sample->period * evsel->scale; perf_stat__update_shadow_stats(evsel, val, - sample->cpu, - &rt_stat); + sample->cpu); evsel_script(evsel)->val = val; if (evsel_script(leader)->gnum == leader->core.nr_members) { for_each_group_member (ev2, leader) { @@ -2083,8 +2082,7 @@ static void perf_sample__fprint_metric(struct perf_script *script, evsel_script(ev2)->val, sample->cpu, &ctx, - NULL, - &rt_stat); + NULL); } evsel_script(leader)->gnum = 0; } diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c index 9c1fbf154ee3..619387459914 100644 --- a/tools/perf/builtin-stat.c +++ b/tools/perf/builtin-stat.c @@ -434,7 +434,7 @@ static void process_interval(void) clock_gettime(CLOCK_MONOTONIC, &ts); diff_timespec(&rs, &ts, &ref_time); - perf_stat__reset_shadow_per_stat(&rt_stat); + perf_stat__reset_shadow_per_stat(); evlist__reset_aggr_stats(evsel_list); if (read_counters(&rs) == 0) @@ -910,7 +910,7 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx) evlist__copy_prev_raw_counts(evsel_list); evlist__reset_prev_raw_counts(evsel_list); evlist__reset_aggr_stats(evsel_list); - perf_stat__reset_shadow_per_stat(&rt_stat); + perf_stat__reset_shadow_per_stat(); } else { update_stats(&walltime_nsecs_stats, t1 - t0); update_rusage_stats(&ru_stats, &stat_config.ru_data); diff --git a/tools/perf/tests/parse-metric.c b/tools/perf/tests/parse-metric.c index 132c9b945a42..37e3371d978e 100644 --- a/tools/perf/tests/parse-metric.c +++ b/tools/perf/tests/parse-metric.c @@ -30,8 +30,7 @@ static u64 find_value(const char *name, struct value *values) return 0; } -static void load_runtime_stat(struct runtime_stat *st, struct evlist *evlist, - struct value *vals) +static void load_runtime_stat(struct evlist *evlist, struct value *vals) { struct evsel *evsel; u64 count; @@ -39,14 +38,14 @@ static void load_runtime_stat(struct runtime_stat *st, struct evlist *evlist, perf_stat__reset_shadow_stats(); evlist__for_each_entry(evlist, evsel) { count = find_value(evsel->name, vals); - perf_stat__update_shadow_stats(evsel, count, 0, st); + perf_stat__update_shadow_stats(evsel, count, 0); if (!strcmp(evsel->name, "duration_time")) update_stats(&walltime_nsecs_stats, count); } } static double compute_single(struct rblist *metric_events, struct evlist *evlist, - struct runtime_stat *st, const char *name) + const char *name) { struct metric_expr *mexp; struct metric_event *me; @@ -58,7 +57,7 @@ static double compute_single(struct rblist *metric_events, struct evlist *evlist list_for_each_entry (mexp, &me->head, nd) { if (strcmp(mexp->metric_name, name)) continue; - return test_generic_metric(mexp, 0, st); + return test_generic_metric(mexp, 0); } } } @@ -74,7 +73,6 @@ static int __compute_metric(const char *name, struct value *vals, }; const struct pmu_metrics_table *pme_test; struct perf_cpu_map *cpus; - struct runtime_stat st; struct evlist *evlist; int err; @@ -93,7 +91,6 @@ static int __compute_metric(const char *name, struct value *vals, } perf_evlist__set_maps(&evlist->core, cpus, NULL); - runtime_stat__init(&st); /* Parse the metric into metric_events list. */ pme_test = find_core_metrics_table("testarch", "testcpu"); @@ -107,18 +104,17 @@ static int __compute_metric(const char *name, struct value *vals, goto out; /* Load the runtime stats with given numbers for events. */ - load_runtime_stat(&st, evlist, vals); + load_runtime_stat(evlist, vals); /* And execute the metric */ if (name1 && ratio1) - *ratio1 = compute_single(&metric_events, evlist, &st, name1); + *ratio1 = compute_single(&metric_events, evlist, name1); if (name2 && ratio2) - *ratio2 = compute_single(&metric_events, evlist, &st, name2); + *ratio2 = compute_single(&metric_events, evlist, name2); out: /* ... cleanup. */ metricgroup__rblist_exit(&metric_events); - runtime_stat__exit(&st); evlist__free_stats(evlist); perf_cpu_map__put(cpus); evlist__delete(evlist); @@ -300,6 +296,7 @@ static int test_metric_group(void) static int test__parse_metric(struct test_suite *test __maybe_unused, int subtest __maybe_unused) { + perf_stat__init_shadow_stats(); TEST_ASSERT_VAL("IPC failed", test_ipc() == 0); TEST_ASSERT_VAL("frontend failed", test_frontend() == 0); TEST_ASSERT_VAL("DCache_L2 failed", test_dcache_l2() == 0); diff --git a/tools/perf/tests/pmu-events.c b/tools/perf/tests/pmu-events.c index 50b99a0f8f59..122e74c282a7 100644 --- a/tools/perf/tests/pmu-events.c +++ b/tools/perf/tests/pmu-events.c @@ -816,7 +816,6 @@ static int test__parsing_callback(const struct pmu_metric *pm, int k; struct evlist *evlist; struct perf_cpu_map *cpus; - struct runtime_stat st; struct evsel *evsel; struct rblist metric_events = { .nr_entries = 0, @@ -844,7 +843,6 @@ static int test__parsing_callback(const struct pmu_metric *pm, } perf_evlist__set_maps(&evlist->core, cpus, NULL); - runtime_stat__init(&st); err = metricgroup__parse_groups_test(evlist, table, pm->metric_name, &metric_events); if (err) { @@ -867,7 +865,7 @@ static int test__parsing_callback(const struct pmu_metric *pm, k = 1; perf_stat__reset_shadow_stats(); evlist__for_each_entry(evlist, evsel) { - perf_stat__update_shadow_stats(evsel, k, 0, &st); + perf_stat__update_shadow_stats(evsel, k, 0); if (!strcmp(evsel->name, "duration_time")) update_stats(&walltime_nsecs_stats, k); k++; @@ -881,7 +879,7 @@ static int test__parsing_callback(const struct pmu_metric *pm, list_for_each_entry (mexp, &me->head, nd) { if (strcmp(mexp->metric_name, pm->metric_name)) continue; - pr_debug("Result %f\n", test_generic_metric(mexp, 0, &st)); + pr_debug("Result %f\n", test_generic_metric(mexp, 0)); err = 0; (*failures)--; goto out_err; @@ -896,7 +894,6 @@ static int test__parsing_callback(const struct pmu_metric *pm, /* ... cleanup. */ metricgroup__rblist_exit(&metric_events); - runtime_stat__exit(&st); evlist__free_stats(evlist); perf_cpu_map__put(cpus); evlist__delete(evlist); @@ -908,6 +905,7 @@ static int test__parsing(struct test_suite *test __maybe_unused, { int failures = 0; + perf_stat__init_shadow_stats(); pmu_for_each_core_metric(test__parsing_callback, &failures); pmu_for_each_sys_metric(test__parsing_callback, &failures); diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c index 1b5cb20efd23..6c065d0624c3 100644 --- a/tools/perf/util/stat-display.c +++ b/tools/perf/util/stat-display.c @@ -729,7 +729,7 @@ static void printout(struct perf_stat_config *config, struct outstate *os, if (ok) { perf_stat__print_shadow_stats(config, counter, uval, map_idx, - &out, &config->metric_events, &rt_stat); + &out, &config->metric_events); } else { pm(config, os, /*color=*/NULL, /*format=*/NULL, /*unit=*/"", /*val=*/0); } @@ -1089,8 +1089,7 @@ static void print_metric_headers(struct perf_stat_config *config, perf_stat__print_shadow_stats(config, counter, 0, 0, &out, - &config->metric_events, - &rt_stat); + &config->metric_events); } if (!config->json_output) diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c index fc948a7e83b7..f80be6abac90 100644 --- a/tools/perf/util/stat-shadow.c +++ b/tools/perf/util/stat-shadow.c @@ -25,10 +25,13 @@ * AGGR_THREAD: Not supported? */ -struct runtime_stat rt_stat; struct stats walltime_nsecs_stats; struct rusage_stats ru_stats; +static struct runtime_stat { + struct rblist value_list; +} rt_stat; + enum { CTX_BIT_USER = 1 << 0, CTX_BIT_KERNEL = 1 << 1, @@ -125,7 +128,6 @@ static struct saved_value *saved_value_lookup(struct evsel *evsel, bool create, enum stat_type type, int ctx, - struct runtime_stat *st, struct cgroup *cgrp) { struct rblist *rblist; @@ -138,7 +140,7 @@ static struct saved_value *saved_value_lookup(struct evsel *evsel, .cgrp = cgrp, }; - rblist = &st->value_list; + rblist = &rt_stat.value_list; /* don't use context info for clock events */ if (type == STAT_NSECS) @@ -156,9 +158,9 @@ static struct saved_value *saved_value_lookup(struct evsel *evsel, return NULL; } -void runtime_stat__init(struct runtime_stat *st) +void perf_stat__init_shadow_stats(void) { - struct rblist *rblist = &st->value_list; + struct rblist *rblist = &rt_stat.value_list; rblist__init(rblist); rblist->node_cmp = saved_value_cmp; @@ -166,16 +168,6 @@ void runtime_stat__init(struct runtime_stat *st) rblist->node_delete = saved_value_delete; } -void runtime_stat__exit(struct runtime_stat *st) -{ - rblist__exit(&st->value_list); -} - -void perf_stat__init_shadow_stats(void) -{ - runtime_stat__init(&rt_stat); -} - static int evsel_context(struct evsel *evsel) { int ctx = 0; @@ -194,12 +186,12 @@ static int evsel_context(struct evsel *evsel) return ctx; } -static void reset_stat(struct runtime_stat *st) +void perf_stat__reset_shadow_per_stat(void) { struct rblist *rblist; struct rb_node *pos, *next; - rblist = &st->value_list; + rblist = &rt_stat.value_list; next = rb_first_cached(&rblist->entries); while (next) { pos = next; @@ -212,28 +204,22 @@ static void reset_stat(struct runtime_stat *st) void perf_stat__reset_shadow_stats(void) { - reset_stat(&rt_stat); + perf_stat__reset_shadow_per_stat(); memset(&walltime_nsecs_stats, 0, sizeof(walltime_nsecs_stats)); memset(&ru_stats, 0, sizeof(ru_stats)); } -void perf_stat__reset_shadow_per_stat(struct runtime_stat *st) -{ - reset_stat(st); -} - struct runtime_stat_data { int ctx; struct cgroup *cgrp; }; -static void update_runtime_stat(struct runtime_stat *st, - enum stat_type type, +static void update_runtime_stat(enum stat_type type, int map_idx, u64 count, struct runtime_stat_data *rsd) { struct saved_value *v = saved_value_lookup(NULL, map_idx, true, type, - rsd->ctx, st, rsd->cgrp); + rsd->ctx, rsd->cgrp); if (v) update_stats(&v->stats, count); @@ -245,7 +231,7 @@ static void update_runtime_stat(struct runtime_stat *st, * instruction rates, etc: */ void perf_stat__update_shadow_stats(struct evsel *counter, u64 count, - int map_idx, struct runtime_stat *st) + int map_idx) { u64 count_ns = count; struct saved_value *v; @@ -253,43 +239,42 @@ void perf_stat__update_shadow_stats(struct evsel *counter, u64 count, .ctx = evsel_context(counter), .cgrp = counter->cgrp, }; - count *= counter->scale; if (evsel__is_clock(counter)) - update_runtime_stat(st, STAT_NSECS, map_idx, count_ns, &rsd); + update_runtime_stat(STAT_NSECS, map_idx, count_ns, &rsd); else if (evsel__match(counter, HARDWARE, HW_CPU_CYCLES)) - update_runtime_stat(st, STAT_CYCLES, map_idx, count, &rsd); + update_runtime_stat(STAT_CYCLES, map_idx, count, &rsd); else if (evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_FRONTEND)) - update_runtime_stat(st, STAT_STALLED_CYCLES_FRONT, + update_runtime_stat(STAT_STALLED_CYCLES_FRONT, map_idx, count, &rsd); else if (evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_BACKEND)) - update_runtime_stat(st, STAT_STALLED_CYCLES_BACK, + update_runtime_stat(STAT_STALLED_CYCLES_BACK, map_idx, count, &rsd); else if (evsel__match(counter, HARDWARE, HW_BRANCH_INSTRUCTIONS)) - update_runtime_stat(st, STAT_BRANCHES, map_idx, count, &rsd); + update_runtime_stat(STAT_BRANCHES, map_idx, count, &rsd); else if (evsel__match(counter, HARDWARE, HW_CACHE_REFERENCES)) - update_runtime_stat(st, STAT_CACHEREFS, map_idx, count, &rsd); + update_runtime_stat(STAT_CACHEREFS, map_idx, count, &rsd); else if (evsel__match(counter, HW_CACHE, HW_CACHE_L1D)) - update_runtime_stat(st, STAT_L1_DCACHE, map_idx, count, &rsd); + update_runtime_stat(STAT_L1_DCACHE, map_idx, count, &rsd); else if (evsel__match(counter, HW_CACHE, HW_CACHE_L1I)) - update_runtime_stat(st, STAT_L1_ICACHE, map_idx, count, &rsd); + update_runtime_stat(STAT_L1_ICACHE, map_idx, count, &rsd); else if (evsel__match(counter, HW_CACHE, HW_CACHE_LL)) - update_runtime_stat(st, STAT_LL_CACHE, map_idx, count, &rsd); + update_runtime_stat(STAT_LL_CACHE, map_idx, count, &rsd); else if (evsel__match(counter, HW_CACHE, HW_CACHE_DTLB)) - update_runtime_stat(st, STAT_DTLB_CACHE, map_idx, count, &rsd); + update_runtime_stat(STAT_DTLB_CACHE, map_idx, count, &rsd); else if (evsel__match(counter, HW_CACHE, HW_CACHE_ITLB)) - update_runtime_stat(st, STAT_ITLB_CACHE, map_idx, count, &rsd); + update_runtime_stat(STAT_ITLB_CACHE, map_idx, count, &rsd); if (counter->collect_stat) { - v = saved_value_lookup(counter, map_idx, true, STAT_NONE, 0, st, + v = saved_value_lookup(counter, map_idx, true, STAT_NONE, 0, rsd.cgrp); update_stats(&v->stats, count); if (counter->metric_leader) v->metric_total += count; } else if (counter->metric_leader && !counter->merged_stat) { v = saved_value_lookup(counter->metric_leader, - map_idx, true, STAT_NONE, 0, st, rsd.cgrp); + map_idx, true, STAT_NONE, 0, rsd.cgrp); v->metric_total += count; v->metric_other++; } @@ -322,26 +307,24 @@ static const char *get_ratio_color(enum grc_type type, double ratio) return color; } -static double runtime_stat_avg(struct runtime_stat *st, - enum stat_type type, int map_idx, +static double runtime_stat_avg(enum stat_type type, int map_idx, struct runtime_stat_data *rsd) { struct saved_value *v; - v = saved_value_lookup(NULL, map_idx, false, type, rsd->ctx, st, rsd->cgrp); + v = saved_value_lookup(NULL, map_idx, false, type, rsd->ctx, rsd->cgrp); if (!v) return 0.0; return avg_stats(&v->stats); } -static double runtime_stat_n(struct runtime_stat *st, - enum stat_type type, int map_idx, +static double runtime_stat_n(enum stat_type type, int map_idx, struct runtime_stat_data *rsd) { struct saved_value *v; - v = saved_value_lookup(NULL, map_idx, false, type, rsd->ctx, st, rsd->cgrp); + v = saved_value_lookup(NULL, map_idx, false, type, rsd->ctx, rsd->cgrp); if (!v) return 0.0; @@ -351,13 +334,12 @@ static double runtime_stat_n(struct runtime_stat *st, static void print_stalled_cycles_frontend(struct perf_stat_config *config, int map_idx, double avg, struct perf_stat_output_ctx *out, - struct runtime_stat *st, struct runtime_stat_data *rsd) { double total, ratio = 0.0; const char *color; - total = runtime_stat_avg(st, STAT_CYCLES, map_idx, rsd); + total = runtime_stat_avg(STAT_CYCLES, map_idx, rsd); if (total) ratio = avg / total * 100.0; @@ -374,13 +356,12 @@ static void print_stalled_cycles_frontend(struct perf_stat_config *config, static void print_stalled_cycles_backend(struct perf_stat_config *config, int map_idx, double avg, struct perf_stat_output_ctx *out, - struct runtime_stat *st, struct runtime_stat_data *rsd) { double total, ratio = 0.0; const char *color; - total = runtime_stat_avg(st, STAT_CYCLES, map_idx, rsd); + total = runtime_stat_avg(STAT_CYCLES, map_idx, rsd); if (total) ratio = avg / total * 100.0; @@ -393,13 +374,12 @@ static void print_stalled_cycles_backend(struct perf_stat_config *config, static void print_branch_misses(struct perf_stat_config *config, int map_idx, double avg, struct perf_stat_output_ctx *out, - struct runtime_stat *st, struct runtime_stat_data *rsd) { double total, ratio = 0.0; const char *color; - total = runtime_stat_avg(st, STAT_BRANCHES, map_idx, rsd); + total = runtime_stat_avg(STAT_BRANCHES, map_idx, rsd); if (total) ratio = avg / total * 100.0; @@ -412,13 +392,12 @@ static void print_branch_misses(struct perf_stat_config *config, static void print_l1_dcache_misses(struct perf_stat_config *config, int map_idx, double avg, struct perf_stat_output_ctx *out, - struct runtime_stat *st, struct runtime_stat_data *rsd) { double total, ratio = 0.0; const char *color; - total = runtime_stat_avg(st, STAT_L1_DCACHE, map_idx, rsd); + total = runtime_stat_avg(STAT_L1_DCACHE, map_idx, rsd); if (total) ratio = avg / total * 100.0; @@ -431,13 +410,12 @@ static void print_l1_dcache_misses(struct perf_stat_config *config, static void print_l1_icache_misses(struct perf_stat_config *config, int map_idx, double avg, struct perf_stat_output_ctx *out, - struct runtime_stat *st, struct runtime_stat_data *rsd) { double total, ratio = 0.0; const char *color; - total = runtime_stat_avg(st, STAT_L1_ICACHE, map_idx, rsd); + total = runtime_stat_avg(STAT_L1_ICACHE, map_idx, rsd); if (total) ratio = avg / total * 100.0; @@ -449,13 +427,12 @@ static void print_l1_icache_misses(struct perf_stat_config *config, static void print_dtlb_cache_misses(struct perf_stat_config *config, int map_idx, double avg, struct perf_stat_output_ctx *out, - struct runtime_stat *st, struct runtime_stat_data *rsd) { double total, ratio = 0.0; const char *color; - total = runtime_stat_avg(st, STAT_DTLB_CACHE, map_idx, rsd); + total = runtime_stat_avg(STAT_DTLB_CACHE, map_idx, rsd); if (total) ratio = avg / total * 100.0; @@ -467,13 +444,12 @@ static void print_dtlb_cache_misses(struct perf_stat_config *config, static void print_itlb_cache_misses(struct perf_stat_config *config, int map_idx, double avg, struct perf_stat_output_ctx *out, - struct runtime_stat *st, struct runtime_stat_data *rsd) { double total, ratio = 0.0; const char *color; - total = runtime_stat_avg(st, STAT_ITLB_CACHE, map_idx, rsd); + total = runtime_stat_avg(STAT_ITLB_CACHE, map_idx, rsd); if (total) ratio = avg / total * 100.0; @@ -485,13 +461,12 @@ static void print_itlb_cache_misses(struct perf_stat_config *config, static void print_ll_cache_misses(struct perf_stat_config *config, int map_idx, double avg, struct perf_stat_output_ctx *out, - struct runtime_stat *st, struct runtime_stat_data *rsd) { double total, ratio = 0.0; const char *color; - total = runtime_stat_avg(st, STAT_LL_CACHE, map_idx, rsd); + total = runtime_stat_avg(STAT_LL_CACHE, map_idx, rsd); if (total) ratio = avg / total * 100.0; @@ -503,8 +478,7 @@ static void print_ll_cache_misses(struct perf_stat_config *config, static int prepare_metric(struct evsel **metric_events, struct metric_ref *metric_refs, struct expr_parse_ctx *pctx, - int map_idx, - struct runtime_stat *st) + int map_idx) { double scale; char *n; @@ -543,7 +517,7 @@ static int prepare_metric(struct evsel **metric_events, } } else { v = saved_value_lookup(metric_events[i], map_idx, false, - STAT_NONE, 0, st, + STAT_NONE, 0, metric_events[i]->cgrp); if (!v) break; @@ -587,8 +561,7 @@ static void generic_metric(struct perf_stat_config *config, const char *metric_unit, int runtime, int map_idx, - struct perf_stat_output_ctx *out, - struct runtime_stat *st) + struct perf_stat_output_ctx *out) { print_metric_t print_metric = out->print_metric; struct expr_parse_ctx *pctx; @@ -605,7 +578,7 @@ static void generic_metric(struct perf_stat_config *config, pctx->sctx.user_requested_cpu_list = strdup(config->user_requested_cpu_list); pctx->sctx.runtime = runtime; pctx->sctx.system_wide = config->system_wide; - i = prepare_metric(metric_events, metric_refs, pctx, map_idx, st); + i = prepare_metric(metric_events, metric_refs, pctx, map_idx); if (i < 0) { expr__ctx_free(pctx); return; @@ -657,7 +630,7 @@ static void generic_metric(struct perf_stat_config *config, expr__ctx_free(pctx); } -double test_generic_metric(struct metric_expr *mexp, int map_idx, struct runtime_stat *st) +double test_generic_metric(struct metric_expr *mexp, int map_idx) { struct expr_parse_ctx *pctx; double ratio = 0.0; @@ -666,7 +639,7 @@ double test_generic_metric(struct metric_expr *mexp, int map_idx, struct runtime if (!pctx) return NAN; - if (prepare_metric(mexp->metric_events, mexp->metric_refs, pctx, map_idx, st) < 0) + if (prepare_metric(mexp->metric_events, mexp->metric_refs, pctx, map_idx) < 0) goto out; if (expr__parse(&ratio, pctx, mexp->metric_expr)) @@ -681,8 +654,7 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config, struct evsel *evsel, double avg, int map_idx, struct perf_stat_output_ctx *out, - struct rblist *metric_events, - struct runtime_stat *st) + struct rblist *metric_events) { void *ctxp = out->ctx; print_metric_t print_metric = out->print_metric; @@ -697,7 +669,7 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config, if (config->iostat_run) { iostat_print_metric(config, evsel, out); } else if (evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS)) { - total = runtime_stat_avg(st, STAT_CYCLES, map_idx, &rsd); + total = runtime_stat_avg(STAT_CYCLES, map_idx, &rsd); if (total) { ratio = avg / total; @@ -707,10 +679,9 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config, print_metric(config, ctxp, NULL, NULL, "insn per cycle", 0); } - total = runtime_stat_avg(st, STAT_STALLED_CYCLES_FRONT, map_idx, &rsd); + total = runtime_stat_avg(STAT_STALLED_CYCLES_FRONT, map_idx, &rsd); - total = max(total, runtime_stat_avg(st, - STAT_STALLED_CYCLES_BACK, + total = max(total, runtime_stat_avg(STAT_STALLED_CYCLES_BACK, map_idx, &rsd)); if (total && avg) { @@ -721,8 +692,8 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config, ratio); } } else if (evsel__match(evsel, HARDWARE, HW_BRANCH_MISSES)) { - if (runtime_stat_n(st, STAT_BRANCHES, map_idx, &rsd) != 0) - print_branch_misses(config, map_idx, avg, out, st, &rsd); + if (runtime_stat_n(STAT_BRANCHES, map_idx, &rsd) != 0) + print_branch_misses(config, map_idx, avg, out, &rsd); else print_metric(config, ctxp, NULL, NULL, "of all branches", 0); } else if ( @@ -731,8 +702,8 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config, ((PERF_COUNT_HW_CACHE_OP_READ) << 8) | ((PERF_COUNT_HW_CACHE_RESULT_MISS) << 16))) { - if (runtime_stat_n(st, STAT_L1_DCACHE, map_idx, &rsd) != 0) - print_l1_dcache_misses(config, map_idx, avg, out, st, &rsd); + if (runtime_stat_n(STAT_L1_DCACHE, map_idx, &rsd) != 0) + print_l1_dcache_misses(config, map_idx, avg, out, &rsd); else print_metric(config, ctxp, NULL, NULL, "of all L1-dcache accesses", 0); } else if ( @@ -741,8 +712,8 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config, ((PERF_COUNT_HW_CACHE_OP_READ) << 8) | ((PERF_COUNT_HW_CACHE_RESULT_MISS) << 16))) { - if (runtime_stat_n(st, STAT_L1_ICACHE, map_idx, &rsd) != 0) - print_l1_icache_misses(config, map_idx, avg, out, st, &rsd); + if (runtime_stat_n(STAT_L1_ICACHE, map_idx, &rsd) != 0) + print_l1_icache_misses(config, map_idx, avg, out, &rsd); else print_metric(config, ctxp, NULL, NULL, "of all L1-icache accesses", 0); } else if ( @@ -751,8 +722,8 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config, ((PERF_COUNT_HW_CACHE_OP_READ) << 8) | ((PERF_COUNT_HW_CACHE_RESULT_MISS) << 16))) { - if (runtime_stat_n(st, STAT_DTLB_CACHE, map_idx, &rsd) != 0) - print_dtlb_cache_misses(config, map_idx, avg, out, st, &rsd); + if (runtime_stat_n(STAT_DTLB_CACHE, map_idx, &rsd) != 0) + print_dtlb_cache_misses(config, map_idx, avg, out, &rsd); else print_metric(config, ctxp, NULL, NULL, "of all dTLB cache accesses", 0); } else if ( @@ -761,8 +732,8 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config, ((PERF_COUNT_HW_CACHE_OP_READ) << 8) | ((PERF_COUNT_HW_CACHE_RESULT_MISS) << 16))) { - if (runtime_stat_n(st, STAT_ITLB_CACHE, map_idx, &rsd) != 0) - print_itlb_cache_misses(config, map_idx, avg, out, st, &rsd); + if (runtime_stat_n(STAT_ITLB_CACHE, map_idx, &rsd) != 0) + print_itlb_cache_misses(config, map_idx, avg, out, &rsd); else print_metric(config, ctxp, NULL, NULL, "of all iTLB cache accesses", 0); } else if ( @@ -771,27 +742,27 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config, ((PERF_COUNT_HW_CACHE_OP_READ) << 8) | ((PERF_COUNT_HW_CACHE_RESULT_MISS) << 16))) { - if (runtime_stat_n(st, STAT_LL_CACHE, map_idx, &rsd) != 0) - print_ll_cache_misses(config, map_idx, avg, out, st, &rsd); + if (runtime_stat_n(STAT_LL_CACHE, map_idx, &rsd) != 0) + print_ll_cache_misses(config, map_idx, avg, out, &rsd); else print_metric(config, ctxp, NULL, NULL, "of all LL-cache accesses", 0); } else if (evsel__match(evsel, HARDWARE, HW_CACHE_MISSES)) { - total = runtime_stat_avg(st, STAT_CACHEREFS, map_idx, &rsd); + total = runtime_stat_avg(STAT_CACHEREFS, map_idx, &rsd); if (total) ratio = avg * 100 / total; - if (runtime_stat_n(st, STAT_CACHEREFS, map_idx, &rsd) != 0) + if (runtime_stat_n(STAT_CACHEREFS, map_idx, &rsd) != 0) print_metric(config, ctxp, NULL, "%8.3f %%", "of all cache refs", ratio); else print_metric(config, ctxp, NULL, NULL, "of all cache refs", 0); } else if (evsel__match(evsel, HARDWARE, HW_STALLED_CYCLES_FRONTEND)) { - print_stalled_cycles_frontend(config, map_idx, avg, out, st, &rsd); + print_stalled_cycles_frontend(config, map_idx, avg, out, &rsd); } else if (evsel__match(evsel, HARDWARE, HW_STALLED_CYCLES_BACKEND)) { - print_stalled_cycles_backend(config, map_idx, avg, out, st, &rsd); + print_stalled_cycles_backend(config, map_idx, avg, out, &rsd); } else if (evsel__match(evsel, HARDWARE, HW_CPU_CYCLES)) { - total = runtime_stat_avg(st, STAT_NSECS, map_idx, &rsd); + total = runtime_stat_avg(STAT_NSECS, map_idx, &rsd); if (total) { ratio = avg / total; @@ -805,11 +776,11 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config, avg / (ratio * evsel->scale)); else print_metric(config, ctxp, NULL, NULL, "CPUs utilized", 0); - } else if (runtime_stat_n(st, STAT_NSECS, map_idx, &rsd) != 0) { + } else if (runtime_stat_n(STAT_NSECS, map_idx, &rsd) != 0) { char unit = ' '; char unit_buf[10] = "/sec"; - total = runtime_stat_avg(st, STAT_NSECS, map_idx, &rsd); + total = runtime_stat_avg(STAT_NSECS, map_idx, &rsd); if (total) ratio = convert_unit_double(1000000000.0 * avg / total, &unit); @@ -829,7 +800,7 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config, generic_metric(config, mexp->metric_expr, mexp->metric_threshold, mexp->metric_events, mexp->metric_refs, evsel->name, mexp->metric_name, mexp->metric_unit, mexp->runtime, - map_idx, out, st); + map_idx, out); } } if (num == 0) diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c index 8d83d2f4a082..0d7538670d67 100644 --- a/tools/perf/util/stat.c +++ b/tools/perf/util/stat.c @@ -659,7 +659,7 @@ static void evsel__update_shadow_stats(struct evsel *evsel) for (i = 0; i < ps->nr_aggr; i++) { struct perf_counts_values *aggr_counts = &ps->aggr[i].counts; - perf_stat__update_shadow_stats(evsel, aggr_counts->val, i, &rt_stat); + perf_stat__update_shadow_stats(evsel, aggr_counts->val, i); } } diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h index 215c0f5c4db7..09975e098bd0 100644 --- a/tools/perf/util/stat.h +++ b/tools/perf/util/stat.h @@ -55,10 +55,6 @@ enum aggr_mode { AGGR_MAX }; -struct runtime_stat { - struct rblist value_list; -}; - struct rusage_stats { struct stats ru_utime_usec_stat; struct stats ru_stime_usec_stat; @@ -153,7 +149,6 @@ static inline void update_rusage_stats(struct rusage_stats *ru_stats, struct rus struct evsel; struct evlist; -extern struct runtime_stat rt_stat; extern struct stats walltime_nsecs_stats; extern struct rusage_stats ru_stats; @@ -162,13 +157,10 @@ typedef void (*print_metric_t)(struct perf_stat_config *config, const char *fmt, double val); typedef void (*new_line_t)(struct perf_stat_config *config, void *ctx); -void runtime_stat__init(struct runtime_stat *st); -void runtime_stat__exit(struct runtime_stat *st); void perf_stat__init_shadow_stats(void); void perf_stat__reset_shadow_stats(void); -void perf_stat__reset_shadow_per_stat(struct runtime_stat *st); -void perf_stat__update_shadow_stats(struct evsel *counter, u64 count, - int map_idx, struct runtime_stat *st); +void perf_stat__reset_shadow_per_stat(void); +void perf_stat__update_shadow_stats(struct evsel *counter, u64 count, int map_idx); struct perf_stat_output_ctx { void *ctx; print_metric_t print_metric; @@ -180,8 +172,7 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config, struct evsel *evsel, double avg, int map_idx, struct perf_stat_output_ctx *out, - struct rblist *metric_events, - struct runtime_stat *st); + struct rblist *metric_events); int evlist__alloc_stats(struct perf_stat_config *config, struct evlist *evlist, bool alloc_raw); @@ -220,5 +211,5 @@ void evlist__print_counters(struct evlist *evlist, struct perf_stat_config *conf struct target *_target, struct timespec *ts, int argc, const char **argv); struct metric_expr; -double test_generic_metric(struct metric_expr *mexp, int map_idx, struct runtime_stat *st); +double test_generic_metric(struct metric_expr *mexp, int map_idx); #endif From patchwork Sun Feb 19 09:28:45 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59130 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp776416wrn; Sun, 19 Feb 2023 01:48:29 -0800 (PST) X-Google-Smtp-Source: AK7set9edTuytm001xpgENGpE2bYDO5lmzdRT8xeaCvxNSOEtpPSNJoJ5f1T/60bUwTED2JLKu2Y X-Received: by 2002:a17:903:2344:b0:19a:9859:be26 with SMTP id c4-20020a170903234400b0019a9859be26mr3129783plh.22.1676800109077; Sun, 19 Feb 2023 01:48:29 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676800109; cv=none; d=google.com; s=arc-20160816; b=hA8r62PVtaMYxSg8XdjqOzkUWfuw3tVFfaX5Lw9Cayj3JbL+07pnhXoWHsOO9d2UNg xWlwAq21YPhXEveN/hf1FfJel2X1zpwp/S44p6dCCs6jrwq+USrebLe4kG9HUjpsqjwv ESNgqp+DAYNHXgO+Kvma9bovsGkU6ws1qw+toLbVHpyeW2JWVAf5ls82u7HsFDQl4P4y v1cpGZxd47VT3Kquzo9vNQz7g9FtqvsVuaPLqorT0GxcFvJ48h7L8ib9jHmavbWGqX7W 9JKfEr9Z/dKP+gDiSyJjinGbkzQ0XBg47yyhl5lQopZrPm6zXk0AbgE969vKcj9wJeww pjIg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=/y7SWHTUkfrtXB6+Nyg9zBpnl23ivOFDADIaKoe8cgw=; b=KjfpeL7BfU2/vn3kR7pRjLBWhUpXver0dLFyIFmJmJvg/RW+srGGHvGrxGb57bVpZg chKr7EkosXZPX+6r2bJdRSGpjNrn3uDYgopjsCHliBfYpUXIU+zWE4msxScVHCCbq2L9 q++lFbYlFCsxd4FykT5u8tTWDw/3IW5g2ndtc1xmTjn/K7DHTDQ+zwZZS3jJpiClc8ew PQSPc+Q9GP7I5cnGBmf+AQnogYFeQ68mc52xE4DtcLSZ6fR2M4zTGM/G/DougVng1o0B GxpvEALgem6ZIwTqhQQ/c15qt1G3yjPHkvwEiJ0r9baCPj83nhRy4uLuZlXcfiPxOMdy vhQQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=iH7Yk8Zg; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id b5-20020a655785000000b004fbedfa5b69si9093403pgr.497.2023.02.19.01.48.17; Sun, 19 Feb 2023 01:48:29 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=iH7Yk8Zg; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230086AbjBSJjW (ORCPT + 99 others); Sun, 19 Feb 2023 04:39:22 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34254 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230074AbjBSJjT (ORCPT ); Sun, 19 Feb 2023 04:39:19 -0500 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AD98B1117A for ; Sun, 19 Feb 2023 01:38:22 -0800 (PST) Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-5369e5f7913so1149257b3.1 for ; Sun, 19 Feb 2023 01:38:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=/y7SWHTUkfrtXB6+Nyg9zBpnl23ivOFDADIaKoe8cgw=; b=iH7Yk8ZgXy+D6RWwLGF5Fea/CxfUBXBWWaJTGGLOaD3q0mPtVxy87WyEw1kDXvTs8N lVs/8gn3d9xyHWfONVRA4PnFDLyL+j2HaxjoWqsZ4cVF51kSMNi0W/SKen7hmzU52zwp VVn0eX2Kh1EQUjLisJYaA7huB27zZN6I2vuBrfTyBAd22Wuz4jYjc53371S41SDo6ZIz q4HjKc4jHYhXbS47kZnTyg8rlJoqgQ8U4oHn34uWMwSwSK/SvJO7P3UiqNvSmwtKs8xZ NWdNCLIAYUg4Hb1Gpe59FDXDn6ABHPblT6huaDvP2xhkKsnOFy7QJeRV0n6+AixmvJEP PqDg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=/y7SWHTUkfrtXB6+Nyg9zBpnl23ivOFDADIaKoe8cgw=; b=ZXyU81RtMHJz/CawmUVZQ2bDPvtsZMkT8mTKKxkvqrvCaNfq9GLpwMMmQ4Kb9lQIkS cJRrb1ldJRtaDDmx6BHv4bGaM1CCbKH0neWpkC6k2CxhqIt490cd0aHiSHz+QSXU9LyB QjjxX1M/p8Z2bNm4GBz1g8W1oZY6IU/pA4HnrXON8ONod7CT+7BA4O9LfSAvFDn2LqEx Xqww02Yeyyy9KWuyi317HD5btsiyr+/EdIXSsZAlR4nofSPPtSXXmVmmQ01wGTa4UKMO LwMjKjR0rQaLXPKWN5Sr9Po3zGhOtXrwc88H+bg5Ftw0v2AMvxxfI8ZLfE/sD5833MA5 azKw== X-Gm-Message-State: AO0yUKXLTTCodPh4O+nJjcff6h/NDTuSnZUgA8DJkXTeQzRQjcqB/wf3 swW90oZnXdaL7/1EDAKmud8jbadKj0Ld X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a5b:cc7:0:b0:7d1:87e:1713 with SMTP id e7-20020a5b0cc7000000b007d1087e1713mr1553569ybr.102.1676799345329; Sun, 19 Feb 2023 01:35:45 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:45 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-49-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 48/51] perf stat: Add cpu_aggr_map for loop From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758252350900787708?= X-GMAIL-MSGID: =?utf-8?q?1758252350900787708?= Rename variables, add a comment and add a cpu_aggr_map__for_each_idx to aid the readability of the stat-display code. In particular, try to make sure aggr_idx is used consistently to differentiate from other kinds of index. Signed-off-by: Ian Rogers --- tools/perf/util/cpumap.h | 3 + tools/perf/util/stat-display.c | 112 +++++++++++++++-------------- tools/perf/util/stat-shadow.c | 128 ++++++++++++++++----------------- tools/perf/util/stat.c | 8 +-- tools/perf/util/stat.h | 6 +- 5 files changed, 132 insertions(+), 125 deletions(-) diff --git a/tools/perf/util/cpumap.h b/tools/perf/util/cpumap.h index c2f5824a3a22..e3426541e0aa 100644 --- a/tools/perf/util/cpumap.h +++ b/tools/perf/util/cpumap.h @@ -35,6 +35,9 @@ struct cpu_aggr_map { struct aggr_cpu_id map[]; }; +#define cpu_aggr_map__for_each_idx(idx, aggr_map) \ + for ((idx) = 0; (idx) < aggr_map->nr; (idx)++) + struct perf_record_cpu_map_data; bool perf_record_cpu_map_data__test_bit(int i, const struct perf_record_cpu_map_data *data); diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c index 6c065d0624c3..e6035ecbeee8 100644 --- a/tools/perf/util/stat-display.c +++ b/tools/perf/util/stat-display.c @@ -183,7 +183,7 @@ static void print_cgroup(struct perf_stat_config *config, struct cgroup *cgrp) } static void print_aggr_id_std(struct perf_stat_config *config, - struct evsel *evsel, struct aggr_cpu_id id, int nr) + struct evsel *evsel, struct aggr_cpu_id id, int aggr_nr) { FILE *output = config->output; int idx = config->aggr_mode; @@ -225,11 +225,11 @@ static void print_aggr_id_std(struct perf_stat_config *config, return; } - fprintf(output, "%-*s %*d ", aggr_header_lens[idx], buf, 4, nr); + fprintf(output, "%-*s %*d ", aggr_header_lens[idx], buf, 4, aggr_nr); } static void print_aggr_id_csv(struct perf_stat_config *config, - struct evsel *evsel, struct aggr_cpu_id id, int nr) + struct evsel *evsel, struct aggr_cpu_id id, int aggr_nr) { FILE *output = config->output; const char *sep = config->csv_sep; @@ -237,19 +237,19 @@ static void print_aggr_id_csv(struct perf_stat_config *config, switch (config->aggr_mode) { case AGGR_CORE: fprintf(output, "S%d-D%d-C%d%s%d%s", - id.socket, id.die, id.core, sep, nr, sep); + id.socket, id.die, id.core, sep, aggr_nr, sep); break; case AGGR_DIE: fprintf(output, "S%d-D%d%s%d%s", - id.socket, id.die, sep, nr, sep); + id.socket, id.die, sep, aggr_nr, sep); break; case AGGR_SOCKET: fprintf(output, "S%d%s%d%s", - id.socket, sep, nr, sep); + id.socket, sep, aggr_nr, sep); break; case AGGR_NODE: fprintf(output, "N%d%s%d%s", - id.node, sep, nr, sep); + id.node, sep, aggr_nr, sep); break; case AGGR_NONE: if (evsel->percore && !config->percore_show_thread) { @@ -275,26 +275,26 @@ static void print_aggr_id_csv(struct perf_stat_config *config, } static void print_aggr_id_json(struct perf_stat_config *config, - struct evsel *evsel, struct aggr_cpu_id id, int nr) + struct evsel *evsel, struct aggr_cpu_id id, int aggr_nr) { FILE *output = config->output; switch (config->aggr_mode) { case AGGR_CORE: fprintf(output, "\"core\" : \"S%d-D%d-C%d\", \"aggregate-number\" : %d, ", - id.socket, id.die, id.core, nr); + id.socket, id.die, id.core, aggr_nr); break; case AGGR_DIE: fprintf(output, "\"die\" : \"S%d-D%d\", \"aggregate-number\" : %d, ", - id.socket, id.die, nr); + id.socket, id.die, aggr_nr); break; case AGGR_SOCKET: fprintf(output, "\"socket\" : \"S%d\", \"aggregate-number\" : %d, ", - id.socket, nr); + id.socket, aggr_nr); break; case AGGR_NODE: fprintf(output, "\"node\" : \"N%d\", \"aggregate-number\" : %d, ", - id.node, nr); + id.node, aggr_nr); break; case AGGR_NONE: if (evsel->percore && !config->percore_show_thread) { @@ -319,14 +319,14 @@ static void print_aggr_id_json(struct perf_stat_config *config, } static void aggr_printout(struct perf_stat_config *config, - struct evsel *evsel, struct aggr_cpu_id id, int nr) + struct evsel *evsel, struct aggr_cpu_id id, int aggr_nr) { if (config->json_output) - print_aggr_id_json(config, evsel, id, nr); + print_aggr_id_json(config, evsel, id, aggr_nr); else if (config->csv_output) - print_aggr_id_csv(config, evsel, id, nr); + print_aggr_id_csv(config, evsel, id, aggr_nr); else - print_aggr_id_std(config, evsel, id, nr); + print_aggr_id_std(config, evsel, id, aggr_nr); } struct outstate { @@ -335,7 +335,7 @@ struct outstate { bool first; const char *prefix; int nfields; - int nr; + int aggr_nr; struct aggr_cpu_id id; struct evsel *evsel; struct cgroup *cgrp; @@ -355,7 +355,7 @@ static void do_new_line_std(struct perf_stat_config *config, fputc('\n', os->fh); if (os->prefix) fputs(os->prefix, os->fh); - aggr_printout(config, os->evsel, os->id, os->nr); + aggr_printout(config, os->evsel, os->id, os->aggr_nr); if (config->aggr_mode == AGGR_NONE) fprintf(os->fh, " "); fprintf(os->fh, " "); @@ -396,7 +396,7 @@ static void new_line_csv(struct perf_stat_config *config, void *ctx) fputc('\n', os->fh); if (os->prefix) fprintf(os->fh, "%s", os->prefix); - aggr_printout(config, os->evsel, os->id, os->nr); + aggr_printout(config, os->evsel, os->id, os->aggr_nr); for (i = 0; i < os->nfields; i++) fputs(config->csv_sep, os->fh); } @@ -444,7 +444,7 @@ static void new_line_json(struct perf_stat_config *config, void *ctx) fputs("\n{", os->fh); if (os->prefix) fprintf(os->fh, "%s", os->prefix); - aggr_printout(config, os->evsel, os->id, os->nr); + aggr_printout(config, os->evsel, os->id, os->aggr_nr); } /* Filter out some columns that don't work well in metrics only mode */ @@ -645,10 +645,10 @@ static void print_counter_value(struct perf_stat_config *config, } static void abs_printout(struct perf_stat_config *config, - struct aggr_cpu_id id, int nr, + struct aggr_cpu_id id, int aggr_nr, struct evsel *evsel, double avg, bool ok) { - aggr_printout(config, evsel, id, nr); + aggr_printout(config, evsel, id, aggr_nr); print_counter_value(config, evsel, avg, ok); print_cgroup(config, evsel->cgrp); } @@ -678,7 +678,7 @@ static bool is_mixed_hw_group(struct evsel *counter) } static void printout(struct perf_stat_config *config, struct outstate *os, - double uval, u64 run, u64 ena, double noise, int map_idx) + double uval, u64 run, u64 ena, double noise, int aggr_idx) { struct perf_stat_output_ctx out; print_metric_t pm; @@ -721,14 +721,14 @@ static void printout(struct perf_stat_config *config, struct outstate *os, out.force_header = false; if (!config->metric_only) { - abs_printout(config, os->id, os->nr, counter, uval, ok); + abs_printout(config, os->id, os->aggr_nr, counter, uval, ok); print_noise(config, counter, noise, /*before_metric=*/true); print_running(config, run, ena, /*before_metric=*/true); } if (ok) { - perf_stat__print_shadow_stats(config, counter, uval, map_idx, + perf_stat__print_shadow_stats(config, counter, uval, aggr_idx, &out, &config->metric_events); } else { pm(config, os, /*color=*/NULL, /*format=*/NULL, /*unit=*/"", /*val=*/0); @@ -833,20 +833,20 @@ static bool should_skip_zero_counter(struct perf_stat_config *config, } static void print_counter_aggrdata(struct perf_stat_config *config, - struct evsel *counter, int s, + struct evsel *counter, int aggr_idx, struct outstate *os) { FILE *output = config->output; u64 ena, run, val; double uval; struct perf_stat_evsel *ps = counter->stats; - struct perf_stat_aggr *aggr = &ps->aggr[s]; - struct aggr_cpu_id id = config->aggr_map->map[s]; + struct perf_stat_aggr *aggr = &ps->aggr[aggr_idx]; + struct aggr_cpu_id id = config->aggr_map->map[aggr_idx]; double avg = aggr->counts.val; bool metric_only = config->metric_only; os->id = id; - os->nr = aggr->nr; + os->aggr_nr = aggr->nr; os->evsel = counter; /* Skip already merged uncore/hybrid events */ @@ -874,7 +874,7 @@ static void print_counter_aggrdata(struct perf_stat_config *config, uval = val * counter->scale; - printout(config, os, uval, run, ena, avg, s); + printout(config, os, uval, run, ena, avg, aggr_idx); if (!metric_only) fputc('\n', output); @@ -925,7 +925,7 @@ static void print_aggr(struct perf_stat_config *config, struct outstate *os) { struct evsel *counter; - int s; + int aggr_idx; if (!config->aggr_map || !config->aggr_get_id) return; @@ -934,11 +934,11 @@ static void print_aggr(struct perf_stat_config *config, * With metric_only everything is on a single line. * Without each counter has its own line. */ - for (s = 0; s < config->aggr_map->nr; s++) { - print_metric_begin(config, evlist, os, s); + cpu_aggr_map__for_each_idx(aggr_idx, config->aggr_map) { + print_metric_begin(config, evlist, os, aggr_idx); evlist__for_each_entry(evlist, counter) { - print_counter_aggrdata(config, counter, s, os); + print_counter_aggrdata(config, counter, aggr_idx, os); } print_metric_end(config, os); } @@ -949,7 +949,7 @@ static void print_aggr_cgroup(struct perf_stat_config *config, struct outstate *os) { struct evsel *counter, *evsel; - int s; + int aggr_idx; if (!config->aggr_map || !config->aggr_get_id) return; @@ -960,14 +960,14 @@ static void print_aggr_cgroup(struct perf_stat_config *config, os->cgrp = evsel->cgrp; - for (s = 0; s < config->aggr_map->nr; s++) { - print_metric_begin(config, evlist, os, s); + cpu_aggr_map__for_each_idx(aggr_idx, config->aggr_map) { + print_metric_begin(config, evlist, os, aggr_idx); evlist__for_each_entry(evlist, counter) { if (counter->cgrp != os->cgrp) continue; - print_counter_aggrdata(config, counter, s, os); + print_counter_aggrdata(config, counter, aggr_idx, os); } print_metric_end(config, os); } @@ -977,14 +977,14 @@ static void print_aggr_cgroup(struct perf_stat_config *config, static void print_counter(struct perf_stat_config *config, struct evsel *counter, struct outstate *os) { - int s; + int aggr_idx; /* AGGR_THREAD doesn't have config->aggr_get_id */ if (!config->aggr_map) return; - for (s = 0; s < config->aggr_map->nr; s++) { - print_counter_aggrdata(config, counter, s, os); + cpu_aggr_map__for_each_idx(aggr_idx, config->aggr_map) { + print_counter_aggrdata(config, counter, aggr_idx, os); } } @@ -1003,23 +1003,23 @@ static void print_no_aggr_metric(struct perf_stat_config *config, u64 ena, run, val; double uval; struct perf_stat_evsel *ps = counter->stats; - int counter_idx = perf_cpu_map__idx(evsel__cpus(counter), cpu); + int aggr_idx = perf_cpu_map__idx(evsel__cpus(counter), cpu); - if (counter_idx < 0) + if (aggr_idx < 0) continue; os->evsel = counter; os->id = aggr_cpu_id__cpu(cpu, /*data=*/NULL); if (first) { - print_metric_begin(config, evlist, os, counter_idx); + print_metric_begin(config, evlist, os, aggr_idx); first = false; } - val = ps->aggr[counter_idx].counts.val; - ena = ps->aggr[counter_idx].counts.ena; - run = ps->aggr[counter_idx].counts.run; + val = ps->aggr[aggr_idx].counts.val; + ena = ps->aggr[aggr_idx].counts.ena; + run = ps->aggr[aggr_idx].counts.run; uval = val * counter->scale; - printout(config, os, uval, run, ena, 1.0, counter_idx); + printout(config, os, uval, run, ena, 1.0, aggr_idx); } if (!first) print_metric_end(config, os); @@ -1338,7 +1338,7 @@ static void print_percore(struct perf_stat_config *config, bool metric_only = config->metric_only; FILE *output = config->output; struct cpu_aggr_map *core_map; - int s, c, i; + int aggr_idx, core_map_len = 0; if (!config->aggr_map || !config->aggr_get_id) return; @@ -1346,18 +1346,22 @@ static void print_percore(struct perf_stat_config *config, if (config->percore_show_thread) return print_counter(config, counter, os); + /* + * core_map will hold the aggr_cpu_id for the cores that have been + * printed so that each core is printed just once. + */ core_map = cpu_aggr_map__empty_new(config->aggr_map->nr); if (core_map == NULL) { fprintf(output, "Cannot allocate per-core aggr map for display\n"); return; } - for (s = 0, c = 0; s < config->aggr_map->nr; s++) { - struct perf_cpu curr_cpu = config->aggr_map->map[s].cpu; + cpu_aggr_map__for_each_idx(aggr_idx, config->aggr_map) { + struct perf_cpu curr_cpu = config->aggr_map->map[aggr_idx].cpu; struct aggr_cpu_id core_id = aggr_cpu_id__core(curr_cpu, NULL); bool found = false; - for (i = 0; i < c; i++) { + for (int i = 0; i < core_map_len; i++) { if (aggr_cpu_id__equal(&core_map->map[i], &core_id)) { found = true; break; @@ -1366,9 +1370,9 @@ static void print_percore(struct perf_stat_config *config, if (found) continue; - print_counter_aggrdata(config, counter, s, os); + print_counter_aggrdata(config, counter, aggr_idx, os); - core_map->map[c++] = core_id; + core_map->map[core_map_len++] = core_id; } free(core_map); diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c index f80be6abac90..7b48e2bd3ba1 100644 --- a/tools/perf/util/stat-shadow.c +++ b/tools/perf/util/stat-shadow.c @@ -231,7 +231,7 @@ static void update_runtime_stat(enum stat_type type, * instruction rates, etc: */ void perf_stat__update_shadow_stats(struct evsel *counter, u64 count, - int map_idx) + int aggr_idx) { u64 count_ns = count; struct saved_value *v; @@ -242,39 +242,39 @@ void perf_stat__update_shadow_stats(struct evsel *counter, u64 count, count *= counter->scale; if (evsel__is_clock(counter)) - update_runtime_stat(STAT_NSECS, map_idx, count_ns, &rsd); + update_runtime_stat(STAT_NSECS, aggr_idx, count_ns, &rsd); else if (evsel__match(counter, HARDWARE, HW_CPU_CYCLES)) - update_runtime_stat(STAT_CYCLES, map_idx, count, &rsd); + update_runtime_stat(STAT_CYCLES, aggr_idx, count, &rsd); else if (evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_FRONTEND)) update_runtime_stat(STAT_STALLED_CYCLES_FRONT, - map_idx, count, &rsd); + aggr_idx, count, &rsd); else if (evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_BACKEND)) update_runtime_stat(STAT_STALLED_CYCLES_BACK, - map_idx, count, &rsd); + aggr_idx, count, &rsd); else if (evsel__match(counter, HARDWARE, HW_BRANCH_INSTRUCTIONS)) - update_runtime_stat(STAT_BRANCHES, map_idx, count, &rsd); + update_runtime_stat(STAT_BRANCHES, aggr_idx, count, &rsd); else if (evsel__match(counter, HARDWARE, HW_CACHE_REFERENCES)) - update_runtime_stat(STAT_CACHEREFS, map_idx, count, &rsd); + update_runtime_stat(STAT_CACHEREFS, aggr_idx, count, &rsd); else if (evsel__match(counter, HW_CACHE, HW_CACHE_L1D)) - update_runtime_stat(STAT_L1_DCACHE, map_idx, count, &rsd); + update_runtime_stat(STAT_L1_DCACHE, aggr_idx, count, &rsd); else if (evsel__match(counter, HW_CACHE, HW_CACHE_L1I)) - update_runtime_stat(STAT_L1_ICACHE, map_idx, count, &rsd); + update_runtime_stat(STAT_L1_ICACHE, aggr_idx, count, &rsd); else if (evsel__match(counter, HW_CACHE, HW_CACHE_LL)) - update_runtime_stat(STAT_LL_CACHE, map_idx, count, &rsd); + update_runtime_stat(STAT_LL_CACHE, aggr_idx, count, &rsd); else if (evsel__match(counter, HW_CACHE, HW_CACHE_DTLB)) - update_runtime_stat(STAT_DTLB_CACHE, map_idx, count, &rsd); + update_runtime_stat(STAT_DTLB_CACHE, aggr_idx, count, &rsd); else if (evsel__match(counter, HW_CACHE, HW_CACHE_ITLB)) - update_runtime_stat(STAT_ITLB_CACHE, map_idx, count, &rsd); + update_runtime_stat(STAT_ITLB_CACHE, aggr_idx, count, &rsd); if (counter->collect_stat) { - v = saved_value_lookup(counter, map_idx, true, STAT_NONE, 0, + v = saved_value_lookup(counter, aggr_idx, true, STAT_NONE, 0, rsd.cgrp); update_stats(&v->stats, count); if (counter->metric_leader) v->metric_total += count; } else if (counter->metric_leader && !counter->merged_stat) { v = saved_value_lookup(counter->metric_leader, - map_idx, true, STAT_NONE, 0, rsd.cgrp); + aggr_idx, true, STAT_NONE, 0, rsd.cgrp); v->metric_total += count; v->metric_other++; } @@ -307,24 +307,24 @@ static const char *get_ratio_color(enum grc_type type, double ratio) return color; } -static double runtime_stat_avg(enum stat_type type, int map_idx, +static double runtime_stat_avg(enum stat_type type, int aggr_idx, struct runtime_stat_data *rsd) { struct saved_value *v; - v = saved_value_lookup(NULL, map_idx, false, type, rsd->ctx, rsd->cgrp); + v = saved_value_lookup(NULL, aggr_idx, false, type, rsd->ctx, rsd->cgrp); if (!v) return 0.0; return avg_stats(&v->stats); } -static double runtime_stat_n(enum stat_type type, int map_idx, +static double runtime_stat_n(enum stat_type type, int aggr_idx, struct runtime_stat_data *rsd) { struct saved_value *v; - v = saved_value_lookup(NULL, map_idx, false, type, rsd->ctx, rsd->cgrp); + v = saved_value_lookup(NULL, aggr_idx, false, type, rsd->ctx, rsd->cgrp); if (!v) return 0.0; @@ -332,14 +332,14 @@ static double runtime_stat_n(enum stat_type type, int map_idx, } static void print_stalled_cycles_frontend(struct perf_stat_config *config, - int map_idx, double avg, + int aggr_idx, double avg, struct perf_stat_output_ctx *out, struct runtime_stat_data *rsd) { double total, ratio = 0.0; const char *color; - total = runtime_stat_avg(STAT_CYCLES, map_idx, rsd); + total = runtime_stat_avg(STAT_CYCLES, aggr_idx, rsd); if (total) ratio = avg / total * 100.0; @@ -354,14 +354,14 @@ static void print_stalled_cycles_frontend(struct perf_stat_config *config, } static void print_stalled_cycles_backend(struct perf_stat_config *config, - int map_idx, double avg, + int aggr_idx, double avg, struct perf_stat_output_ctx *out, struct runtime_stat_data *rsd) { double total, ratio = 0.0; const char *color; - total = runtime_stat_avg(STAT_CYCLES, map_idx, rsd); + total = runtime_stat_avg(STAT_CYCLES, aggr_idx, rsd); if (total) ratio = avg / total * 100.0; @@ -372,14 +372,14 @@ static void print_stalled_cycles_backend(struct perf_stat_config *config, } static void print_branch_misses(struct perf_stat_config *config, - int map_idx, double avg, + int aggr_idx, double avg, struct perf_stat_output_ctx *out, struct runtime_stat_data *rsd) { double total, ratio = 0.0; const char *color; - total = runtime_stat_avg(STAT_BRANCHES, map_idx, rsd); + total = runtime_stat_avg(STAT_BRANCHES, aggr_idx, rsd); if (total) ratio = avg / total * 100.0; @@ -390,14 +390,14 @@ static void print_branch_misses(struct perf_stat_config *config, } static void print_l1_dcache_misses(struct perf_stat_config *config, - int map_idx, double avg, + int aggr_idx, double avg, struct perf_stat_output_ctx *out, struct runtime_stat_data *rsd) { double total, ratio = 0.0; const char *color; - total = runtime_stat_avg(STAT_L1_DCACHE, map_idx, rsd); + total = runtime_stat_avg(STAT_L1_DCACHE, aggr_idx, rsd); if (total) ratio = avg / total * 100.0; @@ -408,14 +408,14 @@ static void print_l1_dcache_misses(struct perf_stat_config *config, } static void print_l1_icache_misses(struct perf_stat_config *config, - int map_idx, double avg, + int aggr_idx, double avg, struct perf_stat_output_ctx *out, struct runtime_stat_data *rsd) { double total, ratio = 0.0; const char *color; - total = runtime_stat_avg(STAT_L1_ICACHE, map_idx, rsd); + total = runtime_stat_avg(STAT_L1_ICACHE, aggr_idx, rsd); if (total) ratio = avg / total * 100.0; @@ -425,14 +425,14 @@ static void print_l1_icache_misses(struct perf_stat_config *config, } static void print_dtlb_cache_misses(struct perf_stat_config *config, - int map_idx, double avg, + int aggr_idx, double avg, struct perf_stat_output_ctx *out, struct runtime_stat_data *rsd) { double total, ratio = 0.0; const char *color; - total = runtime_stat_avg(STAT_DTLB_CACHE, map_idx, rsd); + total = runtime_stat_avg(STAT_DTLB_CACHE, aggr_idx, rsd); if (total) ratio = avg / total * 100.0; @@ -442,14 +442,14 @@ static void print_dtlb_cache_misses(struct perf_stat_config *config, } static void print_itlb_cache_misses(struct perf_stat_config *config, - int map_idx, double avg, + int aggr_idx, double avg, struct perf_stat_output_ctx *out, struct runtime_stat_data *rsd) { double total, ratio = 0.0; const char *color; - total = runtime_stat_avg(STAT_ITLB_CACHE, map_idx, rsd); + total = runtime_stat_avg(STAT_ITLB_CACHE, aggr_idx, rsd); if (total) ratio = avg / total * 100.0; @@ -459,14 +459,14 @@ static void print_itlb_cache_misses(struct perf_stat_config *config, } static void print_ll_cache_misses(struct perf_stat_config *config, - int map_idx, double avg, + int aggr_idx, double avg, struct perf_stat_output_ctx *out, struct runtime_stat_data *rsd) { double total, ratio = 0.0; const char *color; - total = runtime_stat_avg(STAT_LL_CACHE, map_idx, rsd); + total = runtime_stat_avg(STAT_LL_CACHE, aggr_idx, rsd); if (total) ratio = avg / total * 100.0; @@ -478,7 +478,7 @@ static void print_ll_cache_misses(struct perf_stat_config *config, static int prepare_metric(struct evsel **metric_events, struct metric_ref *metric_refs, struct expr_parse_ctx *pctx, - int map_idx) + int aggr_idx) { double scale; char *n; @@ -516,7 +516,7 @@ static int prepare_metric(struct evsel **metric_events, abort(); } } else { - v = saved_value_lookup(metric_events[i], map_idx, false, + v = saved_value_lookup(metric_events[i], aggr_idx, false, STAT_NONE, 0, metric_events[i]->cgrp); if (!v) @@ -560,7 +560,7 @@ static void generic_metric(struct perf_stat_config *config, const char *metric_name, const char *metric_unit, int runtime, - int map_idx, + int aggr_idx, struct perf_stat_output_ctx *out) { print_metric_t print_metric = out->print_metric; @@ -578,7 +578,7 @@ static void generic_metric(struct perf_stat_config *config, pctx->sctx.user_requested_cpu_list = strdup(config->user_requested_cpu_list); pctx->sctx.runtime = runtime; pctx->sctx.system_wide = config->system_wide; - i = prepare_metric(metric_events, metric_refs, pctx, map_idx); + i = prepare_metric(metric_events, metric_refs, pctx, aggr_idx); if (i < 0) { expr__ctx_free(pctx); return; @@ -630,7 +630,7 @@ static void generic_metric(struct perf_stat_config *config, expr__ctx_free(pctx); } -double test_generic_metric(struct metric_expr *mexp, int map_idx) +double test_generic_metric(struct metric_expr *mexp, int aggr_idx) { struct expr_parse_ctx *pctx; double ratio = 0.0; @@ -639,7 +639,7 @@ double test_generic_metric(struct metric_expr *mexp, int map_idx) if (!pctx) return NAN; - if (prepare_metric(mexp->metric_events, mexp->metric_refs, pctx, map_idx) < 0) + if (prepare_metric(mexp->metric_events, mexp->metric_refs, pctx, aggr_idx) < 0) goto out; if (expr__parse(&ratio, pctx, mexp->metric_expr)) @@ -652,7 +652,7 @@ double test_generic_metric(struct metric_expr *mexp, int map_idx) void perf_stat__print_shadow_stats(struct perf_stat_config *config, struct evsel *evsel, - double avg, int map_idx, + double avg, int aggr_idx, struct perf_stat_output_ctx *out, struct rblist *metric_events) { @@ -669,7 +669,7 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config, if (config->iostat_run) { iostat_print_metric(config, evsel, out); } else if (evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS)) { - total = runtime_stat_avg(STAT_CYCLES, map_idx, &rsd); + total = runtime_stat_avg(STAT_CYCLES, aggr_idx, &rsd); if (total) { ratio = avg / total; @@ -679,10 +679,10 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config, print_metric(config, ctxp, NULL, NULL, "insn per cycle", 0); } - total = runtime_stat_avg(STAT_STALLED_CYCLES_FRONT, map_idx, &rsd); + total = runtime_stat_avg(STAT_STALLED_CYCLES_FRONT, aggr_idx, &rsd); total = max(total, runtime_stat_avg(STAT_STALLED_CYCLES_BACK, - map_idx, &rsd)); + aggr_idx, &rsd)); if (total && avg) { out->new_line(config, ctxp); @@ -692,8 +692,8 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config, ratio); } } else if (evsel__match(evsel, HARDWARE, HW_BRANCH_MISSES)) { - if (runtime_stat_n(STAT_BRANCHES, map_idx, &rsd) != 0) - print_branch_misses(config, map_idx, avg, out, &rsd); + if (runtime_stat_n(STAT_BRANCHES, aggr_idx, &rsd) != 0) + print_branch_misses(config, aggr_idx, avg, out, &rsd); else print_metric(config, ctxp, NULL, NULL, "of all branches", 0); } else if ( @@ -702,8 +702,8 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config, ((PERF_COUNT_HW_CACHE_OP_READ) << 8) | ((PERF_COUNT_HW_CACHE_RESULT_MISS) << 16))) { - if (runtime_stat_n(STAT_L1_DCACHE, map_idx, &rsd) != 0) - print_l1_dcache_misses(config, map_idx, avg, out, &rsd); + if (runtime_stat_n(STAT_L1_DCACHE, aggr_idx, &rsd) != 0) + print_l1_dcache_misses(config, aggr_idx, avg, out, &rsd); else print_metric(config, ctxp, NULL, NULL, "of all L1-dcache accesses", 0); } else if ( @@ -712,8 +712,8 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config, ((PERF_COUNT_HW_CACHE_OP_READ) << 8) | ((PERF_COUNT_HW_CACHE_RESULT_MISS) << 16))) { - if (runtime_stat_n(STAT_L1_ICACHE, map_idx, &rsd) != 0) - print_l1_icache_misses(config, map_idx, avg, out, &rsd); + if (runtime_stat_n(STAT_L1_ICACHE, aggr_idx, &rsd) != 0) + print_l1_icache_misses(config, aggr_idx, avg, out, &rsd); else print_metric(config, ctxp, NULL, NULL, "of all L1-icache accesses", 0); } else if ( @@ -722,8 +722,8 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config, ((PERF_COUNT_HW_CACHE_OP_READ) << 8) | ((PERF_COUNT_HW_CACHE_RESULT_MISS) << 16))) { - if (runtime_stat_n(STAT_DTLB_CACHE, map_idx, &rsd) != 0) - print_dtlb_cache_misses(config, map_idx, avg, out, &rsd); + if (runtime_stat_n(STAT_DTLB_CACHE, aggr_idx, &rsd) != 0) + print_dtlb_cache_misses(config, aggr_idx, avg, out, &rsd); else print_metric(config, ctxp, NULL, NULL, "of all dTLB cache accesses", 0); } else if ( @@ -732,8 +732,8 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config, ((PERF_COUNT_HW_CACHE_OP_READ) << 8) | ((PERF_COUNT_HW_CACHE_RESULT_MISS) << 16))) { - if (runtime_stat_n(STAT_ITLB_CACHE, map_idx, &rsd) != 0) - print_itlb_cache_misses(config, map_idx, avg, out, &rsd); + if (runtime_stat_n(STAT_ITLB_CACHE, aggr_idx, &rsd) != 0) + print_itlb_cache_misses(config, aggr_idx, avg, out, &rsd); else print_metric(config, ctxp, NULL, NULL, "of all iTLB cache accesses", 0); } else if ( @@ -742,27 +742,27 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config, ((PERF_COUNT_HW_CACHE_OP_READ) << 8) | ((PERF_COUNT_HW_CACHE_RESULT_MISS) << 16))) { - if (runtime_stat_n(STAT_LL_CACHE, map_idx, &rsd) != 0) - print_ll_cache_misses(config, map_idx, avg, out, &rsd); + if (runtime_stat_n(STAT_LL_CACHE, aggr_idx, &rsd) != 0) + print_ll_cache_misses(config, aggr_idx, avg, out, &rsd); else print_metric(config, ctxp, NULL, NULL, "of all LL-cache accesses", 0); } else if (evsel__match(evsel, HARDWARE, HW_CACHE_MISSES)) { - total = runtime_stat_avg(STAT_CACHEREFS, map_idx, &rsd); + total = runtime_stat_avg(STAT_CACHEREFS, aggr_idx, &rsd); if (total) ratio = avg * 100 / total; - if (runtime_stat_n(STAT_CACHEREFS, map_idx, &rsd) != 0) + if (runtime_stat_n(STAT_CACHEREFS, aggr_idx, &rsd) != 0) print_metric(config, ctxp, NULL, "%8.3f %%", "of all cache refs", ratio); else print_metric(config, ctxp, NULL, NULL, "of all cache refs", 0); } else if (evsel__match(evsel, HARDWARE, HW_STALLED_CYCLES_FRONTEND)) { - print_stalled_cycles_frontend(config, map_idx, avg, out, &rsd); + print_stalled_cycles_frontend(config, aggr_idx, avg, out, &rsd); } else if (evsel__match(evsel, HARDWARE, HW_STALLED_CYCLES_BACKEND)) { - print_stalled_cycles_backend(config, map_idx, avg, out, &rsd); + print_stalled_cycles_backend(config, aggr_idx, avg, out, &rsd); } else if (evsel__match(evsel, HARDWARE, HW_CPU_CYCLES)) { - total = runtime_stat_avg(STAT_NSECS, map_idx, &rsd); + total = runtime_stat_avg(STAT_NSECS, aggr_idx, &rsd); if (total) { ratio = avg / total; @@ -776,11 +776,11 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config, avg / (ratio * evsel->scale)); else print_metric(config, ctxp, NULL, NULL, "CPUs utilized", 0); - } else if (runtime_stat_n(STAT_NSECS, map_idx, &rsd) != 0) { + } else if (runtime_stat_n(STAT_NSECS, aggr_idx, &rsd) != 0) { char unit = ' '; char unit_buf[10] = "/sec"; - total = runtime_stat_avg(STAT_NSECS, map_idx, &rsd); + total = runtime_stat_avg(STAT_NSECS, aggr_idx, &rsd); if (total) ratio = convert_unit_double(1000000000.0 * avg / total, &unit); @@ -800,7 +800,7 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config, generic_metric(config, mexp->metric_expr, mexp->metric_threshold, mexp->metric_events, mexp->metric_refs, evsel->name, mexp->metric_name, mexp->metric_unit, mexp->runtime, - map_idx, out); + aggr_idx, out); } } if (num == 0) diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c index 0d7538670d67..83dc4c1f4b12 100644 --- a/tools/perf/util/stat.c +++ b/tools/perf/util/stat.c @@ -651,15 +651,15 @@ void perf_stat_process_percore(struct perf_stat_config *config, struct evlist *e static void evsel__update_shadow_stats(struct evsel *evsel) { struct perf_stat_evsel *ps = evsel->stats; - int i; + int aggr_idx; if (ps->aggr == NULL) return; - for (i = 0; i < ps->nr_aggr; i++) { - struct perf_counts_values *aggr_counts = &ps->aggr[i].counts; + for (aggr_idx = 0; aggr_idx < ps->nr_aggr; aggr_idx++) { + struct perf_counts_values *aggr_counts = &ps->aggr[aggr_idx].counts; - perf_stat__update_shadow_stats(evsel, aggr_counts->val, i); + perf_stat__update_shadow_stats(evsel, aggr_counts->val, aggr_idx); } } diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h index 09975e098bd0..b01c828c3799 100644 --- a/tools/perf/util/stat.h +++ b/tools/perf/util/stat.h @@ -160,7 +160,7 @@ typedef void (*new_line_t)(struct perf_stat_config *config, void *ctx); void perf_stat__init_shadow_stats(void); void perf_stat__reset_shadow_stats(void); void perf_stat__reset_shadow_per_stat(void); -void perf_stat__update_shadow_stats(struct evsel *counter, u64 count, int map_idx); +void perf_stat__update_shadow_stats(struct evsel *counter, u64 count, int aggr_idx); struct perf_stat_output_ctx { void *ctx; print_metric_t print_metric; @@ -170,7 +170,7 @@ struct perf_stat_output_ctx { void perf_stat__print_shadow_stats(struct perf_stat_config *config, struct evsel *evsel, - double avg, int map_idx, + double avg, int aggr_idx, struct perf_stat_output_ctx *out, struct rblist *metric_events); @@ -211,5 +211,5 @@ void evlist__print_counters(struct evlist *evlist, struct perf_stat_config *conf struct target *_target, struct timespec *ts, int argc, const char **argv); struct metric_expr; -double test_generic_metric(struct metric_expr *mexp, int map_idx); +double test_generic_metric(struct metric_expr *mexp, int aggr_idx); #endif From patchwork Sun Feb 19 09:28:46 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59131 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp776652wrn; Sun, 19 Feb 2023 01:49:24 -0800 (PST) X-Google-Smtp-Source: AK7set/IL+k6jw33YQAWeVV4aKSaxrkTojqf6MlwACh/WTih/40W2HWrxzXahCegmnx3FsEG2B7r X-Received: by 2002:a17:90b:1d05:b0:232:edb6:9710 with SMTP id on5-20020a17090b1d0500b00232edb69710mr373973pjb.17.1676800164579; Sun, 19 Feb 2023 01:49:24 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676800164; cv=none; d=google.com; s=arc-20160816; b=Lwu9cQD6FyVhxfrq4bWrBj5RSES1PFdqRklFvNVNoT2vSKBVwcFDGDez/EDjKs40fa nrREtvKxFPy3vqpRKvjW3yOtsEKvVpj2Se+5lTF0YIGmSzWVFmmUPzmAFXVGO2KMM5BQ NLSrWFzPxAyHGYXl7+FzXSnTNgSjvHXNN35pqpodvyuX8vVaWCM67Wy6pCR5Cq2HQVax KBjokwgJ/FwHSB8ByHo8S9wYT1IRbxheXcXKI16sAdIhXoOilCH0gpNCtQbGK1q0YfdE UKHJthP+lKtH+g0dRJUc5kqWynpRmZ4sCYmFvcOAxpbO6H/X6YwRph4tc1+jJaqA4aN8 j5+Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=c9cwoCmrhGIfqHNT3FoQmyx1P/u1tZpxQLcinvW6P6c=; b=oY/htdEfFFkEn82eR3qtj0FpS3XEYw9/MteO2qb2sXz0eHABqlF5S+AhQR5rosj5f8 3JbaB1SM5aLwCsX1A4ll4PA5Pi21eHCtWzFqVUJA1SU0LVa3VSp03e445SrhU3WrVG39 QBHd8iAMhXEf7hpFP3PV9XGfmKkmcirKL4l38g02pSHKo+iOmACpo1hIH76tZYLR29u6 6E+UhTIs3WfF/nFAqWVEjXfiK+p5dI0IZItLbkG+GwiXC32Foje5ivftbr0eQ54bRIBN z2Iuu+duDDWsmn1TtHuCl4mw9qFBqGUzLICNvk8HMNyp4z825uKWyALlVNfYOLF+5CaD 83dg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=KZFTskEw; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id h9-20020a17090a648900b002366fbaaa5fsi4173731pjj.121.2023.02.19.01.49.12; Sun, 19 Feb 2023 01:49:24 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=KZFTskEw; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230092AbjBSJjd (ORCPT + 99 others); Sun, 19 Feb 2023 04:39:33 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34422 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230089AbjBSJj0 (ORCPT ); Sun, 19 Feb 2023 04:39:26 -0500 Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0169B12068 for ; Sun, 19 Feb 2023 01:38:27 -0800 (PST) Received: by mail-pj1-f74.google.com with SMTP id y15-20020a17090a104f00b002311a9f6b3cso183449pjd.6 for ; Sun, 19 Feb 2023 01:38:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=c9cwoCmrhGIfqHNT3FoQmyx1P/u1tZpxQLcinvW6P6c=; b=KZFTskEwL+dr/8fy74EfeXIcp2au/6QZJ3gGYkGbfhDs/uzHGlgkQlCyOtQqbhL3xY Jy1NQsoj7VKTWKAE/EitP649Ng/7FgcDlXOnxI1iU1cBlsH6DgyFo++fEl+UBe/E/+W5 6hC9CawjtpR0vb/YU+KI0IOuX1SnuuwXheTWvrToMTgJNwtFIZ16uN4w91EFgClU/Kzn OEa/fXjKSNwohQfBZOG6jV6XNsX9hHU7FRETX0HMhyg1Hxjj/M+tT5z24TCVCzcXtwH3 gj4FZtCyVBNx3v/vbbFuafTfVzEK/pt0egT6W2xLL818ez3k8/S5kDDMY5D9Auucjcbq xwzg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=c9cwoCmrhGIfqHNT3FoQmyx1P/u1tZpxQLcinvW6P6c=; b=fVSyhGPRMZ+Q6R7mTrGLK088YAvbYDjy0HH093DVpBwz8iBXrtmDeaT+Eg9AScGtdz JyoXGW9RWbMVrQfdZ1ZDEPVSnDYnYXF6dxWe5TPVYYLO4NIs7hkbGvjR3BxcF4dsso+I Lof5vx5SmWJwhRUFMIoa9/oYucc5689nclTXh4aMl3IEFDUM0cMnIDsPcEfHPoXzjLdE 8l0IRpBT5YxAyoooH/1l7zVbw6635SA0/Mb5yiPsB7C4WOQgrsqRevJiza1AqsJIZekJ 7fZzc/WVN3iNrXU1GGHcTVUsAWfmUKysFYC3x5p783PUFevPvsZDdjPT5mkNYPzuDX8m CpCQ== X-Gm-Message-State: AO0yUKXwPqBTPcYiDH4wGtk0p4tQQnkOXaSb0MvfndzKybQAlBjrt44M mlchAWQerGAbqdn6GnhJBCVTpXj1iZ8W X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a17:902:ab0c:b0:198:fd60:df2c with SMTP id ik12-20020a170902ab0c00b00198fd60df2cmr28114plb.11.1676799353893; Sun, 19 Feb 2023 01:35:53 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:46 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-50-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 49/51] perf metric: Directly use counts rather than saved_value From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758252409579416978?= X-GMAIL-MSGID: =?utf-8?q?1758252409579416978?= Bugs with double aggregation have been introduced because of aggregation of counters and again with saved_value. Remove the generic metric use-case. Update parse-metric and pmu-events tests to update aggregate rather than saved_value counts. Signed-off-by: Ian Rogers --- tools/perf/tests/parse-metric.c | 4 +-- tools/perf/tests/pmu-events.c | 4 +-- tools/perf/util/stat-shadow.c | 56 +++++++++++---------------------- 3 files changed, 23 insertions(+), 41 deletions(-) diff --git a/tools/perf/tests/parse-metric.c b/tools/perf/tests/parse-metric.c index 37e3371d978e..b9b8a48289c4 100644 --- a/tools/perf/tests/parse-metric.c +++ b/tools/perf/tests/parse-metric.c @@ -35,10 +35,10 @@ static void load_runtime_stat(struct evlist *evlist, struct value *vals) struct evsel *evsel; u64 count; - perf_stat__reset_shadow_stats(); + evlist__alloc_aggr_stats(evlist, 1); evlist__for_each_entry(evlist, evsel) { count = find_value(evsel->name, vals); - perf_stat__update_shadow_stats(evsel, count, 0); + evsel->stats->aggr->counts.val = count; if (!strcmp(evsel->name, "duration_time")) update_stats(&walltime_nsecs_stats, count); } diff --git a/tools/perf/tests/pmu-events.c b/tools/perf/tests/pmu-events.c index 122e74c282a7..4ec2a4ca1a82 100644 --- a/tools/perf/tests/pmu-events.c +++ b/tools/perf/tests/pmu-events.c @@ -863,9 +863,9 @@ static int test__parsing_callback(const struct pmu_metric *pm, * zero when subtracted and so try to make them unique. */ k = 1; - perf_stat__reset_shadow_stats(); + evlist__alloc_aggr_stats(evlist, 1); evlist__for_each_entry(evlist, evsel) { - perf_stat__update_shadow_stats(evsel, k, 0); + evsel->stats->aggr->counts.val = k; if (!strcmp(evsel->name, "duration_time")) update_stats(&walltime_nsecs_stats, k); k++; diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c index 7b48e2bd3ba1..eba98520cea2 100644 --- a/tools/perf/util/stat-shadow.c +++ b/tools/perf/util/stat-shadow.c @@ -234,7 +234,6 @@ void perf_stat__update_shadow_stats(struct evsel *counter, u64 count, int aggr_idx) { u64 count_ns = count; - struct saved_value *v; struct runtime_stat_data rsd = { .ctx = evsel_context(counter), .cgrp = counter->cgrp, @@ -265,19 +264,6 @@ void perf_stat__update_shadow_stats(struct evsel *counter, u64 count, update_runtime_stat(STAT_DTLB_CACHE, aggr_idx, count, &rsd); else if (evsel__match(counter, HW_CACHE, HW_CACHE_ITLB)) update_runtime_stat(STAT_ITLB_CACHE, aggr_idx, count, &rsd); - - if (counter->collect_stat) { - v = saved_value_lookup(counter, aggr_idx, true, STAT_NONE, 0, - rsd.cgrp); - update_stats(&v->stats, count); - if (counter->metric_leader) - v->metric_total += count; - } else if (counter->metric_leader && !counter->merged_stat) { - v = saved_value_lookup(counter->metric_leader, - aggr_idx, true, STAT_NONE, 0, rsd.cgrp); - v->metric_total += count; - v->metric_other++; - } } /* used for get_ratio_color() */ @@ -480,18 +466,17 @@ static int prepare_metric(struct evsel **metric_events, struct expr_parse_ctx *pctx, int aggr_idx) { - double scale; - char *n; - int i, j, ret; + int i; for (i = 0; metric_events[i]; i++) { - struct saved_value *v; - struct stats *stats; - u64 metric_total = 0; - int source_count; + char *n; + double val; + int source_count = 0; if (evsel__is_tool(metric_events[i])) { - source_count = 1; + struct stats *stats; + double scale; + switch (metric_events[i]->tool_event) { case PERF_TOOL_DURATION_TIME: stats = &walltime_nsecs_stats; @@ -515,35 +500,32 @@ static int prepare_metric(struct evsel **metric_events, pr_err("Unknown tool event '%s'", evsel__name(metric_events[i])); abort(); } + val = avg_stats(stats) * scale; + source_count = 1; } else { - v = saved_value_lookup(metric_events[i], aggr_idx, false, - STAT_NONE, 0, - metric_events[i]->cgrp); - if (!v) + struct perf_stat_evsel *ps = metric_events[i]->stats; + struct perf_stat_aggr *aggr = &ps->aggr[aggr_idx]; + + if (!aggr) break; - stats = &v->stats; + /* * If an event was scaled during stat gathering, reverse * the scale before computing the metric. */ - scale = 1.0 / metric_events[i]->scale; - + val = aggr->counts.val * (1.0 / metric_events[i]->scale); source_count = evsel__source_count(metric_events[i]); - - if (v->metric_other) - metric_total = v->metric_total * scale; } n = strdup(evsel__metric_id(metric_events[i])); if (!n) return -ENOMEM; - expr__add_id_val_source_count(pctx, n, - metric_total ? : avg_stats(stats) * scale, - source_count); + expr__add_id_val_source_count(pctx, n, val, source_count); } - for (j = 0; metric_refs && metric_refs[j].metric_name; j++) { - ret = expr__add_ref(pctx, &metric_refs[j]); + for (int j = 0; metric_refs && metric_refs[j].metric_name; j++) { + int ret = expr__add_ref(pctx, &metric_refs[j]); + if (ret) return ret; } From patchwork Sun Feb 19 09:28:47 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59129 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp776374wrn; Sun, 19 Feb 2023 01:48:22 -0800 (PST) X-Google-Smtp-Source: AK7set+Hlz3DzK5jwwRbaWzU/kJ8KfVx19Vbg1dZIJKEPaY2s7BGbWScJu1vqoMGfK8RmqM491e7 X-Received: by 2002:a17:902:e80d:b0:19a:59d1:389e with SMTP id u13-20020a170902e80d00b0019a59d1389emr227211plg.23.1676800102123; Sun, 19 Feb 2023 01:48:22 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676800102; cv=none; d=google.com; s=arc-20160816; b=eRfzfpY3U1F+4P7uWN0uYwkyOxVi2IvVU9DKyf026QNwVlQJhdKnMcrBU6GZLgB1/3 sQFZWMed3polMKsTZJ7sVDe+CmLQJkOVc+gRJmasHN55W1B/9TP6Chb/skP7TR5k9tvo 7FjO5DJsUKdp6mMZB0p5S5uUuxWIXtDnvqDLtMUTiKC1RRn5PoJi/HK5nHR9TO2t/LZW eWVMgt+VEn4f9eF6D23LChUSzovVy7Eb01Y6+dsfP8+C/2N+6p5bwhana98E+5gw4Dya WMILre9swp1ZG3GVLGPnnz8E3u6i3ZDxe0V6Ku+3VR+636GoH7LbI10t4pbM50wR3UyU d/bQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=GaAtWKR9dpkXeAR2EgXOT4Jdr/WSY94NBCFSvIpDY1I=; b=dPQ68efeqiGsIYXdUuhg+El583ORy50W40qHA9afks/yXYzK/rIWOp9KRPqzWVqDgN aJDPkCepfKkRsywKx63Te/7fik/6rihySDOPFRgvRTrxdQ1TefV0IKEZQTr2GkHMoBW+ +D4y0sOTkTIcawr0wjrbmx0uoV16vQs7D6GVxeHPGMMZE5HI2MEVuAbdNaCl0K8mdvGC y34OoVEslFydHSDwoJmwO63RNr2Q/QOb1wwnkpObDOY0DHdB9il5UfnWHg17MicQbzWP NncSpw04uS3fgjAGc5Mg9FLXWily3WpcRGOXxn0cjMg0qmuvQN7cJLa5wDT7QgKrK+Jb Nifw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=VylUNAn6; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id ku6-20020a170903288600b0019a6ca56219si5196262plb.313.2023.02.19.01.48.10; Sun, 19 Feb 2023 01:48:22 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=VylUNAn6; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229751AbjBSJjE (ORCPT + 99 others); Sun, 19 Feb 2023 04:39:04 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33858 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230054AbjBSJjA (ORCPT ); Sun, 19 Feb 2023 04:39:00 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D087D59E3 for ; Sun, 19 Feb 2023 01:38:07 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id k131-20020a256f89000000b0074747131938so226261ybc.12 for ; Sun, 19 Feb 2023 01:38:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=GaAtWKR9dpkXeAR2EgXOT4Jdr/WSY94NBCFSvIpDY1I=; b=VylUNAn6gHel5Js0dIwjbwarkGvBURpX13yLOd3CW4OTRSkRMUf8XHI2zSpfzbjBTl SB510gQdp40gx0uSZZLIgNNz4Kd88eaHe+lh/k2K9tn4l3KLtbeqrchgwWNAhiazTwQd Br7tiZX8fDhfDJ0Iq88Ivllx+3qI5YH0zvu/svcdpjZ+dbPcW5Wtvl4LbHk7hp38hz9C M0sCu7FAudJGP8OO1fJYzQVounuONNh5bliMIXH5XMWyetj1YHZsu5e4XHxoyfX1tq// 75Z0Ygyq+8zK+P+XGK8lDR+F98LErBncmwBRFxwIQT5kmERsmDNZl+uUm2HDhfD5Bwu/ 7Mhg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=GaAtWKR9dpkXeAR2EgXOT4Jdr/WSY94NBCFSvIpDY1I=; b=mFtFl5mFVKveobuxkt6VNXqPv8+QAxMk+Yeb3AVroUoQibbisOn2Jg7NVbubxNv/13 DfJ8GwH0lASkw4BnhYYZOZ1UTHtCo4UFiJgJxAraJswI97j53p2zouzDFU+wpmNMwTxg xPUBIanri5mTbGBGaCbIC56HJngy0tIkful9R4HJ28cleGYhhva4RMixmlCkpxHiO00V i9H7l03YsNnuXhwO3rhAbWwZqmCqHPHcG4GIhdOzRTlx+XpOUMKmgTUzdLP04JehQuQH DKhj6FC5xNxkRQzwbXddgaKMk7v+jTx6H8WN8U5lixkV0rnrqgjDpptH+gOXNl4f+Zc1 MnuA== X-Gm-Message-State: AO0yUKXx967UPB3d2GyPh2UAli4vSfXD2pLXI/bxkZOYhDDYECJoPp5j FIXH/dTRh9yMY1aEmdgt6oiXzS7gPsB5 X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a05:6902:11cd:b0:90d:a6c0:6870 with SMTP id n13-20020a05690211cd00b0090da6c06870mr391638ybu.2.1676799362330; Sun, 19 Feb 2023 01:36:02 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:47 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-51-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 50/51] perf stat: Use counts rather than saved_value From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758252344012122561?= X-GMAIL-MSGID: =?utf-8?q?1758252344012122561?= Switch the hard coded metrics to use the aggregate value rather than from saved_value. When computing a metric like IPC the aggregate count comes from instructions then cycles is looked up and if present IPC computed. Rather than lookup from the saved_value rbtree, search the counter's evlist for the desired counter. A new helper evsel__stat_type is used to both quickly find a metric function and to identify when a counter is the one being sought. So that both total and miss counts can be sought, the stat_type enum is expanded. The ratio functions are rewritten to share a common helper with the ratios being directly passed rather than computed from an enum value. Signed-off-by: Ian Rogers --- tools/perf/util/evsel.h | 2 +- tools/perf/util/stat-shadow.c | 534 +++++++++++++++++----------------- 2 files changed, 270 insertions(+), 266 deletions(-) diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h index 24cb807ef6ce..814a49ebb7e3 100644 --- a/tools/perf/util/evsel.h +++ b/tools/perf/util/evsel.h @@ -436,7 +436,7 @@ static inline bool evsel__is_bpf_output(struct evsel *evsel) return evsel__match(evsel, SOFTWARE, SW_BPF_OUTPUT); } -static inline bool evsel__is_clock(struct evsel *evsel) +static inline bool evsel__is_clock(const struct evsel *evsel) { return evsel__match(evsel, SOFTWARE, SW_CPU_CLOCK) || evsel__match(evsel, SOFTWARE, SW_TASK_CLOCK); diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c index eba98520cea2..9d22cde09dc9 100644 --- a/tools/perf/util/stat-shadow.c +++ b/tools/perf/util/stat-shadow.c @@ -45,15 +45,23 @@ enum stat_type { STAT_NONE = 0, STAT_NSECS, STAT_CYCLES, + STAT_INSTRUCTIONS, STAT_STALLED_CYCLES_FRONT, STAT_STALLED_CYCLES_BACK, STAT_BRANCHES, - STAT_CACHEREFS, + STAT_BRANCH_MISS, + STAT_CACHE_REFS, + STAT_CACHE_MISSES, STAT_L1_DCACHE, STAT_L1_ICACHE, STAT_LL_CACHE, STAT_ITLB_CACHE, STAT_DTLB_CACHE, + STAT_L1D_MISS, + STAT_L1I_MISS, + STAT_LL_MISS, + STAT_DTLB_MISS, + STAT_ITLB_MISS, STAT_MAX }; @@ -168,7 +176,7 @@ void perf_stat__init_shadow_stats(void) rblist->node_delete = saved_value_delete; } -static int evsel_context(struct evsel *evsel) +static int evsel_context(const struct evsel *evsel) { int ctx = 0; @@ -253,7 +261,7 @@ void perf_stat__update_shadow_stats(struct evsel *counter, u64 count, else if (evsel__match(counter, HARDWARE, HW_BRANCH_INSTRUCTIONS)) update_runtime_stat(STAT_BRANCHES, aggr_idx, count, &rsd); else if (evsel__match(counter, HARDWARE, HW_CACHE_REFERENCES)) - update_runtime_stat(STAT_CACHEREFS, aggr_idx, count, &rsd); + update_runtime_stat(STAT_CACHE_REFS, aggr_idx, count, &rsd); else if (evsel__match(counter, HW_CACHE, HW_CACHE_L1D)) update_runtime_stat(STAT_L1_DCACHE, aggr_idx, count, &rsd); else if (evsel__match(counter, HW_CACHE, HW_CACHE_L1I)) @@ -266,199 +274,283 @@ void perf_stat__update_shadow_stats(struct evsel *counter, u64 count, update_runtime_stat(STAT_ITLB_CACHE, aggr_idx, count, &rsd); } -/* used for get_ratio_color() */ -enum grc_type { - GRC_STALLED_CYCLES_FE, - GRC_STALLED_CYCLES_BE, - GRC_CACHE_MISSES, - GRC_MAX_NR -}; +static enum stat_type evsel__stat_type(const struct evsel *evsel) +{ + /* Fake perf_hw_cache_op_id values for use with evsel__match. */ + u64 PERF_COUNT_hw_cache_l1d_miss = PERF_COUNT_HW_CACHE_L1D | + ((PERF_COUNT_HW_CACHE_OP_READ) << 8) | + ((PERF_COUNT_HW_CACHE_RESULT_MISS) << 16); + u64 PERF_COUNT_hw_cache_l1i_miss = PERF_COUNT_HW_CACHE_L1I | + ((PERF_COUNT_HW_CACHE_OP_READ) << 8) | + ((PERF_COUNT_HW_CACHE_RESULT_MISS) << 16); + u64 PERF_COUNT_hw_cache_ll_miss = PERF_COUNT_HW_CACHE_LL | + ((PERF_COUNT_HW_CACHE_OP_READ) << 8) | + ((PERF_COUNT_HW_CACHE_RESULT_MISS) << 16); + u64 PERF_COUNT_hw_cache_dtlb_miss = PERF_COUNT_HW_CACHE_DTLB | + ((PERF_COUNT_HW_CACHE_OP_READ) << 8) | + ((PERF_COUNT_HW_CACHE_RESULT_MISS) << 16); + u64 PERF_COUNT_hw_cache_itlb_miss = PERF_COUNT_HW_CACHE_ITLB | + ((PERF_COUNT_HW_CACHE_OP_READ) << 8) | + ((PERF_COUNT_HW_CACHE_RESULT_MISS) << 16); + + if (evsel__is_clock(evsel)) + return STAT_NSECS; + else if (evsel__match(evsel, HARDWARE, HW_CPU_CYCLES)) + return STAT_CYCLES; + else if (evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS)) + return STAT_INSTRUCTIONS; + else if (evsel__match(evsel, HARDWARE, HW_STALLED_CYCLES_FRONTEND)) + return STAT_STALLED_CYCLES_FRONT; + else if (evsel__match(evsel, HARDWARE, HW_STALLED_CYCLES_BACKEND)) + return STAT_STALLED_CYCLES_BACK; + else if (evsel__match(evsel, HARDWARE, HW_BRANCH_INSTRUCTIONS)) + return STAT_BRANCHES; + else if (evsel__match(evsel, HARDWARE, HW_BRANCH_MISSES)) + return STAT_BRANCH_MISS; + else if (evsel__match(evsel, HARDWARE, HW_CACHE_REFERENCES)) + return STAT_CACHE_REFS; + else if (evsel__match(evsel, HARDWARE, HW_CACHE_MISSES)) + return STAT_CACHE_MISSES; + else if (evsel__match(evsel, HW_CACHE, HW_CACHE_L1D)) + return STAT_L1_DCACHE; + else if (evsel__match(evsel, HW_CACHE, HW_CACHE_L1I)) + return STAT_L1_ICACHE; + else if (evsel__match(evsel, HW_CACHE, HW_CACHE_LL)) + return STAT_LL_CACHE; + else if (evsel__match(evsel, HW_CACHE, HW_CACHE_DTLB)) + return STAT_DTLB_CACHE; + else if (evsel__match(evsel, HW_CACHE, HW_CACHE_ITLB)) + return STAT_ITLB_CACHE; + else if (evsel__match(evsel, HW_CACHE, hw_cache_l1d_miss)) + return STAT_L1D_MISS; + else if (evsel__match(evsel, HW_CACHE, hw_cache_l1i_miss)) + return STAT_L1I_MISS; + else if (evsel__match(evsel, HW_CACHE, hw_cache_ll_miss)) + return STAT_LL_MISS; + else if (evsel__match(evsel, HW_CACHE, hw_cache_dtlb_miss)) + return STAT_DTLB_MISS; + else if (evsel__match(evsel, HW_CACHE, hw_cache_itlb_miss)) + return STAT_ITLB_MISS; + return STAT_NONE; +} -static const char *get_ratio_color(enum grc_type type, double ratio) +static const char *get_ratio_color(const double ratios[3], double val) { - static const double grc_table[GRC_MAX_NR][3] = { - [GRC_STALLED_CYCLES_FE] = { 50.0, 30.0, 10.0 }, - [GRC_STALLED_CYCLES_BE] = { 75.0, 50.0, 20.0 }, - [GRC_CACHE_MISSES] = { 20.0, 10.0, 5.0 }, - }; const char *color = PERF_COLOR_NORMAL; - if (ratio > grc_table[type][0]) + if (val > ratios[0]) color = PERF_COLOR_RED; - else if (ratio > grc_table[type][1]) + else if (val > ratios[1]) color = PERF_COLOR_MAGENTA; - else if (ratio > grc_table[type][2]) + else if (val > ratios[2]) color = PERF_COLOR_YELLOW; return color; } -static double runtime_stat_avg(enum stat_type type, int aggr_idx, - struct runtime_stat_data *rsd) +static double find_stat(const struct evsel *evsel, int aggr_idx, enum stat_type type) { - struct saved_value *v; - - v = saved_value_lookup(NULL, aggr_idx, false, type, rsd->ctx, rsd->cgrp); - if (!v) - return 0.0; - - return avg_stats(&v->stats); + const struct evsel *cur; + int evsel_ctx = evsel_context(evsel); + + evlist__for_each_entry(evsel->evlist, cur) { + struct perf_stat_aggr *aggr; + + /* Ignore the evsel that is being searched from. */ + if (evsel == cur) + continue; + + /* Ignore evsels that are part of different groups. */ + if (evsel->core.leader->nr_members && + evsel->core.leader != cur->core.leader) + continue; + /* Ignore evsels with mismatched modifiers. */ + if (evsel_ctx != evsel_context(cur)) + continue; + /* Ignore if not the cgroup we're looking for. */ + if (evsel->cgrp != cur->cgrp) + continue; + /* Ignore if not the stat we're looking for. */ + if (type != evsel__stat_type(cur)) + continue; + + aggr = &cur->stats->aggr[aggr_idx]; + if (type == STAT_NSECS) + return aggr->counts.val; + return aggr->counts.val * cur->scale; + } + return 0.0; } -static double runtime_stat_n(enum stat_type type, int aggr_idx, - struct runtime_stat_data *rsd) +static void print_ratio(struct perf_stat_config *config, + const struct evsel *evsel, int aggr_idx, + double numerator, struct perf_stat_output_ctx *out, + enum stat_type denominator_type, + const double color_ratios[3], const char *unit) { - struct saved_value *v; + double denominator = find_stat(evsel, aggr_idx, denominator_type); - v = saved_value_lookup(NULL, aggr_idx, false, type, rsd->ctx, rsd->cgrp); - if (!v) - return 0.0; + if (numerator && denominator) { + double ratio = numerator / denominator * 100.0; + const char *color = get_ratio_color(color_ratios, ratio); - return v->stats.n; + out->print_metric(config, out->ctx, color, "%7.2f%%", unit, ratio); + } else + out->print_metric(config, out->ctx, NULL, NULL, unit, 0); } -static void print_stalled_cycles_frontend(struct perf_stat_config *config, - int aggr_idx, double avg, - struct perf_stat_output_ctx *out, - struct runtime_stat_data *rsd) +static void print_stalled_cycles_front(struct perf_stat_config *config, + const struct evsel *evsel, + int aggr_idx, double stalled, + struct perf_stat_output_ctx *out) { - double total, ratio = 0.0; - const char *color; - - total = runtime_stat_avg(STAT_CYCLES, aggr_idx, rsd); - - if (total) - ratio = avg / total * 100.0; + static const double color_ratios[3] = {50.0, 30.0, 10.0}; - color = get_ratio_color(GRC_STALLED_CYCLES_FE, ratio); - - if (ratio) - out->print_metric(config, out->ctx, color, "%7.2f%%", "frontend cycles idle", - ratio); - else - out->print_metric(config, out->ctx, NULL, NULL, "frontend cycles idle", 0); + print_ratio(config, evsel, aggr_idx, stalled, out, STAT_CYCLES, color_ratios, + "frontend cycles idle"); } -static void print_stalled_cycles_backend(struct perf_stat_config *config, - int aggr_idx, double avg, - struct perf_stat_output_ctx *out, - struct runtime_stat_data *rsd) +static void print_stalled_cycles_back(struct perf_stat_config *config, + const struct evsel *evsel, + int aggr_idx, double stalled, + struct perf_stat_output_ctx *out) { - double total, ratio = 0.0; - const char *color; - - total = runtime_stat_avg(STAT_CYCLES, aggr_idx, rsd); - - if (total) - ratio = avg / total * 100.0; + static const double color_ratios[3] = {75.0, 50.0, 20.0}; - color = get_ratio_color(GRC_STALLED_CYCLES_BE, ratio); - - out->print_metric(config, out->ctx, color, "%7.2f%%", "backend cycles idle", ratio); + print_ratio(config, evsel, aggr_idx, stalled, out, STAT_CYCLES, color_ratios, + "backend cycles idle"); } -static void print_branch_misses(struct perf_stat_config *config, - int aggr_idx, double avg, - struct perf_stat_output_ctx *out, - struct runtime_stat_data *rsd) +static void print_branch_miss(struct perf_stat_config *config, + const struct evsel *evsel, + int aggr_idx, double misses, + struct perf_stat_output_ctx *out) { - double total, ratio = 0.0; - const char *color; - - total = runtime_stat_avg(STAT_BRANCHES, aggr_idx, rsd); - - if (total) - ratio = avg / total * 100.0; + static const double color_ratios[3] = {20.0, 10.0, 5.0}; - color = get_ratio_color(GRC_CACHE_MISSES, ratio); - - out->print_metric(config, out->ctx, color, "%7.2f%%", "of all branches", ratio); + print_ratio(config, evsel, aggr_idx, misses, out, STAT_BRANCHES, color_ratios, + "of all branches"); } -static void print_l1_dcache_misses(struct perf_stat_config *config, - int aggr_idx, double avg, - struct perf_stat_output_ctx *out, - struct runtime_stat_data *rsd) +static void print_l1d_miss(struct perf_stat_config *config, + const struct evsel *evsel, + int aggr_idx, double misses, + struct perf_stat_output_ctx *out) { - double total, ratio = 0.0; - const char *color; - - total = runtime_stat_avg(STAT_L1_DCACHE, aggr_idx, rsd); + static const double color_ratios[3] = {20.0, 10.0, 5.0}; - if (total) - ratio = avg / total * 100.0; + print_ratio(config, evsel, aggr_idx, misses, out, STAT_L1_DCACHE, color_ratios, + "of all L1-dcache accesses"); +} - color = get_ratio_color(GRC_CACHE_MISSES, ratio); +static void print_l1i_miss(struct perf_stat_config *config, + const struct evsel *evsel, + int aggr_idx, double misses, + struct perf_stat_output_ctx *out) +{ + static const double color_ratios[3] = {20.0, 10.0, 5.0}; - out->print_metric(config, out->ctx, color, "%7.2f%%", "of all L1-dcache accesses", ratio); + print_ratio(config, evsel, aggr_idx, misses, out, STAT_L1_ICACHE, color_ratios, + "of all L1-icache accesses"); } -static void print_l1_icache_misses(struct perf_stat_config *config, - int aggr_idx, double avg, - struct perf_stat_output_ctx *out, - struct runtime_stat_data *rsd) +static void print_ll_miss(struct perf_stat_config *config, + const struct evsel *evsel, + int aggr_idx, double misses, + struct perf_stat_output_ctx *out) { - double total, ratio = 0.0; - const char *color; + static const double color_ratios[3] = {20.0, 10.0, 5.0}; - total = runtime_stat_avg(STAT_L1_ICACHE, aggr_idx, rsd); + print_ratio(config, evsel, aggr_idx, misses, out, STAT_LL_CACHE, color_ratios, + "of all L1-icache accesses"); +} - if (total) - ratio = avg / total * 100.0; +static void print_dtlb_miss(struct perf_stat_config *config, + const struct evsel *evsel, + int aggr_idx, double misses, + struct perf_stat_output_ctx *out) +{ + static const double color_ratios[3] = {20.0, 10.0, 5.0}; - color = get_ratio_color(GRC_CACHE_MISSES, ratio); - out->print_metric(config, out->ctx, color, "%7.2f%%", "of all L1-icache accesses", ratio); + print_ratio(config, evsel, aggr_idx, misses, out, STAT_DTLB_CACHE, color_ratios, + "of all dTLB cache accesses"); } -static void print_dtlb_cache_misses(struct perf_stat_config *config, - int aggr_idx, double avg, - struct perf_stat_output_ctx *out, - struct runtime_stat_data *rsd) +static void print_itlb_miss(struct perf_stat_config *config, + const struct evsel *evsel, + int aggr_idx, double misses, + struct perf_stat_output_ctx *out) { - double total, ratio = 0.0; - const char *color; + static const double color_ratios[3] = {20.0, 10.0, 5.0}; - total = runtime_stat_avg(STAT_DTLB_CACHE, aggr_idx, rsd); + print_ratio(config, evsel, aggr_idx, misses, out, STAT_ITLB_CACHE, color_ratios, + "of all iTLB cache accesses"); +} - if (total) - ratio = avg / total * 100.0; +static void print_cache_miss(struct perf_stat_config *config, + const struct evsel *evsel, + int aggr_idx, double misses, + struct perf_stat_output_ctx *out) +{ + static const double color_ratios[3] = {20.0, 10.0, 5.0}; - color = get_ratio_color(GRC_CACHE_MISSES, ratio); - out->print_metric(config, out->ctx, color, "%7.2f%%", "of all dTLB cache accesses", ratio); + print_ratio(config, evsel, aggr_idx, misses, out, STAT_CACHE_REFS, color_ratios, + "of all cache refs"); } -static void print_itlb_cache_misses(struct perf_stat_config *config, - int aggr_idx, double avg, - struct perf_stat_output_ctx *out, - struct runtime_stat_data *rsd) +static void print_instructions(struct perf_stat_config *config, + const struct evsel *evsel, + int aggr_idx, double instructions, + struct perf_stat_output_ctx *out) { - double total, ratio = 0.0; - const char *color; + print_metric_t print_metric = out->print_metric; + void *ctxp = out->ctx; + double cycles = find_stat(evsel, aggr_idx, STAT_CYCLES); + double max_stalled = max(find_stat(evsel, aggr_idx, STAT_STALLED_CYCLES_FRONT), + find_stat(evsel, aggr_idx, STAT_STALLED_CYCLES_BACK)); + + if (cycles) { + print_metric(config, ctxp, NULL, "%7.2f ", "insn per cycle", + instructions / cycles); + } else + print_metric(config, ctxp, NULL, NULL, "insn per cycle", 0); + + if (max_stalled && instructions) { + out->new_line(config, ctxp); + print_metric(config, ctxp, NULL, "%7.2f ", "stalled cycles per insn", + max_stalled / instructions); + } +} - total = runtime_stat_avg(STAT_ITLB_CACHE, aggr_idx, rsd); +static void print_cycles(struct perf_stat_config *config, + const struct evsel *evsel, + int aggr_idx, double cycles, + struct perf_stat_output_ctx *out) +{ + double nsecs = find_stat(evsel, aggr_idx, STAT_NSECS); - if (total) - ratio = avg / total * 100.0; + if (cycles && nsecs) { + double ratio = cycles / nsecs; - color = get_ratio_color(GRC_CACHE_MISSES, ratio); - out->print_metric(config, out->ctx, color, "%7.2f%%", "of all iTLB cache accesses", ratio); + out->print_metric(config, out->ctx, NULL, "%8.3f", "GHz", ratio); + } else + out->print_metric(config, out->ctx, NULL, NULL, "GHz", 0); } -static void print_ll_cache_misses(struct perf_stat_config *config, - int aggr_idx, double avg, - struct perf_stat_output_ctx *out, - struct runtime_stat_data *rsd) +static void print_nsecs(struct perf_stat_config *config, + const struct evsel *evsel, + int aggr_idx __maybe_unused, double nsecs, + struct perf_stat_output_ctx *out) { - double total, ratio = 0.0; - const char *color; - - total = runtime_stat_avg(STAT_LL_CACHE, aggr_idx, rsd); - - if (total) - ratio = avg / total * 100.0; + print_metric_t print_metric = out->print_metric; + void *ctxp = out->ctx; + double wall_time = avg_stats(&walltime_nsecs_stats); - color = get_ratio_color(GRC_CACHE_MISSES, ratio); - out->print_metric(config, out->ctx, color, "%7.2f%%", "of all LL-cache accesses", ratio); + if (wall_time) { + print_metric(config, ctxp, NULL, "%8.3f", "CPUs utilized", + nsecs / (wall_time * evsel->scale)); + } else + print_metric(config, ctxp, NULL, NULL, "CPUs utilized", 0); } static int prepare_metric(struct evsel **metric_events, @@ -638,139 +730,51 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config, struct perf_stat_output_ctx *out, struct rblist *metric_events) { - void *ctxp = out->ctx; - print_metric_t print_metric = out->print_metric; - double total, ratio = 0.0; - struct runtime_stat_data rsd = { - .ctx = evsel_context(evsel), - .cgrp = evsel->cgrp, + typedef void (*stat_print_function_t)(struct perf_stat_config *config, + const struct evsel *evsel, + int aggr_idx, double misses, + struct perf_stat_output_ctx *out); + static const stat_print_function_t stat_print_function[STAT_MAX] = { + [STAT_INSTRUCTIONS] = print_instructions, + [STAT_BRANCH_MISS] = print_branch_miss, + [STAT_L1D_MISS] = print_l1d_miss, + [STAT_L1I_MISS] = print_l1i_miss, + [STAT_DTLB_MISS] = print_dtlb_miss, + [STAT_ITLB_MISS] = print_itlb_miss, + [STAT_LL_MISS] = print_ll_miss, + [STAT_CACHE_MISSES] = print_cache_miss, + [STAT_STALLED_CYCLES_FRONT] = print_stalled_cycles_front, + [STAT_STALLED_CYCLES_BACK] = print_stalled_cycles_back, + [STAT_CYCLES] = print_cycles, + [STAT_NSECS] = print_nsecs, }; + print_metric_t print_metric = out->print_metric; + void *ctxp = out->ctx; struct metric_event *me; int num = 1; if (config->iostat_run) { iostat_print_metric(config, evsel, out); - } else if (evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS)) { - total = runtime_stat_avg(STAT_CYCLES, aggr_idx, &rsd); - - if (total) { - ratio = avg / total; - print_metric(config, ctxp, NULL, "%7.2f ", - "insn per cycle", ratio); - } else { - print_metric(config, ctxp, NULL, NULL, "insn per cycle", 0); - } - - total = runtime_stat_avg(STAT_STALLED_CYCLES_FRONT, aggr_idx, &rsd); - - total = max(total, runtime_stat_avg(STAT_STALLED_CYCLES_BACK, - aggr_idx, &rsd)); - - if (total && avg) { - out->new_line(config, ctxp); - ratio = total / avg; - print_metric(config, ctxp, NULL, "%7.2f ", - "stalled cycles per insn", - ratio); - } - } else if (evsel__match(evsel, HARDWARE, HW_BRANCH_MISSES)) { - if (runtime_stat_n(STAT_BRANCHES, aggr_idx, &rsd) != 0) - print_branch_misses(config, aggr_idx, avg, out, &rsd); - else - print_metric(config, ctxp, NULL, NULL, "of all branches", 0); - } else if ( - evsel->core.attr.type == PERF_TYPE_HW_CACHE && - evsel->core.attr.config == ( PERF_COUNT_HW_CACHE_L1D | - ((PERF_COUNT_HW_CACHE_OP_READ) << 8) | - ((PERF_COUNT_HW_CACHE_RESULT_MISS) << 16))) { - - if (runtime_stat_n(STAT_L1_DCACHE, aggr_idx, &rsd) != 0) - print_l1_dcache_misses(config, aggr_idx, avg, out, &rsd); - else - print_metric(config, ctxp, NULL, NULL, "of all L1-dcache accesses", 0); - } else if ( - evsel->core.attr.type == PERF_TYPE_HW_CACHE && - evsel->core.attr.config == ( PERF_COUNT_HW_CACHE_L1I | - ((PERF_COUNT_HW_CACHE_OP_READ) << 8) | - ((PERF_COUNT_HW_CACHE_RESULT_MISS) << 16))) { - - if (runtime_stat_n(STAT_L1_ICACHE, aggr_idx, &rsd) != 0) - print_l1_icache_misses(config, aggr_idx, avg, out, &rsd); - else - print_metric(config, ctxp, NULL, NULL, "of all L1-icache accesses", 0); - } else if ( - evsel->core.attr.type == PERF_TYPE_HW_CACHE && - evsel->core.attr.config == ( PERF_COUNT_HW_CACHE_DTLB | - ((PERF_COUNT_HW_CACHE_OP_READ) << 8) | - ((PERF_COUNT_HW_CACHE_RESULT_MISS) << 16))) { - - if (runtime_stat_n(STAT_DTLB_CACHE, aggr_idx, &rsd) != 0) - print_dtlb_cache_misses(config, aggr_idx, avg, out, &rsd); - else - print_metric(config, ctxp, NULL, NULL, "of all dTLB cache accesses", 0); - } else if ( - evsel->core.attr.type == PERF_TYPE_HW_CACHE && - evsel->core.attr.config == ( PERF_COUNT_HW_CACHE_ITLB | - ((PERF_COUNT_HW_CACHE_OP_READ) << 8) | - ((PERF_COUNT_HW_CACHE_RESULT_MISS) << 16))) { - - if (runtime_stat_n(STAT_ITLB_CACHE, aggr_idx, &rsd) != 0) - print_itlb_cache_misses(config, aggr_idx, avg, out, &rsd); - else - print_metric(config, ctxp, NULL, NULL, "of all iTLB cache accesses", 0); - } else if ( - evsel->core.attr.type == PERF_TYPE_HW_CACHE && - evsel->core.attr.config == ( PERF_COUNT_HW_CACHE_LL | - ((PERF_COUNT_HW_CACHE_OP_READ) << 8) | - ((PERF_COUNT_HW_CACHE_RESULT_MISS) << 16))) { - - if (runtime_stat_n(STAT_LL_CACHE, aggr_idx, &rsd) != 0) - print_ll_cache_misses(config, aggr_idx, avg, out, &rsd); - else - print_metric(config, ctxp, NULL, NULL, "of all LL-cache accesses", 0); - } else if (evsel__match(evsel, HARDWARE, HW_CACHE_MISSES)) { - total = runtime_stat_avg(STAT_CACHEREFS, aggr_idx, &rsd); - - if (total) - ratio = avg * 100 / total; - - if (runtime_stat_n(STAT_CACHEREFS, aggr_idx, &rsd) != 0) - print_metric(config, ctxp, NULL, "%8.3f %%", - "of all cache refs", ratio); - else - print_metric(config, ctxp, NULL, NULL, "of all cache refs", 0); - } else if (evsel__match(evsel, HARDWARE, HW_STALLED_CYCLES_FRONTEND)) { - print_stalled_cycles_frontend(config, aggr_idx, avg, out, &rsd); - } else if (evsel__match(evsel, HARDWARE, HW_STALLED_CYCLES_BACKEND)) { - print_stalled_cycles_backend(config, aggr_idx, avg, out, &rsd); - } else if (evsel__match(evsel, HARDWARE, HW_CPU_CYCLES)) { - total = runtime_stat_avg(STAT_NSECS, aggr_idx, &rsd); - - if (total) { - ratio = avg / total; - print_metric(config, ctxp, NULL, "%8.3f", "GHz", ratio); - } else { - print_metric(config, ctxp, NULL, NULL, "Ghz", 0); - } - } else if (evsel__is_clock(evsel)) { - if ((ratio = avg_stats(&walltime_nsecs_stats)) != 0) - print_metric(config, ctxp, NULL, "%8.3f", "CPUs utilized", - avg / (ratio * evsel->scale)); - else - print_metric(config, ctxp, NULL, NULL, "CPUs utilized", 0); - } else if (runtime_stat_n(STAT_NSECS, aggr_idx, &rsd) != 0) { - char unit = ' '; - char unit_buf[10] = "/sec"; - - total = runtime_stat_avg(STAT_NSECS, aggr_idx, &rsd); - if (total) - ratio = convert_unit_double(1000000000.0 * avg / total, &unit); - - if (unit != ' ') - snprintf(unit_buf, sizeof(unit_buf), "%c/sec", unit); - print_metric(config, ctxp, NULL, "%8.3f", unit_buf, ratio); } else { - num = 0; + stat_print_function_t fn = stat_print_function[evsel__stat_type(evsel)]; + + if (fn) + fn(config, evsel, aggr_idx, avg, out); + else { + double nsecs = find_stat(evsel, aggr_idx, STAT_NSECS); + + if (nsecs) { + char unit = ' '; + char unit_buf[10] = "/sec"; + double ratio = convert_unit_double(1000000000.0 * avg / nsecs, + &unit); + + if (unit != ' ') + snprintf(unit_buf, sizeof(unit_buf), "%c/sec", unit); + print_metric(config, ctxp, NULL, "%8.3f", unit_buf, ratio); + } else + num = 0; + } } if ((me = metricgroup__lookup(metric_events, evsel, false)) != NULL) { From patchwork Sun Feb 19 09:28:48 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ian Rogers X-Patchwork-Id: 59127 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp776251wrn; Sun, 19 Feb 2023 01:47:54 -0800 (PST) X-Google-Smtp-Source: AK7set8auIK3MBGjQKdmrNKeMT7TwjZ84h3LhEpWX0gcHFa2/dGG5OU5z3FAw/3BQL0sAbFanPDy X-Received: by 2002:a05:6a20:430e:b0:bf:d67e:5517 with SMTP id h14-20020a056a20430e00b000bfd67e5517mr8030073pzk.42.1676800073763; Sun, 19 Feb 2023 01:47:53 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676800073; cv=none; d=google.com; s=arc-20160816; b=XkYW8+19BCch5tmFGaqYqUNKReZJUGVkHkzJ/ZG3m61YztrbmDrQ4bh2NUy49DTta+ GALHTRtQAUyxT3tt/kVGvdxrjIKu0QXNTLND/yGDIZier5YKlcXitirg19YnJquvPjOk Ko10WSliwBUPhdqj/qOF+L6UYVjLcjVVN3ey+vSj3Csftu23idj0mUmCdWlRKNBTG2re d9i01Jg8F4BhFZY1qquKG2SZsV0Qe4T+pUcwF45jDLJ3WX0rAMbB1IgDC0tFi+eWDQye VTqMXG7lqvTfvotjrp6YhmEeTYzhisBR3LZomn48KPXTsXR7JhA/kivNvn3Jv35XCf53 cGUA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=KWbhNdODInNmkKhRE1J16fJdDjqDUBwLENwO3W9vuiU=; b=FdUE7DqlOwg2qqzdmd7xI2Z2UQP/QKe8Wocp6UH40i1uAy3fRuJgInmMNVdPOGfEiQ Wp1TqexxD2NVhQK6P8i993+PQS2X+hnbweXwvvuZaoUGAWvbNStUd0mNb9UNKR9Mqn63 jnrJPJQubakD95ANtQTwNiA6flQNSLzbxnm/VIygDipbyBhS3GYcfZ/n4aAuLcIie6fl jSk6VZVe8qt49Cy6Hmi/Zp2SEgKuC8kyigs6bEphHgFKELG0O02XHKPuUk6zAplVrODh tA0nYyuSjRDBTP19BB7FpOq5s0P2eVPWnhazW8B3dJlJEDRedTLs1eTqrQV9Y0oh9K+p 2+og== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=Pm859wys; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id i4-20020a63d444000000b004fb681fec99si29487pgj.628.2023.02.19.01.47.41; Sun, 19 Feb 2023 01:47:53 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=Pm859wys; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229640AbjBSJin (ORCPT + 99 others); Sun, 19 Feb 2023 04:38:43 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33434 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230062AbjBSJil (ORCPT ); Sun, 19 Feb 2023 04:38:41 -0500 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BD26C12BC1 for ; Sun, 19 Feb 2023 01:37:42 -0800 (PST) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-5368a9902d0so14232857b3.10 for ; Sun, 19 Feb 2023 01:37:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=KWbhNdODInNmkKhRE1J16fJdDjqDUBwLENwO3W9vuiU=; b=Pm859wysWzwzMi1q3PqzD+ne5cXUtXcBdKyZ0RS1dnxvqc3OaHp8DhL81ofEGYOgyI L5IQIUgDp0VOypIYOXzLORvQ8rsun/udjxKhuATtxbN3onPcIRK5hjIiBs6kOb6UyhPu jQuiF01vyuV9aBbDnpjIIgSILCRwX6F5MWuKOHlsyDoDqHN0LdEVSQrp9hBJ/m6KYVni iYurqb7fJVAn6qXnKcJb2No+8gTYalxOJk4e0xCvES+dUhxkkykpJpF8WSCYQnmmkKcF 2YZi+Kuqk7+q7PLIuUiD4DtSXzRupWNlCokHZfTMqJ88/DyhTR0Lsu0EZMX1oRjJSn2P lzPg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=KWbhNdODInNmkKhRE1J16fJdDjqDUBwLENwO3W9vuiU=; b=KQEYgjPkp84L+2rWR2d2uzLZ89KNpwkY5v9pJf19EX48vTH2ot9n4DDOmCBzKpRXhn mvom5vaXdIqRCarf7L50L+/Ct6jy/2ROegg//zA1JlqsYQt265HyFGS+uBM4kPXrKQRc D69ZM5TfbUuSVGUninaV4DoyzuDHTZPRPIcOZCO7cghjpd9vNvaO9zAjAZJTjDujP7Gx cgHy5ndkttkQp95VtDTWSlMC1LlUkUsINa5iAaOOA8K68l74e7sFB7Tz9daYJK5g0h+B gHqdw1dv9cPJ0+UwhAZ/osc6iqrAFu4Mf0u+ZtMAtvgBs3jbJoeCVQujzAWCe9lHW6s7 UKAw== X-Gm-Message-State: AO0yUKU8zgiLhPmR49Q52jo3LvvHbeE50HbEpPKql7lSDkaO5jasGvMq F+BUdsAbxpl0+vtgc6Qk37S15buyIZGJ X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a05:6902:12c5:b0:8d9:7d26:d690 with SMTP id j5-20020a05690212c500b008d97d26d690mr156747ybu.1.1676799370473; Sun, 19 Feb 2023 01:36:10 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:48 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-52-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 51/51] perf stat: Remove saved_value/runtime_stat From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758252314274225304?= X-GMAIL-MSGID: =?utf-8?q?1758252314274225304?= As saved_value/runtime_stat are only written to and not read, remove the associated logic and writes. Signed-off-by: Ian Rogers --- tools/perf/builtin-script.c | 5 - tools/perf/builtin-stat.c | 6 - tools/perf/tests/parse-metric.c | 1 - tools/perf/tests/pmu-events.c | 1 - tools/perf/util/stat-shadow.c | 198 -------------------------------- tools/perf/util/stat.c | 24 ---- tools/perf/util/stat.h | 4 - 7 files changed, 239 deletions(-) diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c index e9b5387161df..522226114263 100644 --- a/tools/perf/builtin-script.c +++ b/tools/perf/builtin-script.c @@ -2072,9 +2072,6 @@ static void perf_sample__fprint_metric(struct perf_script *script, if (evsel_script(leader)->gnum++ == 0) perf_stat__reset_shadow_stats(); val = sample->period * evsel->scale; - perf_stat__update_shadow_stats(evsel, - val, - sample->cpu); evsel_script(evsel)->val = val; if (evsel_script(leader)->gnum == leader->core.nr_members) { for_each_group_member (ev2, leader) { @@ -2792,8 +2789,6 @@ static int __cmd_script(struct perf_script *script) signal(SIGINT, sig_handler); - perf_stat__init_shadow_stats(); - /* override event processing functions */ if (script->show_task_events) { script->tool.comm = process_comm_event; diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c index 619387459914..d70b1ec88594 100644 --- a/tools/perf/builtin-stat.c +++ b/tools/perf/builtin-stat.c @@ -424,7 +424,6 @@ static void process_counters(void) perf_stat_merge_counters(&stat_config, evsel_list); perf_stat_process_percore(&stat_config, evsel_list); - perf_stat_process_shadow_stats(&stat_config, evsel_list); } static void process_interval(void) @@ -434,7 +433,6 @@ static void process_interval(void) clock_gettime(CLOCK_MONOTONIC, &ts); diff_timespec(&rs, &ts, &ref_time); - perf_stat__reset_shadow_per_stat(); evlist__reset_aggr_stats(evsel_list); if (read_counters(&rs) == 0) @@ -910,7 +908,6 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx) evlist__copy_prev_raw_counts(evsel_list); evlist__reset_prev_raw_counts(evsel_list); evlist__reset_aggr_stats(evsel_list); - perf_stat__reset_shadow_per_stat(); } else { update_stats(&walltime_nsecs_stats, t1 - t0); update_rusage_stats(&ru_stats, &stat_config.ru_data); @@ -2132,8 +2129,6 @@ static int __cmd_report(int argc, const char **argv) input_name = "perf.data"; } - perf_stat__init_shadow_stats(); - perf_stat.data.path = input_name; perf_stat.data.mode = PERF_DATA_MODE_READ; @@ -2413,7 +2408,6 @@ int cmd_stat(int argc, const char **argv) &stat_config.metric_events); zfree(&metrics); } - perf_stat__init_shadow_stats(); if (add_default_attributes()) goto out; diff --git a/tools/perf/tests/parse-metric.c b/tools/perf/tests/parse-metric.c index b9b8a48289c4..c43b056f9fa3 100644 --- a/tools/perf/tests/parse-metric.c +++ b/tools/perf/tests/parse-metric.c @@ -296,7 +296,6 @@ static int test_metric_group(void) static int test__parse_metric(struct test_suite *test __maybe_unused, int subtest __maybe_unused) { - perf_stat__init_shadow_stats(); TEST_ASSERT_VAL("IPC failed", test_ipc() == 0); TEST_ASSERT_VAL("frontend failed", test_frontend() == 0); TEST_ASSERT_VAL("DCache_L2 failed", test_dcache_l2() == 0); diff --git a/tools/perf/tests/pmu-events.c b/tools/perf/tests/pmu-events.c index 4ec2a4ca1a82..6ccd413b5983 100644 --- a/tools/perf/tests/pmu-events.c +++ b/tools/perf/tests/pmu-events.c @@ -905,7 +905,6 @@ static int test__parsing(struct test_suite *test __maybe_unused, { int failures = 0; - perf_stat__init_shadow_stats(); pmu_for_each_core_metric(test__parsing_callback, &failures); pmu_for_each_sys_metric(test__parsing_callback, &failures); diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c index 9d22cde09dc9..ef85f1ae1ab2 100644 --- a/tools/perf/util/stat-shadow.c +++ b/tools/perf/util/stat-shadow.c @@ -16,22 +16,9 @@ #include "iostat.h" #include "util/hashmap.h" -/* - * AGGR_GLOBAL: Use CPU 0 - * AGGR_SOCKET: Use first CPU of socket - * AGGR_DIE: Use first CPU of die - * AGGR_CORE: Use first CPU of core - * AGGR_NONE: Use matching CPU - * AGGR_THREAD: Not supported? - */ - struct stats walltime_nsecs_stats; struct rusage_stats ru_stats; -static struct runtime_stat { - struct rblist value_list; -} rt_stat; - enum { CTX_BIT_USER = 1 << 0, CTX_BIT_KERNEL = 1 << 1, @@ -65,117 +52,6 @@ enum stat_type { STAT_MAX }; -struct saved_value { - struct rb_node rb_node; - struct evsel *evsel; - enum stat_type type; - int ctx; - int map_idx; /* cpu or thread map index */ - struct cgroup *cgrp; - struct stats stats; - u64 metric_total; - int metric_other; -}; - -static int saved_value_cmp(struct rb_node *rb_node, const void *entry) -{ - struct saved_value *a = container_of(rb_node, - struct saved_value, - rb_node); - const struct saved_value *b = entry; - - if (a->map_idx != b->map_idx) - return a->map_idx - b->map_idx; - - /* - * Previously the rbtree was used to link generic metrics. - * The keys were evsel/cpu. Now the rbtree is extended to support - * per-thread shadow stats. For shadow stats case, the keys - * are cpu/type/ctx/stat (evsel is NULL). For generic metrics - * case, the keys are still evsel/cpu (type/ctx/stat are 0 or NULL). - */ - if (a->type != b->type) - return a->type - b->type; - - if (a->ctx != b->ctx) - return a->ctx - b->ctx; - - if (a->cgrp != b->cgrp) - return (char *)a->cgrp < (char *)b->cgrp ? -1 : +1; - - if (a->evsel == b->evsel) - return 0; - if ((char *)a->evsel < (char *)b->evsel) - return -1; - return +1; -} - -static struct rb_node *saved_value_new(struct rblist *rblist __maybe_unused, - const void *entry) -{ - struct saved_value *nd = malloc(sizeof(struct saved_value)); - - if (!nd) - return NULL; - memcpy(nd, entry, sizeof(struct saved_value)); - return &nd->rb_node; -} - -static void saved_value_delete(struct rblist *rblist __maybe_unused, - struct rb_node *rb_node) -{ - struct saved_value *v; - - BUG_ON(!rb_node); - v = container_of(rb_node, struct saved_value, rb_node); - free(v); -} - -static struct saved_value *saved_value_lookup(struct evsel *evsel, - int map_idx, - bool create, - enum stat_type type, - int ctx, - struct cgroup *cgrp) -{ - struct rblist *rblist; - struct rb_node *nd; - struct saved_value dm = { - .map_idx = map_idx, - .evsel = evsel, - .type = type, - .ctx = ctx, - .cgrp = cgrp, - }; - - rblist = &rt_stat.value_list; - - /* don't use context info for clock events */ - if (type == STAT_NSECS) - dm.ctx = 0; - - nd = rblist__find(rblist, &dm); - if (nd) - return container_of(nd, struct saved_value, rb_node); - if (create) { - rblist__add_node(rblist, &dm); - nd = rblist__find(rblist, &dm); - if (nd) - return container_of(nd, struct saved_value, rb_node); - } - return NULL; -} - -void perf_stat__init_shadow_stats(void) -{ - struct rblist *rblist = &rt_stat.value_list; - - rblist__init(rblist); - rblist->node_cmp = saved_value_cmp; - rblist->node_new = saved_value_new; - rblist->node_delete = saved_value_delete; -} - static int evsel_context(const struct evsel *evsel) { int ctx = 0; @@ -194,86 +70,12 @@ static int evsel_context(const struct evsel *evsel) return ctx; } -void perf_stat__reset_shadow_per_stat(void) -{ - struct rblist *rblist; - struct rb_node *pos, *next; - - rblist = &rt_stat.value_list; - next = rb_first_cached(&rblist->entries); - while (next) { - pos = next; - next = rb_next(pos); - memset(&container_of(pos, struct saved_value, rb_node)->stats, - 0, - sizeof(struct stats)); - } -} - void perf_stat__reset_shadow_stats(void) { - perf_stat__reset_shadow_per_stat(); memset(&walltime_nsecs_stats, 0, sizeof(walltime_nsecs_stats)); memset(&ru_stats, 0, sizeof(ru_stats)); } -struct runtime_stat_data { - int ctx; - struct cgroup *cgrp; -}; - -static void update_runtime_stat(enum stat_type type, - int map_idx, u64 count, - struct runtime_stat_data *rsd) -{ - struct saved_value *v = saved_value_lookup(NULL, map_idx, true, type, - rsd->ctx, rsd->cgrp); - - if (v) - update_stats(&v->stats, count); -} - -/* - * Update various tracking values we maintain to print - * more semantic information such as miss/hit ratios, - * instruction rates, etc: - */ -void perf_stat__update_shadow_stats(struct evsel *counter, u64 count, - int aggr_idx) -{ - u64 count_ns = count; - struct runtime_stat_data rsd = { - .ctx = evsel_context(counter), - .cgrp = counter->cgrp, - }; - count *= counter->scale; - - if (evsel__is_clock(counter)) - update_runtime_stat(STAT_NSECS, aggr_idx, count_ns, &rsd); - else if (evsel__match(counter, HARDWARE, HW_CPU_CYCLES)) - update_runtime_stat(STAT_CYCLES, aggr_idx, count, &rsd); - else if (evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_FRONTEND)) - update_runtime_stat(STAT_STALLED_CYCLES_FRONT, - aggr_idx, count, &rsd); - else if (evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_BACKEND)) - update_runtime_stat(STAT_STALLED_CYCLES_BACK, - aggr_idx, count, &rsd); - else if (evsel__match(counter, HARDWARE, HW_BRANCH_INSTRUCTIONS)) - update_runtime_stat(STAT_BRANCHES, aggr_idx, count, &rsd); - else if (evsel__match(counter, HARDWARE, HW_CACHE_REFERENCES)) - update_runtime_stat(STAT_CACHE_REFS, aggr_idx, count, &rsd); - else if (evsel__match(counter, HW_CACHE, HW_CACHE_L1D)) - update_runtime_stat(STAT_L1_DCACHE, aggr_idx, count, &rsd); - else if (evsel__match(counter, HW_CACHE, HW_CACHE_L1I)) - update_runtime_stat(STAT_L1_ICACHE, aggr_idx, count, &rsd); - else if (evsel__match(counter, HW_CACHE, HW_CACHE_LL)) - update_runtime_stat(STAT_LL_CACHE, aggr_idx, count, &rsd); - else if (evsel__match(counter, HW_CACHE, HW_CACHE_DTLB)) - update_runtime_stat(STAT_DTLB_CACHE, aggr_idx, count, &rsd); - else if (evsel__match(counter, HW_CACHE, HW_CACHE_ITLB)) - update_runtime_stat(STAT_ITLB_CACHE, aggr_idx, count, &rsd); -} - static enum stat_type evsel__stat_type(const struct evsel *evsel) { /* Fake perf_hw_cache_op_id values for use with evsel__match. */ diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c index 83dc4c1f4b12..4abfd87c5352 100644 --- a/tools/perf/util/stat.c +++ b/tools/perf/util/stat.c @@ -648,30 +648,6 @@ void perf_stat_process_percore(struct perf_stat_config *config, struct evlist *e evsel__process_percore(evsel); } -static void evsel__update_shadow_stats(struct evsel *evsel) -{ - struct perf_stat_evsel *ps = evsel->stats; - int aggr_idx; - - if (ps->aggr == NULL) - return; - - for (aggr_idx = 0; aggr_idx < ps->nr_aggr; aggr_idx++) { - struct perf_counts_values *aggr_counts = &ps->aggr[aggr_idx].counts; - - perf_stat__update_shadow_stats(evsel, aggr_counts->val, aggr_idx); - } -} - -void perf_stat_process_shadow_stats(struct perf_stat_config *config __maybe_unused, - struct evlist *evlist) -{ - struct evsel *evsel; - - evlist__for_each_entry(evlist, evsel) - evsel__update_shadow_stats(evsel); -} - int perf_event__process_stat_event(struct perf_session *session, union perf_event *event) { diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h index b01c828c3799..41204547b76b 100644 --- a/tools/perf/util/stat.h +++ b/tools/perf/util/stat.h @@ -157,10 +157,7 @@ typedef void (*print_metric_t)(struct perf_stat_config *config, const char *fmt, double val); typedef void (*new_line_t)(struct perf_stat_config *config, void *ctx); -void perf_stat__init_shadow_stats(void); void perf_stat__reset_shadow_stats(void); -void perf_stat__reset_shadow_per_stat(void); -void perf_stat__update_shadow_stats(struct evsel *counter, u64 count, int aggr_idx); struct perf_stat_output_ctx { void *ctx; print_metric_t print_metric; @@ -189,7 +186,6 @@ int perf_stat_process_counter(struct perf_stat_config *config, struct evsel *counter); void perf_stat_merge_counters(struct perf_stat_config *config, struct evlist *evlist); void perf_stat_process_percore(struct perf_stat_config *config, struct evlist *evlist); -void perf_stat_process_shadow_stats(struct perf_stat_config *config, struct evlist *evlist); struct perf_tool; union perf_event;