Message ID | 20230328222735.1367829-2-kan.liang@linux.intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp13402vqo; Tue, 28 Mar 2023 15:39:30 -0700 (PDT) X-Google-Smtp-Source: AKy350ayL9+xCX4AixdaJMj2e2chZyboxj54fyDgfay1hnf3G3MIYgqdJbCRt5sDSDa8AcCar7vW X-Received: by 2002:a17:906:4a55:b0:92f:a0d5:211c with SMTP id a21-20020a1709064a5500b0092fa0d5211cmr14789342ejv.35.1680043170677; Tue, 28 Mar 2023 15:39:30 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1680043170; cv=none; d=google.com; s=arc-20160816; b=D1hR6zgLccTfrP3OSaO65ejRQ6ekeiQVZ8A2sUj6ZB3So/y70Je5mi+TWXpNq6Uk7b 0ZaaVm+15zYOGyJN2px3xDwW99QGAUfHwIWItcDMtQQCSivF3m3n4eMcEdASMo6ZaiH3 9+bwpWzu3ZZQG30y9vD1Kc9f4eqlSwmZ4wVUK9Zh5f8r8S8P4t6mb54uhgkpI7yoBXDt sMOVzAehows+88lgKFJmK6J2kifW+72rA/fT9CY7GRZisKI8DCHlkF0OUoTFmQs6mU/t htqSTtD/7I/d4KXGXBmmTl3kI8inBhJmDmyWsr2biCZHNURxEBEsmB4vRHqFpAFiqCoI 0D7A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=fzH6hLsRudlfctgnXwopdgPcBTPATXj3u7ZWBbrloS4=; b=XIs5+k0NWhGF1Wo4cThrViKgrmEULLu6CPeu/zBwhTpDo5Y57orgYHcsx3hxokXPcQ a+dqydsj/9Mj+MrmJ+LS1l87bHaetGDYRZzG7mhTYpRnQH2aWyq8Ya2qeR4yjVbwCuTj 3LkQds9vxU3kBmGJ9JTAfS0v2r5mQMAY+cbEVb0nmOGNHIWa6oh/F5FiiRxmm4S3QIgE j7bymYjRYfPdNVkhjW9l06IdCtFaG+90UfNm1rg1GhVQzS0TtBI8HlLuUxK54wtQDGOB DUO05nkEK06c+ceUtXr8W/RpEGzgGgq4sJhzrMpSAvOKlsxrkwp8hZVr/9RGtb6s4upF QUwA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=ReGVh010; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id z16-20020a170906945000b0093dd2150283si13047498ejx.715.2023.03.28.15.39.07; Tue, 28 Mar 2023 15:39:30 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=ReGVh010; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229725AbjC1Wd4 (ORCPT <rfc822;rua109.linux@gmail.com> + 99 others); Tue, 28 Mar 2023 18:33:56 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58262 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229689AbjC1Wdo (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 28 Mar 2023 18:33:44 -0400 Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CABA8107 for <linux-kernel@vger.kernel.org>; Tue, 28 Mar 2023 15:33:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1680042823; x=1711578823; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=dff8bfa4xFxrDAv4QO3C6nfds+XDea5PueZbQSlEq0Q=; b=ReGVh010QeWvXu3czNBIBxPaRmeFXazVfwipkTW67FKyNJqaNb+vIbWA kEod+MzuzE+PMfw5z+LGuWDsj8/yystvM5kQz97v5nyzVFDzosKvFAU8e Moq/bUJkaa2rhS1hnMVxRWueb4tXTayJyKFt+pEDhXwI0YmNyWZZaeeNk RORGhqryC3KPWVydePweysz33VJN0Hx+N+7UPn420nvJEh7gm9CRXZGq3 bq7rBk7M/mTZDo6ALu+jmeiE9edBGZpsJlSbsBzmBZBcKwIuIi4XaNbSt kl9hJjxf2KuOlDAK4WbFIDBFqjM0o9p9+tICVBmTtlgRzzWHjEGwyWC0l Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10663"; a="340735721" X-IronPort-AV: E=Sophos;i="5.98,297,1673942400"; d="scan'208";a="340735721" Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Mar 2023 15:32:06 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10663"; a="634236283" X-IronPort-AV: E=Sophos;i="5.98,297,1673942400"; d="scan'208";a="634236283" Received: from kanliang-dev.jf.intel.com ([10.165.154.102]) by orsmga003.jf.intel.com with ESMTP; 28 Mar 2023 15:27:37 -0700 From: kan.liang@linux.intel.com To: peterz@infradead.org, mingo@redhat.com, linux-kernel@vger.kernel.org Cc: ak@linux.intel.com, eranian@google.com, Kan Liang <kan.liang@linux.intel.com> Subject: [PATCH 2/2] perf/x86/intel/ds: Use the size from each PEBS record Date: Tue, 28 Mar 2023 15:27:35 -0700 Message-Id: <20230328222735.1367829-2-kan.liang@linux.intel.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20230328222735.1367829-1-kan.liang@linux.intel.com> References: <20230328222735.1367829-1-kan.liang@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.4 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE,SPF_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1761652947836523128?= X-GMAIL-MSGID: =?utf-8?q?1761652947836523128?= |
Series |
[1/2] perf: Add sched_task callback during ctx reschedule
|
|
Commit Message
Liang, Kan
March 28, 2023, 10:27 p.m. UTC
From: Kan Liang <kan.liang@linux.intel.com> The kernel warning for the unexpected PEBS record can also be observed during a context switch, when the below commands are running in parallel for a while on SPR. while true; do perf record --no-buildid -a --intr-regs=AX -e cpu/event=0xd0,umask=0x81/pp -c 10003 -o /dev/null ./triad; done & while true; do perf record -o /tmp/out -W -d -e '{ld_blocks.store_forward:period=1000000, MEM_TRANS_RETIRED.LOAD_LATENCY:u:precise=2:ldlat=4}' -c 1037 ./triad; done *The triad program is just the generation of loads/stores. The current PEBS code assumes that all the PEBS records in the DS buffer have the same size, aka cpuc->pebs_record_size. It's true for the most cases, since the DS buffer is always flushed in every context switch. However, there is a corner case that breaks the assumption. A system-wide PEBS event with the large PEBS config may be enabled during a context switch. Some PEBS records for the system-wide PEBS may be generated while the old task is sched out but the new one hasn't been sched in yet. When the new task is sched in, the cpuc->pebs_record_size may be updated for the per-task PEBS events. So the existing system-wide PEBS records have a different size from the later PEBS records. Two methods were considered to fix the issue. One is to flush the DS buffer for the system-wide PEBS right before the new task sched in. It has to be done in the generic code via the sched_task() call back. However, the sched_task() is shared among different ARCHs. The movement may impact other ARCHs, e.g., AMD BRS requires the sched_task() is called after the PMU has started on a ctxswin. The method is dropped. The other method is implemented here. It doesn't assume that all the PEBS records have the same size any more. The size from each PEBS record is used to parse the record. For the previous platform (PEBS format < 4), which doesn't support adaptive PEBS, there is nothing changed. Reported-by: Stephane Eranian <eranian@google.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> --- arch/x86/events/intel/ds.c | 31 ++++++++++++++++++++++++++----- arch/x86/include/asm/perf_event.h | 6 ++++++ 2 files changed, 32 insertions(+), 5 deletions(-)
Comments
Hi,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on tip/perf/core]
[also build test WARNING on acme/perf/core linus/master v6.3-rc4 next-20230328]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/kan-liang-linux-intel-com/perf-x86-intel-ds-Use-the-size-from-each-PEBS-record/20230329-064258
patch link: https://lore.kernel.org/r/20230328222735.1367829-2-kan.liang%40linux.intel.com
patch subject: [PATCH 2/2] perf/x86/intel/ds: Use the size from each PEBS record
config: x86_64-allyesconfig (https://download.01.org/0day-ci/archive/20230329/202303290854.CQhdHZDG-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
# https://github.com/intel-lab-lkp/linux/commit/a5988003cfa30fb0c88507d2d124eb551d42e1a6
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review kan-liang-linux-intel-com/perf-x86-intel-ds-Use-the-size-from-each-PEBS-record/20230329-064258
git checkout a5988003cfa30fb0c88507d2d124eb551d42e1a6
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 O=build_dir ARCH=x86_64 olddefconfig
make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash arch/x86/
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202303290854.CQhdHZDG-lkp@intel.com/
All warnings (new ones prefixed by >>):
arch/x86/events/intel/ds.c: In function '__intel_pmu_pebs_event':
>> arch/x86/events/intel/ds.c:2042:31: warning: unused variable 'cpuc' [-Wunused-variable]
2042 | struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
| ^~~~
vim +/cpuc +2042 arch/x86/events/intel/ds.c
d31fc13fdcb20e arch/x86/events/intel/ds.c Kan Liang 2018-02-12 2029
9dfa9a5c9bae34 arch/x86/events/intel/ds.c Peter Zijlstra 2020-10-30 2030 static __always_inline void
9dfa9a5c9bae34 arch/x86/events/intel/ds.c Peter Zijlstra 2020-10-30 2031 __intel_pmu_pebs_event(struct perf_event *event,
21509084f999d7 arch/x86/kernel/cpu/perf_event_intel_ds.c Yan, Zheng 2015-05-06 2032 struct pt_regs *iregs,
9dfa9a5c9bae34 arch/x86/events/intel/ds.c Peter Zijlstra 2020-10-30 2033 struct perf_sample_data *data,
21509084f999d7 arch/x86/kernel/cpu/perf_event_intel_ds.c Yan, Zheng 2015-05-06 2034 void *base, void *top,
c22497f5838c23 arch/x86/events/intel/ds.c Kan Liang 2019-04-02 2035 int bit, int count,
c22497f5838c23 arch/x86/events/intel/ds.c Kan Liang 2019-04-02 2036 void (*setup_sample)(struct perf_event *,
c22497f5838c23 arch/x86/events/intel/ds.c Kan Liang 2019-04-02 2037 struct pt_regs *,
c22497f5838c23 arch/x86/events/intel/ds.c Kan Liang 2019-04-02 2038 void *,
c22497f5838c23 arch/x86/events/intel/ds.c Kan Liang 2019-04-02 2039 struct perf_sample_data *,
c22497f5838c23 arch/x86/events/intel/ds.c Kan Liang 2019-04-02 2040 struct pt_regs *))
43cf76312faefe arch/x86/kernel/cpu/perf_event_intel_ds.c Yan, Zheng 2015-05-06 2041 {
c22497f5838c23 arch/x86/events/intel/ds.c Kan Liang 2019-04-02 @2042 struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
d31fc13fdcb20e arch/x86/events/intel/ds.c Kan Liang 2018-02-12 2043 struct hw_perf_event *hwc = &event->hw;
c22497f5838c23 arch/x86/events/intel/ds.c Kan Liang 2019-04-02 2044 struct x86_perf_regs perf_regs;
c22497f5838c23 arch/x86/events/intel/ds.c Kan Liang 2019-04-02 2045 struct pt_regs *regs = &perf_regs.regs;
21509084f999d7 arch/x86/kernel/cpu/perf_event_intel_ds.c Yan, Zheng 2015-05-06 2046 void *at = get_next_pebs_record_by_bit(base, top, bit);
e506d1dac0edb2 arch/x86/events/intel/ds.c Peter Zijlstra 2020-10-30 2047 static struct pt_regs dummy_iregs;
43cf76312faefe arch/x86/kernel/cpu/perf_event_intel_ds.c Yan, Zheng 2015-05-06 2048
d31fc13fdcb20e arch/x86/events/intel/ds.c Kan Liang 2018-02-12 2049 if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
d31fc13fdcb20e arch/x86/events/intel/ds.c Kan Liang 2018-02-12 2050 /*
d31fc13fdcb20e arch/x86/events/intel/ds.c Kan Liang 2018-02-12 2051 * Now, auto-reload is only enabled in fixed period mode.
d31fc13fdcb20e arch/x86/events/intel/ds.c Kan Liang 2018-02-12 2052 * The reload value is always hwc->sample_period.
d31fc13fdcb20e arch/x86/events/intel/ds.c Kan Liang 2018-02-12 2053 * May need to change it, if auto-reload is enabled in
d31fc13fdcb20e arch/x86/events/intel/ds.c Kan Liang 2018-02-12 2054 * freq mode later.
d31fc13fdcb20e arch/x86/events/intel/ds.c Kan Liang 2018-02-12 2055 */
d31fc13fdcb20e arch/x86/events/intel/ds.c Kan Liang 2018-02-12 2056 intel_pmu_save_and_restart_reload(event, count);
d31fc13fdcb20e arch/x86/events/intel/ds.c Kan Liang 2018-02-12 2057 } else if (!intel_pmu_save_and_restart(event))
43cf76312faefe arch/x86/kernel/cpu/perf_event_intel_ds.c Yan, Zheng 2015-05-06 2058 return;
43cf76312faefe arch/x86/kernel/cpu/perf_event_intel_ds.c Yan, Zheng 2015-05-06 2059
35d1ce6bec1336 arch/x86/events/intel/ds.c Kan Liang 2020-09-02 2060 if (!iregs)
35d1ce6bec1336 arch/x86/events/intel/ds.c Kan Liang 2020-09-02 2061 iregs = &dummy_iregs;
35d1ce6bec1336 arch/x86/events/intel/ds.c Kan Liang 2020-09-02 2062
a3d86542de8850 arch/x86/kernel/cpu/perf_event_intel_ds.c Peter Zijlstra 2015-05-12 2063 while (count > 1) {
9dfa9a5c9bae34 arch/x86/events/intel/ds.c Peter Zijlstra 2020-10-30 2064 setup_sample(event, iregs, at, data, regs);
9dfa9a5c9bae34 arch/x86/events/intel/ds.c Peter Zijlstra 2020-10-30 2065 perf_event_output(event, data, regs);
a5988003cfa30f arch/x86/events/intel/ds.c Kan Liang 2023-03-28 2066 at += get_pebs_size(at);
21509084f999d7 arch/x86/kernel/cpu/perf_event_intel_ds.c Yan, Zheng 2015-05-06 2067 at = get_next_pebs_record_by_bit(at, top, bit);
a3d86542de8850 arch/x86/kernel/cpu/perf_event_intel_ds.c Peter Zijlstra 2015-05-12 2068 count--;
21509084f999d7 arch/x86/kernel/cpu/perf_event_intel_ds.c Yan, Zheng 2015-05-06 2069 }
21509084f999d7 arch/x86/kernel/cpu/perf_event_intel_ds.c Yan, Zheng 2015-05-06 2070
9dfa9a5c9bae34 arch/x86/events/intel/ds.c Peter Zijlstra 2020-10-30 2071 setup_sample(event, iregs, at, data, regs);
35d1ce6bec1336 arch/x86/events/intel/ds.c Kan Liang 2020-09-02 2072 if (iregs == &dummy_iregs) {
35d1ce6bec1336 arch/x86/events/intel/ds.c Kan Liang 2020-09-02 2073 /*
35d1ce6bec1336 arch/x86/events/intel/ds.c Kan Liang 2020-09-02 2074 * The PEBS records may be drained in the non-overflow context,
35d1ce6bec1336 arch/x86/events/intel/ds.c Kan Liang 2020-09-02 2075 * e.g., large PEBS + context switch. Perf should treat the
35d1ce6bec1336 arch/x86/events/intel/ds.c Kan Liang 2020-09-02 2076 * last record the same as other PEBS records, and doesn't
35d1ce6bec1336 arch/x86/events/intel/ds.c Kan Liang 2020-09-02 2077 * invoke the generic overflow handler.
35d1ce6bec1336 arch/x86/events/intel/ds.c Kan Liang 2020-09-02 2078 */
9dfa9a5c9bae34 arch/x86/events/intel/ds.c Peter Zijlstra 2020-10-30 2079 perf_event_output(event, data, regs);
35d1ce6bec1336 arch/x86/events/intel/ds.c Kan Liang 2020-09-02 2080 } else {
21509084f999d7 arch/x86/kernel/cpu/perf_event_intel_ds.c Yan, Zheng 2015-05-06 2081 /*
21509084f999d7 arch/x86/kernel/cpu/perf_event_intel_ds.c Yan, Zheng 2015-05-06 2082 * All but the last records are processed.
21509084f999d7 arch/x86/kernel/cpu/perf_event_intel_ds.c Yan, Zheng 2015-05-06 2083 * The last one is left to be able to call the overflow handler.
21509084f999d7 arch/x86/kernel/cpu/perf_event_intel_ds.c Yan, Zheng 2015-05-06 2084 */
9dfa9a5c9bae34 arch/x86/events/intel/ds.c Peter Zijlstra 2020-10-30 2085 if (perf_event_overflow(event, data, regs))
a4eaf7f14675cb arch/x86/kernel/cpu/perf_event_intel_ds.c Peter Zijlstra 2010-06-16 2086 x86_pmu_stop(event, 0);
21509084f999d7 arch/x86/kernel/cpu/perf_event_intel_ds.c Yan, Zheng 2015-05-06 2087 }
2b0b5c6fe9b383 arch/x86/kernel/cpu/perf_event_intel_ds.c Peter Zijlstra 2010-04-08 2088 }
2b0b5c6fe9b383 arch/x86/kernel/cpu/perf_event_intel_ds.c Peter Zijlstra 2010-04-08 2089
Hi,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on tip/perf/core]
[also build test WARNING on acme/perf/core linus/master v6.3-rc4 next-20230328]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/kan-liang-linux-intel-com/perf-x86-intel-ds-Use-the-size-from-each-PEBS-record/20230329-064258
patch link: https://lore.kernel.org/r/20230328222735.1367829-2-kan.liang%40linux.intel.com
patch subject: [PATCH 2/2] perf/x86/intel/ds: Use the size from each PEBS record
config: i386-randconfig-a013-20230327 (https://download.01.org/0day-ci/archive/20230329/202303291028.3Xe9Gdlp-lkp@intel.com/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/a5988003cfa30fb0c88507d2d124eb551d42e1a6
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review kan-liang-linux-intel-com/perf-x86-intel-ds-Use-the-size-from-each-PEBS-record/20230329-064258
git checkout a5988003cfa30fb0c88507d2d124eb551d42e1a6
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 SHELL=/bin/bash arch/x86/events/intel/
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202303291028.3Xe9Gdlp-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> arch/x86/events/intel/ds.c:2042:24: warning: unused variable 'cpuc' [-Wunused-variable]
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
^
1 warning generated.
vim +/cpuc +2042 arch/x86/events/intel/ds.c
d31fc13fdcb20e arch/x86/events/intel/ds.c Kan Liang 2018-02-12 2029
9dfa9a5c9bae34 arch/x86/events/intel/ds.c Peter Zijlstra 2020-10-30 2030 static __always_inline void
9dfa9a5c9bae34 arch/x86/events/intel/ds.c Peter Zijlstra 2020-10-30 2031 __intel_pmu_pebs_event(struct perf_event *event,
21509084f999d7 arch/x86/kernel/cpu/perf_event_intel_ds.c Yan, Zheng 2015-05-06 2032 struct pt_regs *iregs,
9dfa9a5c9bae34 arch/x86/events/intel/ds.c Peter Zijlstra 2020-10-30 2033 struct perf_sample_data *data,
21509084f999d7 arch/x86/kernel/cpu/perf_event_intel_ds.c Yan, Zheng 2015-05-06 2034 void *base, void *top,
c22497f5838c23 arch/x86/events/intel/ds.c Kan Liang 2019-04-02 2035 int bit, int count,
c22497f5838c23 arch/x86/events/intel/ds.c Kan Liang 2019-04-02 2036 void (*setup_sample)(struct perf_event *,
c22497f5838c23 arch/x86/events/intel/ds.c Kan Liang 2019-04-02 2037 struct pt_regs *,
c22497f5838c23 arch/x86/events/intel/ds.c Kan Liang 2019-04-02 2038 void *,
c22497f5838c23 arch/x86/events/intel/ds.c Kan Liang 2019-04-02 2039 struct perf_sample_data *,
c22497f5838c23 arch/x86/events/intel/ds.c Kan Liang 2019-04-02 2040 struct pt_regs *))
43cf76312faefe arch/x86/kernel/cpu/perf_event_intel_ds.c Yan, Zheng 2015-05-06 2041 {
c22497f5838c23 arch/x86/events/intel/ds.c Kan Liang 2019-04-02 @2042 struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
d31fc13fdcb20e arch/x86/events/intel/ds.c Kan Liang 2018-02-12 2043 struct hw_perf_event *hwc = &event->hw;
c22497f5838c23 arch/x86/events/intel/ds.c Kan Liang 2019-04-02 2044 struct x86_perf_regs perf_regs;
c22497f5838c23 arch/x86/events/intel/ds.c Kan Liang 2019-04-02 2045 struct pt_regs *regs = &perf_regs.regs;
21509084f999d7 arch/x86/kernel/cpu/perf_event_intel_ds.c Yan, Zheng 2015-05-06 2046 void *at = get_next_pebs_record_by_bit(base, top, bit);
e506d1dac0edb2 arch/x86/events/intel/ds.c Peter Zijlstra 2020-10-30 2047 static struct pt_regs dummy_iregs;
43cf76312faefe arch/x86/kernel/cpu/perf_event_intel_ds.c Yan, Zheng 2015-05-06 2048
d31fc13fdcb20e arch/x86/events/intel/ds.c Kan Liang 2018-02-12 2049 if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
d31fc13fdcb20e arch/x86/events/intel/ds.c Kan Liang 2018-02-12 2050 /*
d31fc13fdcb20e arch/x86/events/intel/ds.c Kan Liang 2018-02-12 2051 * Now, auto-reload is only enabled in fixed period mode.
d31fc13fdcb20e arch/x86/events/intel/ds.c Kan Liang 2018-02-12 2052 * The reload value is always hwc->sample_period.
d31fc13fdcb20e arch/x86/events/intel/ds.c Kan Liang 2018-02-12 2053 * May need to change it, if auto-reload is enabled in
d31fc13fdcb20e arch/x86/events/intel/ds.c Kan Liang 2018-02-12 2054 * freq mode later.
d31fc13fdcb20e arch/x86/events/intel/ds.c Kan Liang 2018-02-12 2055 */
d31fc13fdcb20e arch/x86/events/intel/ds.c Kan Liang 2018-02-12 2056 intel_pmu_save_and_restart_reload(event, count);
d31fc13fdcb20e arch/x86/events/intel/ds.c Kan Liang 2018-02-12 2057 } else if (!intel_pmu_save_and_restart(event))
43cf76312faefe arch/x86/kernel/cpu/perf_event_intel_ds.c Yan, Zheng 2015-05-06 2058 return;
43cf76312faefe arch/x86/kernel/cpu/perf_event_intel_ds.c Yan, Zheng 2015-05-06 2059
35d1ce6bec1336 arch/x86/events/intel/ds.c Kan Liang 2020-09-02 2060 if (!iregs)
35d1ce6bec1336 arch/x86/events/intel/ds.c Kan Liang 2020-09-02 2061 iregs = &dummy_iregs;
35d1ce6bec1336 arch/x86/events/intel/ds.c Kan Liang 2020-09-02 2062
a3d86542de8850 arch/x86/kernel/cpu/perf_event_intel_ds.c Peter Zijlstra 2015-05-12 2063 while (count > 1) {
9dfa9a5c9bae34 arch/x86/events/intel/ds.c Peter Zijlstra 2020-10-30 2064 setup_sample(event, iregs, at, data, regs);
9dfa9a5c9bae34 arch/x86/events/intel/ds.c Peter Zijlstra 2020-10-30 2065 perf_event_output(event, data, regs);
a5988003cfa30f arch/x86/events/intel/ds.c Kan Liang 2023-03-28 2066 at += get_pebs_size(at);
21509084f999d7 arch/x86/kernel/cpu/perf_event_intel_ds.c Yan, Zheng 2015-05-06 2067 at = get_next_pebs_record_by_bit(at, top, bit);
a3d86542de8850 arch/x86/kernel/cpu/perf_event_intel_ds.c Peter Zijlstra 2015-05-12 2068 count--;
21509084f999d7 arch/x86/kernel/cpu/perf_event_intel_ds.c Yan, Zheng 2015-05-06 2069 }
21509084f999d7 arch/x86/kernel/cpu/perf_event_intel_ds.c Yan, Zheng 2015-05-06 2070
9dfa9a5c9bae34 arch/x86/events/intel/ds.c Peter Zijlstra 2020-10-30 2071 setup_sample(event, iregs, at, data, regs);
35d1ce6bec1336 arch/x86/events/intel/ds.c Kan Liang 2020-09-02 2072 if (iregs == &dummy_iregs) {
35d1ce6bec1336 arch/x86/events/intel/ds.c Kan Liang 2020-09-02 2073 /*
35d1ce6bec1336 arch/x86/events/intel/ds.c Kan Liang 2020-09-02 2074 * The PEBS records may be drained in the non-overflow context,
35d1ce6bec1336 arch/x86/events/intel/ds.c Kan Liang 2020-09-02 2075 * e.g., large PEBS + context switch. Perf should treat the
35d1ce6bec1336 arch/x86/events/intel/ds.c Kan Liang 2020-09-02 2076 * last record the same as other PEBS records, and doesn't
35d1ce6bec1336 arch/x86/events/intel/ds.c Kan Liang 2020-09-02 2077 * invoke the generic overflow handler.
35d1ce6bec1336 arch/x86/events/intel/ds.c Kan Liang 2020-09-02 2078 */
9dfa9a5c9bae34 arch/x86/events/intel/ds.c Peter Zijlstra 2020-10-30 2079 perf_event_output(event, data, regs);
35d1ce6bec1336 arch/x86/events/intel/ds.c Kan Liang 2020-09-02 2080 } else {
21509084f999d7 arch/x86/kernel/cpu/perf_event_intel_ds.c Yan, Zheng 2015-05-06 2081 /*
21509084f999d7 arch/x86/kernel/cpu/perf_event_intel_ds.c Yan, Zheng 2015-05-06 2082 * All but the last records are processed.
21509084f999d7 arch/x86/kernel/cpu/perf_event_intel_ds.c Yan, Zheng 2015-05-06 2083 * The last one is left to be able to call the overflow handler.
21509084f999d7 arch/x86/kernel/cpu/perf_event_intel_ds.c Yan, Zheng 2015-05-06 2084 */
9dfa9a5c9bae34 arch/x86/events/intel/ds.c Peter Zijlstra 2020-10-30 2085 if (perf_event_overflow(event, data, regs))
a4eaf7f14675cb arch/x86/kernel/cpu/perf_event_intel_ds.c Peter Zijlstra 2010-06-16 2086 x86_pmu_stop(event, 0);
21509084f999d7 arch/x86/kernel/cpu/perf_event_intel_ds.c Yan, Zheng 2015-05-06 2087 }
2b0b5c6fe9b383 arch/x86/kernel/cpu/perf_event_intel_ds.c Peter Zijlstra 2010-04-08 2088 }
2b0b5c6fe9b383 arch/x86/kernel/cpu/perf_event_intel_ds.c Peter Zijlstra 2010-04-08 2089
On Tue, Mar 28, 2023 at 03:27:35PM -0700, kan.liang@linux.intel.com wrote: > From: Kan Liang <kan.liang@linux.intel.com> > > The kernel warning for the unexpected PEBS record can also be observed > during a context switch, when the below commands are running in parallel > for a while on SPR. > > while true; do perf record --no-buildid -a --intr-regs=AX -e > cpu/event=0xd0,umask=0x81/pp -c 10003 -o /dev/null ./triad; done & > > while true; do perf record -o /tmp/out -W -d -e > '{ld_blocks.store_forward:period=1000000, > MEM_TRANS_RETIRED.LOAD_LATENCY:u:precise=2:ldlat=4}' > -c 1037 ./triad; done > *The triad program is just the generation of loads/stores. > > The current PEBS code assumes that all the PEBS records in the DS buffer > have the same size, aka cpuc->pebs_record_size. It's true for the most > cases, since the DS buffer is always flushed in every context switch. > > However, there is a corner case that breaks the assumption. > A system-wide PEBS event with the large PEBS config may be enabled > during a context switch. Some PEBS records for the system-wide PEBS may > be generated while the old task is sched out but the new one hasn't been > sched in yet. When the new task is sched in, the cpuc->pebs_record_size > may be updated for the per-task PEBS events. So the existing system-wide > PEBS records have a different size from the later PEBS records. > > Two methods were considered to fix the issue. > One is to flush the DS buffer for the system-wide PEBS right before the > new task sched in. It has to be done in the generic code via the > sched_task() call back. However, the sched_task() is shared among > different ARCHs. The movement may impact other ARCHs, e.g., AMD BRS > requires the sched_task() is called after the PMU has started on a > ctxswin. The method is dropped. > > The other method is implemented here. It doesn't assume that all the > PEBS records have the same size any more. The size from each PEBS record > is used to parse the record. For the previous platform (PEBS format < 4), > which doesn't support adaptive PEBS, there is nothing changed. Same as with the other; why can't we flush the buffer when we reprogram the hardware?
On 2023-04-06 9:13 a.m., Peter Zijlstra wrote: > On Tue, Mar 28, 2023 at 03:27:35PM -0700, kan.liang@linux.intel.com wrote: >> From: Kan Liang <kan.liang@linux.intel.com> >> >> The kernel warning for the unexpected PEBS record can also be observed >> during a context switch, when the below commands are running in parallel >> for a while on SPR. >> >> while true; do perf record --no-buildid -a --intr-regs=AX -e >> cpu/event=0xd0,umask=0x81/pp -c 10003 -o /dev/null ./triad; done & >> >> while true; do perf record -o /tmp/out -W -d -e >> '{ld_blocks.store_forward:period=1000000, >> MEM_TRANS_RETIRED.LOAD_LATENCY:u:precise=2:ldlat=4}' >> -c 1037 ./triad; done >> *The triad program is just the generation of loads/stores. >> >> The current PEBS code assumes that all the PEBS records in the DS buffer >> have the same size, aka cpuc->pebs_record_size. It's true for the most >> cases, since the DS buffer is always flushed in every context switch. >> >> However, there is a corner case that breaks the assumption. >> A system-wide PEBS event with the large PEBS config may be enabled >> during a context switch. Some PEBS records for the system-wide PEBS may >> be generated while the old task is sched out but the new one hasn't been >> sched in yet. When the new task is sched in, the cpuc->pebs_record_size >> may be updated for the per-task PEBS events. So the existing system-wide >> PEBS records have a different size from the later PEBS records. >> >> Two methods were considered to fix the issue. >> One is to flush the DS buffer for the system-wide PEBS right before the >> new task sched in. It has to be done in the generic code via the >> sched_task() call back. However, the sched_task() is shared among >> different ARCHs. The movement may impact other ARCHs, e.g., AMD BRS >> requires the sched_task() is called after the PMU has started on a >> ctxswin. The method is dropped. >> >> The other method is implemented here. It doesn't assume that all the >> PEBS records have the same size any more. The size from each PEBS record >> is used to parse the record. For the previous platform (PEBS format < 4), >> which doesn't support adaptive PEBS, there is nothing changed. > > Same as with the other; why can't we flush the buffer when we reprogram > the hardware? For the current code, the pebs_record_size has been updated in another place before we reprogram the hardware. But I think it's possible to move the update of the pebs_record_size right before the hardware reprogram. So we can flush the buffer before everything is updated. Let me try this method. Thanks, Kan
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c index a2e566e53076..905135a8b99f 100644 --- a/arch/x86/events/intel/ds.c +++ b/arch/x86/events/intel/ds.c @@ -1546,6 +1546,15 @@ static inline u64 get_pebs_status(void *n) return ((struct pebs_basic *)n)->applicable_counters; } +static inline u64 get_pebs_size(void *n) +{ + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); + + if (x86_pmu.intel_cap.pebs_format < 4) + return cpuc->pebs_record_size; + return intel_adaptive_pebs_size(((struct pebs_basic *)n)->format_size); +} + #define PERF_X86_EVENT_PEBS_HSW_PREC \ (PERF_X86_EVENT_PEBS_ST_HSW | \ PERF_X86_EVENT_PEBS_LD_HSW | \ @@ -1903,9 +1912,9 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event, } } - WARN_ONCE(next_record != __pebs + (format_size >> 48), + WARN_ONCE(next_record != __pebs + intel_adaptive_pebs_size(format_size), "PEBS record size %llu, expected %llu, config %llx\n", - format_size >> 48, + intel_adaptive_pebs_size(format_size), (u64)(next_record - __pebs), basic->format_size); } @@ -1927,7 +1936,7 @@ get_next_pebs_record_by_bit(void *base, void *top, int bit) if (base == NULL) return NULL; - for (at = base; at < top; at += cpuc->pebs_record_size) { + for (at = base; at < top; at += get_pebs_size(at)) { unsigned long status = get_pebs_status(at); if (test_bit(bit, (unsigned long *)&status)) { @@ -2054,7 +2063,7 @@ __intel_pmu_pebs_event(struct perf_event *event, while (count > 1) { setup_sample(event, iregs, at, data, regs); perf_event_output(event, data, regs); - at += cpuc->pebs_record_size; + at += get_pebs_size(at); at = get_next_pebs_record_by_bit(at, top, bit); count--; } @@ -2278,7 +2287,19 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d return; } - for (at = base; at < top; at += cpuc->pebs_record_size) { + /* + * The cpuc->pebs_record_size may be different from the + * size of each PEBS record. For example, a system-wide + * PEBS event with the large PEBS config may be enabled + * during a context switch. Some PEBS records for the + * system-wide PEBS may be generated while the old task + * is sched out but the new one isn't sched in. When the + * new task is sched in, the cpuc->pebs_record_size may + * be updated for the per-task PEBS events. So the + * existing system-wide PEBS records have a different + * size from the later PEBS records. + */ + for (at = base; at < top; at += get_pebs_size(at)) { u64 pebs_status; pebs_status = get_pebs_status(at) & cpuc->pebs_enabled; diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index 8fc15ed5e60b..ad5655bb90f6 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -386,6 +386,12 @@ static inline bool is_topdown_idx(int idx) /* * Adaptive PEBS v4 */ +#define INTEL_ADAPTIVE_PEBS_SIZE_OFF 48 + +static inline u64 intel_adaptive_pebs_size(u64 format_size) +{ + return (format_size >> INTEL_ADAPTIVE_PEBS_SIZE_OFF); +} struct pebs_basic { u64 format_size;