Message ID | 20240123113420.1928154-1-ben.gainey@arm.com |
---|---|
Headers |
Return-Path: <linux-kernel+bounces-35202-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:2553:b0:103:945f:af90 with SMTP id p19csp269269dyi; Tue, 23 Jan 2024 03:35:07 -0800 (PST) X-Google-Smtp-Source: AGHT+IHdI4alhgD5lHMnNBAqoEWyrmJTuL3RVQaO9d3mqLhjNXFffbcFGGiuMyZT77KIhl2fD0Ep X-Received: by 2002:a05:6830:71a1:b0:6e0:16cb:4b5c with SMTP id el33-20020a05683071a100b006e016cb4b5cmr6795981otb.54.1706009706908; Tue, 23 Jan 2024 03:35:06 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1706009706; cv=pass; d=google.com; s=arc-20160816; b=R4+cS6qnFouF5aT52YC8Cf7i7cfdlPzl40mp7Hio4hKZSREEsL6BcyMovWODSxAdN8 aJrAoxUw+b4Fl2YcnwVdkWzn0ALbIIYnYgG/o6vkeqUxsyQ/d/cb5KjGhwBLEfyG4Y9j /UXvUiJvt1zvOv/Q2VX4FyJTcsvyNePdN2H0IbMS7buFQzPYNmukBYfXvbCVWw43Toi5 dQby0FhkcRDRvwbE05IUAyHn3D+zYmLzYSF/Ehh2Az983gsj5pXys9kHNQ5DljKYc5oJ 3iWq1vCZVRFhUrC8e3UtRVlsCIsMYgo0XojL2+Yx0rtyHZJi22Mse3GoHhRGCE1iEXNK O3RQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:cc:to :from; bh=IksJme0bIuqzzueaBR4J6pygYWAH7WN6rKVreCgBgdM=; fh=rTG5NIABeCUS1wWiHAjc60eclX3r/xBdGX1ZVSIbqVA=; b=FeSERLYCoNq5GB/Cgf9TL+jrCMT1zBAN8L1TTYg2tMlMXOLqznl2OIBPPNhwP50aHl 7LnoJSliWdAMfoVMc2yeuhppNcBkHmrtzGSG5v/1405PGxRz8xgXHcPyudmIBM/quX/U muwJZpSf7kVDDnW14FQy+sWvCG+KU3u7sADYnPonEQLNfhwKKrQE5wp/5E4X3QiXSLLd QhXYZO7hQmxAUJwVGyKcVpfSDDgrAPdPnTHLSdW5caNCs4Kd+D2mO3QUEVafR3By7gIb oq5FAM96IsmthBJWO7S30Cy8jqpoW7EQk1bKGeoNQQtiujPdio5HtA/dpsHBJ2Yyu8tr NSpw== ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-35202-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-35202-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [2604:1380:45d1:ec00::1]) by mx.google.com with ESMTPS id r7-20020a05622a034700b0042a4cb63a75si1708533qtw.185.2024.01.23.03.35.06 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 23 Jan 2024 03:35:06 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-35202-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-35202-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-35202-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 4AA3E1C20D2B for <ouuuleilei@gmail.com>; Tue, 23 Jan 2024 11:35:04 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 1D5FE5D912; Tue, 23 Jan 2024 11:34:43 +0000 (UTC) Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 817325D73B; Tue, 23 Jan 2024 11:34:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706009681; cv=none; b=tFVfdmybqNYViItE+wlI2jHwMtZMsoKyOEQOqJ2gDA9Xp/0BudfK1RldorLEYpsBaVkjEcrEonaHPQojL5UJW14pwvYfyE96aB17xm5B2/GytLoHtkGDR6GBodL4P3sxbZl/Gnf/Ja6ADR8PZ70uezDvsB78/D41sCRYCP+q7QU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706009681; c=relaxed/simple; bh=Itound2rhJr3CwxbHvsE9PKUkyqy43GCjIgaMwm/PK8=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=rScickD49tKApugzhu+dugYSJW2pPG77DafhsHtfxSVdyIanpYIrO07akOudpP15tmdDDOjNXcsfZCs6N9bymN/F4d+TV5/Bvzy4+vSyF/tDFwOuNWJoYO0tFmNpeVCkv2KqzU9JsuNjNceksikI7r8RevX5n1lPwACPz++FMGM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 353811FB; Tue, 23 Jan 2024 03:35:23 -0800 (PST) Received: from e126817.. (e126817.cambridge.arm.com [10.2.3.5]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id 0623B3F762; Tue, 23 Jan 2024 03:34:35 -0800 (PST) From: Ben Gainey <ben.gainey@arm.com> To: linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org Cc: peterz@infradead.org, mingo@redhat.com, acme@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, namhyung@kernel.org, irogers@google.com, adrian.hunter@intel.com, will@kernel.org, Ben Gainey <ben.gainey@arm.com> Subject: [RFC PATCH 0/2] A mechanism for efficient support for per-function metrics Date: Tue, 23 Jan 2024 11:34:18 +0000 Message-ID: <20240123113420.1928154-1-ben.gainey@arm.com> X-Mailer: git-send-email 2.43.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1788880834015977854 X-GMAIL-MSGID: 1788880834015977854 |
Series |
A mechanism for efficient support for per-function metrics
|
|
Message
Ben Gainey
Jan. 23, 2024, 11:34 a.m. UTC
I've been working on an approach to supporting per-function metrics for aarch64 cores, which requires some changes to the arm_pmuv3 driver, and I'm wondering if this approach would make sense as a generic feature that could be used to enable the same on other architectures? The basic idea is as follows: * Periodically sample one or more counters as needed for the chosen set of metrics. * Record a sample count for each symbol so as to identify hot functions. * Accumulate counter totals for each of the counters in each of the metrics *but* only do this where the previous sample's symbol matches the current sample's symbol. Discarding the counter deltas for any given sample is important to ensure that couters are correctly attributed to a single function, without this step the resulting metrics trend towards some average value across the top 'n' functions. It is acknowledged that it is possible for this heuristic to fail, for example if samples to land either side of a nested call, so a sufficiently small sample window over which the counters are collected must be used to reduce the risk of misattribution. So far, this can be acheived without any further modifications to perf tools or the kernel. However as noted it requires the counter collection window to be sufficiently small; in testing on Neoverse-N1/-V1, a window of about 300 cycles gives good results. Using the cycle counter with a sample_period of 300 is possible, but such an approach generates significant amounts of capture data, and introduces a lot of overhead and probe effect. Whilst the kernel will throttle such a configuration, which helps to mitigate a large portion of the bandwidth and capture overhead, it is not something that can be controlled for on a per event basis, or for non-root users, and because throttling is controlled as a percentage of time its affects vary from machine to machine. For this to work efficiently, it is useful to provide a means to decouple the sample window (time over which events are counted) from the sample period (time between interesting samples). This patcheset modifies the Arm PMU driver to support alternating between two sample_period values, providing a simple and inexpensive way for tools to separate out the sample period and the sample window. It is expected to be used with the cycle counter event, alternating between a long and short period and subsequently discarding the counter data for samples with the long period. The combined long and short period gives the overall sampling period, and the short sample period gives the sample window. The symbol taken from the sample at the end of the long period can be used by tools to ensure correct attribution as described previously. The cycle counter is recommended as it provides fair temporal distribution of samples as would be required for the per-symbol sample count mentioned previously, and because the PMU can be programmed to overflow after a sufficiently short window; this may not be possible with software timer (for example). This patch does not restrict to only the cycle counter, it is possible there could be other novel uses based on different events. To test this I have developed a simple `perf script` based python script. For a limited set of Arm PMU events it will post process a `perf record`-ing and generate a table of metrics. Along side this I have developed a benchmark application that rotates through a sequence of different classes of behaviour that can be detected by the Arm PMU (eg. mispredics, cache misses, different instruction mixes). The path through the benchmark can be rotated after each iteration so as to ensure the results don't land on some lucky harmonic with the sample period. The script can be used with and without the strobing patch allowing comparison of the results. Testing was on Juno (A53+A57), N1SDP, Gravaton 2 and 3. In addition this approach has been applied to a few of Arm's tools projects and has correctly identified improvements and regressions. Headline results from testing indicate that ~300 cycles sample window gives good results with or without the strobing patch. When the strobing patch is used, the resulting `perf.data` files are typically 25-50x smaller that without, and take ~25x less time for the python script to post-process. Without strobing, the test application's runtime was x20 slower when sampling every 300 cycles as compared to every 1000000 cycles. With strobing enabled such that the long period was 999700 cycles and the short period was 300 cycles, the test applications runtime was only x1.2 slower than every 1000000 cycles. Notably, without the patch, L1D cache miss rates are significantly higher than with the patch, which we attribute to increased impact on the cache that trapping into the kernel every 300 cycles has. These results are given with `perf_cpu_time_max_percent=25`. When tested with `perf_cpu_time_max_percent=100` the size and time comparisons are more significant. Disabling throttling did not lead to obvious improvements in the collected metrics, suggesting that the sampling approach is sufficient to collect representative metrics. Cursory testing on a Xeon(R) W-2145 sampling every 300 cycles (without the patch) suggests this approach would work for some counters. Calculating branch miss rates for example appears to be correct, likewise UOPS_EXECUTED.THREAD seems to give something like a sensible cycles-per-uop value. On the other hand the fixed function instructions counter does not appear to sample correctly (it seems to report either very small or very large numbers). No idea whats going on there, so any insight welcome... This patch modifies the arm_pmu and introduces an additional field in config2 to configure the behaviour. If we think there is broad applicability, would it make sense to move into perf_event_attr flags or field, and possibly pull up into core? If we don't think so, then some feedback around the core header and arm_pmu modifications would be appreciated. A copy of the post-processing script is available at https://github.com/ARM-software/gator/blob/prototypes/strobing/prototypes/strobing-patches/test-script/generate-function-metrics.py note that the script its self has a dependency on https://lore.kernel.org/linux-perf-users/20240123103137.1890779-1-ben.gainey@arm.com/ Ben Gainey (2): arm_pmu: Allow the PMU to alternate between two sample_period values. arm_pmuv3: Add config bits for sample period strobing drivers/perf/arm_pmu.c | 74 +++++++++++++++++++++++++++++------- drivers/perf/arm_pmuv3.c | 25 ++++++++++++ include/linux/perf/arm_pmu.h | 1 + include/linux/perf_event.h | 10 ++++- 4 files changed, 95 insertions(+), 15 deletions(-)
Comments
Ben Gainey <ben.gainey@arm.com> writes: > I've been working on an approach to supporting per-function metrics for > aarch64 cores, which requires some changes to the arm_pmuv3 driver, and > I'm wondering if this approach would make sense as a generic feature > that could be used to enable the same on other architectures? > > The basic idea is as follows: > > * Periodically sample one or more counters as needed for the chosen > set of metrics. > * Record a sample count for each symbol so as to identify hot > functions. > * Accumulate counter totals for each of the counters in each of the > metrics *but* only do this where the previous sample's symbol > matches the current sample's symbol. It sounds very similar to what perf script -F +metric already does (or did if it wasn't broken currently). It would be a straight forward extension here to add this "same as previous" check. Of course the feature is somewhat dubious in that it will have a very strong systematic bias against short functions and even long functions in some alternating execution patterns. I assume you did some experiments to characterize this. It would be important to emphasize this in any documentation. > For this to work efficiently, it is useful to provide a means to > decouple the sample window (time over which events are counted) from > the sample period (time between interesting samples). This patcheset > modifies the Arm PMU driver to support alternating between two > sample_period values, providing a simple and inexpensive way for tools > to separate out the sample period and the sample window. It is expected > to be used with the cycle counter event, alternating between a long and > short period and subsequently discarding the counter data for samples > with the long period. The combined long and short period gives the > overall sampling period, and the short sample period gives the sample > window. The symbol taken from the sample at the end of the long period > can be used by tools to ensure correct attribution as described > previously. The cycle counter is recommended as it provides fair > temporal distribution of samples as would be required for the > per-symbol sample count mentioned previously, and because the PMU can > be programmed to overflow after a sufficiently short window; this may > not be possible with software timer (for example). This patch does not > restrict to only the cycle counter, it is possible there could be other > novel uses based on different events. I don't see anything ARM specific with the technique, so if it's done it should be done generically IMHO > Cursory testing on a Xeon(R) W-2145 sampling every 300 cycles (without > the patch) suggests this approach would work for some counters. > Calculating branch miss rates for example appears to be correct, > likewise UOPS_EXECUTED.THREAD seems to give something like a sensible > cycles-per-uop value. On the other hand the fixed function instructions > counter does not appear to sample correctly (it seems to report either > very small or very large numbers). No idea whats going on there, so any > insight welcome... If you use precise samples with 3p there is a restriction on the periods that is enforced by the kernel. Non precise or single/double p should support arbitrary, except that any p is always period + 1. One drawback of the technique on x86 is that it won't allow multi record pebs (collecting samples without interrupts), so the overhead might be intrinsically higher. -Andi
Hi Andi Thanks for commenting... On Wed, 2024-02-14 at 01:55 -0800, Andi Kleen wrote: > Ben Gainey <ben.gainey@arm.com> writes: > > > I've been working on an approach to supporting per-function metrics > > for > > aarch64 cores, which requires some changes to the arm_pmuv3 driver, > > and > > I'm wondering if this approach would make sense as a generic > > feature > > that could be used to enable the same on other architectures? > > > > The basic idea is as follows: > > > > * Periodically sample one or more counters as needed for the > > chosen > > set of metrics. > > * Record a sample count for each symbol so as to identify hot > > functions. > > * Accumulate counter totals for each of the counters in each of > > the > > metrics *but* only do this where the previous sample's symbol > > matches the current sample's symbol. > > It sounds very similar to what perf script -F +metric already does > (or did if it wasn't broken currently). It would be a straight > forward > extension here to add this "same as previous" check. Nice, I wasn't aware of this feature. I'll have a play... > > Of course the feature is somewhat dubious in that it will have a very > strong systematic bias against short functions and even long > functions > in some alternating execution patterns. I assume you did some > experiments to characterize this. It would be important > to emphasize this in any documentation. The way I have been thinking about this is that for each sample you always maintain a periodic sample count so that the relative ranking of functions is maintained, and that the "same as previous" check is a way to enhance the attributability of the PMU data for any given sample. But it absolutely correct to say that this will bias the availability of PMU data in the way you have describe. The bias depends on sample window size, workload characteristics and so on. It should be possible to provide a per metric "valid sample" count that can be used to judge the "quality" of the metrics for each symbol, which may allow the user to make some adjustments to the recording paramters (modify sample period, or sample window size for example). I'll have a think about the best way to convey this in docs. I have a few ideas for ways to further impove the attributability / identify cases where short functions are missed, but they'd not affect the implementation of anything in the kernel, just perhaps the tool's post- processing. > > > For this to work efficiently, it is useful to provide a means to > > decouple the sample window (time over which events are counted) > > from > > the sample period (time between interesting samples). This > > patcheset > > modifies the Arm PMU driver to support alternating between two > > sample_period values, providing a simple and inexpensive way for > > tools > > to separate out the sample period and the sample window. It is > > expected > > to be used with the cycle counter event, alternating between a long > > and > > short period and subsequently discarding the counter data for > > samples > > with the long period. The combined long and short period gives the > > overall sampling period, and the short sample period gives the > > sample > > window. The symbol taken from the sample at the end of the long > > period > > can be used by tools to ensure correct attribution as described > > previously. The cycle counter is recommended as it provides fair > > temporal distribution of samples as would be required for the > > per-symbol sample count mentioned previously, and because the PMU > > can > > be programmed to overflow after a sufficiently short window; this > > may > > not be possible with software timer (for example). This patch does > > not > > restrict to only the cycle counter, it is possible there could be > > other > > novel uses based on different events. > > I don't see anything ARM specific with the technique, so if it's done > it should be done generically IMHO Great. When i was originally thinking about the implementation of the event strobing feature I was thinking: * Add `strobe_sample` flag bit to opt into the fature - This will be mutually exclusive with `freq`. * Add `strobe_period` field to hold the alternate sample period (for the sample window. * Have all PMU drivers check and reject the `strobe_sample` flag by default; the swizzling of the period will be done in the PMU driver its self if it make sense to support this feature for a given PMU. - Do you think this is sensible, or would be better handled in core? Any obvious issues with this approach? > > > > Cursory testing on a Xeon(R) W-2145 sampling every 300 cycles > > (without > > the patch) suggests this approach would work for some counters. > > Calculating branch miss rates for example appears to be correct, > > likewise UOPS_EXECUTED.THREAD seems to give something like a > > sensible > > cycles-per-uop value. On the other hand the fixed function > > instructions > > counter does not appear to sample correctly (it seems to report > > either > > very small or very large numbers). No idea whats going on there, so > > any > > insight welcome... > > If you use precise samples with 3p there is a restriction on the > periods > that is enforced by the kernel. Non precise or single/double p should > support arbitrary, except that any p is always period + 1. Is there some default value for precise? when testing I didn't set any specific value for p modifier. > > One drawback of the technique on x86 is that it won't allow multi > record > pebs (collecting samples without interrupts), so the overhead might > be intrinsically higher. > > -Andi Sure, I think this kind of detail was why I was thinking it should be the PMU driver rather than core that handles the strobing feature, since there may be other considerations / better ways to collect the metrics samples. Regards Ben
On Wed, Feb 14, 2024 at 07:13:50PM +0000, Ben Gainey wrote: > > Nice, I wasn't aware of this feature. I'll have a play... You have to use an old perf version for now, still need to fix it. > > > > > > Of course the feature is somewhat dubious in that it will have a very > > strong systematic bias against short functions and even long > > functions > > in some alternating execution patterns. I assume you did some > > experiments to characterize this. It would be important > > to emphasize this in any documentation. > > The way I have been thinking about this is that for each sample you > always maintain a periodic sample count so that the relative ranking of > functions is maintained, and that the "same as previous" check is a way > to enhance the attributability of the PMU data for any given sample. > > But it absolutely correct to say that this will bias the availability > of PMU data in the way you have describe. The bias depends on sample > window size, workload characteristics and so on. I would be more comfortable with it if you added some randomization on the window sizes. That would limit bias and worst case sampling error. > It should be possible to provide a per metric "valid sample" count that > can be used to judge the "quality" of the metrics for each symbol, > which may allow the user to make some adjustments to the recording > paramters (modify sample period, or sample window size for example). Even that would be misleading because it assumes that the IP stayed in the same function between the two samples. But you could have something like F1 sample F2 F1 sample and if you're unlucky this could happen systematically. The randomization would fight it somewhat, but even there you might be very very unlucky. The only sure way to judge it really is to run branch trace in parallel and see if it is correct. Also there is of course the problem that on a modern core the reordering window might well be larger than your sample window, so any notion of things happening inside a short window is quite fuzzy. > > > > I don't see anything ARM specific with the technique, so if it's done > > it should be done generically IMHO > > > Great. When i was originally thinking about the implementation of the > event strobing feature I was thinking: > > * Add `strobe_sample` flag bit to opt into the fature > - This will be mutually exclusive with `freq`. > * Add `strobe_period` field to hold the alternate sample period (for > the sample window. > * Have all PMU drivers check and reject the `strobe_sample` flag by > default; the swizzling of the period will be done in the PMU driver its > self if it make sense to support this feature for a given PMU. > - Do you think this is sensible, or would be better handled in core? I would have a common function in core that is called from the PMU drivers, similar to how the adaptive period is done today. > > > the patch) suggests this approach would work for some counters. > > > Calculating branch miss rates for example appears to be correct, > > > likewise UOPS_EXECUTED.THREAD seems to give something like a > > > sensible > > > cycles-per-uop value. On the other hand the fixed function > > > instructions > > > counter does not appear to sample correctly (it seems to report > > > either > > > very small or very large numbers). No idea whats going on there, so > > > any > > > insight welcome... > > > > If you use precise samples with 3p there is a restriction on the > > periods > > that is enforced by the kernel. Non precise or single/double p should > > support arbitrary, except that any p is always period + 1. > > Is there some default value for precise? when testing I didn't set any > specific value for p modifier. In some cases the perf tool tries to use the highest, e.g. if you don't specify anything. If you used :p (and not :P) it should have taken the number you specified. Single :p is normally not useful because it samples the -1 IP, normal use is either two or three p. Three p is more precise but also has some restrictions on the sampling period and the counters that can be used. -Andi