[v2,21/24] selftests/resctrl: Read in less obvious order to defeat prefetch optimizations
Message ID | 20230418114506.46788-22-ilpo.jarvinen@linux.intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp2795329vqo; Tue, 18 Apr 2023 05:11:29 -0700 (PDT) X-Google-Smtp-Source: AKy350aE0IImrXcXpdRi57iL8Li3n8zho5BlI2vhpmvXhUt21aOc2zmIpaENNgfZuckprofvR711 X-Received: by 2002:a05:6a00:cc1:b0:634:970e:ca09 with SMTP id b1-20020a056a000cc100b00634970eca09mr24498007pfv.30.1681819889687; Tue, 18 Apr 2023 05:11:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1681819889; cv=none; d=google.com; s=arc-20160816; b=CVQ+Ub+WTiiGgQsscCMyfOt9M5mrhSNfDGXxTXXDbkpmNeucDyJsS4NI6m5DjyQ+ZQ pDY0Mkt04yBQ2Sk6B+Xjm0uHTQQv6PTgRFqBVKbZtnj7RySBxYxqmu0j5a9kv7nvKf88 Tr+1f34fgtAPsjtt5RrH6AbtBB9muAyAPDMluBWC3+cwqPzbuHRJ/PaH6rbkaP3S7Iyt 9nHFue6Nc/YDHicLz0sA0vzcNU8Bfo8yVCUp9I56ECuIdVr90P8LYJG1/G5HA6PwFCgK 4v8ZouP+QAfwfvlRkJjWFBB+Wvhgtx/dsPTglNMba28EISqMddtDrL3lvo9TR6yDvqqt HnNA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=OKVYUlw0BbhJrClUgbZmRWqk1rMXLffgQ6ocyLLIxGk=; b=V1dtnouGq2ZQyq9a+Pl/ZCfmD2MQdNLPi1dPhMTrKqFmKW9wn1WhILdu3ol59inC5+ XNlybZIS5dqHwXTCCY2TXrnTtsuUq1xSOGjf4j9/lroIncYcHQAMra/1PJtWzu0/FNrO c9M+4fftyCoCnEGxSjedCSApgdjJYvtTpYoYXAwGWPOD3mVhOeCiGHSUAsJU4ZPqAlGD SKc3bef8Fq0oJziif/UJykSb3pyJ2LapoEqFB/b0sl7yEx70tFi+7D5yiwjrcfWG9qQt I9LoSWPuqErsXLoBiPkAPj9u39i7Nfy8mkhp3a9JrV8X1JZBmcYJ9el+tJRNdL2UJih9 o4KQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=bLqJKe+R; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id a15-20020aa78e8f000000b0063b6e660b65si10193466pfr.288.2023.04.18.05.11.15; Tue, 18 Apr 2023 05:11:29 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=bLqJKe+R; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230502AbjDRLsi (ORCPT <rfc822;leviz.kernel.dev@gmail.com> + 99 others); Tue, 18 Apr 2023 07:48:38 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58524 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231527AbjDRLsd (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 18 Apr 2023 07:48:33 -0400 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1FCF283C0; Tue, 18 Apr 2023 04:48:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1681818485; x=1713354485; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=o4xwHdOVMGdLd6NXJ0J9sSmWMj5I0QfO/TqlAoKy0Ng=; b=bLqJKe+Rw7RWmE2yvcbABIsf23dQmD5YpmB20Qzo3uULU8/f6yLv4u83 oHE5PVmyG3NPIXd2b6kgLYJIi/RMaFa5tj1e5TLItSjUB8e1R0PzoULiD mDbE2z+sVYQe26l3Q7sJAylt/weNPgoFgFtf7bx9SFkh4mjqb/2qlak7x 66XYQ7paAiZ6L8wF+ICrktNM/gjXdxs4snS5lCvDzEsqXWu9rOCll0JDK 9+T3e35643Vq5f9Iynes/jkKSmOQgo5/VlvZBekIIsQjodafQJzCdo8ig bcMPguswyuqEJtdPwqIDx3qzGhwIUinzUHrotbMoIwbwaZmwVx7deXDtk g==; X-IronPort-AV: E=McAfee;i="6600,9927,10683"; a="346994490" X-IronPort-AV: E=Sophos;i="5.99,207,1677571200"; d="scan'208";a="346994490" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Apr 2023 04:46:30 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10683"; a="723601868" X-IronPort-AV: E=Sophos;i="5.99,207,1677571200"; d="scan'208";a="723601868" Received: from yvolokit-mobl1.ger.corp.intel.com (HELO ijarvine-MOBL2.ger.corp.intel.com) ([10.251.213.103]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Apr 2023 04:46:28 -0700 From: =?utf-8?q?Ilpo_J=C3=A4rvinen?= <ilpo.jarvinen@linux.intel.com> To: linux-kselftest@vger.kernel.org, Reinette Chatre <reinette.chatre@intel.com>, Fenghua Yu <fenghua.yu@intel.com>, Shuah Khan <shuah@kernel.org>, linux-kernel@vger.kernel.org Cc: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>, =?utf-8?q?Ilpo_J=C3=A4rvinen?= <ilpo.jarvinen@linux.intel.com> Subject: [PATCH v2 21/24] selftests/resctrl: Read in less obvious order to defeat prefetch optimizations Date: Tue, 18 Apr 2023 14:45:03 +0300 Message-Id: <20230418114506.46788-22-ilpo.jarvinen@linux.intel.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20230418114506.46788-1-ilpo.jarvinen@linux.intel.com> References: <20230418114506.46788-1-ilpo.jarvinen@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.3 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE, T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1763515972816852570?= X-GMAIL-MSGID: =?utf-8?q?1763515972816852570?= |
Series |
selftests/resctrl: Fixes, cleanups, and rewritten CAT test
|
|
Commit Message
Ilpo Järvinen
April 18, 2023, 11:45 a.m. UTC
When reading memory in order, HW prefetching optimizations will
interfere with measuring how caches and memory are being accessed. This
adds noise into the results.
Change the fill_buf reading loop to not use an obvious in-order
access using multiply by a prime and modulo.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
---
tools/testing/selftests/resctrl/fill_buf.c | 17 ++++++++++-------
1 file changed, 10 insertions(+), 7 deletions(-)
Comments
Hi Ilpo, > When reading memory in order, HW prefetching optimizations will interfere > with measuring how caches and memory are being accessed. This adds noise > into the results. > > Change the fill_buf reading loop to not use an obvious in-order access using > multiply by a prime and modulo. > > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> > --- > tools/testing/selftests/resctrl/fill_buf.c | 17 ++++++++++------- > 1 file changed, 10 insertions(+), 7 deletions(-) > > diff --git a/tools/testing/selftests/resctrl/fill_buf.c > b/tools/testing/selftests/resctrl/fill_buf.c > index 7e0d3a1ea555..049a520498a9 100644 > --- a/tools/testing/selftests/resctrl/fill_buf.c > +++ b/tools/testing/selftests/resctrl/fill_buf.c > @@ -88,14 +88,17 @@ static void *malloc_and_init_memory(size_t s) > > static int fill_one_span_read(unsigned char *start_ptr, unsigned char > *end_ptr) { > - unsigned char sum, *p; > - > + unsigned int size = (end_ptr - start_ptr) / (CL_SIZE / 2); > + unsigned int count = size; > + unsigned char sum; > + > + /* > + * Read the buffer in an order that is unexpected by HW prefetching > + * optimizations to prevent them interfering with the caching pattern. > + */ > sum = 0; > - p = start_ptr; > - while (p < end_ptr) { > - sum += *p; > - p += (CL_SIZE / 2); > - } > + while (count--) > + sum += start_ptr[((count * 59) % size) * CL_SIZE / 2]; Could you please elaborate why 59 is used? Best regards, Shaopeng TAN
On Wed, 31 May 2023, Shaopeng Tan (Fujitsu) wrote: > Hi Ilpo, > > > When reading memory in order, HW prefetching optimizations will interfere > > with measuring how caches and memory are being accessed. This adds noise > > into the results. > > > > Change the fill_buf reading loop to not use an obvious in-order access using > > multiply by a prime and modulo. > > > > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> > > --- > > tools/testing/selftests/resctrl/fill_buf.c | 17 ++++++++++------- > > 1 file changed, 10 insertions(+), 7 deletions(-) > > > > diff --git a/tools/testing/selftests/resctrl/fill_buf.c > > b/tools/testing/selftests/resctrl/fill_buf.c > > index 7e0d3a1ea555..049a520498a9 100644 > > --- a/tools/testing/selftests/resctrl/fill_buf.c > > +++ b/tools/testing/selftests/resctrl/fill_buf.c > > @@ -88,14 +88,17 @@ static void *malloc_and_init_memory(size_t s) > > > > static int fill_one_span_read(unsigned char *start_ptr, unsigned char > > *end_ptr) { > > - unsigned char sum, *p; > > - > > + unsigned int size = (end_ptr - start_ptr) / (CL_SIZE / 2); > > + unsigned int count = size; > > + unsigned char sum; > > + > > + /* > > + * Read the buffer in an order that is unexpected by HW prefetching > > + * optimizations to prevent them interfering with the caching pattern. > > + */ > > sum = 0; > > - p = start_ptr; > > - while (p < end_ptr) { > > - sum += *p; > > - p += (CL_SIZE / 2); > > - } > > + while (count--) > > + sum += start_ptr[((count * 59) % size) * CL_SIZE / 2]; > > Could you please elaborate why 59 is used? The main reason is that it's a prime number ensuring the whole buffer gets read. I picked something that doesn't make it to wrap on almost every iteration.
Hi Ilpo, > > > When reading memory in order, HW prefetching optimizations will > > > interfere with measuring how caches and memory are being accessed. > > > This adds noise into the results. > > > > > > Change the fill_buf reading loop to not use an obvious in-order > > > access using multiply by a prime and modulo. > > > > > > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> > > > --- > > > tools/testing/selftests/resctrl/fill_buf.c | 17 ++++++++++------- > > > 1 file changed, 10 insertions(+), 7 deletions(-) > > > > > > diff --git a/tools/testing/selftests/resctrl/fill_buf.c > > > b/tools/testing/selftests/resctrl/fill_buf.c > > > index 7e0d3a1ea555..049a520498a9 100644 > > > --- a/tools/testing/selftests/resctrl/fill_buf.c > > > +++ b/tools/testing/selftests/resctrl/fill_buf.c > > > @@ -88,14 +88,17 @@ static void *malloc_and_init_memory(size_t s) > > > > > > static int fill_one_span_read(unsigned char *start_ptr, unsigned > > > char > > > *end_ptr) { > > > - unsigned char sum, *p; > > > - > > > + unsigned int size = (end_ptr - start_ptr) / (CL_SIZE / 2); > > > + unsigned int count = size; > > > + unsigned char sum; > > > + > > > + /* > > > + * Read the buffer in an order that is unexpected by HW prefetching > > > + * optimizations to prevent them interfering with the caching pattern. > > > + */ > > > sum = 0; > > > - p = start_ptr; > > > - while (p < end_ptr) { > > > - sum += *p; > > > - p += (CL_SIZE / 2); > > > - } > > > + while (count--) > > > + sum += start_ptr[((count * 59) % size) * CL_SIZE / 2]; > > > > Could you please elaborate why 59 is used? > > The main reason is that it's a prime number ensuring the whole buffer gets read. > I picked something that doesn't make it to wrap on almost every iteration. Thanks for your explanation. It seems there is no problem. Perhaps you have already tested this patch in your environment and got a test result of "ok". Because HW prefetching does not work well, the IMC counter fluctuates a lot in my environment, and the test result is "not ok". In order to ensure this test set runs in any environments and gets "ok", would you consider changing the value of MAX_DIFF_PERCENT of each test? or changing something else? ``` Environment: Kernel: 6.4.0-rc2 CPU: Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz Test result(MBM as an example): # # Starting MBM BW change ... # # Mounting resctrl to "/sys/fs/resctrl" # # Benchmark PID: 8671 # # Writing benchmark parameters to resctrl FS # # Write schema "MB:0=100" to resctrl FS # # Checking for pass/fail # # Fail: Check MBM diff within 5% # # avg_diff_per: 9% # # Span in bytes: 262144000 # # avg_bw_imc: 6202 # # avg_bw_resc: 5585 # not ok 1 MBM: bw change ``` Best regards, Shaopeng TAN
On Thu, 1 Jun 2023, Shaopeng Tan (Fujitsu) wrote: > > > > > When reading memory in order, HW prefetching optimizations will > > > > interfere with measuring how caches and memory are being accessed. > > > > This adds noise into the results. > > > > > > > > Change the fill_buf reading loop to not use an obvious in-order > > > > access using multiply by a prime and modulo. > > > > > > > > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> > > > > --- > > > > tools/testing/selftests/resctrl/fill_buf.c | 17 ++++++++++------- > > > > 1 file changed, 10 insertions(+), 7 deletions(-) > > > > > > > > diff --git a/tools/testing/selftests/resctrl/fill_buf.c > > > > b/tools/testing/selftests/resctrl/fill_buf.c > > > > index 7e0d3a1ea555..049a520498a9 100644 > > > > --- a/tools/testing/selftests/resctrl/fill_buf.c > > > > +++ b/tools/testing/selftests/resctrl/fill_buf.c > > > > @@ -88,14 +88,17 @@ static void *malloc_and_init_memory(size_t s) > > > > > > > > static int fill_one_span_read(unsigned char *start_ptr, unsigned > > > > char > > > > *end_ptr) { > > > > - unsigned char sum, *p; > > > > - > > > > + unsigned int size = (end_ptr - start_ptr) / (CL_SIZE / 2); > > > > + unsigned int count = size; > > > > + unsigned char sum; > > > > + > > > > + /* > > > > + * Read the buffer in an order that is unexpected by HW prefetching > > > > + * optimizations to prevent them interfering with the caching pattern. > > > > + */ > > > > sum = 0; > > > > - p = start_ptr; > > > > - while (p < end_ptr) { > > > > - sum += *p; > > > > - p += (CL_SIZE / 2); > > > > - } > > > > + while (count--) > > > > + sum += start_ptr[((count * 59) % size) * CL_SIZE / 2]; > > > > > > Could you please elaborate why 59 is used? > > > > The main reason is that it's a prime number ensuring the whole buffer gets read. > > I picked something that doesn't make it to wrap on almost every iteration. > > Thanks for your explanation. It seems there is no problem. > > Perhaps you have already tested this patch in your environment and got a > test result of "ok". Yes, it was tested :-) and all looked fine here. But my testing was more focused on the systems which come with CAT and on all those, this change clearly improved MBA/MBM results (they became almost always diff=0 except for the smallest ones in the MBA test). > Because HW prefetching does not work well, > the IMC counter fluctuates a lot in my environment, > and the test result is "not ok". > > In order to ensure this test set runs in any environments and gets "ok", > would you consider changing the value of MAX_DIFF_PERCENT of each test? > or changing something else? > > ``` > Environment: > Kernel: 6.4.0-rc2 > CPU: Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz > > Test result(MBM as an example): > # # Starting MBM BW change ... > # # Mounting resctrl to "/sys/fs/resctrl" > # # Benchmark PID: 8671 > # # Writing benchmark parameters to resctrl FS > # # Write schema "MB:0=100" to resctrl FS > # # Checking for pass/fail > # # Fail: Check MBM diff within 5% > # # avg_diff_per: 9% > # # Span in bytes: 262144000 > # # avg_bw_imc: 6202 > # # avg_bw_resc: 5585 > # not ok 1 MBM: bw change Oh, I see. It seems that these CPUs break the trend and get much worse and more unstable for some reason. It might be that some i9 I recently got a lkp report from could have the same problem. I'll look more into this, thanks a lot for testing and bringing it up. So to answer your question above, I've no intention to tweak MAX_DIFF_PERCENT because of this issue but I'll instead try to improve the approach to defeat the HW prefetcher. If HW prefetcher is not defeated, the CAT test LLC misses have a slowly converging ramp which is not very useful unless number of runs is increased by much (and perhaps the first samples dropped entirely). So it is kinda needed and it would be nice if an approach that is non-HW specific could be used for this. It will probably take some time... Should I send a v3 with only the fixes and useful refactors at the head of this series while I try to sort these problems with the test changes out?
Hi Ilpo, On 6/2/2023 6:51 AM, Ilpo Järvinen wrote: > It will probably take some time... Should I send a v3 with only the fixes > and useful refactors at the head of this series while I try to sort these > problems with the test changes out? This sounds good to me. This series is already at 24 patches so I think that splitting the redesign of the CAT test from the other fixes would indeed make this work easier to parse. Thank you Reinette
On Thu, 1 Jun 2023, Shaopeng Tan (Fujitsu) wrote: > > > > When reading memory in order, HW prefetching optimizations will > > > > interfere with measuring how caches and memory are being accessed. > > > > This adds noise into the results. > > > > > > > > Change the fill_buf reading loop to not use an obvious in-order > > > > access using multiply by a prime and modulo. > > > > > > > > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> > > > > --- > > > > tools/testing/selftests/resctrl/fill_buf.c | 17 ++++++++++------- > > > > 1 file changed, 10 insertions(+), 7 deletions(-) > > > > > > > > diff --git a/tools/testing/selftests/resctrl/fill_buf.c > > > > b/tools/testing/selftests/resctrl/fill_buf.c > > > > index 7e0d3a1ea555..049a520498a9 100644 > > > > --- a/tools/testing/selftests/resctrl/fill_buf.c > > > > +++ b/tools/testing/selftests/resctrl/fill_buf.c > > > > @@ -88,14 +88,17 @@ static void *malloc_and_init_memory(size_t s) > > > > > > > > static int fill_one_span_read(unsigned char *start_ptr, unsigned > > > > char > > > > *end_ptr) { > > > > - unsigned char sum, *p; > > > > - > > > > + unsigned int size = (end_ptr - start_ptr) / (CL_SIZE / 2); > > > > + unsigned int count = size; > > > > + unsigned char sum; > > > > + > > > > + /* > > > > + * Read the buffer in an order that is unexpected by HW prefetching > > > > + * optimizations to prevent them interfering with the caching pattern. > > > > + */ > > > > sum = 0; > > > > - p = start_ptr; > > > > - while (p < end_ptr) { > > > > - sum += *p; > > > > - p += (CL_SIZE / 2); > > > > - } > > > > + while (count--) > > > > + sum += start_ptr[((count * 59) % size) * CL_SIZE / 2]; > > > > > > Could you please elaborate why 59 is used? > > > > The main reason is that it's a prime number ensuring the whole buffer gets read. > > I picked something that doesn't make it to wrap on almost every iteration. > > Thanks for your explanation. It seems there is no problem. > > Perhaps you have already tested this patch in your environment and got a test result of "ok". > Because HW prefetching does not work well, > the IMC counter fluctuates a lot in my environment, > and the test result is "not ok". > > In order to ensure this test set runs in any environments and gets "ok", > would you consider changing the value of MAX_DIFF_PERCENT of each test? > or changing something else? > > ``` > Environment: > Kernel: 6.4.0-rc2 > CPU: Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz > > Test result(MBM as an example): > # # Starting MBM BW change ... > # # Mounting resctrl to "/sys/fs/resctrl" > # # Benchmark PID: 8671 > # # Writing benchmark parameters to resctrl FS > # # Write schema "MB:0=100" to resctrl FS > # # Checking for pass/fail > # # Fail: Check MBM diff within 5% > # # avg_diff_per: 9% > # # Span in bytes: 262144000 > # # avg_bw_imc: 6202 > # # avg_bw_resc: 5585 > # not ok 1 MBM: bw change > ``` Could you try if the approach below works better (I think it should apply cleanly on top of the fixes+cleanups v3 series which you recently tested, no need to have the other CAT test changes). The biggest difference in terms of result stability my tests come from these factors: - Removed reversed index order. - Open-coded the modulo in the loop to subtraction. In addition, I changed the prime to one which works slightly better than 59. The MBM/MBA results were already <5% with 59 too due to the other two changes, but using 23 lowered them further in my tests (with Platinum 8260L). --- From: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> [PATCH] selftests/resctrl: Read in less obvious order to defeat prefetch optimizations When reading memory in order, HW prefetching optimizations will interfere with measuring how caches and memory are being accessed. This adds noise into the results. Change the fill_buf reading loop to not use an obvious in-order access using multiply by a prime and modulo. Using a prime multiplier with modulo ensures the entire buffer is eventually read. 23 is small enough that the reads are spread out but wrapping does not occur very frequently (wrapping too often can trigger L2 hits more frequently which causes noise to the test because getting the data from LLC is not required). It was discovered that not all primes work equally well and some can cause wildly unstable results (e.g., in an earlier version of this patch, the reads were done in reversed order and 59 was used as the prime resulting in unacceptably high and unstable results in MBA and MBM test on some architectures). Link: https://lore.kernel.org/linux-kselftest/TYAPR01MB6330025B5E6537F94DA49ACB8B499@TYAPR01MB6330.jpnprd01.prod.outlook.com/ Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> --- tools/testing/selftests/resctrl/fill_buf.c | 38 +++++++++++++++++++++++------- 1 file changed, 30 insertions(+), 8 deletions(-) diff --git a/tools/testing/selftests/resctrl/fill_buf.c b/tools/testing/selftests/resctrl/fill_buf.c index f9893edda869..afde37d3fca0 100644 --- a/tools/testing/selftests/resctrl/fill_buf.c +++ b/tools/testing/selftests/resctrl/fill_buf.c @@ -74,16 +74,38 @@ static void *malloc_and_init_memory(size_t buf_size) return p; } +/* + * Buffer index step advance to workaround HW prefetching interfering with + * the measurements. + * + * Must be a prime to step through all indexes of the buffer. + * + * Some primes work better than others on some architectures (from MBA/MBM + * result stability point of view). + */ +#define FILL_IDX_MULT 23 + static int fill_one_span_read(unsigned char *buf, size_t buf_size) { - unsigned char *end_ptr = buf + buf_size; - unsigned char sum, *p; - - sum = 0; - p = buf; - while (p < end_ptr) { - sum += *p; - p += (CL_SIZE / 2); + unsigned int size = buf_size / (CL_SIZE / 2); + unsigned int i, idx = 0; + unsigned char sum = 0; + + /* + * Read the buffer in an order that is unexpected by HW prefetching + * optimizations to prevent them interfering with the caching pattern. + * + * The read order is (in terms of halves of cachelines): + * i * FILL_IDX_MULT % size + * The formula is open-coded below to avoiding modulo inside the loop + * as it improves MBA/MBM result stability on some architectures. + */ + for (i = 0; i < size; i++) { + sum += buf[idx * (CL_SIZE / 2)]; + + idx += FILL_IDX_MULT; + while (idx >= size) + idx -= size; } return sum;
Hi Ilpo, > On Thu, 1 Jun 2023, Shaopeng Tan (Fujitsu) wrote: > > > > > When reading memory in order, HW prefetching optimizations will > > > > > interfere with measuring how caches and memory are being accessed. > > > > > This adds noise into the results. > > > > > > > > > > Change the fill_buf reading loop to not use an obvious in-order > > > > > access using multiply by a prime and modulo. > > > > > > > > > > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> > > > > > --- > > > > > tools/testing/selftests/resctrl/fill_buf.c | 17 > > > > > ++++++++++------- > > > > > 1 file changed, 10 insertions(+), 7 deletions(-) > > > > > > > > > > diff --git a/tools/testing/selftests/resctrl/fill_buf.c > > > > > b/tools/testing/selftests/resctrl/fill_buf.c > > > > > index 7e0d3a1ea555..049a520498a9 100644 > > > > > --- a/tools/testing/selftests/resctrl/fill_buf.c > > > > > +++ b/tools/testing/selftests/resctrl/fill_buf.c > > > > > @@ -88,14 +88,17 @@ static void *malloc_and_init_memory(size_t > > > > > s) > > > > > > > > > > static int fill_one_span_read(unsigned char *start_ptr, > > > > > unsigned char > > > > > *end_ptr) { > > > > > - unsigned char sum, *p; > > > > > - > > > > > + unsigned int size = (end_ptr - start_ptr) / (CL_SIZE / 2); > > > > > + unsigned int count = size; > > > > > + unsigned char sum; > > > > > + > > > > > + /* > > > > > + * Read the buffer in an order that is unexpected by HW > prefetching > > > > > + * optimizations to prevent them interfering with the caching > pattern. > > > > > + */ > > > > > sum = 0; > > > > > - p = start_ptr; > > > > > - while (p < end_ptr) { > > > > > - sum += *p; > > > > > - p += (CL_SIZE / 2); > > > > > - } > > > > > + while (count--) > > > > > + sum += start_ptr[((count * 59) % size) * CL_SIZE / 2]; > > > > > > > > Could you please elaborate why 59 is used? > > > > > > The main reason is that it's a prime number ensuring the whole buffer gets > read. > > > I picked something that doesn't make it to wrap on almost every iteration. > > > > Thanks for your explanation. It seems there is no problem. > > > > Perhaps you have already tested this patch in your environment and got a test > result of "ok". > > Because HW prefetching does not work well, the IMC counter fluctuates > > a lot in my environment, and the test result is "not ok". > > > > In order to ensure this test set runs in any environments and gets > > "ok", would you consider changing the value of MAX_DIFF_PERCENT of > each test? > > or changing something else? > > > > ``` > > Environment: > > Kernel: 6.4.0-rc2 > > CPU: Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz > > > > Test result(MBM as an example): > > # # Starting MBM BW change ... > > # # Mounting resctrl to "/sys/fs/resctrl" > > # # Benchmark PID: 8671 > > # # Writing benchmark parameters to resctrl FS # # Write schema > > "MB:0=100" to resctrl FS # # Checking for pass/fail # # Fail: Check > > MBM diff within 5% # # avg_diff_per: 9% # # Span in bytes: 262144000 # > > # avg_bw_imc: 6202 # # avg_bw_resc: 5585 # not ok 1 MBM: bw change ``` > > Could you try if the approach below works better (I think it should apply cleanly > on top of the fixes+cleanups v3 series which you recently tested, no need to > have the other CAT test changes). I ran the test set several times. MBA and MBM seem fine, but CAT is always "not ok". The result is below. --- $ sudo make -C tools/testing/selftests/resctrl run_tests make: Entering directory '/**/tools/testing/selftests/resctrl' TAP version 13 1..1 # selftests: resctrl: resctrl_tests # TAP version 13 # # Pass: Check kernel supports resctrl filesystem # # Pass: Check resctrl mountpoint "/sys/fs/resctrl" exists # # resctrl filesystem not mounted # # dmesg: [ 3.658029] resctrl: L3 allocation detected # # dmesg: [ 3.658420] resctrl: MB allocation detected # # dmesg: [ 3.658604] resctrl: L3 monitoring detected # 1..4 # # Starting MBM BW change ... # # Mounting resctrl to "/sys/fs/resctrl" # # Benchmark PID: 9555 # # Writing benchmark parameters to resctrl FS # # Write schema "MB:0=100" to resctrl FS # # Checking for pass/fail # # Pass: Check MBM diff within 5% # # avg_diff_per: 0% # # Span (MB): 250 # # avg_bw_imc: 6880 # # avg_bw_resc: 6895 # ok 1 MBM: bw change # # Starting MBA Schemata change ... # # Mounting resctrl to "/sys/fs/resctrl" # # Benchmark PID: 9561 # # Writing benchmark parameters to resctrl FS # # Write schema "MB:0=100" to resctrl FS # # Write schema "MB:0=90" to resctrl FS # # Write schema "MB:0=80" to resctrl FS # # Write schema "MB:0=70" to resctrl FS # # Write schema "MB:0=60" to resctrl FS # # Write schema "MB:0=50" to resctrl FS # # Write schema "MB:0=40" to resctrl FS # # Write schema "MB:0=30" to resctrl FS # # Write schema "MB:0=20" to resctrl FS # # Write schema "MB:0=10" to resctrl FS # # Results are displayed in (MB) # # Pass: Check MBA diff within 5% for schemata 100 # # avg_diff_per: 0% # # avg_bw_imc: 6874 # # avg_bw_resc: 6904 # # Pass: Check MBA diff within 5% for schemata 90 # # avg_diff_per: 1% # # avg_bw_imc: 6738 # # avg_bw_resc: 6807 # # Pass: Check MBA diff within 5% for schemata 80 # # avg_diff_per: 1% # # avg_bw_imc: 6735 # # avg_bw_resc: 6803 # # Pass: Check MBA diff within 5% for schemata 70 # # avg_diff_per: 1% # # avg_bw_imc: 6702 # # avg_bw_resc: 6770 # # Pass: Check MBA diff within 5% for schemata 60 # # avg_diff_per: 1% # # avg_bw_imc: 6632 # # avg_bw_resc: 6725 # # Pass: Check MBA diff within 5% for schemata 50 # # avg_diff_per: 1% # # avg_bw_imc: 6510 # # avg_bw_resc: 6635 # # Pass: Check MBA diff within 5% for schemata 40 # # avg_diff_per: 2% # # avg_bw_imc: 6206 # # avg_bw_resc: 6342 # # Pass: Check MBA diff within 5% for schemata 30 # # avg_diff_per: 1% # # avg_bw_imc: 3826 # # avg_bw_resc: 3896 # # Pass: Check MBA diff within 5% for schemata 20 # # avg_diff_per: 1% # # avg_bw_imc: 2820 # # avg_bw_resc: 2862 # # Pass: Check MBA diff within 5% for schemata 10 # # avg_diff_per: 1% # # avg_bw_imc: 1876 # # avg_bw_resc: 1898 # # Pass: Check schemata change using MBA # ok 2 MBA: schemata change # # Starting CMT test ... # # Mounting resctrl to "/sys/fs/resctrl" # # Cache size :25952256 # # Benchmark PID: 9573 # # Writing benchmark parameters to resctrl FS # # Checking for pass/fail # # Pass: Check cache miss rate within 15% # # Percent diff=10 # # Number of bits: 5 # # Average LLC val: 12994560 # # Cache span (bytes): 11796480 # ok 3 CMT: test # # Starting CAT test ... # # Mounting resctrl to "/sys/fs/resctrl" # # Cache size :25952256 # # Writing benchmark parameters to resctrl FS # # Write schema "L3:0=3f" to resctrl FS # # Checking for pass/fail # # Fail: Check cache miss rate within 4% # # Percent diff=24 # # Number of bits: 6 # # Average LLC val: 275418 # # Cache span (lines): 221184 # not ok 4 CAT: test # # Totals: pass:3 fail:1 xfail:0 xpass:0 skip:0 error:0 not ok 1 selftests: resctrl: resctrl_tests # exit=1 make: Leaving directory '/**/tools/testing/selftests/resctrl' --- Best regards, Shaopeng TAN
On Fri, 16 Jun 2023, Shaopeng Tan (Fujitsu) wrote: > Hi Ilpo, > > > On Thu, 1 Jun 2023, Shaopeng Tan (Fujitsu) wrote: > > > > > > When reading memory in order, HW prefetching optimizations will > > > > > > interfere with measuring how caches and memory are being accessed. > > > > > > This adds noise into the results. > > > > > > > > > > > > Change the fill_buf reading loop to not use an obvious in-order > > > > > > access using multiply by a prime and modulo. > > > > > > > > > > > > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> > > > > > > --- > > > > > > tools/testing/selftests/resctrl/fill_buf.c | 17 > > > > > > ++++++++++------- > > > > > > 1 file changed, 10 insertions(+), 7 deletions(-) > > > > > > > > > > > > diff --git a/tools/testing/selftests/resctrl/fill_buf.c > > > > > > b/tools/testing/selftests/resctrl/fill_buf.c > > > > > > index 7e0d3a1ea555..049a520498a9 100644 > > > > > > --- a/tools/testing/selftests/resctrl/fill_buf.c > > > > > > +++ b/tools/testing/selftests/resctrl/fill_buf.c > > > > > > @@ -88,14 +88,17 @@ static void *malloc_and_init_memory(size_t > > > > > > s) > > > > > > > > > > > > static int fill_one_span_read(unsigned char *start_ptr, > > > > > > unsigned char > > > > > > *end_ptr) { > > > > > > - unsigned char sum, *p; > > > > > > - > > > > > > + unsigned int size = (end_ptr - start_ptr) / (CL_SIZE / 2); > > > > > > + unsigned int count = size; > > > > > > + unsigned char sum; > > > > > > + > > > > > > + /* > > > > > > + * Read the buffer in an order that is unexpected by HW > > prefetching > > > > > > + * optimizations to prevent them interfering with the caching > > pattern. > > > > > > + */ > > > > > > sum = 0; > > > > > > - p = start_ptr; > > > > > > - while (p < end_ptr) { > > > > > > - sum += *p; > > > > > > - p += (CL_SIZE / 2); > > > > > > - } > > > > > > + while (count--) > > > > > > + sum += start_ptr[((count * 59) % size) * CL_SIZE / 2]; > > > > > > > > > > Could you please elaborate why 59 is used? > > > > > > > > The main reason is that it's a prime number ensuring the whole buffer gets > > read. > > > > I picked something that doesn't make it to wrap on almost every iteration. > > > > > > Thanks for your explanation. It seems there is no problem. > > > > > > Perhaps you have already tested this patch in your environment and got a test > > result of "ok". > > > Because HW prefetching does not work well, the IMC counter fluctuates > > > a lot in my environment, and the test result is "not ok". > > > > > > In order to ensure this test set runs in any environments and gets > > > "ok", would you consider changing the value of MAX_DIFF_PERCENT of > > each test? > > > or changing something else? > > > > > > ``` > > > Environment: > > > Kernel: 6.4.0-rc2 > > > CPU: Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz > > > > > > Test result(MBM as an example): > > > # # Starting MBM BW change ... > > > # # Mounting resctrl to "/sys/fs/resctrl" > > > # # Benchmark PID: 8671 > > > # # Writing benchmark parameters to resctrl FS # # Write schema > > > "MB:0=100" to resctrl FS # # Checking for pass/fail # # Fail: Check > > > MBM diff within 5% # # avg_diff_per: 9% # # Span in bytes: 262144000 # > > > # avg_bw_imc: 6202 # # avg_bw_resc: 5585 # not ok 1 MBM: bw change ``` > > > > Could you try if the approach below works better (I think it should apply cleanly > > on top of the fixes+cleanups v3 series which you recently tested, no need to > > have the other CAT test changes). > > I ran the test set several times. > MBA and MBM seem fine, but CAT is always "not ok". > The result is below. Ok, thanks a lot for confirming. I was just interested to see MBA/MBM test results. I'm not surprised the old CAT test is failing with "not ok". I see it occurring quite often with the old CAT test. It is one of the reasons why it is being rewritten, although the main motivator is that the old one doesn't really even test CAT because it flushes LLC and reads the buffer only once after the flush. The rewritten CAT test should work better in this regard but it was not among fixes+cleanups series (v3) + this patch.
diff --git a/tools/testing/selftests/resctrl/fill_buf.c b/tools/testing/selftests/resctrl/fill_buf.c index 7e0d3a1ea555..049a520498a9 100644 --- a/tools/testing/selftests/resctrl/fill_buf.c +++ b/tools/testing/selftests/resctrl/fill_buf.c @@ -88,14 +88,17 @@ static void *malloc_and_init_memory(size_t s) static int fill_one_span_read(unsigned char *start_ptr, unsigned char *end_ptr) { - unsigned char sum, *p; - + unsigned int size = (end_ptr - start_ptr) / (CL_SIZE / 2); + unsigned int count = size; + unsigned char sum; + + /* + * Read the buffer in an order that is unexpected by HW prefetching + * optimizations to prevent them interfering with the caching pattern. + */ sum = 0; - p = start_ptr; - while (p < end_ptr) { - sum += *p; - p += (CL_SIZE / 2); - } + while (count--) + sum += start_ptr[((count * 59) % size) * CL_SIZE / 2]; return sum; }