Message ID | cover.1695704179.git.yu.c.chen@intel.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:cae8:0:b0:403:3b70:6f57 with SMTP id r8csp1678615vqu; Mon, 25 Sep 2023 22:11:26 -0700 (PDT) X-Google-Smtp-Source: AGHT+IH5pLy0UBT90EK2ExAKtGRmBwqn6eDhSWSNuxnfsl7eq50HP6t3Kj7jE6lg+qGNxTj/+kQL X-Received: by 2002:a05:690c:3605:b0:59f:8241:7ef with SMTP id ft5-20020a05690c360500b0059f824107efmr3910233ywb.1.1695705086481; Mon, 25 Sep 2023 22:11:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695705086; cv=none; d=google.com; s=arc-20160816; b=LBHocDknFUL14YeFt88TZ7sJ5rPcEpRYNpgehmgDiuOAW7s4tZxx9YIQs1jmhd5uwr F8I1Ptp6wiaEaenrByXEAZghtuvPsFq4WILVSR4JvVxNXaW0kw1l8OCdz2Gl2MoRJ9D5 hRxe6RH78JXmm3PXZikPJHYfRON4Ao+qRUZs/aCMDZa+UCImk4fsdsmRrr+HDgb0fjDU cVDT+oh/k5rsrM+gCgLPTWm7ZnCY7j4TMU63nUmlLR++AM8lLYxrS4UkHBPLogEE8Imp 8x/nmVCOVOpJleolvbep7UI8+TIle+CM7FH6AWU1bG97C9Fc/3yXiDZ6KWkl5L2HuJFx JJtg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=sVKJWDlRXphNgqmykXPwc5CThiakbdhMYnD2mqMZB4I=; fh=k2Da3tdGwoTTGVnisHT5n9jOa+bFowxZWJ3Ziql+GyA=; b=G/RxPHMH6Csw3icAMPkxrUmi+Lp7x6doLBzOEW+lUnG0MlDZu0EKw/pxyhQtbHtzud UlwoM1Nxlw2FnDM2w45JR03hIY/hbzkZ26dSDfFHx8Sa2wLx+SAMlGhMD2za8tZQK4g0 mg3AM2Yq8H5LCS/Pj3W9OMVh2vt/jyzMaPVZRm92bgFIk0UOVb6R0ls37eI8ZqwsuQY8 jpatN50EkMuw/s3Snw8i2wieZqBhEv1FTxwCZ0H9UIhnDyOzOGYjBIXpHB4DtmkZqE32 rG/rle/qUf6MuJMbshMbiDqQPEooXHpx3v+5qZD+KSM+NEvCYRDQIP/Tf5m/eJ9VLDgE AqXg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="a/W2+Ajd"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from lipwig.vger.email (lipwig.vger.email. [23.128.96.33]) by mx.google.com with ESMTPS id m9-20020a654389000000b005638179cecasi11119699pgp.833.2023.09.25.22.11.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 25 Sep 2023 22:11:26 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) client-ip=23.128.96.33; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="a/W2+Ajd"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id E5A838253FEC; Mon, 25 Sep 2023 22:11:21 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231725AbjIZFLJ (ORCPT <rfc822;ruipengqi7@gmail.com> + 26 others); Tue, 26 Sep 2023 01:11:09 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36226 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231549AbjIZFLH (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 26 Sep 2023 01:11:07 -0400 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9C806B0 for <linux-kernel@vger.kernel.org>; Mon, 25 Sep 2023 22:11:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1695705060; x=1727241060; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=IURD6yOAQKSnGHG4SzFBtCW0lCI8dfJfDlDAE12p3e0=; b=a/W2+AjdHdZBv2CAB8QeLbkWHTSdgmCnyIiJ6pUCh9AQw2hEAc7lXoB6 dgI/ECMWtql37uhGmqLihOw5/d31N216qdZ7d9Bzmcnco07PQyDQ+aJy4 313NGjcEttW47tjd9Wpiq2kOHQ2BrR8ltt7n87o+AoTDDr5v2y2UHOIND JafA0zFjFN0IUOlXV+R3tW0BylIU4YE3ecdzGFzY3ZBDkDF2hn7DL72N4 88tTew5xKJchkF136vibJC8cvwfq1ViOAE1QTH46TtkvKpVuOcw3cBDZE wVg+Xi9IjpI9O3Nt2GdWigWL+rka+F0T2ZZAP+hXLeCEWCVy6XcUcOANm Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10843"; a="380351347" X-IronPort-AV: E=Sophos;i="6.03,177,1694761200"; d="scan'208";a="380351347" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Sep 2023 22:10:59 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10843"; a="742222883" X-IronPort-AV: E=Sophos;i="6.03,177,1694761200"; d="scan'208";a="742222883" Received: from chenyu-dev.sh.intel.com ([10.239.62.164]) by orsmga007.jf.intel.com with ESMTP; 25 Sep 2023 22:10:53 -0700 From: Chen Yu <yu.c.chen@intel.com> To: Peter Zijlstra <peterz@infradead.org>, Mathieu Desnoyers <mathieu.desnoyers@efficios.com>, Ingo Molnar <mingo@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org>, Juri Lelli <juri.lelli@redhat.com> Cc: Tim Chen <tim.c.chen@intel.com>, Aaron Lu <aaron.lu@intel.com>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Steven Rostedt <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>, Daniel Bristot de Oliveira <bristot@redhat.com>, Valentin Schneider <vschneid@redhat.com>, K Prateek Nayak <kprateek.nayak@amd.com>, "Gautham R . Shenoy" <gautham.shenoy@amd.com>, linux-kernel@vger.kernel.org, Chen Yu <yu.chen.surf@gmail.com>, Chen Yu <yu.c.chen@intel.com> Subject: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup Date: Tue, 26 Sep 2023 13:10:21 +0800 Message-Id: <cover.1695704179.git.yu.c.chen@intel.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Mon, 25 Sep 2023 22:11:21 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1778075656223707463 X-GMAIL-MSGID: 1778075656223707463 |
Series |
Introduce SIS_CACHE to choose previous CPU during task wakeup
|
|
Message
Chen Yu
Sept. 26, 2023, 5:10 a.m. UTC
RFC -> v1: - drop RFC - Only record the short sleeping time for each task, to better honor the burst sleeping tasks. (Mathieu Desnoyers) - Keep the forward movement monotonic for runqueue's cache-hot timeout value. (Mathieu Desnoyers, Aaron Lu) - Introduce a new helper function cache_hot_cpu() that considers rq->cache_hot_timeout. (Aaron Lu) - Add analysis of why inhibiting task migration could bring better throughput for some benchmarks. (Gautham R. Shenoy) - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in select_idle_cpu(). To avoid possible task stacking on the waker's CPU. (K Prateek Nayak) Thanks for your comments and review! ---------------------------------------------------------------------- When task p is woken up, the scheduler leverages select_idle_sibling() to find an idle CPU for it. p's previous CPU is usually a preference because it can improve cache locality. However in many cases, the previous CPU has already been taken by other wakees, thus p has to find another idle CPU. Inhibit the task migration while keeping the work conservation of scheduler could benefit many workloads. Inspired by Mathieu's proposal to limit the task migration ratio[1], this patch considers the task average sleep duration. If the task is a short sleeping one, then tag its previous CPU as cache hot for a short while. During this reservation period, other wakees are not allowed to pick this idle CPU until a timeout. Later if the task is woken up again, it can find its previous CPU still idle, and choose it in select_idle_sibling(). This test is based on tip/sched/core, on top of Commit afc1996859a2 ("sched/fair: Ratelimit update to tg->load_avg") patch afc1996859a2 has significantly reduced the cost of task migration, the SIS_CACHE further reduces that cost. SIS_CACHE shows noticeable throughput improvement of netperf/tbench around 100% load. [patch 1/2] records the task's average short sleeping time in its per sched_entity structure. [patch 2/2] introduces the SIS_CACHE to skip the cache-hot idle CPU during wakeup. Link: https://lore.kernel.org/lkml/20230905171105.1005672-2-mathieu.desnoyers@efficios.com/ #1 Chen Yu (2): sched/fair: Record the short sleeping time of a task sched/fair: skip the cache hot CPU in select_idle_cpu() include/linux/sched.h | 3 ++ kernel/sched/fair.c | 86 +++++++++++++++++++++++++++++++++++++++-- kernel/sched/features.h | 1 + kernel/sched/sched.h | 1 + 4 files changed, 87 insertions(+), 4 deletions(-)
Comments
* Chen Yu <yu.c.chen@intel.com> wrote: > When task p is woken up, the scheduler leverages select_idle_sibling() > to find an idle CPU for it. p's previous CPU is usually a preference > because it can improve cache locality. However in many cases, the > previous CPU has already been taken by other wakees, thus p has to > find another idle CPU. > > Inhibit the task migration while keeping the work conservation of > scheduler could benefit many workloads. Inspired by Mathieu's > proposal to limit the task migration ratio[1], this patch considers > the task average sleep duration. If the task is a short sleeping one, > then tag its previous CPU as cache hot for a short while. During this > reservation period, other wakees are not allowed to pick this idle CPU > until a timeout. Later if the task is woken up again, it can find its > previous CPU still idle, and choose it in select_idle_sibling(). Yeah, so I'm not convinced about this at this stage. By allowing a task to basically hog a CPU after it has gone idle already, however briefly, we reduce resource utilization efficiency for the sake of singular benchmark workloads. In a mixed environment the cost of leaving CPUs idle longer than necessary will show up - and none of these benchmarks show that kind of side effect and indirect overhead. This feature would be a lot more convincing if it tried to measure overhead in the pathological case, not the case it's been written for. Thanks, Ingo
On Wed, 2023-09-27 at 10:00 +0200, Ingo Molnar wrote: > * Chen Yu <yu.c.chen@intel.com> wrote: > > > When task p is woken up, the scheduler leverages select_idle_sibling() > > to find an idle CPU for it. p's previous CPU is usually a preference > > because it can improve cache locality. However in many cases, the > > previous CPU has already been taken by other wakees, thus p has to > > find another idle CPU. > > > > Inhibit the task migration while keeping the work conservation of > > scheduler could benefit many workloads. Inspired by Mathieu's > > proposal to limit the task migration ratio[1], this patch considers > > the task average sleep duration. If the task is a short sleeping one, > > then tag its previous CPU as cache hot for a short while. During this > > reservation period, other wakees are not allowed to pick this idle CPU > > until a timeout. Later if the task is woken up again, it can find its > > previous CPU still idle, and choose it in select_idle_sibling(). > > Yeah, so I'm not convinced about this at this stage. > > By allowing a task to basically hog a CPU after it has gone idle already, > however briefly, we reduce resource utilization efficiency for the sake > of singular benchmark workloads. > > In a mixed environment the cost of leaving CPUs idle longer than necessary > will show up - and none of these benchmarks show that kind of side effect > and indirect overhead. > > This feature would be a lot more convincing if it tried to measure overhead > in the pathological case, not the case it's been written for. > Ingo, Mathieu's patches on detecting overly high task migrations and then rate limiting migration is a way to detect that tasks are getting crazy doing CPU musical chairs and in a pathological state. Will the migration rate be a reasonable indicator that we need to do something to reduce pathological migrations like SIS_CACHE proposal so the tasks don't get jerked all over? Or you have some other better indicators in mind? We did some experiments on the OLTP workload on a 112 core 2 socket SPR machine. The OLTP workload have a mixture of threads handling database updates on disks and handling transaction queries over network. For Mathieu's original task migration rate limit patches, we saw 1.2% improvement and for Chen Yu's SIS_CACHE proposal, we saw 0.7% improvement. System is running at ~94% busy so is under high utilization. The variation of this workload is less than 0.2%. There are improvements for such mix workload though it is not as much as the microbenchmarks. These data are perliminary and we are still doing more experiments. For the OLTP experiments, each socket with 64 cores are divided with sub-numa clusters of 4 nodes of 16 cores each so the scheduling overhead in idle CPU search is much less if SNC is off. Thanks. Tim
Hi Ingo, On 2023-09-27 at 10:00:11 +0200, Ingo Molnar wrote: > > * Chen Yu <yu.c.chen@intel.com> wrote: > > > When task p is woken up, the scheduler leverages select_idle_sibling() > > to find an idle CPU for it. p's previous CPU is usually a preference > > because it can improve cache locality. However in many cases, the > > previous CPU has already been taken by other wakees, thus p has to > > find another idle CPU. > > > > Inhibit the task migration while keeping the work conservation of > > scheduler could benefit many workloads. Inspired by Mathieu's > > proposal to limit the task migration ratio[1], this patch considers > > the task average sleep duration. If the task is a short sleeping one, > > then tag its previous CPU as cache hot for a short while. During this > > reservation period, other wakees are not allowed to pick this idle CPU > > until a timeout. Later if the task is woken up again, it can find its > > previous CPU still idle, and choose it in select_idle_sibling(). > > Yeah, so I'm not convinced about this at this stage. > > By allowing a task to basically hog a CPU after it has gone idle already, > however briefly, we reduce resource utilization efficiency for the sake > of singular benchmark workloads. > Currently in the code we do not really reserve the idle CPU or force it to be idle. We just give other wakee a search sequence suggestion to find the idle CPU. If all idle CPUs are in reserved state, the first reserved idle CPU will be picked up rather than left it in idle. This can fully utilize the idle CPU resource. The main impact is the wakeup latency if I understand correctly. Let me run the latest schbench and monitor these latency statistics in detail. > In a mixed environment the cost of leaving CPUs idle longer than necessary > will show up - and none of these benchmarks show that kind of side effect > and indirect overhead. > > This feature would be a lot more convincing if it tried to measure overhead > in the pathological case, not the case it's been written for. > Thanks for the suggestion, Ingo. Yes, we should launch more tests to evaluate this proposal. As Tim mentioned, we have previously tested it using OLTP benchmark as described in PATCH [2/2]. I'm thinking of running more benchmarks to get a wider understanding of how this change would impact them, both positive and negative part. thanks, Chenyu
Hello Chenyu, On 9/26/2023 10:40 AM, Chen Yu wrote: > RFC -> v1: > - drop RFC > - Only record the short sleeping time for each task, to better honor the > burst sleeping tasks. (Mathieu Desnoyers) > - Keep the forward movement monotonic for runqueue's cache-hot timeout value. > (Mathieu Desnoyers, Aaron Lu) > - Introduce a new helper function cache_hot_cpu() that considers > rq->cache_hot_timeout. (Aaron Lu) > - Add analysis of why inhibiting task migration could bring better throughput > for some benchmarks. (Gautham R. Shenoy) > - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in > select_idle_cpu(). To avoid possible task stacking on the waker's CPU. > (K Prateek Nayak) > > Thanks for your comments and review! Sorry for the delay! I'll leave the test results from a 3rd Generation EPYC system below. tl;dr - Small regression in tbench and netperf possible due to more searching for an idle CPU. - Small regression in schbench (old) at 256 workers albeit with large run to run variance. - Other benchmarks are more or less same. I'll leave the full result below o System details - 3rd Generation EPYC System - 2 sockets each with 64C/128T - NPS1 (Each socket is a NUMA node) - Boost enabled, C2 Disabled (POLL and MWAIT based C1 remained enabled) o Kernel Details - tip: tip:sched/core at commit 5fe7765997b1 (sched/deadline: Make dl_rq->pushable_dl_tasks update drive dl_rq->overloaded) - SIS_CACHE: tip + this series o Benchmark results ================================================================== Test : hackbench Units : Normalized time in seconds Interpretation: Lower is better Statistic : AMean ================================================================== Case: tip[pct imp](CV) SIS_CACHE[pct imp](CV) 1-groups 1.00 [ -0.00]( 2.36) 1.01 [ -1.47]( 3.02) 2-groups 1.00 [ -0.00]( 2.35) 0.99 [ 0.92]( 1.01) 4-groups 1.00 [ -0.00]( 1.79) 0.98 [ 2.34]( 0.63) 8-groups 1.00 [ -0.00]( 0.84) 0.98 [ 1.73]( 1.02) 16-groups 1.00 [ -0.00]( 2.39) 0.97 [ 2.76]( 2.33) ================================================================== Test : tbench Units : Normalized throughput Interpretation: Higher is better Statistic : AMean ================================================================== Clients: tip[pct imp](CV) SIS_CACHE[pct imp](CV) 1 1.00 [ 0.00]( 0.86) 0.97 [ -2.68]( 0.74) 2 1.00 [ 0.00]( 0.99) 0.98 [ -2.18]( 0.17) 4 1.00 [ 0.00]( 0.49) 0.98 [ -2.47]( 1.15) 8 1.00 [ 0.00]( 0.96) 0.96 [ -3.81]( 0.24) 16 1.00 [ 0.00]( 1.38) 0.96 [ -4.33]( 1.31) 32 1.00 [ 0.00]( 1.64) 0.95 [ -4.70]( 1.59) 64 1.00 [ 0.00]( 0.92) 0.97 [ -2.97]( 0.49) 128 1.00 [ 0.00]( 0.57) 0.99 [ -1.15]( 0.57) 256 1.00 [ 0.00]( 0.38) 1.00 [ 0.03]( 0.79) 512 1.00 [ 0.00]( 0.04) 1.00 [ 0.43]( 0.34) 1024 1.00 [ 0.00]( 0.20) 1.00 [ 0.41]( 0.13) ================================================================== Test : stream-10 Units : Normalized Bandwidth, MB/s Interpretation: Higher is better Statistic : HMean ================================================================== Test: tip[pct imp](CV) SIS_CACHE[pct imp](CV) Copy 1.00 [ 0.00]( 2.52) 0.93 [ -6.90]( 6.75) Scale 1.00 [ 0.00]( 6.38) 0.99 [ -1.18]( 7.45) Add 1.00 [ 0.00]( 6.54) 0.97 [ -2.55]( 7.34) Triad 1.00 [ 0.00]( 5.18) 0.95 [ -4.64]( 6.81) ================================================================== Test : stream-100 Units : Normalized Bandwidth, MB/s Interpretation: Higher is better Statistic : HMean ================================================================== Test: tip[pct imp](CV) SIS_CACHE[pct imp](CV) Copy 1.00 [ 0.00]( 0.74) 1.00 [ -0.20]( 1.69) Scale 1.00 [ 0.00]( 6.25) 1.03 [ 3.46]( 0.55) Add 1.00 [ 0.00]( 6.53) 1.05 [ 4.58]( 0.43) Triad 1.00 [ 0.00]( 5.14) 0.98 [ -1.78]( 6.24) ================================================================== Test : netperf Units : Normalized Througput Interpretation: Higher is better Statistic : AMean ================================================================== Clients: tip[pct imp](CV) SIS_CACHE[pct imp](CV) 1-clients 1.00 [ 0.00]( 0.27) 0.98 [ -1.50]( 0.14) 2-clients 1.00 [ 0.00]( 1.32) 0.98 [ -2.35]( 0.54) 4-clients 1.00 [ 0.00]( 0.40) 0.98 [ -2.35]( 0.56) 8-clients 1.00 [ 0.00]( 0.97) 0.97 [ -2.72]( 0.50) 16-clients 1.00 [ 0.00]( 0.54) 0.96 [ -3.92]( 0.86) 32-clients 1.00 [ 0.00]( 1.38) 0.97 [ -3.10]( 0.44) 64-clients 1.00 [ 0.00]( 1.78) 0.97 [ -3.44]( 1.70) 128-clients 1.00 [ 0.00]( 1.09) 0.94 [ -5.75]( 2.67) 256-clients 1.00 [ 0.00]( 4.45) 0.97 [ -2.61]( 4.93) 512-clients 1.00 [ 0.00](54.70) 0.98 [ -1.64](55.09) ================================================================== Test : schbench Units : Normalized 99th percentile latency in us Interpretation: Lower is better Statistic : Median ================================================================== #workers: tip[pct imp](CV) SIS_CACHE[pct imp](CV) 1 1.00 [ -0.00]( 3.95) 0.97 [ 2.56](10.42) 2 1.00 [ -0.00]( 5.89) 0.83 [ 16.67](22.56) 4 1.00 [ -0.00](14.28) 1.00 [ -0.00](14.75) 8 1.00 [ -0.00]( 4.90) 0.84 [ 15.69]( 6.01) 16 1.00 [ -0.00]( 4.15) 1.00 [ -0.00]( 4.41) 32 1.00 [ -0.00]( 5.10) 1.01 [ -1.10]( 3.44) 64 1.00 [ -0.00]( 2.69) 1.04 [ -3.72]( 2.57) 128 1.00 [ -0.00]( 2.63) 0.94 [ 6.29]( 2.55) 256 1.00 [ -0.00](26.75) 1.51 [-50.57](11.40) 512 1.00 [ -0.00]( 2.93) 0.96 [ 3.52]( 3.56) ================================================================== Test : ycsb-cassandra Units : Normalized throughput Interpretation: Higher is better Statistic : Mean ================================================================== Metric tip SIS_CACHE(pct imp) Throughput 1.00 1.00 (%diff: 0.27%) ================================================================== Test : ycsb-mondodb Units : Normalized throughput Interpretation: Higher is better Statistic : Mean ================================================================== Metric tip SIS_CACHE(pct imp) Throughput 1.00 1.00 (%diff: -0.45%) ================================================================== Test : DeathStarBench Units : Normalized throughput Interpretation: Higher is better Statistic : Mean ================================================================== Pinning scaling tip SIS_CACHE(pct imp) 1CCD 1 1.00 1.00 (%diff: -0.47%) 2CCD 2 1.00 0.98 (%diff: -2.34%) 4CCD 4 1.00 1.00 (%diff: -0.29%) 8CCD 8 1.00 1.01 (%diff: 0.54%) > > ---------------------------------------------------------------------- > > [..snip..] > -- Thanks and Regards, Prateek
Hi Prateek, On 2023-10-05 at 11:52:13 +0530, K Prateek Nayak wrote: > Hello Chenyu, > > On 9/26/2023 10:40 AM, Chen Yu wrote: > > RFC -> v1: > > - drop RFC > > - Only record the short sleeping time for each task, to better honor the > > burst sleeping tasks. (Mathieu Desnoyers) > > - Keep the forward movement monotonic for runqueue's cache-hot timeout value. > > (Mathieu Desnoyers, Aaron Lu) > > - Introduce a new helper function cache_hot_cpu() that considers > > rq->cache_hot_timeout. (Aaron Lu) > > - Add analysis of why inhibiting task migration could bring better throughput > > for some benchmarks. (Gautham R. Shenoy) > > - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in > > select_idle_cpu(). To avoid possible task stacking on the waker's CPU. > > (K Prateek Nayak) > > > > Thanks for your comments and review! > > Sorry for the delay! I'll leave the test results from a 3rd Generation > EPYC system below. > > tl;dr > > - Small regression in tbench and netperf possible due to more searching > for an idle CPU. > > - Small regression in schbench (old) at 256 workers albeit with large > run to run variance. > > - Other benchmarks are more or less same. > > Test : schbench > Units : Normalized 99th percentile latency in us > Interpretation: Lower is better > Statistic : Median > ================================================================== > #workers: tip[pct imp](CV) SIS_CACHE[pct imp](CV) > 1 1.00 [ -0.00]( 3.95) 0.97 [ 2.56](10.42) > 2 1.00 [ -0.00]( 5.89) 0.83 [ 16.67](22.56) > 4 1.00 [ -0.00](14.28) 1.00 [ -0.00](14.75) > 8 1.00 [ -0.00]( 4.90) 0.84 [ 15.69]( 6.01) > 16 1.00 [ -0.00]( 4.15) 1.00 [ -0.00]( 4.41) > 32 1.00 [ -0.00]( 5.10) 1.01 [ -1.10]( 3.44) > 64 1.00 [ -0.00]( 2.69) 1.04 [ -3.72]( 2.57) > 128 1.00 [ -0.00]( 2.63) 0.94 [ 6.29]( 2.55) > 256 1.00 [ -0.00](26.75) 1.51 [-50.57](11.40) Thanks for the testing. So the latency regression from schbench is quite obvious, and as you mentioned, it is possible due to longer scan time during select_idle_cpu(). I'll run the same test with split LLC to see if I can reproduce the issue or not. I'm also working with Mathieu on another direction to choose previous CPU over current CPU when the system is overloaded, and that should be more moderate and I'll post the test result later. thanks, Chenyu
Hi Chen Yu, On 26/09/23 10:40, Chen Yu wrote: > RFC -> v1: > - drop RFC > - Only record the short sleeping time for each task, to better honor the > burst sleeping tasks. (Mathieu Desnoyers) > - Keep the forward movement monotonic for runqueue's cache-hot timeout value. > (Mathieu Desnoyers, Aaron Lu) > - Introduce a new helper function cache_hot_cpu() that considers > rq->cache_hot_timeout. (Aaron Lu) > - Add analysis of why inhibiting task migration could bring better throughput > for some benchmarks. (Gautham R. Shenoy) > - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in > select_idle_cpu(). To avoid possible task stacking on the waker's CPU. > (K Prateek Nayak) > > Thanks for your comments and review! > > ---------------------------------------------------------------------- Regarding making the scan for finding an idle cpu longer vs cache benefits, I ran some benchmarks. Tested the patch on power system with 12 cores. Total of 96 CPU's. System has two NUMA nodes. Below are some of the benchmark results schbench 99.0th latency (lower is better) ======== case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%) normal 1-mthreads 1.00 [ 0.00]( 3.66) 1.00 [ 0.00]( 1.71) normal 2-mthreads 1.00 [ 0.00]( 4.55) 1.02 [ -2.00]( 3.00) normal 4-mthreads 1.00 [ 0.00]( 4.77) 0.96 [ +4.00]( 4.27) normal 6-mthreads 1.00 [ 0.00]( 60.37) 2.66 [ -166.00]( 23.67) schbench results are showing that there is not much impact in wakeup latencies due to more iterations in search for an idle cpu in the select_idle_cpu code path and interestingly numbers are slightly better for SIS_CACHE in case of 4-mthreads. I think we can ignore the last case due to huge run to run variations. producer_consumer avg time/access (lower is better) ======== loads per consumer iteration baseline[pct imp](std%) SIS_CACHE[pct imp]( std%) 5 1.00 [ 0.00]( 0.00) 0.87 [ +13.0]( 1.92) 20 1.00 [ 0.00]( 0.00) 0.92 [ +8.00]( 0.00) 50 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00) 100 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00) The main goal of the patch of improving cache locality is reflected as SIS_CACHE only improves in this workload, mainly when loads per consumer iteration is lower. hackbench normalized time in seconds (lower is better) ======== case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%) process-pipe 1-groups 1.00 [ 0.00]( 1.50) 1.02 [ -2.00]( 3.36) process-pipe 2-groups 1.00 [ 0.00]( 4.76) 0.99 [ +1.00]( 5.68) process-sockets 1-groups 1.00 [ 0.00]( 2.56) 1.00 [ 0.00]( 0.86) process-sockets 2-groups 1.00 [ 0.00]( 0.50) 0.99 [ +1.00]( 0.96) threads-pipe 1-groups 1.00 [ 0.00]( 3.87) 0.71 [ +29.0]( 3.56) threads-pipe 2-groups 1.00 [ 0.00]( 1.60) 0.97 [ +3.00]( 3.44) threads-sockets 1-groups 1.00 [ 0.00]( 7.65) 0.99 [ +1.00]( 1.05) threads-sockets 2-groups 1.00 [ 0.00]( 3.12) 1.03 [ -3.00]( 1.70) hackbench results are similar in both kernels except the case where there is an improvement of 29% in case of threads-pipe case with 1 groups. Daytrader throughput (higher is better) ======== As per Ingo suggestion, ran a real life workload daytrader baseline: =================================================================================== Instance 1 Throughputs Ave. Resp. Time Min. Resp. Time Max. Resp. Time ================ =============== =============== =============== 10124.5 2 0 3970 SIS_CACHE: =================================================================================== Instance 1 Throughputs Ave. Resp. Time Min. Resp. Time Max. Resp. Time ================ =============== =============== =============== 10319.5 2 0 5771 In the above run, daytrader perfomance was 2% better in case of SIS_CACHE. Thanks and Regards Madadi Vineeth Reddy
Hi Madadi, On 2023-10-17 at 15:19:24 +0530, Madadi Vineeth Reddy wrote: > Hi Chen Yu, > > On 26/09/23 10:40, Chen Yu wrote: > > RFC -> v1: > > - drop RFC > > - Only record the short sleeping time for each task, to better honor the > > burst sleeping tasks. (Mathieu Desnoyers) > > - Keep the forward movement monotonic for runqueue's cache-hot timeout value. > > (Mathieu Desnoyers, Aaron Lu) > > - Introduce a new helper function cache_hot_cpu() that considers > > rq->cache_hot_timeout. (Aaron Lu) > > - Add analysis of why inhibiting task migration could bring better throughput > > for some benchmarks. (Gautham R. Shenoy) > > - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in > > select_idle_cpu(). To avoid possible task stacking on the waker's CPU. > > (K Prateek Nayak) > > > > Thanks for your comments and review! > > > > ---------------------------------------------------------------------- > > Regarding making the scan for finding an idle cpu longer vs cache benefits, > I ran some benchmarks. > Thanks very much for your interest and your time on the patch. > Tested the patch on power system with 12 cores. Total of 96 CPU's. > System has two NUMA nodes. > > Below are some of the benchmark results > > schbench 99.0th latency (lower is better) > ======== > case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%) > normal 1-mthreads 1.00 [ 0.00]( 3.66) 1.00 [ 0.00]( 1.71) > normal 2-mthreads 1.00 [ 0.00]( 4.55) 1.02 [ -2.00]( 3.00) > normal 4-mthreads 1.00 [ 0.00]( 4.77) 0.96 [ +4.00]( 4.27) > normal 6-mthreads 1.00 [ 0.00]( 60.37) 2.66 [ -166.00]( 23.67) > > > schbench results are showing that there is not much impact in wakeup latencies due to more iterations > in search for an idle cpu in the select_idle_cpu code path and interestingly numbers are slightly better > for SIS_CACHE in case of 4-mthreads. The 4% improvement is within std%, so I suppose we did not see much difference in 4 mthreads case. > I think we can ignore the last case due to huge run to run variations. Although the run-to-run variation is large, it seems that the decrease is within that range. Prateek has also reported that when the system is overloaded there could be some regression from schbench: https://lore.kernel.org/lkml/27651e14-f441-c1e2-9b5b-b958d6aadc79@amd.com/ Could you also post the raw data printed by schbench? And maybe using the latest schbench could get the latency in detail. > producer_consumer avg time/access (lower is better) > ======== > loads per consumer iteration baseline[pct imp](std%) SIS_CACHE[pct imp]( std%) > 5 1.00 [ 0.00]( 0.00) 0.87 [ +13.0]( 1.92) > 20 1.00 [ 0.00]( 0.00) 0.92 [ +8.00]( 0.00) > 50 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00) > 100 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00) > > The main goal of the patch of improving cache locality is reflected as SIS_CACHE only improves in this workload, > mainly when loads per consumer iteration is lower. > > hackbench normalized time in seconds (lower is better) > ======== > case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%) > process-pipe 1-groups 1.00 [ 0.00]( 1.50) 1.02 [ -2.00]( 3.36) > process-pipe 2-groups 1.00 [ 0.00]( 4.76) 0.99 [ +1.00]( 5.68) > process-sockets 1-groups 1.00 [ 0.00]( 2.56) 1.00 [ 0.00]( 0.86) > process-sockets 2-groups 1.00 [ 0.00]( 0.50) 0.99 [ +1.00]( 0.96) > threads-pipe 1-groups 1.00 [ 0.00]( 3.87) 0.71 [ +29.0]( 3.56) > threads-pipe 2-groups 1.00 [ 0.00]( 1.60) 0.97 [ +3.00]( 3.44) > threads-sockets 1-groups 1.00 [ 0.00]( 7.65) 0.99 [ +1.00]( 1.05) > threads-sockets 2-groups 1.00 [ 0.00]( 3.12) 1.03 [ -3.00]( 1.70) > > hackbench results are similar in both kernels except the case where there is an improvement of > 29% in case of threads-pipe case with 1 groups. > > Daytrader throughput (higher is better) > ======== > > As per Ingo suggestion, ran a real life workload daytrader > > baseline: > =================================================================================== > Instance 1 > Throughputs Ave. Resp. Time Min. Resp. Time Max. Resp. Time > ================ =============== =============== =============== > 10124.5 2 0 3970 > > SIS_CACHE: > =================================================================================== > Instance 1 > Throughputs Ave. Resp. Time Min. Resp. Time Max. Resp. Time > ================ =============== =============== =============== > 10319.5 2 0 5771 > > In the above run, daytrader perfomance was 2% better in case of SIS_CACHE. > Thanks for bringing this good news, a real life workload benefits from this change. I'll tune this patch a little bit to address the regression from schbench. Also to mention that, I'm working with Mathieu on his proposal to make the wakee choosing its previous CPU easier(similar to SIS_CACHE, but a little simpler), and we'll check how to make more platform benefit from this change. https://lore.kernel.org/lkml/20231012203626.1298944-1-mathieu.desnoyers@efficios.com/ thanks, Chenyu
Hi Chen Yu, On 17/10/23 16:39, Chen Yu wrote: > Hi Madadi, > > On 2023-10-17 at 15:19:24 +0530, Madadi Vineeth Reddy wrote: >> Hi Chen Yu, >> >> On 26/09/23 10:40, Chen Yu wrote: >>> RFC -> v1: >>> - drop RFC >>> - Only record the short sleeping time for each task, to better honor the >>> burst sleeping tasks. (Mathieu Desnoyers) >>> - Keep the forward movement monotonic for runqueue's cache-hot timeout value. >>> (Mathieu Desnoyers, Aaron Lu) >>> - Introduce a new helper function cache_hot_cpu() that considers >>> rq->cache_hot_timeout. (Aaron Lu) >>> - Add analysis of why inhibiting task migration could bring better throughput >>> for some benchmarks. (Gautham R. Shenoy) >>> - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in >>> select_idle_cpu(). To avoid possible task stacking on the waker's CPU. >>> (K Prateek Nayak) >>> >>> Thanks for your comments and review! >>> >>> ---------------------------------------------------------------------- >> >> Regarding making the scan for finding an idle cpu longer vs cache benefits, >> I ran some benchmarks. >> > > Thanks very much for your interest and your time on the patch. > >> Tested the patch on power system with 12 cores. Total of 96 CPU's. >> System has two NUMA nodes. >> >> Below are some of the benchmark results >> >> schbench 99.0th latency (lower is better) >> ======== >> case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%) >> normal 1-mthreads 1.00 [ 0.00]( 3.66) 1.00 [ 0.00]( 1.71) >> normal 2-mthreads 1.00 [ 0.00]( 4.55) 1.02 [ -2.00]( 3.00) >> normal 4-mthreads 1.00 [ 0.00]( 4.77) 0.96 [ +4.00]( 4.27) >> normal 6-mthreads 1.00 [ 0.00]( 60.37) 2.66 [ -166.00]( 23.67) >> >> >> schbench results are showing that there is not much impact in wakeup latencies due to more iterations >> in search for an idle cpu in the select_idle_cpu code path and interestingly numbers are slightly better >> for SIS_CACHE in case of 4-mthreads. > > The 4% improvement is within std%, so I suppose we did not see much difference in 4 mthreads case. > >> I think we can ignore the last case due to huge run to run variations. > > Although the run-to-run variation is large, it seems that the decrease is within that range. > Prateek has also reported that when the system is overloaded there could be some regression > from schbench: > https://lore.kernel.org/lkml/27651e14-f441-c1e2-9b5b-b958d6aadc79@amd.com/ > Could you also post the raw data printed by schbench? And maybe using the latest schbench could get the > latency in detail. > raw data by schbench(old) with 6-mthreads ====================== Baseline (5 runs) ======== Latency percentiles (usec) 50.0000th: 22 75.0000th: 29 90.0000th: 34 95.0000th: 37 *99.0000th: 981 99.5000th: 4424 99.9000th: 9200 min=0, max=29497 Latency percentiles (usec) 50.0000th: 23 75.0000th: 29 90.0000th: 35 95.0000th: 38 *99.0000th: 495 99.5000th: 3924 99.9000th: 9872 min=0, max=29997 Latency percentiles (usec) 50.0000th: 23 75.0000th: 30 90.0000th: 36 95.0000th: 39 *99.0000th: 1326 99.5000th: 4744 99.9000th: 10000 min=0, max=23394 Latency percentiles (usec) 50.0000th: 23 75.0000th: 29 90.0000th: 34 95.0000th: 37 *99.0000th: 55 99.5000th: 3292 99.9000th: 9104 min=0, max=25196 Latency percentiles (usec) 50.0000th: 23 75.0000th: 29 90.0000th: 34 95.0000th: 37 *99.0000th: 711 99.5000th: 4600 99.9000th: 9424 min=0, max=19997 SIS_CACHE (5 runs) ========= Latency percentiles (usec) 50.0000th: 23 75.0000th: 30 90.0000th: 35 95.0000th: 38 *99.0000th: 1894 99.5000th: 5464 99.9000th: 10000 min=0, max=19157 Latency percentiles (usec) 50.0000th: 22 75.0000th: 29 90.0000th: 34 95.0000th: 37 *99.0000th: 2396 99.5000th: 6664 99.9000th: 10000 min=0, max=24029 Latency percentiles (usec) 50.0000th: 22 75.0000th: 29 90.0000th: 34 95.0000th: 37 *99.0000th: 2132 99.5000th: 6296 99.9000th: 10000 min=0, max=25313 Latency percentiles (usec) 50.0000th: 22 75.0000th: 29 90.0000th: 34 95.0000th: 37 *99.0000th: 1090 99.5000th: 6232 99.9000th: 9744 min=0, max=27264 Latency percentiles (usec) 50.0000th: 22 75.0000th: 29 90.0000th: 34 95.0000th: 38 *99.0000th: 1786 99.5000th: 5240 99.9000th: 9968 min=0, max=24754 The above data as indicated has large run to run variation and in general, the latency is high in case of SIS_CACHE for the 99th %ile. schbench(new) with 6-mthreads ============= Baseline ======== Wakeup Latencies percentiles (usec) runtime 30 (s) (209403 total samples) 50.0th: 8 (43672 samples) 90.0th: 13 (83908 samples) * 99.0th: 20 (18323 samples) 99.9th: 775 (1785 samples) min=1, max=8400 Request Latencies percentiles (usec) runtime 30 (s) (209543 total samples) 50.0th: 13648 (59873 samples) 90.0th: 14000 (82767 samples) * 99.0th: 14320 (16342 samples) 99.9th: 18720 (1670 samples) min=5130, max=38334 RPS percentiles (requests) runtime 30 (s) (31 total samples) 20.0th: 6968 (8 samples) * 50.0th: 6984 (23 samples) 90.0th: 6984 (0 samples) min=6835, max=6991 average rps: 6984.77 SIS_CACHE ========= Wakeup Latencies percentiles (usec) runtime 30 (s) (209295 total samples) 50.0th: 9 (49267 samples) 90.0th: 14 (86522 samples) * 99.0th: 21 (14091 samples) 99.9th: 1146 (1722 samples) min=1, max=10427 Request Latencies percentiles (usec) runtime 30 (s) (209432 total samples) 50.0th: 13616 (62838 samples) 90.0th: 14000 (85301 samples) * 99.0th: 14352 (16149 samples) 99.9th: 21408 (1660 samples) min=5070, max=41866 RPS percentiles (requests) runtime 30 (s) (31 total samples) 20.0th: 6968 (7 samples) * 50.0th: 6984 (21 samples) 90.0th: 6984 (0 samples) min=6672, max=6996 average rps: 6981.07 In new schbench, I didn't observe run to run variation and also there was no regression in case of SIS_CACHE for the 99th %ile. >> producer_consumer avg time/access (lower is better) >> ======== >> loads per consumer iteration baseline[pct imp](std%) SIS_CACHE[pct imp]( std%) >> 5 1.00 [ 0.00]( 0.00) 0.87 [ +13.0]( 1.92) >> 20 1.00 [ 0.00]( 0.00) 0.92 [ +8.00]( 0.00) >> 50 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00) >> 100 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00) >> >> The main goal of the patch of improving cache locality is reflected as SIS_CACHE only improves in this workload, >> mainly when loads per consumer iteration is lower. >> >> hackbench normalized time in seconds (lower is better) >> ======== >> case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%) >> process-pipe 1-groups 1.00 [ 0.00]( 1.50) 1.02 [ -2.00]( 3.36) >> process-pipe 2-groups 1.00 [ 0.00]( 4.76) 0.99 [ +1.00]( 5.68) >> process-sockets 1-groups 1.00 [ 0.00]( 2.56) 1.00 [ 0.00]( 0.86) >> process-sockets 2-groups 1.00 [ 0.00]( 0.50) 0.99 [ +1.00]( 0.96) >> threads-pipe 1-groups 1.00 [ 0.00]( 3.87) 0.71 [ +29.0]( 3.56) >> threads-pipe 2-groups 1.00 [ 0.00]( 1.60) 0.97 [ +3.00]( 3.44) >> threads-sockets 1-groups 1.00 [ 0.00]( 7.65) 0.99 [ +1.00]( 1.05) >> threads-sockets 2-groups 1.00 [ 0.00]( 3.12) 1.03 [ -3.00]( 1.70) >> >> hackbench results are similar in both kernels except the case where there is an improvement of >> 29% in case of threads-pipe case with 1 groups. >> >> Daytrader throughput (higher is better) >> ======== >> >> As per Ingo suggestion, ran a real life workload daytrader >> >> baseline: >> =================================================================================== >> Instance 1 >> Throughputs Ave. Resp. Time Min. Resp. Time Max. Resp. Time >> ================ =============== =============== =============== >> 10124.5 2 0 3970 >> >> SIS_CACHE: >> =================================================================================== >> Instance 1 >> Throughputs Ave. Resp. Time Min. Resp. Time Max. Resp. Time >> ================ =============== =============== =============== >> 10319.5 2 0 5771 >> >> In the above run, daytrader perfomance was 2% better in case of SIS_CACHE. >> > > Thanks for bringing this good news, a real life workload benefits from this change. > I'll tune this patch a little bit to address the regression from schbench. Also to mention > that, I'm working with Mathieu on his proposal to make the wakee choosing its previous > CPU easier(similar to SIS_CACHE, but a little simpler), and we'll check how to make more > platform benefit from this change. > https://lore.kernel.org/lkml/20231012203626.1298944-1-mathieu.desnoyers@efficios.com/ Oh..ok. Thanks for the pointer! > > thanks, > Chenyu > Thanks and Regards Madadi Vineeth Reddy
On 2023-10-19 at 01:02:16 +0530, Madadi Vineeth Reddy wrote: > Hi Chen Yu, > On 17/10/23 16:39, Chen Yu wrote: > > Hi Madadi, > > > > On 2023-10-17 at 15:19:24 +0530, Madadi Vineeth Reddy wrote: > >> Hi Chen Yu, > >> > >> On 26/09/23 10:40, Chen Yu wrote: > >>> RFC -> v1: > >>> - drop RFC > >>> - Only record the short sleeping time for each task, to better honor the > >>> burst sleeping tasks. (Mathieu Desnoyers) > >>> - Keep the forward movement monotonic for runqueue's cache-hot timeout value. > >>> (Mathieu Desnoyers, Aaron Lu) > >>> - Introduce a new helper function cache_hot_cpu() that considers > >>> rq->cache_hot_timeout. (Aaron Lu) > >>> - Add analysis of why inhibiting task migration could bring better throughput > >>> for some benchmarks. (Gautham R. Shenoy) > >>> - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in > >>> select_idle_cpu(). To avoid possible task stacking on the waker's CPU. > >>> (K Prateek Nayak) > >>> > >>> Thanks for your comments and review! > >>> > >>> ---------------------------------------------------------------------- > >> > >> Regarding making the scan for finding an idle cpu longer vs cache benefits, > >> I ran some benchmarks. > >> > > > > Thanks very much for your interest and your time on the patch. > > > >> Tested the patch on power system with 12 cores. Total of 96 CPU's. > >> System has two NUMA nodes. > >> > >> Below are some of the benchmark results > >> > >> schbench 99.0th latency (lower is better) > >> ======== > >> case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%) > >> normal 1-mthreads 1.00 [ 0.00]( 3.66) 1.00 [ 0.00]( 1.71) > >> normal 2-mthreads 1.00 [ 0.00]( 4.55) 1.02 [ -2.00]( 3.00) > >> normal 4-mthreads 1.00 [ 0.00]( 4.77) 0.96 [ +4.00]( 4.27) > >> normal 6-mthreads 1.00 [ 0.00]( 60.37) 2.66 [ -166.00]( 23.67) > >> > >> > >> schbench results are showing that there is not much impact in wakeup latencies due to more iterations > >> in search for an idle cpu in the select_idle_cpu code path and interestingly numbers are slightly better > >> for SIS_CACHE in case of 4-mthreads. > > > > The 4% improvement is within std%, so I suppose we did not see much difference in 4 mthreads case. > > > >> I think we can ignore the last case due to huge run to run variations. > > > > Although the run-to-run variation is large, it seems that the decrease is within that range. > > Prateek has also reported that when the system is overloaded there could be some regression > > from schbench: > > https://lore.kernel.org/lkml/27651e14-f441-c1e2-9b5b-b958d6aadc79@amd.com/ > > Could you also post the raw data printed by schbench? And maybe using the latest schbench could get the > > latency in detail. > > > > raw data by schbench(old) with 6-mthreads > ====================== > > Baseline (5 runs) > ======== > Latency percentiles (usec) > 50.0000th: 22 > 75.0000th: 29 > 90.0000th: 34 > 95.0000th: 37 > *99.0000th: 981 > 99.5000th: 4424 > 99.9000th: 9200 > min=0, max=29497 > > Latency percentiles (usec) > 50.0000th: 23 > 75.0000th: 29 > 90.0000th: 35 > 95.0000th: 38 > *99.0000th: 495 > 99.5000th: 3924 > 99.9000th: 9872 > min=0, max=29997 > > Latency percentiles (usec) > 50.0000th: 23 > 75.0000th: 30 > 90.0000th: 36 > 95.0000th: 39 > *99.0000th: 1326 > 99.5000th: 4744 > 99.9000th: 10000 > min=0, max=23394 > > Latency percentiles (usec) > 50.0000th: 23 > 75.0000th: 29 > 90.0000th: 34 > 95.0000th: 37 > *99.0000th: 55 > 99.5000th: 3292 > 99.9000th: 9104 > min=0, max=25196 > > Latency percentiles (usec) > 50.0000th: 23 > 75.0000th: 29 > 90.0000th: 34 > 95.0000th: 37 > *99.0000th: 711 > 99.5000th: 4600 > 99.9000th: 9424 > min=0, max=19997 > > SIS_CACHE (5 runs) > ========= > Latency percentiles (usec) > 50.0000th: 23 > 75.0000th: 30 > 90.0000th: 35 > 95.0000th: 38 > *99.0000th: 1894 > 99.5000th: 5464 > 99.9000th: 10000 > min=0, max=19157 > > Latency percentiles (usec) > 50.0000th: 22 > 75.0000th: 29 > 90.0000th: 34 > 95.0000th: 37 > *99.0000th: 2396 > 99.5000th: 6664 > 99.9000th: 10000 > min=0, max=24029 > > Latency percentiles (usec) > 50.0000th: 22 > 75.0000th: 29 > 90.0000th: 34 > 95.0000th: 37 > *99.0000th: 2132 > 99.5000th: 6296 > 99.9000th: 10000 > min=0, max=25313 > > Latency percentiles (usec) > 50.0000th: 22 > 75.0000th: 29 > 90.0000th: 34 > 95.0000th: 37 > *99.0000th: 1090 > 99.5000th: 6232 > 99.9000th: 9744 > min=0, max=27264 > > Latency percentiles (usec) > 50.0000th: 22 > 75.0000th: 29 > 90.0000th: 34 > 95.0000th: 38 > *99.0000th: 1786 > 99.5000th: 5240 > 99.9000th: 9968 > min=0, max=24754 > > The above data as indicated has large run to run variation and in general, the latency is > high in case of SIS_CACHE for the 99th %ile. > > > schbench(new) with 6-mthreads > ============= > > Baseline > ======== > Wakeup Latencies percentiles (usec) runtime 30 (s) (209403 total samples) > 50.0th: 8 (43672 samples) > 90.0th: 13 (83908 samples) > * 99.0th: 20 (18323 samples) > 99.9th: 775 (1785 samples) > min=1, max=8400 > Request Latencies percentiles (usec) runtime 30 (s) (209543 total samples) > 50.0th: 13648 (59873 samples) > 90.0th: 14000 (82767 samples) > * 99.0th: 14320 (16342 samples) > 99.9th: 18720 (1670 samples) > min=5130, max=38334 > RPS percentiles (requests) runtime 30 (s) (31 total samples) > 20.0th: 6968 (8 samples) > * 50.0th: 6984 (23 samples) > 90.0th: 6984 (0 samples) > min=6835, max=6991 > average rps: 6984.77 > > > SIS_CACHE > ========= > Wakeup Latencies percentiles (usec) runtime 30 (s) (209295 total samples) > 50.0th: 9 (49267 samples) > 90.0th: 14 (86522 samples) > * 99.0th: 21 (14091 samples) > 99.9th: 1146 (1722 samples) > min=1, max=10427 > Request Latencies percentiles (usec) runtime 30 (s) (209432 total samples) > 50.0th: 13616 (62838 samples) > 90.0th: 14000 (85301 samples) > * 99.0th: 14352 (16149 samples) > 99.9th: 21408 (1660 samples) > min=5070, max=41866 > RPS percentiles (requests) runtime 30 (s) (31 total samples) > 20.0th: 6968 (7 samples) > * 50.0th: 6984 (21 samples) > 90.0th: 6984 (0 samples) > min=6672, max=6996 > average rps: 6981.07 > > In new schbench, I didn't observe run to run variation and also there was no regression > in case of SIS_CACHE for the 99th %ile. > Thanks for the test Madadi, in my opinion we can stick with the new schbench in the future. I'll have a double check on my test machine. thanks, Chenyu