Message ID | 20231019160523.1582101-1-mathieu.desnoyers@efficios.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:2010:b0:403:3b70:6f57 with SMTP id fe16csp492293vqb; Thu, 19 Oct 2023 09:06:20 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEf++5/pum2pyejuSzH3Si80qNLXQSPWZXbw4JYbLcvIRWEXXOmqtp54OJeqZCQyMQjLspA X-Received: by 2002:a92:d289:0:b0:34f:b9e8:a431 with SMTP id p9-20020a92d289000000b0034fb9e8a431mr2826788ilp.22.1697731579963; Thu, 19 Oct 2023 09:06:19 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697731579; cv=none; d=google.com; s=arc-20160816; b=ogyEnOwx/0e9PNqLhfCufDVsFzYvW62/fv91g6levjPsaIYz7ALUq631Ymk4IafLRM 6qBqy52bok4oKge2xV2mGZEwJlQHHk9O3SCjG320aje3bVbhDtcNS5kT6nfQnd0VimI2 BWBjppmJ8q7N7JD1g7gHPKGWB1U8dIStdJJnsRIpRXLmIphQodGRGwsJvsAsrTRCWl6y w1bjfLv+qfd6gEGcCPvNu3/9BU5STSd026aGBndNwJ2Pjw0yEodGlJBG5BOcDjbxUnfg OtdAWz/gn6LW2wpUfjR9wPXfT/XGdtvOmZbakppmUUOblE8ZKJZXTCTh4JmRt0ddWzZk FX0g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=/h0Yg3cCHQvFd9gus2M9zEmgoSJ+EI9V53cE6xrV67A=; fh=MlL9xuK1bL5YhHTfhEwdGxMCfB66v6TLl/GiGg/sHRw=; b=EbWI4PjBoqLPOyel1tHPtBehUZ2/7d7O7cYHvtiVX5xBvVHlsIYcuBwLuRMC92bE0l HlsNs8Dm3ySfN/R2PQedkJgEOe+8yrh3OCOM6drINcIWKI23u9ONuQ4Bfy9dX/14/viW cxMmeDxEIgyudkDcQ/yuSIgsSj0dDzTiTUtSkKwd05J+08yejX/YYV1p56u0kNs5ekDt 15Qqw7l13b3mH91JsNNFjF7GU79wWxrQLYwRRNBKBRaVLIDj+zLnC6Yc0msixzmdgwV1 QXbNQl0OwmCpkBe42ZK8OBKMaej/BhISYfIkJ1s2BqimBuDZEgxroHm65d8d9pyEDCh1 yV6g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=sOQBiwTu; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Received: from fry.vger.email (fry.vger.email. [23.128.96.38]) by mx.google.com with ESMTPS id 80-20020a630253000000b005859e8c7c22si5057996pgc.658.2023.10.19.09.06.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 19 Oct 2023 09:06:19 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) client-ip=23.128.96.38; Authentication-Results: mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=sOQBiwTu; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by fry.vger.email (Postfix) with ESMTP id 885978051AAC; Thu, 19 Oct 2023 09:05:43 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at fry.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1346409AbjJSQFW (ORCPT <rfc822;a1648639935@gmail.com> + 26 others); Thu, 19 Oct 2023 12:05:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43272 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345104AbjJSQFU (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 19 Oct 2023 12:05:20 -0400 Received: from smtpout.efficios.com (unknown [IPv6:2607:5300:203:b2ee::31e5]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 687F312A for <linux-kernel@vger.kernel.org>; Thu, 19 Oct 2023 09:05:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com; s=smtpout1; t=1697731516; bh=HmwG14J3AlJoFMkkIkRSsNIqRZMe3libr4mFaqc0lWQ=; h=From:To:Cc:Subject:Date:From; b=sOQBiwTuhHyiox3zgOd1KHiJH4E0m1+/QHJ135qcyd3qkrjfs25NHK7KSwkEsNHtY OBYe1p6yydyvA98wddrcu0R8YLVH04MRPcIcdx19zbQ20sl77vg8vNUzUN69M/4I7Y tl1YKxIRvBiHH4pVjUAmuDcf2ZsucjVMIJjGyIPICljcpQs6nWemi9F+RTHvzD3tq2 InVT2Q9wOAqugVbn1iwThq0WKU/zdUt5TOTHjXSJ++MUrbhuXf1ThOuc1SzTZB1EKJ OS+fdfrdzZqlGvXLKYohEC92SJFi+UDAmQM8Tn6ER1mBbHhVcKNq5HDAyeJ8u2SCYE eI1sCihlLbQDw== Received: from thinkos.internal.efficios.com (192-222-143-198.qc.cable.ebox.net [192.222.143.198]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4SBCH00lhcz1YPd; Thu, 19 Oct 2023 12:05:16 -0400 (EDT) From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> To: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers <mathieu.desnoyers@efficios.com>, Ingo Molnar <mingo@redhat.com>, Valentin Schneider <vschneid@redhat.com>, Steven Rostedt <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>, Daniel Bristot de Oliveira <bristot@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org>, Juri Lelli <juri.lelli@redhat.com>, Swapnil Sapkal <Swapnil.Sapkal@amd.com>, Aaron Lu <aaron.lu@intel.com>, Chen Yu <yu.c.chen@intel.com>, Tim Chen <tim.c.chen@intel.com>, K Prateek Nayak <kprateek.nayak@amd.com>, "Gautham R . Shenoy" <gautham.shenoy@amd.com>, x86@kernel.org Subject: [RFC PATCH v2 0/2] sched/fair migration reduction features Date: Thu, 19 Oct 2023 12:05:21 -0400 Message-Id: <20231019160523.1582101-1-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.39.2 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on fry.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (fry.vger.email [0.0.0.0]); Thu, 19 Oct 2023 09:05:43 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1780200589193545996 X-GMAIL-MSGID: 1780200589193545996 |
Series |
sched/fair migration reduction features
|
|
Message
Mathieu Desnoyers
Oct. 19, 2023, 4:05 p.m. UTC
Hi, This series introduces two new scheduler features: UTIL_FITS_CAPACITY and SELECT_BIAS_PREV. When used together, they achieve a 41% speedup of a hackbench workload which leaves some idle CPU time on a 192-core AMD EPYC. The main metrics which are significantly improved are: - cpu-migrations are reduced by 80%, - CPU utilization is increased by 17%. Feedback is welcome. I am especially interested to learn whether this series has positive or detrimental effects on performance of other workloads. The main changes since v1 are to take into account feedback from Chen Yu and keep a 20% margin of unused utilization in the capacity fit, and use scale_rt_capacity() which is more precise. Thanks, Mathieu Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Swapnil Sapkal <Swapnil.Sapkal@amd.com> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Chen Yu <yu.c.chen@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Gautham R . Shenoy <gautham.shenoy@amd.com> Cc: x86@kernel.org Mathieu Desnoyers (2): sched/fair: Introduce UTIL_FITS_CAPACITY feature (v2) sched/fair: Introduce SELECT_BIAS_PREV to reduce migrations kernel/sched/fair.c | 68 ++++++++++++++++++++++++++++++++++++----- kernel/sched/features.h | 12 ++++++++ kernel/sched/sched.h | 5 +++ 3 files changed, 77 insertions(+), 8 deletions(-)
Comments
Hello Mathieu, On 10/19/2023 9:35 PM, Mathieu Desnoyers wrote: > Hi, > > This series introduces two new scheduler features: UTIL_FITS_CAPACITY > and SELECT_BIAS_PREV. When used together, they achieve a 41% speedup of > a hackbench workload which leaves some idle CPU time on a 192-core AMD > EPYC. > > The main metrics which are significantly improved are: > > - cpu-migrations are reduced by 80%, > - CPU utilization is increased by 17%. > > Feedback is welcome. I am especially interested to learn whether this > series has positive or detrimental effects on performance of other > workloads. I got a chance to test this series on a dual socket 3rd Generation EPYC System (2 x 64C/128T). Following is a quick summary: - stream and ycsb-mongodb don't see any changes. - hackbench and DeathStarBench see a major improvement. Both are high utilization workloads with CPUs being overloaded most of the time. DeathStarBench is known to benefit from lower migration count. It was discussed by Gautham at OSPM '23. - tbench, netperf, and sch bench regresses. The former two when the system is near fully loaded, and the latter for most cases. All these benchmarks are client-server / messenger-worker oriented and is known to perform better when client-server / messenger-worker are on same CCX (LLC domain). Detailed results are as follows: o Machine details - 3rd Generation EPYC System - 2 sockets each with 64C/128T - NPS1 (Each socket is a NUMA node) - C2 Disabled (POLL and C1(MWAIT) remained enabled) o Kernel Details - tip: tip:sched/core at commit 984ffb6a4366 ("sched/fair: Remove SIS_PROP") - wake_prev_bias: tip + this series + Peter's suggestion to optimize sched_util_fits_capacity_active() I've taken liberty at resolving the conflict with recently added cluster wakeup optimization by prioritizing "SELECT_BIAS_PREV" feature. select_idle_sibling() looks as follows: select_idle_sibling(...) { ... /* * With the SELECT_BIAS_PREV feature, if the previous CPU is * cache affine, prefer the previous CPU when all CPUs are busy * to inhibit migration. */ if (sched_feat(SELECT_BIAS_PREV) && prev != target && cpus_share_cache(prev, target)) return prev; /* * For cluster machines which have lower sharing cache like L2 or * LLC Tag, we tend to find an idle CPU in the target's cluster * first. But prev_cpu or recent_used_cpu may also be a good candidate, * use them if possible when no idle CPU found in select_idle_cpu(). */ if ((unsigned int)prev_aff < nr_cpumask_bits) return prev_aff; if ((unsigned int)recent_used_cpu < nr_cpumask_bits) return recent_used_cpu; return target; } Please let me know if you have a different ordering in mind. o Benchmark results ================================================================== Test : hackbench Units : Normalized time in seconds Interpretation: Lower is better Statistic : AMean ================================================================== Case: tip[pct imp](CV) wake_prev_bias[pct imp](CV) 1-groups 1.00 [ -0.00]( 2.88) 0.97 [ 2.88]( 1.78) 2-groups 1.00 [ -0.00]( 2.03) 0.91 [ 8.79]( 1.19) 4-groups 1.00 [ -0.00]( 1.42) 0.87 [ 13.07]( 1.77) 8-groups 1.00 [ -0.00]( 1.37) 0.86 [ 13.70]( 0.98) 16-groups 1.00 [ -0.00]( 2.54) 0.90 [ 9.74]( 1.65) ================================================================== Test : tbench Units : Normalized throughput Interpretation: Higher is better Statistic : AMean ================================================================== Clients: tip[pct imp](CV) wake_prev_bias[pct imp](CV) 1 1.00 [ 0.00]( 0.63) 0.99 [ -0.53]( 0.97) 2 1.00 [ 0.00]( 0.89) 1.00 [ 0.21]( 0.99) 4 1.00 [ 0.00]( 1.34) 1.01 [ 0.70]( 0.88) 8 1.00 [ 0.00]( 0.49) 1.00 [ 0.40]( 0.55) 16 1.00 [ 0.00]( 1.51) 0.99 [ -0.51]( 1.23) 32 1.00 [ 0.00]( 0.74) 0.97 [ -2.57]( 0.59) 64 1.00 [ 0.00]( 0.92) 0.95 [ -4.69]( 0.70) 128 1.00 [ 0.00]( 0.97) 0.91 [ -8.58]( 0.29) 256 1.00 [ 0.00]( 1.14) 0.90 [ -9.86]( 2.40) 512 1.00 [ 0.00]( 0.35) 0.97 [ -2.91]( 1.78) 1024 1.00 [ 0.00]( 0.07) 0.96 [ -4.15]( 1.43) ================================================================== Test : stream-10 Units : Normalized Bandwidth, MB/s Interpretation: Higher is better Statistic : HMean ================================================================== Test: tip[pct imp](CV) wake_prev_bias[pct imp](CV) Copy 1.00 [ 0.00]( 8.25) 1.04 [ 3.53](10.84) Scale 1.00 [ 0.00]( 5.65) 0.99 [ -0.85]( 5.94) Add 1.00 [ 0.00]( 5.73) 1.00 [ 0.50]( 7.68) Triad 1.00 [ 0.00]( 3.41) 1.00 [ 0.12]( 6.25) ================================================================== Test : stream-100 Units : Normalized Bandwidth, MB/s Interpretation: Higher is better Statistic : HMean ================================================================== Test: tip[pct imp](CV) wake_prev_bias[pct imp](CV) Copy 1.00 [ 0.00]( 1.75) 1.01 [ 1.18]( 1.61) Scale 1.00 [ 0.00]( 0.92) 1.00 [ -0.14]( 1.37) Add 1.00 [ 0.00]( 0.32) 0.99 [ -0.54]( 1.34) Triad 1.00 [ 0.00]( 5.97) 1.00 [ 0.37]( 6.34) ================================================================== Test : netperf Units : Normalized Througput Interpretation: Higher is better Statistic : AMean ================================================================== Clients: tip[pct imp](CV) wake_prev_bias[pct imp](CV) 1-clients 1.00 [ 0.00]( 0.67) 1.00 [ 0.08]( 0.15) 2-clients 1.00 [ 0.00]( 0.15) 1.00 [ 0.10]( 0.57) 4-clients 1.00 [ 0.00]( 0.58) 1.00 [ 0.10]( 0.74) 8-clients 1.00 [ 0.00]( 0.46) 1.00 [ 0.31]( 0.64) 16-clients 1.00 [ 0.00]( 0.84) 0.99 [ -0.56]( 1.78) 32-clients 1.00 [ 0.00]( 1.07) 1.00 [ 0.04]( 1.40) 64-clients 1.00 [ 0.00]( 1.53) 1.01 [ 0.68]( 2.27) 128-clients 1.00 [ 0.00]( 1.17) 0.99 [ -0.70]( 1.17) 256-clients 1.00 [ 0.00]( 5.42) 0.91 [ -9.31](10.74) 512-clients 1.00 [ 0.00](48.07) 1.00 [ -0.07](47.71) ================================================================== Test : schbench Units : Normalized 99th percentile latency in us Interpretation: Lower is better Statistic : Median ================================================================== #workers: tip[pct imp](CV) wake_prev_bias[pct imp](CV) 1 1.00 [ -0.00](12.00) 1.06 [ -5.56]( 2.99) 2 1.00 [ -0.00]( 6.96) 1.08 [ -7.69]( 2.38) 4 1.00 [ -0.00](13.57) 1.07 [ -7.32](12.95) 8 1.00 [ -0.00]( 6.45) 0.98 [ 2.08](10.86) 16 1.00 [ -0.00]( 3.45) 1.02 [ -1.72]( 1.69) 32 1.00 [ -0.00]( 3.00) 1.05 [ -5.00](10.92) 64 1.00 [ -0.00]( 2.18) 1.04 [ -4.17]( 1.15) 128 1.00 [ -0.00]( 7.15) 1.07 [ -6.65]( 8.45) 256 1.00 [ -0.00](30.20) 1.72 [-72.03](30.62) 512 1.00 [ -0.00]( 4.90) 0.97 [ 3.25]( 1.92) ================================================================== Test : ycsb-mondodb Units : Normalized throughput Interpretation: Higher is better Statistic : Mean ================================================================== metric tip wake_prev_bias(%diff) throughput 1.00 0.99 (%diff: -0.94%) ================================================================== Test : DeathStarBench Units : Normalized throughput Interpretation: Higher is better Statistic : Mean ================================================================== Pinning scaling tip wake_prev_bias(%diff) 1CCD 1 1.00 1.10 (%diff: 10.04%) 2CCD 2 1.00 1.06 (%diff: 5.90%) 4CCD 4 1.00 1.04 (%diff: 3.74%) 8CCD 8 1.00 1.03 (%diff: 2.98%) -- It is a mixed bag of results, as expected. I would love to hear your thoughts on the results. Meanwhile, I'll try to get some more data from other benchmarks. > > [..snip..] > -- Thanks and Regards, Prateek
On 2023-10-27 at 08:57:00 +0530, K Prateek Nayak wrote: > Hello Mathieu, > > On 10/19/2023 9:35 PM, Mathieu Desnoyers wrote: > > Hi, > > > > This series introduces two new scheduler features: UTIL_FITS_CAPACITY > > and SELECT_BIAS_PREV. When used together, they achieve a 41% speedup of > > a hackbench workload which leaves some idle CPU time on a 192-core AMD > > EPYC. > > > > The main metrics which are significantly improved are: > > > > - cpu-migrations are reduced by 80%, > > - CPU utilization is increased by 17%. > > > > Feedback is welcome. I am especially interested to learn whether this > > series has positive or detrimental effects on performance of other > > workloads. > > I got a chance to test this series on a dual socket 3rd Generation EPYC > System (2 x 64C/128T). Following is a quick summary: > > - stream and ycsb-mongodb don't see any changes. > > - hackbench and DeathStarBench see a major improvement. Both are high > utilization workloads with CPUs being overloaded most of the time. > DeathStarBench is known to benefit from lower migration count. It was > discussed by Gautham at OSPM '23. > > - tbench, netperf, and sch bench regresses. The former two when the > system is near fully loaded, and the latter for most cases. Does it mean hackbench gets benefits when the system is overloaded, while tbench/netperf do not get benefit when the system is underloaded? > All these benchmarks are client-server / messenger-worker oriented and is > known to perform better when client-server / messenger-worker are on > same CCX (LLC domain). I thought hackbench should also be of client-server mode, because hackbench has socket/pipe mode and exchanges datas between sender/receiver. This reminds me of your proposal to provide user hint to the scheduler to whether do task consolidation vs task spreading, and could this also be applied to Mathieu's case? For task or task group with "consolidate" flag set, tasks prefer to be woken up on target/previous CPU if the wakee fits into that CPU. In this way we could bring benefit and not introduce regress. thanks, Chenyu
Hello Chenyu, On 11/6/2023 11:22 AM, Chen Yu wrote: > On 2023-10-27 at 08:57:00 +0530, K Prateek Nayak wrote: >> Hello Mathieu, >> >> On 10/19/2023 9:35 PM, Mathieu Desnoyers wrote: >>> Hi, >>> >>> This series introduces two new scheduler features: UTIL_FITS_CAPACITY >>> and SELECT_BIAS_PREV. When used together, they achieve a 41% speedup of >>> a hackbench workload which leaves some idle CPU time on a 192-core AMD >>> EPYC. >>> >>> The main metrics which are significantly improved are: >>> >>> - cpu-migrations are reduced by 80%, >>> - CPU utilization is increased by 17%. >>> >>> Feedback is welcome. I am especially interested to learn whether this >>> series has positive or detrimental effects on performance of other >>> workloads. >> >> I got a chance to test this series on a dual socket 3rd Generation EPYC >> System (2 x 64C/128T). Following is a quick summary: >> >> - stream and ycsb-mongodb don't see any changes. >> >> - hackbench and DeathStarBench see a major improvement. Both are high >> utilization workloads with CPUs being overloaded most of the time. >> DeathStarBench is known to benefit from lower migration count. It was >> discussed by Gautham at OSPM '23. >> >> - tbench, netperf, and sch bench regresses. The former two when the >> system is near fully loaded, and the latter for most cases. > > Does it mean hackbench gets benefits when the system is overloaded, while > tbench/netperf do not get benefit when the system is underloaded? Yup! Seems like that from the results. From what I have seen so far, there seems to be a work conservation aspect to hackbench where if we reduce the time spent in the kernel (by reducing time to decide on the target which Mathieu's patch [this one] achieves, there is also a second order effect from another one of Mathieu's Patches that uses wakelist but indirectly curbs the SIS_UTIL limits based on Aaron's observation [1] thus reducing time spent in select_idle_cpu()) hackbench results seem to improve. [1] https://lore.kernel.org/lkml/20230905072141.GA253439@ziqianlu-dell/ schbench, tbench, and netperf see that wakeups are faster when the client and server are on same LLC so consolidation as long as there is one task per run queue for under loaded case is better than just keeping them on separate LLCs. > >> All these benchmarks are client-server / messenger-worker oriented and is >> known to perform better when client-server / messenger-worker are on >> same CCX (LLC domain). > > I thought hackbench should also be of client-server mode, because hackbench has > socket/pipe mode and exchanges datas between sender/receiver. Yes but its N:M nature makes it slightly complicated to understand where the cache benefits disappear and the work conservation benefits become more prominent. > > This reminds me of your proposal to provide user hint to the scheduler > to whether do task consolidation vs task spreading, and could this also > be applied to Mathieu's case? For task or task group with "consolidate" > flag set, tasks prefer to be woken up on target/previous CPU if the wakee > fits into that CPU. In this way we could bring benefit and not introduce > regress. I think even a simple WF_SYNC check will help tbench and netperf case. Let me get back to you with some data on different variants of hackbench wit the latest tip. > > thanks, > Chenyu -- Thanks and Regards, Prateek
On 2023-10-26 23:27, K Prateek Nayak wrote: [...] > -- > It is a mixed bag of results, as expected. I would love to hear your > thoughts on the results. Meanwhile, I'll try to get some more data > from other benchmarks. I suspect that workloads that exhibit a client-server (1:1) pairing pattern are hurt by the bias towards leaving tasks on their prev runqueue: they benefit from moving both client/server tasks as close as possible so they share either the same core or a common cache. The hackbench workload is also client-server, but there are N-client and N-server threads, creating a N:N relationship which really does not work well when trying to pull tasks on sync wakeup: tasks then bounce all over the place. It's tricky though. If we try to fix the "1:1" client-server pattern with a heuristic, we may miss scenarios which are close to 1:1 but don't exactly match. I'm working on a rewrite of select_task_rq_fair, with the aim to tackle the more general task placement problem taking into account the following: - We want to converge towards a task placement that moves tasks with most waker/wakee interactions as close as possible in the cache topology, - We can use the core util_est/capacity metrics to calculate whether we have capacity left to enqueue a task in a core's runqueue. - The underlying assumption is that work conserving [1] is not a good characteristic to aim for, because it does not take into account the overhead associated with migrations, and thus lack of cache locality. Thanks, Mathieu [1] I use the definition of "work conserving" found here: https://people.ece.ubc.ca/sasha/papers/eurosys16-final29.pdf
On 2023-11-06 02:06, K Prateek Nayak wrote: > Hello Chenyu, > > On 11/6/2023 11:22 AM, Chen Yu wrote: >> On 2023-10-27 at 08:57:00 +0530, K Prateek Nayak wrote: >>> Hello Mathieu, >>> >>> On 10/19/2023 9:35 PM, Mathieu Desnoyers wrote: >>>> Hi, >>>> >>>> This series introduces two new scheduler features: UTIL_FITS_CAPACITY >>>> and SELECT_BIAS_PREV. When used together, they achieve a 41% speedup of >>>> a hackbench workload which leaves some idle CPU time on a 192-core AMD >>>> EPYC. >>>> >>>> The main metrics which are significantly improved are: >>>> >>>> - cpu-migrations are reduced by 80%, >>>> - CPU utilization is increased by 17%. >>>> >>>> Feedback is welcome. I am especially interested to learn whether this >>>> series has positive or detrimental effects on performance of other >>>> workloads. >>> >>> I got a chance to test this series on a dual socket 3rd Generation EPYC >>> System (2 x 64C/128T). Following is a quick summary: >>> >>> - stream and ycsb-mongodb don't see any changes. >>> >>> - hackbench and DeathStarBench see a major improvement. Both are high >>> utilization workloads with CPUs being overloaded most of the time. >>> DeathStarBench is known to benefit from lower migration count. It was >>> discussed by Gautham at OSPM '23. >>> >>> - tbench, netperf, and sch bench regresses. The former two when the >>> system is near fully loaded, and the latter for most cases. >> >> Does it mean hackbench gets benefits when the system is overloaded, while >> tbench/netperf do not get benefit when the system is underloaded? > > Yup! Seems like that from the results. From what I have seen so far, > there seems to be a work conservation aspect to hackbench where if we > reduce the time spent in the kernel (by reducing time to decide on the > target which Mathieu's patch [this one] achieves, I am confused by this comment. Quoting Daniel Bristot, "work conserving" is defined as "in a system with M processor, the M "higest priority" must be running (in real-time)". This should apply to other scheduling classes as well. This definition fits with this paper's definition [1]: "The Linux scheduler is work-conserving, meaning that it should never leave cores idle if there is work to do." Do you mean something different by "work conservation" ? Just in case, I've made the following experiment to figure out if my patches benefit from having less time spent in select_task_rq_fair(). I have copied the original "select_idle_sibling()" into a separate function "select_idle_sibling_orig()", which I call at the beginning of the new "biased" select_idle_sibling. I use its result in an empty asm volatile, which ensures that the code is not optimized away. Then the biased function selects the runqueue with the new biased approach. The result with hackbench is that the speed up is still pretty much the same with or without the added "select_idle_sibling_orig()" call. Based on this, my understanding is that the speed up comes from minimizing the amount of migrations (and the side effects caused by those migrations such as runqueue locks and cache misses), rather than by making select_idle_sibling faster. So based on this, I suspect that we could add some overhead to select_task_runqueue_fair if it means we do a better task placement decision and minimize migrations, and that would still provide an overall benefit performance-wise. > there is also a > second order effect from another one of Mathieu's Patches that uses > wakelist but indirectly curbs the SIS_UTIL limits based on Aaron's > observation [1] thus reducing time spent in select_idle_cpu()) > hackbench results seem to improve. It's possible that an indirect effect of bias towards prev runqueue is to affect the metrics used by select_idle_cpu() as well and make it return early. I've tried adding a 1000 iteration barrier() loop within select_idle_sibling_orig(), and indeed the hackbench time goes from 29s to 31s. Therefore, slowing down the task rq selection does have some impact. > > [1] https://lore.kernel.org/lkml/20230905072141.GA253439@ziqianlu-dell/ > > schbench, tbench, and netperf see that wakeups are faster when the > client and server are on same LLC so consolidation as long as there is > one task per run queue for under loaded case is better than just keeping > them on separate LLCs. What is faster for the 1:1 client/server ping-pong scenario: having the client and server on the same LLC, but different runqueues, or having them share a single runqueue ? If they wait for each other, then I suspect it's better to place them on the same runqueue as long as there is capacity left. > >> >>> All these benchmarks are client-server / messenger-worker oriented and is >>> known to perform better when client-server / messenger-worker are on >>> same CCX (LLC domain). >> >> I thought hackbench should also be of client-server mode, because hackbench has >> socket/pipe mode and exchanges datas between sender/receiver. > > Yes but its N:M nature makes it slightly complicated to understand where > the cache benefits disappear and the work conservation benefits become > more prominent. The N:M nature of hackbench AFAIU causes N-server *and* M-client tasks to pull each other pretty much randomly, therefore trashing cache locality. I'm still unclear about the definition of "work conservation" in this discussion. > >> >> This reminds me of your proposal to provide user hint to the scheduler >> to whether do task consolidation vs task spreading, and could this also >> be applied to Mathieu's case? For task or task group with "consolidate" >> flag set, tasks prefer to be woken up on target/previous CPU if the wakee >> fits into that CPU. In this way we could bring benefit and not introduce >> regress. > > I think even a simple WF_SYNC check will help tbench and netperf case. > Let me get back to you with some data on different variants of hackbench > wit the latest tip. AFAIU (to be double-checked) the hackbench workload also has WF_SYNC, which prevents us from using this flag to distinguish between 1:1 server/client and N:M scenarios. Or am I missing something ? Thanks, Mathieu [1] https://people.ece.ubc.ca/sasha/papers/eurosys16-final29.pdf
Hello Mathieu, On 11/6/2023 10:48 PM, Mathieu Desnoyers wrote: > On 2023-11-06 02:06, K Prateek Nayak wrote: >> Hello Chenyu, >> >> On 11/6/2023 11:22 AM, Chen Yu wrote: >>> On 2023-10-27 at 08:57:00 +0530, K Prateek Nayak wrote: >>>> Hello Mathieu, >>>> >>>> On 10/19/2023 9:35 PM, Mathieu Desnoyers wrote: >>>>> Hi, >>>>> >>>>> This series introduces two new scheduler features: UTIL_FITS_CAPACITY >>>>> and SELECT_BIAS_PREV. When used together, they achieve a 41% speedup of >>>>> a hackbench workload which leaves some idle CPU time on a 192-core AMD >>>>> EPYC. >>>>> >>>>> The main metrics which are significantly improved are: >>>>> >>>>> - cpu-migrations are reduced by 80%, >>>>> - CPU utilization is increased by 17%. >>>>> >>>>> Feedback is welcome. I am especially interested to learn whether this >>>>> series has positive or detrimental effects on performance of other >>>>> workloads. >>>> >>>> I got a chance to test this series on a dual socket 3rd Generation EPYC >>>> System (2 x 64C/128T). Following is a quick summary: >>>> >>>> - stream and ycsb-mongodb don't see any changes. >>>> >>>> - hackbench and DeathStarBench see a major improvement. Both are high >>>> utilization workloads with CPUs being overloaded most of the time. >>>> DeathStarBench is known to benefit from lower migration count. It was >>>> discussed by Gautham at OSPM '23. >>>> >>>> - tbench, netperf, and sch bench regresses. The former two when the >>>> system is near fully loaded, and the latter for most cases. >>> >>> Does it mean hackbench gets benefits when the system is overloaded, while >>> tbench/netperf do not get benefit when the system is underloaded? >> >> Yup! Seems like that from the results. From what I have seen so far, >> there seems to be a work conservation aspect to hackbench where if we >> reduce the time spent in the kernel (by reducing time to decide on the >> target which Mathieu's patch [this one] achieves, > > I am confused by this comment. > > Quoting Daniel Bristot, "work conserving" is defined as "in a system with M processor, the M "higest priority" must be running (in real-time)". This should apply to other scheduling classes as well. This definition fits with this paper's definition [1]: "The Linux scheduler is work-conserving, meaning that it should never leave cores idle if there is work to do." > > Do you mean something different by "work conservation" ? Sorry for the confusion. My interpretation of the term "work conservation" was when there are multiple runnable tasks in the system, each task more or less get same amount of CPU time. In case of hackbench specifically, it is time in the userspace. > > Just in case, I've made the following experiment to figure out if my patches benefit from having less time spent in select_task_rq_fair(). I have copied the original "select_idle_sibling()" into a separate function "select_idle_sibling_orig()", which I call at the beginning of the new "biased" select_idle_sibling. I use its result in an empty asm volatile, which ensures that the code is not optimized away. Then the biased function selects the runqueue with the new biased approach. So in a way you are doing two calls to "select_idle_sibling()" each time? Or is it more like: select_idle_sibling(...) { int cpu = select_idle_sibling_orig(); /* * Take full cost of select_idle_sibling_orig() * but return prev_cpu if it is still optimal * target for wakeup with the biases. */ if (sched_feat(SELECT_BIAS_PREV) && prev_cpu_still_optimal(p)) return prev_cpu; return cpu; } > > The result with hackbench is that the speed up is still pretty much the same with or without the added "select_idle_sibling_orig()" call. > > Based on this, my understanding is that the speed up comes from minimizing the amount of migrations (and the side effects caused by those migrations such as runqueue locks and cache misses), rather than by making select_idle_sibling faster. > > So based on this, I suspect that we could add some overhead to select_task_runqueue_fair if it means we do a better task placement decision and minimize migrations, and that would still provide an overall benefit performance-wise. Some of my older experiments when SIS_NODE was proposed suggested the opposite but things might have changed now :) I'll get back to you on this. > >> there is also a >> second order effect from another one of Mathieu's Patches that uses >> wakelist but indirectly curbs the SIS_UTIL limits based on Aaron's >> observation [1] thus reducing time spent in select_idle_cpu()) >> hackbench results seem to improve. > > It's possible that an indirect effect of bias towards prev runqueue is to affect the metrics used by select_idle_cpu() as well and make it return early. > > I've tried adding a 1000 iteration barrier() loop within select_idle_sibling_orig(), and indeed the hackbench time goes from 29s to 31s. Therefore, slowing down the task rq selection does have some impact. > >> >> [1] https://lore.kernel.org/lkml/20230905072141.GA253439@ziqianlu-dell/ >> >> schbench, tbench, and netperf see that wakeups are faster when the >> client and server are on same LLC so consolidation as long as there is >> one task per run queue for under loaded case is better than just keeping >> them on separate LLCs. > > What is faster for the 1:1 client/server ping-pong scenario: having the client and server on the same LLC, but different runqueues, or having them share a single runqueue ? Client and Server on same LLC, but on different cores give the best result. > If they wait for each other, then I suspect it's better to place them on the same runqueue as long as there is capacity left. Yup, that is correct. > >> >>> >>>> All these benchmarks are client-server / messenger-worker oriented and is >>>> known to perform better when client-server / messenger-worker are on >>>> same CCX (LLC domain). >>> >>> I thought hackbench should also be of client-server mode, because hackbench has >>> socket/pipe mode and exchanges datas between sender/receiver. >> >> Yes but its N:M nature makes it slightly complicated to understand where >> the cache benefits disappear and the work conservation benefits become >> more prominent. > > The N:M nature of hackbench AFAIU causes N-server *and* M-client tasks to pull each other pretty much randomly, therefore trashing cache locality. > > I'm still unclear about the definition of "work conservation" in this discussion. In my previous observations, if you can minimize time spent scheduling the wakee and return back to userspace faster, the benchmark benefited overall. But then the MM_CID observation goes against this ¯\_(ツ)_/¯ or maybe there is a higher order effect that I might be missing. > >> >>> >>> This reminds me of your proposal to provide user hint to the scheduler >>> to whether do task consolidation vs task spreading, and could this also >>> be applied to Mathieu's case? For task or task group with "consolidate" >>> flag set, tasks prefer to be woken up on target/previous CPU if the wakee >>> fits into that CPU. In this way we could bring benefit and not introduce >>> regress. >> >> I think even a simple WF_SYNC check will help tbench and netperf case. >> Let me get back to you with some data on different variants of hackbench >> wit the latest tip. > > AFAIU (to be double-checked) the hackbench workload also has WF_SYNC, which prevents us from using this flag to distinguish between 1:1 server/client and N:M scenarios. Or am I missing something ? Yup! You are right. My bad. > > Thanks, > > Mathieu > > [1] https://people.ece.ubc.ca/sasha/papers/eurosys16-final29.pdf > -- Thanks and Regards, Prateek
On 2023-11-06 at 11:32:02 -0500, Mathieu Desnoyers wrote: > On 2023-10-26 23:27, K Prateek Nayak wrote: > [...] > > -- > > It is a mixed bag of results, as expected. I would love to hear your > > thoughts on the results. Meanwhile, I'll try to get some more data > > from other benchmarks. > > I suspect that workloads that exhibit a client-server (1:1) pairing pattern > are hurt by the bias towards leaving tasks on their prev runqueue: they > benefit from moving both client/server tasks as close as possible so they > share either the same core or a common cache. Yes, this should be true if the wakee's previous runqueue is not idle, at least on Prateek's machine. Does it mean, the change in PATCH 2/2 that "chooses previous CPU over target CPU when all CPUs are busy" might not be a universal win for the 1:1 workloads? > > The hackbench workload is also client-server, but there are N-client and > N-server threads, creating a N:N relationship which really does not work > well when trying to pull tasks on sync wakeup: tasks then bounce all over > the place. > > It's tricky though. If we try to fix the "1:1" client-server pattern with a > heuristic, we may miss scenarios which are close to 1:1 but don't exactly > match. > > I'm working on a rewrite of select_task_rq_fair, with the aim to tackle the > more general task placement problem taking into account the following: > > - We want to converge towards a task placement that moves tasks with > most waker/wakee interactions as close as possible in the cache > topology, > - We can use the core util_est/capacity metrics to calculate whether we > have capacity left to enqueue a task in a core's runqueue. > - The underlying assumption is that work conserving [1] is not a good > characteristic to aim for, because it does not take into account the > overhead associated with migrations, and thus lack of cache locality. Agree, one pain point is how to figure out the requirement of a wakee. Does the wakee want an idle CPU, or want cache locality? One heuristic I'm thinking of to predict if a task is cache sensitive: check both the task's average runtime, and its average sleep time. If the runtime is long, it usually indicates that this task has large cache footprint, in terms of icache/dcache. If the sleep time is short, it means that this task is likely to revisit its hot cache soon. thanks, Chenyu