Message ID | 20230306132521.968182689@infradead.org |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:5915:0:0:0:0:0 with SMTP id v21csp1892337wrd; Mon, 6 Mar 2023 07:07:40 -0800 (PST) X-Google-Smtp-Source: AK7set9dicF3IEHXG66q0Bzey3tZM/WdS8zBTgtcg03dvEngsmTJ6ag2yuADTbf4cZiOaeefxa3i X-Received: by 2002:a17:902:74c4:b0:19e:6b50:e220 with SMTP id f4-20020a17090274c400b0019e6b50e220mr9841059plt.53.1678115260293; Mon, 06 Mar 2023 07:07:40 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1678115260; cv=none; d=google.com; s=arc-20160816; b=xpqj5pYkpeAmoa55Ovg88iOO5Uif/wBl13xbhiFy6W11SjPnmGTn7m6ZZQGNowMXr0 F9QU1QnzM2uODp+AbSEaOzcuN0o3i6g0qrbEZ4st1szs7ObMIOCTa1IUVfufV+2kmly+ W4ZYTe/RC5UvMyFb9s5insQ86F4pKE97hMELmi1Eq8IwhLQPMFvKB+0FmftK2DnCO8HP jeNQhQVhrcczAhqAuYKV6TKz6a0esFhaLWqd/JsSD8uklZupIcVSjuEMihN70MZOHTz4 weNwEup+p55sjqcqc9d51djFuiME+t1BWo2fL71d9mSUR3prl1/8t0/5weNtT2f7NhJY XnmQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:subject:cc:to:from:date:user-agent:message-id :dkim-signature; bh=Xz7dL2tcnPT6ww6dXBhPnkI2dUM8QaQFf+4VBLsgKss=; b=vS+HUdbq/gTH8ZCD9qCTaSg2kH0E/cCz0oADhsBlQNcTS2tbQGB4F1Rre1P4yUHwPz BY1lOpJngN3MXB6AzfWyLqLHunRs10Yt9+U1DtKkdZJ5GKLEYEhxwrEG3xrr8R/PcJZq imxF5dP/B8aEfaBgoDhvySz69gA7BUe/WuBT8KqT6a1PCRzy1u7pPC+UrU51M3v0zBmD QFXofcXcfWuU6PVMRhu5WVjB3HVZWk4Lx0exSMyEwVogQ0lzG1oQ5y9tuevS+7DVbSNp a23DO+i9IHuV7b2Ev0sIUTBWoXNe6aFBK/JZzhOdVnAQUn7L6uNOW5hV56QEBNHQhXFZ 67tw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@infradead.org header.s=casper.20170209 header.b=WbbkGE3k; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id x4-20020a170902ec8400b0019cb63d38e3si10311849plg.589.2023.03.06.07.07.27; Mon, 06 Mar 2023 07:07:40 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@infradead.org header.s=casper.20170209 header.b=WbbkGE3k; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231283AbjCFOVK (ORCPT <rfc822;toshivichauhan@gmail.com> + 99 others); Mon, 6 Mar 2023 09:21:10 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55932 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231617AbjCFOUV (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 6 Mar 2023 09:20:21 -0500 Received: from casper.infradead.org (casper.infradead.org [IPv6:2001:8b0:10b:1236::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8DB768A61 for <linux-kernel@vger.kernel.org>; Mon, 6 Mar 2023 06:19:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=Subject:Cc:To:From:Date:Message-ID: Sender:Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To:References; bh=Xz7dL2tcnPT6ww6dXBhPnkI2dUM8QaQFf+4VBLsgKss=; b=WbbkGE3kA9CEMDoY7iWLv8LRxj 8VLWzwhdYPKhqDA2EeHvBBgyKGhtaeqRnQr+yZ8E6ILGKV950+Jh81aPZm7ATs20i7+hQHCyhHWLU IL45M6i0+82G35n+5LT0/ZVIw7rWAMNSurUpRpuwpCmebPZQOkfqL8u2TXk+HDfeoZOutFPh0jil3 6Ax+40Ffr05BlpJmlMEgn9j+J6/QgyXLVL1tbFNstaZUmzWJhtusDEnx78r1FigM68tMldvAaBu/7 NQFvxTZqCMEupJgAK33qmR6q7AJ/FWoRHYtEekIo4/UDg/hSaGBurYrX8TL93ildpK87KsgiIgGz7 6ftM4S3w==; Received: from j130084.upc-j.chello.nl ([24.132.130.84] helo=noisy.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1pZBe5-005P2M-Iu; Mon, 06 Mar 2023 14:16:58 +0000 Received: from hirez.programming.kicks-ass.net (hirez.programming.kicks-ass.net [192.168.1.225]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by noisy.programming.kicks-ass.net (Postfix) with ESMTPS id 08881300137; Mon, 6 Mar 2023 15:16:55 +0100 (CET) Received: by hirez.programming.kicks-ass.net (Postfix, from userid 0) id E55AB2130D7A2; Mon, 6 Mar 2023 15:16:54 +0100 (CET) Message-ID: <20230306132521.968182689@infradead.org> User-Agent: quilt/0.66 Date: Mon, 06 Mar 2023 14:25:21 +0100 From: Peter Zijlstra <peterz@infradead.org> To: mingo@kernel.org, vincent.guittot@linaro.org Cc: linux-kernel@vger.kernel.org, peterz@infradead.org, juri.lelli@redhat.com, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com, corbet@lwn.net, qyousef@layalina.io, chris.hyser@oracle.com, patrick.bellasi@matbug.net, pjt@google.com, pavel@ucw.cz, qperret@google.com, tim.c.chen@linux.intel.com, joshdon@google.com, timj@gnu.org, kprateek.nayak@amd.com, yu.c.chen@intel.com, youssefesmat@chromium.org, joel@joelfernandes.org Subject: [PATCH 00/10] sched: EEVDF using latency-nice X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1759631387107430563?= X-GMAIL-MSGID: =?utf-8?q?1759631387107430563?= |
Series |
sched: EEVDF using latency-nice
|
|
Message
Peter Zijlstra
March 6, 2023, 1:25 p.m. UTC
Hi! Ever since looking at the latency-nice patches, I've wondered if EEVDF would not make more sense, and I did point Vincent at some older patches I had for that (which is here his augmented rbtree thing comes from). Also, since I really dislike the dual tree, I also figured we could dynamically switch between an augmented tree and not (and while I have code for that, that's not included in this posting because with the current results I don't think we actually need this). Anyway, since I'm somewhat under the weather, I spend last week desperately trying to connect a small cluster of neurons in defiance of the snot overlord and bring back the EEVDF patches from the dark crypts where they'd been gathering cobwebs for the past 13 odd years. By friday they worked well enough, and this morning (because obviously I forgot the weekend is ideal to run benchmarks) I ran a bunch of hackbenck, netperf, tbench and sysbench -- there's a bunch of wins and losses, but nothing that indicates a total fail. ( in fact, some of the schbench results seem to indicate EEVDF schedules a lot more consistent than CFS and has a bunch of latency wins ) ( hackbench also doesn't show the augmented tree and generally more expensive pick to be a loss, in fact it shows a slight win here ) hackbech load + cyclictest --policy other results: EEVDF CFS # Min Latencies: 00053 LNICE(19) # Avg Latencies: 04350 # Max Latencies: 76019 # Min Latencies: 00052 00053 LNICE(0) # Avg Latencies: 00690 00687 # Max Latencies: 14145 13913 # Min Latencies: 00019 LNICE(-19) # Avg Latencies: 00261 # Max Latencies: 05642 The nice -19 numbers aren't as pretty as Vincent's, but at the end I was going cross-eyed from staring at tree prints and I just couldn't figure out where it was going side-ways. There's definitely more benchmarking/tweaking to be done (0-day already reported a stress-ng loss), but if we can pull this off we can delete a whole much of icky heuristics code. EEVDF is a much better defined policy than what we currently have.
Comments
On Mon, 6 Mar 2023 at 15:17, Peter Zijlstra <peterz@infradead.org> wrote: > > Hi! > > Ever since looking at the latency-nice patches, I've wondered if EEVDF would > not make more sense, and I did point Vincent at some older patches I had for > that (which is here his augmented rbtree thing comes from). > > Also, since I really dislike the dual tree, I also figured we could dynamically > switch between an augmented tree and not (and while I have code for that, > that's not included in this posting because with the current results I don't > think we actually need this). > > Anyway, since I'm somewhat under the weather, I spend last week desperately > trying to connect a small cluster of neurons in defiance of the snot overlord > and bring back the EEVDF patches from the dark crypts where they'd been > gathering cobwebs for the past 13 odd years. I haven't studied your patchset in detail yet but at a 1st glance this seems to be a major rework on the cfs task placement and the latency is just an add-on on top of moving to the EEVDF scheduling. > > By friday they worked well enough, and this morning (because obviously I forgot > the weekend is ideal to run benchmarks) I ran a bunch of hackbenck, netperf, > tbench and sysbench -- there's a bunch of wins and losses, but nothing that > indicates a total fail. > > ( in fact, some of the schbench results seem to indicate EEVDF schedules a lot > more consistent than CFS and has a bunch of latency wins ) > > ( hackbench also doesn't show the augmented tree and generally more expensive > pick to be a loss, in fact it shows a slight win here ) > > > hackbech load + cyclictest --policy other results: > > > EEVDF CFS > > # Min Latencies: 00053 > LNICE(19) # Avg Latencies: 04350 > # Max Latencies: 76019 > > # Min Latencies: 00052 00053 > LNICE(0) # Avg Latencies: 00690 00687 > # Max Latencies: 14145 13913 > > # Min Latencies: 00019 > LNICE(-19) # Avg Latencies: 00261 > # Max Latencies: 05642 > > > The nice -19 numbers aren't as pretty as Vincent's, but at the end I was going > cross-eyed from staring at tree prints and I just couldn't figure out where it > was going side-ways. > > There's definitely more benchmarking/tweaking to be done (0-day already > reported a stress-ng loss), but if we can pull this off we can delete a whole > much of icky heuristics code. EEVDF is a much better defined policy than what > we currently have. > >
On Tue, Mar 07, 2023 at 11:27:37AM +0100, Vincent Guittot wrote: > On Mon, 6 Mar 2023 at 15:17, Peter Zijlstra <peterz@infradead.org> wrote: > > > > Hi! > > > > Ever since looking at the latency-nice patches, I've wondered if EEVDF would > > not make more sense, and I did point Vincent at some older patches I had for > > that (which is here his augmented rbtree thing comes from). > > > > Also, since I really dislike the dual tree, I also figured we could dynamically > > switch between an augmented tree and not (and while I have code for that, > > that's not included in this posting because with the current results I don't > > think we actually need this). > > > > Anyway, since I'm somewhat under the weather, I spend last week desperately > > trying to connect a small cluster of neurons in defiance of the snot overlord > > and bring back the EEVDF patches from the dark crypts where they'd been > > gathering cobwebs for the past 13 odd years. > > I haven't studied your patchset in detail yet but at a 1st glance this > seems to be a major rework on the cfs task placement and the latency > is just an add-on on top of moving to the EEVDF scheduling. It completely reworks the base scheduler, placement, preemption, picking -- everything. The only thing they have in common is that they're both a virtual time based scheduler. The big advantage I see is that EEVDF is fairly well known and studied, and a much better defined scheduler than WFQ. Specifically, where WFQ is only well defined in how much time is given to any task (bandwidth), but says nothing about how that is distributed in time. That is, there is no native preemption condition/constraint etc. -- all that code we have is random heuristics mostly. The WF2Q/EEVDF class of schedulers otoh *do* define all that. There is a lot less wiggle room as a result. The avg_vruntime / placement stuff I did is fundamental to how it controls bandwidth distribution and guarantees the WFQ subset. Specifically, by limiting the pick to that subset of tasks that has positive lag (owed time), it guarantees this fairness -- but that means we need a working measure of lag. Similarly, since the whole 'when' thing is well defined in order to provide the additional latency goals of these schedulers, placement is crucial. Things like sleeper bonus is fundamentally incompatible with latency guarantees -- both affect the 'when'. Initial EEVDF paper is here: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=805acf7726282721504c8f00575d91ebfd750564 It contains a few 'mistakes' and oversights, but those should not matter. Anyway, I'm still struggling to make complete sense of what you did -- will continue to stare at that.
> Hi! > > Ever since looking at the latency-nice patches, I've wondered if EEVDF would > not make more sense, and I did point Vincent at some older patches I had for > that (which is here his augmented rbtree thing comes from). > > Also, since I really dislike the dual tree, I also figured we could dynamically > switch between an augmented tree and not (and while I have code for that, > that's not included in this posting because with the current results I don't > think we actually need this). > > Anyway, since I'm somewhat under the weather, I spend last week desperately > trying to connect a small cluster of neurons in defiance of the snot overlord > and bring back the EEVDF patches from the dark crypts where they'd been > gathering cobwebs for the past 13 odd years. > > By friday they worked well enough, and this morning (because obviously I forgot > the weekend is ideal to run benchmarks) I ran a bunch of hackbenck, netperf, > tbench and sysbench -- there's a bunch of wins and losses, but nothing that > indicates a total fail. > > ( in fact, some of the schbench results seem to indicate EEVDF schedules a lot > more consistent than CFS and has a bunch of latency wins ) > > ( hackbench also doesn't show the augmented tree and generally more expensive > pick to be a loss, in fact it shows a slight win here ) > > > hackbech load + cyclictest --policy other results: > > > EEVDF CFS > > # Min Latencies: 00053 > LNICE(19) # Avg Latencies: 04350 > # Max Latencies: 76019 > > # Min Latencies: 00052 00053 > LNICE(0) # Avg Latencies: 00690 00687 > # Max Latencies: 14145 13913 > > # Min Latencies: 00019 > LNICE(-19) # Avg Latencies: 00261 > # Max Latencies: 05642 > > > The nice -19 numbers aren't as pretty as Vincent's, but at the end I was going > cross-eyed from staring at tree prints and I just couldn't figure out where it > was going side-ways. > > There's definitely more benchmarking/tweaking to be done (0-day already > reported a stress-ng loss), but if we can pull this off we can delete a whole > much of icky heuristics code. EEVDF is a much better defined policy than what > we currently have. > Tested the patch series on powerpc systems. This test is done in the same way that was done for vincent's V12 series. Creating two cgroups. In cgroup2 running stress-ng -l 50 --cpu=<total_cpu> and in cgroup1 running micro benchmarks. Different latency values are assigned to cgroup1. Tested on two different system. One system has 480 CPU and other one has 96 CPU. ++++++++ Summary: ++++++++ For hackbench, 480 CPU system shows good improvement. 96 CPU system shows same numbers as 6.2. Smaller system was showing regressing results as discussed in Vincent's V12 series. With this patch, there is no regression. Schbench shows good improvement compared to v6.2 at LN=0 or LN=-20. Whereas at LN=19, it shows regression. Please suggest if any variation of the benchmark or a different benchmark to be run. ++++++++++++++++++ 480 CPU system ++++++++++++++++++ ========== schbench ========== v6.2 | v6.2+LN=0 | v6.2+LN=-20 | v6.2+LN=19 1 Threads 50.0th: 14.00 | 12.00 | 14.50 | 15.00 75.0th: 16.50 | 14.50 | 17.00 | 18.00 90.0th: 18.50 | 17.00 | 19.50 | 20.00 95.0th: 20.50 | 18.50 | 22.00 | 23.50 99.0th: 27.50 | 24.50 | 31.50 | 155.00 99.5th: 36.00 | 30.00 | 44.50 | 2991.00 99.9th: 81.50 | 171.50 | 153.00 | 4621.00 2 Threads 50.0th: 14.00 | 15.50 | 17.00 | 16.00 75.0th: 17.00 | 18.00 | 19.00 | 19.00 90.0th: 20.00 | 21.00 | 22.00 | 22.50 95.0th: 23.00 | 23.00 | 25.00 | 25.50 99.0th: 71.00 | 30.50 | 35.50 | 990.50 99.5th: 1170.00 | 53.00 | 71.00 | 3719.00 99.9th: 5088.00 | 245.50 | 138.00 | 6644.00 4 Threads 50.0th: 20.50 | 20.00 | 20.00 | 19.50 75.0th: 24.50 | 23.00 | 23.00 | 23.50 90.0th: 31.00 | 27.00 | 26.50 | 27.50 95.0th: 260.50 | 29.50 | 29.00 | 35.00 99.0th: 3644.00 | 106.00 | 37.50 | 2884.00 99.5th: 5152.00 | 227.00 | 92.00 | 5496.00 99.9th: 8076.00 | 3662.50 | 517.00 | 8640.00 8 Threads 50.0th: 26.00 | 23.50 | 22.50 | 25.00 75.0th: 32.50 | 29.50 | 27.50 | 31.00 90.0th: 41.50 | 34.50 | 31.50 | 39.00 95.0th: 794.00 | 37.00 | 34.50 | 579.50 99.0th: 5992.00 | 48.50 | 52.00 | 5872.00 99.5th: 7208.00 | 100.50 | 97.50 | 7280.00 99.9th: 9392.00 | 4098.00 | 1226.00 | 9328.00 16 Threads 50.0th: 37.50 | 33.00 | 34.00 | 37.00 75.0th: 49.50 | 43.50 | 44.00 | 49.00 90.0th: 70.00 | 52.00 | 53.00 | 66.00 95.0th: 1284.00 | 57.50 | 59.00 | 1162.50 99.0th: 5600.00 | 79.50 | 111.50 | 5912.00 99.5th: 7216.00 | 282.00 | 194.50 | 7392.00 99.9th: 9328.00 | 4026.00 | 2009.00 | 9440.00 32 Threads 50.0th: 59.00 | 56.00 | 57.00 | 59.00 75.0th: 83.00 | 77.50 | 79.00 | 83.00 90.0th: 118.50 | 94.00 | 95.00 | 120.50 95.0th: 1921.00 | 104.50 | 104.00 | 1800.00 99.0th: 6672.00 | 425.00 | 255.00 | 6384.00 99.5th: 8252.00 | 2800.00 | 1252.00 | 7696.00 99.9th: 10448.00 | 7264.00 | 5888.00 | 9504.00 ========= hackbench ========= Process 10 0.19 | 0.18 | 0.17 | 0.18 Process 20 0.34 | 0.32 | 0.33 | 0.31 Process 30 0.45 | 0.42 | 0.43 | 0.43 Process 40 0.58 | 0.53 | 0.53 | 0.53 Process 50 0.70 | 0.64 | 0.64 | 0.65 Process 60 0.82 | 0.74 | 0.75 | 0.76 thread 10 0.20 | 0.19 | 0.19 | 0.19 thread 20 0.36 | 0.34 | 0.34 | 0.34 Process(Pipe) 10 0.24 | 0.15 | 0.15 | 0.15 Process(Pipe) 20 0.46 | 0.22 | 0.22 | 0.21 Process(Pipe) 30 0.65 | 0.30 | 0.29 | 0.29 Process(Pipe) 40 0.90 | 0.35 | 0.36 | 0.34 Process(Pipe) 50 1.04 | 0.38 | 0.39 | 0.38 Process(Pipe) 60 1.16 | 0.42 | 0.42 | 0.43 thread(Pipe) 10 0.19 | 0.13 | 0.13 | 0.13 thread(Pipe) 20 0.46 | 0.21 | 0.21 | 0.21 ++++++++++++++++++ 96 CPU system ++++++++++++++++++ =========== schbench =========== v6.2 | v6.2+LN=0 | v6.2+LN=-20 | v6.2+LN=19 1 Thread 50.0th: 10.50 | 10.00 | 10.00 | 11.00 75.0th: 12.50 | 11.50 | 11.50 | 12.50 90.0th: 15.00 | 13.00 | 13.50 | 16.50 95.0th: 47.50 | 15.00 | 15.00 | 274.50 99.0th: 4744.00 | 17.50 | 18.00 | 5032.00 99.5th: 7640.00 | 18.50 | 525.00 | 6636.00 99.9th: 8916.00 | 538.00 | 6704.00 | 9264.00 2 Threads 50.0th: 11.00 | 10.00 | 11.00 | 11.00 75.0th: 13.50 | 12.00 | 12.50 | 13.50 90.0th: 17.00 | 14.00 | 14.00 | 17.00 95.0th: 451.50 | 16.00 | 15.50 | 839.00 99.0th: 5488.00 | 20.50 | 18.00 | 6312.00 99.5th: 6712.00 | 986.00 | 19.00 | 7664.00 99.9th: 9856.00 | 4913.00 | 1154.00 | 8736.00 4 Threads 50.0th: 13.00 | 12.00 | 12.00 | 13.00 75.0th: 15.00 | 14.00 | 14.00 | 15.00 90.0th: 23.50 | 16.00 | 16.00 | 20.00 95.0th: 2508.00 | 17.50 | 17.50 | 1818.00 99.0th: 7232.00 | 777.00 | 38.50 | 5952.00 99.5th: 8720.00 | 3548.00 | 1926.00 | 7788.00 99.9th: 10352.00 | 6320.00 | 7160.00 | 10000.00 8 Threads 50.0th: 16.00 | 15.00 | 15.00 | 16.00 75.0th: 20.00 | 18.00 | 18.00 | 19.50 90.0th: 371.50 | 20.00 | 21.00 | 245.50 95.0th: 2992.00 | 22.00 | 23.00 | 2608.00 99.0th: 7784.00 | 1084.50 | 563.50 | 7136.00 99.5th: 9488.00 | 2612.00 | 2696.00 | 8720.00 99.9th: 15568.00 | 6656.00 | 7496.00 | 10000.00 16 Threads 50.0th: 23.00 | 21.00 | 20.00 | 22.50 75.0th: 31.00 | 27.50 | 26.00 | 29.50 90.0th: 1981.00 | 32.50 | 30.50 | 1500.50 95.0th: 4856.00 | 304.50 | 34.00 | 4046.00 99.0th: 10112.00 | 5720.00 | 4590.00 | 8220.00 99.5th: 13104.00 | 7828.00 | 7008.00 | 9312.00 99.9th: 18624.00 | 9856.00 | 9504.00 | 11984.00 32 Threads 50.0th: 36.50 | 34.50 | 33.50 | 35.50 75.0th: 56.50 | 48.00 | 46.00 | 52.50 90.0th: 4728.00 | 1470.50 | 376.00 | 3624.00 95.0th: 7808.00 | 4130.00 | 3850.00 | 6488.00 99.0th: 15776.00 | 8972.00 | 9060.00 | 9872.00 99.5th: 19072.00 | 11328.00 | 12224.00 | 11520.00 99.9th: 28864.00 | 18016.00 | 18368.00 | 18848.00 ========== Hackbench ========= Type groups v6.2 | v6.2+LN=0 | v6.2+LN=-20 | v6.2+LN=19 Process 10 0.33 | 0.33 | 0.33 | 0.33 Process 20 0.61 | 0.56 | 0.58 | 0.57 Process 30 0.87 | 0.82 | 0.81 | 0.81 Process 40 1.10 | 1.05 | 1.06 | 1.05 Process 50 1.34 | 1.28 | 1.29 | 1.29 Process 60 1.58 | 1.53 | 1.52 | 1.51 thread 10 0.36 | 0.35 | 0.35 | 0.35 thread 20 0.64 | 0.63 | 0.62 | 0.62 Process(Pipe) 10 0.18 | 0.18 | 0.18 | 0.17 Process(Pipe) 20 0.32 | 0.31 | 0.31 | 0.31 Process(Pipe) 30 0.42 | 0.41 | 0.41 | 0.42 Process(Pipe) 40 0.56 | 0.53 | 0.55 | 0.53 Process(Pipe) 50 0.68 | 0.66 | 0.66 | 0.66 Process(Pipe) 60 0.80 | 0.78 | 0.78 | 0.78 thread(Pipe) 10 0.20 | 0.18 | 0.19 | 0.18 thread(Pipe) 20 0.34 | 0.34 | 0.33 | 0.33 Tested-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
Hello Peter, Leaving some results from my testing on a dual socket Zen3 machine (2 x 64C/128T) below. tl;dr o I've not tested workloads with nice and latency nice yet focusing more on the out of the box performance. No changes to sched_feat were made for the same reason. o Except for hackbench (m:n communication relationship), I do not see any regression for other standard benchmarks (mostly 1:1 or 1:n) relation when system is below fully loaded. o At fully loaded scenario, schbench seems to be unhappy. Looking at the data from /proc/<pid>/sched for the tasks with schedstats enabled, there is an increase in number of context switches and the total wait sum. When system is overloaded, things flip and the schbench tail latency improves drastically. I suspect the involuntary context-switches help workers make progress much sooner after wakeup compared to tip thus leading to lower tail latency. o For the same reason as above, tbench throughput takes a hit with number of involuntary context-switches increasing drastically for the tbench server. There is also an increase in wait sum noticed. o Couple of real world workloads were also tested. DeathStarBench throughput tanks much more with the updated version in your tree compared to this series as is. SpecJBB Max-jOPS sees large improvements but comes at a cost of drop in Critical-jOPS signifying an increase in either wait time or an increase in involuntary context-switches which can lead to transactions taking longer to complete. o Apart from DeathStarBench, the all the trends reported remain same comparing the version in your tree and this series, as is, applied on the same base kernel. I'll leave the detailed results below and some limited analysis. On 3/6/2023 6:55 PM, Peter Zijlstra wrote: > Hi! > > Ever since looking at the latency-nice patches, I've wondered if EEVDF would > not make more sense, and I did point Vincent at some older patches I had for > that (which is here his augmented rbtree thing comes from). > > Also, since I really dislike the dual tree, I also figured we could dynamically > switch between an augmented tree and not (and while I have code for that, > that's not included in this posting because with the current results I don't > think we actually need this). > > Anyway, since I'm somewhat under the weather, I spend last week desperately > trying to connect a small cluster of neurons in defiance of the snot overlord > and bring back the EEVDF patches from the dark crypts where they'd been > gathering cobwebs for the past 13 odd years. > > By friday they worked well enough, and this morning (because obviously I forgot > the weekend is ideal to run benchmarks) I ran a bunch of hackbenck, netperf, > tbench and sysbench -- there's a bunch of wins and losses, but nothing that > indicates a total fail. > > ( in fact, some of the schbench results seem to indicate EEVDF schedules a lot > more consistent than CFS and has a bunch of latency wins ) > > ( hackbench also doesn't show the augmented tree and generally more expensive > pick to be a loss, in fact it shows a slight win here ) > > > hackbech load + cyclictest --policy other results: > > > EEVDF CFS > > # Min Latencies: 00053 > LNICE(19) # Avg Latencies: 04350 > # Max Latencies: 76019 > > # Min Latencies: 00052 00053 > LNICE(0) # Avg Latencies: 00690 00687 > # Max Latencies: 14145 13913 > > # Min Latencies: 00019 > LNICE(-19) # Avg Latencies: 00261 > # Max Latencies: 05642 > Following are the results from testing the series on a dual socket Zen3 machine (2 x 64C/128T): NPS Modes are used to logically divide single socket into multiple NUMA region. Following is the NUMA configuration for each NPS mode on the system: NPS1: Each socket is a NUMA node. Total 2 NUMA nodes in the dual socket machine. Node 0: 0-63, 128-191 Node 1: 64-127, 192-255 NPS2: Each socket is further logically divided into 2 NUMA regions. Total 4 NUMA nodes exist over 2 socket. Node 0: 0-31, 128-159 Node 1: 32-63, 160-191 Node 2: 64-95, 192-223 Node 3: 96-127, 223-255 NPS4: Each socket is logically divided into 4 NUMA regions. Total 8 NUMA nodes exist over 2 socket. Node 0: 0-15, 128-143 Node 1: 16-31, 144-159 Node 2: 32-47, 160-175 Node 3: 48-63, 176-191 Node 4: 64-79, 192-207 Node 5: 80-95, 208-223 Node 6: 96-111, 223-231 Node 7: 112-127, 232-255 Kernel versions: - tip: 6.2.0-rc6 tip sched/core - eevdf: 6.2.0-rc6 tip sched/core + eevdf commits from your tree (https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/eevdf) - eevdf prev: 6.2.0-rc6 tip sched/core + this series as is When the testing started, the tip was at: commit 7c4a5b89a0b5 "sched/rt: pick_next_rt_entity(): check list_entry" Benchmark Results: ~~~~~~~~~~~~~ ~ hackbench ~ ~~~~~~~~~~~~~ o NPS1 Test: tip eevdf 1-groups: 4.63 (0.00 pct) 4.52 (2.37 pct) 2-groups: 4.42 (0.00 pct) 5.41 (-22.39 pct) * 4-groups: 4.21 (0.00 pct) 5.26 (-24.94 pct) * 8-groups: 4.95 (0.00 pct) 5.01 (-1.21 pct) 16-groups: 5.43 (0.00 pct) 6.24 (-14.91 pct) * o NPS2 Test: tip eevdf 1-groups: 4.68 (0.00 pct) 4.56 (2.56 pct) 2-groups: 4.45 (0.00 pct) 5.19 (-16.62 pct) * 4-groups: 4.19 (0.00 pct) 4.53 (-8.11 pct) * 8-groups: 4.80 (0.00 pct) 4.81 (-0.20 pct) 16-groups: 5.60 (0.00 pct) 6.22 (-11.07 pct) * o NPS4 Test: tip eevdf 1-groups: 4.68 (0.00 pct) 4.57 (2.35 pct) 2-groups: 4.56 (0.00 pct) 5.19 (-13.81 pct) * 4-groups: 4.50 (0.00 pct) 4.96 (-10.22 pct) * 8-groups: 5.76 (0.00 pct) 5.49 (4.68 pct) 16-groups: 5.60 (0.00 pct) 6.53 (-16.60 pct) * ~~~~~~~~~~~~ ~ schbench ~ ~~~~~~~~~~~~ o NPS1 #workers: tip eevdf 1: 36.00 (0.00 pct) 36.00 (0.00 pct) 2: 37.00 (0.00 pct) 37.00 (0.00 pct) 4: 38.00 (0.00 pct) 39.00 (-2.63 pct) 8: 52.00 (0.00 pct) 50.00 (3.84 pct) 16: 66.00 (0.00 pct) 68.00 (-3.03 pct) 32: 111.00 (0.00 pct) 109.00 (1.80 pct) 64: 213.00 (0.00 pct) 212.00 (0.46 pct) 128: 502.00 (0.00 pct) 637.00 (-26.89 pct) * 256: 45632.00 (0.00 pct) 24992.00 (45.23 pct) ^ 512: 78720.00 (0.00 pct) 44096.00 (43.98 pct) ^ o NPS2 #workers: tip eevdf 1: 31.00 (0.00 pct) 23.00 (25.80 pct) 2: 32.00 (0.00 pct) 33.00 (-3.12 pct) 4: 39.00 (0.00 pct) 37.00 (5.12 pct) 8: 52.00 (0.00 pct) 49.00 (5.76 pct) 16: 67.00 (0.00 pct) 68.00 (-1.49 pct) 32: 113.00 (0.00 pct) 112.00 (0.88 pct) 64: 213.00 (0.00 pct) 214.00 (-0.46 pct) 128: 508.00 (0.00 pct) 491.00 (3.34 pct) 256: 46912.00 (0.00 pct) 22304.00 (52.45 pct) ^ 512: 76672.00 (0.00 pct) 42944.00 (43.98 pct) ^ o NPS4 #workers: tip eevdf 1: 33.00 (0.00 pct) 30.00 (9.09 pct) 2: 40.00 (0.00 pct) 36.00 (10.00 pct) 4: 44.00 (0.00 pct) 41.00 (6.81 pct) 8: 73.00 (0.00 pct) 73.00 (0.00 pct) 16: 71.00 (0.00 pct) 71.00 (0.00 pct) 32: 111.00 (0.00 pct) 115.00 (-3.60 pct) 64: 217.00 (0.00 pct) 211.00 (2.76 pct) 128: 509.00 (0.00 pct) 553.00 (-8.64 pct) * 256: 44352.00 (0.00 pct) 26848.00 (39.46 pct) ^ 512: 75392.00 (0.00 pct) 44352.00 (41.17 pct) ^ ~~~~~~~~~~ ~ tbench ~ ~~~~~~~~~~ o NPS1 Clients: tip eevdf 1 483.10 (0.00 pct) 476.46 (-1.37 pct) 2 956.03 (0.00 pct) 943.12 (-1.35 pct) 4 1786.36 (0.00 pct) 1760.64 (-1.43 pct) 8 3304.47 (0.00 pct) 3105.19 (-6.03 pct) 16 5440.44 (0.00 pct) 5609.24 (3.10 pct) 32 10462.02 (0.00 pct) 10416.02 (-0.43 pct) 64 18995.99 (0.00 pct) 19317.34 (1.69 pct) 128 27896.44 (0.00 pct) 28459.38 (2.01 pct) 256 49742.89 (0.00 pct) 46371.44 (-6.77 pct) * 512 49583.01 (0.00 pct) 45717.22 (-7.79 pct) * 1024 48467.75 (0.00 pct) 43475.31 (-10.30 pct) * o NPS2 Clients: tip eevdf 1 472.57 (0.00 pct) 475.35 (0.58 pct) 2 938.27 (0.00 pct) 942.19 (0.41 pct) 4 1764.34 (0.00 pct) 1783.50 (1.08 pct) 8 3043.57 (0.00 pct) 3205.85 (5.33 pct) 16 5103.53 (0.00 pct) 5154.94 (1.00 pct) 32 9767.22 (0.00 pct) 9793.81 (0.27 pct) 64 18712.65 (0.00 pct) 18601.10 (-0.59 pct) 128 27691.95 (0.00 pct) 27542.57 (-0.53 pct) 256 47939.24 (0.00 pct) 43401.62 (-9.46 pct) * 512 47843.70 (0.00 pct) 43971.16 (-8.09 pct) * 1024 48412.05 (0.00 pct) 42808.58 (-11.57 pct) * o NPS4 Clients: tip eevdf 1 486.74 (0.00 pct) 484.88 (-0.38 pct) 2 950.50 (0.00 pct) 950.04 (-0.04 pct) 4 1778.58 (0.00 pct) 1796.03 (0.98 pct) 8 3106.36 (0.00 pct) 3180.09 (2.37 pct) 16 5139.81 (0.00 pct) 5139.50 (0.00 pct) 32 9911.04 (0.00 pct) 10086.37 (1.76 pct) 64 18201.46 (0.00 pct) 18289.40 (0.48 pct) 128 27284.67 (0.00 pct) 26947.19 (-1.23 pct) 256 46793.72 (0.00 pct) 43971.87 (-6.03 pct) * 512 48841.96 (0.00 pct) 44255.01 (-9.39 pct) * 1024 48811.99 (0.00 pct) 43118.99 (-11.66 pct) * ~~~~~~~~~~ ~ stream ~ ~~~~~~~~~~ o NPS1 - 10 Runs: Test: tip eevdf Copy: 321229.54 (0.00 pct) 332975.45 (3.65 pct) Scale: 207471.32 (0.00 pct) 212534.83 (2.44 pct) Add: 234962.15 (0.00 pct) 243011.39 (3.42 pct) Triad: 246256.00 (0.00 pct) 256453.73 (4.14 pct) - 100 Runs: Test: tip eevdf Copy: 332714.94 (0.00 pct) 333183.42 (0.14 pct) Scale: 216140.84 (0.00 pct) 212160.53 (-1.84 pct) Add: 239605.00 (0.00 pct) 233168.69 (-2.68 pct) Triad: 258580.84 (0.00 pct) 256972.33 (-0.62 pct) o NPS2 - 10 Runs: Test: tip eevdf Copy: 324423.92 (0.00 pct) 340685.20 (5.01 pct) Scale: 215993.56 (0.00 pct) 217895.31 (0.88 pct) Add: 250590.28 (0.00 pct) 257495.12 (2.75 pct) Triad: 261284.44 (0.00 pct) 261373.49 (0.03 pct) - 100 Runs: Test: tip eevdf Copy: 325993.72 (0.00 pct) 341244.18 (4.67 pct) Scale: 227201.27 (0.00 pct) 227255.98 (0.02 pct) Add: 256601.84 (0.00 pct) 258026.75 (0.55 pct) Triad: 260222.19 (0.00 pct) 269878.75 (3.71 pct) o NPS4 - 10 Runs: Test: tip eevdf Copy: 356850.80 (0.00 pct) 371230.27 (4.02 pct) Scale: 247219.39 (0.00 pct) 237846.20 (-3.79 pct) Add: 268588.78 (0.00 pct) 261088.54 (-2.79 pct) Triad: 272932.59 (0.00 pct) 284068.07 (4.07 pct) - 100 Runs: Test: tip eevdf Copy: 365965.18 (0.00 pct) 371186.97 (1.42 pct) Scale: 246068.58 (0.00 pct) 245991.10 (-0.03 pct) Add: 263677.73 (0.00 pct) 269021.14 (2.02 pct) Triad: 273701.36 (0.00 pct) 280566.44 (2.50 pct) ~~~~~~~~~~~~~ ~ Unixbench ~ ~~~~~~~~~~~~~ o NPS1 Test Metric Parallelism tip eevdf unixbench-dhry2reg Hmean unixbench-dhry2reg-1 49077561.21 ( 0.00%) 49144835.64 ( 0.14%) unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6285373890.61 ( 0.00%) 6270537933.92 ( -0.24%) unixbench-syscall Amean unixbench-syscall-1 2664815.40 ( 0.00%) 2679289.17 * -0.54%* unixbench-syscall Amean unixbench-syscall-512 7848462.70 ( 0.00%) 7456802.37 * 4.99%* unixbench-pipe Hmean unixbench-pipe-1 2531131.89 ( 0.00%) 2475863.05 * -2.18%* unixbench-pipe Hmean unixbench-pipe-512 305171024.40 ( 0.00%) 301182156.60 ( -1.31%) unixbench-spawn Hmean unixbench-spawn-1 4058.05 ( 0.00%) 4284.38 * 5.58%* unixbench-spawn Hmean unixbench-spawn-512 79893.24 ( 0.00%) 78234.45 * -2.08%* unixbench-execl Hmean unixbench-execl-1 4148.64 ( 0.00%) 4086.73 * -1.49%* unixbench-execl Hmean unixbench-execl-512 11077.20 ( 0.00%) 11137.79 ( 0.55%) o NPS2 Test Metric Parallelism tip eevdf unixbench-dhry2reg Hmean unixbench-dhry2reg-1 49394822.56 ( 0.00%) 49175574.26 ( -0.44%) unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6267817215.36 ( 0.00%) 6282838979.08 * 0.24%* unixbench-syscall Amean unixbench-syscall-1 2663675.03 ( 0.00%) 2677018.53 * -0.50%* unixbench-syscall Amean unixbench-syscall-512 7342392.90 ( 0.00%) 7443264.00 * -1.37%* unixbench-pipe Hmean unixbench-pipe-1 2533194.04 ( 0.00%) 2475969.01 * -2.26%* unixbench-pipe Hmean unixbench-pipe-512 303588239.03 ( 0.00%) 302217597.98 * -0.45%* unixbench-spawn Hmean unixbench-spawn-1 5141.40 ( 0.00%) 4862.78 ( -5.42%) * unixbench-spawn Hmean unixbench-spawn-512 82993.79 ( 0.00%) 79139.42 * -4.64%* * unixbench-execl Hmean unixbench-execl-1 4140.15 ( 0.00%) 4084.20 * -1.35%* unixbench-execl Hmean unixbench-execl-512 12229.25 ( 0.00%) 11445.22 ( -6.41%) * o NPS4 Test Metric Parallelism tip eevdf unixbench-dhry2reg Hmean unixbench-dhry2reg-1 48970677.27 ( 0.00%) 49070289.56 ( 0.20%) unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6297506696.81 ( 0.00%) 6311038905.07 ( 0.21%) unixbench-syscall Amean unixbench-syscall-1 2664715.13 ( 0.00%) 2677752.20 * -0.49%* unixbench-syscall Amean unixbench-syscall-512 7938670.70 ( 0.00%) 7972291.60 ( -0.42%) unixbench-pipe Hmean unixbench-pipe-1 2527605.54 ( 0.00%) 2476140.77 * -2.04%* unixbench-pipe Hmean unixbench-pipe-512 305068507.23 ( 0.00%) 304114548.50 ( -0.31%) unixbench-spawn Hmean unixbench-spawn-1 5207.34 ( 0.00%) 4964.39 ( -4.67%) * unixbench-spawn Hmean unixbench-spawn-512 81352.38 ( 0.00%) 74467.00 * -8.46%* * unixbench-execl Hmean unixbench-execl-1 4131.37 ( 0.00%) 4044.09 * -2.11%* unixbench-execl Hmean unixbench-execl-512 13025.56 ( 0.00%) 11124.77 * -14.59%* * ~~~~~~~~~~~ ~ netperf ~ ~~~~~~~~~~~ o NPS1 tip eevdf 1-clients: 107932.22 (0.00 pct) 106167.39 (-1.63 pct) 2-clients: 106887.99 (0.00 pct) 105304.25 (-1.48 pct) 4-clients: 106676.11 (0.00 pct) 104328.10 (-2.20 pct) 8-clients: 98645.45 (0.00 pct) 94076.26 (-4.63 pct) 16-clients: 88881.23 (0.00 pct) 86831.85 (-2.30 pct) 32-clients: 86654.28 (0.00 pct) 86313.80 (-0.39 pct) 64-clients: 81431.90 (0.00 pct) 74885.75 (-8.03 pct) 128-clients: 55993.77 (0.00 pct) 55378.10 (-1.09 pct) 256-clients: 43865.59 (0.00 pct) 44326.30 (1.05 pct) o NPS2 tip eevdf 1-clients: 106711.81 (0.00 pct) 108576.27 (1.74 pct) 2-clients: 106987.79 (0.00 pct) 108348.24 (1.27 pct) 4-clients: 105275.37 (0.00 pct) 105702.12 (0.40 pct) 8-clients: 103028.31 (0.00 pct) 96250.20 (-6.57 pct) 16-clients: 87382.43 (0.00 pct) 87683.29 (0.34 pct) 32-clients: 86578.14 (0.00 pct) 86968.29 (0.45 pct) 64-clients: 81470.63 (0.00 pct) 75906.15 (-6.83 pct) 128-clients: 54803.35 (0.00 pct) 55051.90 (0.45 pct) 256-clients: 42910.29 (0.00 pct) 44062.33 (2.68 pct) ~~~~~~~~~~~ ~ SpecJBB ~ ~~~~~~~~~~~ o NPS1 tip eevdf Max-jOPS 100% 115.71% (+15.71%) ^ Critical-jOPS 100% 93.59% (-6.41%) * ~~~~~~~~~~~~~~~~~~ ~ DeathStarBench ~ ~~~~~~~~~~~~~~~~~~ o NPS1 #CCX 1 CCX 2 CCX 3 CCX 4 CCX o eevdf compared to tip -10.93 -14.35 -9.74 -6.07 o eevdf prev (this sries as is) compared to tip -1.99 -6.64 -4.99 -3.87 Note: #CCX is the number of LLCs the the services are pinned to. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ Some Preliminary Analysis ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ tl;dr - There seems to be an increase in number of involuntary context switches when the system is overloaded. This probably allows newly waking task to make progress benefiting latency sensitive workload like schbench in overloaded scenario compared to tip but hurts tbench performance. When system is fully loaded, the larger average wait time seems to hurt the schbench performance. More analysis is needed to get to the bottom of the problem. - For hackbench 2 groups scenario, there seems the wait time seems to go up drastically. Scheduler statistics of interest are listed in detail below. Note: Units of all metrics denoting time is ms. They are processed from per-task schedstats in /proc/<pid>/sched. o Hackbench (2 Groups) (NPS1) tip eevdf %diff Comm sched-messaging sched-messaging N/A Sum of avg_atom 282.0024818 19.04355233 -93.24702669 Average of avg_atom 3.481512121 0.235105584 -93.24702669 Sum of avg_per_cpu 1761.949461 61.52537145 -96.50810805 Average of avg_per_cpu 21.75246248 0.759572487 -96.50810805 Average of avg_wait_time 0.007239228 0.012899105 78.18343632 Sum of nr_switches 4897740 4728784 -3.449672706 Sum of nr_voluntary_switches 4742512 4621606 -2.549408415 Sum of nr_involuntary_switches 155228 107178 -30.95446698 Sum of nr_wakeups 4742648 4623175 -2.51912012 Sum of nr_migrations 1263925 930600 -26.37221354 Sum of sum_exec_runtime 288481.15 262255.2574 -9.091024712 Sum of sum_idle_runtime 2576164.568 2851759.68 10.69788457 Sum of sum_sleep_runtime 76890.14753 78632.31679 2.265789982 Sum of wait_count 4897894 4728939 -3.449543824 Sum of wait_sum 3041.78227 24167.4694 694.5167422 o schbench (2 messengers, 128 workers - fully loaded) (NPS1) tip eevdf %diff Comm schbench schbench N/A Sum of avg_atom 7538.162897 7289.565705 -3.297848503 Average of avg_atom 29.10487605 28.14504133 -3.297848503 Sum of avg_per_cpu 630248.6079 471215.3671 -25.23341406 Average of avg_per_cpu 2433.392309 1819.364352 -25.23341406 Average of avg_wait_time 0.054147456 25.34304285 46703.75524 Sum of nr_switches 85210 88176 3.480812111 Sum of nr_voluntary_switches 83165 83457 0.351109241 Sum of nr_involuntary_switches 2045 4719 130.7579462 Sum of nr_wakeups 83168 83459 0.34989419 Sum of nr_migrations 3265 3025 -7.350689127 Sum of sum_exec_runtime 2476504.52 2469058.164 -0.300680129 Sum of sum_idle_runtime 110294825.8 132520924.2 20.15153321 Sum of sum_sleep_runtime 5293337.741 5297778.714 0.083897408 Sum of sum_block_runtime 56.043253 15.12936 -73.00413664 Sum of wait_count 85615 88606 3.493546692 Sum of wait_sum 4653.340163 9605.221964 106.4156418 o schbench (2 messengers, 256 workers - overloaded) (NPS1) tip eevdf %diff Comm schbench schbench N/A Sum of avg_atom 11676.77306 4803.485728 -58.8629007 Average of avg_atom 22.67334574 9.327156753 -58.8629007 Sum of avg_per_cpu 55235.68013 38286.47722 -30.68524343 Average of avg_per_cpu 107.2537478 74.34267421 -30.68524343 Average of avg_wait_time 2.23189096 2.58191945 15.68304621 Sum of nr_switches 202862 425258 109.6292061 Sum of nr_voluntary_switches 163079 165058 1.213522281 Sum of nr_involuntary_switches 39783 260200 554.0482115 Sum of nr_wakeups 163082 165058 1.211660392 Sum of nr_migrations 44199 54894 24.19738003 Sum of sum_exec_runtime 4586675.667 3963846.024 -13.57910801 Sum of sum_idle_runtime 201050644.2 195126863.7 -2.946412087 Sum of sum_sleep_runtime 10418117.66 10402686.4 -0.148119407 Sum of sum_block_runtime 1548.979156 516.115078 -66.68030838 Sum of wait_count 203377 425792 109.3609405 Sum of wait_sum 455609.3122 1100885.201 141.6292142 o tbench (256 clients - overloaded) (NPS1) - tbench client tip eevdf % diff comm tbench tbench N/A Sum of avg_atom 3.594587941 5.112101854 42.21663064 Average of avg_atom 0.013986724 0.019891447 42.21663064 Sum of avg_per_cpu 392838.0975 142065.4206 -63.83613975 Average of avg_per_cpu 1528.552909 552.7837377 -63.83613975 Average of avg_wait_time 0.010512441 0.006861579 -34.72895916 Sum of nr_switches 692845080 511780111 -26.1335433 Sum of nr_voluntary_switches 178151085 371234907 108.3820635 Sum of nr_involuntary_switches 514693995 140545204 -72.69344399 Sum of nr_wakeups 178151085 371234909 108.3820646 Sum of nr_migrations 45279 71177 57.19649286 Sum of sum_exec_runtime 9192343.465 9624025.792 4.69610746 Sum of sum_idle_runtime 7125370.721 16145736.39 126.5950365 Sum of sum_sleep_runtime 2222469.726 5792868.629 160.650058 Sum of sum_block_runtime 68.60879 446.080476 550.1797743 Sum of wait_count 692845479 511780543 -26.13352349 Sum of wait_sum 7287852.246 3297894.139 -54.7480653 - tbench server tip eevdf % diff Comm tbench_srv tbench_srv N/A Sum of avg_atom 5.077837807 5.447267364 7.275331971 Average of avg_atom2 0.019758124 0.021195593 7.275331971 Sum of avg_per_cpu 538586.1634 87925.51225 -83.67475471 Average of avg_per_cpu2 2095.666006 342.1226158 -83.67475471 Average of avg_wait_time 0.000827346 0.006505748 686.3392261 Sum of nr_switches 692980666 511838912 -26.13951051 Sum of nr_voluntary_switches 690367607 390304935 -43.46418762 Sum of nr_involuntary_switches 2613059 121533977 4551.023073 Sum of nr_wakeups 690367607 390304935 -43.46418762 Sum of nr_migrations 39486 84474 113.9340526 Sum of sum_exec_runtime 9176708.278 8734423.401 -4.819646259 Sum of sum_idle_runtime 413900.3645 447180.3879 8.040588086 Sum of sum_sleep_runtime 8966201.976 6690818.107 -25.37734345 Sum of sum_block_runtime 1.776413 1.617435 -8.949382829 Sum of wait_count 692980942 511839229 -26.13949418 Sum of wait_sum 565739.6984 3295519.077 482.5150836 > > The nice -19 numbers aren't as pretty as Vincent's, but at the end I was going > cross-eyed from staring at tree prints and I just couldn't figure out where it > was going side-ways. > > There's definitely more benchmarking/tweaking to be done (0-day already > reported a stress-ng loss), but if we can pull this off we can delete a whole > much of icky heuristics code. EEVDF is a much better defined policy than what > we currently have. > DeathStarBench and SpecJBB and slightly more complex to analyze. I'll get the schedstat data for both soon. I'll rerun some of the above workloads with NO_PRESERVE_LAG to see if that makes any difference. In the meantime, if you need more data from the test system for any particular workload, please do let me know. I will collect the per-task and system-wide schedstat data for the workload as it is rather inexpensive to collect and gives good insights but if you need any other data, I'll be more than happy to get those too for analysis. -- Thanks and Regards, Prateek
Hello Peter, One important detail I forgot to mention: When I picked eevdf commits from your tree (https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=sched/core), they were based on v6.3-rc1 with the sched/eevdf HEAD at: commit: 0dddbc0b54ad ("sched/fair: Implement an EEVDF like policy") On 3/22/2023 12:19 PM, K Prateek Nayak wrote: > Hello Peter, > > Leaving some results from my testing on a dual socket Zen3 machine > (2 x 64C/128T) below. > > tl;dr > > o I've not tested workloads with nice and latency nice yet focusing more > on the out of the box performance. No changes to sched_feat were made > for the same reason. > > o Except for hackbench (m:n communication relationship), I do not see any > regression for other standard benchmarks (mostly 1:1 or 1:n) relation > when system is below fully loaded. > > o At fully loaded scenario, schbench seems to be unhappy. Looking at the > data from /proc/<pid>/sched for the tasks with schedstats enabled, > there is an increase in number of context switches and the total wait > sum. When system is overloaded, things flip and the schbench tail > latency improves drastically. I suspect the involuntary > context-switches help workers make progress much sooner after wakeup > compared to tip thus leading to lower tail latency. > > o For the same reason as above, tbench throughput takes a hit with > number of involuntary context-switches increasing drastically for the > tbench server. There is also an increase in wait sum noticed. > > o Couple of real world workloads were also tested. DeathStarBench > throughput tanks much more with the updated version in your tree > compared to this series as is. > SpecJBB Max-jOPS sees large improvements but comes at a cost of > drop in Critical-jOPS signifying an increase in either wait time > or an increase in involuntary context-switches which can lead to > transactions taking longer to complete. > > o Apart from DeathStarBench, the all the trends reported remain same > comparing the version in your tree and this series, as is, applied > on the same base kernel. > > I'll leave the detailed results below and some limited analysis. > > On 3/6/2023 6:55 PM, Peter Zijlstra wrote: >> Hi! >> >> Ever since looking at the latency-nice patches, I've wondered if EEVDF would >> not make more sense, and I did point Vincent at some older patches I had for >> that (which is here his augmented rbtree thing comes from). >> >> Also, since I really dislike the dual tree, I also figured we could dynamically >> switch between an augmented tree and not (and while I have code for that, >> that's not included in this posting because with the current results I don't >> think we actually need this). >> >> Anyway, since I'm somewhat under the weather, I spend last week desperately >> trying to connect a small cluster of neurons in defiance of the snot overlord >> and bring back the EEVDF patches from the dark crypts where they'd been >> gathering cobwebs for the past 13 odd years. >> >> By friday they worked well enough, and this morning (because obviously I forgot >> the weekend is ideal to run benchmarks) I ran a bunch of hackbenck, netperf, >> tbench and sysbench -- there's a bunch of wins and losses, but nothing that >> indicates a total fail. >> >> ( in fact, some of the schbench results seem to indicate EEVDF schedules a lot >> more consistent than CFS and has a bunch of latency wins ) >> >> ( hackbench also doesn't show the augmented tree and generally more expensive >> pick to be a loss, in fact it shows a slight win here ) >> >> >> hackbech load + cyclictest --policy other results: >> >> >> EEVDF CFS >> >> # Min Latencies: 00053 >> LNICE(19) # Avg Latencies: 04350 >> # Max Latencies: 76019 >> >> # Min Latencies: 00052 00053 >> LNICE(0) # Avg Latencies: 00690 00687 >> # Max Latencies: 14145 13913 >> >> # Min Latencies: 00019 >> LNICE(-19) # Avg Latencies: 00261 >> # Max Latencies: 05642 >> > > Following are the results from testing the series on a dual socket > Zen3 machine (2 x 64C/128T): > > NPS Modes are used to logically divide single socket into > multiple NUMA region. > Following is the NUMA configuration for each NPS mode on the system: > > NPS1: Each socket is a NUMA node. > Total 2 NUMA nodes in the dual socket machine. > > Node 0: 0-63, 128-191 > Node 1: 64-127, 192-255 > > NPS2: Each socket is further logically divided into 2 NUMA regions. > Total 4 NUMA nodes exist over 2 socket. > > Node 0: 0-31, 128-159 > Node 1: 32-63, 160-191 > Node 2: 64-95, 192-223 > Node 3: 96-127, 223-255 > > NPS4: Each socket is logically divided into 4 NUMA regions. > Total 8 NUMA nodes exist over 2 socket. > > Node 0: 0-15, 128-143 > Node 1: 16-31, 144-159 > Node 2: 32-47, 160-175 > Node 3: 48-63, 176-191 > Node 4: 64-79, 192-207 > Node 5: 80-95, 208-223 > Node 6: 96-111, 223-231 > Node 7: 112-127, 232-255 > > Kernel versions: > - tip: 6.2.0-rc6 tip sched/core > - eevdf: 6.2.0-rc6 tip sched/core > + eevdf commits from your tree > (https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/eevdf) I had cherry picked the following commits for eevdf: commit: b84a8f6b6fa3 ("sched: Introduce latency-nice as a per-task attribute") commit: eea7fc6f13b4 ("sched/core: Propagate parent task's latency requirements to the child task") commit: a143d2bcef65 ("sched: Allow sched_{get,set}attr to change latency_nice of the task") commit: d9790468df14 ("sched/fair: Add latency_offset") commit: 3d4d37acaba4 ("sched/fair: Add sched group latency support") commit: 707840ffc8fa ("sched/fair: Add avg_vruntime") commit: 394af9db316b ("sched/fair: Remove START_DEBIT") commit: 89b2a2ee0e9d ("sched/fair: Add lag based placement") commit: e3db9631d8ca ("rbtree: Add rb_add_augmented_cached() helper") commit: 0dddbc0b54ad ("sched/fair: Implement an EEVDF like policy") from the sched/eevdf branch in your tree onto the tip branch back when I started testing. I notice some more changes have been added since then. Queuing testing of latest changes on the updated tip:sched/core based on v6.3-rc3. I was able to cherry pick the latest commits from sched/eevdf cleanly. > > - eevdf prev: 6.2.0-rc6 tip sched/core + this series as is > > When the testing started, the tip was at: > commit 7c4a5b89a0b5 "sched/rt: pick_next_rt_entity(): check list_entry" > [..snip..] > -- Thanks and Regards, Prateek
Hi! > Ever since looking at the latency-nice patches, I've wondered if EEVDF would > not make more sense, and I did point Vincent at some older patches I had for > that (which is here his augmented rbtree thing comes from). Link for context: https://lwn.net/Articles/925371/ . "EEVDF" is not commonly known acronym :-). BR, Pavel