Message ID | 20221019122859.18399-1-wuyun.abel@bytedance.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:4ac7:0:0:0:0:0 with SMTP id y7csp312202wrs; Wed, 19 Oct 2022 06:01:30 -0700 (PDT) X-Google-Smtp-Source: AMsMyM6Qt8Tp5VP63PDNLbxk8A8jBKinaVa7qkuHF7raj6icav9WdCob4RHGP2iVBzF8D/UfQXV3 X-Received: by 2002:a17:90b:1b07:b0:20d:571c:1d3d with SMTP id nu7-20020a17090b1b0700b0020d571c1d3dmr9821084pjb.192.1666184490339; Wed, 19 Oct 2022 06:01:30 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666184490; cv=none; d=google.com; s=arc-20160816; b=lU7HhmNWJi/nTTPOcSlpPILybPkTb4TolkFalMtD1lDd6smDbvCmtP0LmwRngC6Ep3 C+Ts6ZQ3a8BfolFVxxc1eZRs/jBp1C5W6EcQRiJXq6dEM+Ci/XukC3B/hg9O8Im1za0C vYqVLf7E8FjDduwwKw/IwCy64Z3AenFsQvSqxZzRcOSiFwhGGV8wTeahLq1u6GON6GIR MzfLMxaBAVx/spUJUHA8H0hhXkqJ4AeAbTftUxtAnJjUZi7gF0wWXWBGWbRYaSdB5ewg LH/CTTKA+79UNxycYD1zg7WonrZdWZ6hrPf44acKIb3kS/AjULKNCpXDgv5Rg63bVmhc +CqQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=00hqzPQ60wefKYMOhXC429UksHq3kwxJP8z9O6lm324=; b=X0IwaOKO3yju62Ix0B4gSIm/Bract6BO+zYpBBzcEPFJ3viVXZVhbt9pCE6bgxyyiH 7U5qGGnu0p4E+TSiiLBrTiDMvtDIJvLVFblv7Ieonq/JQNojwBikFRg/7NuNdVtXbvyL jtbwlirmOPU5mw9C3EZj6NSumBcdmxirSNrmv8WjRc5wlKngqSwZL0Q97rbpZbwujdsB nZFJCoFd7fYYyPH0M1tfVcFaI+IOmwEmUw6U9GNYJC0ZLNCBlG3Yvw0U9+EDZBuEsupm T7Sg+ph9O7TZ86pKhdONEd0cCWrAtsmfOLAwXTIE9ZgDUGkYJxWqSQMRULV+UCU5zVCo E8NA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=MOjgelvd; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id w18-20020a056a0014d200b00542deddc4e0si18661599pfu.46.2022.10.19.06.01.00; Wed, 19 Oct 2022 06:01:30 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=MOjgelvd; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230517AbiJSMsP (ORCPT <rfc822;samuel.l.nystrom@gmail.com> + 99 others); Wed, 19 Oct 2022 08:48:15 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58682 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231569AbiJSMrd (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 19 Oct 2022 08:47:33 -0400 Received: from mail-pj1-x1032.google.com (mail-pj1-x1032.google.com [IPv6:2607:f8b0:4864:20::1032]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0419CF4B for <linux-kernel@vger.kernel.org>; Wed, 19 Oct 2022 05:30:24 -0700 (PDT) Received: by mail-pj1-x1032.google.com with SMTP id a5-20020a17090aa50500b002008eeb040eso737500pjq.1 for <linux-kernel@vger.kernel.org>; Wed, 19 Oct 2022 05:30:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=00hqzPQ60wefKYMOhXC429UksHq3kwxJP8z9O6lm324=; b=MOjgelvdOpUz3hTkdT2hLaJgmsoNf4pMmZlcPSbRLd6dvofUiGAaVZZtr5/JC0p2Q/ 1JqYzhRFPeuM3biNvdYpbwtv428NyqVc3WyFV4Stc577+nAII0OmzfDUorGoGH43Xv2L qui08vC7vApFJVsqLX8i3emFeZei0ZmcMqvEqcNQ1IcLDo/MEbeXhV9NT9MGB/B/jMQ4 7u4uyWnhd6PQAiwmytzgEyHaB1Y9kEtD25v1C8fXVGkDiczZmr2jkyqdUtczNtXudSb1 3gxcTfN3uV1Z7663OkP1TNqpKtPjHpBkvpYd8nuJLtnxASr0HfAzs01koDty3H5Djtmm q4uA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=00hqzPQ60wefKYMOhXC429UksHq3kwxJP8z9O6lm324=; b=PpkysfGe7w61TJUnKmm7PDEUWdyH6r/bZfRJgyxOwuof/kv7mtJ/qRG2o1jqLI93In Q8ae1hXJ+WeTocsJd46jXMz0kBuPKitZS5UMTih4J950d4Bk20yeBIqeHHLN0zBLY92n o3mPYB5qHdd9tpBaiOzBgvNlZTDEdVkqZ9EgCj/IhH9P3bNUlWvSMPUQKAxfXap+GFn3 H6fbaRqsZdWuwekp18sLkyDVbHTgzBn4O7hl0q/nBXVE54MQYgWP2vI9QpVO3MPta9fa 9VnXNRyNz+6BGBm6rEbidoF8hqL9LMV+1R3Ns0SWWYXrra6eBhijveJUhmCB/j8MvglM f5aA== X-Gm-Message-State: ACrzQf0WcPkLbawdK+GUzV3gMzQK6AMMCnip3rpXPEmDlE2Y5KvWppr1 z+jOMRRpQUyCrh8Y+27nJ4eR3A== X-Received: by 2002:a17:902:9b82:b0:183:fffb:1bfe with SMTP id y2-20020a1709029b8200b00183fffb1bfemr8265751plp.173.1666182569537; Wed, 19 Oct 2022 05:29:29 -0700 (PDT) Received: from C02DV8HUMD6R.bytedance.net ([139.177.225.237]) by smtp.gmail.com with ESMTPSA id c21-20020a63da15000000b00439c6a4e1ccsm9881825pgh.62.2022.10.19.05.29.22 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 19 Oct 2022 05:29:29 -0700 (PDT) From: Abel Wu <wuyun.abel@bytedance.com> To: Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@kernel.org>, Mel Gorman <mgorman@suse.de>, Vincent Guittot <vincent.guittot@linaro.org>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Valentin Schneider <valentin.schneider@arm.com> Cc: Josh Don <joshdon@google.com>, Chen Yu <yu.c.chen@intel.com>, Tim Chen <tim.c.chen@linux.intel.com>, K Prateek Nayak <kprateek.nayak@amd.com>, "Gautham R . Shenoy" <gautham.shenoy@amd.com>, Aubrey Li <aubrey.li@intel.com>, Qais Yousef <qais.yousef@arm.com>, Juri Lelli <juri.lelli@redhat.com>, Rik van Riel <riel@surriel.com>, Yicong Yang <yangyicong@huawei.com>, Barry Song <21cnbao@gmail.com>, linux-kernel@vger.kernel.org, Abel Wu <wuyun.abel@bytedance.com> Subject: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS Date: Wed, 19 Oct 2022 20:28:55 +0800 Message-Id: <20221019122859.18399-1-wuyun.abel@bytedance.com> X-Mailer: git-send-email 2.37.3 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1747121068294967754?= X-GMAIL-MSGID: =?utf-8?q?1747121068294967754?= |
Series |
sched/fair: Improve scan efficiency of SIS
|
|
Message
Abel Wu
Oct. 19, 2022, 12:28 p.m. UTC
This patchset tries to improve SIS scan efficiency by recording idle cpus in a cpumask for each LLC which will be used as a target cpuset in the domain scan. The cpus are recorded at CORE granule to avoid tasks being stack on same core. v5 -> v6: - Rename SIS_FILTER to SIS_CORE as it can only be activated when SMT is enabled and better describes the behavior of CORE granule update & load delivery. - Removed the part of limited scan for idle cores since it might be better to open another thread to discuss the strategies such as limited or scaled depth. But keep the part of full scan for idle cores when LLC is overloaded because SIS_CORE can greatly reduce the overhead of full scan in such case. - Removed the state of sd_is_busy which indicates an LLC is fully busy and we can safely skip the SIS domain scan. I would prefer leave this to SIS_UTIL. - The filter generation mechanism is replaced by in-place updates during domain scan to better deal with partial scan failures. - Collect Reviewed-bys from Tim Chen v4 -> v5: - Add limited scan for idle cores when overloaded, suggested by Mel - Split out several patches since they are irrelevant to this scope - Add quick check on ttwu_pending before core update - Wrap the filter into SIS_FILTER feature, suggested by Chen Yu - Move the main filter logic to the idle path, because the newidle balance can bail out early if rq->avg_idle is small enough and lose chances to update the filter. v3 -> v4: - Update filter in load_balance rather than in the tick - Now the filter contains unoccupied cpus rather than overloaded ones - Added mechanisms to deal with the false positive cases v2 -> v3: - Removed sched-idle balance feature and focus on SIS - Take non-CFS tasks into consideration - Several fixes/improvement suggested by Josh Don v1 -> v2: - Several optimizations on sched-idle balancing - Ignore asym topos in can_migrate_task - Add more benchmarks including SIS efficiency - Re-organize patch as suggested by Mel Gorman Abel Wu (4): sched/fair: Skip core update if task pending sched/fair: Ignore SIS_UTIL when has_idle_core sched/fair: Introduce SIS_CORE sched/fair: Deal with SIS scan failures include/linux/sched/topology.h | 15 ++++ kernel/sched/fair.c | 122 +++++++++++++++++++++++++++++---- kernel/sched/features.h | 7 ++ kernel/sched/sched.h | 3 + kernel/sched/topology.c | 8 ++- 5 files changed, 141 insertions(+), 14 deletions(-)
Comments
Ping :) On 10/19/22 8:28 PM, Abel Wu wrote: > This patchset tries to improve SIS scan efficiency by recording idle > cpus in a cpumask for each LLC which will be used as a target cpuset > in the domain scan. The cpus are recorded at CORE granule to avoid > tasks being stack on same core. > > v5 -> v6: > - Rename SIS_FILTER to SIS_CORE as it can only be activated when > SMT is enabled and better describes the behavior of CORE granule > update & load delivery. > - Removed the part of limited scan for idle cores since it might be > better to open another thread to discuss the strategies such as > limited or scaled depth. But keep the part of full scan for idle > cores when LLC is overloaded because SIS_CORE can greatly reduce > the overhead of full scan in such case. > - Removed the state of sd_is_busy which indicates an LLC is fully > busy and we can safely skip the SIS domain scan. I would prefer > leave this to SIS_UTIL. > - The filter generation mechanism is replaced by in-place updates > during domain scan to better deal with partial scan failures. > - Collect Reviewed-bys from Tim Chen > > ... >
Hello Abel, Sorry for the delay. I've tested the patch on a dual socket Zen3 system (2 x 64C/128T) tl;dr o I do not notice any regressions with the standard benchmarks. o schbench sees a nice improvement to the tail latency when the number of worker are equal to the number of cores in the system in NPS1 and NPS2 mode. (Marked with "^") o Few data points show improvements in tbench in NPS1 and NPS2 mode. (Marked with "^") I'm still in the process of running larger workloads. If there is any specific workload you would like me to run on the test system, please do let me know. Below is the detailed report: Following are the results from running standard benchmarks on a dual socket Zen3 (2 x 64C/128T) machine configured in different NPS modes. NPS Modes are used to logically divide single socket into multiple NUMA region. Following is the NUMA configuration for each NPS mode on the system: NPS1: Each socket is a NUMA node. Total 2 NUMA nodes in the dual socket machine. Node 0: 0-63, 128-191 Node 1: 64-127, 192-255 NPS2: Each socket is further logically divided into 2 NUMA regions. Total 4 NUMA nodes exist over 2 socket. Node 0: 0-31, 128-159 Node 1: 32-63, 160-191 Node 2: 64-95, 192-223 Node 3: 96-127, 223-255 NPS4: Each socket is logically divided into 4 NUMA regions. Total 8 NUMA nodes exist over 2 socket. Node 0: 0-15, 128-143 Node 1: 16-31, 144-159 Node 2: 32-47, 160-175 Node 3: 48-63, 176-191 Node 4: 64-79, 192-207 Node 5: 80-95, 208-223 Node 6: 96-111, 223-231 Node 7: 112-127, 232-255 Benchmark Results: Kernel versions: - tip: 5.19.0 tip sched/core - sis_core: 5.19.0 tip sched/core + this series When we started testing, the tip was at: commit fdf756f71271 ("sched: Fix more TASK_state comparisons") ~~~~~~~~~~~~~ ~ hackbench ~ ~~~~~~~~~~~~~ o NPS1 Test: tip sis_core 1-groups: 4.06 (0.00 pct) 4.26 (-4.92 pct) * 1-groups: 4.14 (0.00 pct) 4.09 (1.20 pct) [Verification Run] 2-groups: 4.76 (0.00 pct) 4.71 (1.05 pct) 4-groups: 5.22 (0.00 pct) 5.11 (2.10 pct) 8-groups: 5.35 (0.00 pct) 5.31 (0.74 pct) 16-groups: 7.21 (0.00 pct) 6.80 (5.68 pct) o NPS2 Test: tip sis_core 1-groups: 4.09 (0.00 pct) 4.08 (0.24 pct) 2-groups: 4.70 (0.00 pct) 4.69 (0.21 pct) 4-groups: 5.05 (0.00 pct) 4.92 (2.57 pct) 8-groups: 5.35 (0.00 pct) 5.26 (1.68 pct) 16-groups: 6.37 (0.00 pct) 6.34 (0.47 pct) o NPS4 Test: tip sis_core 1-groups: 4.07 (0.00 pct) 3.99 (1.96 pct) 2-groups: 4.65 (0.00 pct) 4.59 (1.29 pct) 4-groups: 5.13 (0.00 pct) 5.00 (2.53 pct) 8-groups: 5.47 (0.00 pct) 5.43 (0.73 pct) 16-groups: 6.82 (0.00 pct) 6.56 (3.81 pct) ~~~~~~~~~~~~ ~ schbench ~ ~~~~~~~~~~~~ o NPS1 #workers: tip sis_core 1: 33.00 (0.00 pct) 33.00 (0.00 pct) 2: 35.00 (0.00 pct) 35.00 (0.00 pct) 4: 39.00 (0.00 pct) 38.00 (2.56 pct) 8: 49.00 (0.00 pct) 48.00 (2.04 pct) 16: 63.00 (0.00 pct) 66.00 (-4.76 pct) 32: 109.00 (0.00 pct) 107.00 (1.83 pct) 64: 208.00 (0.00 pct) 216.00 (-3.84 pct) 128: 559.00 (0.00 pct) 469.00 (16.10 pct) ^ 256: 45888.00 (0.00 pct) 47552.00 (-3.62 pct) 512: 80000.00 (0.00 pct) 79744.00 (0.32 pct) o NPS2 #workers: =tip sis_core 1: 30.00 (0.00 pct) 32.00 (-6.66 pct) 2: 37.00 (0.00 pct) 34.00 (8.10 pct) 4: 39.00 (0.00 pct) 36.00 (7.69 pct) 8: 51.00 (0.00 pct) 49.00 (3.92 pct) 16: 67.00 (0.00 pct) 66.00 (1.49 pct) 32: 117.00 (0.00 pct) 109.00 (6.83 pct) 64: 216.00 (0.00 pct) 213.00 (1.38 pct) 128: 529.00 (0.00 pct) 465.00 (12.09 pct) ^ 256: 47040.00 (0.00 pct) 46528.00 (1.08 pct) 512: 84864.00 (0.00 pct) 83584.00 (1.50 pct) o NPS4 #workers: tip sis_core 1: 23.00 (0.00 pct) 28.00 (-21.73 pct) 2: 28.00 (0.00 pct) 36.00 (-28.57 pct) 4: 41.00 (0.00 pct) 43.00 (-4.87 pct) 8: 60.00 (0.00 pct) 48.00 (20.00 pct) 16: 71.00 (0.00 pct) 69.00 (2.81 pct) 32: 117.00 (0.00 pct) 115.00 (1.70 pct) 64: 227.00 (0.00 pct) 228.00 (-0.44 pct) 128: 545.00 (0.00 pct) 545.00 (0.00 pct) 256: 45632.00 (0.00 pct) 47680.00 (-4.48 pct) 512: 81024.00 (0.00 pct) 76416.00 (5.68 pct) Note: For lower worker count, schbench can show run to run variation depending on external factors. Regression for lower worker count can be ignored. The results are included to spot any large blow up in the tail latency for larger worker count. ~~~~~~~~~~ ~ tbench ~ ~~~~~~~~~~ o NPS1 Clients: tip sis_core 1 578.37 (0.00 pct) 582.09 (0.64 pct) 2 1062.09 (0.00 pct) 1063.95 (0.17 pct) 4 1800.62 (0.00 pct) 1879.18 (4.36 pct) 8 3211.02 (0.00 pct) 3220.44 (0.29 pct) 16 4848.92 (0.00 pct) 4890.08 (0.84 pct) 32 9091.36 (0.00 pct) 9721.13 (6.92 pct) ^ 64 15454.01 (0.00 pct) 15124.42 (-2.13 pct) 128 3511.33 (0.00 pct) 14314.79 (307.67 pct) 128 19910.99 (0.00pct) 19935.61 (0.12 pct) [Verification Run] 256 50019.32 (0.00 pct) 50708.24 (1.37 pct) 512 44317.68 (0.00 pct) 44787.48 (1.06 pct) 1024 41200.85 (0.00 pct) 42079.29 (2.13 pct) o NPS2 Clients: tip sis_core 1 576.05 (0.00 pct) 579.18 (0.54 pct) 2 1037.68 (0.00 pct) 1070.49 (3.16 pct) 4 1818.13 (0.00 pct) 1860.22 (2.31 pct) 8 3004.16 (0.00 pct) 3087.09 (2.76 pct) 16 4520.11 (0.00 pct) 4789.53 (5.96 pct) 32 8624.23 (0.00 pct) 9439.50 (9.45 pct) ^ 64 14886.75 (0.00 pct) 15004.96 (0.79 pct) 128 20602.00 (0.00 pct) 17730.31 (-13.93 pct) * 128 20602.00 (0.00 pct) 19585.20 (-4.93 pct) [Verification Run] 256 45566.83 (0.00 pct) 47922.70 (5.17 pct) 512 42717.49 (0.00 pct) 43809.68 (2.55 pct) 1024 40936.61 (0.00 pct) 40787.71 (-0.36 pct) o NPS4 Clients: tip sis_core 1 576.36 (0.00 pct) 580.83 (0.77 pct) 2 1044.26 (0.00 pct) 1066.50 (2.12 pct) 4 1839.77 (0.00 pct) 1867.56 (1.51 pct) 8 3043.53 (0.00 pct) 3115.17 (2.35 pct) 16 5207.54 (0.00 pct) 4847.53 (-6.91 pct) * 16 4722.56 (0.00 pct) 4811.29 (1.87 pct) [Verification Run] 32 9263.86 (0.00 pct) 9478.68 (2.31 pct) 64 14959.66 (0.00 pct) 15267.39 (2.05 pct) 128 20698.65 (0.00 pct) 20432.19 (-1.28 pct) 256 46666.21 (0.00 pct) 46664.81 (0.00 pct) 512 41532.80 (0.00 pct) 44241.12 (6.52 pct) 1024 39459.49 (0.00 pct) 41043.22 (4.01 pct) Note: On the tested kernel, with 128 clients, tbench can run into a bottleneck during C2 exit. More details can be found at: https://lore.kernel.org/lkml/20220921063638.2489-1-kprateek.nayak@amd.com/ This issue has been fixed in v6.0 but was not part of the tip kernel when I started testing. This data point has been rerun with C2 disabled to get representative results. ~~~~~~~~~~ ~ Stream ~ ~~~~~~~~~~ o NPS1 -> 10 Runs: Test: tip sis_core Copy: 328419.14 (0.00 pct) 337857.83 (2.87 pct) Scale: 206071.21 (0.00 pct) 212133.82 (2.94 pct) Add: 235271.48 (0.00 pct) 243811.97 (3.63 pct) Triad: 253175.80 (0.00 pct) 252333.43 (-0.33 pct) -> 100 Runs: Test: tip sis_core Copy: 328209.61 (0.00 pct) 339817.27 (3.53 pct) Scale: 216310.13 (0.00 pct) 218635.16 (1.07 pct) Add: 244417.83 (0.00 pct) 245641.47 (0.50 pct) Triad: 237508.83 (0.00 pct) 255387.28 (7.52 pct) o NPS2 -> 10 Runs: Test: tip sis_core Copy: 336503.88 (0.00 pct) 339684.21 (0.94 pct) Scale: 218035.23 (0.00 pct) 217601.11 (-0.19 pct) Add: 257677.42 (0.00 pct) 258608.34 (0.36 pct) Triad: 268872.37 (0.00 pct) 272548.09 (1.36 pct) -> 100 Runs: Test: tip sis_core Copy: 332304.34 (0.00 pct) 341565.75 (2.78 pct) Scale: 223421.60 (0.00 pct) 224267.40 (0.37 pct) Add: 252363.56 (0.00 pct) 254926.98 (1.01 pct) Triad: 266687.56 (0.00 pct) 270782.81 (1.53 pct) o NPS4 -> 10 Runs: Test: tip sis_core Copy: 353515.62 (0.00 pct) 342060.85 (-3.24 pct) Scale: 228854.37 (0.00 pct) 218262.41 (-4.62 pct) Add: 254942.12 (0.00 pct) 241975.90 (-5.08 pct) Triad: 270521.87 (0.00 pct) 257686.71 (-4.74 pct) -> 100 Runs: Test: tip sis_core Copy: 374520.81 (0.00 pct) 369353.13 (-1.37 pct) Scale: 246280.23 (0.00 pct) 253881.69 (3.08 pct) Add: 262772.72 (0.00 pct) 266484.58 (1.41 pct) Triad: 283740.92 (0.00 pct) 279981.18 (-1.32 pct) On 10/19/2022 5:58 PM, Abel Wu wrote: > This patchset tries to improve SIS scan efficiency by recording idle > cpus in a cpumask for each LLC which will be used as a target cpuset > in the domain scan. The cpus are recorded at CORE granule to avoid > tasks being stack on same core. > > v5 -> v6: > - Rename SIS_FILTER to SIS_CORE as it can only be activated when > SMT is enabled and better describes the behavior of CORE granule > update & load delivery. > - Removed the part of limited scan for idle cores since it might be > better to open another thread to discuss the strategies such as > limited or scaled depth. But keep the part of full scan for idle > cores when LLC is overloaded because SIS_CORE can greatly reduce > the overhead of full scan in such case. > - Removed the state of sd_is_busy which indicates an LLC is fully > busy and we can safely skip the SIS domain scan. I would prefer > leave this to SIS_UTIL. > - The filter generation mechanism is replaced by in-place updates > during domain scan to better deal with partial scan failures. > - Collect Reviewed-bys from Tim Chen > > v4 -> v5: > - Add limited scan for idle cores when overloaded, suggested by Mel > - Split out several patches since they are irrelevant to this scope > - Add quick check on ttwu_pending before core update > - Wrap the filter into SIS_FILTER feature, suggested by Chen Yu > - Move the main filter logic to the idle path, because the newidle > balance can bail out early if rq->avg_idle is small enough and > lose chances to update the filter. > > v3 -> v4: > - Update filter in load_balance rather than in the tick > - Now the filter contains unoccupied cpus rather than overloaded ones > - Added mechanisms to deal with the false positive cases > > v2 -> v3: > - Removed sched-idle balance feature and focus on SIS > - Take non-CFS tasks into consideration > - Several fixes/improvement suggested by Josh Don > > v1 -> v2: > - Several optimizations on sched-idle balancing > - Ignore asym topos in can_migrate_task > - Add more benchmarks including SIS efficiency > - Re-organize patch as suggested by Mel Gorman > > Abel Wu (4): > sched/fair: Skip core update if task pending > sched/fair: Ignore SIS_UTIL when has_idle_core > sched/fair: Introduce SIS_CORE > sched/fair: Deal with SIS scan failures > > include/linux/sched/topology.h | 15 ++++ > kernel/sched/fair.c | 122 +++++++++++++++++++++++++++++---- > kernel/sched/features.h | 7 ++ > kernel/sched/sched.h | 3 + > kernel/sched/topology.c | 8 ++- > 5 files changed, 141 insertions(+), 14 deletions(-) > I ran pgbench from mmtest but realised there is too much run to run variation on the system. Planning on running MongoDB benchmark which is more stable on the system and couple more workloads but the initial results look good. I'll get back with results later this week or by early next week. Meanwhile, if you need data for any specific workload on the test system, please do let me know. -- Thanks and Regards, Prateek
Hi Prateek, thanks very much for your detailed testing! On 11/14/22 1:45 PM, K Prateek Nayak wrote: > Hello Abel, > > Sorry for the delay. I've tested the patch on a dual socket Zen3 system > (2 x 64C/128T) > > tl;dr > > o I do not notice any regressions with the standard benchmarks. > o schbench sees a nice improvement to the tail latency when the number > of worker are equal to the number of cores in the system in NPS1 and > NPS2 mode. (Marked with "^") > o Few data points show improvements in tbench in NPS1 and NPS2 mode. > (Marked with "^") > > I'm still in the process of running larger workloads. If there is any > specific workload you would like me to run on the test system, please > do let me know. Below is the detailed report: Not particularly in my mind, and I think testing larger workloads is great. Thanks! > > Following are the results from running standard benchmarks on a > dual socket Zen3 (2 x 64C/128T) machine configured in different > NPS modes. > > NPS Modes are used to logically divide single socket into > multiple NUMA region. > Following is the NUMA configuration for each NPS mode on the system: > > NPS1: Each socket is a NUMA node. > Total 2 NUMA nodes in the dual socket machine. > > Node 0: 0-63, 128-191 > Node 1: 64-127, 192-255 > > NPS2: Each socket is further logically divided into 2 NUMA regions. > Total 4 NUMA nodes exist over 2 socket. > > Node 0: 0-31, 128-159 > Node 1: 32-63, 160-191 > Node 2: 64-95, 192-223 > Node 3: 96-127, 223-255 > > NPS4: Each socket is logically divided into 4 NUMA regions. > Total 8 NUMA nodes exist over 2 socket. > > Node 0: 0-15, 128-143 > Node 1: 16-31, 144-159 > Node 2: 32-47, 160-175 > Node 3: 48-63, 176-191 > Node 4: 64-79, 192-207 > Node 5: 80-95, 208-223 > Node 6: 96-111, 223-231 > Node 7: 112-127, 232-255 > > Benchmark Results: > > Kernel versions: > - tip: 5.19.0 tip sched/core > - sis_core: 5.19.0 tip sched/core + this series > > When we started testing, the tip was at: > commit fdf756f71271 ("sched: Fix more TASK_state comparisons") > > ~~~~~~~~~~~~~ > ~ hackbench ~ > ~~~~~~~~~~~~~ > > o NPS1 > > Test: tip sis_core > 1-groups: 4.06 (0.00 pct) 4.26 (-4.92 pct) * > 1-groups: 4.14 (0.00 pct) 4.09 (1.20 pct) [Verification Run] > 2-groups: 4.76 (0.00 pct) 4.71 (1.05 pct) > 4-groups: 5.22 (0.00 pct) 5.11 (2.10 pct) > 8-groups: 5.35 (0.00 pct) 5.31 (0.74 pct) > 16-groups: 7.21 (0.00 pct) 6.80 (5.68 pct) > > o NPS2 > > Test: tip sis_core > 1-groups: 4.09 (0.00 pct) 4.08 (0.24 pct) > 2-groups: 4.70 (0.00 pct) 4.69 (0.21 pct) > 4-groups: 5.05 (0.00 pct) 4.92 (2.57 pct) > 8-groups: 5.35 (0.00 pct) 5.26 (1.68 pct) > 16-groups: 6.37 (0.00 pct) 6.34 (0.47 pct) > > o NPS4 > > Test: tip sis_core > 1-groups: 4.07 (0.00 pct) 3.99 (1.96 pct) > 2-groups: 4.65 (0.00 pct) 4.59 (1.29 pct) > 4-groups: 5.13 (0.00 pct) 5.00 (2.53 pct) > 8-groups: 5.47 (0.00 pct) 5.43 (0.73 pct) > 16-groups: 6.82 (0.00 pct) 6.56 (3.81 pct) Although each cpu will get 2.5 tasks when 16-groups, which can be considered overloaded, I tested in AMD EPYC 7Y83 machine and the total cpu usage was ~82% (with some older kernel version), so there is still lots of idle time. I guess cutting off at 16-groups is because it's enough loaded compared to the real workloads, so testing more groups might just be a waste of time? Thanks & Best Regards, Abel > > ~~~~~~~~~~~~ > ~ schbench ~ > ~~~~~~~~~~~~ > > o NPS1 > > #workers: tip sis_core > 1: 33.00 (0.00 pct) 33.00 (0.00 pct) > 2: 35.00 (0.00 pct) 35.00 (0.00 pct) > 4: 39.00 (0.00 pct) 38.00 (2.56 pct) > 8: 49.00 (0.00 pct) 48.00 (2.04 pct) > 16: 63.00 (0.00 pct) 66.00 (-4.76 pct) > 32: 109.00 (0.00 pct) 107.00 (1.83 pct) > 64: 208.00 (0.00 pct) 216.00 (-3.84 pct) > 128: 559.00 (0.00 pct) 469.00 (16.10 pct) ^ > 256: 45888.00 (0.00 pct) 47552.00 (-3.62 pct) > 512: 80000.00 (0.00 pct) 79744.00 (0.32 pct) > > o NPS2 > > #workers: =tip sis_core > 1: 30.00 (0.00 pct) 32.00 (-6.66 pct) > 2: 37.00 (0.00 pct) 34.00 (8.10 pct) > 4: 39.00 (0.00 pct) 36.00 (7.69 pct) > 8: 51.00 (0.00 pct) 49.00 (3.92 pct) > 16: 67.00 (0.00 pct) 66.00 (1.49 pct) > 32: 117.00 (0.00 pct) 109.00 (6.83 pct) > 64: 216.00 (0.00 pct) 213.00 (1.38 pct) > 128: 529.00 (0.00 pct) 465.00 (12.09 pct) ^ > 256: 47040.00 (0.00 pct) 46528.00 (1.08 pct) > 512: 84864.00 (0.00 pct) 83584.00 (1.50 pct) > > o NPS4 > > #workers: tip sis_core > 1: 23.00 (0.00 pct) 28.00 (-21.73 pct) > 2: 28.00 (0.00 pct) 36.00 (-28.57 pct) > 4: 41.00 (0.00 pct) 43.00 (-4.87 pct) > 8: 60.00 (0.00 pct) 48.00 (20.00 pct) > 16: 71.00 (0.00 pct) 69.00 (2.81 pct) > 32: 117.00 (0.00 pct) 115.00 (1.70 pct) > 64: 227.00 (0.00 pct) 228.00 (-0.44 pct) > 128: 545.00 (0.00 pct) 545.00 (0.00 pct) > 256: 45632.00 (0.00 pct) 47680.00 (-4.48 pct) > 512: 81024.00 (0.00 pct) 76416.00 (5.68 pct) > > Note: For lower worker count, schbench can show run to > run variation depending on external factors. Regression > for lower worker count can be ignored. The results are > included to spot any large blow up in the tail latency > for larger worker count. > > ~~~~~~~~~~ > ~ tbench ~ > ~~~~~~~~~~ > > o NPS1 > > Clients: tip sis_core > 1 578.37 (0.00 pct) 582.09 (0.64 pct) > 2 1062.09 (0.00 pct) 1063.95 (0.17 pct) > 4 1800.62 (0.00 pct) 1879.18 (4.36 pct) > 8 3211.02 (0.00 pct) 3220.44 (0.29 pct) > 16 4848.92 (0.00 pct) 4890.08 (0.84 pct) > 32 9091.36 (0.00 pct) 9721.13 (6.92 pct) ^ > 64 15454.01 (0.00 pct) 15124.42 (-2.13 pct) > 128 3511.33 (0.00 pct) 14314.79 (307.67 pct) > 128 19910.99 (0.00pct) 19935.61 (0.12 pct) [Verification Run] > 256 50019.32 (0.00 pct) 50708.24 (1.37 pct) > 512 44317.68 (0.00 pct) 44787.48 (1.06 pct) > 1024 41200.85 (0.00 pct) 42079.29 (2.13 pct) > > o NPS2 > > Clients: tip sis_core > 1 576.05 (0.00 pct) 579.18 (0.54 pct) > 2 1037.68 (0.00 pct) 1070.49 (3.16 pct) > 4 1818.13 (0.00 pct) 1860.22 (2.31 pct) > 8 3004.16 (0.00 pct) 3087.09 (2.76 pct) > 16 4520.11 (0.00 pct) 4789.53 (5.96 pct) > 32 8624.23 (0.00 pct) 9439.50 (9.45 pct) ^ > 64 14886.75 (0.00 pct) 15004.96 (0.79 pct) > 128 20602.00 (0.00 pct) 17730.31 (-13.93 pct) * > 128 20602.00 (0.00 pct) 19585.20 (-4.93 pct) [Verification Run] > 256 45566.83 (0.00 pct) 47922.70 (5.17 pct) > 512 42717.49 (0.00 pct) 43809.68 (2.55 pct) > 1024 40936.61 (0.00 pct) 40787.71 (-0.36 pct) > > o NPS4 > > Clients: tip sis_core > 1 576.36 (0.00 pct) 580.83 (0.77 pct) > 2 1044.26 (0.00 pct) 1066.50 (2.12 pct) > 4 1839.77 (0.00 pct) 1867.56 (1.51 pct) > 8 3043.53 (0.00 pct) 3115.17 (2.35 pct) > 16 5207.54 (0.00 pct) 4847.53 (-6.91 pct) * > 16 4722.56 (0.00 pct) 4811.29 (1.87 pct) [Verification Run] > 32 9263.86 (0.00 pct) 9478.68 (2.31 pct) > 64 14959.66 (0.00 pct) 15267.39 (2.05 pct) > 128 20698.65 (0.00 pct) 20432.19 (-1.28 pct) > 256 46666.21 (0.00 pct) 46664.81 (0.00 pct) > 512 41532.80 (0.00 pct) 44241.12 (6.52 pct) > 1024 39459.49 (0.00 pct) 41043.22 (4.01 pct) > > Note: On the tested kernel, with 128 clients, tbench can > run into a bottleneck during C2 exit. More details can be > found at: > https://lore.kernel.org/lkml/20220921063638.2489-1-kprateek.nayak@amd.com/ > This issue has been fixed in v6.0 but was not part of the > tip kernel when I started testing. This data point has > been rerun with C2 disabled to get representative results. > > ~~~~~~~~~~ > ~ Stream ~ > ~~~~~~~~~~ > > o NPS1 > > -> 10 Runs: > > Test: tip sis_core > Copy: 328419.14 (0.00 pct) 337857.83 (2.87 pct) > Scale: 206071.21 (0.00 pct) 212133.82 (2.94 pct) > Add: 235271.48 (0.00 pct) 243811.97 (3.63 pct) > Triad: 253175.80 (0.00 pct) 252333.43 (-0.33 pct) > > -> 100 Runs: > > Test: tip sis_core > Copy: 328209.61 (0.00 pct) 339817.27 (3.53 pct) > Scale: 216310.13 (0.00 pct) 218635.16 (1.07 pct) > Add: 244417.83 (0.00 pct) 245641.47 (0.50 pct) > Triad: 237508.83 (0.00 pct) 255387.28 (7.52 pct) > > o NPS2 > > -> 10 Runs: > > Test: tip sis_core > Copy: 336503.88 (0.00 pct) 339684.21 (0.94 pct) > Scale: 218035.23 (0.00 pct) 217601.11 (-0.19 pct) > Add: 257677.42 (0.00 pct) 258608.34 (0.36 pct) > Triad: 268872.37 (0.00 pct) 272548.09 (1.36 pct) > > -> 100 Runs: > > Test: tip sis_core > Copy: 332304.34 (0.00 pct) 341565.75 (2.78 pct) > Scale: 223421.60 (0.00 pct) 224267.40 (0.37 pct) > Add: 252363.56 (0.00 pct) 254926.98 (1.01 pct) > Triad: 266687.56 (0.00 pct) 270782.81 (1.53 pct) > > o NPS4 > > -> 10 Runs: > > Test: tip sis_core > Copy: 353515.62 (0.00 pct) 342060.85 (-3.24 pct) > Scale: 228854.37 (0.00 pct) 218262.41 (-4.62 pct) > Add: 254942.12 (0.00 pct) 241975.90 (-5.08 pct) > Triad: 270521.87 (0.00 pct) 257686.71 (-4.74 pct) > > -> 100 Runs: > > Test: tip sis_core > Copy: 374520.81 (0.00 pct) 369353.13 (-1.37 pct) > Scale: 246280.23 (0.00 pct) 253881.69 (3.08 pct) > Add: 262772.72 (0.00 pct) 266484.58 (1.41 pct) > Triad: 283740.92 (0.00 pct) 279981.18 (-1.32 pct) > > On 10/19/2022 5:58 PM, Abel Wu wrote: >> This patchset tries to improve SIS scan efficiency by recording idle >> cpus in a cpumask for each LLC which will be used as a target cpuset >> in the domain scan. The cpus are recorded at CORE granule to avoid >> tasks being stack on same core. >> >> v5 -> v6: >> - Rename SIS_FILTER to SIS_CORE as it can only be activated when >> SMT is enabled and better describes the behavior of CORE granule >> update & load delivery. >> - Removed the part of limited scan for idle cores since it might be >> better to open another thread to discuss the strategies such as >> limited or scaled depth. But keep the part of full scan for idle >> cores when LLC is overloaded because SIS_CORE can greatly reduce >> the overhead of full scan in such case. >> - Removed the state of sd_is_busy which indicates an LLC is fully >> busy and we can safely skip the SIS domain scan. I would prefer >> leave this to SIS_UTIL. >> - The filter generation mechanism is replaced by in-place updates >> during domain scan to better deal with partial scan failures. >> - Collect Reviewed-bys from Tim Chen >> >> v4 -> v5: >> - Add limited scan for idle cores when overloaded, suggested by Mel >> - Split out several patches since they are irrelevant to this scope >> - Add quick check on ttwu_pending before core update >> - Wrap the filter into SIS_FILTER feature, suggested by Chen Yu >> - Move the main filter logic to the idle path, because the newidle >> balance can bail out early if rq->avg_idle is small enough and >> lose chances to update the filter. >> >> v3 -> v4: >> - Update filter in load_balance rather than in the tick >> - Now the filter contains unoccupied cpus rather than overloaded ones >> - Added mechanisms to deal with the false positive cases >> >> v2 -> v3: >> - Removed sched-idle balance feature and focus on SIS >> - Take non-CFS tasks into consideration >> - Several fixes/improvement suggested by Josh Don >> >> v1 -> v2: >> - Several optimizations on sched-idle balancing >> - Ignore asym topos in can_migrate_task >> - Add more benchmarks including SIS efficiency >> - Re-organize patch as suggested by Mel Gorman >> >> Abel Wu (4): >> sched/fair: Skip core update if task pending >> sched/fair: Ignore SIS_UTIL when has_idle_core >> sched/fair: Introduce SIS_CORE >> sched/fair: Deal with SIS scan failures >> >> include/linux/sched/topology.h | 15 ++++ >> kernel/sched/fair.c | 122 +++++++++++++++++++++++++++++---- >> kernel/sched/features.h | 7 ++ >> kernel/sched/sched.h | 3 + >> kernel/sched/topology.c | 8 ++- >> 5 files changed, 141 insertions(+), 14 deletions(-) >> > > I ran pgbench from mmtest but realised there is too much run to run > variation on the system. Planning on running MongoDB benchmark which > is more stable on the system and couple more workloads but the > initial results look good. I'll get back with results later this week > or by early next week. Meanwhile, if you need data for any specific > workload on the test system, please do let me know. > > -- > Thanks and Regards, > Prateek
Hello Abel, Thank you for taking a look at the report. On 11/15/2022 2:01 PM, Abel Wu wrote: > Hi Prateek, thanks very much for your detailed testing! > > On 11/14/22 1:45 PM, K Prateek Nayak wrote: >> Hello Abel, >> >> Sorry for the delay. I've tested the patch on a dual socket Zen3 system >> (2 x 64C/128T) >> >> tl;dr >> >> o I do not notice any regressions with the standard benchmarks. >> o schbench sees a nice improvement to the tail latency when the number >> of worker are equal to the number of cores in the system in NPS1 and >> NPS2 mode. (Marked with "^") >> o Few data points show improvements in tbench in NPS1 and NPS2 mode. >> (Marked with "^") >> >> I'm still in the process of running larger workloads. If there is any >> specific workload you would like me to run on the test system, please >> do let me know. Below is the detailed report: > > Not particularly in my mind, and I think testing larger workloads is > great. Thanks! > >> >> Following are the results from running standard benchmarks on a >> dual socket Zen3 (2 x 64C/128T) machine configured in different >> NPS modes. >> >> NPS Modes are used to logically divide single socket into >> multiple NUMA region. >> Following is the NUMA configuration for each NPS mode on the system: >> >> NPS1: Each socket is a NUMA node. >> Total 2 NUMA nodes in the dual socket machine. >> >> Node 0: 0-63, 128-191 >> Node 1: 64-127, 192-255 >> >> NPS2: Each socket is further logically divided into 2 NUMA regions. >> Total 4 NUMA nodes exist over 2 socket. >> Node 0: 0-31, 128-159 >> Node 1: 32-63, 160-191 >> Node 2: 64-95, 192-223 >> Node 3: 96-127, 223-255 >> >> NPS4: Each socket is logically divided into 4 NUMA regions. >> Total 8 NUMA nodes exist over 2 socket. >> Node 0: 0-15, 128-143 >> Node 1: 16-31, 144-159 >> Node 2: 32-47, 160-175 >> Node 3: 48-63, 176-191 >> Node 4: 64-79, 192-207 >> Node 5: 80-95, 208-223 >> Node 6: 96-111, 223-231 >> Node 7: 112-127, 232-255 >> >> Benchmark Results: >> >> Kernel versions: >> - tip: 5.19.0 tip sched/core >> - sis_core: 5.19.0 tip sched/core + this series >> >> When we started testing, the tip was at: >> commit fdf756f71271 ("sched: Fix more TASK_state comparisons") >> >> ~~~~~~~~~~~~~ >> ~ hackbench ~ >> ~~~~~~~~~~~~~ >> >> o NPS1 >> >> Test: tip sis_core >> 1-groups: 4.06 (0.00 pct) 4.26 (-4.92 pct) * >> 1-groups: 4.14 (0.00 pct) 4.09 (1.20 pct) [Verification Run] >> 2-groups: 4.76 (0.00 pct) 4.71 (1.05 pct) >> 4-groups: 5.22 (0.00 pct) 5.11 (2.10 pct) >> 8-groups: 5.35 (0.00 pct) 5.31 (0.74 pct) >> 16-groups: 7.21 (0.00 pct) 6.80 (5.68 pct) >> >> o NPS2 >> >> Test: tip sis_core >> 1-groups: 4.09 (0.00 pct) 4.08 (0.24 pct) >> 2-groups: 4.70 (0.00 pct) 4.69 (0.21 pct) >> 4-groups: 5.05 (0.00 pct) 4.92 (2.57 pct) >> 8-groups: 5.35 (0.00 pct) 5.26 (1.68 pct) >> 16-groups: 6.37 (0.00 pct) 6.34 (0.47 pct) >> >> o NPS4 >> >> Test: tip sis_core >> 1-groups: 4.07 (0.00 pct) 3.99 (1.96 pct) >> 2-groups: 4.65 (0.00 pct) 4.59 (1.29 pct) >> 4-groups: 5.13 (0.00 pct) 5.00 (2.53 pct) >> 8-groups: 5.47 (0.00 pct) 5.43 (0.73 pct) >> 16-groups: 6.82 (0.00 pct) 6.56 (3.81 pct) > > Although each cpu will get 2.5 tasks when 16-groups, which can > be considered overloaded, I tested in AMD EPYC 7Y83 machine and > the total cpu usage was ~82% (with some older kernel version), > so there is still lots of idle time. > > I guess cutting off at 16-groups is because it's enough loaded > compared to the real workloads, so testing more groups might just > be a waste of time? The machine has 16 LLCs so I capped the results at 16-groups. Previously I had seen some run-to-run variance with larger group counts so I limited the reports to 16-groups. I'll run hackbench with more number of groups (32, 64, 128, 256) and get back to you with the results along with results for a couple of long running workloads. > > Thanks & Best Regards, > Abel > > [..snip..] > -- Thanks and Regards, Prateek
Hello Abel, Following are the results for hackbench with larger number of groups, ycsb-mongodb, Spec-JBB, and unixbench. Apart for a regression in unixbench spawn in NPS2 and NPS4 mode and unixbench syscall in NPs2 mode, everything looks good. Detailed results are below: ~~~~~~~~~~~~~~~~ ~ ycsb-mongodb ~ ~~~~~~~~~~~~~~~~ o NPS1: tip: 131696.33 (var: 2.03%) sis_core: 129519.00 (var: 1.46%) (-1.65%) o NPS2: tip: 129895.33 (var: 2.34%) sis_core: 130774.33 (var: 2.57%) (+0.67%) o NPS4: tip: 131165.00 (var: 1.06%) sis_core: 133547.33 (var: 3.90%) (+1.81%) ~~~~~~~~~~~~~~~~~ ~ Spec-JBB NPS1 ~ ~~~~~~~~~~~~~~~~~ Max-jOPS and Critical-jOPS are same as the tip kernel. ~~~~~~~~~~~~~ ~ unixbench ~ ~~~~~~~~~~~~~ -> unixbench-dhry2reg o NPS1 kernel: tip sis_core Min unixbench-dhry2reg-1 48876615.50 ( 0.00%) 48891544.00 ( 0.03%) Min unixbench-dhry2reg-512 6260344658.90 ( 0.00%) 6282967594.10 ( 0.36%) Hmean unixbench-dhry2reg-1 49299721.81 ( 0.00%) 49233828.70 ( -0.13%) Hmean unixbench-dhry2reg-512 6267459427.19 ( 0.00%) 6288772961.79 * 0.34%* CoeffVar unixbench-dhry2reg-1 0.90 ( 0.00%) 0.68 ( 24.66%) CoeffVar unixbench-dhry2reg-512 0.10 ( 0.00%) 0.10 ( 7.54%) o NPS2 kernel: tip sis_core Min unixbench-dhry2reg-1 48828251.70 ( 0.00%) 48856709.20 ( 0.06%) Min unixbench-dhry2reg-512 6244987739.10 ( 0.00%) 6271229549.10 ( 0.42%) Hmean unixbench-dhry2reg-1 48869882.65 ( 0.00%) 49302481.81 ( 0.89%) Hmean unixbench-dhry2reg-512 6261073948.84 ( 0.00%) 6272564898.35 ( 0.18%) CoeffVar unixbench-dhry2reg-1 0.08 ( 0.00%) 0.87 (-945.28%) CoeffVar unixbench-dhry2reg-512 0.23 ( 0.00%) 0.03 ( 85.94%) o NPS4 kernel: tip sis_core Min unixbench-dhry2reg-1 48523981.30 ( 0.00%) 49083957.50 ( 1.15%) Min unixbench-dhry2reg-512 6253738837.10 ( 0.00%) 6271747119.10 ( 0.29%) Hmean unixbench-dhry2reg-1 48781044.09 ( 0.00%) 49232218.87 * 0.92%* Hmean unixbench-dhry2reg-512 6264428474.90 ( 0.00%) 6280484789.64 ( 0.26%) CoeffVar unixbench-dhry2reg-1 0.46 ( 0.00%) 0.26 ( 42.63%) CoeffVar unixbench-dhry2reg-512 0.17 ( 0.00%) 0.21 ( -26.72%) -> unixbench-syscall o NPS1 kernel: tip sis_core Min unixbench-syscall-1 2975654.80 ( 0.00%) 2978489.40 ( -0.10%) Min unixbench-syscall-512 7840226.50 ( 0.00%) 7822133.40 ( 0.23%) Amean unixbench-syscall-1 2976326.47 ( 0.00%) 2980985.27 * -0.16%* Amean unixbench-syscall-512 7850493.90 ( 0.00%) 7844527.50 ( 0.08%) CoeffVar unixbench-syscall-1 0.03 ( 0.00%) 0.07 (-154.43%) CoeffVar unixbench-syscall-512 0.13 ( 0.00%) 0.34 (-158.96%) o NPS2 kernel: tip sis_core Min unixbench-syscall-1 2969863.60 ( 0.00%) 2977936.50 ( -0.27%) Min unixbench-syscall-512 8053157.60 ( 0.00%) 8072239.00 ( -0.24%) Amean unixbench-syscall-1 2970462.30 ( 0.00%) 2981732.50 * -0.38%* Amean unixbench-syscall-512 8061454.50 ( 0.00%) 8079287.73 * -0.22%* CoeffVar unixbench-syscall-1 0.02 ( 0.00%) 0.11 (-527.26%) CoeffVar unixbench-syscall-512 0.12 ( 0.00%) 0.08 ( 37.30%) o NPS4 kernel: tip sis_core Min unixbench-syscall-1 2971799.80 ( 0.00%) 2979335.60 ( -0.25%) Min unixbench-syscall-512 7824196.90 ( 0.00%) 8155610.20 ( -4.24%) Amean unixbench-syscall-1 2973045.43 ( 0.00%) 2982036.13 * -0.30%* Amean unixbench-syscall-512 7826302.17 ( 0.00%) 8173026.57 * -4.43%* <-- Regression in syscall for larger worker count CoeffVar unixbench-syscall-1 0.04 ( 0.00%) 0.09 (-139.63%) CoeffVar unixbench-syscall-512 0.03 ( 0.00%) 0.20 (-701.13%) -> unixbench-pipe o NPS1 kernel: tip sis_core Min unixbench-pipe-1 2894765.30 ( 0.00%) 2891505.30 ( -0.11%) Min unixbench-pipe-512 329818573.50 ( 0.00%) 325610257.80 ( -1.28%) Hmean unixbench-pipe-1 2898803.38 ( 0.00%) 2896940.25 ( -0.06%) Hmean unixbench-pipe-512 330226401.69 ( 0.00%) 326311984.29 * -1.19%* CoeffVar unixbench-pipe-1 0.14 ( 0.00%) 0.17 ( -21.99%) CoeffVar unixbench-pipe-512 0.11 ( 0.00%) 0.20 ( -88.38%) o NPS2 kernel: tip sis_core Min unixbench-pipe-1 2895327.90 ( 0.00%) 2894798.20 ( -0.02%) Min unixbench-pipe-512 328350065.60 ( 0.00%) 325681163.10 ( -0.81%) Hmean unixbench-pipe-1 2899129.86 ( 0.00%) 2897067.80 ( -0.07%) Hmean unixbench-pipe-512 329436096.80 ( 0.00%) 326023030.94 * -1.04%* CoeffVar unixbench-pipe-1 0.12 ( 0.00%) 0.09 ( 21.96%) CoeffVar unixbench-pipe-512 0.30 ( 0.00%) 0.12 ( 60.80%) o NPS4 kernel: tip sis_core Min unixbench-pipe-1 2901525.60 ( 0.00%) 2885730.80 ( -0.54%) Min unixbench-pipe-512 330265873.90 ( 0.00%) 326730770.60 ( -1.07%) Hmean unixbench-pipe-1 2906184.70 ( 0.00%) 2891616.18 * -0.50%* Hmean unixbench-pipe-512 330854683.27 ( 0.00%) 327113296.63 * -1.13%* CoeffVar unixbench-pipe-1 0.14 ( 0.00%) 0.19 ( -33.74%) CoeffVar unixbench-pipe-512 0.16 ( 0.00%) 0.11 ( 31.75%) -> unixbench-spawn o NPS1 kernel: tip sis_core Min unixbench-spawn-1 6536.50 ( 0.00%) 6000.30 ( -8.20%) Min unixbench-spawn-512 72571.40 ( 0.00%) 70829.60 ( -2.40%) Hmean unixbench-spawn-1 6811.16 ( 0.00%) 7016.11 ( 3.01%) Hmean unixbench-spawn-512 72801.77 ( 0.00%) 71012.03 * -2.46%* CoeffVar unixbench-spawn-1 3.69 ( 0.00%) 13.52 (-266.69%) CoeffVar unixbench-spawn-512 0.27 ( 0.00%) 0.22 ( 18.25%) o NPS2 kernel: tip sis_core Min unixbench-spawn-1 7042.20 ( 0.00%) 7078.70 ( 0.52%) Min unixbench-spawn-512 85571.60 ( 0.00%) 77362.60 ( -9.59%) Hmean unixbench-spawn-1 7199.01 ( 0.00%) 7276.55 ( 1.08%) Hmean unixbench-spawn-512 85717.77 ( 0.00%) 77923.73 * -9.09%* <-- Regression in spawn test for larger worker count CoeffVar unixbench-spawn-1 3.50 ( 0.00%) 3.30 ( 5.70%) CoeffVar unixbench-spawn-512 0.20 ( 0.00%) 0.82 (-304.88%) o NPS4 kernel: tip sis_core Min unixbench-spawn-1 7521.90 ( 0.00%) 8102.80 ( 7.72%) Min unixbench-spawn-512 84245.70 ( 0.00%) 73074.50 ( -13.26%) Hmean unixbench-spawn-1 7659.12 ( 0.00%) 8645.19 * 12.87%* Hmean unixbench-spawn-512 84908.77 ( 0.00%) 73409.49 * -13.54%* <-- Regression in spawn test for larger worker count CoeffVar unixbench-spawn-1 1.92 ( 0.00%) 5.78 (-200.56%) CoeffVar unixbench-spawn-512 0.76 ( 0.00%) 0.41 ( 46.58%) -> unixbench-execl o NPS1 kernel: tip sis_core Min unixbench-execl-1 5421.50 ( 0.00%) 5471.50 ( 0.92%) Min unixbench-execl-512 11213.50 ( 0.00%) 11677.20 ( 4.14%) Hmean unixbench-execl-1 5443.75 ( 0.00%) 5475.36 * 0.58%* Hmean unixbench-execl-512 11311.94 ( 0.00%) 11804.52 * 4.35%* CoeffVar unixbench-execl-1 0.38 ( 0.00%) 0.12 ( 69.22%) CoeffVar unixbench-execl-512 1.03 ( 0.00%) 1.73 ( -68.91%) o NPS2 kernel: tip sis_core Min unixbench-execl-1 5089.10 ( 0.00%) 5405.40 ( 6.22%) Min unixbench-execl-512 11772.70 ( 0.00%) 11917.20 ( 1.23%) Hmean unixbench-execl-1 5321.65 ( 0.00%) 5421.41 ( 1.87%) Hmean unixbench-execl-512 12201.73 ( 0.00%) 12327.95 ( 1.03%) CoeffVar unixbench-execl-1 3.87 ( 0.00%) 0.28 ( 92.88%) CoeffVar unixbench-execl-512 6.23 ( 0.00%) 5.78 ( 7.21%) o NPS4 kernel: tip sis_core Min unixbench-execl-1 5099.40 ( 0.00%) 5479.60 ( 7.46%) Min unixbench-execl-512 11692.80 ( 0.00%) 12205.50 ( 4.38%) Hmean unixbench-execl-1 5136.86 ( 0.00%) 5487.93 * 6.83%* Hmean unixbench-execl-512 12053.71 ( 0.00%) 12712.96 ( 5.47%) CoeffVar unixbench-execl-1 1.05 ( 0.00%) 0.14 ( 86.57%) CoeffVar unixbench-execl-512 3.85 ( 0.00%) 5.86 ( -52.14%) For unixbench regressions, I do not see anything obvious jump up in perf traces captureed with IBS. top shows over 99% utilization which would ideally mean there are not many updates to the mask. I'll take some more look at the spawn test case and get back to you. On 11/15/2022 4:58 PM, K Prateek Nayak wrote: > Hello Abel, > > Thank you for taking a look at the report. > > On 11/15/2022 2:01 PM, Abel Wu wrote: >> Hi Prateek, thanks very much for your detailed testing! >> >> On 11/14/22 1:45 PM, K Prateek Nayak wrote: >>> Hello Abel, >>> >>> Sorry for the delay. I've tested the patch on a dual socket Zen3 system >>> (2 x 64C/128T) >>> >>> tl;dr >>> >>> o I do not notice any regressions with the standard benchmarks. >>> o schbench sees a nice improvement to the tail latency when the number >>> of worker are equal to the number of cores in the system in NPS1 and >>> NPS2 mode. (Marked with "^") >>> o Few data points show improvements in tbench in NPS1 and NPS2 mode. >>> (Marked with "^") >>> >>> I'm still in the process of running larger workloads. If there is any >>> specific workload you would like me to run on the test system, please >>> do let me know. Below is the detailed report: >> >> Not particularly in my mind, and I think testing larger workloads is >> great. Thanks! >> >>> >>> Following are the results from running standard benchmarks on a >>> dual socket Zen3 (2 x 64C/128T) machine configured in different >>> NPS modes. >>> >>> NPS Modes are used to logically divide single socket into >>> multiple NUMA region. >>> Following is the NUMA configuration for each NPS mode on the system: >>> >>> NPS1: Each socket is a NUMA node. >>> Total 2 NUMA nodes in the dual socket machine. >>> >>> Node 0: 0-63, 128-191 >>> Node 1: 64-127, 192-255 >>> >>> NPS2: Each socket is further logically divided into 2 NUMA regions. >>> Total 4 NUMA nodes exist over 2 socket. >>> Node 0: 0-31, 128-159 >>> Node 1: 32-63, 160-191 >>> Node 2: 64-95, 192-223 >>> Node 3: 96-127, 223-255 >>> >>> NPS4: Each socket is logically divided into 4 NUMA regions. >>> Total 8 NUMA nodes exist over 2 socket. >>> Node 0: 0-15, 128-143 >>> Node 1: 16-31, 144-159 >>> Node 2: 32-47, 160-175 >>> Node 3: 48-63, 176-191 >>> Node 4: 64-79, 192-207 >>> Node 5: 80-95, 208-223 >>> Node 6: 96-111, 223-231 >>> Node 7: 112-127, 232-255 >>> >>> Benchmark Results: >>> >>> Kernel versions: >>> - tip: 5.19.0 tip sched/core >>> - sis_core: 5.19.0 tip sched/core + this series >>> >>> When we started testing, the tip was at: >>> commit fdf756f71271 ("sched: Fix more TASK_state comparisons") >>> >>> ~~~~~~~~~~~~~ >>> ~ hackbench ~ >>> ~~~~~~~~~~~~~ >>> >>> o NPS1 >>> >>> Test: tip sis_core >>> 1-groups: 4.06 (0.00 pct) 4.26 (-4.92 pct) * >>> 1-groups: 4.14 (0.00 pct) 4.09 (1.20 pct) [Verification Run] >>> 2-groups: 4.76 (0.00 pct) 4.71 (1.05 pct) >>> 4-groups: 5.22 (0.00 pct) 5.11 (2.10 pct) >>> 8-groups: 5.35 (0.00 pct) 5.31 (0.74 pct) >>> 16-groups: 7.21 (0.00 pct) 6.80 (5.68 pct) >>> >>> o NPS2 >>> >>> Test: tip sis_core >>> 1-groups: 4.09 (0.00 pct) 4.08 (0.24 pct) >>> 2-groups: 4.70 (0.00 pct) 4.69 (0.21 pct) >>> 4-groups: 5.05 (0.00 pct) 4.92 (2.57 pct) >>> 8-groups: 5.35 (0.00 pct) 5.26 (1.68 pct) >>> 16-groups: 6.37 (0.00 pct) 6.34 (0.47 pct) >>> >>> o NPS4 >>> >>> Test: tip sis_core >>> 1-groups: 4.07 (0.00 pct) 3.99 (1.96 pct) >>> 2-groups: 4.65 (0.00 pct) 4.59 (1.29 pct) >>> 4-groups: 5.13 (0.00 pct) 5.00 (2.53 pct) >>> 8-groups: 5.47 (0.00 pct) 5.43 (0.73 pct) >>> 16-groups: 6.82 (0.00 pct) 6.56 (3.81 pct) >> >> Although each cpu will get 2.5 tasks when 16-groups, which can >> be considered overloaded, I tested in AMD EPYC 7Y83 machine and >> the total cpu usage was ~82% (with some older kernel version), >> so there is still lots of idle time. >> >> I guess cutting off at 16-groups is because it's enough loaded >> compared to the real workloads, so testing more groups might just >> be a waste of time? > > The machine has 16 LLCs so I capped the results at 16-groups. > Previously I had seen some run-to-run variance with larger group counts > so I limited the reports to 16-groups. I'll run hackbench with more > number of groups (32, 64, 128, 256) and get back to you with the > results along with results for a couple of long running workloads. ~~~~~~~~~~~~~ ~ Hackbench ~ ~~~~~~~~~~~~~ $ perf bench sched messaging -p -l 50000 -g <groups> o NPS1 kernel: tip sis_core 32-groups: 6.20 (0.00 pct) 5.86 (5.48 pct) 64-groups: 16.55 (0.00 pct) 15.21 (8.09 pct) 128-groups: 42.57 (0.00 pct) 34.63 (18.65 pct) 256-groups: 71.69 (0.00 pct) 67.11 (6.38 pct) 512-groups: 108.48 (0.00 pct) 110.23 (-1.61 pct) o NPS2 kernel: tip sis_core 32-groups: 6.56 (0.00 pct) 5.60 (14.63 pct) 64-groups: 15.74 (0.00 pct) 14.45 (8.19 pct) 128-groups: 39.93 (0.00 pct) 35.33 (11.52 pct) 256-groups: 74.49 (0.00 pct) 69.65 (6.49 pct) 512-groups: 112.22 (0.00 pct) 113.75 (-1.36 pct) o NPS4: kernel: tip sis_core 32-groups: 9.48 (0.00 pct) 5.64 (40.50 pct) 64-groups: 15.38 (0.00 pct) 14.13 (8.12 pct) 128-groups: 39.93 (0.00 pct) 34.47 (13.67 pct) 256-groups: 75.31 (0.00 pct) 67.98 (9.73 pct) 512-groups: 115.37 (0.00 pct) 111.15 (3.65 pct) Note: Hackbench with 32-groups show run to run variation on tip but is more stable with sis_core. Hackbench for 64-groups and beyond is stable on both kernels. > >> >> Thanks & Best Regards, >> Abel >> >> [..snip..] >> > > > -- > Thanks and Regards, > Prateek Apart from the couple of regressions in Unixbench, everything looks good. If you would like me to get any more data for any workload on the test system, please do let me know. -- Thanks and Regards, Prateek
Hi Prateek, thanks again for your detailed test! On 11/22/22 7:28 PM, K Prateek Nayak wrote: > Hello Abel, > > Following are the results for hackbench with larger number of > groups, ycsb-mongodb, Spec-JBB, and unixbench. Apart for > a regression in unixbench spawn in NPS2 and NPS4 mode and > unixbench syscall in NPs2 mode, everything looks good. > > ... > > -> unixbench-syscall > > o NPS4 > > kernel: tip sis_core > Min unixbench-syscall-1 2971799.80 ( 0.00%) 2979335.60 ( -0.25%) > Min unixbench-syscall-512 7824196.90 ( 0.00%) 8155610.20 ( -4.24%) > Amean unixbench-syscall-1 2973045.43 ( 0.00%) 2982036.13 * -0.30%* > Amean unixbench-syscall-512 7826302.17 ( 0.00%) 8173026.57 * -4.43%* <-- Regression in syscall for larger worker count > CoeffVar unixbench-syscall-1 0.04 ( 0.00%) 0.09 (-139.63%) > CoeffVar unixbench-syscall-512 0.03 ( 0.00%) 0.20 (-701.13%) > > > -> unixbench-spawn > > o NPS1 > > kernel: tip sis_core > Min unixbench-spawn-1 6536.50 ( 0.00%) 6000.30 ( -8.20%) > Min unixbench-spawn-512 72571.40 ( 0.00%) 70829.60 ( -2.40%) > Hmean unixbench-spawn-1 6811.16 ( 0.00%) 7016.11 ( 3.01%) > Hmean unixbench-spawn-512 72801.77 ( 0.00%) 71012.03 * -2.46%* > CoeffVar unixbench-spawn-1 3.69 ( 0.00%) 13.52 (-266.69%) > CoeffVar unixbench-spawn-512 0.27 ( 0.00%) 0.22 ( 18.25%) > > o NPS2 > > kernel: tip sis_core > Min unixbench-spawn-1 7042.20 ( 0.00%) 7078.70 ( 0.52%) > Min unixbench-spawn-512 85571.60 ( 0.00%) 77362.60 ( -9.59%) > Hmean unixbench-spawn-1 7199.01 ( 0.00%) 7276.55 ( 1.08%) > Hmean unixbench-spawn-512 85717.77 ( 0.00%) 77923.73 * -9.09%* <-- Regression in spawn test for larger worker count > CoeffVar unixbench-spawn-1 3.50 ( 0.00%) 3.30 ( 5.70%) > CoeffVar unixbench-spawn-512 0.20 ( 0.00%) 0.82 (-304.88%) > > o NPS4 > > kernel: tip sis_core > Min unixbench-spawn-1 7521.90 ( 0.00%) 8102.80 ( 7.72%) > Min unixbench-spawn-512 84245.70 ( 0.00%) 73074.50 ( -13.26%) > Hmean unixbench-spawn-1 7659.12 ( 0.00%) 8645.19 * 12.87%* > Hmean unixbench-spawn-512 84908.77 ( 0.00%) 73409.49 * -13.54%* <-- Regression in spawn test for larger worker count > CoeffVar unixbench-spawn-1 1.92 ( 0.00%) 5.78 (-200.56%) > CoeffVar unixbench-spawn-512 0.76 ( 0.00%) 0.41 ( 46.58%) > > ... > > For unixbench regressions, I do not see anything obvious jump up > in perf traces captureed with IBS. top shows over 99% utilization > which would ideally mean there are not many updates to the mask. > I'll take some more look at the spawn test case and get back to you. These regressions seems to be common in full parallel tests. I guess it might be due to over updating the idle cpumask when LLC is overloaded which is not necessary if SIS_UTIL enabled, but I need to dig it further. Maybe the rq avg_idle or nr_idle_scan need to be taken into consideration as well. Thanks for providing these important information. > > ~~~~~~~~~~~~~ > ~ Hackbench ~ > ~~~~~~~~~~~~~ > > $ perf bench sched messaging -p -l 50000 -g <groups> > > o NPS1 > > kernel: tip sis_core > 32-groups: 6.20 (0.00 pct) 5.86 (5.48 pct) > 64-groups: 16.55 (0.00 pct) 15.21 (8.09 pct) > 128-groups: 42.57 (0.00 pct) 34.63 (18.65 pct) > 256-groups: 71.69 (0.00 pct) 67.11 (6.38 pct) > 512-groups: 108.48 (0.00 pct) 110.23 (-1.61 pct) > > o NPS2 > > kernel: tip sis_core > 32-groups: 6.56 (0.00 pct) 5.60 (14.63 pct) > 64-groups: 15.74 (0.00 pct) 14.45 (8.19 pct) > 128-groups: 39.93 (0.00 pct) 35.33 (11.52 pct) > 256-groups: 74.49 (0.00 pct) 69.65 (6.49 pct) > 512-groups: 112.22 (0.00 pct) 113.75 (-1.36 pct) > > o NPS4: > > kernel: tip sis_core > 32-groups: 9.48 (0.00 pct) 5.64 (40.50 pct) > 64-groups: 15.38 (0.00 pct) 14.13 (8.12 pct) > 128-groups: 39.93 (0.00 pct) 34.47 (13.67 pct) > 256-groups: 75.31 (0.00 pct) 67.98 (9.73 pct) > 512-groups: 115.37 (0.00 pct) 111.15 (3.65 pct) > > Note: Hackbench with 32-groups show run to run variation > on tip but is more stable with sis_core. Hackbench for > 64-groups and beyond is stable on both kernels. > The result is consistent with mine except 512-groups which I didn't test. The 512-groups test may have the same problem aforementioned. Thanks & Regards, Abel
Hello Abel, I've retested the patches with on the updated tip and the results are still promising. tl;dr o Hackbench sees improvements when the machine is overloaded. o tbench shows improvements when the machine is overloaded. o The unixbench regression seen previously seems to be unrelated to the patch as the spawn test scores are vastly different after a reboot/kexec for the same kernel. o Other benchmarks show slight improvements or are comparable to the numbers on tip. Following are the results from running standard benchmarks on a dual socket Zen3 (2 x 64C/128T) machine configured in different NPS modes. NPS Modes are used to logically divide single socket into multiple NUMA region. Following is the NUMA configuration for each NPS mode on the system: NPS1: Each socket is a NUMA node. Total 2 NUMA nodes in the dual socket machine. Node 0: 0-63, 128-191 Node 1: 64-127, 192-255 NPS2: Each socket is further logically divided into 2 NUMA regions. Total 4 NUMA nodes exist over 2 socket. Node 0: 0-31, 128-159 Node 1: 32-63, 160-191 Node 2: 64-95, 192-223 Node 3: 96-127, 223-255 NPS4: Each socket is logically divided into 4 NUMA regions. Total 8 NUMA nodes exist over 2 socket. Node 0: 0-15, 128-143 Node 1: 16-31, 144-159 Node 2: 32-47, 160-175 Node 3: 48-63, 176-191 Node 4: 64-79, 192-207 Node 5: 80-95, 208-223 Node 6: 96-111, 223-231 Node 7: 112-127, 232-255 Following are the Kernel versions: tip: 6.2.0-rc2 tip:sched/core at commit: bbd0b031509b "sched/rseq: Fix concurrency ID handling of usermodehelper kthreads" sis_short: tip + series The patch applied cleanly on the tip. Benchmark Results: ~~~~~~~~~~~~~ ~ hackbench ~ ~~~~~~~~~~~~~ NPS1 Test: tip sis_core 1-groups: 4.36 (0.00 pct) 4.17 (4.35 pct) 2-groups: 5.17 (0.00 pct) 5.03 (2.70 pct) 4-groups: 4.17 (0.00 pct) 4.14 (0.71 pct) 8-groups: 4.64 (0.00 pct) 4.63 (0.21 pct) 16-groups: 5.43 (0.00 pct) 5.32 (2.02 pct) NPS2 Test: tip sis_core 1-groups: 4.43 (0.00 pct) 4.27 (3.61 pct) 2-groups: 4.61 (0.00 pct) 4.92 (-6.72 pct) * 2-groups: 4.52 (0.00 pct) 4.55 (-0.66 pct) [Verification Run] 4-groups: 4.25 (0.00 pct) 4.10 (3.52 pct) 8-groups: 4.91 (0.00 pct) 4.53 (7.73 pct) 16-groups: 5.84 (0.00 pct) 5.54 (5.13 pct) NPS4 Test: tip sis_core 1-groups: 4.34 (0.00 pct) 4.23 (2.53 pct) 2-groups: 4.64 (0.00 pct) 4.84 (-4.31 pct) 4-groups: 4.20 (0.00 pct) 4.17 (0.71 pct) 8-groups: 5.21 (0.00 pct) 5.06 (2.87 pct) 16-groups: 6.24 (0.00 pct) 5.60 (10.25 pct) ~~~~~~~~~~~~ ~ schbench ~ ~~~~~~~~~~~~ NPS1 #workers: tip sis_core 1: 36.00 (0.00 pct) 23.00 (36.11 pct) 2: 37.00 (0.00 pct) 37.00 (0.00 pct) 4: 37.00 (0.00 pct) 38.00 (-2.70 pct) 8: 47.00 (0.00 pct) 52.00 (-10.63 pct) 16: 64.00 (0.00 pct) 65.00 (-1.56 pct) 32: 109.00 (0.00 pct) 111.00 (-1.83 pct) 64: 222.00 (0.00 pct) 215.00 (3.15 pct) 128: 515.00 (0.00 pct) 486.00 (5.63 pct) 256: 39744.00 (0.00 pct) 47808.00 (-20.28 pct) * (Machine Overloaded ~ 2 tasks per rq) 256: 43242.00 (0.00 pct) 42293.00 (2.19 pct) [Verification Run] 512: 81280.00 (0.00 pct) 76416.00 (5.98 pct) NPS2 #workers: tip sis_core 1: 27.00 (0.00 pct) 27.00 (0.00 pct) 2: 31.00 (0.00 pct) 30.00 (3.22 pct) 4: 38.00 (0.00 pct) 37.00 (2.63 pct) 8: 50.00 (0.00 pct) 46.00 (8.00 pct) 16: 66.00 (0.00 pct) 68.00 (-3.03 pct) 32: 116.00 (0.00 pct) 113.00 (2.58 pct) 64: 210.00 (0.00 pct) 228.00 (-8.57 pct) * 64: 206.00 (0.00 pct) 219.00 (-6.31 pct) [Verification Run] 128: 523.00 (0.00 pct) 559.00 (-6.88 pct) * 128: 474.00 (0.00 pct) 497.00 (-4.85 pct) [Verification Run] 256: 44864.00 (0.00 pct) 47040.00 (-4.85 pct) 512: 78464.00 (0.00 pct) 81280.00 (-3.58 pct) NPS4 #workers: tip sis_core 1: 32.00 (0.00 pct) 27.00 (15.62 pct) 2: 32.00 (0.00 pct) 35.00 (-9.37 pct) 4: 34.00 (0.00 pct) 41.00 (-20.58 pct) 8: 58.00 (0.00 pct) 58.00 (0.00 pct) 16: 67.00 (0.00 pct) 69.00 (-2.98 pct) 32: 118.00 (0.00 pct) 112.00 (5.08 pct) 64: 224.00 (0.00 pct) 209.00 (6.69 pct) 128: 533.00 (0.00 pct) 519.00 (2.62 pct) 256: 43456.00 (0.00 pct) 45248.00 (-4.12 pct) 512: 78976.00 (0.00 pct) 76160.00 (3.56 pct) ~~~~~~~~~~ ~ tbench ~ ~~~~~~~~~~ NPS1 Clients: tip sis_core 1 539.96 (0.00 pct) 538.19 (-0.32 pct) 2 1068.21 (0.00 pct) 1063.04 (-0.48 pct) 4 1994.76 (0.00 pct) 1990.47 (-0.21 pct) 8 3602.30 (0.00 pct) 3496.07 (-2.94 pct) 16 6075.49 (0.00 pct) 6061.74 (-0.22 pct) 32 11641.07 (0.00 pct) 11904.58 (2.26 pct) 64 21529.16 (0.00 pct) 22124.81 (2.76 pct) 128 30852.92 (0.00 pct) 31258.56 (1.31 pct) 256 51901.20 (0.00 pct) 53249.69 (2.59 pct) 512 46797.40 (0.00 pct) 54477.79 (16.41 pct) 1024 46057.28 (0.00 pct) 53676.58 (16.54 pct) NPS2 Clients: tip sis_core 1 536.11 (0.00 pct) 541.18 (0.94 pct) 2 1044.58 (0.00 pct) 1064.16 (1.87 pct) 4 2043.92 (0.00 pct) 2017.84 (-1.27 pct) 8 3572.50 (0.00 pct) 3494.83 (-2.17 pct) 16 6040.97 (0.00 pct) 5530.10 (-8.45 pct) * 16 5814.03 (0.00 pct) 6012.33 (-8.45 pct) [Verification Run] 32 10794.10 (0.00 pct) 10841.68 (0.44 pct) 64 20905.89 (0.00 pct) 21438.82 (2.54 pct) 128 30885.39 (0.00 pct) 30064.78 (-2.65 pct) 256 48901.25 (0.00 pct) 51395.08 (5.09 pct) 512 49673.91 (0.00 pct) 51725.89 (4.13 pct) 1024 47626.34 (0.00 pct) 52662.01 (10.57 pct) NPS4 Clients: tip sis_core 1 544.91 (0.00 pct) 544.66 (-0.04 pct) 2 1046.49 (0.00 pct) 1072.42 (2.47 pct) 4 2007.11 (0.00 pct) 1970.05 (-1.84 pct) 8 3590.66 (0.00 pct) 3670.45 (2.22 pct) 16 5956.60 (0.00 pct) 6045.07 (1.48 pct) 32 10431.73 (0.00 pct) 10439.40 (0.07 pct) 64 21563.37 (0.00 pct) 19344.05 (-10.29 pct) * 64 19387.71 (0.00 pct) 19050.47 (-1.73 pct) [Verification Run] 128 30352.16 (0.00 pct) 26998.85 (-11.04 pct) * 128 29110.99 (0.00 pct) 29690.37 (1.99 pct) [Verification Run] 256 49504.51 (0.00 pct) 50921.66 (2.86 pct) 512 44916.61 (0.00 pct) 52176.11 (16.16 pct) 1024 49986.21 (0.00 pct) 51639.91 (3.30 pct) ~~~~~~~~~~ ~ stream ~ ~~~~~~~~~~ NPS1 10 Runs: Test: tip sis_core Copy: 339390.30 (0.00 pct) 324656.88 (-4.34 pct) Scale: 212472.78 (0.00 pct) 210641.39 (-0.86 pct) Add: 247598.48 (0.00 pct) 241669.10 (-2.39 pct) Triad: 261852.07 (0.00 pct) 252088.55 (-3.72 pct) 100 Runs: Test: tip sis_core Copy: 335938.02 (0.00 pct) 331491.32 (-1.32 pct) Scale: 212597.92 (0.00 pct) 218705.84 (2.87 pct) Add: 248294.62 (0.00 pct) 243830.42 (-1.79 pct) Triad: 258400.88 (0.00 pct) 248178.42 (-3.95 pct) NPS2 10 Runs: Test: tip sis_core Copy: 334500.32 (0.00 pct) 335317.70 (0.24 pct) Scale: 216804.76 (0.00 pct) 217862.71 (0.48 pct) Add: 250787.33 (0.00 pct) 258839.00 (3.21 pct) Triad: 259451.40 (0.00 pct) 264847.88 (2.07 pct) 100 Runs: Test: tip sis_core Copy: 326385.13 (0.00 pct) 338030.70 (3.56 pct) Scale: 216440.37 (0.00 pct) 230053.24 (6.28 pct) Add: 255062.22 (0.00 pct) 259197.23 (1.62 pct) Triad: 265442.03 (0.00 pct) 271365.65 (2.23 pct) NPS4 10 Runs: Test: tip sis_core Copy: 363927.86 (0.00 pct) 361014.15 (-0.80 pct) Scale: 238190.49 (0.00 pct) 242176.02 (1.67 pct) Add: 262806.49 (0.00 pct) 266348.50 (1.34 pct) Triad: 276492.33 (0.00 pct) 276769.10 (0.10 pct) 100 Runs: Test: tip sis_core Copy: 365041.37 (0.00 pct) 349299.35 (-4.31 pct) Scale: 239295.27 (0.00 pct) 229944.85 (-3.90 pct) Add: 264085.21 (0.00 pct) 252651.56 (-4.32 pct) Triad: 279664.56 (0.00 pct) 274254.22 (-1.93 pct) ~~~~~~~~~~~~~~~~ ~ ycsb-mongodb ~ ~~~~~~~~~~~~~~~~ o NPS1 tip: 131328.67 (var: 2.97%) sis_core: 131702.33 (var: 3.61%) (0.28%) o NPS2: tip: 132482.33 (var: 2.06%) sis_core: 132338.33 (var: 0.97%) (-0.11%) o NPS4: tip: 134130.00 (var: 4.12%) sis_core: 133224.33 (var: 4.13%) (-0.67%) ~~~~~~~~~~~~~ ~ unixbench ~ ~~~~~~~~~~~~~ o NPS1 Test Metric Parallelism tip sis_core unixbench-dhry2reg Hmean unixbench-dhry2reg-1 48770555.20 ( 0.00%) 49025161.73 ( 0.52%) unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6268185467.60 ( 0.00%) 6266351964.20 ( -0.03%) unixbench-syscall Amean unixbench-syscall-1 2685321.17 ( 0.00%) 2694468.30 * -0.34%* unixbench-syscall Amean unixbench-syscall-512 7291476.20 ( 0.00%) 7295087.67 ( -0.05%) unixbench-pipe Hmean unixbench-pipe-1 2480858.53 ( 0.00%) 2536923.44 * 2.26%* unixbench-pipe Hmean unixbench-pipe-512 300739256.62 ( 0.00%) 303470605.93 * 0.91%* unixbench-spawn Hmean unixbench-spawn-1 4358.14 ( 0.00%) 4104.88 ( -5.81%) * (Known to be unstable) unixbench-spawn Hmean unixbench-spawn-1 4711.00 ( 0.00%) 4006.20 ( -14.96%) [Verification Run] unixbench-spawn Hmean unixbench-spawn-512 76497.32 ( 0.00%) 75555.94 * -1.23%* unixbench-execl Hmean unixbench-execl-1 4147.12 ( 0.00%) 4157.33 ( 0.25%) unixbench-execl Hmean unixbench-execl-512 12435.26 ( 0.00%) 11992.43 ( -3.56%) o NPS2 Test Metric Parallelism tip sis_core unixbench-dhry2reg Hmean unixbench-dhry2reg-1 48872335.50 ( 0.00%) 48902553.70 ( 0.06%) unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6264134378.20 ( 0.00%) 6260631689.40 ( -0.06%) unixbench-syscall Amean unixbench-syscall-1 2683903.13 ( 0.00%) 2694829.17 * -0.41%* unixbench-syscall Amean unixbench-syscall-512 7746773.60 ( 0.00%) 7493782.67 * 3.27%* unixbench-pipe Hmean unixbench-pipe-1 2476724.23 ( 0.00%) 2537127.96 * 2.44%* unixbench-pipe Hmean unixbench-pipe-512 300277350.41 ( 0.00%) 302979776.19 * 0.90%* unixbench-spawn Hmean unixbench-spawn-1 5026.50 ( 0.00%) 4680.63 ( -6.88%) * unixbench-spawn Hmean unixbench-spawn-1 5421.70 ( 0.00%) 5311.50 ( -2.03%) [Verification Run] unixbench-spawn Hmean unixbench-spawn-512 80549.70 ( 0.00%) 78888.60 ( -2.06%) unixbench-execl Hmean unixbench-execl-1 4151.70 ( 0.00%) 3913.76 * -5.73%* * unixbench-execl Hmean unixbench-execl-1 4304.30 ( 0.00%) 4303.20 ( -0.02%) [Verification run] unixbench-execl Hmean unixbench-execl-512 13605.15 ( 0.00%) 13129.23 ( -3.50%) o NPS4 Test Metric Parallelism tip sis_core unixbench-dhry2reg Hmean unixbench-dhry2reg-1 48506771.20 ( 0.00%) 48894866.70 ( 0.80%) unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6280954362.50 ( 0.00%) 6282759876.40 ( 0.03%) unixbench-syscall Amean unixbench-syscall-1 2687259.30 ( 0.00%) 2695379.93 * -0.30%* unixbench-syscall Amean unixbench-syscall-512 7350275.67 ( 0.00%) 7366923.73 ( -0.23%) unixbench-pipe Hmean unixbench-pipe-1 2478893.01 ( 0.00%) 2540015.88 * 2.47%* unixbench-pipe Hmean unixbench-pipe-512 301830155.61 ( 0.00%) 304305539.27 * 0.82%* unixbench-spawn Hmean unixbench-spawn-1 5208.55 ( 0.00%) 5273.11 ( 1.24%) unixbench-spawn Hmean unixbench-spawn-512 80745.79 ( 0.00%) 81940.71 * 1.48%* unixbench-execl Hmean unixbench-execl-1 4072.72 ( 0.00%) 4126.13 * 1.31%* unixbench-execl Hmean unixbench-execl-512 13746.56 ( 0.00%) 12848.77 ( -6.53%) * unixbench-execl Hmean unixbench-execl-512 13898.30 ( 0.00%) 13959.70 ( 0.44%) [Verification Run] On 10/19/2022 5:58 PM, Abel Wu wrote: > This patchset tries to improve SIS scan efficiency by recording idle > cpus in a cpumask for each LLC which will be used as a target cpuset > in the domain scan. The cpus are recorded at CORE granule to avoid > tasks being stack on same core. > > v5 -> v6: > - Rename SIS_FILTER to SIS_CORE as it can only be activated when > SMT is enabled and better describes the behavior of CORE granule > update & load delivery. > - Removed the part of limited scan for idle cores since it might be > better to open another thread to discuss the strategies such as > limited or scaled depth. But keep the part of full scan for idle > cores when LLC is overloaded because SIS_CORE can greatly reduce > the overhead of full scan in such case. > - Removed the state of sd_is_busy which indicates an LLC is fully > busy and we can safely skip the SIS domain scan. I would prefer > leave this to SIS_UTIL. > - The filter generation mechanism is replaced by in-place updates > during domain scan to better deal with partial scan failures. > - Collect Reviewed-bys from Tim Chen > > v4 -> v5: > - Add limited scan for idle cores when overloaded, suggested by Mel > - Split out several patches since they are irrelevant to this scope > - Add quick check on ttwu_pending before core update > - Wrap the filter into SIS_FILTER feature, suggested by Chen Yu > - Move the main filter logic to the idle path, because the newidle > balance can bail out early if rq->avg_idle is small enough and > lose chances to update the filter. > > v3 -> v4: > - Update filter in load_balance rather than in the tick > - Now the filter contains unoccupied cpus rather than overloaded ones > - Added mechanisms to deal with the false positive cases > > v2 -> v3: > - Removed sched-idle balance feature and focus on SIS > - Take non-CFS tasks into consideration > - Several fixes/improvement suggested by Josh Don > > v1 -> v2: > - Several optimizations on sched-idle balancing > - Ignore asym topos in can_migrate_task > - Add more benchmarks including SIS efficiency > - Re-organize patch as suggested by Mel Gorman > > Abel Wu (4): > sched/fair: Skip core update if task pending > sched/fair: Ignore SIS_UTIL when has_idle_core > sched/fair: Introduce SIS_CORE > sched/fair: Deal with SIS scan failures > > include/linux/sched/topology.h | 15 ++++ > kernel/sched/fair.c | 122 +++++++++++++++++++++++++++++---- > kernel/sched/features.h | 7 ++ > kernel/sched/sched.h | 3 + > kernel/sched/topology.c | 8 ++- > 5 files changed, 141 insertions(+), 14 deletions(-) > Testing with couple of larger workloads like SpecJBB are still underway. I'll update the thread with the results once they are done. The idea is promising. I'll also try to run schbench / hackbench pinned in a manner such that all wakeups happen on an external LLC to spot any impact of rapid changes to the idle cpu mask of an external LLC. Please let me know if you would like me to test or get data for any particular benchmark from my test setup. -- Thanks and Regards, Prateek
Hi Prateek, thanks very much for your solid testings! On 2/7/23 11:42 AM, K Prateek Nayak wrote: > Hello Abel, > > I've retested the patches with on the updated tip and the results > are still promising. > > tl;dr > > o Hackbench sees improvements when the machine is overloaded. > o tbench shows improvements when the machine is overloaded. > o The unixbench regression seen previously seems to be unrelated > to the patch as the spawn test scores are vastly different > after a reboot/kexec for the same kernel. > o Other benchmarks show slight improvements or are comparable to > the numbers on tip. Cheers! Yet I still see some minor regressions in the report below. As we discussed last time, reducing unnecessary updates on the idle cpumask when LLC is overloaded should help. Thanks & Best regards, Abel