[v3,7/7] sched: Shard per-LLC shared runqueues

  The SHARED_RUNQ scheduler feature creates a FIFO queue per LLC that
tasks are put into on enqueue, and pulled from when a core in that LLC
would otherwise go idle. For CPUs with large LLCs, this can sometimes
cause significant contention, as illustrated in [0].

[0]: https://lore.kernel.org/all/c8419d9b-2b31-2190-3058-3625bdbcb13d@meta.com/

So as to try and mitigate this contention, we can instead shard the
per-LLC runqueue into multiple per-LLC shards.

While this doesn't outright prevent all contention, it does somewhat mitigate it.
For example, if we run the following schbench command which does almost
nothing other than pound the runqueue:

schbench -L -m 52 -p 512 -r 10 -t 1

we observe with lockstats that sharding significantly decreases
contention.

3 shards:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name         con-bounces    contentions       waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:      31510503       31510711           0.08          19.98        168932319.64     5.36            31700383      31843851       0.03           17.50        10273968.33      0.32
------------
&shard->lock       15731657          [<0000000068c0fd75>] pick_next_task_fair+0x4dd/0x510
&shard->lock       15756516          [<000000001faf84f9>] enqueue_task_fair+0x459/0x530
&shard->lock          21766          [<00000000126ec6ab>] newidle_balance+0x45a/0x650
&shard->lock            772          [<000000002886c365>] dequeue_task_fair+0x4c9/0x540
------------
&shard->lock          23458          [<00000000126ec6ab>] newidle_balance+0x45a/0x650
&shard->lock       16505108          [<000000001faf84f9>] enqueue_task_fair+0x459/0x530
&shard->lock       14981310          [<0000000068c0fd75>] pick_next_task_fair+0x4dd/0x510
&shard->lock            835          [<000000002886c365>] dequeue_task_fair+0x4c9/0x540

No sharding:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name        con-bounces    contentions         waittime-min   waittime-max waittime-total         waittime-avg    acq-bounces   acquisitions   holdtime-min  holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:     117868635      118361486           0.09           393.01       1250954097.25          10.57           119345882     119780601      0.05          343.35       38313419.51      0.32
------------
&shard->lock       59169196          [<0000000060507011>] __enqueue_entity+0xdc/0x110
&shard->lock       59084239          [<00000000f1c67316>] __dequeue_entity+0x78/0xa0
&shard->lock         108051          [<00000000084a6193>] newidle_balance+0x45a/0x650
------------
&shard->lock       60028355          [<0000000060507011>] __enqueue_entity+0xdc/0x110
&shard->lock         119882          [<00000000084a6193>] newidle_balance+0x45a/0x650
&shard->lock       58213249          [<00000000f1c67316>] __dequeue_entity+0x78/0xa0

The contention is ~3-4x worse if we don't shard at all. This roughly
matches the fact that we had 3 shards on the host where this was
collected. This could be addressed in future patch sets by adding a
debugfs knob to control the sharding granularity. If we make the shards
even smaller (what's in this patch, i.e. a size of 6), the contention
goes away almost entirely:

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name    	   con-bounces    contentions   waittime-min  waittime-max waittime-total   waittime-avg   acq-bounces   acquisitions   holdtime-min  holdtime-max holdtime-total   holdtime-avg
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:      13839849       13877596      0.08          13.23        5389564.95       0.39           46910241      48069307       0.06          16.40        16534469.35      0.34
------------
&shard->lock           3559          [<00000000ea455dcc>] newidle_balance+0x45a/0x650
&shard->lock        6992418          [<000000002266f400>] __dequeue_entity+0x78/0xa0
&shard->lock        6881619          [<000000002a62f2e0>] __enqueue_entity+0xdc/0x110
------------
&shard->lock        6640140          [<000000002266f400>] __dequeue_entity+0x78/0xa0
&shard->lock           3523          [<00000000ea455dcc>] newidle_balance+0x45a/0x650
&shard->lock        7233933          [<000000002a62f2e0>] __enqueue_entity+0xdc/0x110

Interestingly, SHARED_RUNQ performs worse than NO_SHARED_RUNQ on the schbench
benchmark on Milan, but we contend even more on the rq lock:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name         con-bounces    contentions   waittime-min  waittime-max waittime-total   waittime-avg   acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&rq->__lock:       9617614        9656091       0.10          79.64        69665812.00      7.21           18092700      67652829       0.11           82.38        344524858.87     5.09
-----------
&rq->__lock        6301611          [<000000003e63bf26>] task_rq_lock+0x43/0xe0
&rq->__lock        2530807          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock         109360          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock         178218          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
-----------
&rq->__lock        3245506          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock        1294355          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
&rq->__lock        2837804          [<000000003e63bf26>] task_rq_lock+0x43/0xe0
&rq->__lock        1627866          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10

..................................................................................................................................................................................................

&shard->lock:       7338558       7343244       0.10          35.97        7173949.14       0.98           30200858      32679623       0.08           35.59        16270584.52      0.50
------------
&shard->lock        2004142          [<00000000f8aa2c91>] __dequeue_entity+0x78/0xa0
&shard->lock        2611264          [<00000000473978cc>] newidle_balance+0x45a/0x650
&shard->lock        2727838          [<0000000028f55bb5>] __enqueue_entity+0xdc/0x110
------------
&shard->lock        2737232          [<00000000473978cc>] newidle_balance+0x45a/0x650
&shard->lock        1693341          [<00000000f8aa2c91>] __dequeue_entity+0x78/0xa0
&shard->lock        2912671          [<0000000028f55bb5>] __enqueue_entity+0xdc/0x110

...................................................................................................................................................................................................

If we look at the lock stats with SHARED_RUNQ disabled, the rq lock still
contends the most, but it's significantly less than with it enabled:

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name          con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&rq->__lock:        791277         791690        0.12           110.54       4889787.63       6.18            1575996       62390275       0.13           112.66       316262440.56     5.07
-----------
&rq->__lock         263343          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock          19394          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock           4143          [<000000003b542e83>] __task_rq_lock+0x51/0xf0
&rq->__lock          51094          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
-----------
&rq->__lock          23756          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock         379048          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock            677          [<000000003b542e83>] __task_rq_lock+0x51/0xf0
&rq->__lock          47962          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170

In general, the takeaway here is that sharding does help with
contention, but it's not necessarily one size fits all, and it's
workload dependent. For now, let's include sharding to try and avoid
contention, and because it doesn't seem to regress CPUs that don't need
it such as the AMD 7950X.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: David Vernet <void@manifault.com>
---
 kernel/sched/fair.c  | 149 ++++++++++++++++++++++++++++++-------------
 kernel/sched/sched.h |   3 +-
 2 files changed, 108 insertions(+), 44 deletions(-)

Message ID	20230809221218.163894-8-void@manifault.com
State	New
Headers	Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b824:0:b0:3f2:4152:657d with SMTP id z4csp57042vqi; Wed, 9 Aug 2023 15:59:20 -0700 (PDT) X-Google-Smtp-Source: AGHT+IE1Vo6Jg26NQYhCzef2JvjWrVJQlhQvBzALakgtl882o6lRgm++gW5OJQzTjwix1/o1U16k X-Received: by 2002:a9d:7d85:0:b0:6b8:90cd:47b5 with SMTP id j5-20020a9d7d85000000b006b890cd47b5mr586257otn.7.1691621959739; Wed, 09 Aug 2023 15:59:19 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1691621959; cv=none; d=google.com; s=arc-20160816; b=d3n8jWiE/3nTHAXsIbB1veYIn2qJMA5zKQuWDyZhACmU/HcNf9rQZF831qAO9Evi0i +TCYq7WsiOFXJI7juURd6PUXAIBhNor3t4CkBWJYBQzPuZQC6gKQgff2X1N/wYogiegs Eqs/ApEoGiL89+EFmgWQPzPSkqGYFhDQjaC23VIppAWG9OvR6TNf7fH2golLRzU6eq6B ouK+xVA845t1CkdKyLuG6KCTF8mBxA1xK/gWi9f9OOLtj1GpwT3w5btrSU7v93EDLWpB bnFqRT9mjOGLMTUlEO57V5/PLD6Zn5i86dCnWFgh1nvAz2e9fM619t0VvGjdaAGYO9M1 QQ4A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=289uBjwPOn362GkpAcGTsnLiJb6Nd9Y0Sebv5Fsn+ec=; fh=20RY08rsve2lwti6V2FQSjrHvxbvm4llznoKqUOZSeY=; b=a4RdG5snyjumSDDjnWC9dN/CvG4+2aEo7GGFaH5BOQgzZ7EGQlylAldokcew4MSgV3 KhZ3Bicf7PF9IZtWOm7UL5miCAPqsrJiSsDqRsWcbFi503TXZFLX83mBrkS1z5+se2/9 Nns7XL+1nS+m8ifJb1f3Qpb/JkMkAjjiToCNPbycwfLFUmlGj4v4zU3eW2umGuYXCH5U pOiTQqzbnSkbmV/F12M24NqYACI3shtJqQgfsEPCPJWtDh/mDWrX0KrpSsYlm33Pb8YX cPUsWl6wrMLw34IQ2lookla4ow4WIBlh2P462lZ8DHASRK2v4RYEXzTD8OYKWvvoq786 0Cvg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id o23-20020a656157000000b005652febd4b7si213006pgv.356.2023.08.09.15.59.04; Wed, 09 Aug 2023 15:59:19 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232558AbjHIWNR (ORCPT <rfc822;craechal@gmail.com> + 99 others); Wed, 9 Aug 2023 18:13:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54436 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232937AbjHIWNK (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 9 Aug 2023 18:13:10 -0400 Received: from mail-qv1-f43.google.com (mail-qv1-f43.google.com [209.85.219.43]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 70CC0213F for <linux-kernel@vger.kernel.org>; Wed, 9 Aug 2023 15:12:58 -0700 (PDT) Received: by mail-qv1-f43.google.com with SMTP id 6a1803df08f44-635f293884cso1864656d6.3 for <linux-kernel@vger.kernel.org>; Wed, 09 Aug 2023 15:12:58 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1691619177; x=1692223977; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=289uBjwPOn362GkpAcGTsnLiJb6Nd9Y0Sebv5Fsn+ec=; b=hYxF0oshW+I07f/BqJomcfSnAi0dfzOmxXwrU0oGL/haYwc069LjqsZ7nWSawAmRMQ snkaeoPT9aS56UwqliNtYg9x/n/nEASuNzOQK63JDJpYfIEZzyNmKOPjbN5EfJ7Pdo1v +4SfGhvudoHvmOtei1tVnQF9XLW7xP8kCl2RPAMRx1FtMdTfray3n3DYmAFM9zcm8AyF 8Lq1LE8HBbIuY+QGBg/Qau63OXUWP6Bsy3HVo7TQVSBdFuw7vzPLmoEl08/QzR7t1cjs FrBB/koAbqGPx6zS/+2P6Qo/qQP9q1ab39w3cJ8rt5U2BmhM9ADdZj42RxerC2H46xSs sLwA== X-Gm-Message-State: AOJu0Yy8fRuQye0VmgJuEjVPaokxsDb7h0h/qalGEj4IdQBtEpVFNw0f uUp2DJjOXd7LogcJKgI5e1SiMISVlqgXNPed X-Received: by 2002:a0c:db06:0:b0:626:1906:bcac with SMTP id d6-20020a0cdb06000000b006261906bcacmr627824qvk.0.1691619177105; Wed, 09 Aug 2023 15:12:57 -0700 (PDT) Received: from localhost ([2620:10d:c091:400::5:ed08]) by smtp.gmail.com with ESMTPSA id p6-20020a0ce186000000b00631fea4d5bcsm4752789qvl.95.2023.08.09.15.12.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 09 Aug 2023 15:12:56 -0700 (PDT) From: David Vernet <void@manifault.com> To: linux-kernel@vger.kernel.org Cc: peterz@infradead.org, mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com, vschneid@redhat.com, tj@kernel.org, roman.gushchin@linux.dev, gautham.shenoy@amd.com, kprateek.nayak@amd.com, aaron.lu@intel.com, wuyun.abel@bytedance.com, kernel-team@meta.com Subject: [PATCH v3 7/7] sched: Shard per-LLC shared runqueues Date: Wed, 9 Aug 2023 17:12:18 -0500 Message-ID: <20230809221218.163894-8-void@manifault.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20230809221218.163894-1-void@manifault.com> References: <20230809221218.163894-1-void@manifault.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.4 required=5.0 tests=BAYES_00, FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS, RCVD_IN_DNSWL_BLOCKED,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1773794188135816179 X-GMAIL-MSGID: 1773794188135816179
Series	sched: Implement shared runqueue in CFS \| [v3,0/7] sched: Implement shared runqueue in CFS [v3,1/7] sched: Expose move_queued_task() from core.c [v3,2/7] sched: Move is_cpu_allowed() into sched.h [v3,3/7] sched: Check cpu_active() earlier in newidle_balance() [v3,4/7] sched: Enable sched_feat callbacks on enable/disable [v3,5/7] sched/fair: Add SHARED_RUNQ sched feature and skeleton calls [v3,6/7] sched: Implement shared runqueue in CFS [v3,7/7] sched: Shard per-LLC shared runqueues

[v3,7/7] sched: Shard per-LLC shared runqueues

Commit Message

Comments

Patch