[v3,0/7] sched: Implement shared runqueue in CFS

Message ID	20230809221218.163894-1-void@manifault.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: David Vernet <void@manifault.com> To: linux-kernel@vger.kernel.org Cc: peterz@infradead.org, mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com, vschneid@redhat.com, tj@kernel.org, roman.gushchin@linux.dev, gautham.shenoy@amd.com, kprateek.nayak@amd.com, aaron.lu@intel.com, wuyun.abel@bytedance.com, kernel-team@meta.com Subject: [PATCH v3 0/7] sched: Implement shared runqueue in CFS Date: Wed, 9 Aug 2023 17:12:11 -0500 Message-ID: <20230809221218.163894-1-void@manifault.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	sched: Implement shared runqueue in CFS \| [v3,0/7] sched: Implement shared runqueue in CFS [v3,1/7] sched: Expose move_queued_task() from core.c [v3,2/7] sched: Move is_cpu_allowed() into sched.h [v3,3/7] sched: Check cpu_active() earlier in newidle_balance() [v3,4/7] sched: Enable sched_feat callbacks on enable/disable [v3,5/7] sched/fair: Add SHARED_RUNQ sched feature and skeleton calls [v3,6/7] sched: Implement shared runqueue in CFS [v3,7/7] sched: Shard per-LLC shared runqueues

Message ID

20230809221218.163894-1-void@manifault.com

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
From: David Vernet <void@manifault.com>
To: linux-kernel@vger.kernel.org
Cc: peterz@infradead.org, mingo@redhat.com, juri.lelli@redhat.com,
        vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
        rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
        bristot@redhat.com, vschneid@redhat.com, tj@kernel.org,
        roman.gushchin@linux.dev, gautham.shenoy@amd.com,
        kprateek.nayak@amd.com, aaron.lu@intel.com,
        wuyun.abel@bytedance.com, kernel-team@meta.com
Subject: [PATCH v3 0/7] sched: Implement shared runqueue in CFS
Date: Wed,  9 Aug 2023 17:12:11 -0500
Message-ID: <20230809221218.163894-1-void@manifault.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

sched: Implement shared runqueue in CFS |

Message

David Vernet Aug. 9, 2023, 10:12 p.m. UTC

  Changes
-------

This is v3 of the shared runqueue patchset. This patch set is based off
of commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
bandwidth in use") on the sched/core branch of tip.git.

v1 (RFC): https://lore.kernel.org/lkml/20230613052004.2836135-1-void@manifault.com/
v2: https://lore.kernel.org/lkml/20230710200342.358255-1-void@manifault.com/

v2 -> v3 changes:
- Don't leave stale tasks in the lists when the SHARED_RUNQ feature is
  disabled (Abel Wu)

- Use raw spin lock instead of spinlock_t (Peter)

- Fix return value from shared_runq_pick_next_task() to match the
  semantics expected by newidle_balance() (Gautham, Abel)

- Fold patch __enqueue_entity() / __dequeue_entity() into previous patch
  (Peter)

- Skip <= LLC domains in newidle_balance() if SHARED_RUNQ is enabled
  (Peter)

- Properly support hotplug and recreating sched domains (Peter)

- Avoid unnecessary task_rq_unlock() + raw_spin_rq_lock() when src_rq ==
  target_rq in shared_runq_pick_next_task() (Abel)

- Only issue list_del_init() in shared_runq_dequeue_task() if the task
  is still in the list after acquiring the lock (Aaron Lu)

- Slightly change shared_runq_shard_idx() to make it more likely to keep
  SMT siblings on the same bucket (Peter)

v1 -> v2 changes:
- Change name from swqueue to shared_runq (Peter)

- Shard per-LLC shared runqueues to avoid contention on scheduler-heavy
  workloads (Peter)

- Pull tasks from the shared_runq in newidle_balance() rather than in
  pick_next_task_fair() (Peter and Vincent)

- Rename a few functions to reflect their actual purpose. For example,
  shared_runq_dequeue_task() instead of swqueue_remove_task() (Peter)

- Expose move_queued_task() from core.c rather than migrate_task_to()
  (Peter)

- Properly check is_cpu_allowed() when pulling a task from a shared_runq
  to ensure it can actually be migrated (Peter and Gautham)

- Dropped RFC tag

Overview
========

The scheduler must constantly strike a balance between work
conservation, and avoiding costly migrations which harm performance due
to e.g. decreased cache locality. The matter is further complicated by
the topology of the system. Migrating a task between cores on the same
LLC may be more optimal than keeping a task local to the CPU, whereas
migrating a task between LLCs or NUMA nodes may tip the balance in the
other direction.

With that in mind, while CFS is by and large mostly a work conserving
scheduler, there are certain instances where the scheduler will choose
to keep a task local to a CPU, when it would have been more optimal to
migrate it to an idle core.

An example of such a workload is the HHVM / web workload at Meta. HHVM
is a VM that JITs Hack and PHP code in service of web requests. Like
other JIT / compilation workloads, it tends to be heavily CPU bound, and
exhibit generally poor cache locality. To try and address this, we set
several debugfs (/sys/kernel/debug/sched) knobs on our HHVM workloads:

- migration_cost_ns -> 0
- latency_ns -> 20000000
- min_granularity_ns -> 10000000
- wakeup_granularity_ns -> 12000000

These knobs are intended both to encourage the scheduler to be as work
conserving as possible (migration_cost_ns -> 0), and also to keep tasks
running for relatively long time slices so as to avoid the overhead of
context switching (the other knobs). Collectively, these knobs provide a
substantial performance win; resulting in roughly a 20% improvement in
throughput. Worth noting, however, is that this improvement is _not_ at
full machine saturation.

That said, even with these knobs, we noticed that CPUs were still going
idle even when the host was overcommitted. In response, we wrote the
"shared runqueue" (SHARED_RUNQ) feature proposed in this patch set. The
idea behind SHARED_RUNQ is simple: it enables the scheduler to be more
aggressively work conserving by placing a waking task into a sharded
per-LLC FIFO queue that can be pulled from by another core in the LLC
FIFO queue which can then be pulled from before it goes idle.

With this simple change, we were able to achieve a 1 - 1.6% improvement
in throughput, as well as a small, consistent improvement in p95 and p99
latencies, in HHVM. These performance improvements were in addition to
the wins from the debugfs knobs mentioned above, and to other benchmarks
outlined below in the Results section.

Design
======

Note that the design described here reflects sharding, which is the
implementation added in the final patch of the series (following the
initial unsharded implementation added in patch 6/7). The design is
described that way in this commit summary as the benchmarks described in
the results section below all reflect a sharded SHARED_RUNQ.

The design of SHARED_RUNQ is quite simple. A shared_runq is simply a
list of struct shared_runq_shard objects, which itself is simply a
struct list_head of tasks, and a spinlock:

struct shared_runq_shard {
	struct list_head list;
	raw_spinlock_t lock;
} ____cacheline_aligned;

struct shared_runq {
	u32 num_shards;
	struct shared_runq_shard shards[];
} ____cacheline_aligned;

We create a struct shared_runq per LLC, ensuring they're in their own
cachelines to avoid false sharing between CPUs on different LLCs, and we
create a number of struct shared_runq_shard objects that are housed
there.

When a task first wakes up, it enqueues itself in the shared_runq_shard
of its current LLC at the end of enqueue_task_fair(). Enqueues only
happen if the task was not manually migrated to the current core by
select_task_rq(), and is not pinned to a specific CPU.

A core will pull a task from the shards in its LLC's shared_runq at the
beginning of newidle_balance().

Difference between SHARED_RUNQ and SIS_NODE
===========================================

In [0] Peter proposed a patch that addresses Tejun's observations that
when workqueues are targeted towards a specific LLC on his Zen2 machine
with small CCXs, that there would be significant idle time due to
select_idle_sibling() not considering anything outside of the current
LLC.

This patch (SIS_NODE) is essentially the complement to the proposal
here. SID_NODE causes waking tasks to look for idle cores in neighboring
LLCs on the same die, whereas SHARED_RUNQ causes cores about to go idle
to look for enqueued tasks. That said, in its current form, the two
features at are a different scope as SIS_NODE searches for idle cores
between LLCs, while SHARED_RUNQ enqueues tasks within a single LLC.

The patch was since removed in [1], and we compared the results to
SHARED_RUNQ (previously called "swqueue") in [2]. SIS_NODE did not
outperform SHARED_RUNQ on any of the benchmarks, so we elect to not
compare against it again for this v2 patch set.

[0]: https://lore.kernel.org/all/20230530113249.GA156198@hirez.programming.kicks-ass.net/
[1]: https://lore.kernel.org/all/20230605175636.GA4253@hirez.programming.kicks-ass.net/
[2]: https://lore.kernel.org/lkml/20230613052004.2836135-1-void@manifault.com/

Worth noting as well is that pointed out in [3] that the logic behind
including SIS_NODE in the first place should apply to SHARED_RUNQ
(meaning that e.g. very small Zen2 CPUs with only 3/4 cores per LLC
should benefit from having a single shared_runq stretch across multiple
LLCs). I drafted a patch that implements this by having a minimum LLC
size for creating a shard, and stretches a shared_runq across multiple
LLCs if they're smaller than that size, and sent it to Tejun to test on
his Zen2. Tejun reported back that SIS_NODE did not seem to make a
difference:

[3]: https://lore.kernel.org/lkml/20230711114207.GK3062772@hirez.programming.kicks-ass.net/

			    o____________o__________o
			    |    mean    | Variance |
			    o------------o----------o
Vanilla:		    | 108.84s    | 0.0057   |
NO_SHARED_RUNQ:		    | 108.82s    | 0.119s   |
SHARED_RUNQ:		    | 108.17s    | 0.038s   |
SHARED_RUNQ w/ SIS_NODE:    | 108.87s    | 0.111s   |
			    o------------o----------o

I similarly tried running kcompile on SHARED_RUNQ with SIS_NODE on my
7950X Zen3, but didn't see any gain relative to plain SHARED_RUNQ (though
a gain was observed relative to NO_SHARED_RUNQ, as described below).

Results
=======

Note that the motivation for the shared runqueue feature was originally
arrived at using experiments in the sched_ext framework that's currently
being proposed upstream. The ~1 - 1.6% improvement in HHVM throughput
is similarly visible using work-conserving sched_ext schedulers (even
very simple ones like global FIFO).

In both single and multi socket / CCX hosts, this can measurably improve
performance. In addition to the performance gains observed on our
internal web workloads, we also observed an improvement in common
workloads such as kernel compile and hackbench, when running shared
runqueue.

On the other hand, some workloads suffer from SHARED_RUNQ. Workloads
that hammer the runqueue hard, such as netperf UDP_RR, or schbench -L
-m 52 -p 512 -r 10 -t 1. This can be mitigated somewhat by sharding the
shared datastructures within a CCX, but it doesn't seem to eliminate all
contention in every scenario. On the positive side, it seems that
sharding does not materially harm the benchmarks run for this patch
series; and in fact seems to improve some workloads such as kernel
compile.

Note that for the kernel compile workloads below, the compilation was
done by running make -j$(nproc) built-in.a on several different types of
hosts configured with make allyesconfig on commit a27648c74210 ("afs:
Fix setting of mtime when creating a file/dir/symlink") on Linus' tree
(boost and turbo were disabled on all of these hosts when the
experiments were performed).

Finally, note that these results were from the patch set built off of
commit ebb83d84e49b ("sched/core: Avoid multiple calling
update_rq_clock() in __cfsb_csd_unthrottle()") on the sched/core branch
of tip.git for easy comparison with the v2 patch set results. The
patches in their final form from this set were rebased onto commit
88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs bandwidth in
use") on the sched/core branch of tip.git.

=== Single-socket | 16 core / 32 thread | 2-CCX | AMD 7950X Zen4 ===

CPU max MHz: 5879.8818
CPU min MHz: 3000.0000

Command: make -j$(nproc) built-in.a
			    o____________o__________o
			    |    mean    | Variance |
			    o------------o----------o
NO_SHARED_RUNQ:		    | 581.95s    | 2.639s   |
SHARED_RUNQ:		    | 577.02s    | 0.084s   |
			    o------------o----------o

Takeaway: SHARED_RUNQ results in a statistically significant ~.85%
improvement over NO_SHARED_RUNQ. This suggests that enqueuing tasks in
the shared runqueue on every enqueue improves work conservation, and
thanks to sharding, does not result in contention.

Command: hackbench --loops 10000
                            o____________o__________o
                            |    mean    | Variance |
                            o------------o----------o
NO_SHARED_RUNQ:             | 2.2492s    | .00001s  |
SHARED_RUNQ:		    | 2.0217s    | .00065s  |
                            o------------o----------o

Takeaway: SHARED_RUNQ in both forms performs exceptionally well compared
to NO_SHARED_RUNQ here, beating it by over 10%. This was a surprising
result given that it seems advantageous to err on the side of avoiding
migration in hackbench given that tasks are short lived in sending only
10k bytes worth of messages, but the results of the benchmark would seem
to suggest that minimizing runqueue delays is preferable.

Command:
for i in `seq 128`; do
    netperf -6 -t UDP_RR -c -C -l $runtime &
done
                            o_______________________o
                            | Throughput | Variance |
                            o-----------------------o
NO_SHARED_RUNQ:             | 25037.45   | 2243.44  |
SHARED_RUNQ:                | 24952.50   | 1268.06  |
                            o-----------------------o

Takeaway: No statistical significance, though it is worth noting that
there is no regression for shared runqueue on the 7950X, while there is
a small regression on the Skylake and Milan hosts for SHARED_RUNQ as
described below.

=== Single-socket | 18 core / 36 thread | 1-CCX | Intel Skylake ===

CPU max MHz: 1601.0000
CPU min MHz: 800.0000

Command: make -j$(nproc) built-in.a
			    o____________o__________o
			    |    mean    | Variance |
			    o------------o----------o
NO_SHARED_RUNQ:		    | 1517.44s   | 2.8322s  |
SHARED_RUNQ:		    | 1516.51s   | 2.9450s  |
			    o------------o----------o

Takeaway: There's on statistically significant gain here. I observed
what I claimed was a .23% win in v2, but it appears that this is not
actually statistically significant.

Command: hackbench --loops 10000
                            o____________o__________o
                            |    mean    | Variance |
                            o------------o----------o
NO_SHARED_RUNQ:             | 5.3370s    | .0012s   |
SHARED_RUNQ:		    | 5.2668s    | .0033s   |
                            o------------o----------o

Takeaway: SHARED_RUNQ results in a ~1.3% improvement over
NO_SHARED_RUNQ. Also statistically significant, but smaller than the
10+% improvement observed on the 7950X.

Command: netperf -n $(nproc) -l 60 -t TCP_RR
for i in `seq 128`; do
        netperf -6 -t UDP_RR -c -C -l $runtime &
done
                            o_______________________o
                            | Throughput | Variance |
                            o-----------------------o
NO_SHARED_RUNQ:             | 15699.32   | 377.01   |
SHARED_RUNQ:                | 14966.42   | 714.13   |
                            o-----------------------o

Takeaway: NO_SHARED_RUNQ beats SHARED_RUNQ by ~4.6%. This result makes
sense -- the workload is very heavy on the runqueue, so enqueuing tasks
in the shared runqueue in __enqueue_entity() would intuitively result in
increased contention on the shard lock.

=== Single-socket | 72-core | 6-CCX | AMD Milan Zen3 ===

CPU max MHz: 700.0000
CPU min MHz: 700.0000

Command: make -j$(nproc) built-in.a
			    o____________o__________o
			    |    mean    | Variance |
			    o------------o----------o
NO_SHARED_RUNQ:		    | 1568.55s   | 0.1568s  |
SHARED_RUNQ:		    | 1568.26s   | 1.2168s  |
			    o------------o----------o

Takeaway: No statistically significant difference here. It might be
worth experimenting with work stealing in a follow-on patch set.

Command: hackbench --loops 10000
                            o____________o__________o
                            |    mean    | Variance |
                            o------------o----------o
NO_SHARED_RUNQ:             | 5.2716s    | .00143s  |
SHARED_RUNQ:		    | 5.1716s    | .00289s  |
                            o------------o----------o

Takeaway: SHARED_RUNQ again wins, by about 2%.

Command: netperf -n $(nproc) -l 60 -t TCP_RR
for i in `seq 128`; do
        netperf -6 -t UDP_RR -c -C -l $runtime &
done
                            o_______________________o
                            | Throughput | Variance |
                            o-----------------------o
NO_SHARED_RUNQ:             | 17482.03   | 4675.99  |
SHARED_RUNQ:                | 16697.25   | 9812.23  |
                            o-----------------------o

Takeaway: Similar to the Skylake runs, NO_SHARED_RUNQ still beats
SHARED_RUNQ, this time by ~4.5%. It's worth noting that in v2, the
NO_SHARED_RUNQ was only ~1.8% faster. The variance is very high here, so
the results of this benchmark should be taken with a large grain of
salt (noting that we do consistently see NO_SHARED_RUNQ on top due to
not contending on the shard lock).

Finally, let's look at how sharding affects the following schbench
incantation suggested by Chris in [4]:

schbench -L -m 52 -p 512 -r 10 -t 1

[4]: https://lore.kernel.org/lkml/c8419d9b-2b31-2190-3058-3625bdbcb13d@meta.com/

The TL;DR is that sharding improves things a lot, but doesn't completely
fix the problem. Here are the results from running the schbench command
on the 18 core / 36 thread single CCX, single-socket Skylake:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name         con-bounces    contentions       waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:      31510503       31510711           0.08          19.98        168932319.64     5.36            31700383      31843851       0.03           17.50        10273968.33      0.32
------------
&shard->lock       15731657          [<0000000068c0fd75>] pick_next_task_fair+0x4dd/0x510
&shard->lock       15756516          [<000000001faf84f9>] enqueue_task_fair+0x459/0x530
&shard->lock          21766          [<00000000126ec6ab>] newidle_balance+0x45a/0x650
&shard->lock            772          [<000000002886c365>] dequeue_task_fair+0x4c9/0x540
------------
&shard->lock          23458          [<00000000126ec6ab>] newidle_balance+0x45a/0x650
&shard->lock       16505108          [<000000001faf84f9>] enqueue_task_fair+0x459/0x530
&shard->lock       14981310          [<0000000068c0fd75>] pick_next_task_fair+0x4dd/0x510
&shard->lock            835          [<000000002886c365>] dequeue_task_fair+0x4c9/0x540

These results are when we create only 3 shards (16 logical cores per
shard), so the contention may be a result of overly-coarse sharding. If
we run the schbench incantation with no sharding whatsoever, we see the
following significantly worse lock stats contention:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name        con-bounces    contentions         waittime-min   waittime-max waittime-total         waittime-avg    acq-bounces   acquisitions   holdtime-min  holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:     117868635      118361486           0.09           393.01       1250954097.25          10.57           119345882     119780601      0.05          343.35       38313419.51      0.32
------------
&shard->lock       59169196          [<0000000060507011>] __enqueue_entity+0xdc/0x110
&shard->lock       59084239          [<00000000f1c67316>] __dequeue_entity+0x78/0xa0
&shard->lock         108051          [<00000000084a6193>] newidle_balance+0x45a/0x650
------------
&shard->lock       60028355          [<0000000060507011>] __enqueue_entity+0xdc/0x110
&shard->lock         119882          [<00000000084a6193>] newidle_balance+0x45a/0x650
&shard->lock       58213249          [<00000000f1c67316>] __dequeue_entity+0x78/0xa0

The contention is ~3-4x worse if we don't shard at all. This roughly
matches the fact that we had 3 shards on the first workload run above.
If we make the shards even smaller, the contention is comparably much
lower:

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name         con-bounces    contentions   waittime-min  waittime-max waittime-total   waittime-avg   acq-bounces   acquisitions   holdtime-min  holdtime-max holdtime-total   holdtime-avg
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:      13839849       13877596      0.08          13.23        5389564.95       0.39           46910241      48069307       0.06          16.40        16534469.35      0.34
------------
&shard->lock           3559          [<00000000ea455dcc>] newidle_balance+0x45a/0x650
&shard->lock        6992418          [<000000002266f400>] __dequeue_entity+0x78/0xa0
&shard->lock        6881619          [<000000002a62f2e0>] __enqueue_entity+0xdc/0x110
------------
&shard->lock        6640140          [<000000002266f400>] __dequeue_entity+0x78/0xa0
&shard->lock           3523          [<00000000ea455dcc>] newidle_balance+0x45a/0x650
&shard->lock        7233933          [<000000002a62f2e0>] __enqueue_entity+0xdc/0x110

Interestingly, SHARED_RUNQ performs worse than NO_SHARED_RUNQ on the schbench
benchmark on Milan as well, but we contend more on the rq lock than the
shard lock:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name         con-bounces    contentions   waittime-min  waittime-max waittime-total   waittime-avg   acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&rq->__lock:       9617614        9656091       0.10          79.64        69665812.00      7.21           18092700      67652829       0.11           82.38        344524858.87     5.09
-----------
&rq->__lock        6301611          [<000000003e63bf26>] task_rq_lock+0x43/0xe0
&rq->__lock        2530807          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock         109360          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock         178218          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
-----------
&rq->__lock        3245506          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock        1294355          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
&rq->__lock        2837804          [<000000003e63bf26>] task_rq_lock+0x43/0xe0
&rq->__lock        1627866          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10

..................................................................................................................................................................................................

&shard->lock:       7338558       7343244       0.10          35.97        7173949.14       0.98           30200858      32679623       0.08           35.59        16270584.52      0.50
------------
&shard->lock        2004142          [<00000000f8aa2c91>] __dequeue_entity+0x78/0xa0
&shard->lock        2611264          [<00000000473978cc>] newidle_balance+0x45a/0x650
&shard->lock        2727838          [<0000000028f55bb5>] __enqueue_entity+0xdc/0x110
------------
&shard->lock        2737232          [<00000000473978cc>] newidle_balance+0x45a/0x650
&shard->lock        1693341          [<00000000f8aa2c91>] __dequeue_entity+0x78/0xa0
&shard->lock        2912671          [<0000000028f55bb5>] __enqueue_entity+0xdc/0x110

...................................................................................................................................................................................................

If we look at the lock stats with SHARED_RUNQ disabled, the rq lock still
contends the most, but it's significantly less than with it enabled:

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name          con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&rq->__lock:        791277         791690        0.12           110.54       4889787.63       6.18            1575996       62390275       0.13           112.66       316262440.56     5.07
-----------
&rq->__lock         263343          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock          19394          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock           4143          [<000000003b542e83>] __task_rq_lock+0x51/0xf0
&rq->__lock          51094          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
-----------
&rq->__lock          23756          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock         379048          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock            677          [<000000003b542e83>] __task_rq_lock+0x51/0xf0

Worth noting is that increasing the granularity of the shards in general
improves very runqueue-heavy workloads such as netperf UDP_RR and this
schbench command, but it doesn't necessarily make a big difference for
every workload, or for sufficiently small CCXs such as the 7950X. It may
make sense to eventually allow users to control this with a debugfs
knob, but for now we'll elect to choose a default that resulted in good
performance for the benchmarks run for this patch series.

Conclusion
==========

SHARED_RUNQ in this form provides statistically significant wins for
several types of workloads, and various CPU topologies. The reason for
this is roughly the same for all workloads: SHARED_RUNQ encourages work
conservation inside of a CCX by having a CPU do an O(# per-LLC shards)
iteration over the shared_runq shards in an LLC. We could similarly do
an O(n) iteration over all of the runqueues in the current LLC when a
core is going idle, but that's quite costly (especially for larger
LLCs), and sharded SHARED_RUNQ seems to provide a performant middle
ground between doing an O(n) walk, and doing an O(1) pull from a single
per-LLC shared runq.

For the workloads above, kernel compile and hackbench were clear winners
for SHARED_RUNQ (especially in __enqueue_entity()). The reason for the
improvement in kernel compile is of course that we have a heavily
CPU-bound workload where cache locality doesn't mean much; getting a CPU
is the #1 goal. As mentioned above, while I didn't expect to see an
improvement in hackbench, the results of the benchmark suggest that
minimizing runqueue delays is preferable to optimizing for L1/L2
locality.

Not all workloads benefit from SHARED_RUNQ, however. Workloads that
hammer the runqueue hard, such as netperf UDP_RR, or schbench -L -m 52
-p 512 -r 10 -t 1, tend to run into contention on the shard locks;
especially when enqueuing tasks in __enqueue_entity(). This can be
mitigated significantly by sharding the shared datastructures within a
CCX, but it doesn't eliminate all contention, as described above.

Worth noting as well is that Gautham Shenoy ran some interesting
experiments on a few more ideas in [5], such as walking the shared_runq
on the pop path until a task is found that can be migrated to the
calling CPU. I didn't run those experiments in this patch set, but it
might be worth doing so.

[5]: https://lore.kernel.org/lkml/ZJkqeXkPJMTl49GB@BLR-5CG11610CF.amd.com/

Gautham also ran some other benchmarks in [6], which we may want to
again try on this v3, but with boost disabled.

[6]: https://lore.kernel.org/lkml/ZLpMGVPDXqWEu+gm@BLR-5CG11610CF.amd.com/

Finally, while SHARED_RUNQ in this form encourages work conservation, it
of course does not guarantee it given that we don't implement any kind
of work stealing between shared_runq's. In the future, we could
potentially push CPU utilization even higher by enabling work stealing
between shared_runq's, likely between CCXs on the same NUMA node.

Originally-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: David Vernet <void@manifault.com>

David Vernet (7):
  sched: Expose move_queued_task() from core.c
  sched: Move is_cpu_allowed() into sched.h
  sched: Check cpu_active() earlier in newidle_balance()
  sched: Enable sched_feat callbacks on enable/disable
  sched/fair: Add SHARED_RUNQ sched feature and skeleton calls
  sched: Implement shared runqueue in CFS
  sched: Shard per-LLC shared runqueues

 include/linux/sched.h   |   2 +
 kernel/sched/core.c     |  52 ++----
 kernel/sched/debug.c    |  18 ++-
 kernel/sched/fair.c     | 340 +++++++++++++++++++++++++++++++++++++++-
 kernel/sched/features.h |   1 +
 kernel/sched/sched.h    |  56 ++++++-
 kernel/sched/topology.c |   4 +-
 7 files changed, 420 insertions(+), 53 deletions(-)

Comments

David Vernet Aug. 18, 2023, 5:03 a.m. UTC | #1

On Thu, Aug 17, 2023 at 02:12:03PM +0530, Gautham R. Shenoy wrote:
> Hello David,

Hello Gautham,

Thanks a lot as always for running some benchmarks and analyzing these
changes.

> On Wed, Aug 09, 2023 at 05:12:11PM -0500, David Vernet wrote:
> > Changes
> > -------
> > 
> > This is v3 of the shared runqueue patchset. This patch set is based off
> > of commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
> > bandwidth in use") on the sched/core branch of tip.git.
> 
> 
> I tested the patches on Zen3 and Zen4 EPYC Servers like last time. I
> notice that apart from hackbench, every other bechmark is showing
> regressions with this patch series. Quick summary of my observations:

Just to verify per our prior conversation [0], was this latest set of
benchmarks run with boost disabled? Your analysis below seems to
indicate pretty clearly that v3 regresses some workloads regardless of
boost, but I did just want to double check regardless just to bear it in
mind when looking over the results.

[0]: https://lore.kernel.org/all/ZMn4dLePB45M5CGa@BLR-5CG11610CF.amd.com/

> * With shared-runqueue enabled, tbench and netperf both stop scaling
>   when we go beyond 32 clients and the scaling issue persists until
>   the system is overutilized. When the system is overutilized,
>   shared-runqueue is able to recover quite splendidly and outperform
>   tip.

Hmm, I still don't understand why we perform better when the system is
overutilized. I'd expect vanilla CFS to perform better than shared_runq
in such a scenario in general, as the system will be fully utilized
regardless.

> * stream doesn't show any significant difference with the
>   shared-runqueue as expected.
> 
> * schbench shows no major regressions for the requests-per-second and
>   the request-latency until the system is completely saturated at
>   which point, I do see some improvements with the shared
>   runqueue. However, the wakeup-latency is bad when the system is
>   moderately utilized.
> 
> * mongodb shows 3.5% regression with shared runqueue enabled.

This is indeed surprising. I think I have a theory, as described below.

> 
> Please find the detailed results at the end of this mail.
> 
> Scalability for tbench and netperf
> ==================================
> I want to call out the reason for the scaling issues observed with
> tbench and netperf when the number of clients are between 32 to 256. I
> will use tbench here to illustrate the analysis.
> 
> As I had mentioned, in my response to Aaron's RFC,
> (https://lore.kernel.org/lkml/20230816024831.682107-2-aaron.lu@intel.com/#t)
> in the aforementioned cases, I could observe a bottleneck with
> update_cfs_group() and update_load_avg() which is due to the fact that
> we do a lot more task migrations when the shared runqueue is enabled.
> 
>   Overhead  Command  Shared Object     Symbol
> +   20.54%  tbench   [kernel.vmlinux]  [k] update_cfs_group
> +   15.78%  tbench   [kernel.vmlinux]  [k] update_load_avg
> 
> Applying Aaron's ratelimiting patch helps improve the
> scalability. Previously the throughput values for 32 clients, 64
> clients, 128 clients and 256 clients were very close to each other but
> with Aaron's patch, that improved. However, the regression still
> persisted.
> 
> ==================================================================
> Test          : tbench 
> Units         : Normalized throughput 
> Interpretation: Higher is better 
> Statistic     : AMean 
> ==================================================================
> Clients:  tip[pct imp](CV)       sh_rq_v3[pct imp](CV)    sh_rq_v3_tgload_fix[pct imp](CV)
>    32     1.00 [  0.00]( 2.90)     0.44 [-55.53]( 1.44)     0.98 [ -2.23]( 1.72)
>    64     1.00 [  0.00]( 1.02)     0.27 [-72.58]( 0.35)     0.74 [-25.64]( 2.43)
>   128     1.00 [  0.00]( 0.88)     0.19 [-81.29]( 0.51)     0.52 [-48.47]( 3.92)
>   256     1.00 [  0.00]( 0.28)     0.17 [-82.80]( 0.29)     0.88 [-12.23]( 1.76)

Just to make sure we're on the same page, "CV" here is the coefficient
of variation (i.e. standard deviation / mean), correct?

> With Aaron's fix, perf showed that there were a lot of samples for
> update_sd_lb_stats().
> 
> Samples: 8M of event 'ibs_op//', Event count (approx.): 28860989545448
>   Overhead  Command  Shared Object         Symbol
> -   13.00%  tbench   [kernel.vmlinux]      [k] update_sd_lb_stats.constprop.0
>    - 7.21% update_sd_lb_stats.constprop.0                                    
>       - 7.21% find_busiest_group                                                
>            load_balance                                                         
>          - newidle_balance                                                      
>             + 5.90% pick_next_task_fair                                         
>             + 1.31% balance_fair                                                
>    - 3.05% cpu_util                                                             
>       - 2.63% update_sd_lb_stats.constprop.0                                    
>            find_busiest_group                                                   
>            load_balance                                                         
>          + newidle_balance                                                      
>    - 1.67% idle_cpu                                                             
>       - 1.36% update_sd_lb_stats.constprop.0                                    
>            find_busiest_group                                                   
>            load_balance                                                         
>          - newidle_balance                                                      
>             + 1.11% pick_next_task_fair   
> 
> perf annotate shows the hotspot to be a harmless looking "add"
> instruction update_sg_lb_stats() which adds a value obtained from
> cfs_rq->avg.load_avg to sg->group_load.
> 
>        │     cfs_rq_load_avg():
>        │     return cfs_rq->avg.load_avg;
>   0.31 │       mov    0x220(%r8),%rax
>        │     update_sg_lb_stats():
>        │     sgs->group_load += load;
>  15.90 │       add    %rax,0x8(%r13)
>        │     cfs_rq_load_avg():
>        │     return cfs_rq->avg.load_avg;
> 
> So, I counted the number of times the CPUs call find_busiest_group()
> without and with shared_rq and the distribution is quite stark.
> 
> =====================================================
> per-cpu             :          Number of CPUs       :
> find_busiest_group  :----------------:--------------:
> count               :  without-sh.rq :  with-sh.rq  :
> =====================================:===============
> [      0 -  200000) :     77
> [ 200000 -  400000) :     41
> [ 400000 -  600000) :     64
> [ 600000 -  800000) :     63
> [ 800000 - 1000000) :     66
> [1000000 - 1200000) :     69
> [1200000 - 1400000) :     52
> [1400000 - 1600000) :     34              5
> [1600000 - 1800000) :     17		 31
> [1800000 - 2000000) :      6		 59
> [2000000 - 2200000) :     13		109
> [2200000 - 2400000) :      4		120
> [2400000 - 2600000) :      3		157
> [2600000 - 2800000) :      1		 29
> [2800000 - 3000000) :      1		  2
> [9200000 - 9400000) :      1
> 
> As you can notice, the number of calls to find_busiest_group() without
> the shared.rq is greater at the lower end of distribution, which
> implies fewer calls in total. With shared-rq enabled, the distribution
> is normal, but shifted to the right, which implies a lot more calls to
> find_busiest_group().

Huh, that's very much unexpected for obvious reasons -- we should be
hitting the load_balance() path _less_ due to scheduling tasks with the
shared_runq.

> To investigate further, where this is coming from, I reran tbench with
> sched-scoreboard (https://github.com/AMDESE/sched-scoreboard), and the
> schedstats shows the the total wait-time of the tasks on the runqueue
> *increases* by a significant amount when shared-rq is enabled.
> 
> Further, if you notice the newidle load_balance() attempts at the DIE
> and the NUMA domains, they are significantly higher when shared-rq is
> enabled. So it appears that a lot more time is being spent trying to
> do load-balancing when shared runqueue is enabled, which is counter
> intutitive.

Certainly agreed on it being counter intuitive. I wonder if this is due
to this change in the latest v3 revision [1]:

@@ -12093,6 +12308,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	rcu_read_lock();
 	sd = rcu_dereference_check_sched_domain(this_rq->sd);
 
+	/*
+	 * Skip <= LLC domains as they likely won't have any tasks if the
+	 * shared runq is empty.
+	 */
+	if (sched_feat(SHARED_RUNQ)) {
+		sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
+		if (likely(sd))
+			sd = sd->parent;
+	}
+
 	if (!READ_ONCE(this_rq->rd->overload) ||
 	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

[1]: https://lore.kernel.org/all/20230809221218.163894-7-void@manifault.com/

I originally added this following Peter's suggestion in v2 [2] with the
idea that we'd skip the <= LLC domains when shared_runq is enabled, but
in hindsight, we also aren't walking the shared_runq list until we find
a task that can run on the current core. If there's a task, and it can't
run on the current core, we give up and proceed to the rest of
newidle_balance(). So it's possible that we're incorrectly assuming
there's no task in the current LLC because it wasn't enqueued at the
head of the shared_runq. I think we'd only want to add this improvement
if we walked the list as you tried in v1 of the patch set, or revert it
otherwise.

[2]: https://lore.kernel.org/lkml/20230711094547.GE3062772@hirez.programming.kicks-ass.net/

Would you be able to try running your benchmarks again with that change
removed, or with your original idea of walking the list added? I've
tried to reproduce this issue on my 7950x, as well as a single-socket /
CCX 26 core / 52 thread Cooper Lake host, but have been unable to. For
example, if I run funccount.py 'load_balance' -d 30 (from [3]) while
running the below netperf command on the Cooper Lake, this is what I
see:

for i in `seq 128`; do netperf -6 -t UDP_RR -c -C -l 60 & done

NO_SHARED_RUNQ
--------------
FUNC                                    COUNT
b'load_balance'                         39636


SHARED_RUNQ
-----------
FUNC                                    COUNT
b'load_balance'                         32345

[3]: https://github.com/iovisor/bcc/blob/master/tools/funccount.py

The stack traces also don't show us running load_balance() excessively.
Please feel free to share how you're running tbench as well and I can
try that on my end.

> ----------------------------------------------------------------------------------------------------
> Time elapsed (in jiffies)                                  :         39133,        39132           
> ----------------------------------------------------------------------------------------------------
> cpu:  all_cpus (avg) vs cpu:  all_cpus (avg)
> ----------------------------------------------------------------------------------------------------
> sched_yield count                                          :             0,            0           
> Legacy counter can be ignored                              :             0,            0           
> schedule called                                            :       9112673,      5014567  | -44.97|
> schedule left the processor idle                           :       4554145,      2460379  | -45.97|
> try_to_wake_up was called                                  :       4556347,      2552974  | -43.97|
> try_to_wake_up was called to wake up the local cpu         :          2227,         1350  | -39.38| 
> total runtime by tasks on this processor (in ns)           :   41093465125,  33816591424  | -17.71|
> total waittime by tasks on this processor (in ns)          :      21832848,   3382037232  |15390.59| <======
> total timeslices run on this cpu                           :       4558524,      2554181  | -43.97|
> 
> ----------------------------------------------------------------------------------------------------
> domain:  SMT (NO_SHARED_RUNQ vs SHARED_RUNQ)
> ----------------------------------------------------------------------------------------------------
> < ----------------------------------------  Category:  newidle ---------------------------------------- >
> load_balance count on cpu newly idle                       :        964585,       619463  | -35.78|
> load_balance found balanced on cpu newly idle              :        964573,       619303  | -35.80|
>   ->load_balance failed to find bsy q on cpu newly idle    :             0,            0           
>   ->load_balance failed to find bsy grp on cpu newly idle  :        964423,       617603  | -35.96|
> load_balance move task failed on cpu newly idle            :             5,          110  |2100.00|
> *load_balance success cnt on cpu newidle                   :             7,           50  | 614.29|
> pull_task count on cpu newly idle                          :             6,           48  | 700.00|
> *avg task pulled per successfull lb attempt (cpu newidle)  :       0.85714,      0.96000  |  12.00|
>   ->pull_task whn target task was cache-hot on cpu newidle :             0,            0           
> 
> ----------------------------------------------------------------------------------------------------
> domain:  MC (NO_SHARED_RUNQ vs SHARED_RUNQ)
> ----------------------------------------------------------------------------------------------------
> < ----------------------------------------  Category:  newidle ---------------------------------------- >
> load_balance count on cpu newly idle                       :        803080,       615613  | -23.34|
> load_balance found balanced on cpu newly idle              :        641630,       568818  | -11.35|
>   ->load_balance failed to find bsy q on cpu newly idle    :           178,          616  | 246.07|
>   ->load_balance failed to find bsy grp on cpu newly idle  :        641446,       568082  | -11.44|
> load_balance move task failed on cpu newly idle            :        161448,        46296  | -71.32|
> *load_balance success cnt on cpu newidle                   :             2,          499  |24850.00|
> pull_task count on cpu newly idle                          :             1,          498  |49700.00|
> *avg task pulled per successfull lb attempt (cpu newidle)  :       0.50000,      0.99800  |  99.60|
>   ->pull_task whn target task was cache-hot on cpu newidle :             0,            0           
> 
> ----------------------------------------------------------------------------------------------------
> domain:  DIE cpus = all_cpus (avg) vs domain:  DIE cpus = all_cpus (avg)
> ----------------------------------------------------------------------------------------------------
> < ----------------------------------------  Category:  newidle ---------------------------------------- >
> load_balance count on cpu newly idle                       :          2761,       566824  |20429.66| <======
> load_balance found balanced on cpu newly idle              :          1737,       284232  |16263.39|
>   ->load_balance failed to find bsy q on cpu newly idle    :             0,          537            
>   ->load_balance failed to find bsy grp on cpu newly idle  :          1736,       283427  |16226.44|
> load_balance move task failed on cpu newly idle            :          1023,       282021  |27468.04|
> *load_balance success cnt on cpu newidle                   :             1,          571  |57000.00|
> pull_task count on cpu newly idle                          :             0,          571           
> *avg task pulled per successfull lb attempt (cpu newidle)  :             0,            1           
>   ->pull_task whn target task was cache-hot on cpu newidle :             0,            0           
> 
> ----------------------------------------------------------------------------------------------------
> domain:  NUMA cpus = all_cpus (avg) vs domain:  NUMA cpus = all_cpus (avg)
> ----------------------------------------------------------------------------------------------------
> < ----------------------------------------  Category:  newidle ---------------------------------------- >
> load_balance count on cpu newly idle                       :            38,        47936  |126047.37| <======
> load_balance found balanced on cpu newly idle              :            20,        26628  |133040.00|
>   ->load_balance failed to find bsy q on cpu newly idle    :             0,            0             
>   ->load_balance failed to find bsy grp on cpu newly idle  :            20,        26531  |132555.00|
> load_balance move task failed on cpu newly idle            :            18,        21167  |117494.44|
> *load_balance success cnt on cpu newidle                   :             0,          141             
> pull_task count on cpu newly idle                          :             0,          140           
> *avg task pulled per successfull lb attempt (cpu newidle)  :             0,      0.99291           
>   ->pull_task whn target task was cache-hot on cpu newidle :             0,            0           
> 
> < ----------------------------------------  Wakeup info:  ---------------------------------------- >
> Wakeups on same                          CPU (avg)      :          2227,         1350  | -39.38|
> Wakeups on same         SMT cpus = all_cpus (avg)       :         85553,        30942  | -63.83|
> Wakeups on same         MC cpus = all_cpus (avg)        :       4468548,      2520585  | -43.59|
> Wakeups on same         DIE cpus = all_cpus (avg)       :             9,           60  | 566.67|
> Wakeups on same         NUMA cpus = all_cpus (avg)      :             8,           35  | 337.50|
> 
> Affine wakeups on same  SMT cpus = all_cpus (avg)       :         85484,        18848  | -77.95|
> Affine wakeups on same  MC cpus = all_cpus (avg)        :       4465108,      1511225  | -66.15| <======
> Affine wakeups on same  DIE cpus = all_cpus (avg)       :             1,          569  |56800.00:
> Affine wakeups on same  NUMA cpus = all_cpus (avg)      :             0,          140           
> 
> 
> 
> Detailed Results are as follows:
> =============================================================
> Test Machine : 2 Socket Zen4 with 128 cores per socket, SMT enabled.
> 
> tip                 : commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
>                       bandwidth in use")
> sh_rq_v3            : This patchset with SHARED_RUNQ feature enabled.
> 
> sh_rq_v3_tgload_fix : This patchset along with Aaron's patch
>                       (https://lore.kernel.org/lkml/20230816024831.682107-2-aaron.lu@intel.com/)
> 
> The trend is similar on a 2 Socket Zen3 with 64 cores per socket, SMT
> enabled. So, I am ommitting it.
> 
> ==================================================================
> Test          : hackbench 
> Units         : Normalized time in seconds 
> Interpretation: Lower is better 
> Statistic     : AMean 
> ==================================================================
> Case:        tip[pct imp](CV)        sh_rq_v3[pct imp](CV)    sh_rq_v3_tgload_fix[pct imp](CV)
>  1-groups     1.00 [  0.00]( 8.41)     0.96 [  3.63]( 6.04)     0.94 [  6.48]( 9.16)
>  2-groups     1.00 [  0.00](12.96)     0.96 [  4.46]( 9.76)     0.89 [ 11.02]( 8.28)
>  4-groups     1.00 [  0.00]( 2.90)     0.85 [ 14.77]( 9.18)     0.86 [ 14.35](13.26)
>  8-groups     1.00 [  0.00]( 1.06)     0.91 [  8.96]( 2.83)     0.94 [  6.34]( 2.02)
> 16-groups     1.00 [  0.00]( 0.57)     1.19 [-18.91]( 2.82)     0.74 [ 26.02]( 1.33)

Nice, this matches what I observed when running my benchmarks as well.

> ==================================================================
> Test          : tbench 
> Units         : Normalized throughput 
> Interpretation: Higher is better 
> Statistic     : AMean 
> ==================================================================
> Clients:   tip[pct imp](CV)      sh_rq_v3[pct imp](CV)    sh_rq_v3_tgload_fix[pct imp](CV)
>     1     1.00 [  0.00]( 0.26)     0.99 [ -1.25]( 0.13)     0.98 [ -2.15]( 0.49)
>     2     1.00 [  0.00]( 0.37)     0.98 [ -2.33]( 0.88)     0.98 [ -2.21]( 0.53)
>     4     1.00 [  0.00]( 0.66)     0.99 [ -1.32]( 0.91)     0.98 [ -2.12]( 0.79)
>     8     1.00 [  0.00]( 2.14)     0.99 [ -0.53]( 2.45)     1.00 [ -0.23]( 2.18)
>    16     1.00 [  0.00]( 1.08)     0.97 [ -3.37]( 2.12)     0.95 [ -5.28]( 1.92)
>    32     1.00 [  0.00]( 2.90)     0.44 [-55.53]( 1.44)     0.98 [ -2.23]( 1.72)
>    64     1.00 [  0.00]( 1.02)     0.27 [-72.58]( 0.35)     0.74 [-25.64]( 2.43)
>   128     1.00 [  0.00]( 0.88)     0.19 [-81.29]( 0.51)     0.52 [-48.47]( 3.92)
>   256     1.00 [  0.00]( 0.28)     0.17 [-82.80]( 0.29)     0.88 [-12.23]( 1.76)
>   512     1.00 [  0.00]( 2.78)     1.33 [ 33.50]( 4.12)     1.22 [ 22.33]( 2.59)
>  1024     1.00 [  0.00]( 0.46)     1.34 [ 34.27]( 0.37)     1.31 [ 31.36]( 1.65)
>  2048     1.00 [  0.00]( 0.75)     1.40 [ 40.42]( 0.05)     1.20 [ 20.09]( 1.98)
> 
> 
> ==================================================================
> Test          : stream (10 Runs)
> Units         : Normalized Bandwidth, MB/s 
> Interpretation: Higher is better 
> Statistic     : HMean 
> ==================================================================
> Test:     tip[pct imp](CV)      sh_rq_v3[pct imp](CV)    sh_rq_v3_tgload_fix[pct imp](CV)
>  Copy     1.00 [  0.00]( 0.84)     1.00 [ -0.22]( 0.59)     1.00 [  0.08]( 0.90)
> Scale     1.00 [  0.00]( 0.42)     1.00 [ -0.33]( 0.39)     1.00 [ -0.15]( 0.42)
>   Add     1.00 [  0.00]( 0.58)     1.00 [ -0.48]( 0.28)     1.00 [ -0.22]( 0.34)
> Triad     1.00 [  0.00]( 0.41)     0.99 [ -0.65]( 0.38)     1.00 [ -0.29]( 0.34)
> 
> 
> ==================================================================
> Test          : stream (100 runs)
> Units         : Normalized Bandwidth, MB/s 
> Interpretation: Higher is better 
> Statistic     : HMean 
> ==================================================================
> Test:     tip[pct imp](CV)       sh_rq_v3[pct imp](CV)    sh_rq_v3_tgload_fix[pct imp](CV)
>  Copy     1.00 [  0.00]( 0.52)     1.00 [ -0.16]( 0.45)     1.00 [  0.35]( 0.73)
> Scale     1.00 [  0.00]( 0.35)     1.00 [ -0.20]( 0.38)     1.00 [  0.07]( 0.34)
>   Add     1.00 [  0.00]( 0.37)     1.00 [ -0.07]( 0.42)     1.00 [  0.07]( 0.46)
> Triad     1.00 [  0.00]( 0.57)     1.00 [ -0.22]( 0.45)     1.00 [ -0.04]( 0.49)
> 
> 
> ==================================================================
> Test          : netperf 

Could you please share exactly how you're invoking netperf? Is this
with -t UDP_RR (which is what Aaron was running with), or the default?

[...]

As far as I can tell there weren't many changes between v2 and v3 that
could have caused this regression. I strongly suspect the heuristic I
mentioned above, especially with your analysis on us excessively calling
load_balance().

Thanks,
David

David Vernet Aug. 31, 2023, 2:32 a.m. UTC | #2

On Wed, Aug 30, 2023 at 03:26:40PM +0530, K Prateek Nayak wrote:
> Hello David,

Hello Prateek,

> 
> Short update based on some of my experimentation.
> 
> Disclaimer: I've been only looking at tbench 128 client case on a dual
> socket 3rd Generation EPYC system (2x 64C/128T). Wider results may
> vary but I have some information that may help with the debug and to
> proceed further.
> 
> On 8/25/2023 4:21 AM, David Vernet wrote:
> > On Thu, Aug 24, 2023 at 04:44:19PM +0530, Gautham R. Shenoy wrote:
> >> Hello David,
> >>
> >> On Fri, Aug 18, 2023 at 02:19:03PM +0530, Gautham R. Shenoy wrote:
> >>> Hello David,
> >>>
> >>> On Fri, Aug 18, 2023 at 12:03:55AM -0500, David Vernet wrote:
> >>>> On Thu, Aug 17, 2023 at 02:12:03PM +0530, Gautham R. Shenoy wrote:
> >>>>> Hello David,
> >>>>
> >>>> Hello Gautham,
> >>>>
> >>>> Thanks a lot as always for running some benchmarks and analyzing these
> >>>> changes.
> >>>>
> >>>>> On Wed, Aug 09, 2023 at 05:12:11PM -0500, David Vernet wrote:
> >>>>>> Changes
> >>>>>> -------
> >>>>>>
> >>>>>> This is v3 of the shared runqueue patchset. This patch set is based off
> >>>>>> of commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
> >>>>>> bandwidth in use") on the sched/core branch of tip.git.
> >>>>>
> >>>>>
> >>>>> I tested the patches on Zen3 and Zen4 EPYC Servers like last time. I
> >>>>> notice that apart from hackbench, every other bechmark is showing
> >>>>> regressions with this patch series. Quick summary of my observations:
> >>>>
> >>>> Just to verify per our prior conversation [0], was this latest set of
> >>>> benchmarks run with boost disabled?
> >>>
> >>> Boost is enabled by default. I will queue a run tonight with boost
> >>> disabled.
> >>
> >> Apologies for the delay. I didn't see any changes with boost-disabled
> >> and with reverting the optimization to bail out of the
> >> newidle_balance() for SMT and MC domains when there was no task to be
> >> pulled from the shared-runq. I reran the whole thing once again, just
> >> to rule out any possible variance. The results came out the same.
> > 
> > Thanks a lot for taking the time to run more benchmarks.
> > 
> >> With the boost disabled, and the optimization reverted, the results
> >> don't change much.
> > 
> > Hmmm, I see. So, that was the only real substantive "change" between v2
> > -> v3. The other changes were supporting hotplug / domain recreation,
> > optimizing locking a bit, and fixing small bugs like the return value
> > from shared_runq_pick_next_task(), draining the queue when the feature
> > is disabled, and fixing the lkp errors.
> > 
> > With all that said, it seems very possible that the regression is due to
> > changes in sched/core between commit ebb83d84e49b ("sched/core: Avoid
> > multiple calling update_rq_clock() in __cfsb_csd_unthrottle()") in v2,
> > and commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
> > bandwidth in use") in v3. EEVDF was merged in that window, so that could
> > be one explanation for the context switch rate being so much higher.
> > 
> >> It doesn't appear that the optimization is the cause for increase in
> >> the number of load-balancing attempts at the DIE and the NUMA
> >> domains. I have shared the counts of the newidle_balance with and
> >> without SHARED_RUNQ below for tbench and it can be noticed that the
> >> counts are significantly higher for the 64 clients and 128 clients. I
> >> also captured the counts/s of find_busiest_group() using funccount.py
> >> which tells the same story. So the drop in the performance for tbench
> >> with your patches strongly correlates with the increase in
> >> load-balancing attempts.
> >>
> >> newidle balance is undertaken only if the overload flag is set and the
> >> expected idle duration is greater than the avg load balancing cost. It
> >> is hard to imagine why should the shared runq cause the overload flag
> >> to be set!
> > 
> > Yeah, I'm not sure either about how or why woshared_runq uld cause this
> > This is purely hypothetical, but is it possible that shared_runq causes
> > idle cores to on average _stay_ idle longer due to other cores pulling
> > tasks that would have otherwise been load balanced to those cores?
> > 
> > Meaning -- say CPU0 is idle, and there are tasks on other rqs which
> > could be load balanced. Without shared_runq, CPU0 might be woken up to
> > run a task from a periodic load balance. With shared_runq, any active
> > core that would otherwise have gone idle could pull the task, keeping
> > CPU0 idle.
> > 
> > What do you think? I could be totally off here.
> > 
> > From my perspective, I'm not too worried about this given that we're
> > seeing gains in other areas such as kernel compile as I showed in [0],
> > though I definitely would like to better understand it.
> 
> Let me paste a cumulative diff containing everything I've tried since
> it'll be easy to explain.
> 
> o Performance numbers for tbench 128 clients:
> 
> tip			: 1.00 (Var: 0.57%)
> tip + vanilla v3	: 0.39 (var: 1.15%) (%diff: -60.74%)
> tip + v3 + diff		: 0.99 (var: 0.61%) (%diff: -00.24%)
> 
> tip is at commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when
> cfs bandwidth in use"), same as what Gautham used, so no EEVDF yet.
> 
> o Cumulative Diff
> 
> Should apply cleanly on top of tip at above commit + this series as is.
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d67d86d3bfdf..f1e64412fd48 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -198,7 +198,7 @@ struct shared_runq_shard {
>  } ____cacheline_aligned;
>  
>  /* This would likely work better as a configurable knob via debugfs */
> -#define SHARED_RUNQ_SHARD_SZ 6
> +#define SHARED_RUNQ_SHARD_SZ 16
>  #define SHARED_RUNQ_MAX_SHARDS \
>  	((NR_CPUS / SHARED_RUNQ_SHARD_SZ) + (NR_CPUS % SHARED_RUNQ_SHARD_SZ != 0))
>  
> @@ -322,20 +322,36 @@ void shared_runq_toggle(bool enabling)
>  }
>  
>  static struct task_struct *
> -shared_runq_pop_task(struct shared_runq_shard *shard, int target)
> +shared_runq_pop_task(struct shared_runq_shard *shard, struct rq *rq)
>  {
> +	int target = cpu_of(rq);
>  	struct task_struct *p;
>  
>  	if (list_empty(&shard->list))
>  		return NULL;
>  
>  	raw_spin_lock(&shard->lock);
> +again:
>  	p = list_first_entry_or_null(&shard->list, struct task_struct,
>  				     shared_runq_node);
> -	if (p && is_cpu_allowed(p, target))
> +
> +	/* If we find a task, delete if from list regardless */
> +	if (p) {
>  		list_del_init(&p->shared_runq_node);
> -	else
> -		p = NULL;
> +
> +		if (!task_on_rq_queued(p) ||
> +		    task_on_cpu(task_rq(p), p) ||

Have you observed !task_on_rq_queued() or task_on_cpu() returning true
here? The task should have removed itself from the shard when
__dequeue_entity() is called from set_next_entity() when it's scheduled
in pick_next_task_fair(). The reason we have to check in
shared_runq_pick_next_task() is that between dequeuing the task from the
shared_runq and getting its rq lock, it could have been scheduled on its
current rq. But if the task was scheduled first, it should have removed
itself from the shard.

> +		    !is_cpu_allowed(p, target)) {
> +			if (rq->ttwu_pending) {
> +				p = NULL;
> +				goto out;
> +			}

Have you observed this as well? If the task is enqueued on the ttwu
queue wakelist, it isn't enqueued on the waking CPU, so it shouldn't be
added to the shared_runq right?

> +
> +			goto again;
> +		}
> +	}
> +
> +out:
>  	raw_spin_unlock(&shard->lock);
>  
>  	return p;
> @@ -380,9 +396,12 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
>  		curr_idx = (starting_idx + i) % num_shards;
>  		shard = &shared_runq->shards[curr_idx];
>  
> -		p = shared_runq_pop_task(shard, cpu_of(rq));
> +		p = shared_runq_pop_task(shard, rq);
>  		if (p)
>  			break;
> +
> +		if (rq->ttwu_pending)
> +			return 0;

Same here r.e. rq->ttwu_pending. This should be handled in the

if (task_on_rq_queued(p) && !task_on_cpu(src_rq, p))

check below, no? Note that task_on_rq_queued(p) should only return true
if the task has made it to ttwu_do_activate(), and if it hasn't, I don't
think it should be in the shard in the first place.

>  	}
>  	if (!p)
>  		return 0;
> @@ -395,17 +414,16 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
>  	if (task_on_rq_queued(p) && !task_on_cpu(src_rq, p)) {
>  		update_rq_clock(src_rq);
>  		src_rq = move_queued_task(src_rq, &src_rf, p, cpu_of(rq));
> -		ret = 1;
>  	}
>  
>  	if (src_rq != rq) {
>  		task_rq_unlock(src_rq, p, &src_rf);
>  		raw_spin_rq_lock(rq);
>  	} else {
> +		ret = 1;
>  		rq_unpin_lock(rq, &src_rf);
>  		raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
>  	}
> -	rq_repin_lock(rq, rf);

Huh, wouldn't this cause a WARN to be issued the next time we invoke
rq_clock() in newidle_balance() if we weren't able to find a task? Or
was it because we moved the SHARED_RUNQ logic to below where we check
rq_clock()? In general though, I don't think this should be removed. At
the very least, it should be tested with lockdep.

>  	return ret;
>  }
> @@ -12344,50 +12362,59 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  	if (!cpu_active(this_cpu))
>  		return 0;
>  
> -	if (sched_feat(SHARED_RUNQ)) {
> -		pulled_task = shared_runq_pick_next_task(this_rq, rf);
> -		if (pulled_task)
> -			return pulled_task;
> -	}
> -
>  	/*
>  	 * We must set idle_stamp _before_ calling idle_balance(), such that we
>  	 * measure the duration of idle_balance() as idle time.
>  	 */
>  	this_rq->idle_stamp = rq_clock(this_rq);
>  
> -	/*
> -	 * This is OK, because current is on_cpu, which avoids it being picked
> -	 * for load-balance and preemption/IRQs are still disabled avoiding
> -	 * further scheduler activity on it and we're being very careful to
> -	 * re-start the picking loop.
> -	 */
> -	rq_unpin_lock(this_rq, rf);
> -
>  	rcu_read_lock();
> -	sd = rcu_dereference_check_sched_domain(this_rq->sd);
> -
> -	/*
> -	 * Skip <= LLC domains as they likely won't have any tasks if the
> -	 * shared runq is empty.
> -	 */
> -	if (sched_feat(SHARED_RUNQ)) {
> +	if (sched_feat(SHARED_RUNQ))
>  		sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> -		if (likely(sd))
> -			sd = sd->parent;
> -	}
> +	else
> +		sd = rcu_dereference_check_sched_domain(this_rq->sd);
>  
>  	if (!READ_ONCE(this_rq->rd->overload) ||
>  	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>  
> -		if (sd)
> +		while (sd) {
>  			update_next_balance(sd, &next_balance);
> +			sd = sd->child;
> +		}
> +
>  		rcu_read_unlock();
>  
>  		goto out;
>  	}
>  	rcu_read_unlock();
>  
> +	t0 = sched_clock_cpu(this_cpu);
> +	if (sched_feat(SHARED_RUNQ)) {
> +		pulled_task = shared_runq_pick_next_task(this_rq, rf);
> +		if (pulled_task) {
> +			curr_cost = sched_clock_cpu(this_cpu) - t0;
> +			update_newidle_cost(sd, curr_cost);
> +			goto out_swq;
> +		}
> +	}

Hmmm, why did you move this further down in newidle_balance()? We don't
want to skip trying to get a task from the shared_runq if rq->avg_idle <
sd->max_newidle_lb_cost.

> +
> +	/* Check again for pending wakeups */
> +	if (this_rq->ttwu_pending)
> +		return 0;
> +
> +	t1 = sched_clock_cpu(this_cpu);
> +	curr_cost += t1 - t0;
> +
> +	if (sd)
> +		update_newidle_cost(sd, curr_cost);
> +
> +	/*
> +	 * This is OK, because current is on_cpu, which avoids it being picked
> +	 * for load-balance and preemption/IRQs are still disabled avoiding
> +	 * further scheduler activity on it and we're being very careful to
> +	 * re-start the picking loop.
> +	 */
> +	rq_unpin_lock(this_rq, rf);
>  	raw_spin_rq_unlock(this_rq);
>  
>  	t0 = sched_clock_cpu(this_cpu);
> @@ -12400,6 +12427,13 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  
>  		update_next_balance(sd, &next_balance);
>  
> +		/*
> +		 * Skip <= LLC domains as they likely won't have any tasks if the
> +		 * shared runq is empty.
> +		 */
> +		if (sched_feat(SHARED_RUNQ) && (sd->flags & SD_SHARE_PKG_RESOURCES))
> +			continue;

This makes sense to me, good call.

> +
>  		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
>  			break;
>  
> @@ -12429,6 +12463,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  
>  	raw_spin_rq_lock(this_rq);
>  
> +out_swq:
>  	if (curr_cost > this_rq->max_idle_balance_cost)
>  		this_rq->max_idle_balance_cost = curr_cost;
>  
> --
> 
> o Breakdown
> 
> I'll proceed to annotate a copy of diff with reasoning behind the changes:

Ah, ok, you provided explanations :-) I'll leave my questions above
regardless for posterity.

> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d67d86d3bfdf..f1e64412fd48 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -198,7 +198,7 @@ struct shared_runq_shard {
>  } ____cacheline_aligned;
>  
>  /* This would likely work better as a configurable knob via debugfs */
> -#define SHARED_RUNQ_SHARD_SZ 6
> +#define SHARED_RUNQ_SHARD_SZ 16
>  #define SHARED_RUNQ_MAX_SHARDS \
>  	((NR_CPUS / SHARED_RUNQ_SHARD_SZ) + (NR_CPUS % SHARED_RUNQ_SHARD_SZ != 0))
> 
> --
> 
> 	Here I'm setting the SHARED_RUNQ_SHARD_SZ to sd_llc_size for
> 	my machine. I played around with this and did not see any
> 	contention for shared_rq lock while running tbench.

I don't really mind making the shard bigger because it will never be one
size fits all, but for what it's worth, I saw less contention in netperf
with a size of 6, but everything else performed fine with a larger
shard. This is one of those "there's no right answer" things, I'm
afraid. I think it will be inevitable to make this configurable at some
point, if we find that it's really causing inefficiencies.

> --
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d67d86d3bfdf..f1e64412fd48 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -322,20 +322,36 @@ void shared_runq_toggle(bool enabling)
>  }
>  
>  static struct task_struct *
> -shared_runq_pop_task(struct shared_runq_shard *shard, int target)
> +shared_runq_pop_task(struct shared_runq_shard *shard, struct rq *rq)
>  {
> +	int target = cpu_of(rq);
>  	struct task_struct *p;
>  
>  	if (list_empty(&shard->list))
>  		return NULL;
>  
>  	raw_spin_lock(&shard->lock);
> +again:
>  	p = list_first_entry_or_null(&shard->list, struct task_struct,
>  				     shared_runq_node);
> -	if (p && is_cpu_allowed(p, target))
> +
> +	/* If we find a task, delete if from list regardless */

As I mentioned in my other reply [0], I don't think we should always
have to delete here. Let me know if I'm missing something.

[0]: https://lore.kernel.org/all/20230831013435.GB506447@maniforge/

> +	if (p) {
>  		list_del_init(&p->shared_runq_node);
> -	else
> -		p = NULL;
> +
> +		if (!task_on_rq_queued(p) ||
> +		    task_on_cpu(task_rq(p), p) ||
> +		    !is_cpu_allowed(p, target)) {
> +			if (rq->ttwu_pending) {
> +				p = NULL;
> +				goto out;
> +			}
> +
> +			goto again;
> +		}
> +	}
> +
> +out:
>  	raw_spin_unlock(&shard->lock);
>  
>  	return p;
> --
> 
> 	Context: When running perf with IBS, I saw following lock
> 	contention:
> 
> -   12.17%  swapper          [kernel.vmlinux]          [k] native_queued_spin_lock_slowpath
>    - 10.48% native_queued_spin_lock_slowpath
>       - 10.30% _raw_spin_lock
>          - 9.11% __schedule
>               schedule_idle
>               do_idle
>             + cpu_startup_entry
>          - 0.86% task_rq_lock
>               newidle_balance
>               pick_next_task_fair
>               __schedule
>               schedule_idle
>               do_idle
>             + cpu_startup_entry
> 
> 	So I imagined the newidle_balance is contending with another
> 	run_queue going idle when pulling task. Hence, I moved some
> 	checks in shared_runq_pick_next_task() to here.

Hmm, so the idea was to avoid contending on the rq lock? As I mentioned
above, I'm not sure these checks actually buy us anything.

> 	I was not sure if the task's rq lock needs to be held to do this
> 	to get an accurate result so I've left the original checks in
> 	shared_runq_pick_next_task() as it is.

Yep, we need to have the rq lock held for these functions to return
consistent data.

> 	Since retry may be costly, I'm using "rq->ttwu_pending" as a
> 	bail out threshold. Maybe there are better alternates with
> 	the lb_cost and rq->avg_idle but this was simpler for now.

Hmm, not sure I'm quite understanding this one. As I mentioned above, I
don't _think_ this should ever be set for a task enqueued in a shard.
Were you observing that?

> 	(Realizing as I write this that this will cause more contention
> 	with enqueue/dequeue in a busy system. I'll check if that is the
> 	case)
> 
> 	P.S. This did not affect the ~60% regression I was seeing one
> 	bit so the problem was deeper.
> 
> --
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d67d86d3bfdf..f1e64412fd48 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -380,9 +396,12 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
>  		curr_idx = (starting_idx + i) % num_shards;
>  		shard = &shared_runq->shards[curr_idx];
>  
> -		p = shared_runq_pop_task(shard, cpu_of(rq));
> +		p = shared_runq_pop_task(shard, rq);
>  		if (p)
>  			break;
> +
> +		if (rq->ttwu_pending)
> +			return 0;
>  	}
>  	if (!p)
>  		return 0;
> --
> 
> 	More bailout logic.
> 
> --
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d67d86d3bfdf..f1e64412fd48 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -395,17 +414,16 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
>  	if (task_on_rq_queued(p) && !task_on_cpu(src_rq, p)) {
>  		update_rq_clock(src_rq);
>  		src_rq = move_queued_task(src_rq, &src_rf, p, cpu_of(rq));
> -		ret = 1;
>  	}
>  
>  	if (src_rq != rq) {
>  		task_rq_unlock(src_rq, p, &src_rf);
>  		raw_spin_rq_lock(rq);
>  	} else {
> +		ret = 1;
>  		rq_unpin_lock(rq, &src_rf);
>  		raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
>  	}
> -	rq_repin_lock(rq, rf);
>  
>  	return ret;
>  }
> --
> 
> 	Only return 1 is task is actually pulled else return -1
> 	signifying the path has released and re-aquired the lock.

Not sure I'm following. If we migrate the task to the current rq, we
want to return 1 to signify that there are new fair tasks present in the
rq don't we? It doesn't need to have started there originally for it to
be present after we move it.

> 
> 	Also leave the rq_repin_lock() part to caller, i.e.,
> 	newidle_balance() since it makes up for a nice flow (see
> 	below).
> 
> --
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d67d86d3bfdf..f1e64412fd48 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12344,50 +12362,59 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  	if (!cpu_active(this_cpu))
>  		return 0;
>  
> -	if (sched_feat(SHARED_RUNQ)) {
> -		pulled_task = shared_runq_pick_next_task(this_rq, rf);
> -		if (pulled_task)
> -			return pulled_task;
> -	}
> -
>  	/*
>  	 * We must set idle_stamp _before_ calling idle_balance(), such that we
>  	 * measure the duration of idle_balance() as idle time.
>  	 */
>  	this_rq->idle_stamp = rq_clock(this_rq);
>  
> -	/*
> -	 * This is OK, because current is on_cpu, which avoids it being picked
> -	 * for load-balance and preemption/IRQs are still disabled avoiding
> -	 * further scheduler activity on it and we're being very careful to
> -	 * re-start the picking loop.
> -	 */
> -	rq_unpin_lock(this_rq, rf);
> -
>  	rcu_read_lock();
> -	sd = rcu_dereference_check_sched_domain(this_rq->sd);
> -
> -	/*
> -	 * Skip <= LLC domains as they likely won't have any tasks if the
> -	 * shared runq is empty.
> -	 */
> -	if (sched_feat(SHARED_RUNQ)) {
> +	if (sched_feat(SHARED_RUNQ))
>  		sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> -		if (likely(sd))
> -			sd = sd->parent;
> -	}
> +	else
> +		sd = rcu_dereference_check_sched_domain(this_rq->sd);
>  
>  	if (!READ_ONCE(this_rq->rd->overload) ||
>  	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>  
> -		if (sd)
> +		while (sd) {
>  			update_next_balance(sd, &next_balance);
> +			sd = sd->child;
> +		}
> +
>  		rcu_read_unlock();
>  
>  		goto out;
>  	}
>  	rcu_read_unlock();
>  
> +	t0 = sched_clock_cpu(this_cpu);
> +	if (sched_feat(SHARED_RUNQ)) {
> +		pulled_task = shared_runq_pick_next_task(this_rq, rf);
> +		if (pulled_task) {
> +			curr_cost = sched_clock_cpu(this_cpu) - t0;
> +			update_newidle_cost(sd, curr_cost);
> +			goto out_swq;
> +		}
> +	}
> +
> +	/* Check again for pending wakeups */
> +	if (this_rq->ttwu_pending)
> +		return 0;
> +
> +	t1 = sched_clock_cpu(this_cpu);
> +	curr_cost += t1 - t0;
> +
> +	if (sd)
> +		update_newidle_cost(sd, curr_cost);
> +
> +	/*
> +	 * This is OK, because current is on_cpu, which avoids it being picked
> +	 * for load-balance and preemption/IRQs are still disabled avoiding
> +	 * further scheduler activity on it and we're being very careful to
> +	 * re-start the picking loop.
> +	 */
> +	rq_unpin_lock(this_rq, rf);
>  	raw_spin_rq_unlock(this_rq);
>  
>  	t0 = sched_clock_cpu(this_cpu);
> --
> 
> 	This hunk does a few things:
> 
> 	1. If a task is successfully pulled from shared rq, or if the rq
> 	   lock had been released and re-acquired with, jump to the
> 	   very end where we check a bunch of conditions and return
> 	   accordingly.
> 
> 	2. Move the shared rq picking after the "rd->overload" and
> 	   checks against "rq->avg_idle".
> 
> 	   P.S. This recovered half the performance that was lost.

Sorry, which performance are you referring to? I'm not thrilled about
this part because it's another heuristic for users to have to reason
about. _Maybe_ it makes sense to keep the rd->overload check? I don't
think we should keep the rq->avg_idle check though unless it's
absolutely necessary, and I'd have to think more about the rq->overload
check.

> 	3. Update the newidle_balance_cost via update_newidle_cost()
> 	   since that is also used to determine the previous bailout
> 	   threshold.

I think this makes sense as well, but need to think about it more.

> 	4. A bunch of update_next_balance().

I guess this should be OK, though I would expect us to have to load
balance less with SHARED_RUNQ.

> 	5. Move rq_unpin_lock() below. I do not know the implication of
> 	   this the kernel is not complaining so far (mind you I'm on
> 	   x86 and I do not have lockdep enabled)

If you move rq_unpin_lock(), you should probably run with lockdep to see
what happens :-) There are also implications for tracking whether it's
safe to look at the rq clock.

> 
> 	A combination of 3 and 4 seemed to give back the other half of
> 	tbench performance.
> 
> --
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d67d86d3bfdf..f1e64412fd48 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12400,6 +12427,13 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  
>  		update_next_balance(sd, &next_balance);
>  
> +		/*
> +		 * Skip <= LLC domains as they likely won't have any tasks if the
> +		 * shared runq is empty.
> +		 */
> +		if (sched_feat(SHARED_RUNQ) && (sd->flags & SD_SHARE_PKG_RESOURCES))
> +			continue;
> +

Agreed on this.

>  		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
>  			break;
>  
> --
> 
> 	This was based on my suggestion in the parallel thread.
> 
> 	P.S. This alone, without the changes in previous hunk showed no
> 	difference in performance with results same as vanilla v3.
> 
> --
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d67d86d3bfdf..f1e64412fd48 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12429,6 +12463,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  
>  	raw_spin_rq_lock(this_rq);
>  
> +out_swq:
>  	if (curr_cost > this_rq->max_idle_balance_cost)
>  		this_rq->max_idle_balance_cost = curr_cost;
>  
> --
> 
> 	The last part of newidle_balance() does a bunch of accounting
> 	which is relevant after the above changes. Also the
> 	rq_repin_lock() I had removed now happens here.
> 
> --
> 
> Again most of this is lightly tested with just one workload but I would
> like to hear your thoughts, especially with the significance of
> "rd->overload", "max_newidle_lb_cost", and "update_next_balance()".
> however, I'm afraid these may be the bits that led to the drop in
> utilization you mentioned in the first place.

Exactly :-( I think your proposal to fix how we skip load balancing on
the LLC if SHARED_RUNQ is enabled makes sense, but I'd really prefer to
avoid adding these heuristics to avoid contention for specific
workloads. The goal of SHARED_RUNQ is really to drive up CPU util. I
don't think we're really doing the user many favors if we try to guess
for them when they actually want that to happen.

If there's a time where we _really_ don't want or need to do it then
absolutely, let's skip it. But I would really like to see this go in
without checks on max_newidle_lb_cost, etc. The whole point of
SHARED_RUNQ is that it should be less costly than iterating over O(n)
cores to find tasks, so we _want_ to be more aggressive in doing so.

> Most of the experimentation (except for rq lock contention using IBS)
> was done by reading the newidle_balance() code.

And kudos for doing so!

> Finally a look at newidle_balance counts (tip vs tip + v3 + diff) for
> 128-clients of tbench on the test machine:
> 
> 
> < ----------------------------------------  Category:  newidle (SMT)  ---------------------------------------- >
> load_balance cnt on cpu newly idle                         :     921871,   0	(diff: -100.00%)
> --
> < ----------------------------------------  Category:  newidle (MC)   ---------------------------------------- >
> load_balance cnt on cpu newly idle                         :     472412,   0	(diff: -100.00%)
> --
> < ----------------------------------------  Category:  newidle (DIE)  ---------------------------------------- >
> load_balance cnt on cpu newly idle                         :        114, 279	(diff: +144.74%)
> --
> < ----------------------------------------  Category:  newidle (NUMA) ---------------------------------------- >
> load_balance cnt on cpu newly idle                         :          9,   9	(diff: +00.00%)
> --
> 
> Let me know if you have any queries. I'll go back and try to bisect the
> diff to see if only a couple of changes that I thought were important
> are good enought to yield back the lost performance. I'll do wider
> testing post hearing your thoughts.

Hopefully my feedback above should give you enough context to bisect and
find the changes that we really think are most helpful? To reiterate: I
definitely think your change to avoid iterating over the LLC sd is
correct and makes sense. Others possibly do as well such as checking
rd->overload (though not 100% sure), but others such as the
max_newidle_lb_cost checks I would strongly prefer to avoid.

Prateek -- thank you for doing all of this work, it's very much
appreciated.

As I mentioned on the other email, I'll be on vacation for about a week
and a half starting tomorrow afternoon, so I may be slow to respond in
that time.

Thanks,
David

Aboorva Devarajan Nov. 27, 2023, 8:28 a.m. UTC | #3

On Wed, 2023-08-09 at 17:12 -0500, David Vernet wrote:

Hi David,

I have been benchmarking the patch-set on POWER9 machine to understand
its impact. However, I've run into a recurring hard-lockups in
newidle_balance, specifically when SHARED_RUNQ feature is enabled. It
doesn't happen all the time, but it's something worth noting. I wanted
to inform you about this, and I can provide more details if needed.

-----------------------------------------

Some inital information regarding the hard-lockup:

Base Kernel:
-----------

Base kernel is upto commit 88c56cfeaec4 ("sched/fair: Block nohz
tick_stop when cfs bandwidth in use").

Patched Kernel:
-------------

Base Kernel + v3 (shared runqueue patch-set)(
https://lore.kernel.org/all/20230809221218.163894-1-void@manifault.com/
)

The hard-lockup moslty occurs when running the Apache2 benchmarks with
ab (Apache HTTP benchmarking tool) on the patched kernel. However, this
problem is not exclusive to the mentioned benchmark and only occurs
while the SHARED_RUNQ feature is enabled. Disabling SHARED_RUNQ feature
prevents the occurrence of the lockup.

ab (Apache HTTP benchmarking tool): 
https://httpd.apache.org/docs/2.4/programs/ab.html

Hardlockup with Patched Kernel:
------------------------------

[ 3289.727912][  C123] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 3289.727943][  C123] rcu: 	124-...0: (1 GPs behind) idle=f174/1/0x4000000000000000 softirq=12283/12289 fqs=732
[ 3289.727976][  C123] rcu: 	(detected by 123, t=2103 jiffies, g=127061, q=5517 ncpus=128)
[ 3289.728008][  C123] Sending NMI from CPU 123 to CPUs 124:
[ 3295.182378][  C123] CPU 124 didn't respond to backtrace IPI, inspecting paca.
[ 3295.182403][  C123] irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 15 (ksoftirqd/124)
[ 3295.182421][  C123] Back trace of paca->saved_r1 (0xc000000de13e79b0) (possibly stale):
[ 3295.182437][  C123] Call Trace:
[ 3295.182456][  C123] [c000000de13e79b0] [c000000de13e7a70] 0xc000000de13e7a70 (unreliable)
[ 3295.182477][  C123] [c000000de13e7ac0] [0000000000000008] 0x8
[ 3295.182500][  C123] [c000000de13e7b70] [c000000de13e7c98] 0xc000000de13e7c98
[ 3295.182519][  C123] [c000000de13e7ba0] [c0000000001da8bc] move_queued_task+0x14c/0x280
[ 3295.182557][  C123] [c000000de13e7c30] [c0000000001f22d8] newidle_balance+0x648/0x940
[ 3295.182602][  C123] [c000000de13e7d30] [c0000000001f26ac] pick_next_task_fair+0x7c/0x680
[ 3295.182647][  C123] [c000000de13e7dd0] [c0000000010f175c] __schedule+0x15c/0x1040
[ 3295.182675][  C123] [c000000de13e7ec0] [c0000000010f26b4] schedule+0x74/0x140
[ 3295.182694][  C123] [c000000de13e7f30] [c0000000001c4994] smpboot_thread_fn+0x244/0x250
[ 3295.182731][  C123] [c000000de13e7f90] [c0000000001bc6e8] kthread+0x138/0x140
[ 3295.182769][  C123] [c000000de13e7fe0] [c00000000000ded8] start_kernel_thread+0x14/0x18
[ 3295.182806][  C123] rcu: rcu_sched kthread starved for 544 jiffies! g127061 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=66
[ 3295.182845][  C123] rcu: 	Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 3295.182878][  C123] rcu: RCU grace-period kthread stack dump:

-----------------------------------------

[ 3943.438625][  C112] watchdog: CPU 112 self-detected hard LOCKUP @ _raw_spin_lock_irqsave+0x4c/0xc0
[ 3943.438631][  C112] watchdog: CPU 112 TB:115060212303626, last heartbeat TB:115054309631589 (11528ms ago)
[ 3943.438673][  C112] CPU: 112 PID: 2090 Comm: kworker/112:2 Tainted: G        W    L     6.5.0-rc2-00028-g7475adccd76b #51
[ 3943.438676][  C112] Hardware name: 8335-GTW POWER9 (raw) 0x4e1203 opal:skiboot-v6.5.3-35-g1851b2a06 PowerNV
[ 3943.438678][  C112] Workqueue:  0x0 (events)
[ 3943.438682][  C112] NIP:  c0000000010ff01c LR: c0000000001d1064 CTR: c0000000001e8580
[ 3943.438684][  C112] REGS: c000007fffb6bd60 TRAP: 0900   Tainted: G        W    L      (6.5.0-rc2-00028-g7475adccd76b)
[ 3943.438686][  C112] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24082222  XER: 00000000
[ 3943.438693][  C112] CFAR: 0000000000000000 IRQMASK: 1 
[ 3943.438693][  C112] GPR00: c0000000001d1064 c000000e16d1fb20 c0000000014e8200 c000000e092fed3c 
[ 3943.438693][  C112] GPR04: c000000e16d1fc58 c000000e092fe3c8 00000000000000e1 fffffffffffe0000 
[ 3943.438693][  C112] GPR08: 0000000000000000 00000000000000e1 0000000000000000 c00000000299ccd8 
[ 3943.438693][  C112] GPR12: 0000000024088222 c000007ffffb8300 c0000000001bc5b8 c000000deb46f740 
[ 3943.438693][  C112] GPR16: 0000000000000008 c000000e092fe280 0000000000000001 c000007ffedd7b00 
[ 3943.438693][  C112] GPR20: 0000000000000001 c0000000029a1280 0000000000000000 0000000000000001 
[ 3943.438693][  C112] GPR24: 0000000000000000 c000000e092fed3c c000000e16d1fdf0 c00000000299ccd8 
[ 3943.438693][  C112] GPR28: c000000e16d1fc58 c0000000021fbf00 c000007ffee6bf00 0000000000000001 
[ 3943.438722][  C112] NIP [c0000000010ff01c] _raw_spin_lock_irqsave+0x4c/0xc0
[ 3943.438725][  C112] LR [c0000000001d1064] task_rq_lock+0x64/0x1b0
[ 3943.438727][  C112] Call Trace:
[ 3943.438728][  C112] [c000000e16d1fb20] [c000000e16d1fb60] 0xc000000e16d1fb60 (unreliable)
[ 3943.438731][  C112] [c000000e16d1fb50] [c000000e16d1fbf0] 0xc000000e16d1fbf0
[ 3943.438733][  C112] [c000000e16d1fbf0] [c0000000001f214c] newidle_balance+0x4bc/0x940
[ 3943.438737][  C112] [c000000e16d1fcf0] [c0000000001f26ac] pick_next_task_fair+0x7c/0x680
[ 3943.438739][  C112] [c000000e16d1fd90] [c0000000010f175c] __schedule+0x15c/0x1040
[ 3943.438743][  C112] [c000000e16d1fe80] [c0000000010f26b4] schedule+0x74/0x140
[ 3943.438747][  C112] [c000000e16d1fef0] [c0000000001afd44] worker_thread+0x134/0x580
[ 3943.438749][  C112] [c000000e16d1ff90] [c0000000001bc6e8] kthread+0x138/0x140
[ 3943.438753][  C112] [c000000e16d1ffe0] [c00000000000ded8] start_kernel_thread+0x14/0x18
[ 3943.438756][  C112] Code: 63e90001 992d0932 a12d0008 3ce0fffe 5529083c 61290001 7d001

-----------------------------------------

System configuration:
--------------------

# lscpu
Architecture:                    ppc64le
Byte Order:                      Little Endian
CPU(s):                          128
On-line CPU(s) list:             0-127
Thread(s) per core:              4
Core(s) per socket:              16
Socket(s):                       2
NUMA node(s):                    8
Model:                           2.3 (pvr 004e 1203)
Model name:                      POWER9 (raw), altivec supported
Frequency boost:                 enabled
CPU max MHz:                     3800.0000
CPU min MHz:                     2300.0000
L1d cache:                       1 MiB
L1i cache:                       1 MiB
NUMA node0 CPU(s):               64-127
NUMA node8 CPU(s):               0-63
NUMA node250 CPU(s):             
NUMA node251 CPU(s):             
NUMA node252 CPU(s):             
NUMA node253 CPU(s):             
NUMA node254 CPU(s):             
NUMA node255 CPU(s):             

# uname -r
6.5.0-rc2-00028-g7475adccd76b

# cat /sys/kernel/debug/sched/features
GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY
CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_HRTICK_DL NO_DOUBLE_TICK
NONTASK_CAPACITY TTWU_QUEUE NO_SIS_PROP SIS_UTIL NO_WARN_DOUBLE_CLOCK
RT_PUSH_IPI NO_RT_RUNTIME_SHARE NO_LB_MIN ATTACH_AGE_LOAD WA_IDLE
WA_WEIGHT WA_BIAS UTIL_EST UTIL_EST_FASTUP NO_LATENCY_WARN ALT_PERIOD
BASE_SLICE HZ_BW SHARED_RUNQ

-----------------------------------------

Please let me know if I've missed anything here. I'll continue
investigating and share any additional information I find.

Thanks and Regards,
Aboorva


> Changes
> -------
> 
> This is v3 of the shared runqueue patchset. This patch set is based
> off
> of commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
> bandwidth in use") on the sched/core branch of tip.git.
> 
> v1 (RFC): 
> https://lore.kernel.org/lkml/20230613052004.2836135-1-void@manifault.com/
> v2: 
> https://lore.kernel.org/lkml/20230710200342.358255-1-void@manifault.com/
> 
> v2 -> v3 changes:
> - Don't leave stale tasks in the lists when the SHARED_RUNQ feature
> is
>   disabled (Abel Wu)
> 
> - Use raw spin lock instead of spinlock_t (Peter)
> 
> - Fix return value from shared_runq_pick_next_task() to match the
>   semantics expected by newidle_balance() (Gautham, Abel)
> 
> - Fold patch __enqueue_entity() / __dequeue_entity() into previous
> patch
>   (Peter)
> 
> - Skip <= LLC domains in newidle_balance() if SHARED_RUNQ is enabled
>   (Peter)
> 
> - Properly support hotplug and recreating sched domains (Peter)
> 
> - Avoid unnecessary task_rq_unlock() + raw_spin_rq_lock() when src_rq
> ==
>   target_rq in shared_runq_pick_next_task() (Abel)
> 
> - Only issue list_del_init() in shared_runq_dequeue_task() if the
> task
>   is still in the list after acquiring the lock (Aaron Lu)
> 
> - Slightly change shared_runq_shard_idx() to make it more likely to
> keep
>   SMT siblings on the same bucket (Peter)
> 
> v1 -> v2 changes:
> - Change name from swqueue to shared_runq (Peter)
> 
> - Shard per-LLC shared runqueues to avoid contention on scheduler-
> heavy
>   workloads (Peter)
> 
> - Pull tasks from the shared_runq in newidle_balance() rather than in
>   pick_next_task_fair() (Peter and Vincent)
> 
> - Rename a few functions to reflect their actual purpose. For
> example,
>   shared_runq_dequeue_task() instead of swqueue_remove_task() (Peter)
> 
> - Expose move_queued_task() from core.c rather than migrate_task_to()
>   (Peter)
> 
> - Properly check is_cpu_allowed() when pulling a task from a
> shared_runq
>   to ensure it can actually be migrated (Peter and Gautham)
> 
> - Dropped RFC tag
> 
> Overview
> ========
> 
> The scheduler must constantly strike a balance between work
> conservation, and avoiding costly migrations which harm performance
> due
> to e.g. decreased cache locality. The matter is further complicated
> by
> the topology of the system. Migrating a task between cores on the
> same
> LLC may be more optimal than keeping a task local to the CPU, whereas
> migrating a task between LLCs or NUMA nodes may tip the balance in
> the
> other direction.
> 
> With that in mind, while CFS is by and large mostly a work conserving
> scheduler, there are certain instances where the scheduler will
> choose
> to keep a task local to a CPU, when it would have been more optimal
> to
> migrate it to an idle core.
> 
> An example of such a workload is the HHVM / web workload at Meta.
> HHVM
> is a VM that JITs Hack and PHP code in service of web requests. Like
> other JIT / compilation workloads, it tends to be heavily CPU bound,
> and
> exhibit generally poor cache locality. To try and address this, we
> set
> several debugfs (/sys/kernel/debug/sched) knobs on our HHVM
> workloads:
> 
> - migration_cost_ns -> 0
> - latency_ns -> 20000000
> - min_granularity_ns -> 10000000
> - wakeup_granularity_ns -> 12000000
> 
> These knobs are intended both to encourage the scheduler to be as
> work
> conserving as possible (migration_cost_ns -> 0), and also to keep
> tasks
> running for relatively long time slices so as to avoid the overhead
> of
> context switching (the other knobs). Collectively, these knobs
> provide a
> substantial performance win; resulting in roughly a 20% improvement
> in
> throughput. Worth noting, however, is that this improvement is _not_
> at
> full machine saturation.
> 
> That said, even with these knobs, we noticed that CPUs were still
> going
> idle even when the host was overcommitted. In response, we wrote the
> "shared runqueue" (SHARED_RUNQ) feature proposed in this patch set.
> The
> idea behind SHARED_RUNQ is simple: it enables the scheduler to be
> more
> aggressively work conserving by placing a waking task into a sharded
> per-LLC FIFO queue that can be pulled from by another core in the LLC
> FIFO queue which can then be pulled from before it goes idle.
> 
> With this simple change, we were able to achieve a 1 - 1.6%
> improvement
> in throughput, as well as a small, consistent improvement in p95 and
> p99
> latencies, in HHVM. These performance improvements were in addition
> to
> the wins from the debugfs knobs mentioned above, and to other
> benchmarks
> outlined below in the Results section.
> 
> Design
> ======
> 
> Note that the design described here reflects sharding, which is the
> implementation added in the final patch of the series (following the
> initial unsharded implementation added in patch 6/7). The design is
> described that way in this commit summary as the benchmarks described
> in
> the results section below all reflect a sharded SHARED_RUNQ.
> 
> The design of SHARED_RUNQ is quite simple. A shared_runq is simply a
> list of struct shared_runq_shard objects, which itself is simply a
> struct list_head of tasks, and a spinlock:
> 
> struct shared_runq_shard {
> 	struct list_head list;
> 	raw_spinlock_t lock;
> } ____cacheline_aligned;
> 
> struct shared_runq {
> 	u32 num_shards;
> 	struct shared_runq_shard shards[];
> } ____cacheline_aligned;
> 
> We create a struct shared_runq per LLC, ensuring they're in their own
> cachelines to avoid false sharing between CPUs on different LLCs, and
> we
> create a number of struct shared_runq_shard objects that are housed
> there.
> 
> When a task first wakes up, it enqueues itself in the
> shared_runq_shard
> of its current LLC at the end of enqueue_task_fair(). Enqueues only
> happen if the task was not manually migrated to the current core by
> select_task_rq(), and is not pinned to a specific CPU.
> 
> A core will pull a task from the shards in its LLC's shared_runq at
> the
> beginning of newidle_balance().
> 
> Difference between SHARED_RUNQ and SIS_NODE
> ===========================================
> 
> In [0] Peter proposed a patch that addresses Tejun's observations
> that
> when workqueues are targeted towards a specific LLC on his Zen2
> machine
> with small CCXs, that there would be significant idle time due to
> select_idle_sibling() not considering anything outside of the current
> LLC.
> 
> This patch (SIS_NODE) is essentially the complement to the proposal
> here. SID_NODE causes waking tasks to look for idle cores in
> neighboring
> LLCs on the same die, whereas SHARED_RUNQ causes cores about to go
> idle
> to look for enqueued tasks. That said, in its current form, the two
> features at are a different scope as SIS_NODE searches for idle cores
> between LLCs, while SHARED_RUNQ enqueues tasks within a single LLC.
> 
> The patch was since removed in [1], and we compared the results to
> SHARED_RUNQ (previously called "swqueue") in [2]. SIS_NODE did not
> outperform SHARED_RUNQ on any of the benchmarks, so we elect to not
> compare against it again for this v2 patch set.
> 
> [0]: 
> https://lore.kernel.org/all/20230530113249.GA156198@hirez.programming.kicks-ass.net/
> [1]: 
> https://lore.kernel.org/all/20230605175636.GA4253@hirez.programming.kicks-ass.net/
> [2]: 
> https://lore.kernel.org/lkml/20230613052004.2836135-1-void@manifault.com/
> 
> Worth noting as well is that pointed out in [3] that the logic behind
> including SIS_NODE in the first place should apply to SHARED_RUNQ
> (meaning that e.g. very small Zen2 CPUs with only 3/4 cores per LLC
> should benefit from having a single shared_runq stretch across
> multiple
> LLCs). I drafted a patch that implements this by having a minimum LLC
> size for creating a shard, and stretches a shared_runq across
> multiple
> LLCs if they're smaller than that size, and sent it to Tejun to test
> on
> his Zen2. Tejun reported back that SIS_NODE did not seem to make a
> difference:
> 
> [3]: 
> https://lore.kernel.org/lkml/20230711114207.GK3062772@hirez.programming.kicks-ass.net/
> 
> 			    o____________o__________o
> 			    |    mean    | Variance |
> 			    o------------o----------o
> Vanilla:		    | 108.84s    | 0.0057   |
> NO_SHARED_RUNQ:		    | 108.82s    | 0.119s   |
> SHARED_RUNQ:		    | 108.17s    | 0.038s   |
> SHARED_RUNQ w/ SIS_NODE:    | 108.87s    | 0.111s   |
> 			    o------------o----------o
> 
> I similarly tried running kcompile on SHARED_RUNQ with SIS_NODE on my
> 7950X Zen3, but didn't see any gain relative to plain SHARED_RUNQ
> (though
> a gain was observed relative to NO_SHARED_RUNQ, as described below).
> 
> Results
> =======
> 
> Note that the motivation for the shared runqueue feature was
> originally
> arrived at using experiments in the sched_ext framework that's
> currently
> being proposed upstream. The ~1 - 1.6% improvement in HHVM throughput
> is similarly visible using work-conserving sched_ext schedulers (even
> very simple ones like global FIFO).
> 
> In both single and multi socket / CCX hosts, this can measurably
> improve
> performance. In addition to the performance gains observed on our
> internal web workloads, we also observed an improvement in common
> workloads such as kernel compile and hackbench, when running shared
> runqueue.
> 
> On the other hand, some workloads suffer from SHARED_RUNQ. Workloads
> that hammer the runqueue hard, such as netperf UDP_RR, or schbench -L
> -m 52 -p 512 -r 10 -t 1. This can be mitigated somewhat by sharding
> the
> shared datastructures within a CCX, but it doesn't seem to eliminate
> all
> contention in every scenario. On the positive side, it seems that
> sharding does not materially harm the benchmarks run for this patch
> series; and in fact seems to improve some workloads such as kernel
> compile.
> 
> Note that for the kernel compile workloads below, the compilation was
> done by running make -j$(nproc) built-in.a on several different types
> of
> hosts configured with make allyesconfig on commit a27648c74210 ("afs:
> Fix setting of mtime when creating a file/dir/symlink") on Linus'
> tree
> (boost and turbo were disabled on all of these hosts when the
> experiments were performed).
> 
> Finally, note that these results were from the patch set built off of
> commit ebb83d84e49b ("sched/core: Avoid multiple calling
> update_rq_clock() in __cfsb_csd_unthrottle()") on the sched/core
> branch
> of tip.git for easy comparison with the v2 patch set results. The
> patches in their final form from this set were rebased onto commit
> 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs bandwidth in
> use") on the sched/core branch of tip.git.
> 
> === Single-socket | 16 core / 32 thread | 2-CCX | AMD 7950X Zen4 ===
> 
> CPU max MHz: 5879.8818
> CPU min MHz: 3000.0000
> 
> Command: make -j$(nproc) built-in.a
> 			    o____________o__________o
> 			    |    mean    | Variance |
> 			    o------------o----------o
> NO_SHARED_RUNQ:		    | 581.95s    | 2.639s   |
> SHARED_RUNQ:		    | 577.02s    | 0.084s   |
> 			    o------------o----------o
> 
> Takeaway: SHARED_RUNQ results in a statistically significant ~.85%
> improvement over NO_SHARED_RUNQ. This suggests that enqueuing tasks
> in
> the shared runqueue on every enqueue improves work conservation, and
> thanks to sharding, does not result in contention.
> 
> Command: hackbench --loops 10000
>                             o____________o__________o
>                             |    mean    | Variance |
>                             o------------o----------o
> NO_SHARED_RUNQ:             | 2.2492s    | .00001s  |
> SHARED_RUNQ:		    | 2.0217s    | .00065s  |
>                             o------------o----------o
> 
> Takeaway: SHARED_RUNQ in both forms performs exceptionally well
> compared
> to NO_SHARED_RUNQ here, beating it by over 10%. This was a surprising
> result given that it seems advantageous to err on the side of
> avoiding
> migration in hackbench given that tasks are short lived in sending
> only
> 10k bytes worth of messages, but the results of the benchmark would
> seem
> to suggest that minimizing runqueue delays is preferable.
> 
> Command:
> for i in `seq 128`; do
>     netperf -6 -t UDP_RR -c -C -l $runtime &
> done
>                             o_______________________o
>                             | Throughput | Variance |
>                             o-----------------------o
> NO_SHARED_RUNQ:             | 25037.45   | 2243.44  |
> SHARED_RUNQ:                | 24952.50   | 1268.06  |
>                             o-----------------------o
> 
> Takeaway: No statistical significance, though it is worth noting that
> there is no regression for shared runqueue on the 7950X, while there
> is
> a small regression on the Skylake and Milan hosts for SHARED_RUNQ as
> described below.
> 
> === Single-socket | 18 core / 36 thread | 1-CCX | Intel Skylake ===
> 
> CPU max MHz: 1601.0000
> CPU min MHz: 800.0000
> 
> Command: make -j$(nproc) built-in.a
> 			    o____________o__________o
> 			    |    mean    | Variance |
> 			    o------------o----------o
> NO_SHARED_RUNQ:		    | 1517.44s   | 2.8322s  |
> SHARED_RUNQ:		    | 1516.51s   | 2.9450s  |
> 			    o------------o----------o
> 
> Takeaway: There's on statistically significant gain here. I observed
> what I claimed was a .23% win in v2, but it appears that this is not
> actually statistically significant.
> 
> Command: hackbench --loops 10000
>                             o____________o__________o
>                             |    mean    | Variance |
>                             o------------o----------o
> NO_SHARED_RUNQ:             | 5.3370s    | .0012s   |
> SHARED_RUNQ:		    | 5.2668s    | .0033s   |
>                             o------------o----------o
> 
> Takeaway: SHARED_RUNQ results in a ~1.3% improvement over
> NO_SHARED_RUNQ. Also statistically significant, but smaller than the
> 10+% improvement observed on the 7950X.
> 
> Command: netperf -n $(nproc) -l 60 -t TCP_RR
> for i in `seq 128`; do
>         netperf -6 -t UDP_RR -c -C -l $runtime &
> done
>                             o_______________________o
>                             | Throughput | Variance |
>                             o-----------------------o
> NO_SHARED_RUNQ:             | 15699.32   | 377.01   |
> SHARED_RUNQ:                | 14966.42   | 714.13   |
>                             o-----------------------o
> 
> Takeaway: NO_SHARED_RUNQ beats SHARED_RUNQ by ~4.6%. This result
> makes
> sense -- the workload is very heavy on the runqueue, so enqueuing
> tasks
> in the shared runqueue in __enqueue_entity() would intuitively result
> in
> increased contention on the shard lock.
> 
> === Single-socket | 72-core | 6-CCX | AMD Milan Zen3 ===
> 
> CPU max MHz: 700.0000
> CPU min MHz: 700.0000
> 
> Command: make -j$(nproc) built-in.a
> 			    o____________o__________o
> 			    |    mean    | Variance |
> 			    o------------o----------o
> NO_SHARED_RUNQ:		    | 1568.55s   | 0.1568s  |
> SHARED_RUNQ:		    | 1568.26s   | 1.2168s  |
> 			    o------------o----------o
> 
> Takeaway: No statistically significant difference here. It might be
> worth experimenting with work stealing in a follow-on patch set.
> 
> Command: hackbench --loops 10000
>                             o____________o__________o
>                             |    mean    | Variance |
>                             o------------o----------o
> NO_SHARED_RUNQ:             | 5.2716s    | .00143s  |
> SHARED_RUNQ:		    | 5.1716s    | .00289s  |
>                             o------------o----------o
> 
> Takeaway: SHARED_RUNQ again wins, by about 2%.
> 
> Command: netperf -n $(nproc) -l 60 -t TCP_RR
> for i in `seq 128`; do
>         netperf -6 -t UDP_RR -c -C -l $runtime &
> done
>                             o_______________________o
>                             | Throughput | Variance |
>                             o-----------------------o
> NO_SHARED_RUNQ:             | 17482.03   | 4675.99  |
> SHARED_RUNQ:                | 16697.25   | 9812.23  |
>                             o-----------------------o
> 
> Takeaway: Similar to the Skylake runs, NO_SHARED_RUNQ still beats
> SHARED_RUNQ, this time by ~4.5%. It's worth noting that in v2, the
> NO_SHARED_RUNQ was only ~1.8% faster. The variance is very high here,
> so
> the results of this benchmark should be taken with a large grain of
> salt (noting that we do consistently see NO_SHARED_RUNQ on top due to
> not contending on the shard lock).
> 
> Finally, let's look at how sharding affects the following schbench
> incantation suggested by Chris in [4]:
> 
> schbench -L -m 52 -p 512 -r 10 -t 1
> 
> [4]: 
> https://lore.kernel.org/lkml/c8419d9b-2b31-2190-3058-3625bdbcb13d@meta.com/
> 
> The TL;DR is that sharding improves things a lot, but doesn't
> completely
> fix the problem. Here are the results from running the schbench
> command
> on the 18 core / 36 thread single CCX, single-socket Skylake:
> 
> -------------------------------------------------------------------
> -------------------------------------------------------------------
> -----------------------------------------------------------------
> class name         con-bounces    contentions       waittime-
> min   waittime-max waittime-total   waittime-avg    acq-
> bounces   acquisitions   holdtime-min   holdtime-max holdtime-
> total   holdtime-avg
> -------------------------------------------------------------------
> -------------------------------------------------------------------
> -----------------------------------------------------------------
> 
> &shard-
> >lock:      31510503       31510711           0.08          19.98    
>     168932319.64     5.36            31700383      31843851       0.0
> 3           17.50        10273968.33      0.32
> ------------
> &shard->lock       15731657          [<0000000068c0fd75>]
> pick_next_task_fair+0x4dd/0x510
> &shard->lock       15756516          [<000000001faf84f9>]
> enqueue_task_fair+0x459/0x530
> &shard->lock          21766          [<00000000126ec6ab>]
> newidle_balance+0x45a/0x650
> &shard->lock            772          [<000000002886c365>]
> dequeue_task_fair+0x4c9/0x540
> ------------
> &shard->lock          23458          [<00000000126ec6ab>]
> newidle_balance+0x45a/0x650
> &shard->lock       16505108          [<000000001faf84f9>]
> enqueue_task_fair+0x459/0x530
> &shard->lock       14981310          [<0000000068c0fd75>]
> pick_next_task_fair+0x4dd/0x510
> &shard->lock            835          [<000000002886c365>]
> dequeue_task_fair+0x4c9/0x540
> 
> These results are when we create only 3 shards (16 logical cores per
> shard), so the contention may be a result of overly-coarse sharding.
> If
> we run the schbench incantation with no sharding whatsoever, we see
> the
> following significantly worse lock stats contention:
> 
> -------------------------------------------------------------------
> -------------------------------------------------------------------
> -------------------------------------------------------------------
> ----
> class name        con-bounces    contentions         waittime-
> min   waittime-max waittime-total         waittime-avg    acq-
> bounces   acquisitions   holdtime-min  holdtime-max holdtime-
> total   holdtime-avg
> -------------------------------------------------------------------
> -------------------------------------------------------------------
> -------------------------------------------------------------------
> ----
> 
> &shard-
> >lock:     117868635      118361486           0.09           393.01  
>      1250954097.25          10.57           119345882     119780601  
>     0.05          343.35       38313419.51      0.32
> ------------
> &shard->lock       59169196          [<0000000060507011>]
> __enqueue_entity+0xdc/0x110
> &shard->lock       59084239          [<00000000f1c67316>]
> __dequeue_entity+0x78/0xa0
> &shard->lock         108051          [<00000000084a6193>]
> newidle_balance+0x45a/0x650
> ------------
> &shard->lock       60028355          [<0000000060507011>]
> __enqueue_entity+0xdc/0x110
> &shard->lock         119882          [<00000000084a6193>]
> newidle_balance+0x45a/0x650
> &shard->lock       58213249          [<00000000f1c67316>]
> __dequeue_entity+0x78/0xa0
> 
> The contention is ~3-4x worse if we don't shard at all. This roughly
> matches the fact that we had 3 shards on the first workload run
> above.
> If we make the shards even smaller, the contention is comparably much
> lower:
> 
> -------------------------------------------------------------------
> -------------------------------------------------------------------
> ----------------------------------------------------------
> class name         con-bounces    contentions   waittime-
> min  waittime-max waittime-total   waittime-avg   acq-
> bounces   acquisitions   holdtime-min  holdtime-max holdtime-
> total   holdtime-avg
> -------------------------------------------------------------------
> -------------------------------------------------------------------
> ----------------------------------------------------------
> 
> &shard-
> >lock:      13839849       13877596      0.08          13.23        5
> 389564.95       0.39           46910241      48069307       0.06     
>      16.40        16534469.35      0.34
> ------------
> &shard->lock           3559          [<00000000ea455dcc>]
> newidle_balance+0x45a/0x650
> &shard->lock        6992418          [<000000002266f400>]
> __dequeue_entity+0x78/0xa0
> &shard->lock        6881619          [<000000002a62f2e0>]
> __enqueue_entity+0xdc/0x110
> ------------
> &shard->lock        6640140          [<000000002266f400>]
> __dequeue_entity+0x78/0xa0
> &shard->lock           3523          [<00000000ea455dcc>]
> newidle_balance+0x45a/0x650
> &shard->lock        7233933          [<000000002a62f2e0>]
> __enqueue_entity+0xdc/0x110
> 
> Interestingly, SHARED_RUNQ performs worse than NO_SHARED_RUNQ on the
> schbench
> benchmark on Milan as well, but we contend more on the rq lock than
> the
> shard lock:
> 
> -------------------------------------------------------------------
> -------------------------------------------------------------------
> -----------------------------------------------------------
> class name         con-bounces    contentions   waittime-
> min  waittime-max waittime-total   waittime-avg   acq-
> bounces   acquisitions   holdtime-min   holdtime-max holdtime-
> total   holdtime-avg
> -------------------------------------------------------------------
> -------------------------------------------------------------------
> -----------------------------------------------------------
> 
> &rq-
> >__lock:       9617614        9656091       0.10          79.64      
>   69665812.00      7.21           18092700      67652829       0.11  
>          82.38        344524858.87     5.09
> -----------
> &rq->__lock        6301611          [<000000003e63bf26>]
> task_rq_lock+0x43/0xe0
> &rq->__lock        2530807          [<00000000516703f0>]
> __schedule+0x72/0xaa0
> &rq->__lock         109360          [<0000000011be1562>]
> raw_spin_rq_lock_nested+0xa/0x10
> &rq->__lock         178218          [<00000000c38a30f9>]
> sched_ttwu_pending+0x3d/0x170
> -----------
> &rq->__lock        3245506          [<00000000516703f0>]
> __schedule+0x72/0xaa0
> &rq->__lock        1294355          [<00000000c38a30f9>]
> sched_ttwu_pending+0x3d/0x170
> &rq->__lock        2837804          [<000000003e63bf26>]
> task_rq_lock+0x43/0xe0
> &rq->__lock        1627866          [<0000000011be1562>]
> raw_spin_rq_lock_nested+0xa/0x10
> 
> .....................................................................
> .....................................................................
> ........................................................
> 
> &shard-
> >lock:       7338558       7343244       0.10          35.97        7
> 173949.14       0.98           30200858      32679623       0.08     
>       35.59        16270584.52      0.50
> ------------
> &shard->lock        2004142          [<00000000f8aa2c91>]
> __dequeue_entity+0x78/0xa0
> &shard->lock        2611264          [<00000000473978cc>]
> newidle_balance+0x45a/0x650
> &shard->lock        2727838          [<0000000028f55bb5>]
> __enqueue_entity+0xdc/0x110
> ------------
> &shard->lock        2737232          [<00000000473978cc>]
> newidle_balance+0x45a/0x650
> &shard->lock        1693341          [<00000000f8aa2c91>]
> __dequeue_entity+0x78/0xa0
> &shard->lock        2912671          [<0000000028f55bb5>]
> __enqueue_entity+0xdc/0x110
> 
> .....................................................................
> .....................................................................
> .........................................................
> 
> If we look at the lock stats with SHARED_RUNQ disabled, the rq lock
> still
> contends the most, but it's significantly less than with it enabled:
> 
> -------------------------------------------------------------------
> -------------------------------------------------------------------
> --------------------------------------------------------------
> class name          con-bounces    contentions   waittime-
> min   waittime-max waittime-total   waittime-avg    acq-
> bounces   acquisitions   holdtime-min   holdtime-max holdtime-
> total   holdtime-avg
> -------------------------------------------------------------------
> -------------------------------------------------------------------
> --------------------------------------------------------------
> 
> &rq-
> >__lock:        791277         791690        0.12           110.54   
>     4889787.63       6.18            1575996       62390275       0.1
> 3           112.66       316262440.56     5.07
> -----------
> &rq->__lock         263343          [<00000000516703f0>]
> __schedule+0x72/0xaa0
> &rq->__lock          19394          [<0000000011be1562>]
> raw_spin_rq_lock_nested+0xa/0x10
> &rq->__lock           4143          [<000000003b542e83>]
> __task_rq_lock+0x51/0xf0
> &rq->__lock          51094          [<00000000c38a30f9>]
> sched_ttwu_pending+0x3d/0x170
> -----------
> &rq->__lock          23756          [<0000000011be1562>]
> raw_spin_rq_lock_nested+0xa/0x10
> &rq->__lock         379048          [<00000000516703f0>]
> __schedule+0x72/0xaa0
> &rq->__lock            677          [<000000003b542e83>]
> __task_rq_lock+0x51/0xf0
> 
> Worth noting is that increasing the granularity of the shards in
> general
> improves very runqueue-heavy workloads such as netperf UDP_RR and
> this
> schbench command, but it doesn't necessarily make a big difference
> for
> every workload, or for sufficiently small CCXs such as the 7950X. It
> may
> make sense to eventually allow users to control this with a debugfs
> knob, but for now we'll elect to choose a default that resulted in
> good
> performance for the benchmarks run for this patch series.
> 
> Conclusion
> ==========
> 
> SHARED_RUNQ in this form provides statistically significant wins for
> several types of workloads, and various CPU topologies. The reason
> for
> this is roughly the same for all workloads: SHARED_RUNQ encourages
> work
> conservation inside of a CCX by having a CPU do an O(# per-LLC
> shards)
> iteration over the shared_runq shards in an LLC. We could similarly
> do
> an O(n) iteration over all of the runqueues in the current LLC when a
> core is going idle, but that's quite costly (especially for larger
> LLCs), and sharded SHARED_RUNQ seems to provide a performant middle
> ground between doing an O(n) walk, and doing an O(1) pull from a
> single
> per-LLC shared runq.
> 
> For the workloads above, kernel compile and hackbench were clear
> winners
> for SHARED_RUNQ (especially in __enqueue_entity()). The reason for
> the
> improvement in kernel compile is of course that we have a heavily
> CPU-bound workload where cache locality doesn't mean much; getting a
> CPU
> is the #1 goal. As mentioned above, while I didn't expect to see an
> improvement in hackbench, the results of the benchmark suggest that
> minimizing runqueue delays is preferable to optimizing for L1/L2
> locality.
> 
> Not all workloads benefit from SHARED_RUNQ, however. Workloads that
> hammer the runqueue hard, such as netperf UDP_RR, or schbench -L -m
> 52
> -p 512 -r 10 -t 1, tend to run into contention on the shard locks;
> especially when enqueuing tasks in __enqueue_entity(). This can be
> mitigated significantly by sharding the shared datastructures within
> a
> CCX, but it doesn't eliminate all contention, as described above.
> 
> Worth noting as well is that Gautham Shenoy ran some interesting
> experiments on a few more ideas in [5], such as walking the
> shared_runq
> on the pop path until a task is found that can be migrated to the
> calling CPU. I didn't run those experiments in this patch set, but it
> might be worth doing so.
> 
> [5]: 
> https://lore.kernel.org/lkml/ZJkqeXkPJMTl49GB@BLR-5CG11610CF.amd.com/
> 
> Gautham also ran some other benchmarks in [6], which we may want to
> again try on this v3, but with boost disabled.
> 
> [6]: 
> https://lore.kernel.org/lkml/ZLpMGVPDXqWEu+gm@BLR-5CG11610CF.amd.com/
> 
> Finally, while SHARED_RUNQ in this form encourages work conservation,
> it
> of course does not guarantee it given that we don't implement any
> kind
> of work stealing between shared_runq's. In the future, we could
> potentially push CPU utilization even higher by enabling work
> stealing
> between shared_runq's, likely between CCXs on the same NUMA node.
> 
> Originally-by: Roman Gushchin <roman.gushchin@linux.dev>
> Signed-off-by: David Vernet <void@manifault.com>
> 
> David Vernet (7):
>   sched: Expose move_queued_task() from core.c
>   sched: Move is_cpu_allowed() into sched.h
>   sched: Check cpu_active() earlier in newidle_balance()
>   sched: Enable sched_feat callbacks on enable/disable
>   sched/fair: Add SHARED_RUNQ sched feature and skeleton calls
>   sched: Implement shared runqueue in CFS
>   sched: Shard per-LLC shared runqueues
> 
>  include/linux/sched.h   |   2 +
>  kernel/sched/core.c     |  52 ++----
>  kernel/sched/debug.c    |  18 ++-
>  kernel/sched/fair.c     | 340
> +++++++++++++++++++++++++++++++++++++++-
>  kernel/sched/features.h |   1 +
>  kernel/sched/sched.h    |  56 ++++++-
>  kernel/sched/topology.c |   4 +-
>  7 files changed, 420 insertions(+), 53 deletions(-)
>

David Vernet Nov. 27, 2023, 7:49 p.m. UTC | #4

On Mon, Nov 27, 2023 at 01:58:34PM +0530, Aboorva Devarajan wrote:
> On Wed, 2023-08-09 at 17:12 -0500, David Vernet wrote:
> 
> Hi David,
> 
> I have been benchmarking the patch-set on POWER9 machine to understand
> its impact. However, I've run into a recurring hard-lockups in
> newidle_balance, specifically when SHARED_RUNQ feature is enabled. It
> doesn't happen all the time, but it's something worth noting. I wanted
> to inform you about this, and I can provide more details if needed.

Hello Aboorva,

Thank you for testing out this patch set and for the report. One issue
that v4 will correct is that the shared_runq list could become corrupted
if you enable and disable the feature, as a stale task could remain in
the list after the feature has been disabled. I'll be including a fix
for that in v4, which I'm currently benchmarking, but other stuff keeps
seeming to preempt it.

By any chance, did you run into this when you were enabling / disabling
the feature? Or did you just enable it once and then hit this issue
after some time, which would indicate a different issue? I'm trying to
repro using ab, but haven't been successful thus far. If you're able to
repro consistently, it might be useful to run with CONFIG_LIST_DEBUG=y.

Thanks,
David

> -----------------------------------------
> 
> Some inital information regarding the hard-lockup:
> 
> Base Kernel:
> -----------
> 
> Base kernel is upto commit 88c56cfeaec4 ("sched/fair: Block nohz
> tick_stop when cfs bandwidth in use").
> 
> Patched Kernel:
> -------------
> 
> Base Kernel + v3 (shared runqueue patch-set)(
> https://lore.kernel.org/all/20230809221218.163894-1-void@manifault.com/
> )
> 
> The hard-lockup moslty occurs when running the Apache2 benchmarks with
> ab (Apache HTTP benchmarking tool) on the patched kernel. However, this
> problem is not exclusive to the mentioned benchmark and only occurs
> while the SHARED_RUNQ feature is enabled. Disabling SHARED_RUNQ feature
> prevents the occurrence of the lockup.
> 
> ab (Apache HTTP benchmarking tool): 
> https://httpd.apache.org/docs/2.4/programs/ab.html
> 
> Hardlockup with Patched Kernel:
> ------------------------------
> 
> [ 3289.727912][  C123] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 3289.727943][  C123] rcu: 	124-...0: (1 GPs behind) idle=f174/1/0x4000000000000000 softirq=12283/12289 fqs=732
> [ 3289.727976][  C123] rcu: 	(detected by 123, t=2103 jiffies, g=127061, q=5517 ncpus=128)
> [ 3289.728008][  C123] Sending NMI from CPU 123 to CPUs 124:
> [ 3295.182378][  C123] CPU 124 didn't respond to backtrace IPI, inspecting paca.
> [ 3295.182403][  C123] irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 15 (ksoftirqd/124)
> [ 3295.182421][  C123] Back trace of paca->saved_r1 (0xc000000de13e79b0) (possibly stale):
> [ 3295.182437][  C123] Call Trace:
> [ 3295.182456][  C123] [c000000de13e79b0] [c000000de13e7a70] 0xc000000de13e7a70 (unreliable)
> [ 3295.182477][  C123] [c000000de13e7ac0] [0000000000000008] 0x8
> [ 3295.182500][  C123] [c000000de13e7b70] [c000000de13e7c98] 0xc000000de13e7c98
> [ 3295.182519][  C123] [c000000de13e7ba0] [c0000000001da8bc] move_queued_task+0x14c/0x280
> [ 3295.182557][  C123] [c000000de13e7c30] [c0000000001f22d8] newidle_balance+0x648/0x940
> [ 3295.182602][  C123] [c000000de13e7d30] [c0000000001f26ac] pick_next_task_fair+0x7c/0x680
> [ 3295.182647][  C123] [c000000de13e7dd0] [c0000000010f175c] __schedule+0x15c/0x1040
> [ 3295.182675][  C123] [c000000de13e7ec0] [c0000000010f26b4] schedule+0x74/0x140
> [ 3295.182694][  C123] [c000000de13e7f30] [c0000000001c4994] smpboot_thread_fn+0x244/0x250
> [ 3295.182731][  C123] [c000000de13e7f90] [c0000000001bc6e8] kthread+0x138/0x140
> [ 3295.182769][  C123] [c000000de13e7fe0] [c00000000000ded8] start_kernel_thread+0x14/0x18
> [ 3295.182806][  C123] rcu: rcu_sched kthread starved for 544 jiffies! g127061 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=66
> [ 3295.182845][  C123] rcu: 	Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
> [ 3295.182878][  C123] rcu: RCU grace-period kthread stack dump:
> 
> -----------------------------------------
> 
> [ 3943.438625][  C112] watchdog: CPU 112 self-detected hard LOCKUP @ _raw_spin_lock_irqsave+0x4c/0xc0
> [ 3943.438631][  C112] watchdog: CPU 112 TB:115060212303626, last heartbeat TB:115054309631589 (11528ms ago)
> [ 3943.438673][  C112] CPU: 112 PID: 2090 Comm: kworker/112:2 Tainted: G        W    L     6.5.0-rc2-00028-g7475adccd76b #51
> [ 3943.438676][  C112] Hardware name: 8335-GTW POWER9 (raw) 0x4e1203 opal:skiboot-v6.5.3-35-g1851b2a06 PowerNV
> [ 3943.438678][  C112] Workqueue:  0x0 (events)
> [ 3943.438682][  C112] NIP:  c0000000010ff01c LR: c0000000001d1064 CTR: c0000000001e8580
> [ 3943.438684][  C112] REGS: c000007fffb6bd60 TRAP: 0900   Tainted: G        W    L      (6.5.0-rc2-00028-g7475adccd76b)
> [ 3943.438686][  C112] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24082222  XER: 00000000
> [ 3943.438693][  C112] CFAR: 0000000000000000 IRQMASK: 1 
> [ 3943.438693][  C112] GPR00: c0000000001d1064 c000000e16d1fb20 c0000000014e8200 c000000e092fed3c 
> [ 3943.438693][  C112] GPR04: c000000e16d1fc58 c000000e092fe3c8 00000000000000e1 fffffffffffe0000 
> [ 3943.438693][  C112] GPR08: 0000000000000000 00000000000000e1 0000000000000000 c00000000299ccd8 
> [ 3943.438693][  C112] GPR12: 0000000024088222 c000007ffffb8300 c0000000001bc5b8 c000000deb46f740 
> [ 3943.438693][  C112] GPR16: 0000000000000008 c000000e092fe280 0000000000000001 c000007ffedd7b00 
> [ 3943.438693][  C112] GPR20: 0000000000000001 c0000000029a1280 0000000000000000 0000000000000001 
> [ 3943.438693][  C112] GPR24: 0000000000000000 c000000e092fed3c c000000e16d1fdf0 c00000000299ccd8 
> [ 3943.438693][  C112] GPR28: c000000e16d1fc58 c0000000021fbf00 c000007ffee6bf00 0000000000000001 
> [ 3943.438722][  C112] NIP [c0000000010ff01c] _raw_spin_lock_irqsave+0x4c/0xc0
> [ 3943.438725][  C112] LR [c0000000001d1064] task_rq_lock+0x64/0x1b0
> [ 3943.438727][  C112] Call Trace:
> [ 3943.438728][  C112] [c000000e16d1fb20] [c000000e16d1fb60] 0xc000000e16d1fb60 (unreliable)
> [ 3943.438731][  C112] [c000000e16d1fb50] [c000000e16d1fbf0] 0xc000000e16d1fbf0
> [ 3943.438733][  C112] [c000000e16d1fbf0] [c0000000001f214c] newidle_balance+0x4bc/0x940
> [ 3943.438737][  C112] [c000000e16d1fcf0] [c0000000001f26ac] pick_next_task_fair+0x7c/0x680
> [ 3943.438739][  C112] [c000000e16d1fd90] [c0000000010f175c] __schedule+0x15c/0x1040
> [ 3943.438743][  C112] [c000000e16d1fe80] [c0000000010f26b4] schedule+0x74/0x140
> [ 3943.438747][  C112] [c000000e16d1fef0] [c0000000001afd44] worker_thread+0x134/0x580
> [ 3943.438749][  C112] [c000000e16d1ff90] [c0000000001bc6e8] kthread+0x138/0x140
> [ 3943.438753][  C112] [c000000e16d1ffe0] [c00000000000ded8] start_kernel_thread+0x14/0x18
> [ 3943.438756][  C112] Code: 63e90001 992d0932 a12d0008 3ce0fffe 5529083c 61290001 7d001
> 
> -----------------------------------------
> 
> System configuration:
> --------------------
> 
> # lscpu
> Architecture:                    ppc64le
> Byte Order:                      Little Endian
> CPU(s):                          128
> On-line CPU(s) list:             0-127
> Thread(s) per core:              4
> Core(s) per socket:              16
> Socket(s):                       2
> NUMA node(s):                    8
> Model:                           2.3 (pvr 004e 1203)
> Model name:                      POWER9 (raw), altivec supported
> Frequency boost:                 enabled
> CPU max MHz:                     3800.0000
> CPU min MHz:                     2300.0000
> L1d cache:                       1 MiB
> L1i cache:                       1 MiB
> NUMA node0 CPU(s):               64-127
> NUMA node8 CPU(s):               0-63
> NUMA node250 CPU(s):             
> NUMA node251 CPU(s):             
> NUMA node252 CPU(s):             
> NUMA node253 CPU(s):             
> NUMA node254 CPU(s):             
> NUMA node255 CPU(s):             
> 
> # uname -r
> 6.5.0-rc2-00028-g7475adccd76b
> 
> # cat /sys/kernel/debug/sched/features
> GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY
> CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_HRTICK_DL NO_DOUBLE_TICK
> NONTASK_CAPACITY TTWU_QUEUE NO_SIS_PROP SIS_UTIL NO_WARN_DOUBLE_CLOCK
> RT_PUSH_IPI NO_RT_RUNTIME_SHARE NO_LB_MIN ATTACH_AGE_LOAD WA_IDLE
> WA_WEIGHT WA_BIAS UTIL_EST UTIL_EST_FASTUP NO_LATENCY_WARN ALT_PERIOD
> BASE_SLICE HZ_BW SHARED_RUNQ
> 
> -----------------------------------------
> 
> Please let me know if I've missed anything here. I'll continue
> investigating and share any additional information I find.
> 
> Thanks and Regards,
> Aboorva
> 
> 
> > Changes
> > -------
> > 
> > This is v3 of the shared runqueue patchset. This patch set is based
> > off
> > of commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
> > bandwidth in use") on the sched/core branch of tip.git.
> > 
> > v1 (RFC): 
> > https://lore.kernel.org/lkml/20230613052004.2836135-1-void@manifault.com/
> > v2: 
> > https://lore.kernel.org/lkml/20230710200342.358255-1-void@manifault.com/
> > 
> > v2 -> v3 changes:
> > - Don't leave stale tasks in the lists when the SHARED_RUNQ feature
> > is
> >   disabled (Abel Wu)
> > 
> > - Use raw spin lock instead of spinlock_t (Peter)
> > 
> > - Fix return value from shared_runq_pick_next_task() to match the
> >   semantics expected by newidle_balance() (Gautham, Abel)
> > 
> > - Fold patch __enqueue_entity() / __dequeue_entity() into previous
> > patch
> >   (Peter)
> > 
> > - Skip <= LLC domains in newidle_balance() if SHARED_RUNQ is enabled
> >   (Peter)
> > 
> > - Properly support hotplug and recreating sched domains (Peter)
> > 
> > - Avoid unnecessary task_rq_unlock() + raw_spin_rq_lock() when src_rq
> > ==
> >   target_rq in shared_runq_pick_next_task() (Abel)
> > 
> > - Only issue list_del_init() in shared_runq_dequeue_task() if the
> > task
> >   is still in the list after acquiring the lock (Aaron Lu)
> > 
> > - Slightly change shared_runq_shard_idx() to make it more likely to
> > keep
> >   SMT siblings on the same bucket (Peter)
> > 
> > v1 -> v2 changes:
> > - Change name from swqueue to shared_runq (Peter)
> > 
> > - Shard per-LLC shared runqueues to avoid contention on scheduler-
> > heavy
> >   workloads (Peter)
> > 
> > - Pull tasks from the shared_runq in newidle_balance() rather than in
> >   pick_next_task_fair() (Peter and Vincent)
> > 
> > - Rename a few functions to reflect their actual purpose. For
> > example,
> >   shared_runq_dequeue_task() instead of swqueue_remove_task() (Peter)
> > 
> > - Expose move_queued_task() from core.c rather than migrate_task_to()
> >   (Peter)
> > 
> > - Properly check is_cpu_allowed() when pulling a task from a
> > shared_runq
> >   to ensure it can actually be migrated (Peter and Gautham)
> > 
> > - Dropped RFC tag
> > 
> > Overview
> > ========
> > 
> > The scheduler must constantly strike a balance between work
> > conservation, and avoiding costly migrations which harm performance
> > due
> > to e.g. decreased cache locality. The matter is further complicated
> > by
> > the topology of the system. Migrating a task between cores on the
> > same
> > LLC may be more optimal than keeping a task local to the CPU, whereas
> > migrating a task between LLCs or NUMA nodes may tip the balance in
> > the
> > other direction.
> > 
> > With that in mind, while CFS is by and large mostly a work conserving
> > scheduler, there are certain instances where the scheduler will
> > choose
> > to keep a task local to a CPU, when it would have been more optimal
> > to
> > migrate it to an idle core.
> > 
> > An example of such a workload is the HHVM / web workload at Meta.
> > HHVM
> > is a VM that JITs Hack and PHP code in service of web requests. Like
> > other JIT / compilation workloads, it tends to be heavily CPU bound,
> > and
> > exhibit generally poor cache locality. To try and address this, we
> > set
> > several debugfs (/sys/kernel/debug/sched) knobs on our HHVM
> > workloads:
> > 
> > - migration_cost_ns -> 0
> > - latency_ns -> 20000000
> > - min_granularity_ns -> 10000000
> > - wakeup_granularity_ns -> 12000000
> > 
> > These knobs are intended both to encourage the scheduler to be as
> > work
> > conserving as possible (migration_cost_ns -> 0), and also to keep
> > tasks
> > running for relatively long time slices so as to avoid the overhead
> > of
> > context switching (the other knobs). Collectively, these knobs
> > provide a
> > substantial performance win; resulting in roughly a 20% improvement
> > in
> > throughput. Worth noting, however, is that this improvement is _not_
> > at
> > full machine saturation.
> > 
> > That said, even with these knobs, we noticed that CPUs were still
> > going
> > idle even when the host was overcommitted. In response, we wrote the
> > "shared runqueue" (SHARED_RUNQ) feature proposed in this patch set.
> > The
> > idea behind SHARED_RUNQ is simple: it enables the scheduler to be
> > more
> > aggressively work conserving by placing a waking task into a sharded
> > per-LLC FIFO queue that can be pulled from by another core in the LLC
> > FIFO queue which can then be pulled from before it goes idle.
> > 
> > With this simple change, we were able to achieve a 1 - 1.6%
> > improvement
> > in throughput, as well as a small, consistent improvement in p95 and
> > p99
> > latencies, in HHVM. These performance improvements were in addition
> > to
> > the wins from the debugfs knobs mentioned above, and to other
> > benchmarks
> > outlined below in the Results section.
> > 
> > Design
> > ======
> > 
> > Note that the design described here reflects sharding, which is the
> > implementation added in the final patch of the series (following the
> > initial unsharded implementation added in patch 6/7). The design is
> > described that way in this commit summary as the benchmarks described
> > in
> > the results section below all reflect a sharded SHARED_RUNQ.
> > 
> > The design of SHARED_RUNQ is quite simple. A shared_runq is simply a
> > list of struct shared_runq_shard objects, which itself is simply a
> > struct list_head of tasks, and a spinlock:
> > 
> > struct shared_runq_shard {
> > 	struct list_head list;
> > 	raw_spinlock_t lock;
> > } ____cacheline_aligned;
> > 
> > struct shared_runq {
> > 	u32 num_shards;
> > 	struct shared_runq_shard shards[];
> > } ____cacheline_aligned;
> > 
> > We create a struct shared_runq per LLC, ensuring they're in their own
> > cachelines to avoid false sharing between CPUs on different LLCs, and
> > we
> > create a number of struct shared_runq_shard objects that are housed
> > there.
> > 
> > When a task first wakes up, it enqueues itself in the
> > shared_runq_shard
> > of its current LLC at the end of enqueue_task_fair(). Enqueues only
> > happen if the task was not manually migrated to the current core by
> > select_task_rq(), and is not pinned to a specific CPU.
> > 
> > A core will pull a task from the shards in its LLC's shared_runq at
> > the
> > beginning of newidle_balance().
> > 
> > Difference between SHARED_RUNQ and SIS_NODE
> > ===========================================
> > 
> > In [0] Peter proposed a patch that addresses Tejun's observations
> > that
> > when workqueues are targeted towards a specific LLC on his Zen2
> > machine
> > with small CCXs, that there would be significant idle time due to
> > select_idle_sibling() not considering anything outside of the current
> > LLC.
> > 
> > This patch (SIS_NODE) is essentially the complement to the proposal
> > here. SID_NODE causes waking tasks to look for idle cores in
> > neighboring
> > LLCs on the same die, whereas SHARED_RUNQ causes cores about to go
> > idle
> > to look for enqueued tasks. That said, in its current form, the two
> > features at are a different scope as SIS_NODE searches for idle cores
> > between LLCs, while SHARED_RUNQ enqueues tasks within a single LLC.
> > 
> > The patch was since removed in [1], and we compared the results to
> > SHARED_RUNQ (previously called "swqueue") in [2]. SIS_NODE did not
> > outperform SHARED_RUNQ on any of the benchmarks, so we elect to not
> > compare against it again for this v2 patch set.
> > 
> > [0]: 
> > https://lore.kernel.org/all/20230530113249.GA156198@hirez.programming.kicks-ass.net/
> > [1]: 
> > https://lore.kernel.org/all/20230605175636.GA4253@hirez.programming.kicks-ass.net/
> > [2]: 
> > https://lore.kernel.org/lkml/20230613052004.2836135-1-void@manifault.com/
> > 
> > Worth noting as well is that pointed out in [3] that the logic behind
> > including SIS_NODE in the first place should apply to SHARED_RUNQ
> > (meaning that e.g. very small Zen2 CPUs with only 3/4 cores per LLC
> > should benefit from having a single shared_runq stretch across
> > multiple
> > LLCs). I drafted a patch that implements this by having a minimum LLC
> > size for creating a shard, and stretches a shared_runq across
> > multiple
> > LLCs if they're smaller than that size, and sent it to Tejun to test
> > on
> > his Zen2. Tejun reported back that SIS_NODE did not seem to make a
> > difference:
> > 
> > [3]: 
> > https://lore.kernel.org/lkml/20230711114207.GK3062772@hirez.programming.kicks-ass.net/
> > 
> > 			    o____________o__________o
> > 			    |    mean    | Variance |
> > 			    o------------o----------o
> > Vanilla:		    | 108.84s    | 0.0057   |
> > NO_SHARED_RUNQ:		    | 108.82s    | 0.119s   |
> > SHARED_RUNQ:		    | 108.17s    | 0.038s   |
> > SHARED_RUNQ w/ SIS_NODE:    | 108.87s    | 0.111s   |
> > 			    o------------o----------o
> > 
> > I similarly tried running kcompile on SHARED_RUNQ with SIS_NODE on my
> > 7950X Zen3, but didn't see any gain relative to plain SHARED_RUNQ
> > (though
> > a gain was observed relative to NO_SHARED_RUNQ, as described below).
> > 
> > Results
> > =======
> > 
> > Note that the motivation for the shared runqueue feature was
> > originally
> > arrived at using experiments in the sched_ext framework that's
> > currently
> > being proposed upstream. The ~1 - 1.6% improvement in HHVM throughput
> > is similarly visible using work-conserving sched_ext schedulers (even
> > very simple ones like global FIFO).
> > 
> > In both single and multi socket / CCX hosts, this can measurably
> > improve
> > performance. In addition to the performance gains observed on our
> > internal web workloads, we also observed an improvement in common
> > workloads such as kernel compile and hackbench, when running shared
> > runqueue.
> > 
> > On the other hand, some workloads suffer from SHARED_RUNQ. Workloads
> > that hammer the runqueue hard, such as netperf UDP_RR, or schbench -L
> > -m 52 -p 512 -r 10 -t 1. This can be mitigated somewhat by sharding
> > the
> > shared datastructures within a CCX, but it doesn't seem to eliminate
> > all
> > contention in every scenario. On the positive side, it seems that
> > sharding does not materially harm the benchmarks run for this patch
> > series; and in fact seems to improve some workloads such as kernel
> > compile.
> > 
> > Note that for the kernel compile workloads below, the compilation was
> > done by running make -j$(nproc) built-in.a on several different types
> > of
> > hosts configured with make allyesconfig on commit a27648c74210 ("afs:
> > Fix setting of mtime when creating a file/dir/symlink") on Linus'
> > tree
> > (boost and turbo were disabled on all of these hosts when the
> > experiments were performed).
> > 
> > Finally, note that these results were from the patch set built off of
> > commit ebb83d84e49b ("sched/core: Avoid multiple calling
> > update_rq_clock() in __cfsb_csd_unthrottle()") on the sched/core
> > branch
> > of tip.git for easy comparison with the v2 patch set results. The
> > patches in their final form from this set were rebased onto commit
> > 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs bandwidth in
> > use") on the sched/core branch of tip.git.
> > 
> > === Single-socket | 16 core / 32 thread | 2-CCX | AMD 7950X Zen4 ===
> > 
> > CPU max MHz: 5879.8818
> > CPU min MHz: 3000.0000
> > 
> > Command: make -j$(nproc) built-in.a
> > 			    o____________o__________o
> > 			    |    mean    | Variance |
> > 			    o------------o----------o
> > NO_SHARED_RUNQ:		    | 581.95s    | 2.639s   |
> > SHARED_RUNQ:		    | 577.02s    | 0.084s   |
> > 			    o------------o----------o
> > 
> > Takeaway: SHARED_RUNQ results in a statistically significant ~.85%
> > improvement over NO_SHARED_RUNQ. This suggests that enqueuing tasks
> > in
> > the shared runqueue on every enqueue improves work conservation, and
> > thanks to sharding, does not result in contention.
> > 
> > Command: hackbench --loops 10000
> >                             o____________o__________o
> >                             |    mean    | Variance |
> >                             o------------o----------o
> > NO_SHARED_RUNQ:             | 2.2492s    | .00001s  |
> > SHARED_RUNQ:		    | 2.0217s    | .00065s  |
> >                             o------------o----------o
> > 
> > Takeaway: SHARED_RUNQ in both forms performs exceptionally well
> > compared
> > to NO_SHARED_RUNQ here, beating it by over 10%. This was a surprising
> > result given that it seems advantageous to err on the side of
> > avoiding
> > migration in hackbench given that tasks are short lived in sending
> > only
> > 10k bytes worth of messages, but the results of the benchmark would
> > seem
> > to suggest that minimizing runqueue delays is preferable.
> > 
> > Command:
> > for i in `seq 128`; do
> >     netperf -6 -t UDP_RR -c -C -l $runtime &
> > done
> >                             o_______________________o
> >                             | Throughput | Variance |
> >                             o-----------------------o
> > NO_SHARED_RUNQ:             | 25037.45   | 2243.44  |
> > SHARED_RUNQ:                | 24952.50   | 1268.06  |
> >                             o-----------------------o
> > 
> > Takeaway: No statistical significance, though it is worth noting that
> > there is no regression for shared runqueue on the 7950X, while there
> > is
> > a small regression on the Skylake and Milan hosts for SHARED_RUNQ as
> > described below.
> > 
> > === Single-socket | 18 core / 36 thread | 1-CCX | Intel Skylake ===
> > 
> > CPU max MHz: 1601.0000
> > CPU min MHz: 800.0000
> > 
> > Command: make -j$(nproc) built-in.a
> > 			    o____________o__________o
> > 			    |    mean    | Variance |
> > 			    o------------o----------o
> > NO_SHARED_RUNQ:		    | 1517.44s   | 2.8322s  |
> > SHARED_RUNQ:		    | 1516.51s   | 2.9450s  |
> > 			    o------------o----------o
> > 
> > Takeaway: There's on statistically significant gain here. I observed
> > what I claimed was a .23% win in v2, but it appears that this is not
> > actually statistically significant.
> > 
> > Command: hackbench --loops 10000
> >                             o____________o__________o
> >                             |    mean    | Variance |
> >                             o------------o----------o
> > NO_SHARED_RUNQ:             | 5.3370s    | .0012s   |
> > SHARED_RUNQ:		    | 5.2668s    | .0033s   |
> >                             o------------o----------o
> > 
> > Takeaway: SHARED_RUNQ results in a ~1.3% improvement over
> > NO_SHARED_RUNQ. Also statistically significant, but smaller than the
> > 10+% improvement observed on the 7950X.
> > 
> > Command: netperf -n $(nproc) -l 60 -t TCP_RR
> > for i in `seq 128`; do
> >         netperf -6 -t UDP_RR -c -C -l $runtime &
> > done
> >                             o_______________________o
> >                             | Throughput | Variance |
> >                             o-----------------------o
> > NO_SHARED_RUNQ:             | 15699.32   | 377.01   |
> > SHARED_RUNQ:                | 14966.42   | 714.13   |
> >                             o-----------------------o
> > 
> > Takeaway: NO_SHARED_RUNQ beats SHARED_RUNQ by ~4.6%. This result
> > makes
> > sense -- the workload is very heavy on the runqueue, so enqueuing
> > tasks
> > in the shared runqueue in __enqueue_entity() would intuitively result
> > in
> > increased contention on the shard lock.
> > 
> > === Single-socket | 72-core | 6-CCX | AMD Milan Zen3 ===
> > 
> > CPU max MHz: 700.0000
> > CPU min MHz: 700.0000
> > 
> > Command: make -j$(nproc) built-in.a
> > 			    o____________o__________o
> > 			    |    mean    | Variance |
> > 			    o------------o----------o
> > NO_SHARED_RUNQ:		    | 1568.55s   | 0.1568s  |
> > SHARED_RUNQ:		    | 1568.26s   | 1.2168s  |
> > 			    o------------o----------o
> > 
> > Takeaway: No statistically significant difference here. It might be
> > worth experimenting with work stealing in a follow-on patch set.
> > 
> > Command: hackbench --loops 10000
> >                             o____________o__________o
> >                             |    mean    | Variance |
> >                             o------------o----------o
> > NO_SHARED_RUNQ:             | 5.2716s    | .00143s  |
> > SHARED_RUNQ:		    | 5.1716s    | .00289s  |
> >                             o------------o----------o
> > 
> > Takeaway: SHARED_RUNQ again wins, by about 2%.
> > 
> > Command: netperf -n $(nproc) -l 60 -t TCP_RR
> > for i in `seq 128`; do
> >         netperf -6 -t UDP_RR -c -C -l $runtime &
> > done
> >                             o_______________________o
> >                             | Throughput | Variance |
> >                             o-----------------------o
> > NO_SHARED_RUNQ:             | 17482.03   | 4675.99  |
> > SHARED_RUNQ:                | 16697.25   | 9812.23  |
> >                             o-----------------------o
> > 
> > Takeaway: Similar to the Skylake runs, NO_SHARED_RUNQ still beats
> > SHARED_RUNQ, this time by ~4.5%. It's worth noting that in v2, the
> > NO_SHARED_RUNQ was only ~1.8% faster. The variance is very high here,
> > so
> > the results of this benchmark should be taken with a large grain of
> > salt (noting that we do consistently see NO_SHARED_RUNQ on top due to
> > not contending on the shard lock).
> > 
> > Finally, let's look at how sharding affects the following schbench
> > incantation suggested by Chris in [4]:
> > 
> > schbench -L -m 52 -p 512 -r 10 -t 1
> > 
> > [4]: 
> > https://lore.kernel.org/lkml/c8419d9b-2b31-2190-3058-3625bdbcb13d@meta.com/
> > 
> > The TL;DR is that sharding improves things a lot, but doesn't
> > completely
> > fix the problem. Here are the results from running the schbench
> > command
> > on the 18 core / 36 thread single CCX, single-socket Skylake:
> > 
> > -------------------------------------------------------------------
> > -------------------------------------------------------------------
> > -----------------------------------------------------------------
> > class name         con-bounces    contentions       waittime-
> > min   waittime-max waittime-total   waittime-avg    acq-
> > bounces   acquisitions   holdtime-min   holdtime-max holdtime-
> > total   holdtime-avg
> > -------------------------------------------------------------------
> > -------------------------------------------------------------------
> > -----------------------------------------------------------------
> > 
> > &shard-
> > >lock:      31510503       31510711           0.08          19.98    
> >     168932319.64     5.36            31700383      31843851       0.0
> > 3           17.50        10273968.33      0.32
> > ------------
> > &shard->lock       15731657          [<0000000068c0fd75>]
> > pick_next_task_fair+0x4dd/0x510
> > &shard->lock       15756516          [<000000001faf84f9>]
> > enqueue_task_fair+0x459/0x530
> > &shard->lock          21766          [<00000000126ec6ab>]
> > newidle_balance+0x45a/0x650
> > &shard->lock            772          [<000000002886c365>]
> > dequeue_task_fair+0x4c9/0x540
> > ------------
> > &shard->lock          23458          [<00000000126ec6ab>]
> > newidle_balance+0x45a/0x650
> > &shard->lock       16505108          [<000000001faf84f9>]
> > enqueue_task_fair+0x459/0x530
> > &shard->lock       14981310          [<0000000068c0fd75>]
> > pick_next_task_fair+0x4dd/0x510
> > &shard->lock            835          [<000000002886c365>]
> > dequeue_task_fair+0x4c9/0x540
> > 
> > These results are when we create only 3 shards (16 logical cores per
> > shard), so the contention may be a result of overly-coarse sharding.
> > If
> > we run the schbench incantation with no sharding whatsoever, we see
> > the
> > following significantly worse lock stats contention:
> > 
> > -------------------------------------------------------------------
> > -------------------------------------------------------------------
> > -------------------------------------------------------------------
> > ----
> > class name        con-bounces    contentions         waittime-
> > min   waittime-max waittime-total         waittime-avg    acq-
> > bounces   acquisitions   holdtime-min  holdtime-max holdtime-
> > total   holdtime-avg
> > -------------------------------------------------------------------
> > -------------------------------------------------------------------
> > -------------------------------------------------------------------
> > ----
> > 
> > &shard-
> > >lock:     117868635      118361486           0.09           393.01  
> >      1250954097.25          10.57           119345882     119780601  
> >     0.05          343.35       38313419.51      0.32
> > ------------
> > &shard->lock       59169196          [<0000000060507011>]
> > __enqueue_entity+0xdc/0x110
> > &shard->lock       59084239          [<00000000f1c67316>]
> > __dequeue_entity+0x78/0xa0
> > &shard->lock         108051          [<00000000084a6193>]
> > newidle_balance+0x45a/0x650
> > ------------
> > &shard->lock       60028355          [<0000000060507011>]
> > __enqueue_entity+0xdc/0x110
> > &shard->lock         119882          [<00000000084a6193>]
> > newidle_balance+0x45a/0x650
> > &shard->lock       58213249          [<00000000f1c67316>]
> > __dequeue_entity+0x78/0xa0
> > 
> > The contention is ~3-4x worse if we don't shard at all. This roughly
> > matches the fact that we had 3 shards on the first workload run
> > above.
> > If we make the shards even smaller, the contention is comparably much
> > lower:
> > 
> > -------------------------------------------------------------------
> > -------------------------------------------------------------------
> > ----------------------------------------------------------
> > class name         con-bounces    contentions   waittime-
> > min  waittime-max waittime-total   waittime-avg   acq-
> > bounces   acquisitions   holdtime-min  holdtime-max holdtime-
> > total   holdtime-avg
> > -------------------------------------------------------------------
> > -------------------------------------------------------------------
> > ----------------------------------------------------------
> > 
> > &shard-
> > >lock:      13839849       13877596      0.08          13.23        5
> > 389564.95       0.39           46910241      48069307       0.06     
> >      16.40        16534469.35      0.34
> > ------------
> > &shard->lock           3559          [<00000000ea455dcc>]
> > newidle_balance+0x45a/0x650
> > &shard->lock        6992418          [<000000002266f400>]
> > __dequeue_entity+0x78/0xa0
> > &shard->lock        6881619          [<000000002a62f2e0>]
> > __enqueue_entity+0xdc/0x110
> > ------------
> > &shard->lock        6640140          [<000000002266f400>]
> > __dequeue_entity+0x78/0xa0
> > &shard->lock           3523          [<00000000ea455dcc>]
> > newidle_balance+0x45a/0x650
> > &shard->lock        7233933          [<000000002a62f2e0>]
> > __enqueue_entity+0xdc/0x110
> > 
> > Interestingly, SHARED_RUNQ performs worse than NO_SHARED_RUNQ on the
> > schbench
> > benchmark on Milan as well, but we contend more on the rq lock than
> > the
> > shard lock:
> > 
> > -------------------------------------------------------------------
> > -------------------------------------------------------------------
> > -----------------------------------------------------------
> > class name         con-bounces    contentions   waittime-
> > min  waittime-max waittime-total   waittime-avg   acq-
> > bounces   acquisitions   holdtime-min   holdtime-max holdtime-
> > total   holdtime-avg
> > -------------------------------------------------------------------
> > -------------------------------------------------------------------
> > -----------------------------------------------------------
> > 
> > &rq-
> > >__lock:       9617614        9656091       0.10          79.64      
> >   69665812.00      7.21           18092700      67652829       0.11  
> >          82.38        344524858.87     5.09
> > -----------
> > &rq->__lock        6301611          [<000000003e63bf26>]
> > task_rq_lock+0x43/0xe0
> > &rq->__lock        2530807          [<00000000516703f0>]
> > __schedule+0x72/0xaa0
> > &rq->__lock         109360          [<0000000011be1562>]
> > raw_spin_rq_lock_nested+0xa/0x10
> > &rq->__lock         178218          [<00000000c38a30f9>]
> > sched_ttwu_pending+0x3d/0x170
> > -----------
> > &rq->__lock        3245506          [<00000000516703f0>]
> > __schedule+0x72/0xaa0
> > &rq->__lock        1294355          [<00000000c38a30f9>]
> > sched_ttwu_pending+0x3d/0x170
> > &rq->__lock        2837804          [<000000003e63bf26>]
> > task_rq_lock+0x43/0xe0
> > &rq->__lock        1627866          [<0000000011be1562>]
> > raw_spin_rq_lock_nested+0xa/0x10
> > 
> > .....................................................................
> > .....................................................................
> > ........................................................
> > 
> > &shard-
> > >lock:       7338558       7343244       0.10          35.97        7
> > 173949.14       0.98           30200858      32679623       0.08     
> >       35.59        16270584.52      0.50
> > ------------
> > &shard->lock        2004142          [<00000000f8aa2c91>]
> > __dequeue_entity+0x78/0xa0
> > &shard->lock        2611264          [<00000000473978cc>]
> > newidle_balance+0x45a/0x650
> > &shard->lock        2727838          [<0000000028f55bb5>]
> > __enqueue_entity+0xdc/0x110
> > ------------
> > &shard->lock        2737232          [<00000000473978cc>]
> > newidle_balance+0x45a/0x650
> > &shard->lock        1693341          [<00000000f8aa2c91>]
> > __dequeue_entity+0x78/0xa0
> > &shard->lock        2912671          [<0000000028f55bb5>]
> > __enqueue_entity+0xdc/0x110
> > 
> > .....................................................................
> > .....................................................................
> > .........................................................
> > 
> > If we look at the lock stats with SHARED_RUNQ disabled, the rq lock
> > still
> > contends the most, but it's significantly less than with it enabled:
> > 
> > -------------------------------------------------------------------
> > -------------------------------------------------------------------
> > --------------------------------------------------------------
> > class name          con-bounces    contentions   waittime-
> > min   waittime-max waittime-total   waittime-avg    acq-
> > bounces   acquisitions   holdtime-min   holdtime-max holdtime-
> > total   holdtime-avg
> > -------------------------------------------------------------------
> > -------------------------------------------------------------------
> > --------------------------------------------------------------
> > 
> > &rq-
> > >__lock:        791277         791690        0.12           110.54   
> >     4889787.63       6.18            1575996       62390275       0.1
> > 3           112.66       316262440.56     5.07
> > -----------
> > &rq->__lock         263343          [<00000000516703f0>]
> > __schedule+0x72/0xaa0
> > &rq->__lock          19394          [<0000000011be1562>]
> > raw_spin_rq_lock_nested+0xa/0x10
> > &rq->__lock           4143          [<000000003b542e83>]
> > __task_rq_lock+0x51/0xf0
> > &rq->__lock          51094          [<00000000c38a30f9>]
> > sched_ttwu_pending+0x3d/0x170
> > -----------
> > &rq->__lock          23756          [<0000000011be1562>]
> > raw_spin_rq_lock_nested+0xa/0x10
> > &rq->__lock         379048          [<00000000516703f0>]
> > __schedule+0x72/0xaa0
> > &rq->__lock            677          [<000000003b542e83>]
> > __task_rq_lock+0x51/0xf0
> > 
> > Worth noting is that increasing the granularity of the shards in
> > general
> > improves very runqueue-heavy workloads such as netperf UDP_RR and
> > this
> > schbench command, but it doesn't necessarily make a big difference
> > for
> > every workload, or for sufficiently small CCXs such as the 7950X. It
> > may
> > make sense to eventually allow users to control this with a debugfs
> > knob, but for now we'll elect to choose a default that resulted in
> > good
> > performance for the benchmarks run for this patch series.
> > 
> > Conclusion
> > ==========
> > 
> > SHARED_RUNQ in this form provides statistically significant wins for
> > several types of workloads, and various CPU topologies. The reason
> > for
> > this is roughly the same for all workloads: SHARED_RUNQ encourages
> > work
> > conservation inside of a CCX by having a CPU do an O(# per-LLC
> > shards)
> > iteration over the shared_runq shards in an LLC. We could similarly
> > do
> > an O(n) iteration over all of the runqueues in the current LLC when a
> > core is going idle, but that's quite costly (especially for larger
> > LLCs), and sharded SHARED_RUNQ seems to provide a performant middle
> > ground between doing an O(n) walk, and doing an O(1) pull from a
> > single
> > per-LLC shared runq.
> > 
> > For the workloads above, kernel compile and hackbench were clear
> > winners
> > for SHARED_RUNQ (especially in __enqueue_entity()). The reason for
> > the
> > improvement in kernel compile is of course that we have a heavily
> > CPU-bound workload where cache locality doesn't mean much; getting a
> > CPU
> > is the #1 goal. As mentioned above, while I didn't expect to see an
> > improvement in hackbench, the results of the benchmark suggest that
> > minimizing runqueue delays is preferable to optimizing for L1/L2
> > locality.
> > 
> > Not all workloads benefit from SHARED_RUNQ, however. Workloads that
> > hammer the runqueue hard, such as netperf UDP_RR, or schbench -L -m
> > 52
> > -p 512 -r 10 -t 1, tend to run into contention on the shard locks;
> > especially when enqueuing tasks in __enqueue_entity(). This can be
> > mitigated significantly by sharding the shared datastructures within
> > a
> > CCX, but it doesn't eliminate all contention, as described above.
> > 
> > Worth noting as well is that Gautham Shenoy ran some interesting
> > experiments on a few more ideas in [5], such as walking the
> > shared_runq
> > on the pop path until a task is found that can be migrated to the
> > calling CPU. I didn't run those experiments in this patch set, but it
> > might be worth doing so.
> > 
> > [5]: 
> > https://lore.kernel.org/lkml/ZJkqeXkPJMTl49GB@BLR-5CG11610CF.amd.com/
> > 
> > Gautham also ran some other benchmarks in [6], which we may want to
> > again try on this v3, but with boost disabled.
> > 
> > [6]: 
> > https://lore.kernel.org/lkml/ZLpMGVPDXqWEu+gm@BLR-5CG11610CF.amd.com/
> > 
> > Finally, while SHARED_RUNQ in this form encourages work conservation,
> > it
> > of course does not guarantee it given that we don't implement any
> > kind
> > of work stealing between shared_runq's. In the future, we could
> > potentially push CPU utilization even higher by enabling work
> > stealing
> > between shared_runq's, likely between CCXs on the same NUMA node.
> > 
> > Originally-by: Roman Gushchin <roman.gushchin@linux.dev>
> > Signed-off-by: David Vernet <void@manifault.com>
> > 
> > David Vernet (7):
> >   sched: Expose move_queued_task() from core.c
> >   sched: Move is_cpu_allowed() into sched.h
> >   sched: Check cpu_active() earlier in newidle_balance()
> >   sched: Enable sched_feat callbacks on enable/disable
> >   sched/fair: Add SHARED_RUNQ sched feature and skeleton calls
> >   sched: Implement shared runqueue in CFS
> >   sched: Shard per-LLC shared runqueues
> > 
> >  include/linux/sched.h   |   2 +
> >  kernel/sched/core.c     |  52 ++----
> >  kernel/sched/debug.c    |  18 ++-
> >  kernel/sched/fair.c     | 340
> > +++++++++++++++++++++++++++++++++++++++-
> >  kernel/sched/features.h |   1 +
> >  kernel/sched/sched.h    |  56 ++++++-
> >  kernel/sched/topology.c |   4 +-
> >  7 files changed, 420 insertions(+), 53 deletions(-)
> > 
>

David Vernet Dec. 4, 2023, 7:30 p.m. UTC | #5

On Wed, Aug 09, 2023 at 05:12:11PM -0500, David Vernet wrote:
> Changes
> -------
> 
> This is v3 of the shared runqueue patchset. This patch set is based off
> of commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
> bandwidth in use") on the sched/core branch of tip.git.
> 
> v1 (RFC): https://lore.kernel.org/lkml/20230613052004.2836135-1-void@manifault.com/
> v2: https://lore.kernel.org/lkml/20230710200342.358255-1-void@manifault.com/

Hello everyone,

I wanted to give an update on this, as I've been promising a v4 of the
patch set for quite some time. I ran more experiments over the last few
weeks, and EEVDF has changed the performance profile of the SHARED_RUNQ
feature quite a bit compared to what we were observing with CFS. We may
pick this back up again at a later point, but for now we're going to
take a step back and re-evaluate.

Thanks,
David

Aboorva Devarajan Dec. 7, 2023, 6 a.m. UTC | #6

On Mon, 2023-11-27 at 13:49 -0600, David Vernet wrote:
> On Mon, Nov 27, 2023 at 01:58:34PM +0530, Aboorva Devarajan wrote:
> > On Wed, 2023-08-09 at 17:12 -0500, David Vernet wrote:
> > 
> > Hi David,
> > 
> > I have been benchmarking the patch-set on POWER9 machine to understand
> > its impact. However, I've run into a recurring hard-lockups in
> > newidle_balance, specifically when SHARED_RUNQ feature is enabled. It
> > doesn't happen all the time, but it's something worth noting. I wanted
> > to inform you about this, and I can provide more details if needed.
> 
> Hello Aboorva,
> 
> Thank you for testing out this patch set and for the report. One issue
> that v4 will correct is that the shared_runq list could become corrupted
> if you enable and disable the feature, as a stale task could remain in
> the list after the feature has been disabled. I'll be including a fix
> for that in v4, which I'm currently benchmarking, but other stuff keeps
> seeming to preempt it.

Hi David,

Thank you for your response. While testing, I did observe the
shared_runq list becoming corrupted when enabling and disabling the
feature. 

Please find the logs below with CONFIG_DEBUG_LIST enabled:
------------------------------------------

[ 4952.270819] list_add corruption. prev->next should be next (c0000003fae87a80), but was c0000000ba027ec8. (prev=c0000000ba027ec8).
[ 4952.270926] ------------[ cut here ]------------
[ 4952.270935] kernel BUG at lib/list_debug.c:30!
[ 4952.270947] Oops: Exception in kernel mode, sig: 5 [#1]
[ 4952.270956] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
[ 4952.271029] CPU: 10 PID: 31426 Comm: cc1 Kdump: loaded Not tainted 6.5.0-rc2+ #1
[ 4952.271042] Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_012) hv:phyp pSeries
[ 4952.271054] NIP:  c000000000872f88 LR: c000000000872f84 CTR: 00000000006d1a1c
[ 4952.271070] REGS: c00000006e1b34e0 TRAP: 0700   Not tainted  (6.5.0-rc2+)
[ 4952.271079] MSR:  8000000002029033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 28048222  XER: 00000006
[ 4952.271102] CFAR: c0000000001ffa24 IRQMASK: 1 
[ 4952.271102] GPR00: c000000000872f84 c00000006e1b3780 c0000000019a3b00 0000000000000075 
[ 4952.271102] GPR04: c0000003faff2c08 c0000003fb077e80 c00000006e1b35c8 00000003f8e70000 
[ 4952.271102] GPR08: 0000000000000027 c000000002185f30 00000003f8e70000 0000000000000001 
[ 4952.271102] GPR12: 0000000000000000 c0000003fffe2c80 c000000068ecb100 0000000000000000 
[ 4952.271102] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
[ 4952.271102] GPR20: 0000000000000000 0000000000000000 0000000000000041 c00000006e1b3bb0 
[ 4952.271102] GPR24: c000000002c72058 00000003f8e70000 0000000000000001 c00000000e919948 
[ 4952.271102] GPR28: c0000000ba027ec8 c0000003fae87a80 c000000080ce6c00 c00000000e919980 
[ 4952.271212] NIP [c000000000872f88] __list_add_valid+0xb8/0x100
[ 4952.271236] LR [c000000000872f84] __list_add_valid+0xb4/0x100
[ 4952.271248] Call Trace:
[ 4952.271254] [c00000006e1b3780] [c000000000872f84] __list_add_valid+0xb4/0x100 (unreliable)
[ 4952.271270] [c00000006e1b37e0] [c0000000001b8f50] __enqueue_entity+0x110/0x1c0
[ 4952.271288] [c00000006e1b3830] [c0000000001bec9c] enqueue_entity+0x16c/0x690
[ 4952.271301] [c00000006e1b38e0] [c0000000001bf280] enqueue_task_fair+0xc0/0x490
[ 4952.271315] [c00000006e1b3980] [c0000000001ada0c] ttwu_do_activate+0xac/0x410
[ 4952.271328] [c00000006e1b3a10] [c0000000001ae59c] try_to_wake_up+0x5fc/0x8b0
[ 4952.271341] [c00000006e1b3ae0] [c0000000001df6dc] autoremove_wake_function+0x2c/0xc0
[ 4952.271359] [c00000006e1b3b20] [c0000000001e1018] __wake_up_common+0xc8/0x240
[ 4952.271372] [c00000006e1b3b90] [c0000000001e123c] __wake_up_common_lock+0xac/0x120
[ 4952.271385] [c00000006e1b3c20] [c0000000005bd4a4] pipe_write+0xd4/0x980
[ 4952.271401] [c00000006e1b3d00] [c0000000005ad720] vfs_write+0x350/0x4b0
[ 4952.271420] [c00000006e1b3dc0] [c0000000005adc24] ksys_write+0xf4/0x140
[ 4952.271433] [c00000006e1b3e10] [c000000000031108] system_call_exception+0x128/0x340
[ 4952.271449] [c00000006e1b3e50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
[ 4952.271470] --- interrupt: 3000 at 0x7fff8df3aa34
[ 4952.271482] NIP:  00007fff8df3aa34 LR: 0000000000000000 CTR: 0000000000000000
[ 4952.271492] REGS: c00000006e1b3e80 TRAP: 3000   Not tainted  (6.5.0-rc2+)
[ 4952.271502] MSR:  800000000000f033 <SF,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 44002822  XER: 00000000
[ 4952.271526] IRQMASK: 0 
[ 4952.271526] GPR00: 0000000000000004 00007fffea094d00 0000000112467a00 0000000000000001 
[ 4952.271526] GPR04: 0000000132c6a810 0000000000002000 00000000000004e4 0000000000000036 
[ 4952.271526] GPR08: 0000000132c6c810 0000000000000000 0000000000000000 0000000000000000 
[ 4952.271526] GPR12: 0000000000000000 00007fff8e71cac0 0000000000000000 0000000000000000 
[ 4952.271526] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
[ 4952.271526] GPR20: 00007fffea09c76f 00000001123b6898 0000000000000003 0000000132c6c820 
[ 4952.271526] GPR24: 0000000112469d88 00000001124686b8 0000000132c6a810 0000000000002000 
[ 4952.271526] GPR28: 0000000000002000 00007fff8e0418e0 0000000132c6a810 0000000000002000 
[ 4952.271627] NIP [00007fff8df3aa34] 0x7fff8df3aa34
[ 4952.271635] LR [0000000000000000] 0x0
[ 4952.271642] --- interrupt: 3000
[ 4952.271648] Code: f8010070 4b98ca81 60000000 0fe00000 7c0802a6 3c62ffa6 7d064378 7d244b78 38637f68 f8010070 4b98ca5d 60000000 <0fe00000> 7c0802a6 3c62ffa6 7ca62b78 
[ 4952.271685] ---[ end trace 0000000000000000 ]---
[ 4952.282562] pstore: backend (nvram) writing error (-1)
------------------------------------------

> 
> By any chance, did you run into this when you were enabling / disabling
> the feature? Or did you just enable it once and then hit this issue
> after some time, which would indicate a different issue? I'm trying to
> repro using ab, but haven't been successful thus far. If you're able to
> repro consistently, it might be useful to run with CONFIG_LIST_DEBUG=y.
> 

Additionally, I noticed a sporadic issue persisting even after enabling
the feature once, and the issue surfaced over time.  However, it
occurred specifically on a particular system, and my attempts to
recreate it were unsuccessful. I will provide more details if I can
successfully reproduce the issue with debug enabled. But looks like the
primary issue revolves around the shared_runq list getting corrupted
when toggling the feature on and off repeatedly as you pointed out.

I will keep an eye out for v4 and test if it's available later.

Thanks,
Aboorva


> Thanks,
> David
> 
> > -----------------------------------------
> > 
> > Some inital information regarding the hard-lockup:
> > 
> > Base Kernel:
> > -----------
> > 
> > Base kernel is upto commit 88c56cfeaec4 ("sched/fair: Block nohz
> > tick_stop when cfs bandwidth in use").
> > 
> > Patched Kernel:
> > -------------
> > 
> > Base Kernel + v3 (shared runqueue patch-set)(
> > https://lore.kernel.org/all/20230809221218.163894-1-void@manifault.com/
> > )
> > 
> > The hard-lockup moslty occurs when running the Apache2 benchmarks
> > with
> > ab (Apache HTTP benchmarking tool) on the patched kernel. However,
> > this
> > problem is not exclusive to the mentioned benchmark and only occurs
> > while the SHARED_RUNQ feature is enabled. Disabling SHARED_RUNQ
> > feature
> > prevents the occurrence of the lockup.
> > 
> > ab (Apache HTTP benchmarking tool): 
> > https://httpd.apache.org/docs/2.4/programs/ab.html
> > 
> > Hardlockup with Patched Kernel:
> > ------------------------------
> > 
> > [ 3289.727912][  C123] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> > [ 3289.727943][  C123] rcu: 	124-...0: (1 GPs behind) idle=f174/1/0x4000000000000000 softirq=12283/12289 fqs=732
> > [ 3289.727976][  C123] rcu: 	(detected by 123, t=2103 jiffies, g=127061, q=5517 ncpus=128)
> > [ 3289.728008][  C123] Sending NMI from CPU 123 to CPUs 124:
> > [ 3295.182378][  C123] CPU 124 didn't respond to backtrace IPI, inspecting paca.
> > [ 3295.182403][  C123] irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 15 (ksoftirqd/124)
> > [ 3295.182421][  C123] Back trace of paca->saved_r1 (0xc000000de13e79b0) (possibly stale):
> > [ 3295.182437][  C123] Call Trace:
> > [ 3295.182456][  C123] [c000000de13e79b0] [c000000de13e7a70] 0xc000000de13e7a70 (unreliable)
> > [ 3295.182477][  C123] [c000000de13e7ac0] [0000000000000008] 0x8
> > [ 3295.182500][  C123] [c000000de13e7b70] [c000000de13e7c98] 0xc000000de13e7c98
> > [ 3295.182519][  C123] [c000000de13e7ba0] [c0000000001da8bc] move_queued_task+0x14c/0x280
> > [ 3295.182557][  C123] [c000000de13e7c30] [c0000000001f22d8] newidle_balance+0x648/0x940
> > [ 3295.182602][  C123] [c000000de13e7d30] [c0000000001f26ac] pick_next_task_fair+0x7c/0x680
> > [ 3295.182647][  C123] [c000000de13e7dd0] [c0000000010f175c] __schedule+0x15c/0x1040
> > [ 3295.182675][  C123] [c000000de13e7ec0] [c0000000010f26b4] schedule+0x74/0x140
> > [ 3295.182694][  C123] [c000000de13e7f30] [c0000000001c4994] smpboot_thread_fn+0x244/0x250
> > [ 3295.182731][  C123] [c000000de13e7f90] [c0000000001bc6e8] kthread+0x138/0x140
> > [ 3295.182769][  C123] [c000000de13e7fe0] [c00000000000ded8] start_kernel_thread+0x14/0x18
> > [ 3295.182806][  C123] rcu: rcu_sched kthread starved for 544 jiffies! g127061 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=66
> > [ 3295.182845][  C123] rcu: 	Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
> > [ 3295.182878][  C123] rcu: RCU grace-period kthread stack dump:
> > 
> > -----------------------------------------
> > 
> > [ 3943.438625][  C112] watchdog: CPU 112 self-detected hard LOCKUP @ _raw_spin_lock_irqsave+0x4c/0xc0
> > [ 3943.438631][  C112] watchdog: CPU 112 TB:115060212303626, last heartbeat TB:115054309631589 (11528ms ago)
> > [ 3943.438673][  C112] CPU: 112 PID: 2090 Comm: kworker/112:2 Tainted: G        W    L     6.5.0-rc2-00028-g7475adccd76b #51
> > [ 3943.438676][  C112] Hardware name: 8335-GTW POWER9 (raw) 0x4e1203 opal:skiboot-v6.5.3-35-g1851b2a06 PowerNV
> > [ 3943.438678][  C112] Workqueue:  0x0 (events)
> > [ 3943.438682][  C112] NIP:  c0000000010ff01c LR: c0000000001d1064 CTR: c0000000001e8580
> > [ 3943.438684][  C112] REGS: c000007fffb6bd60 TRAP: 0900   Tainted: G        W    L      (6.5.0-rc2-00028-g7475adccd76b)
> > [ 3943.438686][  C112] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24082222  XER: 00000000
> > [ 3943.438693][  C112] CFAR: 0000000000000000 IRQMASK: 1 
> > [ 3943.438693][  C112] GPR00: c0000000001d1064 c000000e16d1fb20 c0000000014e8200 c000000e092fed3c 
> > [ 3943.438693][  C112] GPR04: c000000e16d1fc58 c000000e092fe3c8 00000000000000e1 fffffffffffe0000 
> > [ 3943.438693][  C112] GPR08: 0000000000000000 00000000000000e1 0000000000000000 c00000000299ccd8 
> > [ 3943.438693][  C112] GPR12: 0000000024088222 c000007ffffb8300 c0000000001bc5b8 c000000deb46f740 
> > [ 3943.438693][  C112] GPR16: 0000000000000008 c000000e092fe280 0000000000000001 c000007ffedd7b00 
> > [ 3943.438693][  C112] GPR20: 0000000000000001 c0000000029a1280 0000000000000000 0000000000000001 
> > [ 3943.438693][  C112] GPR24: 0000000000000000 c000000e092fed3c c000000e16d1fdf0 c00000000299ccd8 
> > [ 3943.438693][  C112] GPR28: c000000e16d1fc58 c0000000021fbf00 c000007ffee6bf00 0000000000000001 
> > [ 3943.438722][  C112] NIP [c0000000010ff01c] _raw_spin_lock_irqsave+0x4c/0xc0
> > [ 3943.438725][  C112] LR [c0000000001d1064] task_rq_lock+0x64/0x1b0
> > [ 3943.438727][  C112] Call Trace:
> > [ 3943.438728][  C112] [c000000e16d1fb20] [c000000e16d1fb60] 0xc000000e16d1fb60 (unreliable)
> > [ 3943.438731][  C112] [c000000e16d1fb50] [c000000e16d1fbf0] 0xc000000e16d1fbf0
> > [ 3943.438733][  C112] [c000000e16d1fbf0] [c0000000001f214c] newidle_balance+0x4bc/0x940
> > [ 3943.438737][  C112] [c000000e16d1fcf0] [c0000000001f26ac] pick_next_task_fair+0x7c/0x680
> > [ 3943.438739][  C112] [c000000e16d1fd90] [c0000000010f175c] __schedule+0x15c/0x1040
> > [ 3943.438743][  C112] [c000000e16d1fe80] [c0000000010f26b4] schedule+0x74/0x140
> > [ 3943.438747][  C112] [c000000e16d1fef0] [c0000000001afd44] worker_thread+0x134/0x580
> > [ 3943.438749][  C112] [c000000e16d1ff90] [c0000000001bc6e8] kthread+0x138/0x140
> > [ 3943.438753][  C112] [c000000e16d1ffe0] [c00000000000ded8] start_kernel_thread+0x14/0x18
> > [ 3943.438756][  C112] Code: 63e90001 992d0932 a12d0008 3ce0fffe 5529083c 61290001 7d001
> > 
> > -----------------------------------------
> > 
> > System configuration:
> > --------------------
> > 
> > # lscpu
> > Architecture:                    ppc64le
> > Byte Order:                      Little Endian
> > CPU(s):                          128
> > On-line CPU(s) list:             0-127
> > Thread(s) per core:              4
> > Core(s) per socket:              16
> > Socket(s):                       2
> > NUMA node(s):                    8
> > Model:                           2.3 (pvr 004e 1203)
> > Model name:                      POWER9 (raw), altivec supported
> > Frequency boost:                 enabled
> > CPU max MHz:                     3800.0000
> > CPU min MHz:                     2300.0000
> > L1d cache:                       1 MiB
> > L1i cache:                       1 MiB
> > NUMA node0 CPU(s):               64-127
> > NUMA node8 CPU(s):               0-63
> > NUMA node250 CPU(s):             
> > NUMA node251 CPU(s):             
> > NUMA node252 CPU(s):             
> > NUMA node253 CPU(s):             
> > NUMA node254 CPU(s):             
> > NUMA node255 CPU(s):             
> > 
> > # uname -r
> > 6.5.0-rc2-00028-g7475adccd76b
> > 
> > # cat /sys/kernel/debug/sched/features
> > GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY
> > CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_HRTICK_DL NO_DOUBLE_TICK
> > NONTASK_CAPACITY TTWU_QUEUE NO_SIS_PROP SIS_UTIL NO_WARN_DOUBLE_CLOCK
> > RT_PUSH_IPI NO_RT_RUNTIME_SHARE NO_LB_MIN ATTACH_AGE_LOAD WA_IDLE
> > WA_WEIGHT WA_BIAS UTIL_EST UTIL_EST_FASTUP NO_LATENCY_WARN ALT_PERIOD
> > BASE_SLICE HZ_BW SHARED_RUNQ
> > 
> > -----------------------------------------
> > 
> > Please let me know if I've missed anything here. I'll continue
> > investigating and share any additional information I find.
> > 
> > Thanks and Regards,
> > Aboorva
> >