Message ID | 20230613052004.2836135-4-void@manifault.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp324315vqr; Mon, 12 Jun 2023 22:52:10 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7yOrP5UkTk916crGMfcxSxG9vkl+1DtPLA8xoU2/rcQdw4jm/I2OgbX7inTWOxfDNCcO9v X-Received: by 2002:a05:6359:b97:b0:123:6652:f581 with SMTP id gf23-20020a0563590b9700b001236652f581mr5927044rwb.4.1686635530551; Mon, 12 Jun 2023 22:52:10 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686635530; cv=none; d=google.com; s=arc-20160816; b=W+JwWeyrDdF1P4RB20uOT7JNgyy3F3+br2fSqRJSnV18eVmnzV7Qg+UahnxCcoNIqs 3dRNdvRyAlQTH0H/NGD16IOGAAHV5CaRTmP++eBhXSbq4PrRLq3zicSbuvaecSxcBJdV r5Jahtt3OFfMU15HJszrOmqz0u9yZpEDuNa49MK4KcR5Rn1BT8jafNwpGPldFRKQk/vG w4CvpUd3HoH0Rroz0c22hJK1qDAFgh5ejOR56sf+BBjbMcKnEoxgKAH0zUX3SwemBtLl 6BYmFs3F71IwOA2aQqqHV0yMCir6l6129C39APF7YCHaFBmYER6w7MERHewaki8KwOzK N96Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=yBfUhJKNhMPahQRJUIAA9NDelW2RpX3bQX4Or2IDOzY=; b=e7ZmaEh19ovLrU6LkcU+DXAIUm9UQCEUugw9Osn4UfJxtS4iCEUQ5u9iOZkUyhVnZM 9A75N6UMPXnqZDkU5xz8S9mg11qIViNuGfn6o5/W/cSN68f+VO7uUmt+IhJ4rzk5Ngz0 x5vloQ1x+pIaJASX3dYl0T17/hN47vG6c99LzT0ew68YYf6QtZ0JJA7/T0/yA1SLaCkT Kz9Ak24zmf5m7PzKRzT60aczRUeiX/xyKoQQL5oOL8LdVxAwhZRmOrp/8jH2pMM+TuEP Xp1FrGJedWubpRRsqwkQi8Op9QED8MqlXa5sBeZZaQXohZDzKmRJjRK3I/2UBf4ZmRS3 VLeQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id s10-20020aa78bca000000b00661397beb62si7990533pfd.171.2023.06.12.22.51.58; Mon, 12 Jun 2023 22:52:10 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239993AbjFMFUa (ORCPT <rfc822;liningstudo@gmail.com> + 99 others); Tue, 13 Jun 2023 01:20:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37232 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239493AbjFMFUU (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 13 Jun 2023 01:20:20 -0400 Received: from mail-qv1-f44.google.com (mail-qv1-f44.google.com [209.85.219.44]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AFC12E7A for <linux-kernel@vger.kernel.org>; Mon, 12 Jun 2023 22:20:18 -0700 (PDT) Received: by mail-qv1-f44.google.com with SMTP id 6a1803df08f44-62def2d2838so5564616d6.0 for <linux-kernel@vger.kernel.org>; Mon, 12 Jun 2023 22:20:18 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686633617; x=1689225617; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=yBfUhJKNhMPahQRJUIAA9NDelW2RpX3bQX4Or2IDOzY=; b=KuB0sOD7R1AKMrCWP8EgkIyHLYqsI3Y0r0Rlw5LRGHfwvcbXSzWHn2kuuByGe1wGIz ocTEOhCQKpK7jU0ELhhnMMmHajyxKihLDPC+x7wis39bxCX4tZzljgbiuJuaMMzaWuM+ m8dLl+DTWfgtBf3GHXc6bKsEOTwlWfh0cjvFxxyhrRPTJNpuM+VF6aLmI0pC9oXO0ie3 f4dtgvBdeDrKvabg3g2v3gYjLPwNPuaZHX1aizPIMohRk6RDHbPKu9ufo/sZ+nXqq54q MJfk8qIO9PBJppd9k5rtmPAbY5VMGF02d3J7mmoo4FVHgSVgx6/jcjhl1MMXGDF7aJLU nMQw== X-Gm-Message-State: AC+VfDxHgfKXuIslze2DblTHLGu1OhO0MZSiWXey95jwxQQqdQHWsqgy gyg2PZCu9/wB2TB+w0FtQ2x/oJYEn6CUH63X X-Received: by 2002:a05:6214:242e:b0:625:aa1a:937e with SMTP id gy14-20020a056214242e00b00625aa1a937emr13615140qvb.58.1686633617192; Mon, 12 Jun 2023 22:20:17 -0700 (PDT) Received: from localhost ([24.1.27.177]) by smtp.gmail.com with ESMTPSA id r16-20020a0cb290000000b00623819de804sm3753550qve.127.2023.06.12.22.20.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 12 Jun 2023 22:20:16 -0700 (PDT) From: David Vernet <void@manifault.com> To: linux-kernel@vger.kernel.org Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, rostedt@goodmis.org, dietmar.eggemann@arm.com, bsegall@google.com, mgorman@suse.de, bristot@redhat.com, vschneid@redhat.com, joshdon@google.com, roman.gushchin@linux.dev, tj@kernel.org, kernel-team@meta.com Subject: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS Date: Tue, 13 Jun 2023 00:20:04 -0500 Message-Id: <20230613052004.2836135-4-void@manifault.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230613052004.2836135-1-void@manifault.com> References: <20230613052004.2836135-1-void@manifault.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.4 required=5.0 tests=BAYES_00, FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1768565243771369258?= X-GMAIL-MSGID: =?utf-8?q?1768565538130839210?= |
Series |
sched: Implement shared wakequeue in CFS
|
|
Commit Message
David Vernet
June 13, 2023, 5:20 a.m. UTC
Overview
========
The scheduler must constantly strike a balance between work
conservation, and avoiding costly migrations which harm performance due
to e.g. decreased cache locality. The matter is further complicated by
the topology of the system. Migrating a task between cores on the same
LLC may be more optimal than keeping a task local to the CPU, whereas
migrating a task between LLCs or NUMA nodes may tip the balance in the
other direction.
With that in mind, while CFS is by and large mostly a work conserving
scheduler, there are certain instances where the scheduler will choose
to keep a task local to a CPU, when it would have been more optimal to
migrate it to an idle core.
An example of such a workload is the HHVM / web workload at Meta. HHVM
is a VM that JITs Hack and PHP code in service of web requests. Like
other JIT / compilation workloads, it tends to be heavily CPU bound, and
exhibit generally poor cache locality. To try and address this, we set
several debugfs (/sys/kernel/debug/sched) knobs on our HHVM workloads:
- migration_cost_ns -> 0
- latency_ns -> 20000000
- min_granularity_ns -> 10000000
- wakeup_granularity_ns -> 12000000
These knobs are intended both to encourage the scheduler to be as work
conserving as possible (migration_cost_ns -> 0), and also to keep tasks
running for relatively long time slices so as to avoid the overhead of
context switching (the other knobs). Collectively, these knobs provide a
substantial performance win; resulting in roughly a 20% improvement in
throughput. Worth noting, however, is that this improvement is _not_ at
full machine saturation.
That said, even with these knobs, we noticed that CPUs were still going
idle even when the host was overcommitted. In response, we wrote the
"shared wakequeue" (swqueue) feature proposed in this patch set. The
idea behind swqueue is simple: it enables the scheduler to be
aggressively work conserving by placing a waking task into a per-LLC
FIFO queue that can be pulled from by another core in the LLC FIFO queue
which can then be pulled from before it goes idle.
With this simple change, we were able to achieve a 1 - 1.6% improvement
in throughput, as well as a small, consistent improvement in p95 and p99
latencies, in HHVM. These performance improvements were in addition to
the wins from the debugfs knobs mentioned above.
Design
======
The design of swqueue is quite simple. An swqueue is simply a struct
list_head, and a spinlock:
struct swqueue {
struct list_head list;
spinlock_t lock;
} ____cacheline_aligned;
We create a struct swqueue per LLC, ensuring they're in their own
cachelines to avoid false sharing between CPUs on different LLCs.
When a task first wakes up, it enqueues itself in the swqueue of its
current LLC at the end of enqueue_task_fair(). Enqueues only happen if
the task was not manually migrated to the current core by
select_task_rq(), and is not pinned to a specific CPU.
A core will pull a task from its LLC's swqueue before calling
newidle_balance().
Difference between SIS_NODE
===========================
In [0] Peter proposed a patch that addresses Tejun's observations that
when workqueues are targeted towards a specific LLC on his Zen2 machine
with small CCXs, that there would be significant idle time due to
select_idle_sibling() not considering anything outside of the current
LLC.
This patch (SIS_NODE) is essentially the complement to the proposal
here. SID_NODE causes waking tasks to look for idle cores in neighboring
LLCs on the same die, whereas swqueue causes cores about to go idle to
look for enqueued tasks. That said, in its current form, the two
features at are a different scope as SIS_NODE searches for idle cores
between LLCs, while swqueue enqueues tasks within a single LLC.
The patch was since removed in [1], but we elect to compare its
performance to swqueue given that as described above, it's conceptually
complementary.
[0]: https://lore.kernel.org/all/20230530113249.GA156198@hirez.programming.kicks-ass.net/
[1]: https://lore.kernel.org/all/20230605175636.GA4253@hirez.programming.kicks-ass.net/
I observed that while SIS_NODE works quite well for hosts with small
CCXs, it can result in degraded performance on machines either with a
large number of total cores in a CCD, or for which the cache miss
penalty of migrating between CCXs is high, even on the same die.
For example, on Zen 4c hosts (Bergamo), CCXs within a CCD are muxed
through a single link to the IO die, and thus have similar cache miss
latencies as cores in remote CCDs.
Such subtleties could be taken into account with SIS_NODE, but
regardless, both features are conceptually complementary sides of the
same coin. SIS_NODE searches for idle cores for waking threads, whereas
swqueue searches for available work before a core goes idle.
Results
=======
Note that the motivation for the shared wakequeue feature was originally
arrived at using experiments in the sched_ext framework that's currently
being proposed upstream. The ~1 - 1.6% improvement in HHVM throughput
is similarly visible using work-conserving sched_ext schedulers (even
very simple ones like global FIFO).
In both single and multi socket / CCX hosts, this can measurably improve
performance. In addition to the performance gains observed on our
internal web workloads, we also observed an improvement in common
workloads such as kernel compile when running shared wakequeue. Here are
the results of running make -j$(nproc) built-in.a on several different
types of hosts configured with make allyesconfig on commit a27648c74210
("afs: Fix setting of mtime when creating a file/dir/symlink") on Linus'
tree (boost was disabled on all of these hosts when the experiments were
performed):
Single-socket | 32-core | 2-CCX | AMD 7950X Zen4
CPU max MHz: 5879.8818
CPU min MHz: 3000.0000
o____________o_______o
| mean | CPU |
o------------o-------o
NO_SWQUEUE + NO_SIS_NODE: | 590.52s | 3103% |
NO_SWQUEUE + SIS_NODE: | 590.80s | 3102% |
SWQUEUE + NO_SIS_NODE: | 589.65s | 3116% |
SWQUEUE + SIS_NODE: | 589.99s | 3115% |
o------------o-------o
Takeaway: swqueue doesn't seem to provide a statistically significant
improvement for kernel compile on my 7950X. SIS_NODE similarly does not
have a noticeable effect on performance.
-------------------------------------------------------------------------------
Single-socket | 72-core | 6-CCX | AMD Milan Zen3
CPU max MHz: 3245.0190
CPU min MHz: 700.0000
o_____________o_______o
| mean | CPU |
o-------------o-------o
NO_SWQUEUE + NO_SIS_NODE: | 1608.69s | 6488% |
NO_SWQUEUE + SIS_NODE: | 1610.24s | 6473% |
SWQUEUE + NO_SIS_NODE: | 1605.80s | 6504% |
SWQUEUE + SIS_NODE: | 1606.96s | 6488% |
o-------------o-------o
Takeaway: swqueue does provide a small statistically significant
improvement on Milan, but the compile times in general were quite long
relative to the 7950X Zen4, and the Bergamo Zen4c due to the lower clock
frequency. Milan also has larger CCXs than Bergamo, so it stands to
reason that select_idle_sibling() will have an easier time finding idle
cores inside the current CCX.
It also seems logical that SIS_NODE would hurt performance a bit here,
as all cores / CCXs are in the same NUMA node, so select_idle_sibling()
has to iterate over 72 cores; delaying task wakeup. That said, I'm not
sure that's a viable theory if total CPU% is lower with SIS_NODE.
-------------------------------------------------------------------------------
Single-socket | 176-core | 11-CCX | 2-CCX per CCD | AMD Bergamo Zen4c
CPU max MHz: 1200.0000
CPU min MHz: 1000.0000
o____________o________o
| mean | CPU |
o------------o--------o
NO_SWQUEUE + NO_SIS_NODE: | 322.44s | 15534% |
NO_SWQUEUE + SIS_NODE: | 324.39s | 15508% |
SWQUEUE + NO_SIS_NODE: | 321.54s | 15603% |
SWQUEUE + SIS_NODE: | 321.88s | 15622% |
o------------o--------o
Takeaway: swqueue barely beats NO_SWQUEUE + NO_SIS_NODE, to the point
that it's arguably not statistically significant.
SIS_NODE results in a ~.9% performance degradation, for likely the same
reason as Milan: the host has a large number of LLCs within a single
socket, so task wakeup latencies suffer due to select_idle_node()
searching up to 11 CCXs.
Conclusion
==========
swqueue in this form seems to provide a small, but noticeable win for
front-end CPU-bound workloads spread over multiple CCXs. The reason
seems fairly straightforward: swqueue encourages work conservation
inside of a CCX by having a CPU do an O(1) pull from a per-LLC queue of
runnable tasks. As mentioned above, it is complementary to SIS_NODE,
which searches for idle cores on the wakeup path.
While swqueue in this form encourages work conservation, it of course
does not guarantee it given that we don't implement any kind of work
stealing between swqueues. In the future, we could potentially push CPU
utilization even higher by enabling work stealing between swqueues,
likely between CCXs on the same NUMA node.
Originally-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: David Vernet <void@manifault.com>
---
include/linux/sched.h | 2 +
kernel/sched/core.c | 2 +
kernel/sched/fair.c | 175 ++++++++++++++++++++++++++++++++++++++++--
kernel/sched/sched.h | 2 +
4 files changed, 176 insertions(+), 5 deletions(-)
Comments
Still gotta read it properly, however: On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote: > Single-socket | 32-core | 2-CCX | AMD 7950X Zen4 > Single-socket | 72-core | 6-CCX | AMD Milan Zen3 > Single-socket | 176-core | 11-CCX | 2-CCX per CCD | AMD Bergamo Zen4c Could you please also benchmark on something Intel that has these stupid large LLCs ? Because the last time I tried something like this, it came apart real quick. And AMD has these relatively small 8-core LLCs.
On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote: > +struct swqueue { > + struct list_head list; > + spinlock_t lock; > +} ____cacheline_aligned; > + > #ifdef CONFIG_SMP > +static struct swqueue *rq_swqueue(struct rq *rq) > +{ > + return rq->cfs.swqueue; > +} > + > +static struct task_struct *swqueue_pull_task(struct swqueue *swqueue) > +{ > + unsigned long flags; > + > + struct task_struct *p; > + > + spin_lock_irqsave(&swqueue->lock, flags); > + p = list_first_entry_or_null(&swqueue->list, struct task_struct, > + swqueue_node); > + if (p) > + list_del_init(&p->swqueue_node); > + spin_unlock_irqrestore(&swqueue->lock, flags); > + > + return p; > +} > + > +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags) > +{ > + unsigned long flags; > + struct swqueue *swqueue; > + bool task_migrated = enq_flags & ENQUEUE_MIGRATED; > + bool task_wakeup = enq_flags & ENQUEUE_WAKEUP; > + > + /* > + * Only enqueue the task in the shared wakequeue if: > + * > + * - SWQUEUE is enabled > + * - The task is on the wakeup path > + * - The task wasn't purposefully migrated to the current rq by > + * select_task_rq() > + * - The task isn't pinned to a specific CPU > + */ > + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1) > + return; > + > + swqueue = rq_swqueue(rq); > + spin_lock_irqsave(&swqueue->lock, flags); > + list_add_tail(&p->swqueue_node, &swqueue->list); > + spin_unlock_irqrestore(&swqueue->lock, flags); > +} > + > static int swqueue_pick_next_task(struct rq *rq, struct rq_flags *rf) > { > - return 0; > + struct swqueue *swqueue; > + struct task_struct *p = NULL; > + struct rq *src_rq; > + struct rq_flags src_rf; > + int ret; > + > + swqueue = rq_swqueue(rq); > + if (!list_empty(&swqueue->list)) > + p = swqueue_pull_task(swqueue); > + > + if (!p) > + return 0; > + > + rq_unpin_lock(rq, rf); > + raw_spin_rq_unlock(rq); > + > + src_rq = task_rq_lock(p, &src_rf); > + > + if (task_on_rq_queued(p) && !task_on_cpu(rq, p)) > + src_rq = migrate_task_to(src_rq, &src_rf, p, cpu_of(rq)); > + > + if (src_rq->cpu != rq->cpu) > + ret = 1; > + else > + ret = -1; > + > + task_rq_unlock(src_rq, p, &src_rf); > + > + raw_spin_rq_lock(rq); > + rq_repin_lock(rq, rf); > + > + return ret; > } > > static void swqueue_remove_task(struct task_struct *p) > -{} > +{ > + unsigned long flags; > + struct swqueue *swqueue; > + > + if (!list_empty(&p->swqueue_node)) { > + swqueue = rq_swqueue(task_rq(p)); > + spin_lock_irqsave(&swqueue->lock, flags); > + list_del_init(&p->swqueue_node); > + spin_unlock_irqrestore(&swqueue->lock, flags); > + } > +} > > /* > * For asym packing, by default the lower numbered CPU has higher priority. *sigh*... pretty much all, if not all of this is called with rq->lock held. So why the irqsave and big fat fail for using spinlock :-(
On Tue, Jun 13, 2023 at 10:32:03AM +0200, Peter Zijlstra wrote: > > Still gotta read it properly, however: > > On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote: > > Single-socket | 32-core | 2-CCX | AMD 7950X Zen4 > > Single-socket | 72-core | 6-CCX | AMD Milan Zen3 > > Single-socket | 176-core | 11-CCX | 2-CCX per CCD | AMD Bergamo Zen4c > > Could you please also benchmark on something Intel that has these stupid > large LLCs ? > > Because the last time I tried something like this, it came apart real > quick. And AMD has these relatively small 8-core LLCs. I tested on Intel(R) Xeon(R) Platinum 8358, which has 2 sockets and each socket has a single LLC with 32 cores/64threads. When running netperf with nr_thread=128, runtime=60: " netserver -4 for i in `seq $nr_threads`; do netperf -4 -H 127.0.0.1 -t UDP_RR -c -C -l $runtime & done wait " The lock contention due to the per-LLC swqueue->lock is quite serious: 83.39% 83.33% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath - - | |--42.86%--__libc_sendto | entry_SYSCALL_64 | do_syscall_64 | | | --42.81%--__x64_sys_sendto | __sys_sendto | sock_sendmsg | inet_sendmsg | udp_sendmsg | udp_send_skb | ip_send_skb | ip_output | ip_finish_output | __ip_finish_output | ip_finish_output2 | __dev_queue_xmit | __local_bh_enable_ip | do_softirq.part.0 | __do_softirq | | | --42.81%--net_rx_action | __napi_poll | process_backlog | __netif_receive_skb | __netif_receive_skb_one_core | ip_rcv | ip_local_deliver | ip_local_deliver_finish | ip_protocol_deliver_rcu | udp_rcv | __udp4_lib_rcv | udp_unicast_rcv_skb | udp_queue_rcv_skb | udp_queue_rcv_one_skb | __udp_enqueue_schedule_skb | sock_def_readable | __wake_up_sync_key | __wake_up_common_lock | | | --42.81%--__wake_up_common | receiver_wake_function | autoremove_wake_function | default_wake_function | try_to_wake_up | ttwu_do_activate | enqueue_task | enqueue_task_fair | _raw_spin_lock_irqsave | | | --42.81%--native_queued_spin_lock_slowpath | |--20.39%--0 | __libc_recvfrom | entry_SYSCALL_64 | do_syscall_64 | __x64_sys_recvfrom | __sys_recvfrom | sock_recvmsg | inet_recvmsg | udp_recvmsg | __skb_recv_udp | __skb_wait_for_more_packets | schedule_timeout | schedule | __schedule | pick_next_task_fair | | | --20.39%--swqueue_remove_task | _raw_spin_lock_irqsave | | | --20.39%--native_queued_spin_lock_slowpath | --20.07%--__libc_recvfrom entry_SYSCALL_64 do_syscall_64 __x64_sys_recvfrom __sys_recvfrom sock_recvmsg inet_recvmsg udp_recvmsg __skb_recv_udp __skb_wait_for_more_packets schedule_timeout schedule __schedule | --20.06%--pick_next_task_fair swqueue_remove_task _raw_spin_lock_irqsave | --20.06%--native_queued_spin_lock_slowpath I suppose that is because there are too many CPUs in a single LLC on this machine and when all these CPUs try to queue/pull tasks in this per-LLC shared wakequeue, it just doesn't scale well.
On Wed, Jun 14, 2023 at 12:35:29PM +0800, Aaron Lu wrote: > On Tue, Jun 13, 2023 at 10:32:03AM +0200, Peter Zijlstra wrote: > > > > Still gotta read it properly, however: > > > > On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote: > > > Single-socket | 32-core | 2-CCX | AMD 7950X Zen4 > > > Single-socket | 72-core | 6-CCX | AMD Milan Zen3 > > > Single-socket | 176-core | 11-CCX | 2-CCX per CCD | AMD Bergamo Zen4c > > > > Could you please also benchmark on something Intel that has these stupid > > large LLCs ? > > > > Because the last time I tried something like this, it came apart real > > quick. And AMD has these relatively small 8-core LLCs. > > I tested on Intel(R) Xeon(R) Platinum 8358, which has 2 sockets and each > socket has a single LLC with 32 cores/64threads. > The lock contention due to the per-LLC swqueue->lock is quite serious: Yep, that's what I was expecting -- complete meltdown that.
On Tue, Jun 13, 2023 at 10:41:11AM +0200, Peter Zijlstra wrote: > On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote: > > +struct swqueue { > > + struct list_head list; > > + spinlock_t lock; > > +} ____cacheline_aligned; > > + > > #ifdef CONFIG_SMP > > +static struct swqueue *rq_swqueue(struct rq *rq) > > +{ > > + return rq->cfs.swqueue; > > +} > > + > > +static struct task_struct *swqueue_pull_task(struct swqueue *swqueue) > > +{ > > + unsigned long flags; > > + > > + struct task_struct *p; > > + > > + spin_lock_irqsave(&swqueue->lock, flags); > > + p = list_first_entry_or_null(&swqueue->list, struct task_struct, > > + swqueue_node); > > + if (p) > > + list_del_init(&p->swqueue_node); > > + spin_unlock_irqrestore(&swqueue->lock, flags); > > + > > + return p; > > +} > > + > > +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags) > > +{ > > + unsigned long flags; > > + struct swqueue *swqueue; > > + bool task_migrated = enq_flags & ENQUEUE_MIGRATED; > > + bool task_wakeup = enq_flags & ENQUEUE_WAKEUP; > > + > > + /* > > + * Only enqueue the task in the shared wakequeue if: > > + * > > + * - SWQUEUE is enabled > > + * - The task is on the wakeup path > > + * - The task wasn't purposefully migrated to the current rq by > > + * select_task_rq() > > + * - The task isn't pinned to a specific CPU > > + */ > > + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1) > > + return; > > + > > + swqueue = rq_swqueue(rq); > > + spin_lock_irqsave(&swqueue->lock, flags); > > + list_add_tail(&p->swqueue_node, &swqueue->list); > > + spin_unlock_irqrestore(&swqueue->lock, flags); > > +} > > + > > static int swqueue_pick_next_task(struct rq *rq, struct rq_flags *rf) > > { > > - return 0; > > + struct swqueue *swqueue; > > + struct task_struct *p = NULL; > > + struct rq *src_rq; > > + struct rq_flags src_rf; > > + int ret; > > + > > + swqueue = rq_swqueue(rq); > > + if (!list_empty(&swqueue->list)) > > + p = swqueue_pull_task(swqueue); > > + > > + if (!p) > > + return 0; > > + > > + rq_unpin_lock(rq, rf); > > + raw_spin_rq_unlock(rq); > > + > > + src_rq = task_rq_lock(p, &src_rf); > > + > > + if (task_on_rq_queued(p) && !task_on_cpu(rq, p)) > > + src_rq = migrate_task_to(src_rq, &src_rf, p, cpu_of(rq)); > > + > > + if (src_rq->cpu != rq->cpu) > > + ret = 1; > > + else > > + ret = -1; > > + > > + task_rq_unlock(src_rq, p, &src_rf); > > + > > + raw_spin_rq_lock(rq); > > + rq_repin_lock(rq, rf); > > + > > + return ret; > > } > > > > static void swqueue_remove_task(struct task_struct *p) > > -{} > > +{ > > + unsigned long flags; > > + struct swqueue *swqueue; > > + > > + if (!list_empty(&p->swqueue_node)) { > > + swqueue = rq_swqueue(task_rq(p)); > > + spin_lock_irqsave(&swqueue->lock, flags); > > + list_del_init(&p->swqueue_node); > > + spin_unlock_irqrestore(&swqueue->lock, flags); > > + } > > +} > > > > /* > > * For asym packing, by default the lower numbered CPU has higher priority. > > *sigh*... pretty much all, if not all of this is called with rq->lock > held. So why the irqsave and big fat fail for using spinlock :-( Hi Peter, Thanks for the quick review. Yeah good call about the irq's -- looks like we're holding an rq lock on all swqueue paths so we can just use a raw spinlock. I'll make that change for v2. Regarding the per-swqueue spinlock being a potential bottleneck, I'll reply to Aaron's thread on [0] with some numbers I collected locally on a 26 core / 52 thread Cooperlake host, and a 20/40 x 2 Skylake. The TL;DR is that I'm not observing the spinlock be contended on either netperf or kernel compile workloads, with swqueue actually performing ~1 - 2% better than non-swqueue on both of these hosts for netperf. [0]: https://lore.kernel.org/all/20230614043529.GA1942@ziqianlu-dell/ Thanks, David
On Wed, Jun 14, 2023 at 12:35:29PM +0800, Aaron Lu wrote: > On Tue, Jun 13, 2023 at 10:32:03AM +0200, Peter Zijlstra wrote: > > > > Still gotta read it properly, however: > > > > On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote: > > > Single-socket | 32-core | 2-CCX | AMD 7950X Zen4 > > > Single-socket | 72-core | 6-CCX | AMD Milan Zen3 > > > Single-socket | 176-core | 11-CCX | 2-CCX per CCD | AMD Bergamo Zen4c > > > > Could you please also benchmark on something Intel that has these stupid > > large LLCs ? Sure, please see below. > > Because the last time I tried something like this, it came apart real > > quick. And AMD has these relatively small 8-core LLCs. > > I tested on Intel(R) Xeon(R) Platinum 8358, which has 2 sockets and each > socket has a single LLC with 32 cores/64threads. > > When running netperf with nr_thread=128, runtime=60: > > " > netserver -4 > > for i in `seq $nr_threads`; do > netperf -4 -H 127.0.0.1 -t UDP_RR -c -C -l $runtime & > done > > wait > " > > The lock contention due to the per-LLC swqueue->lock is quite serious: > > 83.39% 83.33% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath - - > | > |--42.86%--__libc_sendto > | entry_SYSCALL_64 > | do_syscall_64 > | | > | --42.81%--__x64_sys_sendto > | __sys_sendto > | sock_sendmsg > | inet_sendmsg > | udp_sendmsg > | udp_send_skb > | ip_send_skb > | ip_output > | ip_finish_output > | __ip_finish_output > | ip_finish_output2 > | __dev_queue_xmit > | __local_bh_enable_ip > | do_softirq.part.0 > | __do_softirq > | | > | --42.81%--net_rx_action > | __napi_poll > | process_backlog > | __netif_receive_skb > | __netif_receive_skb_one_core > | ip_rcv > | ip_local_deliver > | ip_local_deliver_finish > | ip_protocol_deliver_rcu > | udp_rcv > | __udp4_lib_rcv > | udp_unicast_rcv_skb > | udp_queue_rcv_skb > | udp_queue_rcv_one_skb > | __udp_enqueue_schedule_skb > | sock_def_readable > | __wake_up_sync_key > | __wake_up_common_lock > | | > | --42.81%--__wake_up_common > | receiver_wake_function > | autoremove_wake_function > | default_wake_function > | try_to_wake_up > | ttwu_do_activate > | enqueue_task > | enqueue_task_fair > | _raw_spin_lock_irqsave > | | > | --42.81%--native_queued_spin_lock_slowpath > | > |--20.39%--0 > | __libc_recvfrom > | entry_SYSCALL_64 > | do_syscall_64 > | __x64_sys_recvfrom > | __sys_recvfrom > | sock_recvmsg > | inet_recvmsg > | udp_recvmsg > | __skb_recv_udp > | __skb_wait_for_more_packets > | schedule_timeout > | schedule > | __schedule > | pick_next_task_fair > | | > | --20.39%--swqueue_remove_task > | _raw_spin_lock_irqsave > | | > | --20.39%--native_queued_spin_lock_slowpath > | > --20.07%--__libc_recvfrom > entry_SYSCALL_64 > do_syscall_64 > __x64_sys_recvfrom > __sys_recvfrom > sock_recvmsg > inet_recvmsg > udp_recvmsg > __skb_recv_udp > __skb_wait_for_more_packets > schedule_timeout > schedule > __schedule > | > --20.06%--pick_next_task_fair > swqueue_remove_task > _raw_spin_lock_irqsave > | > --20.06%--native_queued_spin_lock_slowpath > > I suppose that is because there are too many CPUs in a single LLC on > this machine and when all these CPUs try to queue/pull tasks in this > per-LLC shared wakequeue, it just doesn't scale well. Hi Aaron, Thanks for taking a look and running some benchmarks. I tried to reproduce your results on a 26 core / 52 thread Cooperlake host, and a 20 core / 40 thread x 2 Skylake host, but I wasn't able to observe the contention on the per-swqueue spinlock you were saw on your Ice Lake. I ran the following netperf benchmarks: - netperf -l 60 -n $(nproc) -6 - netperf -l 60 -n $(nproc) -6 -t UDP_RR Here are the results I'm seeing from taking the average of running those benchmarks three times on each host (all results are from the "Throughput" column of the below table). Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput <<< bytes bytes bytes secs. 10^6bits/sec 26 / 52 x 1 Cooperlake ---------------------- Default workload: NO_SWQUEUE: 34103.76 SWQUEUE: 34707.46 Delta: +1.77% UDP_RR: NO_SWQUEUE: 57695.29 SWQUEUE: 54827.36 Delta: -4.97% There's clearly a statistically significant decline in performance for UDP_RR here, but surprisingly, I don't see any contention on swqueue locks when I profile with perf. Rather, we seem to be contending on the rq lock (presumably in swqueue_pick_next_task() which is unexpected (to me, at least): 7.81% netperf [kernel.vmlinux] [k] queued_spin_lock_slowpath | --7.81%--queued_spin_lock_slowpath | --7.55%--task_rq_lock pick_next_task_fair schedule schedule_timeout __skb_wait_for_more_packets __skb_recv_udp udpv6_recvmsg inet6_recvmsg __x64_sys_recvfrom do_syscall_64 entry_SYSCALL_64 __libc_recvfrom 0x7f923fdfacf0 0 4.97% netperf [kernel.vmlinux] [k] _raw_spin_lock_irqsave | --4.96%--_raw_spin_lock_irqsave | |--1.13%--prepare_to_wait_exclusive | __skb_wait_for_more_packets | __skb_recv_udp | udpv6_recvmsg | inet6_recvmsg | __x64_sys_recvfrom | do_syscall_64 | entry_SYSCALL_64 | __libc_recvfrom | 0x7f923fdfacf0 | 0 | |--0.92%--default_wake_function | autoremove_wake_function | __wake_up_sync_key | sock_def_readable | __udp_enqueue_schedule_skb | udpv6_queue_rcv_one_skb | __udp6_lib_rcv | ip6_input | ipv6_rcv | process_backlog | net_rx_action | __softirqentry_text_start | __local_bh_enable_ip | ip6_output | ip6_local_out | ip6_send_skb | udp_v6_send_skb | udpv6_sendmsg | __sys_sendto | __x64_sys_sendto | do_syscall_64 | entry_SYSCALL_64 | __libc_sendto | |--0.81%--__wake_up_sync_key | sock_def_readable | __udp_enqueue_schedule_skb | udpv6_queue_rcv_one_skb | __udp6_lib_rcv | ip6_input | ipv6_rcv | process_backlog | net_rx_action | __softirqentry_text_start | __local_bh_enable_ip | ip6_output | ip6_local_out | ip6_send_skb | udp_v6_send_skb | udpv6_sendmsg | __sys_sendto | __x64_sys_sendto | do_syscall_64 | entry_SYSCALL_64 | __libc_sendto | |--0.73%--enqueue_task_fair | | | --0.72%--default_wake_function | autoremove_wake_function | __wake_up_sync_key | sock_def_readable | __udp_enqueue_schedule_skb | udpv6_queue_rcv_one_skb | __udp6_lib_rcv | ip6_input | ipv6_rcv | process_backlog | net_rx_action | __softirqentry_text_start | __local_bh_enable_ip | ip6_output | ip6_local_out | ip6_send_skb | udp_v6_send_skb | udpv6_sendmsg | __sys_sendto | __x64_sys_sendto | do_syscall_64 | entry_SYSCALL_64 | __libc_sendto | --0.68%--task_rq_lock pick_next_task_fair schedule schedule_timeout __skb_wait_for_more_packets __skb_recv_udp udpv6_recvmsg inet6_recvmsg __x64_sys_recvfrom do_syscall_64 entry_SYSCALL_64 __libc_recvfrom 0x7f923fdfacf0 0 The profile without swqueue doesn't show any contention on the rq lock, but does show us spending a good amount of time in the scheduler: 4.03% netperf [kernel.vmlinux] [k] default_wake_function | --3.98%--default_wake_function autoremove_wake_function __wake_up_sync_key sock_def_readable __udp_enqueue_schedule_skb udpv6_queue_rcv_one_skb __udp6_lib_rcv ip6_input ipv6_rcv process_backlog net_rx_action __softirqentry_text_start __local_bh_enable_ip ip6_output ip6_local_out ip6_send_skb udp_v6_send_skb udpv6_sendmsg __sys_sendto __x64_sys_sendto do_syscall_64 entry_SYSCALL_64 __libc_sendto 3.70% netperf [kernel.vmlinux] [k] enqueue_entity | --3.66%--enqueue_entity enqueue_task_fair | --3.66%--default_wake_function autoremove_wake_function __wake_up_sync_key sock_def_readable __udp_enqueue_schedule_skb udpv6_queue_rcv_one_skb __udp6_lib_rcv ip6_input ipv6_rcv process_backlog net_rx_action __softirqentry_text_start __local_bh_enable_ip ip6_output ip6_local_out ip6_send_skb udp_v6_send_skb udpv6_sendmsg __sys_sendto __x64_sys_sendto do_syscall_64 entry_SYSCALL_64 __libc_sendto There's clearly a measurable impact on performance here due to swqueue (negative for UDP_RR, positive for the default workload), but this looks quite different than what you were observing. 20 / 40 x 2 Skylake ------------------- Default workload: NO_SWQUEUE: 57437.45 SWQUEUE: 58801.11 Delta: +2.37% UDP_RR: NO_SWQUEUE: 75932.28 SWQUEUE: 75232.77 Delta: -.92% Same observation here. I didn't collect a profile, but the trend seems consistent, and there's clearly a tradeoff. Despite the small drop in perf for UDP_RR, it seems quite a bit less drastic than what would be expected with the contention you showed in your profile. 7950X (8 cores / 16 threads per CCX, 2 CCXs) -------------------------------------------- Default workload: NO_SWQUEUE: 77615.08 SWQUEUE: 77465.73 Delta: -.19% UDP_RR: NO_SWQUEUE: 230258.75 SWQUEUE: 230280.34 Delta: ~0% I'd call this essentially a no-op. Milan (6 cores / 12 threads per CCX, 6 CCXs) -------------------------------------------- Default workload: NO_SWQUEUE: 23571.39 SWQUEUE: 23791.47 Delta: +.93% UDP_RR: NO_SWQUEUE: 40954.66 SWQUEUE: 40385.08 Delta: -1.39% Same story here as above. Interesting to note though that Milan has a small numer of cores per-LLC. I collected a profile, but I'm not sure how useful it is given that so much of the profile seems to be indicting perf as the bottleneck: 24.93% 664 netperf [kernel.vmlinux] [k] x86_pmu_disable_all | ---x86_pmu_disable_all amd_pmu_disable_all | |--23.87%--__perf_event_task_sched_out | schedule | | | --23.84%--schedule_timeout | __skb_wait_for_more_packets | __skb_recv_udp | udpv6_recvmsg | inet6_recvmsg | __x64_sys_recvfrom | do_syscall_64 | entry_SYSCALL_64 | __libc_recvfrom | 0x7fbfa69facf0 | 0 | --0.62%--perf_mux_hrtimer_handler hrtimer_interrupt __sysvec_apic_timer_interrupt sysvec_apic_timer_interrupt asm_sysvec_apic_timer_interrupt 5.33% 135 netperf [kernel.vmlinux] [k] amd_pmu_test_overflow_topbit | ---amd_pmu_test_overflow_topbit amd_pmu_check_overflow | --5.18%--__perf_event_task_sched_out schedule schedule_timeout __skb_wait_for_more_packets __skb_recv_udp udpv6_recvmsg inet6_recvmsg __x64_sys_recvfrom do_syscall_64 entry_SYSCALL_64 __libc_recvfrom 0x7fbfa69facf0 0 3.73% 102 netperf [kernel.vmlinux] [k] native_sched_clock | ---native_sched_clock | |--1.18%--sched_clock_cpu | | | --0.75%--irqtime_account_irq | | | --0.71%--__softirqentry_text_start | __local_bh_enable_ip | ip6_output | ip6_local_out | ip6_send_skb | udp_v6_send_skb | udpv6_sendmsg | __sys_sendto | __x64_sys_sendto | do_syscall_64 | entry_SYSCALL_64 | __libc_sendto | |--1.18%--vtime_user_exit | __context_tracking_exit | syscall_enter_from_user_mode | do_syscall_64 | entry_SYSCALL_64 | | | |--0.61%--__libc_recvfrom | | 0x7fbfa69facf0 | | 0 | | | --0.57%--__libc_sendto | --0.51%--vtime_user_enter __context_tracking_enter syscall_exit_to_user_mode do_syscall_64 entry_SYSCALL_64 3.53% 107 netperf [kernel.vmlinux] [k] copy_user_generic_string | ---copy_user_generic_string | |--1.45%--_copy_to_iter | udpv6_recvmsg | inet6_recvmsg | __x64_sys_recvfrom | do_syscall_64 | entry_SYSCALL_64 | __libc_recvfrom | 0x7fbfa69facf0 | 0 | |--0.91%--_copy_to_user | move_addr_to_user | __x64_sys_recvfrom | do_syscall_64 | entry_SYSCALL_64 | __libc_recvfrom | 0x7fbfa69facf0 | 0 | --0.73%--_copy_from_user __sys_sendto __x64_sys_sendto do_syscall_64 entry_SYSCALL_64 __libc_sendto 2.98% 79 netperf [kernel.vmlinux] [k] default_wake_function | ---default_wake_function autoremove_wake_function __wake_up_sync_key sock_def_readable __udp_enqueue_schedule_skb udpv6_queue_rcv_one_skb __udp6_lib_rcv ip6_input ipv6_rcv process_backlog net_rx_action __softirqentry_text_start __local_bh_enable_ip ip6_output ip6_local_out ip6_send_skb udp_v6_send_skb udpv6_sendmsg __sys_sendto __x64_sys_sendto do_syscall_64 entry_SYSCALL_64 __libc_sendto In any case, as expected, the swqueue lock doesn't seem to be causing any issues here. I was wondering if the contention you were seeing could be due to something like TTWU_QUEUE (though I wouldn't have expected that because we don't use the swqueue if a task is migrated following select_task_rq()), so I also tried collecting data with NO_TTWU_QUEUE (only on the 2-socket SKL), but the results matched what I observed without it: Default workload: NO_SWQUEUE: 57552.69 SWQUEUE: 58320.29 Delta: +1.33% UDP_RR: NO_SWQUEUE: 76296.46 SWQUEUE: 75418.35 Delta: -1.15% I also tried running kernel compile workloads just to verify they didn't regress per Peter's request: 26 / 52 x 1 Cooperlake ---------------------- NO_SWQUEUE: 1187.72s SWQUEUE: 1185.39s Delta: +.20% 20 / 40 x 2 Skylake ------------------- NO_SWQUEUE: 418.15s SWQUEUE: 418.87s Delta: -.17% I'd call this essentially a no-op for kernel compile. In general, the results indicate some wins and some losses. For kernel compile, larger LLCs with more cores was a no-op. For netperf, the default workload had some gains, and for UPD_RR, some losses. With all that said, I have a few thoughts. Firstly, would you mind please sharing your .config? It's possible that the hosts I'm testing on just don't have big enough LLCs to observe the contention you're seeing on the swqueue spinlock, as my 26 / 52 CPL host is smaller than a single socket of your 32/64 ICL host. On the other hand, your ICL isn't _that_ much bigger per-LLC, so I'd be curious to see if there's a .config difference here that's causing the contention. Also, the fact that Milan (which has only 6 cores / 12 threads per LLC) also saw a performance hit with swqueue for UDP_RR suggests to me that the issue with UDP_RR is not the scalability of the per-LLC swqueue spinlock. Regardless, even if there are some SKUs that have so many cores per LLC that we do end up contending, I don't think that should preclude adding a feature that which does benefit plenty of other configurations that have fewer cores per LLC. Thanks, David
Hi David, On Wed, Jun 14, 2023 at 07:01:03PM -0500, David Vernet wrote: > Hi Aaron, > > Thanks for taking a look and running some benchmarks. I tried to > reproduce your results on a 26 core / 52 thread Cooperlake host, and a > 20 core / 40 thread x 2 Skylake host, but I wasn't able to observe the > contention on the per-swqueue spinlock you were saw on your Ice Lake. > > I ran the following netperf benchmarks: > > - netperf -l 60 -n $(nproc) -6 > - netperf -l 60 -n $(nproc) -6 -t UDP_RR Just to confirm: did you run multiple of the above or just one? > > Here are the results I'm seeing from taking the average of running those > benchmarks three times on each host (all results are from the "Throughput" > column of the below table). > > Recv Send Send > Socket Socket Message Elapsed > Size Size Size Time Throughput <<< > bytes bytes bytes secs. 10^6bits/sec > > > 26 / 52 x 1 Cooperlake > ---------------------- > > Default workload: > > NO_SWQUEUE: 34103.76 > SWQUEUE: 34707.46 > Delta: +1.77% > > UDP_RR: > > NO_SWQUEUE: 57695.29 > SWQUEUE: 54827.36 > Delta: -4.97% > > There's clearly a statistically significant decline in performance for > UDP_RR here, but surprisingly, I don't see any contention on swqueue > locks when I profile with perf. Rather, we seem to be contending on the > rq lock (presumably in swqueue_pick_next_task() which is unexpected (to > me, at least): I don't see rq lock contention though and since you have seen some contention here, I suppose you should have started multiple instances of netperf client? > > 7.81% netperf [kernel.vmlinux] [k] queued_spin_lock_slowpath > | > --7.81%--queued_spin_lock_slowpath > | > --7.55%--task_rq_lock > pick_next_task_fair > schedule > schedule_timeout > __skb_wait_for_more_packets > __skb_recv_udp > udpv6_recvmsg > inet6_recvmsg > __x64_sys_recvfrom > do_syscall_64 > entry_SYSCALL_64 > __libc_recvfrom > 0x7f923fdfacf0 > 0 > 4.97% netperf [kernel.vmlinux] [k] _raw_spin_lock_irqsave > | > --4.96%--_raw_spin_lock_irqsave > | > |--1.13%--prepare_to_wait_exclusive > | __skb_wait_for_more_packets > | __skb_recv_udp > | udpv6_recvmsg > | inet6_recvmsg > | __x64_sys_recvfrom > | do_syscall_64 > | entry_SYSCALL_64 > | __libc_recvfrom > | 0x7f923fdfacf0 > | 0 > | > |--0.92%--default_wake_function > | autoremove_wake_function > | __wake_up_sync_key > | sock_def_readable > | __udp_enqueue_schedule_skb > | udpv6_queue_rcv_one_skb > | __udp6_lib_rcv > | ip6_input > | ipv6_rcv > | process_backlog > | net_rx_action > | __softirqentry_text_start > | __local_bh_enable_ip > | ip6_output > | ip6_local_out > | ip6_send_skb > | udp_v6_send_skb > | udpv6_sendmsg > | __sys_sendto > | __x64_sys_sendto > | do_syscall_64 > | entry_SYSCALL_64 > | __libc_sendto > | > |--0.81%--__wake_up_sync_key > | sock_def_readable > | __udp_enqueue_schedule_skb > | udpv6_queue_rcv_one_skb > | __udp6_lib_rcv > | ip6_input > | ipv6_rcv > | process_backlog > | net_rx_action > | __softirqentry_text_start > | __local_bh_enable_ip > | ip6_output > | ip6_local_out > | ip6_send_skb > | udp_v6_send_skb > | udpv6_sendmsg > | __sys_sendto > | __x64_sys_sendto > | do_syscall_64 > | entry_SYSCALL_64 > | __libc_sendto > | > |--0.73%--enqueue_task_fair > | | > | --0.72%--default_wake_function > | autoremove_wake_function > | __wake_up_sync_key > | sock_def_readable > | __udp_enqueue_schedule_skb > | udpv6_queue_rcv_one_skb > | __udp6_lib_rcv > | ip6_input > | ipv6_rcv > | process_backlog > | net_rx_action > | __softirqentry_text_start > | __local_bh_enable_ip > | ip6_output > | ip6_local_out > | ip6_send_skb > | udp_v6_send_skb > | udpv6_sendmsg > | __sys_sendto > | __x64_sys_sendto > | do_syscall_64 > | entry_SYSCALL_64 > | __libc_sendto > | > --0.68%--task_rq_lock > pick_next_task_fair > schedule > schedule_timeout > __skb_wait_for_more_packets > __skb_recv_udp > udpv6_recvmsg > inet6_recvmsg > __x64_sys_recvfrom > do_syscall_64 > entry_SYSCALL_64 > __libc_recvfrom > 0x7f923fdfacf0 > 0 > > The profile without swqueue doesn't show any contention on the rq lock, > but does show us spending a good amount of time in the scheduler: > > 4.03% netperf [kernel.vmlinux] [k] default_wake_function > | > --3.98%--default_wake_function > autoremove_wake_function > __wake_up_sync_key > sock_def_readable > __udp_enqueue_schedule_skb > udpv6_queue_rcv_one_skb > __udp6_lib_rcv > ip6_input > ipv6_rcv > process_backlog > net_rx_action > __softirqentry_text_start > __local_bh_enable_ip > ip6_output > ip6_local_out > ip6_send_skb > udp_v6_send_skb > udpv6_sendmsg > __sys_sendto > __x64_sys_sendto > do_syscall_64 > entry_SYSCALL_64 > __libc_sendto > 3.70% netperf [kernel.vmlinux] [k] enqueue_entity > | > --3.66%--enqueue_entity > enqueue_task_fair > | > --3.66%--default_wake_function > autoremove_wake_function > __wake_up_sync_key > sock_def_readable > __udp_enqueue_schedule_skb > udpv6_queue_rcv_one_skb > __udp6_lib_rcv > ip6_input > ipv6_rcv > process_backlog > net_rx_action > __softirqentry_text_start > __local_bh_enable_ip > ip6_output > ip6_local_out > ip6_send_skb > udp_v6_send_skb > udpv6_sendmsg > __sys_sendto > __x64_sys_sendto > do_syscall_64 > entry_SYSCALL_64 > __libc_sendto > > There's clearly a measurable impact on performance here due to swqueue > (negative for UDP_RR, positive for the default workload), but this looks > quite different than what you were observing. Yes. > > 20 / 40 x 2 Skylake > ------------------- > > Default workload: > > NO_SWQUEUE: 57437.45 > SWQUEUE: 58801.11 > Delta: +2.37% > > UDP_RR: > > NO_SWQUEUE: 75932.28 > SWQUEUE: 75232.77 > Delta: -.92% > > Same observation here. I didn't collect a profile, but the trend seems > consistent, and there's clearly a tradeoff. Despite the small drop in > perf for UDP_RR, it seems quite a bit less drastic than what would be > expected with the contention you showed in your profile. Indeed. > > 7950X (8 cores / 16 threads per CCX, 2 CCXs) > -------------------------------------------- > > Default workload: > > NO_SWQUEUE: 77615.08 > SWQUEUE: 77465.73 > Delta: -.19% > > UDP_RR: > > NO_SWQUEUE: 230258.75 > SWQUEUE: 230280.34 > Delta: ~0% > > I'd call this essentially a no-op. > > With all that said, I have a few thoughts. > > Firstly, would you mind please sharing your .config? It's possible that > the hosts I'm testing on just don't have big enough LLCs to observe the > contention you're seeing on the swqueue spinlock, as my 26 / 52 CPL host > is smaller than a single socket of your 32/64 ICL host. On the other > hand, your ICL isn't _that_ much bigger per-LLC, so I'd be curious to Agreed. Your 26cores/52threads isn't that smaller than mine and I had expected to see something similar. > see if there's a .config difference here that's causing the contention. Attached my .config. I also pushed the branch I used for testing to github just in case you want to take a look: https://github.com/aaronlu/linux.git swqueue > Also, the fact that Milan (which has only 6 cores / 12 threads per LLC) > also saw a performance hit with swqueue for UDP_RR suggests to me that > the issue with UDP_RR is not the scalability of the per-LLC swqueue > spinlock. I've tested again today and I still saw serious contention on swqueue->lock with your cmdline. I did it this way: " $ netserver Starting netserver with host 'IN(6)ADDR_ANY' port '12865' and family AF_UNSPEC $ for i in `seq 128`; do netperf -l 60 -n 128 -6 -t UDP_RR & done " And the profile is about the same as my last posting: 83.23% 83.18% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath - - | |--40.23%--sendto | entry_SYSCALL_64 | do_syscall_64 | | | --40.22%--__x64_sys_sendto | __sys_sendto | sock_sendmsg | inet6_sendmsg | udpv6_sendmsg | udp_v6_send_skb | ip6_send_skb | ip6_local_out | ip6_output | ip6_finish_output | ip6_finish_output2 | __dev_queue_xmit | __local_bh_enable_ip | do_softirq.part.0 | __do_softirq | net_rx_action | __napi_poll | process_backlog | __netif_receive_skb | __netif_receive_skb_one_core | ipv6_rcv | ip6_input | ip6_input_finish | ip6_protocol_deliver_rcu | udpv6_rcv | __udp6_lib_rcv | udp6_unicast_rcv_skb | udpv6_queue_rcv_skb | udpv6_queue_rcv_one_skb | __udp_enqueue_schedule_skb | sock_def_readable | __wake_up_sync_key | __wake_up_common_lock | | | --40.22%--__wake_up_common | receiver_wake_function | autoremove_wake_function | default_wake_function | try_to_wake_up | ttwu_do_activate | enqueue_task | enqueue_task_fair | | | --40.22%--_raw_spin_lock_irqsave | | | --40.21%--native_queued_spin_lock_slowpath | |--38.59%--recvfrom | | | --38.59%--entry_SYSCALL_64 | do_syscall_64 | __x64_sys_recvfrom | __sys_recvfrom | sock_recvmsg | inet6_recvmsg | udpv6_recvmsg | __skb_recv_udp | __skb_wait_for_more_packets | schedule_timeout | schedule | __schedule | | | --38.59%--pick_next_task_fair | | | --38.59%--swqueue_remove_task | | | --38.59%--_raw_spin_lock_irqsave | | | --38.58%--native_queued_spin_lock_slowpath | |--2.25%--sendto | entry_SYSCALL_64 | do_syscall_64 | | | --2.25%--__x64_sys_sendto | __sys_sendto | sock_sendmsg | inet6_sendmsg | udpv6_sendmsg | udp_v6_send_skb | ip6_send_skb | ip6_local_out | ip6_output | ip6_finish_output | ip6_finish_output2 | __dev_queue_xmit | __local_bh_enable_ip | do_softirq.part.0 | __do_softirq | net_rx_action | __napi_poll | process_backlog | __netif_receive_skb | __netif_receive_skb_one_core | ipv6_rcv | ip6_input | ip6_input_finish | ip6_protocol_deliver_rcu | udpv6_rcv | __udp6_lib_rcv | udp6_unicast_rcv_skb | udpv6_queue_rcv_skb | udpv6_queue_rcv_one_skb | __udp_enqueue_schedule_skb | sock_def_readable | __wake_up_sync_key | __wake_up_common_lock | | | --2.25%--__wake_up_common | receiver_wake_function | autoremove_wake_function | default_wake_function | try_to_wake_up | ttwu_do_activate | enqueue_task | enqueue_task_fair | _raw_spin_lock_irqsave | | | --2.25%--native_queued_spin_lock_slowpath | --2.10%--recvfrom entry_SYSCALL_64 do_syscall_64 __x64_sys_recvfrom __sys_recvfrom sock_recvmsg inet6_recvmsg udpv6_recvmsg __skb_recv_udp __skb_wait_for_more_packets schedule_timeout schedule __schedule | --2.09%--pick_next_task_fair swqueue_remove_task _raw_spin_lock_irqsave | --2.09%--native_queued_spin_lock_slowpath During the test, vmstat showed lines like this: procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 185 0 0 128555784 47348 1791244 0 0 0 0 32387 4380484 1 98 0 0 0 When swqueue is disabled, vmstat showed lines like this and there is no lock contention: procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 116 0 0 128977136 58324 1338760 0 0 0 16 599677 9734395 5 72 23 0 0 The runnable tasks are a lot more when swqueue is in use and sys% also increased, perhaps due to the lock contention. I'll see if I can find a smaller machine and give it a run there too. Thanks, Aaron
On Thu, Jun 15, 2023 at 12:49:17PM +0800, Aaron Lu wrote:
> I'll see if I can find a smaller machine and give it a run there too.
Found a Skylake with 18cores/36threads on each socket/LLC and with
netperf, the contention is still serious.
"
$ netserver
$ sudo sh -c "echo SWQUEUE > /sys/kernel/debug/sched/features"
$ for i in `seq 72`; do netperf -l 60 -n 72 -6 -t UDP_RR & done
"
53.61% 53.61% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath - -
|
|--27.93%--sendto
| entry_SYSCALL_64
| do_syscall_64
| |
| --27.93%--__x64_sys_sendto
| __sys_sendto
| sock_sendmsg
| inet6_sendmsg
| udpv6_sendmsg
| udp_v6_send_skb
| ip6_send_skb
| ip6_local_out
| ip6_output
| ip6_finish_output
| ip6_finish_output2
| __dev_queue_xmit
| __local_bh_enable_ip
| do_softirq.part.0
| __do_softirq
| net_rx_action
| __napi_poll
| process_backlog
| __netif_receive_skb
| __netif_receive_skb_one_core
| ipv6_rcv
| ip6_input
| ip6_input_finish
| ip6_protocol_deliver_rcu
| udpv6_rcv
| __udp6_lib_rcv
| udp6_unicast_rcv_skb
| udpv6_queue_rcv_skb
| udpv6_queue_rcv_one_skb
| __udp_enqueue_schedule_skb
| sock_def_readable
| __wake_up_sync_key
| __wake_up_common_lock
| |
| --27.85%--__wake_up_common
| receiver_wake_function
| autoremove_wake_function
| default_wake_function
| try_to_wake_up
| |
| --27.85%--ttwu_do_activate
| enqueue_task
| enqueue_task_fair
| |
| --27.85%--_raw_spin_lock_irqsave
| |
| --27.85%--native_queued_spin_lock_slowpath
|
--25.67%--recvfrom
entry_SYSCALL_64
do_syscall_64
__x64_sys_recvfrom
__sys_recvfrom
sock_recvmsg
inet6_recvmsg
udpv6_recvmsg
__skb_recv_udp
|
--25.67%--__skb_wait_for_more_packets
schedule_timeout
schedule
__schedule
|
--25.66%--pick_next_task_fair
|
--25.65%--swqueue_remove_task
|
--25.65%--_raw_spin_lock_irqsave
|
--25.65%--native_queued_spin_lock_slowpath
I didn't aggregate the throughput(Trans. Rate per sec) from all these
clients, but a glimpse from the result showed that the throughput of
each client dropped from 4xxxx(NO_SWQUEUE) to 2xxxx(SWQUEUE).
Thanks,
Aaron
On Thu, Jun 15, 2023 at 03:31:53PM +0800, Aaron Lu wrote: > On Thu, Jun 15, 2023 at 12:49:17PM +0800, Aaron Lu wrote: > > I'll see if I can find a smaller machine and give it a run there too. > > Found a Skylake with 18cores/36threads on each socket/LLC and with > netperf, the contention is still serious. > > " > $ netserver > $ sudo sh -c "echo SWQUEUE > /sys/kernel/debug/sched/features" > $ for i in `seq 72`; do netperf -l 60 -n 72 -6 -t UDP_RR & done > " > > 53.61% 53.61% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath - - > | > |--27.93%--sendto > | entry_SYSCALL_64 > | do_syscall_64 > | | > | --27.93%--__x64_sys_sendto > | __sys_sendto > | sock_sendmsg > | inet6_sendmsg > | udpv6_sendmsg > | udp_v6_send_skb > | ip6_send_skb > | ip6_local_out > | ip6_output > | ip6_finish_output > | ip6_finish_output2 > | __dev_queue_xmit > | __local_bh_enable_ip > | do_softirq.part.0 > | __do_softirq > | net_rx_action > | __napi_poll > | process_backlog > | __netif_receive_skb > | __netif_receive_skb_one_core > | ipv6_rcv > | ip6_input > | ip6_input_finish > | ip6_protocol_deliver_rcu > | udpv6_rcv > | __udp6_lib_rcv > | udp6_unicast_rcv_skb > | udpv6_queue_rcv_skb > | udpv6_queue_rcv_one_skb > | __udp_enqueue_schedule_skb > | sock_def_readable > | __wake_up_sync_key > | __wake_up_common_lock > | | > | --27.85%--__wake_up_common > | receiver_wake_function > | autoremove_wake_function > | default_wake_function > | try_to_wake_up > | | > | --27.85%--ttwu_do_activate > | enqueue_task > | enqueue_task_fair > | | > | --27.85%--_raw_spin_lock_irqsave > | | > | --27.85%--native_queued_spin_lock_slowpath > | > --25.67%--recvfrom > entry_SYSCALL_64 > do_syscall_64 > __x64_sys_recvfrom > __sys_recvfrom > sock_recvmsg > inet6_recvmsg > udpv6_recvmsg > __skb_recv_udp > | > --25.67%--__skb_wait_for_more_packets > schedule_timeout > schedule > __schedule > | > --25.66%--pick_next_task_fair > | > --25.65%--swqueue_remove_task > | > --25.65%--_raw_spin_lock_irqsave > | > --25.65%--native_queued_spin_lock_slowpath > > I didn't aggregate the throughput(Trans. Rate per sec) from all these > clients, but a glimpse from the result showed that the throughput of > each client dropped from 4xxxx(NO_SWQUEUE) to 2xxxx(SWQUEUE). > > Thanks, > Aaron Ok, it seems that the issue is that I wasn't creating enough netperf clients. I assumed that -n $(nproc) was sufficient. I was able to repro the contention on my 26 core / 52 thread skylake client as well: 41.01% netperf [kernel.vmlinux] [k] queued_spin_lock_slowpath | --41.01%--queued_spin_lock_slowpath | --40.63%--_raw_spin_lock_irqsave | |--21.18%--enqueue_task_fair | | | --21.09%--default_wake_function | | | --21.09%--autoremove_wake_function | | | --21.09%--__wake_up_sync_key | sock_def_readable | __udp_enqueue_schedule_skb | udpv6_queue_rcv_one_skb | __udp6_lib_rcv | ip6_input | ipv6_rcv | process_backlog | net_rx_action | | | --21.09%--__softirqentry_text_start | __local_bh_enable_ip | ip6_output | ip6_local_out | ip6_send_skb | udp_v6_send_skb | udpv6_sendmsg | __sys_sendto | __x64_sys_sendto | do_syscall_64 | entry_SYSCALL_64 | --19.44%--swqueue_remove_task | --19.42%--pick_next_task_fair | --19.42%--schedule | --19.21%--schedule_timeout __skb_wait_for_more_packets __skb_recv_udp udpv6_recvmsg inet6_recvmsg __x64_sys_recvfrom do_syscall_64 entry_SYSCALL_64 40.87% netserver [kernel.vmlinux] [k] queued_spin_lock_slowpath | --40.87%--queued_spin_lock_slowpath | --40.51%--_raw_spin_lock_irqsave | |--21.03%--enqueue_task_fair | | | --20.94%--default_wake_function | | | --20.94%--autoremove_wake_function | | | --20.94%--__wake_up_sync_key | sock_def_readable | __udp_enqueue_schedule_skb | udpv6_queue_rcv_one_skb | __udp6_lib_rcv | ip6_input | ipv6_rcv | process_backlog | net_rx_action | | | --20.94%--__softirqentry_text_start | __local_bh_enable_ip | ip6_output | ip6_local_out | ip6_send_skb | udp_v6_send_skb | udpv6_sendmsg | __sys_sendto | __x64_sys_sendto | do_syscall_64 | entry_SYSCALL_64 | --19.48%--swqueue_remove_task | --19.47%--pick_next_task_fair schedule | --19.38%--schedule_timeout __skb_wait_for_more_packets __skb_recv_udp udpv6_recvmsg inet6_recvmsg __x64_sys_recvfrom do_syscall_64 entry_SYSCALL_64 Thanks for the help in getting the repro on my end. So yes, there is certainly a scalability concern to bear in mind for swqueue for LLCs with a lot of cores. If you have a lot of tasks quickly e.g. blocking and waking on futexes in a tight loop, I expect a similar issue would be observed. On the other hand, the issue did not occur on my 7950X. I also wasn't able to repro the contention on the Skylake if I ran with the default netperf workload rather than UDP_RR (even with the additional clients). I didn't bother to take the mean of all of the throughput results between NO_SWQUEUE and SWQUEUE, but they looked roughly equal. So swqueue isn't ideal for every configuration, but I'll echo my sentiment from [0] that this shouldn't on its own necessarily preclude it from being merged given that it does help a large class of configurations and workloads, and it's disabled by default. [0]: https://lore.kernel.org/all/20230615000103.GC2883716@maniforge/ Thanks, David
On Thu, Jun 15, 2023 at 06:26:05PM -0500, David Vernet wrote: > Ok, it seems that the issue is that I wasn't creating enough netperf > clients. I assumed that -n $(nproc) was sufficient. I was able to repro Yes that switch is confusing. > the contention on my 26 core / 52 thread skylake client as well: > > > Thanks for the help in getting the repro on my end. You are welcome. > So yes, there is certainly a scalability concern to bear in mind for > swqueue for LLCs with a lot of cores. If you have a lot of tasks quickly > e.g. blocking and waking on futexes in a tight loop, I expect a similar > issue would be observed. > > On the other hand, the issue did not occur on my 7950X. I also wasn't Using netperf/UDP_RR? > able to repro the contention on the Skylake if I ran with the default > netperf workload rather than UDP_RR (even with the additional clients). I also tried that on the 18cores/36threads/LLC Skylake and the contention is indeed much smaller than UDP_RR: 7.30% 7.29% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath But I wouldn't say it's entirely gone. Also consider Skylake has a lot fewer cores per LLC than later Intel servers like Icelake and Sapphire Rapids and I expect things would be worse on those two machines. > I didn't bother to take the mean of all of the throughput results > between NO_SWQUEUE and SWQUEUE, but they looked roughly equal. > > So swqueue isn't ideal for every configuration, but I'll echo my > sentiment from [0] that this shouldn't on its own necessarily preclude > it from being merged given that it does help a large class of > configurations and workloads, and it's disabled by default. > > [0]: https://lore.kernel.org/all/20230615000103.GC2883716@maniforge/ I was wondering: does it make sense to do some divide on machines with big LLCs? Like converting the per-LLC swqueue to per-group swqueue where the group can be made of ~8 cpus of the same LLC. This will have a similar effect of reducing the number of CPUs in a single LLC so the scalability issue can hopefully be fixed while at the same time, it might still help some workloads. I realized this isn't ideal in that wakeup happens at LLC scale so the group thing may not fit very well here. Just a thought, feel free to ignore it if you don't think this is feasible :-) Thanks, Aaron
On Tue, 13 Jun 2023 at 07:20, David Vernet <void@manifault.com> wrote: > > Overview > ======== > > The scheduler must constantly strike a balance between work > conservation, and avoiding costly migrations which harm performance due > to e.g. decreased cache locality. The matter is further complicated by > the topology of the system. Migrating a task between cores on the same > LLC may be more optimal than keeping a task local to the CPU, whereas > migrating a task between LLCs or NUMA nodes may tip the balance in the > other direction. > > With that in mind, while CFS is by and large mostly a work conserving > scheduler, there are certain instances where the scheduler will choose > to keep a task local to a CPU, when it would have been more optimal to > migrate it to an idle core. > > An example of such a workload is the HHVM / web workload at Meta. HHVM > is a VM that JITs Hack and PHP code in service of web requests. Like > other JIT / compilation workloads, it tends to be heavily CPU bound, and > exhibit generally poor cache locality. To try and address this, we set > several debugfs (/sys/kernel/debug/sched) knobs on our HHVM workloads: > > - migration_cost_ns -> 0 > - latency_ns -> 20000000 > - min_granularity_ns -> 10000000 > - wakeup_granularity_ns -> 12000000 > > These knobs are intended both to encourage the scheduler to be as work > conserving as possible (migration_cost_ns -> 0), and also to keep tasks > running for relatively long time slices so as to avoid the overhead of > context switching (the other knobs). Collectively, these knobs provide a > substantial performance win; resulting in roughly a 20% improvement in > throughput. Worth noting, however, is that this improvement is _not_ at > full machine saturation. > > That said, even with these knobs, we noticed that CPUs were still going > idle even when the host was overcommitted. In response, we wrote the > "shared wakequeue" (swqueue) feature proposed in this patch set. The > idea behind swqueue is simple: it enables the scheduler to be > aggressively work conserving by placing a waking task into a per-LLC > FIFO queue that can be pulled from by another core in the LLC FIFO queue > which can then be pulled from before it goes idle. This seems to be just another newly idle load balance outside the current one ! The knobs above are not the only thing preventing a rq to pull a new task. We have rq->avg_idle, curr_cost and sd->max_newidle_lb_cost stuff which might be one main root cause for one of your cpu not pulling a waiting task It's not clear in your explanation why fixing newly_idle_load_balance was not possible instead of adding outside code and what prevents newly_idle_load balance from picking a task in your case ? For example, have you tried to disable the early break because of avg_idle ? > > With this simple change, we were able to achieve a 1 - 1.6% improvement > in throughput, as well as a small, consistent improvement in p95 and p99 > latencies, in HHVM. These performance improvements were in addition to > the wins from the debugfs knobs mentioned above. > > Design > ====== > > The design of swqueue is quite simple. An swqueue is simply a struct > list_head, and a spinlock: > > struct swqueue { > struct list_head list; > spinlock_t lock; > } ____cacheline_aligned; > > We create a struct swqueue per LLC, ensuring they're in their own > cachelines to avoid false sharing between CPUs on different LLCs. > > When a task first wakes up, it enqueues itself in the swqueue of its > current LLC at the end of enqueue_task_fair(). Enqueues only happen if > the task was not manually migrated to the current core by > select_task_rq(), and is not pinned to a specific CPU. > > A core will pull a task from its LLC's swqueue before calling > newidle_balance(). > > Difference between SIS_NODE > =========================== > > In [0] Peter proposed a patch that addresses Tejun's observations that > when workqueues are targeted towards a specific LLC on his Zen2 machine > with small CCXs, that there would be significant idle time due to > select_idle_sibling() not considering anything outside of the current > LLC. > > This patch (SIS_NODE) is essentially the complement to the proposal > here. SID_NODE causes waking tasks to look for idle cores in neighboring > LLCs on the same die, whereas swqueue causes cores about to go idle to > look for enqueued tasks. That said, in its current form, the two > features at are a different scope as SIS_NODE searches for idle cores > between LLCs, while swqueue enqueues tasks within a single LLC. > > The patch was since removed in [1], but we elect to compare its > performance to swqueue given that as described above, it's conceptually > complementary. > > [0]: https://lore.kernel.org/all/20230530113249.GA156198@hirez.programming.kicks-ass.net/ > [1]: https://lore.kernel.org/all/20230605175636.GA4253@hirez.programming.kicks-ass.net/ > > I observed that while SIS_NODE works quite well for hosts with small > CCXs, it can result in degraded performance on machines either with a > large number of total cores in a CCD, or for which the cache miss > penalty of migrating between CCXs is high, even on the same die. > > For example, on Zen 4c hosts (Bergamo), CCXs within a CCD are muxed > through a single link to the IO die, and thus have similar cache miss > latencies as cores in remote CCDs. > > Such subtleties could be taken into account with SIS_NODE, but > regardless, both features are conceptually complementary sides of the > same coin. SIS_NODE searches for idle cores for waking threads, whereas > swqueue searches for available work before a core goes idle. > > Results > ======= > > Note that the motivation for the shared wakequeue feature was originally > arrived at using experiments in the sched_ext framework that's currently > being proposed upstream. The ~1 - 1.6% improvement in HHVM throughput > is similarly visible using work-conserving sched_ext schedulers (even > very simple ones like global FIFO). > > In both single and multi socket / CCX hosts, this can measurably improve > performance. In addition to the performance gains observed on our > internal web workloads, we also observed an improvement in common > workloads such as kernel compile when running shared wakequeue. Here are > the results of running make -j$(nproc) built-in.a on several different > types of hosts configured with make allyesconfig on commit a27648c74210 > ("afs: Fix setting of mtime when creating a file/dir/symlink") on Linus' > tree (boost was disabled on all of these hosts when the experiments were > performed): > > Single-socket | 32-core | 2-CCX | AMD 7950X Zen4 > > CPU max MHz: 5879.8818 > CPU min MHz: 3000.0000 > o____________o_______o > | mean | CPU | > o------------o-------o > NO_SWQUEUE + NO_SIS_NODE: | 590.52s | 3103% | > NO_SWQUEUE + SIS_NODE: | 590.80s | 3102% | > SWQUEUE + NO_SIS_NODE: | 589.65s | 3116% | > SWQUEUE + SIS_NODE: | 589.99s | 3115% | > o------------o-------o > > Takeaway: swqueue doesn't seem to provide a statistically significant > improvement for kernel compile on my 7950X. SIS_NODE similarly does not > have a noticeable effect on performance. > > ------------------------------------------------------------------------------- > > Single-socket | 72-core | 6-CCX | AMD Milan Zen3 > > CPU max MHz: 3245.0190 > CPU min MHz: 700.0000 > o_____________o_______o > | mean | CPU | > o-------------o-------o > NO_SWQUEUE + NO_SIS_NODE: | 1608.69s | 6488% | > NO_SWQUEUE + SIS_NODE: | 1610.24s | 6473% | > SWQUEUE + NO_SIS_NODE: | 1605.80s | 6504% | > SWQUEUE + SIS_NODE: | 1606.96s | 6488% | > o-------------o-------o > > Takeaway: swqueue does provide a small statistically significant > improvement on Milan, but the compile times in general were quite long > relative to the 7950X Zen4, and the Bergamo Zen4c due to the lower clock > frequency. Milan also has larger CCXs than Bergamo, so it stands to > reason that select_idle_sibling() will have an easier time finding idle > cores inside the current CCX. > > It also seems logical that SIS_NODE would hurt performance a bit here, > as all cores / CCXs are in the same NUMA node, so select_idle_sibling() > has to iterate over 72 cores; delaying task wakeup. That said, I'm not > sure that's a viable theory if total CPU% is lower with SIS_NODE. > > ------------------------------------------------------------------------------- > > Single-socket | 176-core | 11-CCX | 2-CCX per CCD | AMD Bergamo Zen4c > > CPU max MHz: 1200.0000 > CPU min MHz: 1000.0000 > > o____________o________o > | mean | CPU | > o------------o--------o > NO_SWQUEUE + NO_SIS_NODE: | 322.44s | 15534% | > NO_SWQUEUE + SIS_NODE: | 324.39s | 15508% | > SWQUEUE + NO_SIS_NODE: | 321.54s | 15603% | > SWQUEUE + SIS_NODE: | 321.88s | 15622% | > o------------o--------o > > Takeaway: swqueue barely beats NO_SWQUEUE + NO_SIS_NODE, to the point > that it's arguably not statistically significant. > SIS_NODE results in a ~.9% performance degradation, for likely the same > reason as Milan: the host has a large number of LLCs within a single > socket, so task wakeup latencies suffer due to select_idle_node() > searching up to 11 CCXs. > > Conclusion > ========== > > swqueue in this form seems to provide a small, but noticeable win for > front-end CPU-bound workloads spread over multiple CCXs. The reason > seems fairly straightforward: swqueue encourages work conservation > inside of a CCX by having a CPU do an O(1) pull from a per-LLC queue of > runnable tasks. As mentioned above, it is complementary to SIS_NODE, > which searches for idle cores on the wakeup path. > > While swqueue in this form encourages work conservation, it of course > does not guarantee it given that we don't implement any kind of work > stealing between swqueues. In the future, we could potentially push CPU > utilization even higher by enabling work stealing between swqueues, > likely between CCXs on the same NUMA node. > > Originally-by: Roman Gushchin <roman.gushchin@linux.dev> > Signed-off-by: David Vernet <void@manifault.com> > --- > include/linux/sched.h | 2 + > kernel/sched/core.c | 2 + > kernel/sched/fair.c | 175 ++++++++++++++++++++++++++++++++++++++++-- > kernel/sched/sched.h | 2 + > 4 files changed, 176 insertions(+), 5 deletions(-) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 1292d38d66cc..1f4fd22f88a8 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -770,6 +770,8 @@ struct task_struct { > unsigned long wakee_flip_decay_ts; > struct task_struct *last_wakee; > > + struct list_head swqueue_node; > + > /* > * recent_used_cpu is initially set as the last CPU used by a task > * that wakes affine another task. Waker/wakee relationships can > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index d911b0631e7b..e04f0daf1f05 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4533,6 +4533,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) > #ifdef CONFIG_SMP > p->wake_entry.u_flags = CSD_TYPE_TTWU; > p->migration_pending = NULL; > + INIT_LIST_HEAD(&p->swqueue_node); > #endif > init_sched_mm_cid(p); > } > @@ -9872,6 +9873,7 @@ void __init sched_init_smp(void) > > init_sched_rt_class(); > init_sched_dl_class(); > + init_sched_fair_class_late(); > > sched_smp_initialized = true; > } > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 807986bd6ea6..29fe25794884 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -139,17 +139,151 @@ static int __init setup_sched_thermal_decay_shift(char *str) > } > __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift); > > +/** > + * struct swqueue - Per-LLC queue structure for enqueuing and pulling waking > + * tasks. > + * > + * WHAT > + * ==== > + * > + * This structure enables the scheduler to be more aggressively work > + * conserving, by placing waking tasks on a per-LLC FIFO queue that can then be > + * pulled from when another core in the LLC is going to go idle. > + * > + * struct rq stores a pointer to its LLC's swqueue via struct cfs_rq. Waking > + * tasks are enqueued in a swqueue at the end of enqueue_task_fair(), and are > + * opportunistically pulled from the swqueue in pick_next_task_fair() prior to > + * invoking newidle_balance(). Tasks enqueued in a swqueue be scheduled prior > + * to being pulled from the swqueue, in which case they're simply removed from > + * the swqueue. A waking task is only enqueued to a swqueue when it was _not_ > + * manually migrated to the current runqueue by select_task_rq_fair(). > + * > + * There is currently no task-stealing between swqueues in different LLCs, > + * which means that swqueue is not fully work conserving. This could be added > + * at a later time, with tasks likely only being stolen across swqueues on the > + * same NUMA node to avoid violating NUMA affinities. > + * > + * HOW > + * === > + * > + * An swqueue is comprised of a list, and a spinlock for synchronization. Given > + * that the critical section for a swqueue is typically a fast list operation, > + * and that the swqueue is localized to a single LLC, the spinlock does not > + * seem to be contended; even on a heavily utilized host. struct swqueues are > + * also cacheline aligned to prevent false sharing between CPUs manipulating > + * swqueues in other LLCs. > + * > + * WHY > + * === > + * > + * As mentioned above, the main benefit of swqueue is that it enables more > + * aggressive work conservation in the scheduler. This can benefit workloads > + * that benefit more from CPU utilization than from L1/L2 cache locality. > + * > + * swqueues are segmented across LLCs both to avoid contention on the swqueue > + * spinlock by minimizing the number of CPUs that could contend on it, as well > + * as to strike a balance between work conservation, and L3 cache locality. > + */ > +struct swqueue { > + struct list_head list; > + spinlock_t lock; > +} ____cacheline_aligned; > + > #ifdef CONFIG_SMP > -static void swqueue_enqueue(struct rq *rq, struct task_struct *p, > - int enq_flags) > -{} > +static struct swqueue *rq_swqueue(struct rq *rq) > +{ > + return rq->cfs.swqueue; > +} > + > +static struct task_struct *swqueue_pull_task(struct swqueue *swqueue) > +{ > + unsigned long flags; > + > + struct task_struct *p; > + > + spin_lock_irqsave(&swqueue->lock, flags); > + p = list_first_entry_or_null(&swqueue->list, struct task_struct, > + swqueue_node); > + if (p) > + list_del_init(&p->swqueue_node); > + spin_unlock_irqrestore(&swqueue->lock, flags); > + > + return p; > +} > + > +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags) > +{ > + unsigned long flags; > + struct swqueue *swqueue; > + bool task_migrated = enq_flags & ENQUEUE_MIGRATED; > + bool task_wakeup = enq_flags & ENQUEUE_WAKEUP; > + > + /* > + * Only enqueue the task in the shared wakequeue if: > + * > + * - SWQUEUE is enabled > + * - The task is on the wakeup path > + * - The task wasn't purposefully migrated to the current rq by > + * select_task_rq() > + * - The task isn't pinned to a specific CPU > + */ > + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1) > + return; > + > + swqueue = rq_swqueue(rq); > + spin_lock_irqsave(&swqueue->lock, flags); > + list_add_tail(&p->swqueue_node, &swqueue->list); > + spin_unlock_irqrestore(&swqueue->lock, flags); > +} > + > static int swqueue_pick_next_task(struct rq *rq, struct rq_flags *rf) > { > - return 0; > + struct swqueue *swqueue; > + struct task_struct *p = NULL; > + struct rq *src_rq; > + struct rq_flags src_rf; > + int ret; > + > + swqueue = rq_swqueue(rq); > + if (!list_empty(&swqueue->list)) > + p = swqueue_pull_task(swqueue); > + > + if (!p) > + return 0; > + > + rq_unpin_lock(rq, rf); > + raw_spin_rq_unlock(rq); > + > + src_rq = task_rq_lock(p, &src_rf); > + > + if (task_on_rq_queued(p) && !task_on_cpu(rq, p)) > + src_rq = migrate_task_to(src_rq, &src_rf, p, cpu_of(rq)); > + > + if (src_rq->cpu != rq->cpu) > + ret = 1; > + else > + ret = -1; > + > + task_rq_unlock(src_rq, p, &src_rf); > + > + raw_spin_rq_lock(rq); > + rq_repin_lock(rq, rf); > + > + return ret; > } > > static void swqueue_remove_task(struct task_struct *p) > -{} > +{ > + unsigned long flags; > + struct swqueue *swqueue; > + > + if (!list_empty(&p->swqueue_node)) { > + swqueue = rq_swqueue(task_rq(p)); > + spin_lock_irqsave(&swqueue->lock, flags); > + list_del_init(&p->swqueue_node); > + spin_unlock_irqrestore(&swqueue->lock, flags); > + } > +} > > /* > * For asym packing, by default the lower numbered CPU has higher priority. > @@ -12839,3 +12973,34 @@ __init void init_sched_fair_class(void) > #endif /* SMP */ > > } > + > +__init void init_sched_fair_class_late(void) > +{ > +#ifdef CONFIG_SMP > + int i; > + struct swqueue *swqueue; > + struct rq *rq; > + struct rq *llc_rq; > + > + for_each_possible_cpu(i) { > + if (per_cpu(sd_llc_id, i) == i) { > + llc_rq = cpu_rq(i); > + > + swqueue = kzalloc_node(sizeof(struct swqueue), > + GFP_KERNEL, cpu_to_node(i)); > + INIT_LIST_HEAD(&swqueue->list); > + spin_lock_init(&swqueue->lock); > + llc_rq->cfs.swqueue = swqueue; > + } > + } > + > + for_each_possible_cpu(i) { > + rq = cpu_rq(i); > + llc_rq = cpu_rq(per_cpu(sd_llc_id, i)); > + > + if (rq == llc_rq) > + continue; > + rq->cfs.swqueue = llc_rq->cfs.swqueue; > + } > +#endif /* SMP */ > +} > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 5a86e9795731..daee5c64af87 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -575,6 +575,7 @@ struct cfs_rq { > #endif > > #ifdef CONFIG_SMP > + struct swqueue *swqueue; > /* > * CFS load tracking > */ > @@ -2380,6 +2381,7 @@ extern void update_max_interval(void); > extern void init_sched_dl_class(void); > extern void init_sched_rt_class(void); > extern void init_sched_fair_class(void); > +extern void init_sched_fair_class_late(void); > > extern void reweight_task(struct task_struct *p, int prio); > > -- > 2.40.1 >
Hello David, On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote: [..snip..] > +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags) > +{ > + unsigned long flags; > + struct swqueue *swqueue; > + bool task_migrated = enq_flags & ENQUEUE_MIGRATED; > + bool task_wakeup = enq_flags & ENQUEUE_WAKEUP; > + > + /* > + * Only enqueue the task in the shared wakequeue if: > + * > + * - SWQUEUE is enabled > + * - The task is on the wakeup path > + * - The task wasn't purposefully migrated to the current rq by > + * select_task_rq() > + * - The task isn't pinned to a specific CPU > + */ > + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1) > + return; In select_task_rq_fair(), having determined if the target of task wakeup should be the task's previous CPU vs the waker's current CPU, we spend quite a bit of time already to determine if there is an idle core/CPU in the target's LLC. @rq would correspond to CPU chosen as a result of that scan or if no idle CPU exists, @rq corresponds to the target CPU determined by wake_affine_idle()/wake_affine_weight(). So if the CPU of @rq is idle here, can we not simply return here? Or if the idea is to avoid the scan for an idle core/CPU in select_task_rq_fair(), then Perhaps I am missing something... > + > + swqueue = rq_swqueue(rq); > + spin_lock_irqsave(&swqueue->lock, flags); > + list_add_tail(&p->swqueue_node, &swqueue->list); > + spin_unlock_irqrestore(&swqueue->lock, flags); > +} > + -- Thanks and Regards gautham.
On Fri, Jun 16, 2023 at 08:53:38AM +0800, Aaron Lu wrote: > On Thu, Jun 15, 2023 at 06:26:05PM -0500, David Vernet wrote: > > > Ok, it seems that the issue is that I wasn't creating enough netperf > > clients. I assumed that -n $(nproc) was sufficient. I was able to repro > > Yes that switch is confusing. > > > the contention on my 26 core / 52 thread skylake client as well: > > > > > > > Thanks for the help in getting the repro on my end. > > You are welcome. > > > So yes, there is certainly a scalability concern to bear in mind for > > swqueue for LLCs with a lot of cores. If you have a lot of tasks quickly > > e.g. blocking and waking on futexes in a tight loop, I expect a similar > > issue would be observed. > > > > On the other hand, the issue did not occur on my 7950X. I also wasn't > > Using netperf/UDP_RR? Correct > > able to repro the contention on the Skylake if I ran with the default > > netperf workload rather than UDP_RR (even with the additional clients). > > I also tried that on the 18cores/36threads/LLC Skylake and the contention > is indeed much smaller than UDP_RR: > > 7.30% 7.29% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath > > But I wouldn't say it's entirely gone. Also consider Skylake has a lot > fewer cores per LLC than later Intel servers like Icelake and Sapphire > Rapids and I expect things would be worse on those two machines. I cannot reproduce this contention locally, even on a slightly larger Skylake. Not really sure what to make of the difference here. Perhaps it's because you're running with CONFIG_SCHED_CORE=y? What is the change in throughput when you run the default workload on your SKL? > > I didn't bother to take the mean of all of the throughput results > > between NO_SWQUEUE and SWQUEUE, but they looked roughly equal. > > > > So swqueue isn't ideal for every configuration, but I'll echo my > > sentiment from [0] that this shouldn't on its own necessarily preclude > > it from being merged given that it does help a large class of > > configurations and workloads, and it's disabled by default. > > > > [0]: https://lore.kernel.org/all/20230615000103.GC2883716@maniforge/ > > I was wondering: does it make sense to do some divide on machines with > big LLCs? Like converting the per-LLC swqueue to per-group swqueue where > the group can be made of ~8 cpus of the same LLC. This will have a > similar effect of reducing the number of CPUs in a single LLC so the > scalability issue can hopefully be fixed while at the same time, it > might still help some workloads. I realized this isn't ideal in that > wakeup happens at LLC scale so the group thing may not fit very well > here. > > Just a thought, feel free to ignore it if you don't think this is > feasible :-) That's certainly an idea we could explore, but my inclination would be to keep everything at a per-LLC granularity. It makes it easier to reason about performance; both in terms of work conservation per-LLC (again, not every workload suffers from having large LLCs even if others do, and halving the size of a swqueue in an LLC could harm other workloads which benefit from the increased work conservation), and in terms of contention. To the latter point, I think it would be difficult to choose an LLC size that wasn't somewhat artificial and workload specific. If someone has that requirement, I think sched_ext would be a better alternative. Thanks, David
On Fri, Jun 16, 2023 at 10:08:57AM +0200, Vincent Guittot wrote: > On Tue, 13 Jun 2023 at 07:20, David Vernet <void@manifault.com> wrote: > > > > Overview > > ======== > > > > The scheduler must constantly strike a balance between work > > conservation, and avoiding costly migrations which harm performance due > > to e.g. decreased cache locality. The matter is further complicated by > > the topology of the system. Migrating a task between cores on the same > > LLC may be more optimal than keeping a task local to the CPU, whereas > > migrating a task between LLCs or NUMA nodes may tip the balance in the > > other direction. > > > > With that in mind, while CFS is by and large mostly a work conserving > > scheduler, there are certain instances where the scheduler will choose > > to keep a task local to a CPU, when it would have been more optimal to > > migrate it to an idle core. > > > > An example of such a workload is the HHVM / web workload at Meta. HHVM > > is a VM that JITs Hack and PHP code in service of web requests. Like > > other JIT / compilation workloads, it tends to be heavily CPU bound, and > > exhibit generally poor cache locality. To try and address this, we set > > several debugfs (/sys/kernel/debug/sched) knobs on our HHVM workloads: > > > > - migration_cost_ns -> 0 > > - latency_ns -> 20000000 > > - min_granularity_ns -> 10000000 > > - wakeup_granularity_ns -> 12000000 > > > > These knobs are intended both to encourage the scheduler to be as work > > conserving as possible (migration_cost_ns -> 0), and also to keep tasks > > running for relatively long time slices so as to avoid the overhead of > > context switching (the other knobs). Collectively, these knobs provide a > > substantial performance win; resulting in roughly a 20% improvement in > > throughput. Worth noting, however, is that this improvement is _not_ at > > full machine saturation. > > > > That said, even with these knobs, we noticed that CPUs were still going > > idle even when the host was overcommitted. In response, we wrote the > > "shared wakequeue" (swqueue) feature proposed in this patch set. The > > idea behind swqueue is simple: it enables the scheduler to be > > aggressively work conserving by placing a waking task into a per-LLC > > FIFO queue that can be pulled from by another core in the LLC FIFO queue > > which can then be pulled from before it goes idle. > > This seems to be just another newly idle load balance outside the current one ! Hi Vincent, I can bring the swqueue logic inside of newidle_balance(). In hindsight I think it makes more sense there. To answer your point more generally though, yes, this is a new approach to load balancing that eschews tracking migration costs, scanning runqueues, etc in favor of optimizing for work conservation, and tracking runnable tasks in a shared data structure. More on this below in response to your other points. > > The knobs above are not the only thing preventing a rq to pull a new > task. We have rq->avg_idle, curr_cost and sd->max_newidle_lb_cost > stuff which might be one main root cause for one of your cpu not > pulling a waiting task > > It's not clear in your explanation why fixing newly_idle_load_balance > was not possible instead of adding outside code and what prevents > newly_idle_load balance from picking a task in your case ? > > For example, have you tried to disable the early break because of avg_idle ? The goal of swqueue is to enable work conservation using a shared, per-LLC data structure. The shared data structure is really the salient point as to why just updating newidle_balance() wouldn't achieve the same result. newidle_balance() finds tasks to migrate by (sometimes) iterating over runqueues and scanning for tasks. It's an expensive operation, which doesn't scale to large machines or to being performed on every idle path. swqueue, on the other hand, is shared across all cores in an LLC, so pulling a runnable task is simply a matter of a spinlock acquire and a list operation. This doesn't scale to every single configuration as Aaron pointed out in [0], but it works well in many other configurations. [0]: https://lore.kernel.org/all/20230614043529.GA1942@ziqianlu-dell/ Another consideration is that even if we could adjust newidle_balance() to load balance well enough for our specific purpose, we're still relying on heuristics to determine when it's appropriate to load balance; and that will a) Inevitably be suboptimal for certain workloads and configurations. For example, if we got rid of the following check: 12021 for_each_domain(this_cpu, sd) { 12022 int continue_balancing = 1; 12023 u64 domain_cost; 12024 12025 update_next_balance(sd, &next_balance); 12026 12027 if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) 12028 break; we may end up load balancing too frequently, which could be a regression in and of itself per the scalability concerns mentioned above. Or, for certain workloads, we'll load balance too aggressively and miss out on L1/L2 locality. b) Be harder for users to effectively tune or use. At OSPM, Peter made it quite clear that users should not be tuning any of the debugfs knobs. Relying on heuristics and knobs like this feels antithetical to that policy. swqueue feels like a reasonable, self-contained alternative to that. Thanks, David
On Mon, Jun 19, 2023 at 11:43:13AM +0530, Gautham R. Shenoy wrote: > Hello David, > > > On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote: > [..snip..] > > > +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags) > > +{ > > + unsigned long flags; > > + struct swqueue *swqueue; > > + bool task_migrated = enq_flags & ENQUEUE_MIGRATED; > > + bool task_wakeup = enq_flags & ENQUEUE_WAKEUP; > > + > > + /* > > + * Only enqueue the task in the shared wakequeue if: > > + * > > + * - SWQUEUE is enabled > > + * - The task is on the wakeup path > > + * - The task wasn't purposefully migrated to the current rq by > > + * select_task_rq() > > + * - The task isn't pinned to a specific CPU > > + */ > > + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1) > > + return; > > In select_task_rq_fair(), having determined if the target of task > wakeup should be the task's previous CPU vs the waker's current CPU, > we spend quite a bit of time already to determine if there is an idle > core/CPU in the target's LLC. @rq would correspond to CPU chosen as a > result of that scan or if no idle CPU exists, @rq corresponds to the > target CPU determined by wake_affine_idle()/wake_affine_weight(). > > So if the CPU of @rq is idle here, can we not simply return here? Hi Gautum, Sorry, I'm not sure I'm quite following the issue you're pointing out. We don't use swqueue if the task was migrated following select_task_rq_fair(). That's the idea with us returning if the task was migrated (the second conditional in that if). If I messed up that logic somehow, it should be fixed. > Or if the idea is to avoid the scan for an idle core/CPU in > select_task_rq_fair(), then No, swqueue_enqueue() is called from enqueue_task_fair(), not select_task_rq_fair(). If select_task_rq_fair() (which is of course called beforehand for a waking task) found an idle core/CPU, we don't bother using swqueue, as mentioned above. Thanks, David
On Tue, Jun 20, 2023 at 02:54:23PM -0500, David Vernet wrote: > On Fri, Jun 16, 2023 at 10:08:57AM +0200, Vincent Guittot wrote: > > On Tue, 13 Jun 2023 at 07:20, David Vernet <void@manifault.com> wrote: > > > > > > Overview > > > ======== > > > > > > The scheduler must constantly strike a balance between work > > > conservation, and avoiding costly migrations which harm performance due > > > to e.g. decreased cache locality. The matter is further complicated by > > > the topology of the system. Migrating a task between cores on the same > > > LLC may be more optimal than keeping a task local to the CPU, whereas > > > migrating a task between LLCs or NUMA nodes may tip the balance in the > > > other direction. > > > > > > With that in mind, while CFS is by and large mostly a work conserving > > > scheduler, there are certain instances where the scheduler will choose > > > to keep a task local to a CPU, when it would have been more optimal to > > > migrate it to an idle core. > > > > > > An example of such a workload is the HHVM / web workload at Meta. HHVM > > > is a VM that JITs Hack and PHP code in service of web requests. Like > > > other JIT / compilation workloads, it tends to be heavily CPU bound, and > > > exhibit generally poor cache locality. To try and address this, we set > > > several debugfs (/sys/kernel/debug/sched) knobs on our HHVM workloads: > > > > > > - migration_cost_ns -> 0 > > > - latency_ns -> 20000000 > > > - min_granularity_ns -> 10000000 > > > - wakeup_granularity_ns -> 12000000 > > > > > > These knobs are intended both to encourage the scheduler to be as work > > > conserving as possible (migration_cost_ns -> 0), and also to keep tasks > > > running for relatively long time slices so as to avoid the overhead of > > > context switching (the other knobs). Collectively, these knobs provide a > > > substantial performance win; resulting in roughly a 20% improvement in > > > throughput. Worth noting, however, is that this improvement is _not_ at > > > full machine saturation. > > > > > > That said, even with these knobs, we noticed that CPUs were still going > > > idle even when the host was overcommitted. In response, we wrote the > > > "shared wakequeue" (swqueue) feature proposed in this patch set. The > > > idea behind swqueue is simple: it enables the scheduler to be > > > aggressively work conserving by placing a waking task into a per-LLC > > > FIFO queue that can be pulled from by another core in the LLC FIFO queue > > > which can then be pulled from before it goes idle. > > > > This seems to be just another newly idle load balance outside the current one ! > > Hi Vincent, > > I can bring the swqueue logic inside of newidle_balance(). In hindsight > I think it makes more sense there. > > To answer your point more generally though, yes, this is a new approach > to load balancing that eschews tracking migration costs, scanning > runqueues, etc in favor of optimizing for work conservation, and > tracking runnable tasks in a shared data structure. More on this below > in response to your other points. > > > > > The knobs above are not the only thing preventing a rq to pull a new > > task. We have rq->avg_idle, curr_cost and sd->max_newidle_lb_cost > > stuff which might be one main root cause for one of your cpu not > > pulling a waiting task > > > > It's not clear in your explanation why fixing newly_idle_load_balance > > was not possible instead of adding outside code and what prevents > > newly_idle_load balance from picking a task in your case ? > > > > For example, have you tried to disable the early break because of avg_idle ? > > The goal of swqueue is to enable work conservation using a shared, per-LLC data > structure. The shared data structure is really the salient point as to why just > updating newidle_balance() wouldn't achieve the same result. newidle_balance() > finds tasks to migrate by (sometimes) iterating over runqueues and scanning for > tasks. It's an expensive operation, which doesn't scale to large machines or to > being performed on every idle path. swqueue, on the other hand, is shared > across all cores in an LLC, so pulling a runnable task is simply a matter of a > spinlock acquire and a list operation. This doesn't scale to every single > configuration as Aaron pointed out in [0], but it works well in many other > configurations. +1 Additionally, swqueue minimizes the number of races when two tasks pick the same cpu for being woken up. It's also serializing the wakeup order, which effectively pessimises tasks which are woken very often and promotes cpu-bound tasks, which usually positively affects the throughput. I agree that swqueue can be seen as an alternative form of the idle load balancing, I thought this way when I was working on it. Can it replace it completely? Idk, maybe. But maybe we need some sort of a balancing between multiple wait queues. E.g. if there are two swqueues on the system, one is empty and another is long, load idle balancing can borrow some tasks from the "foreign" swqueue. Just some ideas. Thanks!
On Tue, Jun 20, 2023 at 12:36:26PM -0500, David Vernet wrote: > On Fri, Jun 16, 2023 at 08:53:38AM +0800, Aaron Lu wrote: > > I also tried that on the 18cores/36threads/LLC Skylake and the contention > > is indeed much smaller than UDP_RR: > > > > 7.30% 7.29% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath > > > > But I wouldn't say it's entirely gone. Also consider Skylake has a lot > > fewer cores per LLC than later Intel servers like Icelake and Sapphire > > Rapids and I expect things would be worse on those two machines. > > I cannot reproduce this contention locally, even on a slightly larger With netperf client number equal to nr_cpu? > Skylake. Not really sure what to make of the difference here. Perhaps > it's because you're running with CONFIG_SCHED_CORE=y? What is the Yes I had that config on but I didn't tag any tasks or groups. > change in throughput when you run the default workload on your SKL? The throughput dropped a little with SWQUEUE: avg_throughput native_queued_spin_lock_slowpath% NO_SWQUEUE: 9528.061111111108 0.09% SWQUEUE: 8984.369722222222 8.05% avg_throughput: average throughput of all netperf client's throughput, higher is better. I run this workload like this: " netserver for i in `seq 72`; do netperf -l 60 -n 72 -6 & done sleep 30 perf record -ag -e cycles:pp -- sleep 5 & wait " (the '-n 72' should be redundant but I just keep it there) Thanks, Aaron
On Wed, Jun 21, 2023 at 10:35:34AM +0800, Aaron Lu wrote: > On Tue, Jun 20, 2023 at 12:36:26PM -0500, David Vernet wrote: > > On Fri, Jun 16, 2023 at 08:53:38AM +0800, Aaron Lu wrote: > > > I also tried that on the 18cores/36threads/LLC Skylake and the contention > > > is indeed much smaller than UDP_RR: > > > > > > 7.30% 7.29% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath > > > > > > But I wouldn't say it's entirely gone. Also consider Skylake has a lot > > > fewer cores per LLC than later Intel servers like Icelake and Sapphire > > > Rapids and I expect things would be worse on those two machines. > > > > I cannot reproduce this contention locally, even on a slightly larger > > With netperf client number equal to nr_cpu? No, that confusion was only the first time around. See below though, I'm not sure what insights are to be gained by continuing to tinker with netperf runs. > > Skylake. Not really sure what to make of the difference here. Perhaps > > it's because you're running with CONFIG_SCHED_CORE=y? What is the > > Yes I had that config on but I didn't tag any tasks or groups. > > > change in throughput when you run the default workload on your SKL? > > The throughput dropped a little with SWQUEUE: > > avg_throughput native_queued_spin_lock_slowpath% > NO_SWQUEUE: 9528.061111111108 0.09% > SWQUEUE: 8984.369722222222 8.05% > > avg_throughput: average throughput of all netperf client's throughput, > higher is better. > > I run this workload like this: > " > netserver > > for i in `seq 72`; do > netperf -l 60 -n 72 -6 & > done > > sleep 30 > perf record -ag -e cycles:pp -- sleep 5 & > > wait > " > (the '-n 72' should be redundant but I just keep it there) At this point I'd say we've spent quite a bit of time discussing netperf results. We understand where the contention is coming from, and yes, we've established that there are going to be some configurations where swqueue is not well suited. We've also established that there are configurations where it will and does perform well, including on netperf. I'm not sure what we're hoping to gain by continuing to run various netperf workloads with your specific parameters? Thanks, David
On Tue, Jun 20, 2023 at 09:43:00PM -0500, David Vernet wrote: > On Wed, Jun 21, 2023 at 10:35:34AM +0800, Aaron Lu wrote: > > On Tue, Jun 20, 2023 at 12:36:26PM -0500, David Vernet wrote: > > > On Fri, Jun 16, 2023 at 08:53:38AM +0800, Aaron Lu wrote: > > > > I also tried that on the 18cores/36threads/LLC Skylake and the contention > > > > is indeed much smaller than UDP_RR: > > > > > > > > 7.30% 7.29% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath > > > > > > > > But I wouldn't say it's entirely gone. Also consider Skylake has a lot > > > > fewer cores per LLC than later Intel servers like Icelake and Sapphire > > > > Rapids and I expect things would be worse on those two machines. > > > > > > I cannot reproduce this contention locally, even on a slightly larger > > > > With netperf client number equal to nr_cpu? > > No, that confusion was only the first time around. See below though, I'm > not sure what insights are to be gained by continuing to tinker with > netperf runs. > > > > Skylake. Not really sure what to make of the difference here. Perhaps > > > it's because you're running with CONFIG_SCHED_CORE=y? What is the > > > > Yes I had that config on but I didn't tag any tasks or groups. > > > > > change in throughput when you run the default workload on your SKL? > > > > The throughput dropped a little with SWQUEUE: > > > > avg_throughput native_queued_spin_lock_slowpath% > > NO_SWQUEUE: 9528.061111111108 0.09% > > SWQUEUE: 8984.369722222222 8.05% > > > > avg_throughput: average throughput of all netperf client's throughput, > > higher is better. > > > > I run this workload like this: > > " > > netserver > > > > for i in `seq 72`; do > > netperf -l 60 -n 72 -6 & > > done > > > > sleep 30 > > perf record -ag -e cycles:pp -- sleep 5 & > > > > wait > > " > > (the '-n 72' should be redundant but I just keep it there) > > At this point I'd say we've spent quite a bit of time discussing netperf > results. We understand where the contention is coming from, and yes, > we've established that there are going to be some configurations where > swqueue is not well suited. We've also established that there are > configurations where it will and does perform well, including on > netperf. > > I'm not sure what we're hoping to gain by continuing to run various > netperf workloads with your specific parameters? I don't quite follow you. I thought we were in the process of figuring out why for the same workload(netperf/default_mode/nr_client=nr_cpu) on two similar machines(both are Skylake) you saw no contention while I saw some so I tried to be exact on how I run the workload. If that's not the case, then yes there is no much value continuing this discussion. Thanks, Aaron
On Wed, Jun 21, 2023 at 12:54:16PM +0800, Aaron Lu wrote: > On Tue, Jun 20, 2023 at 09:43:00PM -0500, David Vernet wrote: > > On Wed, Jun 21, 2023 at 10:35:34AM +0800, Aaron Lu wrote: > > > On Tue, Jun 20, 2023 at 12:36:26PM -0500, David Vernet wrote: > > > > On Fri, Jun 16, 2023 at 08:53:38AM +0800, Aaron Lu wrote: > > > > > I also tried that on the 18cores/36threads/LLC Skylake and the contention > > > > > is indeed much smaller than UDP_RR: > > > > > > > > > > 7.30% 7.29% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath > > > > > > > > > > But I wouldn't say it's entirely gone. Also consider Skylake has a lot > > > > > fewer cores per LLC than later Intel servers like Icelake and Sapphire > > > > > Rapids and I expect things would be worse on those two machines. > > > > > > > > I cannot reproduce this contention locally, even on a slightly larger > > > > > > With netperf client number equal to nr_cpu? > > > > No, that confusion was only the first time around. See below though, I'm > > not sure what insights are to be gained by continuing to tinker with > > netperf runs. > > > > > > Skylake. Not really sure what to make of the difference here. Perhaps > > > > it's because you're running with CONFIG_SCHED_CORE=y? What is the > > > > > > Yes I had that config on but I didn't tag any tasks or groups. > > > > > > > change in throughput when you run the default workload on your SKL? > > > > > > The throughput dropped a little with SWQUEUE: > > > > > > avg_throughput native_queued_spin_lock_slowpath% > > > NO_SWQUEUE: 9528.061111111108 0.09% > > > SWQUEUE: 8984.369722222222 8.05% > > > > > > avg_throughput: average throughput of all netperf client's throughput, > > > higher is better. > > > > > > I run this workload like this: > > > " > > > netserver > > > > > > for i in `seq 72`; do > > > netperf -l 60 -n 72 -6 & > > > done > > > > > > sleep 30 > > > perf record -ag -e cycles:pp -- sleep 5 & > > > > > > wait > > > " > > > (the '-n 72' should be redundant but I just keep it there) > > > > At this point I'd say we've spent quite a bit of time discussing netperf > > results. We understand where the contention is coming from, and yes, > > we've established that there are going to be some configurations where > > swqueue is not well suited. We've also established that there are > > configurations where it will and does perform well, including on > > netperf. > > > > I'm not sure what we're hoping to gain by continuing to run various > > netperf workloads with your specific parameters? > > I don't quite follow you. > > I thought we were in the process of figuring out why for the same > workload(netperf/default_mode/nr_client=nr_cpu) on two similar > machines(both are Skylake) you saw no contention while I saw some so I > tried to be exact on how I run the workload. I just reran the workload on a 26 core / 52 thread Cooper Lake using your exact command below and still don't observe any contention whatsoever on the swqueue lock: for i in `seq 72`; do netperf -l 60 -n 72 -6 & done > If that's not the case, then yes there is no much value continuing this > discussion. We can iterate until we find out why we're seeing slightly different contention (different configs, different amount of RAM, maybe you have turbo enabled or other things running on your host, etc), but I don't see what that would tell us that would meaningfully drive the discussion forward for the patch set. Is there anything in particular you're trying to determine and/or do you have reason to think that the contention you're observing is due to something other than a lot of tasks waking up at the same time, just as it was with UDP_RR? Thanks, David
On Wed, Jun 21, 2023 at 12:43:52AM -0500, David Vernet wrote: > On Wed, Jun 21, 2023 at 12:54:16PM +0800, Aaron Lu wrote: > > On Tue, Jun 20, 2023 at 09:43:00PM -0500, David Vernet wrote: > > > On Wed, Jun 21, 2023 at 10:35:34AM +0800, Aaron Lu wrote: > > > > On Tue, Jun 20, 2023 at 12:36:26PM -0500, David Vernet wrote: > > > > > On Fri, Jun 16, 2023 at 08:53:38AM +0800, Aaron Lu wrote: > > > > > > I also tried that on the 18cores/36threads/LLC Skylake and the contention > > > > > > is indeed much smaller than UDP_RR: > > > > > > > > > > > > 7.30% 7.29% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath > > > > > > > > > > > > But I wouldn't say it's entirely gone. Also consider Skylake has a lot > > > > > > fewer cores per LLC than later Intel servers like Icelake and Sapphire > > > > > > Rapids and I expect things would be worse on those two machines. > > > > > > > > > > I cannot reproduce this contention locally, even on a slightly larger > > > > > > > > With netperf client number equal to nr_cpu? > > > > > > No, that confusion was only the first time around. See below though, I'm > > > not sure what insights are to be gained by continuing to tinker with > > > netperf runs. > > > > > > > > Skylake. Not really sure what to make of the difference here. Perhaps > > > > > it's because you're running with CONFIG_SCHED_CORE=y? What is the > > > > > > > > Yes I had that config on but I didn't tag any tasks or groups. > > > > > > > > > change in throughput when you run the default workload on your SKL? > > > > > > > > The throughput dropped a little with SWQUEUE: > > > > > > > > avg_throughput native_queued_spin_lock_slowpath% > > > > NO_SWQUEUE: 9528.061111111108 0.09% > > > > SWQUEUE: 8984.369722222222 8.05% > > > > > > > > avg_throughput: average throughput of all netperf client's throughput, > > > > higher is better. > > > > > > > > I run this workload like this: > > > > " > > > > netserver > > > > > > > > for i in `seq 72`; do > > > > netperf -l 60 -n 72 -6 & > > > > done > > > > > > > > sleep 30 > > > > perf record -ag -e cycles:pp -- sleep 5 & > > > > > > > > wait > > > > " > > > > (the '-n 72' should be redundant but I just keep it there) > > > > > > At this point I'd say we've spent quite a bit of time discussing netperf > > > results. We understand where the contention is coming from, and yes, > > > we've established that there are going to be some configurations where > > > swqueue is not well suited. We've also established that there are > > > configurations where it will and does perform well, including on > > > netperf. > > > > > > I'm not sure what we're hoping to gain by continuing to run various > > > netperf workloads with your specific parameters? > > > > I don't quite follow you. > > > > I thought we were in the process of figuring out why for the same > > workload(netperf/default_mode/nr_client=nr_cpu) on two similar > > machines(both are Skylake) you saw no contention while I saw some so I > > tried to be exact on how I run the workload. > > I just reran the workload on a 26 core / 52 thread Cooper Lake using > your exact command below and still don't observe any contention > whatsoever on the swqueue lock: Well, it's a puzzle to me. But as you said below, I guess I'll just move on. Thanks, Aaron > > for i in `seq 72`; do > netperf -l 60 -n 72 -6 & > done > > > If that's not the case, then yes there is no much value continuing this > > discussion. > > We can iterate until we find out why we're seeing slightly different > contention (different configs, different amount of RAM, maybe you have > turbo enabled or other things running on your host, etc), but I don't > see what that would tell us that would meaningfully drive the discussion > forward for the patch set. Is there anything in particular you're trying > to determine and/or do you have reason to think that the contention > you're observing is due to something other than a lot of tasks waking up > at the same time, just as it was with UDP_RR? > > Thanks, > David
Hello David, On Tue, Jun 20, 2023 at 03:08:22PM -0500, David Vernet wrote: > On Mon, Jun 19, 2023 at 11:43:13AM +0530, Gautham R. Shenoy wrote: > > Hello David, > > > > > > On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote: > > [..snip..] > > > > > +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags) > > > +{ > > > + unsigned long flags; > > > + struct swqueue *swqueue; > > > + bool task_migrated = enq_flags & ENQUEUE_MIGRATED; > > > + bool task_wakeup = enq_flags & ENQUEUE_WAKEUP; > > > + > > > + /* > > > + * Only enqueue the task in the shared wakequeue if: > > > + * > > > + * - SWQUEUE is enabled > > > + * - The task is on the wakeup path > > > + * - The task wasn't purposefully migrated to the current rq by > > > + * select_task_rq() > > > + * - The task isn't pinned to a specific CPU > > > + */ > > > + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1) > > > + return; > > > > In select_task_rq_fair(), having determined if the target of task > > wakeup should be the task's previous CPU vs the waker's current CPU, > > we spend quite a bit of time already to determine if there is an idle > > core/CPU in the target's LLC. @rq would correspond to CPU chosen as a > > result of that scan or if no idle CPU exists, @rq corresponds to the > > target CPU determined by wake_affine_idle()/wake_affine_weight(). > > > > So if the CPU of @rq is idle here, can we not simply return here? > > Hi Gautum, > > Sorry, I'm not sure I'm quite following the issue you're pointing out. > We don't use swqueue if the task was migrated following > select_task_rq_fair(). That's the idea with us returning if the task was > migrated (the second conditional in that if). If I messed up that logic > somehow, it should be fixed. Sorry, my bad. I see it now. So as per this patch, the only time we enqueue the task on the shared wakeup is if the target of try_to_wake_up() is the same CPU where the task ran previously. When wake_affine logic fails and the previous CPU is chosen as the target, and when there are no other idle cores/threads in the LLC of the previous CPU, it makes sense to queue the task on the shared-wakequeue instead of on a busy previous CPU. And when that previous CPU is idle, the try_to_wake_up() would have woken it up via ttwu_queue(), so before going idle the next time it will check the shared queue for the task and find it. We should be good in this case. Now, it is possible that select_task_rq_fair() ended up selecting the waker's CPU as the target based on the wake_affine_idle()/wake_affine_weight() logic. And if there is no idle core/thread on the waker's LLC, the target would be the busy waker CPU. In the case when the waker CPU is different from the task's previous CPU, due to ENQUEUE_MIGRATE flag being set, the task won't be queued on the shared wakequeue and instead has to wait on the busy waker CPU. I wonder if it makes sense to enqueue the task on the shared wakequeue in this scenario as well. > > > Or if the idea is to avoid the scan for an idle core/CPU in > > select_task_rq_fair(), then > > No, swqueue_enqueue() is called from enqueue_task_fair(), not > select_task_rq_fair(). If select_task_rq_fair() (which is of course > called beforehand for a waking task) found an idle core/CPU, we don't > bother using swqueue, as mentioned above. Got it. Thanks! > > Thanks, > David -- Thanks and Regards gautham.
On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote: > +struct swqueue { > + struct list_head list; > + spinlock_t lock; > +} ____cacheline_aligned; I'm thinking you can shard this just fine, it makes that pop() needs to iterate all shards, but that shouldn't be a problem, and it would still only need to take a single lock. I'm thinking 4 or 8 shards should be plenty, even for Intel LLC. > #ifdef CONFIG_SMP > +static struct task_struct *swqueue_pull_task(struct swqueue *swqueue) > +{ > + unsigned long flags; > + > + struct task_struct *p; > + > + spin_lock_irqsave(&swqueue->lock, flags); > + p = list_first_entry_or_null(&swqueue->list, struct task_struct, > + swqueue_node); > + if (p) > + list_del_init(&p->swqueue_node); > + spin_unlock_irqrestore(&swqueue->lock, flags); > + > + return p; > +} Would this not normally be called pop() or somesuch? > +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags) > +{ > + unsigned long flags; > + struct swqueue *swqueue; > + bool task_migrated = enq_flags & ENQUEUE_MIGRATED; > + bool task_wakeup = enq_flags & ENQUEUE_WAKEUP; > + > + /* > + * Only enqueue the task in the shared wakequeue if: > + * > + * - SWQUEUE is enabled > + * - The task is on the wakeup path > + * - The task wasn't purposefully migrated to the current rq by > + * select_task_rq() > + * - The task isn't pinned to a specific CPU > + */ > + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1) > + return; Elsewhere you mentioned heuristics, this smells like them. This and the is_cpus_allowed() thing makes you loose plenty of opportunities. > + swqueue = rq_swqueue(rq); > + spin_lock_irqsave(&swqueue->lock, flags); > + list_add_tail(&p->swqueue_node, &swqueue->list); > + spin_unlock_irqrestore(&swqueue->lock, flags); > +} > + > static int swqueue_pick_next_task(struct rq *rq, struct rq_flags *rf) > { > - return 0; > + struct swqueue *swqueue; > + struct task_struct *p = NULL; > + struct rq *src_rq; > + struct rq_flags src_rf; > + int ret; > + > + swqueue = rq_swqueue(rq); > + if (!list_empty(&swqueue->list)) > + p = swqueue_pull_task(swqueue); > + > + if (!p) > + return 0; At this point you can do the whole is_cpu_allowed() and avoid the whole lock dance if not. > + > + rq_unpin_lock(rq, rf); > + raw_spin_rq_unlock(rq); > + > + src_rq = task_rq_lock(p, &src_rf); > + > + if (task_on_rq_queued(p) && !task_on_cpu(rq, p)) > + src_rq = migrate_task_to(src_rq, &src_rf, p, cpu_of(rq)); And then this becomes move_queued_task(). > + if (src_rq->cpu != rq->cpu) > + ret = 1; > + else > + ret = -1; > + > + task_rq_unlock(src_rq, p, &src_rf); > + > + raw_spin_rq_lock(rq); > + rq_repin_lock(rq, rf); > + > + return ret; > } > > static void swqueue_remove_task(struct task_struct *p) > +{ > + unsigned long flags; > + struct swqueue *swqueue; > + > + if (!list_empty(&p->swqueue_node)) { > + swqueue = rq_swqueue(task_rq(p)); > + spin_lock_irqsave(&swqueue->lock, flags); > + list_del_init(&p->swqueue_node); > + spin_unlock_irqrestore(&swqueue->lock, flags); > + } > +} dequeue()
On Tue, Jun 20, 2023 at 02:54:23PM -0500, David Vernet wrote: > Another consideration is that even if we could adjust newidle_balance() > to load balance well enough for our specific purpose, we're still > relying on heuristics to determine when it's appropriate to load > balance; and that will Your thing has heuristics too. But if we're going to do this, then it should replace newidle <= LLC domains.
On Wed, Jun 21, 2023 at 04:20:20PM +0200, Peter Zijlstra wrote: > On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote: > > +struct swqueue { > > + struct list_head list; > > + spinlock_t lock; > > +} ____cacheline_aligned; > > I'm thinking you can shard this just fine, it makes that pop() needs to > iterate all shards, but that shouldn't be a problem, and it would still > only need to take a single lock. Is the idea here to have multiple sharded queues per LLC, with a single lock protecting them? Assuming so, is the idea that it would avoid bouncing the list heads amongst all of the cores' cachelines? I could see that being useful in some scenarios, but it also feels a bit complicated for what it gives you. If you have tasks being pulled quickly I don't think that will help much because the list heads will just bounce to the pullers. Also, if the lock is already heavily contended, it seems possible that the cores inside a single shard could still bounce the head back and forth amongst them, or the cache line bouncing will be a small-order overhead compared to the lock itself. Or did I misunderstand your suggestion entirely? > I'm thinking 4 or 8 shards should be plenty, even for Intel LLC. > > > #ifdef CONFIG_SMP > > > +static struct task_struct *swqueue_pull_task(struct swqueue *swqueue) > > +{ > > + unsigned long flags; > > + > > + struct task_struct *p; > > + > > + spin_lock_irqsave(&swqueue->lock, flags); > > + p = list_first_entry_or_null(&swqueue->list, struct task_struct, > > + swqueue_node); > > + if (p) > > + list_del_init(&p->swqueue_node); > > + spin_unlock_irqrestore(&swqueue->lock, flags); > > + > > + return p; > > +} > > Would this not normally be called pop() or somesuch? Yes, I'll improve the name in the next iteration. swqueue_dequeue() and swqueue_enqueue() seem like the most canonical. Let me know if you have another preference. > > > +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags) > > +{ > > + unsigned long flags; > > + struct swqueue *swqueue; > > + bool task_migrated = enq_flags & ENQUEUE_MIGRATED; > > + bool task_wakeup = enq_flags & ENQUEUE_WAKEUP; > > + > > + /* > > + * Only enqueue the task in the shared wakequeue if: > > + * > > + * - SWQUEUE is enabled > > + * - The task is on the wakeup path > > + * - The task wasn't purposefully migrated to the current rq by > > + * select_task_rq() > > + * - The task isn't pinned to a specific CPU > > + */ > > + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1) > > + return; > > Elsewhere you mentioned heuristics, this smells like them. This and the > is_cpus_allowed() thing makes you loose plenty of opportunities. Yeah fair enough, these certainly are heuristics as well. I thought it best to try and avoid swqueue getting in the way of select_task_rq_fair() (at least to start out with), but we could always remove that and run other experiments to see how it does. > > + swqueue = rq_swqueue(rq); > > + spin_lock_irqsave(&swqueue->lock, flags); > > + list_add_tail(&p->swqueue_node, &swqueue->list); > > + spin_unlock_irqrestore(&swqueue->lock, flags); > > +} > > + > > static int swqueue_pick_next_task(struct rq *rq, struct rq_flags *rf) > > { > > - return 0; > > + struct swqueue *swqueue; > > + struct task_struct *p = NULL; > > + struct rq *src_rq; > > + struct rq_flags src_rf; > > + int ret; > > + > > + swqueue = rq_swqueue(rq); > > + if (!list_empty(&swqueue->list)) > > + p = swqueue_pull_task(swqueue); > > + > > + if (!p) > > + return 0; > > At this point you can do the whole is_cpu_allowed() and avoid the whole > lock dance if not. Good idea, will incorporate into the next iteration. > > + > > + rq_unpin_lock(rq, rf); > > + raw_spin_rq_unlock(rq); > > + > > + src_rq = task_rq_lock(p, &src_rf); > > + > > + if (task_on_rq_queued(p) && !task_on_cpu(rq, p)) > > + src_rq = migrate_task_to(src_rq, &src_rf, p, cpu_of(rq)); > > And then this becomes move_queued_task(). Yep, will make this change per your suggestion in [0]. [0]: https://lore.kernel.org/all/20230621130439.GF2053369@hirez.programming.kicks-ass.net/ > > + if (src_rq->cpu != rq->cpu) > > + ret = 1; > > + else > > + ret = -1; > > + > > + task_rq_unlock(src_rq, p, &src_rf); > > + > > + raw_spin_rq_lock(rq); > > + rq_repin_lock(rq, rf); > > + > > + return ret; > > } > > > > static void swqueue_remove_task(struct task_struct *p) > > +{ > > + unsigned long flags; > > + struct swqueue *swqueue; > > + > > + if (!list_empty(&p->swqueue_node)) { > > + swqueue = rq_swqueue(task_rq(p)); > > + spin_lock_irqsave(&swqueue->lock, flags); > > + list_del_init(&p->swqueue_node); > > + spin_unlock_irqrestore(&swqueue->lock, flags); > > + } > > +} > > dequeue() Ack
On Wed, Jun 21, 2023 at 01:47:00PM +0530, Gautham R. Shenoy wrote: > Hello David, > On Tue, Jun 20, 2023 at 03:08:22PM -0500, David Vernet wrote: > > On Mon, Jun 19, 2023 at 11:43:13AM +0530, Gautham R. Shenoy wrote: > > > Hello David, > > > > > > > > > On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote: > > > [..snip..] > > > > > > > +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags) > > > > +{ > > > > + unsigned long flags; > > > > + struct swqueue *swqueue; > > > > + bool task_migrated = enq_flags & ENQUEUE_MIGRATED; > > > > + bool task_wakeup = enq_flags & ENQUEUE_WAKEUP; > > > > + > > > > + /* > > > > + * Only enqueue the task in the shared wakequeue if: > > > > + * > > > > + * - SWQUEUE is enabled > > > > + * - The task is on the wakeup path > > > > + * - The task wasn't purposefully migrated to the current rq by > > > > + * select_task_rq() > > > > + * - The task isn't pinned to a specific CPU > > > > + */ > > > > + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1) > > > > + return; > > > > > > In select_task_rq_fair(), having determined if the target of task > > > wakeup should be the task's previous CPU vs the waker's current CPU, > > > we spend quite a bit of time already to determine if there is an idle > > > core/CPU in the target's LLC. @rq would correspond to CPU chosen as a > > > result of that scan or if no idle CPU exists, @rq corresponds to the > > > target CPU determined by wake_affine_idle()/wake_affine_weight(). > > > > > > So if the CPU of @rq is idle here, can we not simply return here? > > > > Hi Gautum, > > > > Sorry, I'm not sure I'm quite following the issue you're pointing out. > > We don't use swqueue if the task was migrated following > > select_task_rq_fair(). That's the idea with us returning if the task was > > migrated (the second conditional in that if). If I messed up that logic > > somehow, it should be fixed. > > Sorry, my bad. I see it now. > > So as per this patch, the only time we enqueue the task on the shared > wakeup is if the target of try_to_wake_up() is the same CPU where the > task ran previously. > > When wake_affine logic fails and the previous CPU is chosen as the > target, and when there are no other idle cores/threads in the LLC of > the previous CPU, it makes sense to queue the task on the > shared-wakequeue instead of on a busy previous CPU. > > And when that previous CPU is idle, the try_to_wake_up() would have > woken it up via ttwu_queue(), so before going idle the next time it > will check the shared queue for the task and find it. We should be > good in this case. > > Now, it is possible that select_task_rq_fair() ended up selecting the > waker's CPU as the target based on the > wake_affine_idle()/wake_affine_weight() logic. And if there is no idle > core/thread on the waker's LLC, the target would be the busy waker > CPU. In the case when the waker CPU is different from the task's > previous CPU, due to ENQUEUE_MIGRATE flag being set, the task won't be > queued on the shared wakequeue and instead has to wait on the busy > waker CPU. > > I wonder if it makes sense to enqueue the task on the shared wakequeue > in this scenario as well. Hello Gautham, That's a good point. My original intention with opting out of using swqueue if select_task_rq_fair() caused us to migrate is so that it wouldn't interfere with decisions made with other select_task_rq_fair() heuristics like wake_affine_*(). Basically just minimizing the possible impact of swqueue. That said, I think it probably does make sense to just enqueue in the swqueue regardless of whether ENQUEUE_MIGRATED is set. One of the main goals of swqueue is work conservation, and in hindsight it does feel somewhat artificial to add a heuristic that works against that. I'd like to hear what others think. In my opinion it's worth at least running some tests on workloads that heavily utilize the CPU such as kernel compile, and seeing what the outcomes are. Thanks, David
On Wed, Jun 21, 2023 at 08:43:29PM -0500, David Vernet wrote: > On Wed, Jun 21, 2023 at 01:47:00PM +0530, Gautham R. Shenoy wrote: > > Hello David, > > On Tue, Jun 20, 2023 at 03:08:22PM -0500, David Vernet wrote: > > > On Mon, Jun 19, 2023 at 11:43:13AM +0530, Gautham R. Shenoy wrote: > > > > Hello David, > > > > > > > > > > > > On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote: > > > > [..snip..] > > > > > > > > > +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags) > > > > > +{ > > > > > + unsigned long flags; > > > > > + struct swqueue *swqueue; > > > > > + bool task_migrated = enq_flags & ENQUEUE_MIGRATED; > > > > > + bool task_wakeup = enq_flags & ENQUEUE_WAKEUP; > > > > > + > > > > > + /* > > > > > + * Only enqueue the task in the shared wakequeue if: > > > > > + * > > > > > + * - SWQUEUE is enabled > > > > > + * - The task is on the wakeup path > > > > > + * - The task wasn't purposefully migrated to the current rq by > > > > > + * select_task_rq() > > > > > + * - The task isn't pinned to a specific CPU > > > > > + */ > > > > > + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1) > > > > > + return; > > > > > > > > In select_task_rq_fair(), having determined if the target of task > > > > wakeup should be the task's previous CPU vs the waker's current CPU, > > > > we spend quite a bit of time already to determine if there is an idle > > > > core/CPU in the target's LLC. @rq would correspond to CPU chosen as a > > > > result of that scan or if no idle CPU exists, @rq corresponds to the > > > > target CPU determined by wake_affine_idle()/wake_affine_weight(). > > > > > > > > So if the CPU of @rq is idle here, can we not simply return here? > > > > > > Hi Gautum, > > > > > > Sorry, I'm not sure I'm quite following the issue you're pointing out. > > > We don't use swqueue if the task was migrated following > > > select_task_rq_fair(). That's the idea with us returning if the task was > > > migrated (the second conditional in that if). If I messed up that logic > > > somehow, it should be fixed. > > > > Sorry, my bad. I see it now. > > > > So as per this patch, the only time we enqueue the task on the shared > > wakeup is if the target of try_to_wake_up() is the same CPU where the > > task ran previously. > > > > When wake_affine logic fails and the previous CPU is chosen as the > > target, and when there are no other idle cores/threads in the LLC of > > the previous CPU, it makes sense to queue the task on the > > shared-wakequeue instead of on a busy previous CPU. > > > > And when that previous CPU is idle, the try_to_wake_up() would have > > woken it up via ttwu_queue(), so before going idle the next time it > > will check the shared queue for the task and find it. We should be > > good in this case. > > > > Now, it is possible that select_task_rq_fair() ended up selecting the > > waker's CPU as the target based on the > > wake_affine_idle()/wake_affine_weight() logic. And if there is no idle > > core/thread on the waker's LLC, the target would be the busy waker > > CPU. In the case when the waker CPU is different from the task's > > previous CPU, due to ENQUEUE_MIGRATE flag being set, the task won't be > > queued on the shared wakequeue and instead has to wait on the busy > > waker CPU. > > > > I wonder if it makes sense to enqueue the task on the shared wakequeue > > in this scenario as well. > > Hello Gautham, > > That's a good point. My original intention with opting out of using > swqueue if select_task_rq_fair() caused us to migrate is so that it > wouldn't interfere with decisions made with other select_task_rq_fair() > heuristics like wake_affine_*(). Basically just minimizing the possible > impact of swqueue. > That said, I think it probably does make sense to > just enqueue in the swqueue regardless of whether ENQUEUE_MIGRATED is > set. One of the main goals of swqueue is work conservation, and in > hindsight it does feel somewhat artificial to add a heuristic that works > against that. In that case we can perhaps have an explicit flag that is passed by try_to_wake_up() when it cannot find an idle CPU and the chosen target is just a fallback. The task gets enqueued on the swqueue of the target only in such cases. Something like the following entirely untested: ---------------------------- x8---------------------------- diff --git a/include/linux/sched.h b/include/linux/sched.h index b64fec55a381..38005262a7fe 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -910,6 +910,12 @@ struct task_struct { */ unsigned sched_remote_wakeup:1; + /* + * Bit used by select_idle_sibling() to signal enqueuing the + * task on a shared wakequeue. + */ + unsigned sched_add_to_swq:1; + /* Bit to tell LSMs we're in execve(): */ unsigned in_execve:1; unsigned in_iowait:1; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e311d1c3b992..f4246c33f3c5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -215,21 +215,17 @@ static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags) { unsigned long flags; struct swqueue *swqueue; - bool task_migrated = enq_flags & ENQUEUE_MIGRATED; - bool task_wakeup = enq_flags & ENQUEUE_WAKEUP; /* * Only enqueue the task in the shared wakequeue if: * * - SWQUEUE is enabled - * - The task is on the wakeup path - * - The task wasn't purposefully migrated to the current rq by - * select_task_rq() - * - The task isn't pinned to a specific CPU + * - select_idle_sibling() didn't find an idle CPU to enqueue this wakee task. */ - if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1) + if (!READ_ONCE(p->sched_add_to_swq)) return; + WRITE_ONCE(p->sched_add_to_swq, 0); swqueue = rq_swqueue(rq); spin_lock_irqsave(&swqueue->lock, flags); list_add_tail(&p->swqueue_node, &swqueue->list); @@ -7361,6 +7357,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if ((unsigned)i < nr_cpumask_bits) return i; + /* + * No idle sibling was found. Ok to queue this task on the + * shared wakequeue of the target. + */ + WRITE_ONCE(p->sched_add_to_swq, 1); return target; } > > I'd like to hear what others think. In my opinion it's worth at least > running some tests on workloads that heavily utilize the CPU such as > kernel compile, and seeing what the outcomes are. I will try and get some numbers for such workloads on our setup over this weekend. > > Thanks, > David -- Thanks and Regards gautham.
On Thu, Jun 22, 2023 at 02:41:57PM +0530, Gautham R. Shenoy wrote: > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -215,21 +215,17 @@ static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags) > { > unsigned long flags; > struct swqueue *swqueue; > - bool task_migrated = enq_flags & ENQUEUE_MIGRATED; > - bool task_wakeup = enq_flags & ENQUEUE_WAKEUP; > > /* > * Only enqueue the task in the shared wakequeue if: > * > * - SWQUEUE is enabled > - * - The task is on the wakeup path > - * - The task wasn't purposefully migrated to the current rq by > - * select_task_rq() > - * - The task isn't pinned to a specific CPU > + * - select_idle_sibling() didn't find an idle CPU to enqueue this wakee task. > */ > - if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1) You lost the nr_cpus_allowed == 1 case, that actually made sense, except he didn't don't have a hook in set_cpus_allowed() to fix it back up. > + if (!READ_ONCE(p->sched_add_to_swq)) > return; So suppose we fill all CPUs with tasks T_{0..n-n1}, sis finds them a nice idle cpu and we don't get queued. Then wake up another n tasks T_{n...2n-1}, none of them find an idle CPU, as per the above them all taken. These all do get queued. However, suppose T>=n does a wakeup-preemption, and we end up with T>=n running while T<n get queued on the rq, but not the 'queue' (I so hate the swqueue name). Then one CPU has both it's tasks go 'away' and becomes idle. At this point the 'queue' contains only running tasks that are not migratable and all the tasks that could be migrated are not on the queue -- IOW it is completely useless. Is this the intention?
On Wed, Jun 21, 2023 at 03:34:45PM -0500, David Vernet wrote: > On Wed, Jun 21, 2023 at 04:20:20PM +0200, Peter Zijlstra wrote: > > On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote: > > > +struct swqueue { > > > + struct list_head list; > > > + spinlock_t lock; > > > +} ____cacheline_aligned; > > > > I'm thinking you can shard this just fine, it makes that pop() needs to > > iterate all shards, but that shouldn't be a problem, and it would still > > only need to take a single lock. > > Is the idea here to have multiple sharded queues per LLC, with a single > lock protecting them? No, each shard will have it's own lock, each shard it's own cacheline. > Or did I misunderstand your suggestion entirely? Yup, so each shard is basically the above struct, whole cacheline with a lock and list. Then on enqueue we hash to a shard (possibly based on cpu -- this gives the least amount of bounces) and stick it on. Then on pop we iterate the shards, the first non-empty shard gets the lock taken, and pop'ed. If you start iteration based on the cpu->shard map used for enqueue you can even get a sense of 'locality' depending on the mapping. So pop is O(n) in shards, but on average will only take a single lock -- there is an obvious race where it might see a non-empty queue, but race on lock and find the queue empty once acquired, but meh. > > I'm thinking 4 or 8 shards should be plenty, even for Intel LLC. > > > > > #ifdef CONFIG_SMP > > > > > +static struct task_struct *swqueue_pull_task(struct swqueue *swqueue) > > > +{ > > > + unsigned long flags; > > > + > > > + struct task_struct *p; > > > + > > > + spin_lock_irqsave(&swqueue->lock, flags); > > > + p = list_first_entry_or_null(&swqueue->list, struct task_struct, > > > + swqueue_node); > > > + if (p) > > > + list_del_init(&p->swqueue_node); > > > + spin_unlock_irqrestore(&swqueue->lock, flags); > > > + > > > + return p; > > > +} > > > > Would this not normally be called pop() or somesuch? > > Yes, I'll improve the name in the next iteration. swqueue_dequeue() and > swqueue_enqueue() seem like the most canonical. Let me know if you have another > preference. Well, there's two layers intermixed here, which is, I think, what gave rise to the whole wonky naming. If you implement the queue primitives as: push, pop, del and then build, using them, the scheduler primitives: enqueue, dequeue, pick, things should become saner. > > > + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1) > > > + return; > > > > Elsewhere you mentioned heuristics, this smells like them. This and the > > is_cpus_allowed() thing makes you loose plenty of opportunities. > > Yeah fair enough, these certainly are heuristics as well. > > I thought it best to try and avoid swqueue getting in the way of > select_task_rq_fair() (at least to start out with), but we could always > remove that and run other experiments to see how it does. So, part of the problem is that you might be hooking at the wrong level (see also my earlier email about the queue only containing running tasks and none of the ready tasks). If you were to hook at the __{en,de}queue_entity() level you'll automagically dequeue running tasks and enqueue any ready tasks. OTOH you get to deal with the problem that not all entities are tasks, but that shouldn't be too hard. Also, you end up with more enqueue/dequeue -- so that might hurt. Also, at this point you're nr_cpus/N away from simply scanning the runqueues and pulling things from non-empty runqueues. But for large nr_cpus and smallish N it might be a win over-all. Dunno..
On Thu, Jun 22, 2023 at 12:58:41PM +0200, Peter Zijlstra wrote: > On Wed, Jun 21, 2023 at 03:34:45PM -0500, David Vernet wrote: > > On Wed, Jun 21, 2023 at 04:20:20PM +0200, Peter Zijlstra wrote: > > > On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote: > > > > +struct swqueue { > > > > + struct list_head list; > > > > + spinlock_t lock; > > > > +} ____cacheline_aligned; > > > > > > I'm thinking you can shard this just fine, it makes that pop() needs to > > > iterate all shards, but that shouldn't be a problem, and it would still > > > only need to take a single lock. > > > > Is the idea here to have multiple sharded queues per LLC, with a single > > lock protecting them? > > No, each shard will have it's own lock, each shard it's own cacheline. > > > Or did I misunderstand your suggestion entirely? > > Yup, so each shard is basically the above struct, whole cacheline with a > lock and list. Then on enqueue we hash to a shard (possibly based on cpu > -- this gives the least amount of bounces) and stick it on. > > Then on pop we iterate the shards, the first non-empty shard gets the > lock taken, and pop'ed. If you start iteration based on the cpu->shard > map used for enqueue you can even get a sense of 'locality' depending on > the mapping. That's nifty. Even if we don't get locality from it, it seems prudent to do to avoid contention. > So pop is O(n) in shards, but on average will only take a single lock -- Ah, so it's 1 lock on average -- thanks for clarifying. I like this idea and will implement it for v2. I'm not keen on the idea of artificially having multiple independent swqueues (yes, I'll think of a less evil name) per-LLC to avoid contending, but sharding like this seems like a simple but effective solution to the scalability issue; assuming having a sufficiently high N doesn't slow down the pop path by too much as you pointed out below. > there is an obvious race where it might see a non-empty queue, but race > on lock and find the queue empty once acquired, but meh. Yep, I would be surprised if this race caused a problem in practice. Especially if we spread the poppers by starting iteration based on the cpu->shared map as you suggested. > > > I'm thinking 4 or 8 shards should be plenty, even for Intel LLC. I'll try out a few combinations and see what works. > > > > > > > #ifdef CONFIG_SMP > > > > > > > +static struct task_struct *swqueue_pull_task(struct swqueue *swqueue) > > > > +{ > > > > + unsigned long flags; > > > > + > > > > + struct task_struct *p; > > > > + > > > > + spin_lock_irqsave(&swqueue->lock, flags); > > > > + p = list_first_entry_or_null(&swqueue->list, struct task_struct, > > > > + swqueue_node); > > > > + if (p) > > > > + list_del_init(&p->swqueue_node); > > > > + spin_unlock_irqrestore(&swqueue->lock, flags); > > > > + > > > > + return p; > > > > +} > > > > > > Would this not normally be called pop() or somesuch? > > > > Yes, I'll improve the name in the next iteration. swqueue_dequeue() and > > swqueue_enqueue() seem like the most canonical. Let me know if you have another > > preference. > > Well, there's two layers intermixed here, which is, I think, what gave > rise to the whole wonky naming. > > If you implement the queue primitives as: push, pop, del > and then build, using them, the scheduler primitives: enqueue, dequeue, > pick, things should become saner. Right, will do > > > > + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1) > > > > + return; > > > > > > Elsewhere you mentioned heuristics, this smells like them. This and the > > > is_cpus_allowed() thing makes you loose plenty of opportunities. > > > > Yeah fair enough, these certainly are heuristics as well. > > > > I thought it best to try and avoid swqueue getting in the way of > > select_task_rq_fair() (at least to start out with), but we could always > > remove that and run other experiments to see how it does. > > So, part of the problem is that you might be hooking at the wrong level > (see also my earlier email about the queue only containing running tasks > and none of the ready tasks). > > If you were to hook at the __{en,de}queue_entity() level you'll > automagically dequeue running tasks and enqueue any ready tasks. OTOH > you get to deal with the problem that not all entities are tasks, but > that shouldn't be too hard. > > Also, you end up with more enqueue/dequeue -- so that might hurt. If this doesn't cause untenable contention then it seems preferable if the overall goal is work conservation. Using your shard idea would hopefully mitigate that concern somewhat as well. I'll try this out for v2 and compare it to keeping the enqueues to only be on the wakeup path. If there's too much contention we can always just go back to the wakeup case, but I'm hopeful this works out. > Also, at this point you're nr_cpus/N away from simply scanning the > runqueues and pulling things from non-empty runqueues. But for large > nr_cpus and smallish N it might be a win over-all. Dunno.. Given that we're not seeing any contention on smaller LLCs but still observing perf wins, I'm optimistic. Thanks, David
On 6/21/23 2:03 AM, Aaron Lu wrote: > On Wed, Jun 21, 2023 at 12:43:52AM -0500, David Vernet wrote: >> On Wed, Jun 21, 2023 at 12:54:16PM +0800, Aaron Lu wrote: >>> On Tue, Jun 20, 2023 at 09:43:00PM -0500, David Vernet wrote: >>>> On Wed, Jun 21, 2023 at 10:35:34AM +0800, Aaron Lu wrote: >>>>> On Tue, Jun 20, 2023 at 12:36:26PM -0500, David Vernet wrote: [ ... ] >>>> I'm not sure what we're hoping to gain by continuing to run various >>>> netperf workloads with your specific parameters? >>> >>> I don't quite follow you. >>> >>> I thought we were in the process of figuring out why for the same >>> workload(netperf/default_mode/nr_client=nr_cpu) on two similar >>> machines(both are Skylake) you saw no contention while I saw some so I >>> tried to be exact on how I run the workload. >> >> I just reran the workload on a 26 core / 52 thread Cooper Lake using >> your exact command below and still don't observe any contention >> whatsoever on the swqueue lock: > > Well, it's a puzzle to me. > > But as you said below, I guess I'll just move on. Thanks for bringing this up Aaron. The discussion moved on to different ways to fix the netperf triggered contention, but I wanted to toss this out as an easy way to see the same problem: # swqueue disabled: # ./schbench -L -m 52 -p 512 -r 10 -t 1 Wakeup Latencies percentiles (usec) runtime 10 (s) (14674354 total samples) 20.0th: 8 (4508866 samples) 50.0th: 11 (2879648 samples) 90.0th: 35 (5865268 samples) * 99.0th: 70 (1282166 samples) 99.9th: 110 (124040 samples) min=1, max=9312 avg worker transfer: 28211.91 ops/sec 13.78MB/s During the swqueue=0 run, the system was ~30% idle # swqueue enabled: # ./schbench -L -m 52 -p 512 -r 10 -t 1 Wakeup Latencies percentiles (usec) runtime 10 (s) (6448414 total samples) 20.0th: 30 (1383423 samples) 50.0th: 39 (1986827 samples) 90.0th: 63 (2446965 samples) * 99.0th: 108 (567275 samples) 99.9th: 158 (57487 samples) min=1, max=15018 avg worker transfer: 12395.27 ops/sec 6.05MB/s During the swqueue=1 run, the CPU was at was 97% system time, all stuck on spinlock contention in the scheduler. This is a single socket cooperlake with 26 cores/52 threads. The work is similar to perf pipe test, 52 messenger threads each bouncing a message back and forth with their own private worker for a 10 second run. Adding more messenger threads (-m 128) increases the swqueue=0 ops/sec to about 19MB/s and drags down the swqueue=1 ops/sec to 2MB/s. -chris
On Thu, Jun 22, 2023 at 12:29:35PM +0200, Peter Zijlstra wrote: > On Thu, Jun 22, 2023 at 02:41:57PM +0530, Gautham R. Shenoy wrote: > > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -215,21 +215,17 @@ static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags) > > { > > unsigned long flags; > > struct swqueue *swqueue; > > - bool task_migrated = enq_flags & ENQUEUE_MIGRATED; > > - bool task_wakeup = enq_flags & ENQUEUE_WAKEUP; > > > > /* > > * Only enqueue the task in the shared wakequeue if: > > * > > * - SWQUEUE is enabled > > - * - The task is on the wakeup path > > - * - The task wasn't purposefully migrated to the current rq by > > - * select_task_rq() > > - * - The task isn't pinned to a specific CPU > > + * - select_idle_sibling() didn't find an idle CPU to enqueue this wakee task. > > */ > > - if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1) > > You lost the nr_cpus_allowed == 1 case, that actually made sense, except > he didn't don't have a hook in set_cpus_allowed() to fix it back up. > I didn't retain the "nr_cpus_allowed == 1" check because select_task_rq() calls select_task_rq_fair() only when p->nr_cpus_allowed > 1. So if p was affined to a single CPU, it would always have p->sched_add_to_swq set to 0, thereby not queueing it on the 'queue'. > > + if (!READ_ONCE(p->sched_add_to_swq)) > > return; > > So suppose we fill all CPUs with tasks T_{0..n-n1}, sis finds them a > nice idle cpu and we don't get queued. Yes. > > Then wake up another n tasks T_{n...2n-1}, none of them find an idle > CPU, as per the above them all taken. These all do get queued. Yes. > > However, suppose T>=n does a wakeup-preemption, and we end up with T>=n > running while T<n get queued on the rq, but not the 'queue' (I so hate > the swqueue name). Yes. But note that prior to running T>=n on the CPU, it will be removed from the 'queue'. So if every T>=n preempts the corresponding T<n, then, the queue becomes empty. And yes, in this case the queue served no purpose and does not justify the additional overhead. > > Then one CPU has both it's tasks go 'away' and becomes idle. Yes, and this CPU prior to newidle_balance() will now search the 'queue'. > > At this point the 'queue' contains only running tasks that are not > migratable and all the tasks that could be migrated are not on the queue > -- IOW it is completely useless. At this point, the CPU will call swqueue_pick_next_task() and we can have one of the three cases: * The 'queue' is empty: As mentioned above, the queue didn't do anything for the tasks and was cmpletely useless. * The task at the head of 'queue' is not runnable on the CPU which is going idle. Again, wasted effort of adding this task to the queue, because in the current implementation, the task is removed from the 'queue' even when it is not runnable on this CPU. * The 'queue' has some task which can be run on the CPU going idle. The task is successfully migrated to this CPU, which no longer has to do newidle_balance() and the task has reduced it waiting period. > > Is this the intention? IIUC, the intention is to reduce the avg post-wakeup wait time of the task by running it on a newidle CPU, when the task's CPU is still busy running something else. This is assuming that the task won't win wakeup-preemption. But the flipside is the additional cost of swqueue enqueue/dequeue. I have queued a run across a set of benchmarks, but just to get an idea how the patch performs, I ran hackbench and this is what the results look like on a 2 Socket Gen3 EPYC configured in NPS=2 with 64 cores 128 threads per socket: Test: tip tip_swq_disabled tip_swq_enabled ============================================================================= 1-groups: 3.82 (0.00 pct) 3.84 (-0.52 pct) 3.46 (+9.42 pct) 2-groups: 4.40 (0.00 pct) 4.42 (-0.45 pct) 3.66 (+16.81 pct) 4-groups: 4.84 (0.00 pct) 4.82 (+0.41 pct) 4.50 (+7.02 pct) 8-groups: 5.45 (0.00 pct) 5.43 (+0.36 pct) 5.85 (-7.33 pct) 16-groups: 6.94 (0.00 pct) 6.90 (+0.57 pct) 8.41 (-21.18 pct) We can see that patchset does quite well with 1,2,4 groups, but with 8 and 16 groups as the system becomes overloaded, we can see the performance degrade. In the overloaded case, I could observe native_queued_spin_lock_slowpath() looming large in the perf profiles when the swqueue feature was enabled. So perhaps the cost of swqueue enqueue/dequeue is not justifiable, especially if the tasks are anyway going to run on their target CPUs. I will post more results later. -- Thanks and Regards gautham.
Hello Peter, David, On Fri, Jun 23, 2023 at 03:20:15PM +0530, Gautham R. Shenoy wrote: > On Thu, Jun 22, 2023 at 12:29:35PM +0200, Peter Zijlstra wrote: > > On Thu, Jun 22, 2023 at 02:41:57PM +0530, Gautham R. Shenoy wrote: > > I will post more results later. I was able to get some numbers for hackbench, schbench (old), and tbench over the weekend on a 2 Socket Zen3 box with 64 cores 128 threads per socket configured in NPS1 mode. The legend is as follows: tip : tip/sched/core with HEAD being commit e2a1f85bf9f5 ("sched/psi: Avoid resetting the min update period when it is unnecessary") david : This patchset david-ego-1 : David's patchset + my modification to allow SIS signal that a task should be queued on the shared-wakequeue when SIS cannot find an idle CPU to wake up the task. david-ego-2 : David's patchset + david-ego-1 + my modification to remove the first task from the shared-wakequeue whose cpus_allowed contains this CPU. Currently we don't do this check and always remove the first task. david-ego-1 and david-ego-2 are attached with this mail. hackbench (Measure: time taken to complete, in seconds) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Test: tip david david-ego-1 david-ego-2 1-groups: 3.92 (0.00 pct) 3.35 (14.54 pct) 3.53 (9.94 pct) 3.30 (15.81 pct) 2-groups: 4.58 (0.00 pct) 3.89 (15.06 pct) 3.95 (13.75 pct) 3.79 (17.24 pct) 4-groups: 4.99 (0.00 pct) 4.42 (11.42 pct) 4.76 (4.60 pct) 4.77 (4.40 pct) 8-groups: 5.67 (0.00 pct) 5.08 (10.40 pct) 6.16 (-8.64 pct) 6.33 (-11.64 pct) 16-groups: 7.88 (0.00 pct) 7.32 (7.10 pct) 8.57 (-8.75 pct) 9.77 (-23.98 pct) Observation: We see that David's patchset does very well across all the groups. Expanding the scope of the shared-wakequeue with david-ego-1 doesn't give us much and in fact hurts at higher utilization. Same is the case with david-ego-2 which only pulls allowed tasks from the shared-wakequeue. In david-ego-2 we see a greater amount of spin-lock contention for 8 and 16 groups, as the code holds the spinlock and iterates through the list members while checking cpu-affinity. So, David's original patchset wins this one. schbench (Measure : 99th Percentile latency, in us) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #workers: tip david david-ego-1 david-ego-2 1: 26.00 (0.00 pct) 21.00 (19.23 pct) 28.00 (-7.69 pct) 22.00 (15.38 pct) 2: 27.00 (0.00 pct) 29.00 (-7.40 pct) 28.00 (-3.70 pct) 30.00 (-11.11 pct) 4: 31.00 (0.00 pct) 31.00 (0.00 pct) 31.00 (0.00 pct) 28.00 (9.67 pct) 8: 36.00 (0.00 pct) 37.00 (-2.77 pct) 34.00 (5.55 pct) 39.00 (-8.33 pct) 16: 49.00 (0.00 pct) 49.00 (0.00 pct) 48.00 (2.04 pct) 50.00 (-2.04 pct) 32: 80.00 (0.00 pct) 80.00 (0.00 pct) 88.00 (-10.00 pct) 79.00 (1.25 pct) 64: 169.00 (0.00 pct) 180.00 (-6.50 pct) 174.00 (-2.95 pct) 168.00 (0.59 pct) 128: 343.00 (0.00 pct) 355.00 (-3.49 pct) 356.00 (-3.79 pct) 344.00 (-0.29 pct) 256: 42048.00 (0.00 pct) 46528.00 (-10.65 pct) 51904.00 (-23.43 pct) 48064.00 (-14.30 pct) 512: 95104.00 (0.00 pct) 95872.00 (-0.80 pct) 95360.00 (-0.26 pct) 97152.00 (-2.15 pct) Observations: There are run-to-run variations with this benchmark. I will try with the newer schbench later this week. tbench (Measure: Throughput, records/s) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Clients: tip sis-node david david-ego-1 ego-david-2 1 452.49 (0.00 pct) 457.94 (1.20 pct) 448.52 (-0.87 pct) 447.11 (-1.18 pct) 458.45 (1.31 pct) 2 862.44 (0.00 pct) 879.99 (2.03 pct) 860.14 (-0.26 pct) 873.27 (1.25 pct) 891.72 (3.39 pct) 4 1604.27 (0.00 pct) 1618.87 (0.91 pct) 1610.95 (0.41 pct) 1628.45 (1.50 pct) 1657.26 (3.30 pct) 8 2966.77 (0.00 pct) 3040.90 (2.49 pct) 2991.07 (0.81 pct) 3063.31 (3.25 pct) 3106.50 (4.70 pct) 16 5176.70 (0.00 pct) 5292.29 (2.23 pct) 5478.32 (5.82 pct) 5462.05 (5.51 pct) 5537.15 (6.96 pct) 32 8205.24 (0.00 pct) 8949.12 (9.06 pct) 9039.63 (10.16 pct) 9466.07 (15.36 pct) 9365.06 (14.13 pct) 64 13956.71 (0.00 pct) 14461.42 (3.61 pct) 16337.65 (17.05 pct) 16941.63 (21.38 pct) 15697.47 (12.47 pct) 128 24005.50 (0.00 pct) 26052.75 (8.52 pct) 25605.24 (6.66 pct) 27243.19 (13.48 pct) 24854.60 (3.53 pct) 256 32457.61 (0.00 pct) 21999.41 (-32.22 pct) 36953.22 (13.85 pct) 32299.31 (-0.48 pct) 33037.03 (1.78 pct) 512 34345.24 (0.00 pct) 41166.39 (19.86 pct) 40845.23 (18.92 pct) 40797.97 (18.78 pct) 38150.17 (11.07 pct) 1024 33432.92 (0.00 pct) 40900.84 (22.33 pct) 39749.35 (18.89 pct) 41133.82 (23.03 pct) 38464.26 (15.04 pct) Observations: tbench really likes all variants of shared-wakeueue. I have also included sis-node numbers since we saw that tbench liked sis-node. Also, it can be noted that except for the 256 clients case (number of clients == number of threads in the system), in all other cases, we see a benefit with david-ego-1 which extends the usage of shared-wakequeue to the waker's target when the waker's LLC is busy. Will try and get the netperf, postgresql, SPECjbb and Deathstarbench numbers this week. -- Thanks and Regards gautham.
On Mon, Jun 26, 2023 at 11:34:41AM +0530, Gautham R. Shenoy wrote: > Hello Peter, David, Hello Gautham, Thanks a lot for running these experiments and experimenting with some variations. > > On Fri, Jun 23, 2023 at 03:20:15PM +0530, Gautham R. Shenoy wrote: > > On Thu, Jun 22, 2023 at 12:29:35PM +0200, Peter Zijlstra wrote: > > > On Thu, Jun 22, 2023 at 02:41:57PM +0530, Gautham R. Shenoy wrote: > > > > > I will post more results later. > > I was able to get some numbers for hackbench, schbench (old), and > tbench over the weekend on a 2 Socket Zen3 box with 64 cores 128 > threads per socket configured in NPS1 mode. > > The legend is as follows: > > tip : tip/sched/core with HEAD being commit e2a1f85bf9f5 ("sched/psi: > Avoid resetting the min update period when it is unnecessary") > > > david : This patchset > > david-ego-1 : David's patchset + my modification to allow SIS signal > that a task should be queued on the shared-wakequeue when SIS cannot > find an idle CPU to wake up the task. > > david-ego-2 : David's patchset + david-ego-1 + my modification to > remove the first task from the shared-wakequeue whose > cpus_allowed contains this CPU. Currently we don't do > this check and always remove the first task. > > > david-ego-1 and david-ego-2 are attached with this mail. > > hackbench (Measure: time taken to complete, in seconds) > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > Test: tip david david-ego-1 david-ego-2 > 1-groups: 3.92 (0.00 pct) 3.35 (14.54 pct) 3.53 (9.94 pct) 3.30 (15.81 pct) > 2-groups: 4.58 (0.00 pct) 3.89 (15.06 pct) 3.95 (13.75 pct) 3.79 (17.24 pct) > 4-groups: 4.99 (0.00 pct) 4.42 (11.42 pct) 4.76 (4.60 pct) 4.77 (4.40 pct) > 8-groups: 5.67 (0.00 pct) 5.08 (10.40 pct) 6.16 (-8.64 pct) 6.33 (-11.64 pct) > 16-groups: 7.88 (0.00 pct) 7.32 (7.10 pct) 8.57 (-8.75 pct) 9.77 (-23.98 pct) Nice, these are pretty significant improvements. > Observation: We see that David's patchset does very well across all > the groups. Expanding the scope of the shared-wakequeue with > david-ego-1 doesn't give us much and in fact hurts at higher > utilization. Same is the case with david-ego-2 which only pulls > allowed tasks from the shared-wakequeue. In david-ego-2 we see a > greater amount of spin-lock contention for 8 and 16 groups, as the > code holds the spinlock and iterates through the list members while > checking cpu-affinity. Peter's idea to shard may help with the contention here. I also wonder how we'd do with david-ego-2 minus david-ego-1. Perhaps the slightly less aggressive enqueing would somewhat lower the contention? > so, david's original patchset wins this one. > > > > > schbench (measure : 99th percentile latency, in us) > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > #workers: tip david david-ego-1 david-ego-2 > 1: 26.00 (0.00 pct) 21.00 (19.23 pct) 28.00 (-7.69 pct) 22.00 (15.38 pct) > 2: 27.00 (0.00 pct) 29.00 (-7.40 pct) 28.00 (-3.70 pct) 30.00 (-11.11 pct) > 4: 31.00 (0.00 pct) 31.00 (0.00 pct) 31.00 (0.00 pct) 28.00 (9.67 pct) > 8: 36.00 (0.00 pct) 37.00 (-2.77 pct) 34.00 (5.55 pct) 39.00 (-8.33 pct) > 16: 49.00 (0.00 pct) 49.00 (0.00 pct) 48.00 (2.04 pct) 50.00 (-2.04 pct) > 32: 80.00 (0.00 pct) 80.00 (0.00 pct) 88.00 (-10.00 pct) 79.00 (1.25 pct) > 64: 169.00 (0.00 pct) 180.00 (-6.50 pct) 174.00 (-2.95 pct) 168.00 (0.59 pct) > 128: 343.00 (0.00 pct) 355.00 (-3.49 pct) 356.00 (-3.79 pct) 344.00 (-0.29 pct) > 256: 42048.00 (0.00 pct) 46528.00 (-10.65 pct) 51904.00 (-23.43 pct) 48064.00 (-14.30 pct) > 512: 95104.00 (0.00 pct) 95872.00 (-0.80 pct) 95360.00 (-0.26 pct) 97152.00 (-2.15 pct) > > > observations: there are run-to-run variations with this benchmark. i > will try with the newer schbench later this week. +cc Chris. He showed how you can use schbench to cause painful contention with swqueue in [0]. Chris -- any other schbench incantations that you'd recommend we try with Gautham's patches? [0]: https://lore.kernel.org/lkml/c8419d9b-2b31-2190-3058-3625bdbcb13d@meta.com/ > tbench (measure: throughput, records/s) > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > clients: tip sis-node david david-ego-1 ego-david-2 > 1 452.49 (0.00 pct) 457.94 (1.20 pct) 448.52 (-0.87 pct) 447.11 (-1.18 pct) 458.45 (1.31 pct) > 2 862.44 (0.00 pct) 879.99 (2.03 pct) 860.14 (-0.26 pct) 873.27 (1.25 pct) 891.72 (3.39 pct) > 4 1604.27 (0.00 pct) 1618.87 (0.91 pct) 1610.95 (0.41 pct) 1628.45 (1.50 pct) 1657.26 (3.30 pct) > 8 2966.77 (0.00 pct) 3040.90 (2.49 pct) 2991.07 (0.81 pct) 3063.31 (3.25 pct) 3106.50 (4.70 pct) > 16 5176.70 (0.00 pct) 5292.29 (2.23 pct) 5478.32 (5.82 pct) 5462.05 (5.51 pct) 5537.15 (6.96 pct) > 32 8205.24 (0.00 pct) 8949.12 (9.06 pct) 9039.63 (10.16 pct) 9466.07 (15.36 pct) 9365.06 (14.13 pct) > 64 13956.71 (0.00 pct) 14461.42 (3.61 pct) 16337.65 (17.05 pct) 16941.63 (21.38 pct) 15697.47 (12.47 pct) > 128 24005.50 (0.00 pct) 26052.75 (8.52 pct) 25605.24 (6.66 pct) 27243.19 (13.48 pct) 24854.60 (3.53 pct) > 256 32457.61 (0.00 pct) 21999.41 (-32.22 pct) 36953.22 (13.85 pct) 32299.31 (-0.48 pct) 33037.03 (1.78 pct) > 512 34345.24 (0.00 pct) 41166.39 (19.86 pct) 40845.23 (18.92 pct) 40797.97 (18.78 pct) 38150.17 (11.07 pct) > 1024 33432.92 (0.00 pct) 40900.84 (22.33 pct) 39749.35 (18.89 pct) 41133.82 (23.03 pct) 38464.26 (15.04 pct) > > > observations: tbench really likes all variants of shared-wakeueue. i It's encouraging to see how well swqueue is handling these benchmarks; especially on CPUs with larger LLCs and with a lot of clients. I'm keen to see how we do with the sharding approach and your patches. > have also included sis-node numbers since we saw that tbench liked > sis-node. > > also, it can be noted that except for the 256 clients case (number of > clients == number of threads in the system), in all other cases, we > see a benefit with david-ego-1 which extends the usage of > shared-wakequeue to the waker's target when the waker's llc is busy. Also confused why the 256 client case would perform worse for SIS_NODE and david-ego-1. Was it consistently negative if you do multiple runs? Overall, my handwavey impression is that david-ego-1 doesn't seem to significantly improve over david, though I'd very much like to see the results of the other benchmarks you listed below. Iterating over tasks until you find a usable cpus->ptr also makes sense, and I'm hoping that the sharding idea will help with the contention there. > will try and get the netperf, postgresql, specjbb and deathstarbench > numbers this week. Thanks, that would be great. I'll be on vacation this week starting Wednesday, but will try to put together a v2 that includes Peter's suggestions by the end of the day tomorrow. Otherwise, I'll likely send it out next week (along with your patch(es) if they're shown to be an improvement over the baseline). Thanks, David > > -- > thanks and regards > gautham. > > > > > > > from 05d8efe2f3ae3abafd4bf94a0579d378dba63bb6 mon sep 17 00:00:00 2001 > from: "gautham r. shenoy" <gautham.shenoy@amd.com> > date: fri, 23 jun 2023 11:02:03 +0530 > subject: [patch 1/2] swqueue: control if a task should be queued on swq in > select_idle_sibling() > > if select_idle_sibling() fails to find an idle cpu to wakeup the task > on, then update the newly defined sched_add_to_swq field in its task > struct. > > use the value in this field to later on to determine if the task > should also be queued on the shared-wakequeue of the llc of the target > cpu. > > this extends the use of shared-wakequeue to cases when the target of a > wakeup is the current cpu instead of the task's previous cpu. > > signed-off-by: gautham r. shenoy <gautham.shenoy@amd.com> > --- > include/linux/sched.h | 6 ++++++ > kernel/sched/fair.c | 15 ++++++++------- > 2 files changed, 14 insertions(+), 7 deletions(-) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index b64fec55a381..38005262a7fe 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -910,6 +910,12 @@ struct task_struct { > */ > unsigned sched_remote_wakeup:1; > > + /* > + * bit used by select_idle_sibling() to signal enqueuing the > + * task on a shared wakequeue when it failed find an idle cpu. > + */ > + unsigned sched_add_to_swq:1; > + > /* bit to tell lsms we're in execve(): */ > unsigned in_execve:1; > unsigned in_iowait:1; > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index e311d1c3b992..fe33f6b13299 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -215,21 +215,17 @@ static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags) > { > unsigned long flags; > struct swqueue *swqueue; > - bool task_migrated = enq_flags & enqueue_migrated; > - bool task_wakeup = enq_flags & enqueue_wakeup; > > /* > * only enqueue the task in the shared wakequeue if: > * > * - swqueue is enabled > - * - the task is on the wakeup path > - * - the task wasn't purposefully migrated to the current rq by > - * select_task_rq() > - * - the task isn't pinned to a specific cpu > + * - select_idle_sibling() didn't find an idle cpu to enqueue this wakee task. > */ > - if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1) > + if (!p->sched_add_to_swq) > return; > > + p->sched_add_to_swq = 0; > swqueue = rq_swqueue(rq); > spin_lock_irqsave(&swqueue->lock, flags); > list_add_tail(&p->swqueue_node, &swqueue->list); > @@ -7361,6 +7357,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > if ((unsigned)i < nr_cpumask_bits) > return i; > > + /* > + * no idle sibling was found. ok to queue this task on the > + * shared wakequeue of the target. > + */ > + p->sched_add_to_swq = 1; > return target; > } > > -- > 2.25.1 > > from 88f52c2df8a2d92423ddd12c92edec949148bf3c mon sep 17 00:00:00 2001 > from: "gautham r. shenoy" <gautham.shenoy@amd.com> > date: fri, 23 jun 2023 23:25:04 +0530 > subject: [patch 2/2] swqueue: only pull a task with valid affinity from > swqueue > > currently swqueue_pull_task() dequeues the task at the head of the > shared-wakequeue and then tries to migrate the task onto the current > cpu. > > this may fail, since the current cpu may not be set in the task's > affinity mask. > > hence in swqueue_pull_task(), pull the first task from the > shared-wakequeue that can be run on this cpu. with this, > swqueue_pick_next_task() can return a 0/1 instead of 0/-1/1 as it is > done now. > > singed-off-by: gautham r. shenoy <gautham.shenoy@amd.com> > --- > kernel/sched/fair.c | 22 ++++++++++++---------- > 1 file changed, 12 insertions(+), 10 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index fe33f6b13299..e78b8302b4c8 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -195,17 +195,21 @@ static struct swqueue *rq_swqueue(struct rq *rq) > return rq->cfs.swqueue; > } > > -static struct task_struct *swqueue_pull_task(struct swqueue *swqueue) > +static struct task_struct *swqueue_pull_task(struct swqueue *swqueue, int cpu) > { > unsigned long flags; > > struct task_struct *p; > > spin_lock_irqsave(&swqueue->lock, flags); > - p = list_first_entry_or_null(&swqueue->list, struct task_struct, > - swqueue_node); > - if (p) > - list_del_init(&p->swqueue_node); > + list_for_each_entry(p, &swqueue->list, swqueue_node) { > + if (cpumask_test_cpu(cpu, p->cpus_ptr)) { > + list_del_init(&p->swqueue_node); > + goto found; > + } > + } > + p = null; > +found: > spin_unlock_irqrestore(&swqueue->lock, flags); > > return p; > @@ -238,11 +242,11 @@ static int swqueue_pick_next_task(struct rq *rq, struct rq_flags *rf) > struct task_struct *p = null; > struct rq *src_rq; > struct rq_flags src_rf; > - int ret; > + int ret = 0; > > swqueue = rq_swqueue(rq); > if (!list_empty(&swqueue->list)) > - p = swqueue_pull_task(swqueue); > + p = swqueue_pull_task(swqueue, rq->cpu); > > if (!p) > return 0; > @@ -255,10 +259,8 @@ static int swqueue_pick_next_task(struct rq *rq, struct rq_flags *rf) > if (task_on_rq_queued(p) && !task_on_cpu(rq, p)) > src_rq = migrate_task_to(src_rq, &src_rf, p, cpu_of(rq)); > > - if (src_rq->cpu != rq->cpu) > + if (src_rq->cpu == rq->cpu) > ret = 1; > - else > - ret = -1; > > task_rq_unlock(src_rq, p, &src_rf); > > -- > 2.25.1 >
On 6/26/23 11:17 PM, David Vernet wrote: > On Mon, Jun 26, 2023 at 11:34:41AM +0530, Gautham R. Shenoy wrote: >> >> observations: there are run-to-run variations with this benchmark. i >> will try with the newer schbench later this week. > > +cc Chris. He showed how you can use schbench to cause painful > contention with swqueue in [0]. Chris -- any other schbench incantations > that you'd recommend we try with Gautham's patches? > > [0]: https://lore.kernel.org/lkml/c8419d9b-2b31-2190-3058-3625bdbcb13d@meta.com/ Hopefully my command line from this other email will be consistent with the new schbench, please let me know if you're still seeing variations. In terms of other things to try, I was creating one worker per messenger to maximize runqueue lock contention, but variations on -t may be interesting. Basically I did -t 1 -m <increasing> until all the idle time on unpatched Linus was soaked up, and then I compared Linus and swqueue. -chris
diff --git a/include/linux/sched.h b/include/linux/sched.h index 1292d38d66cc..1f4fd22f88a8 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -770,6 +770,8 @@ struct task_struct { unsigned long wakee_flip_decay_ts; struct task_struct *last_wakee; + struct list_head swqueue_node; + /* * recent_used_cpu is initially set as the last CPU used by a task * that wakes affine another task. Waker/wakee relationships can diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d911b0631e7b..e04f0daf1f05 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4533,6 +4533,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) #ifdef CONFIG_SMP p->wake_entry.u_flags = CSD_TYPE_TTWU; p->migration_pending = NULL; + INIT_LIST_HEAD(&p->swqueue_node); #endif init_sched_mm_cid(p); } @@ -9872,6 +9873,7 @@ void __init sched_init_smp(void) init_sched_rt_class(); init_sched_dl_class(); + init_sched_fair_class_late(); sched_smp_initialized = true; } diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 807986bd6ea6..29fe25794884 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -139,17 +139,151 @@ static int __init setup_sched_thermal_decay_shift(char *str) } __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift); +/** + * struct swqueue - Per-LLC queue structure for enqueuing and pulling waking + * tasks. + * + * WHAT + * ==== + * + * This structure enables the scheduler to be more aggressively work + * conserving, by placing waking tasks on a per-LLC FIFO queue that can then be + * pulled from when another core in the LLC is going to go idle. + * + * struct rq stores a pointer to its LLC's swqueue via struct cfs_rq. Waking + * tasks are enqueued in a swqueue at the end of enqueue_task_fair(), and are + * opportunistically pulled from the swqueue in pick_next_task_fair() prior to + * invoking newidle_balance(). Tasks enqueued in a swqueue be scheduled prior + * to being pulled from the swqueue, in which case they're simply removed from + * the swqueue. A waking task is only enqueued to a swqueue when it was _not_ + * manually migrated to the current runqueue by select_task_rq_fair(). + * + * There is currently no task-stealing between swqueues in different LLCs, + * which means that swqueue is not fully work conserving. This could be added + * at a later time, with tasks likely only being stolen across swqueues on the + * same NUMA node to avoid violating NUMA affinities. + * + * HOW + * === + * + * An swqueue is comprised of a list, and a spinlock for synchronization. Given + * that the critical section for a swqueue is typically a fast list operation, + * and that the swqueue is localized to a single LLC, the spinlock does not + * seem to be contended; even on a heavily utilized host. struct swqueues are + * also cacheline aligned to prevent false sharing between CPUs manipulating + * swqueues in other LLCs. + * + * WHY + * === + * + * As mentioned above, the main benefit of swqueue is that it enables more + * aggressive work conservation in the scheduler. This can benefit workloads + * that benefit more from CPU utilization than from L1/L2 cache locality. + * + * swqueues are segmented across LLCs both to avoid contention on the swqueue + * spinlock by minimizing the number of CPUs that could contend on it, as well + * as to strike a balance between work conservation, and L3 cache locality. + */ +struct swqueue { + struct list_head list; + spinlock_t lock; +} ____cacheline_aligned; + #ifdef CONFIG_SMP -static void swqueue_enqueue(struct rq *rq, struct task_struct *p, - int enq_flags) -{} +static struct swqueue *rq_swqueue(struct rq *rq) +{ + return rq->cfs.swqueue; +} + +static struct task_struct *swqueue_pull_task(struct swqueue *swqueue) +{ + unsigned long flags; + + struct task_struct *p; + + spin_lock_irqsave(&swqueue->lock, flags); + p = list_first_entry_or_null(&swqueue->list, struct task_struct, + swqueue_node); + if (p) + list_del_init(&p->swqueue_node); + spin_unlock_irqrestore(&swqueue->lock, flags); + + return p; +} + +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags) +{ + unsigned long flags; + struct swqueue *swqueue; + bool task_migrated = enq_flags & ENQUEUE_MIGRATED; + bool task_wakeup = enq_flags & ENQUEUE_WAKEUP; + + /* + * Only enqueue the task in the shared wakequeue if: + * + * - SWQUEUE is enabled + * - The task is on the wakeup path + * - The task wasn't purposefully migrated to the current rq by + * select_task_rq() + * - The task isn't pinned to a specific CPU + */ + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1) + return; + + swqueue = rq_swqueue(rq); + spin_lock_irqsave(&swqueue->lock, flags); + list_add_tail(&p->swqueue_node, &swqueue->list); + spin_unlock_irqrestore(&swqueue->lock, flags); +} + static int swqueue_pick_next_task(struct rq *rq, struct rq_flags *rf) { - return 0; + struct swqueue *swqueue; + struct task_struct *p = NULL; + struct rq *src_rq; + struct rq_flags src_rf; + int ret; + + swqueue = rq_swqueue(rq); + if (!list_empty(&swqueue->list)) + p = swqueue_pull_task(swqueue); + + if (!p) + return 0; + + rq_unpin_lock(rq, rf); + raw_spin_rq_unlock(rq); + + src_rq = task_rq_lock(p, &src_rf); + + if (task_on_rq_queued(p) && !task_on_cpu(rq, p)) + src_rq = migrate_task_to(src_rq, &src_rf, p, cpu_of(rq)); + + if (src_rq->cpu != rq->cpu) + ret = 1; + else + ret = -1; + + task_rq_unlock(src_rq, p, &src_rf); + + raw_spin_rq_lock(rq); + rq_repin_lock(rq, rf); + + return ret; } static void swqueue_remove_task(struct task_struct *p) -{} +{ + unsigned long flags; + struct swqueue *swqueue; + + if (!list_empty(&p->swqueue_node)) { + swqueue = rq_swqueue(task_rq(p)); + spin_lock_irqsave(&swqueue->lock, flags); + list_del_init(&p->swqueue_node); + spin_unlock_irqrestore(&swqueue->lock, flags); + } +} /* * For asym packing, by default the lower numbered CPU has higher priority. @@ -12839,3 +12973,34 @@ __init void init_sched_fair_class(void) #endif /* SMP */ } + +__init void init_sched_fair_class_late(void) +{ +#ifdef CONFIG_SMP + int i; + struct swqueue *swqueue; + struct rq *rq; + struct rq *llc_rq; + + for_each_possible_cpu(i) { + if (per_cpu(sd_llc_id, i) == i) { + llc_rq = cpu_rq(i); + + swqueue = kzalloc_node(sizeof(struct swqueue), + GFP_KERNEL, cpu_to_node(i)); + INIT_LIST_HEAD(&swqueue->list); + spin_lock_init(&swqueue->lock); + llc_rq->cfs.swqueue = swqueue; + } + } + + for_each_possible_cpu(i) { + rq = cpu_rq(i); + llc_rq = cpu_rq(per_cpu(sd_llc_id, i)); + + if (rq == llc_rq) + continue; + rq->cfs.swqueue = llc_rq->cfs.swqueue; + } +#endif /* SMP */ +} diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 5a86e9795731..daee5c64af87 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -575,6 +575,7 @@ struct cfs_rq { #endif #ifdef CONFIG_SMP + struct swqueue *swqueue; /* * CFS load tracking */ @@ -2380,6 +2381,7 @@ extern void update_max_interval(void); extern void init_sched_dl_class(void); extern void init_sched_rt_class(void); extern void init_sched_fair_class(void); +extern void init_sched_fair_class_late(void); extern void reweight_task(struct task_struct *p, int prio);