[RFC,3/3] sched: Implement shared wakequeue in CFS

Message ID 20230613052004.2836135-4-void@manifault.com
State New
Headers
Series sched: Implement shared wakequeue in CFS |

Commit Message

David Vernet June 13, 2023, 5:20 a.m. UTC
  Overview
========

The scheduler must constantly strike a balance between work
conservation, and avoiding costly migrations which harm performance due
to e.g. decreased cache locality. The matter is further complicated by
the topology of the system. Migrating a task between cores on the same
LLC may be more optimal than keeping a task local to the CPU, whereas
migrating a task between LLCs or NUMA nodes may tip the balance in the
other direction.

With that in mind, while CFS is by and large mostly a work conserving
scheduler, there are certain instances where the scheduler will choose
to keep a task local to a CPU, when it would have been more optimal to
migrate it to an idle core.

An example of such a workload is the HHVM / web workload at Meta. HHVM
is a VM that JITs Hack and PHP code in service of web requests. Like
other JIT / compilation workloads, it tends to be heavily CPU bound, and
exhibit generally poor cache locality. To try and address this, we set
several debugfs (/sys/kernel/debug/sched) knobs on our HHVM workloads:

- migration_cost_ns -> 0
- latency_ns -> 20000000
- min_granularity_ns -> 10000000
- wakeup_granularity_ns -> 12000000

These knobs are intended both to encourage the scheduler to be as work
conserving as possible (migration_cost_ns -> 0), and also to keep tasks
running for relatively long time slices so as to avoid the overhead of
context switching (the other knobs). Collectively, these knobs provide a
substantial performance win; resulting in roughly a 20% improvement in
throughput. Worth noting, however, is that this improvement is _not_ at
full machine saturation.

That said, even with these knobs, we noticed that CPUs were still going
idle even when the host was overcommitted. In response, we wrote the
"shared wakequeue" (swqueue) feature proposed in this patch set. The
idea behind swqueue is simple: it enables the scheduler to be
aggressively work conserving by placing a waking task into a per-LLC
FIFO queue that can be pulled from by another core in the LLC FIFO queue
which can then be pulled from before it goes idle.

With this simple change, we were able to achieve a 1 - 1.6% improvement
in throughput, as well as a small, consistent improvement in p95 and p99
latencies, in HHVM. These performance improvements were in addition to
the wins from the debugfs knobs mentioned above.

Design
======

The design of swqueue is quite simple. An swqueue is simply a struct
list_head, and a spinlock:

struct swqueue {
	struct list_head list;
	spinlock_t lock;
} ____cacheline_aligned;

We create a struct swqueue per LLC, ensuring they're in their own
cachelines to avoid false sharing between CPUs on different LLCs.

When a task first wakes up, it enqueues itself in the swqueue of its
current LLC at the end of enqueue_task_fair(). Enqueues only happen if
the task was not manually migrated to the current core by
select_task_rq(), and is not pinned to a specific CPU.

A core will pull a task from its LLC's swqueue before calling
newidle_balance().

Difference between SIS_NODE
===========================

In [0] Peter proposed a patch that addresses Tejun's observations that
when workqueues are targeted towards a specific LLC on his Zen2 machine
with small CCXs, that there would be significant idle time due to
select_idle_sibling() not considering anything outside of the current
LLC.

This patch (SIS_NODE) is essentially the complement to the proposal
here. SID_NODE causes waking tasks to look for idle cores in neighboring
LLCs on the same die, whereas swqueue causes cores about to go idle to
look for enqueued tasks. That said, in its current form, the two
features at are a different scope as SIS_NODE searches for idle cores
between LLCs, while swqueue enqueues tasks within a single LLC.

The patch was since removed in [1], but we elect to compare its
performance to swqueue given that as described above, it's conceptually
complementary.

[0]: https://lore.kernel.org/all/20230530113249.GA156198@hirez.programming.kicks-ass.net/
[1]: https://lore.kernel.org/all/20230605175636.GA4253@hirez.programming.kicks-ass.net/

I observed that while SIS_NODE works quite well for hosts with small
CCXs, it can result in degraded performance on machines either with a
large number of total cores in a CCD, or for which the cache miss
penalty of migrating between CCXs is high, even on the same die.

For example, on Zen 4c hosts (Bergamo), CCXs within a CCD are muxed
through a single link to the IO die, and thus have similar cache miss
latencies as cores in remote CCDs.

Such subtleties could be taken into account with SIS_NODE, but
regardless, both features are conceptually complementary sides of the
same coin.  SIS_NODE searches for idle cores for waking threads, whereas
swqueue searches for available work before a core goes idle.

Results
=======

Note that the motivation for the shared wakequeue feature was originally
arrived at using experiments in the sched_ext framework that's currently
being proposed upstream. The ~1 - 1.6% improvement in HHVM throughput
is similarly visible using work-conserving sched_ext schedulers (even
very simple ones like global FIFO).

In both single and multi socket / CCX hosts, this can measurably improve
performance. In addition to the performance gains observed on our
internal web workloads, we also observed an improvement in common
workloads such as kernel compile when running shared wakequeue. Here are
the results of running make -j$(nproc) built-in.a on several different
types of hosts configured with make allyesconfig on commit a27648c74210
("afs: Fix setting of mtime when creating a file/dir/symlink") on Linus'
tree (boost was disabled on all of these hosts when the experiments were
performed):

Single-socket | 32-core | 2-CCX | AMD 7950X Zen4

CPU max MHz:                     5879.8818
CPU min MHz:                     3000.0000
				o____________o_______o
				|    mean    | CPU   |
				o------------o-------o
NO_SWQUEUE + NO_SIS_NODE:	| 590.52s    | 3103% |
NO_SWQUEUE + SIS_NODE:		| 590.80s    | 3102% |
SWQUEUE + NO_SIS_NODE:		| 589.65s    | 3116% |
SWQUEUE + SIS_NODE:		| 589.99s    | 3115% |
				o------------o-------o

Takeaway: swqueue doesn't seem to provide a statistically significant
improvement for kernel compile on my 7950X. SIS_NODE similarly does not
have a noticeable effect on performance.

-------------------------------------------------------------------------------

Single-socket | 72-core | 6-CCX | AMD Milan Zen3

CPU max MHz:                     3245.0190
CPU min MHz:                     700.0000
				o_____________o_______o
				|    mean     | CPU   |
				o-------------o-------o
NO_SWQUEUE + NO_SIS_NODE:	| 1608.69s    | 6488% |
NO_SWQUEUE + SIS_NODE:		| 1610.24s    | 6473% |
SWQUEUE + NO_SIS_NODE:		| 1605.80s    | 6504% |
SWQUEUE + SIS_NODE:		| 1606.96s    | 6488% |
				o-------------o-------o

Takeaway: swqueue does provide a small statistically significant
improvement on Milan, but the compile times in general were quite long
relative to the 7950X Zen4, and the Bergamo Zen4c due to the lower clock
frequency. Milan also has larger CCXs than Bergamo, so it stands to
reason that select_idle_sibling() will have an easier time finding idle
cores inside the current CCX.

It also seems logical that SIS_NODE would hurt performance a bit here,
as all cores / CCXs are in the same NUMA node, so select_idle_sibling()
has to iterate over 72 cores; delaying task wakeup. That said, I'm not
sure that's a viable theory if total CPU% is lower with SIS_NODE.

-------------------------------------------------------------------------------

Single-socket | 176-core | 11-CCX | 2-CCX per CCD | AMD Bergamo Zen4c

CPU max MHz:                     1200.0000
CPU min MHz:                     1000.0000

				o____________o________o
				|    mean    | CPU    |
				o------------o--------o
NO_SWQUEUE + NO_SIS_NODE:	| 322.44s    | 15534% |
NO_SWQUEUE + SIS_NODE:		| 324.39s    | 15508% |
SWQUEUE + NO_SIS_NODE:		| 321.54s    | 15603% |
SWQUEUE + SIS_NODE:		| 321.88s    | 15622% |
				o------------o--------o

Takeaway: swqueue barely beats NO_SWQUEUE + NO_SIS_NODE, to the point
that it's arguably not statistically significant.
SIS_NODE results in a ~.9% performance degradation, for likely the same
reason as Milan: the host has a large number of LLCs within a single
socket, so task wakeup latencies suffer due to select_idle_node()
searching up to 11 CCXs.

Conclusion
==========

swqueue in this form seems to provide a small, but noticeable win for
front-end CPU-bound workloads spread over multiple CCXs. The reason
seems fairly straightforward: swqueue encourages work conservation
inside of a CCX by having a CPU do an O(1) pull from a per-LLC queue of
runnable tasks. As mentioned above, it is complementary to SIS_NODE,
which searches for idle cores on the wakeup path.

While swqueue in this form encourages work conservation, it of course
does not guarantee it given that we don't implement any kind of work
stealing between swqueues. In the future, we could potentially push CPU
utilization even higher by enabling work stealing between swqueues,
likely between CCXs on the same NUMA node.

Originally-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: David Vernet <void@manifault.com>
---
 include/linux/sched.h |   2 +
 kernel/sched/core.c   |   2 +
 kernel/sched/fair.c   | 175 ++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h  |   2 +
 4 files changed, 176 insertions(+), 5 deletions(-)
  

Comments

Peter Zijlstra June 13, 2023, 8:32 a.m. UTC | #1
Still gotta read it properly, however:

On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> Single-socket | 32-core | 2-CCX | AMD 7950X Zen4
> Single-socket | 72-core | 6-CCX | AMD Milan Zen3
> Single-socket | 176-core | 11-CCX | 2-CCX per CCD | AMD Bergamo Zen4c

Could you please also benchmark on something Intel that has these stupid
large LLCs ?

Because the last time I tried something like this, it came apart real
quick. And AMD has these relatively small 8-core LLCs.
  
Peter Zijlstra June 13, 2023, 8:41 a.m. UTC | #2
On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> +struct swqueue {
> +	struct list_head list;
> +	spinlock_t lock;
> +} ____cacheline_aligned;
> +
>  #ifdef CONFIG_SMP
> +static struct swqueue *rq_swqueue(struct rq *rq)
> +{
> +	return rq->cfs.swqueue;
> +}
> +
> +static struct task_struct *swqueue_pull_task(struct swqueue *swqueue)
> +{
> +	unsigned long flags;
> +
> +	struct task_struct *p;
> +
> +	spin_lock_irqsave(&swqueue->lock, flags);
> +	p = list_first_entry_or_null(&swqueue->list, struct task_struct,
> +				     swqueue_node);
> +	if (p)
> +		list_del_init(&p->swqueue_node);
> +	spin_unlock_irqrestore(&swqueue->lock, flags);
> +
> +	return p;
> +}
> +
> +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
> +{
> +	unsigned long flags;
> +	struct swqueue *swqueue;
> +	bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> +	bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
> +
> +	/*
> +	 * Only enqueue the task in the shared wakequeue if:
> +	 *
> +	 * - SWQUEUE is enabled
> +	 * - The task is on the wakeup path
> +	 * - The task wasn't purposefully migrated to the current rq by
> +	 *   select_task_rq()
> +	 * - The task isn't pinned to a specific CPU
> +	 */
> +	if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> +		return;
> +
> +	swqueue = rq_swqueue(rq);
> +	spin_lock_irqsave(&swqueue->lock, flags);
> +	list_add_tail(&p->swqueue_node, &swqueue->list);
> +	spin_unlock_irqrestore(&swqueue->lock, flags);
> +}
> +
>  static int swqueue_pick_next_task(struct rq *rq, struct rq_flags *rf)
>  {
> -	return 0;
> +	struct swqueue *swqueue;
> +	struct task_struct *p = NULL;
> +	struct rq *src_rq;
> +	struct rq_flags src_rf;
> +	int ret;
> +
> +	swqueue = rq_swqueue(rq);
> +	if (!list_empty(&swqueue->list))
> +		p = swqueue_pull_task(swqueue);
> +
> +	if (!p)
> +		return 0;
> +
> +	rq_unpin_lock(rq, rf);
> +	raw_spin_rq_unlock(rq);
> +
> +	src_rq = task_rq_lock(p, &src_rf);
> +
> +	if (task_on_rq_queued(p) && !task_on_cpu(rq, p))
> +		src_rq = migrate_task_to(src_rq, &src_rf, p, cpu_of(rq));
> +
> +	if (src_rq->cpu != rq->cpu)
> +		ret = 1;
> +	else
> +		ret = -1;
> +
> +	task_rq_unlock(src_rq, p, &src_rf);
> +
> +	raw_spin_rq_lock(rq);
> +	rq_repin_lock(rq, rf);
> +
> +	return ret;
>  }
>  
>  static void swqueue_remove_task(struct task_struct *p)
> -{}
> +{
> +	unsigned long flags;
> +	struct swqueue *swqueue;
> +
> +	if (!list_empty(&p->swqueue_node)) {
> +		swqueue = rq_swqueue(task_rq(p));
> +		spin_lock_irqsave(&swqueue->lock, flags);
> +		list_del_init(&p->swqueue_node);
> +		spin_unlock_irqrestore(&swqueue->lock, flags);
> +	}
> +}
>  
>  /*
>   * For asym packing, by default the lower numbered CPU has higher priority.

*sigh*... pretty much all, if not all of this is called with rq->lock
held. So why the irqsave and big fat fail for using spinlock :-(
  
Aaron Lu June 14, 2023, 4:35 a.m. UTC | #3
On Tue, Jun 13, 2023 at 10:32:03AM +0200, Peter Zijlstra wrote:
> 
> Still gotta read it properly, however:
> 
> On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> > Single-socket | 32-core | 2-CCX | AMD 7950X Zen4
> > Single-socket | 72-core | 6-CCX | AMD Milan Zen3
> > Single-socket | 176-core | 11-CCX | 2-CCX per CCD | AMD Bergamo Zen4c
> 
> Could you please also benchmark on something Intel that has these stupid
> large LLCs ?
> 
> Because the last time I tried something like this, it came apart real
> quick. And AMD has these relatively small 8-core LLCs.

I tested on Intel(R) Xeon(R) Platinum 8358, which has 2 sockets and each
socket has a single LLC with 32 cores/64threads.

When running netperf with nr_thread=128, runtime=60:

"
netserver -4

for i in `seq $nr_threads`; do
	netperf -4 -H 127.0.0.1 -t UDP_RR -c -C -l $runtime &
done

wait
"

The lock contention due to the per-LLC swqueue->lock is quite serious:

    83.39%    83.33%  [kernel.vmlinux]  [k] native_queued_spin_lock_slowpath                                  -      -            
            |          
            |--42.86%--__libc_sendto
            |          entry_SYSCALL_64
            |          do_syscall_64
            |          |          
            |           --42.81%--__x64_sys_sendto
            |                     __sys_sendto
            |                     sock_sendmsg
            |                     inet_sendmsg
            |                     udp_sendmsg
            |                     udp_send_skb
            |                     ip_send_skb
            |                     ip_output
            |                     ip_finish_output
            |                     __ip_finish_output
            |                     ip_finish_output2
            |                     __dev_queue_xmit
            |                     __local_bh_enable_ip
            |                     do_softirq.part.0
            |                     __do_softirq
            |                     |          
            |                      --42.81%--net_rx_action
            |                                __napi_poll
            |                                process_backlog
            |                                __netif_receive_skb
            |                                __netif_receive_skb_one_core
            |                                ip_rcv
            |                                ip_local_deliver
            |                                ip_local_deliver_finish
            |                                ip_protocol_deliver_rcu
            |                                udp_rcv
            |                                __udp4_lib_rcv
            |                                udp_unicast_rcv_skb
            |                                udp_queue_rcv_skb
            |                                udp_queue_rcv_one_skb
            |                                __udp_enqueue_schedule_skb
            |                                sock_def_readable
            |                                __wake_up_sync_key
            |                                __wake_up_common_lock
            |                                |          
            |                                 --42.81%--__wake_up_common
            |                                           receiver_wake_function
            |                                           autoremove_wake_function
            |                                           default_wake_function
            |                                           try_to_wake_up
            |                                           ttwu_do_activate
            |                                           enqueue_task
            |                                           enqueue_task_fair
            |                                           _raw_spin_lock_irqsave
            |                                           |          
            |                                            --42.81%--native_queued_spin_lock_slowpath
            |          
            |--20.39%--0
            |          __libc_recvfrom
            |          entry_SYSCALL_64
            |          do_syscall_64
            |          __x64_sys_recvfrom
            |          __sys_recvfrom
            |          sock_recvmsg
            |          inet_recvmsg
            |          udp_recvmsg
            |          __skb_recv_udp
            |          __skb_wait_for_more_packets
            |          schedule_timeout
            |          schedule
            |          __schedule
            |          pick_next_task_fair
            |          |          
            |           --20.39%--swqueue_remove_task
            |                     _raw_spin_lock_irqsave
            |                     |          
            |                      --20.39%--native_queued_spin_lock_slowpath
            |          
             --20.07%--__libc_recvfrom
                       entry_SYSCALL_64
                       do_syscall_64
                       __x64_sys_recvfrom
                       __sys_recvfrom
                       sock_recvmsg
                       inet_recvmsg
                       udp_recvmsg
                       __skb_recv_udp
                       __skb_wait_for_more_packets
                       schedule_timeout
                       schedule
                       __schedule
                       |          
                        --20.06%--pick_next_task_fair
                                  swqueue_remove_task
                                  _raw_spin_lock_irqsave
                                  |          
                                   --20.06%--native_queued_spin_lock_slowpath

I suppose that is because there are too many CPUs in a single LLC on
this machine and when all these CPUs try to queue/pull tasks in this
per-LLC shared wakequeue, it just doesn't scale well.
  
Peter Zijlstra June 14, 2023, 9:27 a.m. UTC | #4
On Wed, Jun 14, 2023 at 12:35:29PM +0800, Aaron Lu wrote:
> On Tue, Jun 13, 2023 at 10:32:03AM +0200, Peter Zijlstra wrote:
> > 
> > Still gotta read it properly, however:
> > 
> > On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> > > Single-socket | 32-core | 2-CCX | AMD 7950X Zen4
> > > Single-socket | 72-core | 6-CCX | AMD Milan Zen3
> > > Single-socket | 176-core | 11-CCX | 2-CCX per CCD | AMD Bergamo Zen4c
> > 
> > Could you please also benchmark on something Intel that has these stupid
> > large LLCs ?
> > 
> > Because the last time I tried something like this, it came apart real
> > quick. And AMD has these relatively small 8-core LLCs.
> 
> I tested on Intel(R) Xeon(R) Platinum 8358, which has 2 sockets and each
> socket has a single LLC with 32 cores/64threads.

> The lock contention due to the per-LLC swqueue->lock is quite serious:

Yep, that's what I was expecting -- complete meltdown that.
  
David Vernet June 14, 2023, 8:26 p.m. UTC | #5
On Tue, Jun 13, 2023 at 10:41:11AM +0200, Peter Zijlstra wrote:
> On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> > +struct swqueue {
> > +	struct list_head list;
> > +	spinlock_t lock;
> > +} ____cacheline_aligned;
> > +
> >  #ifdef CONFIG_SMP
> > +static struct swqueue *rq_swqueue(struct rq *rq)
> > +{
> > +	return rq->cfs.swqueue;
> > +}
> > +
> > +static struct task_struct *swqueue_pull_task(struct swqueue *swqueue)
> > +{
> > +	unsigned long flags;
> > +
> > +	struct task_struct *p;
> > +
> > +	spin_lock_irqsave(&swqueue->lock, flags);
> > +	p = list_first_entry_or_null(&swqueue->list, struct task_struct,
> > +				     swqueue_node);
> > +	if (p)
> > +		list_del_init(&p->swqueue_node);
> > +	spin_unlock_irqrestore(&swqueue->lock, flags);
> > +
> > +	return p;
> > +}
> > +
> > +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
> > +{
> > +	unsigned long flags;
> > +	struct swqueue *swqueue;
> > +	bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> > +	bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
> > +
> > +	/*
> > +	 * Only enqueue the task in the shared wakequeue if:
> > +	 *
> > +	 * - SWQUEUE is enabled
> > +	 * - The task is on the wakeup path
> > +	 * - The task wasn't purposefully migrated to the current rq by
> > +	 *   select_task_rq()
> > +	 * - The task isn't pinned to a specific CPU
> > +	 */
> > +	if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> > +		return;
> > +
> > +	swqueue = rq_swqueue(rq);
> > +	spin_lock_irqsave(&swqueue->lock, flags);
> > +	list_add_tail(&p->swqueue_node, &swqueue->list);
> > +	spin_unlock_irqrestore(&swqueue->lock, flags);
> > +}
> > +
> >  static int swqueue_pick_next_task(struct rq *rq, struct rq_flags *rf)
> >  {
> > -	return 0;
> > +	struct swqueue *swqueue;
> > +	struct task_struct *p = NULL;
> > +	struct rq *src_rq;
> > +	struct rq_flags src_rf;
> > +	int ret;
> > +
> > +	swqueue = rq_swqueue(rq);
> > +	if (!list_empty(&swqueue->list))
> > +		p = swqueue_pull_task(swqueue);
> > +
> > +	if (!p)
> > +		return 0;
> > +
> > +	rq_unpin_lock(rq, rf);
> > +	raw_spin_rq_unlock(rq);
> > +
> > +	src_rq = task_rq_lock(p, &src_rf);
> > +
> > +	if (task_on_rq_queued(p) && !task_on_cpu(rq, p))
> > +		src_rq = migrate_task_to(src_rq, &src_rf, p, cpu_of(rq));
> > +
> > +	if (src_rq->cpu != rq->cpu)
> > +		ret = 1;
> > +	else
> > +		ret = -1;
> > +
> > +	task_rq_unlock(src_rq, p, &src_rf);
> > +
> > +	raw_spin_rq_lock(rq);
> > +	rq_repin_lock(rq, rf);
> > +
> > +	return ret;
> >  }
> >  
> >  static void swqueue_remove_task(struct task_struct *p)
> > -{}
> > +{
> > +	unsigned long flags;
> > +	struct swqueue *swqueue;
> > +
> > +	if (!list_empty(&p->swqueue_node)) {
> > +		swqueue = rq_swqueue(task_rq(p));
> > +		spin_lock_irqsave(&swqueue->lock, flags);
> > +		list_del_init(&p->swqueue_node);
> > +		spin_unlock_irqrestore(&swqueue->lock, flags);
> > +	}
> > +}
> >  
> >  /*
> >   * For asym packing, by default the lower numbered CPU has higher priority.
> 
> *sigh*... pretty much all, if not all of this is called with rq->lock
> held. So why the irqsave and big fat fail for using spinlock :-(

Hi Peter,

Thanks for the quick review. Yeah good call about the irq's -- looks
like we're holding an rq lock on all swqueue paths so we can just use a
raw spinlock. I'll make that change for v2.

Regarding the per-swqueue spinlock being a potential bottleneck, I'll
reply to Aaron's thread on [0] with some numbers I collected locally on
a 26 core / 52 thread Cooperlake host, and a 20/40 x 2 Skylake. The
TL;DR is that I'm not observing the spinlock be contended on either
netperf or kernel compile workloads, with swqueue actually performing
~1 - 2% better than non-swqueue on both of these hosts for netperf.

[0]: https://lore.kernel.org/all/20230614043529.GA1942@ziqianlu-dell/

Thanks,
David
  
David Vernet June 15, 2023, 12:01 a.m. UTC | #6
On Wed, Jun 14, 2023 at 12:35:29PM +0800, Aaron Lu wrote:
> On Tue, Jun 13, 2023 at 10:32:03AM +0200, Peter Zijlstra wrote:
> > 
> > Still gotta read it properly, however:
> > 
> > On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> > > Single-socket | 32-core | 2-CCX | AMD 7950X Zen4
> > > Single-socket | 72-core | 6-CCX | AMD Milan Zen3
> > > Single-socket | 176-core | 11-CCX | 2-CCX per CCD | AMD Bergamo Zen4c
> > 
> > Could you please also benchmark on something Intel that has these stupid
> > large LLCs ?

Sure, please see below.

> > Because the last time I tried something like this, it came apart real
> > quick. And AMD has these relatively small 8-core LLCs.
> 
> I tested on Intel(R) Xeon(R) Platinum 8358, which has 2 sockets and each
> socket has a single LLC with 32 cores/64threads.
> 
> When running netperf with nr_thread=128, runtime=60:
> 
> "
> netserver -4
> 
> for i in `seq $nr_threads`; do
> 	netperf -4 -H 127.0.0.1 -t UDP_RR -c -C -l $runtime &
> done
> 
> wait
> "
> 
> The lock contention due to the per-LLC swqueue->lock is quite serious:
> 
>     83.39%    83.33%  [kernel.vmlinux]  [k] native_queued_spin_lock_slowpath                                  -      -            
>             |          
>             |--42.86%--__libc_sendto
>             |          entry_SYSCALL_64
>             |          do_syscall_64
>             |          |          
>             |           --42.81%--__x64_sys_sendto
>             |                     __sys_sendto
>             |                     sock_sendmsg
>             |                     inet_sendmsg
>             |                     udp_sendmsg
>             |                     udp_send_skb
>             |                     ip_send_skb
>             |                     ip_output
>             |                     ip_finish_output
>             |                     __ip_finish_output
>             |                     ip_finish_output2
>             |                     __dev_queue_xmit
>             |                     __local_bh_enable_ip
>             |                     do_softirq.part.0
>             |                     __do_softirq
>             |                     |          
>             |                      --42.81%--net_rx_action
>             |                                __napi_poll
>             |                                process_backlog
>             |                                __netif_receive_skb
>             |                                __netif_receive_skb_one_core
>             |                                ip_rcv
>             |                                ip_local_deliver
>             |                                ip_local_deliver_finish
>             |                                ip_protocol_deliver_rcu
>             |                                udp_rcv
>             |                                __udp4_lib_rcv
>             |                                udp_unicast_rcv_skb
>             |                                udp_queue_rcv_skb
>             |                                udp_queue_rcv_one_skb
>             |                                __udp_enqueue_schedule_skb
>             |                                sock_def_readable
>             |                                __wake_up_sync_key
>             |                                __wake_up_common_lock
>             |                                |          
>             |                                 --42.81%--__wake_up_common
>             |                                           receiver_wake_function
>             |                                           autoremove_wake_function
>             |                                           default_wake_function
>             |                                           try_to_wake_up
>             |                                           ttwu_do_activate
>             |                                           enqueue_task
>             |                                           enqueue_task_fair
>             |                                           _raw_spin_lock_irqsave
>             |                                           |          
>             |                                            --42.81%--native_queued_spin_lock_slowpath
>             |          
>             |--20.39%--0
>             |          __libc_recvfrom
>             |          entry_SYSCALL_64
>             |          do_syscall_64
>             |          __x64_sys_recvfrom
>             |          __sys_recvfrom
>             |          sock_recvmsg
>             |          inet_recvmsg
>             |          udp_recvmsg
>             |          __skb_recv_udp
>             |          __skb_wait_for_more_packets
>             |          schedule_timeout
>             |          schedule
>             |          __schedule
>             |          pick_next_task_fair
>             |          |          
>             |           --20.39%--swqueue_remove_task
>             |                     _raw_spin_lock_irqsave
>             |                     |          
>             |                      --20.39%--native_queued_spin_lock_slowpath
>             |          
>              --20.07%--__libc_recvfrom
>                        entry_SYSCALL_64
>                        do_syscall_64
>                        __x64_sys_recvfrom
>                        __sys_recvfrom
>                        sock_recvmsg
>                        inet_recvmsg
>                        udp_recvmsg
>                        __skb_recv_udp
>                        __skb_wait_for_more_packets
>                        schedule_timeout
>                        schedule
>                        __schedule
>                        |          
>                         --20.06%--pick_next_task_fair
>                                   swqueue_remove_task
>                                   _raw_spin_lock_irqsave
>                                   |          
>                                    --20.06%--native_queued_spin_lock_slowpath
> 
> I suppose that is because there are too many CPUs in a single LLC on
> this machine and when all these CPUs try to queue/pull tasks in this
> per-LLC shared wakequeue, it just doesn't scale well.

Hi Aaron,

Thanks for taking a look and running some benchmarks. I tried to
reproduce your results on a 26 core / 52 thread Cooperlake host, and a
20 core / 40 thread x 2 Skylake host, but I wasn't able to observe the
contention on the per-swqueue spinlock you were saw on your Ice Lake.

I ran the following netperf benchmarks:

- netperf -l 60 -n $(nproc) -6
- netperf -l 60 -n $(nproc) -6 -t UDP_RR

Here are the results I'm seeing from taking the average of running those
benchmarks three times on each host (all results are from the "Throughput"
column of the below table).

Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput <<<
bytes  bytes   bytes    secs.    10^6bits/sec


26 / 52 x 1 Cooperlake
----------------------

Default workload:

NO_SWQUEUE: 34103.76
SWQUEUE:    34707.46
Delta: +1.77%

UDP_RR:

NO_SWQUEUE: 57695.29
SWQUEUE: 54827.36
Delta: -4.97%

There's clearly a statistically significant decline in performance for
UDP_RR here, but surprisingly, I don't see any contention on swqueue
locks when I profile with perf. Rather, we seem to be contending on the
rq lock (presumably in swqueue_pick_next_task() which is unexpected (to
me, at least):

     7.81%  netperf  [kernel.vmlinux]                                           [k] queued_spin_lock_slowpath
            |          
             --7.81%--queued_spin_lock_slowpath
                       |          
                        --7.55%--task_rq_lock
                                  pick_next_task_fair
                                  schedule
                                  schedule_timeout
                                  __skb_wait_for_more_packets
                                  __skb_recv_udp
                                  udpv6_recvmsg
                                  inet6_recvmsg
                                  __x64_sys_recvfrom
                                  do_syscall_64
                                  entry_SYSCALL_64
                                  __libc_recvfrom
                                  0x7f923fdfacf0
                                  0
     4.97%  netperf  [kernel.vmlinux]                                           [k] _raw_spin_lock_irqsave
            |          
             --4.96%--_raw_spin_lock_irqsave
                       |          
                       |--1.13%--prepare_to_wait_exclusive
                       |          __skb_wait_for_more_packets
                       |          __skb_recv_udp
                       |          udpv6_recvmsg
                       |          inet6_recvmsg
                       |          __x64_sys_recvfrom
                       |          do_syscall_64
                       |          entry_SYSCALL_64
                       |          __libc_recvfrom
                       |          0x7f923fdfacf0
                       |          0
                       |          
                       |--0.92%--default_wake_function
                       |          autoremove_wake_function
                       |          __wake_up_sync_key
                       |          sock_def_readable
                       |          __udp_enqueue_schedule_skb
                       |          udpv6_queue_rcv_one_skb
                       |          __udp6_lib_rcv
                       |          ip6_input
                       |          ipv6_rcv
                       |          process_backlog
                       |          net_rx_action
                       |          __softirqentry_text_start
                       |          __local_bh_enable_ip
                       |          ip6_output
                       |          ip6_local_out
                       |          ip6_send_skb
                       |          udp_v6_send_skb
                       |          udpv6_sendmsg
                       |          __sys_sendto
                       |          __x64_sys_sendto
                       |          do_syscall_64
                       |          entry_SYSCALL_64
                       |          __libc_sendto
                       |          
                       |--0.81%--__wake_up_sync_key
                       |          sock_def_readable
                       |          __udp_enqueue_schedule_skb
                       |          udpv6_queue_rcv_one_skb
                       |          __udp6_lib_rcv
                       |          ip6_input
                       |          ipv6_rcv
                       |          process_backlog
                       |          net_rx_action
                       |          __softirqentry_text_start
                       |          __local_bh_enable_ip
                       |          ip6_output
                       |          ip6_local_out
                       |          ip6_send_skb
                       |          udp_v6_send_skb
                       |          udpv6_sendmsg
                       |          __sys_sendto
                       |          __x64_sys_sendto
                       |          do_syscall_64
                       |          entry_SYSCALL_64
                       |          __libc_sendto
                       |          
                       |--0.73%--enqueue_task_fair
                       |          |          
                       |           --0.72%--default_wake_function
                       |                     autoremove_wake_function
                       |                     __wake_up_sync_key
                       |                     sock_def_readable
                       |                     __udp_enqueue_schedule_skb
                       |                     udpv6_queue_rcv_one_skb
                       |                     __udp6_lib_rcv
                       |                     ip6_input
                       |                     ipv6_rcv
                       |                     process_backlog
                       |                     net_rx_action
                       |                     __softirqentry_text_start
                       |                     __local_bh_enable_ip
                       |                     ip6_output
                       |                     ip6_local_out
                       |                     ip6_send_skb
                       |                     udp_v6_send_skb
                       |                     udpv6_sendmsg
                       |                     __sys_sendto
                       |                     __x64_sys_sendto
                       |                     do_syscall_64
                       |                     entry_SYSCALL_64
                       |                     __libc_sendto
                       |          
                        --0.68%--task_rq_lock
                                  pick_next_task_fair
                                  schedule
                                  schedule_timeout
                                  __skb_wait_for_more_packets
                                  __skb_recv_udp
                                  udpv6_recvmsg
                                  inet6_recvmsg
                                  __x64_sys_recvfrom
                                  do_syscall_64
                                  entry_SYSCALL_64
                                  __libc_recvfrom
                                  0x7f923fdfacf0
                                  0

The profile without swqueue doesn't show any contention on the rq lock,
but does show us spending a good amount of time in the scheduler:

     4.03%  netperf  [kernel.vmlinux]                                           [k] default_wake_function
            |          
             --3.98%--default_wake_function
                       autoremove_wake_function
                       __wake_up_sync_key
                       sock_def_readable
                       __udp_enqueue_schedule_skb
                       udpv6_queue_rcv_one_skb
                       __udp6_lib_rcv
                       ip6_input
                       ipv6_rcv
                       process_backlog
                       net_rx_action
                       __softirqentry_text_start
                       __local_bh_enable_ip
                       ip6_output
                       ip6_local_out
                       ip6_send_skb
                       udp_v6_send_skb
                       udpv6_sendmsg
                       __sys_sendto
                       __x64_sys_sendto
                       do_syscall_64
                       entry_SYSCALL_64
                       __libc_sendto
     3.70%  netperf  [kernel.vmlinux]                                           [k] enqueue_entity
            |          
             --3.66%--enqueue_entity
                       enqueue_task_fair
                       |          
                        --3.66%--default_wake_function
                                  autoremove_wake_function
                                  __wake_up_sync_key
                                  sock_def_readable
                                  __udp_enqueue_schedule_skb
                                  udpv6_queue_rcv_one_skb
                                  __udp6_lib_rcv
                                  ip6_input
                                  ipv6_rcv
                                  process_backlog
                                  net_rx_action
                                  __softirqentry_text_start
                                  __local_bh_enable_ip
                                  ip6_output
                                  ip6_local_out
                                  ip6_send_skb
                                  udp_v6_send_skb
                                  udpv6_sendmsg
                                  __sys_sendto
                                  __x64_sys_sendto
                                  do_syscall_64
                                  entry_SYSCALL_64
                                  __libc_sendto

There's clearly a measurable impact on performance here due to swqueue
(negative for UDP_RR, positive for the default workload), but this looks
quite different than what you were observing.

20 / 40 x 2 Skylake
-------------------

Default workload:

NO_SWQUEUE: 57437.45
SWQUEUE: 58801.11
Delta: +2.37%

UDP_RR:

NO_SWQUEUE: 75932.28
SWQUEUE: 75232.77
Delta: -.92%

Same observation here. I didn't collect a profile, but the trend seems
consistent, and there's clearly a tradeoff. Despite the small drop in
perf for UDP_RR, it seems quite a bit less drastic than what would be
expected with the contention you showed in your profile.

7950X (8 cores / 16 threads per CCX, 2 CCXs)
--------------------------------------------

Default workload:

NO_SWQUEUE: 77615.08
SWQUEUE: 77465.73
Delta: -.19%

UDP_RR:

NO_SWQUEUE: 230258.75
SWQUEUE: 230280.34
Delta: ~0%

I'd call this essentially a no-op.

Milan (6 cores / 12 threads per CCX, 6 CCXs)
--------------------------------------------

Default workload:

NO_SWQUEUE: 23571.39
SWQUEUE: 23791.47
Delta: +.93%

UDP_RR:

NO_SWQUEUE: 40954.66
SWQUEUE: 40385.08
Delta: -1.39%

Same story here as above. Interesting to note though that Milan has a
small numer of cores per-LLC. I collected a profile, but I'm not sure
how useful it is given that so much of the profile seems to be indicting
perf as the bottleneck:

    24.93%           664  netperf  [kernel.vmlinux]                                           [k] x86_pmu_disable_all
            |
            ---x86_pmu_disable_all
               amd_pmu_disable_all
               |          
               |--23.87%--__perf_event_task_sched_out
               |          schedule
               |          |          
               |           --23.84%--schedule_timeout
               |                     __skb_wait_for_more_packets
               |                     __skb_recv_udp
               |                     udpv6_recvmsg
               |                     inet6_recvmsg
               |                     __x64_sys_recvfrom
               |                     do_syscall_64
               |                     entry_SYSCALL_64
               |                     __libc_recvfrom
               |                     0x7fbfa69facf0
               |                     0
               |          
                --0.62%--perf_mux_hrtimer_handler
                          hrtimer_interrupt
                          __sysvec_apic_timer_interrupt
                          sysvec_apic_timer_interrupt
                          asm_sysvec_apic_timer_interrupt
     5.33%           135  netperf  [kernel.vmlinux]                                           [k] amd_pmu_test_overflow_topbit
            |
            ---amd_pmu_test_overflow_topbit
               amd_pmu_check_overflow
               |          
                --5.18%--__perf_event_task_sched_out
                          schedule
                          schedule_timeout
                          __skb_wait_for_more_packets
                          __skb_recv_udp
                          udpv6_recvmsg
                          inet6_recvmsg
                          __x64_sys_recvfrom
                          do_syscall_64
                          entry_SYSCALL_64
                          __libc_recvfrom
                          0x7fbfa69facf0
                          0
     3.73%           102  netperf  [kernel.vmlinux]                                           [k] native_sched_clock
            |
            ---native_sched_clock
               |          
               |--1.18%--sched_clock_cpu
               |          |          
               |           --0.75%--irqtime_account_irq
               |                     |          
               |                      --0.71%--__softirqentry_text_start
               |                                __local_bh_enable_ip
               |                                ip6_output
               |                                ip6_local_out
               |                                ip6_send_skb
               |                                udp_v6_send_skb
               |                                udpv6_sendmsg
               |                                __sys_sendto
               |                                __x64_sys_sendto
               |                                do_syscall_64
               |                                entry_SYSCALL_64
               |                                __libc_sendto
               |          
               |--1.18%--vtime_user_exit
               |          __context_tracking_exit
               |          syscall_enter_from_user_mode
               |          do_syscall_64
               |          entry_SYSCALL_64
               |          |          
               |          |--0.61%--__libc_recvfrom
               |          |          0x7fbfa69facf0
               |          |          0
               |          |          
               |           --0.57%--__libc_sendto
               |          
                --0.51%--vtime_user_enter
                          __context_tracking_enter
                          syscall_exit_to_user_mode
                          do_syscall_64
                          entry_SYSCALL_64
     3.53%           107  netperf  [kernel.vmlinux]                                           [k] copy_user_generic_string
            |
            ---copy_user_generic_string
               |          
               |--1.45%--_copy_to_iter
               |          udpv6_recvmsg
               |          inet6_recvmsg
               |          __x64_sys_recvfrom
               |          do_syscall_64
               |          entry_SYSCALL_64
               |          __libc_recvfrom
               |          0x7fbfa69facf0
               |          0
               |          
               |--0.91%--_copy_to_user
               |          move_addr_to_user
               |          __x64_sys_recvfrom
               |          do_syscall_64
               |          entry_SYSCALL_64
               |          __libc_recvfrom
               |          0x7fbfa69facf0
               |          0
               |          
                --0.73%--_copy_from_user
                          __sys_sendto
                          __x64_sys_sendto
                          do_syscall_64
                          entry_SYSCALL_64
                          __libc_sendto
     2.98%            79  netperf  [kernel.vmlinux]                                           [k] default_wake_function
            |
            ---default_wake_function
               autoremove_wake_function
               __wake_up_sync_key
               sock_def_readable
               __udp_enqueue_schedule_skb
               udpv6_queue_rcv_one_skb
               __udp6_lib_rcv
               ip6_input
               ipv6_rcv
               process_backlog
               net_rx_action
               __softirqentry_text_start
               __local_bh_enable_ip
               ip6_output
               ip6_local_out
               ip6_send_skb
               udp_v6_send_skb
               udpv6_sendmsg
               __sys_sendto
               __x64_sys_sendto
               do_syscall_64
               entry_SYSCALL_64
               __libc_sendto

In any case, as expected, the swqueue lock doesn't seem to be causing
any issues here.


I was wondering if the contention you were seeing could be due to
something like TTWU_QUEUE (though I wouldn't have expected that because
we don't use the swqueue if a task is migrated following
select_task_rq()), so I also tried collecting data with NO_TTWU_QUEUE
(only on the 2-socket SKL), but the results matched what I observed
without it:

Default workload:

NO_SWQUEUE: 57552.69
SWQUEUE: 58320.29
Delta: +1.33%

UDP_RR:

NO_SWQUEUE: 76296.46
SWQUEUE: 75418.35
Delta: -1.15%


I also tried running kernel compile workloads just to verify they didn't
regress per Peter's request:


26 / 52 x 1 Cooperlake
----------------------

NO_SWQUEUE: 1187.72s
SWQUEUE: 1185.39s
Delta: +.20%


20 / 40 x 2 Skylake
-------------------

NO_SWQUEUE: 418.15s
SWQUEUE: 418.87s
Delta: -.17%

I'd call this essentially a no-op for kernel compile.


In general, the results indicate some wins and some losses.

For kernel compile, larger LLCs with more cores was a no-op. For
netperf, the default workload had some gains, and for UPD_RR, some
losses.

With all that said, I have a few thoughts.

Firstly, would you mind please sharing your .config? It's possible that
the hosts I'm testing on just don't have big enough LLCs to observe the
contention you're seeing on the swqueue spinlock, as my 26 / 52 CPL host
is smaller than a single socket of your 32/64 ICL host. On the other
hand, your ICL isn't _that_ much bigger per-LLC, so I'd be curious to
see if there's a .config difference here that's causing the contention.
Also, the fact that Milan (which has only 6 cores / 12 threads per LLC)
also saw a performance hit with swqueue for UDP_RR suggests to me that
the issue with UDP_RR is not the scalability of the per-LLC swqueue
spinlock.

Regardless, even if there are some SKUs that have so many cores per LLC
that we do end up contending, I don't think that should preclude adding
a feature that which does benefit plenty of other configurations that
have fewer cores per LLC.

Thanks,
David
  
Aaron Lu June 15, 2023, 4:49 a.m. UTC | #7
Hi David,

On Wed, Jun 14, 2023 at 07:01:03PM -0500, David Vernet wrote:
> Hi Aaron,
> 
> Thanks for taking a look and running some benchmarks. I tried to
> reproduce your results on a 26 core / 52 thread Cooperlake host, and a
> 20 core / 40 thread x 2 Skylake host, but I wasn't able to observe the
> contention on the per-swqueue spinlock you were saw on your Ice Lake.
> 
> I ran the following netperf benchmarks:
> 
> - netperf -l 60 -n $(nproc) -6
> - netperf -l 60 -n $(nproc) -6 -t UDP_RR

Just to confirm: did you run multiple of the above or just one?

> 
> Here are the results I'm seeing from taking the average of running those
> benchmarks three times on each host (all results are from the "Throughput"
> column of the below table).
> 
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput <<<
> bytes  bytes   bytes    secs.    10^6bits/sec
> 
> 
> 26 / 52 x 1 Cooperlake
> ----------------------
> 
> Default workload:
> 
> NO_SWQUEUE: 34103.76
> SWQUEUE:    34707.46
> Delta: +1.77%
> 
> UDP_RR:
> 
> NO_SWQUEUE: 57695.29
> SWQUEUE: 54827.36
> Delta: -4.97%
> 
> There's clearly a statistically significant decline in performance for
> UDP_RR here, but surprisingly, I don't see any contention on swqueue
> locks when I profile with perf. Rather, we seem to be contending on the
> rq lock (presumably in swqueue_pick_next_task() which is unexpected (to
> me, at least):

I don't see rq lock contention though and since you have seen some
contention here, I suppose you should have started multiple instances of
netperf client?

> 
>      7.81%  netperf  [kernel.vmlinux]                                           [k] queued_spin_lock_slowpath
>             |          
>              --7.81%--queued_spin_lock_slowpath
>                        |          
>                         --7.55%--task_rq_lock
>                                   pick_next_task_fair
>                                   schedule
>                                   schedule_timeout
>                                   __skb_wait_for_more_packets
>                                   __skb_recv_udp
>                                   udpv6_recvmsg
>                                   inet6_recvmsg
>                                   __x64_sys_recvfrom
>                                   do_syscall_64
>                                   entry_SYSCALL_64
>                                   __libc_recvfrom
>                                   0x7f923fdfacf0
>                                   0
>      4.97%  netperf  [kernel.vmlinux]                                           [k] _raw_spin_lock_irqsave
>             |          
>              --4.96%--_raw_spin_lock_irqsave
>                        |          
>                        |--1.13%--prepare_to_wait_exclusive
>                        |          __skb_wait_for_more_packets
>                        |          __skb_recv_udp
>                        |          udpv6_recvmsg
>                        |          inet6_recvmsg
>                        |          __x64_sys_recvfrom
>                        |          do_syscall_64
>                        |          entry_SYSCALL_64
>                        |          __libc_recvfrom
>                        |          0x7f923fdfacf0
>                        |          0
>                        |          
>                        |--0.92%--default_wake_function
>                        |          autoremove_wake_function
>                        |          __wake_up_sync_key
>                        |          sock_def_readable
>                        |          __udp_enqueue_schedule_skb
>                        |          udpv6_queue_rcv_one_skb
>                        |          __udp6_lib_rcv
>                        |          ip6_input
>                        |          ipv6_rcv
>                        |          process_backlog
>                        |          net_rx_action
>                        |          __softirqentry_text_start
>                        |          __local_bh_enable_ip
>                        |          ip6_output
>                        |          ip6_local_out
>                        |          ip6_send_skb
>                        |          udp_v6_send_skb
>                        |          udpv6_sendmsg
>                        |          __sys_sendto
>                        |          __x64_sys_sendto
>                        |          do_syscall_64
>                        |          entry_SYSCALL_64
>                        |          __libc_sendto
>                        |          
>                        |--0.81%--__wake_up_sync_key
>                        |          sock_def_readable
>                        |          __udp_enqueue_schedule_skb
>                        |          udpv6_queue_rcv_one_skb
>                        |          __udp6_lib_rcv
>                        |          ip6_input
>                        |          ipv6_rcv
>                        |          process_backlog
>                        |          net_rx_action
>                        |          __softirqentry_text_start
>                        |          __local_bh_enable_ip
>                        |          ip6_output
>                        |          ip6_local_out
>                        |          ip6_send_skb
>                        |          udp_v6_send_skb
>                        |          udpv6_sendmsg
>                        |          __sys_sendto
>                        |          __x64_sys_sendto
>                        |          do_syscall_64
>                        |          entry_SYSCALL_64
>                        |          __libc_sendto
>                        |          
>                        |--0.73%--enqueue_task_fair
>                        |          |          
>                        |           --0.72%--default_wake_function
>                        |                     autoremove_wake_function
>                        |                     __wake_up_sync_key
>                        |                     sock_def_readable
>                        |                     __udp_enqueue_schedule_skb
>                        |                     udpv6_queue_rcv_one_skb
>                        |                     __udp6_lib_rcv
>                        |                     ip6_input
>                        |                     ipv6_rcv
>                        |                     process_backlog
>                        |                     net_rx_action
>                        |                     __softirqentry_text_start
>                        |                     __local_bh_enable_ip
>                        |                     ip6_output
>                        |                     ip6_local_out
>                        |                     ip6_send_skb
>                        |                     udp_v6_send_skb
>                        |                     udpv6_sendmsg
>                        |                     __sys_sendto
>                        |                     __x64_sys_sendto
>                        |                     do_syscall_64
>                        |                     entry_SYSCALL_64
>                        |                     __libc_sendto
>                        |          
>                         --0.68%--task_rq_lock
>                                   pick_next_task_fair
>                                   schedule
>                                   schedule_timeout
>                                   __skb_wait_for_more_packets
>                                   __skb_recv_udp
>                                   udpv6_recvmsg
>                                   inet6_recvmsg
>                                   __x64_sys_recvfrom
>                                   do_syscall_64
>                                   entry_SYSCALL_64
>                                   __libc_recvfrom
>                                   0x7f923fdfacf0
>                                   0
> 
> The profile without swqueue doesn't show any contention on the rq lock,
> but does show us spending a good amount of time in the scheduler:
> 
>      4.03%  netperf  [kernel.vmlinux]                                           [k] default_wake_function
>             |          
>              --3.98%--default_wake_function
>                        autoremove_wake_function
>                        __wake_up_sync_key
>                        sock_def_readable
>                        __udp_enqueue_schedule_skb
>                        udpv6_queue_rcv_one_skb
>                        __udp6_lib_rcv
>                        ip6_input
>                        ipv6_rcv
>                        process_backlog
>                        net_rx_action
>                        __softirqentry_text_start
>                        __local_bh_enable_ip
>                        ip6_output
>                        ip6_local_out
>                        ip6_send_skb
>                        udp_v6_send_skb
>                        udpv6_sendmsg
>                        __sys_sendto
>                        __x64_sys_sendto
>                        do_syscall_64
>                        entry_SYSCALL_64
>                        __libc_sendto
>      3.70%  netperf  [kernel.vmlinux]                                           [k] enqueue_entity
>             |          
>              --3.66%--enqueue_entity
>                        enqueue_task_fair
>                        |          
>                         --3.66%--default_wake_function
>                                   autoremove_wake_function
>                                   __wake_up_sync_key
>                                   sock_def_readable
>                                   __udp_enqueue_schedule_skb
>                                   udpv6_queue_rcv_one_skb
>                                   __udp6_lib_rcv
>                                   ip6_input
>                                   ipv6_rcv
>                                   process_backlog
>                                   net_rx_action
>                                   __softirqentry_text_start
>                                   __local_bh_enable_ip
>                                   ip6_output
>                                   ip6_local_out
>                                   ip6_send_skb
>                                   udp_v6_send_skb
>                                   udpv6_sendmsg
>                                   __sys_sendto
>                                   __x64_sys_sendto
>                                   do_syscall_64
>                                   entry_SYSCALL_64
>                                   __libc_sendto
> 
> There's clearly a measurable impact on performance here due to swqueue
> (negative for UDP_RR, positive for the default workload), but this looks
> quite different than what you were observing.

Yes.

> 
> 20 / 40 x 2 Skylake
> -------------------
> 
> Default workload:
> 
> NO_SWQUEUE: 57437.45
> SWQUEUE: 58801.11
> Delta: +2.37%
> 
> UDP_RR:
> 
> NO_SWQUEUE: 75932.28
> SWQUEUE: 75232.77
> Delta: -.92%
> 
> Same observation here. I didn't collect a profile, but the trend seems
> consistent, and there's clearly a tradeoff. Despite the small drop in
> perf for UDP_RR, it seems quite a bit less drastic than what would be
> expected with the contention you showed in your profile.

Indeed.

> 
> 7950X (8 cores / 16 threads per CCX, 2 CCXs)
> --------------------------------------------
> 
> Default workload:
> 
> NO_SWQUEUE: 77615.08
> SWQUEUE: 77465.73
> Delta: -.19%
> 
> UDP_RR:
> 
> NO_SWQUEUE: 230258.75
> SWQUEUE: 230280.34
> Delta: ~0%
> 
> I'd call this essentially a no-op.
> 
 
> With all that said, I have a few thoughts.
> 
> Firstly, would you mind please sharing your .config? It's possible that
> the hosts I'm testing on just don't have big enough LLCs to observe the
> contention you're seeing on the swqueue spinlock, as my 26 / 52 CPL host
> is smaller than a single socket of your 32/64 ICL host. On the other
> hand, your ICL isn't _that_ much bigger per-LLC, so I'd be curious to

Agreed. Your 26cores/52threads isn't that smaller than mine and I had
expected to see something similar.

> see if there's a .config difference here that's causing the contention.

Attached my .config.
I also pushed the branch I used for testing to github just in case you
want to take a look: https://github.com/aaronlu/linux.git swqueue

> Also, the fact that Milan (which has only 6 cores / 12 threads per LLC)
> also saw a performance hit with swqueue for UDP_RR suggests to me that
> the issue with UDP_RR is not the scalability of the per-LLC swqueue
> spinlock.

I've tested again today and I still saw serious contention on
swqueue->lock with your cmdline. I did it this way:
"
$ netserver                                                                                      
Starting netserver with host 'IN(6)ADDR_ANY' port '12865' and family AF_UNSPEC

$ for i in `seq 128`; do netperf -l 60 -n 128 -6 -t UDP_RR & done
"
And the profile is about the same as my last posting:

    83.23%    83.18%  [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath            -      -            
            |          
            |--40.23%--sendto
            |          entry_SYSCALL_64
            |          do_syscall_64
            |          |          
            |           --40.22%--__x64_sys_sendto
            |                     __sys_sendto
            |                     sock_sendmsg
            |                     inet6_sendmsg
            |                     udpv6_sendmsg
            |                     udp_v6_send_skb
            |                     ip6_send_skb
            |                     ip6_local_out
            |                     ip6_output
            |                     ip6_finish_output
            |                     ip6_finish_output2
            |                     __dev_queue_xmit
            |                     __local_bh_enable_ip
            |                     do_softirq.part.0
            |                     __do_softirq
            |                     net_rx_action
            |                     __napi_poll
            |                     process_backlog
            |                     __netif_receive_skb
            |                     __netif_receive_skb_one_core
            |                     ipv6_rcv
            |                     ip6_input
            |                     ip6_input_finish
            |                     ip6_protocol_deliver_rcu
            |                     udpv6_rcv
            |                     __udp6_lib_rcv
            |                     udp6_unicast_rcv_skb
            |                     udpv6_queue_rcv_skb
            |                     udpv6_queue_rcv_one_skb
            |                     __udp_enqueue_schedule_skb
            |                     sock_def_readable
            |                     __wake_up_sync_key
            |                     __wake_up_common_lock
            |                     |          
            |                      --40.22%--__wake_up_common
            |                                receiver_wake_function
            |                                autoremove_wake_function
            |                                default_wake_function
            |                                try_to_wake_up
            |                                ttwu_do_activate
            |                                enqueue_task
            |                                enqueue_task_fair
            |                                |          
            |                                 --40.22%--_raw_spin_lock_irqsave
            |                                           |          
            |                                            --40.21%--native_queued_spin_lock_slowpath
            |          
            |--38.59%--recvfrom
            |          |          
            |           --38.59%--entry_SYSCALL_64
            |                     do_syscall_64
            |                     __x64_sys_recvfrom
            |                     __sys_recvfrom
            |                     sock_recvmsg
            |                     inet6_recvmsg
            |                     udpv6_recvmsg
            |                     __skb_recv_udp
            |                     __skb_wait_for_more_packets
            |                     schedule_timeout
            |                     schedule
            |                     __schedule
            |                     |          
            |                      --38.59%--pick_next_task_fair
            |                                |          
            |                                 --38.59%--swqueue_remove_task
            |                                           |          
            |                                            --38.59%--_raw_spin_lock_irqsave
            |                                                      |          
            |                                                       --38.58%--native_queued_spin_lock_slowpath
            |          
            |--2.25%--sendto
            |          entry_SYSCALL_64
            |          do_syscall_64
            |          |          
            |           --2.25%--__x64_sys_sendto
            |                     __sys_sendto
            |                     sock_sendmsg
            |                     inet6_sendmsg
            |                     udpv6_sendmsg
            |                     udp_v6_send_skb
            |                     ip6_send_skb
            |                     ip6_local_out
            |                     ip6_output
            |                     ip6_finish_output
            |                     ip6_finish_output2
            |                     __dev_queue_xmit
            |                     __local_bh_enable_ip
            |                     do_softirq.part.0
            |                     __do_softirq
            |                     net_rx_action
            |                     __napi_poll
            |                     process_backlog
            |                     __netif_receive_skb
            |                     __netif_receive_skb_one_core
            |                     ipv6_rcv
            |                     ip6_input
            |                     ip6_input_finish
            |                     ip6_protocol_deliver_rcu
            |                     udpv6_rcv
            |                     __udp6_lib_rcv
            |                     udp6_unicast_rcv_skb
            |                     udpv6_queue_rcv_skb
            |                     udpv6_queue_rcv_one_skb
            |                     __udp_enqueue_schedule_skb
            |                     sock_def_readable
            |                     __wake_up_sync_key
            |                     __wake_up_common_lock
            |                     |          
            |                      --2.25%--__wake_up_common
            |                                receiver_wake_function
            |                                autoremove_wake_function
            |                                default_wake_function
            |                                try_to_wake_up
            |                                ttwu_do_activate
            |                                enqueue_task
            |                                enqueue_task_fair
            |                                _raw_spin_lock_irqsave
            |                                |          
            |                                 --2.25%--native_queued_spin_lock_slowpath
            |          
             --2.10%--recvfrom
                       entry_SYSCALL_64
                       do_syscall_64
                       __x64_sys_recvfrom
                       __sys_recvfrom
                       sock_recvmsg
                       inet6_recvmsg
                       udpv6_recvmsg
                       __skb_recv_udp
                       __skb_wait_for_more_packets
                       schedule_timeout
                       schedule
                       __schedule
                       |          
                        --2.09%--pick_next_task_fair
                                  swqueue_remove_task
                                  _raw_spin_lock_irqsave
                                  |          
                                   --2.09%--native_queued_spin_lock_slowpath

During the test, vmstat showed lines like this:
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free     buff  cache     si   so    bi    bo   in     cs   us sy id wa st
185 0      0 128555784  47348 1791244    0    0     0     0  32387 4380484  1 98  0  0  0

When swqueue is disabled, vmstat showed lines like this and there is no
lock contention:
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free     buff  cache     si   so    bi    bo   in     cs   us sy id wa st
116 0      0 128977136  58324 1338760    0    0     0    16 599677 9734395  5 72 23  0  0

The runnable tasks are a lot more when swqueue is in use and sys% also
increased, perhaps due to the lock contention.

I'll see if I can find a smaller machine and give it a run there too.

Thanks,
Aaron
  
Aaron Lu June 15, 2023, 7:31 a.m. UTC | #8
On Thu, Jun 15, 2023 at 12:49:17PM +0800, Aaron Lu wrote:
> I'll see if I can find a smaller machine and give it a run there too.

Found a Skylake with 18cores/36threads on each socket/LLC and with
netperf, the contention is still serious.

"
$ netserver
$ sudo sh -c "echo SWQUEUE > /sys/kernel/debug/sched/features"
$ for i in `seq 72`; do netperf -l 60 -n 72 -6 -t UDP_RR & done
"

        53.61%    53.61%  [kernel.vmlinux]            [k] native_queued_spin_lock_slowpath            -      -            
            |          
            |--27.93%--sendto
            |          entry_SYSCALL_64
            |          do_syscall_64
            |          |          
            |           --27.93%--__x64_sys_sendto
            |                     __sys_sendto
            |                     sock_sendmsg
            |                     inet6_sendmsg
            |                     udpv6_sendmsg
            |                     udp_v6_send_skb
            |                     ip6_send_skb
            |                     ip6_local_out
            |                     ip6_output
            |                     ip6_finish_output
            |                     ip6_finish_output2
            |                     __dev_queue_xmit
            |                     __local_bh_enable_ip
            |                     do_softirq.part.0
            |                     __do_softirq
            |                     net_rx_action
            |                     __napi_poll
            |                     process_backlog
            |                     __netif_receive_skb
            |                     __netif_receive_skb_one_core
            |                     ipv6_rcv
            |                     ip6_input
            |                     ip6_input_finish
            |                     ip6_protocol_deliver_rcu
            |                     udpv6_rcv
            |                     __udp6_lib_rcv
            |                     udp6_unicast_rcv_skb
            |                     udpv6_queue_rcv_skb
            |                     udpv6_queue_rcv_one_skb
            |                     __udp_enqueue_schedule_skb
            |                     sock_def_readable
            |                     __wake_up_sync_key
            |                     __wake_up_common_lock
            |                     |          
            |                      --27.85%--__wake_up_common
            |                                receiver_wake_function
            |                                autoremove_wake_function
            |                                default_wake_function
            |                                try_to_wake_up
            |                                |          
            |                                 --27.85%--ttwu_do_activate
            |                                           enqueue_task
            |                                           enqueue_task_fair
            |                                           |          
            |                                            --27.85%--_raw_spin_lock_irqsave
            |                                                      |          
            |                                                       --27.85%--native_queued_spin_lock_slowpath
            |          
             --25.67%--recvfrom
                       entry_SYSCALL_64
                       do_syscall_64
                       __x64_sys_recvfrom
                       __sys_recvfrom
                       sock_recvmsg
                       inet6_recvmsg
                       udpv6_recvmsg
                       __skb_recv_udp
                       |          
                        --25.67%--__skb_wait_for_more_packets
                                  schedule_timeout
                                  schedule
                                  __schedule
                                  |          
                                   --25.66%--pick_next_task_fair
                                             |          
                                              --25.65%--swqueue_remove_task
                                                        |          
                                                         --25.65%--_raw_spin_lock_irqsave
                                                                   |          
                                                                    --25.65%--native_queued_spin_lock_slowpath

I didn't aggregate the throughput(Trans. Rate per sec) from all these
clients, but a glimpse from the result showed that the throughput of
each client dropped from 4xxxx(NO_SWQUEUE) to 2xxxx(SWQUEUE).

Thanks,
Aaron
  
David Vernet June 15, 2023, 11:26 p.m. UTC | #9
On Thu, Jun 15, 2023 at 03:31:53PM +0800, Aaron Lu wrote:
> On Thu, Jun 15, 2023 at 12:49:17PM +0800, Aaron Lu wrote:
> > I'll see if I can find a smaller machine and give it a run there too.
> 
> Found a Skylake with 18cores/36threads on each socket/LLC and with
> netperf, the contention is still serious.
> 
> "
> $ netserver
> $ sudo sh -c "echo SWQUEUE > /sys/kernel/debug/sched/features"
> $ for i in `seq 72`; do netperf -l 60 -n 72 -6 -t UDP_RR & done
> "
> 
>         53.61%    53.61%  [kernel.vmlinux]            [k] native_queued_spin_lock_slowpath            -      -            
>             |          
>             |--27.93%--sendto
>             |          entry_SYSCALL_64
>             |          do_syscall_64
>             |          |          
>             |           --27.93%--__x64_sys_sendto
>             |                     __sys_sendto
>             |                     sock_sendmsg
>             |                     inet6_sendmsg
>             |                     udpv6_sendmsg
>             |                     udp_v6_send_skb
>             |                     ip6_send_skb
>             |                     ip6_local_out
>             |                     ip6_output
>             |                     ip6_finish_output
>             |                     ip6_finish_output2
>             |                     __dev_queue_xmit
>             |                     __local_bh_enable_ip
>             |                     do_softirq.part.0
>             |                     __do_softirq
>             |                     net_rx_action
>             |                     __napi_poll
>             |                     process_backlog
>             |                     __netif_receive_skb
>             |                     __netif_receive_skb_one_core
>             |                     ipv6_rcv
>             |                     ip6_input
>             |                     ip6_input_finish
>             |                     ip6_protocol_deliver_rcu
>             |                     udpv6_rcv
>             |                     __udp6_lib_rcv
>             |                     udp6_unicast_rcv_skb
>             |                     udpv6_queue_rcv_skb
>             |                     udpv6_queue_rcv_one_skb
>             |                     __udp_enqueue_schedule_skb
>             |                     sock_def_readable
>             |                     __wake_up_sync_key
>             |                     __wake_up_common_lock
>             |                     |          
>             |                      --27.85%--__wake_up_common
>             |                                receiver_wake_function
>             |                                autoremove_wake_function
>             |                                default_wake_function
>             |                                try_to_wake_up
>             |                                |          
>             |                                 --27.85%--ttwu_do_activate
>             |                                           enqueue_task
>             |                                           enqueue_task_fair
>             |                                           |          
>             |                                            --27.85%--_raw_spin_lock_irqsave
>             |                                                      |          
>             |                                                       --27.85%--native_queued_spin_lock_slowpath
>             |          
>              --25.67%--recvfrom
>                        entry_SYSCALL_64
>                        do_syscall_64
>                        __x64_sys_recvfrom
>                        __sys_recvfrom
>                        sock_recvmsg
>                        inet6_recvmsg
>                        udpv6_recvmsg
>                        __skb_recv_udp
>                        |          
>                         --25.67%--__skb_wait_for_more_packets
>                                   schedule_timeout
>                                   schedule
>                                   __schedule
>                                   |          
>                                    --25.66%--pick_next_task_fair
>                                              |          
>                                               --25.65%--swqueue_remove_task
>                                                         |          
>                                                          --25.65%--_raw_spin_lock_irqsave
>                                                                    |          
>                                                                     --25.65%--native_queued_spin_lock_slowpath
> 
> I didn't aggregate the throughput(Trans. Rate per sec) from all these
> clients, but a glimpse from the result showed that the throughput of
> each client dropped from 4xxxx(NO_SWQUEUE) to 2xxxx(SWQUEUE).
> 
> Thanks,
> Aaron

Ok, it seems that the issue is that I wasn't creating enough netperf
clients. I assumed that -n $(nproc) was sufficient. I was able to repro
the contention on my 26 core / 52 thread skylake client as well:


    41.01%  netperf          [kernel.vmlinux]                                                 [k] queued_spin_lock_slowpath
            |          
             --41.01%--queued_spin_lock_slowpath
                       |          
                        --40.63%--_raw_spin_lock_irqsave
                                  |          
                                  |--21.18%--enqueue_task_fair
                                  |          |          
                                  |           --21.09%--default_wake_function
                                  |                     |          
                                  |                      --21.09%--autoremove_wake_function
                                  |                                |          
                                  |                                 --21.09%--__wake_up_sync_key
                                  |                                           sock_def_readable
                                  |                                           __udp_enqueue_schedule_skb
                                  |                                           udpv6_queue_rcv_one_skb
                                  |                                           __udp6_lib_rcv
                                  |                                           ip6_input
                                  |                                           ipv6_rcv
                                  |                                           process_backlog
                                  |                                           net_rx_action
                                  |                                           |          
                                  |                                            --21.09%--__softirqentry_text_start
                                  |                                                      __local_bh_enable_ip
                                  |                                                      ip6_output
                                  |                                                      ip6_local_out
                                  |                                                      ip6_send_skb
                                  |                                                      udp_v6_send_skb
                                  |                                                      udpv6_sendmsg
                                  |                                                      __sys_sendto
                                  |                                                      __x64_sys_sendto
                                  |                                                      do_syscall_64
                                  |                                                      entry_SYSCALL_64
                                  |          
                                   --19.44%--swqueue_remove_task
                                             |          
                                              --19.42%--pick_next_task_fair
                                                        |          
                                                         --19.42%--schedule
                                                                   |          
                                                                    --19.21%--schedule_timeout
                                                                              __skb_wait_for_more_packets
                                                                              __skb_recv_udp
                                                                              udpv6_recvmsg
                                                                              inet6_recvmsg
                                                                              __x64_sys_recvfrom
                                                                              do_syscall_64
                                                                              entry_SYSCALL_64
    40.87%  netserver        [kernel.vmlinux]                                                 [k] queued_spin_lock_slowpath
            |          
             --40.87%--queued_spin_lock_slowpath
                       |          
                        --40.51%--_raw_spin_lock_irqsave
                                  |          
                                  |--21.03%--enqueue_task_fair
                                  |          |          
                                  |           --20.94%--default_wake_function
                                  |                     |          
                                  |                      --20.94%--autoremove_wake_function
                                  |                                |          
                                  |                                 --20.94%--__wake_up_sync_key
                                  |                                           sock_def_readable
                                  |                                           __udp_enqueue_schedule_skb
                                  |                                           udpv6_queue_rcv_one_skb
                                  |                                           __udp6_lib_rcv
                                  |                                           ip6_input
                                  |                                           ipv6_rcv
                                  |                                           process_backlog
                                  |                                           net_rx_action
                                  |                                           |          
                                  |                                            --20.94%--__softirqentry_text_start
                                  |                                                      __local_bh_enable_ip
                                  |                                                      ip6_output
                                  |                                                      ip6_local_out
                                  |                                                      ip6_send_skb
                                  |                                                      udp_v6_send_skb
                                  |                                                      udpv6_sendmsg
                                  |                                                      __sys_sendto
                                  |                                                      __x64_sys_sendto
                                  |                                                      do_syscall_64
                                  |                                                      entry_SYSCALL_64
                                  |          
                                   --19.48%--swqueue_remove_task
                                             |          
                                              --19.47%--pick_next_task_fair
                                                        schedule
                                                        |          
                                                         --19.38%--schedule_timeout
                                                                   __skb_wait_for_more_packets
                                                                   __skb_recv_udp
                                                                   udpv6_recvmsg
                                                                   inet6_recvmsg
                                                                   __x64_sys_recvfrom
                                                                   do_syscall_64
                                                                   entry_SYSCALL_64

Thanks for the help in getting the repro on my end.

So yes, there is certainly a scalability concern to bear in mind for
swqueue for LLCs with a lot of cores. If you have a lot of tasks quickly
e.g. blocking and waking on futexes in a tight loop, I expect a similar
issue would be observed.

On the other hand, the issue did not occur on my 7950X. I also wasn't
able to repro the contention on the Skylake if I ran with the default
netperf workload rather than UDP_RR (even with the additional clients).
I didn't bother to take the mean of all of the throughput results
between NO_SWQUEUE and SWQUEUE, but they looked roughly equal.

So swqueue isn't ideal for every configuration, but I'll echo my
sentiment from [0] that this shouldn't on its own necessarily preclude
it from being merged given that it does help a large class of
configurations and workloads, and it's disabled by default.

[0]: https://lore.kernel.org/all/20230615000103.GC2883716@maniforge/

Thanks,
David
  
Aaron Lu June 16, 2023, 12:53 a.m. UTC | #10
On Thu, Jun 15, 2023 at 06:26:05PM -0500, David Vernet wrote:
 
> Ok, it seems that the issue is that I wasn't creating enough netperf
> clients. I assumed that -n $(nproc) was sufficient. I was able to repro

Yes that switch is confusing.

> the contention on my 26 core / 52 thread skylake client as well:
> 
> 
 
> Thanks for the help in getting the repro on my end.

You are welcome.

> So yes, there is certainly a scalability concern to bear in mind for
> swqueue for LLCs with a lot of cores. If you have a lot of tasks quickly
> e.g. blocking and waking on futexes in a tight loop, I expect a similar
> issue would be observed.
> 
> On the other hand, the issue did not occur on my 7950X. I also wasn't

Using netperf/UDP_RR?

> able to repro the contention on the Skylake if I ran with the default
> netperf workload rather than UDP_RR (even with the additional clients).

I also tried that on the 18cores/36threads/LLC Skylake and the contention
is indeed much smaller than UDP_RR:

     7.30%     7.29%  [kernel.vmlinux]      [k]      native_queued_spin_lock_slowpath

But I wouldn't say it's entirely gone. Also consider Skylake has a lot
fewer cores per LLC than later Intel servers like Icelake and Sapphire
Rapids and I expect things would be worse on those two machines.

> I didn't bother to take the mean of all of the throughput results
> between NO_SWQUEUE and SWQUEUE, but they looked roughly equal.
> 
> So swqueue isn't ideal for every configuration, but I'll echo my
> sentiment from [0] that this shouldn't on its own necessarily preclude
> it from being merged given that it does help a large class of
> configurations and workloads, and it's disabled by default.
> 
> [0]: https://lore.kernel.org/all/20230615000103.GC2883716@maniforge/

I was wondering: does it make sense to do some divide on machines with
big LLCs? Like converting the per-LLC swqueue to per-group swqueue where
the group can be made of ~8 cpus of the same LLC. This will have a
similar effect of reducing the number of CPUs in a single LLC so the
scalability issue can hopefully be fixed while at the same time, it
might still help some workloads. I realized this isn't ideal in that
wakeup happens at LLC scale so the group thing may not fit very well
here.

Just a thought, feel free to ignore it if you don't think this is
feasible :-)

Thanks,
Aaron
  
Vincent Guittot June 16, 2023, 8:08 a.m. UTC | #11
On Tue, 13 Jun 2023 at 07:20, David Vernet <void@manifault.com> wrote:
>
> Overview
> ========
>
> The scheduler must constantly strike a balance between work
> conservation, and avoiding costly migrations which harm performance due
> to e.g. decreased cache locality. The matter is further complicated by
> the topology of the system. Migrating a task between cores on the same
> LLC may be more optimal than keeping a task local to the CPU, whereas
> migrating a task between LLCs or NUMA nodes may tip the balance in the
> other direction.
>
> With that in mind, while CFS is by and large mostly a work conserving
> scheduler, there are certain instances where the scheduler will choose
> to keep a task local to a CPU, when it would have been more optimal to
> migrate it to an idle core.
>
> An example of such a workload is the HHVM / web workload at Meta. HHVM
> is a VM that JITs Hack and PHP code in service of web requests. Like
> other JIT / compilation workloads, it tends to be heavily CPU bound, and
> exhibit generally poor cache locality. To try and address this, we set
> several debugfs (/sys/kernel/debug/sched) knobs on our HHVM workloads:
>
> - migration_cost_ns -> 0
> - latency_ns -> 20000000
> - min_granularity_ns -> 10000000
> - wakeup_granularity_ns -> 12000000
>
> These knobs are intended both to encourage the scheduler to be as work
> conserving as possible (migration_cost_ns -> 0), and also to keep tasks
> running for relatively long time slices so as to avoid the overhead of
> context switching (the other knobs). Collectively, these knobs provide a
> substantial performance win; resulting in roughly a 20% improvement in
> throughput. Worth noting, however, is that this improvement is _not_ at
> full machine saturation.
>
> That said, even with these knobs, we noticed that CPUs were still going
> idle even when the host was overcommitted. In response, we wrote the
> "shared wakequeue" (swqueue) feature proposed in this patch set. The
> idea behind swqueue is simple: it enables the scheduler to be
> aggressively work conserving by placing a waking task into a per-LLC
> FIFO queue that can be pulled from by another core in the LLC FIFO queue
> which can then be pulled from before it goes idle.

This seems to be just another newly idle load balance outside the current one !

The knobs above are not the only thing preventing a rq to pull a new
task. We have rq->avg_idle, curr_cost and sd->max_newidle_lb_cost
stuff which might be one main root cause for one of your cpu not
pulling a waiting task

It's not clear in your explanation why fixing newly_idle_load_balance
was not possible instead of adding outside code and what prevents
newly_idle_load balance from picking a task in your case ?

For example, have you tried to disable the early break because of avg_idle ?

>
> With this simple change, we were able to achieve a 1 - 1.6% improvement
> in throughput, as well as a small, consistent improvement in p95 and p99
> latencies, in HHVM. These performance improvements were in addition to
> the wins from the debugfs knobs mentioned above.
>
> Design
> ======
>
> The design of swqueue is quite simple. An swqueue is simply a struct
> list_head, and a spinlock:
>
> struct swqueue {
>         struct list_head list;
>         spinlock_t lock;
> } ____cacheline_aligned;
>
> We create a struct swqueue per LLC, ensuring they're in their own
> cachelines to avoid false sharing between CPUs on different LLCs.
>
> When a task first wakes up, it enqueues itself in the swqueue of its
> current LLC at the end of enqueue_task_fair(). Enqueues only happen if
> the task was not manually migrated to the current core by
> select_task_rq(), and is not pinned to a specific CPU.
>
> A core will pull a task from its LLC's swqueue before calling
> newidle_balance().
>
> Difference between SIS_NODE
> ===========================
>
> In [0] Peter proposed a patch that addresses Tejun's observations that
> when workqueues are targeted towards a specific LLC on his Zen2 machine
> with small CCXs, that there would be significant idle time due to
> select_idle_sibling() not considering anything outside of the current
> LLC.
>
> This patch (SIS_NODE) is essentially the complement to the proposal
> here. SID_NODE causes waking tasks to look for idle cores in neighboring
> LLCs on the same die, whereas swqueue causes cores about to go idle to
> look for enqueued tasks. That said, in its current form, the two
> features at are a different scope as SIS_NODE searches for idle cores
> between LLCs, while swqueue enqueues tasks within a single LLC.
>
> The patch was since removed in [1], but we elect to compare its
> performance to swqueue given that as described above, it's conceptually
> complementary.
>
> [0]: https://lore.kernel.org/all/20230530113249.GA156198@hirez.programming.kicks-ass.net/
> [1]: https://lore.kernel.org/all/20230605175636.GA4253@hirez.programming.kicks-ass.net/
>
> I observed that while SIS_NODE works quite well for hosts with small
> CCXs, it can result in degraded performance on machines either with a
> large number of total cores in a CCD, or for which the cache miss
> penalty of migrating between CCXs is high, even on the same die.
>
> For example, on Zen 4c hosts (Bergamo), CCXs within a CCD are muxed
> through a single link to the IO die, and thus have similar cache miss
> latencies as cores in remote CCDs.
>
> Such subtleties could be taken into account with SIS_NODE, but
> regardless, both features are conceptually complementary sides of the
> same coin.  SIS_NODE searches for idle cores for waking threads, whereas
> swqueue searches for available work before a core goes idle.
>
> Results
> =======
>
> Note that the motivation for the shared wakequeue feature was originally
> arrived at using experiments in the sched_ext framework that's currently
> being proposed upstream. The ~1 - 1.6% improvement in HHVM throughput
> is similarly visible using work-conserving sched_ext schedulers (even
> very simple ones like global FIFO).
>
> In both single and multi socket / CCX hosts, this can measurably improve
> performance. In addition to the performance gains observed on our
> internal web workloads, we also observed an improvement in common
> workloads such as kernel compile when running shared wakequeue. Here are
> the results of running make -j$(nproc) built-in.a on several different
> types of hosts configured with make allyesconfig on commit a27648c74210
> ("afs: Fix setting of mtime when creating a file/dir/symlink") on Linus'
> tree (boost was disabled on all of these hosts when the experiments were
> performed):
>
> Single-socket | 32-core | 2-CCX | AMD 7950X Zen4
>
> CPU max MHz:                     5879.8818
> CPU min MHz:                     3000.0000
>                                 o____________o_______o
>                                 |    mean    | CPU   |
>                                 o------------o-------o
> NO_SWQUEUE + NO_SIS_NODE:       | 590.52s    | 3103% |
> NO_SWQUEUE + SIS_NODE:          | 590.80s    | 3102% |
> SWQUEUE + NO_SIS_NODE:          | 589.65s    | 3116% |
> SWQUEUE + SIS_NODE:             | 589.99s    | 3115% |
>                                 o------------o-------o
>
> Takeaway: swqueue doesn't seem to provide a statistically significant
> improvement for kernel compile on my 7950X. SIS_NODE similarly does not
> have a noticeable effect on performance.
>
> -------------------------------------------------------------------------------
>
> Single-socket | 72-core | 6-CCX | AMD Milan Zen3
>
> CPU max MHz:                     3245.0190
> CPU min MHz:                     700.0000
>                                 o_____________o_______o
>                                 |    mean     | CPU   |
>                                 o-------------o-------o
> NO_SWQUEUE + NO_SIS_NODE:       | 1608.69s    | 6488% |
> NO_SWQUEUE + SIS_NODE:          | 1610.24s    | 6473% |
> SWQUEUE + NO_SIS_NODE:          | 1605.80s    | 6504% |
> SWQUEUE + SIS_NODE:             | 1606.96s    | 6488% |
>                                 o-------------o-------o
>
> Takeaway: swqueue does provide a small statistically significant
> improvement on Milan, but the compile times in general were quite long
> relative to the 7950X Zen4, and the Bergamo Zen4c due to the lower clock
> frequency. Milan also has larger CCXs than Bergamo, so it stands to
> reason that select_idle_sibling() will have an easier time finding idle
> cores inside the current CCX.
>
> It also seems logical that SIS_NODE would hurt performance a bit here,
> as all cores / CCXs are in the same NUMA node, so select_idle_sibling()
> has to iterate over 72 cores; delaying task wakeup. That said, I'm not
> sure that's a viable theory if total CPU% is lower with SIS_NODE.
>
> -------------------------------------------------------------------------------
>
> Single-socket | 176-core | 11-CCX | 2-CCX per CCD | AMD Bergamo Zen4c
>
> CPU max MHz:                     1200.0000
> CPU min MHz:                     1000.0000
>
>                                 o____________o________o
>                                 |    mean    | CPU    |
>                                 o------------o--------o
> NO_SWQUEUE + NO_SIS_NODE:       | 322.44s    | 15534% |
> NO_SWQUEUE + SIS_NODE:          | 324.39s    | 15508% |
> SWQUEUE + NO_SIS_NODE:          | 321.54s    | 15603% |
> SWQUEUE + SIS_NODE:             | 321.88s    | 15622% |
>                                 o------------o--------o
>
> Takeaway: swqueue barely beats NO_SWQUEUE + NO_SIS_NODE, to the point
> that it's arguably not statistically significant.
> SIS_NODE results in a ~.9% performance degradation, for likely the same
> reason as Milan: the host has a large number of LLCs within a single
> socket, so task wakeup latencies suffer due to select_idle_node()
> searching up to 11 CCXs.
>
> Conclusion
> ==========
>
> swqueue in this form seems to provide a small, but noticeable win for
> front-end CPU-bound workloads spread over multiple CCXs. The reason
> seems fairly straightforward: swqueue encourages work conservation
> inside of a CCX by having a CPU do an O(1) pull from a per-LLC queue of
> runnable tasks. As mentioned above, it is complementary to SIS_NODE,
> which searches for idle cores on the wakeup path.
>
> While swqueue in this form encourages work conservation, it of course
> does not guarantee it given that we don't implement any kind of work
> stealing between swqueues. In the future, we could potentially push CPU
> utilization even higher by enabling work stealing between swqueues,
> likely between CCXs on the same NUMA node.
>
> Originally-by: Roman Gushchin <roman.gushchin@linux.dev>
> Signed-off-by: David Vernet <void@manifault.com>
> ---
>  include/linux/sched.h |   2 +
>  kernel/sched/core.c   |   2 +
>  kernel/sched/fair.c   | 175 ++++++++++++++++++++++++++++++++++++++++--
>  kernel/sched/sched.h  |   2 +
>  4 files changed, 176 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 1292d38d66cc..1f4fd22f88a8 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -770,6 +770,8 @@ struct task_struct {
>         unsigned long                   wakee_flip_decay_ts;
>         struct task_struct              *last_wakee;
>
> +       struct list_head                swqueue_node;
> +
>         /*
>          * recent_used_cpu is initially set as the last CPU used by a task
>          * that wakes affine another task. Waker/wakee relationships can
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d911b0631e7b..e04f0daf1f05 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4533,6 +4533,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
>  #ifdef CONFIG_SMP
>         p->wake_entry.u_flags = CSD_TYPE_TTWU;
>         p->migration_pending = NULL;
> +       INIT_LIST_HEAD(&p->swqueue_node);
>  #endif
>         init_sched_mm_cid(p);
>  }
> @@ -9872,6 +9873,7 @@ void __init sched_init_smp(void)
>
>         init_sched_rt_class();
>         init_sched_dl_class();
> +       init_sched_fair_class_late();
>
>         sched_smp_initialized = true;
>  }
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 807986bd6ea6..29fe25794884 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -139,17 +139,151 @@ static int __init setup_sched_thermal_decay_shift(char *str)
>  }
>  __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
>
> +/**
> + * struct swqueue - Per-LLC queue structure for enqueuing and pulling waking
> + * tasks.
> + *
> + * WHAT
> + * ====
> + *
> + * This structure enables the scheduler to be more aggressively work
> + * conserving, by placing waking tasks on a per-LLC FIFO queue that can then be
> + * pulled from when another core in the LLC is going to go idle.
> + *
> + * struct rq stores a pointer to its LLC's swqueue via struct cfs_rq. Waking
> + * tasks are enqueued in a swqueue at the end of enqueue_task_fair(), and are
> + * opportunistically pulled from the swqueue in pick_next_task_fair() prior to
> + * invoking newidle_balance(). Tasks enqueued in a swqueue be scheduled prior
> + * to being pulled from the swqueue, in which case they're simply removed from
> + * the swqueue. A waking task is only enqueued to a swqueue when it was _not_
> + * manually migrated to the current runqueue by select_task_rq_fair().
> + *
> + * There is currently no task-stealing between swqueues in different LLCs,
> + * which means that swqueue is not fully work conserving. This could be added
> + * at a later time, with tasks likely only being stolen across swqueues on the
> + * same NUMA node to avoid violating NUMA affinities.
> + *
> + * HOW
> + * ===
> + *
> + * An swqueue is comprised of a list, and a spinlock for synchronization. Given
> + * that the critical section for a swqueue is typically a fast list operation,
> + * and that the swqueue is localized to a single LLC, the spinlock does not
> + * seem to be contended; even on a heavily utilized host. struct swqueues are
> + * also cacheline aligned to prevent false sharing between CPUs manipulating
> + * swqueues in other LLCs.
> + *
> + * WHY
> + * ===
> + *
> + * As mentioned above, the main benefit of swqueue is that it enables more
> + * aggressive work conservation in the scheduler. This can benefit workloads
> + * that benefit more from CPU utilization than from L1/L2 cache locality.
> + *
> + * swqueues are segmented across LLCs both to avoid contention on the swqueue
> + * spinlock by minimizing the number of CPUs that could contend on it, as well
> + * as to strike a balance between work conservation, and L3 cache locality.
> + */
> +struct swqueue {
> +       struct list_head list;
> +       spinlock_t lock;
> +} ____cacheline_aligned;
> +
>  #ifdef CONFIG_SMP
> -static void swqueue_enqueue(struct rq *rq, struct task_struct *p,
> -                           int enq_flags)
> -{}
> +static struct swqueue *rq_swqueue(struct rq *rq)
> +{
> +       return rq->cfs.swqueue;
> +}
> +
> +static struct task_struct *swqueue_pull_task(struct swqueue *swqueue)
> +{
> +       unsigned long flags;
> +
> +       struct task_struct *p;
> +
> +       spin_lock_irqsave(&swqueue->lock, flags);
> +       p = list_first_entry_or_null(&swqueue->list, struct task_struct,
> +                                    swqueue_node);
> +       if (p)
> +               list_del_init(&p->swqueue_node);
> +       spin_unlock_irqrestore(&swqueue->lock, flags);
> +
> +       return p;
> +}
> +
> +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
> +{
> +       unsigned long flags;
> +       struct swqueue *swqueue;
> +       bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> +       bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
> +
> +       /*
> +        * Only enqueue the task in the shared wakequeue if:
> +        *
> +        * - SWQUEUE is enabled
> +        * - The task is on the wakeup path
> +        * - The task wasn't purposefully migrated to the current rq by
> +        *   select_task_rq()
> +        * - The task isn't pinned to a specific CPU
> +        */
> +       if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> +               return;
> +
> +       swqueue = rq_swqueue(rq);
> +       spin_lock_irqsave(&swqueue->lock, flags);
> +       list_add_tail(&p->swqueue_node, &swqueue->list);
> +       spin_unlock_irqrestore(&swqueue->lock, flags);
> +}
> +
>  static int swqueue_pick_next_task(struct rq *rq, struct rq_flags *rf)
>  {
> -       return 0;
> +       struct swqueue *swqueue;
> +       struct task_struct *p = NULL;
> +       struct rq *src_rq;
> +       struct rq_flags src_rf;
> +       int ret;
> +
> +       swqueue = rq_swqueue(rq);
> +       if (!list_empty(&swqueue->list))
> +               p = swqueue_pull_task(swqueue);
> +
> +       if (!p)
> +               return 0;
> +
> +       rq_unpin_lock(rq, rf);
> +       raw_spin_rq_unlock(rq);
> +
> +       src_rq = task_rq_lock(p, &src_rf);
> +
> +       if (task_on_rq_queued(p) && !task_on_cpu(rq, p))
> +               src_rq = migrate_task_to(src_rq, &src_rf, p, cpu_of(rq));
> +
> +       if (src_rq->cpu != rq->cpu)
> +               ret = 1;
> +       else
> +               ret = -1;
> +
> +       task_rq_unlock(src_rq, p, &src_rf);
> +
> +       raw_spin_rq_lock(rq);
> +       rq_repin_lock(rq, rf);
> +
> +       return ret;
>  }
>
>  static void swqueue_remove_task(struct task_struct *p)
> -{}
> +{
> +       unsigned long flags;
> +       struct swqueue *swqueue;
> +
> +       if (!list_empty(&p->swqueue_node)) {
> +               swqueue = rq_swqueue(task_rq(p));
> +               spin_lock_irqsave(&swqueue->lock, flags);
> +               list_del_init(&p->swqueue_node);
> +               spin_unlock_irqrestore(&swqueue->lock, flags);
> +       }
> +}
>
>  /*
>   * For asym packing, by default the lower numbered CPU has higher priority.
> @@ -12839,3 +12973,34 @@ __init void init_sched_fair_class(void)
>  #endif /* SMP */
>
>  }
> +
> +__init void init_sched_fair_class_late(void)
> +{
> +#ifdef CONFIG_SMP
> +       int i;
> +       struct swqueue *swqueue;
> +       struct rq *rq;
> +       struct rq *llc_rq;
> +
> +       for_each_possible_cpu(i) {
> +               if (per_cpu(sd_llc_id, i) == i) {
> +                       llc_rq = cpu_rq(i);
> +
> +                       swqueue = kzalloc_node(sizeof(struct swqueue),
> +                                              GFP_KERNEL, cpu_to_node(i));
> +                       INIT_LIST_HEAD(&swqueue->list);
> +                       spin_lock_init(&swqueue->lock);
> +                       llc_rq->cfs.swqueue = swqueue;
> +               }
> +       }
> +
> +       for_each_possible_cpu(i) {
> +               rq = cpu_rq(i);
> +               llc_rq = cpu_rq(per_cpu(sd_llc_id, i));
> +
> +               if (rq == llc_rq)
> +                       continue;
> +               rq->cfs.swqueue = llc_rq->cfs.swqueue;
> +       }
> +#endif /* SMP */
> +}
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 5a86e9795731..daee5c64af87 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -575,6 +575,7 @@ struct cfs_rq {
>  #endif
>
>  #ifdef CONFIG_SMP
> +       struct swqueue          *swqueue;
>         /*
>          * CFS load tracking
>          */
> @@ -2380,6 +2381,7 @@ extern void update_max_interval(void);
>  extern void init_sched_dl_class(void);
>  extern void init_sched_rt_class(void);
>  extern void init_sched_fair_class(void);
> +extern void init_sched_fair_class_late(void);
>
>  extern void reweight_task(struct task_struct *p, int prio);
>
> --
> 2.40.1
>
  
Gautham R. Shenoy June 19, 2023, 6:13 a.m. UTC | #12
Hello David,


On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
[..snip..]

> +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
> +{
> +	unsigned long flags;
> +	struct swqueue *swqueue;
> +	bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> +	bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
> +
> +	/*
> +	 * Only enqueue the task in the shared wakequeue if:
> +	 *
> +	 * - SWQUEUE is enabled
> +	 * - The task is on the wakeup path
> +	 * - The task wasn't purposefully migrated to the current rq by
> +	 *   select_task_rq()
> +	 * - The task isn't pinned to a specific CPU
> +	 */
> +	if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> +		return;

In select_task_rq_fair(), having determined if the target of task
wakeup should be the task's previous CPU vs the waker's current CPU,
we spend quite a bit of time already to determine if there is an idle
core/CPU in the target's LLC. @rq would correspond to CPU chosen as a
result of that scan or if no idle CPU exists, @rq corresponds to the
target CPU determined by wake_affine_idle()/wake_affine_weight().

So if the CPU of @rq is idle here, can we not simply return here?

Or if the idea is to avoid the scan for an idle core/CPU in
select_task_rq_fair(), then 

Perhaps I am missing something...

> +
> +	swqueue = rq_swqueue(rq);
> +	spin_lock_irqsave(&swqueue->lock, flags);
> +	list_add_tail(&p->swqueue_node, &swqueue->list);
> +	spin_unlock_irqrestore(&swqueue->lock, flags);
> +}
> +

--
Thanks and Regards
gautham.
  
David Vernet June 20, 2023, 5:36 p.m. UTC | #13
On Fri, Jun 16, 2023 at 08:53:38AM +0800, Aaron Lu wrote:
> On Thu, Jun 15, 2023 at 06:26:05PM -0500, David Vernet wrote:
>  
> > Ok, it seems that the issue is that I wasn't creating enough netperf
> > clients. I assumed that -n $(nproc) was sufficient. I was able to repro
> 
> Yes that switch is confusing.
> 
> > the contention on my 26 core / 52 thread skylake client as well:
> > 
> > 
>  
> > Thanks for the help in getting the repro on my end.
> 
> You are welcome.
> 
> > So yes, there is certainly a scalability concern to bear in mind for
> > swqueue for LLCs with a lot of cores. If you have a lot of tasks quickly
> > e.g. blocking and waking on futexes in a tight loop, I expect a similar
> > issue would be observed.
> > 
> > On the other hand, the issue did not occur on my 7950X. I also wasn't
> 
> Using netperf/UDP_RR?

Correct

> > able to repro the contention on the Skylake if I ran with the default
> > netperf workload rather than UDP_RR (even with the additional clients).
> 
> I also tried that on the 18cores/36threads/LLC Skylake and the contention
> is indeed much smaller than UDP_RR:
> 
>      7.30%     7.29%  [kernel.vmlinux]      [k]      native_queued_spin_lock_slowpath
> 
> But I wouldn't say it's entirely gone. Also consider Skylake has a lot
> fewer cores per LLC than later Intel servers like Icelake and Sapphire
> Rapids and I expect things would be worse on those two machines.

I cannot reproduce this contention locally, even on a slightly larger
Skylake. Not really sure what to make of the difference here. Perhaps
it's because you're running with CONFIG_SCHED_CORE=y? What is the
change in throughput when you run the default workload on your SKL?

> > I didn't bother to take the mean of all of the throughput results
> > between NO_SWQUEUE and SWQUEUE, but they looked roughly equal.
> > 
> > So swqueue isn't ideal for every configuration, but I'll echo my
> > sentiment from [0] that this shouldn't on its own necessarily preclude
> > it from being merged given that it does help a large class of
> > configurations and workloads, and it's disabled by default.
> > 
> > [0]: https://lore.kernel.org/all/20230615000103.GC2883716@maniforge/
> 
> I was wondering: does it make sense to do some divide on machines with
> big LLCs? Like converting the per-LLC swqueue to per-group swqueue where
> the group can be made of ~8 cpus of the same LLC. This will have a
> similar effect of reducing the number of CPUs in a single LLC so the
> scalability issue can hopefully be fixed while at the same time, it
> might still help some workloads. I realized this isn't ideal in that
> wakeup happens at LLC scale so the group thing may not fit very well
> here.
> 
> Just a thought, feel free to ignore it if you don't think this is
> feasible :-)

That's certainly an idea we could explore, but my inclination would be
to keep everything at a per-LLC granularity. It makes it easier to
reason about performance; both in terms of work conservation per-LLC
(again, not every workload suffers from having large LLCs even if others
do, and halving the size of a swqueue in an LLC could harm other
workloads which benefit from the increased work conservation), and in
terms of contention. To the latter point, I think it would be difficult
to choose an LLC size that wasn't somewhat artificial and workload
specific. If someone has that requirement, I think sched_ext would be a
better alternative.

Thanks,
David
  
David Vernet June 20, 2023, 7:54 p.m. UTC | #14
On Fri, Jun 16, 2023 at 10:08:57AM +0200, Vincent Guittot wrote:
> On Tue, 13 Jun 2023 at 07:20, David Vernet <void@manifault.com> wrote:
> >
> > Overview
> > ========
> >
> > The scheduler must constantly strike a balance between work
> > conservation, and avoiding costly migrations which harm performance due
> > to e.g. decreased cache locality. The matter is further complicated by
> > the topology of the system. Migrating a task between cores on the same
> > LLC may be more optimal than keeping a task local to the CPU, whereas
> > migrating a task between LLCs or NUMA nodes may tip the balance in the
> > other direction.
> >
> > With that in mind, while CFS is by and large mostly a work conserving
> > scheduler, there are certain instances where the scheduler will choose
> > to keep a task local to a CPU, when it would have been more optimal to
> > migrate it to an idle core.
> >
> > An example of such a workload is the HHVM / web workload at Meta. HHVM
> > is a VM that JITs Hack and PHP code in service of web requests. Like
> > other JIT / compilation workloads, it tends to be heavily CPU bound, and
> > exhibit generally poor cache locality. To try and address this, we set
> > several debugfs (/sys/kernel/debug/sched) knobs on our HHVM workloads:
> >
> > - migration_cost_ns -> 0
> > - latency_ns -> 20000000
> > - min_granularity_ns -> 10000000
> > - wakeup_granularity_ns -> 12000000
> >
> > These knobs are intended both to encourage the scheduler to be as work
> > conserving as possible (migration_cost_ns -> 0), and also to keep tasks
> > running for relatively long time slices so as to avoid the overhead of
> > context switching (the other knobs). Collectively, these knobs provide a
> > substantial performance win; resulting in roughly a 20% improvement in
> > throughput. Worth noting, however, is that this improvement is _not_ at
> > full machine saturation.
> >
> > That said, even with these knobs, we noticed that CPUs were still going
> > idle even when the host was overcommitted. In response, we wrote the
> > "shared wakequeue" (swqueue) feature proposed in this patch set. The
> > idea behind swqueue is simple: it enables the scheduler to be
> > aggressively work conserving by placing a waking task into a per-LLC
> > FIFO queue that can be pulled from by another core in the LLC FIFO queue
> > which can then be pulled from before it goes idle.
> 
> This seems to be just another newly idle load balance outside the current one !

Hi Vincent,

I can bring the swqueue logic inside of newidle_balance(). In hindsight
I think it makes more sense there.

To answer your point more generally though, yes, this is a new approach
to load balancing that eschews tracking migration costs, scanning
runqueues, etc in favor of optimizing for work conservation, and
tracking runnable tasks in a shared data structure. More on this below
in response to your other points.

> 
> The knobs above are not the only thing preventing a rq to pull a new
> task. We have rq->avg_idle, curr_cost and sd->max_newidle_lb_cost
> stuff which might be one main root cause for one of your cpu not
> pulling a waiting task
>
> It's not clear in your explanation why fixing newly_idle_load_balance
> was not possible instead of adding outside code and what prevents
> newly_idle_load balance from picking a task in your case ?
> 
> For example, have you tried to disable the early break because of avg_idle ?

The goal of swqueue is to enable work conservation using a shared, per-LLC data
structure. The shared data structure is really the salient point as to why just
updating newidle_balance() wouldn't achieve the same result. newidle_balance()
finds tasks to migrate by (sometimes) iterating over runqueues and scanning for
tasks. It's an expensive operation, which doesn't scale to large machines or to
being performed on every idle path. swqueue, on the other hand, is shared
across all cores in an LLC, so pulling a runnable task is simply a matter of a
spinlock acquire and a list operation. This doesn't scale to every single
configuration as Aaron pointed out in [0], but it works well in many other
configurations.

[0]: https://lore.kernel.org/all/20230614043529.GA1942@ziqianlu-dell/

Another consideration is that even if we could adjust newidle_balance()
to load balance well enough for our specific purpose, we're still
relying on heuristics to determine when it's appropriate to load
balance; and that will 

a) Inevitably be suboptimal for certain workloads and configurations. For
example, if we got rid of the following check:

12021         for_each_domain(this_cpu, sd) {
12022                 int continue_balancing = 1;
12023                 u64 domain_cost;
12024
12025                 update_next_balance(sd, &next_balance);
12026
12027                 if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
12028                         break;

we may end up load balancing too frequently, which could be a regression in and
of itself per the scalability concerns mentioned above. Or, for certain
workloads, we'll load balance too aggressively and miss out on L1/L2 locality.

b) Be harder for users to effectively tune or use. At OSPM, Peter made it quite
clear that users should not be tuning any of the debugfs knobs. Relying on
heuristics and knobs like this feels antithetical to that policy. swqueue feels
like a reasonable, self-contained alternative to that.

Thanks,
David
  
David Vernet June 20, 2023, 8:08 p.m. UTC | #15
On Mon, Jun 19, 2023 at 11:43:13AM +0530, Gautham R. Shenoy wrote:
> Hello David,
> 
> 
> On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> [..snip..]
> 
> > +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
> > +{
> > +	unsigned long flags;
> > +	struct swqueue *swqueue;
> > +	bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> > +	bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
> > +
> > +	/*
> > +	 * Only enqueue the task in the shared wakequeue if:
> > +	 *
> > +	 * - SWQUEUE is enabled
> > +	 * - The task is on the wakeup path
> > +	 * - The task wasn't purposefully migrated to the current rq by
> > +	 *   select_task_rq()
> > +	 * - The task isn't pinned to a specific CPU
> > +	 */
> > +	if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> > +		return;
> 
> In select_task_rq_fair(), having determined if the target of task
> wakeup should be the task's previous CPU vs the waker's current CPU,
> we spend quite a bit of time already to determine if there is an idle
> core/CPU in the target's LLC. @rq would correspond to CPU chosen as a
> result of that scan or if no idle CPU exists, @rq corresponds to the
> target CPU determined by wake_affine_idle()/wake_affine_weight().
> 
> So if the CPU of @rq is idle here, can we not simply return here?

Hi Gautum,

Sorry, I'm not sure I'm quite following the issue you're pointing out.
We don't use swqueue if the task was migrated following
select_task_rq_fair(). That's the idea with us returning if the task was
migrated (the second conditional in that if). If I messed up that logic
somehow, it should be fixed.

> Or if the idea is to avoid the scan for an idle core/CPU in
> select_task_rq_fair(), then 

No, swqueue_enqueue() is called from enqueue_task_fair(), not
select_task_rq_fair(). If select_task_rq_fair() (which is of course
called beforehand for a waking task) found an idle core/CPU, we don't
bother using swqueue, as mentioned above.

Thanks,
David
  
Roman Gushchin June 20, 2023, 9:37 p.m. UTC | #16
On Tue, Jun 20, 2023 at 02:54:23PM -0500, David Vernet wrote:
> On Fri, Jun 16, 2023 at 10:08:57AM +0200, Vincent Guittot wrote:
> > On Tue, 13 Jun 2023 at 07:20, David Vernet <void@manifault.com> wrote:
> > >
> > > Overview
> > > ========
> > >
> > > The scheduler must constantly strike a balance between work
> > > conservation, and avoiding costly migrations which harm performance due
> > > to e.g. decreased cache locality. The matter is further complicated by
> > > the topology of the system. Migrating a task between cores on the same
> > > LLC may be more optimal than keeping a task local to the CPU, whereas
> > > migrating a task between LLCs or NUMA nodes may tip the balance in the
> > > other direction.
> > >
> > > With that in mind, while CFS is by and large mostly a work conserving
> > > scheduler, there are certain instances where the scheduler will choose
> > > to keep a task local to a CPU, when it would have been more optimal to
> > > migrate it to an idle core.
> > >
> > > An example of such a workload is the HHVM / web workload at Meta. HHVM
> > > is a VM that JITs Hack and PHP code in service of web requests. Like
> > > other JIT / compilation workloads, it tends to be heavily CPU bound, and
> > > exhibit generally poor cache locality. To try and address this, we set
> > > several debugfs (/sys/kernel/debug/sched) knobs on our HHVM workloads:
> > >
> > > - migration_cost_ns -> 0
> > > - latency_ns -> 20000000
> > > - min_granularity_ns -> 10000000
> > > - wakeup_granularity_ns -> 12000000
> > >
> > > These knobs are intended both to encourage the scheduler to be as work
> > > conserving as possible (migration_cost_ns -> 0), and also to keep tasks
> > > running for relatively long time slices so as to avoid the overhead of
> > > context switching (the other knobs). Collectively, these knobs provide a
> > > substantial performance win; resulting in roughly a 20% improvement in
> > > throughput. Worth noting, however, is that this improvement is _not_ at
> > > full machine saturation.
> > >
> > > That said, even with these knobs, we noticed that CPUs were still going
> > > idle even when the host was overcommitted. In response, we wrote the
> > > "shared wakequeue" (swqueue) feature proposed in this patch set. The
> > > idea behind swqueue is simple: it enables the scheduler to be
> > > aggressively work conserving by placing a waking task into a per-LLC
> > > FIFO queue that can be pulled from by another core in the LLC FIFO queue
> > > which can then be pulled from before it goes idle.
> > 
> > This seems to be just another newly idle load balance outside the current one !
> 
> Hi Vincent,
> 
> I can bring the swqueue logic inside of newidle_balance(). In hindsight
> I think it makes more sense there.
> 
> To answer your point more generally though, yes, this is a new approach
> to load balancing that eschews tracking migration costs, scanning
> runqueues, etc in favor of optimizing for work conservation, and
> tracking runnable tasks in a shared data structure. More on this below
> in response to your other points.
> 
> > 
> > The knobs above are not the only thing preventing a rq to pull a new
> > task. We have rq->avg_idle, curr_cost and sd->max_newidle_lb_cost
> > stuff which might be one main root cause for one of your cpu not
> > pulling a waiting task
> >
> > It's not clear in your explanation why fixing newly_idle_load_balance
> > was not possible instead of adding outside code and what prevents
> > newly_idle_load balance from picking a task in your case ?
> > 
> > For example, have you tried to disable the early break because of avg_idle ?
> 
> The goal of swqueue is to enable work conservation using a shared, per-LLC data
> structure. The shared data structure is really the salient point as to why just
> updating newidle_balance() wouldn't achieve the same result. newidle_balance()
> finds tasks to migrate by (sometimes) iterating over runqueues and scanning for
> tasks. It's an expensive operation, which doesn't scale to large machines or to
> being performed on every idle path. swqueue, on the other hand, is shared
> across all cores in an LLC, so pulling a runnable task is simply a matter of a
> spinlock acquire and a list operation. This doesn't scale to every single
> configuration as Aaron pointed out in [0], but it works well in many other
> configurations.

+1

Additionally, swqueue minimizes the number of races when two tasks pick
the same cpu for being woken up.
It's also serializing the wakeup order, which effectively pessimises tasks
which are woken very often and promotes cpu-bound tasks, which usually
positively affects the throughput.

I agree that swqueue can be seen as an alternative form of the idle load balancing,
I thought this way when I was working on it.
Can it replace it completely? Idk, maybe. But maybe we need some sort of a balancing
between multiple wait queues. E.g. if there are two swqueues on the system, one is empty
and another is long, load idle balancing can borrow some tasks from the "foreign" swqueue.
Just some ideas.

Thanks!
  
Aaron Lu June 21, 2023, 2:35 a.m. UTC | #17
On Tue, Jun 20, 2023 at 12:36:26PM -0500, David Vernet wrote:
> On Fri, Jun 16, 2023 at 08:53:38AM +0800, Aaron Lu wrote:
> > I also tried that on the 18cores/36threads/LLC Skylake and the contention
> > is indeed much smaller than UDP_RR:
> > 
> >      7.30%     7.29%  [kernel.vmlinux]      [k]      native_queued_spin_lock_slowpath
> > 
> > But I wouldn't say it's entirely gone. Also consider Skylake has a lot
> > fewer cores per LLC than later Intel servers like Icelake and Sapphire
> > Rapids and I expect things would be worse on those two machines.
> 
> I cannot reproduce this contention locally, even on a slightly larger

With netperf client number equal to nr_cpu?

> Skylake. Not really sure what to make of the difference here. Perhaps
> it's because you're running with CONFIG_SCHED_CORE=y? What is the

Yes I had that config on but I didn't tag any tasks or groups.

> change in throughput when you run the default workload on your SKL?

The throughput dropped a little with SWQUEUE:

                 avg_throughput    native_queued_spin_lock_slowpath%
NO_SWQUEUE:      9528.061111111108      0.09%
SWQUEUE:         8984.369722222222      8.05%

avg_throughput: average throughput of all netperf client's throughput,
higher is better.

I run this workload like this:
"
netserver

for i in `seq 72`; do
        netperf -l 60 -n 72 -6 &
done

sleep 30
perf record -ag -e cycles:pp -- sleep 5 &

wait
"
(the '-n 72' should be redundant but I just keep it there)

Thanks,
Aaron
  
David Vernet June 21, 2023, 2:43 a.m. UTC | #18
On Wed, Jun 21, 2023 at 10:35:34AM +0800, Aaron Lu wrote:
> On Tue, Jun 20, 2023 at 12:36:26PM -0500, David Vernet wrote:
> > On Fri, Jun 16, 2023 at 08:53:38AM +0800, Aaron Lu wrote:
> > > I also tried that on the 18cores/36threads/LLC Skylake and the contention
> > > is indeed much smaller than UDP_RR:
> > > 
> > >      7.30%     7.29%  [kernel.vmlinux]      [k]      native_queued_spin_lock_slowpath
> > > 
> > > But I wouldn't say it's entirely gone. Also consider Skylake has a lot
> > > fewer cores per LLC than later Intel servers like Icelake and Sapphire
> > > Rapids and I expect things would be worse on those two machines.
> > 
> > I cannot reproduce this contention locally, even on a slightly larger
> 
> With netperf client number equal to nr_cpu?

No, that confusion was only the first time around. See below though, I'm
not sure what insights are to be gained by continuing to tinker with
netperf runs.

> > Skylake. Not really sure what to make of the difference here. Perhaps
> > it's because you're running with CONFIG_SCHED_CORE=y? What is the
> 
> Yes I had that config on but I didn't tag any tasks or groups.
> 
> > change in throughput when you run the default workload on your SKL?
> 
> The throughput dropped a little with SWQUEUE:
> 
>                  avg_throughput    native_queued_spin_lock_slowpath%
> NO_SWQUEUE:      9528.061111111108      0.09%
> SWQUEUE:         8984.369722222222      8.05%
> 
> avg_throughput: average throughput of all netperf client's throughput,
> higher is better.
> 
> I run this workload like this:
> "
> netserver
> 
> for i in `seq 72`; do
>         netperf -l 60 -n 72 -6 &
> done
> 
> sleep 30
> perf record -ag -e cycles:pp -- sleep 5 &
> 
> wait
> "
> (the '-n 72' should be redundant but I just keep it there)

At this point I'd say we've spent quite a bit of time discussing netperf
results. We understand where the contention is coming from, and yes,
we've established that there are going to be some configurations where
swqueue is not well suited. We've also established that there are
configurations where it will and does perform well, including on
netperf.

I'm not sure what we're hoping to gain by continuing to run various
netperf workloads with your specific parameters?

Thanks,
David
  
Aaron Lu June 21, 2023, 4:54 a.m. UTC | #19
On Tue, Jun 20, 2023 at 09:43:00PM -0500, David Vernet wrote:
> On Wed, Jun 21, 2023 at 10:35:34AM +0800, Aaron Lu wrote:
> > On Tue, Jun 20, 2023 at 12:36:26PM -0500, David Vernet wrote:
> > > On Fri, Jun 16, 2023 at 08:53:38AM +0800, Aaron Lu wrote:
> > > > I also tried that on the 18cores/36threads/LLC Skylake and the contention
> > > > is indeed much smaller than UDP_RR:
> > > > 
> > > >      7.30%     7.29%  [kernel.vmlinux]      [k]      native_queued_spin_lock_slowpath
> > > > 
> > > > But I wouldn't say it's entirely gone. Also consider Skylake has a lot
> > > > fewer cores per LLC than later Intel servers like Icelake and Sapphire
> > > > Rapids and I expect things would be worse on those two machines.
> > > 
> > > I cannot reproduce this contention locally, even on a slightly larger
> > 
> > With netperf client number equal to nr_cpu?
> 
> No, that confusion was only the first time around. See below though, I'm
> not sure what insights are to be gained by continuing to tinker with
> netperf runs.
> 
> > > Skylake. Not really sure what to make of the difference here. Perhaps
> > > it's because you're running with CONFIG_SCHED_CORE=y? What is the
> > 
> > Yes I had that config on but I didn't tag any tasks or groups.
> > 
> > > change in throughput when you run the default workload on your SKL?
> > 
> > The throughput dropped a little with SWQUEUE:
> > 
> >                  avg_throughput    native_queued_spin_lock_slowpath%
> > NO_SWQUEUE:      9528.061111111108      0.09%
> > SWQUEUE:         8984.369722222222      8.05%
> > 
> > avg_throughput: average throughput of all netperf client's throughput,
> > higher is better.
> > 
> > I run this workload like this:
> > "
> > netserver
> > 
> > for i in `seq 72`; do
> >         netperf -l 60 -n 72 -6 &
> > done
> > 
> > sleep 30
> > perf record -ag -e cycles:pp -- sleep 5 &
> > 
> > wait
> > "
> > (the '-n 72' should be redundant but I just keep it there)
> 
> At this point I'd say we've spent quite a bit of time discussing netperf
> results. We understand where the contention is coming from, and yes,
> we've established that there are going to be some configurations where
> swqueue is not well suited. We've also established that there are
> configurations where it will and does perform well, including on
> netperf.
> 
> I'm not sure what we're hoping to gain by continuing to run various
> netperf workloads with your specific parameters?

I don't quite follow you.

I thought we were in the process of figuring out why for the same
workload(netperf/default_mode/nr_client=nr_cpu) on two similar
machines(both are Skylake) you saw no contention while I saw some so I
tried to be exact on how I run the workload.

If that's not the case, then yes there is no much value continuing this
discussion.

Thanks,
Aaron
  
David Vernet June 21, 2023, 5:43 a.m. UTC | #20
On Wed, Jun 21, 2023 at 12:54:16PM +0800, Aaron Lu wrote:
> On Tue, Jun 20, 2023 at 09:43:00PM -0500, David Vernet wrote:
> > On Wed, Jun 21, 2023 at 10:35:34AM +0800, Aaron Lu wrote:
> > > On Tue, Jun 20, 2023 at 12:36:26PM -0500, David Vernet wrote:
> > > > On Fri, Jun 16, 2023 at 08:53:38AM +0800, Aaron Lu wrote:
> > > > > I also tried that on the 18cores/36threads/LLC Skylake and the contention
> > > > > is indeed much smaller than UDP_RR:
> > > > > 
> > > > >      7.30%     7.29%  [kernel.vmlinux]      [k]      native_queued_spin_lock_slowpath
> > > > > 
> > > > > But I wouldn't say it's entirely gone. Also consider Skylake has a lot
> > > > > fewer cores per LLC than later Intel servers like Icelake and Sapphire
> > > > > Rapids and I expect things would be worse on those two machines.
> > > > 
> > > > I cannot reproduce this contention locally, even on a slightly larger
> > > 
> > > With netperf client number equal to nr_cpu?
> > 
> > No, that confusion was only the first time around. See below though, I'm
> > not sure what insights are to be gained by continuing to tinker with
> > netperf runs.
> > 
> > > > Skylake. Not really sure what to make of the difference here. Perhaps
> > > > it's because you're running with CONFIG_SCHED_CORE=y? What is the
> > > 
> > > Yes I had that config on but I didn't tag any tasks or groups.
> > > 
> > > > change in throughput when you run the default workload on your SKL?
> > > 
> > > The throughput dropped a little with SWQUEUE:
> > > 
> > >                  avg_throughput    native_queued_spin_lock_slowpath%
> > > NO_SWQUEUE:      9528.061111111108      0.09%
> > > SWQUEUE:         8984.369722222222      8.05%
> > > 
> > > avg_throughput: average throughput of all netperf client's throughput,
> > > higher is better.
> > > 
> > > I run this workload like this:
> > > "
> > > netserver
> > > 
> > > for i in `seq 72`; do
> > >         netperf -l 60 -n 72 -6 &
> > > done
> > > 
> > > sleep 30
> > > perf record -ag -e cycles:pp -- sleep 5 &
> > > 
> > > wait
> > > "
> > > (the '-n 72' should be redundant but I just keep it there)
> > 
> > At this point I'd say we've spent quite a bit of time discussing netperf
> > results. We understand where the contention is coming from, and yes,
> > we've established that there are going to be some configurations where
> > swqueue is not well suited. We've also established that there are
> > configurations where it will and does perform well, including on
> > netperf.
> > 
> > I'm not sure what we're hoping to gain by continuing to run various
> > netperf workloads with your specific parameters?
> 
> I don't quite follow you.
> 
> I thought we were in the process of figuring out why for the same
> workload(netperf/default_mode/nr_client=nr_cpu) on two similar
> machines(both are Skylake) you saw no contention while I saw some so I
> tried to be exact on how I run the workload.

I just reran the workload on a 26 core / 52 thread Cooper Lake using
your exact command below and still don't observe any contention
whatsoever on the swqueue lock:

for i in `seq 72`; do
	netperf -l 60 -n 72 -6 &
done

> If that's not the case, then yes there is no much value continuing this
> discussion.

We can iterate until we find out why we're seeing slightly different
contention (different configs, different amount of RAM, maybe you have
turbo enabled or other things running on your host, etc), but I don't
see what that would tell us that would meaningfully drive the discussion
forward for the patch set. Is there anything in particular you're trying
to determine and/or do you have reason to think that the contention
you're observing is due to something other than a lot of tasks waking up
at the same time, just as it was with UDP_RR?

Thanks,
David
  
Aaron Lu June 21, 2023, 6:03 a.m. UTC | #21
On Wed, Jun 21, 2023 at 12:43:52AM -0500, David Vernet wrote:
> On Wed, Jun 21, 2023 at 12:54:16PM +0800, Aaron Lu wrote:
> > On Tue, Jun 20, 2023 at 09:43:00PM -0500, David Vernet wrote:
> > > On Wed, Jun 21, 2023 at 10:35:34AM +0800, Aaron Lu wrote:
> > > > On Tue, Jun 20, 2023 at 12:36:26PM -0500, David Vernet wrote:
> > > > > On Fri, Jun 16, 2023 at 08:53:38AM +0800, Aaron Lu wrote:
> > > > > > I also tried that on the 18cores/36threads/LLC Skylake and the contention
> > > > > > is indeed much smaller than UDP_RR:
> > > > > > 
> > > > > >      7.30%     7.29%  [kernel.vmlinux]      [k]      native_queued_spin_lock_slowpath
> > > > > > 
> > > > > > But I wouldn't say it's entirely gone. Also consider Skylake has a lot
> > > > > > fewer cores per LLC than later Intel servers like Icelake and Sapphire
> > > > > > Rapids and I expect things would be worse on those two machines.
> > > > > 
> > > > > I cannot reproduce this contention locally, even on a slightly larger
> > > > 
> > > > With netperf client number equal to nr_cpu?
> > > 
> > > No, that confusion was only the first time around. See below though, I'm
> > > not sure what insights are to be gained by continuing to tinker with
> > > netperf runs.
> > > 
> > > > > Skylake. Not really sure what to make of the difference here. Perhaps
> > > > > it's because you're running with CONFIG_SCHED_CORE=y? What is the
> > > > 
> > > > Yes I had that config on but I didn't tag any tasks or groups.
> > > > 
> > > > > change in throughput when you run the default workload on your SKL?
> > > > 
> > > > The throughput dropped a little with SWQUEUE:
> > > > 
> > > >                  avg_throughput    native_queued_spin_lock_slowpath%
> > > > NO_SWQUEUE:      9528.061111111108      0.09%
> > > > SWQUEUE:         8984.369722222222      8.05%
> > > > 
> > > > avg_throughput: average throughput of all netperf client's throughput,
> > > > higher is better.
> > > > 
> > > > I run this workload like this:
> > > > "
> > > > netserver
> > > > 
> > > > for i in `seq 72`; do
> > > >         netperf -l 60 -n 72 -6 &
> > > > done
> > > > 
> > > > sleep 30
> > > > perf record -ag -e cycles:pp -- sleep 5 &
> > > > 
> > > > wait
> > > > "
> > > > (the '-n 72' should be redundant but I just keep it there)
> > > 
> > > At this point I'd say we've spent quite a bit of time discussing netperf
> > > results. We understand where the contention is coming from, and yes,
> > > we've established that there are going to be some configurations where
> > > swqueue is not well suited. We've also established that there are
> > > configurations where it will and does perform well, including on
> > > netperf.
> > > 
> > > I'm not sure what we're hoping to gain by continuing to run various
> > > netperf workloads with your specific parameters?
> > 
> > I don't quite follow you.
> > 
> > I thought we were in the process of figuring out why for the same
> > workload(netperf/default_mode/nr_client=nr_cpu) on two similar
> > machines(both are Skylake) you saw no contention while I saw some so I
> > tried to be exact on how I run the workload.
> 
> I just reran the workload on a 26 core / 52 thread Cooper Lake using
> your exact command below and still don't observe any contention
> whatsoever on the swqueue lock:

Well, it's a puzzle to me.

But as you said below, I guess I'll just move on.

Thanks,
Aaron

> 
> for i in `seq 72`; do
> 	netperf -l 60 -n 72 -6 &
> done
> 
> > If that's not the case, then yes there is no much value continuing this
> > discussion.
> 
> We can iterate until we find out why we're seeing slightly different
> contention (different configs, different amount of RAM, maybe you have
> turbo enabled or other things running on your host, etc), but I don't
> see what that would tell us that would meaningfully drive the discussion
> forward for the patch set. Is there anything in particular you're trying
> to determine and/or do you have reason to think that the contention
> you're observing is due to something other than a lot of tasks waking up
> at the same time, just as it was with UDP_RR?
> 
> Thanks,
> David
  
Gautham R. Shenoy June 21, 2023, 8:17 a.m. UTC | #22
Hello David,

On Tue, Jun 20, 2023 at 03:08:22PM -0500, David Vernet wrote:
> On Mon, Jun 19, 2023 at 11:43:13AM +0530, Gautham R. Shenoy wrote:
> > Hello David,
> > 
> > 
> > On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> > [..snip..]
> > 
> > > +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
> > > +{
> > > +	unsigned long flags;
> > > +	struct swqueue *swqueue;
> > > +	bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> > > +	bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
> > > +
> > > +	/*
> > > +	 * Only enqueue the task in the shared wakequeue if:
> > > +	 *
> > > +	 * - SWQUEUE is enabled
> > > +	 * - The task is on the wakeup path
> > > +	 * - The task wasn't purposefully migrated to the current rq by
> > > +	 *   select_task_rq()
> > > +	 * - The task isn't pinned to a specific CPU
> > > +	 */
> > > +	if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> > > +		return;
> > 
> > In select_task_rq_fair(), having determined if the target of task
> > wakeup should be the task's previous CPU vs the waker's current CPU,
> > we spend quite a bit of time already to determine if there is an idle
> > core/CPU in the target's LLC. @rq would correspond to CPU chosen as a
> > result of that scan or if no idle CPU exists, @rq corresponds to the
> > target CPU determined by wake_affine_idle()/wake_affine_weight().
> > 
> > So if the CPU of @rq is idle here, can we not simply return here?
> 
> Hi Gautum,
> 
> Sorry, I'm not sure I'm quite following the issue you're pointing out.
> We don't use swqueue if the task was migrated following
> select_task_rq_fair(). That's the idea with us returning if the task was
> migrated (the second conditional in that if). If I messed up that logic
> somehow, it should be fixed.

Sorry, my bad. I see it now.

So as per this patch, the only time we enqueue the task on the shared
wakeup is if the target of try_to_wake_up() is the same CPU where the
task ran previously.

When wake_affine logic fails and the previous CPU is chosen as the
target, and when there are no other idle cores/threads in the LLC of
the previous CPU, it makes sense to queue the task on the
shared-wakequeue instead of on a busy previous CPU.

And when that previous CPU is idle, the try_to_wake_up() would have
woken it up via ttwu_queue(), so before going idle the next time it
will check the shared queue for the task and find it. We should be
good in this case.

Now, it is possible that select_task_rq_fair() ended up selecting the
waker's CPU as the target based on the
wake_affine_idle()/wake_affine_weight() logic. And if there is no idle
core/thread on the waker's LLC, the target would be the busy waker
CPU. In the case when the waker CPU is different from the task's
previous CPU, due to ENQUEUE_MIGRATE flag being set, the task won't be
queued on the shared wakequeue and instead has to wait on the busy
waker CPU.

I wonder if it makes sense to enqueue the task on the shared wakequeue
in this scenario as well.

> 
> > Or if the idea is to avoid the scan for an idle core/CPU in
> > select_task_rq_fair(), then 
> 
> No, swqueue_enqueue() is called from enqueue_task_fair(), not
> select_task_rq_fair(). If select_task_rq_fair() (which is of course
> called beforehand for a waking task) found an idle core/CPU, we don't
> bother using swqueue, as mentioned above.

Got it. Thanks!

> 
> Thanks,
> David

--
Thanks and Regards
gautham.
  
Peter Zijlstra June 21, 2023, 2:20 p.m. UTC | #23
On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> +struct swqueue {
> +	struct list_head list;
> +	spinlock_t lock;
> +} ____cacheline_aligned;

I'm thinking you can shard this just fine, it makes that pop() needs to
iterate all shards, but that shouldn't be a problem, and it would still
only need to take a single lock.

I'm thinking 4 or 8 shards should be plenty, even for Intel LLC.

>  #ifdef CONFIG_SMP

> +static struct task_struct *swqueue_pull_task(struct swqueue *swqueue)
> +{
> +	unsigned long flags;
> +
> +	struct task_struct *p;
> +
> +	spin_lock_irqsave(&swqueue->lock, flags);
> +	p = list_first_entry_or_null(&swqueue->list, struct task_struct,
> +				     swqueue_node);
> +	if (p)
> +		list_del_init(&p->swqueue_node);
> +	spin_unlock_irqrestore(&swqueue->lock, flags);
> +
> +	return p;
> +}

Would this not normally be called pop() or somesuch?

> +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
> +{
> +	unsigned long flags;
> +	struct swqueue *swqueue;
> +	bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> +	bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
> +
> +	/*
> +	 * Only enqueue the task in the shared wakequeue if:
> +	 *
> +	 * - SWQUEUE is enabled
> +	 * - The task is on the wakeup path
> +	 * - The task wasn't purposefully migrated to the current rq by
> +	 *   select_task_rq()
> +	 * - The task isn't pinned to a specific CPU
> +	 */
> +	if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> +		return;

Elsewhere you mentioned heuristics, this smells like them. This and the
is_cpus_allowed() thing makes you loose plenty of opportunities.

> +	swqueue = rq_swqueue(rq);
> +	spin_lock_irqsave(&swqueue->lock, flags);
> +	list_add_tail(&p->swqueue_node, &swqueue->list);
> +	spin_unlock_irqrestore(&swqueue->lock, flags);
> +}
> +
>  static int swqueue_pick_next_task(struct rq *rq, struct rq_flags *rf)
>  {
> -	return 0;
> +	struct swqueue *swqueue;
> +	struct task_struct *p = NULL;
> +	struct rq *src_rq;
> +	struct rq_flags src_rf;
> +	int ret;
> +
> +	swqueue = rq_swqueue(rq);
> +	if (!list_empty(&swqueue->list))
> +		p = swqueue_pull_task(swqueue);
> +
> +	if (!p)
> +		return 0;

At this point you can do the whole is_cpu_allowed() and avoid the whole
lock dance if not.

> +
> +	rq_unpin_lock(rq, rf);
> +	raw_spin_rq_unlock(rq);
> +
> +	src_rq = task_rq_lock(p, &src_rf);
> +
> +	if (task_on_rq_queued(p) && !task_on_cpu(rq, p))
> +		src_rq = migrate_task_to(src_rq, &src_rf, p, cpu_of(rq));

And then this becomes move_queued_task().

> +	if (src_rq->cpu != rq->cpu)
> +		ret = 1;
> +	else
> +		ret = -1;
> +
> +	task_rq_unlock(src_rq, p, &src_rf);
> +
> +	raw_spin_rq_lock(rq);
> +	rq_repin_lock(rq, rf);
> +
> +	return ret;
>  }
>  
>  static void swqueue_remove_task(struct task_struct *p)
> +{
> +	unsigned long flags;
> +	struct swqueue *swqueue;
> +
> +	if (!list_empty(&p->swqueue_node)) {
> +		swqueue = rq_swqueue(task_rq(p));
> +		spin_lock_irqsave(&swqueue->lock, flags);
> +		list_del_init(&p->swqueue_node);
> +		spin_unlock_irqrestore(&swqueue->lock, flags);
> +	}
> +}

dequeue()
  
Peter Zijlstra June 21, 2023, 2:22 p.m. UTC | #24
On Tue, Jun 20, 2023 at 02:54:23PM -0500, David Vernet wrote:

> Another consideration is that even if we could adjust newidle_balance()
> to load balance well enough for our specific purpose, we're still
> relying on heuristics to determine when it's appropriate to load
> balance; and that will 

Your thing has heuristics too.

But if we're going to do this, then it should replace newidle <= LLC
domains.
  
David Vernet June 21, 2023, 8:34 p.m. UTC | #25
On Wed, Jun 21, 2023 at 04:20:20PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> > +struct swqueue {
> > +	struct list_head list;
> > +	spinlock_t lock;
> > +} ____cacheline_aligned;
> 
> I'm thinking you can shard this just fine, it makes that pop() needs to
> iterate all shards, but that shouldn't be a problem, and it would still
> only need to take a single lock.

Is the idea here to have multiple sharded queues per LLC, with a single
lock protecting them? Assuming so, is the idea that it would avoid
bouncing the list heads amongst all of the cores' cachelines? I could
see that being useful in some scenarios, but it also feels a bit
complicated for what it gives you. If you have tasks being pulled
quickly I don't think that will help much because the list heads will
just bounce to the pullers. Also, if the lock is already heavily
contended, it seems possible that the cores inside a single shard could
still bounce the head back and forth amongst them, or the cache line
bouncing will be a small-order overhead compared to the lock itself.

Or did I misunderstand your suggestion entirely?

> I'm thinking 4 or 8 shards should be plenty, even for Intel LLC.
> 
> >  #ifdef CONFIG_SMP
> 
> > +static struct task_struct *swqueue_pull_task(struct swqueue *swqueue)
> > +{
> > +	unsigned long flags;
> > +
> > +	struct task_struct *p;
> > +
> > +	spin_lock_irqsave(&swqueue->lock, flags);
> > +	p = list_first_entry_or_null(&swqueue->list, struct task_struct,
> > +				     swqueue_node);
> > +	if (p)
> > +		list_del_init(&p->swqueue_node);
> > +	spin_unlock_irqrestore(&swqueue->lock, flags);
> > +
> > +	return p;
> > +}
> 
> Would this not normally be called pop() or somesuch?

Yes, I'll improve the name in the next iteration. swqueue_dequeue() and
swqueue_enqueue() seem like the most canonical. Let me know if you have another
preference.

> 
> > +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
> > +{
> > +	unsigned long flags;
> > +	struct swqueue *swqueue;
> > +	bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> > +	bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
> > +
> > +	/*
> > +	 * Only enqueue the task in the shared wakequeue if:
> > +	 *
> > +	 * - SWQUEUE is enabled
> > +	 * - The task is on the wakeup path
> > +	 * - The task wasn't purposefully migrated to the current rq by
> > +	 *   select_task_rq()
> > +	 * - The task isn't pinned to a specific CPU
> > +	 */
> > +	if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> > +		return;
> 
> Elsewhere you mentioned heuristics, this smells like them. This and the
> is_cpus_allowed() thing makes you loose plenty of opportunities.

Yeah fair enough, these certainly are heuristics as well.

I thought it best to try and avoid swqueue getting in the way of
select_task_rq_fair() (at least to start out with), but we could always
remove that and run other experiments to see how it does.

> > +	swqueue = rq_swqueue(rq);
> > +	spin_lock_irqsave(&swqueue->lock, flags);
> > +	list_add_tail(&p->swqueue_node, &swqueue->list);
> > +	spin_unlock_irqrestore(&swqueue->lock, flags);
> > +}
> > +
> >  static int swqueue_pick_next_task(struct rq *rq, struct rq_flags *rf)
> >  {
> > -	return 0;
> > +	struct swqueue *swqueue;
> > +	struct task_struct *p = NULL;
> > +	struct rq *src_rq;
> > +	struct rq_flags src_rf;
> > +	int ret;
> > +
> > +	swqueue = rq_swqueue(rq);
> > +	if (!list_empty(&swqueue->list))
> > +		p = swqueue_pull_task(swqueue);
> > +
> > +	if (!p)
> > +		return 0;
> 
> At this point you can do the whole is_cpu_allowed() and avoid the whole
> lock dance if not.

Good idea, will incorporate into the next iteration.

> > +
> > +	rq_unpin_lock(rq, rf);
> > +	raw_spin_rq_unlock(rq);
> > +
> > +	src_rq = task_rq_lock(p, &src_rf);
> > +
> > +	if (task_on_rq_queued(p) && !task_on_cpu(rq, p))
> > +		src_rq = migrate_task_to(src_rq, &src_rf, p, cpu_of(rq));
> 
> And then this becomes move_queued_task().

Yep, will make this change per your suggestion in [0].

[0]: https://lore.kernel.org/all/20230621130439.GF2053369@hirez.programming.kicks-ass.net/

> > +	if (src_rq->cpu != rq->cpu)
> > +		ret = 1;
> > +	else
> > +		ret = -1;
> > +
> > +	task_rq_unlock(src_rq, p, &src_rf);
> > +
> > +	raw_spin_rq_lock(rq);
> > +	rq_repin_lock(rq, rf);
> > +
> > +	return ret;
> >  }
> >  
> >  static void swqueue_remove_task(struct task_struct *p)
> > +{
> > +	unsigned long flags;
> > +	struct swqueue *swqueue;
> > +
> > +	if (!list_empty(&p->swqueue_node)) {
> > +		swqueue = rq_swqueue(task_rq(p));
> > +		spin_lock_irqsave(&swqueue->lock, flags);
> > +		list_del_init(&p->swqueue_node);
> > +		spin_unlock_irqrestore(&swqueue->lock, flags);
> > +	}
> > +}
> 
> dequeue()

Ack
  
David Vernet June 22, 2023, 1:43 a.m. UTC | #26
On Wed, Jun 21, 2023 at 01:47:00PM +0530, Gautham R. Shenoy wrote:
> Hello David,
> On Tue, Jun 20, 2023 at 03:08:22PM -0500, David Vernet wrote:
> > On Mon, Jun 19, 2023 at 11:43:13AM +0530, Gautham R. Shenoy wrote:
> > > Hello David,
> > > 
> > > 
> > > On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> > > [..snip..]
> > > 
> > > > +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
> > > > +{
> > > > +	unsigned long flags;
> > > > +	struct swqueue *swqueue;
> > > > +	bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> > > > +	bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
> > > > +
> > > > +	/*
> > > > +	 * Only enqueue the task in the shared wakequeue if:
> > > > +	 *
> > > > +	 * - SWQUEUE is enabled
> > > > +	 * - The task is on the wakeup path
> > > > +	 * - The task wasn't purposefully migrated to the current rq by
> > > > +	 *   select_task_rq()
> > > > +	 * - The task isn't pinned to a specific CPU
> > > > +	 */
> > > > +	if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> > > > +		return;
> > > 
> > > In select_task_rq_fair(), having determined if the target of task
> > > wakeup should be the task's previous CPU vs the waker's current CPU,
> > > we spend quite a bit of time already to determine if there is an idle
> > > core/CPU in the target's LLC. @rq would correspond to CPU chosen as a
> > > result of that scan or if no idle CPU exists, @rq corresponds to the
> > > target CPU determined by wake_affine_idle()/wake_affine_weight().
> > > 
> > > So if the CPU of @rq is idle here, can we not simply return here?
> > 
> > Hi Gautum,
> > 
> > Sorry, I'm not sure I'm quite following the issue you're pointing out.
> > We don't use swqueue if the task was migrated following
> > select_task_rq_fair(). That's the idea with us returning if the task was
> > migrated (the second conditional in that if). If I messed up that logic
> > somehow, it should be fixed.
> 
> Sorry, my bad. I see it now.
> 
> So as per this patch, the only time we enqueue the task on the shared
> wakeup is if the target of try_to_wake_up() is the same CPU where the
> task ran previously.
> 
> When wake_affine logic fails and the previous CPU is chosen as the
> target, and when there are no other idle cores/threads in the LLC of
> the previous CPU, it makes sense to queue the task on the
> shared-wakequeue instead of on a busy previous CPU.
> 
> And when that previous CPU is idle, the try_to_wake_up() would have
> woken it up via ttwu_queue(), so before going idle the next time it
> will check the shared queue for the task and find it. We should be
> good in this case.
> 
> Now, it is possible that select_task_rq_fair() ended up selecting the
> waker's CPU as the target based on the
> wake_affine_idle()/wake_affine_weight() logic. And if there is no idle
> core/thread on the waker's LLC, the target would be the busy waker
> CPU. In the case when the waker CPU is different from the task's
> previous CPU, due to ENQUEUE_MIGRATE flag being set, the task won't be
> queued on the shared wakequeue and instead has to wait on the busy
> waker CPU.
> 
> I wonder if it makes sense to enqueue the task on the shared wakequeue
> in this scenario as well.

Hello Gautham,

That's a good point. My original intention with opting out of using
swqueue if select_task_rq_fair() caused us to migrate is so that it
wouldn't interfere with decisions made with other select_task_rq_fair()
heuristics like wake_affine_*(). Basically just minimizing the possible
impact of swqueue. That said, I think it probably does make sense to
just enqueue in the swqueue regardless of whether ENQUEUE_MIGRATED is
set. One of the main goals of swqueue is work conservation, and in
hindsight it does feel somewhat artificial to add a heuristic that works
against that.

I'd like to hear what others think. In my opinion it's worth at least
running some tests on workloads that heavily utilize the CPU such as
kernel compile, and seeing what the outcomes are.

Thanks,
David
  
Gautham R. Shenoy June 22, 2023, 9:11 a.m. UTC | #27
On Wed, Jun 21, 2023 at 08:43:29PM -0500, David Vernet wrote:
> On Wed, Jun 21, 2023 at 01:47:00PM +0530, Gautham R. Shenoy wrote:
> > Hello David,
> > On Tue, Jun 20, 2023 at 03:08:22PM -0500, David Vernet wrote:
> > > On Mon, Jun 19, 2023 at 11:43:13AM +0530, Gautham R. Shenoy wrote:
> > > > Hello David,
> > > > 
> > > > 
> > > > On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> > > > [..snip..]
> > > > 
> > > > > +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
> > > > > +{
> > > > > +	unsigned long flags;
> > > > > +	struct swqueue *swqueue;
> > > > > +	bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> > > > > +	bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
> > > > > +
> > > > > +	/*
> > > > > +	 * Only enqueue the task in the shared wakequeue if:
> > > > > +	 *
> > > > > +	 * - SWQUEUE is enabled
> > > > > +	 * - The task is on the wakeup path
> > > > > +	 * - The task wasn't purposefully migrated to the current rq by
> > > > > +	 *   select_task_rq()
> > > > > +	 * - The task isn't pinned to a specific CPU
> > > > > +	 */
> > > > > +	if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> > > > > +		return;
> > > > 
> > > > In select_task_rq_fair(), having determined if the target of task
> > > > wakeup should be the task's previous CPU vs the waker's current CPU,
> > > > we spend quite a bit of time already to determine if there is an idle
> > > > core/CPU in the target's LLC. @rq would correspond to CPU chosen as a
> > > > result of that scan or if no idle CPU exists, @rq corresponds to the
> > > > target CPU determined by wake_affine_idle()/wake_affine_weight().
> > > > 
> > > > So if the CPU of @rq is idle here, can we not simply return here?
> > > 
> > > Hi Gautum,
> > > 
> > > Sorry, I'm not sure I'm quite following the issue you're pointing out.
> > > We don't use swqueue if the task was migrated following
> > > select_task_rq_fair(). That's the idea with us returning if the task was
> > > migrated (the second conditional in that if). If I messed up that logic
> > > somehow, it should be fixed.
> > 
> > Sorry, my bad. I see it now.
> > 
> > So as per this patch, the only time we enqueue the task on the shared
> > wakeup is if the target of try_to_wake_up() is the same CPU where the
> > task ran previously.
> > 
> > When wake_affine logic fails and the previous CPU is chosen as the
> > target, and when there are no other idle cores/threads in the LLC of
> > the previous CPU, it makes sense to queue the task on the
> > shared-wakequeue instead of on a busy previous CPU.
> > 
> > And when that previous CPU is idle, the try_to_wake_up() would have
> > woken it up via ttwu_queue(), so before going idle the next time it
> > will check the shared queue for the task and find it. We should be
> > good in this case.
> > 
> > Now, it is possible that select_task_rq_fair() ended up selecting the
> > waker's CPU as the target based on the
> > wake_affine_idle()/wake_affine_weight() logic. And if there is no idle
> > core/thread on the waker's LLC, the target would be the busy waker
> > CPU. In the case when the waker CPU is different from the task's
> > previous CPU, due to ENQUEUE_MIGRATE flag being set, the task won't be
> > queued on the shared wakequeue and instead has to wait on the busy
> > waker CPU.
> > 
> > I wonder if it makes sense to enqueue the task on the shared wakequeue
> > in this scenario as well.
> 
> Hello Gautham,
> 
> That's a good point. My original intention with opting out of using
> swqueue if select_task_rq_fair() caused us to migrate is so that it
> wouldn't interfere with decisions made with other select_task_rq_fair()
> heuristics like wake_affine_*(). Basically just minimizing the possible
> impact of swqueue.
> That said, I think it probably does make sense to
> just enqueue in the swqueue regardless of whether ENQUEUE_MIGRATED is
> set. One of the main goals of swqueue is work conservation, and in
> hindsight it does feel somewhat artificial to add a heuristic that works
> against that.

In that case we can perhaps have an explicit flag that is passed by
try_to_wake_up() when it cannot find an idle CPU and the chosen target
is just a fallback. The task gets enqueued on the swqueue of the
target only in such cases. Something like the following entirely
untested:

---------------------------- x8----------------------------
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b64fec55a381..38005262a7fe 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -910,6 +910,12 @@ struct task_struct {
 	 */
 	unsigned			sched_remote_wakeup:1;
 
+	/*
+	 * Bit used by select_idle_sibling() to signal enqueuing the
+	 * task on a shared wakequeue.
+	 */
+	unsigned			sched_add_to_swq:1;
+
 	/* Bit to tell LSMs we're in execve(): */
 	unsigned			in_execve:1;
 	unsigned			in_iowait:1;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e311d1c3b992..f4246c33f3c5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -215,21 +215,17 @@ static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
 {
 	unsigned long flags;
 	struct swqueue *swqueue;
-	bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
-	bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
 
 	/*
 	 * Only enqueue the task in the shared wakequeue if:
 	 *
 	 * - SWQUEUE is enabled
-	 * - The task is on the wakeup path
-	 * - The task wasn't purposefully migrated to the current rq by
-	 *   select_task_rq()
-	 * - The task isn't pinned to a specific CPU
+	 * - select_idle_sibling() didn't find an idle CPU to enqueue this wakee task.
 	 */
-	if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
+	if (!READ_ONCE(p->sched_add_to_swq))
 		return;
 
+	WRITE_ONCE(p->sched_add_to_swq, 0);
 	swqueue = rq_swqueue(rq);
 	spin_lock_irqsave(&swqueue->lock, flags);
 	list_add_tail(&p->swqueue_node, &swqueue->list);
@@ -7361,6 +7357,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
 
+	/*
+	 * No idle sibling was found. Ok to queue this task on the
+	 * shared wakequeue of the target.
+	 */
+	WRITE_ONCE(p->sched_add_to_swq, 1);
 	return target;
 }
 

> 
> I'd like to hear what others think. In my opinion it's worth at least
> running some tests on workloads that heavily utilize the CPU such as
> kernel compile, and seeing what the outcomes are.

I will try and get some numbers for such workloads on our setup over
this weekend.

> 
> Thanks,
> David

--
Thanks and Regards
gautham.
  
Peter Zijlstra June 22, 2023, 10:29 a.m. UTC | #28
On Thu, Jun 22, 2023 at 02:41:57PM +0530, Gautham R. Shenoy wrote:

> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -215,21 +215,17 @@ static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
>  {
>  	unsigned long flags;
>  	struct swqueue *swqueue;
> -	bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> -	bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
>  
>  	/*
>  	 * Only enqueue the task in the shared wakequeue if:
>  	 *
>  	 * - SWQUEUE is enabled
> -	 * - The task is on the wakeup path
> -	 * - The task wasn't purposefully migrated to the current rq by
> -	 *   select_task_rq()
> -	 * - The task isn't pinned to a specific CPU
> +	 * - select_idle_sibling() didn't find an idle CPU to enqueue this wakee task.
>  	 */
> -	if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)

You lost the nr_cpus_allowed == 1 case, that actually made sense, except
he didn't don't have a hook in set_cpus_allowed() to fix it back up.

> +	if (!READ_ONCE(p->sched_add_to_swq))
>  		return;

So suppose we fill all CPUs with tasks T_{0..n-n1}, sis finds them a
nice idle cpu and we don't get queued.

Then wake up another n tasks T_{n...2n-1}, none of them find an idle
CPU, as per the above them all taken. These all do get queued.

However, suppose T>=n does a wakeup-preemption, and we end up with T>=n
running while T<n get queued on the rq, but not the 'queue' (I so hate
the swqueue name).

Then one CPU has both it's tasks go 'away' and becomes idle.

At this point the 'queue' contains only running tasks that are not
migratable and all the tasks that could be migrated are not on the queue
-- IOW it is completely useless.

Is this the intention?
  
Peter Zijlstra June 22, 2023, 10:58 a.m. UTC | #29
On Wed, Jun 21, 2023 at 03:34:45PM -0500, David Vernet wrote:
> On Wed, Jun 21, 2023 at 04:20:20PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> > > +struct swqueue {
> > > +	struct list_head list;
> > > +	spinlock_t lock;
> > > +} ____cacheline_aligned;
> > 
> > I'm thinking you can shard this just fine, it makes that pop() needs to
> > iterate all shards, but that shouldn't be a problem, and it would still
> > only need to take a single lock.
> 
> Is the idea here to have multiple sharded queues per LLC, with a single
> lock protecting them? 

No, each shard will have it's own lock, each shard it's own cacheline.

> Or did I misunderstand your suggestion entirely?

Yup, so each shard is basically the above struct, whole cacheline with a
lock and list. Then on enqueue we hash to a shard (possibly based on cpu
-- this gives the least amount of bounces) and stick it on.

Then on pop we iterate the shards, the first non-empty shard gets the
lock taken, and pop'ed. If you start iteration based on the cpu->shard
map used for enqueue you can even get a sense of 'locality' depending on
the mapping.

So pop is O(n) in shards, but on average will only take a single lock --
there is an obvious race where it might see a non-empty queue, but race
on lock and find the queue empty once acquired, but meh.

> > I'm thinking 4 or 8 shards should be plenty, even for Intel LLC.
> > 
> > >  #ifdef CONFIG_SMP
> > 
> > > +static struct task_struct *swqueue_pull_task(struct swqueue *swqueue)
> > > +{
> > > +	unsigned long flags;
> > > +
> > > +	struct task_struct *p;
> > > +
> > > +	spin_lock_irqsave(&swqueue->lock, flags);
> > > +	p = list_first_entry_or_null(&swqueue->list, struct task_struct,
> > > +				     swqueue_node);
> > > +	if (p)
> > > +		list_del_init(&p->swqueue_node);
> > > +	spin_unlock_irqrestore(&swqueue->lock, flags);
> > > +
> > > +	return p;
> > > +}
> > 
> > Would this not normally be called pop() or somesuch?
> 
> Yes, I'll improve the name in the next iteration. swqueue_dequeue() and
> swqueue_enqueue() seem like the most canonical. Let me know if you have another
> preference.

Well, there's two layers intermixed here, which is, I think, what gave
rise to the whole wonky naming.

If you implement the queue primitives as: push, pop, del
and then build, using them, the scheduler primitives: enqueue, dequeue,
pick, things should become saner.

> > > +	if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> > > +		return;
> > 
> > Elsewhere you mentioned heuristics, this smells like them. This and the
> > is_cpus_allowed() thing makes you loose plenty of opportunities.
> 
> Yeah fair enough, these certainly are heuristics as well.
> 
> I thought it best to try and avoid swqueue getting in the way of
> select_task_rq_fair() (at least to start out with), but we could always
> remove that and run other experiments to see how it does.

So, part of the problem is that you might be hooking at the wrong level
(see also my earlier email about the queue only containing running tasks
and none of the ready tasks).

If you were to hook at the __{en,de}queue_entity() level you'll
automagically dequeue running tasks and enqueue any ready tasks. OTOH
you get to deal with the problem that not all entities are tasks, but
that shouldn't be too hard.

Also, you end up with more enqueue/dequeue -- so that might hurt.

Also, at this point you're nr_cpus/N away from simply scanning the
runqueues and pulling things from non-empty runqueues. But for large
nr_cpus and smallish N it might be a win over-all. Dunno..
  
David Vernet June 22, 2023, 2:43 p.m. UTC | #30
On Thu, Jun 22, 2023 at 12:58:41PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 21, 2023 at 03:34:45PM -0500, David Vernet wrote:
> > On Wed, Jun 21, 2023 at 04:20:20PM +0200, Peter Zijlstra wrote:
> > > On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> > > > +struct swqueue {
> > > > +	struct list_head list;
> > > > +	spinlock_t lock;
> > > > +} ____cacheline_aligned;
> > > 
> > > I'm thinking you can shard this just fine, it makes that pop() needs to
> > > iterate all shards, but that shouldn't be a problem, and it would still
> > > only need to take a single lock.
> > 
> > Is the idea here to have multiple sharded queues per LLC, with a single
> > lock protecting them? 
> 
> No, each shard will have it's own lock, each shard it's own cacheline.
> 
> > Or did I misunderstand your suggestion entirely?
> 
> Yup, so each shard is basically the above struct, whole cacheline with a
> lock and list. Then on enqueue we hash to a shard (possibly based on cpu
> -- this gives the least amount of bounces) and stick it on.
> 
> Then on pop we iterate the shards, the first non-empty shard gets the
> lock taken, and pop'ed. If you start iteration based on the cpu->shard
> map used for enqueue you can even get a sense of 'locality' depending on
> the mapping.

That's nifty. Even if we don't get locality from it, it seems prudent to
do to avoid contention.

> So pop is O(n) in shards, but on average will only take a single lock --

Ah, so it's 1 lock on average -- thanks for clarifying.

I like this idea and will implement it for v2. I'm not keen on the idea
of artificially having multiple independent swqueues (yes, I'll think of
a less evil name) per-LLC to avoid contending, but sharding like this
seems like a simple but effective solution to the scalability issue;
assuming having a sufficiently high N doesn't slow down the pop path by
too much as you pointed out below.

> there is an obvious race where it might see a non-empty queue, but race
> on lock and find the queue empty once acquired, but meh.

Yep, I would be surprised if this race caused a problem in practice.
Especially if we spread the poppers by starting iteration based on the
cpu->shared map as you suggested.

> > > I'm thinking 4 or 8 shards should be plenty, even for Intel LLC.

I'll try out a few combinations and see what works.

> > > 
> > > >  #ifdef CONFIG_SMP
> > > 
> > > > +static struct task_struct *swqueue_pull_task(struct swqueue *swqueue)
> > > > +{
> > > > +	unsigned long flags;
> > > > +
> > > > +	struct task_struct *p;
> > > > +
> > > > +	spin_lock_irqsave(&swqueue->lock, flags);
> > > > +	p = list_first_entry_or_null(&swqueue->list, struct task_struct,
> > > > +				     swqueue_node);
> > > > +	if (p)
> > > > +		list_del_init(&p->swqueue_node);
> > > > +	spin_unlock_irqrestore(&swqueue->lock, flags);
> > > > +
> > > > +	return p;
> > > > +}
> > > 
> > > Would this not normally be called pop() or somesuch?
> > 
> > Yes, I'll improve the name in the next iteration. swqueue_dequeue() and
> > swqueue_enqueue() seem like the most canonical. Let me know if you have another
> > preference.
> 
> Well, there's two layers intermixed here, which is, I think, what gave
> rise to the whole wonky naming.
> 
> If you implement the queue primitives as: push, pop, del
> and then build, using them, the scheduler primitives: enqueue, dequeue,
> pick, things should become saner.

Right, will do

> > > > +	if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> > > > +		return;
> > > 
> > > Elsewhere you mentioned heuristics, this smells like them. This and the
> > > is_cpus_allowed() thing makes you loose plenty of opportunities.
> > 
> > Yeah fair enough, these certainly are heuristics as well.
> > 
> > I thought it best to try and avoid swqueue getting in the way of
> > select_task_rq_fair() (at least to start out with), but we could always
> > remove that and run other experiments to see how it does.
> 
> So, part of the problem is that you might be hooking at the wrong level
> (see also my earlier email about the queue only containing running tasks
> and none of the ready tasks).
> 
> If you were to hook at the __{en,de}queue_entity() level you'll
> automagically dequeue running tasks and enqueue any ready tasks. OTOH
> you get to deal with the problem that not all entities are tasks, but
> that shouldn't be too hard.
> 
> Also, you end up with more enqueue/dequeue -- so that might hurt.

If this doesn't cause untenable contention then it seems preferable if
the overall goal is work conservation. Using your shard idea would
hopefully mitigate that concern somewhat as well. I'll try this out for
v2 and compare it to keeping the enqueues to only be on the wakeup path.

If there's too much contention we can always just go back to the wakeup
case, but I'm hopeful this works out.

> Also, at this point you're nr_cpus/N away from simply scanning the
> runqueues and pulling things from non-empty runqueues. But for large
> nr_cpus and smallish N it might be a win over-all. Dunno..

Given that we're not seeing any contention on smaller LLCs but still
observing perf wins, I'm optimistic.

Thanks,
David
  
Chris Mason June 22, 2023, 3:57 p.m. UTC | #31
On 6/21/23 2:03 AM, Aaron Lu wrote:
> On Wed, Jun 21, 2023 at 12:43:52AM -0500, David Vernet wrote:
>> On Wed, Jun 21, 2023 at 12:54:16PM +0800, Aaron Lu wrote:
>>> On Tue, Jun 20, 2023 at 09:43:00PM -0500, David Vernet wrote:
>>>> On Wed, Jun 21, 2023 at 10:35:34AM +0800, Aaron Lu wrote:
>>>>> On Tue, Jun 20, 2023 at 12:36:26PM -0500, David Vernet wrote:

[ ... ]

>>>> I'm not sure what we're hoping to gain by continuing to run various
>>>> netperf workloads with your specific parameters?
>>>
>>> I don't quite follow you.
>>>
>>> I thought we were in the process of figuring out why for the same
>>> workload(netperf/default_mode/nr_client=nr_cpu) on two similar
>>> machines(both are Skylake) you saw no contention while I saw some so I
>>> tried to be exact on how I run the workload.
>>
>> I just reran the workload on a 26 core / 52 thread Cooper Lake using
>> your exact command below and still don't observe any contention
>> whatsoever on the swqueue lock:
> 
> Well, it's a puzzle to me.
> 
> But as you said below, I guess I'll just move on.

Thanks for bringing this up Aaron.  The discussion moved on to different
ways to fix the netperf triggered contention, but I wanted to toss this
out as an easy way to see the same problem:

# swqueue disabled:
# ./schbench -L -m 52 -p 512 -r 10 -t 1
Wakeup Latencies percentiles (usec) runtime 10 (s) (14674354 total samples)
          20.0th: 8          (4508866 samples)
          50.0th: 11         (2879648 samples)
          90.0th: 35         (5865268 samples)
        * 99.0th: 70         (1282166 samples)
          99.9th: 110        (124040 samples)
          min=1, max=9312
avg worker transfer: 28211.91 ops/sec 13.78MB/s

During the swqueue=0 run,  the system was ~30% idle

# swqueue enabled:
# ./schbench -L -m 52 -p 512 -r 10 -t 1
Wakeup Latencies percentiles (usec) runtime 10 (s) (6448414 total samples)
          20.0th: 30         (1383423 samples)
          50.0th: 39         (1986827 samples)
          90.0th: 63         (2446965 samples)
        * 99.0th: 108        (567275 samples)
          99.9th: 158        (57487 samples)
          min=1, max=15018
avg worker transfer: 12395.27 ops/sec 6.05MB/s

During the swqueue=1 run, the CPU was at was 97% system time, all stuck
on spinlock contention in the scheduler.

This is a single socket cooperlake with 26 cores/52 threads.

The work is similar to perf pipe test, 52 messenger threads each bouncing
a message back and forth with their own private worker for a 10 second run.

Adding more messenger threads (-m 128) increases the swqueue=0 ops/sec
to about 19MB/s and drags down the swqueue=1 ops/sec to 2MB/s.

-chris
  
Gautham R. Shenoy June 23, 2023, 9:50 a.m. UTC | #32
On Thu, Jun 22, 2023 at 12:29:35PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 22, 2023 at 02:41:57PM +0530, Gautham R. Shenoy wrote:
> 
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -215,21 +215,17 @@ static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
> >  {
> >  	unsigned long flags;
> >  	struct swqueue *swqueue;
> > -	bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> > -	bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
> >  
> >  	/*
> >  	 * Only enqueue the task in the shared wakequeue if:
> >  	 *
> >  	 * - SWQUEUE is enabled
> > -	 * - The task is on the wakeup path
> > -	 * - The task wasn't purposefully migrated to the current rq by
> > -	 *   select_task_rq()
> > -	 * - The task isn't pinned to a specific CPU
> > +	 * - select_idle_sibling() didn't find an idle CPU to enqueue this wakee task.
> >  	 */
> > -	if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> 
> You lost the nr_cpus_allowed == 1 case, that actually made sense, except
> he didn't don't have a hook in set_cpus_allowed() to fix it back up.
>

I didn't retain the "nr_cpus_allowed == 1" check because
select_task_rq() calls select_task_rq_fair() only when
p->nr_cpus_allowed > 1.

So if p was affined to a single CPU, it would always have
p->sched_add_to_swq set to 0, thereby not queueing it on the 'queue'.


> > +	if (!READ_ONCE(p->sched_add_to_swq))
> >  		return;
> 
> So suppose we fill all CPUs with tasks T_{0..n-n1}, sis finds them a
> nice idle cpu and we don't get queued.

Yes. 

> 
> Then wake up another n tasks T_{n...2n-1}, none of them find an idle
> CPU, as per the above them all taken. These all do get queued.

Yes.

> 
> However, suppose T>=n does a wakeup-preemption, and we end up with T>=n
> running while T<n get queued on the rq, but not the 'queue' (I so hate
> the swqueue name).

Yes. But note that prior to running T>=n on the CPU, it will be
removed from the 'queue'. So if every T>=n preempts the corresponding
T<n, then, the queue becomes empty. And yes, in this case the queue
served no purpose and does not justify the additional overhead.

> 
> Then one CPU has both it's tasks go 'away' and becomes idle.

Yes, and this CPU prior to newidle_balance() will now search the
'queue'.

> 
> At this point the 'queue' contains only running tasks that are not
> migratable and all the tasks that could be migrated are not on the queue
> -- IOW it is completely useless.

At this point, the CPU will call swqueue_pick_next_task() and we can
have one of the three cases:

    * The 'queue' is empty: As mentioned above, the queue didn't do
      anything for the tasks and was cmpletely useless.

    * The task at the head of 'queue' is not runnable on the CPU which
      is going idle. Again, wasted effort of adding this task to the
      queue, because in the current implementation, the task is
      removed from the 'queue' even when it is not runnable on this
      CPU.

    * The 'queue' has some task which can be run on the CPU going
      idle. The task is successfully migrated to this CPU, which no
      longer has to do newidle_balance() and the task has reduced it
      waiting period.

> 
> Is this the intention?

IIUC, the intention is to reduce the avg post-wakeup wait time of the
task by running it on a newidle CPU, when the task's CPU is still busy
running something else. This is assuming that the task won't win
wakeup-preemption. But the flipside is the additional cost of swqueue
enqueue/dequeue.

I have queued a run across a set of benchmarks, but just to get an
idea how the patch performs, I ran hackbench and this is what the
results look like on a 2 Socket Gen3 EPYC configured in NPS=2 with 64
cores 128 threads per socket:

Test:            tip                  tip_swq_disabled        tip_swq_enabled
=============================================================================
1-groups:       3.82 (0.00 pct)       3.84 (-0.52 pct)        3.46 (+9.42 pct)
2-groups:       4.40 (0.00 pct)       4.42 (-0.45 pct)        3.66 (+16.81 pct)
4-groups:       4.84 (0.00 pct)       4.82 (+0.41 pct)        4.50 (+7.02 pct)
8-groups:       5.45 (0.00 pct)       5.43 (+0.36 pct)        5.85 (-7.33 pct)
16-groups:      6.94 (0.00 pct)       6.90 (+0.57 pct)        8.41 (-21.18 pct)
 
We can see that patchset does quite well with 1,2,4 groups, but with 8
and 16 groups as the system becomes overloaded, we can see the
performance degrade. In the overloaded case, I could observe
native_queued_spin_lock_slowpath() looming large in the perf profiles
when the swqueue feature was enabled. So perhaps the cost of swqueue
enqueue/dequeue is not justifiable, especially if the tasks are anyway
going to run on their target CPUs.

I will post more results later.
--
Thanks and Regards
gautham.
  
Gautham R. Shenoy June 26, 2023, 6:04 a.m. UTC | #33
Hello Peter, David,

On Fri, Jun 23, 2023 at 03:20:15PM +0530, Gautham R. Shenoy wrote:
> On Thu, Jun 22, 2023 at 12:29:35PM +0200, Peter Zijlstra wrote:
> > On Thu, Jun 22, 2023 at 02:41:57PM +0530, Gautham R. Shenoy wrote:

> 
> I will post more results later.

I was able to get some numbers for hackbench, schbench (old), and
tbench over the weekend on a 2 Socket Zen3 box with 64 cores 128
threads per socket configured in NPS1 mode.

The legend is as follows:

tip : tip/sched/core with HEAD being commit e2a1f85bf9f5 ("sched/psi:
      Avoid resetting the min update period when it is unnecessary")


david : This patchset

david-ego-1 : David's patchset + my modification to allow SIS signal
              that a task should be queued on the shared-wakequeue when SIS cannot
              find an idle CPU to wake up the task.

david-ego-2 : David's patchset + david-ego-1 + my modification to
	      remove the first task from the shared-wakequeue whose
	      cpus_allowed contains this CPU. Currently we don't do
	      this check and always remove the first task. 


david-ego-1 and david-ego-2 are attached with this mail.

hackbench (Measure: time taken to complete, in seconds)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Test:         tip                david                david-ego-1          david-ego-2
1-groups:     3.92 (0.00 pct)    3.35 (14.54 pct)     3.53 (9.94 pct)      3.30 (15.81 pct)
2-groups:     4.58 (0.00 pct)    3.89 (15.06 pct)     3.95 (13.75 pct)     3.79 (17.24 pct)
4-groups:     4.99 (0.00 pct)    4.42 (11.42 pct)     4.76 (4.60 pct)      4.77 (4.40 pct)
8-groups:     5.67 (0.00 pct)    5.08 (10.40 pct)     6.16 (-8.64 pct)     6.33 (-11.64 pct)
16-groups:    7.88 (0.00 pct)    7.32 (7.10 pct)      8.57 (-8.75 pct)     9.77 (-23.98 pct)


Observation: We see that David's patchset does very well across all
the groups.  Expanding the scope of the shared-wakequeue with
david-ego-1 doesn't give us much and in fact hurts at higher
utilization. Same is the case with david-ego-2 which only pulls
allowed tasks from the shared-wakequeue. In david-ego-2 we see a
greater amount of spin-lock contention for 8 and 16 groups, as the
code holds the spinlock and iterates through the list members while
checking cpu-affinity.

So, David's original patchset wins this one.




schbench (Measure : 99th Percentile latency, in us)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#workers: tip                     david                   david-ego-1             david-ego-2
 1:      26.00 (0.00 pct)         21.00 (19.23 pct)       28.00 (-7.69 pct)       22.00 (15.38 pct)
 2:      27.00 (0.00 pct)         29.00 (-7.40 pct)       28.00 (-3.70 pct)       30.00 (-11.11 pct)
 4:      31.00 (0.00 pct)         31.00 (0.00 pct)        31.00 (0.00 pct)        28.00 (9.67 pct)
 8:      36.00 (0.00 pct)         37.00 (-2.77 pct)       34.00 (5.55 pct)        39.00 (-8.33 pct)
16:      49.00 (0.00 pct)         49.00 (0.00 pct)        48.00 (2.04 pct)        50.00 (-2.04 pct)
32:      80.00 (0.00 pct)         80.00 (0.00 pct)        88.00 (-10.00 pct)      79.00 (1.25 pct)
64:     169.00 (0.00 pct)        180.00 (-6.50 pct)      174.00 (-2.95 pct)      168.00 (0.59 pct)
128:     343.00 (0.00 pct)       355.00 (-3.49 pct)      356.00 (-3.79 pct)      344.00 (-0.29 pct)
256:     42048.00 (0.00 pct)   46528.00 (-10.65 pct)   51904.00 (-23.43 pct)   48064.00 (-14.30 pct)
512:     95104.00 (0.00 pct)   95872.00 (-0.80 pct)    95360.00 (-0.26 pct)    97152.00 (-2.15 pct)


Observations: There are run-to-run variations with this benchmark. I
will try with the newer schbench later this week. 

tbench (Measure: Throughput, records/s)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Clients: tip			 sis-node		 david			 david-ego-1		 ego-david-2
    1	 452.49 (0.00 pct)	 457.94 (1.20 pct)	 448.52 (-0.87 pct)	 447.11 (-1.18 pct)	 458.45 (1.31 pct)
    2	 862.44 (0.00 pct)	 879.99 (2.03 pct)	 860.14 (-0.26 pct)	 873.27 (1.25 pct)	 891.72 (3.39 pct)
    4	 1604.27 (0.00 pct)	 1618.87 (0.91 pct)	 1610.95 (0.41 pct)	 1628.45 (1.50 pct)	 1657.26 (3.30 pct)
    8	 2966.77 (0.00 pct)	 3040.90 (2.49 pct)	 2991.07 (0.81 pct)	 3063.31 (3.25 pct)	 3106.50 (4.70 pct)
   16	 5176.70 (0.00 pct)	 5292.29 (2.23 pct)	 5478.32 (5.82 pct)	 5462.05 (5.51 pct)	 5537.15 (6.96 pct)
   32	 8205.24 (0.00 pct)	 8949.12 (9.06 pct)	 9039.63 (10.16 pct)	 9466.07 (15.36 pct)	 9365.06 (14.13 pct)
   64	 13956.71 (0.00 pct)	 14461.42 (3.61 pct)	 16337.65 (17.05 pct)	 16941.63 (21.38 pct)	 15697.47 (12.47 pct)
  128	 24005.50 (0.00 pct)	 26052.75 (8.52 pct)	 25605.24 (6.66 pct)	 27243.19 (13.48 pct)	 24854.60 (3.53 pct)
  256	 32457.61 (0.00 pct)	 21999.41 (-32.22 pct)	 36953.22 (13.85 pct)	 32299.31 (-0.48 pct)	 33037.03 (1.78 pct)
  512	 34345.24 (0.00 pct)	 41166.39 (19.86 pct)	 40845.23 (18.92 pct)	 40797.97 (18.78 pct)	 38150.17 (11.07 pct)
 1024	 33432.92 (0.00 pct)	 40900.84 (22.33 pct)	 39749.35 (18.89 pct)	 41133.82 (23.03 pct)	 38464.26 (15.04 pct)


 Observations: tbench really likes all variants of shared-wakeueue. I
 have also included sis-node numbers since we saw that tbench liked
 sis-node.

Also, it can be noted that except for the 256 clients case (number of
clients == number of threads in the system), in all other cases, we
see a benefit with david-ego-1 which extends the usage of
shared-wakequeue to the waker's target when the waker's LLC is busy.

Will try and get the netperf, postgresql, SPECjbb and Deathstarbench
numbers this week.

--
Thanks and Regards
gautham.
  
David Vernet June 27, 2023, 3:17 a.m. UTC | #34
On Mon, Jun 26, 2023 at 11:34:41AM +0530, Gautham R. Shenoy wrote:
> Hello Peter, David,

Hello Gautham,

Thanks a lot for running these experiments and experimenting with some
variations.

> 
> On Fri, Jun 23, 2023 at 03:20:15PM +0530, Gautham R. Shenoy wrote:
> > On Thu, Jun 22, 2023 at 12:29:35PM +0200, Peter Zijlstra wrote:
> > > On Thu, Jun 22, 2023 at 02:41:57PM +0530, Gautham R. Shenoy wrote:
> 
> > 
> > I will post more results later.
> 
> I was able to get some numbers for hackbench, schbench (old), and
> tbench over the weekend on a 2 Socket Zen3 box with 64 cores 128
> threads per socket configured in NPS1 mode.
> 
> The legend is as follows:
> 
> tip : tip/sched/core with HEAD being commit e2a1f85bf9f5 ("sched/psi:
>       Avoid resetting the min update period when it is unnecessary")
> 
> 
> david : This patchset
> 
> david-ego-1 : David's patchset + my modification to allow SIS signal
>               that a task should be queued on the shared-wakequeue when SIS cannot
>               find an idle CPU to wake up the task.
> 
> david-ego-2 : David's patchset + david-ego-1 + my modification to
> 	      remove the first task from the shared-wakequeue whose
> 	      cpus_allowed contains this CPU. Currently we don't do
> 	      this check and always remove the first task. 
> 
> 
> david-ego-1 and david-ego-2 are attached with this mail.
> 
> hackbench (Measure: time taken to complete, in seconds)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Test:         tip                david                david-ego-1          david-ego-2
> 1-groups:     3.92 (0.00 pct)    3.35 (14.54 pct)     3.53 (9.94 pct)      3.30 (15.81 pct)
> 2-groups:     4.58 (0.00 pct)    3.89 (15.06 pct)     3.95 (13.75 pct)     3.79 (17.24 pct)
> 4-groups:     4.99 (0.00 pct)    4.42 (11.42 pct)     4.76 (4.60 pct)      4.77 (4.40 pct)
> 8-groups:     5.67 (0.00 pct)    5.08 (10.40 pct)     6.16 (-8.64 pct)     6.33 (-11.64 pct)
> 16-groups:    7.88 (0.00 pct)    7.32 (7.10 pct)      8.57 (-8.75 pct)     9.77 (-23.98 pct)

Nice, these are pretty significant improvements.

> Observation: We see that David's patchset does very well across all
> the groups.  Expanding the scope of the shared-wakequeue with
> david-ego-1 doesn't give us much and in fact hurts at higher
> utilization. Same is the case with david-ego-2 which only pulls
> allowed tasks from the shared-wakequeue. In david-ego-2 we see a
> greater amount of spin-lock contention for 8 and 16 groups, as the
> code holds the spinlock and iterates through the list members while
> checking cpu-affinity.

Peter's idea to shard may help with the contention here. I also wonder
how we'd do with david-ego-2 minus david-ego-1. Perhaps the slightly
less aggressive enqueing would somewhat lower the contention?

> so, david's original patchset wins this one.
> 
> 
> 
> 
> schbench (measure : 99th percentile latency, in us)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> #workers: tip                     david                   david-ego-1             david-ego-2
>  1:      26.00 (0.00 pct)         21.00 (19.23 pct)       28.00 (-7.69 pct)       22.00 (15.38 pct)
>  2:      27.00 (0.00 pct)         29.00 (-7.40 pct)       28.00 (-3.70 pct)       30.00 (-11.11 pct)
>  4:      31.00 (0.00 pct)         31.00 (0.00 pct)        31.00 (0.00 pct)        28.00 (9.67 pct)
>  8:      36.00 (0.00 pct)         37.00 (-2.77 pct)       34.00 (5.55 pct)        39.00 (-8.33 pct)
> 16:      49.00 (0.00 pct)         49.00 (0.00 pct)        48.00 (2.04 pct)        50.00 (-2.04 pct)
> 32:      80.00 (0.00 pct)         80.00 (0.00 pct)        88.00 (-10.00 pct)      79.00 (1.25 pct)
> 64:     169.00 (0.00 pct)        180.00 (-6.50 pct)      174.00 (-2.95 pct)      168.00 (0.59 pct)
> 128:     343.00 (0.00 pct)       355.00 (-3.49 pct)      356.00 (-3.79 pct)      344.00 (-0.29 pct)
> 256:     42048.00 (0.00 pct)   46528.00 (-10.65 pct)   51904.00 (-23.43 pct)   48064.00 (-14.30 pct)
> 512:     95104.00 (0.00 pct)   95872.00 (-0.80 pct)    95360.00 (-0.26 pct)    97152.00 (-2.15 pct)
> 
> 
> observations: there are run-to-run variations with this benchmark. i
> will try with the newer schbench later this week. 

+cc Chris. He showed how you can use schbench to cause painful
contention with swqueue in [0]. Chris -- any other schbench incantations
that you'd recommend we try with Gautham's patches?

[0]: https://lore.kernel.org/lkml/c8419d9b-2b31-2190-3058-3625bdbcb13d@meta.com/

> tbench (measure: throughput, records/s)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> clients: tip			 sis-node		 david			 david-ego-1		 ego-david-2
>     1	 452.49 (0.00 pct)	 457.94 (1.20 pct)	 448.52 (-0.87 pct)	 447.11 (-1.18 pct)	 458.45 (1.31 pct)
>     2	 862.44 (0.00 pct)	 879.99 (2.03 pct)	 860.14 (-0.26 pct)	 873.27 (1.25 pct)	 891.72 (3.39 pct)
>     4	 1604.27 (0.00 pct)	 1618.87 (0.91 pct)	 1610.95 (0.41 pct)	 1628.45 (1.50 pct)	 1657.26 (3.30 pct)
>     8	 2966.77 (0.00 pct)	 3040.90 (2.49 pct)	 2991.07 (0.81 pct)	 3063.31 (3.25 pct)	 3106.50 (4.70 pct)
>    16	 5176.70 (0.00 pct)	 5292.29 (2.23 pct)	 5478.32 (5.82 pct)	 5462.05 (5.51 pct)	 5537.15 (6.96 pct)
>    32	 8205.24 (0.00 pct)	 8949.12 (9.06 pct)	 9039.63 (10.16 pct)	 9466.07 (15.36 pct)	 9365.06 (14.13 pct)
>    64	 13956.71 (0.00 pct)	 14461.42 (3.61 pct)	 16337.65 (17.05 pct)	 16941.63 (21.38 pct)	 15697.47 (12.47 pct)
>   128	 24005.50 (0.00 pct)	 26052.75 (8.52 pct)	 25605.24 (6.66 pct)	 27243.19 (13.48 pct)	 24854.60 (3.53 pct)
>   256	 32457.61 (0.00 pct)	 21999.41 (-32.22 pct)	 36953.22 (13.85 pct)	 32299.31 (-0.48 pct)	 33037.03 (1.78 pct)
>   512	 34345.24 (0.00 pct)	 41166.39 (19.86 pct)	 40845.23 (18.92 pct)	 40797.97 (18.78 pct)	 38150.17 (11.07 pct)
>  1024	 33432.92 (0.00 pct)	 40900.84 (22.33 pct)	 39749.35 (18.89 pct)	 41133.82 (23.03 pct)	 38464.26 (15.04 pct)
> 
> 
>  observations: tbench really likes all variants of shared-wakeueue. i

It's encouraging to see how well swqueue is handling these benchmarks;
especially on CPUs with larger LLCs and with a lot of clients. I'm keen
to see how we do with the sharding approach and your patches.

>  have also included sis-node numbers since we saw that tbench liked
>  sis-node.
> 
> also, it can be noted that except for the 256 clients case (number of
> clients == number of threads in the system), in all other cases, we
> see a benefit with david-ego-1 which extends the usage of
> shared-wakequeue to the waker's target when the waker's llc is busy.

Also confused why the 256 client case would perform worse for SIS_NODE
and david-ego-1. Was it consistently negative if you do multiple runs?

Overall, my handwavey impression is that david-ego-1 doesn't seem to
significantly improve over david, though I'd very much like to see the
results of the other benchmarks you listed below. Iterating over tasks
until you find a usable cpus->ptr also makes sense, and I'm hoping that
the sharding idea will help with the contention there.

> will try and get the netperf, postgresql, specjbb and deathstarbench
> numbers this week.

Thanks, that would be great. I'll be on vacation this week starting
Wednesday, but will try to put together a v2 that includes Peter's
suggestions by the end of the day tomorrow. Otherwise, I'll likely send
it out next week (along with your patch(es) if they're shown to be an
improvement over the baseline).

Thanks,
David

> 
> --
> thanks and regards
> gautham.
> 
> 
> 
> 
> 
> 

> from 05d8efe2f3ae3abafd4bf94a0579d378dba63bb6 mon sep 17 00:00:00 2001
> from: "gautham r. shenoy" <gautham.shenoy@amd.com>
> date: fri, 23 jun 2023 11:02:03 +0530
> subject: [patch 1/2] swqueue: control if a task should be queued on swq in
>  select_idle_sibling()
> 
> if select_idle_sibling() fails to find an idle cpu to wakeup the task
> on, then update the newly defined sched_add_to_swq field in its task
> struct.
> 
> use the value in this field to later on to determine if the task
> should also be queued on the shared-wakequeue of the llc of the target
> cpu.
> 
> this extends the use of shared-wakequeue to cases when the target of a
> wakeup is the current cpu instead of the task's previous cpu.
> 
> signed-off-by: gautham r. shenoy <gautham.shenoy@amd.com>
> ---
>  include/linux/sched.h |  6 ++++++
>  kernel/sched/fair.c   | 15 ++++++++-------
>  2 files changed, 14 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index b64fec55a381..38005262a7fe 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -910,6 +910,12 @@ struct task_struct {
>  	 */
>  	unsigned			sched_remote_wakeup:1;
>  
> +	/*
> +	 * bit used by select_idle_sibling() to signal enqueuing the
> +	 * task on a shared wakequeue when it failed find an idle cpu.
> +	 */
> +	unsigned			sched_add_to_swq:1;
> +
>  	/* bit to tell lsms we're in execve(): */
>  	unsigned			in_execve:1;
>  	unsigned			in_iowait:1;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e311d1c3b992..fe33f6b13299 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -215,21 +215,17 @@ static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
>  {
>  	unsigned long flags;
>  	struct swqueue *swqueue;
> -	bool task_migrated = enq_flags & enqueue_migrated;
> -	bool task_wakeup = enq_flags & enqueue_wakeup;
>  
>  	/*
>  	 * only enqueue the task in the shared wakequeue if:
>  	 *
>  	 * - swqueue is enabled
> -	 * - the task is on the wakeup path
> -	 * - the task wasn't purposefully migrated to the current rq by
> -	 *   select_task_rq()
> -	 * - the task isn't pinned to a specific cpu
> +	 * - select_idle_sibling() didn't find an idle cpu to enqueue this wakee task.
>  	 */
> -	if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> +	if (!p->sched_add_to_swq)
>  		return;
>  
> +	p->sched_add_to_swq = 0;
>  	swqueue = rq_swqueue(rq);
>  	spin_lock_irqsave(&swqueue->lock, flags);
>  	list_add_tail(&p->swqueue_node, &swqueue->list);
> @@ -7361,6 +7357,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>  	if ((unsigned)i < nr_cpumask_bits)
>  		return i;
>  
> +	/*
> +	 * no idle sibling was found. ok to queue this task on the
> +	 * shared wakequeue of the target.
> +	 */
> +	p->sched_add_to_swq = 1;
>  	return target;
>  }
>  
> -- 
> 2.25.1
> 

> from 88f52c2df8a2d92423ddd12c92edec949148bf3c mon sep 17 00:00:00 2001
> from: "gautham r. shenoy" <gautham.shenoy@amd.com>
> date: fri, 23 jun 2023 23:25:04 +0530
> subject: [patch 2/2] swqueue: only pull a task with valid affinity from
>  swqueue
> 
> currently swqueue_pull_task() dequeues the task at the head of the
> shared-wakequeue and then tries to migrate the task onto the current
> cpu.
> 
> this may fail, since the current cpu may not be set in the task's
> affinity mask.
> 
> hence in swqueue_pull_task(), pull the first task from the
> shared-wakequeue that can be run on this cpu. with this,
> swqueue_pick_next_task() can return a 0/1 instead of 0/-1/1 as it is
> done now.
> 
> singed-off-by: gautham r. shenoy <gautham.shenoy@amd.com>
> ---
>  kernel/sched/fair.c | 22 ++++++++++++----------
>  1 file changed, 12 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fe33f6b13299..e78b8302b4c8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -195,17 +195,21 @@ static struct swqueue *rq_swqueue(struct rq *rq)
>  	return rq->cfs.swqueue;
>  }
>  
> -static struct task_struct *swqueue_pull_task(struct swqueue *swqueue)
> +static struct task_struct *swqueue_pull_task(struct swqueue *swqueue, int cpu)
>  {
>  	unsigned long flags;
>  
>  	struct task_struct *p;
>  
>  	spin_lock_irqsave(&swqueue->lock, flags);
> -	p = list_first_entry_or_null(&swqueue->list, struct task_struct,
> -				     swqueue_node);
> -	if (p)
> -		list_del_init(&p->swqueue_node);
> +	list_for_each_entry(p, &swqueue->list, swqueue_node) {
> +		if (cpumask_test_cpu(cpu, p->cpus_ptr)) {
> +			list_del_init(&p->swqueue_node);
> +			goto found;
> +		}
> +	}
> +	p = null;
> +found:
>  	spin_unlock_irqrestore(&swqueue->lock, flags);
>  
>  	return p;
> @@ -238,11 +242,11 @@ static int swqueue_pick_next_task(struct rq *rq, struct rq_flags *rf)
>  	struct task_struct *p = null;
>  	struct rq *src_rq;
>  	struct rq_flags src_rf;
> -	int ret;
> +	int ret = 0;
>  
>  	swqueue = rq_swqueue(rq);
>  	if (!list_empty(&swqueue->list))
> -		p = swqueue_pull_task(swqueue);
> +		p = swqueue_pull_task(swqueue, rq->cpu);
>  
>  	if (!p)
>  		return 0;
> @@ -255,10 +259,8 @@ static int swqueue_pick_next_task(struct rq *rq, struct rq_flags *rf)
>  	if (task_on_rq_queued(p) && !task_on_cpu(rq, p))
>  		src_rq = migrate_task_to(src_rq, &src_rf, p, cpu_of(rq));
>  
> -	if (src_rq->cpu != rq->cpu)
> +	if (src_rq->cpu == rq->cpu)
>  		ret = 1;
> -	else
> -		ret = -1;
>  
>  	task_rq_unlock(src_rq, p, &src_rf);
>  
> -- 
> 2.25.1
>
  
Chris Mason June 27, 2023, 4:31 p.m. UTC | #35
On 6/26/23 11:17 PM, David Vernet wrote:
> On Mon, Jun 26, 2023 at 11:34:41AM +0530, Gautham R. Shenoy wrote:

>>
>> observations: there are run-to-run variations with this benchmark. i
>> will try with the newer schbench later this week. 
> 
> +cc Chris. He showed how you can use schbench to cause painful
> contention with swqueue in [0]. Chris -- any other schbench incantations
> that you'd recommend we try with Gautham's patches?
> 
> [0]: https://lore.kernel.org/lkml/c8419d9b-2b31-2190-3058-3625bdbcb13d@meta.com/

Hopefully my command line from this other email will be consistent with
the new schbench, please let me know if you're still seeing variations.

In terms of other things to try, I was creating one worker per messenger
to maximize runqueue lock contention, but variations on -t may be
interesting.

Basically I did -t 1 -m <increasing> until all the idle time on
unpatched Linus was soaked up, and then I compared Linus and swqueue.

-chris
  

Patch

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1292d38d66cc..1f4fd22f88a8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -770,6 +770,8 @@  struct task_struct {
 	unsigned long			wakee_flip_decay_ts;
 	struct task_struct		*last_wakee;
 
+	struct list_head		swqueue_node;
+
 	/*
 	 * recent_used_cpu is initially set as the last CPU used by a task
 	 * that wakes affine another task. Waker/wakee relationships can
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d911b0631e7b..e04f0daf1f05 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4533,6 +4533,7 @@  static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 #ifdef CONFIG_SMP
 	p->wake_entry.u_flags = CSD_TYPE_TTWU;
 	p->migration_pending = NULL;
+	INIT_LIST_HEAD(&p->swqueue_node);
 #endif
 	init_sched_mm_cid(p);
 }
@@ -9872,6 +9873,7 @@  void __init sched_init_smp(void)
 
 	init_sched_rt_class();
 	init_sched_dl_class();
+	init_sched_fair_class_late();
 
 	sched_smp_initialized = true;
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 807986bd6ea6..29fe25794884 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -139,17 +139,151 @@  static int __init setup_sched_thermal_decay_shift(char *str)
 }
 __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
 
+/**
+ * struct swqueue - Per-LLC queue structure for enqueuing and pulling waking
+ * tasks.
+ *
+ * WHAT
+ * ====
+ *
+ * This structure enables the scheduler to be more aggressively work
+ * conserving, by placing waking tasks on a per-LLC FIFO queue that can then be
+ * pulled from when another core in the LLC is going to go idle.
+ *
+ * struct rq stores a pointer to its LLC's swqueue via struct cfs_rq. Waking
+ * tasks are enqueued in a swqueue at the end of enqueue_task_fair(), and are
+ * opportunistically pulled from the swqueue in pick_next_task_fair() prior to
+ * invoking newidle_balance(). Tasks enqueued in a swqueue be scheduled prior
+ * to being pulled from the swqueue, in which case they're simply removed from
+ * the swqueue. A waking task is only enqueued to a swqueue when it was _not_
+ * manually migrated to the current runqueue by select_task_rq_fair().
+ *
+ * There is currently no task-stealing between swqueues in different LLCs,
+ * which means that swqueue is not fully work conserving. This could be added
+ * at a later time, with tasks likely only being stolen across swqueues on the
+ * same NUMA node to avoid violating NUMA affinities.
+ *
+ * HOW
+ * ===
+ *
+ * An swqueue is comprised of a list, and a spinlock for synchronization. Given
+ * that the critical section for a swqueue is typically a fast list operation,
+ * and that the swqueue is localized to a single LLC, the spinlock does not
+ * seem to be contended; even on a heavily utilized host. struct swqueues are
+ * also cacheline aligned to prevent false sharing between CPUs manipulating
+ * swqueues in other LLCs.
+ *
+ * WHY
+ * ===
+ *
+ * As mentioned above, the main benefit of swqueue is that it enables more
+ * aggressive work conservation in the scheduler. This can benefit workloads
+ * that benefit more from CPU utilization than from L1/L2 cache locality.
+ *
+ * swqueues are segmented across LLCs both to avoid contention on the swqueue
+ * spinlock by minimizing the number of CPUs that could contend on it, as well
+ * as to strike a balance between work conservation, and L3 cache locality.
+ */
+struct swqueue {
+	struct list_head list;
+	spinlock_t lock;
+} ____cacheline_aligned;
+
 #ifdef CONFIG_SMP
-static void swqueue_enqueue(struct rq *rq, struct task_struct *p,
-			    int enq_flags)
-{}
+static struct swqueue *rq_swqueue(struct rq *rq)
+{
+	return rq->cfs.swqueue;
+}
+
+static struct task_struct *swqueue_pull_task(struct swqueue *swqueue)
+{
+	unsigned long flags;
+
+	struct task_struct *p;
+
+	spin_lock_irqsave(&swqueue->lock, flags);
+	p = list_first_entry_or_null(&swqueue->list, struct task_struct,
+				     swqueue_node);
+	if (p)
+		list_del_init(&p->swqueue_node);
+	spin_unlock_irqrestore(&swqueue->lock, flags);
+
+	return p;
+}
+
+static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
+{
+	unsigned long flags;
+	struct swqueue *swqueue;
+	bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
+	bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
+
+	/*
+	 * Only enqueue the task in the shared wakequeue if:
+	 *
+	 * - SWQUEUE is enabled
+	 * - The task is on the wakeup path
+	 * - The task wasn't purposefully migrated to the current rq by
+	 *   select_task_rq()
+	 * - The task isn't pinned to a specific CPU
+	 */
+	if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
+		return;
+
+	swqueue = rq_swqueue(rq);
+	spin_lock_irqsave(&swqueue->lock, flags);
+	list_add_tail(&p->swqueue_node, &swqueue->list);
+	spin_unlock_irqrestore(&swqueue->lock, flags);
+}
+
 static int swqueue_pick_next_task(struct rq *rq, struct rq_flags *rf)
 {
-	return 0;
+	struct swqueue *swqueue;
+	struct task_struct *p = NULL;
+	struct rq *src_rq;
+	struct rq_flags src_rf;
+	int ret;
+
+	swqueue = rq_swqueue(rq);
+	if (!list_empty(&swqueue->list))
+		p = swqueue_pull_task(swqueue);
+
+	if (!p)
+		return 0;
+
+	rq_unpin_lock(rq, rf);
+	raw_spin_rq_unlock(rq);
+
+	src_rq = task_rq_lock(p, &src_rf);
+
+	if (task_on_rq_queued(p) && !task_on_cpu(rq, p))
+		src_rq = migrate_task_to(src_rq, &src_rf, p, cpu_of(rq));
+
+	if (src_rq->cpu != rq->cpu)
+		ret = 1;
+	else
+		ret = -1;
+
+	task_rq_unlock(src_rq, p, &src_rf);
+
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+
+	return ret;
 }
 
 static void swqueue_remove_task(struct task_struct *p)
-{}
+{
+	unsigned long flags;
+	struct swqueue *swqueue;
+
+	if (!list_empty(&p->swqueue_node)) {
+		swqueue = rq_swqueue(task_rq(p));
+		spin_lock_irqsave(&swqueue->lock, flags);
+		list_del_init(&p->swqueue_node);
+		spin_unlock_irqrestore(&swqueue->lock, flags);
+	}
+}
 
 /*
  * For asym packing, by default the lower numbered CPU has higher priority.
@@ -12839,3 +12973,34 @@  __init void init_sched_fair_class(void)
 #endif /* SMP */
 
 }
+
+__init void init_sched_fair_class_late(void)
+{
+#ifdef CONFIG_SMP
+	int i;
+	struct swqueue *swqueue;
+	struct rq *rq;
+	struct rq *llc_rq;
+
+	for_each_possible_cpu(i) {
+		if (per_cpu(sd_llc_id, i) == i) {
+			llc_rq = cpu_rq(i);
+
+			swqueue = kzalloc_node(sizeof(struct swqueue),
+					       GFP_KERNEL, cpu_to_node(i));
+			INIT_LIST_HEAD(&swqueue->list);
+			spin_lock_init(&swqueue->lock);
+			llc_rq->cfs.swqueue = swqueue;
+		}
+	}
+
+	for_each_possible_cpu(i) {
+		rq = cpu_rq(i);
+		llc_rq = cpu_rq(per_cpu(sd_llc_id, i));
+
+		if (rq == llc_rq)
+			continue;
+		rq->cfs.swqueue = llc_rq->cfs.swqueue;
+	}
+#endif /* SMP */
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5a86e9795731..daee5c64af87 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -575,6 +575,7 @@  struct cfs_rq {
 #endif
 
 #ifdef CONFIG_SMP
+	struct swqueue		*swqueue;
 	/*
 	 * CFS load tracking
 	 */
@@ -2380,6 +2381,7 @@  extern void update_max_interval(void);
 extern void init_sched_dl_class(void);
 extern void init_sched_rt_class(void);
 extern void init_sched_fair_class(void);
+extern void init_sched_fair_class_late(void);
 
 extern void reweight_task(struct task_struct *p, int prio);