Message ID | xm26zfwx7z5p.fsf@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-34274-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:2553:b0:103:945f:af90 with SMTP id p19csp6544dyi; Mon, 22 Jan 2024 15:07:08 -0800 (PST) X-Google-Smtp-Source: AGHT+IFKaNs0g6jGFxCiDj+GCTdxaZj/0tAWxbXGxCD8VuEqvGowHRMImZsvuVdIvFduG7DziZVp X-Received: by 2002:a05:6a20:8e16:b0:19a:3e32:35cf with SMTP id y22-20020a056a208e1600b0019a3e3235cfmr2828520pzj.75.1705964828541; Mon, 22 Jan 2024 15:07:08 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1705964828; cv=pass; d=google.com; s=arc-20160816; b=Mvx5+OJ84b14U1wqFey3pbBpHDBYIH6/cC5Rtg5UDmk7IsUlEA/xAqTffLglINZmW9 rgBXv4/NC6D/l4bQwd+XLcX2WbeNAgltAX8nxOpPLCLkASdJLFFoXJHxFqqRbGtauDEu bOHfLN8ouGryk4nlX4iD/S28M0EzSw6A9DWf9w/ubqdWKeGpaxsLnoZdVzGW0oTjQdMW WjrahVlgHUBuEYirlMEbRTfCAQA+zCcfGZXFMx48HLRuKN/3lfV0eX7s3oFMRoDFSMKc 51wpEdz5DueiJHWDhDcOFaHHQu2dr0IgGxz7FyKFLTfLONkJJPwhvdmJCNrawN4x62Ce 1J6w== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=mime-version:list-unsubscribe:list-subscribe:list-id:precedence :user-agent:message-id:date:subject:cc:to:from:dkim-signature; bh=3gpFCfr+MFJxusJQ+/hrTJphcuHlmB4G30v/B/mFkJg=; fh=HE2wJXftT8wTJoUbEIMaAfMQRs3pkFE9U49g7sE+aiA=; b=OvH6OvOJoYZ79ppuakkNz91Lk/4bgrHwkY12U4Iyq0cwb7bDjJMVaCBWJBrtsCgKcG vyBW1T4QGnWOqLAt8Zs2Kd4zVHQKDLAi4SjEGxpmDco8Anq7Q6AG8l2XV9TQI73RBWBC 9/jzKMMWu0rR5LLjSNhwZA/RNI9W7G9Vv5ehmHNQ0Ib/bWzCPdqX4IcY844mrWtxypIB /TNYIibVTdZ2rdZWb5v0iWbOiTL1KTjRZALtCVszeqhIT+DM3XIdzIe+k2AzT5ujk+x2 CarWMuiwCG5ZvAG69wb0GfIk8sCBN2NrDXeHxiAe2gP3xN7DnKi1h928qEPGpxoKGNte HJfg== ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=f5p0qGuP; arc=pass (i=1 spf=pass spfdomain=google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-34274-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-34274-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [2604:1380:40f1:3f00::1]) by mx.google.com with ESMTPS id 21-20020a631555000000b005cdad153d85si8593527pgv.651.2024.01.22.15.07.08 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Jan 2024 15:07:08 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-34274-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) client-ip=2604:1380:40f1:3f00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=f5p0qGuP; arc=pass (i=1 spf=pass spfdomain=google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-34274-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-34274-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id 8CE9FB2571C for <ouuuleilei@gmail.com>; Mon, 22 Jan 2024 23:00:55 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 47FE848CE1; Mon, 22 Jan 2024 22:59:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="f5p0qGuP" Received: from mail-pl1-f173.google.com (mail-pl1-f173.google.com [209.85.214.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2D6CE2111F for <linux-kernel@vger.kernel.org>; Mon, 22 Jan 2024 22:59:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.173 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1705964360; cv=none; b=VtnKHVKzMPF6znmyj51FOkbJzBhd49jASbuWNBnoK4Zg6ZKzOuNKLJNv+3Inz/4uE6mjNRZnb29WjfsivyZR0BHokMFYsKHEk3zYQXtJCO07cfAW0Vvtp9bjTthOA5EMQd4jlIdu/oIi7kiiL9e1aVKY+YBmTQuzuWqxEzKQTdk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1705964360; c=relaxed/simple; bh=Q9hOdPEXe0ke/yugu8060AojaXG3WAMbwoljD1ccIIg=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version:Content-Type; b=hdXzYsVVD4Iwvb5ZXnKVw8+Y7tg1Nf33Ce8rJok513Ov3TqUORAs3BTUnxSq0ZSE6ucTaC/OyXW69mgCbhzadnRATl6eHi5wB+n/4kXJZslVf32sQGFIvXb7Sc95xRLKoqxozB29JMyDWMy7cfRWsuA4wQbVgRr3OOZwPW/+jBU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=f5p0qGuP; arc=none smtp.client-ip=209.85.214.173 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=google.com Received: by mail-pl1-f173.google.com with SMTP id d9443c01a7336-1d73130a63aso31755ad.0 for <linux-kernel@vger.kernel.org>; Mon, 22 Jan 2024 14:59:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1705964358; x=1706569158; darn=vger.kernel.org; h=mime-version:user-agent:message-id:date:subject:cc:to:from:from:to :cc:subject:date:message-id:reply-to; bh=3gpFCfr+MFJxusJQ+/hrTJphcuHlmB4G30v/B/mFkJg=; b=f5p0qGuPJgdGzbANdKleJ8JZ0xKDtAoDRzPRkKphUnfqBl0VWIKE0gEQWbKd9EpXYT xbsxIqTOEeeDCZWBBk/xY20cZ3+rpbLYHEntmXp0uPu5wZb3BE3FJA/MWCMFZAjs20tx abpg0CNlHYbAvKh2lqdhOi41S7ODnJt5EfaWAQM9kO6UrUYMpVtGBbauGmI68w9KWUt6 MpSSaRB+a71c/Ytt0RHJPO1NTCRGANA5bhk4Vleb8eaNg9rCn4ZBoLmp+5Vl9+xrEHSU 3eB1h0BPOVIPbLZvw685F1yeXOUpbLliNIUd1i3vIEY2vMPVdAXXVz1tXszh5nbYUPRb vg8g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1705964358; x=1706569158; h=mime-version:user-agent:message-id:date:subject:cc:to:from :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=3gpFCfr+MFJxusJQ+/hrTJphcuHlmB4G30v/B/mFkJg=; b=l4Hva4z699C3he6iar6HKi80uOO7OP2hQ8OyFsDWh5DShjy63HuNtRn383mB+yOpBr xdXRGQaT/hSeYPTvkU0lIxegiU8uBBYVE25GQw8/+vt3h3nOlo1ylebhVHnXiy8nJWqI LjDuPlY3OBor+o2TPI8cjbcoxHa313rHc+GIWXZr3CW1a2Zz5gtSrgx5bX9axET+fkfh XVcm98qqGkucLXvZ/DbPSHyOpnmkZtxpiR7ZoJR89dTYehD6AFBu1/LPTSJeErSSh5aK 3n998SNNOF/uVB6+btiKVpu0crlHL11TLzmTdvLHF4W4fETql3AVi2j3ulq8y7C9EVtU kyqQ== X-Gm-Message-State: AOJu0YwTSf6eAuWjpRGShvQsZ+ScyQCCVKGwCbA4Hfzo6heNz9jXrvvy zVhyhlmWb354psg1r/uyVl/9pk7v9lRT2SYCLnWI3bab3mo5Lirdl4ISA8pkHdjzG8Qxhzs/XFn ZLw== X-Received: by 2002:a17:902:dacb:b0:1d5:76a6:6402 with SMTP id q11-20020a170902dacb00b001d576a66402mr42888plx.17.1705964357712; Mon, 22 Jan 2024 14:59:17 -0800 (PST) Received: from bsegall-glaptop.localhost (c-73-202-176-14.hsd1.ca.comcast.net. [73.202.176.14]) by smtp.gmail.com with ESMTPSA id hq15-20020a056a00680f00b006dbd2231184sm4306692pfb.70.2024.01.22.14.59.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Jan 2024 14:59:16 -0800 (PST) From: Benjamin Segall <bsegall@google.com> To: Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>, Waiman Long <longman@redhat.com>, Boqun Feng <boqun.feng@gmail.com> Cc: linux-kernel@vger.kernel.org Subject: [RFC PATCH] locking/percpu-rwsem: do not do lock handoff in percpu_up_write Date: Mon, 22 Jan 2024 14:59:14 -0800 Message-ID: <xm26zfwx7z5p.fsf@google.com> User-Agent: Gnus/5.13 (Gnus v5.13) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Type: text/plain X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1788833776104621189 X-GMAIL-MSGID: 1788833776104621189 |
Series |
[RFC] locking/percpu-rwsem: do not do lock handoff in percpu_up_write
|
|
Commit Message
Benjamin Segall
Jan. 22, 2024, 10:59 p.m. UTC
The waitq wakeup in percpu_up_write necessarily runs the wake function
immediately in the current thread. With it calling
__percpu_rwsem_trylock on behalf of the thread being woken, the lock is
extremely fair and FIFO, with the window for unfairness merely being the
time between the release of sem->block and the completion of a relevant
trylock.
However, the woken threads that now hold the lock may not be chosen to
run for a long time, and it would be useful to have more of this window
available for a currently running thread to unfairly take the lock
immediately and use it. This can result in priority-inversion issues
with high contention or things like CFS_BANDWIDTH quotas.
The even older version of percpu_rwsem that used an rwsem at its core
provided some related gains in a different fashion through
RWSEM_SPIN_ON_OWNER; while it had a similarly small window, waiting
writers could spin, making it far more likely that a thread would hit
this window.
Signed-off-by: Ben Segall <bsegall@google.com>
---
So the actual problem we saw was that one job had severe slowdowns
during startup with certain other jobs on the machine, and the slowdowns
turned out to be some cgroup moves it did during startup. The antagonist
jobs were spawning huge numbers of threads and some other internal bugs
were exacerbating their contention. The lock handoff meant that a batch
of antagonist threads would receive the read lock of
cgroup_threadgroup_rwsem and at least some of those threads would take a
long time to be scheduled.
As a totally artificial test, you can see occasional latencies up
to multiple seconds on moving singlethreaded processes with
cgroup.procs, with an antagonist cpu cgroup of thousands of cpusoakers
and tasks doing (pthread_create; pthread_join) or fork/join. A simple
antagonist is just a cpu+cpuset cgroup and:
bash -c 'enter cgroup; for i in {1..4000}; do bash -c "while true; do :; done&"; done' &
bash -c 'enter cgroup; for i in {1..4000}; do bash -c "while true; do /bin/true; done&"; done' &
favordynmods improves the average cost of the migrate, but doesn't help
much with the spikes. This patch does.
The locktorture results under the current locktorture are mixed (10m
runtime, 72 cpu machine, default settings):
With this patch:
Writes: Total: 56094 Max/Min: 922/680 Fail: 0
Reads : Total: 21522 Max/Min: 305/292 Fail: 0
Baseline:
Writes: Total: 33801 Max/Min: 472/468 Fail: 0
Reads : Total: 33649 Max/Min: 475/463 Fail: 0
Under the old locktorture I originally tested against (or patching it
back in), where torture_rwsem_write_delay basically did
"mdelay(long_hold / 10)" when it wasn't doing the long hold, the results
were more clearly beneficial (there's actual noticeable variance in this
config, so I have stats and an example run near the median):
Baseline:
Writes: Total: 11852 Max/Min: 167/164 Fail: 0
Reads : Total: 11858 Max/Min: 169/164 Fail: 0
N Min Max Median Avg Stddev
R 50 10297 15142 11690 11951.72 1437.2562
W 50 10296 15144 11692 11952.04 1437.5479
Patched:
Writes: Total: 27624 Max/Min: 442/334 Fail: 0
Reads : Total: 13957 Max/Min: 197/192 Fail: 0
N Min Max Median Avg Stddev
R 30 13725 17133 13902 14458.467 1199.125
W 30 27525 28880 27683 27844.533 372.82701
Which of these locktorture setups is a more useful test in general I
don't know, the patch removing the mdelay ("locktorture: Add long_hold
to adjust lock-hold delays") left in some of the shortdelay type
mdelay/udelay calls for other locks.
Of course, whenever proxy execution lands, the downsides of the current
nearly 100% fifo behavior will be gone.
---
kernel/locking/percpu-rwsem.c | 55 +++++++++++++++++++++--------------
1 file changed, 33 insertions(+), 22 deletions(-)
Comments
On Mon, 22 Jan 2024 14:59:14 -0800 Benjamin Segall <bsegall@google.com> > The waitq wakeup in percpu_up_write necessarily runs the wake function > immediately in the current thread. With it calling > __percpu_rwsem_trylock on behalf of the thread being woken, the lock is > extremely fair and FIFO, with the window for unfairness merely being the > time between the release of sem->block and the completion of a relevant > trylock. > > However, the woken threads that now hold the lock may not be chosen to > run for a long time, and it would be useful to have more of this window > available for a currently running thread to unfairly take the lock > immediately and use it. It makes no sense for lock acquirer to probe owner's activity except for spining on owner. Nor for owner to guess if any acquirer comes soon. > This can result in priority-inversion issues > with high contention or things like CFS_BANDWIDTH quotas. Given mutex could not avoid PI (priority-inversion) and deadlock, why is percpu-rwsem special wrt PI? > > The even older version of percpu_rwsem that used an rwsem at its core > provided some related gains in a different fashion through > RWSEM_SPIN_ON_OWNER; while it had a similarly small window, waiting > writers could spin, making it far more likely that a thread would hit > this window. > > Signed-off-by: Ben Segall <bsegall@google.com> > > --- > > So the actual problem we saw was that one job had severe slowdowns > during startup with certain other jobs on the machine, and the slowdowns > turned out to be some cgroup moves it did during startup. The antagonist > jobs were spawning huge numbers of threads and some other internal bugs > were exacerbating their contention. The lock handoff meant that a batch > of antagonist threads would receive the read lock of > cgroup_threadgroup_rwsem and at least some of those threads would take a > long time to be scheduled. If you want to avoid starved lock waiter, take a look at RWSEM_FLAG_HANDOFF in rwsem_down_read_slowpath().
Hillf Danton <hdanton@sina.com> writes: > On Mon, 22 Jan 2024 14:59:14 -0800 Benjamin Segall <bsegall@google.com> >> The waitq wakeup in percpu_up_write necessarily runs the wake function >> immediately in the current thread. With it calling >> __percpu_rwsem_trylock on behalf of the thread being woken, the lock is >> extremely fair and FIFO, with the window for unfairness merely being the >> time between the release of sem->block and the completion of a relevant >> trylock. >> >> However, the woken threads that now hold the lock may not be chosen to >> run for a long time, and it would be useful to have more of this window >> available for a currently running thread to unfairly take the lock >> immediately and use it. > > It makes no sense for lock acquirer to probe owner's activity except for > spining on owner. Nor for owner to guess if any acquirer comes soon. The code is not doing that; this text is just describing why we might choose a less fair heuristic for which thread gets the lock. > >> This can result in priority-inversion issues >> with high contention or things like CFS_BANDWIDTH quotas. > > Given mutex could not avoid PI (priority-inversion) and deadlock, why is > percpu-rwsem special wrt PI? I was going to say that mutex/rwsem have SPIN_ON_OWNER that dodge this somewhat (and percpu-rwsem cannot do that). Switching cgroup_threadgroup_rwsem to an actual rwsem and even disabling read-side RWSEM_FLAG_HANDOFF doesn't actually help noticeably for my artificial benchmark though, so the test may not be as representative as I hoped. The most obvious possibility is that with the real problem solving/not-causing the internal contention issues was sufficient, and that also attacking it from the percpu-rwsem angle was overkill. It wasn't sufficient for the artificial test, but cranking up the load to get a reliable test could easily have blown past the point where the other fix was sufficient. >> >> Signed-off-by: Ben Segall <bsegall@google.com> >> >> --- >> >> So the actual problem we saw was that one job had severe slowdowns >> during startup with certain other jobs on the machine, and the slowdowns >> turned out to be some cgroup moves it did during startup. The antagonist >> jobs were spawning huge numbers of threads and some other internal bugs >> were exacerbating their contention. The lock handoff meant that a batch >> of antagonist threads would receive the read lock of >> cgroup_threadgroup_rwsem and at least some of those threads would take a >> long time to be scheduled. > > If you want to avoid starved lock waiter, take a look at RWSEM_FLAG_HANDOFF > in rwsem_down_read_slowpath(). rwsem's HANDOFF flag is the exact opposite of what this patch is doing. Percpu-rwsem's current code has perfect handoff for read->write, and a very short window for write->read (or write->write) to be beaten by a new writer.
On Wed, 24 Jan 2024 14:10:43 -0800 Benjamin Segall <bsegall@google.com> > Hillf Danton <hdanton@sina.com> writes: > > On Mon, 22 Jan 2024 14:59:14 -0800 Benjamin Segall <bsegall@google.com> > >> So the actual problem we saw was that one job had severe slowdowns > >> during startup with certain other jobs on the machine, and the slowdowns > >> turned out to be some cgroup moves it did during startup. The antagonist > >> jobs were spawning huge numbers of threads and some other internal bugs > >> were exacerbating their contention. The lock handoff meant that a batch > >> of antagonist threads would receive the read lock of > >> cgroup_threadgroup_rwsem and at least some of those threads would take a > >> long time to be scheduled. > > > > If you want to avoid starved lock waiter, take a look at RWSEM_FLAG_HANDOFF > > in rwsem_down_read_slowpath(). > > rwsem's HANDOFF flag is the exact opposite of what this patch is doing. You and I are not on the same page. > Percpu-rwsem's current code has perfect handoff for read->write, and a very > short window for write->read (or write->write) to be beaten by a new writer. Given no chance left for spin on owner who is legal to take a ten-minute nap, the right thing known to do on behalf of starved waiters is to add the HANDOFF mechanism without any heuristic like you proposed for instance, in order to force lock acquirers to go the slow path. Only for thoughts. --- x/kernel/locking/percpu-rwsem.c +++ y/kernel/locking/percpu-rwsem.c @@ -22,6 +22,7 @@ int __percpu_init_rwsem(struct percpu_rw rcuwait_init(&sem->writer); init_waitqueue_head(&sem->waiters); atomic_set(&sem->block, 0); + atomic_set(&sem->ww, 0); /* write waiters */ #ifdef CONFIG_DEBUG_LOCK_ALLOC debug_check_no_locks_freed((void *)sem, sizeof(*sem)); lockdep_init_map(&sem->dep_map, name, key, 0); @@ -135,6 +136,9 @@ static int percpu_rwsem_wake_function(st wake_up_process(p); put_task_struct(p); + if (!reader) + atomic_dec(&sem->ww); + return !reader; /* wake (readers until) 1 writer */ } @@ -148,8 +152,10 @@ static void percpu_rwsem_wait(struct per * Serialize against the wakeup in percpu_up_write(), if we fail * the trylock, the wakeup must see us on the list. */ - wait = !__percpu_rwsem_trylock(sem, reader); + wait = atomic_read(&sem->ww) || !__percpu_rwsem_trylock(sem, reader); if (wait) { + if (!reader) + atomic_inc(&sem->ww); wq_entry.flags |= WQ_FLAG_EXCLUSIVE | reader * WQ_FLAG_CUSTOM; __add_wait_queue_entry_tail(&sem->waiters, &wq_entry); } @@ -166,7 +172,7 @@ static void percpu_rwsem_wait(struct per bool __sched __percpu_down_read(struct percpu_rw_semaphore *sem, bool try) { - if (__percpu_down_read_trylock(sem)) + if (!atomic_read(&sem->ww) && __percpu_down_read_trylock(sem)) return true; if (try) @@ -234,7 +240,7 @@ void __sched percpu_down_write(struct pe * Try set sem->block; this provides writer-writer exclusion. * Having sem->block set makes new readers block. */ - if (!__percpu_down_write_trylock(sem)) + if (atomic_read(&sem->ww) || !__percpu_down_read_trylock(sem)) percpu_rwsem_wait(sem, /* .reader = */ false); /* smp_mb() implied by __percpu_down_write_trylock() on success -- D matches A */
Hillf Danton <hdanton@sina.com> writes: > On Wed, 24 Jan 2024 14:10:43 -0800 Benjamin Segall <bsegall@google.com> >> Hillf Danton <hdanton@sina.com> writes: >> > On Mon, 22 Jan 2024 14:59:14 -0800 Benjamin Segall <bsegall@google.com> >> >> So the actual problem we saw was that one job had severe slowdowns >> >> during startup with certain other jobs on the machine, and the slowdowns >> >> turned out to be some cgroup moves it did during startup. The antagonist >> >> jobs were spawning huge numbers of threads and some other internal bugs >> >> were exacerbating their contention. The lock handoff meant that a batch >> >> of antagonist threads would receive the read lock of >> >> cgroup_threadgroup_rwsem and at least some of those threads would take a >> >> long time to be scheduled. >> > >> > If you want to avoid starved lock waiter, take a look at RWSEM_FLAG_HANDOFF >> > in rwsem_down_read_slowpath(). >> >> rwsem's HANDOFF flag is the exact opposite of what this patch is doing. > > You and I are not on the same page. > >> Percpu-rwsem's current code has perfect handoff for read->write, and a very >> short window for write->read (or write->write) to be beaten by a new writer. > > Given no chance left for spin on owner who is legal to take a ten-minute nap, > the right thing known to do on behalf of starved waiters is to add the HANDOFF > mechanism without any heuristic like you proposed for instance, in order to > force lock acquirers to go the slow path. > > Only for thoughts. This is not the type of slowdown that is the problem my patch is trying to address. (And due to the way percpu-rwsem works sem->ww is nearly entirely redundant with sem->block - the first waiting writer is instead waiting on rcuwait and holds sem->block while doing so) The problem that my patch addresses is: Writer is done: percpu_up_write atomic_set_release(&sem->block, 0); // #1 wake a batch of readers: percpu_rwsem_wake_function -> __percpu_rwsem_trylock(reader) // #2 wake a single writer percpu_rwsem_wake_function -> __percpu_rwsem_trylock(writer) // #3 new writer wakes up (holding sem->block from #3) sees the readers holding the lock from #2, now sleeps on rcuwait time passes // #4 readers finally get to run, run quickly and release the lock now the writer gets to run Currently the only source of unfairness/optimistic locking is the window between #1 and #2, which occur in quick succession, on the same thread, and with no SPIN_ON_OWNER to make this window more likely than it otherwise would be. My patch makes the entire #4 available to writers (or new readers), so that the woken writer will instead get to run immediately. This is obviously much less fair, but provides much better throughput (ideally it might have some sort of delay, so that in more normal circumstances readers don't have to win the wakeup race by luck and being woken slightly sooner, but I don't have that). This is also only useful because of the assumption that readers will almost always not actually block (among other required assumptions) - if they regularly sleep while holding the lock, then saving writers from that first wakeup latency of readers isn't particularly helpful anymore, because they'll still be delayed by the other wakeup latencies.
On Thu, 25 Jan 2024 13:08:02 -0800 Benjamin Segall <bsegall@google.com> > Hillf Danton <hdanton@sina.com> writes: > > On Wed, 24 Jan 2024 14:10:43 -0800 Benjamin Segall <bsegall@google.com> > >> Hillf Danton <hdanton@sina.com> writes: > >> > On Mon, 22 Jan 2024 14:59:14 -0800 Benjamin Segall <bsegall@google.com> > >> >> So the actual problem we saw was that one job had severe slowdowns > >> >> during startup with certain other jobs on the machine, and the slowdowns > >> >> turned out to be some cgroup moves it did during startup. The antagonist > >> >> jobs were spawning huge numbers of threads and some other internal bugs > >> >> were exacerbating their contention. The lock handoff meant that a batch > >> >> of antagonist threads would receive the read lock of > >> >> cgroup_threadgroup_rwsem and at least some of those threads would take a > >> >> long time to be scheduled. > >> > > >> > If you want to avoid starved lock waiter, take a look at RWSEM_FLAG_HANDOFF > >> > in rwsem_down_read_slowpath(). > >> > >> rwsem's HANDOFF flag is the exact opposite of what this patch is doing. > > > > You and I are not on the same page. > > > >> Percpu-rwsem's current code has perfect handoff for read->write, and a very > >> short window for write->read (or write->write) to be beaten by a new writer. > > > > Given no chance left for spin on owner who is legal to take a ten-minute nap, > > the right thing known to do on behalf of starved waiters is to add the HANDOFF > > mechanism without any heuristic like you proposed for instance, in order to > > force lock acquirers to go the slow path. > > > > Only for thoughts. > > This is not the type of slowdown that is the problem my patch is trying > to address. (And due to the way percpu-rwsem works sem->ww is nearly > entirely redundant with sem->block - the first waiting writer is instead > waiting on rcuwait and holds sem->block while doing so) > > The problem that my patch addresses is: > > Writer is done: percpu_up_write > atomic_set_release(&sem->block, 0); // #1 > wake a batch of readers: > percpu_rwsem_wake_function -> __percpu_rwsem_trylock(reader) // #2 > wake a single writer > percpu_rwsem_wake_function -> __percpu_rwsem_trylock(writer) // #3 > new writer wakes up (holding sem->block from #3) > sees the readers holding the lock from #2, now sleeps on rcuwait > time passes // #4 > readers finally get to run, run quickly and release the lock > now the writer gets to run > > Currently the only source of unfairness/optimistic locking is the window > between #1 and #2, which occur in quick succession, on the same thread, > and with no SPIN_ON_OWNER to make this window more likely than it > otherwise would be. The sem->ww introduced closes the window between #1 and #2 by define as it is derived from rwsem's HANDOFF. > > My patch makes the entire #4 available to writers (or new readers), so > that the woken writer will instead get to run immediately. This is Victims rise in case the woken readers at #2 have been waiting more than a minute while the woken writer less than 20ms. > obviously much less fair, but provides much better throughput (ideally > it might have some sort of delay, so that in more normal circumstances > readers don't have to win the wakeup race by luck and being woken > slightly sooner, but I don't have that). > > This is also only useful because of the assumption that readers will > almost always not actually block (among other required assumptions) - if Like heuristic, any assumption makes the locking game more complex than thought without real win. > they regularly sleep while holding the lock, then saving writers from > that first wakeup latency of readers isn't particularly helpful anymore, > because they'll still be delayed by the other wakeup latencies.
Hillf Danton <hdanton@sina.com> writes: > On Thu, 25 Jan 2024 13:08:02 -0800 Benjamin Segall <bsegall@google.com> >> Hillf Danton <hdanton@sina.com> writes: >> > On Wed, 24 Jan 2024 14:10:43 -0800 Benjamin Segall <bsegall@google.com> >> >> Hillf Danton <hdanton@sina.com> writes: >> >> > On Mon, 22 Jan 2024 14:59:14 -0800 Benjamin Segall <bsegall@google.com> >> >> >> So the actual problem we saw was that one job had severe slowdowns >> >> >> during startup with certain other jobs on the machine, and the slowdowns >> >> >> turned out to be some cgroup moves it did during startup. The antagonist >> >> >> jobs were spawning huge numbers of threads and some other internal bugs >> >> >> were exacerbating their contention. The lock handoff meant that a batch >> >> >> of antagonist threads would receive the read lock of >> >> >> cgroup_threadgroup_rwsem and at least some of those threads would take a >> >> >> long time to be scheduled. >> >> > >> >> > If you want to avoid starved lock waiter, take a look at RWSEM_FLAG_HANDOFF >> >> > in rwsem_down_read_slowpath(). >> >> >> >> rwsem's HANDOFF flag is the exact opposite of what this patch is doing. >> > >> > You and I are not on the same page. >> > >> >> Percpu-rwsem's current code has perfect handoff for read->write, and a very >> >> short window for write->read (or write->write) to be beaten by a new writer. >> > >> > Given no chance left for spin on owner who is legal to take a ten-minute nap, >> > the right thing known to do on behalf of starved waiters is to add the HANDOFF >> > mechanism without any heuristic like you proposed for instance, in order to >> > force lock acquirers to go the slow path. >> > >> > Only for thoughts. >> >> This is not the type of slowdown that is the problem my patch is trying >> to address. (And due to the way percpu-rwsem works sem->ww is nearly >> entirely redundant with sem->block - the first waiting writer is instead >> waiting on rcuwait and holds sem->block while doing so) >> >> The problem that my patch addresses is: >> >> Writer is done: percpu_up_write >> atomic_set_release(&sem->block, 0); // #1 >> wake a batch of readers: >> percpu_rwsem_wake_function -> __percpu_rwsem_trylock(reader) // #2 >> wake a single writer >> percpu_rwsem_wake_function -> __percpu_rwsem_trylock(writer) // #3 >> new writer wakes up (holding sem->block from #3) >> sees the readers holding the lock from #2, now sleeps on rcuwait >> time passes // #4 >> readers finally get to run, run quickly and release the lock >> now the writer gets to run >> >> Currently the only source of unfairness/optimistic locking is the window >> between #1 and #2, which occur in quick succession, on the same thread, >> and with no SPIN_ON_OWNER to make this window more likely than it >> otherwise would be. > > The sem->ww introduced closes the window between #1 and #2 by define > as it is derived from rwsem's HANDOFF. Yes. >> >> My patch makes the entire #4 available to writers (or new readers), so >> that the woken writer will instead get to run immediately. This is > > Victims rise in case the woken readers at #2 have been waiting more > than a minute while the woken writer less than 20ms. > >> obviously much less fair, but provides much better throughput (ideally >> it might have some sort of delay, so that in more normal circumstances >> readers don't have to win the wakeup race by luck and being woken >> slightly sooner, but I don't have that). >> >> This is also only useful because of the assumption that readers will >> almost always not actually block (among other required assumptions) - if > > Like heuristic, any assumption makes the locking game more complex than > thought without real win. > I'm fine with "no, fairness is more important than these performance numbers or mitigating already-sorta-broken situations", but it's not clear to me you've even understood the patch, because you keep only talking about completely different forms of starvation, and suggesting changes that would if anything make the situation worse.
On Fri, 26 Jan 2024 12:40:43 -0800 Benjamin Segall <bsegall@google.com> > > I'm fine with "no, fairness is more important than these performance > numbers or mitigating already-sorta-broken situations", but it's not Fine too because your patch is never able to escape standing ovation. And feel free to specify the broken situations you saw. > clear to me you've even understood the patch, because you keep only > talking about completely different forms of starvation, and suggesting Given woken writer in your reply and sem->ww is write waiters, there is only one starvation in this thread. |> >> My patch makes the entire #4 available to writers (or new readers), so |> >> that the woken writer will instead get to run immediately. This is > changes that would if anything make the situation worse. I mitigate the starvation by making use of the known method without any heuristic added, though, I am happy to see why it no longer works.
Hillf Danton <hdanton@sina.com> writes: > On Fri, 26 Jan 2024 12:40:43 -0800 Benjamin Segall <bsegall@google.com> >> >> I'm fine with "no, fairness is more important than these performance >> numbers or mitigating already-sorta-broken situations", but it's not > > Fine too because your patch is never able to escape standing ovation. > > And feel free to specify the broken situations you saw. The basic locking issue was due to userspace rapidly spawning threads (or processes) more rapidly than the cpus they are running on can support, and this causing issues for unrelated threads doing cgroup operations on other cpus. The contention can be due to a combination of just straight up spawning way too many, userspace misconfiguration of cpus allowed, or load balancer weaknesses. (If you pick minimum cpu.shares values and have large machines, SCHED_LOAD_RESOLUTION isn't really enough for load balance to do a good job, and what you're telling the load balancer you want isn't really a good idea in the first place). > >> clear to me you've even understood the patch, because you keep only >> talking about completely different forms of starvation, and suggesting > > Given woken writer in your reply and sem->ww is write waiters, there is > only one starvation in this thread. The problem is *not* new readers/writers coming in and taking the lock before the waiting ones, which is what your patch would solve. The problem is that some of the woken readers do not get scheduled for a long time, and nothing can take the lock until they do.
On Mon, 29 Jan 2024 12:36:20 -0800 Benjamin Segall <bsegall@google.com> > > The basic locking issue was due to userspace rapidly spawning threads > (or processes) more rapidly than the cpus they are running on can > support, and this causing issues for unrelated threads doing cgroup > operations on other cpus. > > The contention can be due to a combination of just straight up spawning > way too many, userspace misconfiguration of cpus allowed, or load > balancer weaknesses. (If you pick minimum cpu.shares values and have > large machines, SCHED_LOAD_RESOLUTION isn't really enough for load > balance to do a good job, and what you're telling the load balancer you > want isn't really a good idea in the first place). Sigh, add change to percpu-rwsem's handoff because cgroup has a cough in chest.
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c index 185bd1c906b01..251c1a7a94e40 100644 --- a/kernel/locking/percpu-rwsem.c +++ b/kernel/locking/percpu-rwsem.c @@ -112,22 +112,22 @@ static bool __percpu_rwsem_trylock(struct percpu_rw_semaphore *sem, bool reader) * * We use EXCLUSIVE for both readers and writers to preserve FIFO order, * and play games with the return value to allow waking multiple readers. * * Specifically, we wake readers until we've woken a single writer, or until a - * trylock fails. + * check of the sem->block value fails. */ static int percpu_rwsem_wake_function(struct wait_queue_entry *wq_entry, unsigned int mode, int wake_flags, void *key) { bool reader = wq_entry->flags & WQ_FLAG_CUSTOM; struct percpu_rw_semaphore *sem = key; struct task_struct *p; - /* concurrent against percpu_down_write(), can get stolen */ - if (!__percpu_rwsem_trylock(sem, reader)) + /* racy, just an optimization to stop waking if the lock is taken */ + if (atomic_read(&sem->block)) return 1; p = get_task_struct(wq_entry->private); list_del_init(&wq_entry->entry); smp_store_release(&wq_entry->private, NULL); @@ -138,32 +138,44 @@ static int percpu_rwsem_wake_function(struct wait_queue_entry *wq_entry, return !reader; /* wake (readers until) 1 writer */ } static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, bool reader) { - DEFINE_WAIT_FUNC(wq_entry, percpu_rwsem_wake_function); bool wait; + bool first = true; - spin_lock_irq(&sem->waiters.lock); - /* - * Serialize against the wakeup in percpu_up_write(), if we fail - * the trylock, the wakeup must see us on the list. - */ - wait = !__percpu_rwsem_trylock(sem, reader); - if (wait) { - wq_entry.flags |= WQ_FLAG_EXCLUSIVE | reader * WQ_FLAG_CUSTOM; - __add_wait_queue_entry_tail(&sem->waiters, &wq_entry); - } - spin_unlock_irq(&sem->waiters.lock); + do { + DEFINE_WAIT_FUNC(wq_entry, percpu_rwsem_wake_function); - while (wait) { - set_current_state(TASK_UNINTERRUPTIBLE); - if (!smp_load_acquire(&wq_entry.private)) - break; - schedule(); - } - __set_current_state(TASK_RUNNING); + spin_lock_irq(&sem->waiters.lock); + /* + * Serialize against the wakeup in percpu_up_write(), if we fail + * the trylock, the wakeup must see us on the list. + */ + wait = !__percpu_rwsem_trylock(sem, reader); + if (wait) { + wq_entry.flags |= WQ_FLAG_EXCLUSIVE | reader * WQ_FLAG_CUSTOM; + /* + * If we wake but lose a race for the lock, preserve + * FIFO-ish behavior by skipping the queue + */ + if (first) + __add_wait_queue_entry_tail(&sem->waiters, &wq_entry); + else + __add_wait_queue(&sem->waiters, &wq_entry); + first = false; + } + spin_unlock_irq(&sem->waiters.lock); + + while (wait) { + set_current_state(TASK_UNINTERRUPTIBLE); + if (!smp_load_acquire(&wq_entry.private)) + break; + schedule(); + } + __set_current_state(TASK_RUNNING); + } while (wait && !__percpu_rwsem_trylock(sem, reader)); } bool __sched __percpu_down_read(struct percpu_rw_semaphore *sem, bool try) { if (__percpu_down_read_trylock(sem))