[RFC] locking/percpu-rwsem: do not do lock handoff in percpu_up_write

Message ID xm26zfwx7z5p.fsf@google.com
State New
Headers
Series [RFC] locking/percpu-rwsem: do not do lock handoff in percpu_up_write |

Commit Message

Benjamin Segall Jan. 22, 2024, 10:59 p.m. UTC
  The waitq wakeup in percpu_up_write necessarily runs the wake function
immediately in the current thread. With it calling
__percpu_rwsem_trylock on behalf of the thread being woken, the lock is
extremely fair and FIFO, with the window for unfairness merely being the
time between the release of sem->block and the completion of a relevant
trylock.

However, the woken threads that now hold the lock may not be chosen to
run for a long time, and it would be useful to have more of this window
available for a currently running thread to unfairly take the lock
immediately and use it. This can result in priority-inversion issues
with high contention or things like CFS_BANDWIDTH quotas.

The even older version of percpu_rwsem that used an rwsem at its core
provided some related gains in a different fashion through
RWSEM_SPIN_ON_OWNER; while it had a similarly small window, waiting
writers could spin, making it far more likely that a thread would hit
this window.

Signed-off-by: Ben Segall <bsegall@google.com>

---

So the actual problem we saw was that one job had severe slowdowns
during startup with certain other jobs on the machine, and the slowdowns
turned out to be some cgroup moves it did during startup. The antagonist
jobs were spawning huge numbers of threads and some other internal bugs
were exacerbating their contention. The lock handoff meant that a batch
of antagonist threads would receive the read lock of
cgroup_threadgroup_rwsem and at least some of those threads would take a
long time to be scheduled.

As a totally artificial test, you can see occasional latencies up
to multiple seconds on moving singlethreaded processes with
cgroup.procs, with an antagonist cpu cgroup of thousands of cpusoakers
and tasks doing (pthread_create; pthread_join) or fork/join. A simple
antagonist is just a cpu+cpuset cgroup and:

bash -c 'enter cgroup; for i in {1..4000}; do bash -c "while true; do :; done&"; done' &
bash -c 'enter cgroup; for i in {1..4000}; do bash -c "while true; do /bin/true; done&"; done' &

favordynmods improves the average cost of the migrate, but doesn't help
much with the spikes. This patch does.

The locktorture results under the current locktorture are mixed (10m
runtime, 72 cpu machine, default settings):

With this patch:
Writes:  Total: 56094  Max/Min: 922/680   Fail: 0 
Reads :  Total: 21522  Max/Min: 305/292   Fail: 0

Baseline:
Writes:  Total: 33801  Max/Min: 472/468   Fail: 0
Reads :  Total: 33649  Max/Min: 475/463   Fail: 0

Under the old locktorture I originally tested against (or patching it
back in), where torture_rwsem_write_delay basically did
"mdelay(long_hold / 10)" when it wasn't doing the long hold, the results
were more clearly beneficial (there's actual noticeable variance in this
config, so I have stats and an example run near the median):

Baseline:
Writes:  Total: 11852  Max/Min: 167/164   Fail: 0
Reads :  Total: 11858  Max/Min: 169/164   Fail: 0
    N           Min           Max        Median           Avg        Stddev
R  50         10297         15142         11690      11951.72     1437.2562
W  50         10296         15144         11692      11952.04     1437.5479

Patched:
Writes:  Total: 27624  Max/Min: 442/334   Fail: 0
Reads :  Total: 13957  Max/Min: 197/192   Fail: 0
    N           Min           Max        Median           Avg        Stddev
R  30         13725         17133         13902     14458.467      1199.125
W  30         27525         28880         27683     27844.533     372.82701


Which of these locktorture setups is a more useful test in general I
don't know, the patch removing the mdelay ("locktorture: Add long_hold
to adjust lock-hold delays") left in some of the shortdelay type
mdelay/udelay calls for other locks.

Of course, whenever proxy execution lands, the downsides of the current
nearly 100% fifo behavior will be gone.


---
 kernel/locking/percpu-rwsem.c | 55 +++++++++++++++++++++--------------
 1 file changed, 33 insertions(+), 22 deletions(-)
  

Comments

Hillf Danton Jan. 23, 2024, 3:05 p.m. UTC | #1
On Mon, 22 Jan 2024 14:59:14 -0800 Benjamin Segall <bsegall@google.com>
> The waitq wakeup in percpu_up_write necessarily runs the wake function
> immediately in the current thread. With it calling
> __percpu_rwsem_trylock on behalf of the thread being woken, the lock is
> extremely fair and FIFO, with the window for unfairness merely being the
> time between the release of sem->block and the completion of a relevant
> trylock.
> 
> However, the woken threads that now hold the lock may not be chosen to
> run for a long time, and it would be useful to have more of this window
> available for a currently running thread to unfairly take the lock
> immediately and use it.

It makes no sense for lock acquirer to probe owner's activity except for
spining on owner. Nor for owner to guess if any acquirer comes soon.

> This can result in priority-inversion issues
> with high contention or things like CFS_BANDWIDTH quotas.

Given mutex could not avoid PI (priority-inversion) and deadlock, why is
percpu-rwsem special wrt PI?
> 
> The even older version of percpu_rwsem that used an rwsem at its core
> provided some related gains in a different fashion through
> RWSEM_SPIN_ON_OWNER; while it had a similarly small window, waiting
> writers could spin, making it far more likely that a thread would hit
> this window.
> 
> Signed-off-by: Ben Segall <bsegall@google.com>
> 
> ---
> 
> So the actual problem we saw was that one job had severe slowdowns
> during startup with certain other jobs on the machine, and the slowdowns
> turned out to be some cgroup moves it did during startup. The antagonist
> jobs were spawning huge numbers of threads and some other internal bugs
> were exacerbating their contention. The lock handoff meant that a batch
> of antagonist threads would receive the read lock of
> cgroup_threadgroup_rwsem and at least some of those threads would take a
> long time to be scheduled.

If you want to avoid starved lock waiter, take a look at RWSEM_FLAG_HANDOFF
in rwsem_down_read_slowpath().
  
Benjamin Segall Jan. 24, 2024, 10:10 p.m. UTC | #2
Hillf Danton <hdanton@sina.com> writes:

> On Mon, 22 Jan 2024 14:59:14 -0800 Benjamin Segall <bsegall@google.com>
>> The waitq wakeup in percpu_up_write necessarily runs the wake function
>> immediately in the current thread. With it calling
>> __percpu_rwsem_trylock on behalf of the thread being woken, the lock is
>> extremely fair and FIFO, with the window for unfairness merely being the
>> time between the release of sem->block and the completion of a relevant
>> trylock.
>> 
>> However, the woken threads that now hold the lock may not be chosen to
>> run for a long time, and it would be useful to have more of this window
>> available for a currently running thread to unfairly take the lock
>> immediately and use it.
>
> It makes no sense for lock acquirer to probe owner's activity except for
> spining on owner. Nor for owner to guess if any acquirer comes soon.

The code is not doing that; this text is just describing why we might
choose a less fair heuristic for which thread gets the lock.

>
>> This can result in priority-inversion issues
>> with high contention or things like CFS_BANDWIDTH quotas.
>
> Given mutex could not avoid PI (priority-inversion) and deadlock, why is
> percpu-rwsem special wrt PI?

I was going to say that mutex/rwsem have SPIN_ON_OWNER that dodge this
somewhat (and percpu-rwsem cannot do that). Switching
cgroup_threadgroup_rwsem to an actual rwsem and even disabling read-side
RWSEM_FLAG_HANDOFF doesn't actually help noticeably for my artificial
benchmark though, so the test may not be as representative as I hoped.

The most obvious possibility is that with the real problem
solving/not-causing the internal contention issues was sufficient, and
that also attacking it from the percpu-rwsem angle was overkill. It
wasn't sufficient for the artificial test, but cranking up the load to
get a reliable test could easily have blown past the point where the
other fix was sufficient.

>> 
>> Signed-off-by: Ben Segall <bsegall@google.com>
>> 
>> ---
>> 
>> So the actual problem we saw was that one job had severe slowdowns
>> during startup with certain other jobs on the machine, and the slowdowns
>> turned out to be some cgroup moves it did during startup. The antagonist
>> jobs were spawning huge numbers of threads and some other internal bugs
>> were exacerbating their contention. The lock handoff meant that a batch
>> of antagonist threads would receive the read lock of
>> cgroup_threadgroup_rwsem and at least some of those threads would take a
>> long time to be scheduled.
>
> If you want to avoid starved lock waiter, take a look at RWSEM_FLAG_HANDOFF
> in rwsem_down_read_slowpath().

rwsem's HANDOFF flag is the exact opposite of what this patch is doing.
Percpu-rwsem's current code has perfect handoff for read->write, and a very
short window for write->read (or write->write) to be beaten by a new writer.
  
Hillf Danton Jan. 25, 2024, 11:04 a.m. UTC | #3
On Wed, 24 Jan 2024 14:10:43 -0800 Benjamin Segall <bsegall@google.com>
> Hillf Danton <hdanton@sina.com> writes:
> > On Mon, 22 Jan 2024 14:59:14 -0800 Benjamin Segall <bsegall@google.com>
> >> So the actual problem we saw was that one job had severe slowdowns
> >> during startup with certain other jobs on the machine, and the slowdowns
> >> turned out to be some cgroup moves it did during startup. The antagonist
> >> jobs were spawning huge numbers of threads and some other internal bugs
> >> were exacerbating their contention. The lock handoff meant that a batch
> >> of antagonist threads would receive the read lock of
> >> cgroup_threadgroup_rwsem and at least some of those threads would take a
> >> long time to be scheduled.
> >
> > If you want to avoid starved lock waiter, take a look at RWSEM_FLAG_HANDOFF
> > in rwsem_down_read_slowpath().
> 
> rwsem's HANDOFF flag is the exact opposite of what this patch is doing.

You and I are not on the same page.

> Percpu-rwsem's current code has perfect handoff for read->write, and a very
> short window for write->read (or write->write) to be beaten by a new writer.

Given no chance left for spin on owner who is legal to take a ten-minute nap,
the right thing known to do on behalf of starved waiters is to add the HANDOFF
mechanism without any heuristic like you proposed for instance, in order to
force lock acquirers to go the slow path.

Only for thoughts.

--- x/kernel/locking/percpu-rwsem.c
+++ y/kernel/locking/percpu-rwsem.c
@@ -22,6 +22,7 @@ int __percpu_init_rwsem(struct percpu_rw
 	rcuwait_init(&sem->writer);
 	init_waitqueue_head(&sem->waiters);
 	atomic_set(&sem->block, 0);
+	atomic_set(&sem->ww, 0);	/* write waiters */
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	debug_check_no_locks_freed((void *)sem, sizeof(*sem));
 	lockdep_init_map(&sem->dep_map, name, key, 0);
@@ -135,6 +136,9 @@ static int percpu_rwsem_wake_function(st
 	wake_up_process(p);
 	put_task_struct(p);
 
+	if (!reader)
+		atomic_dec(&sem->ww);
+
 	return !reader; /* wake (readers until) 1 writer */
 }
 
@@ -148,8 +152,10 @@ static void percpu_rwsem_wait(struct per
 	 * Serialize against the wakeup in percpu_up_write(), if we fail
 	 * the trylock, the wakeup must see us on the list.
 	 */
-	wait = !__percpu_rwsem_trylock(sem, reader);
+	wait = atomic_read(&sem->ww) || !__percpu_rwsem_trylock(sem, reader);
 	if (wait) {
+		if (!reader)
+			atomic_inc(&sem->ww);
 		wq_entry.flags |= WQ_FLAG_EXCLUSIVE | reader * WQ_FLAG_CUSTOM;
 		__add_wait_queue_entry_tail(&sem->waiters, &wq_entry);
 	}
@@ -166,7 +172,7 @@ static void percpu_rwsem_wait(struct per
 
 bool __sched __percpu_down_read(struct percpu_rw_semaphore *sem, bool try)
 {
-	if (__percpu_down_read_trylock(sem))
+	if (!atomic_read(&sem->ww) && __percpu_down_read_trylock(sem))
 		return true;
 
 	if (try)
@@ -234,7 +240,7 @@ void __sched percpu_down_write(struct pe
 	 * Try set sem->block; this provides writer-writer exclusion.
 	 * Having sem->block set makes new readers block.
 	 */
-	if (!__percpu_down_write_trylock(sem))
+	if (atomic_read(&sem->ww) || !__percpu_down_read_trylock(sem))
 		percpu_rwsem_wait(sem, /* .reader = */ false);
 
 	/* smp_mb() implied by __percpu_down_write_trylock() on success -- D matches A */
  
Benjamin Segall Jan. 25, 2024, 9:08 p.m. UTC | #4
Hillf Danton <hdanton@sina.com> writes:

> On Wed, 24 Jan 2024 14:10:43 -0800 Benjamin Segall <bsegall@google.com>
>> Hillf Danton <hdanton@sina.com> writes:
>> > On Mon, 22 Jan 2024 14:59:14 -0800 Benjamin Segall <bsegall@google.com>
>> >> So the actual problem we saw was that one job had severe slowdowns
>> >> during startup with certain other jobs on the machine, and the slowdowns
>> >> turned out to be some cgroup moves it did during startup. The antagonist
>> >> jobs were spawning huge numbers of threads and some other internal bugs
>> >> were exacerbating their contention. The lock handoff meant that a batch
>> >> of antagonist threads would receive the read lock of
>> >> cgroup_threadgroup_rwsem and at least some of those threads would take a
>> >> long time to be scheduled.
>> >
>> > If you want to avoid starved lock waiter, take a look at RWSEM_FLAG_HANDOFF
>> > in rwsem_down_read_slowpath().
>> 
>> rwsem's HANDOFF flag is the exact opposite of what this patch is doing.
>
> You and I are not on the same page.
>
>> Percpu-rwsem's current code has perfect handoff for read->write, and a very
>> short window for write->read (or write->write) to be beaten by a new writer.
>
> Given no chance left for spin on owner who is legal to take a ten-minute nap,
> the right thing known to do on behalf of starved waiters is to add the HANDOFF
> mechanism without any heuristic like you proposed for instance, in order to
> force lock acquirers to go the slow path.
>
> Only for thoughts.

This is not the type of slowdown that is the problem my patch is trying
to address. (And due to the way percpu-rwsem works sem->ww is nearly
entirely redundant with sem->block - the first waiting writer is instead
waiting on rcuwait and holds sem->block while doing so)

The problem that my patch addresses is:

Writer is done: percpu_up_write
  atomic_set_release(&sem->block, 0);  // #1
  wake a batch of readers:
    percpu_rwsem_wake_function -> __percpu_rwsem_trylock(reader) // #2
  wake a single writer
    percpu_rwsem_wake_function -> __percpu_rwsem_trylock(writer) // #3
new writer wakes up (holding sem->block from #3)
  sees the readers holding the lock from #2, now sleeps on rcuwait
time passes // #4
readers finally get to run, run quickly and release the lock
now the writer gets to run

Currently the only source of unfairness/optimistic locking is the window
between #1 and #2, which occur in quick succession, on the same thread,
and with no SPIN_ON_OWNER to make this window more likely than it
otherwise would be.

My patch makes the entire #4 available to writers (or new readers), so
that the woken writer will instead get to run immediately. This is
obviously much less fair, but provides much better throughput (ideally
it might have some sort of delay, so that in more normal circumstances
readers don't have to win the wakeup race by luck and being woken
slightly sooner, but I don't have that).

This is also only useful because of the assumption that readers will
almost always not actually block (among other required assumptions) - if
they regularly sleep while holding the lock, then saving writers from
that first wakeup latency of readers isn't particularly helpful anymore,
because they'll still be delayed by the other wakeup latencies.
  
Hillf Danton Jan. 26, 2024, 12:22 p.m. UTC | #5
On Thu, 25 Jan 2024 13:08:02 -0800 Benjamin Segall <bsegall@google.com>
> Hillf Danton <hdanton@sina.com> writes:
> > On Wed, 24 Jan 2024 14:10:43 -0800 Benjamin Segall <bsegall@google.com>
> >> Hillf Danton <hdanton@sina.com> writes:
> >> > On Mon, 22 Jan 2024 14:59:14 -0800 Benjamin Segall <bsegall@google.com>
> >> >> So the actual problem we saw was that one job had severe slowdowns
> >> >> during startup with certain other jobs on the machine, and the slowdowns
> >> >> turned out to be some cgroup moves it did during startup. The antagonist
> >> >> jobs were spawning huge numbers of threads and some other internal bugs
> >> >> were exacerbating their contention. The lock handoff meant that a batch
> >> >> of antagonist threads would receive the read lock of
> >> >> cgroup_threadgroup_rwsem and at least some of those threads would take a
> >> >> long time to be scheduled.
> >> >
> >> > If you want to avoid starved lock waiter, take a look at RWSEM_FLAG_HANDOFF
> >> > in rwsem_down_read_slowpath().
> >> 
> >> rwsem's HANDOFF flag is the exact opposite of what this patch is doing.
> >
> > You and I are not on the same page.
> >
> >> Percpu-rwsem's current code has perfect handoff for read->write, and a very
> >> short window for write->read (or write->write) to be beaten by a new writer.
> >
> > Given no chance left for spin on owner who is legal to take a ten-minute nap,
> > the right thing known to do on behalf of starved waiters is to add the HANDOFF
> > mechanism without any heuristic like you proposed for instance, in order to
> > force lock acquirers to go the slow path.
> >
> > Only for thoughts.
> 
> This is not the type of slowdown that is the problem my patch is trying
> to address. (And due to the way percpu-rwsem works sem->ww is nearly
> entirely redundant with sem->block - the first waiting writer is instead
> waiting on rcuwait and holds sem->block while doing so)
> 
> The problem that my patch addresses is:
> 
> Writer is done: percpu_up_write
>   atomic_set_release(&sem->block, 0);  // #1
>   wake a batch of readers:
>     percpu_rwsem_wake_function -> __percpu_rwsem_trylock(reader) // #2
>   wake a single writer
>     percpu_rwsem_wake_function -> __percpu_rwsem_trylock(writer) // #3
> new writer wakes up (holding sem->block from #3)
>   sees the readers holding the lock from #2, now sleeps on rcuwait
> time passes // #4
> readers finally get to run, run quickly and release the lock
> now the writer gets to run
> 
> Currently the only source of unfairness/optimistic locking is the window
> between #1 and #2, which occur in quick succession, on the same thread,
> and with no SPIN_ON_OWNER to make this window more likely than it
> otherwise would be.

The sem->ww introduced closes the window between #1 and #2 by define
as it is derived from rwsem's HANDOFF.
> 
> My patch makes the entire #4 available to writers (or new readers), so
> that the woken writer will instead get to run immediately. This is

Victims rise in case the woken readers at #2 have been waiting more
than a minute while the woken writer less than 20ms.

> obviously much less fair, but provides much better throughput (ideally
> it might have some sort of delay, so that in more normal circumstances
> readers don't have to win the wakeup race by luck and being woken
> slightly sooner, but I don't have that).
> 
> This is also only useful because of the assumption that readers will
> almost always not actually block (among other required assumptions) - if

Like heuristic, any assumption makes the locking game more complex than
thought without real win.

> they regularly sleep while holding the lock, then saving writers from
> that first wakeup latency of readers isn't particularly helpful anymore,
> because they'll still be delayed by the other wakeup latencies.
  
Benjamin Segall Jan. 26, 2024, 8:40 p.m. UTC | #6
Hillf Danton <hdanton@sina.com> writes:

> On Thu, 25 Jan 2024 13:08:02 -0800 Benjamin Segall <bsegall@google.com>
>> Hillf Danton <hdanton@sina.com> writes:
>> > On Wed, 24 Jan 2024 14:10:43 -0800 Benjamin Segall <bsegall@google.com>
>> >> Hillf Danton <hdanton@sina.com> writes:
>> >> > On Mon, 22 Jan 2024 14:59:14 -0800 Benjamin Segall <bsegall@google.com>
>> >> >> So the actual problem we saw was that one job had severe slowdowns
>> >> >> during startup with certain other jobs on the machine, and the slowdowns
>> >> >> turned out to be some cgroup moves it did during startup. The antagonist
>> >> >> jobs were spawning huge numbers of threads and some other internal bugs
>> >> >> were exacerbating their contention. The lock handoff meant that a batch
>> >> >> of antagonist threads would receive the read lock of
>> >> >> cgroup_threadgroup_rwsem and at least some of those threads would take a
>> >> >> long time to be scheduled.
>> >> >
>> >> > If you want to avoid starved lock waiter, take a look at RWSEM_FLAG_HANDOFF
>> >> > in rwsem_down_read_slowpath().
>> >> 
>> >> rwsem's HANDOFF flag is the exact opposite of what this patch is doing.
>> >
>> > You and I are not on the same page.
>> >
>> >> Percpu-rwsem's current code has perfect handoff for read->write, and a very
>> >> short window for write->read (or write->write) to be beaten by a new writer.
>> >
>> > Given no chance left for spin on owner who is legal to take a ten-minute nap,
>> > the right thing known to do on behalf of starved waiters is to add the HANDOFF
>> > mechanism without any heuristic like you proposed for instance, in order to
>> > force lock acquirers to go the slow path.
>> >
>> > Only for thoughts.
>> 
>> This is not the type of slowdown that is the problem my patch is trying
>> to address. (And due to the way percpu-rwsem works sem->ww is nearly
>> entirely redundant with sem->block - the first waiting writer is instead
>> waiting on rcuwait and holds sem->block while doing so)
>> 
>> The problem that my patch addresses is:
>> 
>> Writer is done: percpu_up_write
>>   atomic_set_release(&sem->block, 0);  // #1
>>   wake a batch of readers:
>>     percpu_rwsem_wake_function -> __percpu_rwsem_trylock(reader) // #2
>>   wake a single writer
>>     percpu_rwsem_wake_function -> __percpu_rwsem_trylock(writer) // #3
>> new writer wakes up (holding sem->block from #3)
>>   sees the readers holding the lock from #2, now sleeps on rcuwait
>> time passes // #4
>> readers finally get to run, run quickly and release the lock
>> now the writer gets to run
>> 
>> Currently the only source of unfairness/optimistic locking is the window
>> between #1 and #2, which occur in quick succession, on the same thread,
>> and with no SPIN_ON_OWNER to make this window more likely than it
>> otherwise would be.
>
> The sem->ww introduced closes the window between #1 and #2 by define
> as it is derived from rwsem's HANDOFF.

Yes.

>> 
>> My patch makes the entire #4 available to writers (or new readers), so
>> that the woken writer will instead get to run immediately. This is
>
> Victims rise in case the woken readers at #2 have been waiting more
> than a minute while the woken writer less than 20ms.
>
>> obviously much less fair, but provides much better throughput (ideally
>> it might have some sort of delay, so that in more normal circumstances
>> readers don't have to win the wakeup race by luck and being woken
>> slightly sooner, but I don't have that).
>> 
>> This is also only useful because of the assumption that readers will
>> almost always not actually block (among other required assumptions) - if
>
> Like heuristic, any assumption makes the locking game more complex than
> thought without real win.
>

I'm fine with "no, fairness is more important than these performance
numbers or mitigating already-sorta-broken situations", but it's not
clear to me you've even understood the patch, because you keep only
talking about completely different forms of starvation, and suggesting
changes that would if anything make the situation worse.
  
Hillf Danton Jan. 27, 2024, 11:20 a.m. UTC | #7
On Fri, 26 Jan 2024 12:40:43 -0800 Benjamin Segall <bsegall@google.com>
> 
> I'm fine with "no, fairness is more important than these performance
> numbers or mitigating already-sorta-broken situations", but it's not

Fine too because your patch is never able to escape standing ovation.

And feel free to specify the broken situations you saw.

> clear to me you've even understood the patch, because you keep only
> talking about completely different forms of starvation, and suggesting

Given woken writer in your reply and sem->ww is write waiters, there is
only one starvation in this thread.

  |> >> My patch makes the entire #4 available to writers (or new readers), so
  |> >> that the woken writer will instead get to run immediately. This is

> changes that would if anything make the situation worse.

I mitigate the starvation by making use of the known method without
any heuristic added, though, I am happy to see why it no longer works.
  
Benjamin Segall Jan. 29, 2024, 8:36 p.m. UTC | #8
Hillf Danton <hdanton@sina.com> writes:

> On Fri, 26 Jan 2024 12:40:43 -0800 Benjamin Segall <bsegall@google.com>
>> 
>> I'm fine with "no, fairness is more important than these performance
>> numbers or mitigating already-sorta-broken situations", but it's not
>
> Fine too because your patch is never able to escape standing ovation.
>
> And feel free to specify the broken situations you saw.

The basic locking issue was due to userspace rapidly spawning threads
(or processes) more rapidly than the cpus they are running on can
support, and this causing issues for unrelated threads doing cgroup
operations on other cpus.

The contention can be due to a combination of just straight up spawning
way too many, userspace misconfiguration of cpus allowed, or load
balancer weaknesses. (If you pick minimum cpu.shares values and have
large machines, SCHED_LOAD_RESOLUTION isn't really enough for load
balance to do a good job, and what you're telling the load balancer you
want isn't really a good idea in the first place).

>
>> clear to me you've even understood the patch, because you keep only
>> talking about completely different forms of starvation, and suggesting
>
> Given woken writer in your reply and sem->ww is write waiters, there is
> only one starvation in this thread.

The problem is *not* new readers/writers coming in and taking the lock
before the waiting ones, which is what your patch would solve. The
problem is that some of the woken readers do not get scheduled for a
long time, and nothing can take the lock until they do.
  
Hillf Danton Jan. 30, 2024, 11:41 a.m. UTC | #9
On Mon, 29 Jan 2024 12:36:20 -0800 Benjamin Segall <bsegall@google.com>
> 
> The basic locking issue was due to userspace rapidly spawning threads
> (or processes) more rapidly than the cpus they are running on can
> support, and this causing issues for unrelated threads doing cgroup
> operations on other cpus.
> 
> The contention can be due to a combination of just straight up spawning
> way too many, userspace misconfiguration of cpus allowed, or load
> balancer weaknesses. (If you pick minimum cpu.shares values and have
> large machines, SCHED_LOAD_RESOLUTION isn't really enough for load
> balance to do a good job, and what you're telling the load balancer you
> want isn't really a good idea in the first place).

Sigh, add change to percpu-rwsem's handoff because cgroup has a cough
in chest.
  

Patch

diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index 185bd1c906b01..251c1a7a94e40 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -112,22 +112,22 @@  static bool __percpu_rwsem_trylock(struct percpu_rw_semaphore *sem, bool reader)
  *
  * We use EXCLUSIVE for both readers and writers to preserve FIFO order,
  * and play games with the return value to allow waking multiple readers.
  *
  * Specifically, we wake readers until we've woken a single writer, or until a
- * trylock fails.
+ * check of the sem->block value fails.
  */
 static int percpu_rwsem_wake_function(struct wait_queue_entry *wq_entry,
 				      unsigned int mode, int wake_flags,
 				      void *key)
 {
 	bool reader = wq_entry->flags & WQ_FLAG_CUSTOM;
 	struct percpu_rw_semaphore *sem = key;
 	struct task_struct *p;
 
-	/* concurrent against percpu_down_write(), can get stolen */
-	if (!__percpu_rwsem_trylock(sem, reader))
+	/* racy, just an optimization to stop waking if the lock is taken */
+	if (atomic_read(&sem->block))
 		return 1;
 
 	p = get_task_struct(wq_entry->private);
 	list_del_init(&wq_entry->entry);
 	smp_store_release(&wq_entry->private, NULL);
@@ -138,32 +138,44 @@  static int percpu_rwsem_wake_function(struct wait_queue_entry *wq_entry,
 	return !reader; /* wake (readers until) 1 writer */
 }
 
 static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, bool reader)
 {
-	DEFINE_WAIT_FUNC(wq_entry, percpu_rwsem_wake_function);
 	bool wait;
+	bool first = true;
 
-	spin_lock_irq(&sem->waiters.lock);
-	/*
-	 * Serialize against the wakeup in percpu_up_write(), if we fail
-	 * the trylock, the wakeup must see us on the list.
-	 */
-	wait = !__percpu_rwsem_trylock(sem, reader);
-	if (wait) {
-		wq_entry.flags |= WQ_FLAG_EXCLUSIVE | reader * WQ_FLAG_CUSTOM;
-		__add_wait_queue_entry_tail(&sem->waiters, &wq_entry);
-	}
-	spin_unlock_irq(&sem->waiters.lock);
+	do {
+		DEFINE_WAIT_FUNC(wq_entry, percpu_rwsem_wake_function);
 
-	while (wait) {
-		set_current_state(TASK_UNINTERRUPTIBLE);
-		if (!smp_load_acquire(&wq_entry.private))
-			break;
-		schedule();
-	}
-	__set_current_state(TASK_RUNNING);
+		spin_lock_irq(&sem->waiters.lock);
+		/*
+		 * Serialize against the wakeup in percpu_up_write(), if we fail
+		 * the trylock, the wakeup must see us on the list.
+		 */
+		wait = !__percpu_rwsem_trylock(sem, reader);
+		if (wait) {
+			wq_entry.flags |= WQ_FLAG_EXCLUSIVE | reader * WQ_FLAG_CUSTOM;
+			/*
+			 * If we wake but lose a race for the lock, preserve
+			 * FIFO-ish behavior by skipping the queue
+			 */
+			if (first)
+				__add_wait_queue_entry_tail(&sem->waiters, &wq_entry);
+			else
+				__add_wait_queue(&sem->waiters, &wq_entry);
+			first = false;
+		}
+		spin_unlock_irq(&sem->waiters.lock);
+
+		while (wait) {
+			set_current_state(TASK_UNINTERRUPTIBLE);
+			if (!smp_load_acquire(&wq_entry.private))
+				break;
+			schedule();
+		}
+		__set_current_state(TASK_RUNNING);
+	} while (wait && !__percpu_rwsem_trylock(sem, reader));
 }
 
 bool __sched __percpu_down_read(struct percpu_rw_semaphore *sem, bool try)
 {
 	if (__percpu_down_read_trylock(sem))