[v3,2/5] locking/rwsem: Limit # of null owner retries for handoff writer

Message ID 20221017211356.333862-3-longman@redhat.com
State New
Headers
Series lockinig/rwsem: Fix rwsem bugs & enable true lock handoff |

Commit Message

Waiman Long Oct. 17, 2022, 9:13 p.m. UTC
  Commit 91d2a812dfb9 ("locking/rwsem: Make handoff writer optimistically
spin on owner") assumes that when the owner field is changed to NULL,
the lock will become free soon.  That assumption may not be correct
especially if the handoff writer doing the spinning is a RT task which
may preempt another task from completing its action of either freeing
the rwsem or properly setting up owner.

To prevent this live lock scenario, we have to limit the number of
trylock attempts without sleeping. The current limit is now set to 8
to allow enough time for the other task to hopefully complete its action.

By adding new lock events to track the number of NULL owner retries with
handoff flag set before a successful trylock when running a 96 threads
locking microbenchmark with equal number of readers and writers running
on a 2-core 96-thread system for 15 seconds, the following stats are
obtained. Note that none of locking threads are RT tasks.

  Retries of successful trylock    Count
  -----------------------------    -----
             1                     1738
             2                       19
             3                       11
             4                        2
             5                        1
             6                        1
             7                        1
             8                        0
             X                        1

The last row is the one failed attempt that needs more than 8 retries.
So a retry count maximum of 8 should capture most of them if no RT task
is in the mix.

Fixes: 91d2a812dfb9 ("locking/rwsem: Make handoff writer optimistically spin on owner")
Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-and-tested-by: Mukesh Ojha <quic_mojha@quicinc.com>
---
 kernel/locking/rwsem.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)
  

Comments

Peter Zijlstra Oct. 24, 2022, 1:31 p.m. UTC | #1
On Mon, Oct 17, 2022 at 05:13:53PM -0400, Waiman Long wrote:
> Commit 91d2a812dfb9 ("locking/rwsem: Make handoff writer optimistically
> spin on owner") assumes that when the owner field is changed to NULL,
> the lock will become free soon.  That assumption may not be correct
> especially if the handoff writer doing the spinning is a RT task which
> may preempt another task from completing its action of either freeing
> the rwsem or properly setting up owner.

I'm confused again -- rwsem_*_owner() has
lockdep_assert_preemption_disabled(). But more specifically; why can the
RT task preempt a lock-op like that?
  
Waiman Long Oct. 24, 2022, 3:55 p.m. UTC | #2
On 10/24/22 09:31, Peter Zijlstra wrote:
> On Mon, Oct 17, 2022 at 05:13:53PM -0400, Waiman Long wrote:
>> Commit 91d2a812dfb9 ("locking/rwsem: Make handoff writer optimistically
>> spin on owner") assumes that when the owner field is changed to NULL,
>> the lock will become free soon.  That assumption may not be correct
>> especially if the handoff writer doing the spinning is a RT task which
>> may preempt another task from completing its action of either freeing
>> the rwsem or properly setting up owner.
> I'm confused again -- rwsem_*_owner() has
> lockdep_assert_preemption_disabled(). But more specifically; why can the
> RT task preempt a lock-op like that?

There is a special case raised by Mukesh that can happen. I quoted his 
text here:

---------------------------

Looks like, There is still a window for a race.

There is a chance when a reader who came first added it's BIAS and goes 
to slowpath and before it gets added to wait list it got preempted by RT 
task which  goes to slowpath as well and being the first waiter gets its 
hand-off bit set and not able to get the lock due to following condition 
in rwsem_try_write_lock()

  630                 if (count & RWSEM_LOCK_MASK) {  ==> reader has 
sets its bias
..
...

  634
  635                         new |= RWSEM_FLAG_HANDOFF;
  636                 } else {
  637                         new |= RWSEM_WRITER_LOCKED;


---------------------->----------------------->-------------------------

First reader (1)          writer(2) RT task             Lock holder(3)

It sets
RWSEM_READER_BIAS.
while it is going to
slowpath(as the lock
was held by (3)) and
before it got added
to the waiters list
it got preempted
by (2).
                         RT task also takes
                         the slowpath and add              release the
                         itself into waiting list          rwsem lock
             and since it is the first         clear the
                         it is the next one to get         owner.
                         the lock but it can not
                         get the lock as (count &
                         RWSEM_LOCK_MASK) is set
                         as (1) has added it but
                         not able to remove its
             adjustment.

----------------------

To fix that we either has to disable preemption in down_read() and 
reenable it in rwsem_down_read_slowpath after decrementing the 
RWSEM_READER_BIAS or to limit the number of trylock-spinning attempt 
like this patch. The latter approach seems a bit less messy and I am 
going to take it back out anyway in patch 4. I will put a summary of 
that special case in the patch description.

Cheers,
Longman
  
Peter Zijlstra Oct. 25, 2022, 11:22 a.m. UTC | #3
On Mon, Oct 24, 2022 at 11:55:53AM -0400, Waiman Long wrote:

> Looks like, There is still a window for a race.
> 
> There is a chance when a reader who came first added it's BIAS and goes to
> slowpath and before it gets added to wait list it got preempted by RT task
> which  goes to slowpath as well and being the first waiter gets its hand-off
> bit set and not able to get the lock due to following condition in
> rwsem_try_write_lock()
> 
>  630                 if (count & RWSEM_LOCK_MASK) {  ==> reader has sets its
> bias
> ..
> ...
> 
>  634
>  635                         new |= RWSEM_FLAG_HANDOFF;
>  636                 } else {
>  637                         new |= RWSEM_WRITER_LOCKED;
> 
> 
> ---------------------->----------------------->-------------------------
> 
> First reader (1)          writer(2) RT task             Lock holder(3)
> 
> It sets
> RWSEM_READER_BIAS.
> while it is going to
> slowpath(as the lock
> was held by (3)) and
> before it got added
> to the waiters list
> it got preempted
> by (2).
>                         RT task also takes
>                         the slowpath and add              release the
>                         itself into waiting list          rwsem lock
>             and since it is the first         clear the
>                         it is the next one to get         owner.
>                         the lock but it can not
>                         get the lock as (count &
>                         RWSEM_LOCK_MASK) is set
>                         as (1) has added it but
>                         not able to remove its
>             adjustment.
> 
> ----------------------
> 
> To fix that we either has to disable preemption in down_read() and reenable
> it in rwsem_down_read_slowpath after decrementing the RWSEM_READER_BIAS or
> to limit the number of trylock-spinning attempt like this patch. The latter
> approach seems a bit less messy and I am going to take it back out anyway in
> patch 4. I will put a summary of that special case in the patch description.

Funny, I find the former approach much saner. Disabling preemption
around the whole thing fixes the fundamental problem while spin-limiting
is a band-aid.

Note how rwsem_write_trylock() already does preempt_disable(), having
the read-side do something similar only makes sense.
  
Peter Zijlstra Oct. 25, 2022, 11:48 a.m. UTC | #4
On Tue, Oct 25, 2022 at 01:22:22PM +0200, Peter Zijlstra wrote:

> Funny, I find the former approach much saner. Disabling preemption
> around the whole thing fixes the fundamental problem while spin-limiting
> is a band-aid.
> 
> Note how rwsem_write_trylock() already does preempt_disable(), having
> the read-side do something similar only makes sense.

Something like the completely untested below perhaps...


diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 44873594de03..350fb004b0fb 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -256,16 +256,13 @@ static inline bool rwsem_read_trylock(struct rw_semaphore *sem, long *cntp)
 static inline bool rwsem_write_trylock(struct rw_semaphore *sem)
 {
 	long tmp = RWSEM_UNLOCKED_VALUE;
-	bool ret = false;
 
-	preempt_disable();
 	if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp, RWSEM_WRITER_LOCKED)) {
 		rwsem_set_owner(sem);
-		ret = true;
+		return true;
 	}
 
-	preempt_enable();
-	return ret;
+	return false;
 }
 
 /*
@@ -717,7 +714,6 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 		return false;
 	}
 
-	preempt_disable();
 	/*
 	 * Disable preemption is equal to the RCU read-side crital section,
 	 * thus the task_strcut structure won't go away.
@@ -729,7 +725,6 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 	if ((flags & RWSEM_NONSPINNABLE) ||
 	    (owner && !(flags & RWSEM_READER_OWNED) && !owner_on_cpu(owner)))
 		ret = false;
-	preempt_enable();
 
 	lockevent_cond_inc(rwsem_opt_fail, !ret);
 	return ret;
@@ -829,8 +824,6 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 	int loop = 0;
 	u64 rspin_threshold = 0;
 
-	preempt_disable();
-
 	/* sem->wait_lock should not be held when doing optimistic spinning */
 	if (!osq_lock(&sem->osq))
 		goto done;
@@ -938,7 +931,6 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 	}
 	osq_unlock(&sem->osq);
 done:
-	preempt_enable();
 	lockevent_cond_inc(rwsem_opt_fail, !taken);
 	return taken;
 }
@@ -1092,7 +1084,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, unsigned int stat
 			/* Ordered by sem->wait_lock against rwsem_mark_wake(). */
 			break;
 		}
-		schedule();
+		schedule_preempt_disabled();
 		lockevent_inc(rwsem_sleep_reader);
 	}
 
@@ -1179,15 +1171,12 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 		if (waiter.handoff_set) {
 			enum owner_state owner_state;
 
-			preempt_disable();
 			owner_state = rwsem_spin_on_owner(sem);
-			preempt_enable();
-
 			if (owner_state == OWNER_NULL)
 				goto trylock_again;
 		}
 
-		schedule();
+		schedule_preempt_disabled();
 		lockevent_inc(rwsem_sleep_writer);
 		set_current_state(state);
 trylock_again:
@@ -1254,14 +1243,20 @@ static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
  */
 static inline int __down_read_common(struct rw_semaphore *sem, int state)
 {
+	int ret = 0;
 	long count;
 
+	preempt_disable();
 	if (!rwsem_read_trylock(sem, &count)) {
-		if (IS_ERR(rwsem_down_read_slowpath(sem, count, state)))
-			return -EINTR;
+		if (IS_ERR(rwsem_down_read_slowpath(sem, count, state))) {
+			ret = -EINTR;
+			goto out;
+		}
 		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	}
-	return 0;
+out:
+	preempt_enable();
+	return ret;
 }
 
 static inline void __down_read(struct rw_semaphore *sem)
@@ -1281,19 +1276,23 @@ static inline int __down_read_killable(struct rw_semaphore *sem)
 
 static inline int __down_read_trylock(struct rw_semaphore *sem)
 {
+	int ret = 0;
 	long tmp;
 
 	DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);
 
+	preempt_disable();
 	tmp = atomic_long_read(&sem->count);
 	while (!(tmp & RWSEM_READ_FAILED_MASK)) {
 		if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
 						    tmp + RWSEM_READER_BIAS)) {
 			rwsem_set_reader_owned(sem);
-			return 1;
+			ret = 1;
+			break;
 		}
 	}
-	return 0;
+	preempt_enable();
+	return ret;
 }
 
 /*
@@ -1301,10 +1300,14 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
  */
 static inline int __down_write_common(struct rw_semaphore *sem, int state)
 {
+	int ret = 0;
+
+	preempt_disable();
 	if (unlikely(!rwsem_write_trylock(sem))) {
 		if (IS_ERR(rwsem_down_write_slowpath(sem, state)))
-			return -EINTR;
+			ret = -EINTR;
 	}
+	preempt_enable();
 
 	return 0;
 }
@@ -1321,8 +1324,14 @@ static inline int __down_write_killable(struct rw_semaphore *sem)
 
 static inline int __down_write_trylock(struct rw_semaphore *sem)
 {
+	int ret;
+
+	preempt_disable();
 	DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);
-	return rwsem_write_trylock(sem);
+	ret = rwsem_write_trylock(sem);
+	preempt_enable();
+
+	return ret;
 }
 
 /*
  
Waiman Long Oct. 25, 2022, 7 p.m. UTC | #5
On 10/25/22 10:58, Hillf Danton wrote:
>> @@ -1179,15 +1171,12 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
>>   		if (waiter.handoff_set) {
>>   			enum owner_state owner_state;
>>   
>> -			preempt_disable();
>>   			owner_state = rwsem_spin_on_owner(sem);
>> -			preempt_enable();
>> -
>>   			if (owner_state == OWNER_NULL)
>>   				goto trylock_again;
>>   		}
> 	__up_write()
> 	{
> 	rwsem_clear_owner(sem);
> 	/*
> 	 If lockup can happen when a bound kworker gets preempted here by
> 	 a FIFO acquirer for write, this is a case of preemption deeper
> 	 than thought IMO
> 	*/
> 	tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
> 	if (unlikely(tmp & RWSEM_FLAG_WAITERS))
> 		rwsem_wake(sem);
>
A preempt_disable()/preempt_enable() pair has been added by commit 
48dfb5d2560 ("locking/rwsem: Disable preemption while trying for rwsem 
lock") to __up_write(). So that should not be a problem. However, that 
does make this change, if implemented, has dependency on the coexistence 
of the previous mentioned commit to be functionally complete.

Cheers,
Longman
  
Waiman Long Oct. 25, 2022, 7:55 p.m. UTC | #6
On 10/25/22 07:48, Peter Zijlstra wrote:
> On Tue, Oct 25, 2022 at 01:22:22PM +0200, Peter Zijlstra wrote:
>
>> Funny, I find the former approach much saner. Disabling preemption
>> around the whole thing fixes the fundamental problem while spin-limiting
>> is a band-aid.
>>
>> Note how rwsem_write_trylock() already does preempt_disable(), having
>> the read-side do something similar only makes sense.
> Something like the completely untested below perhaps...

That is quite a number of changes spread over many different functions. 
That is the kind of changes that may make it harder to backport to 
stable releases.

This patch is just a stop-gap measure for stable releases which I 
essentially revert in a later patch. I have no objection to disable 
preemption in within the rwsem code exception to be backported to a 
stable release. So I can add another patch on top of the series to 
essentially do that.

Cheers,
Longman
  
Peter Zijlstra Oct. 25, 2022, 8:14 p.m. UTC | #7
On Tue, Oct 25, 2022 at 03:55:09PM -0400, Waiman Long wrote:

> That is quite a number of changes spread over many different functions. That
> is the kind of changes that may make it harder to backport to stable
> releases.

Yeah; don't care. it's the right thing to do. It also doesn't need to be
reverted since it's a sane and good property for lock-ops to have.
  
Waiman Long Oct. 26, 2022, 1:44 a.m. UTC | #8
On 10/25/22 16:14, Peter Zijlstra wrote:
> On Tue, Oct 25, 2022 at 03:55:09PM -0400, Waiman Long wrote:
>
>> That is quite a number of changes spread over many different functions. That
>> is the kind of changes that may make it harder to backport to stable
>> releases.
> Yeah; don't care. it's the right thing to do. It also doesn't need to be
> reverted since it's a sane and good property for lock-ops to have.

I am sorry to confuse you. What I am saying is the original patch 2 of 
the series, not what you have proposed. I am fine about disabling 
preemption while running in the core rwsem code.

As a counter proposal, I will suggest modifying patch 2 to disable 
preemption for the reader rwsem code as the writer code already have 
preemption disabled in the most critical parts. Then I add another patch 
to complete the writer side conversion for symmetry. Are you OK with that?

Cheers,
Longman
  

Patch

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index be2df9ea7c30..c68d76fc8c68 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1115,6 +1115,7 @@  static struct rw_semaphore __sched *
 rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 {
 	struct rwsem_waiter waiter;
+	int null_owner_retries;
 	DEFINE_WAKE_Q(wake_q);
 
 	/* do optimistic spinning and steal lock if possible */
@@ -1156,7 +1157,7 @@  rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	set_current_state(state);
 	trace_contention_begin(sem, LCB_F_WRITE);
 
-	for (;;) {
+	for (null_owner_retries = 0;;) {
 		if (rwsem_try_write_lock(sem, &waiter)) {
 			/* rwsem_try_write_lock() implies ACQUIRE on success */
 			break;
@@ -1182,8 +1183,21 @@  rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 			owner_state = rwsem_spin_on_owner(sem);
 			preempt_enable();
 
-			if (owner_state == OWNER_NULL)
+			/*
+			 * owner is NULL doesn't guarantee the lock is free.
+			 * An incoming reader will temporarily increment the
+			 * reader count without changing owner and the
+			 * rwsem_try_write_lock() will fails if the reader
+			 * is not able to decrement it in time. Allow 8
+			 * trylock attempts when hitting a NULL owner before
+			 * going to sleep.
+			 */
+			if ((owner_state == OWNER_NULL) &&
+			    (null_owner_retries < 8)) {
+				null_owner_retries++;
 				goto trylock_again;
+			}
+			null_owner_retries = 0;
 		}
 
 		schedule();