rtmutex: ensure we wake up the top waiter

Message ID 20230117172649.52465-1-wander@redhat.com
State New
Headers
Series rtmutex: ensure we wake up the top waiter |

Commit Message

Wander Lairson Costa Jan. 17, 2023, 5:26 p.m. UTC
  In task_blocked_on_lock() we save the owner, release the wait_lock and
call rt_mutex_adjust_prio_chain(). Before we acquire the wait_lock
again, the owner may release the lock and deboost.

rt_mutex_adjust_prio_chain() acquires the wait_lock. In the requeue
phase, waiter may be initially in the top of the queue, but after
dequeued and requeued it may no longer be true.

This scenario ends up waking the wrong task, which will verify it is no
the top waiter and comes back to sleep. Now we have a situation in which
no task is holding the lock but no one acquires it.

We can reproduce the bug in PREEMPT_RT with stress-ng:

while true; do
    stress-ng --sched deadline --sched-period 1000000000 \
    	    --sched-runtime 800000000 --sched-deadline \
    	    1000000000 --mmapfork 23 -t 20
done

Signed-off-by: Wander Lairson Costa <wander@redhat.com>
---
 kernel/locking/rtmutex.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)
  

Comments

Waiman Long Jan. 17, 2023, 7:32 p.m. UTC | #1
On 1/17/23 12:26, Wander Lairson Costa wrote:
> In task_blocked_on_lock() we save the owner, release the wait_lock and
> call rt_mutex_adjust_prio_chain(). Before we acquire the wait_lock
> again, the owner may release the lock and deboost.
Are you referring to task_blocks_on_rt_mutex(), not task_blocked_on_lock()?
>
> rt_mutex_adjust_prio_chain() acquires the wait_lock. In the requeue
> phase, waiter may be initially in the top of the queue, but after
> dequeued and requeued it may no longer be true.
>
> This scenario ends up waking the wrong task, which will verify it is no
> the top waiter and comes back to sleep. Now we have a situation in which
> no task is holding the lock but no one acquires it.
>
> We can reproduce the bug in PREEMPT_RT with stress-ng:
>
> while true; do
>      stress-ng --sched deadline --sched-period 1000000000 \
>      	    --sched-runtime 800000000 --sched-deadline \
>      	    1000000000 --mmapfork 23 -t 20
> done
>
> Signed-off-by: Wander Lairson Costa <wander@redhat.com>
> ---
>   kernel/locking/rtmutex.c | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
> index 010cf4e6d0b8..728f434de2bb 100644
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -901,8 +901,9 @@ static int __sched rt_mutex_adjust_prio_chain(struct task_struct *task,
>   		 * then we need to wake the new top waiter up to try
>   		 * to get the lock.
>   		 */
> -		if (prerequeue_top_waiter != rt_mutex_top_waiter(lock))
> -			wake_up_state(waiter->task, waiter->wake_state);
> +		top_waiter = rt_mutex_top_waiter(lock);
> +		if (prerequeue_top_waiter != top_waiter)
> +			wake_up_state(top_waiter->task, top_waiter->wake_state);
>   		raw_spin_unlock_irq(&lock->wait_lock);
>   		return 0;
>   	}

I would say that if a rt_mutex has no owner but have waiters, we should 
always wake up the top waiter whether or not it is the current waiter. 
So what is the result of the stress-ng run above? Is it a hang? It is 
not clear from your patch description.

I am not that familiar with the rt_mutex code, I am cc'ing Thomas and 
Sebastian for their input.

Cheers,
Longman
  
Waiman Long Jan. 17, 2023, 7:40 p.m. UTC | #2
On 1/17/23 14:32, Waiman Long wrote:
> On 1/17/23 12:26, Wander Lairson Costa wrote:
>> In task_blocked_on_lock() we save the owner, release the wait_lock and
>> call rt_mutex_adjust_prio_chain(). Before we acquire the wait_lock
>> again, the owner may release the lock and deboost.
> Are you referring to task_blocks_on_rt_mutex(), not 
> task_blocked_on_lock()?
>>
>> rt_mutex_adjust_prio_chain() acquires the wait_lock. In the requeue
>> phase, waiter may be initially in the top of the queue, but after
>> dequeued and requeued it may no longer be true.
>>
>> This scenario ends up waking the wrong task, which will verify it is no
>> the top waiter and comes back to sleep. Now we have a situation in which
>> no task is holding the lock but no one acquires it.
>>
>> We can reproduce the bug in PREEMPT_RT with stress-ng:
>>
>> while true; do
>>      stress-ng --sched deadline --sched-period 1000000000 \
>>              --sched-runtime 800000000 --sched-deadline \
>>              1000000000 --mmapfork 23 -t 20
>> done
>>
>> Signed-off-by: Wander Lairson Costa <wander@redhat.com>
>> ---
>>   kernel/locking/rtmutex.c | 5 +++--
>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
>> index 010cf4e6d0b8..728f434de2bb 100644
>> --- a/kernel/locking/rtmutex.c
>> +++ b/kernel/locking/rtmutex.c
>> @@ -901,8 +901,9 @@ static int __sched 
>> rt_mutex_adjust_prio_chain(struct task_struct *task,
>>            * then we need to wake the new top waiter up to try
>>            * to get the lock.
>>            */
>> -        if (prerequeue_top_waiter != rt_mutex_top_waiter(lock))
>> -            wake_up_state(waiter->task, waiter->wake_state);
>> +        top_waiter = rt_mutex_top_waiter(lock);
>> +        if (prerequeue_top_waiter != top_waiter)
>> +            wake_up_state(top_waiter->task, top_waiter->wake_state);
>>           raw_spin_unlock_irq(&lock->wait_lock);
>>           return 0;
>>       }
>
> I would say that if a rt_mutex has no owner but have waiters, we 
> should always wake up the top waiter whether or not it is the current 
> waiter. So what is the result of the stress-ng run above? Is it a 
> hang? It is not clear from your patch description.

BTW, if it is a hang. What arch has this problem? x86 or arm64? There is 
a recent report of some rt_mutex locking issue in arm64, I believe. I 
don't know if it will be related. So it is important to know in what 
arch you see this problem.

Cheers,
Longman
  
Wander Lairson Costa Jan. 17, 2023, 8 p.m. UTC | #3
On Tue, Jan 17, 2023 at 4:32 PM Waiman Long <longman@redhat.com> wrote:
>
> On 1/17/23 12:26, Wander Lairson Costa wrote:
> > In task_blocked_on_lock() we save the owner, release the wait_lock and
> > call rt_mutex_adjust_prio_chain(). Before we acquire the wait_lock
> > again, the owner may release the lock and deboost.
> Are you referring to task_blocks_on_rt_mutex(), not task_blocked_on_lock()?
> >
> > rt_mutex_adjust_prio_chain() acquires the wait_lock. In the requeue
> > phase, waiter may be initially in the top of the queue, but after
> > dequeued and requeued it may no longer be true.
> >
> > This scenario ends up waking the wrong task, which will verify it is no
> > the top waiter and comes back to sleep. Now we have a situation in which
> > no task is holding the lock but no one acquires it.
> >
> > We can reproduce the bug in PREEMPT_RT with stress-ng:
> >
> > while true; do
> >      stress-ng --sched deadline --sched-period 1000000000 \
> >           --sched-runtime 800000000 --sched-deadline \
> >           1000000000 --mmapfork 23 -t 20
> > done
> >
> > Signed-off-by: Wander Lairson Costa <wander@redhat.com>
> > ---
> >   kernel/locking/rtmutex.c | 5 +++--
> >   1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
> > index 010cf4e6d0b8..728f434de2bb 100644
> > --- a/kernel/locking/rtmutex.c
> > +++ b/kernel/locking/rtmutex.c
> > @@ -901,8 +901,9 @@ static int __sched rt_mutex_adjust_prio_chain(struct task_struct *task,
> >                * then we need to wake the new top waiter up to try
> >                * to get the lock.
> >                */
> > -             if (prerequeue_top_waiter != rt_mutex_top_waiter(lock))
> > -                     wake_up_state(waiter->task, waiter->wake_state);
> > +             top_waiter = rt_mutex_top_waiter(lock);
> > +             if (prerequeue_top_waiter != top_waiter)
> > +                     wake_up_state(top_waiter->task, top_waiter->wake_state);
> >               raw_spin_unlock_irq(&lock->wait_lock);
> >               return 0;
> >       }
>
> I would say that if a rt_mutex has no owner but have waiters, we should
> always wake up the top waiter whether or not it is the current waiter.

In practice, there is this case in which the unlock code wakes up the
top waiter, but before it task awakes, a third running task changes
the top waiter.

> So what is the result of the stress-ng run above? Is it a hang? It is
> not clear from your patch description.

It manifests as a rcu_preempt stall.

>
> I am not that familiar with the rt_mutex code, I am cc'ing Thomas and
> Sebastian for their input.
>
> Cheers,
> Longman
>
  
Wander Lairson Costa Jan. 17, 2023, 8:01 p.m. UTC | #4
On Tue, Jan 17, 2023 at 4:40 PM Waiman Long <longman@redhat.com> wrote:
>
> On 1/17/23 14:32, Waiman Long wrote:
> > On 1/17/23 12:26, Wander Lairson Costa wrote:
> >> In task_blocked_on_lock() we save the owner, release the wait_lock and
> >> call rt_mutex_adjust_prio_chain(). Before we acquire the wait_lock
> >> again, the owner may release the lock and deboost.
> > Are you referring to task_blocks_on_rt_mutex(), not
> > task_blocked_on_lock()?
> >>
> >> rt_mutex_adjust_prio_chain() acquires the wait_lock. In the requeue
> >> phase, waiter may be initially in the top of the queue, but after
> >> dequeued and requeued it may no longer be true.
> >>
> >> This scenario ends up waking the wrong task, which will verify it is no
> >> the top waiter and comes back to sleep. Now we have a situation in which
> >> no task is holding the lock but no one acquires it.
> >>
> >> We can reproduce the bug in PREEMPT_RT with stress-ng:
> >>
> >> while true; do
> >>      stress-ng --sched deadline --sched-period 1000000000 \
> >>              --sched-runtime 800000000 --sched-deadline \
> >>              1000000000 --mmapfork 23 -t 20
> >> done
> >>
> >> Signed-off-by: Wander Lairson Costa <wander@redhat.com>
> >> ---
> >>   kernel/locking/rtmutex.c | 5 +++--
> >>   1 file changed, 3 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
> >> index 010cf4e6d0b8..728f434de2bb 100644
> >> --- a/kernel/locking/rtmutex.c
> >> +++ b/kernel/locking/rtmutex.c
> >> @@ -901,8 +901,9 @@ static int __sched
> >> rt_mutex_adjust_prio_chain(struct task_struct *task,
> >>            * then we need to wake the new top waiter up to try
> >>            * to get the lock.
> >>            */
> >> -        if (prerequeue_top_waiter != rt_mutex_top_waiter(lock))
> >> -            wake_up_state(waiter->task, waiter->wake_state);
> >> +        top_waiter = rt_mutex_top_waiter(lock);
> >> +        if (prerequeue_top_waiter != top_waiter)
> >> +            wake_up_state(top_waiter->task, top_waiter->wake_state);
> >>           raw_spin_unlock_irq(&lock->wait_lock);
> >>           return 0;
> >>       }
> >
> > I would say that if a rt_mutex has no owner but have waiters, we
> > should always wake up the top waiter whether or not it is the current
> > waiter. So what is the result of the stress-ng run above? Is it a
> > hang? It is not clear from your patch description.
>
> BTW, if it is a hang. What arch has this problem? x86 or arm64? There is
> a recent report of some rt_mutex locking issue in arm64, I believe. I
> don't know if it will be related. So it is important to know in what
> arch you see this problem.
>

x86_64. Notice, at least in my test case, it only manifested under PREEMPT_RT.

> Cheers,
> Longman
>
  
Thomas Gleixner Jan. 18, 2023, 12:05 a.m. UTC | #5
Wander!

On Tue, Jan 17 2023 at 14:26, Wander Lairson Costa wrote:
> In task_blocked_on_lock() we save the owner, release the wait_lock and
> call rt_mutex_adjust_prio_chain(). Before we acquire the wait_lock
> again, the owner may release the lock and deboost.

This does not make sense in several aspects:

  1) Who is 'we'? You, me, someone else? None of us does anything of the
     above.

        https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#changelog

  2) What has task_blocked_on_lock() to do with the logic in
     rt_mutex_adjust_prio_chain() which is called by other callsites
     too?

  3) If the owner releases the lock and deboosts then this has
     absolutely nothing to do with the lock because the priority of a
     the owner is determined by its own priority and the priority of the
     top most waiter. If the owner releases the lock then it marks the
     lock ownerless, wakes the top most waiter and deboosts itself. In
     this owner deboost rt_mutex_adjust_prio_chain() is not involved at
     all. Why?

     Because the owner deboost does not affect the priority of the
     waiters at all. It's the other way round: Waiter priority affects
     the owner priority if the waiter priority is higher than the owner
     priority.

> rt_mutex_adjust_prio_chain() acquires the wait_lock. In the requeue
> phase, waiter may be initially in the top of the queue, but after
> dequeued and requeued it may no longer be true.

That's related to your above argumentation in which way?

rt_mutex_adjust_prio_chain()

        lock->wait_lock is held across the whole operation

        prerequeue_top_waiter = rt_mutex_top_waiter(lock);

  This saves the current top waiter before the dequeue()/enqueue()
  sequence.

	rt_mutex_dequeue(lock, waiter);
	waiter_update_prio(waiter, task);
	rt_mutex_enqueue(lock, waiter);

	if (!rt_mutex_owner(lock)) {

  This is the case where the lock has no owner, i.e. the previous owner
  unlocked and the chainwalk cannot be continued.

  Now the code checks whether the requeue changed the top waiter task:

		if (prerequeue_top_waiter != rt_mutex_top_waiter(lock))

  What can make this condition true?

    1) @waiter is the new top waiter due to the requeue operation

    2) @waiter is not longer the top waiter due to the requeue operation

  So in both cases the new top waiter must be woken up so it can take over
  the ownerless lock.

  Here is where the code is buggy. It only considers case #1, but not
  case #2, right?

So your patch is correct, but the explanation in your changelog has
absolutely nothing to do with the problem.

Why?

  #2 is caused by a top waiter dropping out due to a signal or timeout
     and thereby deboosting the whole lock chain.

  So the relevant callchain which causes the problem originates from
  remove_waiter()

See?

Thanks,

        tglx
  
Wander Lairson Costa Jan. 18, 2023, 6:49 p.m. UTC | #6
On Tue, Jan 17, 2023 at 9:05 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Wander!
>
> On Tue, Jan 17 2023 at 14:26, Wander Lairson Costa wrote:
> > In task_blocked_on_lock() we save the owner, release the wait_lock and
> > call rt_mutex_adjust_prio_chain(). Before we acquire the wait_lock
> > again, the owner may release the lock and deboost.
>
> This does not make sense in several aspects:
>
>   1) Who is 'we'? You, me, someone else? None of us does anything of the
>      above.
>
>         https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#changelog
>
>   2) What has task_blocked_on_lock() to do with the logic in
>      rt_mutex_adjust_prio_chain() which is called by other callsites
>      too?
>
>   3) If the owner releases the lock and deboosts then this has
>      absolutely nothing to do with the lock because the priority of a
>      the owner is determined by its own priority and the priority of the
>      top most waiter. If the owner releases the lock then it marks the
>      lock ownerless, wakes the top most waiter and deboosts itself. In
>      this owner deboost rt_mutex_adjust_prio_chain() is not involved at
>      all. Why?
>
>      Because the owner deboost does not affect the priority of the
>      waiters at all. It's the other way round: Waiter priority affects
>      the owner priority if the waiter priority is higher than the owner
>      priority.
>
> > rt_mutex_adjust_prio_chain() acquires the wait_lock. In the requeue
> > phase, waiter may be initially in the top of the queue, but after
> > dequeued and requeued it may no longer be true.
>
> That's related to your above argumentation in which way?
>

I think I made the mistake of not explicitly saying at least three
tasks are involved:

- A Task T1 that currently holds the mutex
- A Task T2 that is the top waiter
- A Task T3 that changes the top waiter

T3 tries to acquire the mutex, but as T1 holds it, it calls
task_blocked_on_lock() and saves the owner. It eventually calls
rt_mutex_adjust_prio_chain(), but it releases the wait lock before
doing so. This opens a window for T1 to release the mutex and wake up
T2. Before T2 runs, T3 acquires the wait lock again inside
rt_mutex_adjust_prio_chain(). If the "dequeue/requeue" piece of code
changes the top waiter, then 1) When T2 runs, it will verify that it
is no longer the top waiter and comes back to sleep 2) As you observed
below, the waiter doesn't point to the top waiter and, therefore, it
will wake up the wrong task.


> rt_mutex_adjust_prio_chain()
>
>         lock->wait_lock is held across the whole operation
>
>         prerequeue_top_waiter = rt_mutex_top_waiter(lock);
>
>   This saves the current top waiter before the dequeue()/enqueue()
>   sequence.
>
>         rt_mutex_dequeue(lock, waiter);
>         waiter_update_prio(waiter, task);
>         rt_mutex_enqueue(lock, waiter);
>
>         if (!rt_mutex_owner(lock)) {
>
>   This is the case where the lock has no owner, i.e. the previous owner
>   unlocked and the chainwalk cannot be continued.
>
>   Now the code checks whether the requeue changed the top waiter task:
>
>                 if (prerequeue_top_waiter != rt_mutex_top_waiter(lock))
>
>   What can make this condition true?
>
>     1) @waiter is the new top waiter due to the requeue operation
>
>     2) @waiter is not longer the top waiter due to the requeue operation
>
>   So in both cases the new top waiter must be woken up so it can take over
>   the ownerless lock.
>
>   Here is where the code is buggy. It only considers case #1, but not
>   case #2, right?
>
> So your patch is correct, but the explanation in your changelog has
> absolutely nothing to do with the problem.
>
> Why?
>
>   #2 is caused by a top waiter dropping out due to a signal or timeout
>      and thereby deboosting the whole lock chain.
>
>   So the relevant callchain which causes the problem originates from
>   remove_waiter()
>
> See?
>

Another piece of information I forgot: I spotted the bug in the
spinlock_rt, which uses a rtmutex under the hood. It has a different
code path in the lock scenario, and there is no call to
remove_waiter() (or I am missing something).
Anyway, you summed it up pretty well here: "@waiter is no longer the
top waiter due to the requeue operation". I tried (and failed) to
explain the call chain that ends up in the buggy scenario, but now I
think I should just describe the fundamental problem (the waiter
doesn't point to the top waiter).

> Thanks,
>
>         tglx
>
  
Thomas Gleixner Jan. 19, 2023, 8:53 p.m. UTC | #7
Wander!

On Wed, Jan 18 2023 at 15:49, Wander Lairson Costa wrote:
> On Tue, Jan 17, 2023 at 9:05 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> On Tue, Jan 17 2023 at 14:26, Wander Lairson Costa wrote:
>> > rt_mutex_adjust_prio_chain() acquires the wait_lock. In the requeue
>> > phase, waiter may be initially in the top of the queue, but after
>> > dequeued and requeued it may no longer be true.
>>
>> That's related to your above argumentation in which way?
>>
>
> I think I made the mistake of not explicitly saying at least three
> tasks are involved:
>
> - A Task T1 that currently holds the mutex
> - A Task T2 that is the top waiter
> - A Task T3 that changes the top waiter
>
> T3 tries to acquire the mutex, but as T1 holds it, it calls
> task_blocked_on_lock() and saves the owner. It eventually calls
> rt_mutex_adjust_prio_chain(), but it releases the wait lock before
> doing so. This opens a window for T1 to release the mutex and wake up
> T2. Before T2 runs, T3 acquires the wait lock again inside
> rt_mutex_adjust_prio_chain(). If the "dequeue/requeue" piece of code
> changes the top waiter, then 1) When T2 runs, it will verify that it
> is no longer the top waiter and comes back to sleep 2) As you observed
> below, the waiter doesn't point to the top waiter and, therefore, it
> will wake up the wrong task.

This is still confusing as hell because the wait locks you are talking
about belong to different rtmutexes.  So there is no drops wait lock and
reacquires wait lock window.

There must be (multiple) lock chains involved, but I couldn't figure out
yet what the actaul scenario is in the case of a pure rt_spinlock clock
chain:

> Another piece of information I forgot: I spotted the bug in the
> spinlock_rt, which uses a rtmutex under the hood. It has a different
> code path in the lock scenario, and there is no call to
> remove_waiter() (or I am missing something).

Correct. But this still might be a lock chain issue where a non
rt_spinlock which allows early removal.

> Anyway, you summed it up pretty well here: "@waiter is no longer the
> top waiter due to the requeue operation". I tried (and failed) to
> explain the call chain that ends up in the buggy scenario, but now I
> think I should just describe the fundamental problem (the waiter
> doesn't point to the top waiter).

You really want to provide the information WHY this case can happen at
all. If it's not the removal case and related to some other obscure lock
chain problem, then we really need to understand the scenario because
that lets us analyze whether there are other holes we did not think
about yet.

If you have traces which show the sequence of lock events leading to
this problem, then you should be able to decode the scenario. If you
fail to extract the information, then please provide the traces so we
can stare at them.

Thanks,

        tglx
  
Wander Lairson Costa Jan. 31, 2023, 5:46 p.m. UTC | #8
On Thu, Jan 19, 2023 at 5:54 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Wander!
>
> On Wed, Jan 18 2023 at 15:49, Wander Lairson Costa wrote:
> > On Tue, Jan 17, 2023 at 9:05 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> On Tue, Jan 17 2023 at 14:26, Wander Lairson Costa wrote:
> >> > rt_mutex_adjust_prio_chain() acquires the wait_lock. In the requeue
> >> > phase, waiter may be initially in the top of the queue, but after
> >> > dequeued and requeued it may no longer be true.
> >>
> >> That's related to your above argumentation in which way?
> >>
> >
> > I think I made the mistake of not explicitly saying at least three
> > tasks are involved:
> >
> > - A Task T1 that currently holds the mutex
> > - A Task T2 that is the top waiter
> > - A Task T3 that changes the top waiter
> >
> > T3 tries to acquire the mutex, but as T1 holds it, it calls
> > task_blocked_on_lock() and saves the owner. It eventually calls
> > rt_mutex_adjust_prio_chain(), but it releases the wait lock before
> > doing so. This opens a window for T1 to release the mutex and wake up
> > T2. Before T2 runs, T3 acquires the wait lock again inside
> > rt_mutex_adjust_prio_chain(). If the "dequeue/requeue" piece of code
> > changes the top waiter, then 1) When T2 runs, it will verify that it
> > is no longer the top waiter and comes back to sleep 2) As you observed
> > below, the waiter doesn't point to the top waiter and, therefore, it
> > will wake up the wrong task.
>
> This is still confusing as hell because the wait locks you are talking
> about belong to different rtmutexes. So there is no drops wait lock and
> reacquires wait lock window.
>
> There must be (multiple) lock chains involved, but I couldn't figure out
> yet what the actaul scenario is in the case of a pure rt_spinlock clock
> chain:
>

Sorry, it took so long to reply. I didn't have the traces anymore and
had to regenerate them. You spotted an error in my analysis, then I
had to start over.

> > Another piece of information I forgot: I spotted the bug in the
> > spinlock_rt, which uses a rtmutex under the hood. It has a different
> > code path in the lock scenario, and there is no call to
> > remove_waiter() (or I am missing something).
>
> Correct. But this still might be a lock chain issue where a non
> rt_spinlock which allows early removal.
>
> > Anyway, you summed it up pretty well here: "@waiter is no longer the
> > top waiter due to the requeue operation". I tried (and failed) to
> > explain the call chain that ends up in the buggy scenario, but now I
> > think I should just describe the fundamental problem (the waiter
> > doesn't point to the top waiter).
>
> You really want to provide the information WHY this case can happen at
> all. If it's not the removal case and related to some other obscure lock
> chain problem, then we really need to understand the scenario because
> that lets us analyze whether there are other holes we did not think
> about yet.
>
> If you have traces which show the sequence of lock events leading to
> this problem, then you should be able to decode the scenario. If you
> fail to extract the information, then please provide the traces so we
> can stare at them.
>

Here we go:

Let L1 and L2 be two spinlocks.

Let T1 be a task holding L1 and blocked on L2. T1, currently, is the top
waiter of L2.

Let T2 be the task holding L2.

Let T3 be a task trying to acquire L1.

The following events will lead to a state in which the wait queue of L2
isn't empty but nobody holds it.

T1                              T2                              T3
                                                                spin_lock(L1)

raw_spin_lock(L1->wait_lock)

rtlock_slowlock_locked(L1)

task_blocks_on_rt_mutex(L1, T3)

orig_waiter->lock = L1

orig_waiter->task = T3

raw_spin_unlock(L1->wait_lock)

rt_mutex_adjust_prio_chain(T1, L1, L2, orig_waiter, T3)

                                spin_unlock(L2)
                                  rt_mutex_slowunlock(L2)
                                    raw_spin_lock(L2->wait_lock)
                                    wakeup(T1)
                                    raw_spin_unlock(L2->wait_lock)

       waiter = T1->pi_blocked_on

       waiter == rt_mutex_top_waiter(L2)

       waiter->task == T1

       raw_spin_lock(L2->wait_lock)

       dequeue(L2, waiter)

       update_prio(waiter, T1)

       enqueue(L2, waiter)

       waiter != rt_mutex_top_waiter(L2)

       L2->owner == NULL

       wakeup(T1)

       raw_spin_unlock(L2->wait_lock)
T1 wakes up
T1 != top_waiter(L2)
schedule_rtlock()

What I observed is that T1 deadline is updated before the call to
update_prio(), and the new deadline removes it from the the top waiter
in L2 wait queue.

Does that make sense?
  
Wander Lairson Costa Jan. 31, 2023, 5:53 p.m. UTC | #9
On Tue, Jan 31, 2023 at 02:46:19PM -0300, Wander Lairson Costa wrote:
> On Thu, Jan 19, 2023 at 5:54 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > Wander!
> >
> > On Wed, Jan 18 2023 at 15:49, Wander Lairson Costa wrote:
> > > On Tue, Jan 17, 2023 at 9:05 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > >> On Tue, Jan 17 2023 at 14:26, Wander Lairson Costa wrote:
> > >> > rt_mutex_adjust_prio_chain() acquires the wait_lock. In the requeue
> > >> > phase, waiter may be initially in the top of the queue, but after
> > >> > dequeued and requeued it may no longer be true.
> > >>
> > >> That's related to your above argumentation in which way?
> > >>
> > >
> > > I think I made the mistake of not explicitly saying at least three
> > > tasks are involved:
> > >
> > > - A Task T1 that currently holds the mutex
> > > - A Task T2 that is the top waiter
> > > - A Task T3 that changes the top waiter
> > >
> > > T3 tries to acquire the mutex, but as T1 holds it, it calls
> > > task_blocked_on_lock() and saves the owner. It eventually calls
> > > rt_mutex_adjust_prio_chain(), but it releases the wait lock before
> > > doing so. This opens a window for T1 to release the mutex and wake up
> > > T2. Before T2 runs, T3 acquires the wait lock again inside
> > > rt_mutex_adjust_prio_chain(). If the "dequeue/requeue" piece of code
> > > changes the top waiter, then 1) When T2 runs, it will verify that it
> > > is no longer the top waiter and comes back to sleep 2) As you observed
> > > below, the waiter doesn't point to the top waiter and, therefore, it
> > > will wake up the wrong task.
> >
> > This is still confusing as hell because the wait locks you are talking
> > about belong to different rtmutexes. So there is no drops wait lock and
> > reacquires wait lock window.
> >
> > There must be (multiple) lock chains involved, but I couldn't figure out
> > yet what the actaul scenario is in the case of a pure rt_spinlock clock
> > chain:
> >
> 
> Sorry, it took so long to reply. I didn't have the traces anymore and
> had to regenerate them. You spotted an error in my analysis, then I
> had to start over.
> 
> > > Another piece of information I forgot: I spotted the bug in the
> > > spinlock_rt, which uses a rtmutex under the hood. It has a different
> > > code path in the lock scenario, and there is no call to
> > > remove_waiter() (or I am missing something).
> >
> > Correct. But this still might be a lock chain issue where a non
> > rt_spinlock which allows early removal.
> >
> > > Anyway, you summed it up pretty well here: "@waiter is no longer the
> > > top waiter due to the requeue operation". I tried (and failed) to
> > > explain the call chain that ends up in the buggy scenario, but now I
> > > think I should just describe the fundamental problem (the waiter
> > > doesn't point to the top waiter).
> >
> > You really want to provide the information WHY this case can happen at
> > all. If it's not the removal case and related to some other obscure lock
> > chain problem, then we really need to understand the scenario because
> > that lets us analyze whether there are other holes we did not think
> > about yet.
> >
> > If you have traces which show the sequence of lock events leading to
> > this problem, then you should be able to decode the scenario. If you
> > fail to extract the information, then please provide the traces so we
> > can stare at them.
> >
> 
> Here we go:
> 
> Let L1 and L2 be two spinlocks.
> 
> Let T1 be a task holding L1 and blocked on L2. T1, currently, is the top
> waiter of L2.
> 
> Let T2 be the task holding L2.
> 
> Let T3 be a task trying to acquire L1.
> 
> The following events will lead to a state in which the wait queue of L2
> isn't empty but nobody holds it.
> 
> T1                              T2                              T3
>                                                                 spin_lock(L1)
> 
> raw_spin_lock(L1->wait_lock)
> 
> rtlock_slowlock_locked(L1)
> 
> task_blocks_on_rt_mutex(L1, T3)
> 
> orig_waiter->lock = L1
> 
> orig_waiter->task = T3
> 
> raw_spin_unlock(L1->wait_lock)
> 
> rt_mutex_adjust_prio_chain(T1, L1, L2, orig_waiter, T3)
> 
>                                 spin_unlock(L2)
>                                   rt_mutex_slowunlock(L2)
>                                     raw_spin_lock(L2->wait_lock)
>                                     wakeup(T1)
>                                     raw_spin_unlock(L2->wait_lock)
> 
>        waiter = T1->pi_blocked_on
> 
>        waiter == rt_mutex_top_waiter(L2)
> 
>        waiter->task == T1
> 
>        raw_spin_lock(L2->wait_lock)
> 
>        dequeue(L2, waiter)
> 
>        update_prio(waiter, T1)
> 
>        enqueue(L2, waiter)
> 
>        waiter != rt_mutex_top_waiter(L2)
> 
>        L2->owner == NULL
> 
>        wakeup(T1)
> 
>        raw_spin_unlock(L2->wait_lock)
> T1 wakes up
> T1 != top_waiter(L2)
> schedule_rtlock()
> 

Arrrrghhhh, s**t mail servers... Hopefully now it is formatted correctly:

Let L1 and L2 be two spinlocks.

Let T1 be a task holding L1 and blocked on L2. T1, currently, is the top
waiter of L2.

Let T2 be the task holding L2.

Let T3 be a task trying to acquire L1.

The following events will lead to a state in which the wait queue of L2
isn't empty but nobody holds it.

T1                              T2                              T3
                                                                spin_lock(L1)
                                                                  raw_spin_lock(L1->wait_lock)
                                                                  rtlock_slowlock_locked(L1)
                                                                    task_blocks_on_rt_mutex(L1, T3)
                                                                      orig_waiter->lock = L1
                                                                      orig_waiter->task = T3
                                                                      raw_spin_unlock(L1->wait_lock)
                                                                      rt_mutex_adjust_prio_chain(T1, L1, L2, orig_waiter, T3)

                                spin_unlock(L2)
                                  rt_mutex_slowunlock(L2)
                                    raw_spin_lock(L2->wait_lock)
                                    wakeup(T1)
                                    raw_spin_unlock(L2->wait_lock)
                                                                              waiter = T1->pi_blocked_on
                                                                              waiter == rt_mutex_top_waiter(L2)
                                                                              waiter->task == T1
                                                                              raw_spin_lock(L2->wait_lock)
                                                                              dequeue(L2, waiter)
                                                                              update_prio(waiter, T1)
                                                                              enqueue(L2, waiter)
                                                                              waiter != rt_mutex_top_waiter(L2)
                                                                              L2->owner == NULL
                                                                              wakeup(T1)
                                                                              raw_spin_unlock(L2->wait_lock)
T1 wakes up
T1 != top_waiter(L2)
schedule_rtlock()
  
Thomas Gleixner Feb. 2, 2023, 7:48 a.m. UTC | #10
On Tue, Jan 31 2023 at 14:53, Wander Lairson Costa wrote:
> On Tue, Jan 31, 2023 at 02:46:19PM -0300, Wander Lairson Costa wrote:
>> > If you have traces which show the sequence of lock events leading to
>> > this problem, then you should be able to decode the scenario. If you
>> > fail to extract the information, then please provide the traces so we
>> > can stare at them.
>> >
>> 
>> Here we go:
>> 
>> Let L1 and L2 be two spinlocks.
>> 
>> Let T1 be a task holding L1 and blocked on L2. T1, currently, is the top
>> waiter of L2.
>> 
>> Let T2 be the task holding L2.
>> 
>> Let T3 be a task trying to acquire L1.
>> 
>> The following events will lead to a state in which the wait queue of L2
>> isn't empty but nobody holds it.

That explains it nicely. Care to resend with proper explanations in the
changelog?

Thanks,

        tglx
  
Wander Lairson Costa Feb. 2, 2023, 11:20 a.m. UTC | #11
On Thu, Feb 2, 2023 at 5:07 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Tue, Jan 31 2023 at 14:53, Wander Lairson Costa wrote:
> > On Tue, Jan 31, 2023 at 02:46:19PM -0300, Wander Lairson Costa wrote:
> >> > If you have traces which show the sequence of lock events leading to
> >> > this problem, then you should be able to decode the scenario. If you
> >> > fail to extract the information, then please provide the traces so we
> >> > can stare at them.
> >> >
> >>
> >> Here we go:
> >>
> >> Let L1 and L2 be two spinlocks.
> >>
> >> Let T1 be a task holding L1 and blocked on L2. T1, currently, is the top
> >> waiter of L2.
> >>
> >> Let T2 be the task holding L2.
> >>
> >> Let T3 be a task trying to acquire L1.
> >>
> >> The following events will lead to a state in which the wait queue of L2
> >> isn't empty but nobody holds it.
>
> That explains it nicely. Care to resend with proper explanations in the
> changelog?
>

Sure thing, thanks.
  

Patch

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 010cf4e6d0b8..728f434de2bb 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -901,8 +901,9 @@  static int __sched rt_mutex_adjust_prio_chain(struct task_struct *task,
 		 * then we need to wake the new top waiter up to try
 		 * to get the lock.
 		 */
-		if (prerequeue_top_waiter != rt_mutex_top_waiter(lock))
-			wake_up_state(waiter->task, waiter->wake_state);
+		top_waiter = rt_mutex_top_waiter(lock);
+		if (prerequeue_top_waiter != top_waiter)
+			wake_up_state(top_waiter->task, top_waiter->wake_state);
 		raw_spin_unlock_irq(&lock->wait_lock);
 		return 0;
 	}