Message ID | 20230117172649.52465-1-wander@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp1890857wrn; Tue, 17 Jan 2023 09:40:50 -0800 (PST) X-Google-Smtp-Source: AMrXdXvUrZ8FdiMtaKbycQYK7CRGLSw+aSubtOd4+hWccPPb+Qwdc+kXuO9rkm4Vci2muVzDVnpo X-Received: by 2002:a62:cd89:0:b0:58d:acf7:9648 with SMTP id o131-20020a62cd89000000b0058dacf79648mr4122140pfg.22.1673977250048; Tue, 17 Jan 2023 09:40:50 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1673977250; cv=none; d=google.com; s=arc-20160816; b=wI62XXdIxGaGKbWdCVvugDZIMFWpPzIX5Z8LzBF2InrRmN/KLBynZ9axurxKqOQqKH OXxqszW/LmvxGzzWyZB+h0E54Ktyl9/a5JRer4A55fi1tArwDYDiAncCu7rYVmHmc7/Q XFom79DF1j4Jc00U1lPDfoAotuo46l0y4+3C/pGpb8bZndRGX5tXRvOhITnxZHvpLIYh Zona94R40OLBTTpBMEnp8mE+Uoo0Jhe8yfHRTdF+Z8fT7BcXFUgOFTgmRi5tD1C6uCJk 9ypb1P/OpxvoxCr+NUpkGkmK5gZT4/1uwm7UEwIHRmTAff6XJ64TqunjIbm1P5NxNPlX B7BA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=G55sfgJuDMMAYIXInt+J1nhGEOtoZVH7aF/U1ojeI9Q=; b=USrGY9sprMEgW+Etw/NXGegT34F46DHBzxPBtNMP6wj9aoUWSk3BqmXj8meqCQ2SmC R+onpuwLbbdTh01drmIlqBleAdM+xHjDzmZB2R/mGpd6QEU2BsQJpM+M+SPtbBHSRO68 jRKGS1Wlnew6Tl8nXKq5bxdcUkEP2lRwD+c3sDh9DgM4HMpY4JJ5Waj9c4Yd4sbSs9Q8 kfAjaTDNaXmaGgJq4RtBtjzcN22Bs7HxIBlztJpLHlJjyugKSsG6v0Fb3Wl0WlwZgsTQ 1Sg8q7KccAbhErD4OJFrQ+21p8HpNqjTpXdxiP2H2+FfacELyQR4lgtVgcdLUG7jyTtn mBtw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=UtpbeYA+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id k8-20020aa788c8000000b00571a681de73si671843pff.293.2023.01.17.09.40.37; Tue, 17 Jan 2023 09:40:50 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=UtpbeYA+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235388AbjAQReM (ORCPT <rfc822;pfffrao@gmail.com> + 99 others); Tue, 17 Jan 2023 12:34:12 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54576 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235411AbjAQR3R (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 17 Jan 2023 12:29:17 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 42C3F4B192 for <linux-kernel@vger.kernel.org>; Tue, 17 Jan 2023 09:27:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1673976426; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=G55sfgJuDMMAYIXInt+J1nhGEOtoZVH7aF/U1ojeI9Q=; b=UtpbeYA+hGvIWTAqM1PykGOxMr5z2ffNOZZA55wxaVsc7gImXyiRcwUkJcHgHEAucIYuP0 UhQEQJy1WBwcMYDmd2bbhXbm4lf7yPnq9gJsGd5F1f7cDN+jWUmFwwWng2TwQpP/HphIbN qZl7dNNfq9zsyr9ByhxJMn2B7v7bAi8= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-104-TSDXnP_fNaabKQZIIPOFhQ-1; Tue, 17 Jan 2023 12:27:03 -0500 X-MC-Unique: TSDXnP_fNaabKQZIIPOFhQ-1 Received: from smtp.corp.redhat.com (int-mx10.intmail.prod.int.rdu2.redhat.com [10.11.54.10]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 661BD1816ECA; Tue, 17 Jan 2023 17:26:56 +0000 (UTC) Received: from fedora.redhat.com (unknown [10.22.32.245]) by smtp.corp.redhat.com (Postfix) with ESMTP id 876B9492B00; Tue, 17 Jan 2023 17:26:54 +0000 (UTC) From: Wander Lairson Costa <wander@redhat.com> To: Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>, Waiman Long <longman@redhat.com>, Boqun Feng <boqun.feng@gmail.com>, linux-kernel@vger.kernel.org (open list:LOCKING PRIMITIVES) Cc: Wander Lairson Costa <wander@redhat.com> Subject: [PATCH] rtmutex: ensure we wake up the top waiter Date: Tue, 17 Jan 2023 14:26:49 -0300 Message-Id: <20230117172649.52465-1-wander@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.10 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1755292369013160081?= X-GMAIL-MSGID: =?utf-8?q?1755292369013160081?= |
Series |
rtmutex: ensure we wake up the top waiter
|
|
Commit Message
Wander Lairson Costa
Jan. 17, 2023, 5:26 p.m. UTC
In task_blocked_on_lock() we save the owner, release the wait_lock and
call rt_mutex_adjust_prio_chain(). Before we acquire the wait_lock
again, the owner may release the lock and deboost.
rt_mutex_adjust_prio_chain() acquires the wait_lock. In the requeue
phase, waiter may be initially in the top of the queue, but after
dequeued and requeued it may no longer be true.
This scenario ends up waking the wrong task, which will verify it is no
the top waiter and comes back to sleep. Now we have a situation in which
no task is holding the lock but no one acquires it.
We can reproduce the bug in PREEMPT_RT with stress-ng:
while true; do
stress-ng --sched deadline --sched-period 1000000000 \
--sched-runtime 800000000 --sched-deadline \
1000000000 --mmapfork 23 -t 20
done
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
---
kernel/locking/rtmutex.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
Comments
On 1/17/23 12:26, Wander Lairson Costa wrote: > In task_blocked_on_lock() we save the owner, release the wait_lock and > call rt_mutex_adjust_prio_chain(). Before we acquire the wait_lock > again, the owner may release the lock and deboost. Are you referring to task_blocks_on_rt_mutex(), not task_blocked_on_lock()? > > rt_mutex_adjust_prio_chain() acquires the wait_lock. In the requeue > phase, waiter may be initially in the top of the queue, but after > dequeued and requeued it may no longer be true. > > This scenario ends up waking the wrong task, which will verify it is no > the top waiter and comes back to sleep. Now we have a situation in which > no task is holding the lock but no one acquires it. > > We can reproduce the bug in PREEMPT_RT with stress-ng: > > while true; do > stress-ng --sched deadline --sched-period 1000000000 \ > --sched-runtime 800000000 --sched-deadline \ > 1000000000 --mmapfork 23 -t 20 > done > > Signed-off-by: Wander Lairson Costa <wander@redhat.com> > --- > kernel/locking/rtmutex.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c > index 010cf4e6d0b8..728f434de2bb 100644 > --- a/kernel/locking/rtmutex.c > +++ b/kernel/locking/rtmutex.c > @@ -901,8 +901,9 @@ static int __sched rt_mutex_adjust_prio_chain(struct task_struct *task, > * then we need to wake the new top waiter up to try > * to get the lock. > */ > - if (prerequeue_top_waiter != rt_mutex_top_waiter(lock)) > - wake_up_state(waiter->task, waiter->wake_state); > + top_waiter = rt_mutex_top_waiter(lock); > + if (prerequeue_top_waiter != top_waiter) > + wake_up_state(top_waiter->task, top_waiter->wake_state); > raw_spin_unlock_irq(&lock->wait_lock); > return 0; > } I would say that if a rt_mutex has no owner but have waiters, we should always wake up the top waiter whether or not it is the current waiter. So what is the result of the stress-ng run above? Is it a hang? It is not clear from your patch description. I am not that familiar with the rt_mutex code, I am cc'ing Thomas and Sebastian for their input. Cheers, Longman
On 1/17/23 14:32, Waiman Long wrote: > On 1/17/23 12:26, Wander Lairson Costa wrote: >> In task_blocked_on_lock() we save the owner, release the wait_lock and >> call rt_mutex_adjust_prio_chain(). Before we acquire the wait_lock >> again, the owner may release the lock and deboost. > Are you referring to task_blocks_on_rt_mutex(), not > task_blocked_on_lock()? >> >> rt_mutex_adjust_prio_chain() acquires the wait_lock. In the requeue >> phase, waiter may be initially in the top of the queue, but after >> dequeued and requeued it may no longer be true. >> >> This scenario ends up waking the wrong task, which will verify it is no >> the top waiter and comes back to sleep. Now we have a situation in which >> no task is holding the lock but no one acquires it. >> >> We can reproduce the bug in PREEMPT_RT with stress-ng: >> >> while true; do >> stress-ng --sched deadline --sched-period 1000000000 \ >> --sched-runtime 800000000 --sched-deadline \ >> 1000000000 --mmapfork 23 -t 20 >> done >> >> Signed-off-by: Wander Lairson Costa <wander@redhat.com> >> --- >> kernel/locking/rtmutex.c | 5 +++-- >> 1 file changed, 3 insertions(+), 2 deletions(-) >> >> diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c >> index 010cf4e6d0b8..728f434de2bb 100644 >> --- a/kernel/locking/rtmutex.c >> +++ b/kernel/locking/rtmutex.c >> @@ -901,8 +901,9 @@ static int __sched >> rt_mutex_adjust_prio_chain(struct task_struct *task, >> * then we need to wake the new top waiter up to try >> * to get the lock. >> */ >> - if (prerequeue_top_waiter != rt_mutex_top_waiter(lock)) >> - wake_up_state(waiter->task, waiter->wake_state); >> + top_waiter = rt_mutex_top_waiter(lock); >> + if (prerequeue_top_waiter != top_waiter) >> + wake_up_state(top_waiter->task, top_waiter->wake_state); >> raw_spin_unlock_irq(&lock->wait_lock); >> return 0; >> } > > I would say that if a rt_mutex has no owner but have waiters, we > should always wake up the top waiter whether or not it is the current > waiter. So what is the result of the stress-ng run above? Is it a > hang? It is not clear from your patch description. BTW, if it is a hang. What arch has this problem? x86 or arm64? There is a recent report of some rt_mutex locking issue in arm64, I believe. I don't know if it will be related. So it is important to know in what arch you see this problem. Cheers, Longman
On Tue, Jan 17, 2023 at 4:32 PM Waiman Long <longman@redhat.com> wrote: > > On 1/17/23 12:26, Wander Lairson Costa wrote: > > In task_blocked_on_lock() we save the owner, release the wait_lock and > > call rt_mutex_adjust_prio_chain(). Before we acquire the wait_lock > > again, the owner may release the lock and deboost. > Are you referring to task_blocks_on_rt_mutex(), not task_blocked_on_lock()? > > > > rt_mutex_adjust_prio_chain() acquires the wait_lock. In the requeue > > phase, waiter may be initially in the top of the queue, but after > > dequeued and requeued it may no longer be true. > > > > This scenario ends up waking the wrong task, which will verify it is no > > the top waiter and comes back to sleep. Now we have a situation in which > > no task is holding the lock but no one acquires it. > > > > We can reproduce the bug in PREEMPT_RT with stress-ng: > > > > while true; do > > stress-ng --sched deadline --sched-period 1000000000 \ > > --sched-runtime 800000000 --sched-deadline \ > > 1000000000 --mmapfork 23 -t 20 > > done > > > > Signed-off-by: Wander Lairson Costa <wander@redhat.com> > > --- > > kernel/locking/rtmutex.c | 5 +++-- > > 1 file changed, 3 insertions(+), 2 deletions(-) > > > > diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c > > index 010cf4e6d0b8..728f434de2bb 100644 > > --- a/kernel/locking/rtmutex.c > > +++ b/kernel/locking/rtmutex.c > > @@ -901,8 +901,9 @@ static int __sched rt_mutex_adjust_prio_chain(struct task_struct *task, > > * then we need to wake the new top waiter up to try > > * to get the lock. > > */ > > - if (prerequeue_top_waiter != rt_mutex_top_waiter(lock)) > > - wake_up_state(waiter->task, waiter->wake_state); > > + top_waiter = rt_mutex_top_waiter(lock); > > + if (prerequeue_top_waiter != top_waiter) > > + wake_up_state(top_waiter->task, top_waiter->wake_state); > > raw_spin_unlock_irq(&lock->wait_lock); > > return 0; > > } > > I would say that if a rt_mutex has no owner but have waiters, we should > always wake up the top waiter whether or not it is the current waiter. In practice, there is this case in which the unlock code wakes up the top waiter, but before it task awakes, a third running task changes the top waiter. > So what is the result of the stress-ng run above? Is it a hang? It is > not clear from your patch description. It manifests as a rcu_preempt stall. > > I am not that familiar with the rt_mutex code, I am cc'ing Thomas and > Sebastian for their input. > > Cheers, > Longman >
On Tue, Jan 17, 2023 at 4:40 PM Waiman Long <longman@redhat.com> wrote: > > On 1/17/23 14:32, Waiman Long wrote: > > On 1/17/23 12:26, Wander Lairson Costa wrote: > >> In task_blocked_on_lock() we save the owner, release the wait_lock and > >> call rt_mutex_adjust_prio_chain(). Before we acquire the wait_lock > >> again, the owner may release the lock and deboost. > > Are you referring to task_blocks_on_rt_mutex(), not > > task_blocked_on_lock()? > >> > >> rt_mutex_adjust_prio_chain() acquires the wait_lock. In the requeue > >> phase, waiter may be initially in the top of the queue, but after > >> dequeued and requeued it may no longer be true. > >> > >> This scenario ends up waking the wrong task, which will verify it is no > >> the top waiter and comes back to sleep. Now we have a situation in which > >> no task is holding the lock but no one acquires it. > >> > >> We can reproduce the bug in PREEMPT_RT with stress-ng: > >> > >> while true; do > >> stress-ng --sched deadline --sched-period 1000000000 \ > >> --sched-runtime 800000000 --sched-deadline \ > >> 1000000000 --mmapfork 23 -t 20 > >> done > >> > >> Signed-off-by: Wander Lairson Costa <wander@redhat.com> > >> --- > >> kernel/locking/rtmutex.c | 5 +++-- > >> 1 file changed, 3 insertions(+), 2 deletions(-) > >> > >> diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c > >> index 010cf4e6d0b8..728f434de2bb 100644 > >> --- a/kernel/locking/rtmutex.c > >> +++ b/kernel/locking/rtmutex.c > >> @@ -901,8 +901,9 @@ static int __sched > >> rt_mutex_adjust_prio_chain(struct task_struct *task, > >> * then we need to wake the new top waiter up to try > >> * to get the lock. > >> */ > >> - if (prerequeue_top_waiter != rt_mutex_top_waiter(lock)) > >> - wake_up_state(waiter->task, waiter->wake_state); > >> + top_waiter = rt_mutex_top_waiter(lock); > >> + if (prerequeue_top_waiter != top_waiter) > >> + wake_up_state(top_waiter->task, top_waiter->wake_state); > >> raw_spin_unlock_irq(&lock->wait_lock); > >> return 0; > >> } > > > > I would say that if a rt_mutex has no owner but have waiters, we > > should always wake up the top waiter whether or not it is the current > > waiter. So what is the result of the stress-ng run above? Is it a > > hang? It is not clear from your patch description. > > BTW, if it is a hang. What arch has this problem? x86 or arm64? There is > a recent report of some rt_mutex locking issue in arm64, I believe. I > don't know if it will be related. So it is important to know in what > arch you see this problem. > x86_64. Notice, at least in my test case, it only manifested under PREEMPT_RT. > Cheers, > Longman >
Wander! On Tue, Jan 17 2023 at 14:26, Wander Lairson Costa wrote: > In task_blocked_on_lock() we save the owner, release the wait_lock and > call rt_mutex_adjust_prio_chain(). Before we acquire the wait_lock > again, the owner may release the lock and deboost. This does not make sense in several aspects: 1) Who is 'we'? You, me, someone else? None of us does anything of the above. https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#changelog 2) What has task_blocked_on_lock() to do with the logic in rt_mutex_adjust_prio_chain() which is called by other callsites too? 3) If the owner releases the lock and deboosts then this has absolutely nothing to do with the lock because the priority of a the owner is determined by its own priority and the priority of the top most waiter. If the owner releases the lock then it marks the lock ownerless, wakes the top most waiter and deboosts itself. In this owner deboost rt_mutex_adjust_prio_chain() is not involved at all. Why? Because the owner deboost does not affect the priority of the waiters at all. It's the other way round: Waiter priority affects the owner priority if the waiter priority is higher than the owner priority. > rt_mutex_adjust_prio_chain() acquires the wait_lock. In the requeue > phase, waiter may be initially in the top of the queue, but after > dequeued and requeued it may no longer be true. That's related to your above argumentation in which way? rt_mutex_adjust_prio_chain() lock->wait_lock is held across the whole operation prerequeue_top_waiter = rt_mutex_top_waiter(lock); This saves the current top waiter before the dequeue()/enqueue() sequence. rt_mutex_dequeue(lock, waiter); waiter_update_prio(waiter, task); rt_mutex_enqueue(lock, waiter); if (!rt_mutex_owner(lock)) { This is the case where the lock has no owner, i.e. the previous owner unlocked and the chainwalk cannot be continued. Now the code checks whether the requeue changed the top waiter task: if (prerequeue_top_waiter != rt_mutex_top_waiter(lock)) What can make this condition true? 1) @waiter is the new top waiter due to the requeue operation 2) @waiter is not longer the top waiter due to the requeue operation So in both cases the new top waiter must be woken up so it can take over the ownerless lock. Here is where the code is buggy. It only considers case #1, but not case #2, right? So your patch is correct, but the explanation in your changelog has absolutely nothing to do with the problem. Why? #2 is caused by a top waiter dropping out due to a signal or timeout and thereby deboosting the whole lock chain. So the relevant callchain which causes the problem originates from remove_waiter() See? Thanks, tglx
On Tue, Jan 17, 2023 at 9:05 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > Wander! > > On Tue, Jan 17 2023 at 14:26, Wander Lairson Costa wrote: > > In task_blocked_on_lock() we save the owner, release the wait_lock and > > call rt_mutex_adjust_prio_chain(). Before we acquire the wait_lock > > again, the owner may release the lock and deboost. > > This does not make sense in several aspects: > > 1) Who is 'we'? You, me, someone else? None of us does anything of the > above. > > https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#changelog > > 2) What has task_blocked_on_lock() to do with the logic in > rt_mutex_adjust_prio_chain() which is called by other callsites > too? > > 3) If the owner releases the lock and deboosts then this has > absolutely nothing to do with the lock because the priority of a > the owner is determined by its own priority and the priority of the > top most waiter. If the owner releases the lock then it marks the > lock ownerless, wakes the top most waiter and deboosts itself. In > this owner deboost rt_mutex_adjust_prio_chain() is not involved at > all. Why? > > Because the owner deboost does not affect the priority of the > waiters at all. It's the other way round: Waiter priority affects > the owner priority if the waiter priority is higher than the owner > priority. > > > rt_mutex_adjust_prio_chain() acquires the wait_lock. In the requeue > > phase, waiter may be initially in the top of the queue, but after > > dequeued and requeued it may no longer be true. > > That's related to your above argumentation in which way? > I think I made the mistake of not explicitly saying at least three tasks are involved: - A Task T1 that currently holds the mutex - A Task T2 that is the top waiter - A Task T3 that changes the top waiter T3 tries to acquire the mutex, but as T1 holds it, it calls task_blocked_on_lock() and saves the owner. It eventually calls rt_mutex_adjust_prio_chain(), but it releases the wait lock before doing so. This opens a window for T1 to release the mutex and wake up T2. Before T2 runs, T3 acquires the wait lock again inside rt_mutex_adjust_prio_chain(). If the "dequeue/requeue" piece of code changes the top waiter, then 1) When T2 runs, it will verify that it is no longer the top waiter and comes back to sleep 2) As you observed below, the waiter doesn't point to the top waiter and, therefore, it will wake up the wrong task. > rt_mutex_adjust_prio_chain() > > lock->wait_lock is held across the whole operation > > prerequeue_top_waiter = rt_mutex_top_waiter(lock); > > This saves the current top waiter before the dequeue()/enqueue() > sequence. > > rt_mutex_dequeue(lock, waiter); > waiter_update_prio(waiter, task); > rt_mutex_enqueue(lock, waiter); > > if (!rt_mutex_owner(lock)) { > > This is the case where the lock has no owner, i.e. the previous owner > unlocked and the chainwalk cannot be continued. > > Now the code checks whether the requeue changed the top waiter task: > > if (prerequeue_top_waiter != rt_mutex_top_waiter(lock)) > > What can make this condition true? > > 1) @waiter is the new top waiter due to the requeue operation > > 2) @waiter is not longer the top waiter due to the requeue operation > > So in both cases the new top waiter must be woken up so it can take over > the ownerless lock. > > Here is where the code is buggy. It only considers case #1, but not > case #2, right? > > So your patch is correct, but the explanation in your changelog has > absolutely nothing to do with the problem. > > Why? > > #2 is caused by a top waiter dropping out due to a signal or timeout > and thereby deboosting the whole lock chain. > > So the relevant callchain which causes the problem originates from > remove_waiter() > > See? > Another piece of information I forgot: I spotted the bug in the spinlock_rt, which uses a rtmutex under the hood. It has a different code path in the lock scenario, and there is no call to remove_waiter() (or I am missing something). Anyway, you summed it up pretty well here: "@waiter is no longer the top waiter due to the requeue operation". I tried (and failed) to explain the call chain that ends up in the buggy scenario, but now I think I should just describe the fundamental problem (the waiter doesn't point to the top waiter). > Thanks, > > tglx >
Wander! On Wed, Jan 18 2023 at 15:49, Wander Lairson Costa wrote: > On Tue, Jan 17, 2023 at 9:05 PM Thomas Gleixner <tglx@linutronix.de> wrote: >> On Tue, Jan 17 2023 at 14:26, Wander Lairson Costa wrote: >> > rt_mutex_adjust_prio_chain() acquires the wait_lock. In the requeue >> > phase, waiter may be initially in the top of the queue, but after >> > dequeued and requeued it may no longer be true. >> >> That's related to your above argumentation in which way? >> > > I think I made the mistake of not explicitly saying at least three > tasks are involved: > > - A Task T1 that currently holds the mutex > - A Task T2 that is the top waiter > - A Task T3 that changes the top waiter > > T3 tries to acquire the mutex, but as T1 holds it, it calls > task_blocked_on_lock() and saves the owner. It eventually calls > rt_mutex_adjust_prio_chain(), but it releases the wait lock before > doing so. This opens a window for T1 to release the mutex and wake up > T2. Before T2 runs, T3 acquires the wait lock again inside > rt_mutex_adjust_prio_chain(). If the "dequeue/requeue" piece of code > changes the top waiter, then 1) When T2 runs, it will verify that it > is no longer the top waiter and comes back to sleep 2) As you observed > below, the waiter doesn't point to the top waiter and, therefore, it > will wake up the wrong task. This is still confusing as hell because the wait locks you are talking about belong to different rtmutexes. So there is no drops wait lock and reacquires wait lock window. There must be (multiple) lock chains involved, but I couldn't figure out yet what the actaul scenario is in the case of a pure rt_spinlock clock chain: > Another piece of information I forgot: I spotted the bug in the > spinlock_rt, which uses a rtmutex under the hood. It has a different > code path in the lock scenario, and there is no call to > remove_waiter() (or I am missing something). Correct. But this still might be a lock chain issue where a non rt_spinlock which allows early removal. > Anyway, you summed it up pretty well here: "@waiter is no longer the > top waiter due to the requeue operation". I tried (and failed) to > explain the call chain that ends up in the buggy scenario, but now I > think I should just describe the fundamental problem (the waiter > doesn't point to the top waiter). You really want to provide the information WHY this case can happen at all. If it's not the removal case and related to some other obscure lock chain problem, then we really need to understand the scenario because that lets us analyze whether there are other holes we did not think about yet. If you have traces which show the sequence of lock events leading to this problem, then you should be able to decode the scenario. If you fail to extract the information, then please provide the traces so we can stare at them. Thanks, tglx
On Thu, Jan 19, 2023 at 5:54 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > Wander! > > On Wed, Jan 18 2023 at 15:49, Wander Lairson Costa wrote: > > On Tue, Jan 17, 2023 at 9:05 PM Thomas Gleixner <tglx@linutronix.de> wrote: > >> On Tue, Jan 17 2023 at 14:26, Wander Lairson Costa wrote: > >> > rt_mutex_adjust_prio_chain() acquires the wait_lock. In the requeue > >> > phase, waiter may be initially in the top of the queue, but after > >> > dequeued and requeued it may no longer be true. > >> > >> That's related to your above argumentation in which way? > >> > > > > I think I made the mistake of not explicitly saying at least three > > tasks are involved: > > > > - A Task T1 that currently holds the mutex > > - A Task T2 that is the top waiter > > - A Task T3 that changes the top waiter > > > > T3 tries to acquire the mutex, but as T1 holds it, it calls > > task_blocked_on_lock() and saves the owner. It eventually calls > > rt_mutex_adjust_prio_chain(), but it releases the wait lock before > > doing so. This opens a window for T1 to release the mutex and wake up > > T2. Before T2 runs, T3 acquires the wait lock again inside > > rt_mutex_adjust_prio_chain(). If the "dequeue/requeue" piece of code > > changes the top waiter, then 1) When T2 runs, it will verify that it > > is no longer the top waiter and comes back to sleep 2) As you observed > > below, the waiter doesn't point to the top waiter and, therefore, it > > will wake up the wrong task. > > This is still confusing as hell because the wait locks you are talking > about belong to different rtmutexes. So there is no drops wait lock and > reacquires wait lock window. > > There must be (multiple) lock chains involved, but I couldn't figure out > yet what the actaul scenario is in the case of a pure rt_spinlock clock > chain: > Sorry, it took so long to reply. I didn't have the traces anymore and had to regenerate them. You spotted an error in my analysis, then I had to start over. > > Another piece of information I forgot: I spotted the bug in the > > spinlock_rt, which uses a rtmutex under the hood. It has a different > > code path in the lock scenario, and there is no call to > > remove_waiter() (or I am missing something). > > Correct. But this still might be a lock chain issue where a non > rt_spinlock which allows early removal. > > > Anyway, you summed it up pretty well here: "@waiter is no longer the > > top waiter due to the requeue operation". I tried (and failed) to > > explain the call chain that ends up in the buggy scenario, but now I > > think I should just describe the fundamental problem (the waiter > > doesn't point to the top waiter). > > You really want to provide the information WHY this case can happen at > all. If it's not the removal case and related to some other obscure lock > chain problem, then we really need to understand the scenario because > that lets us analyze whether there are other holes we did not think > about yet. > > If you have traces which show the sequence of lock events leading to > this problem, then you should be able to decode the scenario. If you > fail to extract the information, then please provide the traces so we > can stare at them. > Here we go: Let L1 and L2 be two spinlocks. Let T1 be a task holding L1 and blocked on L2. T1, currently, is the top waiter of L2. Let T2 be the task holding L2. Let T3 be a task trying to acquire L1. The following events will lead to a state in which the wait queue of L2 isn't empty but nobody holds it. T1 T2 T3 spin_lock(L1) raw_spin_lock(L1->wait_lock) rtlock_slowlock_locked(L1) task_blocks_on_rt_mutex(L1, T3) orig_waiter->lock = L1 orig_waiter->task = T3 raw_spin_unlock(L1->wait_lock) rt_mutex_adjust_prio_chain(T1, L1, L2, orig_waiter, T3) spin_unlock(L2) rt_mutex_slowunlock(L2) raw_spin_lock(L2->wait_lock) wakeup(T1) raw_spin_unlock(L2->wait_lock) waiter = T1->pi_blocked_on waiter == rt_mutex_top_waiter(L2) waiter->task == T1 raw_spin_lock(L2->wait_lock) dequeue(L2, waiter) update_prio(waiter, T1) enqueue(L2, waiter) waiter != rt_mutex_top_waiter(L2) L2->owner == NULL wakeup(T1) raw_spin_unlock(L2->wait_lock) T1 wakes up T1 != top_waiter(L2) schedule_rtlock() What I observed is that T1 deadline is updated before the call to update_prio(), and the new deadline removes it from the the top waiter in L2 wait queue. Does that make sense?
On Tue, Jan 31, 2023 at 02:46:19PM -0300, Wander Lairson Costa wrote: > On Thu, Jan 19, 2023 at 5:54 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > > > Wander! > > > > On Wed, Jan 18 2023 at 15:49, Wander Lairson Costa wrote: > > > On Tue, Jan 17, 2023 at 9:05 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > >> On Tue, Jan 17 2023 at 14:26, Wander Lairson Costa wrote: > > >> > rt_mutex_adjust_prio_chain() acquires the wait_lock. In the requeue > > >> > phase, waiter may be initially in the top of the queue, but after > > >> > dequeued and requeued it may no longer be true. > > >> > > >> That's related to your above argumentation in which way? > > >> > > > > > > I think I made the mistake of not explicitly saying at least three > > > tasks are involved: > > > > > > - A Task T1 that currently holds the mutex > > > - A Task T2 that is the top waiter > > > - A Task T3 that changes the top waiter > > > > > > T3 tries to acquire the mutex, but as T1 holds it, it calls > > > task_blocked_on_lock() and saves the owner. It eventually calls > > > rt_mutex_adjust_prio_chain(), but it releases the wait lock before > > > doing so. This opens a window for T1 to release the mutex and wake up > > > T2. Before T2 runs, T3 acquires the wait lock again inside > > > rt_mutex_adjust_prio_chain(). If the "dequeue/requeue" piece of code > > > changes the top waiter, then 1) When T2 runs, it will verify that it > > > is no longer the top waiter and comes back to sleep 2) As you observed > > > below, the waiter doesn't point to the top waiter and, therefore, it > > > will wake up the wrong task. > > > > This is still confusing as hell because the wait locks you are talking > > about belong to different rtmutexes. So there is no drops wait lock and > > reacquires wait lock window. > > > > There must be (multiple) lock chains involved, but I couldn't figure out > > yet what the actaul scenario is in the case of a pure rt_spinlock clock > > chain: > > > > Sorry, it took so long to reply. I didn't have the traces anymore and > had to regenerate them. You spotted an error in my analysis, then I > had to start over. > > > > Another piece of information I forgot: I spotted the bug in the > > > spinlock_rt, which uses a rtmutex under the hood. It has a different > > > code path in the lock scenario, and there is no call to > > > remove_waiter() (or I am missing something). > > > > Correct. But this still might be a lock chain issue where a non > > rt_spinlock which allows early removal. > > > > > Anyway, you summed it up pretty well here: "@waiter is no longer the > > > top waiter due to the requeue operation". I tried (and failed) to > > > explain the call chain that ends up in the buggy scenario, but now I > > > think I should just describe the fundamental problem (the waiter > > > doesn't point to the top waiter). > > > > You really want to provide the information WHY this case can happen at > > all. If it's not the removal case and related to some other obscure lock > > chain problem, then we really need to understand the scenario because > > that lets us analyze whether there are other holes we did not think > > about yet. > > > > If you have traces which show the sequence of lock events leading to > > this problem, then you should be able to decode the scenario. If you > > fail to extract the information, then please provide the traces so we > > can stare at them. > > > > Here we go: > > Let L1 and L2 be two spinlocks. > > Let T1 be a task holding L1 and blocked on L2. T1, currently, is the top > waiter of L2. > > Let T2 be the task holding L2. > > Let T3 be a task trying to acquire L1. > > The following events will lead to a state in which the wait queue of L2 > isn't empty but nobody holds it. > > T1 T2 T3 > spin_lock(L1) > > raw_spin_lock(L1->wait_lock) > > rtlock_slowlock_locked(L1) > > task_blocks_on_rt_mutex(L1, T3) > > orig_waiter->lock = L1 > > orig_waiter->task = T3 > > raw_spin_unlock(L1->wait_lock) > > rt_mutex_adjust_prio_chain(T1, L1, L2, orig_waiter, T3) > > spin_unlock(L2) > rt_mutex_slowunlock(L2) > raw_spin_lock(L2->wait_lock) > wakeup(T1) > raw_spin_unlock(L2->wait_lock) > > waiter = T1->pi_blocked_on > > waiter == rt_mutex_top_waiter(L2) > > waiter->task == T1 > > raw_spin_lock(L2->wait_lock) > > dequeue(L2, waiter) > > update_prio(waiter, T1) > > enqueue(L2, waiter) > > waiter != rt_mutex_top_waiter(L2) > > L2->owner == NULL > > wakeup(T1) > > raw_spin_unlock(L2->wait_lock) > T1 wakes up > T1 != top_waiter(L2) > schedule_rtlock() > Arrrrghhhh, s**t mail servers... Hopefully now it is formatted correctly: Let L1 and L2 be two spinlocks. Let T1 be a task holding L1 and blocked on L2. T1, currently, is the top waiter of L2. Let T2 be the task holding L2. Let T3 be a task trying to acquire L1. The following events will lead to a state in which the wait queue of L2 isn't empty but nobody holds it. T1 T2 T3 spin_lock(L1) raw_spin_lock(L1->wait_lock) rtlock_slowlock_locked(L1) task_blocks_on_rt_mutex(L1, T3) orig_waiter->lock = L1 orig_waiter->task = T3 raw_spin_unlock(L1->wait_lock) rt_mutex_adjust_prio_chain(T1, L1, L2, orig_waiter, T3) spin_unlock(L2) rt_mutex_slowunlock(L2) raw_spin_lock(L2->wait_lock) wakeup(T1) raw_spin_unlock(L2->wait_lock) waiter = T1->pi_blocked_on waiter == rt_mutex_top_waiter(L2) waiter->task == T1 raw_spin_lock(L2->wait_lock) dequeue(L2, waiter) update_prio(waiter, T1) enqueue(L2, waiter) waiter != rt_mutex_top_waiter(L2) L2->owner == NULL wakeup(T1) raw_spin_unlock(L2->wait_lock) T1 wakes up T1 != top_waiter(L2) schedule_rtlock()
On Tue, Jan 31 2023 at 14:53, Wander Lairson Costa wrote: > On Tue, Jan 31, 2023 at 02:46:19PM -0300, Wander Lairson Costa wrote: >> > If you have traces which show the sequence of lock events leading to >> > this problem, then you should be able to decode the scenario. If you >> > fail to extract the information, then please provide the traces so we >> > can stare at them. >> > >> >> Here we go: >> >> Let L1 and L2 be two spinlocks. >> >> Let T1 be a task holding L1 and blocked on L2. T1, currently, is the top >> waiter of L2. >> >> Let T2 be the task holding L2. >> >> Let T3 be a task trying to acquire L1. >> >> The following events will lead to a state in which the wait queue of L2 >> isn't empty but nobody holds it. That explains it nicely. Care to resend with proper explanations in the changelog? Thanks, tglx
On Thu, Feb 2, 2023 at 5:07 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > On Tue, Jan 31 2023 at 14:53, Wander Lairson Costa wrote: > > On Tue, Jan 31, 2023 at 02:46:19PM -0300, Wander Lairson Costa wrote: > >> > If you have traces which show the sequence of lock events leading to > >> > this problem, then you should be able to decode the scenario. If you > >> > fail to extract the information, then please provide the traces so we > >> > can stare at them. > >> > > >> > >> Here we go: > >> > >> Let L1 and L2 be two spinlocks. > >> > >> Let T1 be a task holding L1 and blocked on L2. T1, currently, is the top > >> waiter of L2. > >> > >> Let T2 be the task holding L2. > >> > >> Let T3 be a task trying to acquire L1. > >> > >> The following events will lead to a state in which the wait queue of L2 > >> isn't empty but nobody holds it. > > That explains it nicely. Care to resend with proper explanations in the > changelog? > Sure thing, thanks.
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index 010cf4e6d0b8..728f434de2bb 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -901,8 +901,9 @@ static int __sched rt_mutex_adjust_prio_chain(struct task_struct *task, * then we need to wake the new top waiter up to try * to get the lock. */ - if (prerequeue_top_waiter != rt_mutex_top_waiter(lock)) - wake_up_state(waiter->task, waiter->wake_state); + top_waiter = rt_mutex_top_waiter(lock); + if (prerequeue_top_waiter != top_waiter) + wake_up_state(top_waiter->task, top_waiter->wake_state); raw_spin_unlock_irq(&lock->wait_lock); return 0; }